├── .gitignore
├── LICENSE
├── README.md
└── exercises.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Hugh Perkins
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # cpu-tutorial
 2 | 
 3 | Exercises to guide you through building your own CPU, from scratch, in verilog
 4 | 
 5 | ## pre-requisites
 6 | 
 7 | - you should already have a grounding in verilog
 8 |     - one way to do this is to work your way through the questions at https://hdlbits.01xz.net/wiki/Main_Page
 9 |     - I didn't finish these, but I did many of them; you can see the ones I did personally at https://hdlbits.01xz.net/wiki/Special:VlgStats/3D7115FE8A440C29
10 |     - note: I'm not affiliated, I just found it was quite useful for me
11 |     - make sure you complete the `game of life` problem, since you are going to be building tons of finite state machines, and this problem is good practice
12 | - to learn verilog, hdlbits isn't enough: this just provides short-term validation, practice, and dopamine-stimulation that you are in fact learning
13 |     - one resource that is quite useful for learning is https://www.doulos.com/knowhow/verilog/ (note: I'm not affiliated, I just found it worked well for me)
14 |     - you can google around the web for other resources as you go
15 | - you'll need to install some simulators and synthesizers. I use:
16 |      - [iverilog](http://iverilog.icarus.com/)
17 |      - [verilator](https://www.veripool.org/verilator/)
18 |      - [yosys](https://github.com/YosysHQ/yosys)
19 | - you'll need a text editor, or development environment. I use the following:
20 |     - [Visual Studio Code](https://code.visualstudio.com/)
21 | - know at least one programming language you can use to write host-side code, such as assemblers
22 |     - python or C++ are both ok. I used python initially
23 | 
24 | Note that *all* the above resources can be used without needing to buy licenses or similar.
25 | 
26 | ## Exercises
27 | 
28 | - [exercises.md](/exercises.md)
29 | 


--------------------------------------------------------------------------------
/exercises.md:
--------------------------------------------------------------------------------
  1 | 1. build simulatable verilog, that will read some arbitrary 16-bit hexadecimal numbers from a text file, and output them onto the screen (using $display for output, you can use $readmemh to load the file)
  2 | 2. create a module 'proc' that will output the numbers 1 to 10, changing the number output on each clock positive edge. Create a driver module, which will provide a clock and reset to the module; and will use $display to print the numbers output by the module.
  3 | 3. modify the modules from 2. so that:
  4 |     - the driver module loads the hexadecimal file from 1., into a memory, and provides this memory to `proc`
  5 |     - `proc` iterates over the memory contents, giving each value to the driver module, one at a time
  6 |     - the driver module uses $display to print out each value
  7 |     - it's possible to create the memory in driver module, and then send the entire memory into the proc, using iverilog (this won't synthesize with yosys, but we can think about that later)
  8 | 4. split the hexadecimal file into two sections:
  9 |     - first numbers contain instructions, which we will talk about in a second
 10 |     - second set of numbers are the numbers we want to print out, just as before
 11 |     - for now, create a single 16-bit instruction, which we'll denote as `OUTLOC`, which will contain a memory location, and will send the contents of that memory location to the driver module
 12 |     - you'll have to find a way to encode the memory location inside the instruction
 13 |         - for example, the last 8 bits of the instruction could represent the instruction type, for example you could use `1` to mean `OUTLOC`
 14 |         - and the first 8 bits of the instruction could represent the location in memory to output
 15 |     - run the driver, and check the outputs are ok
 16 | 5. since creating the instructions is kind of tedious, create a python script (or C++, or whatever language you like), that will take a text file with assembly code, and convert it into the hexadecimal instructions. The assembly code can look something like the following:
 17 | ```
 18 | outloc 64
 19 | outloc 68
 20 | outloc 72
 21 | outloc 76
 22 | 
 23 | location 64:
 24 |     abcd
 25 |     1234
 26 |     dead
 27 |     beef
 28 | ```
 29 |     - check that you can run your assembler to produce machine code, and then run your verilog simulation, to run the machine code, and the outputs look correct
 30 | 6. __registers__
 31 |     - Modify your proc module so that it has a memory to store 32 registers, which we will denote x0 to x31.
 32 |     - add a new instruction `LI`, which will load a number into a register
 33 |         - for example `li x1, 123` will load the number 123 into register x1
 34 |     - add an instruction `outr` which will print out the value of a register, eg `outr x1` will print out the value of x1
 35 |     - create some assembly code to test these two instruction, assemble this assembly, and run it
 36 |     - check the output looks ok
 37 | 7. __memory__
 38 |    - move the memory out of proc/driver modules, into a new file, e.g. `mem.sv`
 39 |       - you will need to design an appropriate protocol for `mem.sv`, to allow reads and writes by `proc.sv`
 40 |       - you will need to design an appropriate protocl so that the driver module can write the initial hexadecimal instructions and data into the memory
 41 |       - easiest way might be to create a second 'write port' into `mem.sv`, that only the driver module uses
 42 |       - assembler and run your assembler programs, and check they continue to run ok
 43 |       - you will need to somehow handle that reading in instructions from memory will now take more than a single clock cycle
 44 |           - at this point, you will probalby want to start making proc.sv be a finite state machine
 45 |           - (if you've no idea how to start with this, you could try some of the FSM problems in hdlbits, if you haven't already; and make sure you completed the 'game of life' problem, if you didn't already)
 46 | 8. add `load` and `store` instructions, that will load the value of a register from a memory location, or store the value of a register into a memory location, respectively. e.g. `load x1 64`, will load the value from memory location 64 into register x1; and `store x2 68` will store the value from register `x2` into memory location `68`.
 47 |     - you can add a new instruction `outloc`, if you wish, that outputs the value at a particular memory location, e.g. `outloc 64` will send the value at location 64 to the driver for output
 48 |     - both load and store will likely need multiple clock cycles, so you will need to continue working with the finite state machine you created in step 7
 49 | 9. __RISCV__
 50 |     - migrate your instructions to be RISC-V compliant
 51 |     - don't modify `out` or `outloc` or `outr` for now, since these are not RISC-V instructions
 52 |     - you'll need to change your instructions to be 32-bit
 53 |     - you'll need to implement `li` as a pseudoinstruction, that does something like `addi x5, x0, 123`, which takes the value of register `x0` (always 0), adds 123, and places the result into register x5
 54 |     - you'll need to look up the binary representations used in RISC-V in the RISC-V volume 1, unprivileged spec, https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf
 55 | 10. add arithmetic, like `add`, `sub`, `mul`, using the simple verilog operators `+` and `-`, `*`
 56 |     - write assembly to test this
 57 |     - run, and check output is ok
 58 | 11. build an assembly program to add the numbers 1 to 5
 59 |     - you'll need to add at least one branch instruction to do this
 60 |     - this means you'll need to add the ability to create labels to your assembler
 61 |         - eg something like `somelabel:` is a label called `somelabel`
 62 |     - for now, you can simply allow backwards jumps only, which simplifies your assembler, since you only need a single pass
 63 | 12. __delay propagation__ at some point you need to start synthesizing your design, and doing things like:
 64 |     - gate-level simulation
 65 |     - measuring propagation delay
 66 |     - measuring die area
 67 |     - let's start with measuring propagation delay
 68 |     - there are various repos around which say they can do this (e.g. [OpenTimer](https://github.com/OpenTimer/OpenTimer)), however I didn't have much/any success in using them, so I built my own script, [https://github.com/hughperkins/VeriGPU/blob/main/verigpu/timing.py](https://github.com/hughperkins/VeriGPU/blob/main/verigpu/timing.py)
 69 |     - you are free to measure propagation delay however you want, but I feel you do need to start measuring it :)
 70 |     - obviously, my own opinion is that using my own script is the easiest way, but your mileage may vary :)
 71 |     - the way my script works is:
 72 |          - first use yosys to synthesize your verilog down to gate-level cells, using the OSU018 tech, [https://github.com/hughperkins/VeriGPU/tree/main/tech/osu018](https://github.com/hughperkins/VeriGPU/tree/main/tech/osu018)
 73 |          - assign a relative propagation delay, relative to that of a single NAND gate, to each node in the tree, based loosely on the propagation delays in [https://web.engr.oregonstate.edu/~traylor/ece474/reading/SAED_Cell_Lib_Rev1_4_20_1.pdf](https://web.engr.oregonstate.edu/~traylor/ece474/reading/SAED_Cell_Lib_Rev1_4_20_1.pdf)
 74 |          - walk the graph, finding the longest path between flip-flops, outputs and inputs, as a sum of the propagation delays of the walked nodes
 75 |     - in any case, measure somehow the propagation delay of your `proc.sv` module
 76 | 13. __div__ create a division module, that will divide two integers, returning the result and the remainder
 77 |    - note that using the verilog `/` operator will result in the division running in a single cycle
 78 |    - this is *very* slow: high propagation delay. you can measure using whatever approach you settled on in 12
 79 |    - so you will likely want to make the division run over multiple cycles somehow (e.g. 32 cycles, one for each bit, for example; whatever it takes to keep the propagation delay short)
 80 | 14. __divu__, __remu__ : use the module from the previous step to implement `divu` and `remu` instructions in `proc.sv`
 81 | 15. create an assembler program that outputs the first few prime numbers
 82 |     - you can do this two ways: sieve of aristothenes (I can never spell this...), or iterating over each integer, and checking for factors
 83 |     - the first way doesn't need either `divu` or `remu`, so try the second way, even if you do the sieve of aristothenes too
 84 | 16. add float support. I used `zfinx`, i.e. using same registers for both floats and integers, but you could use the more standard `F` extension of RISC-V
 85 |     - ensure that at least the following work for now:
 86 |          - `li x1, 1.23`
 87 |          - `outr.s x1`  output x1 as a float
 88 |           - load and store for floats (either using `lw` and `sw` if using zfinx, or using `flw.s` and `fsw.s` if using `f`)
 89 | 17. add add for floats
 90 | 18. add subtract for floats
 91 | 19. add multiplication for floats
 92 | 20. add division for floats
 93 | 21. write a matrix multiplication assembler program
 94 |     - this gets pretty fiddly
 95 | 22. since writing assembler for the matrix multiplication was getting fiddly, let's migrate to use a compiler to be able to convert e.g. C/C++ programs into assembler, which we can then assemble and run
 96 |     - you can use the `clang+` compiler provided with llvm
 97 |     - if you use the options `-S -target riscv32`, then `clang+` will convert your C++ into RISC-V assembler
 98 |     - a few complications:
 99 |         - what you'll need to write in the C or C++ is a function
100 |         - you'll need to add additional 'header' assembler in front of the resulting assembler to jump into this function
101 |         - then halt afterwards
102 |         - you'll need to therefore implement jump instructions (possiblye a `halt` instruction, if you don't already have one)
103 |         - you'll need to be able to handle forward jumps, so you'll need to make your assembler handle this somehow (e.g. using two passes)
104 |         - you'll need to load any parameters into the registers `a0`, `a1`, etc
105 |         - you'll need to make `a0`, `a1` etc be aliases to the appropriate risc-v `x` registers
106 |         - you'll also need to add in other aliases, such as `sp`
107 |         - you'll need to point `sp` towards the top of some unused memory before jumping into the function
108 |         - if you want to output anything, you'll need to make your code call a function `void out(int value);`
109 |             - and you'll need to implement this in assembler, as part of the 'header' assembler
110 | 23. at this point, you could also consider migrating from using your own assembler, to using the clang/llvm assembler, e.g. `llc`
111 | 24. once you've got to this point, you can already run fairly complex programs
112 | 25. your next steps will be things like:
113 |     - adding instruction caching, and memory caching in general
114 |     - adding parallel instruction execution
115 | 
116 | (Note: you can see my own processor at https://github.com/hughperkins/VeriGPU , which I basically wrote by doing approximately what I've outlined above :) (I did a few GPU-specific things too; but I'm making this a CPU tutorial, not a GPU tutorial; happy to extend this for GPU if interest))
117 | 


--------------------------------------------------------------------------------