├── .gitignore
├── lab2
    └── figs
    │   ├── dve.png
    │   ├── fir.png
    │   ├── q1_a.png
    │   ├── q1_b_1.png
    │   ├── q1_b_2.png
    │   ├── q1_c.png
    │   ├── new_wave.png
    │   ├── vlsi_flow.png
    │   └── display_wave.png
├── lab6
    ├── figs
    │   ├── drc.png
    │   ├── dp_div.png
    │   ├── layout.png
    │   └── lvs_smiley.png
    └── spec.md
├── project
    ├── figs
    │   ├── csrw.png
    │   ├── RV32I_Base_Instruction_Set.pdf
    │   └── RV32I_Base_Instruction_Set.png
    ├── README.md
    ├── final.md
    ├── checkpoint2.md
    ├── checkpoint3.md
    ├── checkpoint4.md
    ├── overview.md
    └── checkpoint1.md
├── lab1
    ├── figs
    │   └── x2gomacos.png
    └── spec.md
├── lab4
    ├── figs
    │   ├── view_icons.png
    │   ├── timing_debug.png
    │   ├── clock_tree_nets.png
    │   ├── gcd_coprocessor.pdf
    │   ├── gcd_coprocessor.png
    │   ├── innovus_window.png
    │   ├── clock_tree_debugger.png
    │   ├── sky130
    │   │   ├── innovus_window.png
    │   │   ├── timing_debug.png
    │   │   ├── clock_tree_nets.png
    │   │   ├── clock_tree_debugger.png
    │   │   └── critical_path_highlight.png
    │   └── critical_path_highlight.png
    └── spec_sky130.md
├── lab3
    ├── figs
    │   ├── block-diagram.pdf
    │   └── block-diagram.png
    ├── spec.md
    └── spec_sky130.md
├── README.md
└── lab5
    ├── spec_sky130.md
    └── spec.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *.DS_Store


--------------------------------------------------------------------------------
/lab2/figs/dve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/dve.png


--------------------------------------------------------------------------------
/lab2/figs/fir.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/fir.png


--------------------------------------------------------------------------------
/lab6/figs/drc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab6/figs/drc.png


--------------------------------------------------------------------------------
/lab2/figs/q1_a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/q1_a.png


--------------------------------------------------------------------------------
/lab2/figs/q1_b_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/q1_b_1.png


--------------------------------------------------------------------------------
/lab2/figs/q1_b_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/q1_b_2.png


--------------------------------------------------------------------------------
/lab2/figs/q1_c.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/q1_c.png


--------------------------------------------------------------------------------
/lab6/figs/dp_div.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab6/figs/dp_div.png


--------------------------------------------------------------------------------
/lab6/figs/layout.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab6/figs/layout.png


--------------------------------------------------------------------------------
/lab2/figs/new_wave.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/new_wave.png


--------------------------------------------------------------------------------
/project/figs/csrw.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/project/figs/csrw.png


--------------------------------------------------------------------------------
/lab1/figs/x2gomacos.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab1/figs/x2gomacos.png


--------------------------------------------------------------------------------
/lab2/figs/vlsi_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/vlsi_flow.png


--------------------------------------------------------------------------------
/lab4/figs/view_icons.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/view_icons.png


--------------------------------------------------------------------------------
/lab6/figs/lvs_smiley.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab6/figs/lvs_smiley.png


--------------------------------------------------------------------------------
/lab2/figs/display_wave.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab2/figs/display_wave.png


--------------------------------------------------------------------------------
/lab3/figs/block-diagram.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab3/figs/block-diagram.pdf


--------------------------------------------------------------------------------
/lab3/figs/block-diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab3/figs/block-diagram.png


--------------------------------------------------------------------------------
/lab4/figs/timing_debug.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/timing_debug.png


--------------------------------------------------------------------------------
/lab4/figs/clock_tree_nets.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/clock_tree_nets.png


--------------------------------------------------------------------------------
/lab4/figs/gcd_coprocessor.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/gcd_coprocessor.pdf


--------------------------------------------------------------------------------
/lab4/figs/gcd_coprocessor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/gcd_coprocessor.png


--------------------------------------------------------------------------------
/lab4/figs/innovus_window.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/innovus_window.png


--------------------------------------------------------------------------------
/lab4/figs/clock_tree_debugger.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/clock_tree_debugger.png


--------------------------------------------------------------------------------
/lab4/figs/sky130/innovus_window.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/sky130/innovus_window.png


--------------------------------------------------------------------------------
/lab4/figs/sky130/timing_debug.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/sky130/timing_debug.png


--------------------------------------------------------------------------------
/lab4/figs/critical_path_highlight.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/critical_path_highlight.png


--------------------------------------------------------------------------------
/lab4/figs/sky130/clock_tree_nets.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/sky130/clock_tree_nets.png


--------------------------------------------------------------------------------
/lab4/figs/sky130/clock_tree_debugger.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/sky130/clock_tree_debugger.png


--------------------------------------------------------------------------------
/lab4/figs/sky130/critical_path_highlight.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/lab4/figs/sky130/critical_path_highlight.png


--------------------------------------------------------------------------------
/project/figs/RV32I_Base_Instruction_Set.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/project/figs/RV32I_Base_Instruction_Set.pdf


--------------------------------------------------------------------------------
/project/figs/RV32I_Base_Instruction_Set.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EECS150/asic_labs_sp22/HEAD/project/figs/RV32I_Base_Instruction_Set.png


--------------------------------------------------------------------------------
/project/README.md:
--------------------------------------------------------------------------------
 1 | # EECS 151/251A ASIC Project Specification: RISC-V Processor Design
 2 | 
 3 | 
 4 | - [Project Overview](overview.md) : Introduction, Project setup and Grading
 5 | - [Checkpoint 1](checkpoint1.md) :  ALU design and Pipeline diagram 
 6 |     - Apr 1 (Friday), 2022
 7 | - [Checkpoint 2](checkpoint2.md) : Fully functioning core
 8 |     - Apr 15 (Friday), 2022
 9 | - [Checkpoint 3](checkpoint3.md) : Cache
10 |     - Apr 22 (Friday), 2022
11 | - [Checkpoint 4](checkpoint4.md) : Synthesis, PAR & Power
12 |     - Apr 29 (Friday), 2022
13 | - [Final Deliverables](final.md): 
14 |     - Final Interview/Checkoff: May 6, 2022
15 |     - Report: May 9, 2022
16 | # Resources:
17 | [RISC-V Instruction Set Manual](https://riscv.org/technical/specifications/) (Volume 1, Unprivileged Spec)
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # EECS 151/251A ASIC Labs Fall 21
 2 | 
 3 | This lab course consists of 6 labs and a final project. The labs go through the ASIC design flow, from RTL through GDS. 
 4 | These labs are now available in two process technologies, 
 5 | the [ASAP7 7nm Predictive PDK](http://asap.asu.edu/asap/) (a non-implementable finFET technology developed for educational purposes)
 6 | and the [Skywater 130nm PDK](https://skywater-pdk.readthedocs.io/en/latest/) (a real open-source 130nm CMOS process developed by Google and Skywater foundries).
 7 | 
 8 | ## ASAP7 Labs
 9 | - [Lab 1: Getting Around the Compute Environment](lab1/spec.md)
10 | - [Lab 2: Simulation](lab2/spec.md)
11 | - [Lab 3: Logic Synthesis](lab3/spec.md)
12 | - [Lab 4: Floorplanning, Placement, Power, and CTS](lab4/spec.md)
13 | - [Lab 5: Parallelization and Routing](lab5/spec.md)
14 | - [Lab 6: SRAM Integration, DRC, LVS](lab6/spec.md)
15 | 
16 | ## ASIC Final Project
17 | This project guides students through writing their own CPU core and cache, and pushing this design through the ASIC flow to achieve a physical design. 
18 | - [Project Overview](project/overview.md) : Introduction, Project setup and Grading
19 | - [Checkpoint 1](project/checkpoint1.md) :  ALU design and Pipeline diagram 
20 | - [Checkpoint 2](project/checkpoint2.md) : Fully functioning core
21 | - [Checkpoint 3](project/checkpoint3.md) : Cache
22 | - [Checkpoint 4](project/checkpoint4.md) : Synthesis, PAR & Power
23 | 
24 | ## Sky130 Labs
25 | Alternate versions of the ASAP7 labs above use the Skywater 130nm PDK instead. Lab 6 is omitted because (1) the Sky130 SRAMs are currently not mature enough to be used for educational purposes, and (2) for DRC/LVS, the Sky130 Calibre decks are still under NDA, and while the open-source decks are available (for use with Magic and Netgen), our ASIC design flow does not currently support these open-source EDA tools. To learn about SRAMs, DRC, and LVS, please follow the ASAP7 version of Lab 6 above.
26 | - [Lab 1: Getting Around the Compute Environment](lab1/spec.md)
27 | - [Lab 2: Simulation](lab2/spec_sky130.md)
28 | - [Lab 3: Logic Synthesis](lab3/spec_sky130.md)
29 | - [Lab 4: Floorplanning, Placement, Power, and CTS](lab4/spec_sky130.md)
30 | - [Lab 5: Parallelization and Routing](lab5/spec_sky130.md)
31 | 


--------------------------------------------------------------------------------
/project/final.md:
--------------------------------------------------------------------------------
 1 | # EECS 151/251A ASIC Project Specification: Final Deliverables
 2 | <p align="center">
 3 | Prof. Sophia Shao
 4 | </p>
 5 | <p align="center">
 6 | TAs (ASIC): Dima Nikiforov
 7 | </p>
 8 | <p align="center">
 9 | Department of Electrical Engineering and Computer Science
10 | </p>
11 | <p align="center">
12 | College of Engineering, University of California, Berkeley
13 | </p>
14 | 
15 | ---
16 | 
17 | ## Final Project Deliverables
18 | 
19 | By now you should have designed a fully-functional processor from scratch! Your design should pass all assembly tests at your reported maximum frequency. Your
20 | design should also pass all of the benchmark tests in at your reported maximum frequency, and you
21 | should report the cycle count for each of those tests. By the due date (Monday, May 9, 2022), each
22 | team needs to push their final commits to their team’s git repository. Only the final commit before the
23 | due date will be graded, so be very, very careful that you have submitted everything required. To be
24 | graded you must submit the following items:
25 | * `src/*.v`
26 | * `build/syn-rundir/reports/*`
27 | * `build/par-rundir/timingReports/*`
28 | * `build/par-rundir/innovus.log*`
29 | 
30 | These files will be used to check processor functionality and will show us your critical path, maximum operating frequency and area. During the interview session (Friday, May  6, 2022), the
31 | professors and the GSI will be interviewing each team to gauge understanding of various concepts
32 | learned in the project, understand more about each team’s design process, and provide feedback. Your
33 | final report needs to answer the following questions:
34 | 
35 | 1. Show the final pipeline diagram, and explain the functionality of different submodules in your design, how control signals are generated, memory structure, etc.
36 | 
37 | 2. What is the post-synthesis critical path length? What sections of the processor does the critical
38 | path pass through? Why is this the critical path?
39 | 
40 | 3. Show a screenshot of the final floorplan. Also include a screenshot of the clock tree debugger results.  Discuss your floorplanning strategy and the quality of your clock tree results.
41 | 
42 | 4. What is the post-place-and-route critical path length? What sections of the processor does the
43 | critical path pass through? Why is this the critical path? If it is different from the post-synthesis
44 | critical path, why?
45 | 
46 | 5. What is the area utilization of the final design? Also include the total core area you used in PnR and the density.
47 | 
48 | 6. What is the Innovus-estimated power consumption of the final design?
49 | 
50 | 7. What is the number of cycles that your design takes to run the benchmarks? What changes/optimizations
51 | have you done to try and optimize for these tests?
52 | 
53 | 8. What is the post-place-and-route runtime (in seconds) of each benchmark? 
54 |    *Use the number of cycles from RTL simulation, and minimum clock period to meet timing for place-and-route (design doesn't have to pass post-PAR simulations with this clock period).*
55 |    
56 | 9. If there are bugs in your design still, explain what is working and what isn't.  What was your debugging process?  Where are the bugs localized?
57 | 
58 | 10. Explain any other optimizations you made for your design.
59 | 
60 | 11. Is there anything you would like to tell the staff before we grade your project?
61 | 
62 | 
63 | If you worked with a partner you do not need separate reports. If you are having issues with your
64 | partner please contact the GSI privately as soon as possible.
65 | 
66 | 
67 | 
68 | ## Acknowledgement
69 | 
70 | This project is the result of the work of many EECS151/251 GSIs over the years including:
71 | Written By:
72 | - Nathan Narevsky (2014, 2017)
73 | - Brian Zimmer (2014)
74 | Modified By:
75 | - John Wright (2015,2016)
76 | - Ali Moin (2018)
77 | - Arya Reais-Parsi (2019)
78 | - Cem Yalcin (2019)
79 | - Tan Nguyen (2020)
80 | - Harrison Liew (2020)
81 | - Sean Huang (2021)
82 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
83 | - Dima Nikiforov (2022)
84 | 


--------------------------------------------------------------------------------
/project/checkpoint2.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Project Specification: Checkpoint 2
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Dima Nikiforov
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | ---
 16 | ## Fully functioning core
 17 | 
 18 | ### 1. Additional Instructions
 19 | #### 1.1 Control and Status Register (CSR)
 20 | In order to run the testbenches, there are a few new instructions that need to be added for help in
 21 | debugging/creating testbenches. Read through Chapter 9 in the RISC-V specification. A CSR (or
 22 | control status register) is some state that is stored independent of the register file and the memory.
 23 | While there are 2^12 possible CSR addresses, you will only use one of them (`tohost = 0x51E`). The
 24 | `tohost` register is monitored by the test harness, and simulation ends when a value is written to this
 25 | register. A value of 1 indicates success, a value greater than 1 gives clues as to the location of the failure.
 26 | There are 2 CSR related instructions that you will need to implement:
 27 | 1. `csrw tohost,t2` (short for `csrrw x0,csr,rs1` where `csr = 0x51E`)
 28 | 2. `csrwi tohost,1` (short for `csrrwi x0,csr,zimm` where `csr = 0x51E`)
 29 | 
 30 | `csrw` will write the value from register in rs1. `csrwi` will write the immediate (stored in rs1) to
 31 | the addressed csr. Note that you do not need to write to rd (writing to x0 does nothing).
 32 | 
 33 | <p align="center">
 34 | <img src="./figs/csrw.png" width="800" />
 35 | </p>
 36 | 
 37 | ### 2. Details
 38 | Your job is to implement the core of the 3-stage RISC-V CPU. 
 39 | 
 40 | #### 2.1 Reset
 41 | Your CPU will have an input reset signal that the testbench toggles. Once out of reset, 
 42 | your CPU should start at PC address `0x2000` (defined as `PC_RESET` in `src/const.vh`)
 43 | and begin executing instructions.
 44 | 
 45 | #### 2.2 Misaligned Addresses
 46 | According to the RISC-V ISA spec, reads and writes to memory addresses not aligned to a 32-bit word boundary (or 16-bit for halfword) should cause an exception. In this project, for the purpose of simplicity, we ignore the misaligned bits (i.e. set them to zero), which is done in `Memory151.v` by only using address bits `31:2`. 
 47 | 
 48 | 
 49 | ### 3. File Structure
 50 | Implement the datapath and control logic for your RISC-V processor in the file `Riscv151.v`. Make
 51 | sure that the inputs and outputs remain the same, since this module connects to the memory system
 52 | for system-level testing. If you look at `riscv_test_harness.v` you can see a testbench that
 53 | is provided. Target this testbench in your `sim-rtl.yml` file by changing the `tb_name` key to
 54 | `rocketTestHarness`.
 55 | 
 56 | ### 4. Running the Test
 57 | This testbench will load a program into the instruction memory, and will then run until the exit code
 58 | register has been set. 
 59 | There is also a timeout to make sure that the simulation does not run forever. 
 60 | You should only be running this test
 61 | suite after you have eliminated some of the bugs using single instruction tests, as described below.
 62 | 
 63 | ### 5. Running assembly tests
 64 | We have provided a suite of assembly tests to help you debug all of the instructions you need to estimate.
 65 | To run all of them:
 66 | ```
 67 | make sim-rtl test_asm=all
 68 | ```
 69 | This will generate .out files in the `asm_output/` directory, and summarize which tests passed and
 70 | failed. You can also run a single asm test with the following command:
 71 | ```
 72 | make sim-rtl test_asm=simple.out
 73 | ```
 74 | If you would like to generate waveforms for a single test:
 75 | ```
 76 | make sim-rtl test_asm=simple.vpd
 77 | ```
 78 | ’simple’ may be replaced with any of the available tests defined in the `Makefile`.
 79 | 
 80 | You can read the assembly code of the programs by looking at the dump file. Comments in the code
 81 | will help you understand what is happening.
 82 | ```
 83 | cd tests/asm/
 84 | vim addi.dump
 85 | ```
 86 | Last, you can see the hex code that is loaded directly into the memory by looking at the hex file.
 87 | ```
 88 | cd tests/asm/
 89 | vim addi.hex
 90 | ```
 91 | 
 92 | 
 93 | ### 6. Checkpoint 2 Deliverables
 94 | *Checkoff due: Apr 15 (Friday), 2022*
 95 | 
 96 | Congratulations! You’ve started the design of your datapath by implementing your pipeline diagram, and written and thoroughly tested a key component in your processor and should now be well-versed in testing Verilog modules. Please answer the following questions to be checked off by a TA.
 97 | 
 98 | 1. Show that all of the assembly tests pass
 99 | 
100 | 2. Show your final pipeline diagram, updated to match the code.
101 | 
102 | 3. Push your implementation and updated pipeline diagram to your repo.
103 | 
104 | ---
105 | 
106 | 
107 | ## Acknowledgement
108 | 
109 | This project is the result of the work of many EECS151/251 GSIs over the years including:
110 | Written By:
111 | - Nathan Narevsky (2014, 2017)
112 | - Brian Zimmer (2014)
113 | Modified By:
114 | - John Wright (2015,2016)
115 | - Ali Moin (2018)
116 | - Arya Reais-Parsi (2019)
117 | - Cem Yalcin (2019)
118 | - Tan Nguyen (2020)
119 | - Harrison Liew (2020)
120 | - Sean Huang (2021)
121 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
122 | - Dima Nikiforov (2022)
123 | 


--------------------------------------------------------------------------------
/project/checkpoint3.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Project Specification: Checkpoint 3
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Dima Nikiforov
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | ---
 16 | 
 17 | ## Cache
 18 | A processor operates on data in memory. Memory can hold billions of bits, which can either be instructions or data. In a VLSI design, it is a very bad idea to store this many bits close to the processor. The
 19 | chip area required would be huge - consider how many DRAM chips your PC has, and that DRAM cells
 20 | are much smaller than SRAM cells (which can actually be implemented in the same CMOS process).
 21 | Moreover, the entire processor would have to slow down to accommodate delays in the large memory
 22 | array. Instead, caches are used to create the illusion of a large memory with low latency.
 23 | 
 24 | Your task is to implement a (relatively) simple cache for your RISC-V processor, based on some
 25 | predefined SRAM macros (memory arrays) and the interface specified below.
 26 | 
 27 | ### 1 Cache overview
 28 | When you request data at a given address, the cache will see if it is stored locally. If it is (cache hit), it
 29 | is returned immediately. Otherwise if it is not found (cache miss), the cache fetches the bits from the
 30 | main memory.
 31 | Caches store data in “ways.” A way is a logical element which contains valid bits, tag bits, and data.
 32 | The simplest type of cache is direct-mapped (a 1-way cache). A cache stores data in larger units (lines)
 33 | than single words. In each way, a given address may only occupy a single location, determined by the
 34 | lowest bits of the cache line address. The remaining address bits are called the “tag” and are stored so
 35 | that we can check if a given cache line belongs to a given address. The valid bit indicates which lines
 36 | contain valid data.
 37 | Multi-way caches allow more flexibility in what data is stored in the cache, since there are multiple
 38 | locations for a line to occupy (the number of ways). For this reason, a ”replacement policy” is needed.
 39 | This is used to decide which way’s data to evict when fetching new data. For this project you may use
 40 | any policy you wish, but pseudo-random is recommended.
 41 | 
 42 | ### 2 Guidelines and requirements
 43 | You have been given the interface of a cache (`Cache.v`) and your next task is to implement the cache.
 44 | EECS151 students should build a direct-mapped cache, and EECS251 students are required to implement a cache that either:
 45 | 
 46 | 1. is configurable to be either direct-mapped or at least 2-way set associative; or
 47 | 
 48 | 2. is set-associative with configurable associativity.
 49 | 
 50 | You are welcome to implement a more performant cache if you desire.
 51 | Your cache should be at least 512 bytes; if you wish to increase the size, implement the 512 bytes
 52 | cache first and upgrade later.
 53 | Use the SRAMs that are available in
 54 | 
 55 | `/home/ff/eecs151/hammer/src/hammer-vlsi/technology/asap7/sram_compiler/memories/behavioral/sram_behav_models.v`
 56 | 
 57 | for your data and tag arrays.
 58 | 
 59 | 
 60 | The pin descriptions for these SRAMs are as follows:
 61 | 
 62 | |             |                                                 |
 63 | |-------------|-------------------------------------------------|
 64 | | `A`         | Address                                         |
 65 | | `CE`        | clock edge                                      |
 66 | | `OEB`       | output enable bar (tie this to 0)               |
 67 | | `WEB`       | write enable bar (1 is a read, 0 is a write)    |
 68 | | `CSB`       | chip select bar (tie this to 0)                 |
 69 | | `BYTEMASK`  | write byte mask                                 |
 70 | | `I`         | write data                                      |
 71 | | `O`         | read data                                       |
 72 | 
 73 | You should use cache lines that are 512 bits (16 words) for this project. The memory interface is
 74 | 128 bits, meaning that you will require multiple (4) cycles to perform memory transactions.
 75 | Below find a description of each signal in `Cache.v`:
 76 | 
 77 | |                      |                                        |
 78 | |----------------------|----------------------------------------|
 79 | | `clk`                  | clock
 80 | | `reset`                | reset
 81 | | `cpu_req_valid`        | The CPU is requesting a memory transaction
 82 | | `cpu_req_rdy`          | The cache is ready for a CPU memory transaction
 83 | | `cpu_req_addr`         | The address of the CPU memory transaction
 84 | | `cpu_req_data`         | The write data for a CPU memory write (ignored on reads)
 85 | | `cpu_req_write`        | The 4-bit write mask for a CPU memory transaction (each bit corresponds to the byte address within the word). `4’b0000` indicates a read.
 86 | | `cpu_resp_val`         | The cache has output valid data to the CPU after a memory read
 87 | | `cpu_resp_data`        | The data requested by the CPU
 88 | | `mem_req_val`          | The cache is requesting a memory transaction to main memory
 89 | | `mem_req_rdy`          | Main memory is ready for the cache to provide a memory address
 90 | | `mem_req_addr`         | The address of the main memory transaction from the cache. Note that this address is narrower than the CPU byte address since main memory has wider data._
 91 | | `mem_req_rw`           | 1 if the main memory transaction is a write; 0 for a read.
 92 | | `mem_req_data_valid`   | The cache is providing write data to main memory.
 93 | | `mem_req_data_ready`   | Main memory is ready for the cache to provide write data.
 94 | | `mem_req_data_bits`    | Data to write to main memory from the cache (128 bits/4 words).
 95 | | `mem_req_data_mask`    | Byte-level write mask to main memory. May be `16’hFFFF` for a full write.
 96 | | `mem_resp_val`         | The main memory response data is valid.
 97 | | `mem_resp_data`        | Main memory response data to the cache (128 bits/4 words).
 98 | |                        |
 99 | 
100 | To design your cache, start by outlining where the SRAMs should go. You should include an SRAM
101 | per way for data, and a separate SRAM per way for the tags. Depending on your implementation, you
102 | may want to implement the valid bits in flip flops or as part of the tag SRAM.
103 | 
104 | Next you should develop a state machine that covers all the events that your cache needs to handle
105 | for both hits and misses. You can do it without an explicit state machine, but you will suffer. Keep in
106 | mind you will need to write any valid data back to main memory before you start refilling the cache (you
107 | can use a write-back or a write-through policy). Both of these transactions will take multiple cycles.
108 | 
109 | ### 3 Changes to the flow for this checkpoint
110 | You should now be able to pass the `bmark` test. The test suite includes many C programs that do
111 | various things to test your processor and cache implementation. You can observe the number of cycles
112 | that each bmark test takes to run by opening `bmark_output/*.out` and taking note of the number
113 | on the last line. The `make sim-rtl test bmark=all` target will also print this number for you.
114 | To run a specific benchmark (e.g., cachetest), run
115 | ```
116 | make sim-rtl test_bmark=cachetest.out
117 | ```
118 | After completing your cache, run the tests with both the cache included and with the fake memory
119 | (`no_cache_mem`) included. To use no_cache_mem be sure to have `+define+no_cache_mem` in the
120 | simOptions variable in the `sim-rtl.yml` file. To use your cache, comment out `+define+no_cache_mem`.
121 | Take note of the cycle counts for both, you should see the cycle counts increase when you use the cache.
122 | 
123 | ### 4 Checkpoint 3 Deliverables
124 | *Checkoff due: Apr 29 (Friday), 2022*
125 | 
126 | 1. Show that all of the assembly tests and final pass using the cache
127 | 
128 | 2. Show the block diagram of your cache
129 | 
130 | 3. What was the difference in the cycle count for the `bmark` test with the perfect memory and the
131 | cache?
132 | 
133 | 4. Show your final pipeline diagram, updated to match the code
134 | 
135 | ---
136 | 
137 | 
138 | ## Acknowledgement
139 | 
140 | This project is the result of the work of many EECS151/251 GSIs over the years including:
141 | Written By:
142 | - Nathan Narevsky (2014, 2017)
143 | - Brian Zimmer (2014)
144 | Modified By:
145 | - John Wright (2015,2016)
146 | - Ali Moin (2018)
147 | - Arya Reais-Parsi (2019)
148 | - Cem Yalcin (2019)
149 | - Tan Nguyen (2020)
150 | - Harrison Liew (2020)
151 | - Sean Huang (2021)
152 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
153 | - Dima Nikiforov (2022)
154 | 


--------------------------------------------------------------------------------
/lab5/spec_sky130.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Lab 5: Parallelization and Routing
  2 | 
  3 | <p align="center">
  4 | Prof. Bora Nikolic
  5 | </p>
  6 | <p align="center">
  7 | TAs: Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu
  8 | </p>
  9 | <p align="center">
 10 | Department of Electrical Engineering and Computer Science
 11 | </p>
 12 | <p align="center">
 13 | College of Engineering, University of California, Berkeley
 14 | </p>
 15 | 
 16 | ## Overview
 17 | 
 18 | Like last week, this lab has two parts. For the first part, we will continue to develop our GCD
 19 | coprocessor by improving its performance. After that, we will continue the physical design flow by
 20 | performing routing.
 21 | 
 22 | To begin this lab, get the project files and set up your environment by typing the following commands:
 23 | 
 24 | ```shell
 25 | git clone /home/ff/eecs151/fa21/sky130/lab5-sky130.git
 26 | cd lab5
 27 | ```
 28 | 
 29 | ## Design
 30 | 
 31 | One way we can improve the performance of our GCD coprocessor is by parallelizing the compute.
 32 | We can do this by including multiple GCD units in our design, and routing traffic to them as they
 33 | become available.
 34 | 
 35 | You will find that the solution to last week’s lab (`fifo.v` and `gcd_coprocessor.v`) is included. The
 36 | test has been modified to check the total number of cycles taken by the coprocessor to complete the
 37 | tests. Run `make sim-rtl` to run the new testbench on the solution code. Take note of the number
 38 | of cycles that the tests take without modification, as you will need it to calculate your speedup.
 39 | 
 40 | Your task is to edit `gcd_coprocessor.v` to improve the performance below 225 cycles. We will do
 41 | this by using two instances of GCD.
 42 | 
 43 | You will find RTL that connects the datapath and controller into one module in `gcd_unit.v`. You
 44 | may find this useful when refactoring the `gcd_coprocessor`, since you will need fewer wires to place
 45 | both GCD instances.
 46 | 
 47 | You will also find stub code for an arbiter, which you should complete. We will use the arbiter
 48 | to route traffic to GCD units and preserve the response ordering. Most of your design can be
 49 | implemented with combinational logic, but you will need some state to remember which GCD
 50 | block contains the earliest data to preserve ordering.
 51 | 
 52 | ---
 53 | ## Question 1: Design
 54 | 
 55 | **a.) Submit your code (`gcd_coprocessor.v` and `gcd_arbiter.v`) with your lab assignment.**
 56 | 
 57 | **b.) How many cycles did your simulation take? What was the % speedup?**
 58 | 
 59 | ---
 60 | 
 61 | ## Automated Flow
 62 | 
 63 | In the last lab, we only focused on the PAR flow through CTS. In this lab, we will go through the full flow.
 64 | Routing is the next major flow step. Prior to the actual routing step, Innovus uses a
 65 | basic routing engine with errors and shorts, but ignores these errors and simply tries to
 66 | get an estimate of delays and parasitics. Once post-CTS
 67 | optimization is done, it switches to a different tool that actually legalizes routing and tries to eliminate
 68 | shorts while meeting timing. Routing is one of the most
 69 | computationally heavy tasks of digital IC design and can take days to complete for complicated designs. 
 70 | This will be reflected in the runtime in this lab.
 71 | 
 72 | After routing is complete, a post-Route optimization is run to ensure no timing violations
 73 | remain. Post-Route optimization typically has little freedom to move cells around, and it tries to
 74 | meet the timing constraints mostly by tweaking the length of the routings. You may see some DRC
 75 | (Design Rule Check) errors caused by the 7nm technology library, after routing.
 76 | 
 77 | First, synthesize the design:
 78 | 
 79 | ```shell
 80 | make syn
 81 | ```
 82 | 
 83 | Then, simulate the synthesized design to make sure it still works:
 84 | 
 85 | ```shell
 86 | make sim-gl-syn
 87 | ```
 88 | 
 89 | Once your synthesized design passes the test, you can start the PAR flow:
 90 | 
 91 | ```shell
 92 | make par
 93 | ```
 94 | 
 95 | The PAR command will take a long time to complete, as it runs through all stages of PAR. 
 96 | Check out the iterations that Innovus runs through during optimization.  You can see some of the metrics that Innovus is using.
 97 | Once it completes, take a look at the build directory as in the previous labs. You might see additional files
 98 | compare to the `syn-rundir`, and that’s because the PAR flow incorporates the RC and parasitic delays, in addition to the cell delays. Open `build/par-rundir/gcd_coprocessor.setup.par.spef`
 99 | and search for the first occurrence of `D_NET`. What does it say about the first net? You may find
100 | [this wiki page](https://en.wikipedia.org/wiki/Standard_Parasitic_Exchange_Format#Parasitics) helpful. *(thought experiment #1 : get a sense of the units at the top and orders of magnitude of the RC parasitics in the SPEF file. If we used a 5nm technology library, do you expect the resistance to generally increase or decrease? How about the capacitance?)*
101 | 
102 | ---
103 | ## Question 2: Automated Flow
104 | 
105 | a.) Check the post-Synthesis timing report
106 | (`syn-rundir/reports/final_time_ss_100C_1v60.setup_view.rpt`) and post-PAR timing report (`par-rundir/timingReports/gcd_coprocessor_postRoute_all.tarpt`). 
107 | **What are the critical paths of your post-PAR and post-Synthesis designs?**
108 | **Are they the same path?**
109 | **How does this critical path compare to your single-unit critical path?**
110 | 
111 | b.) Iterate on your design by modifying `design.yml` to find a rough estimate (no need to be too
112 | precise) for the clock period until you start running into setup errors. 
113 | **Given the number of cycles it takes to complete the testbench, what is the shortest time your design can finish the computation?**
114 | 
115 | c.) Open the post-CTS timing report(`par-rundir/hammer_cts_debug/hammer_cts_all.tarpt`) and the post-PAR 
116 | timing report(`par-rundir/timingReports/gcd_coprocessor_postRoute_all.tarpt`). 
117 | **Find a common path (same start and end sequential elements). What differences do you notice within the paths?**
118 | 
119 | ---
120 | 
121 | ## Innovus Commands
122 | 
123 | As in the previous lab, we will look at the contents of `par.tcl` that Hammer generates and follow
124 | along using Innovus. 
125 | *(thought experiment #2 : open the `par.tcl` and search for the command `set_db add_fillers_cells`. Based on the names of the cells specified by this command, what do you think is the function of the filler cells?)*
126 | 
127 | Navigate to the directory `build/par-rundir` and type:
128 | 
129 | ```shell
130 | innovus -common_ui
131 | ```
132 | 
133 | This will open the Innovus shell. Next, type `read_db gcd_coprocessor_FINAL` to load the current design
134 | database from the latest PAR flow. This will help us to avoid re-running the entire flow. To see
135 | all the reporting commands, type `help report*` in the Innovus shell and read through the options
136 | available to you.
137 | 
138 | ---
139 | ## Question 3: Innovus Reports
140 | 
141 | **a.) What is the area consumed by your design?**
142 | **What percentage of the total area does the arbiter occupy?**
143 | 
144 | **b.) Submit a screenshot of your setup slack histogram.**
145 | **Compared with the histogram you obtained in Lab 4, does your new slack distribution support the observed performance improvements you obtained in your coprocessor?**
146 | 
147 | ---
148 | 
149 | After you are done with the flow, it is time to simulate our newly printed post-PAR netlist. Type
150 | the following command:
151 | 
152 | ```shell
153 | make sim-gl-par
154 | ```
155 | 
156 | This will use the same testbench, but will now use the post-PAR netlist of your design, backannotated with delays and parasitics from PAR. Make sure to adjust the `CLOCK_PERIOD` variable in `sim-gl-par.yml` to match the clock period you obtained from PAR. Note, however, that the exact
157 | clock period may not work and you may need to relax it slightly.
158 | 
159 | After running `make sim-gl-par` you can run power analysis using:
160 | 
161 | ```shell
162 | make power-par
163 | ```
164 | 
165 | Navigate to `power-rundir/activePowerReports` and open `ss_100C_1v60.setup_view.rpt`. Do
166 | the power estimation numbers match your expectation?
167 | 
168 | ---
169 | ## Question 4: Trade-offs
170 | 
171 | **a.)** Re-run the flow using your old design. 
172 | To prevent your `build` directory from being overwritten, set the `OBJ_DIR` Make variable to a different name (i.e. `make par OBJ_DIR=build2`).
173 | Using the area and power values from Innovus, 
174 | **how does the performance improvement from the dual-unit design compare to area occupation and power consumption increase compared to your old design?**
175 | 
176 | **b.)** Modify your `gcd_coprocessor.v` to take an input parameter in terms of number of clock cycles we
177 | want our design to meet (`parameter TARGET_NUMBER_OF_CYCLES`) for this given testbench. Your
178 | code should generate a low area, low power design if the number is greater than that your simple
179 | gcd coprocessor can achieve, and it should generate the dual-unit design if it is lower. 
180 | **Submit your code.**
181 | 
182 | c.) (Optional) Using a rough estimate of target number of cycles versus number of units in the design,
183 | write a code that will generate 1-8 cores depending on the performance demand. Do NOT do this
184 | by writing out every possible case explicitly. You can limit the number of units to powers of two
185 | (1,2,4,8) if it makes your life easier.
186 | 
187 | ---
188 | 
189 | ## Lab Deliverables
190 | 
191 | ### Lab Due 11:59 PM, Friday Oct 15th, 2021
192 | 
193 | - Submit a written report with all 4 questions answered to Gradescope
194 | - Checkoff with an ASIC lab TA
195 | 
196 | ## Acknowledgements
197 | 
198 | This lab is the result of the work of many EECS151/251 GSIs over the years including:
199 | 
200 | - Nathan Narevsky (2014, 2017)
201 | - Brian Zimmer (2014)
202 | - Cem Yalcin (2019)
203 | 
204 | Modified By:
205 | - John Wright (2015,2016)
206 | - Ali Moin (2018)
207 | - Arya Reais-Parsi (2019)
208 | - Cem Yalcin (2019)
209 | - Tan Nguyen (2020)
210 | - Harrison Liew (2020)
211 | - Sean Huang (2021)
212 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
213 | 


--------------------------------------------------------------------------------
/lab5/spec.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Lab 5: Parallelization and Routing
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Erik Anderson, Roger Hsiao, Hansung Kim, Richard Yan
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | ## Overview
 16 | 
 17 | Like last week, this lab has two parts. For the first part, we will continue to develop our GCD
 18 | coprocessor by improving its performance. After that, we will continue the physical design flow by
 19 | performing routing.
 20 | 
 21 | To begin this lab, get the project files and set up your environment by typing the following command and sourcing the `eecs151.bashrc` file, as usual:
 22 | 
 23 | ```shell
 24 | source /home/ff/eecs151/asic/eecs151.bashrc
 25 | ```
 26 | 
 27 | ```shell
 28 | git clone /home/ff/eecs151/labs/lab5.git
 29 | cd lab5
 30 | ```
 31 | 
 32 | ## Design
 33 | 
 34 | One way we can improve the performance of our GCD coprocessor is by parallelizing the compute.
 35 | We can do this by including multiple GCD units in our design, and routing traffic to them as they
 36 | become available.
 37 | 
 38 | You will find that the solution to last week’s lab (`fifo.v` and `gcd_coprocessor.v`) is included. The
 39 | test has been modified to check the total number of cycles taken by the coprocessor to complete the
 40 | tests. Run `make sim-rtl` to run the new testbench on the solution code. Take note of the number
 41 | of cycles that the tests take without modification, as you will need it to calculate your speedup.
 42 | 
 43 | Your task is to edit `gcd_coprocessor.v` to improve the performance below 225 cycles. We will do
 44 | this by using two instances of GCD.
 45 | 
 46 | You will find RTL that connects the datapath and controller into one module in `gcd_unit.v`. You
 47 | may find this useful when refactoring the `gcd_coprocessor`, since you will need fewer wires to place
 48 | both GCD instances.
 49 | 
 50 | You will also find stub code for an arbiter, which you should complete. We will use the arbiter
 51 | to route traffic to GCD units and preserve the response ordering. Most of your design can be
 52 | implemented with combinational logic, but you will need some state to remember which GCD
 53 | block contains the earliest data to preserve ordering.
 54 | 
 55 | ---
 56 | ## Question 1: Design
 57 | 
 58 | **a.) Submit your code (`gcd_coprocessor.v` and `gcd_arbiter.v`) with your lab assignment.**
 59 | 
 60 | **b.) How many cycles did your simulation take? What was the % speedup?**
 61 | 
 62 | ## Checkoff 1: Design
 63 | Demonstrate your simulation's functionality and explain your design/approach.
 64 | 
 65 | ---
 66 | 
 67 | ## Automated Flow
 68 | 
 69 | In the last lab, we only focused on the PAR flow through CTS. In this lab, we will go through the full flow.
 70 | Routing is the next major flow step. Prior to the actual routing step, Innovus uses a
 71 | basic routing engine with errors and shorts, but ignores these errors and simply tries to
 72 | get an estimate of delays and parasitics. Once post-CTS
 73 | optimization is done, it switches to a different tool that actually legalizes routing and tries to eliminate
 74 | shorts while meeting timing. Routing is one of the most
 75 | computationally heavy tasks of digital IC design and can take days to complete for complicated designs. 
 76 | This will be reflected in the runtime in this lab.
 77 | 
 78 | After routing is complete, a post-Route optimization is run to ensure no timing violations
 79 | remain. Post-Route optimization typically has little freedom to move cells around, and it tries to
 80 | meet the timing constraints mostly by tweaking the length of the routings. You may see some DRC
 81 | (Design Rule Check) errors caused by the 7nm technology library, after routing.
 82 | 
 83 | First, synthesize the design:
 84 | 
 85 | ```shell
 86 | make syn
 87 | ```
 88 | 
 89 | Then, simulate the synthesized design to make sure it still works:
 90 | 
 91 | ```shell
 92 | make sim-gl-syn
 93 | ```
 94 | 
 95 | Once your synthesized design passes the test, you can start the PAR flow:
 96 | 
 97 | ```shell
 98 | make par
 99 | ```
100 | 
101 | The PAR command will take a long time to complete, as it runs through all stages of PAR. 
102 | Check out the iterations that Innovus runs through during optimization.  You can see some of the metrics that Innovus is using.
103 | Once it completes, take a look at the build directory as in the previous labs. You might see additional files
104 | compare to the `syn-rundir`, and that’s because the PAR flow incorporates the RC and parasitic delays, in addition to the cell delays. Open `build/par-rundir/gcd_coprocessor.PVT_0P63V_100C.par.spef`
105 | and search for the first occurrence of `D_NET`. What does it say about the first net? You may find
106 | [this wiki page](https://en.wikipedia.org/wiki/Standard_Parasitic_Exchange_Format#Parasitics) helpful. *(thought experiment #1 : get a sense of the units at the top and orders of magnitude of the RC parasitics in the SPEF file. If we used a 5nm technology library, do you expect the resistance to generally increase or decrease? How about the capacitance?)*
107 | 
108 | ## Question 2: Automated Flow
109 | 
110 | a.) Check the post-Synthesis timing report
111 | (`syn-rundir/reports/final_time_PVT_0P63V_100C.setup_view.rpt`) and post-PAR timing report (`par-rundir/timingReports/gcd_coprocessor_postRoute_all.tarpt`). 
112 | **What are the critical paths of your post-PAR and post-Synthesis designs?**
113 | **Are they the same path?**
114 | **How does this critical path compare to your single-unit critical path?**
115 | 
116 | b.) Iterate on your design by modifying `design.yml` to find a rough estimate (no need to be too
117 | precise) for the clock period until you start running into setup errors. 
118 | **Given the number of cycles it takes to complete the testbench, what is the shortest time your design can finish the computation?**
119 | 
120 | c.) Open the post-CTS timing report(`par-rundir/hammer_cts_debug/hammer_cts_all.tarpt`) and the post-PAR 
121 | timing report(`par-rundir/timingReports/gcd_coprocessor_postRoute_all.tarpt`). 
122 | **Find a common path (same start and end sequential elements). What differences do you notice within the paths?**
123 | 
124 | ---
125 | 
126 | ## Innovus Commands
127 | 
128 | As in the previous lab, we will look at the contents of `par.tcl` that Hammer generates and follow
129 | along using Innovus. 
130 | *(thought experiment #2 : open the `par.tcl` and search for the command `set_db add_fillers_cells`. Based on the names of the cells specified by this command, what do you think is the function of the filler cells?)*
131 | 
132 | Navigate to the directory `build/par-rundir` and type:
133 | 
134 | ```shell
135 | innovus -common_ui
136 | ```
137 | 
138 | This will open the Innovus shell. Next, type `read_db gcd_coprocessor_FINAL` to load the current design
139 | database from the latest PAR flow. This will help us to avoid re-running the entire flow. To see
140 | all the reporting commands, type `help report*` in the Innovus shell and read through the options
141 | available to you.
142 | 
143 | ## Checkoff 2: Innovus Commands
144 | Explain the PAR flow, or ask some questions about any steps you don't understand.
145 | 
146 | ---
147 | ## Question 3: Innovus Reports
148 | 
149 | **a.) What is the area consumed by your design?**
150 | **What percentage of the total area does the arbiter occupy?**
151 | 
152 | **b.) Submit a screenshot of your setup slack histogram.**
153 | **Compared with the histogram you obtained in Lab 4, does your new slack distribution support the observed performance improvements you obtained in your coprocessor?**
154 | 
155 | ---
156 | 
157 | After you are done with the flow, it is time to simulate our newly printed post-PAR netlist. Type
158 | the following command:
159 | 
160 | ```shell
161 | make sim-gl-par
162 | ```
163 | 
164 | This will use the same testbench, but will now use the post-PAR netlist of your design, backannotated with delays and parasitics from PAR. Make sure to adjust the `CLOCK_PERIOD` variable in `sim-gl-par.yml` to match the clock period you obtained from PAR. Note, however, that the exact
165 | clock period may not work and you may need to relax it slightly.
166 | 
167 | After running `make sim-gl-par` you can run power analysis using:
168 | 
169 | ```shell
170 | make power-par
171 | ```
172 | 
173 | Navigate to `build/power-rundir/activePowerReports.PVT_0P63V_100C.setup_view/` and open `power.rpt`. Do
174 | the power estimation numbers match your expectation?
175 | 
176 | ---
177 | ## Question 4: Trade-offs
178 | 
179 | **a.)** Re-run the flow using your old design. 
180 | To prevent your `build` directory from being overwritten, set the `OBJ_DIR` Make variable to a different name (i.e. `make par OBJ_DIR=build2`).
181 | Using the area and power values from Innovus, 
182 | **how does the performance improvement from the dual-unit design compare to area occupation and power consumption increase compared to your old design?**
183 | 
184 | **b.)** Modify your `gcd_coprocessor.v` to take an input parameter in terms of number of clock cycles we
185 | want our design to meet (`parameter TARGET_NUMBER_OF_CYCLES`) for this given testbench. Your
186 | code should generate a low area, low power design if the number is greater than that your simple
187 | gcd coprocessor can achieve, and it should generate the dual-unit design if it is lower. 
188 | **Submit your code.**
189 | 
190 | Hint: Use the `verilog` `generate` syntax for choosing between designs. See [here](https://www.chipverify.com/verilog/verilog-generate-block) for documentation on how to use the `generate` syntax.
191 | 
192 | c.) (Optional) Using a rough estimate of target number of cycles versus number of units in the design,
193 | write a code that will generate 1-8 cores depending on the performance demand. Do NOT do this
194 | by writing out every possible case explicitly. You can limit the number of units to powers of two
195 | (1,2,4,8) if it makes your life easier.
196 | 
197 | ---
198 | 
199 | ## Lab Deliverables
200 | 
201 | ### Lab Due 11:59 PM, 2 weeks after your registered lab section. (Oct. 17 for lab section 1)
202 | 
203 | - Submit a written report with all 4 questions answered to Gradescope
204 | - Checkoff with an ASIC lab TA
205 | 
206 | ## Acknowledgements
207 | 
208 | This lab is the result of the work of many EECS151/251 GSIs over the years including:
209 | 
210 | - Nathan Narevsky (2014, 2017)
211 | - Brian Zimmer (2014)
212 | - Cem Yalcin (2019)
213 | 
214 | Modified By:
215 | - John Wright (2015,2016)
216 | - Ali Moin (2018)
217 | - Arya Reais-Parsi (2019)
218 | - Cem Yalcin (2019)
219 | - Tan Nguyen (2020)
220 | - Harrison Liew (2020)
221 | - Sean Huang (2021)
222 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
223 | - Dima Nikiforov (2022)
224 | - Roger Hsiao (2022)
225 | 


--------------------------------------------------------------------------------
/project/checkpoint4.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Project Specification: Checkpoint 4
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Dima Nikiforov
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | ---
 16 | 
 17 | ## 1 Synthesis, PAR, & Power
 18 | 
 19 | ### 1.1 Performing Synthesis and PAR
 20 | Make sure your design is backed up at this point.
 21 | 
 22 | The setup for Synthesis, and PAR is the similar to what we have used in the labs during the class,
 23 | with some formatting differences.  In `par.yml`, there is extra guidance 
 24 | for how to do placement constraints. Based on how you implemented the caches from the previous
 25 | checkpoint, you will need to modify these constraints to match the master SRAM cell used as well as
 26 | the path. Now you should be ready to proceed to Synthesis and PAR. As in the previous labs, execute
 27 | the following:
 28 | ```shell
 29 | export HAMMER_HOME=$PWD/hammer
 30 | source hammer/sourceme.sh
 31 | ```
 32 | The first thing you should do before simulating, is to make the SRAM libraries, with the command:
 33 | ```
 34 | make srams
 35 | ```
 36 | If you want to make sure the RTL has been pointed to correctly, you can try running the asm tests
 37 | from this environment. To do so, type the following commands:
 38 | ```
 39 | make sim-rtl test_asm=all
 40 | ```
 41 | The command generates the `simv` file, which is the simulation executable, then iterates through all
 42 | the asm tests using the root `Makefile`. If everything looks fine, you can proceed to Synthesis and
 43 | PAR:
 44 | ```
 45 | make clean
 46 | make srams
 47 | make syn
 48 | make par
 49 | ```
 50 | If everything went smoothly, you should now have a circuit laid out. To view the layout, go to
 51 | `build/par-rundir/` directory and type
 52 | ```
 53 | ./generated_scripts/open_chip
 54 | ```
 55 | You are expected to record and document your area, power and clock frequency performance (as
 56 | determined by your critical path). To verify that your design works after PAR, use the following commands:
 57 | ```
 58 | make sim-gl-par test_asm=all
 59 | ```
 60 | Some final notes:
 61 | * You may also want to generally make sure that the post-synthesis netlist passes tests before moving onto post-PAR simulation, because the latter can be slower and will complicate your debugging with any PAR-related failures you may have (e.g. incomplete wiring of signal or clock nets
 62 | due to a bad floorplan).
 63 | * There is a new constraint added to `syn.tcl` under the key `vlsi.inputs.delays`. 
 64 | The external memory model in `riscv_test_harness.v` generates a delayed version of the signals
 65 | going into your CPU (see the parameter `INPUT_DELAY`). Annotating these delays for synthesis/PR 
 66 | is necessary in order capture this effect when the tools perform timing analysis. If you are
 67 | curious, this gets translated into `build/syn-rundir/pin_constraints_fragment.sdc`
 68 | as input to synthesis. After synthesis, the relevant pin delays are encoded in
 69 | `build/syn-rundir/riscv_top.mapped.sdc`. These are Synopsys Design Constraint
 70 | format files. Do not touch this delay constraint except to update the value as your clock period
 71 | divided by 5.
 72 | * As described in Lab 6, the ASAP7 dummy SRAMs do not have complete timing information.
 73 | This is most apparent in gate-level simulations because the SRAMs do not provide any SDF
 74 | timing annotation. You may find that despite meeting timing in synthesis and PAR, you will
 75 | likely need to increase the gate-level simulation clock period for the benchmarks to pass.
 76 | 
 77 | ### 2 Checkpoint 4 Deliverables
 78 | *Checkoff due: May 6 (Friday), 2022*
 79 | 
 80 | 
 81 | 1. Show that all of the assembly tests and final pass using the cache in a post-par simulation
 82 | 
 83 | 2. Show your layout, and explain your design considerations when creating the floorplan
 84 | 
 85 | 3. Show your final pipeline diagram, updated to match the code
 86 | 
 87 | ---
 88 | 
 89 | ## 3 Beyond Checkpoint 4: CPU Optimization
 90 | 
 91 | ### 3.1 Optimizing for frequency
 92 | Beyond functionality, your final project grade will be determined by the maximum operating frequency
 93 | of your processor, determined by the critical path. You will also want to optimize for the number of
 94 | cycles that your processor takes to execute certain programs, more on that later. The critical path will
 95 | be dependent on how aggressively you ask the tools to optimize the design, by changing the target clock
 96 | period in the `syn.yml` file.
 97 | 
 98 | When Innovus is finished, look at the timing report for the critical path. In some cases, it is possible
 99 | to modify your Verilog to improve the critical path by moving pipeline stage registers. However in other
100 | cases, timing can only be improved by tweaking settings in `syn.yml` and `par.yml`.
101 | 
102 | Be sure to backup (meaning check in or branch) your working design before attempting to move
103 | logic, because functionality is worth much more of your grade than maximum frequency.
104 | 
105 | You are allowed to add additional pipeline stages, but remember that you will need to deal with the additional hazards that accompany them.
106 | Be careful that adding additional stages does not increase the overall execution.
107 | Your final performance metric is not only based on the clock speed at which your design will run, so keep
108 | that in mind before heavily modifying your design.
109 | 
110 | Note for bonus grading: due to the SRAM timing issue described above, the maximum frequency
111 | you achieved in PAR (not gate-level simulation) is most accurate and should be what you report for
112 | frequency.
113 | 
114 | ### 3.2 Optimizing for number of cycles
115 | We are providing you tests that are the output of example C programs to run for your processor. They
116 | are meant to be a representative example of different types of programs that each have different reasons
117 | why they may take extra cycles to execute, for a variety of reasons including, but not limited to cache
118 | misses, and branch/jump stalls. A more complicated cache structure may be able to reduce some of the
119 | time spent waiting for memory accesses, but it may not be optimal for all cases. If you implement a
120 | configurable cache you are allowed to set the cache settings differently on a per test basis, you will need
121 | to add those pins to the top level Riscv151 file as well as the testbench with compile flags for VCS. In
122 | terms of dealing with branching and jumping, you can implement any type of branch predictor that you
123 | want to. A branch predictor in its simplest form will always choose to take (or not take) the branch and
124 | then figure out if it was incorrect, and if so go back to where the instruction memory should have gone,
125 | making sure that any additional instructions that were started do not change the state of the CPU. This
126 | means that there should be no writes to memory or any registers for those instructions.
127 | 
128 | The list of final tests are contained within the Makefile under the variable `bmark_tests`, which
129 | include a few tests that are meant to actually test the performance of your design. These tests are longer
130 | C programs that are meant to test different aspects of your design and how you handle different types
131 | of hazards. To run these longer tests you can run the following commands, like in checkpoint #3:
132 | ```
133 | make sim-rtl test_bmark=all
134 | ```
135 | You may need to increase the number of cycles for timeout for some of the longer tests (like sum,
136 | replace and cachetest) to pass.
137 | 
138 | ### 3.3 Optimizing for power
139 | **DISCLAIMER:** The infrastructure to do power analysis in this project is very different from gate-level
140 | simulation and power analysis so far. Doing this optimization is *purely optional* and should only be
141 | tried after you can pass the benchmarks normally! **Proceed at your own risk!**
142 | 
143 | You have the ability to also find out the power consumption of your processor for the various provided benchmarks. 
144 | The value of this is to figure out whether the way you wrote your logic is efficient
145 | and avoids extra switching activity. Simplify instruction decode logic, forwarding paths, etc. can result
146 | in lower power consumption!
147 | 
148 | Near the bottom of `sim-gl-par.yml`, you will see a few lines:
149 | ```
150 | execute_sim: false
151 | # Below is for power analysis. See the spec for instructions!
152 | # execution_flags_append:
153 | # - "+loadmem=../../tests/asm/addi.hex"
154 | # - "+max-cycles=10000"
155 | ```
156 | If you reverse the comments (i.e. comment out `execute_sim`: false and uncomment the
157 | rest), this tells Hammer to run the `simv` executable with the addi test, instead of having the `Makefile`
158 | in the root folder run the executable. This is currently the only way that we can get Hammer currently
159 | to generate the SAIF file with our benchmark hex files. To proceed with the simulation of addi in this
160 | case:
161 | ```
162 | make sim-gl-par test_asm=addi.out
163 | ```
164 | You will find that it will do a simulation twice due to how the root `Makefile` is configured. The
165 | first one should pass (after a lot of printing each cycle number), while the second one should also pass
166 | like you have seen so far—ignore this second simulation. You should now see an `ucli.saif` file in
167 | `build/sim-rundir`. Then, as in previous labs, run Voltus:
168 | ```
169 | make power-par
170 | ```
171 | And you should get static and dynamic power reports in `build/power-rundir`.
172 | 
173 | Some closing recommendations:
174 | 
175 | * This infrastructure only allows us to run one benchmark at a time. To run a different benchmark,
176 | replace the hex file in the `execution_flags_append` list, and alter the `max-cycles` value
177 | as necessary (see the `*_timeout_cycles_variables` in the root Makefile for the numbers).
178 | * Due to the ASAP7 PDK’s dummy SRAMs, we can’t measure SRAM power, and thus can’t find
179 | out how power-efficient our caching is. Therefore, the best benchmarks to run would be an
180 | arithmetic-heavy one that relies heavily on the register file (but the provided benchmarks require
181 | memory accesses). If you have lots of time on your hand, we encourage you to find power
182 | numbers for the **final** benchmark, but **you will not be graded on power performance.**
183 | 
184 | ---
185 | 
186 | 
187 | ## Acknowledgement
188 | 
189 | This project is the result of the work of many EECS151/251 GSIs over the years including:
190 | Written By:
191 | - Nathan Narevsky (2014, 2017)
192 | - Brian Zimmer (2014)
193 | Modified By:
194 | - John Wright (2015,2016)
195 | - Ali Moin (2018)
196 | - Arya Reais-Parsi (2019)
197 | - Cem Yalcin (2019)
198 | - Tan Nguyen (2020)
199 | - Harrison Liew (2020)
200 | - Sean Huang (2021)
201 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
202 | - Dima Nikiforov (2022)
203 | 


--------------------------------------------------------------------------------
/project/overview.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Project Specification RISC-V Processor Design: Overview
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Dima Nikiforov
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | ## 1. Introduction
 15 | 
 16 | The primary goal of this project is to familiarize students with the methods and tools of digital design. In order to make the project both interesting and useful, we will guide you through the implementation of a CPU that is intended to be integrated on a modern SoC. Working alone or in teams of 2, you will be designing a simple 3-stage CPU that implements the RISC-V ISA, developed here at UC Berkeley. If you work in a team, you must both have a complete understanding of your entire project code, and you will both receive the same grade.
 17 | 
 18 | Your first and most important goal is to write a functional implementation of your processor. To better expose you to real design decisions, you will also be tasked with improving the performance of your processor. You will be required to meet a minimum performance to be specified later in the project.
 19 | 
 20 | You will use Verilog HDL to implement this system. You will be provided with some testbenches to verify your design, but you will be responsible for creating additional testbenches to exercise your entire design. Your target implementation technology will be the ASAP7 7nm Educational PDK, a predictive model technology used for instruction. The project will give you experience designing synthesizeable RTL (Register Transfer Level) code, resolving hazards in a simple pipeline, building interfaces, and approaching system-level optimization.
 21 | 
 22 | Your first step will be to map our high level specification to a design which can be translated into a hardware implementation. You will then generate and debug that implementation in Verilog. These steps may take significant time if you do not put effort into your system architecture before attempting implementation. After you have built a working design, you will be optimizing it for speed in the 7nm technology that we have been using this semester.
 23 | 
 24 | 
 25 | ### 1.1 RISC-V
 26 | The final project for this class will be a VLSI implementation of a RISC-V (pronounced risk-five) CPU. RISC-V is an instruction set architecture (ISA) developed here at UC Berkeley. It was originally developed for computer architecture research and education purposes, but recently there has been a push towards commercialization and industry adoption. For the purposes of this lab, you don’t need to delve too deeply into the details of RISC-V. However, it may be good to familiarize yourself with it, as this will be at the core of your final project. Check out the official [RISC-V Instruction Set Manual](https://riscv.org/technical/specifications/) (Volume 1, Unprivileged Spec) and explore http://riscv.org for more information.
 27 | - Read sections 2.2 and 2.3 to understand how the different types of instructions are encoded. 
 28 | - Read sections 2.4, 2.5, 2.6, and 9.1 and think about how each of the instructions will use the ALU
 29 | 
 30 | ### 1.2 Project phases
 31 | Your project will consist of two different phases: front-end and back-end. Within each phase, you will have multiple checkpoints that will ensure you are making consistent progress. These checkpoints will contribute (although not significantly) to your final grade. You are free to make design changes after they have been checked off.
 32 | 
 33 | In the first phase (front-end), you will design and implement a 3-stage RISC-V processor in Verilog, and run simulations to test for functionality. At this point, you will only have a functional description of your processor that is independent of technology (there are no standard cells yet). You are highly encouraged to finish each checkpoint early, and each checkpoint will be released before the due date of the ongoing one. Everything will take much longer than you expect, and finishing early gives you more time to improve your QoR (Quality of Results, e.g. clock period).
 34 | 
 35 | In the second phase (back-end), you will implement your front-end design in the ASAP7 7nm kit using the VLSI tools you used in lab. When you have finished phase 2, you will have a design that could move onto fabrication if this were a real technology process. You will have about 2 weeks to complete the second phase after its release.
 36 | 
 37 | ### 1.3 Philosophy
 38 | This document is meant to describe a high-level specification for the project and its associated support hardware. You can also use it to help lay out a plan for completing the project. As with any design you will encounter in the professional world, we are merely providing a framework within which your project must fit.
 39 | 
 40 | You should consider the GSI(s) a source of direction and clarification, but it is up to you to produce a fully-functional design and its physical implementation. Ultimately the responsibility of designing and debugging your solution lies on you.
 41 | 
 42 | ### 1.4 General Project Tips
 43 | Be sure to use top-down design methodologies in this project. We began by taking the problem of designing a basic computer system, modularizing it into distinct parts, and then refining those parts into manageable checkpoints. You should take this scheme one step further; we have given you each checkpoint, so break each into smaller, manageable pieces.
 44 | 
 45 | As with many engineering disciplines, digital design has a normal development cycle. In the norm, after modularizing your design, your strategy should roughly resemble the following steps:
 46 | 
 47 | - **Design** your modules well, make sure you understand what you want before you begin to code.
 48 | 
 49 | - **Code** exactly what you designed; do not try to add features without redesigning.
 50 | 
 51 | - **Simulate** thoroughly; writing a good testbench is as much a part of creating a module as actually coding it.
 52 | - **Debug** completely; anything which can go wrong with your implementation will.
 53 | 
 54 | Some general tips when designing complex RTL modules:
 55 | 
 56 | * Document your project thoroughly as you go
 57 |   * comment your Verilog
 58 |   * before making any RTL changes, **modify your pipeline diagram first to visualize this change**, doing this:
 59 |     * may reveal the change is actually infeasible
 60 |     * ensures that you and your partner have the same view of your processor's operation
 61 | * Split the module operation into data/control paths and design each separately
 62 |   * Start with the simplest possible implementation
 63 |   * Make changes incrementally and always test your module after each change, no matter how small
 64 |   * Finish the required features first before attempting any extra features
 65 | * Use github version control features like commits, branches, etc.
 66 | * Save your work often and rely on redundancy (e.g. copy files from `/scratch` to your home directory often to ensure they're backed up)
 67 | * Parallelize work as much as possible (e.g. start writing CPU RTL as you finish your diagram, work on CPU and Cache in parallel, start physical design as you finish your cache)
 68 | 
 69 | 
 70 | This project is divided into checkpoints. Each checkpoint will be due 1 to 2 weeks after its release, but the next checkpoint will be released early. Use this to your advantage- try to get ahead so that you have additional time to debug. Your TA will clarify the specific timeline for your semester.
 71 | 
 72 | The most important goal is to design a functional processor- this alone is 50-60% of the final grade, and you must have it **working completely** to receive any credit for performance.
 73 | 
 74 | ---
 75 | 
 76 | ## 2. Front-end design (Phase 1)
 77 | 
 78 | The first phase in this project is designed to guide the development of a three-stage pipelined RISC-V CPU that will be used as a base system for your back-end implementation.
 79 | Phase 1 will last for 5 weeks and has weekly checkpoints.
 80 | 
 81 | - Checkpoint 1: ALU design and pipeline diagram
 82 | - Checkpoint 2: Core implementation
 83 | - Checkpoint 3: Core + memory system implementation 
 84 | - Checkpoint 4: Physical Design
 85 | 
 86 | 
 87 | ### 2.1 Adding SSH Key
 88 | First you must add an SSH key to your Github account, to allow you to push to your project repo from the instructional machines without entering your Github password each time. You may run these commands in any location on any instructional machine (the SSH key will be stored in your home directory and thus work on all machines).
 89 | ```shell
 90 | ssh-keygen -t ed25519 -C "your_email@example.com"
 91 | # hit Enter to each prompt (leave response blank)
 92 | cat ~/.ssh/id_ed25519.pub
 93 | # Then select and copy the contents of the id_ed25519.pub file
 94 | # displayed in the terminal to your clipboard
 95 | ```
 96 | 
 97 | In your browser, navigate to [https://github.com/settings/ssh/new](https://github.com/settings/ssh/new) (log into your Github account if needed). You should see the `SSH Keys / Add New` page. Enter the following values:
 98 | * Title: `something descriptive (ex. eecs151)`
 99 | * Key: `paste the contents of the id_ed25519.pub file`
100 | 
101 | Then click the green `Add SSH key` button.
102 | 
103 | ### 2.2 Project Git Repo
104 | The skeleton files for the project will be delivered as a git repository provided by the staff. You should clone this repository as follows. It is highly recommended to familiarize yourself with git and use it to manage your development.
105 | 
106 | 
107 | ```shell
108 | git clone /home/ff/eecs151/labs/project_skeleton /path/to/my/project
109 | ```
110 | 
111 | To get a team repo, fill out the google form via the link on Piazza with your team information. Please do this even if you are working alone, as these git repos will be used for version control and as part of the final checkoff. You will receive an email with an invite link to your project repo, which you should click to join before following the directions below. 
112 | 
113 | An example working flow to be able to pull from the skeleton as well as push/pull with your team repository is shown below:
114 | 
115 | 
116 | ```shell
117 | cd /path/to/my/project
118 | git remote add myOrigin git@github.com:EECS150/fa21_asic_teamXX
119 | ```
120 | 
121 | Then to pull changes from the skeleton, you would need to type:
122 | ```shell
123 | git pull origin master
124 | ```
125 | 
126 | To pull changes from your team repository you would type:
127 | ```shell
128 | git pull myOrigin master
129 | ```
130 | 
131 | And to push changes to your team repository (please do not attempt to push to the skeleton repository), you would usually want to pull first (above) and then type:
132 | ```shell
133 | git push myOrigin master
134 | ```
135 | 
136 | ---
137 | 
138 | ## 3. Grading
139 | 
140 | ### EECS 151:
141 | |                   |           |
142 | |-------------------|---------|
143 | |  **70%**          |   Functionality at project due date: Your design will be subjected to a comprehensive test suite and    your score will reflect how many of the tests your implementation passes.
144 | |  **25%**          |   Final Report and Final Interview: If your design is not 100% functional, this is your opportunity  explain your bugs and recoup points.
145 | |  **5%**           |   Checkpoints: Each check-off is worth 1.25%. If you accomplished all of your checkpoints on time, you will receive full credit in this category.
146 | |  **Bonus 5%**     |   Performance at project due date: You must have a fully working design to score points in this section. You will receive up to 5 bonus points as your performance improves relative to your peers. Performance will be calculated using the Iron Law: IPC * F
147 | 
148 | ### EECS 251A:
149 | |                 |           |
150 | |-----------------|---------|
151 | |   **60%**       |  Functionality at project due date: Your design will be subjected to a comprehensive test suite and your score will reflect how many of the tests your implementation passes.
152 | |   **10%**       |  Set-Associative Cache: Implementation and performance of the configurable set-associative cache.
153 | |   **25%**       |  Final Report and Final Interview: If your design is not 100% functional, this is your opportunity explain your bugs and recoup points.
154 | |   **5%**        |  Checkpoints: Each check-off is worth 1.25%. If you accomplished all of your checkpoints on time,  you will receive full credit in this category.
155 | |   **Bonus 5%**  |  Performance at project due date: You must have a fully working design to score points in this section. You will receive up to 5 bonus points as your performance improves relative to your peers. Performance will be calculated using the Iron Law: IPC * F
156 | 
157 | ## Acknowledgement
158 | 
159 | This project is the result of the work of many EECS151/251 GSIs over the years including:
160 | Written By:
161 | - Nathan Narevsky (2014, 2017)
162 | - Brian Zimmer (2014)
163 | Modified By:
164 | - John Wright (2015,2016)
165 | - Ali Moin (2018)
166 | - Arya Reais-Parsi (2019)
167 | - Cem Yalcin (2019)
168 | - Tan Nguyen (2020)
169 | - Harrison Liew (2020)
170 | - Sean Huang (2021)
171 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
172 | - Dima Nikiforov (2022)
173 | 


--------------------------------------------------------------------------------
/project/checkpoint1.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Project Specification: Checkpoint 1
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Dima Nikiforov
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | ---
 16 | 
 17 | ## ALU design and pipeline diagram
 18 | The ALU that we will implement in this lab is for a RISC-V instruction set architecture. Pay close attention to the design patterns and how the ALU is intended to function in the context of the RISC-V processor. In particular it is important to note the separation of the datapath and control used in this system which we will explore more here.
 19 | 
 20 | The specific instructions that your ALU must support are shown in the tables below. The branch condition should not be calculated in the ALU. Depending on your CPU implementation, your ALU may or may not need to do anything for branch, jump, load, and store instructions (i.e., it can just output 0).
 21 | 
 22 | ---
 23 | 
 24 | ### 1. Making a pipeline diagram
 25 | 
 26 | 
 27 | The first step in this project is to make a pipeline diagram of your processor, as described in lecture. You only need to make a diagram of the datapath (not the control). Each stage should be clearly separated with a vertical line, and flip-flops will form the boundary between stages. It is a good idea to name signals depending on what stage they are in (eg. `s1_killf`, `s2_rd0`). Also, it is a good idea to separately name the input/output (D/Q) of a flip flop (eg. `s0_next_pc`, `s1_pc`). Draw your diagram in a drawing program (Inkscape, Google Drawings, draw.io or any program you want), because you will need to keep it up-to-date as you build your processor. As such, we recommend you leave plenty of space between diagram elements to make it easier to insert changes as your project evolves.
 28 | It helps to print out scratch copies while you are debugging your processor and to keep your drawings revision-controlled with git. Once you have finished your initial datapath design, you will implement the main building block in the datapath—the ALU.
 29 | 
 30 | ---
 31 | 
 32 | ### 2. ALU functional specification
 33 | Given specifications about what the ALU should do, you will create an ALU in Verilog and write a test harness to test the ALU.
 34 | 
 35 | The encoding of each instruction is shown in the table below. There is a detailed functional description of each of the instructions in Section 2.4 of the [RISC-V Instruction Set Manual](https://riscv.org/technical/specifications/) (Volume 1, Unprivileged Spec). Pay close attention to the functional description of each instruction as there are some subtleties. 
 36 | 
 37 | <p align="center">
 38 | <img src="./figs/RV32I_Base_Instruction_Set.png" width="800" />
 39 | </p>
 40 | 
 41 | ---
 42 | 
 43 | ### 3. Project Files
 44 | We have provided a skeleton directory structure to help you get started.
 45 | 
 46 | Inside, you should see a `src` folder, as well as a `tests` folder. The `src` folder contains all of
 47 | the verilog modules for this phase, and the `tests` folder contains some RISC-V test binaries for your processor.
 48 | 
 49 | ---
 50 | 
 51 | ### 4. Testing the Design
 52 | Before writing any of modules, you will first write the tests so that once you’ve written the modules you’ll be able to test them immediately. This is effectively Test-driven Development (TDD). Writing tests first is good practice- it forces you to write thorough tests, and ensures that tests will exist when you need to rapidly iterate through module design tweaks. Thorough understanding of the expected functionality is key to writing good tests (or RTL). You will be expected to write unit tests for any modules that you design and implement and write integration tests. Unit tests will verify the functionality of individual modules against your specification. Integration tests verify that all the modules work as a system once you connect them together.
 53 | 
 54 | #### 4.1 Verilog Testbench
 55 | One way of testing Verilog code is with testbench Verilog files. The outline of a test bench file has been provided for you in ``ALUTestbench.v``. There are several key components to this file:
 56 | - `` `timescale 1ns / 1ps`` - This specifies, in order,the reference time unit and the precision. This example sets the unit delay in the simulation to 1ns (i.e. `#1` = 1ns) and the precision to 1ps (i.e. the finest delay you can set is `#0.001` = 1ps).
 57 | - The clock is generated by the code below. Since the ALU is only combinational logic, this is not necessary, but it will be a helpful reference once you have sequential elements.
 58 |     - The ``initial`` block sets the clock to 0 at the beginning of the simulation. You should be sure to only change your stimulus when the clock is falling, since the data is captured on the rising edge. Otherwise, it will not only be difficult to debug your design, but it will also cause hold time violations when you run gate level simulation.
 59 |     - You must use an always block without a sensitivity list (the `@` part of an always statement) to cause the clock to run automatically.
 60 |     ```verilog
 61 |         parameter Halfcycle = 5; //half period is 5ns
 62 |         localparam Cycle = 2*Halfcycle;
 63 |         reg Clock;
 64 |         // Clock Signal generation:
 65 |         initial Clock = 0;
 66 |         always #(Halfcycle) Clock =  ̃Clock;
 67 |     ```
 68 | - ``task checkOutput``; - this task contains Verilog code that you would otherwise have to copy paste many times. Note that it is not the same thing as a function (as Verilog also has functions).
 69 | - ``{$random} & 31'h7FFFFFFF `` - $random generates a pseudorandom 32-bit integer. A bitwise AND will mask the result for smaller bit widths.
 70 | 
 71 | For these two modules, the inputs and outputs that you care about are ``opcode``, ``funct``, ``add_rshift_type``, ``A``, ``B`` and ``Out``. To test your design thoroughly, you should work through every possible `opcode`, `funct`, and ``add_rshift_type`` that you care about, and verify that the correct Out is generated from the A and B that you pass in.
 72 | 
 73 | The test bench generates random values for ``A`` and ``B`` and computes ``REFout = A + B``. It also contains calls to ``checkOutput`` for load and store instructions, for which the ALU should perform addition. It will be up to you to write tests for the remaining combinations of opcode, funct, and ``add_rshift_type`` to test your other instructions.
 74 | 
 75 | Remember to restrict ``A`` and ``B`` to reasonable values (e.g. masking them, or making sure that they are not zero) if necessary to guarantee that a function is sufficiently tested. Please also write tests where the inputs and the output are hard-coded. These should be corner cases that you want to be certain are stressed during testing.
 76 | 
 77 | ---
 78 | 
 79 | #### 4.2 Test Vector Testbench
 80 | An alternative way of testing is to use a test vector, which is a series of bit arrays that map to the inputs and outputs of your module. The inputs can be all applied at once if you are testing a combinational logic block or applied over time for a sequential logic block (e.g. an FSM).
 81 | 
 82 | You will write a Verilog testbench that takes the parts of the bit array that correspond to the inputs of the module, feeds those to the module, and compares the output of the module with the output bits of the bit array. The bit vector should be formatted as follows:
 83 | ```verilog
 84 | [106:100] = opcode
 85 | [99:97] = funct
 86 | [96] = add_rshift_type
 87 | [95:64] = A
 88 | [63:32] = B
 89 | [31:0] = REFout
 90 | ```
 91 | Open up the skeleton provided to you in ``ALUTestVectorTestbench.v``. You need to complete the module by making use of ``$readmemb`` to read in the test vector file (named testvectors.input), writing some assign statements to assign the parts of the test vectors to registers, and writing a for loop to iterate over the test vectors.
 92 | 
 93 | The syntax for a for loop can be found in ``ALUTestbench.v``. ``$readmemb`` takes as its arguments a filename and a reg vector, e.g.:
 94 | 
 95 | ```verilog
 96 | reg [5:0] bar [0:20];
 97 | $readmemb("foo.input", bar);
 98 | ```
 99 | 
100 | #### 4.3 Writing Test Vectors
101 | Additionally, you will also have to generate actual test vectors to use in your test bench. A test vector can either be generated in Verilog (like how we generated ``A``, ``B`` using the random number generator and iterated over the possible opcodes and functs), or using a scripting language like Python. Since we have already written a Verilog test bench for our ALU and decoder, we will tackle writing a few test vectors by hand, then use a script to generate test vectors more quickly.
102 | 
103 | Test vectors are of the format specified above, with the 7 opcode bits occupying the left-most bits. In the tests folder, create the file testvectors.input and add test vectors for the following instructions to the end (i.e. manually type the 107 zeros and ones required for each test vector): ``SLT``, ``SLTU``, ``SRA``, and ``SRL``.
104 | 
105 | In the same directory, we’ve also provided a test vector generator ``ALUTestGen.py`` written in Python. We used this generator to generate the test vectors provided to you. If you’re curious, you can read the next paragraph and poke around in the file. If not, feel free to skip ahead to the next section.
106 | 
107 | The script ``ALUTestGen.py`` is located in ``tests``. Run it so that it generates a test vector file in the tests folder. Keep in mind that this script makes a couple assumptions that aren’t necessary and may differ from your implementation:
108 | 
109 | - Jump, branch, load and store instructions will use the ALU to compute the target address. 
110 | 
111 | - For all shift instructions, `A` is shifted by `B`. In other words, `B` is the shift amount.
112 | 
113 | - For the `LUI` instruction, the value to load into the register is fed in through the `B` input.
114 | 
115 | You can either match these assumptions or modify the script to fit with your implementation. All the methods to generate test vectors are located in the two Python dictionaries ``opcodes`` and ``functs``. The lambda methods contained (separated by commas) are respectively: the function that the operation should perform, a function to restrict the ``A`` input to a particular range, and a function to restrict the ``B ``input to a particular range.
116 | 
117 | **If you modify the Python script**, run the generator to make new test vectors. This will overwrite the ``testvectors.input`` file, so if you want to save your handwritten test vectors, rename the file before running the script, then append them once the file has been generated.
118 | ```shell
119 | python ALUTestGen.py
120 | ```
121 | This will write the test vector into the file ``testvectors.input``. Use this file as the target test vector
122 | file when loading the test vectors with ``$readmemb``.
123 | 
124 | ---
125 | 
126 | ### 5. Writing Verilog Modules
127 | For this exercise, we’ve provided the module interfaces for you. They are logically divided into a control (``ALUdec.v``) and a datapath (``ALU.v``). The datapath contains the functional units while control contains the necessary logic to drive the datapath. You will be responsible for implementing these two modules. Descriptions of the inputs and outputs of the modules can be found in the first few lines of each file. The ALU should take an ``ALUop`` and its two inputs ``A`` and ``B``, and provide an output dependent on the ``ALUop``. The operations that it needs to support are outlined in the Functional Specification. Don’t worry about sign extensions–they should take place outside of the ALU. The ALU decoder uses the ``opcode``, ``funct``, and ``add_rshift_type`` to determine the ``ALUop`` that the ALU should execute. The ``funct`` input corresponds to the ``funct3`` field from the ISA encoding table. The ``add_rshift_type`` input is used to distinguish between ``ADD/SUB``, ``SRA/SRL``, and ``SRAI/SRLI``; you will notice that each of these pairs has the same ``opcode`` and ``funct3``, but differ in the ``funct7 `` field.
128 | 
129 | You will find the case statement useful, which has the following syntax:
130 | ```verilog
131 | always@(*) begin
132 |     case(foo)
133 |         3'b000: // something happens here
134 |         3'b001: // something else happens here
135 |         3'b010, 3'b011: // you can have more than
136 |                         // one case do the same thing
137 |         default: // everything else
138 | endcase end
139 | ```
140 | 
141 | To make your job easier, we have provided two Verilog header files: ``Opcode.vh`` and ``ALUop.vh``. They provide, respectively, macros for the opcodes and functs in the ISA and macros for the different ALU operations. You should feel free to change ``ALUop.vh`` to optimize the ``ALUop`` encoding, but if you change ``Opcode.vh``, you will break the test bench skeleton provided to you. You can use these macros by placing a backtick in front of the macro name, e.g.:
142 | 
143 | ```verilog
144 | case(opcode)
145 |  OPC_STORE:
146 |  ```
147 | is the equivalent of:
148 | ```verilog
149 | case(opcode)
150 | 7'b0100011:
151 | ```
152 | 
153 | ---
154 | 
155 | ### 6. Running the Simulation
156 | 
157 | Open the file ``sim-rtl.yml``, set the testbench’s name to be ALUTestbench. 
158 | 
159 | ```yaml
160 | tb_name: &TB_NAME "ALUTestbench"
161 | ```
162 | 
163 | By typing ```make sim-rtl``` you will run the ALU simulation.You may change the testbench’s name to ```ALUTestVectorTestbench``` to use the test vector testbench.
164 | 
165 | Once you have a working design, you should see the following output when you run either of the given testbenches:
166 | ```shell
167 | # ALL TESTS PASSED! 
168 | ```
169 | 
170 | To clean the simulation directory from previous simulations’ files, type ``make clean``.
171 | 
172 | 
173 | ---
174 | 
175 | ### 7. Viewing Waveforms
176 | 
177 | As in the previous labs, you should use DVE to view waveforms.
178 | 1. List of the modules involved in the test bench. You can select one of these to have its signals show up in the object window.
179 | 2. Object window - this lists all the wires and regs in your module. You can add signals to the waveform view by selecting them, right-clicking, and doing Add To Wave.
180 | 3. Waveform viewer - The signals that you add from the object window show up here. You can navigate the waves by searching for specific values, or going forward or backward one transition at a time.
181 | As an example of how to use the waveform viewer, suppose you get the following output when you run
182 | 
183 | ```shell
184 | ALUTestbench:
185 | # FAIL: Incorrect result for opcode 0110011, funct: 101:, add_rshift_type: 1
186 | #       A: 0x92153524, B: 0xffffde81, DUTout: 0x490a9a92, REFout: 0xc90a9a92
187 | ```
188 | 
189 | The ``$display()`` statement actually already tells you everything you need to know to fix your bug, but you’ll find that this is not always the case. For example, if you have an FSM and you need to look at multiple time steps, the waveform viewer presents the data in a much neater format. If your design had more than one clock domain, it would also be nearly impossible to tell what was going on with only ``$display()`` statements.
190 | 
191 | Add all the signals from ``ALUTestbench`` to the waveform viewer and you see the following window: The two highlighted boxes contain the tools for navigation and zoom. You can hover over the icons to find out more about what each of them do. You can find the location (time) in the waveform viewer where the test bench failed by searching for the value of DUTout output by the ``$display()`` statement above (in this case, ``0x490a9a92``):
192 | 
193 | 1. Selecting DUTout
194 | 2. ClickingEdit > Wave Signal Search > Search for Signal Value > ``0x490a9a92``
195 | 
196 | Now you can examine all the other signal values at this time. Compare the ```DUTout``` and ```REFout``` values at this time, and you should see that they are similar but not quite the same. From the ``opcode``, ``funct``, and ```add_rshift_type```, you know that this is supposed to be an ``SRA`` instruction, but it looks like your ALU performed a ``SRL`` instead. However, you wrote
197 | ```verilog
198 | Out = A >>> B[4:0];
199 | ```
200 | That looks like it should work, but it doesn’t! It turns out you need to tell Verilog to treat A as a signed
201 | number for SRA to work as you wish. You change the line to say:
202 | ```verilog
203 | Out = $signed(A) >>> B[4:0];
204 | ```
205 | After making this change, you run the tests again and cross your fingers. Hopefully, you will see the line:
206 | ```shell
207 | # ALL TESTS PASSED!
208 | ```
209 | If not, you will need to debug your module until all test from the test vector file and the hard-coded test cases pass.
210 | 
211 | ---
212 | 
213 | ###  8. Checkpoint #1: Simple test program
214 | *Checkoff due: Apr 1 (Friday), 2022*
215 | 
216 | 
217 | Congratulations! You’ve started the design of your datapath by drawing a pipeline diagram, and written and thoroughly tested a key component in your processor. You should now be well-versed in testing Verilog modules. Please answer the following questions to be checked off by a TA:
218 | 1. Show your pipeline diagram, and explain when writes and reads occur in the register file and memory relative to the pipeline stages.
219 | 2. Show your working ALU test bench files to your TA and explain your hard-coded cases. You should also be able to show that the tests for the test vectors generated by the Python script and your hard-coded test vectors both work.
220 | 3. In ALUTestbench, the inputs to the ALU were generated randomly. When would it be preferable to perform an exhaustive test rather than a random test?
221 | 4. What bugs, if any, did your test bench help you catch?
222 | 5. For one of your bugs, come up with a short assembly program that would have failed had you not caught the bug. In the event that you had no bugs and wrote perfect code the first time, come up with an assembly program to stress the SRA bug mentioned in the above section.
223 | 
224 | ## Acknowledgement
225 | 
226 | This project is the result of the work of many EECS151/251 GSIs over the years including:
227 | Written By:
228 | - Nathan Narevsky (2014, 2017)
229 | - Brian Zimmer (2014)
230 | Modified By:
231 | - John Wright (2015,2016)
232 | - Ali Moin (2018)
233 | - Arya Reais-Parsi (2019)
234 | - Cem Yalcin (2019)
235 | - Tan Nguyen (2020)
236 | - Harrison Liew (2020)
237 | - Sean Huang (2021)
238 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
239 | - Dima Nikiforov (2022)
240 | 


--------------------------------------------------------------------------------
/lab1/spec.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # EECS 151/251A ASIC Lab 1: Getting Around the Compute Environment
  3 | <p align="center">
  4 | Prof. Sophia Shao
  5 | </p>
  6 | <p align="center">
  7 | TAs (ASIC): Dima Nikiforov
  8 | </p>
  9 | <p align="center">
 10 | Department of Electrical Engineering and Computer Science
 11 | </p>
 12 | <p align="center">
 13 | College of Engineering, University of California, Berkeley
 14 | </p>
 15 | 
 16 | ## Overview
 17 | 
 18 | The process of VLSI design is different than developing software, designing analog circuits, and even FPGA-based design. Instead of using a single graphical user interface (GUI) or environment (eg. Eclipse, Cadence Virtuoso, or Xilinx Vivado), VLSI design is done using dozens of command line interface tools on a Linux machine.  These tools primarily use text files as their inputs and outputs, and include GUIs mainly for only visualization, rather than design.  Therefore, familiarity with Linux, text manipulation, and scripting is required to successfully complete the labs this semester.
 19 | 
 20 | The goal of this lab is to introduce some basic techniques needed to use the computer aided design (CAD) tools that are taught in this class. Mastering the topics in this lab will help you save hours of time in later labs and make you a much more efficient chip designer. While you go through this lab, focus on how these techniques will allow you to automate tasks and improve your efficiency. Chip design requires plenty of iteration, so being able to perform trials and identify errors quickly is key to success.
 21 | 
 22 | ## Administrative Info
 23 | 
 24 | This lab, like all labs will be turned in electronically using Gradescope. Please upload a pdf document with the answers to the six questions in the lab.
 25 | 
 26 | ### Getting an Instructional Account
 27 | 
 28 | You are required to get an EECS instructional account to login to the workstations in the lab, since you will be doing all your work on these machines (whether you're working remotely or in-person). This can be done by using WebAcct here: http://inst.eecs.berkeley.edu/webacct.
 29 | 
 30 | Once you login using your CalNet ID, you can click on 'Get a new account' in the eecs151 row. Once the account has been created, you can email your class account form to yourself to have a record of your account information.  You can follow the instructions on the emailed form to change your Linux password with `ssh update.eecs.berkeley.edu` and following the prompts.
 31 | 
 32 | ## Logging into the Classroom Servers
 33 | 
 34 | The servers used for this class are primarily `eda-[1-11].eecs.berkeley.edu`.  You may also use the `c111-[1-17].eecs.berkeley.edu` machines
 35 | (which are physically located in Cory 111/117), although those will be shared with the FPGA lab. You can access all of these machines remotely through SSH.
 36 | 
 37 | ### Remote Access
 38 | 
 39 | It is important that you can remotely access the instructional servers. There are two convenient ways to remotely access our
 40 | lab machines: SSH (Secure SHell) and X2Go.
 41 | First, select a machine. The range of accessible machines are `eda-X`, where X is a number from 1 to 11,
 42 | and `c111-X`, where X is a number from 1 to 17. The fully qualified DNS name (FQDN) of
 43 | your machine is then `eda-X.eecs.berkeley.edu` or `c111-X.eecs.berkeley.edu`. For example,
 44 | if you select machine `eda-8`, the FQDN would be `eda-8.eecs.berkeley.edu`.
 45 | You can use any lab machine, but our lab machines aren’t very powerful; if everyone
 46 | uses the same one, everyone will find that their jobs perform poorly. ASIC design tools are resource
 47 | intensive and will not run well when there are too many simultaneous users on these machines. We
 48 | recommend that every time you want to log into a machine, examine its load on https://hivemind.eecs.berkeley.edu/ 
 49 | for the `eda-X` machines, or using `top` when you log in. If it is heavily loaded, consider
 50 | using a different machine. If you also notice other `eecs151` users with jobs consuming excessive
 51 | resources, do feel free to reach out to the GSIs about it.
 52 | Next, note your instructional class acccount name - the one that looks like `eecs151-YYY`, for example
 53 | `eecs151-abc`. This is the account you created at the start of this lab.
 54 | 
 55 | 
 56 | #### SSH: Linux, BSD, MacOS
 57 | 
 58 | SSH is the de facto remote terminal tool for Linux and BSD systems (which includes macOS). It
 59 | lets you login to a text console from anywhere (as long as you have network connectivity). SSH
 60 | also comes as a standard utility in almost all Linux and BSD systems.
 61 | If you’re using Linux or BSD, you should be able to access your workstation through SSH by running:
 62 | 
 63 | ```shell
 64 | ssh eecs151-YYY@eda-X.eecs.berkeley.edu
 65 | ```
 66 | 
 67 | In our examples, this would be:
 68 | 
 69 | ```shell
 70 | ssh eecs151-abc@eda-8.eecs.berkeley.edu
 71 | ```
 72 | 
 73 | The SSH protocol also enables file transfer between your local and lab machines via the `sftp` and
 74 | `scp` utilities. **WARNING: please only transfer files needed for your reports and nothing else, particularly files relating to CAD tool commnads or process technologies!!!**
 75 | 
 76 | 
 77 | #### SSH: Windows
 78 | 
 79 | The classic and most lightweight way to use SSH on Windows is PuTTY (https://www.putty.org/). Download it and login with the FQDN above as the Host and your instructional account
 80 | username. You can also use WinSCP (winscp.net) for file transfer over SSH.
 81 | Advanced users may wish to install Windows Subsystem for Linux (https://docs.microsoft.com/en-us/windows/wsl/install-win10, Windows 10 build 16215 or later) or Cygwin (cygwin.com) and use SSH, SFTP, and SCP through there.
 82 | 
 83 | 
 84 | #### SSHL Session Management
 85 | 
 86 | Because all your work will be done remotely, we recommend that you utilize SSH session management tools and that all terminal-based work be done over SSH. This would allow your remote terminal sessions to remain active even if your SSH session disconnects, intentionally or not.
 87 | The two most common session managers are tmux and screen. These run persistently on the
 88 | remote workstation, are highly customizable, and can greatly improve your productivity.
 89 | Here are some good tmux and screen tutorials:
 90 | * https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/
 91 | * https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/
 92 | 
 93 | 
 94 | #### X2Go
 95 | 
 96 | For situations in which you need a graphical interface (waveform debugging, layout viewing, etc.),
 97 | you should use X2Go. This is a faster and more reliable alternative to more traditional XForwarding over SSH. X2Go is also recommended because it connects to a persistent graphical
 98 | desktop environment, which continues running even if your internet connection drops.
 99 | Download the X2Go client for your platform from the website: https://wiki.x2go.org/doku.php/download:start.
100 | 
101 | Note: MacOS sometimes blocks the X2Go download/install, if it does follow the directions here: https://support.apple.com/en-us/HT202491.
102 | 
103 | To use X2Go, you need to create a new session (look under the Session menu). Give the session any
104 | name, it doesn’t matter, but set the Host field to the FQDN of your lab machine and the User field
105 | to your instructional account username. For “Session type”, select “GNOME”. Here’s an example from macOS:
106 | 
107 | <p align="center">
108 | <img src="./figs/x2gomacos.png" width="500" />
109 | </p>
110 | 
111 | 
112 | ### Getting Started
113 | 
114 | After you login to one of these servers, you are now ready to start the lab.  You have a limited amount of space in your home directory, so we recommend completing work in the `/scratch/` directory, and then copying any important results to your home directory.
115 | 
116 | To begin, get the lab files by typing the following commands:
117 | 
118 | ```shell
119 | mkdir /scratch/<your-eecs-username>
120 | cd /scratch/<your-eecs-username>
121 | git clone /home/ff/eecs151/labs/lab1
122 | cd lab1
123 | ```
124 | 
125 | 
126 | 
127 | ## Linux Basics
128 | 
129 | You will need to learn how to use Linux so that you can understand what programs are running
130 | on the server, manipulate files, launch programs, and debug problems. Please read through the
131 | tutorial here: http://linuxcommand.org/lc3_learning_the_shell.php
132 | 
133 | To use the CAD tools in this class, you will need to load the class environment. All of the tools
134 | are already installed on the network filesystem, but by default users do not have the tools in their
135 | path. Try locating a program that is already installed (vim) and another which is not (innovus)
136 | by default:
137 | 
138 | ```shell
139 | which vim
140 | which innovus
141 | ```
142 | 
143 | The vim program has been installed in: `/usr/bin/vim`. If you show the contents of `/usr/bin`,
144 | you will notice that you can launch any of programs by typing their filename. This is because
145 | /usr/bin is in the environment variable `$PATH`, which contains different directories to search in a
146 | colon-separated list.
147 | 
148 | ```shell
149 | echo $PATH
150 | ```
151 | 
152 | To be able to access the CAD tools, you will need to append to their location to the `$PATH` variable:
153 | 
154 | ```shell
155 | source /home/ff/eecs151/tutorials/eecs151.bashrc
156 | echo $PATH
157 | which innovus
158 | ```
159 | 
160 | 
161 | #### Question 1: Common terminal tasks
162 | 
163 | For 1-6 below, submit the command/keystrokes needed to generate the desired result.  For 1-4, try generating only the desired result (no extraneous info). 
164 | 
165 | 1. List the 5 most recently modified items in `/usr/bin`
166 | 2. What directory is `git` installed in?
167 | 3. Show the hidden files in your lab directory (the one you cloned from `/home/ff/eecs151/labs/lab1`
168 | 4. What version of Vim is installed? Describe how you figured this out.
169 | 5. Copy the files in this lab to `/scratch` and then delete it.
170 | 6. Run `ping www.google.com`, suspend it, then kill the process. Then run it in the background, report its PID, then kill the process.
171 | 7. Run `top` and report the average CPU load, the highest CPU job, and the amount of memory used (just report the results for this question; you don't need to supply the command/how you got it).
172 | 
173 | 
174 | There are a few miscellaneous commands to analyze disk usage on the servers.
175 | 
176 | ```shell
177 | du -ch --max-depth=1 .
178 | df -H
179 | ```
180 | 
181 | Finally, your instructional accounts have disk usage quotas. Find out how much you are allocated
182 | and how much you are using:
183 | 
184 | ```shell
185 | quota -s
186 | ```
187 | 
188 | By default, you should be using the Bash shell (these labs are designed for Bash, not Csh). The
189 | Bash Guide (guide.bash.academy) is a great resource for users at all levels of Bash profiency.
190 | 
191 | 
192 | ## Using Text Editors
193 | 
194 | Much of the time you will spend designing chips will be writing scripts in a text editor.
195 | Therefore becoming proficient at editing text is a vital skill. Unlike Java or C programming, there
196 | is no integrated development environment (IDE) for writing these scripts. However, many of the
197 | advantages of IDE’s can be obtained by using the proper editor. In this class, we will be using
198 | either Vim or Emacs. Editors such as gedit or nano are not allowed.
199 | 
200 | If you have never used Vim, please follow the tutorial here: http://www.openvim.com/tutorial.html (If you would prefer to learn Emacs, you can read http://www.gnu.org/software/emacs/tour/ and run the Emacs built-in tutorial with Ctrl-h followed by t). Feel free to search for other
201 | resources online to learn more.
202 | 
203 | #### Question 2: Common editor tasks
204 | 
205 | For each task below, describe the keys you need to press to accomplish the action in the file `force_regs.ucli`.
206 | 
207 | 1. Delete 5 lines
208 | 2. Search for the text `clock`
209 | 3. Replace the text `dut` with `device_under_test`
210 | 4. Jump to the end of the file
211 | 5. Go to line 42
212 | 6. Reload the file (in case it was modified in another window)
213 | 7. Save and exit
214 | 
215 | #### Alternative Editors
216 | 
217 | While Vim is a powerful editor and ubiquitous on Linux environments, there are other alternatives that might be more suitable for different use cases. A modern graphical text editor is Visual Studio Code, which supports editing text files through an SSH session. As Visual Studio Code renders text on the client machine, it can be useful for students with high latency or irregular internet connections as in such environments X2Go or Vim in SSH can feel unresponsive. To set up Visual Studio Code for remote development, please follow the tutorial here: https://code.visualstudio.com/docs/remote/ssh-tutorial
218 | 
219 | ## Regular Expressions
220 | 
221 | Regular expressions allow you to perform complex ’Search’ or ’Search and Replace’ operations.
222 | Please work through the tutorial here: http://regexone.com
223 | 
224 | Regular expressions can be used from many different programs: Vim, Emacs, grep, sed, Python,
225 | etc. From the command line, use grep to search, and sed to search and replace.
226 | 
227 | Unfortunately, deciding what characters needs to be escaped can be somewhat confusing. For
228 | example, to find all instances of `dcdc_unit_cell_x`, where `x` is a single digit number, using grep:
229 | 
230 | ```shell
231 | grep "unit_cell_[0-9]\{1\}\." force_regs.ucli
232 | ```
233 | 
234 | And you can do the same search in Vim:
235 | 
236 | ```vim
237 | vim force_regs.ucli
238 | /unit_cell_[0-9]\{1\}\.
239 | ```
240 | 
241 | Notice how you need to be careful what characters get escaped (the `[` is not escaped but `{` is). Now
242 | imagine we want to add a leading 0 to all of the single digit numbers. The match string in sed
243 | could be:
244 | 
245 | ```shell
246 | sed -e 's/\(unit_cell_\)\([0-9]\{1\}\.\)/\10\2/' force_regs.ucli
247 | ```
248 | 
249 | Both sed, vim, and grep use ”Basic Regular Expressions” by default. For regular expressions heavy
250 | with special characters, sometimes it makes more sense to assume most characters except `a-zA-Z0-9`
251 | have special meanings (and they get escaped with only to match them literally). This is called
252 | ”Extended Regular Expressions”, and `?+{}()` no longer need to be escaped. A great resource
253 | for learning more is http://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended. In Vim, you can do this with `\v`:
254 | 
255 | ```shell
256 | :%s/\v(unit_cell_)([0-9]{1}\.)/\10\2/
257 | ```
258 | 
259 | And in sed, you can use the -r flag:
260 | 
261 | ```shell
262 | sed -r -e 's/(unit_cell_)([0-9]{1}\.)/\10\2/' force_regs.ucli
263 | ```
264 | 
265 | And in grep, you can use the -E flag:
266 | 
267 | ```shell
268 | grep -E "unit_cell_[0-9]{1}\." force_regs.ucli
269 | ```
270 | 
271 | sed and grep can be used for many purposes beyond text search and replace. For example, to find
272 | all files in the current directory with filenames that contain a specific text string:
273 | 
274 | ```shell
275 | find . | grep ".ucli"
276 | ```
277 | 
278 | Or to delete all lines in a file that contain a string:
279 | 
280 | ```shell
281 | sed -e '/reset/d' force_regs.ucli
282 | ```
283 | 
284 | #### Question 3: Fun with Regular Expressions
285 | 
286 | For each regular expression, provide an answer for both basic and extended mode (`sed` and `sed -r`).
287 | You are allowed to use multiple commands to perform each task. Operate on the `force_regs.ucli` file.
288 | 
289 | 1. Change all x's surrounding numbers to angle brackets. For example, `regx15xx79x` becomes `reg<15><79>`. Hint: remember to enable global subsitution.
290 | 2. Make every number in the file be exactly 3 digits with padded leading zeros (except the last 0 on each line). Eg. line 120/121 should read:
291 | 
292 | ```
293 | force -deposit rocketTestHarness.dut.Raven003Top_withoutPads.TileWrap.
294 | ... .io_tilelink_release_data.sync_w002r.rq002_wptr_regx000x.Q 0
295 | force -deposit rocketTestHarness.dut.Raven003Top_withoutPads.TileWrap.
296 | ... .io_tilelink_release_data.fifomem.mem_regx015xx098x.Q 0
297 | ```
298 | 
299 | 
300 | ## File Permissions
301 | 
302 | A tutorial about file permissions can be found here: http://www.tutorialspoint.com/unix/unix-file-permission.htm
303 | 
304 | #### Question 4: Understanding File Permissions
305 | 
306 | For each task below, please provide the commands that result in the correct permissions being set. Make no assumptions about the file's existing permissions. Operate on the `run_always.sh` script.
307 | 
308 | 1. Change the script to be executable by you and no one else.
309 | 2. Add permissions for everyone in your group to be able to execute the same script
310 | 3. Make the script writable by you ane everyone in your group, but unreadable by others
311 | 4. Change the owner of the file to be `eecs151` (Note: you will not be able to execute this command, so just provide the command itself)
312 | 
313 | 
314 | ## Using Makefiles
315 | 
316 | Makefiles are a simple way to string together a bunch of different shell tasks in an intelligent
317 | manner. This allows someone to automate tasks and save time when doing repetitive tasks
318 | since make targets allow for only files that have changed to need to be updated. Please read
319 | through the following tutorial here: http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/ (optional). Further documentation on make can be found here: http://www.gnu.org/software/make/manual/make.html.
320 | 
321 | Let’s look at a simple makefile to explain a few things about how they work - this is not meant to
322 | be anything more than a very brief overview of what a makefile is and how it works. If you look at
323 | the Makefile in the provided folder in your favorite text editor, you can see the following lines:
324 | 
325 | ```shell
326 | output_name = force_regs.random.ucli
327 | 
328 | $(output_name): force_regs.ucli
329 |     awk 'BEGIN{srand();}{if ($$1 != "") { print $$1,$$2,$$3,int(rand()*2)}}' $< > $@
330 | 
331 | clean:
332 |     rm -f $(output_name)
333 | ```
334 | 
335 | While this may look like a lot of random characters, let us walk through each part of it to see that
336 | it really is not that complicated.
337 | 
338 | Makefiles are generally composed of rules, which tell Make how to execute a set of commands to
339 | build a set of targets from a set of dependencies. A rule typically has this structure:
340 | 
341 | ```shell
342 | targets: dependencies
343 |     commands
344 | ```
345 | 
346 | **It is very important that indentation in Makefiles are tabs, not spaces.**
347 | The two rules in the above Makefile have targets which are clean and output name. Here,
348 | output name is the name of a variable within the Makefile, which means that it can be overwritten
349 | from the command line. This can be done with the following command:
350 | 
351 | ```shell
352 | make output_name=foo.txt
353 | ```
354 | 
355 | This will result in the output being written to `foo.txt` intstead of `force_regs.random.ucli`.
356 | Generally, a rule will run everytime that its dependencies have been updated more recently than
357 | its own targets, so by editing/updating the `force_regs.ucli` file (including via the touch command), you can regenerate the output name target. This is different than a bash script, as you can see in `runalways.sh`, which will always generate `force_regs.random.ucli` regardless of whether
358 | `force_regs.ucli` is updated or not.
359 | 
360 | Inside the output name target, the `awk` command has a bunch of $ characters. This is because
361 | in normal `awk` the variable names are `$1`, `$2`, and then in the makefile you have to escape those
362 | variable names to get them to work properly. In Make, the character to do that is `$`.
363 | 
364 | The other characters after the awk script are also special characters to make. The `$<` is the first
365 | dependency of that target, the `>` simply redirects the output of awk, and the `$@` is the name of the
366 | target itself. This allows users to create makefiles that can be reusable, since you are operating on
367 | a dependency and outputting the result into the name of your own target.
368 | 
369 | #### Question 5: Makefile Targets
370 | 
371 | 1. Add a new make rule that will create a file called `foo.txt`.  Make it also run the `output_name` rule.
372 | 2. Name at least two ways that you could have the makefile regenerate the `output_name` target after its rule has been run.
373 | 
374 | 
375 | ## Comparing Files
376 | 
377 | Comparing text files is another useful skill. The tools generally behave as black
378 | boxes, so comparing output files to prior output files is an important debugging technique.
379 | 
380 | From the command lines, you can use `diff` to compare files:
381 | 
382 | ```shell
383 | diff force_regs.ucli force_regs.random.ucli
384 | ```
385 | 
386 | You can also compare the contents of directories (the `-q` flag will summarize the results to only
387 | show the names of the files that differ, and the `-r` flag will recurse through subdirectories).
388 | For Vim users, there is a useful built-in `diff` tool:
389 | 
390 | ```shell
391 | vimdiff force_regs.ucli force_regs.random.ucli
392 | ```
393 | 
394 | 
395 | ## Version Control with Git
396 | 
397 | Version control systems help track how files change overtime and make it easier for collaborators
398 | to work on the same files and share their changes. We use git to distribute the lab files so that
399 | bug fixes can easily be incorporated into your files. Please go through the following tutorial:
400 | https://try.github.io
401 | 
402 | #### Question 6: Checking Git Understanding
403 | 
404 | Submit the command required to perform the following tasks:
405 | 
406 | 1. What is the difference between your current Makefile and the file you started with?
407 | 2. How do you make a new branch?
408 | 3. What is the SHA of the version you checked out?
409 | 
410 | ## Customization
411 | 
412 | Many of the commands and tools you will use on a daily basis can be customized. This can
413 | dramatically improve your productivity. Some tools (e.g. vim and bash) are customized using “dotfiles,” which are hidden files in your home directory (e.g. `.bashrc` and `.vimrc`) that contain a series of commands which set variables, create aliases, or change settings. Try adding the following lines to your `.bashrc` and restart your session or source
414 | `~/.bashrc`. Now when you change directories, you no longer need to type `ls` to show the directory contents.
415 | 
416 | ```shell
417 | function cd {
418 |     builtin cd "$@" && ls -F
419 | }
420 | ```
421 | 
422 | The following links are useful for learning how to make some common customizations. You should
423 | read these but are not required to turn in anything for this section.
424 | * https://www.digitalocean.com/community/tutorials/an-introduction-to-useful-bash-aliases-and-functions
425 | * http://statico.github.io/vim.html
426 | 
427 | 
428 | ## Lab Deliverables
429 | 
430 | ### Lab Due: 11 AM, Friday January 28th, 2022
431 | 
432 | - Submit a written report with all 6 questions answered to Gradescope
433 | 
434 | ## Acknowledgement
435 | 
436 | This lab is the result of the work of many EECS151/251 GSIs over the years including:
437 | Written By:
438 | - Nathan Narevsky (2014, 2017)
439 | - Brian Zimmer (2014)
440 | Modified By:
441 | - John Wright (2015,2016)
442 | - Ali Moin (2018)
443 | - Arya Reais-Parsi (2019)
444 | - Cem Yalcin (2019)
445 | - Tan Nguyen (2020)
446 | - Harrison Liew (2020)
447 | - Sean Huang (2021)
448 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
449 | - Dima Nikiforov (2022)
450 | 


--------------------------------------------------------------------------------
/lab3/spec.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Lab 3: Logic Synthesis
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Dima Nikiforov
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | 
 16 | 
 17 | ## Overview
 18 | For this lab, you will learn how to translate RTL code into a gate-level netlist in a process called
 19 | synthesis. In order to successfully synthesize your design, you will need to understand how to
 20 | constrain your design, learn how the tools optimize logic and estimate timing, analyze the critical
 21 | path of your design, and simulate the gate-level netlist.
 22 | To begin this lab, get the project files by typing the following commands:
 23 | 
 24 | ```shell
 25 | git clone /home/ff/eecs151/labs/lab3.git
 26 | cd lab3
 27 | ```
 28 | 
 29 | You should add the following lines to the `.bashrc` file in your home folder
 30 | (for more information about what `.bashrc` does, see https://www.tldp.org/LDP/abs/html/sample-bashrc.html)
 31 | so that every time
 32 | you open a new terminal you have the paths for the tools setup properly.
 33 | 
 34 | ```shell
 35 | source /home/ff/eecs151/tutorials/eecs151.bashrc
 36 | export HAMMER_HOME=/home/ff/eecs151/hammer
 37 | source ${HAMMER_HOME}/sourceme.sh
 38 | ```
 39 | 
 40 | Type
 41 | 
 42 | ```shell
 43 | which genus
 44 | ```
 45 | 
 46 | to see if the shell prints out the path to the Cadence Genus Synthesis program (which we will be
 47 | using for this lab). If it does not work, add the lines to your `.bash_profile` in your home folder
 48 | as well. Try to open a new terminal to see if it works. The file `eecs151.bashrc` sets various
 49 | environment variables in your system such as where to find the CAD programs or license servers.
 50 | 
 51 | 
 52 | ## Synthesis Environment
 53 | To perform synthesis, we will be using Cadence Genus. However, we will not be interfacing with
 54 | Genus directly, we will rather use Hammer. Just like in lab 2, we have set up the basic Hammer
 55 | flow for your lab exercises using Makefile.
 56 | 
 57 | In this lab repository, you will see two sets of input files for Hammer. The first set of files are
 58 | the source codes for our design that you will explore in the next section. The second set of files are
 59 | some YAML files (`inst-env.yml`, `asap7.yml`, `design.yml`, `sim-rtl.yml`, `sim-gl-syn.yml`) that
 60 | configure the Hammer flow. Of these YAML files, you should only need to modify `design.yml`,
 61 | `sim-rtl.yml` and `sim-gl-syn.yml` in order to configure the synthesis and simulation for your
 62 | design.
 63 | 
 64 | 
 65 | Hammer is already setup at `/home/ff/eecs151/hammer` with all the required plugins for Cadence
 66 | Synthesis (Genus) and Place-and-Route (Innovus), Synopsys Simulator (VCS), Mentor Graphics
 67 | DRC and LVS (Calibre). You should not need to install it on your own home directory. **These
 68 | Hammer plugins are under NDA. They are provided to us for educational purpose.
 69 | They should never be copied outside of instructional machines under any circumstances or else we are at risk of unable to get access to the tools in the future!!!**
 70 | 
 71 | Let us take a look at some parts of `design.yml` file:
 72 | 
 73 | ```yaml
 74 | gcd.clockPeriod: &CLK_PERIOD "1ns"
 75 | ```
 76 | 
 77 | This option sets the target clock speed for our design. A more stringent target (a shorter clock
 78 | period) will make the tool work harder and use higher-power gates to meet the 
 79 | constraints. A more relaxed timing target allows the tool to focus on reducing area and/or power.
 80 | In the sim-rtl.yml:
 81 | 
 82 | ```yaml
 83 | defines:
 84 |   - "CLOCK_PERIOD=1.00"
 85 | ```
 86 | 
 87 | This option sets the clock period used during simulation. It is generally useful to separate the two as
 88 | you might want to see how the circuit performs under different clock frequencies without changing
 89 | the design constraints. Continuing from `design.yml`:
 90 | 
 91 | ```yaml
 92 | gcd.verilogSrc: &VERILOG_SRC
 93 |   - "src/gcd.v"
 94 |   - "src/gcd_datapath.v"
 95 |   - "src/gcd_control.v"
 96 | ```
 97 | 
 98 | and in `sim-rtl.yml`:
 99 | 
100 | ```yaml
101 | sim.inputs:
102 |   input_files:
103 |     - "src/gcd.v"
104 |     - "src/gcd_datapath.v"
105 |     - "src/gcd_control.v"
106 |     - "src/gcd_testbench.v"
107 | ```
108 | 
109 | These specify the files for synthesis and simulation. Moving on, we have:
110 | 
111 | ```yaml
112 | vlsi.inputs.clocks: [
113 |   {name: "clk", period: *CLK_PERIOD, uncertainty: "0.1ns"}
114 | ]
115 | ```
116 | 
117 | This is where we specify to Hammer that we intend on using the `CLK_PERIOD` we defined earlier
118 | as the constraint for our design. We will see more detailed constraints in later labs.
119 | 
120 | ## Understanding the example design
121 | We have provided a circuit described in Verilog that computes the greatest common divisor (GCD)
122 | of two numbers. Unlike the FIR filter from the last lab, in which the testbench constantly provided
123 | stimuli, the GCD algorithm takes a variable number of cycles, so the testbench needs to know when
124 | the circuit is done to check the output. This is accomplished through a “ready/valid” handshake
125 | protocol. This protocol shows up in many places in digital circuit design.
126 | Look [here](https://inst.eecs.berkeley.edu/~eecs151/fa21/files/verilog/ready_valid_interface.pdf) at information on the course website for more background.
127 | The GCD top level is shown in the figure below.
128 | 
129 | <p align="center">
130 | <img src="./figs/block-diagram.png" width="600" />
131 | </p>
132 | 
133 | The GCD module declaration is as follows:
134 | 
135 | ```v
136 | module gcd#( parameter W = 16 )
137 | (
138 |   input clk, reset,
139 |   input [W-1:0] operands_bits_A,    // Operand A
140 |   input [W-1:0] operands_bits_B,    // Operand B
141 |   input operands_val,               // Are operands valid?
142 |   output operands_rdy,              // ready to take operands
143 | 
144 |   output [W-1:0] result_bits_data,  // GCD
145 |   output result_val,                // Is the result valid?
146 |   input result_rdy                  // ready to take the result
147 | );
148 | ```
149 | 
150 | On the `operands` boundary, nothing will happen until GCD is ready to receive data (`operands_rdy`).
151 | When this happens, the testbench will place data on the operands (`operands_bits_A` and `operands_bits_B`),
152 | but GCD will not start until the testbench declares that these operands are valid (`operands_val`).
153 | Then GCD will start.
154 | 
155 | The testbench needs to know that GCD is not done. This will be true as long as `result_val` is 0
156 | (the results are not valid). Also, even if GCD is finished, it will hold the result until the testbench is
157 | prepared to receive the data (`result_rdy`). The testbench will check the data when GCD declares
158 | the results are valid by setting `result_val` to 1.
159 | 
160 | The contract is that if the interface declares it is ready while the other side declares it is valid, the
161 | information must be transferred.
162 | 
163 | Open `src/gcd.v`. This is the top-level of GCD and just instantiates `gcd_control` and `gcd_datapath`.
164 | Separating files into control and datapath is generally a good idea. Open `src/gcd_datapath.v`.
165 | This file stores the operands, and contains the logic necessary to implement the algorithm (subtraction and comparison). Open `src/gcd_control.v`. This file contains a state machine that handles
166 | the ready-valid interface and controls the mux selects in the datapath. Open `src/gcd_testbench.v`.
167 | This file sends different operands to GCD, and checks to see if the correct GCD was found. Make
168 | sure you understand how this file works. Note that the inputs are changed on the negative edge
169 | of the clock. This will prevent hold time violations for gate-level simulation, because once a clock
170 | tree has been added, the input flops will register data at a time later than the testbench’s rising
171 | edge of the clock.
172 | 
173 | Now simulate the design by running `make sim-rtl`. The waveform is located under `build/sim-rundir/`.
174 | Open the waveform in DVE (you may need to scroll down in DVE to find the testbench) and try
175 | to understand how the code works by comparing the waveforms with the Verilog code. It might
176 | help to sketch out a state machine diagram and draw the datapath.
177 | 
178 | ---
179 | ### Question 1: Understanding the algorithm
180 | 
181 | By reading the provided Verilog code and/or viewing the RTL level simulations, demonstrate that
182 | you understand the provided code:
183 | 
184 | **a.) Draw a table with 5 columns (cycle number, value of `A_reg`, value of `B_reg`, `A_next`, `B_next`) and fill in all of the rows for the first test vector (GCD of 27 and 15)**
185 | 
186 | **b) In `src/gcd_testbench.v`, the inputs are changed on the negative edge of the clock to prevent hold time violations. Is the output checked on the positive edge of the clock or the negative edge of the clock? Why?**
187 | 
188 | **c) In `src/gcd_testbench.v`, what will happen if you change `result_rdy = 1;` to `result_rdy = 0;`? What state will the `gcd_control.v` state machine be in?**
189 | 
190 | ---
191 | ### Question 2: Testbenches
192 | **a) Modify `src/gcd_testbench.v` so that intermediate steps are displayed in the format below.**
193 | **Include a copy of the code you wrote in your writeup (this should be approximately 3-4 lines).**
194 | 
195 | ```shell
196 |  0: [ ...... ] Test ( x ), [ x == x ]  (decimal)
197 |  1: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
198 |  2: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
199 |  3: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
200 |  4: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
201 |  5: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
202 |  6: [ ...... ] Test ( 0 ), [ 3 == 0 ]  (decimal)
203 |  7: [ ...... ] Test ( 0 ), [ 3 == 0 ]  (decimal)
204 |  8: [ ...... ] Test ( 0 ), [ 3 == 27 ] (decimal)
205 |  9: [ ...... ] Test ( 0 ), [ 3 == 12 ] (decimal)
206 | 10: [ ...... ] Test ( 0 ), [ 3 == 15 ] (decimal)
207 | 11: [ ...... ] Test ( 0 ), [ 3 == 3 ]  (decimal)
208 | 12: [ ...... ] Test ( 0 ), [ 3 == 12 ] (decimal)
209 | 13: [ ...... ] Test ( 0 ), [ 3 == 9 ]  (decimal)
210 | 14: [ ...... ] Test ( 0 ), [ 3 == 6 ]  (decimal)
211 | 15: [ ...... ] Test ( 0 ), [ 3 == 3 ]  (decimal)
212 | 16: [ ...... ] Test ( 0 ), [ 3 == 0 ]  (decimal)
213 | 17: [ ...... ] Test ( 0 ), [ 3 == 3 ]  (decimal)
214 | 18: [ passed ] Test ( 0 ), [ 3 == 3 ]  (decimal)
215 | 19: [ ...... ] Test ( 1 ), [ 7 == 3 ]  (decimal)
216 | ```
217 | 
218 | ---
219 | ## Synthesis
220 | Synthesis is the process of converting your Verilog RTL description into technology (or platform, in the case of
221 | FPGAs) specific gate-level Verilog. These gates are different from the “and”, “or”, “xor” etc. primitives in Verilog. While the logic primitives correspond to gate-level operations, they do not have
222 | a physical representation outside of their symbol. A synthesized gate-level Verilog netlist only contains
223 | cells with corresponding physical aspects: they have a transistor-level schematic with transistor
224 | sizes provided, a physical layout containing information necessary for fabrication, timing libraries
225 | providing performance specifications, etc. Some synthesis tools also output assign statements that
226 | refer to pass-through interfaces, but no logic operation is performed in these assignments (not even
227 | simple inversion!).
228 | 
229 | 
230 | Open the Makefile to see the available targets that you can run. You don’t have to know all of
231 | these for now. The Makefile provides shorthands to various Hammer commands for synthesis,
232 | placement-and-routing, or simulation. Read [Hammer-Flow](https://hammer-vlsi.readthedocs.io/en/latest/Hammer-Flow/index.html) if you want to get more detail.
233 | 
234 | The first step is to have Hammer generate the necessary supplement Makefile (`build/hammer.d`). To do so, type the
235 | following command in the lab directory:
236 | 
237 |     make buildfile
238 | 
239 | This generates a file with make targets specific to the constraints we have provided inside the YAML
240 | files. If you have not run `make clean` after simulating, this file should already be generated. `make buildfile` also copies and extracts a tarball of the ASAP7 PDK to your local workspace. It will
241 | take a while to finish if you run this command first time. The extracted PDK is not deleted when
242 | you do `make clean` to avoid unnecessarily rebuilding the PDK. To explicitly remove it, you need to
243 | remove the build folder (and you should do it once you finish the lab to save your allocated disk
244 | space since the PDK is huge). To synthesize the GCD, use the following command:
245 | 
246 |     make syn
247 | 
248 | This runs through all the steps of synthesis. 
249 | By default, Hammer puts the generated objects under the directory build. Go to `build/syn-rundir/reports`. 
250 | There are five text files here that contain very useful information about
251 | the synthesized design that we just generated. Go through these files and familiarize yourself with
252 | these reports. One report of particular note is `final_time_PVT_0P63V_100C.setup.view.rpt`. The
253 | name of this file represents that it is a timing report, with the Process Voltage Temperature corner
254 | of 0.63 V and 100 degrees C, and that it contains the setup timing checks. Another important file
255 | is `build/syn-rundir/gcd.mapped.v`. This is your synthesized gate-level Verilog. Go through it
256 | to see what the RTL design has become to represent it in terms of technology-specific gates. Try
257 | to follow an input through these gates to see the path it takes until the output.
258 | These files are useful for debugging and evaluating your design.
259 | 
260 | Now open the `final_time_PVT_0P63V_100C.setup.view.rpt` file and look at the first block of text
261 | you see. It should look similar to this:
262 | 
263 | ```text
264 | Path 1: MET (474 ps) Setup Check with Pin GCDdpath0/A_reg_reg[15]/CLK->D
265 | View: PVT_0P63V_100C.setup_view
266 | Group: clk
267 | Startpoint: (R) GCDdpath0/B_reg_reg[5]/CLK
268 | Clock: (R) clk
269 | Endpoint: (F) GCDdpath0/A_reg_reg[15]/D
270 | Clock: (R) clk
271 | Capture Launch
272 | Clock Edge:+ 1000 0
273 | Src Latency:+ 0 0
274 | Net Latency:+ 0 (I) 0 (I)
275 | Arrival:= 1000 0
276 | Setup:- 25
277 | Uncertainty:- 0
278 | Required Time:= 975
279 | Launch Clock:- 0
280 | Data Path:- 501
281 | Slack:= 474
282 | #---------------------------------------------------------------------------------------------------------------------
283 | # Timing Point Flags Arc Edge Cell Fanout Load Trans Delay Arrival Instance
284 | # (fF) (ps) (ps) (ps) Location
285 | #---------------------------------------------------------------------------------------------------------------------
286 | GCDdpath0/B_reg_reg[5]/CLK - - R (arrival) 16 - 0 - 0 (-,-)
287 | GCDdpath0/B_reg_reg[5]/QN - CLK->QN R ASYNC_DFFHx1_ASAP7_75t_SL 5 3.3 42 48 48 (-,-)
288 | GCDdpath0/g1181/Y - A->Y F INVx1_ASAP7_75t_SL 2 1.2 20 10 58 (-,-)
289 | GCDdpath0/g1162__8246/Y - A->Y F OR2x2_ASAP7_75t_SL 2 1.3 12 17 76 (-,-)
290 | GCDdpath0/g1152__6260/Y - A1->Y F AO32x1_ASAP7_75t_SL 1 0.7 13 19 95 (-,-)
291 | GCDdpath0/g1144__2883/Y - C1->Y R AOI322xp5_ASAP7_75t_SL 1 0.7 47 19 114 (-,-)
292 | GCDdpath0/g1138__5115/Y - B2->Y F AOI221xp5_ASAP7_75t_SL 1 0.7 37 14 128 (-,-)
293 | GCDdpath0/g1137__1881/Y - A2->Y R O2A1O1Ixp33_ASAP7_75t_SL 3 2.2 72 36 164 (-,-)
294 | GCDctrl0/g446__5526/Y - B->Y F NAND2xp5_ASAP7_75t_SL 2 1.3 36 17 182 (-,-)
295 | GCDctrl0/g444/Y - A->Y R INVx1_ASAP7_75t_SL 18 10.0 102 52 234 (-,-)
296 | GCDdpath0/g1265/Y - A->Y F INVx1_ASAP7_75t_L 17 9.4 91 63 297 (-,-)
297 | GCDdpath0/g1232__9945/Y - B->Y R NOR2xp33_ASAP7_75t_L 16 9.0 304 154 451 (-,-)
298 | GCDdpath0/g1193__6417/Y - C1->Y F AOI222xp33_ASAP7_75t_SL 1 0.7 124 51 501 (-,-)
299 | GCDdpath0/A_reg_reg[15]/D - - F ASYNC_DFFHx1_ASAP7_75t_SL 1 - - 0 501 (-,-)
300 | #---------------------------------------------------------------------------------------------------------------------
301 | ```
302 | 
303 | This is one of the most common ways to assess the critical paths in your circuit. 
304 | The setup timing report lists each timing path's **slack**, which is the extra delay the signal can have before a setup
305 | violation occurs, in ascending order. The first block indicates the critical path of the design.
306 | Each row represents a timing path from a gate to the next, and the whole block is the **timing
307 | arc** between two flip-flops (or in some cases between latches). The `MET` at the top of the block
308 | indicates that the timing requirements have been met and there is no violation. If there was, this
309 | indicator would have read `VIOLATED`. Since our critical path meets the timing requirements with
310 | a 474 ps of slack, this means we can run this synthesized design with a period equal to clock period
311 | (1000 ps) minus the critical path slack (474 ps), which is 526 ps.
312 | 
313 | ---
314 | 
315 | ### Question 3: Reporting Questions
316 | **a) Which report would you look at to find the total number of each different standard cell that the design contains?**
317 | 
318 | **b) Which report contains area breakdown by modules in the design?**
319 | 
320 | **c) What is the cell used for `A_reg_reg[7]`? How much leakage power does `A_reg_reg[7]` contribute? How did you find this?**
321 | 
322 | ---
323 | 
324 | ### Question 4: Synthesis Questions
325 | **a) Looking at the total number of instances of sequential cells synthesized and the number of `reg` definitions in the Verilog files, are they consistent? If not, why?**
326 | 
327 | **b) Modify the clock period in the `design.yml` file to make the design go faster. What is the highest clock frequency this design can operate at in this technology?**
328 | 
329 | ---
330 | 
331 | ### Synthesis: Step-by-step
332 | 
333 | Typically, we will be roughly following the above section’s flow, but it is also
334 | useful to know what is going on underneath. In this section,
335 | we will look at the steps Hammer takes to get from RTL Verilog to all the outputs we saw in the
336 | last section.
337 | 
338 | First, type `make clean` to clean the environment of previous build’s files. Then, use `make buildfile`
339 | to generate the supplementary Makefile as before. Now, we will modify the `make syn` command to
340 | only run the steps we want. Go through the following commands in the given order:
341 | 
342 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step init_environment"
343 | 
344 | In this step, Hammer invokes Genus to read the technology libraries and the RTL Verilog files, as well as the constraints we
345 | provided in the `design.yml` file.
346 | Hammer will exit with an error, which is expected as Hammer looks for the final synthesis output
347 | files to gauge its success. We have not yet generated the gate-level Verilog, so we know Hammer will display an error after every step except the last one.
348 | 
349 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step syn_generic"
350 | 
351 | This step is the **generic synthesis** step. In this step, Genus converts our RTL read
352 | in the previous step into an intermediate format, made up of technology-independent generic gates. These
353 | gates are purely for gate-level functional representation of the RTL we have coded, and are going
354 | to be used as an input to the next step. This step also performs logical optimizations on our design
355 | to eliminate any redundant/unused operations.
356 | 
357 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step syn_map"
358 | 
359 | This step is the **mapping** step. Genus takes its own generic gate-level output and converts it to
360 | our ASAP7-specific gates. This step further optimizes the design given the gates in our technology.
361 | That being said, this step can also increase the number of gates from the previous step as not
362 | all gates in the generic gate-level Verilog may be available for our use and they may need to be
363 | constructed using several, simpler gates.
364 | 
365 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step add_tieoffs"
366 | 
367 | In some designs, the pins in certain cells are hardwired to 0 or 1, which requires a tie-off cell.
368 | 
369 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step write_regs"
370 | 
371 | This step is purely for the benefit of the designer. For some designs, we may need to have a list
372 | of all the registers in our design. In this lab, the list of regs is used in post-synthesis simulation to
373 | generate the `force_regs.ucli`, which sets initial states of registers.
374 | 
375 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step generate_reports"
376 | 
377 | The reports we have seen in the previous section are generated during this step.
378 | 
379 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step write_outputs"
380 | 
381 | This step writes the outputs of the synthesis flow. This includes the gate-level `.v` file we looked at
382 | earlier in the lab. Other outputs include the design constraints (such as clock frequencies, output
383 | loads etc., in `.sdc` format) and delays between cells (in `.sdf` format).
384 | 
385 | ## Post-Synthesis Simulation
386 | From the root folder, type the following commands:
387 | 
388 |     make sim-gl-syn
389 |     
390 | This will run a post-synthesis simulation using annotated delays from the `gcd.mapped.sdf` file.
391 | 
392 | ---
393 | 
394 | ### Checkoff 1: Synthesis Understanding 
395 | Demonstrate that your synthesis flow works correctly, and be prepared to explain the synthesis steps at a high level.
396 | 
397 | ---
398 | 
399 | ### Question 5: Delay Questions
400 | Check the waveforms in DVE. 
401 | 
402 | **a) Report the clk-q delay of `state[0]` in `GCDctrl0` at 17.5 ns and submit a screenshot of the waveforms showing how you found this delay.**
403 | 
404 | **b) Which line in the sdf file specifies this delay and what is the delay?**
405 | 
406 | **c) Is the delay from the waveform the same as from the sdf file? Why or why not?**
407 | 
408 | ---
409 | 
410 | ## Build Your Divider
411 | In this section, you will build a parameterized divider of unsigned integers. Some initial code has
412 | been provided to help you get started. To keep the control logic simple, the divider module uses an input
413 | signal `start` to begin the computation at the next clock cycle, and asserts an output signal `done` to
414 | HIGH when the division result is valid. The input `dividend` and `divisor` should be registered
415 | when `start` is HIGH. You are not required to handle corner cases such as dividing by 0. You are
416 | free to modify the skeleton code to implement a ready/valid interface instead, but it is not required.
417 | 
418 | It is suggested that you implement the divide algorithm described [here](http://bwrcs.eecs.berkeley.edu/Classes/icdesign/ee141_s04/Project/Divider%20Background.pdf). Use the **Divide Algorithm Version 2** (slide 9).
419 | A simple testbench skeleton is also provided to you. You should change it to add more test vectors,
420 | or test your divider with different bitwidths. You need to change the file `sim-rtl.yml` to use your
421 | divider instead of the GCD module when testing.
422 | 
423 | ---
424 | 
425 | ### Question 6: Synthesize your divider
426 | **a) Push your 4-bit divider design through the synthesis tool, and determine its critical path, cell area, and maximum operating frequency from the reports. You might need to re-run synthesis multiple times to determine the maximum achievable frequency.**
427 | 
428 | **b) Change the bitwidth of your divider to 32-bit, what is the critical path, area, and maximum operating frequency now?**
429 | 
430 | **c) Submit your divider code and testbench to the report. Add comments to explain your testbench and why it provides sufficient coverage for your divider module.**
431 | 
432 | ---
433 | 
434 | ## Lab Deliverables
435 | 
436 | ### Lab Due: 11:59 PM, Friday February 18th, 2021
437 | 
438 | - Submit a written report with all 6 questions answered to Gradescope
439 | - Checkoff with an ASIC lab TA
440 | 
441 | ## Acknowledgement
442 | 
443 | This lab is the result of the work of many EECS151/251 GSIs over the years including:
444 | Written By:
445 | - Nathan Narevsky (2014, 2017)
446 | - Brian Zimmer (2014)
447 | Modified By:
448 | - John Wright (2015,2016)
449 | - Ali Moin (2018)
450 | - Arya Reais-Parsi (2019)
451 | - Cem Yalcin (2019)
452 | - Tan Nguyen (2020)
453 | - Harrison Liew (2020)
454 | - Sean Huang (2021)
455 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
456 | - Dima Nikiforov (2022)
457 | 


--------------------------------------------------------------------------------
/lab6/spec.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Lab 6: SRAM Integration, DRC, LVS
  2 | <p align="center">
  3 | Prof. Sophia Shao
  4 | </p>
  5 | <p align="center">
  6 | TAs (ASIC): Dima Nikiforov
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | ## Overview
 16 | In this lab, we will go over two very important concepts. First we will look at the basics of using
 17 | circuits beyond standard cells in VLSI designs. The most common example of this is SRAM,
 18 | which is a dense addressable memory block used in most VLSI designs. You will learn about SRAM
 19 | in more detail later in the lectures, but the [Wikipedia article on SRAM](https://en.wikipedia.org/wiki/Static_random-access_memory) provides a good starting
 20 | point. SRAM is treated as a hard macro block in VLSI flow. It is created separately from the
 21 | standard cell libraries. The process for adding other custom, analog, or mixed signal circuits will
 22 | be similar to what we use for SRAMs. In your project, you will use the SRAMs extensively for
 23 | data caching. It is important to know how to design a digital circuit and run a CAD flow with
 24 | those hard macro blocks. The lab exercises will help you get familiar with SRAM interfacing. We
 25 | will use an example design of computing a dot product of two vectors to walk you through how to
 26 | use the SRAM blocks.
 27 | 
 28 | Next, we will take a cursory glance at part of the ”signoff” flow: design rule checking (DRC) and
 29 | layout-versus-schematic (LVS). DRC checks all the geometry in the post-PAR’d layout to see that
 30 | they meet all the design rules for the process technology. LVS checks for discrepancies between 
 31 | the actual layout and the netlist that the PAR tool thinks it laid out.
 32 | In a purely standard-cell based design, LVS will almost never be wrong. However, once you start
 33 | integrating hard macros like SRAMs and custom analog cells, LVS can reveal unconnected pins,
 34 | unintended shorts between power/ground/signals, and more that would prevent the circuit from
 35 | working. Often, these stem from improper abstraction of the macro cells for the PAR tool.
 36 | 
 37 | To begin this lab, get the project files and set up your environment by typing the following command and sourcing the `eecs151.bashrc` file, as usual:
 38 | 
 39 | ```shell
 40 | git clone /home/ff/eecs151/labs/lab6.git
 41 | ```
 42 | 
 43 | You should also clean up the build directory generated from the previous labs to save some disk space.
 44 | 
 45 | For this lab, there are many Make targets that will be run, some of which you have explored in
 46 | previous labs. The following list is a reference of what each one does for future reference, but **do not run them right now!**
 47 | 
 48 | ```shell
 49 | # This command gets all the relevant SRAM configurations (file pointers) for the ASAP7 library
 50 | make srams
 51 | 
 52 | # This command runs RTL simulation
 53 | make sim-rtl
 54 | 
 55 | # This command runs Synthesis using Cadence Genus tool
 56 | make syn
 57 | 
 58 | # This command runs Post-Synthesis gate-level simulation
 59 | make sim-gl-syn
 60 | 
 61 | # This command runs Placement-and-Routing using Cadence Innovus tool
 62 | make par
 63 | 
 64 | # This command runs Post-PAR gate-level simulation
 65 | make sim-gl-par
 66 | 
 67 | # This command runs Post-PAR power estimation
 68 | make power-par
 69 | 
 70 | # This command runs DRC using Mentor Calibre tool
 71 | make drc
 72 | 
 73 | # This command runs LVS using Mentor Calibre tool
 74 | make lvs
 75 | ```
 76 | 
 77 | The configuration files (`*.yml` files) are intended to provide you more flexibility when you have a
 78 | large design project, and you want to test the modules separately before final integration. You can
 79 | simply set the top-level module to the one you care about in these configuration files. Don’t hesitate
 80 | the make changes to those files whenever you want to test out your new modules. This structure
 81 | will also be used in the final project, so please take the exercises in this lab as a final practice run 
 82 | with the CAD flow so that you will become more productive when
 83 | working on your project. At the very least, you should be aware of which files to make changes for the tasks
 84 | that you want to carry out. We will run through small to moderate designs to get a sense of the
 85 | entire flow. Please let the TAs know if you have any feedback or suggestion on how to improve
 86 | the tool flow, or you if encounter some tooling issues.
 87 | 
 88 | ## SRAM Modeling and Abstraction
 89 | Open the file `src/dot_product.v`. This Verilog module implements a vector dot product of two
 90 | vectors of unsigned integers a and b. The module first reads elements of the vectors one-by-one via
 91 | the ready/valid interfaces and stores them to two SRAMs, one for each vector.
 92 | 
 93 | Note: You will see some `REGISTER_R_CE` blocks in `dot_product.v`. These are used by some
 94 | iterations of this lab to remove the `reg` ambiguity that exists in Verilog. You may refer to
 95 | `/home/ff/eecs151/verilog_lib/EECS151.v` to see their definition, but in essence they are structural descriptions of registers that are unambiguously translated to flip-flops when written in this
 96 | fashion. You may use these constructs or normal verilog syntax.
 97 | 
 98 | Let’s look at one particular SRAM module instantiation to understand its interface. The function
 99 | of select ports are annotated here:
100 | 
101 | ```v
102 | SRAM2RW16x16 sram (
103 |     .CE1(),   // clock edge (clock signal)
104 |     .CE2(),
105 | 
106 |     .WEB1(),  // Write Enable Bar (HIGH: Read, LOW: Write)
107 |     .WEB2(),
108 |     .OEB1(),  // Output Enable Bar (always tie to LOW)
109 |     .OEB2(),
110 |     .CSB1(),  // Chip Select Bar (always tie to LOW)
111 |     .CSB2(),
112 | 
113 |     .A1(),    // Address pin
114 |     .A2(),
115 |     .I1(),    // Input Data pin
116 |     .I2(),
117 |     .O1(),    // Output Data pin
118 |     .O2()
119 | );
120 | ```
121 | 
122 | This `SRAM2RW16x16` is a dual-port Read/Write memory block of sixteen 16-bit entries. This means
123 | there is a 4-bit address for selecting those 16-bit entries. The SRAM can be clocked with two
124 | independent clock signals. Also, to write to an SRAM, we need to set the `WEBi` signal to LOW. The
125 | signals `OEBi` and `CSBi` should be set to LOW. SRAMs are synchronous-write and synchronous-read;
126 | the read data is only available at the next rising edge, and the write data is only written at the
127 | next rising edge.
128 | 
129 | Where are those SRAMs coming from? Because SRAMs are not made out of standard cells, and
130 | are rather built using different units that do not conform to our PAR flow, they are pre-compiled
131 | and stored in separate databases. These cells are then instantiated by Innovus as black boxes,
132 | and are connected to the rest of the circuit as specified in your Verilog. In order to generate the
133 | database that Innovus will use, type the following command:
134 | 
135 | ```shell
136 | make srams
137 | ```
138 | 
139 | For simulation purposes, a Verilog behavioral model for the SRAMs from the HAMMER repository
140 | is used. This is automatically set up in build/sram generator-output.json and points to `/home/ff/eecs151/hammer/src/hammer-vlsi/technology/asap7/sram_compiler/memories/behavioral/sram_behav_models.v`.
141 | 
142 | This file includes models for various types of SRAMs. You can find SRAMs that have only singleport for Read and Write, or SRAMs with different address widths and data widths. For your final
143 | project, you need to select the appropriate SRAM models to meet the specification. The SRAM
144 | models in this file are only intended for simulation, **do not include this file in your project configuration for Synthesis or PAR**, otherwise, it will mess up with your post-Synthesis or
145 | post-PAR netlist.
146 | 
147 | For Synthesis and PAR, the SRAMs must be abstracted away from the tools, because the only
148 | things that the flow is concerned about at these stages are the timing characteristics and the outer
149 | layout geometry of the SRAM macros. The ASAP7 PDK does not come with SRAMs by default,
150 | so a graduate student (Sean Huang) graciously created some dummy models for us to use. They
151 | are located at:
152 | 
153 | ```shell
154 | # Liberty Timing File       -- delay information
155 | /home/ff/eecs151/hammer/src/hammer-vlsi/technology/asap7/sram_compiler/memories/lib/ 
156 | 
157 | # Library Exchange Format   -- placement information
158 | /home/ff/eecs151/hammer/src/hammer-vlsi/technology/asap7/sram_compiler/memories/lef/ 
159 | 
160 | # Graphical Database System -- final layout information
161 | /home/ff/eecs151/hammer/src/hammer-vlsi/technology/asap7/sram_compiler/memories/gds/ 
162 | 
163 | ```
164 | 
165 | #### Liberty Timing Files (*.lib)
166 | [Liberty files](http://web.engr.uky.edu/~elias/lectures/LibertyFileIntroduction.pdf) must be generated for macros at every relevant process, voltage, and temperature
167 | (PVT) corner that you are using for setup and hold timing analysis. Detailed models contain
168 | descriptions of what each pin does, the delays depending on the load given in tables, and power
169 | information. There are also 3 types of Liberty files: [CCS, ECSM, and NLDM](https://chitlesh.ch/wordpress/liberty-ccs-ecsm-or-ndlm/), which tradeoff
170 | accuracy with tool runtime. 
171 | If you open up a file for the
172 | SRAMs we are using, you will see that they are very basic because these are fake timing models.
173 | Note that you will see that your post-synthesis and post-
174 | 
175 | 
176 | 
177 | 
178 | timing reports will differ from gatelevel simulation due to these inaccuracies.
179 | 
180 | #### Library Exchange Format (*.lef)
181 | [LEF files](http://web.engr.uky.edu/~elias/lectures/LibertyFileIntroduction.pdf) must be generated for macros in order to denote
182 | where pins are located and encode any obstructions (places where the PAR tool cannot place other
183 | cells or routing). The quality of LEFs is very important to get clean layouts. Again, our SRAM
184 | LEFs are fake, so they may present some issues with routing and DRC.
185 | 
186 | 
187 | #### Graphical Database System (*.gds)
188 | [GDS files](https://www.artwork.com/gdsii/gdsii/) must be generated
189 | for macros to encode the entire detailed layout, and get merged with the PAR’d layout before
190 | running DRC, LVS, and sending the design off to the fabrication house.
191 | 
192 | ---
193 | 
194 | ### Question 1: Understanding SRAMs
195 | **a)** Open the file `sram_behav_models.v` (located in HAMMER repository). 
196 | **What are different SRAM-sizes available?** 
197 | **What is the difference between the `SRAM1RW*` and `SRAM2RW*` variants?** 
198 | Hint: take some time to look at the Verilog implementation to understand what it does. You will need to use this SRAM model in the final project.
199 | 
200 | **b)** In the same file, select an SRAM instance that has a BYTEMASK pin. 
201 | **What is the SRAM model (in terms of number of Read/Write ports, address width, data/word width)?** 
202 | **Briefly describe the purpose the BYTEMASK. In which situation do you think it is useful?**
203 | 
204 | *c) (Ungraded thought experiment #1) SRAM libraries in real process technologies are much larger than the list you see in `sram_behav_models.v`. What features do you think are important for real SRAM libraries? Think in terms of number of ports, masking, improving yield, or anything else you can think of. What would these features do to the size of the SRAM macros?*
205 | 
206 | *d) (Ungraded thought experiment #2) SRAMs should be integrated very densely in a circuit’s layout. To build large SRAM arrays, often times many SRAM macros are tiled together, abutted on one or more sides. Knowing this, take a guess at how SRAMs are laid out.*
207 | 
208 | *i) In ASAP7, there are 9 metal layers, but realistically only 7 layers to route on in order to leave   the top 2 layers for robust power distribution, as you saw in Lab 4. How many layers should a well-designed SRAM macro use (i.e. block off from PAR routing), at maximum?*
209 | 
210 | *ii) Where should the pins on SRAMs be located, if you want to maximize the ability for them to abut together?*
211 | 
212 | ---
213 | 
214 | ## A Vector Dot Product with SRAMs
215 | Take a moment to read through the file src/dot_product.v to understand the control logic of
216 | writing and reading from SRAMs. The two SRAMs are first filled with vector data up until a size
217 | of vector size, after that they are read for the dot product computation.
218 | To run RTL simulation, type the following command
219 | 
220 | ```shell
221 | make sim-rtl
222 | ```
223 | 
224 | To inspect the RTL simulation waveform, type the following commands
225 | 
226 | ```shell
227 | cd build/sim-rundir
228 | dve -vpd vcdplus.vpd
229 | ```
230 | 
231 | The simulation takes 35 cycles to complete, which makes sense since it spends the first 16 cycles
232 | to read data from vector `a` and `b`, and performs a dot product computation in 16 cycles, including
233 | extra few cycles for various state transitions. The goal is not building the most efficient dot product
234 | implementation, but rather providing you an introductory design to how you would interface with
235 | SRAMs.
236 | 
237 | Next, we will perform PAR on the circuit.
238 | 
239 | ```shell
240 | make par
241 | ```
242 | 
243 | This command will invoke Synthesis as well, if it has not been run already (However, make sure to re-run synthesis if you updated your `design.yml` file). After PAR finishes,
244 | you can open the floorplan of the design by doing
245 | 
246 | ```shell
247 | cd build/par-rundir
248 | ./generated-scripts/open_chip
249 | ```
250 | 
251 | This will launch Cadence Innovus GUI and load your final design database. You should expect to
252 | see the floorplan as in the following image. Don’t forget to disable M8, V8, M9, V9 on the right
253 | pane to see the unobstructed floorplan.
254 | 
255 | <p align="center">
256 | <img src="./figs/layout.png" width="400" />
257 | </p>
258 | 
259 | This floorplan has two SRAM instances: `sram_a` and `sram_b`. The placement constraints of those
260 | SRAMs were given in the file `design.yml` in this block below. You can look at `build/par-rundir/floorplan.tcl`
261 | to see how HAMMER translated these constraints into Innovus floorplanning commands. Note that
262 | you should:
263 | 
264 | - Always generate a placement constraint for hard macros like SRAMs, because Innovus is not
265 | able to auto-place them in a valid location most of the time.
266 | - Ensure that the hierarchical path to the macro instance is specified correctly, otherwise Innovus will not know what to place.
267 | - Pre-calculate valid locations for the macros. This will involve:
268 |   - Looking at the LEF file to find out its width and height (e.g. 12.384um × 77.184um for
269 | `SRAM2RW16x16`) to make sure it fits within the core boundary/desired area.
270 |   - Legalizing the x and y coordinates. These generally need to be a multiple of a technology
271 | grid to avoid layout rule violations. The most conservative rule of thumb is a multiple
272 | of the site height (height of a standard cell row, which is 1.08um in this technology).
273 |   - Ensuring that the macros receive power. You can see that the SRAMs in the picture
274 | above are placed beneath the M5 power straps. This is because the SRAM’s power pins
275 | are on M4.
276 | 
277 | ```yaml
278 | - path: "dot_product/sram_a"
279 |   type: hardmacro
280 |   x: 35.64
281 |   y: 10.8
282 |   width: 12.384
283 |   height: 77.184
284 |   orientation: r0
285 |   top_layer: M4
286 | 
287 | - path: "dot_product/sram_b"
288 |   type: hardmacro
289 |   x: 71.28
290 |   y: 10.8
291 |   width: 12.384
292 |   height: 77.184
293 |   orientation: r0
294 |   top_layer: M4
295 | ```
296 | 
297 | You can play around with those constraints to change the SRAM placement to a geometry you like.
298 | If you change the placement constraint only in `design.yml` and only want to redo PAR (skipping
299 | synthesis), you can do:
300 | 
301 | ```shell
302 | make redo-par HAMMER_EXTRA_ARGS='-p build/sram_generator-output.json -p design.yml'
303 | ```
304 | 
305 | Finally, we will perform post-PAR gate-level simulation and power estimation.
306 | 
307 | ```shell
308 | make sim-gl-par
309 | make power-par
310 | ```
311 | 
312 | Theoretically, if you don’t have any setup/hold time violation, your post-PAR gate-level simulation
313 | should pass. However, as mentioned above, when are you pushing the timing constraints, due to
314 | the incomplete SRAM timing libraries, the gate-level simulation may not pass. One manifestation
315 | of this is the PAR tool trying to use very large clock buffers (x16 size) in the presence of SRAMs,
316 | which sometimes cannot be placed in the floorplan because they are too wide. At the bottom of
317 | design.yml, they are set to be ”don’t use” by PAR.
318 | 
319 | ---
320 | ### Question 2: Using a different SRAM
321 | **a)** Modify the dot product design to use only one instantiation of a *dual-port, 5-bit address width, and 16-bit data width SRAM*. In this SRAM, you want to store vector `a` to the first 16 entries of the SRAM, and store vector `b` to the remaining entries of the SRAM. You can use the dot product code given to you as a starting point, but please implement your design in `src/dot_product_1SRAM.v`.
322 | **Include a screenshot of the code you added when modifying `dot_product` to be `dot_product_1SRAM`.**
323 | 
324 | **b)** Run PAR (remember to update your SRAM placement constraints) and find the post-PAR critical path in your design: with a step size of 0.1ns, reduce the PAR clock period until your design has setup violation. 
325 | **Describe that path based on your Verilog source (roughly).**
326 | **Can you give a strategy to improve the timing based on the path that you find?** 
327 | You don’t have to implement it. Just provide a brief description of how you should fix it.
328 | 
329 | **c) What is the final performance (latency – in terms of nanoseconds) of your single-SRAM vector dot product design (post PAR)?**
330 | Remember that Latency (ns) = Number of Post-PAR simulation cycles × Lowest Post-PAR clock period. Make sure to run Post-PAR simulation with that clock period when you finish the PAR process.
331 | 
332 | **d) Screenshot the final floorplan of your single-SRAM dot product design to the report, as well as the power report, timing report, and area report.** 
333 | The SRAMs will have 0 power due to incomplete LIBs–show where this shows up in the power reports.
334 | 
335 | ### Checkoff 1: Modified Design
336 | Explain the updated vector unit design, and show the updated layout.
337 | 
338 | ---
339 | ### [Optional, Extra Credit] Question 3: Divide Your Vector Dot Products
340 | **Note: this question is extra credit. You will be awarded up to 20% extra credit on this lab report.**
341 | 
342 | Imagine we would like to compute the division of two dot products of vectors of unsigned integers. Open the file `src/dp_div.v`, connect two single-SRAM vector dot product modules with the divider you implemented in Lab 4 (the divider should have Ready/Valid interfaces for input and output) via FIFOs. If you implement a correct Ready/Valid mechanism for each block, connecting those blocks is simply a matter of wiring relevant signals at the interfaces. One dot product produces dividend input, and the other provides divisor input to your Divider. Then write a testbench for your new `dp_div` module based on `dot_product_tb.v`, where the test cases are simple yet non-trivial (don't worry about covering edge cases with these). Refer to the figure below for the high-level overview of the design.
343 | 
344 | **What is the number of cycles it takes to run a design of 16-element vectors with 16-bit datapath (for both dot product modules and divider module)?**
345 | **Screenshot the floorplan, collect the power report, timing report, and area report at a clock period that your design can meet (i.e., you don’t have to find the maximum achievable frequency).**
346 | **Zip your code and power, timing, area reports and submit it to the separate code assignment on Gradescope instead of pasting them into your lab PDF.** 
347 | Start early, since the tools take a long time!
348 | 
349 | 
350 | To receive full credit, you should make sure that your final implementations has no latch (one way
351 | to do so is opening Genus log file, search for ”latch”). Also, your post PAR gate-level simulation
352 | should pass the test in the testbench code.
353 | 
354 | <p align="center">
355 | <img src="./figs/dp_div.png" width="500" />
356 | </p>
357 | 
358 | 
359 | ---
360 | 
361 | ## DRC and LVS
362 | [DRC](https://en.wikipedia.org/wiki/Design_rule_checking) and [LVS](https://en.wikipedia.org/wiki/Layout_Versus_Schematic) are two of the most important ”signoff” checks. DRC checks that all of the geometries
363 | in the layout conform to process fabrication rules. Without a DRC ”clean” design, the fabrication
364 | house will not accept your design! LVS checks that the PAR’s conception of the circuit is actually
365 | matched by the generated layout. LVS extracts a connectivity netlist from your physical layout by
366 | tracing wires to/from transistors and pins and then tryies to match up transistors and nets between
367 | the netlist reported by the PAR tool and its layout-extracted netlist. DRC and LVS are run in our
368 | environment using an industry standard tool, Mentor Graphics Calibre. This section is intended
369 | as only a brief introduction to the steps of the flow, but you will not need to do them for your final
370 | project.
371 | 
372 | To run DRC and view the results:
373 | 
374 | ```shell
375 | make drc
376 | cd build/drc-rundir
377 | ./generated-scripts/view_drc
378 | ```
379 | 
380 | Your layout will open in Calibre DESIGNrev (or CalibreDRV for short), followed by a window listing
381 | the results. Together, they look like this (using the dual-SRAM design, sorting the violations from
382 | most common to least):
383 | 
384 | 
385 | <p align="center">
386 | <img src="./figs/drc.png" width="750" />
387 | </p>
388 | 
389 | 
390 | We can see that our design is not clean. The rule-checking decks (Calibre script files) are incomplete
391 | for this PDK, so this is expected. The design rule manual (DRM) for this technology is extracted
392 | to your working directory under `build/tech-asap7-cache/extracted/ASAP7_PDKandLIB.tar/ASAP7_PDKandLIB_v1p5/asap7PDK_r1p5.tar.bz2/asap7PDK_r1p5/docs/asap7_drm.pdf`.
393 | 
394 | In a design without SRAMs (i.e. not this lab or your project, but you can try it on previous labs),
395 | we can run LVS and view the results similarly as follows:
396 | 
397 | 
398 | ```shell
399 | make lvs
400 | cd build/lvs-rundir
401 | ./generated-scripts/view_lvs
402 | ```
403 | 
404 | Again, there are some issues with this PDK that would preclude generating LVS clean results.
405 | However, LVS is especially useful to run early on, after you get a first PAR database, to catch
406 | shorts, etc. that may pop up especially when integrating hard macros like SRAMs. After fixing
407 | everything, you will see this at the end a long design process, which is always super satisfying!
408 | 
409 | 
410 | <p align="center">
411 | <img src="./figs/lvs_smiley.png" width="300" />
412 | </p>
413 | 
414 | ---
415 | ### Question 4: DRC and LVS
416 | a) Scroll to the bottom of the DRC result summary report in `build/drc-rundir/drc_results.rpt`.
417 | **For the cell `dot_product` (or whatever you named your single-SRAM vector dot product), how many total violation results do you have? How many rules did you violate?** 
418 | Note: the result count is in the format `hierarchical_count` (`flat_count`), which would disagree if you have many
419 | instances of a submodule in the design. 
420 | **Please report the hierarchical count.**
421 | 
422 | b) Skim through Chapter 1.2 of the DRM (`build/tech-asap7-cache/extracted/ASAP7_PDKandLIB.tar/ASAP7_PDKandLIB_v1p5/asap7PDK_r1p5.tar.bz2/asap7PDK_r1p5/docs/asap7_drm.pdf`). 
423 | **For the violated rule with the highest numbers of occurrences less than 1000, provide a brief description of what the rule requires based on the naming convention and descriptions in Table 1.2.1 of the DRM.**
424 | 
425 | *c) (Ungraded thought experiment #3) If the DRC rule decks are perfect, the way you floorplan your design has a large impact on whether your design can be DRC clean. What things do you think can cause violations? What about other things that are constrained in PAR other than the floorplan?*
426 | 
427 | *d) (Ungraded thought experiment #4) At first, it may seem odd that the netlist that the PAR tool thinks the layout corresponds to could be different from the netlist extracted from the actual layout. What reasons can you think of that could cause mismatches? Which of these causes might make the LVS tool to slow down dramatically as it tries to extract/compare? Would you be able to catch any of these discrepancies if doing a post-PAR gate-level simulation in lieu of LVS, and why?*
428 | 
429 | ### Checkoff 2: DRC and LVS Demo
430 | Show the DRC and LVS results, and explain the meaning of what you see.
431 | 
432 | 
433 | ---
434 | 
435 | ## Lab Deliverables
436 | 
437 | ### Lab Due: 11:59 PM, Friday April 1st, 2022
438 | 
439 | - Submit a written report with all 3 questions (4 if doing extra credit) answered to Gradescope
440 | - Checkoff with an ASIC lab TA
441 | 
442 | ## Acknowledgement
443 | 
444 | This lab is the result of the work of many EECS151/251 GSIs over the years including:
445 | 
446 | Written By:
447 | - Nathan Narevsky (2014, 2017)
448 | - Brian Zimmer (2014)
449 | 
450 | Modified By:
451 | - John Wright (2015,2016)
452 | - Ali Moin (2018)
453 | - Arya Reais-Parsi (2019)
454 | - Cem Yalcin (2019)
455 | - Tan Nguyen (2020)
456 | - Harrison Liew (2020)
457 | - Sean Huang (2021)
458 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
459 | - Dima Nikiforov (2022)
460 | 


--------------------------------------------------------------------------------
/lab3/spec_sky130.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Lab 3: Logic Synthesis
  2 | <p align="center">
  3 | Prof. Bora Nikolic
  4 | </p>
  5 | <p align="center">
  6 | TAs: Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu
  7 | </p>
  8 | <p align="center">
  9 | Department of Electrical Engineering and Computer Science
 10 | </p>
 11 | <p align="center">
 12 | College of Engineering, University of California, Berkeley
 13 | </p>
 14 | 
 15 | 
 16 | 
 17 | ## Overview
 18 | For this lab, you will learn how to translate RTL code into a gate-level netlist in a process called
 19 | synthesis. In order to successfully synthesize your design, you will need to understand how to
 20 | constrain your design, learn how the tools optimize logic and estimate timing, analyze the critical
 21 | path of your design, and simulate the gate-level netlist.
 22 | To begin this lab, get the project files by typing the following commands:
 23 | 
 24 | ```shell
 25 | git clone /home/ff/eecs151/labs/lab3.git
 26 | cd lab3
 27 | ```
 28 | 
 29 | You should add the following lines to the `.bashrc` file in your home folder
 30 | (for more information about what `.bashrc` does, see https://www.tldp.org/LDP/abs/html/sample-bashrc.html)
 31 | so that every time
 32 | you open a new terminal you have the paths for the tools setup properly.
 33 | 
 34 | ```shell
 35 | source /home/ff/eecs151/tutorials/eecs151.bashrc
 36 | export HAMMER_HOME=/home/ff/eecs151/hammer
 37 | source ${HAMMER_HOME}/sourceme.sh
 38 | ```
 39 | 
 40 | Type
 41 | 
 42 | ```shell
 43 | which genus
 44 | ```
 45 | 
 46 | to see if the shell prints out the path to the Cadence Genus Synthesis program (which we will be
 47 | using for this lab). If it does not work, add the lines to your `.bash_profile` in your home folder
 48 | as well. Try log in or open a new terminal to see if it works. The file `eecs151.bashrc` sets various
 49 | environment variables in your system such as where to find the CAD programs or license servers.
 50 | 
 51 | 
 52 | ## Synthesis Environment
 53 | To perform synthesis, we will be using Cadence Genus. However, we will not be interfacing with
 54 | Genus directly, we will rather use HAMMER. Just like in lab 2, we have set up the basic HAMMER
 55 | flow for your lab exercises using Makefile.
 56 | 
 57 | In this lab repository, you will see two sets of input files for HAMMER. The first set of files are
 58 | the source codes for our design that you will explore in the next section. The second set of files are
 59 | some YAML files (`inst-env.yml`, `sky130.yml`, `design-sky130.yml`, `sim-rtl.yml`, `sim-gl-syn.yml`) that
 60 | configure the HAMMER flow. Of these YAML files, you should only need to modify `design.yml`,
 61 | `sim-rtl.yml` and `sim-gl-syn.yml` in order to configurate to the synthesis and simulation for your
 62 | design.
 63 | 
 64 | 
 65 | HAMMER is already setup at `/home/ff/eecs151/hammer` with all the required plugins for Cadence
 66 | Synthesis (Genus) and Place-and-Route (Innovus), Synopsys Simulator (VCS), Mentor Graphics
 67 | DRC and LVS (Calibre). You should not need to install it on your own home directory. **These
 68 | HAMMER plugins are under NDA. They are provided to us for educational purpose.
 69 | They should never be copied outside of instructional machines under any circumstances or else we are at risk of unable to get access to the tools in the future!!!**
 70 | 
 71 | Let us take a look at some parts of `design.yml` file:
 72 | 
 73 | ```yaml
 74 | gcd.clockPeriod: &CLK_PERIOD "1ns"
 75 | ```
 76 | 
 77 | This option sets the target clock speed for our design. A more stringent target (a lower clock
 78 | period) will make the tool work harder and use higher-power gates to meet the clock
 79 | period. A lower target lets the tool focus on reducing area and/or power.
 80 | In the sim-rtl.yml:
 81 | 
 82 | ```yaml
 83 | defines:
 84 |   - "CLOCK_PERIOD=1.00"
 85 | ```
 86 | 
 87 | The option sets the clock period used during simulation. It is generally useful to separate the two as
 88 | you might want to see how the circuit performs under different clock frequencies without changing
 89 | the design constraints. Continuing from `design.yml`:
 90 | 
 91 | ```yaml
 92 | gcd.verilogSrc: &VERILOG_SRC
 93 |   - "src/gcd.v"
 94 |   - "src/gcd_datapath.v"
 95 |   - "src/gcd_control.v"
 96 | ```
 97 | 
 98 | and in `sim-rtl.yml`:
 99 | 
100 | ```yaml
101 | sim.inputs:
102 |   input_files:
103 |     - "src/gcd.v"
104 |     - "src/gcd_datapath.v"
105 |     - "src/gcd_control.v"
106 |     - "src/gcd_testbench.v"
107 | ```
108 | 
109 | These specify the files for synthesis and simulation. Moving on, we have:
110 | 
111 | ```yaml
112 | vlsi.inputs.clocks: [
113 |   {name: "clk", period: *CLK_PERIOD, uncertainty: "0.1ns"}
114 | ]
115 | ```
116 | 
117 | This is where we specify to HAMMER that we intend on using the `CLK_PERIOD` we defined earlier
118 | as the constraint for our design. We will see more detailed constraints in the later labs.
119 | 
120 | ## Understanding the example design
121 | We have provided a circuit described in Verilog that computes the greatest common divisor (GCD)
122 | of two numbers. Unlike the FIR filter from the last lab where the testbench constantly provided
123 | stimuli, the GCD algorithm takes a variable number of cycles, so the testbench needs to know when
124 | the circuit is done to check the output. This is accomplished through a “ready/valid” handshake
125 | protocol. This protocol is very ubiquitous and a flavor of it will appear both in the class project
126 | and later on in other blocks you will encounter throughout your career. The block diagram is shown
127 | in the figure below.
128 | 
129 | <p align="center">
130 | <img src="./figs/block-diagram.png" width="600" />
131 | </p>
132 | 
133 | The GCD module declaration is as follows:
134 | 
135 | ```v
136 | module gcd#( parameter W = 16 )
137 | (
138 |   input clk, reset,
139 |   input [W-1:0] operands_bits_A,    // Operand A
140 |   input [W-1:0] operands_bits_B,    // Operand B
141 |   input operands_val,               // Are operands valid?
142 |   output operands_rdy,              // ready to take operands
143 | 
144 |   output [W-1:0] result_bits_data,  // GCD
145 |   output result_val,                // Is the result valid?
146 |   input result_rdy                  // ready to take the result
147 | );
148 | ```
149 | 
150 | On the `operands` boundary, nothing will happen until GCD is ready to receive data (`operands_rdy`).
151 | When this happens, the testbench will place data on the operands (`operands_bits_A` and `operands_bits_B`),
152 | but GCD will not start until the testbench declares that these operands are valid (`operands_val`).
153 | Then GCD will start.
154 | 
155 | The testbench needs to know that GCD is not done. This will be true as long as `result_val` is 0
156 | (the results are not valid). Also, even if GCD is finished, it will hold the result until the testbench is
157 | prepared to receive the data (`result_rdy`). The testbench will check the data when GCD declares
158 | the results are valid by setting `result_val` to 1.
159 | 
160 | The main contract is that if the interface declares it is ready, and the other side declares valid, the
161 | information must be transfered.
162 | 
163 | Open `src/gcd.v`. This is the top-level of GCD and just instantiates `gcd_control` and `gcd_datapath`.
164 | Separating files into control and datapath is generally a good idea. Open `src/gcd_datapath.v`.
165 | This file stores the operands, and contains the logic necessary to implement the algorithm (subtraction and comparison). Open `src/gcd_control.v`. This file contains a state machine that handles
166 | the ready-valid interface and controls the mux selects in the datapath. Open `src/gcd_testbench.v`.
167 | This file sends different operands to GCD, and checks to see if the correct GCD was found. Make
168 | sure you understand how this file works. Note that the inputs are changed on the negative edge
169 | of the clock. This will prevent hold time violations for gate-level simulation, because once a clock
170 | tree has been added, the input flops will register data at a time later than the testbench’s rising
171 | edge of the clock.
172 | 
173 | Now simulate the design by running `make sim-rtl`. The waveform is located under `build/sim-rundir/`.
174 | Open the waveform in DVE (you may need to scroll down in DVE to find the testbench) and try
175 | to understand how the code works by comparing the waveforms with the Verilog code. It might
176 | help to sketch out a state machine diagram and draw the datapath.
177 | 
178 | ---
179 | 
180 | ### Question 1: Understanding the algorithm
181 | 
182 | By reading the provided Verilog code and/or viewing the RTL level simulations, demonstrate that
183 | you understand the provided code:
184 | 
185 | **a.) Draw a table with 5 columns (cycle number, value of `A_reg`, value of `B_reg`, next value of `A_reg`, next value of `B_reg`) and fill in all of the rows for the first test vector (GCD of 27 and 15)**
186 | 
187 | **b) In `src/gcd_testbench.v`, the inputs are changed on the negative edge of the clock to prevent hold time violations. Is the output checked on the positive edge of the clock or the negative edge of the clock? Why?**
188 | 
189 | **c) In `src/gcd_testbench.v`, what will happen if you change `result_rdy = 1;` to `result_rdy = 0;`? What state will `gcd_control.v` state machine be in?**
190 | 
191 | ---
192 | ### Question 2: Testbenches
193 | **a) Modify `src/gcd_testbench.v` so that intermediate steps are displayed in the format below. Include a copy of the code you wrote in your writeup (this should be approximately 3-4 lines).**
194 | 
195 | ```shell
196 |  0: [ ...... ] Test ( x ), [ x == x ]  (decimal)
197 |  1: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
198 |  2: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
199 |  3: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
200 |  4: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
201 |  5: [ ...... ] Test ( x ), [ x == 0 ]  (decimal)
202 |  6: [ ...... ] Test ( 0 ), [ 3 == 0 ]  (decimal)
203 |  7: [ ...... ] Test ( 0 ), [ 3 == 0 ]  (decimal)
204 |  8: [ ...... ] Test ( 0 ), [ 3 == 27 ] (decimal)
205 |  9: [ ...... ] Test ( 0 ), [ 3 == 12 ] (decimal)
206 | 10: [ ...... ] Test ( 0 ), [ 3 == 15 ] (decimal)
207 | 11: [ ...... ] Test ( 0 ), [ 3 == 3 ]  (decimal)
208 | 12: [ ...... ] Test ( 0 ), [ 3 == 12 ] (decimal)
209 | 13: [ ...... ] Test ( 0 ), [ 3 == 9 ]  (decimal)
210 | 14: [ ...... ] Test ( 0 ), [ 3 == 6 ]  (decimal)
211 | 15: [ ...... ] Test ( 0 ), [ 3 == 3 ]  (decimal)
212 | 16: [ ...... ] Test ( 0 ), [ 3 == 0 ]  (decimal)
213 | 17: [ ...... ] Test ( 0 ), [ 3 == 3 ]  (decimal)
214 | 18: [ passed ] Test ( 0 ), [ 3 == 3 ]  (decimal)
215 | 19: [ ...... ] Test ( 1 ), [ 7 == 3 ]  (decimal)
216 | ```
217 | ---
218 | 
219 | ## Synthesis
220 | Synthesis is the process of converting RTL Verilog files into technology (or platform, in the case of
221 | FPGAs) specific gate-level Verilog. These gates are different from the “and”, “or”, “xor” etc. primitives in Verilog. While the logic primitives correspond to gate-level operations, they do not have
222 | a physical representation outside of their symbol. A synthesized gate-level Verilog only contains
223 | cells with corresponding physical aspects: they have a transistor-level schematic with transistor
224 | sizes provided, a physical layout containing information necessary for fabrication, timing libraries
225 | providing performance specifications etc. Some synthesis tools also output assign statements that
226 | refer to pass-through interfaces, but no logic operation is performed in these assignments (not even
227 | simple inversion!).
228 | 
229 | 
230 | Open the Makefile to see the available targets that you can run. You don’t have to know all of
231 | these for now. The Makefile provides shorthands to various HAMMER commands for synthesis,
232 | placement-and-routing, or simulation. Read [Hammer-Flow](https://hammer-vlsi.readthedocs.io/en/latest/Hammer-Flow/index.html) if you want to get more detail.
233 | 
234 | To start the synthesis process of the GCD module you just analyzed, the first step is to make
235 | HAMMER generate the necessary supplement Makefile (`build/hammer.d`). To do so, type the
236 | following command in the lab directory:
237 | 
238 |     make buildfile
239 | 
240 | This generates a file with make targets specific to the constraints we have provided inside the YAML
241 | files. If you have not run `make clean` after simulating, this file should already be generated. `make buildfile` 
242 | also modifies a few files from the Sky130 PDK and stores them to your local workspace. 
243 | The extracted PDK is not deleted when
244 | you do `make clean` to avoid unnecessarily rebuilding the PDK. To explicitly remove it, you need to
245 | remove the build folder (and you should do it once you finish the lab to save your allocated disk
246 | space since the PDK is huge). To synthesize the GCD, use the following command:
247 | 
248 |     make syn
249 | 
250 | This runs through all the steps necessary to generate the gate-level Verilog. The final lines of output
251 | you will see is a list of all the registers in the design. There should be all the bits of `A_reg_reg`,
252 | `B_reg_reg` and state registers.
253 | 
254 | By default, HAMMER puts the generated objects under the directory build. Go to `build/syn-rundir/reports`. 
255 | There are five text files here that contain very useful information about
256 | the synthesized design that we just generated. Go through these files and familiarize yourself with
257 | these reports. One report of particular note is `final_time_ss_100C_1v60.setup_view.rpt`. The
258 | name of this file represents that it is a timing report, with the Process Voltage Temperature corner
259 | of 1.6 V and 100 degrees C, and that it contains the setup timing checks. Another important file
260 | is `build/syn-rundir/gcd.mapped.v`. This is your synthesized gate-level Verilog. Go through it
261 | to see what the RTL design has become to represent it in terms of technology-specific gates. Try
262 | to follow an input through these gates to see the path it takes until the output. While these files
263 | are rarely ever read by humans, you may sometimes find yourself going through these during the
264 | process of debugging.
265 | 
266 | Now open the `final_time_ss_100C_1v60.setup_view.rpt` file and look at the first block of text
267 | you see. It should look similar to this:
268 | 
269 | ```text
270 | Path 1: MET (212 ps) Setup Check with Pin GCDdpath0/A_reg_reg[15]/CLK->D
271 |            View: ss_100C_1v60.setup_view
272 |           Group: clk
273 |      Startpoint: (R) GCDdpath0/A_reg_reg[1]/CLK
274 |           Clock: (R) clk
275 |        Endpoint: (F) GCDdpath0/A_reg_reg[15]/D
276 |           Clock: (R) clk
277 | 
278 |                      Capture       Launch     
279 |         Clock Edge:+    5000            0     
280 |        Src Latency:+       0            0     
281 |        Net Latency:+       0 (I)        0 (I) 
282 |            Arrival:=    5000            0     
283 |                                               
284 |              Setup:-     293                  
285 |        Uncertainty:-     500                  
286 |      Required Time:=    4207                  
287 |       Launch Clock:-       0                  
288 |          Data Path:-    3995                  
289 |              Slack:=     212                  
290 | 
291 | #--------------------------------------------------------------------------------------------------------------------------
292 | #          Timing Point            Flags    Arc   Edge           Cell             Fanout Load Trans Delay Arrival Instance 
293 | #                                                                                        (fF)  (ps)  (ps)   (ps)  Location 
294 | #--------------------------------------------------------------------------------------------------------------------------
295 |   GCDdpath0/A_reg_reg[1]/CLK       -       -      R     (arrival)                     16    -     0     0       0    (-,-) 
296 |   GCDdpath0/A_reg_reg[1]/Q         -       CLK->Q F     sky130_fd_sc_hd__dfrtp_1       2  8.4   128   756     756    (-,-) 
297 |   GCDdpath0/g815/Y                 -       A->Y   R     sky130_fd_sc_hd__inv_2         2 11.1    99   135     891    (-,-) 
298 |   GCDdpath0/g812/Y                 -       A->Y   F     sky130_fd_sc_hd__inv_2         2  5.5    37    75     966    (-,-) 
299 |   GCDdpath0/sub_45_24_g546__2346/Y -       A_N->Y F     sky130_fd_sc_hd__nand2b_1      2  6.4   145   322    1287    (-,-) 
300 |   GCDdpath0/sub_45_24_g482__9315/Y -       A->Y   R     sky130_fd_sc_hd__nand2_1       1  5.8   122   155    1442    (-,-) 
301 |   GCDdpath0/sub_45_24_g480__6161/Y -       A->Y   F     sky130_fd_sc_hd__nand2_2       3 11.5   120   151    1593    (-,-) 
302 |   GCDdpath0/sub_45_24_g468__3680/Y -       A->Y   R     sky130_fd_sc_hd__nand3_1       1  3.7   115   136    1729    (-,-) 
303 |   GCDdpath0/sub_45_24_g467__6783/Y -       A->Y   F     sky130_fd_sc_hd__nand2_1       4 14.4   250   253    1982    (-,-) 
304 |   GCDdpath0/sub_45_24_g465__8428/Y -       A->Y   R     sky130_fd_sc_hd__nand2_1       2  7.5   145   218    2200    (-,-) 
305 |   GCDdpath0/sub_45_24_g464/Y       -       A->Y   F     sky130_fd_sc_hd__clkinv_1      1  3.6    78   137    2337    (-,-) 
306 |   GCDdpath0/sub_45_24_g459__5477/X -       A1->X  F     sky130_fd_sc_hd__a21o_2        7 23.1   146   447    2784    (-,-) 
307 |   GCDdpath0/sub_45_24_g455__2346/Y -       A->Y   R     sky130_fd_sc_hd__nand2_1       2  6.9   130   166    2950    (-,-) 
308 |   GCDdpath0/sub_45_24_g447__1881/Y -       A2->Y  F     sky130_fd_sc_hd__o21ai_1       1  5.7   139   169    3119    (-,-) 
309 |   GCDdpath0/sub_45_24_g440__1617/Y -       B->Y   F     sky130_fd_sc_hd__xnor2_1       1  3.6   111   244    3363    (-,-) 
310 |   GCDdpath0/g1627__5122/X          -       B1->X  F     sky130_fd_sc_hd__a22o_1        1  3.6    82   350    3714    (-,-) 
311 |   GCDdpath0/g1596__1666/X          -       B1->X  F     sky130_fd_sc_hd__a21o_1        1  3.1    64   282    3995    (-,-) 
312 |   GCDdpath0/A_reg_reg[15]/D        -       -      F     sky130_fd_sc_hd__dfrtp_1       1    -     -     0    3995    (-,-) 
313 | #--------------------------------------------------------------------------------------------------------------------------
314 | 
315 | ```
316 | 
317 | This is one of the most common ways to assess the critical paths in your circuit. 
318 | The setup timing report lists each timing path's **slack**, which is the extra delay the signal can have before a setup
319 | violation occurs, in ascending order. So the first block indicates the critical path of the design.
320 | Each row represents a timing path from a gate to the next, and the whole block is the **timing
321 | arc** between two flip-flops (or in some cases between latches). The `MET` at the top of the block
322 | indicates that the timing requirements have been met and there is no violation. If there was, this
323 | indicator would have read `VIOLATED`. Since our critical path meets the timing requirements with
324 | a 212 ps of slack, this means we can run this synthesized design with a period equal to clock period
325 | (5000 ps) minus the critical path slack (212 ps), which is 4788 ps.
326 | 
327 | ---
328 | ### Question 3: Reporting Questions
329 | **a) Which report would you look at to find the total number of each different standard cell that the design contains?**
330 | 
331 | **b) Which report contains area breakdown by modules in the design?**
332 | 
333 | **c) What is the cell used for `A_reg_reg[7]`? How much leakage power does this contribute? How did you find this?**
334 | 
335 | ---
336 | 
337 | ### Question 4: Synthesis Questions
338 | **a) Looking at the total number of sequential cells synthesized and the number of `reg` definitions in the Verilog files, are they consistent? If not, why?**
339 | 
340 | **b) Modify the clock period in the `design.yml` file to make the design go faster. What is the highest clock frequency this design can operate at in this technology?**
341 | 
342 | ---
343 | 
344 | ### Synthesis: Step-by-step
345 | 
346 | While for the remainder of the semester we will be roughly following the above section’s flow, it is
347 | useful as a digital IC design engineer to know what is going on during the process. In this section,
348 | we will look at the steps HAMMER takes to get from RTL Verilog to all the outputs we saw in the
349 | last section.
350 | 
351 | First, type `make clean` to clean the environment of previous build’s files. Then, use `make buildfile`
352 | to generate the supplementary Makefile as before. Now, we will modify the `make syn` command to
353 | only run the steps we want. Go through the following commands in the given order:
354 | 
355 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step init_environment"
356 | 
357 | HAMMER flow will exit with an error. This is expected, as HAMMER looks for the final output
358 | files to gauge its success. We have not yet generated the gate-level Verilog, so we know beforehand
359 | that every step except the last one is going to end with an error. In this step, HAMMER invokes
360 | Genus to read the technology libraries and the RTL Verilog files, as well as the constraints we
361 | provided in the `design.yml` file.
362 | 
363 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step syn_generic"
364 | 
365 | This step is the **generic synthesis** step. In this step, Genus converts our RTL Verilog files read
366 | in the previous step to an intermediate format, using technology-independent generic gates. These
367 | gates are purely for gate-level functional representation of the RTL we have coded, and are going
368 | to be used as an input to the next step. This step also performs logical optimizations on our design
369 | to eliminate any redundant/unused operations.
370 | 
371 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step syn_map"
372 | 
373 | This step is the **mapping** step. Genus takes its own generic gate-level output and converts it to
374 | our Sky130-specific gates. This step further optimizes the design given the gates in our technology.
375 | That being said, this step can also increase the number of gates from the previous step as not
376 | all gates in the generic gate-level Verilog may be available for our use and they may need to be
377 | constructed using several, simpler gates.
378 | 
379 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step add_tieoffs"
380 | 
381 | In some designs, the pins in certain cells are hardwired to 0 or 1. Since modern technology does
382 | not directly connect cells to Vdd or ground, the tie-off cells are added in this step.
383 | 
384 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step write_regs"
385 | 
386 | This step is purely for the benefit of the designer. For some designs, we may need to have a list
387 | of all the registers in our design. In this lab, the list of regs is used in post-synthesis simulation to
388 | generate the `force_regs.ucli`, which sets initial states of registers.
389 | 
390 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step generate_reports"
391 | 
392 | The reports we have seen in the previous section are generated during this step.
393 | 
394 |     make redo-syn HAMMER_EXTRA_ARGS="--stop_after_step write_outputs"
395 | 
396 | This step writes the outputs of the synthesis flow. This includes the gate-level `.v` file we looked at
397 | earlier in the lab. Other outputs include the design constraints (such as clock frequencies, output
398 | loads etc., in `.sdc` format) and delays between cells (in `.sdf` format).
399 | 
400 | ## Post-Synthesis Simulation
401 | From the root folder, type the following commands:
402 | 
403 |     make sim-gl-syn
404 |     
405 | This will run a post-synthesis simulation using annotated delays from the `gcd.mapped.sdf` file.
406 | 
407 | ---
408 | ### Question 5: Delay Questions
409 | **a) Check the waveforms in DVE. Submit a screenshot and report the clk-q delay of `state[0]` in `GCDctrl0` at 17.5 ns. Which line in the sdf file specifies this delay?**
410 | 
411 | ---
412 | 
413 | ## Build Your Divider
414 | Now that you understand how to use the tools to synthesize and simulate the GCD implementation.
415 | In this section, you will build a parameterized divider of unsigned integers. Some initial code has
416 | been provided to you to get started. To keep the control logic simple, the divider module uses input
417 | signal `start` to begin the computation at the next clock cycle, and asserts output signal `done` to
418 | HIGH when the division result is valid. The input `dividend` and `divisor` should be registered
419 | when `start` is HIGH. You are not required to handle corner cases such as dividing by 0. You are
420 | free to modify the skeleton code to adopt ready/valid instead, but it is not required.
421 | 
422 | It is suggested that you implement the divide algorithm described [here](http://bwrcs.eecs.berkeley.edu/Classes/icdesign/ee141_s04/Project/Divider%20Background.pdf). Use the **Divide Algorithm Version 2** (slide 9).
423 | A simple testbench skeleton is also provided to you. You should change it to add more test vectors,
424 | or test your divider with different bitwidths. You need to change the file `sim-rtl.yml` to use your
425 | divider instead of the GCD module when testing.
426 | 
427 | ---
428 | ### Question 6: HAMMER your divider
429 | **1. Push your 4-bit divider design through the tools, and determine its critical path, cell area, and maximum operating frequency from the reports. You might need to rerun synthesis multiple times to determine the maximum achievable frequency.**
430 | 
431 | **2. Change the bitwidth of your divider to 32-bit, what is the critical path, area, and maximum operating frequency now?**
432 | 
433 | **3. Submit your divider code and testbench to the report. Add comments to explain your testbench and why it provides sufficient coverage for your divider module.**
434 | 
435 | ---
436 | ## Lab Deliverables
437 | 
438 | ### Lab Due: 11:59 PM, Friday September 24th, 2021
439 | 
440 | - Submit a written report with all 6 questions answered to Gradescope
441 | - Checkoff with an ASIC lab TA
442 | 
443 | ## Acknowledgement
444 | 
445 | This lab is the result of the work of many EECS151/251 GSIs over the years including:
446 | Written By:
447 | - Nathan Narevsky (2014, 2017)
448 | - Brian Zimmer (2014)
449 | Modified By:
450 | - John Wright (2015,2016)
451 | - Ali Moin (2018)
452 | - Arya Reais-Parsi (2019)
453 | - Cem Yalcin (2019)
454 | - Tan Nguyen (2020)
455 | - Harrison Liew (2020)
456 | - Sean Huang (2021)
457 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
458 | 


--------------------------------------------------------------------------------
/lab4/spec_sky130.md:
--------------------------------------------------------------------------------
  1 | # EECS 151/251A ASIC Lab 4: Floorplanning, Placement, Power, and CTS
  2 | 
  3 | <p align="center">
  4 | Prof. Bora Nikolic
  5 | </p>
  6 | <p align="center">
  7 | TAs: Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu
  8 | </p>
  9 | <p align="center">
 10 | Department of Electrical Engineering and Computer Science
 11 | </p>
 12 | <p align="center">
 13 | College of Engineering, University of California, Berkeley
 14 | </p>
 15 | 
 16 | ## Overview
 17 | This lab consists of three parts. For the first part, you will be writing a GCD coprocessor that could be included alongside a general-purpose CPU (like your final project). You will then learn how the tools can create a floorplan, route power straps, place standard cells, perform timing optimizations, and generate a clock tree for your design. Finally, you will get a slight head start on your project by writing part of the ALU.
 18 | 
 19 | To begin this lab, get the project files and set up your environment by typing the following command and sourcing the `eecs151.bashrc` file, as usual:
 20 | 
 21 | ```shell
 22 | git clone /home/ff/eecs151/fa21/sky130/lab4-sky130.git
 23 | ```
 24 | 
 25 | You should also clean up the build directory generated from the previous labs to save some disk space.
 26 | 
 27 | ## Writing Your Coprocessor
 28 | 
 29 | Take a look at the `gcd_coprocessor.v` file in the src folder. You will see the following empty Verilog module.
 30 | 
 31 | ```verilog
 32 | module gcd_coprocessor #( parameter W = 32 )(
 33 |   input clk,
 34 |   input reset,
 35 |   input operands_val,
 36 |   input [W-1:0] operands_bits_A,
 37 |   input [W-1:0] operands_bits_B,
 38 |   output operands_rdy,
 39 |   output result_val,
 40 |   output [W-1:0] result_bits,
 41 |   input result_rdy
 42 | );
 43 | 
 44 | // You should be able to build this with only structural verilog!
 45 | // Define wires
 46 | // Instantiate gcd_datapath
 47 | // Instantiate gcd_control
 48 | // Instantiate request FIFO
 49 | // Instantiate response FIFO
 50 | 
 51 | endmodule
 52 | 
 53 | ```
 54 | First notice the parameter `W`. `W` is the data width of your coprocessor; the input data and output data will all be this bitwidth. Be sure to pass this parameter on to any submodules that may use it! You should implement a coprocessor that can handle 4 outstanding requests at a time. For now, you will use a FIFO (First-In, First-Out) block to store requests (operands) and responses (results).
 55 | 
 56 | A FIFO is a sequential logic element which accepts (enqueues) valid data and outputs (dequeues) it in the same order when the next block is ready to accept. This is useful for buffering between the producer of data and its consumer. When the input data is valid (`enq_val`) and the FIFO is ready for data (`enq_rdy`), the input data is enqueued into the FIFO. There are similar signals for the output data. This interface is called a “decoupled” interface, and if implemented correctly it makes modular design easy (although sometimes with performance penalties).
 57 | 
 58 | This FIFO is implemented with a 2-dimensional array of data called `buffer`. There are two pointers: a read pointer `rptr` and a write pointer `wptr`. When data is enqueued, the write pointer is incremented. When data is dequeued, the read pointer is incremented. Because the FIFO depth is a power of 2, we can leverage the fact that addition rolls over and the FIFO will continue to work. However, once the read and write pointers are the same, we don’t know if the FIFO is full or empty. We fix this by writing to the `full` register when they are the same and we just enqueued, and clearing the `full` register otherwise.
 59 | 
 60 | A partially written FIFO has been provided for you in `fifo.v`. Using the information above, complete the FIFO implementation so that it behaves as expected.
 61 | 
 62 | 
 63 | Then, finish the coprocessor implementation in `gcd_coprocessor.v`, so that the GCD unit and FIFOs are connected like in the following diagram. Note the connection between the `gcd_datapath` and `gcd_control` should be very similar to that in Lab 3’s `gcd.v` and that clock and reset are omitted from the diagram. You will need to think about how to manage a ready/valid decoupled interface with 2 FIFOs in parallel.
 64 | 
 65 | 
 66 | <p align="center">
 67 | <img src="./figs/gcd_coprocessor.png" width="600" />
 68 | </p>
 69 | 
 70 | A testbench has been provided for you (`gcd_coprocessor_testbench.v`). You can run the testbench to test your code by typing `make sim-rtl` in the root directory as before.
 71 | 
 72 | ---
 73 | ### Question 1: Design
 74 | 
 75 | **a) Submit your code (`gcd_coprocessor.v` and `fifo.v`) and show that your code works (VCS output is fine).**
 76 | 
 77 | ---
 78 | 
 79 | ## Introducing Place and Route
 80 | 
 81 | In this lab, you will begin to implement your GCD coprocessor in physical layout–the next step towards making it a real integrated circuit. Place & Route (P&R or PAR) itself is a much longer process than synthesis, so for this lab we will look at the first few (and arguably most important) steps: floorplanning, placement, power straps, and clock tree synthesis (CTS). The rest will be introduced in the next lab.
 82 | 
 83 | ### Setting up for P&R
 84 | 
 85 | We will first bring our design to the point we stopped in Lab 3. Synthesize your design:
 86 | 
 87 | 
 88 | ```shell
 89 | make syn
 90 | ```
 91 | 
 92 | Before proceeding, make sure your design is working correctly. It should meet timing at the default 10ns clock period in the setup corner with plenty of slack.
 93 | 
 94 | ### Floorplanning & Placement
 95 | Floorplanning is the process of allocating area to the design as well as putting constraints on how this area is utilized. Floorplanning is often the most important factor for determining a physical circuit’s performance, because intelligent floorplanning can assist the tool in minimizing the delays in the design, especially if the total area is highly constrained.
 96 | 
 97 | Floorplan constraints can be “hard” or “soft”. “Hard” constraints generally involve pre-placement of “macros”, which can be anything from memory elements (SRAM arrays, in an upcoming lab) to analog black boxes (like PLLs or LDOs). “Soft” constraints are generally guided placements of hierarchical modules in the design (e.g. the datapath, controller, and FIFOs in your coprocessor), towards certain regions of the floorplan. Generally, the P&R tool does a good job of placing hierarchical modules optimally, but sometimes, a little human assistance is necessary to eke out the last bit of performance.
 98 | 
 99 | In this lab, we will just look at allocating a custom sized area to our design, specified in the `design-sky130.yml` file. Open up this file and locate the following text block:
100 | 
101 | ```verilog
102 | # Placement Constraints
103 | vlsi.inputs.placement_constraints:
104 |   - path: "gcd_coprocessor"
105 |     type: "toplevel"
106 |     x: 0
107 |     y: 0
108 |     width: 150
109 |     height: 150
110 |     margins:
111 |       left: 10
112 |       right: 10
113 |       top: 10
114 |       bottom: 10
115 |   - path: "gcd_coprocessor/GCDpath0"
116 |     type: "placement"
117 |     x: 50
118 |     y: 50
119 |     width: 50
120 |     height: 50
121 | 
122 | # Pin placement constraints
123 | vlsi.inputs.pin_mode: generated
124 | vlsi.inputs.pin.generate_mode: semi_auto
125 | vlsi.inputs.pin.assignments: [
126 |   {pins: "*", layers: ["met2", "met4"], side: "bottom"}
127 | ]
128 | ```
129 | 
130 | The `vlsi.inputs.placement_constraints` block specifies two floorplan constraints. The first one denotes the origin `(x, y)`, size `(width, height)` and border margins of the top-level block `gcd_coprocessor`. The second one denotes a soft placement constraint on the GCD datapath to be roughly in the center of the floorplan. For complicated designs, floorplans of major modules are often defined separately, and then assembled together hierarchically.
131 | 
132 | Pin constraints are also shown here. All that we need to see is that all pins are located at the bottom boundary of the design, on metal 2 and metal 4 layers. Pin placement becomes very important in a hierarchical design, if modules need to abut each other.
133 | 
134 | Placement is the process of placing the synthesized design (structural connection of standard cells) onto the specified floorplan. While there is placement of minor cells (such as bulk connection cells, antenna-effect prevention cells, I/O buffers...) that take place separately and in between various stages of design, “placement” usually refers to the initial placement of the standard cells.
135 | 
136 | After the cells are placed, they are not “locked”–they can be moved around by the tool during subsequent optimization steps. However, initial placement tries its best to place the cells optimally, obeying the floorplan constraints and using complex heuristics to minimize the parasitic delay caused by the connecting wires between cells and timing skew between synchronous elements (e.g. flip-flops, memories). Poor placement (as well as poor aspect ratio of the floorplan) can result in congestion of wires later on in the design, which may prevent successful routing.
137 | 
138 | ### Power
139 | 
140 | 
141 | In the middle of the `sky130.yml` file, you will see this block, which contains parameters to HAMMER’s power strap auto-calculation API:
142 | 
143 | ```yaml
144 | # Power Straps
145 | par.power_straps_mode: generate
146 | par.generate_power_straps_method: by_tracks
147 | par.blockage_spacing: 2.0
148 | par.generate_power_straps_options:
149 |   by_tracks:
150 |     strap_layers:
151 |       - met2
152 |       - met3
153 |       - met4
154 |       - met5
155 |     pin_layers:
156 |       - met5
157 |     track_width: 6
158 |     track_width_met5: 2
159 |     track_spacing: 1
160 |     track_start: 10
161 |     power_utilization: 0.25
162 |     power_utilization_met5: 1
163 | ```
164 | 
165 | Power must be delivered to the cells from the topmost metal layers all the way down to the transistors, in a fashion that minimizes the overall resistance of the power wires without eating up all the resources that are needed for wiring the cells together. You will learn about power distribution briefly at the end of this course’s lectures, but the preferred method is to place interconnected grids of fat wires on every metal layer. There are tools to check the quality of the `power_distribution` network, which like the post-P&R simulations you did in Lab 2, calculate how the current being drawn by the circuit is transiently distributed across the power grid.
166 | 
167 | You should not need to touch this block of yaml, because the parameters are tuned for meeting design rules in this technology. However, the important parameter is `power_utilization`, which specifies that approximately 25% of the available routing space on each metal layer should be reserved for power, with the exception of metal 5, which should have 100% coverage.
168 | 
169 | ### Clock Tree Synthesis (CTS): Overview
170 | 
171 | Clock Tree Synthesis (CTS) is arguably the next most important step in P&R behind floorplanning. Recall that up until this point, we have not talked about the clock that triggers all the sequential logic in our design. This is because the clock signal is assumed to arrive at every sequential element in our design at the same time. The synthesis tool makes this assumption and so does the initial cell placement algorithm. In reality, the sequential elements have to be placed wherever makes the most sense (e.g. to minimize delays between them). As a result, there is a different amount of delay to every element from the top-level clock pin that must be “balanced” to maintain the timing results from synthesis. We shall now explore the steps the P&R tool takes to solve this problem and why it is called Clock Tree Synthesis.
172 | 
173 | ### Pre-CTS Optimization
174 | 
175 | Pre-CTS optimization is the first round of Static Timing Analysis (STA) and optimization performed on the design. It has a large freedom to move the cells around to optimize your design to meet setup checks, and is performed after the initial cell placement. Hold errors are not checked during pre-CTS optimization. Because we do not have a clock tree in place yet, we do not know when the clocks will arrive to each sequential element, hence we don’t know if there are hold violations. The tool therefore assumes that every sequential element receives the clock ideally at the same time, and tries to balance out the delays in data paths to ensure no setup violations occur. In the end, it generates a timing report, very similar to the ones we saw in the last lab.
176 | 
177 | ### Clock Tree Clustering and Balancing
178 | Most of CTS is accomplished after initial optimization. The CTS algorithm first clusters groups of sequential elements together, mostly based on their position in the design relative to the top-level clock pin and common clock gating logic. The numbers of elements in each cluster is selected so that it does not present too large of a load to a driving cell. These clusters of sequential elements are the “leaves” of the clock tree attached to branches.
179 | 
180 | Next, the CTS algorithm tries to ensure that the delay from the top-level clock pin to the leaves are all the same. It accomplishes this by adding and sizing clock buffers between the top-level pin and the leaves. There may be multiple stages of clock buffering, depending on how physically large the design is. Each clock buffer that drives multiple loads is a branching point in the clock tree, and strings of clock buffers in a row are essentially the “trunks”. Finally, the top-level clock pin is considered the “root” of the clock tree.
181 | 
182 | The CTS algorithm may go through many iterations of clustering and balancing. It will try to minimize the depth of the tree (called *insertion delay*, i.e. the delay from the root to the leaves) while simultaneously minimizing the *skew* (difference in insertion delay) between each leaf in the tree. The deeper the tree, the harder it is to meet both setup and hold timing (*thought experiment #1*: why is this?).
183 | 
184 | ### Post-CTS Optimization
185 | Post-CTS optimization is then performed, where the clock is now a real signal that is being distributed unequally to different parts of the design. In this step, the tool fixes setup and hold time violations simultaneously. Often times, fixing one error may introduce one or multiple errors (*thought experiment #2*: why is this?), so this process is iterative until it reaches convergence (which may or may not meet your timing constraints!). Fixing these violations involve resizing, adding/deleting, and even moving the logic and clock cells.
186 | 
187 | After this stage of optimization, the clock tree and clock routing are fixed. In the next lab, you will finish the P&R flow, which finalizes the rest of the routing, but it is usually the case that if your design is unable to meet timing after CTS, there’s no point continuing!
188 | 
189 | ## Compiling the Design with HAMMER
190 | 
191 | Now that we went over the flow (at least at a high level), it is time to actually perform these steps. Type the following commands to perform the above described operations:
192 | 
193 | ```shell
194 | make syn-to-par
195 | make redo-par HAMMER_EXTRA_ARGS="--stop_after_step clock_tree"
196 | ```
197 | 
198 | 
199 | The first command here translates the outputs of the synthesis tool to conform to the inputs expected by the P&R tool. The second command is similar to the partial synthesis commands we used in the last lab. It tells HAMMER to do the PAR flow until it finishes CTS, then stop. Under the hood, for this lab, HAMMER uses Cadence Innovus as the back-end tool to perform P&R. HAMMER waits until Innovus is done with the P&R steps through post-CTS optimization, then exits. You will see that HAMMER again gives you an error - similar to last lab when HAMMER expected a synthesized output, this time HAMMER expects the full flow to be completed and gives an error whenever it can’t find some collateral expected of P&R.
200 | 
201 | Once done, look into the `build/par-rundir` folder. Similar to how all the synthesis files were placed under `build/syn-rundir` folder in the previous lab, this folder holds all the P&R files. Go ahead and open `par.tcl` file in a text editor. HAMMER generated this file for Innovus to consume in batch mode, and inside are Innovus Common UI commands as a TCL script.
202 | 
203 | While we will be looking through some of these commands in a bit, first take a look at `timingReports`. You should only see the pre-CTS timing reports. `gcd_coprocessor_preCTS_all.tarpt.gz` contains the report in a g-zipped archive. The remaining files also contain useful information regarding capacitances, length of wires etc. You may view these directly using Vim, unzip them using `gzip`, or navigate through them with Caja, the file browser.
204 | 
205 | Going back a level, in `par-rundir`, the folder `hammer_cts_debug` has the post-CTS timing reports. The two important archives are `hammer_cts_all.tarpt.gz` and `hammer_cts_all_hold.tarpt.gz`. These contain the setup and hold timing analyses results after post-CTS optimization. Look into the hold report (you may actually see some violations!). However, any violation should be small (<1 ps) and because we have a lot of margins during design (namely the `design.yml` file has “clock uncertainty” set to 100 ps), these small violations are not of concern, but should still be investigated in a real design.
206 | 
207 | ## Visualizing the Results
208 | 
209 | From the `build/par-rundir` folder, execute the following in a terminal with graphics (X2Go highly recommended for low latency):
210 | 
211 | ```shell
212 | ./generated-scripts/open_chip
213 | ```
214 | The Innovus GUI will pop up with your layout and your terminal is now the Innovus shell. After the window opens, click anywhere inside the black window at the center of the GUI and press “F” to zoom-to-fit. You should see your entire design, which should look roughly similar to the one below once you disable the via4 and met5 layers (because recall that the power straps in these metal layers were set to 100% coverage) using the right panel by unchecking their respective boxes under the “V” column:
215 | 
216 | <p align="center">
217 | <img src="./figs/sky130/innovus_window.png" width="500" />
218 | </p>
219 | 
220 | 
221 | Take a moment to familiarize yourself with the Innovus GUI. You should also toggle between the floorplan, amoeba, and placement views using the buttons in the top right corner of the screen that look like this: <img src="./figs/view_icons.png" width="40" />  and examine how the actual placement of the GCD datapath in ameoba view doesn’t follow our soft placement guidance in floorplan view. This is because our soft placement guidance clearly places the datapath farther away from the pins and would result in a worse clock tree!
222 | 
223 | Now, let’s take a look at the clock tree a couple different ways. In the right panel, under the “Net” category, hide from view all the types of nets except “Clock”. Your design should now look approximately like this, which shows the clock tree routing:
224 | 
225 | <p align="center">
226 | <img src="./figs/sky130/clock_tree_nets.png" width="500" />
227 | </p>
228 | 
229 | 
230 | We can also see the clock tree in its “tree” form by going to the menu Clock → CCOpt Clock Tree Debugger and pressing OK in the popup dialog. A window should pop up looking approximately like this:
231 | 
232 | <p align="center">
233 | <img src="./figs/sky130/clock_tree_debugger.png" width="500" />
234 | </p>
235 | 
236 | 
237 | The red dots are the “leaves”, the green triangles are the clock buffers, the blue dots are clock gates (they are used to save power), and the green pin on top is the clock pin or the clock “root”. The numbers on the left side denote the insertion delay in ps.
238 | 
239 | Now, let’s visualize our critical path. Go to the menu Timing → Debug Timing and press OK in the popup dialog. A window will pop up that looks approximately like this:
240 | 
241 | <p align="center">
242 | <img src="./figs/sky130/timing_debug.png" width="500" />
243 | </p>
244 | 
245 | Examine the histogram. This shows the number of paths for every amount of slack (on the x-axis), and you always want to see a green histogram! The shape of the histogram is a good indicator of how good your design is and how hard the tool is working to meet your timing constraints (*thought experiment #3:* how so, and what would be the the ideal histogram shape?).
246 | 
247 | Now right-click on Path 1 in this window (the critical path), select Show Timing Analyzer and Highlight Path, and select a color. A window will pop up, which is a graphical representation of the timing reports you saw in the `hammer_cts_debug` folder. Poke around the tabs to see all the different representations of this critical path. Back in the main Innovus window, the critical path will be highlighted, showing the chain of cells along the path and the approximate routing it takes to get there, which may look something like this (with all Layers disabled):
248 | 
249 | <p align="center">
250 | <img src="./figs/sky130/critical_path_highlight.png" width="500" />
251 | </p>
252 | 
253 | ---
254 | 
255 | ### Question 2: Interpreting P&R Timing Reports
256 | **a) What is the critical path of your design pre- and post-CTS? Is it the same as the post-synthesis critical path?**
257 | 
258 | **b) Look in the post-CTS text timing report (`hammer_cts_debug/hammer_cts.all.tarpt`). Find a path inside which the same cell is used more than once. Identify the delay of those instances of that common cell. Can you explain why they are different?**
259 | 
260 | **c) What is the skew between the clock that arrives at the flip-flops at the beginning and end of the post-CTS critical path? Does this skew help or hurt the timing margin calculation?**
261 | 
262 | d) (UNGRADED thought experiment #1) Why is it harder to meet both setup and hold timing constraints if the clock tree has large insertion delay?
263 | 
264 | e) (UNGRADED thought experiment #2) Why does fixing one setup or hold error introduce one or multiple errors? Is it more likely to produce an error of the same, or different type, and why?
265 | 
266 | f) (UNGRADED thought experiment #3) P&R tools have a goal to minimize power while ensur- ing that all paths have have >0ps of slack. What might a timing path histogram look like in a design that has maximized the frequency it can run at while meeting this goal? Given the histogram obtained here, does it look we can increase our performance? What might we need to improve/change?
267 | 
268 | ---
269 | 
270 | When you are done, you may exit Innovus by closing the GUI window.
271 | ## Under the Hood: Innovus
272 | While HAMMER obfuscates a lot from the end-user in terms of tool-based commands, most IC companies directly interface with Innovus and it is useful to know what tool-specific commands you are running in case you need to debug your circuit step-by-step. Therefore, we will now look into par.tcl and follow along using Innovus. Make sure you are in the directory `build/par-rundir` and type:
273 | 
274 | ```shell
275 | innovus -common_ui
276 | ```
277 | Now, follow `par.tcl` command-by-command, copying and pasting the commands to the Innovus shell and looking at the GUI for any changes. You may skip the `puts` commands as they just tell the tool to print out what its doing, and the `write_db` commands which write a checkpoint database between each step of the P&R flow. The steps that you will see significant changes are listed below. As you progress through the steps, feel free to zoom in to investigate what is going on with the design, look at the extra TCL files that are sourced, and cross-reference the commands with the command reference manual at `/home/ff/eecs151/labs/manuals/TCRcom.pdf`.
278 | 
279 | 1. After the command sourcing `floorplan.tcl`
280 | 2. After the command sourcing power `straps.tcl`
281 | 3. After the command `edit pin`
282 | 4. After the command `place_opt_design`
283 | 
284 | After the `ccopt_design` command is run, you may see a bunch of white X markers on your design. These are some Design Rule Violations (DRVs), indicating Innovus didn’t quite comply with the technology’s requirements. Ignore these for the purposes of this lab.
285 | 
286 | ---
287 | 
288 | ### Question 3: Understanding P&R Steps
289 | 
290 | **a) Submit a snapshot of your design for each of the four steps described above (use whichever Innovus view you deem is most appropriate to show the changes of each step). Make sure the via4 and met5 layers are not visible, and your design is zoomed-to-fit. Describe how the design layout changes for each major step in their respective figure captions.**
291 | 
292 | **b) Examine the power straps on met1, in relation to the cells. You will need to zoom in far enough to see the net label on the straps. What does their pattern tell you about how digital standard cells are constructed?**
293 | 
294 | **c) Take a note of the orientations of power straps and routing metals. If you were to place pins on the right side of this block instead of the bottom, what metal layers could they be on?**
295 | 
296 | ---
297 | 
298 | Now zoom in to one of the cells and click the box next to “Cell” on the right panel of the GUI. This will show you the internal routing of the standard cells. While by default we have this off, it may prove useful when investigating DRVs in a design. You can now exit the application by closing the GUI window.
299 | 
300 | ## Project Preparation
301 | ---
302 | ### Question 4: ALU
303 | In this question, you will be designing and testing an ALU for later use in the semester. A header file containing define statements for operations (`ALUop.vh`) is provided inside the `src` directory of this lab. This file has already been included in an ALU template given to you in the same folder (`ALU.v`), but you may need to modify the include statement to match the correct path of the header file. Compare ALUop input of your ALU to the define statements inside the header file to select the function ALU is currently running. Definition of the functions is given below:
304 |   
305 | | Op Code |                         Definition                        |
306 | |:-------:|:---------------------------------------------------------:|
307 | |   ADD   |                        Add A and B                        |
308 | |   SUB   |                     Substrate B from A                    |
309 | |   AND   |                    Bitwise `and` A and B                    |
310 | |    OR   |                     Bitwise `or` A and B                    |
311 | |   XOR   |                    Bitwise `xor` A and B                    |
312 | |   SLT   |        Perform a signed comparison, Out=1 if  A < B       |
313 | |   SLTU  |      Perform an unsigned comparison, Out = 1 if A < B     |
314 | |   SLL   |   Logical shift left A by an amount indicated by B[4:0]   |
315 | |   SLA   | Arithmetic shift right A by an amount indicated by B[4:0] |
316 | |   SRL   |   Logical shift right A by an amount indicated by B[4:0]  |
317 | |  COPY_B |                    Output is equal to B                   |
318 | |   XXX   |                        Output is 0                        |
319 | 
320 | Given these definitions, complete `ALU.v` and write a testbench tb `ALU.v` that checks all these operations with random inputs at least a 100 times per operation and outputs a PASS/FAIL indicator. For this lab, we will only check for effort and not correctness, but you will need it to work later!
321 | 
322 | ---
323 | 
324 | ## Lab Deliverables
325 | 
326 | ### Lab Due: 11:59 PM, Friday October 1st, 2021
327 | 
328 | - Submit a written report with all 4 questions answered to Gradescope
329 | - Checkoff with an ASIC lab TA
330 | 
331 | ## Acknowledgement
332 | 
333 | This lab is the result of the work of many EECS151/251 GSIs over the years including:
334 | 
335 | Written By:
336 | - Nathan Narevsky (2014, 2017)
337 | - Brian Zimmer (2014)
338 | 
339 | Modified By:
340 | - John Wright (2015,2016)
341 | - Ali Moin (2018)
342 | - Arya Reais-Parsi (2019)
343 | - Cem Yalcin (2019)
344 | - Tan Nguyen (2020)
345 | - Harrison Liew (2020)
346 | - Sean Huang (2021)
347 | - Daniel Grubb, Nayiri Krzysztofowicz, Zhaokai Liu (2021)
348 | 


--------------------------------------------------------------------------------