├── LICENSE
├── README.md
├── advanced-caches.md
├── assets
├── 12_function-call-inlining.png
├── 12_scheduling-if-conversion.png
├── 15_cache-organization.png
├── 15_direct-mapped-cache.png
├── 15_offset-index-tag-setassociative.png
├── 16_mapping-virtual-physical-memory.png
├── 16_multilevel-page-table-structure.png
├── 16_multilevel-page-tables.png
├── 16_program-view-of-memory.png
├── 16_two-level-page-table-example.png
├── 16_why-virtual-memory.png
├── 17_aliasing-in-virtually-accessed-caches.png
├── 17_loop-interchange.png
├── 17_overlap-misses.png
├── 17_prefetch-instructions.png
├── 17_tlb-cache-hit.png
├── 17_vipt-cache-aliasing.png
├── 17_virtually-indexed-physically-tagged.png
├── 19_connecting-dram.png
├── 19_memory-chip-organization.png
├── 1bh2bc.png
├── 1bit-statemachine.png
├── 20_connecting-io-devices.png
├── 20_magnetic-disks.png
├── 22_centralized-shared-memory.png
├── 22_distributed-memory.png
├── 22_smt-cache-tlb.png
├── 22_smt-hardware-changes.png
├── 24_cache-coherence-problem.png
├── 24_directory-entry.png
├── 24_msi-coherence.png
├── 24_write-update-snooping-coherence.png
├── 25_how-is-llsc-atomic.png
├── 25_synchronization-example.png
├── 26_data-races-and-consistency.png
├── 27_multi-core-power-and-performance.png
├── 2bit-history-predictor.png
├── 2bit-statemachine.png
├── 7_all-inst-same-cycle.png
├── 7_duplicating-register-values.png
├── 7_false-dependencies-after-renaming.png
├── 7_ilp-example.png
├── 7_ilp-ipc-discussion.png
├── 7_ilp-vs-ipc.png
├── 7_rat-example.png
├── 7_waw-dependencies.png
├── 8_dispatch-gt1-ready.png
├── 8_issue-example.png
├── 8_tomasulo-review-1.png
├── 8_tomasulos-algorithm-the-picture.png
├── 8_write-result.png
├── 9_reorder-buffer.png
├── branch-target-buffer.png
├── diminishing-returns.png
├── fabrication-yield.png
├── full-predication-example.png
├── hierarchical-predictor.png
├── history-shared-counters.png
├── historypredictor.png
├── memory-wall.png
├── movz-movn-performance.png
├── static-power.png
├── tournament-predictors.png
└── weaktakenflipflop.png
├── branches.md
├── cache-coherence.md
├── cache-review.md
├── cheat-sheet-midterm.md
├── compiler-ilp.md
├── course-information.md
├── fault-tolerance.md
├── ilp.md
├── instruction-scheduling.md
├── introduction.md
├── many-cores.md
├── memory-consistency.md
├── memory-ordering.md
├── memory.md
├── metrics-and-evaluation.md
├── multi-processing.md
├── pipelining.md
├── predication.md
├── reorder-buffer.md
├── storage.md
├── synchronization.md
├── virtual-memory.md
├── vliw.md
└── welcome.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 David Harris
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # cs6290-notes
2 | Notes for CS6290, High Performance Computer Architecture
3 |
--------------------------------------------------------------------------------
/advanced-caches.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: advanced-caches
3 | title: Advanced Caches
4 | sidebar_label: Advanced Caches
5 | ---
6 |
7 | [🔗Lecture on Udacity (1.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/1080508587/concepts/last-viewed)
8 |
9 | ## Improving Cache Performance
10 | Average Memory Access Time: \\(\text{AMAT} = \text{hit time} + \text{miss rate} * \text{miss penalty} \\)
11 | - Reduce hit time
12 | - Reduce miss rate
13 | - Reduce miss penalty
14 |
15 | ## Reduce Hit Time
16 | - Reduce cache size (bad for miss rate)
17 | - Reduce cache associativity (bad for miss rate)
18 | - Overlap cache hit with another hit
19 | - Overlap cache hit with TLB hit
20 | - Optimize lookup for common case
21 | - Maintain replacement state more quickly
22 |
23 | ### Pipelined Caches (overlap with another hit)
24 | - Multiple Cycles to access cache
25 | - Access comes in cycle N (hit)
26 | - Access comes in cycle N+1 (hit) - has to wait
27 | - Hit time = actual hit + wait time
28 | - Pipeline in 3 stages (for example):
29 | - Stage 1: Index into set, read out tags and valid bits in that set to determine hit
30 | - Stage 2: Combining hits to determine did we hit overall and where in our set, begin the read of data
31 | - Stage 3: Finish data read by using offset to choose the right part of the cache block and returning it
32 |
33 | ## TLB and Cache Hit
34 | Hit time = TLB access time + Cache access time
35 | 
36 | This represents the "Physically Indexed Physically Tagged" cache (PIPT), since both the tags and indexes are physical
37 |
38 | ## Virtually Accessed Cache
39 | Also called VIVT (Virtual Indexed Virtually Tagged). The virtual address first accesses the cache to get the data, and only goes to TLB to get physical address on a cache miss. Cache is now tagged by virtual address instead of physical. This means hit time is simply the cache hit time. An additional benefit could be no TLB access on a cache hit, but in reality we still need TLB information like permissions (RWX) so we still need to do a TLB lookup even on a cache hit.
40 |
41 | A bigger issue is that virtual address is specific to each process - so, virtual addresses that are the same will likely be pointing to different physical locations. This can be solved by flushing the cache on a context switch, which causes a burst of cache misses each time we come back to that process.
42 |
43 | ## Virtually Indexed Physically Tagged (VIPT)
44 | In this cache, we use the index of the virtual address to get the correct set in cache, and we look at tags in that set. Meanwhile, we also retrieve frame number using the Tag from the virtual address. This frame number is used to compare against the tags in the indexed cache set. The hit time for this cache is the max of cache hit time and TLB lookup time, as these activities occur in parallel. Additionally, we do not have to flush the cache on a context switch, because the cache lines are tagged with physical address.
45 | 
46 |
47 | ## Aliasing in Virtually Accessed Caches
48 | Aliasing is a problem with virtually accessed caches because multiple virtual addresses could legitimately point to the same physical location in memory, yet be cached separately due to different indices from the virtual addresses. If one location is modified, a cached version of the other virtual address may not see the change.
49 |
50 | 
51 |
52 | ## VIPT Cache Aliasing
53 | Depending on how the cache is constructed, it is possible to avoid aliasing issues with VIPT. In this example, the index and offset bits were both part of the frame offset, so since VIPT caches use a physical tag, it will always retrieve the same block of memory from cache even if it is accessed using multiple virtual addresses. But, this requires the cache to be constructed in a way that takes advantage of this - the number of sets must be limited to ensure the index and offset bits stay within the bits used for frame offset
54 | 
55 |
56 | ## Real VIPT Caches
57 | \\(\text{Cache Size} \leq \text{Associativity} * \text{Page Size}\\)
58 | - Pentium 4:
59 | - 4-way SA x 4kB \\(\Rightarrow\\) L1 is 16kB
60 | - Core 2, Nehalem, Sandy Bridge, Haswell:
61 | - 8-way SA x 4kB \\(\Rightarrow\\) L1 is 32kB
62 | - Skylake (rumored):
63 | - 16-way SA x ?? \\(\Rightarrow\\) L1 is 64kB
64 |
65 | ## Associativity and Hit Time
66 | - High Associativity:
67 | - Fewer conflicts \\(\rightarrow\\) Miss Rate \\(\downarrow\\) (good)
68 | - Larger VIPT Caches \\(\rightarrow\\) Miss Rate \\(\downarrow\\) (good)
69 | - Slower hits (bad)
70 | - Direct Mapped:
71 | - Miss Rate \\(\uparrow\\) (bad)
72 | - Hit Time \\(\downarrow\\) (good)
73 |
74 | Can we cheat on associativity to take the good things from high associativity and also get better hit time from direct mapped?
75 |
76 | ## Way Prediction
77 | - Set-Associative Cache (miss rate \\(\downarrow\\))
78 | - Guess which line in the set is the most likely to hit (hit time \\(\downarrow\\))
79 | - If no hit there, normal set-associative check
80 |
81 | ### Way Prediction Performance
82 | | | 32kB, 8-way SA | 4kB Direct Mapped | 32kB, 8-way SA with Way Pred |
83 | |--------------|:---:|:---:|:------:|
84 | | Hit Rate | 90% | 70% | 90% |
85 | | Hit Latency | 2 | 1 | 1 or 2 |
86 | | Miss Penalty | 20 | 20 | 20 |
87 | | AMAT | 4
(\\(2+0.1\*20\\)) | 7
(\\(1+0.3\*20\\)) | 3.3 (1.3+2)
(\\(0.7\*1+0.3\*2+0.1\*20\\))
88 |
89 | ## Replacement Policy and Hit Time
90 | - Random
91 | - Nothing to update on cache hit (good)
92 | - Miss Rate \\(\uparrow\\) (bad)
93 | - LRU
94 | - Miss Rate \\(\downarrow\\) (good)
95 | - Update lots of counters on hit (bad)
96 | - We want benefit of LRU miss rate, but we need less activity on a cache hit
97 |
98 | ## NMRU Replacement
99 | (Not Most Recently Used)
100 | - Track which block in set is MRU (Most Recently Used)
101 | - On Replacement, pick a non-MRU block
102 |
103 | N-Way Set-Associative Tracking of MRU
104 | - 1 MRU pointer/set (instead of N LRU counters)
105 |
106 | This is slightly weaker than LRU since we're not ordering the blocks or have no "future" knowledge - but we get some of the benefit of LRU without all the counter overhead.
107 |
108 | ## PLRU (Pseudo-LRU)
109 | - Keep one bit per line in set, init to 0
110 | - Every time a line is accessed, set the bit to 1
111 | - If we need to replace something, choose a block with bit set to 0
112 | - Eventually all the bits will be set - when this happens, we zero out the other bits.
113 | - This provides the same benefit as NMRU
114 |
115 | Thus, at any point in time, PLRU is at least as good as NMRU, but still not quite as good as true LRU. But, we still get the benefit of not having to perform a lot of activities - just simple bit operations.
116 |
117 | ## Reducing the Miss Rate
118 | What are the causes of misses?
119 |
120 | Three Cs:
121 | - Compulsory Misses
122 | - First time this block is accessed
123 | - Would still be a miss in an infinite cache
124 | - Capacity Misses
125 | - Block evicted because of limited cache size
126 | - Would be a miss even in a fully-associative cache of that size
127 | - Conflict Misses
128 | - Block evicted because of limited associativity
129 | - Would not be a miss in fully-associative cache
130 |
131 | ## Larger Cache Blocks
132 | - More words brought in on a miss
133 | - Miss Rate \\(\downarrow\\) when spatial locality is good (good)
134 | - Miss Rate \\(\uparrow\\) when spatial locality is poor (bad)
135 | - As block size increases, miss rate decreases for awhile, and then begins to increase again.
136 | - The local minima is the optimal block size (64B on a 4kB cache)
137 | - For larger caches, the minima will occur at a much higher block size (e.g. 256B for 256kB)
138 |
139 | ## Prefetching
140 | - Guess which blocks will be accessed soon
141 | - Bring them into cache ahead of time
142 | - This moves the memory access time to before the actual load
143 | - Good guess - eliminate a miss (good)
144 | - Bad guess - "Cache Pollution", and we get a cache miss (bad)
145 | - We might have also used a cache spot that could have been used for something else, causing an additional cache miss.
146 |
147 | ### Prefetch Instructions
148 | We can make an instruction that allows the programmer to make explicit prefetches based on their knowledge of the program. In the below example, it shows that this isn't necessarily easy - choosing a wrong "distance" to prefetch is difficult. If you choose it too small, the prefetch is too late and we get no benefit. If it is too large, we prefetch too early and it could be replaced by the time we need it. An additional complication is that you must know something about the hardware cache in order to set this value appropriately, making it not very portable.
149 | 
150 |
151 | ### Hardware Prefetching
152 | - No change to program
153 | - HW tries to guess what will be accessed soon
154 | - Examples:
155 | - Stream Buffer (sequential - fetch next block)
156 | - Stride Prefetcher (continue fetching fixed distance from each other)
157 | - Correlating Prefetcher (keeps a history to correlate block access based on previous access)
158 |
159 | ## Loop Interchange
160 | Replaces the order of nested loops to optimize for bad cache accesses. In the example below it was fetching one element per block. A good compiler should reverse these loops to ensure cache accesses are as sequential as possible. This is not always possible - the compiler has to prove the program is still correct once reversing the loops, which it does by showing there are no dependencies between iterations.
161 | 
162 |
163 | ## Overlap Misses
164 | Reduces Miss Penalty. A good out of order procesor will continue doing whatever work it can after a cache miss, but eventually it runs out of things to do while waiting on the load. Some caches are blocking, meaning a load must wait for all other loads to complete before starting. This creates a lot of inactive time. Other caches are Non-Blocking, meaning the processor will use Memory-Level Parallelism to overlap multiple loads to minimize the inactive time and keep the processor working.
165 | 
166 |
167 | ### Miss-Under-Miss Support in Caches
168 | - Miss Status Handling Registers (MSHRs)
169 | - Info about ongoing misses
170 | - Check MSHRs to see if any match
171 | - No Match (Miss) \\(\Rightarrow\\) Allocate an MSHR, remember which instruction to wake up
172 | - Match (Half-Miss) \\(\Rightarrow\\) Add instruction to MSHR
173 | - When data comes back, wake up all instructions waiting on this data, and release MSHR
174 | - How many MSHRs do we want?
175 | - 2 is good, 4 is even better, and there are still benefits from level larger amount like 16-32.
176 |
177 | ## Cache Hierarchies
178 | Reduces Miss Penalty. Multi-Level Caches:
179 | - Miss in L1 cache goes to L2 cache
180 | - L1 miss penalty \\(\neq\\) memory latency
181 | - L1 miss penalty = L2 Hit Time + (L2 Miss Rate)*(L2 Miss Penalty)
182 | - Can have L3, L4, etc.
183 |
184 | ### AMAT With Cache Hierarchies
185 | \\(\text{AMAT} = \text{L1 hit time} + \text{L1 miss rate} * \text{L1 miss penalty} \\)
186 |
187 | \\(\text{L1 Miss Penalty} = \text{L2 hit time} + \text{L2 miss rate} * \text{L2 miss penalty} \\)
188 |
189 | \\(\text{L2 Miss Penalty} = \text{L3 hit time} + \text{L3 miss rate} * \text{L3 miss penalty} \\)
190 |
191 | ... etc, until:
192 |
193 | \\(\text{LN Miss Penalty} = \text{Main Memory Latency}\\) (Last Level Cache - LLC)
194 |
195 | ### Multilevel Cache Performance
196 |
197 | | | 16kB | 128kB | No cache | L1 = 16kB
L2 = 128kB |
198 | |----------|:----:|:-----:|:--------:|:--------------------------:|
199 | | Hit Time | 2 | 10 | 100 | 2 for L1
12 for L2 |
200 | | Hit Rate | 90% | 97.5% | 100% | 90% for L1
75% for L2 |
201 | | AMAT | 12 | 12.5 | 100 | 5.5 |
202 | | (calc)) | \\(2+0.1\*100\\) | \\(10+0.025\*100\\) | \\(100+0\*100\\) | \\(2+0.1\*(10+0.25*100)\\)
203 |
204 | Combining caches like this provide much better overall performance than just considering hit time and size.
205 |
206 | ### Hit Rate in L2, L3, etc.
207 | When L2 cache is used alone, it has a higher hit rate (97.5% vs 75%) - so it looks like it has a lower hit rate, which is misleading.
208 |
209 | 75% is the "local hit rate", which is a hit rate that the cache actually observes. In this case, 90% of accesses were hits in L1, so L2 cache never observed those (which would have been a hit).
210 |
211 | 97.5% is the "global hit rate", which is the overall hit rate for any access to the cache.
212 |
213 | ### Global vs. Local Hit Rate
214 | - Global Hit Rate: 1 - Global Miss Rate
215 | - Global Miss Rate: \\(\frac{\text{# of misses in this cache}}{\text{# of all memory references}}\\)
216 | - Local Hit Rate: \\(\frac{\text{# of hits}}{\text{# of accesses to this cache}}\\)
217 | - Misses per 1000 instructions (MPKI)
218 | - Similar to Miss Rate, but instead of being based on memory references, it normalizes based on number of instructions
219 |
220 | ### Inclusion Property
221 | - Block is in L1 Cache
222 | - May or may not ben in L2
223 | - Has to also be in L2 (Inclusion)
224 | - Cannot also be in L2 (Exclusion)
225 |
226 | When L1 is a hit, LRU counters on L2 are not changed - over time, L2 may decide to replace blocks that are still frequently accessed in L1. So Inclusion is not guaranteed. One way to enforce it is an "inclusion bit" that can be set to keep it from being replaced.
227 |
228 | [🎥 See Lecture Example (4:44)](https://www.youtube.com/watch?v=J8DQG9Pvp3U)
229 |
230 | Helpful explanation in lecture notes about why Inclusion is desirable:
231 |
232 | > Student Question: What is the point of inclusion in a multi-level cache? Why would the effort/cost be spent to try and enforce an inclusion property?
233 | >
234 | > I would have guessed that EXCLUSION would be a better thing to work towards, get more data into some part of the cache. I just don't get why you would want duplicate data taking up valuable cache space, no matter what level.
235 | >
236 | > Instructor Answer: Inclusion makes several things simpler. When doing a write-back from L1, for example, inclusion ensures that the write-back is a L2 hit. Why is this useful? Well, it limits how much buffering we need and how complicated things will be. If the L1 cache is write-through, inclusion ensures that a write that is a L1 hit will actually happen in L2 (not be an L2 miss). And for coherence with private L1 and L2 caches (a la Intel's i3/i5/i7), inclusion allows the L2 cache to filter requests from other processors. With inclusion, if the request from another processor does not match anything in our L2, we know we don't have that block. Without inclusion, even if the block does not match in L2, we still need to probe in the L1 because it might be there.
237 |
238 | *[AMAT]: Average Memory Access Time
239 | *[LRU]: Least-Recently Used
240 | *[MPKI]: Misses Per 1000 Instructions
241 | *[MSHR]: Miss Status Handling Register
242 | *[MSHRs]: Miss Status Handling Registers
243 | *[NMRU]: Not-Most-Recently Used
244 | *[PIPT]: Physically Indexed Physically Tagged
245 | *[PLRU]: Pseudo-Least Recently Used
246 | *[RWX]: Read, Write, Execute
247 | *[SA]: Set-Associative
248 | *[TLB]: Translation Look-Aside Buffer
249 | *[VIPT]: Virtually Indexed Physically Tagged
250 | *[VIVT]: Virtual Indexed Virtually Tagged
--------------------------------------------------------------------------------
/assets/12_function-call-inlining.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/12_function-call-inlining.png
--------------------------------------------------------------------------------
/assets/12_scheduling-if-conversion.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/12_scheduling-if-conversion.png
--------------------------------------------------------------------------------
/assets/15_cache-organization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/15_cache-organization.png
--------------------------------------------------------------------------------
/assets/15_direct-mapped-cache.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/15_direct-mapped-cache.png
--------------------------------------------------------------------------------
/assets/15_offset-index-tag-setassociative.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/15_offset-index-tag-setassociative.png
--------------------------------------------------------------------------------
/assets/16_mapping-virtual-physical-memory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_mapping-virtual-physical-memory.png
--------------------------------------------------------------------------------
/assets/16_multilevel-page-table-structure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_multilevel-page-table-structure.png
--------------------------------------------------------------------------------
/assets/16_multilevel-page-tables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_multilevel-page-tables.png
--------------------------------------------------------------------------------
/assets/16_program-view-of-memory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_program-view-of-memory.png
--------------------------------------------------------------------------------
/assets/16_two-level-page-table-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_two-level-page-table-example.png
--------------------------------------------------------------------------------
/assets/16_why-virtual-memory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_why-virtual-memory.png
--------------------------------------------------------------------------------
/assets/17_aliasing-in-virtually-accessed-caches.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_aliasing-in-virtually-accessed-caches.png
--------------------------------------------------------------------------------
/assets/17_loop-interchange.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_loop-interchange.png
--------------------------------------------------------------------------------
/assets/17_overlap-misses.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_overlap-misses.png
--------------------------------------------------------------------------------
/assets/17_prefetch-instructions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_prefetch-instructions.png
--------------------------------------------------------------------------------
/assets/17_tlb-cache-hit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_tlb-cache-hit.png
--------------------------------------------------------------------------------
/assets/17_vipt-cache-aliasing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_vipt-cache-aliasing.png
--------------------------------------------------------------------------------
/assets/17_virtually-indexed-physically-tagged.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_virtually-indexed-physically-tagged.png
--------------------------------------------------------------------------------
/assets/19_connecting-dram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/19_connecting-dram.png
--------------------------------------------------------------------------------
/assets/19_memory-chip-organization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/19_memory-chip-organization.png
--------------------------------------------------------------------------------
/assets/1bh2bc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/1bh2bc.png
--------------------------------------------------------------------------------
/assets/1bit-statemachine.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/1bit-statemachine.png
--------------------------------------------------------------------------------
/assets/20_connecting-io-devices.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/20_connecting-io-devices.png
--------------------------------------------------------------------------------
/assets/20_magnetic-disks.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/20_magnetic-disks.png
--------------------------------------------------------------------------------
/assets/22_centralized-shared-memory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_centralized-shared-memory.png
--------------------------------------------------------------------------------
/assets/22_distributed-memory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_distributed-memory.png
--------------------------------------------------------------------------------
/assets/22_smt-cache-tlb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_smt-cache-tlb.png
--------------------------------------------------------------------------------
/assets/22_smt-hardware-changes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_smt-hardware-changes.png
--------------------------------------------------------------------------------
/assets/24_cache-coherence-problem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_cache-coherence-problem.png
--------------------------------------------------------------------------------
/assets/24_directory-entry.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_directory-entry.png
--------------------------------------------------------------------------------
/assets/24_msi-coherence.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_msi-coherence.png
--------------------------------------------------------------------------------
/assets/24_write-update-snooping-coherence.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_write-update-snooping-coherence.png
--------------------------------------------------------------------------------
/assets/25_how-is-llsc-atomic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/25_how-is-llsc-atomic.png
--------------------------------------------------------------------------------
/assets/25_synchronization-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/25_synchronization-example.png
--------------------------------------------------------------------------------
/assets/26_data-races-and-consistency.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/26_data-races-and-consistency.png
--------------------------------------------------------------------------------
/assets/27_multi-core-power-and-performance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/27_multi-core-power-and-performance.png
--------------------------------------------------------------------------------
/assets/2bit-history-predictor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/2bit-history-predictor.png
--------------------------------------------------------------------------------
/assets/2bit-statemachine.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/2bit-statemachine.png
--------------------------------------------------------------------------------
/assets/7_all-inst-same-cycle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_all-inst-same-cycle.png
--------------------------------------------------------------------------------
/assets/7_duplicating-register-values.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_duplicating-register-values.png
--------------------------------------------------------------------------------
/assets/7_false-dependencies-after-renaming.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_false-dependencies-after-renaming.png
--------------------------------------------------------------------------------
/assets/7_ilp-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_ilp-example.png
--------------------------------------------------------------------------------
/assets/7_ilp-ipc-discussion.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_ilp-ipc-discussion.png
--------------------------------------------------------------------------------
/assets/7_ilp-vs-ipc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_ilp-vs-ipc.png
--------------------------------------------------------------------------------
/assets/7_rat-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_rat-example.png
--------------------------------------------------------------------------------
/assets/7_waw-dependencies.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_waw-dependencies.png
--------------------------------------------------------------------------------
/assets/8_dispatch-gt1-ready.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_dispatch-gt1-ready.png
--------------------------------------------------------------------------------
/assets/8_issue-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_issue-example.png
--------------------------------------------------------------------------------
/assets/8_tomasulo-review-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_tomasulo-review-1.png
--------------------------------------------------------------------------------
/assets/8_tomasulos-algorithm-the-picture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_tomasulos-algorithm-the-picture.png
--------------------------------------------------------------------------------
/assets/8_write-result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_write-result.png
--------------------------------------------------------------------------------
/assets/9_reorder-buffer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/9_reorder-buffer.png
--------------------------------------------------------------------------------
/assets/branch-target-buffer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/branch-target-buffer.png
--------------------------------------------------------------------------------
/assets/diminishing-returns.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/diminishing-returns.png
--------------------------------------------------------------------------------
/assets/fabrication-yield.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/fabrication-yield.png
--------------------------------------------------------------------------------
/assets/full-predication-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/full-predication-example.png
--------------------------------------------------------------------------------
/assets/hierarchical-predictor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/hierarchical-predictor.png
--------------------------------------------------------------------------------
/assets/history-shared-counters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/history-shared-counters.png
--------------------------------------------------------------------------------
/assets/historypredictor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/historypredictor.png
--------------------------------------------------------------------------------
/assets/memory-wall.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/memory-wall.png
--------------------------------------------------------------------------------
/assets/movz-movn-performance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/movz-movn-performance.png
--------------------------------------------------------------------------------
/assets/static-power.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/static-power.png
--------------------------------------------------------------------------------
/assets/tournament-predictors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/tournament-predictors.png
--------------------------------------------------------------------------------
/assets/weaktakenflipflop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/weaktakenflipflop.png
--------------------------------------------------------------------------------
/branches.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: branches
3 | title: Branches
4 | sidebar_label: Branches
5 | ---
6 |
7 | [🔗Lecture on Udacity (2.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/3618489075/concepts/last-viewed)
8 |
9 | ## Branch in a Pipeline
10 |
11 | ```mipsasm
12 | BEQ R1, R2, Label
13 | ```
14 |
15 | A branch instruction like this compares R1 and R2, and if equal jumps to `Label` (usually by having in the immediate part of the instruction field the difference in PC for the next instruction versus the labeled instruction, such that it can simply add this offset when branched).
16 |
17 | In a 5-stage pipeline, the compare happens in the third (ALU) stage. Meanwhile, two instructions have moved into Fetch/Read stages. If we fetched the correct instructions for the branch (or not), they continue to move correctly with no penalty. Otherwise we must flush and take a 2-instruction penalty.
18 |
19 | Thus, it never pays to simply not fetch something as you always take the penalty, and it is better to take a penalty only sometimes than all the time. Another important note is that when fetching the instruction after `BEQ`, we know nothing about the branch itself yet except its address (we don't even know it's a branch at all yet), but we already must make a prediction of whether it's a taken branch or not.
20 |
21 | ## Branch Prediction Requirements
22 |
23 | - Using only the knowledge of the instruction's PC
24 | - Guess PC of next instruction to fetch
25 | - Must correctly guess:
26 | - is this a branch?
27 | - is it taken?
28 | - if taken, what is the target PC?
29 |
30 | The first two guesses can be combined into "is this a taken branch?"
31 |
32 | ### Branch Prediction Accuracy
33 |
34 | $$ CPI = 1 + \frac{mispred}{inst} * \frac{penalty}{mispred} $$
35 |
36 | The \\( \frac{mispred}{inst} \\) part is determined by the predictor accuracy. The \\(\frac{penalty}{mispred} \\) part is determined by the pipeline (where in the pipeline we figure out the misprediction).
37 |
38 | Assumption below: 20% of all instructions are branches (common in programs).
39 |
40 | | Accuracy \\(\downarrow\\) | Resolve in 3rd stage | Resolve in 10th stage |
41 | |---|:---:|:---:|
42 | | 50% for BR
100% all other | \\(1 + 0.5\*0.2\*2\\)
\\(= 1.2\\) | \\(1 + 0.5\*0.2\*9\\)
\\(= 1.9\\) |
43 | | 90% of BR
100% all other | \\(1 + 0.1\*0.2\*2\\)
\\(= 1.04\\) | \\(1 + 0.1\*0.2\*9\\)
\\(= 1.18\\) |
44 | | _(Speedup)_ | _1.15_ | _1.61_ |
45 |
46 | Conclusions: A better branch predictor will help regardless of the pipeline, but the _amount_ of help changes with the pipeline depth.
47 |
48 | ### Performance with Not-Taken Prediction
49 |
50 | | | Refuse to Predict
(stall until sure) | Predict Not-Taken
(always increment PC) |
51 | |---|---|---|
52 | | Branch | 3 cycles | 1 or 3 cycles |
53 | | Non-Branch | 2 Cycles | 1 cycle |
54 |
55 | Thus, Predict Not-Taken always wins over refusing to predict
56 |
57 | ### Predict Not-Taken
58 |
59 | Operation: Simply increment PC (no special hardware or memory, since we have to do this anyway)
60 |
61 | Accuracy:
62 | * 20% of instructions are branches
63 | * 60% of branches are taken
64 | * \\(\Rightarrow\\) Correctness: 80% (non-branches) + 8% (non-taken branches)
65 | * \\(\Rightarrow\\) Incorrect 12% of time
66 | * CPI = 1 + 0.12*penalty
67 |
68 | ## Why We Need Better Prediction?
69 | | | Not Taken
88% | Better
99% | Speedup |
70 | |---|:---:|:---:|:---:|
71 | | 5-stages
(3rd stage) | \\(1 + 0.12\*2\\)
\\(CPI = 1.24\\) | \\(1 + 0.01\*2\\)
\\(CPI = 1.02\\) | \\(1.22\\) |
72 | | 14-stages
(11th stage) | \\(1 + 0.12\*10\\)
\\(CPI = 2.2\\) | \\(1 + 0.01\*10\\)
\\(CPI = 1.1\\) | \\(2\\) |
73 | | 11th stage
(4 inst/cycle) | \\(0.25 + 0.12\*10\\)
\\(CPI = 1.45\\) | \\(0.25 + 0.01\*10\\)
\\(CPI = 0.35\\) | \\(4.14\\) |
74 |
75 | If we have a deeper pipeline or are able to execute more instructions per cycle, the better predictor is more important than in simpler processors, because the cost of misprediction is much higher (more instructions lost with a misprediction).
76 |
77 | ### Better Prediction - How?
78 |
79 | Predictor must compute \\(PC_{next}\\) based only on knowledge of \\(PC_{now}\\). This is not much information to decide on. It would help if we knew:
80 | * Is it a branch?
81 | * Will it be taken?
82 | * What is the offset field of the instruction?
83 |
84 | But we don't know any of these because we're still fetching the instruction. We do, however, know the history of how \\(PC_{now}\\) has behaved in the past. So, we can go from: \\(PC_{next} = f(PC_{now})\\) to:
85 |
86 | $$ PC_{next} = f(PC_{now}, history[PC_{now}]) $$
87 |
88 | ## BTB - Branch Target Buffer
89 |
90 | The predictor can take the \\(PC_{now}\\) and uses it to index into a table called the BTB, with the output of our best guess at next PC. Later, when the branch executes, we know the actual \\(PC_{next}\\) and can compare with the predicted one. If it doesn't match, then it is handled as a misprediction and the BTB can be updated.
91 |
92 | 
93 |
94 | One problem: How big does the BTB need to be? We want it to have single cycle latency (small). However, it needs to contain an entire 64-bit address and we one entry for each PC we can fetch. Thus, the BTB would need to be as large as the program itself. How do we make it smaller?
95 |
96 | ### Realistic BTB
97 |
98 | First, we don't need an entry for every possible PC. It's enough if we have entries for any PC likely to execute soon. For example, in a loop of 100 instructions, it's enough to have about a 100-length BTB.
99 |
100 | Perhaps we find through testing that a 1024-entry BTB can still be accessed in one cycle. How do we map 64-bit PCs to this 1024 entry table, keeping in mind any delay in calculating the mapping from PC to BTB index would then shorten the BTB further to compensate.
101 |
102 | The way we do this is by simply taking the last 10 bits (in the 1024 case). While this means future instructions will eventually overwrite the BTB entries for the current instructions, this ensures instructions located around each other are all mapping to the BTB during execution. This particularly applies to better predicting branch behavior in loops or smaller programs.
103 |
104 | If instructions are word-aligned, the last two bits will always be `0b00`. Therefore we can ignore those and index using bits 12-2 in the 1024 case.
105 |
106 | ## Direction Predictor
107 | The BHT is used like the BTB, but the entry is a single bit that tells us whether a PC is:
108 | - [0] not a taken branch (PC++)
109 | - [1] a taken branch (use BTB)
110 |
111 | The entries are accessed by least insignificant bits of PC (e.g. 12-2 in 1024 case). Once the branch resolves we can update BHT accordingly. Because this table is very small in terms of data, it can be much larger in terms of entries.
112 |
113 | ### Problems with 1-bit Predictor
114 | It works well if an instruction is always taken or else always not taken. It also predicts well when taken branches >>> not taken, or not taken >>> taken. Every "switch" (anomaly) between an instruction being taken and not taken causes two mispredictions (on the change, and then the next change).
115 |
116 | So, the 1-bit predictor does not do so well if the taken to not taken ratio is not very large. It also will not perform well in short because this same anomaly occurs when the loop is executed again (the previous loop exit will cause it to mispredict the new loop).
117 |
118 | ## 2-Bit Predictor (2BP, 2BC)
119 | This predictor fixes the behavior of the 1-bit predictor during the anomaly. The upper bit behaves like the 1-bit predictor, but it adds a hysteresis (or "conviction") bit.
120 |
121 | 
122 |
123 | | Prediction Bit | Hysteresis Bit | Description |
124 | |:---:|:---:|---|
125 | | 0 | 0 | Strong Not-Taken |
126 | | 0 | 1 | Weak Not-Taken |
127 | | 1 | 0 | Weak Taken |
128 | | 1 | 1 | Strong Taken |
129 |
130 | Basically, the upper bit controls the final prediction, but the lower bit allows the predictor to flow towards the opposite prediction. Thus a `0b00` state would require 2 taken branches in a row to change its prediction toward a taken branch. This prevents the case of a single anomaly causing multiple mispredictions, in that the behavior itself must be changing for the prediction to change.
131 |
132 | 
133 |
134 | It can be called a 2-bit counter because it simply increments on taken branches, or decrements on not-taken. This allow the predictor to be easily implementable.
135 |
136 | ### 2-Bit Predictor Initialization
137 | The question is - where do we start with the predictor? If we start in a strong state and are correct, then we have no mispredictions. If we are wrong, it costs two mispredictions before it corrects itself. If we start in a weak state and are right, we also have perfect prediction. However, if we were wrong, it only costs a single misprediction before it corrects itself. This leads us to think it's always best to start in a weak state.
138 |
139 | However, consider the state where a branch flips between Taken and Not-Taken; by starting in a strong state, we mispredict half the time. Starting in a weak state, we _constantly_ mispredict.
140 |
141 | 
142 |
143 | So, while it may seem better to start in one state or another, in reality there is no way to predict what state is best for the incoming program, and indeed there is always some sequence of taken/not-taken that can result in constant misprediction. Therefore, it is best to simply initialize in the easiest state to start with, typically `0b00`.
144 |
145 | ### 1-bit to 2-bit is Good... 3-bit, 4-bit?
146 |
147 | If there is benefit in moving to 2-bit, what about just adding more bits? This really serves to increase the cost (larger BHT to serve all PCs), but is only good when anomalous outcomes happen in streaks. How often does this happen? Sometimes. Maybe 3 bits might be worth it, but likely not 4. Best to stick with 2BP.
148 |
149 | ## History-Based Predictors
150 | These predictors function best with a repeatable pattern (e.g. "N-T-N-T-N-T..." or "N-N-T-N-N-T..."). These are 100% predictable, just not with simple n-bit counters. So, how do we learn the pattern?
151 |
152 | 
153 |
154 | In this case our prediction is done by looking "back" at previous steps. So in the first pattern, we know when the history is N, predict T, and vice-versa. In the second pattern, we look back two steps. So when the history is NN, predict T. When it is NT, predict N. And when it is TN, predict N.
155 |
156 | ### 1-bit History with 2-bit Counters
157 | In this predictor, each BHT entry has a single history bit, and two 2-bit counters. The history bit can then be used to index into which counter to use for the prediction.
158 |
159 | 
160 |
161 | While this type of predictor works great for patterns like this, it still mispredicts 1/3 of the time in a "NNT-NNT-NNT" type pattern, as the prediction following an N is a 50% chance of being right.
162 |
163 | ### 2-bit History Predictor
164 | This predictor works the same way as the 1-bit history predictor, but now we have 2 bits of history used to index into a 2BC[4] array. This perfectly predicts both the (NT)* and (NNT)* pattern types and is a pretty good predictor for other patterns. However, it wastes one 2BC on the (NNT)* case and two 2BCs on the (NT)* case.
165 |
166 | 
167 |
168 | ### N-bit History Predictor
169 |
170 | We can generalize to state that an N-bit history predictor can successfully predict all taken patterns of \\(length \leq N+1\\), but will cost \\(N+2*2^N\\) bits per entry and waste most 2BCs. So, while increasing N will give us the ability to predict longer patterns, we do so at rapidly increasing cost with more waste.
171 |
172 | ## History-Based Predictors with Shared Counters
173 |
174 | Instead of \\(2^N\\) counters per entry, we want to use \\(\approx N\\) counters. The idea is to share 2BCs between entries instead of each entry having its own counter.
175 |
176 | We can do this with a Pattern History Table (PHT). This table simply keeps some PC-indexed history bits (N bits per entry), combines that with bits of the PC (XOR) to index into the BHT, each entry of which is just a single 2BC. Thus it is very possible to have two entries/history combinations using the same BHT entry
177 |
178 | 
179 |
180 | Thus, small patterns will only use a few counters, leaving many other counters for longer, more complex patterns to use. This can still result in wasted space, but not nearly as much as the exponential increase of the N-bit history predictor. The downside, of course, is that some branches with particular histories may overlap with other branches/histories. But, if the BHT is large enough (and it can be with each entry being a single 2BC), this should happen rarely.
181 |
182 | ### PShare and GShare Predictors
183 | This shared counters predictor is called PShare -> "P"rivate history, "Share"d counters. This is good for even-odd and 8-iteration loops.
184 |
185 | Another option is GShare, or "G"lobal history, "Share"d counters. The history indexes are shared among all entries, and the PC+History operation (XOR) is used to index into the BHT. This is good for correlated branches - which is very common in programs, as typically operations in one branch are probably somewhat dependent on operations from a previous branch.
186 |
187 | Which to use? Both! Use GShare for correlated branches, and PShare for single branches with shorter history.
188 |
189 | ## Tournament Predictor
190 | We have two predictors, PShare and GShare, each of which is better at predicting certain types of branches. A meta-predictor (array of 2BCs) is used not as a branch predictor, but rather as a predictor of which other predictor is more likely to yield a correct prediction for the current branch. At each step, you "train" each individual predictor based on the outcome, and you also train the meta-predictor on how well each predictor is doing.
191 |
192 | 
193 |
194 | ## Hierarchical Predictor
195 | Like a tournament predictor, but instead of combining two good predictors, it is using one good and one ok predictor. The idea is that good predictors are expensive, and some branches are very easy to predict. So the "ok" predictor can be used for these branches, and the better predictor can be saved for the branches that are more difficult to predict.
196 |
197 | | | Tournament | Hierarchical |
198 | |---|---|---|
199 | |Predictors|2 good predictors|1 good, 1 ok|
200 | |Updates|Update both for every decision|Update OK-pred on every decision
Update Good-pred only if OK-pred not good|
201 | |(Other)|Good predictors are both chosen to be balanced between performance and cost|Can use a combination of predictors of differing quality|
202 |
203 | In this example, the 2BC simple predictor will be used for most branches, but if it is doing a poor job the branch is added to the Local predictor. Similarly it could also go to the Global predictor. Over time the CPU is "trained" on how to handle each branch.
204 | 
205 |
206 | ## Return Address Stack (RAS)
207 | For any branch, we need to know the direction (taken, not taken) and the target. The previous predictors have covered direction (BHT) and target (BTB) find for most types of branches (complex like `BEQ` and `BNE`, simple like `JUMP` and `CALL`, etc.). However, there is a type of branch, the function return, which is always taken (so direction prediction is fine), but the target is more difficult to predict, as it can be called from many different places. The BTB is not good at predicting the target when it could change each time.
208 |
209 | The RAS is a predictor dedicated to predicting function returns. The idea is that upon each function call, we push the return address (PC+4) on the RAS. When we return, we pop from this stack to get the correct target address.
210 |
211 | Why not simply use the stack itself? The prediction should be very close to the other predictors, and using a separate stack allows the predictor to be very small hardware structure and make the prediction very quickly.
212 |
213 | What happens if we exceed the size of the RAS? Two choices:
214 | - Don't push
215 | - Wrap around - this is the best approach (main and top level functions do not consume the entire RAS)
216 |
217 | In the end, remember this is another predictor and we are allowed to have mispredictions - the stack is still there and will work, just with misprediction cost. The goal is to optimize the greatest number of branches and returns, so the wrap-around approach is best.
218 |
219 | ### How do we know it's a `RET`?
220 | This is all still during the IF phase, so how do we even know if the instruction is a `RET`? One simple way is to use a single-bit predictor to whether an instruction is a `RET` or not. This is very accurate (that PC is likely to always be a `RET`).
221 |
222 | Another approach is called "Predecoding". Most of the time instructions are coming from the processor cache and are only loaded from memory when not in the cache. This strategy involves storing part of the decoded instruction along with the instruction in cache. For example, if an instruction is 32-bits, maybe we store 33 bits, where the extra bit tells us if it is a `RET` or not. The alternative is decoding this every time the instruction is fetched, which becomes more expensive. Modern processors store a lot of information during the predecode step such that the pipeline can move quickly during execution.
223 |
224 | *[2BP]: 2-bit Predictor
225 | *[2BC]: 2-Bit Counter
226 | *[2BCs]: 2-Bit Counters
227 | *[ALU]: Arithmetic Logic Unit
228 | *[BHT]: Branch History Table
229 | *[BTB]: Branch Target Buffer
230 | *[CPI]: Cycles Per Instruction
231 | *[IF]: Instruction Fetch
232 | *[PC]: Program Counter
233 | *[PHT]: Pattern History Table
234 | *[RAS]: Return Address Stack
235 | *[XOR]: Exclusive-OR
236 |
--------------------------------------------------------------------------------
/cache-coherence.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: cache-coherence
3 | title: Cache Coherence
4 | sidebar_label: Cache Coherence
5 | ---
6 |
7 | [🔗Lecture on Udacity (2hr)](https://classroom.udacity.com/courses/ud007/lessons/907008654/concepts/last-viewed)
8 |
9 | ## Cache Coherence Problem
10 | 
11 | As each core reads from memory, it may be using the cache, and value changes on other cores do not properly update all cores. Thus, it may operate on incorrect values with many reads and writes. This is called an Incoherent state. We instead need the idea of Cache Coherence - even when the cache is spread among many cores, it should behave as a single memory.
12 |
13 | ## Coherence Definition
14 | 1) Read R from address X on core C1 returns the value written by the most recent write W to X on C1, if no other core has written to X between W and R.
15 | - Case when only one core is operating on a location, should behave like a uniprocessor.
16 | 2) If C1 writes to X and C2 reads after a sufficient time, and there are no other writes in-between, C2's read returns the the value from C1's write.
17 | - After a "sufficent time", a core must read the "new" values produced by other cores using that location.
18 | - This requires some kind of active design to make this work correctly.
19 | 3) Writes to the same location are serialized: any two writes to X must be seen to occur in the same order on all cores.
20 | - All cores should agree on which values were written in a particular order.
21 |
22 | ## How to Get Coherence
23 | Options:
24 | 1) No caches (really bad performance)
25 | 2) All cores share same L1 cache (bad performance)
26 | 3) ~~Private write-through caches~~
27 | (not really coherent because it doesn't prevent stale reads)
28 | 4) Force read in one cache to see write made in another.
29 | 1) Pick one of these:
30 | * Broadcast writes to update other caches (Write-Update Coherence)
31 | * Writes prevent hits to other copies (Write-Invalidate Coherence)
32 | 2) And, pick one of these:
33 | * All writes are broadcast on shared bus (Snooping) - bus can be a bottleneck
34 | * Each block assigned an "Ordering Point" (Directory)
35 |
36 | ## Write-Update Snooping Coherence
37 | Two caches with valid bit (V), tag (T) and data, both connected to the same bus with memory. We have the following instructions:
38 | ```
39 | Core 0 Core 1
40 | 1 RD A WR A=1
41 | 2 WR A=2 WR A=3
42 | ```
43 | The first instruction executes with the Cache 0, is a miss, and is pulled from memory (A=0). Cache 1 could see this but does not care since it only updates on writes. Cache 1 gets the other instruction, writes to A in its own cache, and puts the request on the bus. It is seen by Cache 0, which then updates itself with the new value.
44 |
45 | Then, we get the second instructions both executing (close to) simultaneously. With a single bus, only one of these can send on the bus at the same time and must arbitrate to see which "wins". So both modify their own cache at the same time (Cache 0: A=2, Cache 1: A=3). Let's say Core 0 wins the arbitration and sends A=2 on the bus. Cache 1 then updates to A=2, and _then_ sends its own A=3 on the bus, updating both caches.
46 |
47 | In this way, at any point in time all processors agree on the historical values of A in the same orders, because the bus is serialized.
48 |
49 | 
50 |
51 | ### Write-Update Optimization 1: Memory Writes
52 | Memory is on the same bus, yet is very slow, and thus it becomes a big bottleneck as it cannot keep up with all the writes. We want to thus avoid unneccessary memory writes. The solution is as we have seen before - add a "dirty" bit to each cache. The cache with a dirty bit set will then be responsible for answering any requests to memory. If a cache with dirty bit set snoops a write on the bus, it unsets the dirty bit and is no longer responsible. Memory is only written to when a block in cache with its dirty bit set is replaced. Thus, the last writer is responsible for writing to memory.
53 |
54 | * Dirty Bit benefits:
55 | * Write to memory only when block replaced
56 | * Read from memory only if no cache has that block in a dirty state.
57 |
58 | ### Write-Update Optimization 2: Bus Writes
59 | With the previous optimization, memory is no longer as busy, but the bus still gets all writes, and now becomes the bottleneck in the system. This write does have to happen, because this is a Write-Update coherence system. But, what about the writes that aren't in any other cache? Those updates are wasted.
60 |
61 | We can add a "Shared" bit to the block in cache that tells us if others are using this block or not. Additionally, there is another line on the bus pulled high on a memory read, so that each core can easily tell if other caches are using a block. If so, we can update the Shared bit to 1 on read. A cache can also pull it high if it sees a write on a block it already has in cache, to signal to the other cache that it is being used so it can mark it's Shared bit.
62 |
63 | When a Write happens on a cache with Shared=1, it knows to broadcast that write on a bus, otherwise it will not. In this way, we get the Write-Update behavior when there is sharing, but otherwise we do not add extra traffic to the shared bus.
64 |
65 | ## Write-Invalidate Snooping Coherence
66 | This works similar to Write-Update with both opimizations (dirty/shared) applied. However, it differs in that instead of broadcasting the entire new value on every write, we only broadcast the fact that tag has been modified. Other caches containing that tag can simply unset their Valid bit. The next time that block is read from cache, it will force that cache to re-read, which is then answered by the cache with Dirty=0 or else the Memory.
67 |
68 | The disadvantage with Write-Invalidate is that all readers have a cache miss when reading something previously written on another cache. But the advantage is that multiple writes don't force caches to continue updating blocks they may not need. Additionally, we know when a write is sending an invalidation, we can then unset the Shared bit (since we know all caches will have to re-read that block at some point), and thus reduce bus activity further.
69 |
70 | ## Update vs. Invalidate Coherence
71 | | Application Does... | Update | Invalidate |
72 | |---|---|---|
73 | | Burst of writes to one address | Each write sends an update (bad) | First write invalidates, other accesses are just hits (good) |
74 | | Write different words in same block | Update sent for each word (bad) | First write invalidates, other accesses are just hits (good) |
75 | | Producer-Consumer WR then RD | Producer sends updates, consumer hits (good) | Producer invalidates, consumer misses and re-reads (bad) |
76 |
77 | And, the winner is... Invalidate! Overall, Invalidate is just slightly better in regard to these activities, but where it really shines is:
78 |
79 | | | | |
80 | |---|---|---|
81 | | Thread moves to another core | Keep updating the old core's cache (horrible) | First write to each block invalidates, then no traffic (good) |
82 |
83 | ## MSI Coherence
84 | A block can be in one of three states:
85 | 1. I(nvalid): (V=0, or not in cache)
86 | * Local Read: Move to Shared state, **put RD on bus**
87 | * Local Write: Move to Modified state, **put WR on bus**
88 | * Snoop RD/WR on bus: (Remain in Invalid state)
89 | 2. S(hared): (V=1, D=0)
90 | * Local Read: (Remain in Shared state)
91 | * Local Write: Move to Modified state, **Put Invalidation on Bus**
92 | * Snoop WR on bus: Move to Invalid state
93 | * Snoop RD on bus: (Remain in Shared state)
94 | 3. M(odified): (V=1, D=1)
95 | * Local Read: (Remain in Modified state)
96 | * Local Write: (Remain in Modified state)
97 | * Snoop WR on bus: Move to Invalid state, **WR back**
98 | * (can delay WR and then proceed again once this WR is done)
99 | * Snoop RD on bus: Move to Shared state, **WR back**
100 |
101 | 
102 |
103 | ### Cache-to-Cache Transfers
104 | * Core1 has block B in M state
105 | * Core2 puts RD on bus
106 | * Core1 has to provide data... but how?
107 | * Solution 1: Abort/Retry
108 | * Core1 cancels Core2's request ("abort" signal on bus)
109 | * Core1 does normal write-back to memory
110 | * Core2 retries, gets data from memory
111 | * Problem: double the memory latency (write and read)
112 | * Solution 2: Intervention
113 | * Core1 tells memory it will supply the data ("intervention" signal on bus)
114 | * Core1 responds with data
115 | * Memory picks up data (both caches are now in shared state and think the block is not dirty, so memory also needs to update)
116 | * Problem: more complex hardware
117 | * Most use some form of Intervention, to avoid performance hit.
118 |
119 | ### Avoiding Memory Writes on Cache-to-Cache Transfers
120 | * C1 has block in M state
121 | * C2 wants to read, C1 responds with data -> C1: S, C2: S
122 | * C2 writes -> C1: I, C2: M
123 | * Maybe this repeats a few times, but memory write happens every time C1 responds with data
124 | * C1 read, C2 responds with data -> C1: S, C2: S
125 | * C3 read, memory provides data (memory read, even though either cache could respond)
126 | * C4 read, memory provides data (...)
127 |
128 | So, we want to avoid these memory read/writes if another cache already has the data.
129 |
130 | We need to make a non-M version responsible for:
131 | * Giving data to other caches
132 | * Eventually writing block to memory
133 |
134 | We need to know which cache is "responsible" for the data. New State: O(wned).
135 |
136 | ## MOSI Coherence
137 | * Like MSI, except:
138 | * M => snoop A Read => O (not S)
139 | * Memory does not get accessed!
140 | * O State, like S, except
141 | * Snoop a read => Provide Data
142 | * Write-Back to memory if replaced
143 | * M: Exclusive Read/Write Access, Dirty
144 | * S: Shared Read Access, Clean
145 | * O: Shared Read Access, Dirty (only one block in this state)
146 |
147 | ### M(O)SI Inefficiency
148 | * Thread-Private Data
149 | * All Data in a single-threaded program
150 | * Stack in Multi-threaded programs
151 | * Data Read, then Write
152 | * I -> Miss -> S -> Invalidation -> M
153 | * Uniprocessor: V=0 => Read - Miss => V=1 => Hit => D=1
154 | * We want to avoid this Invalidation step that is unnecessary for thread-private data. For this, we will add another state: (E)xclusive
155 |
156 | ## The E State
157 | * M: Exclusive Access (RD/WR), Dirty
158 | * S: Shared Access (RD), Clean
159 | * O: Shared Access (RD), Dirty
160 | * E: Exclusive Access (RD/WR), Clean
161 |
162 | For the Read/Write loop, here are how each model works (with bus traffic required)
163 |
164 | | | MSI | MOSI | MESI | MOESI |
165 | |--------|:-----------------:|:-----------------:|:-----------------:|:-----------------:|
166 | | `RD A` | I -> S
(miss) | I -> S
(miss) | I -> E
(miss) | I -> E
(miss) |
167 | | `WR A` | S -> M
(inv) | S -> M
(inv) | E -> M | E -> M |
168 |
169 | So, with MESI/MOESI, we get the same behavior as if we had used the same sequence of accesses on a uniprocessor. The E state allows us to achieve a cache hit we're looking for.
170 |
171 | ## Directory-Based Coherence
172 | * Snooping: broadcast requests so others see them, and to establish ordering
173 | * Bus becomes a bottleneck
174 | * Snooping does not work well with > 8-16+ cores
175 | * Non-Broadcast Network
176 | * How do we observe requests we need to see?
177 | * How do we order requests to the same block?
178 |
179 | ### Directory
180 | * Distributed across cores
181 | * Each "slice" serves a set of blocks
182 | * One entry for each block it serves
183 | * Entry tracks which caches have block (in non-I state)
184 | * Order of accesses determined by "home" slice
185 | * Caches still have same states: MOESI
186 | * When we send a request to read or write, it no longer gets broadcast on a bus, it is communicated to a single directory.
187 |
188 | ### Directory Entry
189 | * 1 Dirty Bit (causes us to find out if a cache needs to do a write-back)
190 | * 1 Bit per Cache: present in that cache
191 |
192 | In this example with 8 cores, a read request on Cache 0 of B would be sent to the home slice for block B (instead of being broadcast on a bus). This directory responds with the data (from memory), and that it has Exclusive access. The directory then sets bits for Dirty and Presence[0].
193 |
194 | Cache 1 then performs a write request of B. The directory forwards that write to Cache 0, which moves to Invalid state and can then either return the data (since in E state), or just ignore and acknowledge the invalidation. The directory unsets bit Presence[0] (because it is in I state), sets bit Presence[1] and Cache 1 moves to the M state.
195 | 
196 |
197 | ### Directory Example
198 | [🎥 View lecture video (4:59)](https://www.youtube.com/watch?v=lZZYILcQ68Y)
199 |
200 | ## Cache Misses with Coherence
201 | * 3 Cs: Compulsory, Conflict, Capacity
202 | * Another C: Coherence Miss
203 | * Example: If we read something, somebody else writes it, and we read it again
204 | * So, 4 Cs now!
205 | * Two types of coherence misses:
206 | * True Sharing: Different cores access same data
207 | * False Sharing: Different cores access different data, _but in the same block_
208 |
209 |
--------------------------------------------------------------------------------
/cache-review.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: cache-review
3 | title: Cache Review
4 | sidebar_label: Cache Review
5 | ---
6 |
7 | [🔗Lecture on Udacity (1.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/1025869122/concepts/last-viewed)
8 |
9 | ## Locality Principle
10 | Things that will happen soon are likely to be close to things that just happened.
11 |
12 | ### Memory References
13 | Accessed Address `x` Recently
14 | - Likely to access `x` again soon (Temporal Locality)
15 | - Likely to access addresses close to `x` too (Spatial Locality)
16 |
17 | ### Locality and Data Accesses
18 | Consider the example of a library (large information but slow to access). Accessing a book (e.g. Computer Architecture) has Temporal Locality (likely to look up the same information again) and also Spatial Locality (likely to look up other books on Computer Architecture.) In this example, consider options a student can do:
19 | 1. Go to the library every time to find the info he needs, go back home
20 | 2. Borrow the book and bring it home
21 | 3. Take all the books and build a library at home
22 |
23 | Option 1 does not benefit from locality (inconvenient). Option 3 is still slow to access and requires a lot of space and does not benefit from locality. Option 2 is a good tradeoff for being able to find information quickly without being overwhelmed with data. This is like a Cache, where we bring only certain information to the processor for faster access.
24 |
25 | ## Cache Lookups
26 | We need to it be fast, so it must be small. Not everything will fit.
27 | Access
28 | - Cache Hit - found it in the cache (fast access)
29 | - Cache Miss - not in the cache (access slow memory)
30 | - Copy this location to the cache
31 |
32 | ## Cache Performance
33 | Properties of a good cache:
34 | - Average Memory Access Time (AMAT)
35 | - \\(\text{AMAT} = \text{hit time} + \text{miss rate} * \text{miss penalty} \\)
36 | - hit time \\(\Rightarrow\\) small and fast cache
37 | - miss rate \\(\Rightarrow\\) large and/or smart cache
38 | - miss penalty \\(\Rightarrow\\) main memory access time
39 | - "miss time" = hit time + miss penalty
40 | - Alternate way to see it: \\(AMAT = (1-rate_{miss}) * t_{hit} + rate_{miss}*t_{miss} \\)
41 |
42 | ## Cache Size in Real Processors
43 | Complication: several caches
44 |
45 | L1 Cache - Directly service read/write requests from the processor
46 | - 16KB - 64KB
47 | - Large enough to get \\(\approx\\) 90% hit rate
48 | - Small enough to hit in 1-3 cycles
49 |
50 | ## Cache Organization
51 | 
52 |
53 | Cache is a table indexed by some bits correlating to the address. It contains data and a flag whether it's a hit or not. The size of data per entry (block size) is selected as a balance between having enough data to maximize hit rate for locality, without using up too much data that will not be accesses. Typically 32-128 bytes is ideal.
54 |
55 | ### Cache Block Start Address
56 | - Anywhere? 64B block => 0..63, 1..64, 2..65. (cannot effectively index into cache, contains overlap)
57 | - Aligned! 64B block => 0..63, 64..127, ... (easy to index using bits of the address, no overlap)
58 |
59 | ### Blocks in Cache and Memory
60 | Memory is divided into blocks. Cache is divided into lines (of block size). This line is a space/slot in which a block can be placed.
61 |
62 | ### Block Offset and Block Number
63 | Block Offset is the lower N bits (where 2^N = block size) that tell us where in the block to index. Block Number is the upper M bits (where M+N = address length) that tell us how to index the blocks for selection.
64 |
65 | ### Cache Tags
66 | In addition to the data, the cache keeps a list of "tags", one for each cache line. When accessing a block, we compare block number to each tag in the cache to determine which line that block is in, if it is in cache. We then know the line and offset to access the memory.
67 |
68 | ### Valid Bit
69 | What happens if the tags are initialized to some value that matches the tag of a valid address? We also need a "valid bit" for each cache line to tell us that specific cache line is valid and can be properly read. An added benefit is that we don't need to worry about clearing tag and data all the time - we can just clear the valid bit.
70 |
71 | ## Types of Caches
72 | - Fully Associative: Any block can be in any line
73 | - Set-Associative: N lines where a block can be
74 | - Direct-Mapped: A particular block can go into 1 line (Set-associative with N==1)
75 |
76 | ### Direct-Mapped Cache
77 | Each block number maps to a single potential cache line where it can be. Typically some lowermost bits of the block number are used to index into the cache line. The tag is still needed to tell us which block is actually in the cache line, but we now only need the bits of the tag that were not mapped for the index.
78 | 
79 |
80 | #### Upside and Downside of Direct-Mapped Cache
81 | - Look in only one place (fast, cheap, energy-efficient)
82 | - Block must go in one place ()
83 | - Conflicts when two blocks are used that map to the same cache line - increased cache miss rate
84 |
85 | ### Set-Associative Caches
86 | N-Way Set-Associative: a block can be in one of N lines (each set has N lines, and a particular block can be in any line of that set)
87 |
88 | #### Offset, Index, Tag for Set-Associative
89 | Similar to before - lower offset bits in the address still determine where in the cache line to obtain the data. Index bits (determined by how many sets) now are used to determine which set the line may be in. The tag is the remaining upper bits bits.
90 | 
91 |
92 | ### Fully-Associative Cache
93 | Still have lower offset bits, but there is now no index. All remaining bits are the tag. Any block can be in any cache line, but this means to find something in cache you have to look at every line to see if that is it.
94 |
95 | ### Direct-Mapped and Fully Associative
96 | Direct-Mapped = 1-way set associative
97 |
98 | Fully Associative = N-way set associative
99 |
100 | Always start with offset bits, based on block size. Then, determine index bits. Rest of the bits is the tag.
101 |
102 | - Offset bits = \\(log_2(\text{block size})\\)
103 | - Index bits = \\(log_2(\text{# of sets})) \\)
104 |
105 | ## Cache Replacement
106 | - Set is full
107 | - Miss -> Need to put new block in set
108 | - Which block do we kick out?
109 | - Random
110 | - FIFO
111 | - LRU (Least Recently Used)
112 | - Hard to implement but exploits locality
113 | - Could implement via NMRU (Not Most Recently Used)
114 |
115 | ### Implementing LRU
116 | LRU has a separate set of counters (one per line). The counter is set to max when the line is accessed, and all other counters are decremented. A counter of 0 value represents least recently used and that line could be replaced.
117 |
118 | For an N-way SA cache:
119 | - Cost: N log2(N)-bit counters
120 | - Energy: Change N counters on every access
121 | - (Expensive in both hardware and energy)
122 |
123 | [🎥 Link to lecture (5:11)](https://www.youtube.com/watch?v=bq6N7Ym81iI)
124 |
125 | ## Write Policy
126 | Do we insert into the cache blocks we write?
127 | - Write-Allocate
128 | - Most are write-allocate due to locality - if we write something we are likely to access it
129 | - No-Write-Allocate
130 |
131 | Do we write just to cache or also to memory?
132 | - Write-Through (update mem immediately)
133 | - Very unpopular
134 | - Write-Back (write to cache, but write to mem when block is replaced)
135 | - Takes advantage of locality, more desirable
136 |
137 | If you have a write-back cache you also want write-allocate (with a write miss, we want future writes to go to the cache).
138 |
139 | ### Write-Back Cache
140 | If we're replacing a block in cache, do we know we need to write it to memory?
141 | - If we did write to the block, write to memory
142 | - If we did not write to the block, no need to write
143 | - Use a "Dirty Bit" to determine if the block has been written to.
144 | - 0 = "clean" (not written since last brought from memory)
145 | - 1 = "dirty" (need to write back on replacement)
146 |
147 | [🎥 Link to example (3:19)](https://www.youtube.com/watch?v=xU0ICkgTLTo)
148 |
149 | ## Cache Summary
150 | [🎥 First Part (2:32)](https://www.youtube.com/watch?v=MWpy5bBxl5A)
151 |
152 | [🎥 Second Part (2:32)](https://www.youtube.com/watch?v=DhxAIKaCEBY)
--------------------------------------------------------------------------------
/cheat-sheet-midterm.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: cheat-sheet-midterm
3 | title: Midterm Cheat Sheet
4 | sidebar_label: Cheat Sheet - Midterm
5 | ---
6 |
7 | A denser collection of important pieces of information covering lectures from Introduction to VLIW
8 |
9 | ## Power Consumption Types
10 |
11 | Two kinds of power a processor consumes.
12 | 1. Dynamic Power - consumed by activity in a circuit
13 | Computed by \\(P = \tfrac 12 \*C\*V^2\*f\*\alpha\\), where
14 | \\(C = \text{capacitance}\\), \\(P = \text{power supply voltage}\\), \\(f = \text{frequency}\\), \\(\alpha = \text{activity factor (% of transistors active each clock cycle)}\\)
15 | 2. Static Power - consumed when powered on, but idle
16 | - The power it takes to maintain the circuits when not in use.
17 | - V \\(\downarrow\\), leakage \\(\uparrow\\)
18 |
19 | ## Performance
20 |
21 | * Latency (time start \\( \rightarrow \\) done)
22 | * Throughput (#/second) (not necessarily 1/latency due to pipelining)
23 | * Speedup - "X is N times faster than Y" (X new, Y old)
24 | * Speedup = speed(X)/speed(Y)
25 | * Speedup = throughput(X)/throughput(Y) = IPC(X)/IPC(Y)
26 | * Speedup = latency(Y)/latency(X) = \\(\frac{CPI(Y)\*CTime(Y)}{CPI(X)\*CTime(X)}\\) (notice Y/X reversal)
27 | * Can also multiply by nInst(Y)/nInst(X) factor
28 | * Performance ~ Throughput ~ 1/Latency
29 | * Ratios (e.g. speedup) can only be calculated via geometric mean
30 | * \\(\text{geometric mean} = \sqrt[n]{a_1\*a_2\*...a_n}\\)
31 | * Iron Law of Performance:
32 | * **CPU Time** = (# instructions in the program) * (cycles per instruction) * (clock cycle time)
33 | * clock cycle time = 1/freq
34 | * For unequal instruction times: \\(\sum_i (IC_i\* CPI_i) * \frac{\text{time}}{\text{cycle}}\\)
35 | * Amdahl's Law - overall effect due to partial change
36 | * \\(speedup = [(1-frac_{enh}) + \frac{frac_{enh}}{speedup_{enh}}]^{-1}\\)
37 | * \\( frac_{enh} \\) represents the fraction of the execution **TIME**
38 | * Consider diminishing returns by improving the same area of code.
39 |
40 | ## Pipelining
41 | * CPI = (base CPI) + (stall/mispred rate %)*(mispred/stall penalty in cycles)
42 | * Dependence - property of the program alone
43 | * Control dependence - one instruction executes based on result of previous instruction
44 | * Data Dependence
45 | * RAW (Read-After-Write) - "Flow" or "True" Dependence
46 | * WAW (Write-After-Write) - "Output", "False", or "Name" dependence
47 | * WAR (Write-After-Read) - "Anti-", "False", or "Name" dependence
48 | * Hazard - when a dependence results in incorrect execution
49 | * Handled by Flush, Stall, and Fix values
50 | * More Stages \\( \rightarrow \\) more hazards (CPI \\( \uparrow \\)), but less work per stage ( cycle time \\( \downarrow \\))
51 | * 5-stage pipeline: Fetch-Decode-Execute-Memory-Write
52 |
53 | ## Branch Prediction
54 | * Predictor must compute \\(PC_{next}\\) based only on knowledge of \\(PC_{now}\\)
55 | 1. Guess "is this a branch?"
56 | 2. Guess "is it taken?"
57 | 3. "and if so, what is the target PC"
58 | * Accuracy: \\( CPI = 1 + \frac{mispred}{inst} * \frac{penalty}{mispred} \\)
59 | * \\( \frac{mispred}{inst} \\) is determined by the predictor accuracy.
60 | * \\(\frac{penalty}{mispred} \\) is determined by the pipeline depth at misprediction.
61 | * Types of Predictors and components
62 | * Not-Taken: Always assume branch is not taken
63 | * Historical: \\(PC_{next} = f(PC_{now}, history[PC_{now}])\\)
64 | * Branch Target Buffer (BTB): simple table of next best guessed PC based on current PC
65 | * Use last N bits of PC (not counting final 2-4 alignment bits) to index into this table
66 | * Branch History Table (BHT): like BTB, but entry is a single bit that tells us Taken/Not-Taken
67 | * 1-bit history does not handle switches in behavior well
68 | * 2-bit history adds hysteresis (strong and weak taken or not-taken)
69 | * 3-bit, 4-bit? Depends on pattern of anomalies, but may not be worth it.
70 | * History-Based Predictor: Looks at last N states of taken/not-taken.
71 | * Example: in sequence NNTNNT, if history is NN we know to predict T.
72 | * 1-bit history with 2-bit counters (1 historical state, 2-bit prediction)
73 | * 2-bit history with 2-bit counters (2 historical states, 2-bit prediction)
74 | * Shared counters - Pattern History Table (PC-indexed N bits per entry, XOR to index into BHT with 2-bit counter entries)
75 | * PShare - "private history, shared counters" - inner loops, smaller patterns
76 | * GShare - "global history, shared counters" - correlated branches across program
77 | * Tournament Predictor
78 | * Meta-Predictor feeds into a PShare and GShare, and is trained based on the results of each.
79 | * Hierarchical Predictor
80 | * Uses one good and one ok predictor (use "ok" predictor for easy branches, and "good" predictor for difficult ones)
81 | * Return Address Stack (RAS) - dedicated to predicting function returns
82 | * Should be very small structure, quick, and accurate (fairly deterministic)
83 | * Can still mispredict - wrap around stack due to limited space.
84 | * Can know an instruction is a `RET` via 1BC or pre-decoding instructions
85 |
86 | ## Predication
87 | Attempts to do the work of both directions of a branch and simply waste the work if wrong (to avoid control hazards)
88 | * If-Conversion (takes normal code and makes it perform both paths)
89 | ```cpp
90 | if(cond) { |>| x1 = arr[i];
91 | x = arr[i]; |>| x2 = arr[j];
92 | y = y+1; |>| y1 = y+1;
93 | } else { |>| y2 = y-1;
94 | x = arr[j]; |>| x = cond ? x1 : x2;
95 | y = y-1; } |>| y = cond ? y1 : y2;
96 | ```
97 | * MIPS operands to help with this (and what `x = cond ? x1 : x2;` looks like)
98 | |inst | operands | does |
99 | |---|---|---|
100 | | `MOVZ` | Rd, Rs, Rt | `if(Rt == 0) Rd=Rs;` |
101 | | `MOVN` | Rd, Rs, Rt | `if(Rt != 0) Rd=Rs;` |
102 | ```mipsasm
103 | R3 = cond
104 | R1 = ... x1 ...
105 | R2 = ... x2 ...
106 | MOVN X, R1, R3
107 | MOVZ X, R2, R3
108 | ```
109 | * If-Conversion takes more instructions to do the work, but avoids any penalty, so is typically more performant.
110 |
111 | ## Instruction Level Parallelism (ILP)
112 | ILP is the IPC when the processor does the entire instruction in 1 cycle, and can do any number of instructions in the same cycle (while obeying true dependencies)
113 | * Register Allocation Table (RAT) is used for renaming registers
114 | * Steps to get ILP value:
115 | 1. Rename Registers - use RAT
116 | 2. "Execute" - ensure no false dependencies, determine when instructions are executed
117 | 3. Calculate ILP = (\# instructions)/(\# cycles)
118 | 1. Pay attention to true dependencies, trust renaming to handle false dependencies.
119 | 2. Be mindful to count how many cycles being computed over
120 | 3. Assume ideal hardware - all instructions that can compute, will.
121 | 4. Assume perfect same-cycle branch prediction
122 | * IPC should never assume "perfect processor", so ILP \\(\geq\\) IPC.
123 |
124 | ## Instruction Scheduling (Tomasulo)
125 | 
126 | All of these things happen every cycle:
127 | 1. Issue
128 | Take next from IQ, determine inputs, get free RS and enqueue, tag destination reg of instruction
129 | 2. Dispatch
130 | As RAT values become available on result bus, move from RS to Execution
131 | 3. Write Result (Broadcast)
132 | When execution is complete, put tag and result on bus, write to RF, update RAT, free RS
133 |
134 | ## ReOrder Buffer (ROB)
135 | Used to prevent issues with exceptions and mispredictions to ensure results are not committed to the actual register before previous instructions have completed.
136 |
137 | Correct Out-Of-Order Execution
138 | * Execute Out-Of-Order
139 | * Broadcast Out-Of-Order
140 | * Write values to registers In-Order!
141 |
142 | ROB is a structure (Register | Value | Done) that sits between the RAT and RS. RS now only dispatches instructions and does not have to wait for result to be broadcast before freeing a spot in the RS). RF is only written to once that instruction is complete and all previous instructions have been written. ROB ensures no wrong registers have been committed; upon exception it can flush and move to exception handler.
143 |
144 | ## Memory Ordering
145 | Uses Load-Store Queue (LSQ) to handle read/write memory dependencies. This allows forwarding from stores to loads without having to access cache/mem. Instructions may be out of order, but all memory accesses are in-order. LSQ considers the followiung when being used:
146 | 1. `LOAD`: Which earlier `STORE` can I get a value from?
147 | 2. `STORE`: Which later `LOAD`s do I need to give my value to?
148 |
149 | - Issue: Need a ROB entry and an LSQ entry
150 | - Execute Load/Store: compute the address, produce the value (simultaneously)
151 | - (`LOAD` only) Write Result and Broadcast it
152 | - Commit Load/Store: Free ROB & LSQ entries
153 | - (`STORE` only) Send write to memory
154 |
155 | ## Compiler ILP
156 | * Tree Height Reduction uses associativity in operations to avoid chaining dependencies (e.g. x+y+z+a becomes (x+y)+(z+a)).
157 | * Instruction Scheduling attempts to reorder instructions to fill in any natural "stalls"
158 | * Loop Unrolling performs multiple iterations of the loop during one actual branching. (unroll once means 2 iterations before branch).
159 | * Function Call Inlining takes a function and copies the work to the main program, providing opportunities for scheduling or eliminating call/ret.
160 |
161 | ## VLIW
162 | Combines multiple instructions into one large one - requires extensive compiler support, but lowers hardware cost.
--------------------------------------------------------------------------------
/compiler-ilp.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: compiler-ilp
3 | title: Compiler ILP
4 | sidebar_label: Compiler ILP
5 | ---
6 |
7 | [🔗Lecture on Udacity (1 hr)](https://classroom.udacity.com/courses/ud007/lessons/972428795/concepts/last-viewed)
8 |
9 |
10 | ## Can compilers help improve IPC?
11 |
12 | * Limited ILP
13 | * Due to dependence chains (each instruction depends on the one before it)
14 | * Limited "Window" into program
15 | * Independent instructions are far apart
16 |
17 | ## Tree Height Reduction
18 | Consider a program that performs the operation `R8 = R2 + R3 + R4 + R5`:
19 | ```mipsasm
20 | ADD R8, R2, R3
21 | ADD R8, R8, R4
22 | ADD R8, R8, R5
23 | ```
24 | Obviously this creates a depdendence chain. Instead we could group the instructions like `R8 = (R2 + R3) + (R4 + R5)`, or:
25 | ```mipsasm
26 | ADD R8, R2, R3
27 | ADD R7, R4, R5
28 | ADD R8, R8, R7
29 | ```
30 | This allows the first two instructions to execute in parallel. Tree Height Reduction in a compiler uses associativity to accomplish this. But, it should be considered that not all operations are associative.
31 |
32 | ## Make Indepdendent Instructions Easier to Find
33 | Can use various techniques (to follow)
34 | 1. Instruction Scheduling (branch-free sequences of instructions)
35 | 2. Loop Unrolling (and how it interacts with Instruction Scheduling)
36 | 3. Trace Scheduling
37 |
38 | ## Instruction Scheduling
39 | Different from Tomasulo's algorithm that takes place in the processor, but attempts to accomplish a similar thing. Take this sequence of instructions:
40 |
41 | ```mipsasm
42 | Loop:
43 | LW R2, 0(R1)
44 | ADD R2, R2, R0
45 | SW R2, 0(R1)
46 | ADDI R1, R1, 4
47 | BNE R1, R3, Loop
48 | ```
49 | On each cycle, it may look more like this:
50 | ```
51 | 1. LW R2, 0(R1)
52 | 2. (stall)
53 | 3. ADD R2, R2, R0
54 | 4. (stall)
55 | 5. (stall)
56 | 6. SW R2, 0(R1)
57 | 7. ADDI R1, R1, 4
58 | 8. (stall)
59 | 9. (stall)
60 | 10. BNE R1, R3, Loop
61 | ```
62 | Can we move something into that first stall to help ILP? Cycles 3 and 6 cannot move because those are dependent. Is it possible to move the ADDI from cycle 7 to cycle 2, since it does not depend on anything else?
63 | ```mipsasm
64 | Loop:
65 | LW R2, 0(R1)
66 | ADDI R1, R1, 4
67 | ADD R2, R2, R0
68 | SW R2, -4(R1)
69 | BNE R1, R3, Loop
70 | ```
71 | The `ADDI` can move as-is, but we need to correct the offset in the `SW` instruction to compensate for this. From a cycle analysis, we have eliminated cycles 7 (moved to 2), and the stall from 8-9 (the existing stalls in 4-5 will also compensate for the `ADDI` delay). So instead of 10 cycles, the loop now runs in 7 cycles.
72 |
73 | ## Scheduling and If-Conversion
74 |
75 | In the following example, orange and green are two different branches, and using predication we attempt to execute both and throw away the wrong results later when the branch executes. So using If-Conversion, the program executes these in order.
76 |
77 | 
78 |
79 | For the purposes of scheduling, we can always perform scheduling optimizations within each functional block (e.g. instructions inside the orange section), and also between consecutive blocks (white-orange). We can even reschedule instructions on either side of the branches, and maintain correctness.
80 |
81 | ## If-Convert a Loop
82 | We know how to if-convert branches, but what about a loop?
83 |
84 | ```mipsasm
85 | Loop:
86 | LW R2, 0(R1) #(stall cycle after)
87 | ADD R2, R2, R3 #(stall cycle after)
88 | SW R2, 0(R1)
89 | ADDI R1, R1, 4 #(stall cycle after)
90 | BNE R1, R5, Loop
91 | ```
92 | With scheduling, we get something like this (notice decrease in wasted cycles)
93 | ```mipsasm
94 | Loop:
95 | LW R2, 0(R1)
96 | ADDI R1, R1, 4
97 | ADD R2, R2, R3 #(stall cycle after)
98 | SW R2, -4(R1)
99 | BNE R1, R5, Loop
100 | ```
101 | Instead of the BNE, we could use something like If-Conversion to load in the next instructions and fill that last stall cycle. But each iteration would require a new predicate to be created, and there's a point at which most things only get done if a predicate is true, and we don't see the performance boost for the overhead that will take. So, we cannot really If-Convert, but we can do something called Loop Unrolling
102 |
103 | ## Loop Unrolling
104 | ```cpp
105 | for(i=1000; i !=0;l i--) {
106 | a[i] = a[i] + s;
107 | }
108 | ```
109 | compiles to...
110 | ```mipsasm
111 | LW R2, 0(R1)
112 | ADD R2, R2, R3
113 | SW R2, 0(R1)
114 | ADDI R1, R1, -4
115 | BNE R1, R5, Loop
116 | ```
117 |
118 | With loop unrolling, we try to do a few iterations of the loop during one iteration, maybe:
119 | ```cpp
120 | for(i=1000; i !=0;l i=i-2) {
121 | a[i] = a[i] + s;
122 | a[i-1] = a[i-1] + s; // we could try unrolling a few more times
123 | }
124 | ```
125 | which now compiles to:
126 | ```mipsasm
127 | LW R2, 0(R1)
128 | ADD R2, R2, R3
129 | SW R2, 0(R1)
130 | LW R2, -4(R1)
131 | ADD R2, R2, R3
132 | SW R2, -4(R1)
133 | ADDI R1, R1, -8
134 | BNE R1, R5, Loop
135 | ```
136 | So the process is to take the work, copy it twice, adjust offsets, then adjust the final loop counter and branch instructions as needed. This example was an "unroll once". Unrolling twice would perform 3 iterations before branching.
137 |
138 | ### Loop Unrolling Benefits: # Instructions \\(\downarrow \\)
139 |
140 | In the example above, we start from 5 instructions * 1000 loops = 5000. After loop unrolling we have 8 instructions * 500 loops = 4000. From the Iron Law, we know Execution Time = (# inst)(CPI)(Cycle Time). So just by decreasing instructions by 20% there is a significant effect on overall performance.
141 |
142 | ### Loop Unrolling Benefits: CPI \\(\downarrow \\)
143 | Assume a processor with 4-Issue, In-Order, with perfect branch prediction. We can view how the iterations span over cycles:
144 | |`Loop:` | 1 | 2 | 3 | 4 | 5 | 6 |
145 | |--- |---|---|---|---|---|---|
146 | |`LW R2, 0(R1)` | x | | | x | | |
147 | |`ADD R2, R2, R3` | | x | | | x | |
148 | |`SW R2, 0(R1)` | | | x | | | x |
149 | |`ADDI R1, R1, -4` | | | x | | | x |
150 | |`BNE R1, R5, Loop`| | | | x | | ... |
151 | For an overall CPI of 3/5. With scheduling, we can do the following:
152 | |`Loop:` | 1 | 2 | 3 | 4 | 5 | 6 |
153 | |--- |---|---|---|---|---|---|
154 | |`LW R2, 0(R1)` | x | | x | | x | |
155 | |`ADDI R1, R1, -4` | x | | x | | x | |
156 | |`ADD R2, R2, R3` | | x | | x | | x |
157 | |`SW R2, 4(R1)` | | | x | | x | |
158 | |`BNE R1, R5, Loop`| | | x | | x | ... |
159 | For an overall CPI of 2/5 with scheduling. This is a significant boost. Now, what about loop unrolling (unrolled once)?
160 | |`Loop:` | 1 | 2 | 3 | 4 | 5 | 6 |
161 | |--- |---|---|---|---|---|---|
162 | |`LW R2, 0(R1)` | x | | | | | x |
163 | |`ADD R2, R2, R3` | | x | | | | |
164 | |`SW R2, 0(R1)` | | | x | | | |
165 | |`LW R2, -4(R1)` | | | x | | | |
166 | |`ADD R2, R2, R3` | | | | x | | |
167 | |`SW R2, -4(R1)` | | | | | x | |
168 | |`ADDI R1, R1, -8` | | | | | x | |
169 | |`BNE R1, R5, Loop`| | | | | | x |
170 | So it takes 5 cycles to do 8 instructions, for a CPI of 5/8. This is slightly worse when only looking at a few iterations, but overall it will perform much better (since we need half the iterations). Finally... with unrolling once and scheduling:
171 | |`Loop:` | 1 | 2 | 3 | 4 |
172 | |--- |---|---|---|---|
173 | |`LW R2, 0(R1)` | x | | x | |
174 | |`LW R10, -4(R1)` | x | | | x |
175 | |`ADD R2, R2, R3` | | x | | |
176 | |`ADD R10, R10, R3`| | x | | |
177 | |`ADDI R1, R1, -8` | | x | | |
178 | |`SW R2, 8(R1)` | | | x | |
179 | |`SW R10, 4(R1)` | | | x | |
180 | |`BNE R1, R5, Loop`| | | x | |
181 | So it takes 3 cycles to perform 8 instructions, for CPI of 3/8. This is slightly better than with scheduling alone, and when the benefit of loop unrolling over time (half the iterations) are considered, it is a significant improvement. Unrolling provides more opportunities for scheduling to reorder things to optimize for parallelism, in addition to reducing the overall number of instructions.
182 |
183 | ### Unrolling Downside?
184 | A few reasons we may not always unroll loops.
185 |
186 | 1. Code Bloat (in terms of lines of code after compilation)
187 | 2. What if number of iterations is unknown (e.g. while loop)?
188 | 3. What if number of iterations is not a multiple of N? (N = number of unrolls)
189 |
190 | Solutions to 2 and 3 do exist, but are beyond the scope of this course (may be in a compilers course).
191 |
192 | ## Function Call Inlining
193 |
194 | Function Call Inlining is an optimization that takes a function and copies the work inside the main program.
195 |
196 | 
197 |
198 | This has the benefits of:
199 | * Eliminating call/return overheads (reduces # instructions)
200 | * Better scheduling (reduces CPI)
201 | * Without inlining, the compiler can only schedule instructions around and inside the function block, but by inlining it can schedule all instructions together.
202 | * With fewer instructions and reduced CPI, execution time improves dramatically.
203 |
204 | ### Function Call Inlining Downside
205 |
206 | The main downside to inlining, like in Loop Unrolling, is code bloat. Instead of abstracting the code into its own space and using call/return, it replicates the function body each time it is called. Each time it is inlined, it increases the total program size. Therefore we must be judicious about the usage of inlining. Ideally we select smaller functions to inline, and primarily those that result in fewer instructions when inlined compared to the overhead of setting up parameters, call, and return.
207 |
208 | ## Other IPC-Enhancing Compiler Stuff
209 |
210 | * Software Pipelining
211 | * Treat a loop as a pipeline where you interleave the instructions among the cycles such that correctness is maintained but there are no dependencies)
212 | * Trace Scheduling ("If-Conversion on steroids")
213 | * Combine the common path into one block of scheduled code, provide a way to branch and "fix" the wrong path execution if the uncommon path is determined to occur.
214 |
215 | *[ALU]: Arithmetic Logic Unit
216 | *[CPI]: Cycles Per Instruction
217 | *[ILP]: Instruction-Level Parallelism
218 | *[IPC]: Instructions per Cycle
219 | *[IQ]: Instruction Queue
220 | *[LB]: Load Buffer
221 | *[LSQ]: Load-Store Queue
222 | *[LW]: Load Word
223 | *[OOO]: Out Of Order
224 | *[PC]: Program Counter
225 | *[RAT]: Register Allocation Table (Register Alias Table)
226 | *[RAW]: Read-After-Write
227 | *[ROB]: ReOrder Buffer
228 | *[SB]: Store Buffer
229 | *[SW]: Store Word
230 | *[WAR]: Write-After-Read
231 | *[WAW]: Write-After-Write
232 | *[RAR]: Read-After-Read
--------------------------------------------------------------------------------
/course-information.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: course-information
3 | title: Course Information
4 | sidebar_label: Course Information
5 | ---
6 |
7 | ## Are you ready for this course?
8 |
9 | Complete the prerequisite check at [🔗HPC0](https://classroom.udacity.com/courses/ud219/)
10 |
11 | ## Textbook
12 | There are no required readings. For personal knowledge, a recommended textbook is **Computer Architecture** by Hennessy and Patterson. [🔗This textbook is available electronically via the GT Library](https://ebookcentral-proquest-com.prx.library.gatech.edu/lib/gatech/detail.action?docID=787253).
13 |
14 | ## Sample Tests
15 | ### Sample Midterms
16 | * [Midterm 1 (without solutions)](https://www.udacity.com/wiki/hpca/SampleMidterms/Midterm1)
17 | * [Midterm 2 (with solutions)](https://www.udacity.com/wiki/hpca/SampleMidterms/Midterm2)
18 |
19 | ### Sample Finals
20 | * [Final 1 (without solutions)](https://www.udacity.com/wiki/hpca/sample-final/samplefinal1)
21 | * [Final 2 (with solutions)](https://www.udacity.com/wiki/hpca/sample-final/samplefinal2)
22 |
23 | ## Problem Set Solutions
24 | * [Metrics and Evaluation Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/MetricsAndEval)
25 | * [Pipelining Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Pipelining)
26 | * [Branches Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Branches)
27 | * [Predication Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Predication) (links broken)
28 | * [Problem 1](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-1)
29 | * [Problems 2-6](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-2)
30 | * [Problems 7-9](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-3)
31 | * [Problems 10-14](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-4)
32 | * [Problems 15-19](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-5)
33 | * [ILP Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/ILP)
34 | * [Instruction Scheduling Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Instruction_Scheduling)
35 | * [ReOrder Buffer Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/ROB)
36 | * [Interrupts & Exceptions Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Interrupts_Exceptions)
37 | * [Virtual Memory Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/VirtualMemory)
38 | * [Advanced Caches Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Advanced_Caches)
39 | * [Memory Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Memory)
40 | * [Multiprocessing Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Mulitprocessing)
41 | * [Memory Consistency Problem Set Solutions](https://www.udacity.com/wiki/hpca/problem-set-solutions/memory-consistency)
42 |
43 | ## Additional Resources
44 |
45 | * [Assembly Language Programming](https://www.udacity.com/wiki/hpca/assemblyLanguageProgramming)
46 | * [Linux/Unix Commands](https://www.udacity.com/wiki/hpca/reviewLinuxCommands)
47 | * [C++](https://www.udacity.com/wiki/hpca/reviewC++)
48 | * [Glossary](https://www.udacity.com/wiki/hpca/glossary/Glossary)
49 |
50 | ## External Lectures and Materials
51 |
52 | These are some links I personally found helpful to explain some of the course concepts.
53 |
54 | * Branch Handling and Branch Prediction - CMU Computer Architecture (Prof. Onur Mutlu)
55 | * [2014 Video](https://www.youtube.com/watch?v=06OAhsPL-1k), [2013 Video](https://www.youtube.com/watch?v=XkerLktFtJg), [Slides](http://course.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture11-branch-prediction-afterlecture.pdf)
56 |
57 |
58 |
59 |
--------------------------------------------------------------------------------
/fault-tolerance.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: fault-tolerance
3 | title: Fault Tolerance
4 | sidebar_label: Fault Tolerance
5 | ---
6 |
7 | [🔗Lecture on Udacity (1.5hr)](https://classroom.udacity.com/courses/ud007/lessons/872590122/concepts/last-viewed)
8 |
9 | ## Dependability
10 | Quality of delivered service that justifies relying on the system to provide that service.
11 | - Specified Service = what behavior should be
12 | - Delivered Service = actual behavior
13 |
14 | System has components (modules)
15 | - Each module has an ideal specified behavior
16 |
17 | ## Faults, Errors, and Failures (+ Example)
18 | * **Fault** - module deviates from specified behavior
19 | * Example: Programming mistake
20 | * Add function that works fine, except 5+3 = 7
21 | * Latent Error (only a matter of time until activated)
22 | * **Error** - actual behavior within system differs from specified behavior
23 | * Example: (Activated Fault, Effective Error)
24 | * We call add() with 5 and 3, get 7, and put it in some variable
25 | * **Failure** - System Deviates from specified behavior
26 | * Example: 5+3=7 -> Schedule a meeting for 7am instead of 8am
27 |
28 | An error starts with a fault, but a fault may not necessarily become an error. A failure starts with an error, but an error may not necessarily result in a failure. Example: `if(add(5, 3))>0)` causes an error since the fault is activated, but not a failure since the end result is not affected.
29 |
30 | ## Reliability, Availability
31 | System is in one of two states:
32 | * Service Accomplishment
33 | * Service Interruption
34 |
35 | Reliability:
36 | * Measure continuous Service Accomplishment
37 | * Mean Time to Failure (MTTF)
38 |
39 | Availability:
40 | * Service Accomplishment as a fraction of overall time
41 | * Need to know: Mean Time to Repair (MTTR)
42 | * Availability = \\(\frac{MTTF}{MTTF+MTTR}\\)
43 |
44 | ## Kinds of Faults
45 | By Cause:
46 | * HW Faults - HW fails to perform as designed
47 | * Design Faults - SW bugs, HW design mistakes (FDIV bug)
48 | * Operation Faults - Operator, user mistakes
49 | * Environmental Faults - Fire, power failures, sabotage, etc.
50 |
51 | By Duration:
52 | * Permanent - Once we have it, it doesn't get corrected (wanted to see what's inside processor, and now it's in 4 pieces)
53 | * Intermittent - Last for a while, but recurring (overclock - temporary crashes)
54 | * Transient - Something causes things to stop working correctly, but it fixes itself eventually
55 |
56 | ## Improving Reliability and Availability
57 | * Fault Avoidance
58 | * Prevent Faults from occurring at all
59 | * Example: No coffee in server room
60 | * Fault Tolerance
61 | * Prevent Faults from becoming Failures
62 | * Example: Redundancy, e.g. ECC for memory
63 | * Speed up Repair Process (availability only)
64 | * Example: Keep a spare hard disk in drawer
65 |
66 | ## Fault Tolerance Techniques
67 | * Checkpointing (Recover from error)
68 | * Save state periodically
69 | * If we detect errors, restore the saved state
70 | * Works well for many transient and intermittent failures
71 | * If this takes too long, it has to be treated like a service interruption
72 | * 2-Way Redundancy (Detect error)
73 | * Two modules do the same work and compare results
74 | * Roll back if results are different
75 | * 3-Way Redundancy (Detect and recover from error)
76 | * 3 modules (or more) do the same work and vote on correctness
77 | * Fault in one module does not become an error overall.
78 | * Expensive - 3x the hardware required, but can tolerate *any* fault in one module
79 |
80 | ## N-Module Redundancy
81 | * N=2 - Dual-Module Redundancy
82 | * Detect but not correct one faulty module
83 | * N=3 - Triple-Module Redundancy
84 | * Correct one faulty module
85 | * N=5 - (example: space shuttle)
86 | * 5 computers perform operation and vote
87 | * 1 Wrong Result \\(\Rightarrow\\) normal operation
88 | * 2 Wrong Results \\(\Rightarrow\\) abort mission
89 | * Still no failure from this: 3 outvote the 2
90 | * 3 Wrong Results: failure can be catastrophic (too many broken modules)
91 | * Abort with 2 failures so that this state should never be reached
92 |
93 | ## Fault Tolerance for Memory and Storage
94 | * Dual/Triple Module Redundancy - Overkill (typically better for computation)
95 | * Error Detection, Correction Codes
96 | * Parity: One extra bit (XOR of all data bits)
97 | * Fault flips one bit \\(\Rightarrow\\) Parity does not match data
98 | * ECC: example - SECDED (Single Error Correction, Double Error Detection)
99 | * Can detect and fix any single bit flip, or can only detect any dual bit flip
100 | * Example: ECC DRAM modules
101 | * Disks can use even fancier codes (e.g. Reed-Solomon)
102 | * Detect and correct multiple-bit errors (especially streaks of flipped bits)
103 | * RAID (for hard disks)
104 |
105 | ## RAID - Redundant Array of Independent Disks
106 | * Several disks playing the role of one disk (can be larger and/or more reliable than the single disk)
107 | * Each disk detects errors using codes
108 | * We know which disk has the error
109 | * RAID should have:
110 | * Better performance
111 | * Normal Read/Write accomplishment even when:
112 | * It has a bad sector
113 | * Entire disk fails
114 | * RAID 0, 1, etc...
115 |
116 | ### RAID 0: Striping (to improve performance)
117 | Disks can only read one track at a time, since the head can be in only one position. RAID 0 takes two disks and "stripes" the data across each disk such that consecutive tracks can be accessed simultaneously with the head in a single position. This results in up to 2x the data throughput and reduced queuing delay.
118 |
119 | However, reliability is worse than a single disk:
120 | * \\(f\\) = failure rate for a single disk
121 | * failures/disk/second
122 | * Single-Disk MTTF = \\(\frac{1}{f}\\) (MTTDL: Mean time to data loss = \\(MTTF_1\\))
123 | * N disks in RAID0
124 | * \\(f_N = N*f_1 \Rightarrow MTTF_N = MTTDL_N = \frac{MTTF_1}{N}\\)
125 | * 2 Disks \\( \Rightarrow MTTF_2 = \frac{MTTF_1}{2} \\)
126 |
127 | ### RAID 1: Mirroring (to improve reliability)
128 | Same data on both disks
129 | * Write: Write to each disk
130 | * Same performance as 1 disk alone
131 | * Read: Read any one disk
132 | * 2x throughput of one disk alone
133 | * Can tolerate any faults that affect one disk
134 | * Two copies -> Detect Error ???
135 | * Not true in this case, because ECC on each sector lets us know which one has a fault
136 |
137 | Reliability:
138 | * \\(f\\) = failure rate for a single disk
139 | * failures/disk/second
140 | * Single-Disk MTTF = \\(\frac{1}{f}\\) (MTTDL: Mean time to data loss = \\(MTTF_1\\))
141 | * 2 disks in RAID1
142 | * \\(f_N = N*f_1 \Rightarrow\\)
143 | * both disks OK until \\(\frac{MTTF_1}{2}\\)
144 | * remaining disk lives on for \\(MTTF_1\\) time
145 | * \\(MTTDL_{RAID1-2} = \frac{MTTFF_1}{2}+MTTF_1\\) (Assumes no disk replaced)
146 | * But we do replace failed disks!
147 | * Both disks ok until \\(\frac{MTTF_1}{2}\\)
148 | * Disk fails, have one OK disk for \\(MTTR_1\\)
149 | * Both disks ok again until \\(\frac{MTTF_1}{2}\\)
150 | * So, overall MTTDL :
151 | * (when \\(MTTR_1 \ll MTTF_1\\), probability of second disk failing during MTTR = \\(\frac{MTTR_1}{MTTF_1}\\))
152 | * \\(MTTDL_{RAID1-2} = \frac{MTTF_1}{2} * \frac{MTTF_1}{MTTR_1}\\) (second factor is 1/probability)
153 |
154 | ### RAID 4: Block-Interleaved Parity
155 | * N disks
156 | * N-1 contain data, striped like RAID 0
157 | * 1 disk has parity blocks
158 |
159 | | Disk 0 | | Disk 1 | | Disk 2 | | Disk 3 |
160 | |----------|---|-----------|---|-----------|---|-------------|
161 | | Stripe 0 | ⊕ | Stripe 1 | ⊕ | Stripe 2 | = | Parity 0,1,2 |
162 | | Stripe 3 | ⊕ | Stripe 4 | ⊕ | Stripe 5 | = | Parity 3,4,5 |
163 |
164 | Data from each stripe is XOR'd together to result in parity information on disk 3. If any one of the disks fail, the data can then be reconstructed by XOR-ing all the other disks, including parity.
165 |
166 | * Write: Write 1 data disk and parity disk read/write
167 | * Read: Read 1 disk
168 |
169 | Performance and Reliability:
170 | * Reads: throughput of N-1 disks
171 | * Writes: 1/2 throughput of single disk (primary reason for RAID 5)
172 | * MTTF:
173 | * All disks ok for \\(\frac{MTTF_1}{N}\\)
174 | * If no repair, we are left with N-1 disk array: + \\(\frac{MTTF_1}{N-1} \rightarrow\\) Bad idea
175 | * Repair \\(\Rightarrow\\) chance of another failing during repair: \\(\frac{MTTF_1}{N-1} \\): Multiply by this factor over \\(MTTR_1\\)
176 | * \\(MTTF_{RAID4} = \frac{MTTF_1 \* MTTF_1}{N\*(N-1)\*MTTR_1}\\)
177 |
178 | Reads: [🎥 View Lecture Video (2:19)](https://www.youtube.com/watch?v=3QXaSzM2fE8)
179 | - If we compute the XOR of the old vs new data we are writing, we get the bit flips we're making to the data. If we then XOR this against the parity block, we perform those same bit flips on the parity data and get the new parity information.
180 | - Thus, the parity disk is a bottleneck for writes (multiple disks may be updating it) \\(\Rightarrow\\) RAID5
181 |
182 | ### RAID 5: Distributed Block-Interleaved Parity
183 | * Like RAID 4, but parity is spread among all disks:
184 | | Disk 0 | | Disk 1 | | Disk 2 | | Disk 3 |
185 | |----------|---|-----------|---|-----------|---|-------------|
186 | | Stripe 0 | ⊕ | Stripe 1 | ⊕ | Stripe 2 | = | Parity 0,1,2 |
187 | | Parity 3,4,5 | = | Stripe 3 | ⊕ | Stripe 4 | ⊕ | Stripe 5 |
188 | | Stripe 6 | | Parity 6,7,8 | | Stripe 7 | | Stripe 8 |
189 | | Stripe 9 | | Stripe 10 | | Parity 9,10,11 | | Stripe 11|
190 | | Stripe 12 | ⊕ | Stripe 13 | ⊕ | Stripe 14 | = | Parity 12,13,14 |
191 | * Read Performance: N * throughput of 1 disk
192 | * Write Performance: 4 accesses/write, but distributed: N/4 * throughput of 1
193 | * Reliability: same as RAID4 - if we lose any one disk, we have a problem
194 | * Capacity: same as RAID4 (still sacrifice one disk worth of parity over the array)
195 |
196 | Key takeaway: We should always choose RAID5 over RAID4, since we lose nothing with reliability or capacity, and gain in throughput, all without additional hardware requirements.
197 |
198 | ### RAID 6?
199 | * Two "parity" blocks per group
200 | * Can work when 2 failed stripes/group
201 | * One parity block
202 | * Second is a different type of check-block
203 | * When 1 disk fails, use parity
204 | * When 2 disks fail, solve equations to retrieve data
205 | * RAID 5 vs. RAID 6
206 | * 2x overhead
207 | * More write overhead (6/WR vs 4/WR)
208 | * Only helps reliability when disk fails, then another fails before we replace the first one (low probability, thousands of years MTTDL)
209 |
210 | RAID6 is an overkill?
211 | * RAID5: disk fails, 3 days to replace
212 | * Very low probability of another failing in those 3 days (assuming independent failure)
213 | * Failures can be related!
214 | * RAID5, 5 disks, 1 disk fails (#2)
215 | * System says "Replace disk #2"
216 | * Operator gets replacement disk and inserts it into spot 2
217 | * But... numbering was 0,1,2,3,4! Operator pulled the wrong disk.
218 | * Now we have two failed disks (one hard failure, one operator error)
219 | * RAID6 prevents both a single or double disk failure.
220 |
221 | The above scenario sounds silly, but with very long MTTF values, replacing a RAID disk becomes a rare activity, and is prone to operator error. A real-life personal scenario I (author of these notes) encountered was not properly grounding myself when replacing a disk in a hot-swappable RAID5. When inserting it into the housing, my theory is that I discharged static electricity into the disk below the one I was replacing, and the array failed to rebuild due to an error on that disk. Thankfully we were able to move the platters from that drive into a new one and the array rebuilt ok, avoiding the need to use week-old backups. So, RAID6 is indeed a bit overkill, but could have prevented this scenario, which resulted in lots of system downtime.
222 |
223 |
224 | *[MTTDL]: Mean Time to Data Loss
225 | *[MTTF]: Mean Time to Failure
226 | *[MTTR]: Mean Time to Repair
--------------------------------------------------------------------------------
/ilp.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: ilp
3 | title: ILP
4 | sidebar_label: ILP
5 | ---
6 |
7 | [🔗Lecture on Udacity (54 min)](https://classroom.udacity.com/courses/ud007/lessons/3615429333/concepts/last-viewed)
8 |
9 | ## ILP All in the Same Cycle
10 | Ideal scenario is all instructions executing during the same cycle. This may work sometimes for some instructions, but typically there will be dependencies that prevent this, as below.
11 |
12 | 
13 |
14 | ## The Execute Stage
15 |
16 | Can forwarding help? In the previous example, Inst 1 would be able to forward the result to Instruction 2 in the next cycle, but not during the same cycle. Instruction 2 would need to be stalled until the next cycle. But, if Instructions 3-5 do not have dependencies, they can continue executing during the current cycle.
17 |
18 | ## RAW Dependencies
19 | Even the ideal processor that can execute any number of instructions per cycle still has to obey RAW dependencies, which creates delays and affects overall CPI. So, ideal CPI can never be 0. For example, for instructions 1, 2, 3, 4, and 5, where there is a dependency between 1-2, 3-4, and 4-5, it would take 3 cycles (cycle 1 executes 1 and 3, cycle 2 executes 2 and 4, cycle 3 executes 5), for a total CPI of 3/5 = 0.60. If every instruction had a dependency with the next, you can't do anything in parallel and the minimum CPI is 1.
20 |
21 | ## WAW Dependencies
22 |
23 | In the below example, the second instructions has a data dependency on the first, and gets delayed one cycle. Meanwhile, all other cycles do not have any dependencies and can also be executed in the first cycle. However, we see in the last instruction that R4 could be written, but then due to the delay in instruction 2, could be overwritten. This out-of-order instruction could result in the final value of R4 not being what is expected based on the program. Thus the processor needs a way to find this dependency and delay the last instruction enough cycles to avoid the write issue.
24 |
25 | 
26 |
27 | ## Removing False Dependencies
28 |
29 | RAW dependencies are "true" dependencies - one instruction truly depends on data from another and the dependency must be obeyed. WAR and WAW dependencies are "false" (name) dependencies. They are named this because there is nothing fundamental about them - they are the result of using the same register to store results. If the second instruction used a different register value, there would be no dependency.
30 |
31 | ## Duplicating Register Values
32 |
33 | In the case of a false dependency, you could simply duplicate registers by using separate versions of them. In the below example, you can store two versions of R4 - one on the 2nd instruction, and another on the 4th, and we remember both. The dependency can be resolved when the future instruction that uses R4 can "search" through those past versions and select the most recent.
34 |
35 | 
36 |
37 | Likewise, instruction 3 must also search among "versions" of R4 in instructions 2 and 4 and determine the version it needs is from instruction 2. This is possible and correct, but keeping multiple version is very complicated.
38 |
39 | ## Register Renaming
40 |
41 | Register renaming separates registers into two types:
42 | - Architectural = registers that programmer/compiler use
43 | - Physical = all places value can actually go within the processor
44 |
45 | As the processor fetches and decodes instructions, it "rewrites" the program to use physical registers. This requires a table called the Register Allocation Table (RAT). This table says which physical register has a value for which architectural register.
46 |
47 | ### RAT Example
48 | 
49 |
50 | ## False Dependencies After Renaming?
51 |
52 | In the below example, you can see the true dependencies in purple, and the output/anti dependencies in green. In our renamed program, only the true dependencies remain. This also results in a much lower CPI.
53 |
54 | 
55 |
56 | ## Instruction Level Parallelism (ILP)
57 |
58 | ILP is the IPC when:
59 | - Processor does entire instruction in 1 cycle
60 | - Processor can do any number of instructions in the same cycle
61 | - Has to obey True Dependencies
62 |
63 | So, ILP is really what an ideal processor can do with a program, subject only to obeying true dependencies. ILP is a property of a ***program***, not of a processor.
64 |
65 | Steps to get ILP:
66 | 1. Rename Registers - use RAT
67 | 2. "Execute" - ensure no false dependencies, determine when instructions are executed
68 | 3. Calculate ILP = (\# instructions)/(\# cycles)
69 |
70 | ### ILP Example
71 |
72 | Tips:
73 | 1. You don't have to first do renaming, just pay attention to true dependencies, and trust renaming to handle false dependencies.
74 | 2. Be mindful to count how many cycles you're computing over
75 | 3. Make sure you're dividing the right direction (instructions/cycles)
76 |
77 | 
78 |
79 | ## ILP with Structural and Control Dependencies
80 |
81 | When computing ILP we only consider true dependencies, not false dependencies. But what about structural and control dependencies?
82 |
83 | When considering ILP, there are no structural dependencies. Those are caused by lack of hardware parallelism. ILP assumes ideal hardware - any instructions that can possibly compute in one cycle will do so.
84 |
85 | For control dependencies, we assume perfect same-cycle branch prediction (even predicted before it is executed). For example, below we see that the branch is predicted at the point of program load, such that the label instruction will execute in the first cycle.
86 |
87 | | | 1 | 2 | 3 |
88 | |------------------|:---:|:---:|:---:|
89 | | `ADD R1, R2, R3` | x | | |
90 | | `MUL R1, R1, R1` | | x | |
91 | | `BNE R5, R1, Label` | | | x |
92 | | ... | | | |
93 | | `Label:`
`MUL R5, R7, R8` | x | | |
94 |
95 | ## ILP vs IPC
96 |
97 | ILP is not equal to IPC except on a perfect/ideal out-of-order processor. So IPC should be computed based on the properties of the processor that it is run on, as seen below (note: consider the IPC was calculated ignoring the "issue" property).
98 |
99 | 
100 |
101 | Therefore, we can state ILP \\(\geq\\) IPC, as ILP is calculated using no processor limitations.
102 |
103 | ### ILP and IPC Discussion
104 |
105 | The following are considerations when thinking about effect of processor issue width and order to maximize IPC.
106 |
107 | 
108 |
109 |
110 | *[ALU]: Arithmetic Logic Unit
111 | *[CPI]: Cycles Per Instruction
112 | *[ILP]: Instruction-Level Parallelism
113 | *[IPC]: Instructions per Cycle
114 | *[PC]: Program Counter
115 | *[RAT]: Register Allocation Table
116 | *[RAW]: Read-After-Write
117 | *[WAR]: Write-After-Read
118 | *[WAW]: Write-After-Write
119 | *[RAR]: Read-After-Read
--------------------------------------------------------------------------------
/instruction-scheduling.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: instruction-scheduling
3 | title: Instruction Scheduling
4 | sidebar_label: Instruction Scheduling
5 | ---
6 |
7 | [🔗Lecture on Udacity (1.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/3643658790/concepts/last-viewed)
8 |
9 | ## Improving IPC
10 | * ILP can be good, >> 1.
11 |
12 | * To get IPC \\(\approx\\) ILP, need to handle control dependencies (=> Branch Prediction).
13 |
14 | * Need to eliminate WAR and WAW data dependencies (=> Register Renaming).
15 |
16 | * Handle RAW data dependencies (=> Out of Order Execution)
17 |
18 | * Structural Dependencies (=> Invest in Wider Issue)
19 |
20 | ## Tomasulo's Algorithm
21 |
22 | Originall used in IBM 360, determines which instructions have inputs ready or not. Includes a form of Register Renaming, and is very similar to what is used today.
23 |
24 | | Original Tomasulo | What is used today |
25 | | --- | --- |
26 | | Floating Point instructions | All instruction types |
27 | | Fewer instructions in "window" | Hundreds of instructions considered |
28 | | No exception handling | Explicit exception handling support |
29 |
30 | ### The Picture
31 |
32 | 
33 |
34 | Instructions come from the Fetch unit in order and put into Instruction Queue (IQ). The next available instruction from the IQ is put into a Reservation Station (RS), where they sit until parameters become ready. There is a floating point register file (REGS), and these registers are put into the RS when they are ready.
35 |
36 | When an instruction is ready, it goes into the execution unit. Once the result is available it is broadcast onto a bus. This bus goes to the register file and also the RS (so they will be available when needed). Notice multiple inputs from this bus on the RS - this is because an instruction may need the value to be latched into both operands.
37 |
38 | Finally, if the instruction deals with memory (load/store), it will go to the address generation unit (ADDR) for the address to be computed. It will then go to the Load Buffer (LB, provides just the address A) or Store Buffer (SB, provides both the data D and the address A). When it comes back from memory (MEM), its value is also broadcast on the bus, which also feeds into the SB (so stores can get their values when available).
39 |
40 | Finally, the part where we send the instruction from the fetch to the memory or execution unit is called the "Issue". The place where the instruction is finally sent from the RS to execution is called the "Dispatch". And when the instruction is ready to broadcast its result, it's called a "Write Result" or "Broadcast".
41 |
42 | ## Issue
43 |
44 | 1. Take next (in program order) instruction from IQ
45 | 2. Determine where inputs come from (probably a RAT here)
46 | - register file
47 | - another instruction
48 | 3. Get free/available RS (of the correct kind)
49 | - If all RS are busy, we simply don't issue anything this cycle.
50 | 4. Put instruction in RS
51 | 5. Tag destination register of the instruction
52 |
53 | ### Issue Example
54 |
55 | [🎥 Link to Video](https://www.youtube.com/watch?v=I2qMY0XvYHA) (best seen and not read)
56 |
57 | 
58 |
59 | ## Dispatch
60 |
61 | [🎥 Link to Video](https://www.youtube.com/watch?v=bEB7sZTP8zc)
62 |
63 | As RAT values from other instructions become available on the result broadcast bus, instructions are moved from the RS to the actual execution unit (e.g. ADD, MUL) to be executed.
64 |
65 | ### Dispatch - >1 ready
66 |
67 | What happens if more than one instruction in the RS is ready to execute at the same time? How should Dispatch choose which to execute?
68 | - Oldest first (easy to do, medium performance)
69 | - Most dependencies first (hard to do, best performance)
70 | - Random (easy to do, worst performance)
71 |
72 | 
73 |
74 | All are fine with respect to correctness because of RAT, but if we do not know the future we cannot predict most dependencies, even if that would yield the best overall performance due to possibly freeing up more other instructions to execute. Oldest first is the best compromise with performance and ease of implementation.
75 |
76 | ## Write Result (Broadcast)
77 |
78 | Once the result is ready, it is put on the bus (with its RS tag). It is then used to update the register file and RAT, and then the RS is freed for the next instruction to use.
79 |
80 | 
81 |
82 | ### More than 1 Broadcast
83 |
84 | What if in the previous example, the ADD and MUL complete simultaneously - which one is broadcast first?
85 |
86 | Possibility 1: a separate bus for each unit. This allows both to be broadcast, but requires twice the comparators (every output from the bus now needs to consider both to select the correct tag).
87 |
88 | Possibility 2: Select a higher priority unit based on some heuristic. For example, if one unit is slower, give it higher priority on the WR bus since it's likely the instruction has been waiting longer.
89 |
90 | ### Broadcast "Stale" Result
91 |
92 | Consider a situation in which an instruction is ready to broadcast from RS4, but RS4 is nowhere in the RAT (perhaps overwritten by a new instruction that was both using and writing to the same register, now the RAT will reflect the new RS tag). What will happen during broadcast?
93 |
94 | When updating the RS, it will work normally. RS4 will be replaced by the broadcasted value. In the RAT and RF, we do nothing. It is clear that it will never be used by future instructions, only ones currently in the RS.
95 |
96 | ## Tomasulo's Algorithm - Review
97 |
98 | 
99 |
100 | The key is that for each instruction it will follow the steps (issue, capture, dispatch, WR). Also each cycle, some instruction will be issued, another instruction will be captured, another dispatched, another broadcasting.
101 |
102 | Because all of these things happen every cycle, we need to consider some things:
103 | 1. Can we do same-cycle issue->dispatch? No - on an issue we are writing to the RS, and that instruction is not yet recognizable as a dispatchable instruction.
104 | 2. Can we do same-cycle capture->dispatch? Typically No - on a capture the RS updates its status from "operands missing" to "operands available". But this is technically possible with more hardware.
105 | 3. Can we update RAT entry for both Issue and WR on same cycle? Yes - we just need to ensure that the one being issued is the end result of the RAT. The WR is only trying to point other instructions to the right register, which will also be updated this cycle.
106 |
107 | ## Load and Store Instructions
108 |
109 | As we had dependencies with registers, we have dependencies through memory that must be obeyed or eliminated.
110 | - RAW: SW to address, then LW from address
111 | - WAR: LW then SW
112 | - WAW: SW, SW to same address
113 |
114 | What do we do in Tomasulo's Algorithm?
115 | - Do loads and stores in-order
116 | - Identify dependencies, reorder, etc. - more complicated than with registers, so Tomasulo chose not to do this.
117 |
118 | ## Tomasulo's Algorithm - Long Example
119 | _These examples are best viewed as videos, so links are below..._
120 |
121 | 1. [🎥 Introduction](https://www.youtube.com/watch?v=2M5NQFAaILk)
122 | 2. [🎥 Cycles 1-2](https://www.youtube.com/watch?v=GC8Cp-M0o6Q)
123 | 3. [🎥 Cycles 3-4](https://www.youtube.com/watch?v=G0Kap6eq_Ys)
124 | 4. [🎥 Cycles 5-6](https://www.youtube.com/watch?v=I1VOoFhrnio)
125 | 5. [🎥 Cycles 7-9](https://www.youtube.com/watch?v=wmrPTJpUnV4)
126 | 6. [🎥 Cycles 10-end](https://www.youtube.com/watch?v=ZQ6Tdrs16_U)
127 | 7. [🎥 Timing Example](https://www.youtube.com/watch?v=ZqbhHjFSBoI)
128 |
129 | ## Additional Resources
130 |
131 | From TA Nolan, here is a list of things that can prevent a CPU from moving forward with an instruction.
132 | ```
133 | Issue:
134 | Instructions must be issued in order.
135 | Only a certain number of instructions can be issued in one cycle.
136 | An RS entry of the right type must be available.
137 | An ROB entry must be available.
138 |
139 | Dispatch:
140 | The RS must have actual values for each operand.
141 | An ALU or processing unit must be available.
142 | Only a certain number of instructions from each RS can be dispatched in one cycle.
143 |
144 | Execution:
145 | No limitations.
146 |
147 | Broadcast:
148 | Only a certain number of instructions may broadcast in the same cycle.
149 |
150 | Commit:
151 | Instructions must be committed in order.
152 | Only a certain number of instructions can be committed in one cycle.
153 | ```
154 |
155 | Also from TA Nolan, an attempt to document how Tomasulo works
156 |
157 | ```
158 | While there is an instruction to issue
159 | If there is an empty appropriate RS entry
160 | Put opcode into RS entry.
161 | For each operand
162 | If there is an RS number in the RAT
163 | Put the RS number into the RS as an operand.
164 | else
165 | Put the register value into the RS as an operand.
166 | Put RS number into RAT entry for the destination register.
167 | Take the instruction out of the instruction window.
168 |
169 | For each RS
170 | If RS has instruction with actual values for operands
171 | If the appropriate ALU or processing unit is free
172 | Dispatch the instruction, including operands and the RS number.
173 |
174 | For each ALU
175 | If the instruction is complete
176 | If a RAT entry has #
177 | Put value in corresponding register.
178 | Erase RAT entry.
179 | For each RS waiting for it
180 | Put the result into the RS.
181 | Free the ALU.
182 | Free the RS.
183 | ```
184 |
185 | Finally from TA Nolan, a worksheet to keep track of necessary information when approaching a problem with IS:
186 | ```
187 | Same cycle:
188 | free RS & use RS (?)
189 | issue & dispatch (no)
190 | capture & dispatch (no)
191 | execute & broadcast (no)
192 | reuse ROB (no)
193 |
194 | # / cycle:
195 | issue (1)
196 | broadcast (1)
197 |
198 | Dispatch priority (oldest, random)
199 |
200 | # of universal RS's:
201 |
202 | ALU's:
203 | operation # RS's # ALU's Pipelined?
204 |
205 | Exe time:
206 | operation cycles
207 |
208 | broadcast priority (slower ALU)
209 |
210 | # of ROB's:
211 | ```
212 |
213 | *[ALU]: Arithmetic Logic Unit
214 | *[CPI]: Cycles Per Instruction
215 | *[ILP]: Instruction-Level Parallelism
216 | *[IPC]: Instructions per Cycle
217 | *[IQ]: Instruction Queue
218 | *[LB]: Load Buffer
219 | *[LW]: Load Word
220 | *[PC]: Program Counter
221 | *[RAT]: Register Allocation Table (Register Alias Table)
222 | *[RAW]: Read-After-Write
223 | *[SB]: Store Buffer
224 | *[SW]: Store Word
225 | *[WAR]: Write-After-Read
226 | *[WAW]: Write-After-Write
227 | *[RAR]: Read-After-Read
--------------------------------------------------------------------------------
/introduction.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: introduction
3 | title: Introduction
4 | sidebar_label: Introduction
5 | ---
6 |
7 | [🔗Lecture on Udacity (39 min)](https://classroom.udacity.com/courses/ud007/lessons/3627649022/concepts/last-viewed)
8 |
9 | ## Computer Architecture
10 |
11 | ### What is it?
12 |
13 | Architecture - design a *building* that is well-suited for its purpose
14 | Computer Architecture - design a *computer* that is well-suited for its purpose.
15 |
16 | ### Why do we need it?
17 |
18 | 1. Improve Performance (e.g. speed, battery life, size, weight, energy effficiency, ...)
19 | 2. Improve Abilities (3D graphics, debugging support, security, ...)
20 |
21 | Takeaway: Computer Architecture takes available hardware (fabrication, circuit designs) to create faster, lighter, cheaper, etc. computers.
22 |
23 | ## Technology Trends
24 |
25 | If you design given current technology/parts, you get an obsolete computer by the time the design is complete. Must take into account technological trends and anticipate future technology
26 |
27 | ### Moore's Law
28 |
29 | Used to predict technology trends. Every 18-24 months, you get 2x transistors for same chip area:
30 | \\( \Rightarrow \\) Processor speed doubles
31 | \\( \Rightarrow \\) Energy/Operation cut in half
32 | \\( \Rightarrow \\) Memory capacity doubles
33 |
34 | ### Memory Wall
35 |
36 | Consequence of Moore's Law. IPS and Capacity double every 2 years. If Latency only improves 1.1x every two years, there is a gap between latency and speed/capacity, called the Memory Wall. Caches are used to fill in this gap.
37 |
38 | 
39 |
40 | ## Processor Speed, Cost, Power
41 |
42 | | | Speed | Cost | Power |
43 | | ---| ----- | ---- | ---- |
44 | | | 2x | 1x | 1x |
45 | | OR | 1.1x | .5x | .5x |
46 |
47 | Improvements may differ, it's not always speed. Which one of the above do you really want? It depends.
48 |
49 | ## Power Consumption
50 |
51 | Two kinds of power a processor consumes.
52 | 1. Dynamic Power - consumed by activity in a circuit
53 | 2. Static Power - consumed when powered on, but idle
54 |
55 | ### Active Power
56 | Can be computed by
57 |
58 | $$P = \tfrac 12 \*C\*V^2\*f\*\alpha$$
59 | Where
60 | $$\\\\C = \text{capacitance}\\\\P = \text{power supply voltage}\\\\f = \text{frequency} \\\\ \alpha = \text{activity factor (some percentage of transistors active each clock cycle)}$$
61 |
62 | For example, if we cut size/capacitance in half but put two cores on a chip, the active power is relatively unchanged. To get power improvements, you really need to lower the voltage.
63 |
64 | ### Static Power
65 |
66 | The power it takes to maintain the circuits when not in use.
67 | 
68 | Decreasing the voltage means the reduced electrical "pressure" on a transistor results in more static power usage. However, increasing the voltage means you consume more active power. There is a minimum at which total power (static + dynamic) occurs.
69 |
70 | ## Fabrication Cost and Yield
71 |
72 | Circuits are printed onto a wafer in silicon, divided up into small chips, and packaged. The packaged chip is tested and either discarded or shipped/sold. Larger chips could have more fallout due to probabilities of using larger areas of the wafer, and thus cost more.
73 |
74 | $$ \text{fabrication yield} = \tfrac{\text{working chips}}{\text{chips on wafer}} $$
75 |
76 | 
77 |
78 | Fabrication cost example: If a wafer costs $5000, and has approximately 10 defects, how much does each chip cost in the following example? Remember \\( \text{chip cost} = \tfrac{\text{wafer cost}}{\text{fabrication yield} } \\)
79 |
80 | | Size | Chips/Wafer | Yield | Cost |
81 | |------|-------------|-------|--------|
82 | |Small | 400 | 390 | $12.80 |
83 | |Large | 96 | 86 | $58.20 |
84 | | Huge | 20 | 11 | $454.55|
85 |
86 | Benefit from Moore's Law in two ways:
87 | * Smaller \\( \Rightarrow \\) much lower cost
88 | * Same Area \\( \Rightarrow \\) faster for same cost
89 |
90 | *[IPS]: Instructions per Second
--------------------------------------------------------------------------------
/many-cores.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: many-cores
3 | title: Many Cores
4 | sidebar_label: Many Cores
5 | ---
6 |
7 | [🔗Lecture on Udacity (45 min)](https://classroom.udacity.com/courses/ud007/lessons/913459012/concepts/last-viewed)
8 |
9 | ## Many-Core Challenges
10 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\)
11 | - Writes to shared location \\(\Rightarrow\\) invalidations and misses
12 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) more writes/sec \\(\Rightarrow\\) bus throughput \\(\uparrow\\)
13 | - Bus: one request at a time \\(\Rightarrow\\) bottleneck
14 | - We need:
15 | - Scalable on-chip network that allows traffic to grow with number of cores
16 | - Directory coherence (so we don't rely on bus to serialize everything)
17 |
18 | ## Network On Chip
19 | Consider a linear bus - as we add more cores, it increases the length of the bus (and thus may be slower), and there is also more traffic. It gets very bad with more cores.
20 |
21 | Now consider if the cores were arranged in a mesh network, where each core has some adjacent cores it communicates to. Any core can still communicate to any other core through this network, but potentially there can be many such communications going on at once, so the total throughput is much higher, several times what an individual link's throughput.
22 |
23 | As we continue to add cores, it simply increases the size of the mesh. While the amount of total traffic increases, so does the total number of links and thus the total throughput of the network.
24 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Links \\(\uparrow\\) \\(\Rightarrow\\) Available Throughput \\(\uparrow\\)
25 | - This is also good for chip building because the links don't intersect each other - it is somewhat flat and easily built in silicon
26 |
27 | Can build various network topologies:
28 | - Bus - single link connected to all cores
29 | - Mesh - each core is connected to adjacent cores
30 | - Torus - A mesh where "ends" are also linked to each other in both horizontal/vertical directions
31 | - More advanced or less "flat": Flattened Butterfly, Hypercube, etc.
32 |
33 | ## Many-Core Challenges 2
34 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\)
35 | - Scalable on-chip network (e.g. a mesh)
36 | - Directory coherence
37 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\)
38 | - \# cores \\(\uparrow\\) \\(\Rightarrow\\) \# of on-chip caches \\(\uparrow\\)
39 | - Misses/core same \\(\Rightarrow\\) cores \\(\uparrow\\) \\(\Rightarrow\\) mem requests \\(\uparrow\\)
40 | - \# of pins \\(\uparrow\\), but not \\(\approx\\) to \# of cores \\(\Rightarrow\\) bottleneck
41 | - Need to reduce # of memory requests/core
42 | - Last level cache (L3) shared, size \\(\uparrow \approx \\) # cores \\(\uparrow\\)
43 | - One big LLC \\(\Rightarrow\\) SLOW ... one "entry point" \\(\Rightarrow\\) Bottleneck (Bad)
44 | - Distributed LLC
45 |
46 | ## Distributed LLC
47 | - Logically a single cache (block is not replicated on each cache)
48 | - But sliced up so each tile (core + local caches) gets part of it
49 | - L3 size = # cores * L3 slice size
50 | - On an L2 miss, we must request block from correct L3 slice. How do we know which slice to ask?
51 | - Round-robin by cache index (last few bits of set number)
52 | - May not be good for locality
53 | - Round-robin by page #
54 | - OS can map pages to make accesses more local
55 |
56 | ## Many-Core Challenges 2 (again)
57 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\)
58 | - Scalable on-chip network (e.g. a mesh)
59 | - Directory coherence
60 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\)
61 | - Large Shared Distributed LLC
62 | - Coherence Directory too large (to fit on chip)
63 | - Entry for each memory block
64 | - Memory many GB \\(\Rightarrow\\) billions of entries? \\(\Rightarrow\\) can't fit
65 |
66 | ## On-Chip Directory
67 | - Home node? Same as LLC slice! (we'll be looking at that node anyway)
68 | - Entry for every memory block? No.
69 | - Partial directory
70 | - Directory has limited # of entries
71 | - Allocate entry only for blocks that have at least 1 presence bit set (only blocks that might be in at least one of the private caches)
72 | - If it's only in the LLC or memory, we don't need a directory entry (it would be all zeroes anyway)
73 | - Run out of directory entries?
74 | - Pick an entry to replace (LRU), say entry E
75 | - Invalidation to all tiles with Presence bit set
76 | - Remove entry E and put new entry there
77 | - This is a new type of cache miss, caused by directory replacement
78 |
79 | ## Many-Core Challenges 3
80 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\)
81 | - Scalable on-chip network (e.g. a mesh)
82 | - Directory coherence
83 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\)
84 | - Large Shared Distributed LLC
85 | - Coherence Directory too large (to fit on chip)
86 | - Distributed Partial Directory
87 | - Power budget split among cores
88 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) W/core \\(\downarrow\\) \\(\Rightarrow\\) f and V \\(\downarrow\\) \\(\Rightarrow\\) 1 thread program is slower with more cores!
89 |
90 | ## Multi-Core Power and Performance
91 | 
92 |
93 | Due to need to compensate for less power, each core is noticeably slower.
94 |
95 | ## No Parallelism \\(\Rightarrow\\) boost frequency
96 | - "Turbo" clocks when running on one core
97 | - Example: Intel's Core I7-4702MQ (Q2 2013)
98 | - Design Power: 37W
99 | - 4 cores, "Normal" clock 2.2GHz
100 | - "Turbo" clock 3.2GHz (1.45x normal \\(\Rightarrow\\) 3x power)
101 | - Why not 4x? It spreads more heat to other cores - 3x keeps it distributed to match normal 2.2GHz
102 | - Example: Intel's Core I7-4771 (Q3 2013)
103 | - Design Power: 84W
104 | - 4 cores, "Normal" clock 3.5GHz
105 | - "Turbo" clock 3.9GHz (1.11x normal \\(\Rightarrow\\) 1.38x power)
106 | - Meant for desktop, so can cool the chip more effectively at high power
107 | - But this means the chip already runs almost as hot as it can get, so we don't have much more room to increase power further
108 |
109 | ## Many-Core Challenges 4
110 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\)
111 | - Scalable on-chip network (e.g. a mesh)
112 | - Directory coherence
113 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\)
114 | - Large Shared Distributed LLC
115 | - Coherence Directory too large (to fit on chip)
116 | - Distributed Partial Directory
117 | - Power budget split among cores
118 | - "Turbo" when using only one core
119 | - OS Confusion
120 | - Multi-threading, cores, chips - all at same time!
121 |
122 | ## SMT, Cores, Chips, ...
123 | All combined:
124 | - Dual socket motherboard (two chips)
125 | - 4 cores on each chip
126 | - Each core 2-way SMT
127 | - 16 threads can run
128 |
129 | What if we run 3 threads?
130 | - Assume OS assigns them to the first 3 spots, but maybe two of those are actually the same core (because SMT), and the first half of those spots are actually on the same chip.
131 | - A smarter policy would be to put them on separate chips if possible, and then separate cores if possible, to maximize all benefits.
132 | - So, the OS needs to be very aware of what hardware is available, and smart enough to use it effectively.
133 |
134 |
135 | *[LLC]: Last-Level Cache
136 | *[SMT]: Simultaneous Multi-Threading
--------------------------------------------------------------------------------
/memory-consistency.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: memory-consistency
3 | title: Memory Consistency
4 | sidebar_label: Memory Consistency
5 | ---
6 |
7 | [🔗Lecture on Udacity (32 min)](https://classroom.udacity.com/courses/ud007/lessons/914198580/concepts/last-viewed)
8 |
9 | ## Memory Consistency
10 | - Coherence: Defines order of accesses to **the same address**
11 | - We need this to share data among threads
12 | - Does not say anything about accesses to different locations
13 | - Consistency: Defines order of accesses to **different addresses**
14 | - Does this even matter?
15 |
16 | ### Consistency Matters
17 | Because of coherence, stores must always occur in-order. But, since out of order processors could change the order of loads, you might get results out of order that wouldn't be possible from the program logic alone.
18 |
19 | Examples: [🎥 View lecture video (3:59)](https://www.youtube.com/watch?v=uh8gF64345I), [🎥 View quiz from lecture (3:33)](https://www.youtube.com/watch?v=4X-DciIfFcc)
20 |
21 | ### Why We Need Consistency
22 | - Data-Ready Flag Synchronization
23 | - See quiz from previous section. Effects of branch prediction can cause loads to happen before the data is ready (according to some flag)
24 | - Thread Termination
25 | - Thread A creates Thread B
26 | - Thread B works, updates data
27 | - Thread A waits until B exits (system call)
28 | - Branch prediction may cause this to execute before the following:
29 | - Thread B done, OS marks B done
30 |
31 | We need an additional set of ordering restrictions, beyond coherence alone!
32 |
33 | ## Sequential Consistency
34 | The result of any execution should be as if accesses executed by each processor were executed in-order, and accesses among different processors were arbitrarily interleaved.
35 |
36 | - Simplest Implementation: remove the "as if"
37 | - A core performs next access only when ALL previous accesses are complete
38 | - This works fine, but it is really, really bad for performance (MLP = 1)
39 |
40 | ### Better Implementation of Sequential Consistency
41 | - Core can reorder loads
42 | - Detect when SC may be violated \\(\Rightarrow\\) Fix
43 | - How? In the ROB we have all instructions in program order, regardless of execution order.
44 | - If instructions actually executed in order, we're ok
45 | - If instructions are out of order but no other stores were done to any of the reordered addresses in the meantime, we're also ok.
46 | - Example: `LW A` and `LW B` are program order, but they get executed out of order. There is nothing such as a `SW B` after the `LW B`, so by the time we hit `LW A` it is still consistent with what it would have been in-order.
47 | - If we see a store after a load that is out of order (e.g. execution order of `LW B` then `SW B` then `LW A`), we then know that load has to be replayed, because it would have been a different value in program order. We know it is replayable because it is still beyond the commit point in the ROB, so we are capable of replaying it.
48 | - In order to detect such a violation, for anything we load out of order we need to monitor coherence traffic until we get back in order.
49 |
50 | ## Relaxed Consistency
51 | An alternative approach to SC is to not set the expectation to the programmers for SC, but something slightly less strict.
52 | - Four types of ordering
53 | 1. `WR A` \\(\rightarrow\\) `WR B`
54 | 2. `WR A` \\(\rightarrow\\) `RD B`
55 | 3. `RD A` \\(\rightarrow\\) `WR B`
56 | 4. `RD A` \\(\rightarrow\\) `RD B`
57 | - Sequential Consistency: All must be obeyed!
58 | - Relaxed Consistency: some of these types need not be obeyed at all the times
59 | - Read/Read (#4) best example
60 | - How do we write correct programs if we cannot expect this ordering to be consistent?
61 |
62 | ### Relaxed Consistency implementation
63 | - Allow reordering of "Normal" accesses
64 | - Add special non-reorderable accesses (separate instructions than normal `LW`/`SW`)
65 | - Must use these when ordering in the program matters
66 | - Example: x86 `MSYNC` instruction
67 | - Normally, all accesses may be reordered
68 | - But, no reordering across the `MSYNC`
69 | - [`LW`, `SW`, `LW`, `SW`] \\(\rightarrow\\) `MSYNC`, \\(\rightarrow\\) [`LW`, `SW`, ...]
70 | - Acts as a barrier between operations that expect things to be in order
71 | ```cpp
72 | while(!flag);
73 | MSYNC
74 | //use data -- ensures the data is correctly synchronized to the flag state regardless of reordering
75 | ```
76 | - When to `MSYNC` for synchronization? After a lock/"acquire", but before an unlock/"release"
77 |
78 | ## Data Races and Consistency
79 | 
80 |
81 | One key point is that when creating a data-race-free program, we may want to debug in a SC environment until proper synchronization ensures that we are free of any potential data races. Once we are confident of that, we can move to a more relaxed consistency model and rely on synchronization to ensure consistency is maintained. If the program is not data race free, then anything can happen!
82 |
83 | ## Consistency Models
84 | - Sequential Consistency
85 | - Relaxed Consistency
86 | - Weak (distinguishes between synchronization and non-synchronization accesses, and prevents any synchronization-related accesses from being reordered, but all others may be freely reordered)
87 | - Processor
88 | - Release (distinguishes between lock and unlock (acquire/release) accesses, and does not allow non-synchronization accesses to be reordered across this lock/unlock boundary)
89 | - Lazy Release
90 | - Scope
91 | - ...
92 |
93 |
94 | *[MLP]: Memory Level Parallelism
95 | *[SC]: Sequential Consistency
96 | *[ROB]: Re-Order Buffer
--------------------------------------------------------------------------------
/memory-ordering.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: memory-ordering
3 | title: Memory Ordering
4 | sidebar_label: Memory Ordering
5 | ---
6 |
7 | [🔗Lecture on Udacity (30 min)](https://classroom.udacity.com/courses/ud007/lessons/937498641/concepts/last-viewed)
8 |
9 | ## Memory Access Ordering
10 |
11 | With ROB and Tomasulo, we can enforce order of dependencies on registers between instructions. Is there an equivalent for memory acces (load/store)? Can they be reordered?
12 |
13 | So far, we have:
14 | - Eliminated Control Dependencies
15 | - Branch Prediction
16 | - Eliminated False Dependencies on registers
17 | - Register Renaming
18 | - Obey RAW register dependencies
19 | - Tomasulo-like scheduling
20 |
21 | What about memory dependencies?
22 | ```mipsasm
23 | SW R1, 0(R3)
24 | LW R2, 0(R4)
25 | ```
26 | In this case, if 0(R3) and 0(R4) point to different addresses, then order does not matter. But if they point to the same address, then we need a way to handle this dependency and keep them in order.
27 |
28 | ## When does Memory Write Happen?
29 |
30 | At commit! Any instruction may execute but not be committed (due to exceptions, branch misprediction), so memory should not be written until the instruction is ready to be committed.
31 |
32 | But if we write at the commit, where does a `LOAD` get its data?
33 |
34 | ## Load Store Queue (LSQ)
35 |
36 | We need an LSQ to provide values from stores to loads, since the actual memory will not be written until a commit. We use the following structure:
37 |
38 | | L/S | Addr | Val | C |
39 | | --- | --- | --- |---|
40 | | `` | | | |
41 |
42 | L/S: bit for whether it is load or store
43 | Addr: address being accessed
44 | Val: value being stored or loaded
45 | C: instruction has been completed
46 |
47 | An example of using the LSQ:
48 | 1. [🎥 Part 1](https://www.youtube.com/watch?v=utRgthVxAYk)
49 | 2. [🎥 Part 2](https://www.youtube.com/watch?v=mbwf-CoA5Zg)
50 |
51 | For each Load instruction, it looks in the LSQ to see if the value was already stored to that address by a previous instruction. If so, it uses "Store-to-Load Forwarding" to obtain the value for the Load. If not, it goes to memory. In some cases, a Store may not yet know its address (needs to be computed). In this case, a Load may have not known there was a previous store on that address and would have obtained an incorrect value. There are some options to handle this:
52 | 1. All mem instructions in-order (very low performance)
53 | 2. Wait for all previous store addresses to complete (still bad performance)
54 | 3. Go anyway - the problem will occur, but additional logic must be used to recover from this situation.
55 |
56 | ### Out-Of-Order Load Store Execution
57 |
58 | | # | Inst | Notes |
59 | |---| --- | --- |
60 | | 1 | `LOAD R3 = 0(R6)` | Dispatched immediately, but cache miss |
61 | | 2 | `ADD R7 = R3 + R9` | Depends on #1, waiting... |
62 | | 3 | `STORE R4 -> 0(R7)` | Depends on #2, waiting... |
63 | | 4 | `SUB R1 = R1 - R2` | Dispatched immediately, R1 good |
64 | | 5 | `LOAD R8 = 0(R1)` | R1 ready, so runs and cache hit |
65 |
66 | In this example, instructions 1-3 are delayed due to a cache miss, but instructions 4 and 5 are able to execute. `R8` now contains the value from address `0(R1)`. Later, instructions 1-3 will execute, and address `0(R7)` contains the value of `R4`. If the addresses in `R1` and `R7` do not match, then this is ok. However, if they are the same number (given that `R7` was calculated in instruction 2 and we did not know this yet), then we have a problem, because R8 contains the previous value from that address, not the value it should have had with the ordering.
67 |
68 | ### In-Order Load Store Execution
69 |
70 | Similar to previous example, we may execute some instructions out of order (#4 may execute right away), but all loads and stores are in order. A store instruction is considered "done" (for purposes of ordering) when the address is known, at which point it can check to ensure any other instructions do not conflict with it. Of course this is not very high-performance, as the final `LOAD` instruction is delayed considerably - and if it is a cache miss, then the performance takes an even greater hit.
71 |
72 | ## Store-to-Load Forwarding
73 |
74 | Load:
75 | - Which earlier `STORE` do we get value from?
76 |
77 | Store:
78 | - Which later `LOAD`s do I give my value to?
79 |
80 | The LSQ is responsible for deciding these things.
81 |
82 | ## LSQ Example
83 |
84 | [🎥 Example Video (5:29)](https://www.youtube.com/watch?v=eHVLMgfy-Jc)
85 |
86 | A key point from this is the connection to Commits as we learned in the ROB lesson. The stores are only written to data cache when the associated instruction is committed - thereby ensuring that exceptions or mispredicted branches do not affect the memory itself. At any point execution could stop and the state of everything will be ok.
87 |
88 | ## LSQ, ROB, and RS
89 |
90 | Load/Store:
91 | - Issue
92 | - Need a ROB entry
93 | - Need an LSQ entry (similar to needing an RS in a non-Load/Store issue)
94 | - Execute Load/Store:
95 | - Compute the Address
96 | - Produce the Value
97 | - _(Can happen simultaneously)_
98 | - (LOAD only) Write-Result
99 | - And Broadcast it
100 | - Commit Load/Store
101 | - Free ROB & LSQ entries
102 | - (STORE only) Send write to memory
103 |
104 |
105 | *[ALU]: Arithmetic Logic Unit
106 | *[CPI]: Cycles Per Instruction
107 | *[ILP]: Instruction-Level Parallelism
108 | *[IPC]: Instructions per Cycle
109 | *[IQ]: Instruction Queue
110 | *[LB]: Load Buffer
111 | *[LSQ]: Load-Store Queue
112 | *[LW]: Load Word
113 | *[OOO]: Out Of Order
114 | *[PC]: Program Counter
115 | *[RAT]: Register Allocation Table (Register Alias Table)
116 | *[RAW]: Read-After-Write
117 | *[ROB]: ReOrder Buffer
118 | *[SB]: Store Buffer
119 | *[SW]: Store Word
120 | *[WAR]: Write-After-Read
121 | *[WAW]: Write-After-Write
122 | *[RAR]: Read-After-Read
--------------------------------------------------------------------------------
/memory.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: memory
3 | title: Memory
4 | sidebar_label: Memory
5 | ---
6 |
7 | [🔗Lecture on Udacity (23 min)](https://classroom.udacity.com/courses/ud007/lessons/872590120/concepts/last-viewed)
8 |
9 | ## How Memory Works
10 | * Memory Technology: SRAM and DRAM
11 | * Why is memory slow?
12 | * Why don't we use cache-like memory?
13 | * What happens when we access main memory?
14 | * How do we make it faster?
15 |
16 | ## Memory Technology SRAM and DRAM
17 | SRAM - Static Random Access Memory - Faster
18 | - Retains data while power supplied (good)
19 | - Several transistors per bit (bad)
20 |
21 | DRAM - Dynamic Random Access Memory - Slower
22 | - Will lose data if we don't refresh it (bad)
23 | - Only one transistor per bit (good)
24 |
25 | Both of these lose data when power is not supplied
26 |
27 | ## One Memory Bit
28 | ### SRAM
29 | * [🎥 See SRAM (4:13)](https://www.youtube.com/watch?v=mwNqzc1o5zM)
30 |
31 | SRAM is a matrix of Wordlines and Bitlines. The wordline controls which word is selected, which is the "on" switch for transistors for that word. These transistors connect the bitlines to a inverter feedback loop that contains the data. Another transistor connects the other side of the inverter to another bitline that represents the inverted value of that same bit. In this way memory writes can more easily happen by driving the inverter loop in the opposite direction at the same times, and we can also be more sure of the bit value by looking at the difference between the bit and not-bit lines.
32 |
33 | ### DRAM
34 | * [🎥 See DRAM (4:03)](https://www.youtube.com/watch?v=3s7zsLU83bY)
35 |
36 | In DRAM we similarly have a transistor controlled by the wordline, but the main difference is that our memory is no longer stored in a feedback loop, but in a simple capacitor that is charged on a write to memory. The problem is that a transistor is not a perfect switch - it's a tiny bit leaky, which means the capacitor will lose its charge over time. So, we need to write that bit again from time to time. Additionally, reads are destructive, in that they also drain the capacitor.
37 |
38 | ### SRAM/DRAM Thoughts
39 |
40 | We also see that SRAM can be called a 6T cell (2 control transistors + two transistors each inverter). DRAM is a 1T cell (single transistor controlling the capacitor). We might think the physical footprint of that is a single transistor plus a capacitor (and we want a large capacitor to retain the value longer). However, DRAM uses a technology called "trench cell", which buries a capacitor underneath transistor into a single unit as far as footprint. Additionally, we don't need the second bitline as in SRAM, so space is conserved and the cost is much lower overall.
41 |
42 | ## Memory Chip Organization
43 | A Row Address is passed into a Row Decoder, which decides which Wordline to activate. There is a memory cell on each intersection between a wordline and a bitline. Once a row is activated, the bits in that word follows down the bitlines at a small voltage where it is collected in a Sense Amplifier that produces full 0 and 1 bits. These signals go into a Row Buffer, a storage element for these values. The value is then fed into a Column Decoder, which takes a Column Address (which bit to read), and outputs a single bit. This can then be replicated to provide more bits.
44 | 
45 |
46 | After this happens, the destructive read means the value needs to be refreshed. So the sense amplifier now drives the bitlines to ensure the value has been refreshed. Thus, every read is really a Read-Then-Write. It also must handle refreshing all words periodically to counteract leakage. We can't rely on the processor to do this for us, because things that are accessed often are probably in cache and won't be accessed in memory. So, we have a Refresh Row Counter that loops through words within the required refresh period. So for a Refresh Period T and N rows, then every T/N seconds we need to refresh a row. In DRAM there are many rows and T is less than a second, so this is a lot of required refresh activity.
47 |
48 | To write to memory, we still read a row, and once it is latched into the row buffer we provide the new value to the column decoder, which then will write it into the row buffer, and the sense amplifier uses the value to write back on the bitlines. Thus, a write is also a Read-Then-Write, similar to the read.
49 |
50 | ## Fast Page Mode
51 | "Page" is the series of bitlines that can be read by selecting the proper row. It can be many bits long. Fast Page mode is a technique in which once a row is selected, the value is latched into the row buffer and successive bit reads will happen faster.
52 | - Fast Page Mode
53 | - Open a page
54 | - Row Address
55 | - Select Row
56 | - Sense Amplifier
57 | - Latch into Row Buffer
58 | - Read/Write Cycle (can be many read and write operations on that page)
59 | - Close the page
60 | - Write data from row buffer back to memory row
61 |
62 | ## Connecting DRAM to the Processor
63 | Once a memory request has missed all levels of cache, it is communicated externally through the Front Side Bus into the Memory Controller. The Memory Controller channels these requests into one of several DRAM channels. Thus, the total memory latency includes all the overhead of the communication over the FSB, the activities of the Memory Controller, in addition to the memory access itself.
64 | 
65 |
66 | Some recent processors integrate the memory controller so we can eliminate communication over the FSB and communicate directly from it. This can be a significant overall savings. It also forced standardzation of the protocols connecting DRAM to the processor to allow for this without a complete chip redesign.
67 |
68 |
69 |
70 | *[DRAM]: Dynamic Random Access Memory
71 | *[SRAM]: Static Random Access Memory
--------------------------------------------------------------------------------
/metrics-and-evaluation.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: metrics-and-evaluation
3 | title: Metrics and Evaluation
4 | sidebar_label: Metrics and Evaluation
5 | ---
6 |
7 | [🔗Lecture on Udacity (45 min)](https://classroom.udacity.com/courses/ud007/lessons/3650739106/concepts/last-viewed)
8 |
9 | ## Performance
10 |
11 | * Latency (time start \\( \rightarrow \\) done)
12 | * Throughput (#/second) (not necessarily 1/latency due to pipelining)
13 |
14 | ### Comparing Performance
15 |
16 | "X is N times faster than Y"
17 |
18 | Speedup = N = speed(X)/speed(Y)
19 | \\( \Rightarrow \\) N = Throughput(X)/Throughput(Y)
20 | \\( \Rightarrow \\) N = Latency(Y)/Latency(X)
21 | (notice Y and X for each)
22 |
23 | ### Speedup
24 |
25 | Speedup > 1 \\( \Rightarrow \\) Improved Performance
26 | * Shorter execution time
27 | * Higher throughput
28 |
29 | Speedup < 1 \\( \Rightarrow \\) Worse Performance
30 |
31 | Performance ~ Throughput
32 |
33 | Performance ~ 1/Latency
34 |
35 | ## Measuring Performance
36 |
37 | How do we measure performance?
38 |
39 | Actual User Workload:
40 | * Many programs
41 | * Not representative of other users
42 | * How do we get workload data?
43 |
44 | Instead, use Benchmarks
45 |
46 | ### Benchmarks
47 |
48 | Programs and input data agreed upon for performance measurements. Typically there is a benchmark suite, comprised of multiple programs, each representative of some kind of application.
49 |
50 | #### Benchmark Types
51 | * Real Applications
52 | * Most Representative
53 | * Most difficult to set up
54 | * Kernels
55 | * Find most time-consuming part of application
56 | * May not be easy to run depending on development phase
57 | * Best to run once a prototype machine is available
58 | * Synthetic Benchmarks
59 | * Behave similarly to kernels but simpler to compile
60 | * Typically good for design studies to choose from design
61 | * Peak Performance
62 | * In theory, how many IPS
63 | * Typically only good for marketing
64 |
65 | #### Benchmark Standards
66 |
67 | A benchmarking organization takes input from academia, user groups, and manufacturers, and curates a standard benchmark suite. Examples: TPC (databases, web), EEMBC (embedded), SPEC (engineering workstations, raw processors). For example, SPEC includes GCC, Perl, BZIP, and more.
68 |
69 | ### Summarizing Performance
70 |
71 | #### Average Execution Time
72 |
73 | | | Comp X | Comp Y | Speedup |
74 | |---|--------|--------|---------|
75 | | App A | 9s | 18s | 2.0 |
76 | | App B | 10s | 7s | 0.7 |
77 | | App C | 5s | 11s | 2.2 |
78 | | AVG | 8s | 12s | 1.5 |
79 |
80 | If you simply average speedups, you get 1.63 instead. Speedup of average execution times is not the same as simply averaging speedups on individual applications.
81 |
82 | #### Geometric Mean
83 |
84 | If we want to be able to average speedups, we need to use geometric means for both average times and speedups. This results in the same value whether you are taking the geometric mean of individual speedups, or speedup of geometric mean of execution times.
85 |
86 | For example, in the table above we would obtain:
87 | | | Comp X | Comp Y | Speedup |
88 | |---|--------|--------|---------|
89 | | Geo Mean | 7.66s | 11.15s | 1.456 |
90 |
91 | Geometric mean of speedup values (2.0, 0.7, 2.2) also result in 1.456.
92 |
93 | $$ \text{geometric mean} = \sqrt[n]{a_1\*a_2\*...a_n} $$
94 |
95 | As a general rule, if you are trying to average things that are ratios (speedups are ratios), you cannot simply average them. Use the geometric mean instead.
96 |
97 | ## Iron Law of Performance
98 |
99 | **CPU Time** = (# instructions in the program) * (cycles per instruction) * (clock cycle time)
100 |
101 | Each component allows us to think about the computer architecture and how it can be changed:
102 | * Instructions in the Program
103 | * Algorithm
104 | * Compiler
105 | * Instruction Set
106 | * Cycles Per Instruction
107 | * Instruction Set
108 | * Processor Design
109 | * Clock Cycle Time
110 | * Processor Design
111 | * Circuit Design
112 | * Transistor Physics
113 |
114 | ### Iron Law for Unequal Instruction Times
115 |
116 | When instructions have different number of cycles, sum them individually:
117 | $$ \sum_i (IC_i\* CPI_i) * \frac{\text{time}}{\text{cycle}} $$
118 |
119 | where \\( IC_i \\) is the instruction count for instruction \\( i \\), and \\(CPI_i\\) is the cycles for instruction \\( i \\).
120 |
121 | ## Amdahl's Law
122 |
123 | Used when only part of the program or certain instructions. What is the overall affect on speedup?
124 |
125 | $$ speedup = \frac{1}{(1-frac_{enh}) + \frac{frac_{enh}}{speedup_{enh}}} $$
126 |
127 | where \\( frac_{enh} \\) represents the fraction of the execution **TIME** enhanced by the changes, and \\( speedup_{enh} \\) represents the amount that change was sped up.
128 |
129 | NOTE: Always ensure the fraction represents TIME, not any other quantity (cycles, etc.). First, convert changes into execution time before the change, and execution time after the change.
130 |
131 | ### Implications
132 |
133 | Compare these enhancements:
134 | * Enhancement 1
135 | * Speedup of 20x on 10% of time
136 | * \\( \Rightarrow \\) speedup = 1.105
137 | * Enhancement 2
138 | * Speedup of 1.6x on 80% of time
139 | * \\( \Rightarrow \\) speedup = 1.43
140 |
141 | Even an infinite speedup in enhancement 1 only yields 1.111 overall speedup.
142 |
143 | Takeaway: Make the common case fast
144 |
145 | ### Lhadma's Law
146 |
147 | * Amdahl: Make common case fast
148 | * Lhadma: Do not mess up the uncommon case too badly
149 |
150 | Example:
151 | * Improvement of 2x on 90%
152 | * Slow down rest by 10x
153 | * Speedup = \\( \frac{1}{\frac{0.1}{0.1} + \frac{0.9}{2}} = \frac{1}{1+0.45} = 0.7 \\)
154 | * \\( \Rightarrow \\) Net slowdown, not speedup.
155 |
156 | ### Diminishing Returns
157 |
158 | Consequence of Amdahl's law. If you keep trying to improve the same area, you get diminishing returns on the effort. Always reconsider what is now the dominant part of the execution time.
159 |
160 | 
161 |
162 | *[IPS]: Instructions per Second
163 | *[CPI]: Cycles per Instruction
--------------------------------------------------------------------------------
/multi-processing.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: multi-processing
3 | title: Multi-Processing
4 | sidebar_label: Multi-Processing
5 | ---
6 |
7 | [🔗Lecture on Udacity (1.25hr)](https://classroom.udacity.com/courses/ud007/lessons/1097109180/concepts/last-viewed)
8 |
9 | ## Flynn's Taxonomy of Parallel Machines
10 | * How many instruction streams
11 | * How many data streams
12 |
13 | | | Instruction Streams | Data Streams | |
14 | |:---:|:---:|:---:|---|
15 | | SISD
(uniprocessor) | 1 | 1 | Single Instruction Single Data |
16 | | SIMD
(vector, SSE/MMX) | 1 | > 1 | Single Instruction Multiple Data |
17 | | MISD
(stream processor?) | > 1 | 1 | Multiple Instruction Single Data
(rare) |
18 | | MIMD
(multiprocessor) | > 1 | > 1 | Multiple Instruction Multiple Data
(most today, multi-core)|
19 |
20 | ## Why Multiprocessors?
21 | * Uniprocessors already \\(\approx\\) 4-wide
22 | * Diminishing returns from making it even wider
23 | * Faster?? \\(frequency \uparrow\ \Rightarrow voltage \uparrow\ \\Rightarrow power \uparrow^3\ \Rightarrow (fire)\\)
24 | * But Moore's Law continues!
25 | * 2x transistors every 18 months
26 | * \\(\Rightarrow\ \approx\\)2x performance every 18 months (assuming we can use all the cores)
27 |
28 | ## Multiprocessor Needs Parallel Programs!
29 | * Sequential (single-threaded) code is _a lot_ easier to develop
30 | * Debugging parallel code is _much_ more difficult
31 | * Performance scaling _very_ hard to get
32 | * (as we scale # cores, performance scales equally, linearly)
33 | * In reality, performance will improve to a point, but then levels off. We can keep improving the program, but it will still max out at some level of performance no matter how many cores.
34 |
35 | ## Centralized Shared Memory
36 | Each core has its own cache, but connected to the same bus that shares main memory and I/O. Cores can send data to each other by writing to a location in main memory. Similarly they can share I/O. This architecture is called a Symmetric Multiprocessor (SMP), another concept is Uniform Memory Access (UMA), in terms of time.
37 | 
38 |
39 | ### Centralized Main Memory Problems
40 | * Memory Size
41 | * Need large memory \\(\Rightarrow\\) slow memory
42 | * Memory Bandwidth
43 | * Misses from all cores \\(\Rightarrow\\) memory bandwidth contention
44 | * As we add more cores, it starts to serialize against memory accesses, so we lose performance benefit
45 |
46 | Works well only for smaller machines (2, 4, 8, 16 cores)
47 |
48 | ## Distributed ~~Shared~~ Memory
49 | This uses message passing for a core to access another core's memory. If accesses are not in the local cache or the memory slice local to the core, it actually sends a request for this memory through the network to the core that has it. This system is also called a cluster or a multi-computer, since each core is somewhat like an individual computer.
50 | * 
51 |
52 | These computers tend to scale better - not because network communication is faster, but rather because the programmer is forced to explicitly think about distributed memory access and will tend to minimize communication.
53 |
54 | ## A Message-Passing Program
55 | ```cpp
56 | #define ASIZE 1024
57 | #define NUMPROC 4
58 | double myArray[ASIZE/NUMPROC]; // each processor allocates only 1/4 of entire array (assumes elements have been distributed somehow)
59 | double mySum = 0;
60 | for(int i=0; i < ASIZE/NUMPROC; i++)
61 | mySum += myArray[i];
62 | if(myPID==0) {
63 | for(int p=1; p < NUMPROC; p++) {
64 | int pSum;
65 | recv(p, pSum); // receives the sums from the other processors
66 | mySum += pSum;
67 | }
68 | printf("Sum: %lf\n", mySum);
69 | } else {
70 | send(0, mySum); // sends the local processor's sum to p0
71 | }
72 | ```
73 |
74 | ## A Shared Memory Program
75 | Benefit: No need to distribute the array!
76 |
77 | ```cpp
78 | #define ASIZE 1024
79 | #define NUMPROC 4
80 | shared double array[ASIZE]; // each processor needs the whole array, but it is shared
81 | shared double allSum = 0; // total sum in shared memory
82 | shared mutex sumLock; // shared mutex for locking
83 | double mySum = 0;
84 | for(int i=myPID*ASIZE/NUMPROC; i < (myPID+1)*ASIZE/NUMPROC; i++) {
85 | mySum += array[i]; // sums up this processor's own part of the array
86 | }
87 | lock(sumLock); // lock mutex for shared memory access
88 | allSum += mySum;
89 | unlock(sumLock); // unlock mutex
90 | // <<<<< Need a barrier here to prevent p0 from displaying result until other processes finish
91 | if(myPID==0) {
92 | printf("Sum: %lf\n", allSum);
93 | }
94 | ```
95 |
96 | ## Message Passing vs Shared Memory
97 |
98 | | | Message Passing | Shared Memory |
99 | |-------------------|:---------------:|:-------------:|
100 | | Communication | Programmer | Automatic |
101 | | Data Distribution | Manual | Automatic |
102 | | Hardware Support | Simple | Extensive |
103 | | Programming |
104 | | - Correctness | Difficult | Less Difficult |
105 | | - Performance | Difficult | Very Difficult |
106 |
107 | Key difference: Shared memory is easier to get a correct solution, but there may be a lot of issues making it perform well. With Message Passing, typically once you've gotten a correct solution you also have solved a large part of the performance problem.
108 |
109 | ## Shared Memory Hardware
110 | * Multiple cores share physical address space
111 | * (all can issue addresses to any memory address)
112 | * Examples: UMA, NUMA
113 | * Multi-threading by time-sharing a core
114 | * Same core \\(\rightarrow\\) same physical memory
115 | * (not really benefiting from multi-threading)
116 | * Hardware multithreading in a core
117 | * Coarse-Grain: change thread every few cycles
118 | * Fine-Grain: change thread every cycle (more HW support)
119 | * Simultaneous Multi-Threading (SMT): any given cycle we can be doing multiple instructions from different threads
120 | * Also called Hyperthreading
121 |
122 | ## Multi-Threading Performance
123 | [🎥 View lecture video (8:25)](https://www.youtube.com/watch?v=ZpqeeHFWxes)
124 |
125 | Without multi-threading, processor activity in terms of width is very limited - there will be many waits in which no work can be done. As it switches from thread to thread (time sharing), the activity of each thread is somewhat sparse.
126 |
127 | In a Chip Multiprocessor (CMP), each core runs different threads. While the activity happens simultaneously in time, each core still has to deal with the sparse workload created by the threads it is running. (Total cost 2x)
128 |
129 | With Fine-Grain Multithreading, we have one core, but with separate sets of registers for different threads. Each thread's work switches from cycle to cycle, taking advantage of the periods of time in which a thread is waiting on something to be ready to do the next work. This keeps the core busier overall, at the slight expense of extra hardware requirements. (Total cost \\(\approx\\) 1.05x)
130 |
131 | With SMT, we can go one step further by mixing instructions from different threads into the same cycle. This dramatically reduces the total idle time of the processor. This requires much more hardware to support. (Total cost \\(\approx\\) 1.1x)
132 |
133 | ## SMT Hardware Changes
134 | * Fetch
135 | * Need additional PC to keep track of another thread
136 | * Decode
137 | * No changes needed
138 | * Rename
139 | * Renamer does the same thing as per one thread
140 | * Need another RAT for the other thread
141 | * Dispatch/Execution
142 | * No changes needed
143 | * ROB
144 | * Need a separate ROB with its own commit point
145 | * Alternatively: interleave in single ROB to save HW expense (more often in practice)
146 | * Commit
147 | * Need a separate ARF for the other thread.
148 | 
149 |
150 | Overall: cost is not that much higher - the most expensive parts do not need duplication to support SMT.
151 |
152 | ## SMT, D$, TLB
153 | With a VIVT cache, we simply send the virtual address into the cache and use the data that comes out. However, each thread may have different address spaces, so the cache has no clue which of the two it is getting. So with VIVT, we risk getting the wrong data!
154 |
155 | With a VIPT cache (w/o aliasing), the same virtual address will be combined with the physical address and things will work correctly as long as the TLB provides the right information. So for VIPT and PIPT both, SMT addressing will work fine as long as the TLB is thread-aware. One way to ensure this is with a thread bit on each TLB entry, and to pass the thread ID into the TLB lookup.
156 | 
157 |
158 | ## SMT and Cache Performance
159 | * Cache on the core is shared by all SMT threads
160 | * Fast data sharing (good)
161 | ```
162 | Thread 0: SW A
163 | Thread 1: LW A <--- really good performance
164 | ```
165 | * Cache capacity (and associativity) shared (bad)
166 | * If WS(TH0) + WS(TH1) - WS(TH0, TH1) > D$ size:
167 | \\(\Rightarrow\\) Cache Thrashing
168 | * If WS(TH0) < D$ size
169 | \\(\Rightarrow\\) SMT performance can be worse than one-at-a-time
170 | * If the threads share a lot of data and fit into cache, maybe SMT performance doesn't have as many issues.
171 |
172 |
173 |
174 | *[CMP]: Chip Multiprocessor
175 | *[ARF]: Architectural Register File
176 | *[SMT]: Simultaneous Multi-Threading
177 | *[SMP]: Symmetric Multiprocessor
178 | *[WS]: Working Set
179 | *[D$]: Data Cache
180 | *[PIPT]: Physically Indexed Physically Tagged
181 | *[TLB]: Translation Look-Aside Buffer
182 | *[VIPT]: Virtually Indexed Physically Tagged
183 | *[VIVT]: Virtual Indexed Virtually Tagged
184 | *[ALU]: Arithmetic Logic Unit
185 | *[IQ]: Instruction Queue
186 | *[LB]: Load Buffer
187 | *[LW]: Load Word
188 | *[PC]: Program Counter
189 | *[RAT]: Register Allocation Table (Register Alias Table)
190 | *[ROB]: ReOrder Buffer
191 | *[SB]: Store Buffer
192 | *[SW]: Store Word
193 | *[UMA]: Uniform Memory Access
194 | *[NUMA]: Non-Uniform Memory Access
--------------------------------------------------------------------------------
/pipelining.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: pipelining
3 | title: Pipelining
4 | sidebar_label: Pipelining
5 | ---
6 |
7 | [🔗Lecture on Udacity (54 min)](https://classroom.udacity.com/courses/ud007/lessons/3650589023/concepts/last-viewed)
8 |
9 | Given a certain latency, the idea is that the first unit of work is achieved in the same period of time, but successive units continue to be completed thereafter. (Oil pipeline vs. traveling with a bucket)
10 |
11 | ## Pipelining in a Processor
12 | Each logical phase of operation in the processor (fetch, decode, ALU, memory, write) operates as a pipeline in which successive instructions move through each phase immediately behind the current instruction, instead of waiting for the entire instruction to complete.
13 |
14 | 
15 |
16 | ## Pipeline CPI
17 | Is Pipeline CPI = always 1?
18 |
19 | On initial fill, CPI \\( \rightarrow \\) 1 when # instruction \\( \rightarrow \infty \\). If pipeline stalls (a single phase breaks down), then on the next cycle the successive phase will be doing no work. Therefore, CPI is always > 1 if any stalls are possible. For example, if every 5 cycles one phase stalls, your CPI is 6 cycles/5 units = 1.2 CPI.
20 |
21 | ## Processor Pipeline Stalls
22 |
23 | In the example below, we see one example of a pipeline stall in a processor in a program like this:
24 |
25 | ```mipsasm
26 | LW R1, ...
27 | ADD R2, R1, 1
28 | ADD R3, R2, 1
29 | ```
30 |
31 | 
32 |
33 | The second instruction depends on the result of the first instruction, and must wait for it to complete the ALU and MEM phases before it can proceed. Thus, the CPI is actually much greater than 1.
34 |
35 | ## Processor Pipeline Stalls and Flushes
36 |
37 | ```mipsasm
38 | LW R1, ...
39 | ADD R2, R1, 1
40 | ADD R3, R2, 1
41 | JUMP ...
42 | SUB ... ...
43 | ADD ... ...
44 | SHIFT
45 | ```
46 |
47 | In this case, we have a jump, but we don't know where yet (maybe the address is being manipulated by previous instructions). Instructions following the jump instruction are loaded into the pipeline. When we get to the ALU phase, the `JUMP` will be processed, but this means the next instruction is not the `SUB ... ...` or `ADD ... ...`. These instructions are then flushed from the pipeline and the correct instruction (`SHIFT`) is fetched.
48 |
49 | 
50 |
51 | ## Control Dependencies
52 |
53 | For the following program, the `ADD`/`SUB` instructions have a control dependence on `BEQ`. Similarly, the instructions after `label:` also have a control dependence on `BEQ`.
54 |
55 | ```mipsasm
56 | ADD R1, R1, R2
57 | BEQ R1, R3, label
58 | ADD R2, R3, R4
59 | SUB R5, R6, R8
60 | label:
61 | MUL R5, R6, R8
62 | ```
63 |
64 | We estimate that:
65 | - 20% of instructions are branch/jump
66 | - slightly more than 50% of all branch/jump instructions are taken
67 |
68 | On a 5-stage pipeline, CPI = 1 + 0.1*2 = 1.2 (10% of the time (50% * 20%) an instruction spends two extra cycles)
69 |
70 | With a deeper pipeline (more stages), the number of wasted instructions increases.
71 |
72 | ## Data Dependencies
73 |
74 | ```mipsasm
75 | ADD R1, R2, R3
76 | SUB R7, R1, R8
77 | MUL R1, R5, R6
78 | ```
79 |
80 | This program has 3 dependencies:
81 |
82 | 1. Lines 1 and 2 have a RAW (Read-After-Write) dependence on R1 (also called Flow dependence, or TRUE dependence). The `ADD` instruction must be completed before the `SUB` instruction.
83 |
84 | 2. Lines 1 and 3 have a WAW (Write-After-Write) dependence on R1 with the `MUL` instruction in which the `ADD` must complete first, else R1 is overwritten with an incorrect value. This is also called an Output dependence.
85 |
86 | 3. Lines 2 and 3 have a WAR (Write-After-Read) dependence on R1 in that the `SUB` instruction must use the value of R1 before the `MUL` instruction overwrites it. This is also called an Anti-dependence because it reverses the order of the flow dependence.
87 |
88 | WAW and WAR dependencies are also called "False" or "Name" dependencies. RAR (Read-After-Read) dependencies do not matter since the value could not have changed in-between and is thus safe to read.
89 |
90 | ### Data Dependencies Quiz
91 |
92 | (Included for additional study help). For the following program, select which dependencies exist:
93 |
94 | ```mipsasm
95 | MUL R1, R2, R3
96 | ADD R4, R4, R1
97 | MUL R1, R5, R6
98 | SUB R4, R4, R1
99 | ```
100 |
101 | | | RAW | WAR | WAW |
102 | |---|---|---|---|
103 | | \\( I1 \rightarrow I2 \\) | x | - | - |
104 | | \\( I1 \rightarrow I3 \\) | - | - | x |
105 | | \\( I1 \rightarrow I4 \\) | - | - | - |
106 | | \\( I2 \rightarrow I3 \\) | - | x | - |
107 |
108 | ## Dependencies and Hazards
109 |
110 | Dependence - property of the program alone.
111 |
112 | Hazard - when a dependence results in incorrect execution.
113 |
114 | For example, in a 5-stage pipeline, a dependency that is 3 instructions apart may not cause a hazard, since the result will be written before the dependent instruction reads it.
115 |
116 | ## Handling of Hazards
117 |
118 | First, detect hazard situations. Then, address it by:
119 | 1. Flush dependent instructions
120 | 2. Stall dependent instruction
121 | 3. Fix values read by dependent instructions
122 |
123 | Must use flushes for control dependencies, because the instructions that come after the hazard are the wrong instructions.
124 |
125 | For data dependence, we can stall the next instruction, or fix the instruction by forwarding the value to the correct stage of the pipeline (e.g. "keep" the value inside the ALU stage for the next instruction to use). Forwarding does not always work, because the value we need is produced at a later point in time. In this cases we must stall.
126 |
127 | ## How Many Stages?
128 |
129 | For an ideal CPI = 1, we consider the following:
130 |
131 | More Stages \\( \rightarrow \\) more hazards (CPI \\( \uparrow \\)), but less work per stage ( cycle time \\( \downarrow \\))
132 |
133 | From iron law, Execution Time = # Instructions * CPI * Cycle Time
134 |
135 | \# Stages should be chosen to balance CPI and Cycle time (some local minima for execution time where cycle time has decreased without causing additional hazards). Additionally consider more stages consumes more power (work being done in less cycle time with more latches per stage).
136 |
137 | * Performance (execution time) \\( \Rightarrow \\) 30-40 stages
138 | * Manageable Power Consumption \\( \Rightarrow \\) 10-15 stages
139 |
140 | *[ALU]: Arithmetic Logic Unit
141 | *[CPI]: Cycles Per Instruction
142 | *[PC]: Program Counter
143 | *[RAW]: Read-After-Write
144 | *[WAR]: Write-After-Read
145 | *[WAW]: Write-After-Write
146 | *[RAR]: Read-After-Read
--------------------------------------------------------------------------------
/predication.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: predication
3 | title: Predication
4 | sidebar_label: Predication
5 | ---
6 |
7 | [🔗Lecture on Udacity (30 min)](https://classroom.udacity.com/courses/ud007/lessons/3617709440/concepts/last-viewed)
8 |
9 | ## Predication
10 | Predication is another way we can try to deal with control hazards and dependencies by dividing the work into both directions.
11 |
12 | Branch prediction
13 | - Guess where it is going
14 | - No penalty if correct
15 | - Huge penalty if wrong (misprediction cost ~50 instructions)
16 |
17 | Predication
18 | - Do work of both directions
19 | - Waste up to 50%
20 | - Throw away work from wrong path
21 |
22 | | Construct/Type | Branch Prediction | Predication | Winner |
23 | | --- | --- | --- | --- |
24 | | Loop | Very good, tends to take same path more often | Extra work is outside the loop and often wasted | Predict |
25 | | Call/Ret | Can be perfect, always takes this path | Any other branch is always wasted | Predict|
26 | | Large If/Then/Else | May waste large number of instructions | Either way we lose a large number of instructions | Predict |
27 | | Small If/Then/Else | May waste a large number of instructions | Only waste a few instructions | Predication, depending on BPred accuracy |
28 |
29 | ## If-Conversion
30 | Compiler takes code like this:
31 | ```cpp
32 | if(cond) {
33 | x = arr[i];
34 | y = y+1;
35 | } else {
36 | x = arr[j];
37 | y = y-1;
38 | }
39 | ```
40 | And changes it to this (work of both paths):
41 | ```cpp
42 | x1 = arr[i];
43 | x2 = arr[j];
44 | y1 = y+1;
45 | y2 = y-1;
46 | x = cond ? x1 : x2;
47 | y = cond ? y1 : y2;
48 | ```
49 |
50 | There is still a question of how to do the conditional assignment. For example if we do something like this:
51 |
52 | ```mipsasm
53 | BEQ ...
54 | MOV X, X2
55 | B Done
56 | MOV X, X1
57 | ... etc.
58 | ```
59 | then we haven't done much, because we just converted one branch into another branch and possibly have two mispredictions. We need a move instruction that is conditional on some flag being set.
60 |
61 | ## Conditional Move
62 | In MIPS instruction set there are the following instructions
63 |
64 | |inst | operands | does |
65 | |---|---|---|
66 | | `MOVZ` | Rd, Rs, Rt | `if(Rt == 0) Rd=Rs;` |
67 | | `MOVN` | Rd, Rs, Rt | `if(Rt != 0) Rd=Rs;` |
68 |
69 | So we can implement our conditional assignment `x = cond ? x1 : x2;` as follows:
70 |
71 | ```mipsasm
72 | R3 = cond
73 | R1 = ... x1 ...
74 | R2 = ... x2 ...
75 | MOVN X, R1, R3
76 | MOVZ X, R2, R3
77 | ```
78 |
79 | Similarly in x86 there are many instructions (`CMOVZ`, `CMOVNZ`, `CMOVGT`, etc.) that operate based on the flags, where the move operation completes based on the condition codes provided.
80 |
81 | ## MOVZ/MOVN Performance
82 | Is the If-Conversion really faster? Consider the following example. The branched loop can average 10.5 instructions given a large penalty. The If-Converted form uses more instructions to do the work, but never receives a penalty, for an average of 4 instructions.
83 |
84 | 
85 |
86 | ## Full Predication HW Support
87 |
88 | For MOV_cc we need separate opcode. For Full Predication, we add condition bits to _every_ instruction. For example, the Itanium instruction has 41 bits, where the least significant 6 bits specify a "qualifying predicate". The predicates are actually small 1-bit registers, and the 6 bit code tells the processor which of the 64-bit conditional registers to use.
89 |
90 | ## Full Predication Example
91 | In this example, the same code as before is now using a predication construct. So, `MP.EQZ` sets up predicates `p1` and `p2`. The `ADDI` instructions then use the proper predicate to determine if it is done or not. This avoids needing extra registers, special instruction codes, etc.
92 |
93 | 
94 |
95 |
96 |
97 | *[2BP]: 2-bit Predictor
98 | *[2BC]: 2-Bit Counter
99 | *[2BCs]: 2-Bit Counters
100 | *[ALU]: Arithmetic Logic Unit
101 | *[BHT]: Branch History Table
102 | *[BTB]: Branch Target Buffer
103 | *[CPI]: Cycles Per Instruction
104 | *[IF]: Instruction Fetch
105 | *[PC]: Program Counter
106 | *[PHT]: Pattern History Table
107 | *[RAS]: Return Address Stack
108 | *[XOR]: Exclusive-OR
109 |
--------------------------------------------------------------------------------
/reorder-buffer.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: reorder-buffer
3 | title: ReOrder Buffer
4 | sidebar_label: ReOrder Buffer
5 | ---
6 |
7 | [🔗Lecture on Udacity (2 hr)](https://classroom.udacity.com/courses/ud007/lessons/945398787/concepts/last-viewed)
8 |
9 | ## Exceptions in Out Of Order Exceptions
10 |
11 | Consider this set of instructions as a result of Tomasulo:
12 |
13 | | | Instruction | Issue | Disp | WR |
14 | |---|---|---|---|---|
15 | | 1 | `DIV F10, F0, F6` | 1 | 2 | 42 |
16 | | 2 | `L.D F2, 45(R3)` | 2 | 3 | 13 |
17 | | 3 | `MUL F0, F2, F4` | 3 | 14 | 19 |
18 | | 4 | `SUB F8, F2, F6` | 4 | 14 | 15 |
19 |
20 | Let's say in instruction 1, the F6 == 0 and causes a Divide By Zero exception in cycle 40. Normally an exception would cause PC to be saved and then it would go to an exception handler, and then come back and operations would resume. However, in this case, by the time the exception occurs, F0 has been overwritten by instruction 3 and instruction 1 would produce an incorrect result. This can also happen if a Load has a page fault, and various other causes.
21 |
22 | ## Branch Misprediction in OOO Execution
23 |
24 | Similarly to before, what happens if branch misprediction has caused a register to be overwritten before another instruction completes?
25 |
26 | ```mipsasm
27 | DIV R1, R3, R4
28 | BEQ R1, R2, Label
29 | ADD R3, R4, R5
30 | ...
31 | DIV ...
32 | Label:
33 | ...
34 | ```
35 |
36 | If `DIV` takes many cycles, it will take a long time before we realize the branch was mispredicted. In the meantime, the third instruction may have completed and updated `R3`. When we realize the misprediction, we should behave as if the wrong branch had never executed, but this is now impossible.
37 |
38 | Another issue is with Phantom Exceptions. Let's say the branch was mispredicted, and the program went on and the final DIV statement caused an exception - it could be in the exception handler before we even know the branch was mispredicted and the statement never should have run.
39 |
40 | ## Correct OOO Execution
41 | - Execute OOO
42 | - Broadcast OOO
43 | - But deposit values to registers In-Order!
44 |
45 | We need a structure called a ReOrder Buffer:
46 | - Remembers program order
47 | - Keeps results until safe to write to registers
48 |
49 | ## ReOrder Buffer (ROB)
50 |
51 | This structure contains the register that will be updated, the value it should be updated to, and whether the corresponding instruction is really done and ready to be written. The instructions are in program order.
52 |
53 | 
54 |
55 | There are also pointers to the next issuable instruction, and the next value to be committed (creating a standard FIFO buffer structure via head/tail pointers).
56 |
57 | [🎥 How ROB fits into Tomasulo](https://www.youtube.com/watch?v=0w6lXz71eJ8) |
58 | [🎥 Committing a WR to registers](https://www.youtube.com/watch?v=p6clkAsUV7E)
59 |
60 | Basically, it takes the place of the link between the RAT and RS, and now both point to this intermediary ROB structure. So RS now only is concerned with dispatching instructions, and doesn't have to wait for the result to be broadcast before freeing a spot in the RS (because the RS itself is not a register alias to be resolved)
61 |
62 | ## Hardware Organization with ROB
63 |
64 | Things are much the same as before. We still have an IQ that feeds into one or more RS, and we have a RAT that may have entries pointing to an RF. The primary difference is where before the RAT may have pointed to the RS, it will now point to the ROB instead. On each cycle, the ROB may commit its result to the RF and RS if the associated instruction is considered "done".
65 |
66 | ## Branch Misprediction Recovery
67 |
68 | Describing how ROB works on a branch misprediction:
69 | - [🎥 Part 1](https://www.youtube.com/watch?v=GG09xSZ32MU) |
70 | - [🎥 Part 2](https://www.youtube.com/watch?v=ozc8ceuw-GA)
71 |
72 | The summary is that once the commit pointer reaches a mispredicted branch, we simply empty the ROB (issue pointer = commit pointer). The register file contains the correct registers. The RAT is modified to point to the associated registers. And finally, the next fetch will fetch from the correct PC. Therefore, the ROB makes recovery from a mispredicted branch fairly straightforward.
73 |
74 | ## ROB and Exceptions
75 |
76 | Reminder: two scenarios:
77 | 1. Exception in long-running instruction (or page fault)
78 | - Simply flush the ROB and move to exception handler
79 | - ROB ensures that no wrong registers have been committed and the program is in a good state for the exception handler
80 | 2. Phantom exceptions for mispredicted branch
81 | - The instruction with an exception is tagged as an exception in the ROB
82 | - When we figure out the branch is mispredicted it is handled like any other misprediction and thus the exception does not get handled (as it doesn't reach the commit point for the instruction with the exception)
83 |
84 | ## Commit == Outside view of "Executed"
85 |
86 | [🎥 Video explanation](https://www.youtube.com/watch?v=vpPjDW48v90). The main point is that ROB makes it so the processor itself may see instructions executed in any order with mispredicted branches, but the programmer's view of the program's execution is that all instructions are committed in order with no wrong branches taken.
87 |
88 | ## RAT Updates on Commit
89 |
90 | [🎥 Video explanation](https://www.youtube.com/watch?v=PYFg7QOfcvI). This example walks through how the ROB, RAT, and REGS work together during the commit phase. At a basic level, as instructions in the ROB are completed, it updates the register values accordingly. However, it will only update the RAT (to change its pointer from a ROB entry to a Register) if it just committed the ROB instruction the RAT entry was pointing to. For example, if the RAT entry for `R3` was pointing to `ROB2`, when `ROB2` executes it will commit the update to the associated register, and also update the RAT entry `R3` to point to `R3` now. In this way, registers and RAT are both kept updated only when the instruction is finally committed.
91 |
92 | ## ROB Example
93 | _These examples are best viewed as videos, so links are below..._
94 |
95 | 1. [🎥 Cycles 1-2](https://www.youtube.com/watch?v=39AFF5Qq5DI)
96 | 2. [🎥 Cycles 3-4](https://www.youtube.com/watch?v=c3hFm_DOUA0)
97 | 3. [🎥 Cycles 5-6](https://www.youtube.com/watch?v=4nZN_mLcCJo)
98 | 4. _(cycles 7-12 are "fast forwarded")_
99 | 5. [🎥 Cycles 13-24](https://www.youtube.com/watch?v=bE3IFvoChyw)
100 | 6. [🎥 Cycles 25-43](https://www.youtube.com/watch?v=HmURweRTsU4)
101 | 7. [🎥 Cycles 44-48](https://www.youtube.com/watch?v=V0nywwV0lKU)
102 | 8. [🎥 Timing Example](https://www.youtube.com/watch?v=f9IcEtKTz8k)
103 |
104 | ## Unified Reservation Stations
105 |
106 | With separate RS for separate units (e.g. ADD, MUL), often running out of RS spots on one unit will prevent the other unit from being issued too (because instructions must be issued in order). The structures themselves are functionally the same, so all RS can be unified into one larger array, to allow more total instructions to be issued. However, this requires additional logic in the dispatch unit to target the correct unit. But, reservation stations are expensive, so it may be better to add the additional logic rather than having stations go unused.
107 |
108 | ## Superscalar
109 |
110 | Previous examples were limited to one instruction per cycle. For superscalar, we need to consider the following:
111 | * Fetch > 1 inst/cycle
112 | * Decode > 1 inst/cycle
113 | * Issue > 1 inst/cycle (still should be in order)
114 | * Dispatch > 1 inst/cycle
115 | * May require multiple units of each functional type
116 | * Broadcast > 1 result/cycle
117 | * This involves not only having more buses for each result, but every RS has to compare with every bus each cycle
118 | * Commit > 1 inst/cycle
119 | * Must still obey rule of in-order commits
120 |
121 | With all of these, we must consider the "weakest link". If all of these are very large but one is limited to 3 inst/cycle, that will be the bottleneck in the pipeline.
122 |
123 | ## Termninology Confusion
124 |
125 | | Academics | Companies, other papers |
126 | | --- | --- |
127 | | Issue | Issue, Allocate, Dispatch |
128 | | Dispatch | Execute, Issue, Dispatch |
129 | | Commit | Commit, Complete, Retire, Graduate |
130 |
131 | So, it's complicated.
132 |
133 | ## Out of Order
134 |
135 | In an out-of-order processor, not ALL pipeline stages are processing instructions out of order. Some stages must still be in-order to preserve proper dependencies.
136 |
137 | | Stage | Order |
138 | | --- | --- |
139 | | Fetch | In-Order |
140 | | Decode | In-Order |
141 | | Issue | In-Order |
142 | | Execute | Out-of-Order |
143 | | Write/Bcast | Out-of-Order |
144 | | Commit | In-Order |
145 |
146 |
147 | ## Additional Resources
148 |
149 | From TA Nolan, here is an attempt to document how a CPU with ROB works:
150 |
151 | ```
152 | While there is an instruction to issue
153 | If there is an empty ROB entry and an empty appropriate RS
154 | Put opcode into RS.
155 | Put ROB entry number into RS.
156 | For each operand which is a register
157 | If there is a ROB entry number in the RAT for that register
158 | Put the ROB entry number into the RS as an operand.
159 | else
160 | Put the register value into the RS as an operand.
161 | Put opcode into ROB entry.
162 | Put destination register name into ROB entry.
163 | Put ROB entry number into RAT entry for the destination register.
164 | Take the instruction out of the instruction window.
165 |
166 | For each RS
167 | If RS has instruction with actual values for operands
168 | If an appropriate ALU or processing unit is free
169 | Dispatch the instruction, including operands and the ROB entry number.
170 | Free the RS.
171 |
172 | While the next ROB entry to be retired has a result
173 | Write the result to the register.
174 | If the ROB entry number is in the RAT, remove it.
175 | Free the ROB entry.
176 |
177 | For each ALU
178 | If the instruction is complete
179 | Put the result into the ROB entry corresponding to the destination register.
180 | For each RS waiting for it
181 | Put the result into the RS as an operand.
182 | Free the ALU.
183 | ```
184 |
185 |
186 |
187 |
188 | *[ALU]: Arithmetic Logic Unit
189 | *[CPI]: Cycles Per Instruction
190 | *[ILP]: Instruction-Level Parallelism
191 | *[IPC]: Instructions per Cycle
192 | *[IQ]: Instruction Queue
193 | *[LB]: Load Buffer
194 | *[LW]: Load Word
195 | *[OOO]: Out Of Order
196 | *[PC]: Program Counter
197 | *[RAT]: Register Allocation Table (Register Alias Table)
198 | *[RAW]: Read-After-Write
199 | *[ROB]: ReOrder Buffer
200 | *[SB]: Store Buffer
201 | *[SW]: Store Word
202 | *[WAR]: Write-After-Read
203 | *[WAW]: Write-After-Write
204 | *[RAR]: Read-After-Read
--------------------------------------------------------------------------------
/storage.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: storage
3 | title: Storage
4 | sidebar_label: Storage
5 | ---
6 |
7 | [🔗Lecture on Udacity (30 min)](https://classroom.udacity.com/courses/ud007/lessons/872590121/concepts/last-viewed)
8 |
9 | ## Storage
10 | Reasons we need it:
11 | * Files - Programs, data, settings
12 | * Virtual Memory
13 |
14 | We care about:
15 | * Performance
16 | * Throughput - improving (not as quickly as processor speed)
17 | * Latency - improving (very slowly, more so than DRAM)
18 | * Reliability
19 |
20 | Types we can use are very diverse
21 | * Magnetic Disks
22 | * Optical Disks
23 | * Tape
24 | * Flash
25 | * ... etc.
26 |
27 | ## Magnetic Disks
28 | Multiple platters rotate around a spindle. A head assembly sits beside these platters, and has a head that is used to read both sides of the magnetic platter surface. Where the head is positioned on the platter is called a track - this is the circle of data the head can access while in that position as the platters rotate. A cylinder is the collation of the same track across multiple platters. The head will move position which allows access to other tracks.
29 |
30 | 
31 |
32 | The data will be organized into position order along a track, and divided into sectors. Each sector contains a preamble, some data, and some kind of checksum or error correction code.
33 |
34 | Disk capacity then can be represented by the multiplication of:
35 | - number of platters x 2 (surfaces)
36 | - number of tracks/surface (cylinders)
37 | - number of sectors/track
38 | - number of bytes/sector
39 |
40 | ### Access time for Magnetic Disks
41 | (If the disk is spinning already)
42 | - Seek Time - move the head assembly to the correct cylinder
43 | - Rotational Latency - Wait for the start of our sector to get under the head
44 | - Data Read - read until end of sector seen by head
45 | - Controller Time - checksum, determine sector is ok
46 | - I/O Bus Time - get the data into main memory
47 |
48 | ...plus a significant Queuing Delay (wait for previous accesses to complete before we can move the head again)
49 |
50 | ### Trends for Magnetic Disks
51 | - Capacity
52 | - 2x per 1-2 years
53 | - Seek Time
54 | - 5-10ms, very slow improvement (primarily due to shrinking disk diameter)
55 | - Rotation:
56 | - 5,000 RPM \\(\rightarrow\\) 10,000 RPM \\(\rightarrow\\) 15,000 RPM
57 | - Materials and noise improvements
58 | - Improves slowly
59 | - Controller, Bus
60 | - Improves at OK rate
61 | - Currently a small fraction of overall time
62 |
63 | Overall performance is dominated by materials and mechanics, not something like Moore's Law - so it is very difficult to rapidly improve on this technology.
64 |
65 | ## Optical Disks
66 | Similar to hard disk platter, except we use a laser instead of a magnetic head. Additionally, where a hard disk needs enclosing to keep contaminants out, this is not as important on an optical disk. So being open (insert/remove any disk) allows portability, as does standardization of technologies.
67 |
68 | However, improving this technology is not very easy because portability requires standards that must be agreed upon, which takes time. Then products still have to be made to address the new technologies, and will take even more time to be adopted by consumers.
69 |
70 | ## Magnetic Tape
71 | * Backup (secondary storage)
72 | * Large capacity, replaceable
73 | * Sequential access (good for large sequential data, poor for virtual memory)
74 | * Currently dying out
75 | * Low production volume \\(\Rightarrow\\) cost not dropping as rapidly as disks
76 | * Cheaper to use more disks rather than invest in new equipment and the tapes to use them (and USB hard drives)
77 |
78 | ## Using RAM for storage
79 | * Disk about 100x cheaper per GB
80 | * DRAM has about 100,000x better latency
81 |
82 | Solid-State Disk (SSD)
83 | * DRAM + Battery
84 | * Fast!
85 | * Reliable
86 | * Extremely Expensive
87 | * Not good for archiving (must be powered)
88 | * Flash
89 | * Low Power
90 | * Fast (but slower than DRAM)
91 | * Smaller Capacity (GB vs TB)
92 | * Keeps data alive without power
93 |
94 | ## Hybrid Magnetic Flash
95 | Combine magnetic disk with Flash
96 | * Magnetic Disk
97 | * Low $/GB
98 | * Huge Capacity
99 | * Power Hungry
100 | * Slow (mechanical movement)
101 | * Sensitive to impacts while spinning
102 | * Flash
103 | * Fast
104 | * Power Efficient
105 | * No moving parts
106 |
107 | Use both!
108 | * Use Flash as cache for disk
109 | * Can potentially power down the disk when what we need is in cache
110 |
111 | ## Connecting I/O Devices
112 | 
113 | Main point: Things closer to the CPU may need full speed of the bus. Things farther away or naturally slower (e.g. storage, USB) will not utilize as much of the speed, so are more interested in standardization of the bus/controller technology, which grows at slower rates.
--------------------------------------------------------------------------------
/synchronization.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: synchronization
3 | title: Synchronization
4 | sidebar_label: Synchronization
5 | ---
6 |
7 | [🔗Lecture on Udacity (1hr)](https://classroom.udacity.com/courses/ud007/lessons/906999159/concepts/last-viewed)
8 |
9 | ## Synchronization Example
10 | In the following example, each thread is counting letters on its own half of the document. Most of the time, this can work out ok as each thread is operating on different parts of the document and things are being accessed in different orders depending on the letters each thread encounters. But what if both threads simultaneously attempt to update the counter for letter 'A'? Both `LW` for the same address for the proper counter, increments that value, and stores it. In the end, the counter only increments once instead of twice. We need a way to separate lines 2-4 into an Atomic Section (Critical Section) that cannot run simultaneously. With such a construct, one thread completes and stores the incremented value, and cache coherence ensures the other thread obtains the newly updated value before incrementing it.
11 |
12 | 
13 |
14 | ### Synchronization Example - Lock
15 | The type of synchronization we use for atomic/critical sections is called Mutual Exclusion (Lock). When entering a critical section, we request a lock on some variable, and then unlock it once complete. The lock construct ensures that only one thread can lock it at a time, in no particular order - whichever thread is able to lock it first. The code above may now look like
16 |
17 | ```mipsasm
18 | LW L, 0(R1)
19 | lock CountLock[L]
20 | LW R, Count[L]
21 | ADDI R, R, 1
22 | SW R, Count[L]
23 | unlock CountLock[L]
24 | ```
25 |
26 | ## Lock Synchronization
27 | Naive implementation:
28 | ```cpp
29 | 1 typedef int mutex_type;
30 | 2 void lock_init(mutex_type &lockvar) {
31 | 3 lockvar = 0;
32 | 4 }
33 | 5 void lock(mutex_type &lockvar) {
34 | 6 while(lockvar == 1);
35 | 7 lockvar = 1;
36 | 8 }
37 | 9 void unlock(mutex_type &lockvar) {
38 | 10 lockvar = 0;
39 | 11 }
40 | ```
41 | This implementation has an issue in that if both threads attempt to access line 6 (`while`) simulatanously while `lockvar` is 0, both may wind up in the critical section at the same time. So really, lines 6 and 7 need to be in an atomic section of its own! So it seems there is a paradox, in that we need locks to implement locks.
42 |
43 | ## Implementing `lock()`
44 | ```cpp
45 | 1 void lock(mutex_type &lockvar) {
46 | 2 Wait:
47 | 3 lock_magic();
48 | 4 if(lockvar == 0) {
49 | 5 lockvar = 1;
50 | 6 goto Exit;
51 | 7 }
52 | 8 unlock_magic();
53 | 9 goto Wait;
54 | 10 Exit:
55 | 11 }
56 | ```
57 |
58 | But, we know there is no magic.
59 | Options:
60 | - Lamport's Bakery Algorithm
61 | - Complicated, expensive (makes lock slow)
62 | - Special atomic read/write instructions
63 | - We need an instruction that both reads and writes memory
64 |
65 | ## Atomic Instructions
66 | - Atomic Exchange
67 | - Swaps contents of the two registers
68 | - `EXCH R1, 78(R2)` allows us to implement lock:
69 | ```cpp
70 | R1 = 1;
71 | while(R1 == 1)
72 | EXCH R1, lockvar;
73 | ```
74 | - If at any point lockvar becomes 0, we swap it with R1 and exit the while loop. If other threads are attempting the lock at the same time, they will get our thread's R1.
75 | - Drawback: writes all the time, even while the lock is busy
76 | - Test-and-Write
77 | - We test a location, and if it meets some sort of condition, we then do the write.
78 | - `TSTSW R1, 78(R2)` works like:
79 | ```cpp
80 | if(Mem[78+R2] == 0) {
81 | Mem[78+R2] = R1;
82 | R1 = 1;
83 | } else {
84 | R1 = 0;
85 | }
86 | ```
87 | - Implement lock by doing:
88 | ```cpp
89 | do {
90 | R1=1;
91 | TSTSW R1, lockvar;
92 | } while(R1==0);
93 | ```
94 | - This is good because we only do the write (and thus cause cache invalidations) when the condition is met. So, there is only communication happening when locks are acquired or freed.
95 | - Load Linked/Store Conditional
96 |
97 | ## Load Linked/Store Conditional (LL/SC)
98 | - Atomic Read/Write in same instruction
99 | - Bad for pipelining (Fetch, Decode, ALU, Mem, Write) - add a Mem2, Mem3 stage to handle all reads/writes needed
100 | - Separate into 2 instructions
101 | - Load Linked
102 | - Like a normal `LW`
103 | - Save address in Link register
104 | - Store Conditional
105 | - Check if address is same as in Link register
106 | - Yes? Normal `SW`, return 1
107 | - No? Return 0!
108 |
109 | ### How is LL/SC Atomic?
110 | The key is that if some other thread manages to write to lockvar in-between the `LL`/`SC`, the link register will be 0 and the `SC` will fail, due to cache coherence. A major benefit here is that simple critical sections no longer need locks, as we can `LL`/`SC` directly on the variable.
111 |
112 | 
113 |
114 | ## Locks and Performance
115 | [🎥 View lecture video (3:20)](https://www.youtube.com/watch?v=JS88digI8iQ)
116 |
117 | Atomic Exchange has poor performance given that each core is constantly exchanging blocks, triggering invalidations on other cores. This level of overhead is very power hungry and slows down useful work that could be done.
118 |
119 | ### Test-and-Atomic-Op Lock
120 | Recall original implementation of atomic exchange:
121 |
122 | ```cpp
123 | R1 = 1;
124 | while(R1 = 1)
125 | EXCH R1, lockvar;
126 | ```
127 |
128 | We can improve this via normal loads and only using `EXCH` if the local read shows the lock is free. Now, the purpose of the exchange is only as a final way to safely obtain the lock.
129 |
130 | ```cpp
131 | R1 = 1;
132 | while(R1 == 1) {
133 | while(lockvar == 1);
134 | EXCH R1, lockvar;
135 | }
136 | ```
137 |
138 | This implementation allows `lockvar` to be cached locally until it is released by the core inside the critical section. This eliminates nearly all coherence traffic and leaves the bus free for the locked core to handle any coherence requests for `lockvar`.
139 |
140 | ## Barrier Synchronization
141 | A barrier is a form of synchronization that waits for all threads to complete some work before proceeding further. An example of this is if the program is attempting to operate on data computed by multiple threads - all threads need to have completed their work before the thread(s) responsible for processing that work can continue.
142 |
143 | - All threads must arrive at the barrier before any can leave.
144 | - Two variables:
145 | - Counter (count arrived threads)
146 | - Flag (set when counter == N)
147 |
148 | ### Simple Barrier Implementation
149 | ```cpp
150 | 1 lock(counterlock);
151 | 2 if(count==0) release=0; // re-initalize release
152 | 3 count++; // count arrivals
153 | 4 unlock(counterlock);
154 | 5 if(count==total) {
155 | 6 count=0; // re-initialize barrier
156 | 7 release=1; // set flag to release all threads
157 | 8 } else {
158 | 9 spin(release==1); // wait until release==1
159 | 10 }
160 | ```
161 |
162 | This implementation has one major flaw. Upon the first barrier encounter, this works correctly. However, what if the work then continues for awhile and we try to go back to using these barrier variables for synchronization? We expect that we will start with `count == 0` and `release == 0`, but this may not be true...
163 |
164 | ### Simple Barrier Implementation Doesn't Work
165 | The main issue with using this barrier implementation multiple times is that some threads may not pick up the release right away. Consider that the final thread to hit the barrier sets `release=1`, but maybe Thread 0 is off doing something else like handling an interrupt and doesn't see it. But the other threads have already been released and are not doing much work, and one could come back and hit the barrier again, setting `release=0`. Thread 0 is finally done with its work, and sees that `release==0` and continues to wait at the first barrier! Furthermore, the other threads are now waiting on `release==1` on the second instance of the barrier, and since Thread 0 is now locked up this will never happen, resulting in deadlock.
166 |
167 | ### Reusable Barrier
168 | ```cpp
169 | 1 localSense = !localSense;
170 | 2 lock(counterlock);
171 | 3 count++;
172 | 4 if(count==total) {
173 | 5 count=0;
174 | 6 release=localSense;
175 | 7 }
176 | 8 unlock(counterlock);
177 | 9 spin(release==localSense);
178 | ```
179 |
180 | This implementation works because it functions as a sort of flip-flop on each barrier instance. We are never re-initializing `release`, but only `localSense`. So each time we hit the barrier it simply waits on the release to "flip", at which point it continues. So even if some threads continue work up until the next barrier before another thread is released from the barrier, then they must still wait on that thread to also arrive at that barrier.
--------------------------------------------------------------------------------
/virtual-memory.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: virtual-memory
3 | title: Virtual Memory
4 | sidebar_label: Virtual Memory
5 | ---
6 |
7 | [🔗Lecture on Udacity (1 hr)](https://classroom.udacity.com/courses/ud007/lessons/1032798942/concepts/last-viewed)
8 |
9 | ## Why Virtual Memory?
10 |
11 | Virtual memory is a way of reconciling how a programmer views memory and how the hardware actually uses memory. Every program uses the same address space, even though clearly the data in some of those addresses must be different for each program (e.g. code section).
12 |
13 | 
14 |
15 | ## Processor's View of Memory
16 |
17 | Physical Memory - *Actual* memory modules in the system.
18 | - Sometimes < 4GB
19 | - Almost never 4GB/process
20 | - Never 16 Exabytes/process
21 | - So... usually less than what programs can access
22 |
23 | Addresses
24 | - 1:1 mapping to bytes/words in physical memory (a given address always goes to the same physical location, and the same physical location always has the same address)
25 |
26 | ## Program's View of Memory
27 | Programs each have their own address space, some of which may point to the same physical location as another program's, or not. Likewise, even different addresses in different programs could be pointing at the same shared physical location.
28 |
29 | 
30 |
31 | ## Mapping Virtual -> Physical Memory
32 |
33 | Virtual Memory is divided into pages of some size (e.g. 4kB). Likewise, Physical Memory is divided into Frames of that same size. This is similar to Blocks/Lines in caches. The Operating System decides how to map these pages to frames and handles any shared memory using Page Tables, each of which is unique to the program being mapped.
34 |
35 | 
36 |
37 | ## Where is the missing memory?
38 |
39 | What happens when we have more pages in our programs than can be currently loaded into physical memory? Since virtual memory will almost certainly exceed the size of physical memory, this occurs often. The Operating System will "page" memory to disk that has not been accessed recently, to allow the more used pages to remain in physical memory. Similar to caches, it will then obtain those pages back from disk whenever they are requested.
40 |
41 | ## Virtual to Physical Translation
42 | Similar to a cache, the lower N bits of a virtual address (where 2^N = page size) are used as an offset into the page. The upper bits are then the page number (similar to tags in caches), which is used to index into the Page Table. This table returns a physical Frame Number, which is then combined with the Page Offset to get the Physical Address actually used in hardware.
43 |
44 | ## Size of Flat Page Table
45 | - 1 entry per page in the virtual address space
46 | - even for pages the program never uses
47 | - Entry contains Frame Number + bits that tell us if the page is accessible
48 | - Entry Size \\(\approx\\) physical address
49 | - Overall Size: \\(\frac{\text{virtual memory}}{\text{page size}}*\text{size of entry}\\)
50 | - For very large virtual address space, the page table may need to be reorganized to ensure it properly fits within memory
51 |
52 | ## Multi-Level Page Tables
53 | Multi-Level Page Tables are designed where page table space is proportional to how much memory the program is actually using. As most programs use the "early" addresses (code, data, heap) and "late" addresses (stack), there is a large gap in the middle that usually goes unused.
54 | 
55 |
56 | ### Multi-Level Page Table Structure
57 | The page number now is split into "inner" and "outer" page numbers. The Inner Page Number is used to index into an Inner Page Table to find the frame number, and the Outer Page Number is used to determine which Inner Page Table to use.
58 | 
59 |
60 | ### Two-Level Page Table Example
61 | We save space because we do not create an inner page table if nothing is using the associated index in the outer page table. Only a certain number of outer page numbers will be utilized, and they are likely contiguous, so this results in a lot of savings.
62 | 
63 |
64 | ### Two-Level Page Table Size
65 | To calculate two-level page table size:
66 | 1. Determine how many bits are used for the page offsets (N bits where \\(2^N\\)=page size)
67 | 2. Determine how many bits are used for the inner and outer page tables (X and Y bits where \\(2^X\\)=outer page table entries and \\(2^Y\\)=inner page table entries)
68 | 3. Outer Page Table size is number of outer entries * entry size in bytes
69 | 4. Inner Page Table size is number of inner entries * entry size in bytes
70 | 5. Number of Inner Page Tables Used is determined by partitioning the utilized address space by the number of bits associated with the outer page table. Typically only a few outer entries will be used.
71 | 6. Total page table size = (outer page table size) + (number of inner page tables used)*(inner page table size)
72 |
73 | [🎥 Link to example (4:19)](https://www.youtube.com/watch?v=tP7LYbFrk10)
74 |
75 | ## Choosing the Page Size
76 |
77 | Smaller Pages -> Large Page Table
78 |
79 | Larger Pages -> Smaller Page Table
80 | - But, internal fragmentation due to most of a page not being used by applications (wasted in physical memory)
81 |
82 | Like with block size of caches, we need to compromise between these. Typically a few KB to a few MB is appropriate for page size.
83 |
84 | ## Memory Access Time with V->P Translation
85 | Example: `LOAD R1=4(R2)` (virtual address of value in `R2`+4)
86 |
87 | For Load/Store
88 | - Compute Virtual Address
89 | - Compute page number (take some bits from address)
90 | - When using virtual->physical translation (for each level of page table)
91 | - Compute physical address of page table entry (adding)
92 | - Read page table entry
93 | - Is it fast? Where is the page table? In memory!
94 | - Compute physical address
95 | - Access Cache (and sometimes memory)
96 |
97 | Since page table is in memory, this could also result in cache misses, and so it may take multiple rounds of memory access just to get the data in one address.
98 |
99 | ## Translation Look-Aside Buffer (TLB)
100 | - TLB is a cache for translations
101 | - Cache is big -> TLB is small
102 | - 16KB in cache is only 4 entries (page size of 4KB)
103 | - So, TLB can be very small and fast since it only holds translations
104 | - Cache accessed for each level of Page Table (4-level = 4 accesses)
105 | - TLB only stores the final translation (Frame Number) (1 access)
106 |
107 | What if we have a TLB miss?
108 | - Perform translation using page table(s)
109 | - Put translation in TLB for later use
110 |
111 | ## TLB Organization
112 | - Associativity? Small, fast => Fully or Highly Associative
113 | - Size? Want hits similar to cache => 64..512 entries
114 | - Need more? Two-level TLB
115 | - L1: small/fast
116 | - L2: hit time of several cycles but larger
117 |
118 |
--------------------------------------------------------------------------------
/vliw.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: vliw
3 | title: VLIW
4 | sidebar_label: VLIW
5 | ---
6 |
7 | [🔗Lecture on Udacity (17 min)](https://classroom.udacity.com/courses/ud007/lessons/961349070/concepts/last-viewed)
8 |
9 | ## Superscalar vs VLIW
10 |
11 | | | OOO Superscalar | In-Order Superscalar | VLIW |
12 | | --- | --- | --- | --- |
13 | | IPC | \\(\leq N\\) | \\(\leq N\\) | 1 large inst/cycle but does work of N "normal" inst |
14 | | How to find independent instructions? | Look at >> N insts | Look at next N insts in program order | Just do next large inst |
15 | | Hardware Cost | $$$ | $$ | $ |
16 | | Help from compiler? | Compiler can help | Needs help | Completely depends on compiler for performance |
17 |
18 | ## The Good and The Bad
19 |
20 | Good:
21 | * Compiler does the hard work
22 | * Plenty of Time
23 | * Simpler HW
24 | * Can be energy efficient
25 | * Works well on loops and "regular" code
26 |
27 | Bad:
28 | * Latencies not always the same
29 | * e.g. Cache Miss
30 | * Irregular Applications
31 | * e.g. Applications with lots of decision making that are hard for compiler to figure out
32 | * Code Bloat
33 | * Can be much larger if there are many no-ops from dependent instructions
34 |
35 | ## VLIW Instructions
36 |
37 | * Instruction set typically has all the "normal" ISA opcodes
38 | * Full predication support (or at least extensive)
39 | * Replies on compiler to expose parallelism via instruction scheduling
40 | * Lots of registers
41 | * A lot of scheduling optimizations require use of additional registers
42 | * Branch hints
43 | * Compiler tells processor what it thinks the branch will do
44 | * VLIW instruction "compaction"
45 | | OP1 | NOP | NOP | NOP |
46 | |---|---|---|---|
47 |
48 | | OP2 | OP3 | NOP | NOP |
49 | |---|---|---|---|
50 | becomes
51 | | OP1 |X| OP2 | | OP3 |X| NOP | |
52 | |---|---|---|---|---|---|---|---|
53 | where the X represents some sort of stop bit. This reduces the number of `NOP` and therefore code bloat.
54 |
55 | ## VLIW Examples
56 | * Itanium
57 | * _Tons_ of ISA Features
58 | * HW very complicated
59 | * Still not great on irregular code
60 | * DSP Processors
61 | * Regular loops, lots of iterations
62 | * Excellent performance
63 | * Very energy-efficient
64 |
65 | The main point is that VLIW can be a very good choice given the right application where compilers can do well.
66 |
67 |
68 | *[ALU]: Arithmetic Logic Unit
69 | *[CPI]: Cycles Per Instruction
70 | *[DSP]: Digital Signal Processing
71 | *[ILP]: Instruction-Level Parallelism
72 | *[IPC]: Instructions per Cycle
73 | *[IQ]: Instruction Queue
74 | *[ISA]: Instruction Set Architecture
75 | *[LB]: Load Buffer
76 | *[LSQ]: Load-Store Queue
77 | *[LW]: Load Word
78 | *[OOO]: Out Of Order
79 | *[PC]: Program Counter
80 | *[RAT]: Register Allocation Table (Register Alias Table)
81 | *[RAR]: Read-After-Read
82 | *[RAW]: Read-After-Write
83 | *[ROB]: ReOrder Buffer
84 | *[SB]: Store Buffer
85 | *[SW]: Store Word
86 | *[VLIW]: Very Long Instruction Word
87 | *[WAR]: Write-After-Read
88 | *[WAW]: Write-After-Write
89 |
--------------------------------------------------------------------------------
/welcome.md:
--------------------------------------------------------------------------------
1 | ---
2 | id: welcome
3 | title: High Performance Computer Architecture (HPCA)
4 | sidebar_label: Welcome to HPCA
5 | ---
6 |
7 | ### Howdy Friends
8 |
9 | Here are my notes from when I am taking HPCA in OMSCS during Fall 2019.
10 |
11 | Each document in "Lecture Notes" corresponds to a lesson in [Udacity](https://classroom.udacity.com/courses/ud007). Within each document, the headings roughly correspond to the videos within that lesson. Usually, I omit the lesson intro videos.
12 |
13 | There is a lot of information in these documents. I hope you find it helpful!
14 |
15 | If you have any questions, comments, concerns, or improvements, don't hesitate to reach out to me. You can find me at:
16 |
17 | * [davidharris@gatech.edu](mailto:davidharris@gatech.edu)
18 | * [David Harris \| Linkedin](https://www.linkedin.com/in/davidrossharris/)
19 | * @drharris \(Slack\)
20 |
--------------------------------------------------------------------------------