├── LICENSE ├── README.md ├── advanced-caches.md ├── assets ├── 12_function-call-inlining.png ├── 12_scheduling-if-conversion.png ├── 15_cache-organization.png ├── 15_direct-mapped-cache.png ├── 15_offset-index-tag-setassociative.png ├── 16_mapping-virtual-physical-memory.png ├── 16_multilevel-page-table-structure.png ├── 16_multilevel-page-tables.png ├── 16_program-view-of-memory.png ├── 16_two-level-page-table-example.png ├── 16_why-virtual-memory.png ├── 17_aliasing-in-virtually-accessed-caches.png ├── 17_loop-interchange.png ├── 17_overlap-misses.png ├── 17_prefetch-instructions.png ├── 17_tlb-cache-hit.png ├── 17_vipt-cache-aliasing.png ├── 17_virtually-indexed-physically-tagged.png ├── 19_connecting-dram.png ├── 19_memory-chip-organization.png ├── 1bh2bc.png ├── 1bit-statemachine.png ├── 20_connecting-io-devices.png ├── 20_magnetic-disks.png ├── 22_centralized-shared-memory.png ├── 22_distributed-memory.png ├── 22_smt-cache-tlb.png ├── 22_smt-hardware-changes.png ├── 24_cache-coherence-problem.png ├── 24_directory-entry.png ├── 24_msi-coherence.png ├── 24_write-update-snooping-coherence.png ├── 25_how-is-llsc-atomic.png ├── 25_synchronization-example.png ├── 26_data-races-and-consistency.png ├── 27_multi-core-power-and-performance.png ├── 2bit-history-predictor.png ├── 2bit-statemachine.png ├── 7_all-inst-same-cycle.png ├── 7_duplicating-register-values.png ├── 7_false-dependencies-after-renaming.png ├── 7_ilp-example.png ├── 7_ilp-ipc-discussion.png ├── 7_ilp-vs-ipc.png ├── 7_rat-example.png ├── 7_waw-dependencies.png ├── 8_dispatch-gt1-ready.png ├── 8_issue-example.png ├── 8_tomasulo-review-1.png ├── 8_tomasulos-algorithm-the-picture.png ├── 8_write-result.png ├── 9_reorder-buffer.png ├── branch-target-buffer.png ├── diminishing-returns.png ├── fabrication-yield.png ├── full-predication-example.png ├── hierarchical-predictor.png ├── history-shared-counters.png ├── historypredictor.png ├── memory-wall.png ├── movz-movn-performance.png ├── static-power.png ├── tournament-predictors.png └── weaktakenflipflop.png ├── branches.md ├── cache-coherence.md ├── cache-review.md ├── cheat-sheet-midterm.md ├── compiler-ilp.md ├── course-information.md ├── fault-tolerance.md ├── ilp.md ├── instruction-scheduling.md ├── introduction.md ├── many-cores.md ├── memory-consistency.md ├── memory-ordering.md ├── memory.md ├── metrics-and-evaluation.md ├── multi-processing.md ├── pipelining.md ├── predication.md ├── reorder-buffer.md ├── storage.md ├── synchronization.md ├── virtual-memory.md ├── vliw.md └── welcome.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 David Harris 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # cs6290-notes 2 | Notes for CS6290, High Performance Computer Architecture 3 | -------------------------------------------------------------------------------- /advanced-caches.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: advanced-caches 3 | title: Advanced Caches 4 | sidebar_label: Advanced Caches 5 | --- 6 | 7 | [🔗Lecture on Udacity (1.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/1080508587/concepts/last-viewed) 8 | 9 | ## Improving Cache Performance 10 | Average Memory Access Time: \\(\text{AMAT} = \text{hit time} + \text{miss rate} * \text{miss penalty} \\) 11 | - Reduce hit time 12 | - Reduce miss rate 13 | - Reduce miss penalty 14 | 15 | ## Reduce Hit Time 16 | - Reduce cache size (bad for miss rate) 17 | - Reduce cache associativity (bad for miss rate) 18 | - Overlap cache hit with another hit 19 | - Overlap cache hit with TLB hit 20 | - Optimize lookup for common case 21 | - Maintain replacement state more quickly 22 | 23 | ### Pipelined Caches (overlap with another hit) 24 | - Multiple Cycles to access cache 25 | - Access comes in cycle N (hit) 26 | - Access comes in cycle N+1 (hit) - has to wait 27 | - Hit time = actual hit + wait time 28 | - Pipeline in 3 stages (for example): 29 | - Stage 1: Index into set, read out tags and valid bits in that set to determine hit 30 | - Stage 2: Combining hits to determine did we hit overall and where in our set, begin the read of data 31 | - Stage 3: Finish data read by using offset to choose the right part of the cache block and returning it 32 | 33 | ## TLB and Cache Hit 34 | Hit time = TLB access time + Cache access time 35 | ![TLB and cache hit](https://i.imgur.com/5yCJxdz.png) 36 | This represents the "Physically Indexed Physically Tagged" cache (PIPT), since both the tags and indexes are physical 37 | 38 | ## Virtually Accessed Cache 39 | Also called VIVT (Virtual Indexed Virtually Tagged). The virtual address first accesses the cache to get the data, and only goes to TLB to get physical address on a cache miss. Cache is now tagged by virtual address instead of physical. This means hit time is simply the cache hit time. An additional benefit could be no TLB access on a cache hit, but in reality we still need TLB information like permissions (RWX) so we still need to do a TLB lookup even on a cache hit. 40 | 41 | A bigger issue is that virtual address is specific to each process - so, virtual addresses that are the same will likely be pointing to different physical locations. This can be solved by flushing the cache on a context switch, which causes a burst of cache misses each time we come back to that process. 42 | 43 | ## Virtually Indexed Physically Tagged (VIPT) 44 | In this cache, we use the index of the virtual address to get the correct set in cache, and we look at tags in that set. Meanwhile, we also retrieve frame number using the Tag from the virtual address. This frame number is used to compare against the tags in the indexed cache set. The hit time for this cache is the max of cache hit time and TLB lookup time, as these activities occur in parallel. Additionally, we do not have to flush the cache on a context switch, because the cache lines are tagged with physical address. 45 | ![VIPT](https://i.imgur.com/IHCm6sK.png) 46 | 47 | ## Aliasing in Virtually Accessed Caches 48 | Aliasing is a problem with virtually accessed caches because multiple virtual addresses could legitimately point to the same physical location in memory, yet be cached separately due to different indices from the virtual addresses. If one location is modified, a cached version of the other virtual address may not see the change. 49 | 50 | ![Aliasing in virtually accessed caches](https://i.imgur.com/D7Oeama.png) 51 | 52 | ## VIPT Cache Aliasing 53 | Depending on how the cache is constructed, it is possible to avoid aliasing issues with VIPT. In this example, the index and offset bits were both part of the frame offset, so since VIPT caches use a physical tag, it will always retrieve the same block of memory from cache even if it is accessed using multiple virtual addresses. But, this requires the cache to be constructed in a way that takes advantage of this - the number of sets must be limited to ensure the index and offset bits stay within the bits used for frame offset 54 | ![VIPT Cache Aliasing](https://i.imgur.com/Oq5TWi4.png) 55 | 56 | ## Real VIPT Caches 57 | \\(\text{Cache Size} \leq \text{Associativity} * \text{Page Size}\\) 58 | - Pentium 4: 59 | - 4-way SA x 4kB \\(\Rightarrow\\) L1 is 16kB 60 | - Core 2, Nehalem, Sandy Bridge, Haswell: 61 | - 8-way SA x 4kB \\(\Rightarrow\\) L1 is 32kB 62 | - Skylake (rumored): 63 | - 16-way SA x ?? \\(\Rightarrow\\) L1 is 64kB 64 | 65 | ## Associativity and Hit Time 66 | - High Associativity: 67 | - Fewer conflicts \\(\rightarrow\\) Miss Rate \\(\downarrow\\) (good) 68 | - Larger VIPT Caches \\(\rightarrow\\) Miss Rate \\(\downarrow\\) (good) 69 | - Slower hits (bad) 70 | - Direct Mapped: 71 | - Miss Rate \\(\uparrow\\) (bad) 72 | - Hit Time \\(\downarrow\\) (good) 73 | 74 | Can we cheat on associativity to take the good things from high associativity and also get better hit time from direct mapped? 75 | 76 | ## Way Prediction 77 | - Set-Associative Cache (miss rate \\(\downarrow\\)) 78 | - Guess which line in the set is the most likely to hit (hit time \\(\downarrow\\)) 79 | - If no hit there, normal set-associative check 80 | 81 | ### Way Prediction Performance 82 | | | 32kB, 8-way SA | 4kB Direct Mapped | 32kB, 8-way SA with Way Pred | 83 | |--------------|:---:|:---:|:------:| 84 | | Hit Rate | 90% | 70% | 90% | 85 | | Hit Latency | 2 | 1 | 1 or 2 | 86 | | Miss Penalty | 20 | 20 | 20 | 87 | | AMAT | 4
(\\(2+0.1\*20\\)) | 7
(\\(1+0.3\*20\\)) | 3.3 (1.3+2)
(\\(0.7\*1+0.3\*2+0.1\*20\\)) 88 | 89 | ## Replacement Policy and Hit Time 90 | - Random 91 | - Nothing to update on cache hit (good) 92 | - Miss Rate \\(\uparrow\\) (bad) 93 | - LRU 94 | - Miss Rate \\(\downarrow\\) (good) 95 | - Update lots of counters on hit (bad) 96 | - We want benefit of LRU miss rate, but we need less activity on a cache hit 97 | 98 | ## NMRU Replacement 99 | (Not Most Recently Used) 100 | - Track which block in set is MRU (Most Recently Used) 101 | - On Replacement, pick a non-MRU block 102 | 103 | N-Way Set-Associative Tracking of MRU 104 | - 1 MRU pointer/set (instead of N LRU counters) 105 | 106 | This is slightly weaker than LRU since we're not ordering the blocks or have no "future" knowledge - but we get some of the benefit of LRU without all the counter overhead. 107 | 108 | ## PLRU (Pseudo-LRU) 109 | - Keep one bit per line in set, init to 0 110 | - Every time a line is accessed, set the bit to 1 111 | - If we need to replace something, choose a block with bit set to 0 112 | - Eventually all the bits will be set - when this happens, we zero out the other bits. 113 | - This provides the same benefit as NMRU 114 | 115 | Thus, at any point in time, PLRU is at least as good as NMRU, but still not quite as good as true LRU. But, we still get the benefit of not having to perform a lot of activities - just simple bit operations. 116 | 117 | ## Reducing the Miss Rate 118 | What are the causes of misses? 119 | 120 | Three Cs: 121 | - Compulsory Misses 122 | - First time this block is accessed 123 | - Would still be a miss in an infinite cache 124 | - Capacity Misses 125 | - Block evicted because of limited cache size 126 | - Would be a miss even in a fully-associative cache of that size 127 | - Conflict Misses 128 | - Block evicted because of limited associativity 129 | - Would not be a miss in fully-associative cache 130 | 131 | ## Larger Cache Blocks 132 | - More words brought in on a miss 133 | - Miss Rate \\(\downarrow\\) when spatial locality is good (good) 134 | - Miss Rate \\(\uparrow\\) when spatial locality is poor (bad) 135 | - As block size increases, miss rate decreases for awhile, and then begins to increase again. 136 | - The local minima is the optimal block size (64B on a 4kB cache) 137 | - For larger caches, the minima will occur at a much higher block size (e.g. 256B for 256kB) 138 | 139 | ## Prefetching 140 | - Guess which blocks will be accessed soon 141 | - Bring them into cache ahead of time 142 | - This moves the memory access time to before the actual load 143 | - Good guess - eliminate a miss (good) 144 | - Bad guess - "Cache Pollution", and we get a cache miss (bad) 145 | - We might have also used a cache spot that could have been used for something else, causing an additional cache miss. 146 | 147 | ### Prefetch Instructions 148 | We can make an instruction that allows the programmer to make explicit prefetches based on their knowledge of the program. In the below example, it shows that this isn't necessarily easy - choosing a wrong "distance" to prefetch is difficult. If you choose it too small, the prefetch is too late and we get no benefit. If it is too large, we prefetch too early and it could be replaced by the time we need it. An additional complication is that you must know something about the hardware cache in order to set this value appropriately, making it not very portable. 149 | ![prefetch instructions](https://i.imgur.com/4a0zHOV.png) 150 | 151 | ### Hardware Prefetching 152 | - No change to program 153 | - HW tries to guess what will be accessed soon 154 | - Examples: 155 | - Stream Buffer (sequential - fetch next block) 156 | - Stride Prefetcher (continue fetching fixed distance from each other) 157 | - Correlating Prefetcher (keeps a history to correlate block access based on previous access) 158 | 159 | ## Loop Interchange 160 | Replaces the order of nested loops to optimize for bad cache accesses. In the example below it was fetching one element per block. A good compiler should reverse these loops to ensure cache accesses are as sequential as possible. This is not always possible - the compiler has to prove the program is still correct once reversing the loops, which it does by showing there are no dependencies between iterations. 161 | ![loop interchange](https://i.imgur.com/JMPw2Fk.png) 162 | 163 | ## Overlap Misses 164 | Reduces Miss Penalty. A good out of order procesor will continue doing whatever work it can after a cache miss, but eventually it runs out of things to do while waiting on the load. Some caches are blocking, meaning a load must wait for all other loads to complete before starting. This creates a lot of inactive time. Other caches are Non-Blocking, meaning the processor will use Memory-Level Parallelism to overlap multiple loads to minimize the inactive time and keep the processor working. 165 | ![overlap misses](https://i.imgur.com/4kJzxCX.png) 166 | 167 | ### Miss-Under-Miss Support in Caches 168 | - Miss Status Handling Registers (MSHRs) 169 | - Info about ongoing misses 170 | - Check MSHRs to see if any match 171 | - No Match (Miss) \\(\Rightarrow\\) Allocate an MSHR, remember which instruction to wake up 172 | - Match (Half-Miss) \\(\Rightarrow\\) Add instruction to MSHR 173 | - When data comes back, wake up all instructions waiting on this data, and release MSHR 174 | - How many MSHRs do we want? 175 | - 2 is good, 4 is even better, and there are still benefits from level larger amount like 16-32. 176 | 177 | ## Cache Hierarchies 178 | Reduces Miss Penalty. Multi-Level Caches: 179 | - Miss in L1 cache goes to L2 cache 180 | - L1 miss penalty \\(\neq\\) memory latency 181 | - L1 miss penalty = L2 Hit Time + (L2 Miss Rate)*(L2 Miss Penalty) 182 | - Can have L3, L4, etc. 183 | 184 | ### AMAT With Cache Hierarchies 185 | \\(\text{AMAT} = \text{L1 hit time} + \text{L1 miss rate} * \text{L1 miss penalty} \\) 186 | 187 | \\(\text{L1 Miss Penalty} = \text{L2 hit time} + \text{L2 miss rate} * \text{L2 miss penalty} \\) 188 | 189 | \\(\text{L2 Miss Penalty} = \text{L3 hit time} + \text{L3 miss rate} * \text{L3 miss penalty} \\) 190 | 191 | ... etc, until: 192 | 193 | \\(\text{LN Miss Penalty} = \text{Main Memory Latency}\\) (Last Level Cache - LLC) 194 | 195 | ### Multilevel Cache Performance 196 | 197 | | | 16kB | 128kB | No cache | L1 = 16kB
L2 = 128kB | 198 | |----------|:----:|:-----:|:--------:|:--------------------------:| 199 | | Hit Time | 2 | 10 | 100 | 2 for L1
12 for L2 | 200 | | Hit Rate | 90% | 97.5% | 100% | 90% for L1
75% for L2 | 201 | | AMAT | 12 | 12.5 | 100 | 5.5 | 202 | | (calc)) | \\(2+0.1\*100\\) | \\(10+0.025\*100\\) | \\(100+0\*100\\) | \\(2+0.1\*(10+0.25*100)\\) 203 | 204 | Combining caches like this provide much better overall performance than just considering hit time and size. 205 | 206 | ### Hit Rate in L2, L3, etc. 207 | When L2 cache is used alone, it has a higher hit rate (97.5% vs 75%) - so it looks like it has a lower hit rate, which is misleading. 208 | 209 | 75% is the "local hit rate", which is a hit rate that the cache actually observes. In this case, 90% of accesses were hits in L1, so L2 cache never observed those (which would have been a hit). 210 | 211 | 97.5% is the "global hit rate", which is the overall hit rate for any access to the cache. 212 | 213 | ### Global vs. Local Hit Rate 214 | - Global Hit Rate: 1 - Global Miss Rate 215 | - Global Miss Rate: \\(\frac{\text{# of misses in this cache}}{\text{# of all memory references}}\\) 216 | - Local Hit Rate: \\(\frac{\text{# of hits}}{\text{# of accesses to this cache}}\\) 217 | - Misses per 1000 instructions (MPKI) 218 | - Similar to Miss Rate, but instead of being based on memory references, it normalizes based on number of instructions 219 | 220 | ### Inclusion Property 221 | - Block is in L1 Cache 222 | - May or may not ben in L2 223 | - Has to also be in L2 (Inclusion) 224 | - Cannot also be in L2 (Exclusion) 225 | 226 | When L1 is a hit, LRU counters on L2 are not changed - over time, L2 may decide to replace blocks that are still frequently accessed in L1. So Inclusion is not guaranteed. One way to enforce it is an "inclusion bit" that can be set to keep it from being replaced. 227 | 228 | [🎥 See Lecture Example (4:44)](https://www.youtube.com/watch?v=J8DQG9Pvp3U) 229 | 230 | Helpful explanation in lecture notes about why Inclusion is desirable: 231 | 232 | > Student Question: What is the point of inclusion in a multi-level cache? Why would the effort/cost be spent to try and enforce an inclusion property? 233 | > 234 | > I would have guessed that EXCLUSION would be a better thing to work towards, get more data into some part of the cache. I just don't get why you would want duplicate data taking up valuable cache space, no matter what level. 235 | > 236 | > Instructor Answer: Inclusion makes several things simpler. When doing a write-back from L1, for example, inclusion ensures that the write-back is a L2 hit. Why is this useful? Well, it limits how much buffering we need and how complicated things will be. If the L1 cache is write-through, inclusion ensures that a write that is a L1 hit will actually happen in L2 (not be an L2 miss). And for coherence with private L1 and L2 caches (a la Intel's i3/i5/i7), inclusion allows the L2 cache to filter requests from other processors. With inclusion, if the request from another processor does not match anything in our L2, we know we don't have that block. Without inclusion, even if the block does not match in L2, we still need to probe in the L1 because it might be there. 237 | 238 | *[AMAT]: Average Memory Access Time 239 | *[LRU]: Least-Recently Used 240 | *[MPKI]: Misses Per 1000 Instructions 241 | *[MSHR]: Miss Status Handling Register 242 | *[MSHRs]: Miss Status Handling Registers 243 | *[NMRU]: Not-Most-Recently Used 244 | *[PIPT]: Physically Indexed Physically Tagged 245 | *[PLRU]: Pseudo-Least Recently Used 246 | *[RWX]: Read, Write, Execute 247 | *[SA]: Set-Associative 248 | *[TLB]: Translation Look-Aside Buffer 249 | *[VIPT]: Virtually Indexed Physically Tagged 250 | *[VIVT]: Virtual Indexed Virtually Tagged -------------------------------------------------------------------------------- /assets/12_function-call-inlining.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/12_function-call-inlining.png -------------------------------------------------------------------------------- /assets/12_scheduling-if-conversion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/12_scheduling-if-conversion.png -------------------------------------------------------------------------------- /assets/15_cache-organization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/15_cache-organization.png -------------------------------------------------------------------------------- /assets/15_direct-mapped-cache.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/15_direct-mapped-cache.png -------------------------------------------------------------------------------- /assets/15_offset-index-tag-setassociative.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/15_offset-index-tag-setassociative.png -------------------------------------------------------------------------------- /assets/16_mapping-virtual-physical-memory.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_mapping-virtual-physical-memory.png -------------------------------------------------------------------------------- /assets/16_multilevel-page-table-structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_multilevel-page-table-structure.png -------------------------------------------------------------------------------- /assets/16_multilevel-page-tables.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_multilevel-page-tables.png -------------------------------------------------------------------------------- /assets/16_program-view-of-memory.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_program-view-of-memory.png -------------------------------------------------------------------------------- /assets/16_two-level-page-table-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_two-level-page-table-example.png -------------------------------------------------------------------------------- /assets/16_why-virtual-memory.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/16_why-virtual-memory.png -------------------------------------------------------------------------------- /assets/17_aliasing-in-virtually-accessed-caches.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_aliasing-in-virtually-accessed-caches.png -------------------------------------------------------------------------------- /assets/17_loop-interchange.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_loop-interchange.png -------------------------------------------------------------------------------- /assets/17_overlap-misses.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_overlap-misses.png -------------------------------------------------------------------------------- /assets/17_prefetch-instructions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_prefetch-instructions.png -------------------------------------------------------------------------------- /assets/17_tlb-cache-hit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_tlb-cache-hit.png -------------------------------------------------------------------------------- /assets/17_vipt-cache-aliasing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_vipt-cache-aliasing.png -------------------------------------------------------------------------------- /assets/17_virtually-indexed-physically-tagged.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/17_virtually-indexed-physically-tagged.png -------------------------------------------------------------------------------- /assets/19_connecting-dram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/19_connecting-dram.png -------------------------------------------------------------------------------- /assets/19_memory-chip-organization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/19_memory-chip-organization.png -------------------------------------------------------------------------------- /assets/1bh2bc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/1bh2bc.png -------------------------------------------------------------------------------- /assets/1bit-statemachine.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/1bit-statemachine.png -------------------------------------------------------------------------------- /assets/20_connecting-io-devices.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/20_connecting-io-devices.png -------------------------------------------------------------------------------- /assets/20_magnetic-disks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/20_magnetic-disks.png -------------------------------------------------------------------------------- /assets/22_centralized-shared-memory.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_centralized-shared-memory.png -------------------------------------------------------------------------------- /assets/22_distributed-memory.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_distributed-memory.png -------------------------------------------------------------------------------- /assets/22_smt-cache-tlb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_smt-cache-tlb.png -------------------------------------------------------------------------------- /assets/22_smt-hardware-changes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/22_smt-hardware-changes.png -------------------------------------------------------------------------------- /assets/24_cache-coherence-problem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_cache-coherence-problem.png -------------------------------------------------------------------------------- /assets/24_directory-entry.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_directory-entry.png -------------------------------------------------------------------------------- /assets/24_msi-coherence.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_msi-coherence.png -------------------------------------------------------------------------------- /assets/24_write-update-snooping-coherence.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/24_write-update-snooping-coherence.png -------------------------------------------------------------------------------- /assets/25_how-is-llsc-atomic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/25_how-is-llsc-atomic.png -------------------------------------------------------------------------------- /assets/25_synchronization-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/25_synchronization-example.png -------------------------------------------------------------------------------- /assets/26_data-races-and-consistency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/26_data-races-and-consistency.png -------------------------------------------------------------------------------- /assets/27_multi-core-power-and-performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/27_multi-core-power-and-performance.png -------------------------------------------------------------------------------- /assets/2bit-history-predictor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/2bit-history-predictor.png -------------------------------------------------------------------------------- /assets/2bit-statemachine.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/2bit-statemachine.png -------------------------------------------------------------------------------- /assets/7_all-inst-same-cycle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_all-inst-same-cycle.png -------------------------------------------------------------------------------- /assets/7_duplicating-register-values.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_duplicating-register-values.png -------------------------------------------------------------------------------- /assets/7_false-dependencies-after-renaming.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_false-dependencies-after-renaming.png -------------------------------------------------------------------------------- /assets/7_ilp-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_ilp-example.png -------------------------------------------------------------------------------- /assets/7_ilp-ipc-discussion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_ilp-ipc-discussion.png -------------------------------------------------------------------------------- /assets/7_ilp-vs-ipc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_ilp-vs-ipc.png -------------------------------------------------------------------------------- /assets/7_rat-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_rat-example.png -------------------------------------------------------------------------------- /assets/7_waw-dependencies.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/7_waw-dependencies.png -------------------------------------------------------------------------------- /assets/8_dispatch-gt1-ready.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_dispatch-gt1-ready.png -------------------------------------------------------------------------------- /assets/8_issue-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_issue-example.png -------------------------------------------------------------------------------- /assets/8_tomasulo-review-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_tomasulo-review-1.png -------------------------------------------------------------------------------- /assets/8_tomasulos-algorithm-the-picture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_tomasulos-algorithm-the-picture.png -------------------------------------------------------------------------------- /assets/8_write-result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/8_write-result.png -------------------------------------------------------------------------------- /assets/9_reorder-buffer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/9_reorder-buffer.png -------------------------------------------------------------------------------- /assets/branch-target-buffer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/branch-target-buffer.png -------------------------------------------------------------------------------- /assets/diminishing-returns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/diminishing-returns.png -------------------------------------------------------------------------------- /assets/fabrication-yield.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/fabrication-yield.png -------------------------------------------------------------------------------- /assets/full-predication-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/full-predication-example.png -------------------------------------------------------------------------------- /assets/hierarchical-predictor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/hierarchical-predictor.png -------------------------------------------------------------------------------- /assets/history-shared-counters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/history-shared-counters.png -------------------------------------------------------------------------------- /assets/historypredictor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/historypredictor.png -------------------------------------------------------------------------------- /assets/memory-wall.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/memory-wall.png -------------------------------------------------------------------------------- /assets/movz-movn-performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/movz-movn-performance.png -------------------------------------------------------------------------------- /assets/static-power.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/static-power.png -------------------------------------------------------------------------------- /assets/tournament-predictors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/tournament-predictors.png -------------------------------------------------------------------------------- /assets/weaktakenflipflop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/drharris/cs6290-notes/6c7ff7aae95a066ecf9eb626f2adfca28f0cf25a/assets/weaktakenflipflop.png -------------------------------------------------------------------------------- /branches.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: branches 3 | title: Branches 4 | sidebar_label: Branches 5 | --- 6 | 7 | [🔗Lecture on Udacity (2.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/3618489075/concepts/last-viewed) 8 | 9 | ## Branch in a Pipeline 10 | 11 | ```mipsasm 12 | BEQ R1, R2, Label 13 | ``` 14 | 15 | A branch instruction like this compares R1 and R2, and if equal jumps to `Label` (usually by having in the immediate part of the instruction field the difference in PC for the next instruction versus the labeled instruction, such that it can simply add this offset when branched). 16 | 17 | In a 5-stage pipeline, the compare happens in the third (ALU) stage. Meanwhile, two instructions have moved into Fetch/Read stages. If we fetched the correct instructions for the branch (or not), they continue to move correctly with no penalty. Otherwise we must flush and take a 2-instruction penalty. 18 | 19 | Thus, it never pays to simply not fetch something as you always take the penalty, and it is better to take a penalty only sometimes than all the time. Another important note is that when fetching the instruction after `BEQ`, we know nothing about the branch itself yet except its address (we don't even know it's a branch at all yet), but we already must make a prediction of whether it's a taken branch or not. 20 | 21 | ## Branch Prediction Requirements 22 | 23 | - Using only the knowledge of the instruction's PC 24 | - Guess PC of next instruction to fetch 25 | - Must correctly guess: 26 | - is this a branch? 27 | - is it taken? 28 | - if taken, what is the target PC? 29 | 30 | The first two guesses can be combined into "is this a taken branch?" 31 | 32 | ### Branch Prediction Accuracy 33 | 34 | $$ CPI = 1 + \frac{mispred}{inst} * \frac{penalty}{mispred} $$ 35 | 36 | The \\( \frac{mispred}{inst} \\) part is determined by the predictor accuracy. The \\(\frac{penalty}{mispred} \\) part is determined by the pipeline (where in the pipeline we figure out the misprediction). 37 | 38 | Assumption below: 20% of all instructions are branches (common in programs). 39 | 40 | | Accuracy \\(\downarrow\\) | Resolve in 3rd stage | Resolve in 10th stage | 41 | |---|:---:|:---:| 42 | | 50% for BR
100% all other | \\(1 + 0.5\*0.2\*2\\)
\\(= 1.2\\) | \\(1 + 0.5\*0.2\*9\\)
\\(= 1.9\\) | 43 | | 90% of BR
100% all other | \\(1 + 0.1\*0.2\*2\\)
\\(= 1.04\\) | \\(1 + 0.1\*0.2\*9\\)
\\(= 1.18\\) | 44 | | _(Speedup)_ | _1.15_ | _1.61_ | 45 | 46 | Conclusions: A better branch predictor will help regardless of the pipeline, but the _amount_ of help changes with the pipeline depth. 47 | 48 | ### Performance with Not-Taken Prediction 49 | 50 | | | Refuse to Predict
(stall until sure) | Predict Not-Taken
(always increment PC) | 51 | |---|---|---| 52 | | Branch | 3 cycles | 1 or 3 cycles | 53 | | Non-Branch | 2 Cycles | 1 cycle | 54 | 55 | Thus, Predict Not-Taken always wins over refusing to predict 56 | 57 | ### Predict Not-Taken 58 | 59 | Operation: Simply increment PC (no special hardware or memory, since we have to do this anyway) 60 | 61 | Accuracy: 62 | * 20% of instructions are branches 63 | * 60% of branches are taken 64 | * \\(\Rightarrow\\) Correctness: 80% (non-branches) + 8% (non-taken branches) 65 | * \\(\Rightarrow\\) Incorrect 12% of time 66 | * CPI = 1 + 0.12*penalty 67 | 68 | ## Why We Need Better Prediction? 69 | | | Not Taken
88% | Better
99% | Speedup | 70 | |---|:---:|:---:|:---:| 71 | | 5-stages
(3rd stage) | \\(1 + 0.12\*2\\)
\\(CPI = 1.24\\) | \\(1 + 0.01\*2\\)
\\(CPI = 1.02\\) | \\(1.22\\) | 72 | | 14-stages
(11th stage) | \\(1 + 0.12\*10\\)
\\(CPI = 2.2\\) | \\(1 + 0.01\*10\\)
\\(CPI = 1.1\\) | \\(2\\) | 73 | | 11th stage
(4 inst/cycle) | \\(0.25 + 0.12\*10\\)
\\(CPI = 1.45\\) | \\(0.25 + 0.01\*10\\)
\\(CPI = 0.35\\) | \\(4.14\\) | 74 | 75 | If we have a deeper pipeline or are able to execute more instructions per cycle, the better predictor is more important than in simpler processors, because the cost of misprediction is much higher (more instructions lost with a misprediction). 76 | 77 | ### Better Prediction - How? 78 | 79 | Predictor must compute \\(PC_{next}\\) based only on knowledge of \\(PC_{now}\\). This is not much information to decide on. It would help if we knew: 80 | * Is it a branch? 81 | * Will it be taken? 82 | * What is the offset field of the instruction? 83 | 84 | But we don't know any of these because we're still fetching the instruction. We do, however, know the history of how \\(PC_{now}\\) has behaved in the past. So, we can go from: \\(PC_{next} = f(PC_{now})\\) to: 85 | 86 | $$ PC_{next} = f(PC_{now}, history[PC_{now}]) $$ 87 | 88 | ## BTB - Branch Target Buffer 89 | 90 | The predictor can take the \\(PC_{now}\\) and uses it to index into a table called the BTB, with the output of our best guess at next PC. Later, when the branch executes, we know the actual \\(PC_{next}\\) and can compare with the predicted one. If it doesn't match, then it is handled as a misprediction and the BTB can be updated. 91 | 92 | ![Branch Target Buffer](https://i.imgur.com/h6Fwke1.png) 93 | 94 | One problem: How big does the BTB need to be? We want it to have single cycle latency (small). However, it needs to contain an entire 64-bit address and we one entry for each PC we can fetch. Thus, the BTB would need to be as large as the program itself. How do we make it smaller? 95 | 96 | ### Realistic BTB 97 | 98 | First, we don't need an entry for every possible PC. It's enough if we have entries for any PC likely to execute soon. For example, in a loop of 100 instructions, it's enough to have about a 100-length BTB. 99 | 100 | Perhaps we find through testing that a 1024-entry BTB can still be accessed in one cycle. How do we map 64-bit PCs to this 1024 entry table, keeping in mind any delay in calculating the mapping from PC to BTB index would then shorten the BTB further to compensate. 101 | 102 | The way we do this is by simply taking the last 10 bits (in the 1024 case). While this means future instructions will eventually overwrite the BTB entries for the current instructions, this ensures instructions located around each other are all mapping to the BTB during execution. This particularly applies to better predicting branch behavior in loops or smaller programs. 103 | 104 | If instructions are word-aligned, the last two bits will always be `0b00`. Therefore we can ignore those and index using bits 12-2 in the 1024 case. 105 | 106 | ## Direction Predictor 107 | The BHT is used like the BTB, but the entry is a single bit that tells us whether a PC is: 108 | - [0] not a taken branch (PC++) 109 | - [1] a taken branch (use BTB) 110 | 111 | The entries are accessed by least insignificant bits of PC (e.g. 12-2 in 1024 case). Once the branch resolves we can update BHT accordingly. Because this table is very small in terms of data, it can be much larger in terms of entries. 112 | 113 | ### Problems with 1-bit Predictor 114 | It works well if an instruction is always taken or else always not taken. It also predicts well when taken branches >>> not taken, or not taken >>> taken. Every "switch" (anomaly) between an instruction being taken and not taken causes two mispredictions (on the change, and then the next change). 115 | 116 | So, the 1-bit predictor does not do so well if the taken to not taken ratio is not very large. It also will not perform well in short because this same anomaly occurs when the loop is executed again (the previous loop exit will cause it to mispredict the new loop). 117 | 118 | ## 2-Bit Predictor (2BP, 2BC) 119 | This predictor fixes the behavior of the 1-bit predictor during the anomaly. The upper bit behaves like the 1-bit predictor, but it adds a hysteresis (or "conviction") bit. 120 | 121 | ![1-bit predictor state machine](https://i.imgur.com/pe7RYc9.png) 122 | 123 | | Prediction Bit | Hysteresis Bit | Description | 124 | |:---:|:---:|---| 125 | | 0 | 0 | Strong Not-Taken | 126 | | 0 | 1 | Weak Not-Taken | 127 | | 1 | 0 | Weak Taken | 128 | | 1 | 1 | Strong Taken | 129 | 130 | Basically, the upper bit controls the final prediction, but the lower bit allows the predictor to flow towards the opposite prediction. Thus a `0b00` state would require 2 taken branches in a row to change its prediction toward a taken branch. This prevents the case of a single anomaly causing multiple mispredictions, in that the behavior itself must be changing for the prediction to change. 131 | 132 | ![2-bit predictor state machine](https://i.imgur.com/TbSV6zv.png) 133 | 134 | It can be called a 2-bit counter because it simply increments on taken branches, or decrements on not-taken. This allow the predictor to be easily implementable. 135 | 136 | ### 2-Bit Predictor Initialization 137 | The question is - where do we start with the predictor? If we start in a strong state and are correct, then we have no mispredictions. If we are wrong, it costs two mispredictions before it corrects itself. If we start in a weak state and are right, we also have perfect prediction. However, if we were wrong, it only costs a single misprediction before it corrects itself. This leads us to think it's always best to start in a weak state. 138 | 139 | However, consider the state where a branch flips between Taken and Not-Taken; by starting in a strong state, we mispredict half the time. Starting in a weak state, we _constantly_ mispredict. 140 | 141 | ![2-bit predictor misprediction on flipped state](https://i.imgur.com/LcNJv41.png) 142 | 143 | So, while it may seem better to start in one state or another, in reality there is no way to predict what state is best for the incoming program, and indeed there is always some sequence of taken/not-taken that can result in constant misprediction. Therefore, it is best to simply initialize in the easiest state to start with, typically `0b00`. 144 | 145 | ### 1-bit to 2-bit is Good... 3-bit, 4-bit? 146 | 147 | If there is benefit in moving to 2-bit, what about just adding more bits? This really serves to increase the cost (larger BHT to serve all PCs), but is only good when anomalous outcomes happen in streaks. How often does this happen? Sometimes. Maybe 3 bits might be worth it, but likely not 4. Best to stick with 2BP. 148 | 149 | ## History-Based Predictors 150 | These predictors function best with a repeatable pattern (e.g. "N-T-N-T-N-T..." or "N-N-T-N-N-T..."). These are 100% predictable, just not with simple n-bit counters. So, how do we learn the pattern? 151 | 152 | ![history-based predictor](https://i.imgur.com/QFxW4dB.png) 153 | 154 | In this case our prediction is done by looking "back" at previous steps. So in the first pattern, we know when the history is N, predict T, and vice-versa. In the second pattern, we look back two steps. So when the history is NN, predict T. When it is NT, predict N. And when it is TN, predict N. 155 | 156 | ### 1-bit History with 2-bit Counters 157 | In this predictor, each BHT entry has a single history bit, and two 2-bit counters. The history bit can then be used to index into which counter to use for the prediction. 158 | 159 | ![1-bit history with 2-bit counters](https://i.imgur.com/wn9dPMD.png) 160 | 161 | While this type of predictor works great for patterns like this, it still mispredicts 1/3 of the time in a "NNT-NNT-NNT" type pattern, as the prediction following an N is a 50% chance of being right. 162 | 163 | ### 2-bit History Predictor 164 | This predictor works the same way as the 1-bit history predictor, but now we have 2 bits of history used to index into a 2BC[4] array. This perfectly predicts both the (NT)* and (NNT)* pattern types and is a pretty good predictor for other patterns. However, it wastes one 2BC on the (NNT)* case and two 2BCs on the (NT)* case. 165 | 166 | ![2-bit history predictor](https://i.imgur.com/7WaZWRg.png) 167 | 168 | ### N-bit History Predictor 169 | 170 | We can generalize to state that an N-bit history predictor can successfully predict all taken patterns of \\(length \leq N+1\\), but will cost \\(N+2*2^N\\) bits per entry and waste most 2BCs. So, while increasing N will give us the ability to predict longer patterns, we do so at rapidly increasing cost with more waste. 171 | 172 | ## History-Based Predictors with Shared Counters 173 | 174 | Instead of \\(2^N\\) counters per entry, we want to use \\(\approx N\\) counters. The idea is to share 2BCs between entries instead of each entry having its own counter. 175 | 176 | We can do this with a Pattern History Table (PHT). This table simply keeps some PC-indexed history bits (N bits per entry), combines that with bits of the PC (XOR) to index into the BHT, each entry of which is just a single 2BC. Thus it is very possible to have two entries/history combinations using the same BHT entry 177 | 178 | ![history with shared counters](https://i.imgur.com/7PeYo5u.png) 179 | 180 | Thus, small patterns will only use a few counters, leaving many other counters for longer, more complex patterns to use. This can still result in wasted space, but not nearly as much as the exponential increase of the N-bit history predictor. The downside, of course, is that some branches with particular histories may overlap with other branches/histories. But, if the BHT is large enough (and it can be with each entry being a single 2BC), this should happen rarely. 181 | 182 | ### PShare and GShare Predictors 183 | This shared counters predictor is called PShare -> "P"rivate history, "Share"d counters. This is good for even-odd and 8-iteration loops. 184 | 185 | Another option is GShare, or "G"lobal history, "Share"d counters. The history indexes are shared among all entries, and the PC+History operation (XOR) is used to index into the BHT. This is good for correlated branches - which is very common in programs, as typically operations in one branch are probably somewhat dependent on operations from a previous branch. 186 | 187 | Which to use? Both! Use GShare for correlated branches, and PShare for single branches with shorter history. 188 | 189 | ## Tournament Predictor 190 | We have two predictors, PShare and GShare, each of which is better at predicting certain types of branches. A meta-predictor (array of 2BCs) is used not as a branch predictor, but rather as a predictor of which other predictor is more likely to yield a correct prediction for the current branch. At each step, you "train" each individual predictor based on the outcome, and you also train the meta-predictor on how well each predictor is doing. 191 | 192 | ![tournament predictor](https://i.imgur.com/Pa56hP0.png) 193 | 194 | ## Hierarchical Predictor 195 | Like a tournament predictor, but instead of combining two good predictors, it is using one good and one ok predictor. The idea is that good predictors are expensive, and some branches are very easy to predict. So the "ok" predictor can be used for these branches, and the better predictor can be saved for the branches that are more difficult to predict. 196 | 197 | | | Tournament | Hierarchical | 198 | |---|---|---| 199 | |Predictors|2 good predictors|1 good, 1 ok| 200 | |Updates|Update both for every decision|Update OK-pred on every decision
Update Good-pred only if OK-pred not good| 201 | |(Other)|Good predictors are both chosen to be balanced between performance and cost|Can use a combination of predictors of differing quality| 202 | 203 | In this example, the 2BC simple predictor will be used for most branches, but if it is doing a poor job the branch is added to the Local predictor. Similarly it could also go to the Global predictor. Over time the CPU is "trained" on how to handle each branch. 204 | ![Hierarchical Predictor Example](https://i.imgur.com/o8jriH3.png) 205 | 206 | ## Return Address Stack (RAS) 207 | For any branch, we need to know the direction (taken, not taken) and the target. The previous predictors have covered direction (BHT) and target (BTB) find for most types of branches (complex like `BEQ` and `BNE`, simple like `JUMP` and `CALL`, etc.). However, there is a type of branch, the function return, which is always taken (so direction prediction is fine), but the target is more difficult to predict, as it can be called from many different places. The BTB is not good at predicting the target when it could change each time. 208 | 209 | The RAS is a predictor dedicated to predicting function returns. The idea is that upon each function call, we push the return address (PC+4) on the RAS. When we return, we pop from this stack to get the correct target address. 210 | 211 | Why not simply use the stack itself? The prediction should be very close to the other predictors, and using a separate stack allows the predictor to be very small hardware structure and make the prediction very quickly. 212 | 213 | What happens if we exceed the size of the RAS? Two choices: 214 | - Don't push 215 | - Wrap around - this is the best approach (main and top level functions do not consume the entire RAS) 216 | 217 | In the end, remember this is another predictor and we are allowed to have mispredictions - the stack is still there and will work, just with misprediction cost. The goal is to optimize the greatest number of branches and returns, so the wrap-around approach is best. 218 | 219 | ### How do we know it's a `RET`? 220 | This is all still during the IF phase, so how do we even know if the instruction is a `RET`? One simple way is to use a single-bit predictor to whether an instruction is a `RET` or not. This is very accurate (that PC is likely to always be a `RET`). 221 | 222 | Another approach is called "Predecoding". Most of the time instructions are coming from the processor cache and are only loaded from memory when not in the cache. This strategy involves storing part of the decoded instruction along with the instruction in cache. For example, if an instruction is 32-bits, maybe we store 33 bits, where the extra bit tells us if it is a `RET` or not. The alternative is decoding this every time the instruction is fetched, which becomes more expensive. Modern processors store a lot of information during the predecode step such that the pipeline can move quickly during execution. 223 | 224 | *[2BP]: 2-bit Predictor 225 | *[2BC]: 2-Bit Counter 226 | *[2BCs]: 2-Bit Counters 227 | *[ALU]: Arithmetic Logic Unit 228 | *[BHT]: Branch History Table 229 | *[BTB]: Branch Target Buffer 230 | *[CPI]: Cycles Per Instruction 231 | *[IF]: Instruction Fetch 232 | *[PC]: Program Counter 233 | *[PHT]: Pattern History Table 234 | *[RAS]: Return Address Stack 235 | *[XOR]: Exclusive-OR 236 | -------------------------------------------------------------------------------- /cache-coherence.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: cache-coherence 3 | title: Cache Coherence 4 | sidebar_label: Cache Coherence 5 | --- 6 | 7 | [🔗Lecture on Udacity (2hr)](https://classroom.udacity.com/courses/ud007/lessons/907008654/concepts/last-viewed) 8 | 9 | ## Cache Coherence Problem 10 | ![cache coherence problem](https://i.imgur.com/VyEGRMt.png) 11 | As each core reads from memory, it may be using the cache, and value changes on other cores do not properly update all cores. Thus, it may operate on incorrect values with many reads and writes. This is called an Incoherent state. We instead need the idea of Cache Coherence - even when the cache is spread among many cores, it should behave as a single memory. 12 | 13 | ## Coherence Definition 14 | 1) Read R from address X on core C1 returns the value written by the most recent write W to X on C1, if no other core has written to X between W and R. 15 | - Case when only one core is operating on a location, should behave like a uniprocessor. 16 | 2) If C1 writes to X and C2 reads after a sufficient time, and there are no other writes in-between, C2's read returns the the value from C1's write. 17 | - After a "sufficent time", a core must read the "new" values produced by other cores using that location. 18 | - This requires some kind of active design to make this work correctly. 19 | 3) Writes to the same location are serialized: any two writes to X must be seen to occur in the same order on all cores. 20 | - All cores should agree on which values were written in a particular order. 21 | 22 | ## How to Get Coherence 23 | Options: 24 | 1) No caches (really bad performance) 25 | 2) All cores share same L1 cache (bad performance) 26 | 3) ~~Private write-through caches~~ 27 | (not really coherent because it doesn't prevent stale reads) 28 | 4) Force read in one cache to see write made in another. 29 | 1) Pick one of these: 30 | * Broadcast writes to update other caches (Write-Update Coherence) 31 | * Writes prevent hits to other copies (Write-Invalidate Coherence) 32 | 2) And, pick one of these: 33 | * All writes are broadcast on shared bus (Snooping) - bus can be a bottleneck 34 | * Each block assigned an "Ordering Point" (Directory) 35 | 36 | ## Write-Update Snooping Coherence 37 | Two caches with valid bit (V), tag (T) and data, both connected to the same bus with memory. We have the following instructions: 38 | ``` 39 | Core 0 Core 1 40 | 1 RD A WR A=1 41 | 2 WR A=2 WR A=3 42 | ``` 43 | The first instruction executes with the Cache 0, is a miss, and is pulled from memory (A=0). Cache 1 could see this but does not care since it only updates on writes. Cache 1 gets the other instruction, writes to A in its own cache, and puts the request on the bus. It is seen by Cache 0, which then updates itself with the new value. 44 | 45 | Then, we get the second instructions both executing (close to) simultaneously. With a single bus, only one of these can send on the bus at the same time and must arbitrate to see which "wins". So both modify their own cache at the same time (Cache 0: A=2, Cache 1: A=3). Let's say Core 0 wins the arbitration and sends A=2 on the bus. Cache 1 then updates to A=2, and _then_ sends its own A=3 on the bus, updating both caches. 46 | 47 | In this way, at any point in time all processors agree on the historical values of A in the same orders, because the bus is serialized. 48 | 49 | ![write update snooping coherence](https://i.imgur.com/bhcMm0n.png) 50 | 51 | ### Write-Update Optimization 1: Memory Writes 52 | Memory is on the same bus, yet is very slow, and thus it becomes a big bottleneck as it cannot keep up with all the writes. We want to thus avoid unneccessary memory writes. The solution is as we have seen before - add a "dirty" bit to each cache. The cache with a dirty bit set will then be responsible for answering any requests to memory. If a cache with dirty bit set snoops a write on the bus, it unsets the dirty bit and is no longer responsible. Memory is only written to when a block in cache with its dirty bit set is replaced. Thus, the last writer is responsible for writing to memory. 53 | 54 | * Dirty Bit benefits: 55 | * Write to memory only when block replaced 56 | * Read from memory only if no cache has that block in a dirty state. 57 | 58 | ### Write-Update Optimization 2: Bus Writes 59 | With the previous optimization, memory is no longer as busy, but the bus still gets all writes, and now becomes the bottleneck in the system. This write does have to happen, because this is a Write-Update coherence system. But, what about the writes that aren't in any other cache? Those updates are wasted. 60 | 61 | We can add a "Shared" bit to the block in cache that tells us if others are using this block or not. Additionally, there is another line on the bus pulled high on a memory read, so that each core can easily tell if other caches are using a block. If so, we can update the Shared bit to 1 on read. A cache can also pull it high if it sees a write on a block it already has in cache, to signal to the other cache that it is being used so it can mark it's Shared bit. 62 | 63 | When a Write happens on a cache with Shared=1, it knows to broadcast that write on a bus, otherwise it will not. In this way, we get the Write-Update behavior when there is sharing, but otherwise we do not add extra traffic to the shared bus. 64 | 65 | ## Write-Invalidate Snooping Coherence 66 | This works similar to Write-Update with both opimizations (dirty/shared) applied. However, it differs in that instead of broadcasting the entire new value on every write, we only broadcast the fact that tag has been modified. Other caches containing that tag can simply unset their Valid bit. The next time that block is read from cache, it will force that cache to re-read, which is then answered by the cache with Dirty=0 or else the Memory. 67 | 68 | The disadvantage with Write-Invalidate is that all readers have a cache miss when reading something previously written on another cache. But the advantage is that multiple writes don't force caches to continue updating blocks they may not need. Additionally, we know when a write is sending an invalidation, we can then unset the Shared bit (since we know all caches will have to re-read that block at some point), and thus reduce bus activity further. 69 | 70 | ## Update vs. Invalidate Coherence 71 | | Application Does... | Update | Invalidate | 72 | |---|---|---| 73 | | Burst of writes to one address | Each write sends an update (bad) | First write invalidates, other accesses are just hits (good) | 74 | | Write different words in same block | Update sent for each word (bad) | First write invalidates, other accesses are just hits (good) | 75 | | Producer-Consumer WR then RD | Producer sends updates, consumer hits (good) | Producer invalidates, consumer misses and re-reads (bad) | 76 | 77 | And, the winner is... Invalidate! Overall, Invalidate is just slightly better in regard to these activities, but where it really shines is: 78 | 79 | | | | | 80 | |---|---|---| 81 | | Thread moves to another core | Keep updating the old core's cache (horrible) | First write to each block invalidates, then no traffic (good) | 82 | 83 | ## MSI Coherence 84 | A block can be in one of three states: 85 | 1. I(nvalid): (V=0, or not in cache) 86 | * Local Read: Move to Shared state, **put RD on bus** 87 | * Local Write: Move to Modified state, **put WR on bus** 88 | * Snoop RD/WR on bus: (Remain in Invalid state) 89 | 2. S(hared): (V=1, D=0) 90 | * Local Read: (Remain in Shared state) 91 | * Local Write: Move to Modified state, **Put Invalidation on Bus** 92 | * Snoop WR on bus: Move to Invalid state 93 | * Snoop RD on bus: (Remain in Shared state) 94 | 3. M(odified): (V=1, D=1) 95 | * Local Read: (Remain in Modified state) 96 | * Local Write: (Remain in Modified state) 97 | * Snoop WR on bus: Move to Invalid state, **WR back** 98 | * (can delay WR and then proceed again once this WR is done) 99 | * Snoop RD on bus: Move to Shared state, **WR back** 100 | 101 | ![MSI coherence](https://i.imgur.com/L3zVF7o.png) 102 | 103 | ### Cache-to-Cache Transfers 104 | * Core1 has block B in M state 105 | * Core2 puts RD on bus 106 | * Core1 has to provide data... but how? 107 | * Solution 1: Abort/Retry 108 | * Core1 cancels Core2's request ("abort" signal on bus) 109 | * Core1 does normal write-back to memory 110 | * Core2 retries, gets data from memory 111 | * Problem: double the memory latency (write and read) 112 | * Solution 2: Intervention 113 | * Core1 tells memory it will supply the data ("intervention" signal on bus) 114 | * Core1 responds with data 115 | * Memory picks up data (both caches are now in shared state and think the block is not dirty, so memory also needs to update) 116 | * Problem: more complex hardware 117 | * Most use some form of Intervention, to avoid performance hit. 118 | 119 | ### Avoiding Memory Writes on Cache-to-Cache Transfers 120 | * C1 has block in M state 121 | * C2 wants to read, C1 responds with data -> C1: S, C2: S 122 | * C2 writes -> C1: I, C2: M 123 | * Maybe this repeats a few times, but memory write happens every time C1 responds with data 124 | * C1 read, C2 responds with data -> C1: S, C2: S 125 | * C3 read, memory provides data (memory read, even though either cache could respond) 126 | * C4 read, memory provides data (...) 127 | 128 | So, we want to avoid these memory read/writes if another cache already has the data. 129 | 130 | We need to make a non-M version responsible for: 131 | * Giving data to other caches 132 | * Eventually writing block to memory 133 | 134 | We need to know which cache is "responsible" for the data. New State: O(wned). 135 | 136 | ## MOSI Coherence 137 | * Like MSI, except: 138 | * M => snoop A Read => O (not S) 139 | * Memory does not get accessed! 140 | * O State, like S, except 141 | * Snoop a read => Provide Data 142 | * Write-Back to memory if replaced 143 | * M: Exclusive Read/Write Access, Dirty 144 | * S: Shared Read Access, Clean 145 | * O: Shared Read Access, Dirty (only one block in this state) 146 | 147 | ### M(O)SI Inefficiency 148 | * Thread-Private Data 149 | * All Data in a single-threaded program 150 | * Stack in Multi-threaded programs 151 | * Data Read, then Write 152 | * I -> Miss -> S -> Invalidation -> M 153 | * Uniprocessor: V=0 => Read - Miss => V=1 => Hit => D=1 154 | * We want to avoid this Invalidation step that is unnecessary for thread-private data. For this, we will add another state: (E)xclusive 155 | 156 | ## The E State 157 | * M: Exclusive Access (RD/WR), Dirty 158 | * S: Shared Access (RD), Clean 159 | * O: Shared Access (RD), Dirty 160 | * E: Exclusive Access (RD/WR), Clean 161 | 162 | For the Read/Write loop, here are how each model works (with bus traffic required) 163 | 164 | | | MSI | MOSI | MESI | MOESI | 165 | |--------|:-----------------:|:-----------------:|:-----------------:|:-----------------:| 166 | | `RD A` | I -> S
(miss) | I -> S
(miss) | I -> E
(miss) | I -> E
(miss) | 167 | | `WR A` | S -> M
(inv) | S -> M
(inv) | E -> M | E -> M | 168 | 169 | So, with MESI/MOESI, we get the same behavior as if we had used the same sequence of accesses on a uniprocessor. The E state allows us to achieve a cache hit we're looking for. 170 | 171 | ## Directory-Based Coherence 172 | * Snooping: broadcast requests so others see them, and to establish ordering 173 | * Bus becomes a bottleneck 174 | * Snooping does not work well with > 8-16+ cores 175 | * Non-Broadcast Network 176 | * How do we observe requests we need to see? 177 | * How do we order requests to the same block? 178 | 179 | ### Directory 180 | * Distributed across cores 181 | * Each "slice" serves a set of blocks 182 | * One entry for each block it serves 183 | * Entry tracks which caches have block (in non-I state) 184 | * Order of accesses determined by "home" slice 185 | * Caches still have same states: MOESI 186 | * When we send a request to read or write, it no longer gets broadcast on a bus, it is communicated to a single directory. 187 | 188 | ### Directory Entry 189 | * 1 Dirty Bit (causes us to find out if a cache needs to do a write-back) 190 | * 1 Bit per Cache: present in that cache 191 | 192 | In this example with 8 cores, a read request on Cache 0 of B would be sent to the home slice for block B (instead of being broadcast on a bus). This directory responds with the data (from memory), and that it has Exclusive access. The directory then sets bits for Dirty and Presence[0]. 193 | 194 | Cache 1 then performs a write request of B. The directory forwards that write to Cache 0, which moves to Invalid state and can then either return the data (since in E state), or just ignore and acknowledge the invalidation. The directory unsets bit Presence[0] (because it is in I state), sets bit Presence[1] and Cache 1 moves to the M state. 195 | ![directory entry](https://i.imgur.com/pfSVrbT.png) 196 | 197 | ### Directory Example 198 | [🎥 View lecture video (4:59)](https://www.youtube.com/watch?v=lZZYILcQ68Y) 199 | 200 | ## Cache Misses with Coherence 201 | * 3 Cs: Compulsory, Conflict, Capacity 202 | * Another C: Coherence Miss 203 | * Example: If we read something, somebody else writes it, and we read it again 204 | * So, 4 Cs now! 205 | * Two types of coherence misses: 206 | * True Sharing: Different cores access same data 207 | * False Sharing: Different cores access different data, _but in the same block_ 208 | 209 | -------------------------------------------------------------------------------- /cache-review.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: cache-review 3 | title: Cache Review 4 | sidebar_label: Cache Review 5 | --- 6 | 7 | [🔗Lecture on Udacity (1.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/1025869122/concepts/last-viewed) 8 | 9 | ## Locality Principle 10 | Things that will happen soon are likely to be close to things that just happened. 11 | 12 | ### Memory References 13 | Accessed Address `x` Recently 14 | - Likely to access `x` again soon (Temporal Locality) 15 | - Likely to access addresses close to `x` too (Spatial Locality) 16 | 17 | ### Locality and Data Accesses 18 | Consider the example of a library (large information but slow to access). Accessing a book (e.g. Computer Architecture) has Temporal Locality (likely to look up the same information again) and also Spatial Locality (likely to look up other books on Computer Architecture.) In this example, consider options a student can do: 19 | 1. Go to the library every time to find the info he needs, go back home 20 | 2. Borrow the book and bring it home 21 | 3. Take all the books and build a library at home 22 | 23 | Option 1 does not benefit from locality (inconvenient). Option 3 is still slow to access and requires a lot of space and does not benefit from locality. Option 2 is a good tradeoff for being able to find information quickly without being overwhelmed with data. This is like a Cache, where we bring only certain information to the processor for faster access. 24 | 25 | ## Cache Lookups 26 | We need to it be fast, so it must be small. Not everything will fit. 27 | Access 28 | - Cache Hit - found it in the cache (fast access) 29 | - Cache Miss - not in the cache (access slow memory) 30 | - Copy this location to the cache 31 | 32 | ## Cache Performance 33 | Properties of a good cache: 34 | - Average Memory Access Time (AMAT) 35 | - \\(\text{AMAT} = \text{hit time} + \text{miss rate} * \text{miss penalty} \\) 36 | - hit time \\(\Rightarrow\\) small and fast cache 37 | - miss rate \\(\Rightarrow\\) large and/or smart cache 38 | - miss penalty \\(\Rightarrow\\) main memory access time 39 | - "miss time" = hit time + miss penalty 40 | - Alternate way to see it: \\(AMAT = (1-rate_{miss}) * t_{hit} + rate_{miss}*t_{miss} \\) 41 | 42 | ## Cache Size in Real Processors 43 | Complication: several caches 44 | 45 | L1 Cache - Directly service read/write requests from the processor 46 | - 16KB - 64KB 47 | - Large enough to get \\(\approx\\) 90% hit rate 48 | - Small enough to hit in 1-3 cycles 49 | 50 | ## Cache Organization 51 | ![Cache Organization](https://i.imgur.com/PIWX0aG.png) 52 | 53 | Cache is a table indexed by some bits correlating to the address. It contains data and a flag whether it's a hit or not. The size of data per entry (block size) is selected as a balance between having enough data to maximize hit rate for locality, without using up too much data that will not be accesses. Typically 32-128 bytes is ideal. 54 | 55 | ### Cache Block Start Address 56 | - Anywhere? 64B block => 0..63, 1..64, 2..65. (cannot effectively index into cache, contains overlap) 57 | - Aligned! 64B block => 0..63, 64..127, ... (easy to index using bits of the address, no overlap) 58 | 59 | ### Blocks in Cache and Memory 60 | Memory is divided into blocks. Cache is divided into lines (of block size). This line is a space/slot in which a block can be placed. 61 | 62 | ### Block Offset and Block Number 63 | Block Offset is the lower N bits (where 2^N = block size) that tell us where in the block to index. Block Number is the upper M bits (where M+N = address length) that tell us how to index the blocks for selection. 64 | 65 | ### Cache Tags 66 | In addition to the data, the cache keeps a list of "tags", one for each cache line. When accessing a block, we compare block number to each tag in the cache to determine which line that block is in, if it is in cache. We then know the line and offset to access the memory. 67 | 68 | ### Valid Bit 69 | What happens if the tags are initialized to some value that matches the tag of a valid address? We also need a "valid bit" for each cache line to tell us that specific cache line is valid and can be properly read. An added benefit is that we don't need to worry about clearing tag and data all the time - we can just clear the valid bit. 70 | 71 | ## Types of Caches 72 | - Fully Associative: Any block can be in any line 73 | - Set-Associative: N lines where a block can be 74 | - Direct-Mapped: A particular block can go into 1 line (Set-associative with N==1) 75 | 76 | ### Direct-Mapped Cache 77 | Each block number maps to a single potential cache line where it can be. Typically some lowermost bits of the block number are used to index into the cache line. The tag is still needed to tell us which block is actually in the cache line, but we now only need the bits of the tag that were not mapped for the index. 78 | ![Direct-Mapped Cache](https://i.imgur.com/GlCyPW2.png) 79 | 80 | #### Upside and Downside of Direct-Mapped Cache 81 | - Look in only one place (fast, cheap, energy-efficient) 82 | - Block must go in one place () 83 | - Conflicts when two blocks are used that map to the same cache line - increased cache miss rate 84 | 85 | ### Set-Associative Caches 86 | N-Way Set-Associative: a block can be in one of N lines (each set has N lines, and a particular block can be in any line of that set) 87 | 88 | #### Offset, Index, Tag for Set-Associative 89 | Similar to before - lower offset bits in the address still determine where in the cache line to obtain the data. Index bits (determined by how many sets) now are used to determine which set the line may be in. The tag is the remaining upper bits bits. 90 | ![](https://i.imgur.com/3ehF4Ey.png) 91 | 92 | ### Fully-Associative Cache 93 | Still have lower offset bits, but there is now no index. All remaining bits are the tag. Any block can be in any cache line, but this means to find something in cache you have to look at every line to see if that is it. 94 | 95 | ### Direct-Mapped and Fully Associative 96 | Direct-Mapped = 1-way set associative 97 | 98 | Fully Associative = N-way set associative 99 | 100 | Always start with offset bits, based on block size. Then, determine index bits. Rest of the bits is the tag. 101 | 102 | - Offset bits = \\(log_2(\text{block size})\\) 103 | - Index bits = \\(log_2(\text{# of sets})) \\) 104 | 105 | ## Cache Replacement 106 | - Set is full 107 | - Miss -> Need to put new block in set 108 | - Which block do we kick out? 109 | - Random 110 | - FIFO 111 | - LRU (Least Recently Used) 112 | - Hard to implement but exploits locality 113 | - Could implement via NMRU (Not Most Recently Used) 114 | 115 | ### Implementing LRU 116 | LRU has a separate set of counters (one per line). The counter is set to max when the line is accessed, and all other counters are decremented. A counter of 0 value represents least recently used and that line could be replaced. 117 | 118 | For an N-way SA cache: 119 | - Cost: N log2(N)-bit counters 120 | - Energy: Change N counters on every access 121 | - (Expensive in both hardware and energy) 122 | 123 | [🎥 Link to lecture (5:11)](https://www.youtube.com/watch?v=bq6N7Ym81iI) 124 | 125 | ## Write Policy 126 | Do we insert into the cache blocks we write? 127 | - Write-Allocate 128 | - Most are write-allocate due to locality - if we write something we are likely to access it 129 | - No-Write-Allocate 130 | 131 | Do we write just to cache or also to memory? 132 | - Write-Through (update mem immediately) 133 | - Very unpopular 134 | - Write-Back (write to cache, but write to mem when block is replaced) 135 | - Takes advantage of locality, more desirable 136 | 137 | If you have a write-back cache you also want write-allocate (with a write miss, we want future writes to go to the cache). 138 | 139 | ### Write-Back Cache 140 | If we're replacing a block in cache, do we know we need to write it to memory? 141 | - If we did write to the block, write to memory 142 | - If we did not write to the block, no need to write 143 | - Use a "Dirty Bit" to determine if the block has been written to. 144 | - 0 = "clean" (not written since last brought from memory) 145 | - 1 = "dirty" (need to write back on replacement) 146 | 147 | [🎥 Link to example (3:19)](https://www.youtube.com/watch?v=xU0ICkgTLTo) 148 | 149 | ## Cache Summary 150 | [🎥 First Part (2:32)](https://www.youtube.com/watch?v=MWpy5bBxl5A) 151 | 152 | [🎥 Second Part (2:32)](https://www.youtube.com/watch?v=DhxAIKaCEBY) -------------------------------------------------------------------------------- /cheat-sheet-midterm.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: cheat-sheet-midterm 3 | title: Midterm Cheat Sheet 4 | sidebar_label: Cheat Sheet - Midterm 5 | --- 6 | 7 | A denser collection of important pieces of information covering lectures from Introduction to VLIW 8 | 9 | ## Power Consumption Types 10 | 11 | Two kinds of power a processor consumes. 12 | 1. Dynamic Power - consumed by activity in a circuit 13 | Computed by \\(P = \tfrac 12 \*C\*V^2\*f\*\alpha\\), where 14 | \\(C = \text{capacitance}\\), \\(P = \text{power supply voltage}\\), \\(f = \text{frequency}\\), \\(\alpha = \text{activity factor (% of transistors active each clock cycle)}\\) 15 | 2. Static Power - consumed when powered on, but idle 16 | - The power it takes to maintain the circuits when not in use. 17 | - V \\(\downarrow\\), leakage \\(\uparrow\\) 18 | 19 | ## Performance 20 | 21 | * Latency (time start \\( \rightarrow \\) done) 22 | * Throughput (#/second) (not necessarily 1/latency due to pipelining) 23 | * Speedup - "X is N times faster than Y" (X new, Y old) 24 | * Speedup = speed(X)/speed(Y) 25 | * Speedup = throughput(X)/throughput(Y) = IPC(X)/IPC(Y) 26 | * Speedup = latency(Y)/latency(X) = \\(\frac{CPI(Y)\*CTime(Y)}{CPI(X)\*CTime(X)}\\) (notice Y/X reversal) 27 | * Can also multiply by nInst(Y)/nInst(X) factor 28 | * Performance ~ Throughput ~ 1/Latency 29 | * Ratios (e.g. speedup) can only be calculated via geometric mean 30 | * \\(\text{geometric mean} = \sqrt[n]{a_1\*a_2\*...a_n}\\) 31 | * Iron Law of Performance: 32 | * **CPU Time** = (# instructions in the program) * (cycles per instruction) * (clock cycle time) 33 | * clock cycle time = 1/freq 34 | * For unequal instruction times: \\(\sum_i (IC_i\* CPI_i) * \frac{\text{time}}{\text{cycle}}\\) 35 | * Amdahl's Law - overall effect due to partial change 36 | * \\(speedup = [(1-frac_{enh}) + \frac{frac_{enh}}{speedup_{enh}}]^{-1}\\) 37 | * \\( frac_{enh} \\) represents the fraction of the execution **TIME** 38 | * Consider diminishing returns by improving the same area of code. 39 | 40 | ## Pipelining 41 | * CPI = (base CPI) + (stall/mispred rate %)*(mispred/stall penalty in cycles) 42 | * Dependence - property of the program alone 43 | * Control dependence - one instruction executes based on result of previous instruction 44 | * Data Dependence 45 | * RAW (Read-After-Write) - "Flow" or "True" Dependence 46 | * WAW (Write-After-Write) - "Output", "False", or "Name" dependence 47 | * WAR (Write-After-Read) - "Anti-", "False", or "Name" dependence 48 | * Hazard - when a dependence results in incorrect execution 49 | * Handled by Flush, Stall, and Fix values 50 | * More Stages \\( \rightarrow \\) more hazards (CPI \\( \uparrow \\)), but less work per stage ( cycle time \\( \downarrow \\)) 51 | * 5-stage pipeline: Fetch-Decode-Execute-Memory-Write 52 | 53 | ## Branch Prediction 54 | * Predictor must compute \\(PC_{next}\\) based only on knowledge of \\(PC_{now}\\) 55 | 1. Guess "is this a branch?" 56 | 2. Guess "is it taken?" 57 | 3. "and if so, what is the target PC" 58 | * Accuracy: \\( CPI = 1 + \frac{mispred}{inst} * \frac{penalty}{mispred} \\) 59 | * \\( \frac{mispred}{inst} \\) is determined by the predictor accuracy. 60 | * \\(\frac{penalty}{mispred} \\) is determined by the pipeline depth at misprediction. 61 | * Types of Predictors and components 62 | * Not-Taken: Always assume branch is not taken 63 | * Historical: \\(PC_{next} = f(PC_{now}, history[PC_{now}])\\) 64 | * Branch Target Buffer (BTB): simple table of next best guessed PC based on current PC 65 | * Use last N bits of PC (not counting final 2-4 alignment bits) to index into this table 66 | * Branch History Table (BHT): like BTB, but entry is a single bit that tells us Taken/Not-Taken 67 | * 1-bit history does not handle switches in behavior well 68 | * 2-bit history adds hysteresis (strong and weak taken or not-taken) 69 | * 3-bit, 4-bit? Depends on pattern of anomalies, but may not be worth it. 70 | * History-Based Predictor: Looks at last N states of taken/not-taken. 71 | * Example: in sequence NNTNNT, if history is NN we know to predict T. 72 | * 1-bit history with 2-bit counters (1 historical state, 2-bit prediction) 73 | * 2-bit history with 2-bit counters (2 historical states, 2-bit prediction) 74 | * Shared counters - Pattern History Table (PC-indexed N bits per entry, XOR to index into BHT with 2-bit counter entries) 75 | * PShare - "private history, shared counters" - inner loops, smaller patterns 76 | * GShare - "global history, shared counters" - correlated branches across program 77 | * Tournament Predictor 78 | * Meta-Predictor feeds into a PShare and GShare, and is trained based on the results of each. 79 | * Hierarchical Predictor 80 | * Uses one good and one ok predictor (use "ok" predictor for easy branches, and "good" predictor for difficult ones) 81 | * Return Address Stack (RAS) - dedicated to predicting function returns 82 | * Should be very small structure, quick, and accurate (fairly deterministic) 83 | * Can still mispredict - wrap around stack due to limited space. 84 | * Can know an instruction is a `RET` via 1BC or pre-decoding instructions 85 | 86 | ## Predication 87 | Attempts to do the work of both directions of a branch and simply waste the work if wrong (to avoid control hazards) 88 | * If-Conversion (takes normal code and makes it perform both paths) 89 | ```cpp 90 | if(cond) { |>| x1 = arr[i]; 91 | x = arr[i]; |>| x2 = arr[j]; 92 | y = y+1; |>| y1 = y+1; 93 | } else { |>| y2 = y-1; 94 | x = arr[j]; |>| x = cond ? x1 : x2; 95 | y = y-1; } |>| y = cond ? y1 : y2; 96 | ``` 97 | * MIPS operands to help with this (and what `x = cond ? x1 : x2;` looks like) 98 | |inst | operands | does | 99 | |---|---|---| 100 | | `MOVZ` | Rd, Rs, Rt | `if(Rt == 0) Rd=Rs;` | 101 | | `MOVN` | Rd, Rs, Rt | `if(Rt != 0) Rd=Rs;` | 102 | ```mipsasm 103 | R3 = cond 104 | R1 = ... x1 ... 105 | R2 = ... x2 ... 106 | MOVN X, R1, R3 107 | MOVZ X, R2, R3 108 | ``` 109 | * If-Conversion takes more instructions to do the work, but avoids any penalty, so is typically more performant. 110 | 111 | ## Instruction Level Parallelism (ILP) 112 | ILP is the IPC when the processor does the entire instruction in 1 cycle, and can do any number of instructions in the same cycle (while obeying true dependencies) 113 | * Register Allocation Table (RAT) is used for renaming registers 114 | * Steps to get ILP value: 115 | 1. Rename Registers - use RAT 116 | 2. "Execute" - ensure no false dependencies, determine when instructions are executed 117 | 3. Calculate ILP = (\# instructions)/(\# cycles) 118 | 1. Pay attention to true dependencies, trust renaming to handle false dependencies. 119 | 2. Be mindful to count how many cycles being computed over 120 | 3. Assume ideal hardware - all instructions that can compute, will. 121 | 4. Assume perfect same-cycle branch prediction 122 | * IPC should never assume "perfect processor", so ILP \\(\geq\\) IPC. 123 | 124 | ## Instruction Scheduling (Tomasulo) 125 | ![Tomasulo's Algorithm](https://i.imgur.com/MuCQEgr.png) 126 | All of these things happen every cycle: 127 | 1. Issue 128 | Take next from IQ, determine inputs, get free RS and enqueue, tag destination reg of instruction 129 | 2. Dispatch 130 | As RAT values become available on result bus, move from RS to Execution 131 | 3. Write Result (Broadcast) 132 | When execution is complete, put tag and result on bus, write to RF, update RAT, free RS 133 | 134 | ## ReOrder Buffer (ROB) 135 | Used to prevent issues with exceptions and mispredictions to ensure results are not committed to the actual register before previous instructions have completed. 136 | 137 | Correct Out-Of-Order Execution 138 | * Execute Out-Of-Order 139 | * Broadcast Out-Of-Order 140 | * Write values to registers In-Order! 141 | 142 | ROB is a structure (Register | Value | Done) that sits between the RAT and RS. RS now only dispatches instructions and does not have to wait for result to be broadcast before freeing a spot in the RS). RF is only written to once that instruction is complete and all previous instructions have been written. ROB ensures no wrong registers have been committed; upon exception it can flush and move to exception handler. 143 | 144 | ## Memory Ordering 145 | Uses Load-Store Queue (LSQ) to handle read/write memory dependencies. This allows forwarding from stores to loads without having to access cache/mem. Instructions may be out of order, but all memory accesses are in-order. LSQ considers the followiung when being used: 146 | 1. `LOAD`: Which earlier `STORE` can I get a value from? 147 | 2. `STORE`: Which later `LOAD`s do I need to give my value to? 148 | 149 | - Issue: Need a ROB entry and an LSQ entry 150 | - Execute Load/Store: compute the address, produce the value (simultaneously) 151 | - (`LOAD` only) Write Result and Broadcast it 152 | - Commit Load/Store: Free ROB & LSQ entries 153 | - (`STORE` only) Send write to memory 154 | 155 | ## Compiler ILP 156 | * Tree Height Reduction uses associativity in operations to avoid chaining dependencies (e.g. x+y+z+a becomes (x+y)+(z+a)). 157 | * Instruction Scheduling attempts to reorder instructions to fill in any natural "stalls" 158 | * Loop Unrolling performs multiple iterations of the loop during one actual branching. (unroll once means 2 iterations before branch). 159 | * Function Call Inlining takes a function and copies the work to the main program, providing opportunities for scheduling or eliminating call/ret. 160 | 161 | ## VLIW 162 | Combines multiple instructions into one large one - requires extensive compiler support, but lowers hardware cost. -------------------------------------------------------------------------------- /compiler-ilp.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: compiler-ilp 3 | title: Compiler ILP 4 | sidebar_label: Compiler ILP 5 | --- 6 | 7 | [🔗Lecture on Udacity (1 hr)](https://classroom.udacity.com/courses/ud007/lessons/972428795/concepts/last-viewed) 8 | 9 | 10 | ## Can compilers help improve IPC? 11 | 12 | * Limited ILP 13 | * Due to dependence chains (each instruction depends on the one before it) 14 | * Limited "Window" into program 15 | * Independent instructions are far apart 16 | 17 | ## Tree Height Reduction 18 | Consider a program that performs the operation `R8 = R2 + R3 + R4 + R5`: 19 | ```mipsasm 20 | ADD R8, R2, R3 21 | ADD R8, R8, R4 22 | ADD R8, R8, R5 23 | ``` 24 | Obviously this creates a depdendence chain. Instead we could group the instructions like `R8 = (R2 + R3) + (R4 + R5)`, or: 25 | ```mipsasm 26 | ADD R8, R2, R3 27 | ADD R7, R4, R5 28 | ADD R8, R8, R7 29 | ``` 30 | This allows the first two instructions to execute in parallel. Tree Height Reduction in a compiler uses associativity to accomplish this. But, it should be considered that not all operations are associative. 31 | 32 | ## Make Indepdendent Instructions Easier to Find 33 | Can use various techniques (to follow) 34 | 1. Instruction Scheduling (branch-free sequences of instructions) 35 | 2. Loop Unrolling (and how it interacts with Instruction Scheduling) 36 | 3. Trace Scheduling 37 | 38 | ## Instruction Scheduling 39 | Different from Tomasulo's algorithm that takes place in the processor, but attempts to accomplish a similar thing. Take this sequence of instructions: 40 | 41 | ```mipsasm 42 | Loop: 43 | LW R2, 0(R1) 44 | ADD R2, R2, R0 45 | SW R2, 0(R1) 46 | ADDI R1, R1, 4 47 | BNE R1, R3, Loop 48 | ``` 49 | On each cycle, it may look more like this: 50 | ``` 51 | 1. LW R2, 0(R1) 52 | 2. (stall) 53 | 3. ADD R2, R2, R0 54 | 4. (stall) 55 | 5. (stall) 56 | 6. SW R2, 0(R1) 57 | 7. ADDI R1, R1, 4 58 | 8. (stall) 59 | 9. (stall) 60 | 10. BNE R1, R3, Loop 61 | ``` 62 | Can we move something into that first stall to help ILP? Cycles 3 and 6 cannot move because those are dependent. Is it possible to move the ADDI from cycle 7 to cycle 2, since it does not depend on anything else? 63 | ```mipsasm 64 | Loop: 65 | LW R2, 0(R1) 66 | ADDI R1, R1, 4 67 | ADD R2, R2, R0 68 | SW R2, -4(R1) 69 | BNE R1, R3, Loop 70 | ``` 71 | The `ADDI` can move as-is, but we need to correct the offset in the `SW` instruction to compensate for this. From a cycle analysis, we have eliminated cycles 7 (moved to 2), and the stall from 8-9 (the existing stalls in 4-5 will also compensate for the `ADDI` delay). So instead of 10 cycles, the loop now runs in 7 cycles. 72 | 73 | ## Scheduling and If-Conversion 74 | 75 | In the following example, orange and green are two different branches, and using predication we attempt to execute both and throw away the wrong results later when the branch executes. So using If-Conversion, the program executes these in order. 76 | 77 | ![Scheduling and If Conversion](https://i.imgur.com/eaW4xq6.png) 78 | 79 | For the purposes of scheduling, we can always perform scheduling optimizations within each functional block (e.g. instructions inside the orange section), and also between consecutive blocks (white-orange). We can even reschedule instructions on either side of the branches, and maintain correctness. 80 | 81 | ## If-Convert a Loop 82 | We know how to if-convert branches, but what about a loop? 83 | 84 | ```mipsasm 85 | Loop: 86 | LW R2, 0(R1) #(stall cycle after) 87 | ADD R2, R2, R3 #(stall cycle after) 88 | SW R2, 0(R1) 89 | ADDI R1, R1, 4 #(stall cycle after) 90 | BNE R1, R5, Loop 91 | ``` 92 | With scheduling, we get something like this (notice decrease in wasted cycles) 93 | ```mipsasm 94 | Loop: 95 | LW R2, 0(R1) 96 | ADDI R1, R1, 4 97 | ADD R2, R2, R3 #(stall cycle after) 98 | SW R2, -4(R1) 99 | BNE R1, R5, Loop 100 | ``` 101 | Instead of the BNE, we could use something like If-Conversion to load in the next instructions and fill that last stall cycle. But each iteration would require a new predicate to be created, and there's a point at which most things only get done if a predicate is true, and we don't see the performance boost for the overhead that will take. So, we cannot really If-Convert, but we can do something called Loop Unrolling 102 | 103 | ## Loop Unrolling 104 | ```cpp 105 | for(i=1000; i !=0;l i--) { 106 | a[i] = a[i] + s; 107 | } 108 | ``` 109 | compiles to... 110 | ```mipsasm 111 | LW R2, 0(R1) 112 | ADD R2, R2, R3 113 | SW R2, 0(R1) 114 | ADDI R1, R1, -4 115 | BNE R1, R5, Loop 116 | ``` 117 | 118 | With loop unrolling, we try to do a few iterations of the loop during one iteration, maybe: 119 | ```cpp 120 | for(i=1000; i !=0;l i=i-2) { 121 | a[i] = a[i] + s; 122 | a[i-1] = a[i-1] + s; // we could try unrolling a few more times 123 | } 124 | ``` 125 | which now compiles to: 126 | ```mipsasm 127 | LW R2, 0(R1) 128 | ADD R2, R2, R3 129 | SW R2, 0(R1) 130 | LW R2, -4(R1) 131 | ADD R2, R2, R3 132 | SW R2, -4(R1) 133 | ADDI R1, R1, -8 134 | BNE R1, R5, Loop 135 | ``` 136 | So the process is to take the work, copy it twice, adjust offsets, then adjust the final loop counter and branch instructions as needed. This example was an "unroll once". Unrolling twice would perform 3 iterations before branching. 137 | 138 | ### Loop Unrolling Benefits: # Instructions \\(\downarrow \\) 139 | 140 | In the example above, we start from 5 instructions * 1000 loops = 5000. After loop unrolling we have 8 instructions * 500 loops = 4000. From the Iron Law, we know Execution Time = (# inst)(CPI)(Cycle Time). So just by decreasing instructions by 20% there is a significant effect on overall performance. 141 | 142 | ### Loop Unrolling Benefits: CPI \\(\downarrow \\) 143 | Assume a processor with 4-Issue, In-Order, with perfect branch prediction. We can view how the iterations span over cycles: 144 | |`Loop:` | 1 | 2 | 3 | 4 | 5 | 6 | 145 | |--- |---|---|---|---|---|---| 146 | |`LW R2, 0(R1)` | x | | | x | | | 147 | |`ADD R2, R2, R3` | | x | | | x | | 148 | |`SW R2, 0(R1)` | | | x | | | x | 149 | |`ADDI R1, R1, -4` | | | x | | | x | 150 | |`BNE R1, R5, Loop`| | | | x | | ... | 151 | For an overall CPI of 3/5. With scheduling, we can do the following: 152 | |`Loop:` | 1 | 2 | 3 | 4 | 5 | 6 | 153 | |--- |---|---|---|---|---|---| 154 | |`LW R2, 0(R1)` | x | | x | | x | | 155 | |`ADDI R1, R1, -4` | x | | x | | x | | 156 | |`ADD R2, R2, R3` | | x | | x | | x | 157 | |`SW R2, 4(R1)` | | | x | | x | | 158 | |`BNE R1, R5, Loop`| | | x | | x | ... | 159 | For an overall CPI of 2/5 with scheduling. This is a significant boost. Now, what about loop unrolling (unrolled once)? 160 | |`Loop:` | 1 | 2 | 3 | 4 | 5 | 6 | 161 | |--- |---|---|---|---|---|---| 162 | |`LW R2, 0(R1)` | x | | | | | x | 163 | |`ADD R2, R2, R3` | | x | | | | | 164 | |`SW R2, 0(R1)` | | | x | | | | 165 | |`LW R2, -4(R1)` | | | x | | | | 166 | |`ADD R2, R2, R3` | | | | x | | | 167 | |`SW R2, -4(R1)` | | | | | x | | 168 | |`ADDI R1, R1, -8` | | | | | x | | 169 | |`BNE R1, R5, Loop`| | | | | | x | 170 | So it takes 5 cycles to do 8 instructions, for a CPI of 5/8. This is slightly worse when only looking at a few iterations, but overall it will perform much better (since we need half the iterations). Finally... with unrolling once and scheduling: 171 | |`Loop:` | 1 | 2 | 3 | 4 | 172 | |--- |---|---|---|---| 173 | |`LW R2, 0(R1)` | x | | x | | 174 | |`LW R10, -4(R1)` | x | | | x | 175 | |`ADD R2, R2, R3` | | x | | | 176 | |`ADD R10, R10, R3`| | x | | | 177 | |`ADDI R1, R1, -8` | | x | | | 178 | |`SW R2, 8(R1)` | | | x | | 179 | |`SW R10, 4(R1)` | | | x | | 180 | |`BNE R1, R5, Loop`| | | x | | 181 | So it takes 3 cycles to perform 8 instructions, for CPI of 3/8. This is slightly better than with scheduling alone, and when the benefit of loop unrolling over time (half the iterations) are considered, it is a significant improvement. Unrolling provides more opportunities for scheduling to reorder things to optimize for parallelism, in addition to reducing the overall number of instructions. 182 | 183 | ### Unrolling Downside? 184 | A few reasons we may not always unroll loops. 185 | 186 | 1. Code Bloat (in terms of lines of code after compilation) 187 | 2. What if number of iterations is unknown (e.g. while loop)? 188 | 3. What if number of iterations is not a multiple of N? (N = number of unrolls) 189 | 190 | Solutions to 2 and 3 do exist, but are beyond the scope of this course (may be in a compilers course). 191 | 192 | ## Function Call Inlining 193 | 194 | Function Call Inlining is an optimization that takes a function and copies the work inside the main program. 195 | 196 | ![Function Call Inlining](https://i.imgur.com/Vl3YXY0.png) 197 | 198 | This has the benefits of: 199 | * Eliminating call/return overheads (reduces # instructions) 200 | * Better scheduling (reduces CPI) 201 | * Without inlining, the compiler can only schedule instructions around and inside the function block, but by inlining it can schedule all instructions together. 202 | * With fewer instructions and reduced CPI, execution time improves dramatically. 203 | 204 | ### Function Call Inlining Downside 205 | 206 | The main downside to inlining, like in Loop Unrolling, is code bloat. Instead of abstracting the code into its own space and using call/return, it replicates the function body each time it is called. Each time it is inlined, it increases the total program size. Therefore we must be judicious about the usage of inlining. Ideally we select smaller functions to inline, and primarily those that result in fewer instructions when inlined compared to the overhead of setting up parameters, call, and return. 207 | 208 | ## Other IPC-Enhancing Compiler Stuff 209 | 210 | * Software Pipelining 211 | * Treat a loop as a pipeline where you interleave the instructions among the cycles such that correctness is maintained but there are no dependencies) 212 | * Trace Scheduling ("If-Conversion on steroids") 213 | * Combine the common path into one block of scheduled code, provide a way to branch and "fix" the wrong path execution if the uncommon path is determined to occur. 214 | 215 | *[ALU]: Arithmetic Logic Unit 216 | *[CPI]: Cycles Per Instruction 217 | *[ILP]: Instruction-Level Parallelism 218 | *[IPC]: Instructions per Cycle 219 | *[IQ]: Instruction Queue 220 | *[LB]: Load Buffer 221 | *[LSQ]: Load-Store Queue 222 | *[LW]: Load Word 223 | *[OOO]: Out Of Order 224 | *[PC]: Program Counter 225 | *[RAT]: Register Allocation Table (Register Alias Table) 226 | *[RAW]: Read-After-Write 227 | *[ROB]: ReOrder Buffer 228 | *[SB]: Store Buffer 229 | *[SW]: Store Word 230 | *[WAR]: Write-After-Read 231 | *[WAW]: Write-After-Write 232 | *[RAR]: Read-After-Read -------------------------------------------------------------------------------- /course-information.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: course-information 3 | title: Course Information 4 | sidebar_label: Course Information 5 | --- 6 | 7 | ## Are you ready for this course? 8 | 9 | Complete the prerequisite check at [🔗HPC0](https://classroom.udacity.com/courses/ud219/) 10 | 11 | ## Textbook 12 | There are no required readings. For personal knowledge, a recommended textbook is **Computer Architecture** by Hennessy and Patterson. [🔗This textbook is available electronically via the GT Library](https://ebookcentral-proquest-com.prx.library.gatech.edu/lib/gatech/detail.action?docID=787253). 13 | 14 | ## Sample Tests 15 | ### Sample Midterms 16 | * [Midterm 1 (without solutions)](https://www.udacity.com/wiki/hpca/SampleMidterms/Midterm1) 17 | * [Midterm 2 (with solutions)](https://www.udacity.com/wiki/hpca/SampleMidterms/Midterm2) 18 | 19 | ### Sample Finals 20 | * [Final 1 (without solutions)](https://www.udacity.com/wiki/hpca/sample-final/samplefinal1) 21 | * [Final 2 (with solutions)](https://www.udacity.com/wiki/hpca/sample-final/samplefinal2) 22 | 23 | ## Problem Set Solutions 24 | * [Metrics and Evaluation Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/MetricsAndEval) 25 | * [Pipelining Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Pipelining) 26 | * [Branches Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Branches) 27 | * [Predication Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Predication) (links broken) 28 | * [Problem 1](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-1) 29 | * [Problems 2-6](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-2) 30 | * [Problems 7-9](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-3) 31 | * [Problems 10-14](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-4) 32 | * [Problems 15-19](https://www.udacity.com/wiki/hpca/problem-set-solutions/predication/problem-5) 33 | * [ILP Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/ILP) 34 | * [Instruction Scheduling Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Instruction_Scheduling) 35 | * [ReOrder Buffer Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/ROB) 36 | * [Interrupts & Exceptions Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Interrupts_Exceptions) 37 | * [Virtual Memory Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/VirtualMemory) 38 | * [Advanced Caches Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Advanced_Caches) 39 | * [Memory Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Memory) 40 | * [Multiprocessing Problem Set Solutions](https://www.udacity.com/wiki/hpca/Problem_Set_Solutions/Mulitprocessing) 41 | * [Memory Consistency Problem Set Solutions](https://www.udacity.com/wiki/hpca/problem-set-solutions/memory-consistency) 42 | 43 | ## Additional Resources 44 | 45 | * [Assembly Language Programming](https://www.udacity.com/wiki/hpca/assemblyLanguageProgramming) 46 | * [Linux/Unix Commands](https://www.udacity.com/wiki/hpca/reviewLinuxCommands) 47 | * [C++](https://www.udacity.com/wiki/hpca/reviewC++) 48 | * [Glossary](https://www.udacity.com/wiki/hpca/glossary/Glossary) 49 | 50 | ## External Lectures and Materials 51 | 52 | These are some links I personally found helpful to explain some of the course concepts. 53 | 54 | * Branch Handling and Branch Prediction - CMU Computer Architecture (Prof. Onur Mutlu) 55 | * [2014 Video](https://www.youtube.com/watch?v=06OAhsPL-1k), [2013 Video](https://www.youtube.com/watch?v=XkerLktFtJg), [Slides](http://course.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture11-branch-prediction-afterlecture.pdf) 56 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /fault-tolerance.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: fault-tolerance 3 | title: Fault Tolerance 4 | sidebar_label: Fault Tolerance 5 | --- 6 | 7 | [🔗Lecture on Udacity (1.5hr)](https://classroom.udacity.com/courses/ud007/lessons/872590122/concepts/last-viewed) 8 | 9 | ## Dependability 10 | Quality of delivered service that justifies relying on the system to provide that service. 11 | - Specified Service = what behavior should be 12 | - Delivered Service = actual behavior 13 | 14 | System has components (modules) 15 | - Each module has an ideal specified behavior 16 | 17 | ## Faults, Errors, and Failures (+ Example) 18 | * **Fault** - module deviates from specified behavior 19 | * Example: Programming mistake 20 | * Add function that works fine, except 5+3 = 7 21 | * Latent Error (only a matter of time until activated) 22 | * **Error** - actual behavior within system differs from specified behavior 23 | * Example: (Activated Fault, Effective Error) 24 | * We call add() with 5 and 3, get 7, and put it in some variable 25 | * **Failure** - System Deviates from specified behavior 26 | * Example: 5+3=7 -> Schedule a meeting for 7am instead of 8am 27 | 28 | An error starts with a fault, but a fault may not necessarily become an error. A failure starts with an error, but an error may not necessarily result in a failure. Example: `if(add(5, 3))>0)` causes an error since the fault is activated, but not a failure since the end result is not affected. 29 | 30 | ## Reliability, Availability 31 | System is in one of two states: 32 | * Service Accomplishment 33 | * Service Interruption 34 | 35 | Reliability: 36 | * Measure continuous Service Accomplishment 37 | * Mean Time to Failure (MTTF) 38 | 39 | Availability: 40 | * Service Accomplishment as a fraction of overall time 41 | * Need to know: Mean Time to Repair (MTTR) 42 | * Availability = \\(\frac{MTTF}{MTTF+MTTR}\\) 43 | 44 | ## Kinds of Faults 45 | By Cause: 46 | * HW Faults - HW fails to perform as designed 47 | * Design Faults - SW bugs, HW design mistakes (FDIV bug) 48 | * Operation Faults - Operator, user mistakes 49 | * Environmental Faults - Fire, power failures, sabotage, etc. 50 | 51 | By Duration: 52 | * Permanent - Once we have it, it doesn't get corrected (wanted to see what's inside processor, and now it's in 4 pieces) 53 | * Intermittent - Last for a while, but recurring (overclock - temporary crashes) 54 | * Transient - Something causes things to stop working correctly, but it fixes itself eventually 55 | 56 | ## Improving Reliability and Availability 57 | * Fault Avoidance 58 | * Prevent Faults from occurring at all 59 | * Example: No coffee in server room 60 | * Fault Tolerance 61 | * Prevent Faults from becoming Failures 62 | * Example: Redundancy, e.g. ECC for memory 63 | * Speed up Repair Process (availability only) 64 | * Example: Keep a spare hard disk in drawer 65 | 66 | ## Fault Tolerance Techniques 67 | * Checkpointing (Recover from error) 68 | * Save state periodically 69 | * If we detect errors, restore the saved state 70 | * Works well for many transient and intermittent failures 71 | * If this takes too long, it has to be treated like a service interruption 72 | * 2-Way Redundancy (Detect error) 73 | * Two modules do the same work and compare results 74 | * Roll back if results are different 75 | * 3-Way Redundancy (Detect and recover from error) 76 | * 3 modules (or more) do the same work and vote on correctness 77 | * Fault in one module does not become an error overall. 78 | * Expensive - 3x the hardware required, but can tolerate *any* fault in one module 79 | 80 | ## N-Module Redundancy 81 | * N=2 - Dual-Module Redundancy 82 | * Detect but not correct one faulty module 83 | * N=3 - Triple-Module Redundancy 84 | * Correct one faulty module 85 | * N=5 - (example: space shuttle) 86 | * 5 computers perform operation and vote 87 | * 1 Wrong Result \\(\Rightarrow\\) normal operation 88 | * 2 Wrong Results \\(\Rightarrow\\) abort mission 89 | * Still no failure from this: 3 outvote the 2 90 | * 3 Wrong Results: failure can be catastrophic (too many broken modules) 91 | * Abort with 2 failures so that this state should never be reached 92 | 93 | ## Fault Tolerance for Memory and Storage 94 | * Dual/Triple Module Redundancy - Overkill (typically better for computation) 95 | * Error Detection, Correction Codes 96 | * Parity: One extra bit (XOR of all data bits) 97 | * Fault flips one bit \\(\Rightarrow\\) Parity does not match data 98 | * ECC: example - SECDED (Single Error Correction, Double Error Detection) 99 | * Can detect and fix any single bit flip, or can only detect any dual bit flip 100 | * Example: ECC DRAM modules 101 | * Disks can use even fancier codes (e.g. Reed-Solomon) 102 | * Detect and correct multiple-bit errors (especially streaks of flipped bits) 103 | * RAID (for hard disks) 104 | 105 | ## RAID - Redundant Array of Independent Disks 106 | * Several disks playing the role of one disk (can be larger and/or more reliable than the single disk) 107 | * Each disk detects errors using codes 108 | * We know which disk has the error 109 | * RAID should have: 110 | * Better performance 111 | * Normal Read/Write accomplishment even when: 112 | * It has a bad sector 113 | * Entire disk fails 114 | * RAID 0, 1, etc... 115 | 116 | ### RAID 0: Striping (to improve performance) 117 | Disks can only read one track at a time, since the head can be in only one position. RAID 0 takes two disks and "stripes" the data across each disk such that consecutive tracks can be accessed simultaneously with the head in a single position. This results in up to 2x the data throughput and reduced queuing delay. 118 | 119 | However, reliability is worse than a single disk: 120 | * \\(f\\) = failure rate for a single disk 121 | * failures/disk/second 122 | * Single-Disk MTTF = \\(\frac{1}{f}\\) (MTTDL: Mean time to data loss = \\(MTTF_1\\)) 123 | * N disks in RAID0 124 | * \\(f_N = N*f_1 \Rightarrow MTTF_N = MTTDL_N = \frac{MTTF_1}{N}\\) 125 | * 2 Disks \\( \Rightarrow MTTF_2 = \frac{MTTF_1}{2} \\) 126 | 127 | ### RAID 1: Mirroring (to improve reliability) 128 | Same data on both disks 129 | * Write: Write to each disk 130 | * Same performance as 1 disk alone 131 | * Read: Read any one disk 132 | * 2x throughput of one disk alone 133 | * Can tolerate any faults that affect one disk 134 | * Two copies -> Detect Error ??? 135 | * Not true in this case, because ECC on each sector lets us know which one has a fault 136 | 137 | Reliability: 138 | * \\(f\\) = failure rate for a single disk 139 | * failures/disk/second 140 | * Single-Disk MTTF = \\(\frac{1}{f}\\) (MTTDL: Mean time to data loss = \\(MTTF_1\\)) 141 | * 2 disks in RAID1 142 | * \\(f_N = N*f_1 \Rightarrow\\) 143 | * both disks OK until \\(\frac{MTTF_1}{2}\\) 144 | * remaining disk lives on for \\(MTTF_1\\) time 145 | * \\(MTTDL_{RAID1-2} = \frac{MTTFF_1}{2}+MTTF_1\\) (Assumes no disk replaced) 146 | * But we do replace failed disks! 147 | * Both disks ok until \\(\frac{MTTF_1}{2}\\) 148 | * Disk fails, have one OK disk for \\(MTTR_1\\) 149 | * Both disks ok again until \\(\frac{MTTF_1}{2}\\) 150 | * So, overall MTTDL : 151 | * (when \\(MTTR_1 \ll MTTF_1\\), probability of second disk failing during MTTR = \\(\frac{MTTR_1}{MTTF_1}\\)) 152 | * \\(MTTDL_{RAID1-2} = \frac{MTTF_1}{2} * \frac{MTTF_1}{MTTR_1}\\) (second factor is 1/probability) 153 | 154 | ### RAID 4: Block-Interleaved Parity 155 | * N disks 156 | * N-1 contain data, striped like RAID 0 157 | * 1 disk has parity blocks 158 | 159 | | Disk 0 | | Disk 1 | | Disk 2 | | Disk 3 | 160 | |----------|---|-----------|---|-----------|---|-------------| 161 | | Stripe 0 | ⊕ | Stripe 1 | ⊕ | Stripe 2 | = | Parity 0,1,2 | 162 | | Stripe 3 | ⊕ | Stripe 4 | ⊕ | Stripe 5 | = | Parity 3,4,5 | 163 | 164 | Data from each stripe is XOR'd together to result in parity information on disk 3. If any one of the disks fail, the data can then be reconstructed by XOR-ing all the other disks, including parity. 165 | 166 | * Write: Write 1 data disk and parity disk read/write 167 | * Read: Read 1 disk 168 | 169 | Performance and Reliability: 170 | * Reads: throughput of N-1 disks 171 | * Writes: 1/2 throughput of single disk (primary reason for RAID 5) 172 | * MTTF: 173 | * All disks ok for \\(\frac{MTTF_1}{N}\\) 174 | * If no repair, we are left with N-1 disk array: + \\(\frac{MTTF_1}{N-1} \rightarrow\\) Bad idea 175 | * Repair \\(\Rightarrow\\) chance of another failing during repair: \\(\frac{MTTF_1}{N-1} \\): Multiply by this factor over \\(MTTR_1\\) 176 | * \\(MTTF_{RAID4} = \frac{MTTF_1 \* MTTF_1}{N\*(N-1)\*MTTR_1}\\) 177 | 178 | Reads: [🎥 View Lecture Video (2:19)](https://www.youtube.com/watch?v=3QXaSzM2fE8) 179 | - If we compute the XOR of the old vs new data we are writing, we get the bit flips we're making to the data. If we then XOR this against the parity block, we perform those same bit flips on the parity data and get the new parity information. 180 | - Thus, the parity disk is a bottleneck for writes (multiple disks may be updating it) \\(\Rightarrow\\) RAID5 181 | 182 | ### RAID 5: Distributed Block-Interleaved Parity 183 | * Like RAID 4, but parity is spread among all disks: 184 | | Disk 0 | | Disk 1 | | Disk 2 | | Disk 3 | 185 | |----------|---|-----------|---|-----------|---|-------------| 186 | | Stripe 0 | ⊕ | Stripe 1 | ⊕ | Stripe 2 | = | Parity 0,1,2 | 187 | | Parity 3,4,5 | = | Stripe 3 | ⊕ | Stripe 4 | ⊕ | Stripe 5 | 188 | | Stripe 6 | | Parity 6,7,8 | | Stripe 7 | | Stripe 8 | 189 | | Stripe 9 | | Stripe 10 | | Parity 9,10,11 | | Stripe 11| 190 | | Stripe 12 | ⊕ | Stripe 13 | ⊕ | Stripe 14 | = | Parity 12,13,14 | 191 | * Read Performance: N * throughput of 1 disk 192 | * Write Performance: 4 accesses/write, but distributed: N/4 * throughput of 1 193 | * Reliability: same as RAID4 - if we lose any one disk, we have a problem 194 | * Capacity: same as RAID4 (still sacrifice one disk worth of parity over the array) 195 | 196 | Key takeaway: We should always choose RAID5 over RAID4, since we lose nothing with reliability or capacity, and gain in throughput, all without additional hardware requirements. 197 | 198 | ### RAID 6? 199 | * Two "parity" blocks per group 200 | * Can work when 2 failed stripes/group 201 | * One parity block 202 | * Second is a different type of check-block 203 | * When 1 disk fails, use parity 204 | * When 2 disks fail, solve equations to retrieve data 205 | * RAID 5 vs. RAID 6 206 | * 2x overhead 207 | * More write overhead (6/WR vs 4/WR) 208 | * Only helps reliability when disk fails, then another fails before we replace the first one (low probability, thousands of years MTTDL) 209 | 210 | RAID6 is an overkill? 211 | * RAID5: disk fails, 3 days to replace 212 | * Very low probability of another failing in those 3 days (assuming independent failure) 213 | * Failures can be related! 214 | * RAID5, 5 disks, 1 disk fails (#2) 215 | * System says "Replace disk #2" 216 | * Operator gets replacement disk and inserts it into spot 2 217 | * But... numbering was 0,1,2,3,4! Operator pulled the wrong disk. 218 | * Now we have two failed disks (one hard failure, one operator error) 219 | * RAID6 prevents both a single or double disk failure. 220 | 221 | The above scenario sounds silly, but with very long MTTF values, replacing a RAID disk becomes a rare activity, and is prone to operator error. A real-life personal scenario I (author of these notes) encountered was not properly grounding myself when replacing a disk in a hot-swappable RAID5. When inserting it into the housing, my theory is that I discharged static electricity into the disk below the one I was replacing, and the array failed to rebuild due to an error on that disk. Thankfully we were able to move the platters from that drive into a new one and the array rebuilt ok, avoiding the need to use week-old backups. So, RAID6 is indeed a bit overkill, but could have prevented this scenario, which resulted in lots of system downtime. 222 | 223 | 224 | *[MTTDL]: Mean Time to Data Loss 225 | *[MTTF]: Mean Time to Failure 226 | *[MTTR]: Mean Time to Repair -------------------------------------------------------------------------------- /ilp.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: ilp 3 | title: ILP 4 | sidebar_label: ILP 5 | --- 6 | 7 | [🔗Lecture on Udacity (54 min)](https://classroom.udacity.com/courses/ud007/lessons/3615429333/concepts/last-viewed) 8 | 9 | ## ILP All in the Same Cycle 10 | Ideal scenario is all instructions executing during the same cycle. This may work sometimes for some instructions, but typically there will be dependencies that prevent this, as below. 11 | 12 | ![All instructions in same cycle](https://i.imgur.com/tQk5TyF.png) 13 | 14 | ## The Execute Stage 15 | 16 | Can forwarding help? In the previous example, Inst 1 would be able to forward the result to Instruction 2 in the next cycle, but not during the same cycle. Instruction 2 would need to be stalled until the next cycle. But, if Instructions 3-5 do not have dependencies, they can continue executing during the current cycle. 17 | 18 | ## RAW Dependencies 19 | Even the ideal processor that can execute any number of instructions per cycle still has to obey RAW dependencies, which creates delays and affects overall CPI. So, ideal CPI can never be 0. For example, for instructions 1, 2, 3, 4, and 5, where there is a dependency between 1-2, 3-4, and 4-5, it would take 3 cycles (cycle 1 executes 1 and 3, cycle 2 executes 2 and 4, cycle 3 executes 5), for a total CPI of 3/5 = 0.60. If every instruction had a dependency with the next, you can't do anything in parallel and the minimum CPI is 1. 20 | 21 | ## WAW Dependencies 22 | 23 | In the below example, the second instructions has a data dependency on the first, and gets delayed one cycle. Meanwhile, all other cycles do not have any dependencies and can also be executed in the first cycle. However, we see in the last instruction that R4 could be written, but then due to the delay in instruction 2, could be overwritten. This out-of-order instruction could result in the final value of R4 not being what is expected based on the program. Thus the processor needs a way to find this dependency and delay the last instruction enough cycles to avoid the write issue. 24 | 25 | ![WAW Dependencies](https://i.imgur.com/q9akgJp.png) 26 | 27 | ## Removing False Dependencies 28 | 29 | RAW dependencies are "true" dependencies - one instruction truly depends on data from another and the dependency must be obeyed. WAR and WAW dependencies are "false" (name) dependencies. They are named this because there is nothing fundamental about them - they are the result of using the same register to store results. If the second instruction used a different register value, there would be no dependency. 30 | 31 | ## Duplicating Register Values 32 | 33 | In the case of a false dependency, you could simply duplicate registers by using separate versions of them. In the below example, you can store two versions of R4 - one on the 2nd instruction, and another on the 4th, and we remember both. The dependency can be resolved when the future instruction that uses R4 can "search" through those past versions and select the most recent. 34 | 35 | ![Duplicating Register Values](https://i.imgur.com/WtpuH48.png) 36 | 37 | Likewise, instruction 3 must also search among "versions" of R4 in instructions 2 and 4 and determine the version it needs is from instruction 2. This is possible and correct, but keeping multiple version is very complicated. 38 | 39 | ## Register Renaming 40 | 41 | Register renaming separates registers into two types: 42 | - Architectural = registers that programmer/compiler use 43 | - Physical = all places value can actually go within the processor 44 | 45 | As the processor fetches and decodes instructions, it "rewrites" the program to use physical registers. This requires a table called the Register Allocation Table (RAT). This table says which physical register has a value for which architectural register. 46 | 47 | ### RAT Example 48 | ![RAT Example](https://i.imgur.com/foUlLDD.png) 49 | 50 | ## False Dependencies After Renaming? 51 | 52 | In the below example, you can see the true dependencies in purple, and the output/anti dependencies in green. In our renamed program, only the true dependencies remain. This also results in a much lower CPI. 53 | 54 | ![False Dependencies After Renaming](https://i.imgur.com/LcbgTrG.png) 55 | 56 | ## Instruction Level Parallelism (ILP) 57 | 58 | ILP is the IPC when: 59 | - Processor does entire instruction in 1 cycle 60 | - Processor can do any number of instructions in the same cycle 61 | - Has to obey True Dependencies 62 | 63 | So, ILP is really what an ideal processor can do with a program, subject only to obeying true dependencies. ILP is a property of a ***program***, not of a processor. 64 | 65 | Steps to get ILP: 66 | 1. Rename Registers - use RAT 67 | 2. "Execute" - ensure no false dependencies, determine when instructions are executed 68 | 3. Calculate ILP = (\# instructions)/(\# cycles) 69 | 70 | ### ILP Example 71 | 72 | Tips: 73 | 1. You don't have to first do renaming, just pay attention to true dependencies, and trust renaming to handle false dependencies. 74 | 2. Be mindful to count how many cycles you're computing over 75 | 3. Make sure you're dividing the right direction (instructions/cycles) 76 | 77 | ![ILP Example](https://i.imgur.com/y1RdLrg.png) 78 | 79 | ## ILP with Structural and Control Dependencies 80 | 81 | When computing ILP we only consider true dependencies, not false dependencies. But what about structural and control dependencies? 82 | 83 | When considering ILP, there are no structural dependencies. Those are caused by lack of hardware parallelism. ILP assumes ideal hardware - any instructions that can possibly compute in one cycle will do so. 84 | 85 | For control dependencies, we assume perfect same-cycle branch prediction (even predicted before it is executed). For example, below we see that the branch is predicted at the point of program load, such that the label instruction will execute in the first cycle. 86 | 87 | | | 1 | 2 | 3 | 88 | |------------------|:---:|:---:|:---:| 89 | | `ADD R1, R2, R3` | x | | | 90 | | `MUL R1, R1, R1` | | x | | 91 | | `BNE R5, R1, Label` | | | x | 92 | | ... | | | | 93 | | `Label:`
`MUL R5, R7, R8` | x | | | 94 | 95 | ## ILP vs IPC 96 | 97 | ILP is not equal to IPC except on a perfect/ideal out-of-order processor. So IPC should be computed based on the properties of the processor that it is run on, as seen below (note: consider the IPC was calculated ignoring the "issue" property). 98 | 99 | ![ILP vs IPC](https://i.imgur.com/m8vSTGJ.png) 100 | 101 | Therefore, we can state ILP \\(\geq\\) IPC, as ILP is calculated using no processor limitations. 102 | 103 | ### ILP and IPC Discussion 104 | 105 | The following are considerations when thinking about effect of processor issue width and order to maximize IPC. 106 | 107 | ![ILP and IPC discussion](https://i.imgur.com/F5utKaZ.png) 108 | 109 | 110 | *[ALU]: Arithmetic Logic Unit 111 | *[CPI]: Cycles Per Instruction 112 | *[ILP]: Instruction-Level Parallelism 113 | *[IPC]: Instructions per Cycle 114 | *[PC]: Program Counter 115 | *[RAT]: Register Allocation Table 116 | *[RAW]: Read-After-Write 117 | *[WAR]: Write-After-Read 118 | *[WAW]: Write-After-Write 119 | *[RAR]: Read-After-Read -------------------------------------------------------------------------------- /instruction-scheduling.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: instruction-scheduling 3 | title: Instruction Scheduling 4 | sidebar_label: Instruction Scheduling 5 | --- 6 | 7 | [🔗Lecture on Udacity (1.5 hr)](https://classroom.udacity.com/courses/ud007/lessons/3643658790/concepts/last-viewed) 8 | 9 | ## Improving IPC 10 | * ILP can be good, >> 1. 11 | 12 | * To get IPC \\(\approx\\) ILP, need to handle control dependencies (=> Branch Prediction). 13 | 14 | * Need to eliminate WAR and WAW data dependencies (=> Register Renaming). 15 | 16 | * Handle RAW data dependencies (=> Out of Order Execution) 17 | 18 | * Structural Dependencies (=> Invest in Wider Issue) 19 | 20 | ## Tomasulo's Algorithm 21 | 22 | Originall used in IBM 360, determines which instructions have inputs ready or not. Includes a form of Register Renaming, and is very similar to what is used today. 23 | 24 | | Original Tomasulo | What is used today | 25 | | --- | --- | 26 | | Floating Point instructions | All instruction types | 27 | | Fewer instructions in "window" | Hundreds of instructions considered | 28 | | No exception handling | Explicit exception handling support | 29 | 30 | ### The Picture 31 | 32 | ![Tomasulo's Algorithm](https://i.imgur.com/MuCQEgr.png) 33 | 34 | Instructions come from the Fetch unit in order and put into Instruction Queue (IQ). The next available instruction from the IQ is put into a Reservation Station (RS), where they sit until parameters become ready. There is a floating point register file (REGS), and these registers are put into the RS when they are ready. 35 | 36 | When an instruction is ready, it goes into the execution unit. Once the result is available it is broadcast onto a bus. This bus goes to the register file and also the RS (so they will be available when needed). Notice multiple inputs from this bus on the RS - this is because an instruction may need the value to be latched into both operands. 37 | 38 | Finally, if the instruction deals with memory (load/store), it will go to the address generation unit (ADDR) for the address to be computed. It will then go to the Load Buffer (LB, provides just the address A) or Store Buffer (SB, provides both the data D and the address A). When it comes back from memory (MEM), its value is also broadcast on the bus, which also feeds into the SB (so stores can get their values when available). 39 | 40 | Finally, the part where we send the instruction from the fetch to the memory or execution unit is called the "Issue". The place where the instruction is finally sent from the RS to execution is called the "Dispatch". And when the instruction is ready to broadcast its result, it's called a "Write Result" or "Broadcast". 41 | 42 | ## Issue 43 | 44 | 1. Take next (in program order) instruction from IQ 45 | 2. Determine where inputs come from (probably a RAT here) 46 | - register file 47 | - another instruction 48 | 3. Get free/available RS (of the correct kind) 49 | - If all RS are busy, we simply don't issue anything this cycle. 50 | 4. Put instruction in RS 51 | 5. Tag destination register of the instruction 52 | 53 | ### Issue Example 54 | 55 | [🎥 Link to Video](https://www.youtube.com/watch?v=I2qMY0XvYHA) (best seen and not read) 56 | 57 | ![Issue Example](https://i.imgur.com/JdepPAx.png) 58 | 59 | ## Dispatch 60 | 61 | [🎥 Link to Video](https://www.youtube.com/watch?v=bEB7sZTP8zc) 62 | 63 | As RAT values from other instructions become available on the result broadcast bus, instructions are moved from the RS to the actual execution unit (e.g. ADD, MUL) to be executed. 64 | 65 | ### Dispatch - >1 ready 66 | 67 | What happens if more than one instruction in the RS is ready to execute at the same time? How should Dispatch choose which to execute? 68 | - Oldest first (easy to do, medium performance) 69 | - Most dependencies first (hard to do, best performance) 70 | - Random (easy to do, worst performance) 71 | 72 | ![Dispatch: >1 ready](https://i.imgur.com/uS0PRFt.png) 73 | 74 | All are fine with respect to correctness because of RAT, but if we do not know the future we cannot predict most dependencies, even if that would yield the best overall performance due to possibly freeing up more other instructions to execute. Oldest first is the best compromise with performance and ease of implementation. 75 | 76 | ## Write Result (Broadcast) 77 | 78 | Once the result is ready, it is put on the bus (with its RS tag). It is then used to update the register file and RAT, and then the RS is freed for the next instruction to use. 79 | 80 | ![Write Result](https://i.imgur.com/2lrRoFf.png) 81 | 82 | ### More than 1 Broadcast 83 | 84 | What if in the previous example, the ADD and MUL complete simultaneously - which one is broadcast first? 85 | 86 | Possibility 1: a separate bus for each unit. This allows both to be broadcast, but requires twice the comparators (every output from the bus now needs to consider both to select the correct tag). 87 | 88 | Possibility 2: Select a higher priority unit based on some heuristic. For example, if one unit is slower, give it higher priority on the WR bus since it's likely the instruction has been waiting longer. 89 | 90 | ### Broadcast "Stale" Result 91 | 92 | Consider a situation in which an instruction is ready to broadcast from RS4, but RS4 is nowhere in the RAT (perhaps overwritten by a new instruction that was both using and writing to the same register, now the RAT will reflect the new RS tag). What will happen during broadcast? 93 | 94 | When updating the RS, it will work normally. RS4 will be replaced by the broadcasted value. In the RAT and RF, we do nothing. It is clear that it will never be used by future instructions, only ones currently in the RS. 95 | 96 | ## Tomasulo's Algorithm - Review 97 | 98 | ![Tomasulo's Algorithm - Review](https://i.imgur.com/tsOgTRq.png) 99 | 100 | The key is that for each instruction it will follow the steps (issue, capture, dispatch, WR). Also each cycle, some instruction will be issued, another instruction will be captured, another dispatched, another broadcasting. 101 | 102 | Because all of these things happen every cycle, we need to consider some things: 103 | 1. Can we do same-cycle issue->dispatch? No - on an issue we are writing to the RS, and that instruction is not yet recognizable as a dispatchable instruction. 104 | 2. Can we do same-cycle capture->dispatch? Typically No - on a capture the RS updates its status from "operands missing" to "operands available". But this is technically possible with more hardware. 105 | 3. Can we update RAT entry for both Issue and WR on same cycle? Yes - we just need to ensure that the one being issued is the end result of the RAT. The WR is only trying to point other instructions to the right register, which will also be updated this cycle. 106 | 107 | ## Load and Store Instructions 108 | 109 | As we had dependencies with registers, we have dependencies through memory that must be obeyed or eliminated. 110 | - RAW: SW to address, then LW from address 111 | - WAR: LW then SW 112 | - WAW: SW, SW to same address 113 | 114 | What do we do in Tomasulo's Algorithm? 115 | - Do loads and stores in-order 116 | - Identify dependencies, reorder, etc. - more complicated than with registers, so Tomasulo chose not to do this. 117 | 118 | ## Tomasulo's Algorithm - Long Example 119 | _These examples are best viewed as videos, so links are below..._ 120 | 121 | 1. [🎥 Introduction](https://www.youtube.com/watch?v=2M5NQFAaILk) 122 | 2. [🎥 Cycles 1-2](https://www.youtube.com/watch?v=GC8Cp-M0o6Q) 123 | 3. [🎥 Cycles 3-4](https://www.youtube.com/watch?v=G0Kap6eq_Ys) 124 | 4. [🎥 Cycles 5-6](https://www.youtube.com/watch?v=I1VOoFhrnio) 125 | 5. [🎥 Cycles 7-9](https://www.youtube.com/watch?v=wmrPTJpUnV4) 126 | 6. [🎥 Cycles 10-end](https://www.youtube.com/watch?v=ZQ6Tdrs16_U) 127 | 7. [🎥 Timing Example](https://www.youtube.com/watch?v=ZqbhHjFSBoI) 128 | 129 | ## Additional Resources 130 | 131 | From TA Nolan, here is a list of things that can prevent a CPU from moving forward with an instruction. 132 | ``` 133 | Issue: 134 | Instructions must be issued in order. 135 | Only a certain number of instructions can be issued in one cycle. 136 | An RS entry of the right type must be available. 137 | An ROB entry must be available. 138 | 139 | Dispatch: 140 | The RS must have actual values for each operand. 141 | An ALU or processing unit must be available. 142 | Only a certain number of instructions from each RS can be dispatched in one cycle. 143 | 144 | Execution: 145 | No limitations. 146 | 147 | Broadcast: 148 | Only a certain number of instructions may broadcast in the same cycle. 149 | 150 | Commit: 151 | Instructions must be committed in order. 152 | Only a certain number of instructions can be committed in one cycle. 153 | ``` 154 | 155 | Also from TA Nolan, an attempt to document how Tomasulo works 156 | 157 | ``` 158 | While there is an instruction to issue 159 | If there is an empty appropriate RS entry 160 | Put opcode into RS entry. 161 | For each operand 162 | If there is an RS number in the RAT 163 | Put the RS number into the RS as an operand. 164 | else 165 | Put the register value into the RS as an operand. 166 | Put RS number into RAT entry for the destination register. 167 | Take the instruction out of the instruction window. 168 | 169 | For each RS 170 | If RS has instruction with actual values for operands 171 | If the appropriate ALU or processing unit is free 172 | Dispatch the instruction, including operands and the RS number. 173 | 174 | For each ALU 175 | If the instruction is complete 176 | If a RAT entry has # 177 | Put value in corresponding register. 178 | Erase RAT entry. 179 | For each RS waiting for it 180 | Put the result into the RS. 181 | Free the ALU. 182 | Free the RS. 183 | ``` 184 | 185 | Finally from TA Nolan, a worksheet to keep track of necessary information when approaching a problem with IS: 186 | ``` 187 | Same cycle: 188 | free RS & use RS (?) 189 | issue & dispatch (no) 190 | capture & dispatch (no) 191 | execute & broadcast (no) 192 | reuse ROB (no) 193 | 194 | # / cycle: 195 | issue (1) 196 | broadcast (1) 197 | 198 | Dispatch priority (oldest, random) 199 | 200 | # of universal RS's: 201 | 202 | ALU's: 203 | operation # RS's # ALU's Pipelined? 204 | 205 | Exe time: 206 | operation cycles 207 | 208 | broadcast priority (slower ALU) 209 | 210 | # of ROB's: 211 | ``` 212 | 213 | *[ALU]: Arithmetic Logic Unit 214 | *[CPI]: Cycles Per Instruction 215 | *[ILP]: Instruction-Level Parallelism 216 | *[IPC]: Instructions per Cycle 217 | *[IQ]: Instruction Queue 218 | *[LB]: Load Buffer 219 | *[LW]: Load Word 220 | *[PC]: Program Counter 221 | *[RAT]: Register Allocation Table (Register Alias Table) 222 | *[RAW]: Read-After-Write 223 | *[SB]: Store Buffer 224 | *[SW]: Store Word 225 | *[WAR]: Write-After-Read 226 | *[WAW]: Write-After-Write 227 | *[RAR]: Read-After-Read -------------------------------------------------------------------------------- /introduction.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: introduction 3 | title: Introduction 4 | sidebar_label: Introduction 5 | --- 6 | 7 | [🔗Lecture on Udacity (39 min)](https://classroom.udacity.com/courses/ud007/lessons/3627649022/concepts/last-viewed) 8 | 9 | ## Computer Architecture 10 | 11 | ### What is it? 12 | 13 | Architecture - design a *building* that is well-suited for its purpose 14 | Computer Architecture - design a *computer* that is well-suited for its purpose. 15 | 16 | ### Why do we need it? 17 | 18 | 1. Improve Performance (e.g. speed, battery life, size, weight, energy effficiency, ...) 19 | 2. Improve Abilities (3D graphics, debugging support, security, ...) 20 | 21 | Takeaway: Computer Architecture takes available hardware (fabrication, circuit designs) to create faster, lighter, cheaper, etc. computers. 22 | 23 | ## Technology Trends 24 | 25 | If you design given current technology/parts, you get an obsolete computer by the time the design is complete. Must take into account technological trends and anticipate future technology 26 | 27 | ### Moore's Law 28 | 29 | Used to predict technology trends. Every 18-24 months, you get 2x transistors for same chip area: 30 | \\( \Rightarrow \\) Processor speed doubles 31 | \\( \Rightarrow \\) Energy/Operation cut in half 32 | \\( \Rightarrow \\) Memory capacity doubles 33 | 34 | ### Memory Wall 35 | 36 | Consequence of Moore's Law. IPS and Capacity double every 2 years. If Latency only improves 1.1x every two years, there is a gap between latency and speed/capacity, called the Memory Wall. Caches are used to fill in this gap. 37 | 38 | ![Memory Wall Graph](https://i.imgur.com/RMSndOW.png) 39 | 40 | ## Processor Speed, Cost, Power 41 | 42 | | | Speed | Cost | Power | 43 | | ---| ----- | ---- | ---- | 44 | | | 2x | 1x | 1x | 45 | | OR | 1.1x | .5x | .5x | 46 | 47 | Improvements may differ, it's not always speed. Which one of the above do you really want? It depends. 48 | 49 | ## Power Consumption 50 | 51 | Two kinds of power a processor consumes. 52 | 1. Dynamic Power - consumed by activity in a circuit 53 | 2. Static Power - consumed when powered on, but idle 54 | 55 | ### Active Power 56 | Can be computed by 57 | 58 | $$P = \tfrac 12 \*C\*V^2\*f\*\alpha$$ 59 | Where 60 | $$\\\\C = \text{capacitance}\\\\P = \text{power supply voltage}\\\\f = \text{frequency} \\\\ \alpha = \text{activity factor (some percentage of transistors active each clock cycle)}$$ 61 | 62 | For example, if we cut size/capacitance in half but put two cores on a chip, the active power is relatively unchanged. To get power improvements, you really need to lower the voltage. 63 | 64 | ### Static Power 65 | 66 | The power it takes to maintain the circuits when not in use. 67 | ![Static Power](https://i.imgur.com/Db7NwSj.png) 68 | Decreasing the voltage means the reduced electrical "pressure" on a transistor results in more static power usage. However, increasing the voltage means you consume more active power. There is a minimum at which total power (static + dynamic) occurs. 69 | 70 | ## Fabrication Cost and Yield 71 | 72 | Circuits are printed onto a wafer in silicon, divided up into small chips, and packaged. The packaged chip is tested and either discarded or shipped/sold. Larger chips could have more fallout due to probabilities of using larger areas of the wafer, and thus cost more. 73 | 74 | $$ \text{fabrication yield} = \tfrac{\text{working chips}}{\text{chips on wafer}} $$ 75 | 76 | ![Fabrication Yield](https://i.imgur.com/vIIzt0I.png) 77 | 78 | Fabrication cost example: If a wafer costs $5000, and has approximately 10 defects, how much does each chip cost in the following example? Remember \\( \text{chip cost} = \tfrac{\text{wafer cost}}{\text{fabrication yield} } \\) 79 | 80 | | Size | Chips/Wafer | Yield | Cost | 81 | |------|-------------|-------|--------| 82 | |Small | 400 | 390 | $12.80 | 83 | |Large | 96 | 86 | $58.20 | 84 | | Huge | 20 | 11 | $454.55| 85 | 86 | Benefit from Moore's Law in two ways: 87 | * Smaller \\( \Rightarrow \\) much lower cost 88 | * Same Area \\( \Rightarrow \\) faster for same cost 89 | 90 | *[IPS]: Instructions per Second -------------------------------------------------------------------------------- /many-cores.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: many-cores 3 | title: Many Cores 4 | sidebar_label: Many Cores 5 | --- 6 | 7 | [🔗Lecture on Udacity (45 min)](https://classroom.udacity.com/courses/ud007/lessons/913459012/concepts/last-viewed) 8 | 9 | ## Many-Core Challenges 10 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\) 11 | - Writes to shared location \\(\Rightarrow\\) invalidations and misses 12 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) more writes/sec \\(\Rightarrow\\) bus throughput \\(\uparrow\\) 13 | - Bus: one request at a time \\(\Rightarrow\\) bottleneck 14 | - We need: 15 | - Scalable on-chip network that allows traffic to grow with number of cores 16 | - Directory coherence (so we don't rely on bus to serialize everything) 17 | 18 | ## Network On Chip 19 | Consider a linear bus - as we add more cores, it increases the length of the bus (and thus may be slower), and there is also more traffic. It gets very bad with more cores. 20 | 21 | Now consider if the cores were arranged in a mesh network, where each core has some adjacent cores it communicates to. Any core can still communicate to any other core through this network, but potentially there can be many such communications going on at once, so the total throughput is much higher, several times what an individual link's throughput. 22 | 23 | As we continue to add cores, it simply increases the size of the mesh. While the amount of total traffic increases, so does the total number of links and thus the total throughput of the network. 24 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Links \\(\uparrow\\) \\(\Rightarrow\\) Available Throughput \\(\uparrow\\) 25 | - This is also good for chip building because the links don't intersect each other - it is somewhat flat and easily built in silicon 26 | 27 | Can build various network topologies: 28 | - Bus - single link connected to all cores 29 | - Mesh - each core is connected to adjacent cores 30 | - Torus - A mesh where "ends" are also linked to each other in both horizontal/vertical directions 31 | - More advanced or less "flat": Flattened Butterfly, Hypercube, etc. 32 | 33 | ## Many-Core Challenges 2 34 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\) 35 | - Scalable on-chip network (e.g. a mesh) 36 | - Directory coherence 37 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\) 38 | - \# cores \\(\uparrow\\) \\(\Rightarrow\\) \# of on-chip caches \\(\uparrow\\) 39 | - Misses/core same \\(\Rightarrow\\) cores \\(\uparrow\\) \\(\Rightarrow\\) mem requests \\(\uparrow\\) 40 | - \# of pins \\(\uparrow\\), but not \\(\approx\\) to \# of cores \\(\Rightarrow\\) bottleneck 41 | - Need to reduce # of memory requests/core 42 | - Last level cache (L3) shared, size \\(\uparrow \approx \\) # cores \\(\uparrow\\) 43 | - One big LLC \\(\Rightarrow\\) SLOW ... one "entry point" \\(\Rightarrow\\) Bottleneck (Bad) 44 | - Distributed LLC 45 | 46 | ## Distributed LLC 47 | - Logically a single cache (block is not replicated on each cache) 48 | - But sliced up so each tile (core + local caches) gets part of it 49 | - L3 size = # cores * L3 slice size 50 | - On an L2 miss, we must request block from correct L3 slice. How do we know which slice to ask? 51 | - Round-robin by cache index (last few bits of set number) 52 | - May not be good for locality 53 | - Round-robin by page # 54 | - OS can map pages to make accesses more local 55 | 56 | ## Many-Core Challenges 2 (again) 57 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\) 58 | - Scalable on-chip network (e.g. a mesh) 59 | - Directory coherence 60 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\) 61 | - Large Shared Distributed LLC 62 | - Coherence Directory too large (to fit on chip) 63 | - Entry for each memory block 64 | - Memory many GB \\(\Rightarrow\\) billions of entries? \\(\Rightarrow\\) can't fit 65 | 66 | ## On-Chip Directory 67 | - Home node? Same as LLC slice! (we'll be looking at that node anyway) 68 | - Entry for every memory block? No. 69 | - Partial directory 70 | - Directory has limited # of entries 71 | - Allocate entry only for blocks that have at least 1 presence bit set (only blocks that might be in at least one of the private caches) 72 | - If it's only in the LLC or memory, we don't need a directory entry (it would be all zeroes anyway) 73 | - Run out of directory entries? 74 | - Pick an entry to replace (LRU), say entry E 75 | - Invalidation to all tiles with Presence bit set 76 | - Remove entry E and put new entry there 77 | - This is a new type of cache miss, caused by directory replacement 78 | 79 | ## Many-Core Challenges 3 80 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\) 81 | - Scalable on-chip network (e.g. a mesh) 82 | - Directory coherence 83 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\) 84 | - Large Shared Distributed LLC 85 | - Coherence Directory too large (to fit on chip) 86 | - Distributed Partial Directory 87 | - Power budget split among cores 88 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) W/core \\(\downarrow\\) \\(\Rightarrow\\) f and V \\(\downarrow\\) \\(\Rightarrow\\) 1 thread program is slower with more cores! 89 | 90 | ## Multi-Core Power and Performance 91 | ![multi core power and performance](https://i.imgur.com/cfb3TXo.png) 92 | 93 | Due to need to compensate for less power, each core is noticeably slower. 94 | 95 | ## No Parallelism \\(\Rightarrow\\) boost frequency 96 | - "Turbo" clocks when running on one core 97 | - Example: Intel's Core I7-4702MQ (Q2 2013) 98 | - Design Power: 37W 99 | - 4 cores, "Normal" clock 2.2GHz 100 | - "Turbo" clock 3.2GHz (1.45x normal \\(\Rightarrow\\) 3x power) 101 | - Why not 4x? It spreads more heat to other cores - 3x keeps it distributed to match normal 2.2GHz 102 | - Example: Intel's Core I7-4771 (Q3 2013) 103 | - Design Power: 84W 104 | - 4 cores, "Normal" clock 3.5GHz 105 | - "Turbo" clock 3.9GHz (1.11x normal \\(\Rightarrow\\) 1.38x power) 106 | - Meant for desktop, so can cool the chip more effectively at high power 107 | - But this means the chip already runs almost as hot as it can get, so we don't have much more room to increase power further 108 | 109 | ## Many-Core Challenges 4 110 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Coherence Traffic \\(\uparrow\\) 111 | - Scalable on-chip network (e.g. a mesh) 112 | - Directory coherence 113 | - Cores \\(\uparrow\\) \\(\Rightarrow\\) Off-Chip Traffic \\(\uparrow\\) 114 | - Large Shared Distributed LLC 115 | - Coherence Directory too large (to fit on chip) 116 | - Distributed Partial Directory 117 | - Power budget split among cores 118 | - "Turbo" when using only one core 119 | - OS Confusion 120 | - Multi-threading, cores, chips - all at same time! 121 | 122 | ## SMT, Cores, Chips, ... 123 | All combined: 124 | - Dual socket motherboard (two chips) 125 | - 4 cores on each chip 126 | - Each core 2-way SMT 127 | - 16 threads can run 128 | 129 | What if we run 3 threads? 130 | - Assume OS assigns them to the first 3 spots, but maybe two of those are actually the same core (because SMT), and the first half of those spots are actually on the same chip. 131 | - A smarter policy would be to put them on separate chips if possible, and then separate cores if possible, to maximize all benefits. 132 | - So, the OS needs to be very aware of what hardware is available, and smart enough to use it effectively. 133 | 134 | 135 | *[LLC]: Last-Level Cache 136 | *[SMT]: Simultaneous Multi-Threading -------------------------------------------------------------------------------- /memory-consistency.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: memory-consistency 3 | title: Memory Consistency 4 | sidebar_label: Memory Consistency 5 | --- 6 | 7 | [🔗Lecture on Udacity (32 min)](https://classroom.udacity.com/courses/ud007/lessons/914198580/concepts/last-viewed) 8 | 9 | ## Memory Consistency 10 | - Coherence: Defines order of accesses to **the same address** 11 | - We need this to share data among threads 12 | - Does not say anything about accesses to different locations 13 | - Consistency: Defines order of accesses to **different addresses** 14 | - Does this even matter? 15 | 16 | ### Consistency Matters 17 | Because of coherence, stores must always occur in-order. But, since out of order processors could change the order of loads, you might get results out of order that wouldn't be possible from the program logic alone. 18 | 19 | Examples: [🎥 View lecture video (3:59)](https://www.youtube.com/watch?v=uh8gF64345I), [🎥 View quiz from lecture (3:33)](https://www.youtube.com/watch?v=4X-DciIfFcc) 20 | 21 | ### Why We Need Consistency 22 | - Data-Ready Flag Synchronization 23 | - See quiz from previous section. Effects of branch prediction can cause loads to happen before the data is ready (according to some flag) 24 | - Thread Termination 25 | - Thread A creates Thread B 26 | - Thread B works, updates data 27 | - Thread A waits until B exits (system call) 28 | - Branch prediction may cause this to execute before the following: 29 | - Thread B done, OS marks B done 30 | 31 | We need an additional set of ordering restrictions, beyond coherence alone! 32 | 33 | ## Sequential Consistency 34 | The result of any execution should be as if accesses executed by each processor were executed in-order, and accesses among different processors were arbitrarily interleaved. 35 | 36 | - Simplest Implementation: remove the "as if" 37 | - A core performs next access only when ALL previous accesses are complete 38 | - This works fine, but it is really, really bad for performance (MLP = 1) 39 | 40 | ### Better Implementation of Sequential Consistency 41 | - Core can reorder loads 42 | - Detect when SC may be violated \\(\Rightarrow\\) Fix 43 | - How? In the ROB we have all instructions in program order, regardless of execution order. 44 | - If instructions actually executed in order, we're ok 45 | - If instructions are out of order but no other stores were done to any of the reordered addresses in the meantime, we're also ok. 46 | - Example: `LW A` and `LW B` are program order, but they get executed out of order. There is nothing such as a `SW B` after the `LW B`, so by the time we hit `LW A` it is still consistent with what it would have been in-order. 47 | - If we see a store after a load that is out of order (e.g. execution order of `LW B` then `SW B` then `LW A`), we then know that load has to be replayed, because it would have been a different value in program order. We know it is replayable because it is still beyond the commit point in the ROB, so we are capable of replaying it. 48 | - In order to detect such a violation, for anything we load out of order we need to monitor coherence traffic until we get back in order. 49 | 50 | ## Relaxed Consistency 51 | An alternative approach to SC is to not set the expectation to the programmers for SC, but something slightly less strict. 52 | - Four types of ordering 53 | 1. `WR A` \\(\rightarrow\\) `WR B` 54 | 2. `WR A` \\(\rightarrow\\) `RD B` 55 | 3. `RD A` \\(\rightarrow\\) `WR B` 56 | 4. `RD A` \\(\rightarrow\\) `RD B` 57 | - Sequential Consistency: All must be obeyed! 58 | - Relaxed Consistency: some of these types need not be obeyed at all the times 59 | - Read/Read (#4) best example 60 | - How do we write correct programs if we cannot expect this ordering to be consistent? 61 | 62 | ### Relaxed Consistency implementation 63 | - Allow reordering of "Normal" accesses 64 | - Add special non-reorderable accesses (separate instructions than normal `LW`/`SW`) 65 | - Must use these when ordering in the program matters 66 | - Example: x86 `MSYNC` instruction 67 | - Normally, all accesses may be reordered 68 | - But, no reordering across the `MSYNC` 69 | - [`LW`, `SW`, `LW`, `SW`] \\(\rightarrow\\) `MSYNC`, \\(\rightarrow\\) [`LW`, `SW`, ...] 70 | - Acts as a barrier between operations that expect things to be in order 71 | ```cpp 72 | while(!flag); 73 | MSYNC 74 | //use data -- ensures the data is correctly synchronized to the flag state regardless of reordering 75 | ``` 76 | - When to `MSYNC` for synchronization? After a lock/"acquire", but before an unlock/"release" 77 | 78 | ## Data Races and Consistency 79 | ![data races and consistency](https://i.imgur.com/owEQKj4.png) 80 | 81 | One key point is that when creating a data-race-free program, we may want to debug in a SC environment until proper synchronization ensures that we are free of any potential data races. Once we are confident of that, we can move to a more relaxed consistency model and rely on synchronization to ensure consistency is maintained. If the program is not data race free, then anything can happen! 82 | 83 | ## Consistency Models 84 | - Sequential Consistency 85 | - Relaxed Consistency 86 | - Weak (distinguishes between synchronization and non-synchronization accesses, and prevents any synchronization-related accesses from being reordered, but all others may be freely reordered) 87 | - Processor 88 | - Release (distinguishes between lock and unlock (acquire/release) accesses, and does not allow non-synchronization accesses to be reordered across this lock/unlock boundary) 89 | - Lazy Release 90 | - Scope 91 | - ... 92 | 93 | 94 | *[MLP]: Memory Level Parallelism 95 | *[SC]: Sequential Consistency 96 | *[ROB]: Re-Order Buffer -------------------------------------------------------------------------------- /memory-ordering.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: memory-ordering 3 | title: Memory Ordering 4 | sidebar_label: Memory Ordering 5 | --- 6 | 7 | [🔗Lecture on Udacity (30 min)](https://classroom.udacity.com/courses/ud007/lessons/937498641/concepts/last-viewed) 8 | 9 | ## Memory Access Ordering 10 | 11 | With ROB and Tomasulo, we can enforce order of dependencies on registers between instructions. Is there an equivalent for memory acces (load/store)? Can they be reordered? 12 | 13 | So far, we have: 14 | - Eliminated Control Dependencies 15 | - Branch Prediction 16 | - Eliminated False Dependencies on registers 17 | - Register Renaming 18 | - Obey RAW register dependencies 19 | - Tomasulo-like scheduling 20 | 21 | What about memory dependencies? 22 | ```mipsasm 23 | SW R1, 0(R3) 24 | LW R2, 0(R4) 25 | ``` 26 | In this case, if 0(R3) and 0(R4) point to different addresses, then order does not matter. But if they point to the same address, then we need a way to handle this dependency and keep them in order. 27 | 28 | ## When does Memory Write Happen? 29 | 30 | At commit! Any instruction may execute but not be committed (due to exceptions, branch misprediction), so memory should not be written until the instruction is ready to be committed. 31 | 32 | But if we write at the commit, where does a `LOAD` get its data? 33 | 34 | ## Load Store Queue (LSQ) 35 | 36 | We need an LSQ to provide values from stores to loads, since the actual memory will not be written until a commit. We use the following structure: 37 | 38 | | L/S | Addr | Val | C | 39 | | --- | --- | --- |---| 40 | | `` | | | | 41 | 42 | L/S: bit for whether it is load or store 43 | Addr: address being accessed 44 | Val: value being stored or loaded 45 | C: instruction has been completed 46 | 47 | An example of using the LSQ: 48 | 1. [🎥 Part 1](https://www.youtube.com/watch?v=utRgthVxAYk) 49 | 2. [🎥 Part 2](https://www.youtube.com/watch?v=mbwf-CoA5Zg) 50 | 51 | For each Load instruction, it looks in the LSQ to see if the value was already stored to that address by a previous instruction. If so, it uses "Store-to-Load Forwarding" to obtain the value for the Load. If not, it goes to memory. In some cases, a Store may not yet know its address (needs to be computed). In this case, a Load may have not known there was a previous store on that address and would have obtained an incorrect value. There are some options to handle this: 52 | 1. All mem instructions in-order (very low performance) 53 | 2. Wait for all previous store addresses to complete (still bad performance) 54 | 3. Go anyway - the problem will occur, but additional logic must be used to recover from this situation. 55 | 56 | ### Out-Of-Order Load Store Execution 57 | 58 | | # | Inst | Notes | 59 | |---| --- | --- | 60 | | 1 | `LOAD R3 = 0(R6)` | Dispatched immediately, but cache miss | 61 | | 2 | `ADD R7 = R3 + R9` | Depends on #1, waiting... | 62 | | 3 | `STORE R4 -> 0(R7)` | Depends on #2, waiting... | 63 | | 4 | `SUB R1 = R1 - R2` | Dispatched immediately, R1 good | 64 | | 5 | `LOAD R8 = 0(R1)` | R1 ready, so runs and cache hit | 65 | 66 | In this example, instructions 1-3 are delayed due to a cache miss, but instructions 4 and 5 are able to execute. `R8` now contains the value from address `0(R1)`. Later, instructions 1-3 will execute, and address `0(R7)` contains the value of `R4`. If the addresses in `R1` and `R7` do not match, then this is ok. However, if they are the same number (given that `R7` was calculated in instruction 2 and we did not know this yet), then we have a problem, because R8 contains the previous value from that address, not the value it should have had with the ordering. 67 | 68 | ### In-Order Load Store Execution 69 | 70 | Similar to previous example, we may execute some instructions out of order (#4 may execute right away), but all loads and stores are in order. A store instruction is considered "done" (for purposes of ordering) when the address is known, at which point it can check to ensure any other instructions do not conflict with it. Of course this is not very high-performance, as the final `LOAD` instruction is delayed considerably - and if it is a cache miss, then the performance takes an even greater hit. 71 | 72 | ## Store-to-Load Forwarding 73 | 74 | Load: 75 | - Which earlier `STORE` do we get value from? 76 | 77 | Store: 78 | - Which later `LOAD`s do I give my value to? 79 | 80 | The LSQ is responsible for deciding these things. 81 | 82 | ## LSQ Example 83 | 84 | [🎥 Example Video (5:29)](https://www.youtube.com/watch?v=eHVLMgfy-Jc) 85 | 86 | A key point from this is the connection to Commits as we learned in the ROB lesson. The stores are only written to data cache when the associated instruction is committed - thereby ensuring that exceptions or mispredicted branches do not affect the memory itself. At any point execution could stop and the state of everything will be ok. 87 | 88 | ## LSQ, ROB, and RS 89 | 90 | Load/Store: 91 | - Issue 92 | - Need a ROB entry 93 | - Need an LSQ entry (similar to needing an RS in a non-Load/Store issue) 94 | - Execute Load/Store: 95 | - Compute the Address 96 | - Produce the Value 97 | - _(Can happen simultaneously)_ 98 | - (LOAD only) Write-Result 99 | - And Broadcast it 100 | - Commit Load/Store 101 | - Free ROB & LSQ entries 102 | - (STORE only) Send write to memory 103 | 104 | 105 | *[ALU]: Arithmetic Logic Unit 106 | *[CPI]: Cycles Per Instruction 107 | *[ILP]: Instruction-Level Parallelism 108 | *[IPC]: Instructions per Cycle 109 | *[IQ]: Instruction Queue 110 | *[LB]: Load Buffer 111 | *[LSQ]: Load-Store Queue 112 | *[LW]: Load Word 113 | *[OOO]: Out Of Order 114 | *[PC]: Program Counter 115 | *[RAT]: Register Allocation Table (Register Alias Table) 116 | *[RAW]: Read-After-Write 117 | *[ROB]: ReOrder Buffer 118 | *[SB]: Store Buffer 119 | *[SW]: Store Word 120 | *[WAR]: Write-After-Read 121 | *[WAW]: Write-After-Write 122 | *[RAR]: Read-After-Read -------------------------------------------------------------------------------- /memory.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: memory 3 | title: Memory 4 | sidebar_label: Memory 5 | --- 6 | 7 | [🔗Lecture on Udacity (23 min)](https://classroom.udacity.com/courses/ud007/lessons/872590120/concepts/last-viewed) 8 | 9 | ## How Memory Works 10 | * Memory Technology: SRAM and DRAM 11 | * Why is memory slow? 12 | * Why don't we use cache-like memory? 13 | * What happens when we access main memory? 14 | * How do we make it faster? 15 | 16 | ## Memory Technology SRAM and DRAM 17 | SRAM - Static Random Access Memory - Faster 18 | - Retains data while power supplied (good) 19 | - Several transistors per bit (bad) 20 | 21 | DRAM - Dynamic Random Access Memory - Slower 22 | - Will lose data if we don't refresh it (bad) 23 | - Only one transistor per bit (good) 24 | 25 | Both of these lose data when power is not supplied 26 | 27 | ## One Memory Bit 28 | ### SRAM 29 | * [🎥 See SRAM (4:13)](https://www.youtube.com/watch?v=mwNqzc1o5zM) 30 | 31 | SRAM is a matrix of Wordlines and Bitlines. The wordline controls which word is selected, which is the "on" switch for transistors for that word. These transistors connect the bitlines to a inverter feedback loop that contains the data. Another transistor connects the other side of the inverter to another bitline that represents the inverted value of that same bit. In this way memory writes can more easily happen by driving the inverter loop in the opposite direction at the same times, and we can also be more sure of the bit value by looking at the difference between the bit and not-bit lines. 32 | 33 | ### DRAM 34 | * [🎥 See DRAM (4:03)](https://www.youtube.com/watch?v=3s7zsLU83bY) 35 | 36 | In DRAM we similarly have a transistor controlled by the wordline, but the main difference is that our memory is no longer stored in a feedback loop, but in a simple capacitor that is charged on a write to memory. The problem is that a transistor is not a perfect switch - it's a tiny bit leaky, which means the capacitor will lose its charge over time. So, we need to write that bit again from time to time. Additionally, reads are destructive, in that they also drain the capacitor. 37 | 38 | ### SRAM/DRAM Thoughts 39 | 40 | We also see that SRAM can be called a 6T cell (2 control transistors + two transistors each inverter). DRAM is a 1T cell (single transistor controlling the capacitor). We might think the physical footprint of that is a single transistor plus a capacitor (and we want a large capacitor to retain the value longer). However, DRAM uses a technology called "trench cell", which buries a capacitor underneath transistor into a single unit as far as footprint. Additionally, we don't need the second bitline as in SRAM, so space is conserved and the cost is much lower overall. 41 | 42 | ## Memory Chip Organization 43 | A Row Address is passed into a Row Decoder, which decides which Wordline to activate. There is a memory cell on each intersection between a wordline and a bitline. Once a row is activated, the bits in that word follows down the bitlines at a small voltage where it is collected in a Sense Amplifier that produces full 0 and 1 bits. These signals go into a Row Buffer, a storage element for these values. The value is then fed into a Column Decoder, which takes a Column Address (which bit to read), and outputs a single bit. This can then be replicated to provide more bits. 44 | ![Memory Chip Organization](https://i.imgur.com/ABBEPWh.png) 45 | 46 | After this happens, the destructive read means the value needs to be refreshed. So the sense amplifier now drives the bitlines to ensure the value has been refreshed. Thus, every read is really a Read-Then-Write. It also must handle refreshing all words periodically to counteract leakage. We can't rely on the processor to do this for us, because things that are accessed often are probably in cache and won't be accessed in memory. So, we have a Refresh Row Counter that loops through words within the required refresh period. So for a Refresh Period T and N rows, then every T/N seconds we need to refresh a row. In DRAM there are many rows and T is less than a second, so this is a lot of required refresh activity. 47 | 48 | To write to memory, we still read a row, and once it is latched into the row buffer we provide the new value to the column decoder, which then will write it into the row buffer, and the sense amplifier uses the value to write back on the bitlines. Thus, a write is also a Read-Then-Write, similar to the read. 49 | 50 | ## Fast Page Mode 51 | "Page" is the series of bitlines that can be read by selecting the proper row. It can be many bits long. Fast Page mode is a technique in which once a row is selected, the value is latched into the row buffer and successive bit reads will happen faster. 52 | - Fast Page Mode 53 | - Open a page 54 | - Row Address 55 | - Select Row 56 | - Sense Amplifier 57 | - Latch into Row Buffer 58 | - Read/Write Cycle (can be many read and write operations on that page) 59 | - Close the page 60 | - Write data from row buffer back to memory row 61 | 62 | ## Connecting DRAM to the Processor 63 | Once a memory request has missed all levels of cache, it is communicated externally through the Front Side Bus into the Memory Controller. The Memory Controller channels these requests into one of several DRAM channels. Thus, the total memory latency includes all the overhead of the communication over the FSB, the activities of the Memory Controller, in addition to the memory access itself. 64 | ![Connecting DRAM](https://i.imgur.com/MNIeFRR.png) 65 | 66 | Some recent processors integrate the memory controller so we can eliminate communication over the FSB and communicate directly from it. This can be a significant overall savings. It also forced standardzation of the protocols connecting DRAM to the processor to allow for this without a complete chip redesign. 67 | 68 | 69 | 70 | *[DRAM]: Dynamic Random Access Memory 71 | *[SRAM]: Static Random Access Memory -------------------------------------------------------------------------------- /metrics-and-evaluation.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: metrics-and-evaluation 3 | title: Metrics and Evaluation 4 | sidebar_label: Metrics and Evaluation 5 | --- 6 | 7 | [🔗Lecture on Udacity (45 min)](https://classroom.udacity.com/courses/ud007/lessons/3650739106/concepts/last-viewed) 8 | 9 | ## Performance 10 | 11 | * Latency (time start \\( \rightarrow \\) done) 12 | * Throughput (#/second) (not necessarily 1/latency due to pipelining) 13 | 14 | ### Comparing Performance 15 | 16 | "X is N times faster than Y" 17 | 18 | Speedup = N = speed(X)/speed(Y) 19 | \\( \Rightarrow \\) N = Throughput(X)/Throughput(Y) 20 | \\( \Rightarrow \\) N = Latency(Y)/Latency(X) 21 | (notice Y and X for each) 22 | 23 | ### Speedup 24 | 25 | Speedup > 1 \\( \Rightarrow \\) Improved Performance 26 | * Shorter execution time 27 | * Higher throughput 28 | 29 | Speedup < 1 \\( \Rightarrow \\) Worse Performance 30 | 31 | Performance ~ Throughput 32 | 33 | Performance ~ 1/Latency 34 | 35 | ## Measuring Performance 36 | 37 | How do we measure performance? 38 | 39 | Actual User Workload: 40 | * Many programs 41 | * Not representative of other users 42 | * How do we get workload data? 43 | 44 | Instead, use Benchmarks 45 | 46 | ### Benchmarks 47 | 48 | Programs and input data agreed upon for performance measurements. Typically there is a benchmark suite, comprised of multiple programs, each representative of some kind of application. 49 | 50 | #### Benchmark Types 51 | * Real Applications 52 | * Most Representative 53 | * Most difficult to set up 54 | * Kernels 55 | * Find most time-consuming part of application 56 | * May not be easy to run depending on development phase 57 | * Best to run once a prototype machine is available 58 | * Synthetic Benchmarks 59 | * Behave similarly to kernels but simpler to compile 60 | * Typically good for design studies to choose from design 61 | * Peak Performance 62 | * In theory, how many IPS 63 | * Typically only good for marketing 64 | 65 | #### Benchmark Standards 66 | 67 | A benchmarking organization takes input from academia, user groups, and manufacturers, and curates a standard benchmark suite. Examples: TPC (databases, web), EEMBC (embedded), SPEC (engineering workstations, raw processors). For example, SPEC includes GCC, Perl, BZIP, and more. 68 | 69 | ### Summarizing Performance 70 | 71 | #### Average Execution Time 72 | 73 | | | Comp X | Comp Y | Speedup | 74 | |---|--------|--------|---------| 75 | | App A | 9s | 18s | 2.0 | 76 | | App B | 10s | 7s | 0.7 | 77 | | App C | 5s | 11s | 2.2 | 78 | | AVG | 8s | 12s | 1.5 | 79 | 80 | If you simply average speedups, you get 1.63 instead. Speedup of average execution times is not the same as simply averaging speedups on individual applications. 81 | 82 | #### Geometric Mean 83 | 84 | If we want to be able to average speedups, we need to use geometric means for both average times and speedups. This results in the same value whether you are taking the geometric mean of individual speedups, or speedup of geometric mean of execution times. 85 | 86 | For example, in the table above we would obtain: 87 | | | Comp X | Comp Y | Speedup | 88 | |---|--------|--------|---------| 89 | | Geo Mean | 7.66s | 11.15s | 1.456 | 90 | 91 | Geometric mean of speedup values (2.0, 0.7, 2.2) also result in 1.456. 92 | 93 | $$ \text{geometric mean} = \sqrt[n]{a_1\*a_2\*...a_n} $$ 94 | 95 | As a general rule, if you are trying to average things that are ratios (speedups are ratios), you cannot simply average them. Use the geometric mean instead. 96 | 97 | ## Iron Law of Performance 98 | 99 | **CPU Time** = (# instructions in the program) * (cycles per instruction) * (clock cycle time) 100 | 101 | Each component allows us to think about the computer architecture and how it can be changed: 102 | * Instructions in the Program 103 | * Algorithm 104 | * Compiler 105 | * Instruction Set 106 | * Cycles Per Instruction 107 | * Instruction Set 108 | * Processor Design 109 | * Clock Cycle Time 110 | * Processor Design 111 | * Circuit Design 112 | * Transistor Physics 113 | 114 | ### Iron Law for Unequal Instruction Times 115 | 116 | When instructions have different number of cycles, sum them individually: 117 | $$ \sum_i (IC_i\* CPI_i) * \frac{\text{time}}{\text{cycle}} $$ 118 | 119 | where \\( IC_i \\) is the instruction count for instruction \\( i \\), and \\(CPI_i\\) is the cycles for instruction \\( i \\). 120 | 121 | ## Amdahl's Law 122 | 123 | Used when only part of the program or certain instructions. What is the overall affect on speedup? 124 | 125 | $$ speedup = \frac{1}{(1-frac_{enh}) + \frac{frac_{enh}}{speedup_{enh}}} $$ 126 | 127 | where \\( frac_{enh} \\) represents the fraction of the execution **TIME** enhanced by the changes, and \\( speedup_{enh} \\) represents the amount that change was sped up. 128 | 129 | NOTE: Always ensure the fraction represents TIME, not any other quantity (cycles, etc.). First, convert changes into execution time before the change, and execution time after the change. 130 | 131 | ### Implications 132 | 133 | Compare these enhancements: 134 | * Enhancement 1 135 | * Speedup of 20x on 10% of time 136 | * \\( \Rightarrow \\) speedup = 1.105 137 | * Enhancement 2 138 | * Speedup of 1.6x on 80% of time 139 | * \\( \Rightarrow \\) speedup = 1.43 140 | 141 | Even an infinite speedup in enhancement 1 only yields 1.111 overall speedup. 142 | 143 | Takeaway: Make the common case fast 144 | 145 | ### Lhadma's Law 146 | 147 | * Amdahl: Make common case fast 148 | * Lhadma: Do not mess up the uncommon case too badly 149 | 150 | Example: 151 | * Improvement of 2x on 90% 152 | * Slow down rest by 10x 153 | * Speedup = \\( \frac{1}{\frac{0.1}{0.1} + \frac{0.9}{2}} = \frac{1}{1+0.45} = 0.7 \\) 154 | * \\( \Rightarrow \\) Net slowdown, not speedup. 155 | 156 | ### Diminishing Returns 157 | 158 | Consequence of Amdahl's law. If you keep trying to improve the same area, you get diminishing returns on the effort. Always reconsider what is now the dominant part of the execution time. 159 | 160 | ![Diminishing Returns](https://i.imgur.com/SzjXnRS.png) 161 | 162 | *[IPS]: Instructions per Second 163 | *[CPI]: Cycles per Instruction -------------------------------------------------------------------------------- /multi-processing.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: multi-processing 3 | title: Multi-Processing 4 | sidebar_label: Multi-Processing 5 | --- 6 | 7 | [🔗Lecture on Udacity (1.25hr)](https://classroom.udacity.com/courses/ud007/lessons/1097109180/concepts/last-viewed) 8 | 9 | ## Flynn's Taxonomy of Parallel Machines 10 | * How many instruction streams 11 | * How many data streams 12 | 13 | | | Instruction Streams | Data Streams | | 14 | |:---:|:---:|:---:|---| 15 | | SISD
(uniprocessor) | 1 | 1 | Single Instruction Single Data | 16 | | SIMD
(vector, SSE/MMX) | 1 | > 1 | Single Instruction Multiple Data | 17 | | MISD
(stream processor?) | > 1 | 1 | Multiple Instruction Single Data
(rare) | 18 | | MIMD
(multiprocessor) | > 1 | > 1 | Multiple Instruction Multiple Data
(most today, multi-core)| 19 | 20 | ## Why Multiprocessors? 21 | * Uniprocessors already \\(\approx\\) 4-wide 22 | * Diminishing returns from making it even wider 23 | * Faster?? \\(frequency \uparrow\ \Rightarrow voltage \uparrow\ \\Rightarrow power \uparrow^3\ \Rightarrow (fire)\\) 24 | * But Moore's Law continues! 25 | * 2x transistors every 18 months 26 | * \\(\Rightarrow\ \approx\\)2x performance every 18 months (assuming we can use all the cores) 27 | 28 | ## Multiprocessor Needs Parallel Programs! 29 | * Sequential (single-threaded) code is _a lot_ easier to develop 30 | * Debugging parallel code is _much_ more difficult 31 | * Performance scaling _very_ hard to get 32 | * (as we scale # cores, performance scales equally, linearly) 33 | * In reality, performance will improve to a point, but then levels off. We can keep improving the program, but it will still max out at some level of performance no matter how many cores. 34 | 35 | ## Centralized Shared Memory 36 | Each core has its own cache, but connected to the same bus that shares main memory and I/O. Cores can send data to each other by writing to a location in main memory. Similarly they can share I/O. This architecture is called a Symmetric Multiprocessor (SMP), another concept is Uniform Memory Access (UMA), in terms of time. 37 | ![centralized shared memory](https://i.imgur.com/27aePNH.png) 38 | 39 | ### Centralized Main Memory Problems 40 | * Memory Size 41 | * Need large memory \\(\Rightarrow\\) slow memory 42 | * Memory Bandwidth 43 | * Misses from all cores \\(\Rightarrow\\) memory bandwidth contention 44 | * As we add more cores, it starts to serialize against memory accesses, so we lose performance benefit 45 | 46 | Works well only for smaller machines (2, 4, 8, 16 cores) 47 | 48 | ## Distributed ~~Shared~~ Memory 49 | This uses message passing for a core to access another core's memory. If accesses are not in the local cache or the memory slice local to the core, it actually sends a request for this memory through the network to the core that has it. This system is also called a cluster or a multi-computer, since each core is somewhat like an individual computer. 50 | * ![distributed memory](https://i.imgur.com/jIRNFwL.png) 51 | 52 | These computers tend to scale better - not because network communication is faster, but rather because the programmer is forced to explicitly think about distributed memory access and will tend to minimize communication. 53 | 54 | ## A Message-Passing Program 55 | ```cpp 56 | #define ASIZE 1024 57 | #define NUMPROC 4 58 | double myArray[ASIZE/NUMPROC]; // each processor allocates only 1/4 of entire array (assumes elements have been distributed somehow) 59 | double mySum = 0; 60 | for(int i=0; i < ASIZE/NUMPROC; i++) 61 | mySum += myArray[i]; 62 | if(myPID==0) { 63 | for(int p=1; p < NUMPROC; p++) { 64 | int pSum; 65 | recv(p, pSum); // receives the sums from the other processors 66 | mySum += pSum; 67 | } 68 | printf("Sum: %lf\n", mySum); 69 | } else { 70 | send(0, mySum); // sends the local processor's sum to p0 71 | } 72 | ``` 73 | 74 | ## A Shared Memory Program 75 | Benefit: No need to distribute the array! 76 | 77 | ```cpp 78 | #define ASIZE 1024 79 | #define NUMPROC 4 80 | shared double array[ASIZE]; // each processor needs the whole array, but it is shared 81 | shared double allSum = 0; // total sum in shared memory 82 | shared mutex sumLock; // shared mutex for locking 83 | double mySum = 0; 84 | for(int i=myPID*ASIZE/NUMPROC; i < (myPID+1)*ASIZE/NUMPROC; i++) { 85 | mySum += array[i]; // sums up this processor's own part of the array 86 | } 87 | lock(sumLock); // lock mutex for shared memory access 88 | allSum += mySum; 89 | unlock(sumLock); // unlock mutex 90 | // <<<<< Need a barrier here to prevent p0 from displaying result until other processes finish 91 | if(myPID==0) { 92 | printf("Sum: %lf\n", allSum); 93 | } 94 | ``` 95 | 96 | ## Message Passing vs Shared Memory 97 | 98 | | | Message Passing | Shared Memory | 99 | |-------------------|:---------------:|:-------------:| 100 | | Communication | Programmer | Automatic | 101 | | Data Distribution | Manual | Automatic | 102 | | Hardware Support | Simple | Extensive | 103 | | Programming | 104 | | - Correctness | Difficult | Less Difficult | 105 | | - Performance | Difficult | Very Difficult | 106 | 107 | Key difference: Shared memory is easier to get a correct solution, but there may be a lot of issues making it perform well. With Message Passing, typically once you've gotten a correct solution you also have solved a large part of the performance problem. 108 | 109 | ## Shared Memory Hardware 110 | * Multiple cores share physical address space 111 | * (all can issue addresses to any memory address) 112 | * Examples: UMA, NUMA 113 | * Multi-threading by time-sharing a core 114 | * Same core \\(\rightarrow\\) same physical memory 115 | * (not really benefiting from multi-threading) 116 | * Hardware multithreading in a core 117 | * Coarse-Grain: change thread every few cycles 118 | * Fine-Grain: change thread every cycle (more HW support) 119 | * Simultaneous Multi-Threading (SMT): any given cycle we can be doing multiple instructions from different threads 120 | * Also called Hyperthreading 121 | 122 | ## Multi-Threading Performance 123 | [🎥 View lecture video (8:25)](https://www.youtube.com/watch?v=ZpqeeHFWxes) 124 | 125 | Without multi-threading, processor activity in terms of width is very limited - there will be many waits in which no work can be done. As it switches from thread to thread (time sharing), the activity of each thread is somewhat sparse. 126 | 127 | In a Chip Multiprocessor (CMP), each core runs different threads. While the activity happens simultaneously in time, each core still has to deal with the sparse workload created by the threads it is running. (Total cost 2x) 128 | 129 | With Fine-Grain Multithreading, we have one core, but with separate sets of registers for different threads. Each thread's work switches from cycle to cycle, taking advantage of the periods of time in which a thread is waiting on something to be ready to do the next work. This keeps the core busier overall, at the slight expense of extra hardware requirements. (Total cost \\(\approx\\) 1.05x) 130 | 131 | With SMT, we can go one step further by mixing instructions from different threads into the same cycle. This dramatically reduces the total idle time of the processor. This requires much more hardware to support. (Total cost \\(\approx\\) 1.1x) 132 | 133 | ## SMT Hardware Changes 134 | * Fetch 135 | * Need additional PC to keep track of another thread 136 | * Decode 137 | * No changes needed 138 | * Rename 139 | * Renamer does the same thing as per one thread 140 | * Need another RAT for the other thread 141 | * Dispatch/Execution 142 | * No changes needed 143 | * ROB 144 | * Need a separate ROB with its own commit point 145 | * Alternatively: interleave in single ROB to save HW expense (more often in practice) 146 | * Commit 147 | * Need a separate ARF for the other thread. 148 | ![smt hardware changes](https://i.imgur.com/4fC0ULd.png) 149 | 150 | Overall: cost is not that much higher - the most expensive parts do not need duplication to support SMT. 151 | 152 | ## SMT, D$, TLB 153 | With a VIVT cache, we simply send the virtual address into the cache and use the data that comes out. However, each thread may have different address spaces, so the cache has no clue which of the two it is getting. So with VIVT, we risk getting the wrong data! 154 | 155 | With a VIPT cache (w/o aliasing), the same virtual address will be combined with the physical address and things will work correctly as long as the TLB provides the right information. So for VIPT and PIPT both, SMT addressing will work fine as long as the TLB is thread-aware. One way to ensure this is with a thread bit on each TLB entry, and to pass the thread ID into the TLB lookup. 156 | ![SMT, cache, TLB](https://i.imgur.com/6zKXp5i.png) 157 | 158 | ## SMT and Cache Performance 159 | * Cache on the core is shared by all SMT threads 160 | * Fast data sharing (good) 161 | ``` 162 | Thread 0: SW A 163 | Thread 1: LW A <--- really good performance 164 | ``` 165 | * Cache capacity (and associativity) shared (bad) 166 | * If WS(TH0) + WS(TH1) - WS(TH0, TH1) > D$ size: 167 | \\(\Rightarrow\\) Cache Thrashing 168 | * If WS(TH0) < D$ size 169 | \\(\Rightarrow\\) SMT performance can be worse than one-at-a-time 170 | * If the threads share a lot of data and fit into cache, maybe SMT performance doesn't have as many issues. 171 | 172 | 173 | 174 | *[CMP]: Chip Multiprocessor 175 | *[ARF]: Architectural Register File 176 | *[SMT]: Simultaneous Multi-Threading 177 | *[SMP]: Symmetric Multiprocessor 178 | *[WS]: Working Set 179 | *[D$]: Data Cache 180 | *[PIPT]: Physically Indexed Physically Tagged 181 | *[TLB]: Translation Look-Aside Buffer 182 | *[VIPT]: Virtually Indexed Physically Tagged 183 | *[VIVT]: Virtual Indexed Virtually Tagged 184 | *[ALU]: Arithmetic Logic Unit 185 | *[IQ]: Instruction Queue 186 | *[LB]: Load Buffer 187 | *[LW]: Load Word 188 | *[PC]: Program Counter 189 | *[RAT]: Register Allocation Table (Register Alias Table) 190 | *[ROB]: ReOrder Buffer 191 | *[SB]: Store Buffer 192 | *[SW]: Store Word 193 | *[UMA]: Uniform Memory Access 194 | *[NUMA]: Non-Uniform Memory Access -------------------------------------------------------------------------------- /pipelining.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: pipelining 3 | title: Pipelining 4 | sidebar_label: Pipelining 5 | --- 6 | 7 | [🔗Lecture on Udacity (54 min)](https://classroom.udacity.com/courses/ud007/lessons/3650589023/concepts/last-viewed) 8 | 9 | Given a certain latency, the idea is that the first unit of work is achieved in the same period of time, but successive units continue to be completed thereafter. (Oil pipeline vs. traveling with a bucket) 10 | 11 | ## Pipelining in a Processor 12 | Each logical phase of operation in the processor (fetch, decode, ALU, memory, write) operates as a pipeline in which successive instructions move through each phase immediately behind the current instruction, instead of waiting for the entire instruction to complete. 13 | 14 | ![Pipelining in a Processor](https://i.imgur.com/0m5vXEf.png) 15 | 16 | ## Pipeline CPI 17 | Is Pipeline CPI = always 1? 18 | 19 | On initial fill, CPI \\( \rightarrow \\) 1 when # instruction \\( \rightarrow \infty \\). If pipeline stalls (a single phase breaks down), then on the next cycle the successive phase will be doing no work. Therefore, CPI is always > 1 if any stalls are possible. For example, if every 5 cycles one phase stalls, your CPI is 6 cycles/5 units = 1.2 CPI. 20 | 21 | ## Processor Pipeline Stalls 22 | 23 | In the example below, we see one example of a pipeline stall in a processor in a program like this: 24 | 25 | ```mipsasm 26 | LW R1, ... 27 | ADD R2, R1, 1 28 | ADD R3, R2, 1 29 | ``` 30 | 31 | ![Processor Pipeline Stalls](https://i.imgur.com/E21sE3o.png) 32 | 33 | The second instruction depends on the result of the first instruction, and must wait for it to complete the ALU and MEM phases before it can proceed. Thus, the CPI is actually much greater than 1. 34 | 35 | ## Processor Pipeline Stalls and Flushes 36 | 37 | ```mipsasm 38 | LW R1, ... 39 | ADD R2, R1, 1 40 | ADD R3, R2, 1 41 | JUMP ... 42 | SUB ... ... 43 | ADD ... ... 44 | SHIFT 45 | ``` 46 | 47 | In this case, we have a jump, but we don't know where yet (maybe the address is being manipulated by previous instructions). Instructions following the jump instruction are loaded into the pipeline. When we get to the ALU phase, the `JUMP` will be processed, but this means the next instruction is not the `SUB ... ...` or `ADD ... ...`. These instructions are then flushed from the pipeline and the correct instruction (`SHIFT`) is fetched. 48 | 49 | ![Processor Pipeline Stalls and Flushes](https://i.imgur.com/5S9f2kB.png) 50 | 51 | ## Control Dependencies 52 | 53 | For the following program, the `ADD`/`SUB` instructions have a control dependence on `BEQ`. Similarly, the instructions after `label:` also have a control dependence on `BEQ`. 54 | 55 | ```mipsasm 56 | ADD R1, R1, R2 57 | BEQ R1, R3, label 58 | ADD R2, R3, R4 59 | SUB R5, R6, R8 60 | label: 61 | MUL R5, R6, R8 62 | ``` 63 | 64 | We estimate that: 65 | - 20% of instructions are branch/jump 66 | - slightly more than 50% of all branch/jump instructions are taken 67 | 68 | On a 5-stage pipeline, CPI = 1 + 0.1*2 = 1.2 (10% of the time (50% * 20%) an instruction spends two extra cycles) 69 | 70 | With a deeper pipeline (more stages), the number of wasted instructions increases. 71 | 72 | ## Data Dependencies 73 | 74 | ```mipsasm 75 | ADD R1, R2, R3 76 | SUB R7, R1, R8 77 | MUL R1, R5, R6 78 | ``` 79 | 80 | This program has 3 dependencies: 81 | 82 | 1. Lines 1 and 2 have a RAW (Read-After-Write) dependence on R1 (also called Flow dependence, or TRUE dependence). The `ADD` instruction must be completed before the `SUB` instruction. 83 | 84 | 2. Lines 1 and 3 have a WAW (Write-After-Write) dependence on R1 with the `MUL` instruction in which the `ADD` must complete first, else R1 is overwritten with an incorrect value. This is also called an Output dependence. 85 | 86 | 3. Lines 2 and 3 have a WAR (Write-After-Read) dependence on R1 in that the `SUB` instruction must use the value of R1 before the `MUL` instruction overwrites it. This is also called an Anti-dependence because it reverses the order of the flow dependence. 87 | 88 | WAW and WAR dependencies are also called "False" or "Name" dependencies. RAR (Read-After-Read) dependencies do not matter since the value could not have changed in-between and is thus safe to read. 89 | 90 | ### Data Dependencies Quiz 91 | 92 | (Included for additional study help). For the following program, select which dependencies exist: 93 | 94 | ```mipsasm 95 | MUL R1, R2, R3 96 | ADD R4, R4, R1 97 | MUL R1, R5, R6 98 | SUB R4, R4, R1 99 | ``` 100 | 101 | | | RAW | WAR | WAW | 102 | |---|---|---|---| 103 | | \\( I1 \rightarrow I2 \\) | x | - | - | 104 | | \\( I1 \rightarrow I3 \\) | - | - | x | 105 | | \\( I1 \rightarrow I4 \\) | - | - | - | 106 | | \\( I2 \rightarrow I3 \\) | - | x | - | 107 | 108 | ## Dependencies and Hazards 109 | 110 | Dependence - property of the program alone. 111 | 112 | Hazard - when a dependence results in incorrect execution. 113 | 114 | For example, in a 5-stage pipeline, a dependency that is 3 instructions apart may not cause a hazard, since the result will be written before the dependent instruction reads it. 115 | 116 | ## Handling of Hazards 117 | 118 | First, detect hazard situations. Then, address it by: 119 | 1. Flush dependent instructions 120 | 2. Stall dependent instruction 121 | 3. Fix values read by dependent instructions 122 | 123 | Must use flushes for control dependencies, because the instructions that come after the hazard are the wrong instructions. 124 | 125 | For data dependence, we can stall the next instruction, or fix the instruction by forwarding the value to the correct stage of the pipeline (e.g. "keep" the value inside the ALU stage for the next instruction to use). Forwarding does not always work, because the value we need is produced at a later point in time. In this cases we must stall. 126 | 127 | ## How Many Stages? 128 | 129 | For an ideal CPI = 1, we consider the following: 130 | 131 | More Stages \\( \rightarrow \\) more hazards (CPI \\( \uparrow \\)), but less work per stage ( cycle time \\( \downarrow \\)) 132 | 133 | From iron law, Execution Time = # Instructions * CPI * Cycle Time 134 | 135 | \# Stages should be chosen to balance CPI and Cycle time (some local minima for execution time where cycle time has decreased without causing additional hazards). Additionally consider more stages consumes more power (work being done in less cycle time with more latches per stage). 136 | 137 | * Performance (execution time) \\( \Rightarrow \\) 30-40 stages 138 | * Manageable Power Consumption \\( \Rightarrow \\) 10-15 stages 139 | 140 | *[ALU]: Arithmetic Logic Unit 141 | *[CPI]: Cycles Per Instruction 142 | *[PC]: Program Counter 143 | *[RAW]: Read-After-Write 144 | *[WAR]: Write-After-Read 145 | *[WAW]: Write-After-Write 146 | *[RAR]: Read-After-Read -------------------------------------------------------------------------------- /predication.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: predication 3 | title: Predication 4 | sidebar_label: Predication 5 | --- 6 | 7 | [🔗Lecture on Udacity (30 min)](https://classroom.udacity.com/courses/ud007/lessons/3617709440/concepts/last-viewed) 8 | 9 | ## Predication 10 | Predication is another way we can try to deal with control hazards and dependencies by dividing the work into both directions. 11 | 12 | Branch prediction 13 | - Guess where it is going 14 | - No penalty if correct 15 | - Huge penalty if wrong (misprediction cost ~50 instructions) 16 | 17 | Predication 18 | - Do work of both directions 19 | - Waste up to 50% 20 | - Throw away work from wrong path 21 | 22 | | Construct/Type | Branch Prediction | Predication | Winner | 23 | | --- | --- | --- | --- | 24 | | Loop | Very good, tends to take same path more often | Extra work is outside the loop and often wasted | Predict | 25 | | Call/Ret | Can be perfect, always takes this path | Any other branch is always wasted | Predict| 26 | | Large If/Then/Else | May waste large number of instructions | Either way we lose a large number of instructions | Predict | 27 | | Small If/Then/Else | May waste a large number of instructions | Only waste a few instructions | Predication, depending on BPred accuracy | 28 | 29 | ## If-Conversion 30 | Compiler takes code like this: 31 | ```cpp 32 | if(cond) { 33 | x = arr[i]; 34 | y = y+1; 35 | } else { 36 | x = arr[j]; 37 | y = y-1; 38 | } 39 | ``` 40 | And changes it to this (work of both paths): 41 | ```cpp 42 | x1 = arr[i]; 43 | x2 = arr[j]; 44 | y1 = y+1; 45 | y2 = y-1; 46 | x = cond ? x1 : x2; 47 | y = cond ? y1 : y2; 48 | ``` 49 | 50 | There is still a question of how to do the conditional assignment. For example if we do something like this: 51 | 52 | ```mipsasm 53 | BEQ ... 54 | MOV X, X2 55 | B Done 56 | MOV X, X1 57 | ... etc. 58 | ``` 59 | then we haven't done much, because we just converted one branch into another branch and possibly have two mispredictions. We need a move instruction that is conditional on some flag being set. 60 | 61 | ## Conditional Move 62 | In MIPS instruction set there are the following instructions 63 | 64 | |inst | operands | does | 65 | |---|---|---| 66 | | `MOVZ` | Rd, Rs, Rt | `if(Rt == 0) Rd=Rs;` | 67 | | `MOVN` | Rd, Rs, Rt | `if(Rt != 0) Rd=Rs;` | 68 | 69 | So we can implement our conditional assignment `x = cond ? x1 : x2;` as follows: 70 | 71 | ```mipsasm 72 | R3 = cond 73 | R1 = ... x1 ... 74 | R2 = ... x2 ... 75 | MOVN X, R1, R3 76 | MOVZ X, R2, R3 77 | ``` 78 | 79 | Similarly in x86 there are many instructions (`CMOVZ`, `CMOVNZ`, `CMOVGT`, etc.) that operate based on the flags, where the move operation completes based on the condition codes provided. 80 | 81 | ## MOVZ/MOVN Performance 82 | Is the If-Conversion really faster? Consider the following example. The branched loop can average 10.5 instructions given a large penalty. The If-Converted form uses more instructions to do the work, but never receives a penalty, for an average of 4 instructions. 83 | 84 | ![MOVZ/MOVN Performance](https://i.imgur.com/GXWFIvL.png) 85 | 86 | ## Full Predication HW Support 87 | 88 | For MOV_cc we need separate opcode. For Full Predication, we add condition bits to _every_ instruction. For example, the Itanium instruction has 41 bits, where the least significant 6 bits specify a "qualifying predicate". The predicates are actually small 1-bit registers, and the 6 bit code tells the processor which of the 64-bit conditional registers to use. 89 | 90 | ## Full Predication Example 91 | In this example, the same code as before is now using a predication construct. So, `MP.EQZ` sets up predicates `p1` and `p2`. The `ADDI` instructions then use the proper predicate to determine if it is done or not. This avoids needing extra registers, special instruction codes, etc. 92 | 93 | ![Full Predication Example](https://i.imgur.com/o1uPWqN.png) 94 | 95 | 96 | 97 | *[2BP]: 2-bit Predictor 98 | *[2BC]: 2-Bit Counter 99 | *[2BCs]: 2-Bit Counters 100 | *[ALU]: Arithmetic Logic Unit 101 | *[BHT]: Branch History Table 102 | *[BTB]: Branch Target Buffer 103 | *[CPI]: Cycles Per Instruction 104 | *[IF]: Instruction Fetch 105 | *[PC]: Program Counter 106 | *[PHT]: Pattern History Table 107 | *[RAS]: Return Address Stack 108 | *[XOR]: Exclusive-OR 109 | -------------------------------------------------------------------------------- /reorder-buffer.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: reorder-buffer 3 | title: ReOrder Buffer 4 | sidebar_label: ReOrder Buffer 5 | --- 6 | 7 | [🔗Lecture on Udacity (2 hr)](https://classroom.udacity.com/courses/ud007/lessons/945398787/concepts/last-viewed) 8 | 9 | ## Exceptions in Out Of Order Exceptions 10 | 11 | Consider this set of instructions as a result of Tomasulo: 12 | 13 | |   | Instruction | Issue | Disp | WR | 14 | |---|---|---|---|---| 15 | | 1 | `DIV F10, F0, F6` | 1 | 2 | 42 | 16 | | 2 | `L.D F2, 45(R3)` | 2 | 3 | 13 | 17 | | 3 | `MUL F0, F2, F4` | 3 | 14 | 19 | 18 | | 4 | `SUB F8, F2, F6` | 4 | 14 | 15 | 19 | 20 | Let's say in instruction 1, the F6 == 0 and causes a Divide By Zero exception in cycle 40. Normally an exception would cause PC to be saved and then it would go to an exception handler, and then come back and operations would resume. However, in this case, by the time the exception occurs, F0 has been overwritten by instruction 3 and instruction 1 would produce an incorrect result. This can also happen if a Load has a page fault, and various other causes. 21 | 22 | ## Branch Misprediction in OOO Execution 23 | 24 | Similarly to before, what happens if branch misprediction has caused a register to be overwritten before another instruction completes? 25 | 26 | ```mipsasm 27 | DIV R1, R3, R4 28 | BEQ R1, R2, Label 29 | ADD R3, R4, R5 30 | ... 31 | DIV ... 32 | Label: 33 | ... 34 | ``` 35 | 36 | If `DIV` takes many cycles, it will take a long time before we realize the branch was mispredicted. In the meantime, the third instruction may have completed and updated `R3`. When we realize the misprediction, we should behave as if the wrong branch had never executed, but this is now impossible. 37 | 38 | Another issue is with Phantom Exceptions. Let's say the branch was mispredicted, and the program went on and the final DIV statement caused an exception - it could be in the exception handler before we even know the branch was mispredicted and the statement never should have run. 39 | 40 | ## Correct OOO Execution 41 | - Execute OOO 42 | - Broadcast OOO 43 | - But deposit values to registers In-Order! 44 | 45 | We need a structure called a ReOrder Buffer: 46 | - Remembers program order 47 | - Keeps results until safe to write to registers 48 | 49 | ## ReOrder Buffer (ROB) 50 | 51 | This structure contains the register that will be updated, the value it should be updated to, and whether the corresponding instruction is really done and ready to be written. The instructions are in program order. 52 | 53 | ![ReOrder Buffer](https://i.imgur.com/vFBCy96.png) 54 | 55 | There are also pointers to the next issuable instruction, and the next value to be committed (creating a standard FIFO buffer structure via head/tail pointers). 56 | 57 | [🎥 How ROB fits into Tomasulo](https://www.youtube.com/watch?v=0w6lXz71eJ8) | 58 | [🎥 Committing a WR to registers](https://www.youtube.com/watch?v=p6clkAsUV7E) 59 | 60 | Basically, it takes the place of the link between the RAT and RS, and now both point to this intermediary ROB structure. So RS now only is concerned with dispatching instructions, and doesn't have to wait for the result to be broadcast before freeing a spot in the RS (because the RS itself is not a register alias to be resolved) 61 | 62 | ## Hardware Organization with ROB 63 | 64 | Things are much the same as before. We still have an IQ that feeds into one or more RS, and we have a RAT that may have entries pointing to an RF. The primary difference is where before the RAT may have pointed to the RS, it will now point to the ROB instead. On each cycle, the ROB may commit its result to the RF and RS if the associated instruction is considered "done". 65 | 66 | ## Branch Misprediction Recovery 67 | 68 | Describing how ROB works on a branch misprediction: 69 | - [🎥 Part 1](https://www.youtube.com/watch?v=GG09xSZ32MU) | 70 | - [🎥 Part 2](https://www.youtube.com/watch?v=ozc8ceuw-GA) 71 | 72 | The summary is that once the commit pointer reaches a mispredicted branch, we simply empty the ROB (issue pointer = commit pointer). The register file contains the correct registers. The RAT is modified to point to the associated registers. And finally, the next fetch will fetch from the correct PC. Therefore, the ROB makes recovery from a mispredicted branch fairly straightforward. 73 | 74 | ## ROB and Exceptions 75 | 76 | Reminder: two scenarios: 77 | 1. Exception in long-running instruction (or page fault) 78 | - Simply flush the ROB and move to exception handler 79 | - ROB ensures that no wrong registers have been committed and the program is in a good state for the exception handler 80 | 2. Phantom exceptions for mispredicted branch 81 | - The instruction with an exception is tagged as an exception in the ROB 82 | - When we figure out the branch is mispredicted it is handled like any other misprediction and thus the exception does not get handled (as it doesn't reach the commit point for the instruction with the exception) 83 | 84 | ## Commit == Outside view of "Executed" 85 | 86 | [🎥 Video explanation](https://www.youtube.com/watch?v=vpPjDW48v90). The main point is that ROB makes it so the processor itself may see instructions executed in any order with mispredicted branches, but the programmer's view of the program's execution is that all instructions are committed in order with no wrong branches taken. 87 | 88 | ## RAT Updates on Commit 89 | 90 | [🎥 Video explanation](https://www.youtube.com/watch?v=PYFg7QOfcvI). This example walks through how the ROB, RAT, and REGS work together during the commit phase. At a basic level, as instructions in the ROB are completed, it updates the register values accordingly. However, it will only update the RAT (to change its pointer from a ROB entry to a Register) if it just committed the ROB instruction the RAT entry was pointing to. For example, if the RAT entry for `R3` was pointing to `ROB2`, when `ROB2` executes it will commit the update to the associated register, and also update the RAT entry `R3` to point to `R3` now. In this way, registers and RAT are both kept updated only when the instruction is finally committed. 91 | 92 | ## ROB Example 93 | _These examples are best viewed as videos, so links are below..._ 94 | 95 | 1. [🎥 Cycles 1-2](https://www.youtube.com/watch?v=39AFF5Qq5DI) 96 | 2. [🎥 Cycles 3-4](https://www.youtube.com/watch?v=c3hFm_DOUA0) 97 | 3. [🎥 Cycles 5-6](https://www.youtube.com/watch?v=4nZN_mLcCJo) 98 | 4. _(cycles 7-12 are "fast forwarded")_ 99 | 5. [🎥 Cycles 13-24](https://www.youtube.com/watch?v=bE3IFvoChyw) 100 | 6. [🎥 Cycles 25-43](https://www.youtube.com/watch?v=HmURweRTsU4) 101 | 7. [🎥 Cycles 44-48](https://www.youtube.com/watch?v=V0nywwV0lKU) 102 | 8. [🎥 Timing Example](https://www.youtube.com/watch?v=f9IcEtKTz8k) 103 | 104 | ## Unified Reservation Stations 105 | 106 | With separate RS for separate units (e.g. ADD, MUL), often running out of RS spots on one unit will prevent the other unit from being issued too (because instructions must be issued in order). The structures themselves are functionally the same, so all RS can be unified into one larger array, to allow more total instructions to be issued. However, this requires additional logic in the dispatch unit to target the correct unit. But, reservation stations are expensive, so it may be better to add the additional logic rather than having stations go unused. 107 | 108 | ## Superscalar 109 | 110 | Previous examples were limited to one instruction per cycle. For superscalar, we need to consider the following: 111 | * Fetch > 1 inst/cycle 112 | * Decode > 1 inst/cycle 113 | * Issue > 1 inst/cycle (still should be in order) 114 | * Dispatch > 1 inst/cycle 115 | * May require multiple units of each functional type 116 | * Broadcast > 1 result/cycle 117 | * This involves not only having more buses for each result, but every RS has to compare with every bus each cycle 118 | * Commit > 1 inst/cycle 119 | * Must still obey rule of in-order commits 120 | 121 | With all of these, we must consider the "weakest link". If all of these are very large but one is limited to 3 inst/cycle, that will be the bottleneck in the pipeline. 122 | 123 | ## Termninology Confusion 124 | 125 | | Academics | Companies, other papers | 126 | | --- | --- | 127 | | Issue | Issue, Allocate, Dispatch | 128 | | Dispatch | Execute, Issue, Dispatch | 129 | | Commit | Commit, Complete, Retire, Graduate | 130 | 131 | So, it's complicated. 132 | 133 | ## Out of Order 134 | 135 | In an out-of-order processor, not ALL pipeline stages are processing instructions out of order. Some stages must still be in-order to preserve proper dependencies. 136 | 137 | | Stage | Order | 138 | | --- | --- | 139 | | Fetch | In-Order | 140 | | Decode | In-Order | 141 | | Issue | In-Order | 142 | | Execute | Out-of-Order | 143 | | Write/Bcast | Out-of-Order | 144 | | Commit | In-Order | 145 | 146 | 147 | ## Additional Resources 148 | 149 | From TA Nolan, here is an attempt to document how a CPU with ROB works: 150 | 151 | ``` 152 | While there is an instruction to issue 153 | If there is an empty ROB entry and an empty appropriate RS 154 | Put opcode into RS. 155 | Put ROB entry number into RS. 156 | For each operand which is a register 157 | If there is a ROB entry number in the RAT for that register 158 | Put the ROB entry number into the RS as an operand. 159 | else 160 | Put the register value into the RS as an operand. 161 | Put opcode into ROB entry. 162 | Put destination register name into ROB entry. 163 | Put ROB entry number into RAT entry for the destination register. 164 | Take the instruction out of the instruction window. 165 | 166 | For each RS 167 | If RS has instruction with actual values for operands 168 | If an appropriate ALU or processing unit is free 169 | Dispatch the instruction, including operands and the ROB entry number. 170 | Free the RS. 171 | 172 | While the next ROB entry to be retired has a result 173 | Write the result to the register. 174 | If the ROB entry number is in the RAT, remove it. 175 | Free the ROB entry. 176 | 177 | For each ALU 178 | If the instruction is complete 179 | Put the result into the ROB entry corresponding to the destination register. 180 | For each RS waiting for it 181 | Put the result into the RS as an operand. 182 | Free the ALU. 183 | ``` 184 | 185 | 186 | 187 | 188 | *[ALU]: Arithmetic Logic Unit 189 | *[CPI]: Cycles Per Instruction 190 | *[ILP]: Instruction-Level Parallelism 191 | *[IPC]: Instructions per Cycle 192 | *[IQ]: Instruction Queue 193 | *[LB]: Load Buffer 194 | *[LW]: Load Word 195 | *[OOO]: Out Of Order 196 | *[PC]: Program Counter 197 | *[RAT]: Register Allocation Table (Register Alias Table) 198 | *[RAW]: Read-After-Write 199 | *[ROB]: ReOrder Buffer 200 | *[SB]: Store Buffer 201 | *[SW]: Store Word 202 | *[WAR]: Write-After-Read 203 | *[WAW]: Write-After-Write 204 | *[RAR]: Read-After-Read -------------------------------------------------------------------------------- /storage.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: storage 3 | title: Storage 4 | sidebar_label: Storage 5 | --- 6 | 7 | [🔗Lecture on Udacity (30 min)](https://classroom.udacity.com/courses/ud007/lessons/872590121/concepts/last-viewed) 8 | 9 | ## Storage 10 | Reasons we need it: 11 | * Files - Programs, data, settings 12 | * Virtual Memory 13 | 14 | We care about: 15 | * Performance 16 | * Throughput - improving (not as quickly as processor speed) 17 | * Latency - improving (very slowly, more so than DRAM) 18 | * Reliability 19 | 20 | Types we can use are very diverse 21 | * Magnetic Disks 22 | * Optical Disks 23 | * Tape 24 | * Flash 25 | * ... etc. 26 | 27 | ## Magnetic Disks 28 | Multiple platters rotate around a spindle. A head assembly sits beside these platters, and has a head that is used to read both sides of the magnetic platter surface. Where the head is positioned on the platter is called a track - this is the circle of data the head can access while in that position as the platters rotate. A cylinder is the collation of the same track across multiple platters. The head will move position which allows access to other tracks. 29 | 30 | ![magnetic disks](https://i.imgur.com/CB5bwaK.png) 31 | 32 | The data will be organized into position order along a track, and divided into sectors. Each sector contains a preamble, some data, and some kind of checksum or error correction code. 33 | 34 | Disk capacity then can be represented by the multiplication of: 35 | - number of platters x 2 (surfaces) 36 | - number of tracks/surface (cylinders) 37 | - number of sectors/track 38 | - number of bytes/sector 39 | 40 | ### Access time for Magnetic Disks 41 | (If the disk is spinning already) 42 | - Seek Time - move the head assembly to the correct cylinder 43 | - Rotational Latency - Wait for the start of our sector to get under the head 44 | - Data Read - read until end of sector seen by head 45 | - Controller Time - checksum, determine sector is ok 46 | - I/O Bus Time - get the data into main memory 47 | 48 | ...plus a significant Queuing Delay (wait for previous accesses to complete before we can move the head again) 49 | 50 | ### Trends for Magnetic Disks 51 | - Capacity 52 | - 2x per 1-2 years 53 | - Seek Time 54 | - 5-10ms, very slow improvement (primarily due to shrinking disk diameter) 55 | - Rotation: 56 | - 5,000 RPM \\(\rightarrow\\) 10,000 RPM \\(\rightarrow\\) 15,000 RPM 57 | - Materials and noise improvements 58 | - Improves slowly 59 | - Controller, Bus 60 | - Improves at OK rate 61 | - Currently a small fraction of overall time 62 | 63 | Overall performance is dominated by materials and mechanics, not something like Moore's Law - so it is very difficult to rapidly improve on this technology. 64 | 65 | ## Optical Disks 66 | Similar to hard disk platter, except we use a laser instead of a magnetic head. Additionally, where a hard disk needs enclosing to keep contaminants out, this is not as important on an optical disk. So being open (insert/remove any disk) allows portability, as does standardization of technologies. 67 | 68 | However, improving this technology is not very easy because portability requires standards that must be agreed upon, which takes time. Then products still have to be made to address the new technologies, and will take even more time to be adopted by consumers. 69 | 70 | ## Magnetic Tape 71 | * Backup (secondary storage) 72 | * Large capacity, replaceable 73 | * Sequential access (good for large sequential data, poor for virtual memory) 74 | * Currently dying out 75 | * Low production volume \\(\Rightarrow\\) cost not dropping as rapidly as disks 76 | * Cheaper to use more disks rather than invest in new equipment and the tapes to use them (and USB hard drives) 77 | 78 | ## Using RAM for storage 79 | * Disk about 100x cheaper per GB 80 | * DRAM has about 100,000x better latency 81 | 82 | Solid-State Disk (SSD) 83 | * DRAM + Battery 84 | * Fast! 85 | * Reliable 86 | * Extremely Expensive 87 | * Not good for archiving (must be powered) 88 | * Flash 89 | * Low Power 90 | * Fast (but slower than DRAM) 91 | * Smaller Capacity (GB vs TB) 92 | * Keeps data alive without power 93 | 94 | ## Hybrid Magnetic Flash 95 | Combine magnetic disk with Flash 96 | * Magnetic Disk 97 | * Low $/GB 98 | * Huge Capacity 99 | * Power Hungry 100 | * Slow (mechanical movement) 101 | * Sensitive to impacts while spinning 102 | * Flash 103 | * Fast 104 | * Power Efficient 105 | * No moving parts 106 | 107 | Use both! 108 | * Use Flash as cache for disk 109 | * Can potentially power down the disk when what we need is in cache 110 | 111 | ## Connecting I/O Devices 112 | ![connecting I/O devices](https://i.imgur.com/Jt1d7Xm.png) 113 | Main point: Things closer to the CPU may need full speed of the bus. Things farther away or naturally slower (e.g. storage, USB) will not utilize as much of the speed, so are more interested in standardization of the bus/controller technology, which grows at slower rates. -------------------------------------------------------------------------------- /synchronization.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: synchronization 3 | title: Synchronization 4 | sidebar_label: Synchronization 5 | --- 6 | 7 | [🔗Lecture on Udacity (1hr)](https://classroom.udacity.com/courses/ud007/lessons/906999159/concepts/last-viewed) 8 | 9 | ## Synchronization Example 10 | In the following example, each thread is counting letters on its own half of the document. Most of the time, this can work out ok as each thread is operating on different parts of the document and things are being accessed in different orders depending on the letters each thread encounters. But what if both threads simultaneously attempt to update the counter for letter 'A'? Both `LW` for the same address for the proper counter, increments that value, and stores it. In the end, the counter only increments once instead of twice. We need a way to separate lines 2-4 into an Atomic Section (Critical Section) that cannot run simultaneously. With such a construct, one thread completes and stores the incremented value, and cache coherence ensures the other thread obtains the newly updated value before incrementing it. 11 | 12 | ![synchronization example](https://i.imgur.com/5viRJgS.png) 13 | 14 | ### Synchronization Example - Lock 15 | The type of synchronization we use for atomic/critical sections is called Mutual Exclusion (Lock). When entering a critical section, we request a lock on some variable, and then unlock it once complete. The lock construct ensures that only one thread can lock it at a time, in no particular order - whichever thread is able to lock it first. The code above may now look like 16 | 17 | ```mipsasm 18 | LW L, 0(R1) 19 | lock CountLock[L] 20 | LW R, Count[L] 21 | ADDI R, R, 1 22 | SW R, Count[L] 23 | unlock CountLock[L] 24 | ``` 25 | 26 | ## Lock Synchronization 27 | Naive implementation: 28 | ```cpp 29 | 1 typedef int mutex_type; 30 | 2 void lock_init(mutex_type &lockvar) { 31 | 3 lockvar = 0; 32 | 4 } 33 | 5 void lock(mutex_type &lockvar) { 34 | 6 while(lockvar == 1); 35 | 7 lockvar = 1; 36 | 8 } 37 | 9 void unlock(mutex_type &lockvar) { 38 | 10 lockvar = 0; 39 | 11 } 40 | ``` 41 | This implementation has an issue in that if both threads attempt to access line 6 (`while`) simulatanously while `lockvar` is 0, both may wind up in the critical section at the same time. So really, lines 6 and 7 need to be in an atomic section of its own! So it seems there is a paradox, in that we need locks to implement locks. 42 | 43 | ## Implementing `lock()` 44 | ```cpp 45 | 1 void lock(mutex_type &lockvar) { 46 | 2 Wait: 47 | 3 lock_magic(); 48 | 4 if(lockvar == 0) { 49 | 5 lockvar = 1; 50 | 6 goto Exit; 51 | 7 } 52 | 8 unlock_magic(); 53 | 9 goto Wait; 54 | 10 Exit: 55 | 11 } 56 | ``` 57 | 58 | But, we know there is no magic. 59 | Options: 60 | - Lamport's Bakery Algorithm 61 | - Complicated, expensive (makes lock slow) 62 | - Special atomic read/write instructions 63 | - We need an instruction that both reads and writes memory 64 | 65 | ## Atomic Instructions 66 | - Atomic Exchange 67 | - Swaps contents of the two registers 68 | - `EXCH R1, 78(R2)` allows us to implement lock: 69 | ```cpp 70 | R1 = 1; 71 | while(R1 == 1) 72 | EXCH R1, lockvar; 73 | ``` 74 | - If at any point lockvar becomes 0, we swap it with R1 and exit the while loop. If other threads are attempting the lock at the same time, they will get our thread's R1. 75 | - Drawback: writes all the time, even while the lock is busy 76 | - Test-and-Write 77 | - We test a location, and if it meets some sort of condition, we then do the write. 78 | - `TSTSW R1, 78(R2)` works like: 79 | ```cpp 80 | if(Mem[78+R2] == 0) { 81 | Mem[78+R2] = R1; 82 | R1 = 1; 83 | } else { 84 | R1 = 0; 85 | } 86 | ``` 87 | - Implement lock by doing: 88 | ```cpp 89 | do { 90 | R1=1; 91 | TSTSW R1, lockvar; 92 | } while(R1==0); 93 | ``` 94 | - This is good because we only do the write (and thus cause cache invalidations) when the condition is met. So, there is only communication happening when locks are acquired or freed. 95 | - Load Linked/Store Conditional 96 | 97 | ## Load Linked/Store Conditional (LL/SC) 98 | - Atomic Read/Write in same instruction 99 | - Bad for pipelining (Fetch, Decode, ALU, Mem, Write) - add a Mem2, Mem3 stage to handle all reads/writes needed 100 | - Separate into 2 instructions 101 | - Load Linked 102 | - Like a normal `LW` 103 | - Save address in Link register 104 | - Store Conditional 105 | - Check if address is same as in Link register 106 | - Yes? Normal `SW`, return 1 107 | - No? Return 0! 108 | 109 | ### How is LL/SC Atomic? 110 | The key is that if some other thread manages to write to lockvar in-between the `LL`/`SC`, the link register will be 0 and the `SC` will fail, due to cache coherence. A major benefit here is that simple critical sections no longer need locks, as we can `LL`/`SC` directly on the variable. 111 | 112 | ![how is ll/sc atomic](https://i.imgur.com/Q5NjAIk.png) 113 | 114 | ## Locks and Performance 115 | [🎥 View lecture video (3:20)](https://www.youtube.com/watch?v=JS88digI8iQ) 116 | 117 | Atomic Exchange has poor performance given that each core is constantly exchanging blocks, triggering invalidations on other cores. This level of overhead is very power hungry and slows down useful work that could be done. 118 | 119 | ### Test-and-Atomic-Op Lock 120 | Recall original implementation of atomic exchange: 121 | 122 | ```cpp 123 | R1 = 1; 124 | while(R1 = 1) 125 | EXCH R1, lockvar; 126 | ``` 127 | 128 | We can improve this via normal loads and only using `EXCH` if the local read shows the lock is free. Now, the purpose of the exchange is only as a final way to safely obtain the lock. 129 | 130 | ```cpp 131 | R1 = 1; 132 | while(R1 == 1) { 133 | while(lockvar == 1); 134 | EXCH R1, lockvar; 135 | } 136 | ``` 137 | 138 | This implementation allows `lockvar` to be cached locally until it is released by the core inside the critical section. This eliminates nearly all coherence traffic and leaves the bus free for the locked core to handle any coherence requests for `lockvar`. 139 | 140 | ## Barrier Synchronization 141 | A barrier is a form of synchronization that waits for all threads to complete some work before proceeding further. An example of this is if the program is attempting to operate on data computed by multiple threads - all threads need to have completed their work before the thread(s) responsible for processing that work can continue. 142 | 143 | - All threads must arrive at the barrier before any can leave. 144 | - Two variables: 145 | - Counter (count arrived threads) 146 | - Flag (set when counter == N) 147 | 148 | ### Simple Barrier Implementation 149 | ```cpp 150 | 1 lock(counterlock); 151 | 2 if(count==0) release=0; // re-initalize release 152 | 3 count++; // count arrivals 153 | 4 unlock(counterlock); 154 | 5 if(count==total) { 155 | 6 count=0; // re-initialize barrier 156 | 7 release=1; // set flag to release all threads 157 | 8 } else { 158 | 9 spin(release==1); // wait until release==1 159 | 10 } 160 | ``` 161 | 162 | This implementation has one major flaw. Upon the first barrier encounter, this works correctly. However, what if the work then continues for awhile and we try to go back to using these barrier variables for synchronization? We expect that we will start with `count == 0` and `release == 0`, but this may not be true... 163 | 164 | ### Simple Barrier Implementation Doesn't Work 165 | The main issue with using this barrier implementation multiple times is that some threads may not pick up the release right away. Consider that the final thread to hit the barrier sets `release=1`, but maybe Thread 0 is off doing something else like handling an interrupt and doesn't see it. But the other threads have already been released and are not doing much work, and one could come back and hit the barrier again, setting `release=0`. Thread 0 is finally done with its work, and sees that `release==0` and continues to wait at the first barrier! Furthermore, the other threads are now waiting on `release==1` on the second instance of the barrier, and since Thread 0 is now locked up this will never happen, resulting in deadlock. 166 | 167 | ### Reusable Barrier 168 | ```cpp 169 | 1 localSense = !localSense; 170 | 2 lock(counterlock); 171 | 3 count++; 172 | 4 if(count==total) { 173 | 5 count=0; 174 | 6 release=localSense; 175 | 7 } 176 | 8 unlock(counterlock); 177 | 9 spin(release==localSense); 178 | ``` 179 | 180 | This implementation works because it functions as a sort of flip-flop on each barrier instance. We are never re-initializing `release`, but only `localSense`. So each time we hit the barrier it simply waits on the release to "flip", at which point it continues. So even if some threads continue work up until the next barrier before another thread is released from the barrier, then they must still wait on that thread to also arrive at that barrier. -------------------------------------------------------------------------------- /virtual-memory.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: virtual-memory 3 | title: Virtual Memory 4 | sidebar_label: Virtual Memory 5 | --- 6 | 7 | [🔗Lecture on Udacity (1 hr)](https://classroom.udacity.com/courses/ud007/lessons/1032798942/concepts/last-viewed) 8 | 9 | ## Why Virtual Memory? 10 | 11 | Virtual memory is a way of reconciling how a programmer views memory and how the hardware actually uses memory. Every program uses the same address space, even though clearly the data in some of those addresses must be different for each program (e.g. code section). 12 | 13 | ![why virtual memory](https://i.imgur.com/CoxI4GQ.png) 14 | 15 | ## Processor's View of Memory 16 | 17 | Physical Memory - *Actual* memory modules in the system. 18 | - Sometimes < 4GB 19 | - Almost never 4GB/process 20 | - Never 16 Exabytes/process 21 | - So... usually less than what programs can access 22 | 23 | Addresses 24 | - 1:1 mapping to bytes/words in physical memory (a given address always goes to the same physical location, and the same physical location always has the same address) 25 | 26 | ## Program's View of Memory 27 | Programs each have their own address space, some of which may point to the same physical location as another program's, or not. Likewise, even different addresses in different programs could be pointing at the same shared physical location. 28 | 29 | ![program's view of memory](https://i.imgur.com/GvJRe0z.png) 30 | 31 | ## Mapping Virtual -> Physical Memory 32 | 33 | Virtual Memory is divided into pages of some size (e.g. 4kB). Likewise, Physical Memory is divided into Frames of that same size. This is similar to Blocks/Lines in caches. The Operating System decides how to map these pages to frames and handles any shared memory using Page Tables, each of which is unique to the program being mapped. 34 | 35 | ![mapping virtual to physical memory](https://i.imgur.com/0yOtU4H.png) 36 | 37 | ## Where is the missing memory? 38 | 39 | What happens when we have more pages in our programs than can be currently loaded into physical memory? Since virtual memory will almost certainly exceed the size of physical memory, this occurs often. The Operating System will "page" memory to disk that has not been accessed recently, to allow the more used pages to remain in physical memory. Similar to caches, it will then obtain those pages back from disk whenever they are requested. 40 | 41 | ## Virtual to Physical Translation 42 | Similar to a cache, the lower N bits of a virtual address (where 2^N = page size) are used as an offset into the page. The upper bits are then the page number (similar to tags in caches), which is used to index into the Page Table. This table returns a physical Frame Number, which is then combined with the Page Offset to get the Physical Address actually used in hardware. 43 | 44 | ## Size of Flat Page Table 45 | - 1 entry per page in the virtual address space 46 | - even for pages the program never uses 47 | - Entry contains Frame Number + bits that tell us if the page is accessible 48 | - Entry Size \\(\approx\\) physical address 49 | - Overall Size: \\(\frac{\text{virtual memory}}{\text{page size}}*\text{size of entry}\\) 50 | - For very large virtual address space, the page table may need to be reorganized to ensure it properly fits within memory 51 | 52 | ## Multi-Level Page Tables 53 | Multi-Level Page Tables are designed where page table space is proportional to how much memory the program is actually using. As most programs use the "early" addresses (code, data, heap) and "late" addresses (stack), there is a large gap in the middle that usually goes unused. 54 | ![multi-level page tables](https://i.imgur.com/gQSVDjS.png) 55 | 56 | ### Multi-Level Page Table Structure 57 | The page number now is split into "inner" and "outer" page numbers. The Inner Page Number is used to index into an Inner Page Table to find the frame number, and the Outer Page Number is used to determine which Inner Page Table to use. 58 | ![multi-level page table structure](https://i.imgur.com/XJ0lmoO.png) 59 | 60 | ### Two-Level Page Table Example 61 | We save space because we do not create an inner page table if nothing is using the associated index in the outer page table. Only a certain number of outer page numbers will be utilized, and they are likely contiguous, so this results in a lot of savings. 62 | ![two-level page table example](https://i.imgur.com/weeBBdr.png) 63 | 64 | ### Two-Level Page Table Size 65 | To calculate two-level page table size: 66 | 1. Determine how many bits are used for the page offsets (N bits where \\(2^N\\)=page size) 67 | 2. Determine how many bits are used for the inner and outer page tables (X and Y bits where \\(2^X\\)=outer page table entries and \\(2^Y\\)=inner page table entries) 68 | 3. Outer Page Table size is number of outer entries * entry size in bytes 69 | 4. Inner Page Table size is number of inner entries * entry size in bytes 70 | 5. Number of Inner Page Tables Used is determined by partitioning the utilized address space by the number of bits associated with the outer page table. Typically only a few outer entries will be used. 71 | 6. Total page table size = (outer page table size) + (number of inner page tables used)*(inner page table size) 72 | 73 | [🎥 Link to example (4:19)](https://www.youtube.com/watch?v=tP7LYbFrk10) 74 | 75 | ## Choosing the Page Size 76 | 77 | Smaller Pages -> Large Page Table 78 | 79 | Larger Pages -> Smaller Page Table 80 | - But, internal fragmentation due to most of a page not being used by applications (wasted in physical memory) 81 | 82 | Like with block size of caches, we need to compromise between these. Typically a few KB to a few MB is appropriate for page size. 83 | 84 | ## Memory Access Time with V->P Translation 85 | Example: `LOAD R1=4(R2)` (virtual address of value in `R2`+4) 86 | 87 | For Load/Store 88 | - Compute Virtual Address 89 | - Compute page number (take some bits from address) 90 | - When using virtual->physical translation (for each level of page table) 91 | - Compute physical address of page table entry (adding) 92 | - Read page table entry 93 | - Is it fast? Where is the page table? In memory! 94 | - Compute physical address 95 | - Access Cache (and sometimes memory) 96 | 97 | Since page table is in memory, this could also result in cache misses, and so it may take multiple rounds of memory access just to get the data in one address. 98 | 99 | ## Translation Look-Aside Buffer (TLB) 100 | - TLB is a cache for translations 101 | - Cache is big -> TLB is small 102 | - 16KB in cache is only 4 entries (page size of 4KB) 103 | - So, TLB can be very small and fast since it only holds translations 104 | - Cache accessed for each level of Page Table (4-level = 4 accesses) 105 | - TLB only stores the final translation (Frame Number) (1 access) 106 | 107 | What if we have a TLB miss? 108 | - Perform translation using page table(s) 109 | - Put translation in TLB for later use 110 | 111 | ## TLB Organization 112 | - Associativity? Small, fast => Fully or Highly Associative 113 | - Size? Want hits similar to cache => 64..512 entries 114 | - Need more? Two-level TLB 115 | - L1: small/fast 116 | - L2: hit time of several cycles but larger 117 | 118 | -------------------------------------------------------------------------------- /vliw.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: vliw 3 | title: VLIW 4 | sidebar_label: VLIW 5 | --- 6 | 7 | [🔗Lecture on Udacity (17 min)](https://classroom.udacity.com/courses/ud007/lessons/961349070/concepts/last-viewed) 8 | 9 | ## Superscalar vs VLIW 10 | 11 | | | OOO Superscalar | In-Order Superscalar | VLIW | 12 | | --- | --- | --- | --- | 13 | | IPC | \\(\leq N\\) | \\(\leq N\\) | 1 large inst/cycle but does work of N "normal" inst | 14 | | How to find independent instructions? | Look at >> N insts | Look at next N insts in program order | Just do next large inst | 15 | | Hardware Cost | $$$ | $$ | $ | 16 | | Help from compiler? | Compiler can help | Needs help | Completely depends on compiler for performance | 17 | 18 | ## The Good and The Bad 19 | 20 | Good: 21 | * Compiler does the hard work 22 | * Plenty of Time 23 | * Simpler HW 24 | * Can be energy efficient 25 | * Works well on loops and "regular" code 26 | 27 | Bad: 28 | * Latencies not always the same 29 | * e.g. Cache Miss 30 | * Irregular Applications 31 | * e.g. Applications with lots of decision making that are hard for compiler to figure out 32 | * Code Bloat 33 | * Can be much larger if there are many no-ops from dependent instructions 34 | 35 | ## VLIW Instructions 36 | 37 | * Instruction set typically has all the "normal" ISA opcodes 38 | * Full predication support (or at least extensive) 39 | * Replies on compiler to expose parallelism via instruction scheduling 40 | * Lots of registers 41 | * A lot of scheduling optimizations require use of additional registers 42 | * Branch hints 43 | * Compiler tells processor what it thinks the branch will do 44 | * VLIW instruction "compaction" 45 | | OP1 | NOP | NOP | NOP | 46 | |---|---|---|---| 47 | 48 | | OP2 | OP3 | NOP | NOP | 49 | |---|---|---|---| 50 | becomes 51 | | OP1 |X| OP2 | | OP3 |X| NOP | | 52 | |---|---|---|---|---|---|---|---| 53 | where the X represents some sort of stop bit. This reduces the number of `NOP` and therefore code bloat. 54 | 55 | ## VLIW Examples 56 | * Itanium 57 | * _Tons_ of ISA Features 58 | * HW very complicated 59 | * Still not great on irregular code 60 | * DSP Processors 61 | * Regular loops, lots of iterations 62 | * Excellent performance 63 | * Very energy-efficient 64 | 65 | The main point is that VLIW can be a very good choice given the right application where compilers can do well. 66 | 67 | 68 | *[ALU]: Arithmetic Logic Unit 69 | *[CPI]: Cycles Per Instruction 70 | *[DSP]: Digital Signal Processing 71 | *[ILP]: Instruction-Level Parallelism 72 | *[IPC]: Instructions per Cycle 73 | *[IQ]: Instruction Queue 74 | *[ISA]: Instruction Set Architecture 75 | *[LB]: Load Buffer 76 | *[LSQ]: Load-Store Queue 77 | *[LW]: Load Word 78 | *[OOO]: Out Of Order 79 | *[PC]: Program Counter 80 | *[RAT]: Register Allocation Table (Register Alias Table) 81 | *[RAR]: Read-After-Read 82 | *[RAW]: Read-After-Write 83 | *[ROB]: ReOrder Buffer 84 | *[SB]: Store Buffer 85 | *[SW]: Store Word 86 | *[VLIW]: Very Long Instruction Word 87 | *[WAR]: Write-After-Read 88 | *[WAW]: Write-After-Write 89 | -------------------------------------------------------------------------------- /welcome.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: welcome 3 | title: High Performance Computer Architecture (HPCA) 4 | sidebar_label: Welcome to HPCA 5 | --- 6 | 7 | ### Howdy Friends 8 | 9 | Here are my notes from when I am taking HPCA in OMSCS during Fall 2019. 10 | 11 | Each document in "Lecture Notes" corresponds to a lesson in [Udacity](https://classroom.udacity.com/courses/ud007). Within each document, the headings roughly correspond to the videos within that lesson. Usually, I omit the lesson intro videos. 12 | 13 | There is a lot of information in these documents. I hope you find it helpful! 14 | 15 | If you have any questions, comments, concerns, or improvements, don't hesitate to reach out to me. You can find me at: 16 | 17 | * [davidharris@gatech.edu](mailto:davidharris@gatech.edu) 18 | * [David Harris \| Linkedin](https://www.linkedin.com/in/davidrossharris/) 19 | * @drharris \(Slack\) 20 | --------------------------------------------------------------------------------