├── .gitignore
├── .vscode
    └── settings.json
├── README.md
├── lessons
    ├── 00-overview.md
    ├── 01-local-opt.md
    ├── 02-dataflow.md
    ├── 03-ssa.md
    ├── 04-loops.md
    ├── 05-memory.md
    ├── 06-interprocedural.md
    └── README.md
├── project.md
├── reading
    ├── beyond-induction-variables.md
    ├── braun-ssa.md
    ├── buildit.md
    ├── copy-and-patch.md
    ├── doop.md
    ├── ir-survey.md
    ├── lazy-code-motion.md
    ├── linear-scan-ssa.md
    ├── optimal-inlining.md
    └── sparse-conditional-constant-prop.md
└── syllabus.md


/.gitignore:
--------------------------------------------------------------------------------
1 | notes.md
2 | 


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 |     "markdown.validate.enabled": true
3 | }


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CS 265: Compiler Optimization
 2 | 
 3 | Welcome to CS 265, Fall 2024 edition!
 4 | 
 5 | - Instructor: [Max Willsey](https://mwillsey.com)
 6 | - Time: Tuesdays and Thursdays, 2:00pm-3:30pm 
 7 | - Location: Soda 405
 8 | - Office hours: After class (TTh 3:30-4:30pm), 725 Soda
 9 | 
10 | This course uses my fork
11 |  of the [bril compiler infrastructure](https://github.com/mwillsey/bril/).
12 | 
13 | Some students have opted to make their projects public!
14 | [Check them out here!](./project.md#public-projects)
15 | 
16 | ## Other Course Pages
17 | 
18 | - [bCourses/Canvas](https://bcourses.berkeley.edu/courses/1538171) (enrollment required)
19 | - [Syllabus](./syllabus.md)
20 | - [Project Information](./project.md)
21 | - [Bril](https://github.com/mwillsey/bril/) (Max's fork)
22 | 
23 | ## Schedule 
24 | 
25 | Schedule is under construction and subject to change.
26 | 
27 | If the topic says "_OH, no class_", 
28 |  that means I will be available during the normal class time for office hours, 
29 |  in addition to the normal office hours after class.
30 | 
31 | | Week | Date   | Topic                                                                                                        | Due                                                     |
32 | |-----:|--------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
33 | |    1 | Aug 29 | [Overview](lessons/00-overview.md)                                                                           |                                                         |
34 | |    2 | Sep 3  | [Local Optimizations](lessons/01-local-opt.md)                                                               | Course Survey ([bCourses][])                            |
35 | |      | Sep 5  | [Local Optimizations](lessons/01-local-opt.md)                                                               |                                                         |
36 | |    3 | Sep 10 | [Dataflow](lessons/02-dataflow.md)                                                                           |                                                         |
37 | |      | Sep 12 | _no class_                                                                                                   |                                                         |
38 | |    4 | Sep 17 | [Dataflow](lessons/02-dataflow.md)                                                                           | [Task 1](lessons/01-local-opt.md#task) due              |
39 | |      | Sep 19 | ["Constant Propagation with Conditional Branches"](./reading/sparse-conditional-constant-prop.md)            | Reading reflection ([bCourses][])                       |
40 | |    5 | Sep 24 | [SSA](lessons/03-ssa.md)                                                                                     |                                                         |
41 | |      | Sep 26 | ["Simple and Efficient Construction of Static Single Assignment Form"](./reading/braun-ssa.md)               | Reading reflection ([bCourses][])                       |
42 | |    6 | Oct 1  | [Loops](lessons/04-loops.md)                                                                                 | [Task 2](lessons/02-dataflow.md#task) due               |
43 | |      | Oct 3  | ["Lazy Code Motion"](./reading/lazy-code-motion.md)                                                          | Reading reflection ([bCourses][])                       |
44 | |    7 | Oct 8  | [Loops](./lessons/04-loops.md#induction-variables)                                                           |                                                         |
45 | |      | Oct 10 | ["Beyond Induction Variables"](./reading/beyond-induction-variables.md)                                      | Reading reflection ([bCourses][])                       |
46 | |    8 | Oct 15 | [Memory](./lessons/05-memory.md)                                                                             | [Task 3](lessons/04-loops.md#task) due                  |
47 | |      | Oct 17 | ["Strictly Declarative Specification of Sophisticated Points-to Analyses"](./reading/doop.md)                | Reading reflection ([bCourses][])                       |
48 | |    9 | Oct 22 | [Interprocedural Optimization](./lessons/06-interprocedural.md)                                              |                                                         |
49 | |      | Oct 24 | ["Understanding and exploiting optimal function inlining"](./reading/optimal-inlining.md)                    | Reading reflection ([bCourses][])                       |
50 | |   10 | Oct 29 | _OH, no class_                                                                                               | [Task 4](lessons/05-memory.md#task) due                 |
51 | |      | Oct 31 | ["Intermediate Representations in Imperative Compilers: A Survey"](./reading/ir-survey.md)                   | Reading reflection ([bCourses][])                       |
52 | |   11 | Nov 5  | _Election day, no class_                                                                                     | [Project Proposals](./project.md#project-proposals)     |
53 | |      | Nov 7  | ["BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++"](./reading/buildit.md) | Reading reflection ([bCourses][])                       |
54 | |   12 | Nov 12 | _no class_                                                                                                   |                                                         |
55 | |      | Nov 14 | ["Copy-and-patch compilation"](./reading/copy-and-patch.md)                                                  | Reading reflection ([bCourses][])                       |
56 | |   13 | Nov 19 | _no class, extra OH_                                                                                         |                                                         |
57 | |      | Nov 21 | ["Linear Scan Register Allocation on SSA Form"](./reading/linear-scan-ssa.md)                                | Reading reflection ([bCourses][])                       |
58 | |   14 | Nov 26 | _no class, no OH_                                                                                            | [Project Check-ins due](./project.md#project-check-ins) |
59 | |      | Nov 28 | _no class, no OH_                                                                                            |                                                         |
60 | |   15 | Dec 3  | _no class_                                                                                                   |                                                         |
61 | |      | Dec 5  | "                                                                                                            |                                                         |
62 | |   16 | Dec 10 | _RRR, no class_                                                                                              |                                                         |
63 | |      | Dec 12 | "                                                                                                            |                                                         |
64 | |   17 | Dec 17 | _Finals Week_                                                                                                | [Project Reports](./project.md#project-report)          |
65 | 
66 | ## Resources
67 | 
68 | These notes are by no means meant to be a comprehensive resource for the course.
69 | Here are some other resources
70 |  (all freely available online)
71 |  that you may find helpful.
72 | I have looked at many of these resources in preparing this class,
73 |  and I recommend them for further reading.
74 | 
75 | - Other courses
76 |   - CMU's
77 |      [15-411](https://www.cs.cmu.edu/~fp/courses/15411-f14/schedule.html) by Frank Pfenning (notes);
78 |      [15-745](http://www.cs.cmu.edu/afs/cs/academic/class/15745-s19/www/syllabus.html) by Phil Gibbons (slides)
79 |   - Cornell's [CS 6120](https://www.cs.cornell.edu/courses/cs6120/) 
80 |     by Adrian Sampson (videos)
81 |   - Stanford's [CS 243](https://suif.stanford.edu/~courses/cs243/)
82 |     by Monica Lam (slides)
83 | - The book _[Static Program Analysis](https://cs.au.dk/~amoeller/spa/)_  by Anders Møller and Michael I. Schwartzbach 
84 |   (available online for free).
85 | - The survey paper "[Compiler transformations for high-performance computing](https://dl.acm.org/doi/10.1145/197405.197406)" (1994)
86 |   for a comprehensive look at many parts of an optimizing compiler, especially loop transformations.
87 | 
88 | [bCourses]: https://bcourses.berkeley.edu/courses/1538171


--------------------------------------------------------------------------------
/lessons/00-overview.md:
--------------------------------------------------------------------------------
  1 | # Overview 
  2 | 
  3 | Welcome to CS 265!
  4 | 
  5 | See the [syllabus](../syllabus.md) for logistics, policies, and course schedule.
  6 | 
  7 | ## Course Goals
  8 | 
  9 | This is an advanced compilers class targeted at students who have experience building some level of compilers or interpreters.
 10 | We will focus on the "mid-end" of the compiler, which operates over an intermediate representation (IR) of the program.
 11 | The lines between front/mid/back-end don't have to be totally stark.
 12 | A compiler may have more than one IR (sometimes many, see the [Nanopass Compiler Framework](https://nanopass.org/)).
 13 | For the purpose of this class, let's define the "ends" of a compiler as follows.
 14 | 
 15 | - Front-end
 16 | 	- Responsible for ingesting surface syntax, processing it, and translating it into an intermediate representation suitable for analysis and optimization
 17 | 	- Examples: lexing, parsing, type checking, macro/template processing, elaborating language features into a reduced set of features.
 18 | - Mid-end
 19 | 	- Many modern compilers include a "mid-end" portion of the compiler where analysis and optimization is focused.
 20 | 	- The goal of the mid-end is to present a reduced surface area to make compiler tasks easier to work with.
 21 | 	- Mid-ends are typically agnostic to details about the target machine.
 22 | - Back-end
 23 | 	- The back-end is responsible for lowering the compiler's intermediate representation into something the target machine can understand. 
 24 | 	- This layer is typically per target machine, so it will refer to specifics about the target machine architecture.
 25 | 	- Examples: instruction selection, legalization, register allocation, instruction scheduling.
 26 | 
 27 | We will focus on the mid-end, 
 28 |  and we'll touch on some aspects of the backend. 
 29 | Compiler front-ends are not really going to be covered in this course.
 30 | You are free to incorporate any part of the compiler in your course [project](../project.md).
 31 | 
 32 | ### Quick note: what's an optimization?
 33 | 
 34 | As the focus of this class is implementing compiler optimizations,
 35 |  it's worth defining what an optimization is.
 36 | In most other contexts, optimization is a process of finding the best solution to a problem.
 37 | In compilers,
 38 |  an optimization is more like an embetterment;
 39 |  it makes things "better" in some way
 40 |  but we often don't make any guarantees about how much better it is (or if it's better at all!)
 41 |  or how far from the best solution we are.
 42 | In other words,
 43 |  compiler optimizations are typically thought of as "best effort".
 44 | (Though there is the topic of superoptimization, which is a search for the best solution.)
 45 | 
 46 | Okay, so what's better?
 47 | For our purposes in this class, we'll define "better" as "faster".
 48 | And we'll define "faster" as "fewer instructions executed".
 49 | But of course, this definition of "faster" misses
 50 |  many details that will affect the actual runtime of a program
 51 |  on real hardware
 52 |  (e.g., cache locality, branch prediction, etc.).
 53 | We will touch on those topics, 
 54 |  but they are not the focus of this class
 55 |  as we are running our code on a simple virtual machine.
 56 | Faster may not be the only thing you care about as well.
 57 | You may care about code size, 
 58 |  or memory usage, 
 59 |  or power consumption,
 60 |  or any number of other things.
 61 | We will touch on some of these topics as well,
 62 |  but the course infrastructure is designed 
 63 |  to focus on the "fewer instructions executed" metric,
 64 |  as well as code size.
 65 | 
 66 | Also important, optimizations better preserve.... something!
 67 | But what?
 68 | The virtual machine in this class runs your program
 69 |  and not only returns the result of the program,
 70 |  but also returns some statistics about the program and its execution.
 71 | It's fair to say those are part of the interpreters semantics.
 72 | Do you want to preserve the result of the program?
 73 | So what's worth preserving?
 74 | Typically this discussion is framed in terms of "observable behavior".
 75 | This is useful, 
 76 |  since it captures the idea of preserving important side effects of the program.
 77 | For example,
 78 |  if you are optimizing an x86 program,
 79 |  you probably want to preserve not only its result but its effect on the status registers,
 80 |  memory, and other parts of the machine state that you consider "observable".
 81 | But you might not care about the cache state, 
 82 |  branch prediction state (or maybe you do care about this in a security context!), 
 83 |  etc.
 84 | 
 85 | Even the notion of preservation is not sufficient in many contexts.
 86 | Instead of a symmetric notion of equivalence,
 87 |  we may need to a _directional_ notion of equivalence.
 88 | Consider the following example:
 89 |  how would you relate `x / x` to `1`?
 90 | Are they equivalent? No!
 91 | Well, what are they then?
 92 | It goes back to your notion of "observable behavior".
 93 | If you consider division by zero to be an observable behavior,
 94 |  then `x / x` is not equivalent to `1`;
 95 |  unless you can prove that `x` is not zero via some analysis.
 96 | If you consider division by zero to be undefined behavior (like LLVM),
 97 |  then `x / x` is _still_ not equivalent to `1`.
 98 | Why not?
 99 | Well, you certainly would never want to replace
100 |  the latter with the former,
101 |  since that's clearly making the program "less defined" in some sense.
102 | In these cases,
103 |  we could say that `1` _refines_ `x / x`.
104 | Refinement is a transitive, reflexive, but _not_ symmetric relation.
105 | See this [blog post on Alive2](https://blog.regehr.org/archives/1722)
106 |  for more info.
107 | 
108 | For this class, 
109 |  we will mostly punt on these (important!) issues
110 |  to aid you getting started as soon as possible.
111 | We will mostly focus on executing fewer instructions,
112 |  and we will look at code size as well.
113 | In terms of preserving behavior,
114 |  we will mostly care about preserving the result of the program.
115 | Our programs "return" their result via printing,
116 |  so we will have to care about effects like order of print statements.
117 | Along the way,
118 |  we will have to care about preserving effects like the state of memory.
119 | On other effects like crashing,
120 |  you may decide to preserve them or not.
121 | 
122 | 
123 | ## Topics Overview
124 | 
125 | Here are some of the planned topics for the course.
126 | 
127 | ### Representing Programs
128 | 
129 | #### ASTs
130 | 
131 | There are many ways to represent a program! 
132 | You are probably familiar with the abstract syntax tree (AST), 
133 |  as that is a common representation many tools that interact with programs use.
134 | 
135 | There are many design decisions that go into choosing an AST representation.
136 | These might affect memory layout, adding annotations to the tree, 
137 |  or the ability reconstruct the concrete syntax from the tree.
138 | 
139 | Designing and working with ASTs is common and important! 
140 | But it is not the focus of this class.
141 | I'll do that part for you,
142 |  so you can focus on working with the IR.
143 | 
144 | #### IRs
145 | 
146 | There's no formal definition of what an intermediate representation (IR) is;
147 |  it's whatever is after the front-end and before the back-end.
148 | Typically, IRs follow a few principles:
149 | - They typically make explicit things like typing, sizing of values, and other
150 |   details that might be implicit in the source language.
151 | - They may reduce the number of language constructs to a smaller set.
152 | 	- For example, the front-end may support `if`s, `while`s, and `for`s, but the IR may only support `goto`s.
153 | - They may also normalize the program into some form that is easier to analyze.
154 |     - We will study Static Single Assignment (SSA) form later in class.
155 | 
156 | Very roughly, 
157 |  IRs could be classified as either **linear** or **graph-based**.
158 | 
159 | A linear IR is one where the program is represented as a sequence of instructions.
160 | These instructions 
161 |  use and define values,
162 |  and they mostly "run" one after the other,
163 |  with some control flow instructions to jump around.
164 | These IRs include some notion of a virtual machine that they run on:
165 |  the Python and WebAssembly virtual machines are stack-based,
166 |  the machine model for other IRs including LLVM and Cranelift are register-based.
167 | Some, like the JVM, are a mix of both.
168 | The virtual machine gives the IR an operational semantics.
169 | 
170 | A graph-based IR is one where the program is represented as a graph, of course.
171 | The nodes in the graph represent values (or instructions),
172 |  and the edges represent how values flow from one to another.
173 | A simple dataflow graph is of course a graph-based IR,
174 |  but it is limited to only representing computation of values.
175 | More complex graph-based IRs can represent control flow as well,
176 |  sometimes by including a machine model. 
177 | Sometimes they do not require a machine model and can be denotationally defined.
178 | A canonical graph-based IR 
179 |  is the [Program Dependence Graph](https://dl.acm.org/doi/10.1145/24039.24041),
180 |  and many IRs that it inspired
181 |  including
182 |  [Sea of Nodes](https://www.oracle.com/technetwork/java/javase/tech/c2-ir95-150110.pdf),
183 |  [Program Expression Graphs](https://dl.acm.org/doi/10.1145/1480881.1480915),
184 |  and
185 |  [RVSDGs](https://dl.acm.org/doi/abs/10.1145/3391902).
186 | 
187 | 
188 | Many IRs will mix these paradigms by grouping sets of instructions into blocks, 
189 |  which are then organized into a graphically into a _control flow graph_.
190 | In this model,
191 |  the point is the group instructions into a block that can be reasoned about
192 |  in a restricted way.
193 | The blocks are then organized into the outer graph that captures how the blocks are connected.
194 | 
195 | In this class, we will focus on a "typical" IR that groups instructions into _basic blocks_.
196 | The SSA form interacts with blocks in a particular way, 
197 |  so we will spend some time understanding how to convert a program into SSA form.
198 | If you are familiar with SSA form, 
199 |  you may know the $\phi$-function, which is a way to merge values from different control flow paths.
200 | Some IRs instead use _block arguments_ as a way to pass values between blocks.
201 | This approach is taken in some modern compilers
202 |  like [Cranelift](https://github.com/bytecodealliance/wasmtime/blob/main/cranelift/docs/ir.md),
203 |  [MLIR](https://mlir.llvm.org/docs/LangRef/#blocks),
204 |  and [Swift](https://github.com/swiftlang/swift/blob/main/docs/SIL.rst#basic-blocks).
205 | If you aren't familiar with SSA form, 
206 |  $\phi$-functions, or block arguments,
207 |  don't worry! 
208 | We will cover them in coming lessons.
209 | 
210 | ### Local Optimizations
211 | 
212 | We'll start by looking at local optimizations.
213 | These are local in the sense that they operate on a single basic block at a time,
214 | i.e., they don't reason about control flow.  
215 | 
216 | These include optimizations like constant folding, copy propagation, and dead code elimination.
217 | We'll learn about value numbering and how it can subsume these optimizations.
218 | 
219 | ### Transformation
220 | 
221 | As part of local optimizations,
222 |  you'll probably add some _peephole optimizations_,
223 |  a classic family of optimizations that look at a small set of instructions
224 |  and replace them with something "better".
225 | 
226 | Examples of peephole optimizations include:
227 | ```
228 | x * 2 -> x << 1
229 | x + 0 -> x
230 | ```
231 | 
232 | This a big part of many compilers,
233 |  so we will also discuss various frameworks to implement and reason about
234 |  these optimizations.
235 | You may choose to implement your optimization in such a framework,
236 |  or you may choose to implement transformations directly.
237 | 
238 | ### Static Analysis
239 | 
240 | We will quickly learn that many optimizations depend on knowing something about the program.
241 | Much of the infrastructure in modern compilers is dedicated to proving facts about the program
242 |  rather than transforming it.
243 | 
244 | We will learn about dataflow analysis, pointer analysis, and other static analysis techniques
245 |  that can be used to prove facts about the program.
246 | Like transformations, 
247 |  these techniques can also be encompassed by some frameworks.
248 | We will read about some of these frameworks,
249 |  and you may choose to use them in your work or implement the analyses directly.
250 | 
251 | ### Loop Optimizations
252 | 
253 | Many optimizations in compilers focus on loops,
254 |  as they execute many times and are a common source of performance bottlenecks.
255 | We will study some loop optimizations,
256 |  but some will not be implementable in the course framework.
257 | Many loop optimizations are focused on improving cache locality,
258 |  such as loop interchange and loop tiling.
259 | Our virtual machine does not model the cache,
260 |  so these optimizations will not be measurable in the course framework.
261 | 
262 | ### Other Stuff
263 | 
264 | Depending on time and interest,
265 |  we may cover some other topics.
266 | I would like to do something in the realm of automatic parallelization,
267 |  either automatic vectorization or via a GPU-like SIMT model.
268 | 
269 | 
270 | 
271 | 
272 | 
273 | 
274 | 
275 | 
276 | 


--------------------------------------------------------------------------------
/lessons/01-local-opt.md:
--------------------------------------------------------------------------------
  1 | # Local Optimization
  2 | 
  3 | Further reading:
  4 | - Rice's [COMP512](https://www.clear.rice.edu/comp512/Lectures/02Overview1.pdf)
  5 | - Cornell's [CS 6120](https://www.cs.cornell.edu/courses/cs6120/2023fa/lesson/3/)
  6 | - CMU's [15-745](https://www.cs.cmu.edu/afs/cs/academic/class/15745-s19/www/lectures/L3-Local-Opts.pdf)
  7 | - ["Value Numbering"](https://www.cs.tufts.edu/~nr/cs257/archive/keith-cooper/value-numbering.pdf), Briggs, Cooper, and Simpson
  8 | 
  9 | This assigment has an associated [task](#task) you should begin after reading this lesson.
 10 | 
 11 | ## Dead code elimination
 12 | 
 13 | Let's start with (trivial, global) dead code elimination as a motivating example.
 14 | 
 15 | "Dead code" is code that doesn't affect the "output" of a program.
 16 | Recall our discussion from [lesson 0](00-overview.md) 
 17 |  about the "obervability" of the "output" of a program
 18 |  (sorry about all the scare quotes).
 19 | In fact,
 20 |  we'll see today that the output of a (part of a) program
 21 |  is a tricky concept to pin down,
 22 |  and we'll constrain ourselves to local optimizations
 23 |  for now.
 24 | 
 25 | Let's start with a simple example:
 26 | 
 27 | ```c
 28 | int foo(int a) {
 29 |     int x = 1;
 30 |     int y = 2;
 31 |     x = x + a;
 32 |     return x;
 33 | }
 34 | ```
 35 | 
 36 | Typically, 
 37 |  we won't be working with the surface syntax of a program,
 38 |  we'll be working with a lower-level intermediate representation (IR).
 39 | Most (including the one we'll use)
 40 |  are oriented around a sequence of instructions.
 41 | 
 42 | Here's the same code in instruction form:
 43 | ```
 44 | x = 1
 45 | y = 2
 46 | x = x + a
 47 | return x
 48 | ```
 49 | 
 50 | In this example, the assignment to variable `y` is dead code.
 51 | It doesn't affect the output of the program,
 52 |  which is here clearly constrained as a straight line function
 53 |  that outputs a single value `x`.
 54 | 
 55 | Let's make things a bit more interesting:
 56 | 
 57 | ```c
 58 | int foo(int a) {
 59 |     int x = 1;
 60 |     int y = 2;
 61 |     if (x > 0) {
 62 |         y = x;
 63 |     }
 64 |     x = x + a;
 65 |     return x;
 66 | }
 67 | ```
 68 | 
 69 | And in instruction form:
 70 | ```
 71 |     x = 1
 72 |     y = 2
 73 |     c = x > 0
 74 |     branch c L1 L2
 75 | L1: 
 76 |     y = x
 77 | L2: 
 78 |     x = x + a
 79 |     return x
 80 | ```
 81 | 
 82 | Can you think of a way to eliminate the dead code at label `L1`?
 83 | 
 84 | Let's start by defining what we mean by "dead code"
 85 |  a bit more precisely so as to avoid the observability issue.
 86 | For now, an instruction is dead if both
 87 | - it is pure (has no side effects other than its assignment)
 88 | - its result is never used
 89 | 
 90 | For today,
 91 |  we'll only look at instructions that are pure;
 92 |  so no calls, no stores, etc.
 93 | Handling those will pose a challenge to the algorithms 
 94 |  we'll see today and we'll defer them to later.
 95 | 
 96 | It remains to show that a variable is never used.
 97 | A conservative approach (overapproximation is theme we'll see a lot)
 98 |  is to say that a variable is used if 
 99 |  there is no instruction that uses it.
100 | 
101 | Here's a simple algorithm to eliminate dead code
102 |  based on this definition:
103 | ```py
104 | used = {}
105 | for instr in func:
106 |     used += instr.args
107 | 
108 | for instr in func:
109 |     if instr.dest not in used and is_pure(instr):
110 |         del instr
111 | ```
112 | 
113 | Great! We've eliminated the dead code in the above example.
114 | 
115 | Can you think of some examples where this algorithm would fail
116 |  to eliminate all the dead code?
117 | 
118 | Here are some different ways our code might be incomplete:
119 | 1. There's more dead code! This can be resolved by iterating until convergence.
120 |     ```
121 |     a = 1
122 |     b = 2
123 |     c = a + a
124 |     d = b + b
125 |     return c
126 |     ```
127 | 2. There's a "dead store" (a variable is assigned to but never used).
128 |     ```
129 |     a = 1
130 |     a = 2
131 |     return a
132 |     ```
133 | 3. A variable is used _somewhere_, but in a part of the code that won't actually run.
134 |     ```c
135 |     int a = 1;
136 |     int b = ...;
137 |     if (a == 0) {
138 |         use(b);
139 |     }
140 |     ```
141 | 
142 | We handled the first case by simply iterating our algorithm.
143 | The third case is a bit more challenging, and we'll see how to handle it later in the course.
144 | 
145 | How about the second case?
146 | The third case hints that reasoning about control flow is important.
147 | So perhaps we can make our lives easier by reasoning not doing that!
148 | It's clear that the example in 2 is "easy" in some sense
149 |  because we are looking at straight-line code.
150 | So let's try to make our lives easier by looking _only_ at straight-line code,
151 |  even in the context of a larger program.
152 | 
153 | ## Control flow graphs
154 | 
155 | We'd like to reason about straight-line code whenever possible.
156 | But even simple programs have instructions that break up this straight-line flow.
157 | 
158 | A control flow graph (CFG) is a way to represent the flow of control in a program.
159 | It's a directed graph where the nodes 
160 |  are "atomic" blocks of code
161 |  the edges represent control flow between them.
162 | 
163 | One way to construct a CFG is to place every instruction in its own block.
164 | Blocks are connected by edges
165 |  to the instructions that might follow them.
166 | A simple assignment instruction 
167 |  will have one _successor_,
168 |  one out-going edge to the next instruction.
169 | A jump will also have one successor, to the target of the jump.
170 | A branch will have two successors, one for each branch.
171 | 
172 | Blocks in a control flow graph are defined by a 
173 |  "single-entry, single-exit" property.
174 | That is, 
175 |  there is only one point at which to enter the block (the top)
176 |  and one point at which to exit the block (the bottom).
177 | In other words, 
178 |  if any part of a block is executed,
179 |  the whole block is executed.
180 | 
181 | When we talk about basic blocks,
182 |  in a CFG,
183 |  we typically mean _maximal_ basic blocks.
184 | These are blocks that are as large as possible
185 |  while still maintaining the single-entry, single-exit property.
186 | In our instruction-oriented IR,
187 |  this means that a basic block is a sequence of instructions
188 |  that may end with a _terminator_ instruction: a jump, branch, or return instruction (the exit point).
189 | Terminators may _not_ appear in the middle of a basic block.
190 | 
191 | Here's our example from before again:
192 | ```
193 |     x = 1
194 |     y = 2
195 |     c = x > 0
196 |     branch c L1 L2
197 | L1: 
198 |     y = x
199 | L2: 
200 |     x = x + a
201 |     return x
202 | ```
203 | 
204 | In this example, the basic blocks are the entry block, `L1`, and `L2`.
205 | Why can't `L1` be combined with `L2`?
206 | After all, `L1` only has one successor, can't it be folding into `L2`?
207 | No, that would violate the single-entry property on the resulting block.
208 | We want to guarantee that all of the instructions in a block are executed together or not at all.
209 | 
210 | ## Local Dead Code Elimination
211 | 
212 | Now that we are equipped with control flow graphs
213 |  with maximal basic blocks,
214 |  we can focus on it as a unit of analysis.
215 | 
216 | Recall our problem case for dead code elimination:
217 | ```
218 | a = 1
219 | a = 2
220 | return a
221 | ```
222 | 
223 | Our global algorithm failed to eliminate the dead code (the first assignment to `a`) in this case.
224 | 
225 | 
226 | ```py
227 | unused: {var: inst} = {}
228 | for inst in block:
229 |     # if it's used, it's not unused
230 |     for use in inst.uses:
231 |         del unused[use]
232 |     if inst.dest 
233 |         # if this inst defines something
234 |         # we can kill the unused definition
235 |         if unused[inst.dest]:
236 |             remove unused[inst.dest]
237 |         # mark this def as unused for now
238 |         unused[inst.dest] = inst
239 | ```
240 | 
241 | Be careful to process uses before defs (in a forward analysis like this one).
242 | Otherwise, instructions like `a = a + 1` will be cause problems.
243 | 
244 | Note that we still have to iterate until convergence.
245 | 
246 | Is this as good as our global algorithm above?
247 | Certainly it's better in some use cases (like the "dead store" example).
248 | 
249 | No! Consider this case with clearly dead instruction, but one that isn't "clobbered" by a later instruction:
250 | 
251 | ```c
252 | int foo(int a) {
253 |     if (a > 0) {
254 |         int x = 1;
255 |     }
256 |     return 0
257 | }
258 | ```
259 | 
260 | So you need both for now.
261 | 
262 | ## Local Value Numbering
263 | 
264 | Value numbering is a very general optimization 
265 |  that be extended implement a number of other optimizations, including:
266 | - Copy propagation
267 | - Common subexpression elimination
268 | - Constant propagation
269 | - Constant folding
270 | - Dead code elimination
271 |     - This part is a little tricky, we'll see why later.
272 | 
273 | How does LVN accomplish all of this?
274 | These optimizations can be see as reasoning about the "final value" 
275 |  of the block of code,
276 |  and figuring out a way to compute that value with fewer instructions.
277 | The problem with reasoning about the "final value" is that
278 |  the instruction-based IR doesn't make it easy to see what the "final value" is.
279 | 
280 | - Problem 1: variable name obscure the values being used
281 |     - graphs can help with this, we will see this later in the course.
282 |     - "clobbered" variables will make things tricky
283 |         - we are stuck here for now, but we will look at SSA form later in the course
284 |     - LVN's approach, "run" the program and see what values are used!
285 |         - build a sort of a symbolic state that keeps track of the values of variables
286 | - Problem 2: we don't know what variables will be used later in the program
287 |     - later lecture for non-local value numbering
288 | 
289 | Here is some pseudocode for a local value numbering pass.
290 | Note that this is avoiding important edge cases related to clobbered variables!
291 | Think about how you'd handle those cases.
292 | 
293 | ```py
294 | # some times all represented as one big table
295 | # here the 4 maps force you to think about the uniqueness requirements of the keys
296 | val2num: dict[value, num] = {}
297 | num2val: dict[num, value] = {}
298 | var2num: dict[var, num] = {}
299 | num2var: dict[num, var] = {}
300 | 
301 | def add(value) -> num:
302 |     # do the value numbering in val2num, num2val
303 | 
304 | for inst in block:
305 |     # canonicalize the instruction's arguments by getting the 
306 |     # value numbers currently held by those vars
307 |     args = [var2num[arg] for arg in inst.args]
308 | 
309 |     # create a new value by packaging this instruction with 
310 |     # the value numbers of its arguments
311 |     value = [inst.op] + args
312 | 
313 |     # look up the value number of this value
314 |     num = val2num.get(value)
315 | 
316 |     if num is None:
317 |         # we've never seen this value before
318 |         num = add(value)
319 |     else:
320 |         # we've seen this value before
321 |         # replace this instruction with an id
322 |         inst.op = "id"
323 |         inst.args = [num2var[num]]
324 | 
325 |     if inst.dest in var2num:
326 |         # be careful here, what if dest is already in var2num?
327 |         # one option is to introduce temp variables
328 |         # another, more conservative option is to remove the overwritten value
329 |         # from all of the maps and point it to the new value
330 |     else:
331 |         var2num[inst.dest] = value
332 |         num2var[num] = inst.dest
333 | ```
334 | 
335 | In this approach,
336 |  we are reconstructing each instruction based on the state at that time.
337 | Not waiting until the end and reconstruction a whole program fragment.
338 | One consequence of this is that we can't do DCE (we don't know what vars are used later).
339 |  in fact we will introduce a bunch of temp vars that might be dead.
340 | So this instance of LVN should be followed by a DCE pass.
341 | 
342 | Example:
343 | ```
344 | a = 1
345 | b = 2
346 | c = 3
347 | sum1 = a + b
348 | sum2 = a + b
349 | prod = sum1 * sum2
350 | return prod
351 | ```
352 | 
353 | ### Extensions of LVN
354 | 
355 | #### Stronger lookup
356 | 
357 | The important part of value numbering is when you "lookup" to see if you've seen a value before.
358 | A typical approach is to use a hash table to see if you've done the same computation before.
359 | But determining if two computations are the same can be done in a much less conservative way!
360 | 
361 | You could add things like commutativity, or even other properties of the operations (say `mul x 2` is the same as `add x x`).
362 | 
363 | #### Seeing through computations
364 | 
365 | Still, we aren't interpreting the instructions in a super useful way. 
366 | For many computations, we can actually evaluate the result!
367 | - `id` can be evaluated regardless if it's value is constant
368 | - arithmetic operations can be evaluated if their arguments are constants
369 | - some arith ops can be evaluated if _some_ of their arguments are constants
370 |     - e.g., `add x 0`, `mul x 1`, `mul x 0`, `div x 1`, `sub x 0`
371 | - what other ops can be evaluated in certain cases?
372 |     - `eq x x`?
373 | 
374 | #### Clobbered variables
375 | 
376 | Our pass above hints at a conservative approach we took to clobbered variables, 
377 |  in which clobbered variables clobber the values from the state, so they can't be reused later!
378 | Here's a simple example:
379 | ```
380 | x = a + b
381 | x = 0
382 | y = a + b
383 | return y
384 | ```
385 | 
386 | You can take another approach that introduces temporary variables to handle clobbered variables.
387 | Give it a try if you feel up to it!
388 | We will see later that SSA (single static assignment) form 
389 |  is a way to ensure that we can deal with this issue once and for all.
390 | 
391 | #### Dead code elimination
392 | 
393 | Currently our LVN pass introduces a bunch of dead code and relies on a DCE pass to clean it up.
394 | One way to view value numbering is that it's building up a graph of computations
395 |  done in a block,
396 |  and re-emiting instructions corresponding to that graph as it goes.
397 | Could we wait until the end of a block and try to emit the whole block at once,
398 |  just including necessary instructions?
399 | Not without knowing what variables are used later in the program...
400 | 
401 | #### Extended Basic Blocks
402 | 
403 | What if a basic block `A` goes straight into `B` (and `B` has no other predecessors)?
404 | Surely we can still do _something_ that looks like value numbering?
405 | In fact, many local analyses can be extended beyond just basic blocks by looking at _extended basic blocks_.
406 | 
407 | An extended basic block is a set of basic blocks 
408 |  a sort of single-entry, _multiple-exit_ property.
409 | In other words, 
410 |  it's a set of basic blocks
411 |  with a distinguished entry point such that:
412 | - the entry basic block may have multiple predecessors
413 | - but all others must have only one predecessor, which must be in the set
414 | 
415 | 
416 | Here's an example EBB:
417 | ```mermaid
418 | flowchart TD
419 | A --> B
420 | B --> C
421 | B --> D
422 | A --> E
423 | ```
424 | 
425 | It is essentially a tree rooted at the entry block.
426 | Any block in an EBB _dominates_ all its children.
427 | We'll return to a more formal definition of dominance later in the course,
428 |  but essentially it means that any path to a node in the EBB must go through all its ancestors.
429 | This allows us to essentially reason about each path in the EBB as a single unit.
430 | In the context of value numbering,
431 |  we could pass the state of A's value numbering to B and then to D.
432 | But we couldn't then pass it to C, because D doesn't run before C!
433 | So we could optimize this EBB by looking at the path from the root to each node:
434 | - Optimize A
435 |     - pass the state into lvn(B)
436 |         - pass the state into lvn(C) 
437 |         - throw away the effects of C (or re-run lvn(A, B)), then do lvn(D)
438 |     - throw away the effects of B (or re-run lvn(A)), then do lvn(E)
439 | 
440 | This technique can be employed to eek out a little more from your local optimizations,
441 |  and hints a bit at the more general techniques global versions later in the course.
442 | 
443 | # Task
444 | 
445 | This course uses my fork
446 |  of the [bril compiler infrastructure](https://github.com/mwillsey/bril/).
447 | 
448 | Your task from this lesson is to get familiar with the bril ecosystem and implement the 
449 | 3 optimizations we discussed:
450 | 1. Trival global dead code elimination
451 | 2. Local dead code elimination
452 |     - See [here](https://github.com/mwillsey/bril/tree/main/examples/test/tdce) for some test cases relevant to dead code optimization.
453 | 3. Local value numbering
454 |     - You must implement LVN that performs common subexpression elimination.
455 |     - You may (but don't have to) implement some of the extensions discussed above (folding, commutativity, etc.).
456 |     - You may choose how to handle "clobbered" variables. Just be sure to document your choice.
457 |     - See [here](https://github.com/mwillsey/bril/tree/main/examples/test/lvn) for some test cases relevant to value numbering.
458 | 
459 | For this and other assignments, **you need not handle all of the cases identically to the examples given
460 |  in the bril repo**.
461 | Your optimizer should be correct (and it's up to you to argue how you know it is),
462 |  but it may optimize less or more or differently
463 |  than the example code.
464 | You should, however,
465 |  be running your optimized code through the bril interpreter 
466 |  to ensure it outputs the same result as the original code.
467 | The [remove_nops example](https://github.com/mwillsey/bril/tree/main/examples/remove_nops) 
468 |  shows how to set up `brench` to run the bril interpreter on the original and optimized code and compare the results.
469 | 
470 | If you are ahead of the game (e.g., you already know about dataflow analysis),
471 |  you are encouraged to implement more aggressive optimizations or a more general pass that subsumes the ones above.
472 | Just be sure to include why you made the choices you did in your written reflection.
473 | 
474 | Submit your written reflection on bCourses. It should include some kind of output from the bril tools! A plot, a csv, a table, etc.
475 | 
476 | Include two short bril programs in your reflection:
477 | 1. A program that you can optimize very well with your passes. Show the unoptimized and optimized versions.
478 | 2. A program that you can't optimize with your passes, but you can in your head. What's the issue? What would you need to do to optimize it?
479 | 
480 | This task (and others) are meant to be open-ended and exploratory.
481 | The requirements above are the minumum to get a 1-star grade, but you are encouraged to go above and beyond!
482 | 


--------------------------------------------------------------------------------
/lessons/02-dataflow.md:
--------------------------------------------------------------------------------
  1 | # Dataflow
  2 | 
  3 | Resources:
  4 | - There are many, many resources on dataflow analysis, and the terminology is pretty consistent.
  5 |   The [Wikipedia page](https://en.wikipedia.org/wiki/Data-flow_analysis) is a good starting point.
  6 | - The excellent book [_Static Program Analysis_](https://cs.au.dk/~amoeller/spa/spa.pdf) (free online) 
  7 |   is a detailed and authoritative reference. Chapter 5 deals with dataflow analysis.
  8 | - Monica Lam's CS 243 slides 
  9 |   ([part 1](https://suif.stanford.edu/~courses/cs243/lectures/L2.pdf), 
 10 |    [part 2](https://suif.stanford.edu/~courses/cs243/lectures/L3.pdf))
 11 |    from Stanford.
 12 | - Susan Horwitz's [CS704 notes](https://pages.cs.wisc.edu/~horwitz/CS704-NOTES/2.DATAFLOW.html).
 13 | - Martin Rinard's [6.035 slides](https://ocw.mit.edu/courses/6-035-computer-language-engineering-sma-5502-fall-2005/80d42de10044c47032e7149d0eefff66_11dataflowanlys.pdf) from MIT OCW.
 14 | - Lecture [notes](https://www.cs.tau.ac.il//~msagiv/courses/pa07/lecture2-notes-update.pdf) from Mooly Sagiv's course at Tel Aviv University,
 15 |   including coverage of the theoretical underpinnings of dataflow analysis like partial orderings, lattices, and fixed points.
 16 | 
 17 | ## Control Flow Graphs
 18 | 
 19 | We introduced control flow graphs (CFGs) in the [previous lesson](./01-local-opt.md#control-flow-graphs),
 20 |  but for the purpose of limiting our reasoning to a single block.
 21 | Now we'll begin to consider the graph as a whole,
 22 |  and the relationships between blocks.
 23 | 
 24 | ```mermaid
 25 | flowchart TD
 26 | A --> B & C
 27 | B --> C
 28 | C
 29 | ```
 30 | 
 31 | Recall the definitions of predecessors and successors of a basic block; 
 32 |  we'll need those later.
 33 | 
 34 | ## Constant Propagation
 35 | 
 36 | Our optimizations from the previous lesson were local, in that they only considered a single block.
 37 | If we zoom in on constant propagation,
 38 |  we could only propagate constants within a single block.
 39 | Consider the following example:
 40 | 
 41 | ```c
 42 | y = 1;
 43 | if (...) { ... }
 44 | use(y)
 45 | ```
 46 | 
 47 | In this case, even if nothing happens to `y` in the `if` block,
 48 |  we can't propagate the constant `1` to the `use(y)` statement.
 49 | 
 50 | ```c
 51 | y = 1;
 52 | if (...) { y = y + 2; }
 53 | use(y)
 54 | ```
 55 | 
 56 | ```c
 57 | y = 1;
 58 | if (...) { y = 1; }
 59 | use(y)
 60 | ```
 61 | 
 62 | ### As An Analysis
 63 | 
 64 | In value numbering, 
 65 |  we were statically executing the program
 66 |  and building up a state that represented the values of variables.
 67 | We were also simultaneously 
 68 |  modifying the program based on that state in the straight-line fashion.
 69 | 
 70 | For now,
 71 |  let's set aside the idea of modifying the program,
 72 |  and just focus on computing the state that represents what's happening in the program.
 73 | 
 74 | What is the state of constant propagation?
 75 | Something about the values of variables at each point in the program,
 76 |  and whether they are constants or not.
 77 | Perhaps a map from variables to constants.
 78 | If a variable is not (or cannot be proven to be) a constant,
 79 |  we can just leave it out of the map.
 80 | 
 81 | Let's try this analysis on on some straight-line code.
 82 | 
 83 | | instruction | `x` | `y` | `z` |
 84 | |-------------|-----|-----|-----|
 85 | | `y = 1`     |     | 1   |     |
 86 | | `z = x + 1` |     | 1   |     |
 87 | | `x = 2`     | 2   | 1   |     |
 88 | | `x = z`     |     | 1   |     |
 89 | | `y = y + 1` |     | 2   |     |
 90 | 
 91 | Note how we left the initial values of `x` and `z` as blank.
 92 | Similar to our value numbering analysis,
 93 |  we don't know anything about the values coming into this segment of the program.
 94 | 
 95 | ### Extended Basic Blocks
 96 | 
 97 | How can we make this analysis more global?
 98 | 
 99 | Recall from last time that we can extend a local analysis to 
100 |  one over [extended basic blocks](./01-local-opt.md#extended-basic-blocks)
101 |  by passing the state from the end of one block to the beginning of the next.
102 | The same holds for our constant propagation analysis.
103 | For a linear path of basic blocks
104 |  (every path in a tree that forms an EBB),
105 |  we can pass the state from the end of one block to the beginning of the next.
106 | 
107 | ```mermaid
108 | flowchart TD
109 | A --> B
110 | A --> C
111 | A["**A**
112 |   x = 2
113 |   y = 1
114 |   z = a
115 | "]
116 | B["**B**
117 |   z = x + 1
118 | "]
119 | C["**C**
120 |   y = x + y
121 | "]
122 | ```
123 | 
124 | Try writing out the "output" state for our constant propagation analysis
125 |  at the end of each block.
126 | 
127 | ### Joins/Meets
128 | 
129 | The key property of EBBs that allows this extension
130 |  is that every non-rooted node has a single predecessor.
131 | That means there's no ambiguity about which state to use when started the analysis at that node;
132 |  you use the state from the parent in the tree (the only predecessor in the CFG).
133 | 
134 | What if we have a node with multiple predecessors?
135 | 
136 | ```mermaid
137 | flowchart TD
138 | A --> B
139 | A --> C
140 | B --> C
141 | 
142 | A["**A**
143 |   x = 1
144 |   y = 1
145 |   branch ...
146 | "]
147 | B["**B**
148 |   y = y + 2;
149 | "] 
150 | C["**C**
151 |  use(y)
152 | "]
153 | ```
154 | 
155 | Blocks `A` and `B` form an EBB, and we can compute their states:
156 | - `A` says `x = 1, y = 1`
157 | - `B` says `x = 1, y = 3` (using constant folding as well)
158 | 
159 | When we get to `C`, what state are we allowed to "start with"?
160 | Surely we can do the same as our local analysis and start with the empty state
161 |  where no variables are known to be constants.
162 | But that seems to be a waste of information,
163 |  since all of `C`'s predecessors agree on at least one value (`x = 1`).
164 | The key is have some way to combine the information when a block has multiple predecessors.
165 | For our constant propagation analysis,
166 |  we can take the intersection of the states from the predecessors.
167 | 
168 | ## Dataflow Analysis
169 | 
170 | Dataflow analysis is well-studied and understood
171 |  framework for analyzing programs.
172 | It can express a wide variety of analyses,
173 |  including constant propagation and folding
174 |  like we saw above.
175 | Let's try to fit in our constant propagation analysis into the dataflow framework.
176 | 
177 | Here are the ingredients of a dataflow analysis:
178 | 1. A fact (or set of facts) you want to know at every point in the program.
179 |     - Point in the program here means beginning or end of a block in the CFG,
180 |       but you could also consider every instruction.
181 | 2. An initial fact for the beginning of the program. 
182 | 3. A way to compute the fact at the end of a block from the facts at the beginning of the block.
183 |     - This is sometimes called the transfer function.
184 |     - Sometime this is notated as $f$, so $out(b) = f(in(b))$.
185 |     - For some analyses, you compute the beginning fact from the end fact (or both ways!).
186 |     - For constant propagation, the transfer function is the same as the one we used for BBs/EBBs;
187 |       a limited version of our value numbering from before.
188 | 3. A way to relate the input/output facts of a block to the inputs/outputs of its neighbors.
189 |     - Typically this is phrased as a "join" or "meet" function that combines the facts from multiple predecessors.
190 |     - $in(b) = meet_{p \in \text{pred}(b)} out(p)$
191 |     - For constant propagation, the meet function is the intersection of the maps.
192 | 5. An algorithm to compute the facts at every program point such that the above equations hold.
193 | 
194 | Let's set aside how to solve the dataflow problem for now,
195 |  and just focus on the framework.
196 | 
197 | How does constant propagation fit into this framework?
198 | 1. The fact we want to know is the mapping of variables to constants.
199 | 2. The initial fact is the empty map.
200 | 3. The transfer function is the same as our local analysis.
201 |     - This is one of the cool parts of dataflow analysis: you are still only reasoning about a single block at a time!
202 | 4. The meet function is the intersection of the maps.
203 |     - If the maps disagree, we just say the variable is not a constant (delete it from the map).
204 | 
205 | Once you apply the dataflow framework,
206 |  you can define a simple, local analysis for something like constant propagation/folding,
207 |  and extend it to a whole CFG.
208 | The solver's that we'll discuss later can be generic across instances of dataflow problems,
209 |  so one solver implementation can be used for many different analyses!
210 | 
211 | Try writing out the dataflow equations for the constant propagation analysis
212 |  we did above on the CFG with blocks `A`, `B`, and `C`.
213 | 
214 | ## Live Variables
215 | 
216 | Let's now switch gears to another dataflow analysis problem: liveness.
217 | This analysis tells us which variables are "live" at each point in the program;
218 |  that is, which variables might be needed in the future.
219 | 
220 | Consider the following code:
221 | ```c
222 | x = 1;
223 | y = 2;
224 | z = 3;
225 | if (...) { 
226 |   y = 1; 
227 |   use(x);
228 | }
229 | use(y);
230 | ```
231 | 
232 | And the corresponding CFG:
233 | ```mermaid
234 | flowchart TD
235 | A["**A**
236 |   x = 1
237 |   y = 2
238 |   z = 3
239 | "]
240 | B["**B**
241 |   y = 1
242 |   use(x)
243 | "]
244 | C["**C**
245 |   return y;
246 | "]
247 | A --> B
248 | A --> C
249 | B --> C
250 | ```
251 | 
252 | Live variables are those that might be used in the future.
253 | Live variables is a backwards analysis:
254 |  so we will compute the live variables for the
255 |  beginning of each block from those live at the end of the block.
256 | To get started,
257 |  no variables are live at the end of the function: `out(C) = {}`.
258 | Now we can compute `in(C)` by seeing that `y` is used in the return statement,
259 |  so `in(C) = {y}`.
260 | 
261 | Now for blocks `A` and `B`.
262 | This is a backwards analysis,
263 |  so we compute the `in` from the `out` of the block, 
264 |  and the out is derived from the `in`s of the successors.
265 | So we can't compute `out(A)` just yet.
266 | 
267 | But `B` is ready to go, it only has one successor `C`, so `out(B) = in(C) = {y}`.
268 | We can compute `in(B)` from `out(B) = {y}` and the block itself:
269 |   `B` uses `x`, but it re-defines `y`, so `in(B) = out(B) + {x} - {y} = {x}`.
270 | 
271 | Now we can compute `out(A)` by combining `in(B) = {x}` and `in(C) = {y}`.
272 | Liveness is said to be a "may" analysis,
273 |  because we're interested in the variables that _might_ used in the future.
274 | In these kinds of analyses,
275 |  the meet operation is typically set union.
276 | Thus, we can compute `out(A) = in(B) U in(C) = {x} U {y} = {x, y}`.
277 | Finally `in(A)` is `out(A) - {x, y, z} = {}` since `A` defines `x`, `y`, and `z`.
278 | 
279 | So to fit liveness into the dataflow framework:
280 | 1. The fact we want to know is the set of live variables at the beginning/end of each block.
281 | 2. The initial fact is the empty set.
282 |     - This is a backwards analysis, so the initial fact is the set of live variables at the end of the program.
283 | 3. The transfer function is as above.
284 |     - This is a backwards analysis, we compute the `in` from the `out`.
285 |     - Live variables is one of a class of dataflow problems called "gen/kill" or [bit-vector problems](https://en.wikipedia.org/wiki/Data-flow_analysis#Bit_vector_problems).
286 |     - In these problems, the transfer function can be broken down further into two parts:
287 |       - `gen` is the set of variables that are used in the block.
288 |       - `kill` is the set of variables that are defined in the block.
289 |       - special care need to be taken for variables that are both used and defined in the block.
290 |       - `in(b) = gen(b) U (out(b) - kill(b))`
291 | 4. The meet function is set union.
292 |     - This is a backwards analysis, so the meet function combines the successors' `in` sets to form the `out` set for a block.
293 |     - This is a "may" analysis, so we use union to represent the fact a variable might be live in any of the successors.
294 | 
295 | 
296 | ## Solving dataflow
297 | 
298 | So far we haven't seen how to solve a dataflow problem.
299 | We've sort of done it by hand, running the transfer function when its input was "ready".
300 | This works for acyclic CFGs, you can just topologically sort the blocks and run the transfer functions in that order.
301 | In this way, you never have to worry about the inputs the transfer function changing;
302 |  you run every block's transfer function exactly once.
303 | 
304 | But what about CFGs with loops?
305 | 
306 | The key to solving dataflow problems is to recognize that they are fixed-point problems.
307 | For a forward analysis on block `b`,
308 |  we construct `in(b)` from the `out` sets of the predecessors.
309 |  then we construct `out(b) = f(in(b))`.
310 | This in turn changes the input to the successors,
311 |  so we repeat the process until the `in` and `out` sets stabilize.
312 | Another way to think about it is solving a system of equations.
313 | Each block `b` induces 2 equations:
314 | - $in(b) = meet_{p \in \text{pred}(b)} out(p)$
315 | - $out(b) = f(in(b))$
316 | 
317 | Perhaps surprisingly,
318 |  a very simple technique can be used to solve these equations
319 |  (given some conditions on the transfer function and the domain; we'll get to those later).
320 | We do, however, have to run some transfer functions multiple times,
321 |  since the inputs to the transfer function can change.
322 | A naive (but correct) solution is to simply iterate over the blocks,
323 |  updating the `in` and `out` sets until they stabilize:
324 | ```python
325 | for b in blocks:
326 |   in[b] = initial
327 | while things changed:
328 |   for b in blocks:
329 |     out[b] = f(b, in[b])
330 |   for b in blocks:
331 |     in[b] = meet(out[p] for p in pred[b])
332 | ```
333 | 
334 | This iterative approach does require some properties of the transfer function and the domain.
335 | If you take as a given that this algorithm is correct,
336 |  perhaps you can deduce what those properties are!
337 | 
338 | 1. The algorithm specifies neither the order in which the blocks are processed,
339 |    nor the order of the predecessors for each block.
340 |      - This suggests the meet function doesn't care about the order of its inputs.
341 |      - The meet function is associative and commutative.
342 | 2. The stopping condition above is very simple: keep going until the `in`s and `out`s stabilize.
343 |      - The dataflow framework uses a _partial order_ on the facts to ensure that this algorithm terminates.
344 |      - Termination requires that the transfer function is _monotonic_, that is,
345 |        a "better" input to the transfer function produces a "better" output:
346 |        if $a \leq b$ then $f(a) \leq f(b)$.
347 |      - We saw above that the meet function is associative and commutative.
348 |        Typically, we constrain the domain to form a [lattice](https://en.wikipedia.org/wiki/Lattice_(order)),
349 |        where the meet operation is indeed the meet of the lattice.
350 |         - See the above link for more details on lattices, 
351 |           but the key point is that `meet(a, b)` returns the greatest lower bound of `a` and `b`; the biggest element that is less than both `a` and `b`.
352 |         - In the context of dataflow analysis, where lower is typically "less information",
353 |           the meet operation gives you the best (biggest) set of facts that is less than its inputs.
354 |      - For now, termination also requires that the lattice has _finite height_,
355 |        so you cannot have an infinite chain of elements that are all less than each other.
356 |        We can lift this restriction later with so-called _widening_ operators.
357 |         - Interval analysis is an example of an analysis that requires widening.
358 | 
359 | We can exploit these properties to make the algorithm more efficient.
360 | In particular, we can use a worklist algorithm to avoid recomputing the `out` sets of blocks that haven't changed.
361 | 
362 | Worklist algorithm (for forward analysis):
363 | ```python
364 | for b in blocks:
365 |   in[b] = initial
366 |   out[b] = initial
367 | 
368 | worklist = blocks
369 | while b := worklist.pop():
370 |     in[b] = meet(out[p] for b in b.predecessors)
371 |     out[b] = f(b, in[b])
372 |     if out[b] changed:
373 |         worklist.extend(b.successors)
374 | ```
375 | 
376 | The worklist algorithm is more-or-less the typical way to solve dataflow problems.
377 | Note that it still does not specify the order in which blocks are processed!
378 | While of course that doesn't matter for correctness,
379 |  it can affect the efficiency of the algorithm.
380 | Consider a CFG that's a DAG.
381 | If you topologically sort the blocks,
382 |  you can run the algorithm in either a single pass,
383 |  or in a very silly way if you proceed in the reverse order.
384 | The ordering typically used in practice is 
385 |  the [reverse postorder](https://en.wikipedia.org/wiki/Data-flow_analysis#Ordering) of the blocks.
386 | 
387 | <!-- We could go even further and exploit the associativity and commutativity of the meet operation.
388 | The purpose of the worklist is that we only recompute `in[b] = meet(...)` if `out[p]` for some predecessor `p` of `b` has changed.
389 | But, if there are a larger number of predecessors,
390 |  we might be recomputing the meet operation more than necessary.
391 |   -->
392 | 
393 | ### Solutions to Dataflow
394 | 
395 | The algorithms above find a fixed point, but what does that mean?
396 | Essentially, it's which execution paths are accounted for in the analysis.
397 | 
398 | The best one is the _ideal_ solution:
399 |  the meet of all of the programs paths that actually execute.
400 | This is not computable in general,
401 |  because you don't know which paths will be taken.
402 | So the best you can do statically is called the Meet Over all Paths (MOP) solution,
403 |  which is the meet of all paths in the CFG.
404 | Note there are infinitely many paths in a CFG with loops,
405 |  so you cannot compute this directly!
406 | 
407 | The algorithms above compute a fixed point to the dataflow equations.
408 | This is typically referred to as the _least fixed point_,
409 |  because the solution is the smallest set of facts that satisfies the dataflow equations.
410 | When the dataflow problem is distributive (see below),
411 |  the least fixed point is the MOP solution.
412 | 
413 | ## Liveness With Loops
414 | 
415 | Let's walk through an example using the naive algorithm to do live variable analysis on the following program with a loop:
416 | 
417 | ```c
418 | int foo(int y, int a) {
419 |   int x = 0;
420 |   int z = y;
421 |   while (x < a) {
422 |     x = x + 1;
423 |     y = y + 1;
424 |   }
425 |   return x;
426 | }
427 | ```
428 | 
429 | And in the IR:
430 | ```
431 | .entry:
432 |   x = 0
433 |   z = y
434 |   jmp .header
435 | .header:
436 |   c = x < a
437 |   br c .loop .exit
438 | .loop:
439 |   x = x + 1
440 |   y = y + 1
441 |   jmp .header
442 | .exit:
443 |   return x
444 | ```
445 | 
446 | Since live variables is a backwards analysis,
447 |  we'll start with the empty set at the end of the program 
448 |  use the naive algorithm over the blocks in reverse order.
449 | 
450 | | Block     | Live Out 1 | Live In 1 | Live Out 2 | Live In 2 | Live Out 3 | Live In 3 |
451 | |-----------|------------|-----------|------------|-----------|------------|-----------|
452 | | `.exit`   | []         | x         | "          | "         | "          | "         |
453 | | `.loop`   | x          | x, y      | x, y, a    | x, y, a   | "          | "         |
454 | | `.header` | x, y       | x, y, a   | x, y, a    | x, y, a   | "          | "         |
455 | | `.entry`  | x, y       | y, a      | x, y, a    | y, a      | "          | "         |
456 | 
457 | There are some interesting things to note about this execution.
458 | 
459 | First, note how the worklist algorithm could have saved 
460 |  revisiting some blocks once they stabilized, e.g., `.exit`.
461 | Since `.exit` has no successors, its `out` set will never change, 
462 |  so nobody would ever add it back to the worklist.
463 | 
464 | ### Optimistic vs Pessimistic
465 | 
466 | Second, observe the intermediate values of the `in` and `out` sets, 
467 |  especially for the `.loop` and `.header` blocks.
468 | They are "wrong" at some point in the analysis!
469 | 
470 | In analyses like constant propagation, 
471 |  we are constantly trying to "improve" the facts at each point in the program.
472 | We start out knowing that nothing is a constant,
473 |  and we try to prove that some things are constants.
474 | We call such an analysis _pessimistic_.
475 | It assumes the worst case scenario, and takes conservative steps along the way.
476 | At any point, you could stop the analysis and the facts would be true, 
477 |  just maybe not as precise as they could be if you kept going.
478 | 
479 | Liveness is an _optimistic_ analysis.
480 | It starts with a (very) optimistic view of the facts:
481 |  that nothing is live!
482 | This is of course unsound if you stopped the analysis at that point;
483 |  it would allow you to say that everything is dead!
484 | In an optimistic analysis, 
485 |  reaching a fixed point is necessary for the correctness of the analysis.
486 | 
487 | The dataflow framework computes optimistic and pessimistic analyses in the same way,
488 |  but it's good to remember which one you're working with, 
489 |  especially if you attempt to use the partial results of the analysis for optimization.
490 | 
491 | ### Strong Liveness
492 | 
493 | Another thing to note about the above example is 
494 |  that "live" code is not necessarily "needed" code.
495 | The variable `y` is never used for anything "important", 
496 |  but it is used in the loop (to increment itself),
497 |  so it's marked as live.
498 | The loop itself becomes a sort of "sink" which makes
499 |  any self-referential assignment.
500 | 
501 | An alternative to the standard liveness analysis is _strong liveness_.
502 | Where liveness computes if a variable is used in the future,
503 |  strong liveness computes if a variable is used in a "meaningful" way.
504 | In the above example, `y` would not be considered strongly live.
505 | As we discussed in the [first lecture](./00-overview.md),
506 |  "meaningful" is a bit of a loaded term,
507 |  but here it's clear that `y` is not used in a way that affects the return value of the function.
508 | Strong liveness starts from a set of instructions that are the source
509 |  of the "meaningful" uses of variables:
510 |  these typically include return statements, 
511 |  (effectful) function calls,
512 |  branches, 
513 |  and other effects like input/output operations.
514 | Using a variable in a standard operation like addition or multiplication
515 |  is not inherently "meaningful",
516 |  but it can be if the result is used in a "meaningful" way.
517 | So using a variable `x` in strong liveness only makes `x` live
518 |  if the operation is inherently "meaningful" or if the operation defines a variable that is used in a "meaningful" way.
519 | 
520 | In our above example,
521 |  `y` is live but not strongly live.
522 | Consider implementing strong liveness as an exercise; it will make your dead code elimination more effective!
523 | 
524 | ## Properties of dataflow problems
525 | 
526 | In addition to optimistic/pessimistic,
527 |  dataflow problems can be classified in a few other ways.
528 | One common taxonomy is based on the direction of the analysis
529 |  and the meet function.
530 | 
531 |  |          | Forwards                                                             | Backwards             |
532 |  |----------|----------------------------------------------------------------------|-----------------------|
533 |  | **May**  | Reaching definitions                                                 | Live variables        |
534 |  | **Must** | Available expressions <br> Constant prop/fold <br> Interval analysis | Very busy expressions |
535 | 
536 | Another property commonly discussed is the distributivity of the transfer function over the meet operation:
537 | `f(a meet b) = f(a) meet f(b)`.
538 | If this property holds, the analysis is said to be _distributive_,
539 |  and can be computed in a more efficient way in some cases.
540 | It also guarantees that the solution found by solving the dataflow equations is the Meet Over all Paths (MOP) solution.
541 | Dataflow problem framed in the gen/kill setting are typically distributive.
542 | Constant propagation is a good example of a _non_-distributive analysis.
543 | Consider the following simple blocks:
544 | - `X: a = 1; b = 2`
545 | - `Y: a = 2; b = 1`
546 | - `Z: c = a + b`, where `Z`'s predecessors are `X` and `Y`.
547 | In this case:
548 | $$f_Z(out_X \cap out_Y) = \emptyset$$
549 | But:
550 | $$f_Z(out_X) \cap f(out_Y) = 
551 |   \{a \mapsto 1, b \mapsto 2, c \mapsto 3 \} \cap 
552 |   \{a \mapsto 2, b \mapsto 1, c \mapsto 3 \} =
553 |   \{c \mapsto 3 \}
554 | $$
555 | 
556 | # Task
557 | 
558 | Turn in on bCourses a written reflection of the below task.
559 | 
560 | Do not feel obligated to explain exactly how each analysis works (I already know!),
561 |  instead focus on the design decisions you made in implementing the analysis,
562 |  and presenting evidence of its effectiveness and correctness.
563 | 
564 | - Implement a global constant propagation/folding analysis (and optimization!).
565 |   - This probably won't entirely subsume the LVN you did in the previous tasks, since you did common subexpression elimination in that.
566 |   - But you should test and see!
567 | - Implement a global liveness analysis and use it to global implement dead code elimination.
568 |   - This will probably subsume the trivial dead code elimination you did in the previous task.
569 |   - But you should test and see!
570 | - As before, include some measurement of the effectiveness and correctness of your optimization/analysis.
571 |   - Run your optimizations on not only the tests in the `examples/` directory,
572 |     but also on the benchmarks in the `benchmarks/` directory.
573 | - Optionally:
574 |   - Implement another dataflow analysis like strong liveness or reaching definitions.
575 |   - What properties does that analysis have? Distributive? May/must? Forward/backward? Optimistic/pessimistic?
576 |   - Implement a generic dataflow solver that can be used for any dataflow problem.
577 | 
578 | As before, the repo includes some [example analyses and test cases](https://github.com/mwillsey/bril/tree/main/examples/test/df) you may use.
579 | Do not copy the provided analysis; please attempt your own first based on the course material. 
580 | You may use the provided analysis as a reference after you have attempted your own.
581 | You may always use the provided test cases, but note that your results may differ and still be correct,
582 |  the provided tests are checking the *exact* output the provided analysis produces.
583 | Note how that folder includes some `.out` files which are not programs,
584 |  but just lists of facts.
585 | This is not a standard format,
586 |  it's just a text dump to stdout from the analysis for `turnt` to check.
587 | 


--------------------------------------------------------------------------------
/lessons/03-ssa.md:
--------------------------------------------------------------------------------
  1 | # Static Single Assignment (SSA) Form
  2 | 
  3 | As you've undoubtedly noticed,
  4 |  re-assignment of variables
  5 |  has been a bit painful.
  6 | 
  7 | We also saw in last week's paper
  8 |  a peek at SSA
  9 |  and how it can enable and even power-up optimizations
 10 |  by allowing a "sparse" computation of dataflow.
 11 | 
 12 | In this lesson, we'll dive into SSA
 13 |  and see how it works
 14 |  and how to convert a program into SSA form.
 15 | 
 16 | Resources:
 17 | - Lots has been written about SSA, the [Wikipedia page](https://en.wikipedia.org/wiki/Static_single_assignment_form) is a good start.
 18 | - The [_SSA Book_](https://pfalcon.github.io/ssabook/latest/book-full.pdf) (free online) is an extensive resource.
 19 | - Jeremy Singer's [SSA Bibliography](https://www.dcs.gla.ac.uk/~jsinger/ssa.html) is a great resource for papers on SSA,
 20 |   and Kenneth Zadeck's [slides](https://compilers.cs.uni-saarland.de/ssasem/talks/Kenneth.Zadeck.pdf) on the history of SSA are a good read as well.
 21 | - This presentation of SSA follows the seminal work by Cytron et al, [Efficiently computing static single assignment form and the control dependence graph](https://dl.acm.org/doi/pdf/10.1145/115372.115320).
 22 |     - The Cytron algorithm is still used by many compilers today.
 23 |     - A more recent paper is "Simple and Efficient Construction of Static Single Assignment Form", by Braun et al., which is [this week's reading](../reading/braun-ssa.md).
 24 | 
 25 | ## SSA
 26 | 
 27 | SSA stands for "Static Single Assignment".
 28 | The "static" part means that each variable is assigned to at precisely one place in the program text.
 29 | Of course dynamically, in the execution of the program, a variable can be assigned to many times (in a loop).
 30 | The "precisely one" part is a big deal!
 31 | This means that at any use of a variable,
 32 |  you can easily look at its definition,
 33 |  no analysis necessary.
 34 | 
 35 | This creates a nice correspondence between the program text and the dataflow in the program.
 36 | - definition = variable
 37 | - instruction = value
 38 | - argument = data flow edges
 39 | 
 40 | This is a step towards a graph-based IR! 
 41 | In fact, LLVM and other compilers use pointers to represent arguments in the IR.
 42 | So instead of a variable being a string, it's a pointer to the instruction that defines it (because it's unique!).
 43 | 
 44 | ### Straight-Line Code
 45 | 
 46 | Let's look at SSA conversion starting with simple, straight-line code.
 47 | 
 48 | ```c
 49 | x = 1;
 50 | x = x + 1;
 51 | y = x + 2;
 52 | z = x + y;
 53 | x = z + 1;
 54 | ```
 55 | 
 56 | For straight-line code like this,
 57 |  SSA is easy.
 58 | You can essential walk through the code
 59 |  and rename each definition to a new variable
 60 |  (typically like `x_1`, `x_2`, etc).
 61 | When you see a use, 
 62 |  you update that use to the most recent definition.
 63 | 
 64 | ```c
 65 | x_1 = 1;
 66 | x_2 = x_1 + 1;
 67 | y_1 = x_2 + 2;
 68 | z_1 = x_2 + y_1;
 69 | x_3 = z_1 + 1;
 70 | ```
 71 | 
 72 | ### Branches
 73 | 
 74 | Easy peasy! But what about control flow?
 75 | Here we encounter the $\phi$ ("phi") function,
 76 |  which we saw defined in last week's paper.
 77 | The $\phi$ function arises
 78 |  from situations where control flow _merges_, 
 79 |  and a variables may have been assigned to in different ways.
 80 | An `if` statement is a classic example.
 81 | 
 82 | ```c
 83 | int x;
 84 | if (cond) {
 85 |     x = 1;
 86 | } else {
 87 |     x = 2;
 88 | }
 89 | use(x);
 90 | ```
 91 | 
 92 | And in the IR:
 93 | 
 94 | ```
 95 |     ....
 96 |     br cond .then .else
 97 | .then:
 98 |     x = 1;
 99 |     jump .end
100 | .else:
101 |     x = 2;
102 |     jump .end
103 | .end:
104 |     use x
105 | ```
106 | 
107 | SSA is "static" single assignment,
108 |  so even though only one of the branches will be taken,
109 |  we still need to rename `x` differently in each branch.
110 | Say `.then` defines `x_1` and `.else` defines `x_2`.
111 | What do we do at `.end`, which `x` do we use?
112 | 
113 | We need a $\phi$ function!
114 | 
115 | ```
116 |     ....
117 |     br cond .then .else
118 | .then:
119 |     x_1 = 1;
120 |     jump .end
121 | .else:
122 |     x_2 = 2;
123 |     jump .end
124 | .end:
125 |     x_3 = phi .then x_1 .end x_2 
126 |     use x_3
127 | ```
128 | 
129 | In Bril, 
130 |  the `phi` instruction takes 2 variables and 2 labels 
131 |  (they can be mixed up, due to the JSON representation storing registers and labels separately).
132 | In some literature, they write it without the labels: $\phi(x_1, x_2)$.
133 | That's just the same, 
134 |  since in SSA a variable is defined in exactly one place.
135 | In Bril it's just a bit more explicit.
136 | 
137 | The `phi` instruction gives us a way to merge dataflow from different branches.
138 | Intuitively, it's like saying "if we came from `.then`, use `x_1`, if we came from `.else` use `x_2`".
139 | It's called "phi" because it's a "phony" instruction,
140 |  not a real computation,
141 |  it only exists to enable use to use SSA form.
142 | The Bril interpreter is actually capable of executing `phi` instructions,
143 |  but in a real setting, they are eliminated (see below).
144 | 
145 | ### Loops
146 | 
147 | Loops is SSA form are similar to branches. 
148 | The interesting points are basic blocks with multiple predecessors.
149 | 
150 | Here's a simple do-while loop:
151 | ```c
152 | int x = 0;
153 | do {
154 |     x = x + 1;
155 | } while (cond);
156 | use(x);
157 | ```
158 | 
159 | In the IR:
160 | 
161 | ```
162 |     ...
163 |     x = 0;
164 |     jump .loop
165 | .loop:
166 |     x = x + 1;
167 |     cond = ...;
168 |     br cond .loop .end
169 | .end:
170 |     use x
171 | ```
172 | 
173 | Here, `.loop` is the point where control flow merges, much like the block after an `if`.
174 | And in SSA form:
175 | 
176 | ```
177 |     ...
178 |     x_1 = 0;
179 |     jump .loop
180 | .loop:
181 |     x_3 = phi .loop x_1 .loop x_2
182 |     x_2 = x_3 + 1;
183 |     cond = ...;
184 |     br cond .loop .end
185 | .end
186 |     use x_2
187 | ```
188 | 
189 | Note the ordering of the `x` vars.
190 | It's not necessary to have them in any order,
191 |  but it suggests the order in which we "discovered" that we need a $\phi$ there.
192 | First we did the renaming,
193 |  and then we went around the loop and saw that we needed to merge the 
194 |  definition of `x` from the header and that of the loop body.
195 | 
196 | ### Where to Put $\phi$ Functions
197 | 
198 | Where do we need to put $\phi$ functions?
199 | We have approached this informally for now. 
200 | Our current algorithm is to do a forward-ish pass, 
201 |  renaming and introducing $\phi$ functions "where they are necessary".
202 | 
203 | A simple way to think about it is in terms of live variables and reaching definitions
204 |  (recall these from the dataflow lesson).
205 | Certainly where phis are necessary is related to control flow merges.
206 | A form of SSA called "trivial SSA"
207 |  can be constructed by inserting $\phi$ functions at every join point for all live variables.
208 | 
209 | But we can do better than that.
210 | Even at a join point, we don't need a $\phi$ function for every variable.
211 | Consider a CFG with a diamond shape, it should be the case at uses after the branches merge 
212 |  can refer to definition from before the branches diverged without needing a $\phi$ function.
213 | This leads us to "minimal SSA",
214 |  where we only insert $\phi$ at join points 
215 |  for variables that that multiple reaching definitions.
216 | 
217 | An alternative way 
218 |  (and one used more frequently since it avoid having to compute liveness)
219 |  to place $\phi$ functions is to consider _dominators_.
220 | 
221 | ## Dominators
222 | 
223 | The key concept we need is that of dominators.
224 | Node $A$ in a CFG _dominates_ node $B$ if every path from the entry to $B$ goes through $A$.
225 | 
226 | Recall an extended basic block is a tree of basic blocks.
227 | In an EBB, the entry node dominates all the other nodes.
228 | 
229 | What about for slightly more complex CFGs?
230 | 
231 | ```mermaid
232 | graph TD
233 |     A --> B & C
234 |     B & C --> D
235 | ```
236 | 
237 | Above is a simple CFG with a diamond shape, perhaps from an `if` statement.
238 | Let's figure out which nodes dominate which.
239 | Hopefully it's intuitive that `A` dominates `B` and `C`, since it has to run directly before them.
240 | `A` also dominates `D`, since every path to `D` goes through `A`.
241 | You can also view this inductively: `D`'s predecessors are `B` and `C`, and `A` dominates both of them, and so `A` dominates `D` as well.
242 | Domination is a reflexive relation, so `A` dominates itself.
243 | Therefore, `A` dominates all the nodes in this CFG.
244 | 
245 | What about `B` and `C`? Do they dominate `D`?
246 | No!
247 | For example, the path `A -> C -> D` does not go through `B`, so `B` does not dominate `D`.
248 | Similarly, `C` does not dominate `D`.
249 | And finally, `D` does not dominate `B` or `C`.
250 | 
251 | So the complete domination relation is:
252 | - `A` dominates `A`, `B`, `C`, and `D`
253 | - `B` dominates `B`
254 | - `C` dominates `C`
255 | - `D` dominates `D`
256 | 
257 | How does domination help us with $\phi$ functions?
258 | So far, it tells us where we _don't_ need $\phi$ functions!
259 | For example, any definition in `A` does not need a $\phi$ function for any use in `A`, `B`, `C`, or `D`, since `A` dominates all of them.
260 | Intuitively, if a definition is guaranteed to run before a use, we don't need a $\phi$ function.
261 | 
262 | ### Computing Dominators
263 | 
264 | How do we compute dominators?
265 | The naive algorithm is to compute the set of dominators for each node.
266 | The formula for the dominators of a node $b$ is:
267 | $$ \text{dom}(b) = \{b\} \cup \left(\bigcap_{p \to b} \text{dom}(p)\right) $$
268 | 
269 | Note that the above is the things that dominate $b$, not the things that $b$ dominates!
270 | 
271 | This formula captures exactly the intuition above! 
272 | A block $b$ dominates itself, and it dominates a block $c$ iff every predecessor of $c$ is dominated by $b$.
273 | 
274 | This looks a bit like a dataflow equation, and in fact you can compute dominators using a fixed-point algorithm 
275 |  in the same fashion!
276 | Just iterate the above equation until you reach a fixed point.
277 | 
278 | Computing dominators in this way is $O(n^2)$ in the worst case,
279 |  but if you visit the nodes in reverse postorder, it's $O(n)$.
280 | 
281 | ### Dominator Tree
282 | 
283 | A convenient and compact way to represent the dominator relation is with a dominator tree.
284 | Some algorithms in compilers will want to traverse the dominator tree directly.
285 | To build one,
286 |  we need a couple more definitions:
287 | - $A$ _strictly dominates_ $B$ if $A$ dominates $B$ and $A \neq B$.
288 | - $A$ _immediately dominates_ $B$ if $A$ strictly dominates $B$ _and_ $A$ does not strictly dominate any other node that strictly dominates $B$.
289 | 
290 | The _dominator tree_ is a tree where each node is a basic block,
291 |  and the parent of a node is the immediate dominator of that node.
292 | This also means that for any subtree of the dominator tree,
293 |  the root of the subtree is dominates of all the nodes in the subtree.
294 | 
295 | Example CFG:
296 | 
297 | ```mermaid
298 | graph TD
299 |     A --> B & C
300 |     B & C --> D
301 |     D --> E
302 | ```
303 | 
304 | And the dominator tree:
305 | 
306 | ```mermaid
307 | graph TD
308 |     A --> B & C & D
309 |     D --> E
310 | ```
311 | 
312 | ### Dominance Frontier
313 | 
314 | The final concept we need is the _dominance frontier_.
315 | Essentially, the dominance frontier of a node $A$ is the set of nodes "just outside" of the ones that $A$ dominates.
316 | 
317 | More formally, $B$ is in the dominance frontier of a node $A$ if both of the following hold:
318 | - $A$ does _not_ strictly dominate $B$
319 | - but $A$ _does_ dominate a predecessor of $B$
320 | 
321 | We need the _strictly_ because a node can be in its own dominance frontier in the presence of a loop.
322 | 
323 | For example, in this CFG, $B$ dominates $\{B, C\}$, and the dominance frontier of $B$ is $\{B, D\}$.
324 | 
325 | ```mermaid
326 | graph TD
327 |     A --> B
328 |     B --> C --> B
329 |     A --> D
330 |     B --> D
331 | ```
332 | 
333 | ## Converting Into SSA
334 | 
335 | To convert a program into SSA form,
336 |  we are going to use the dominance frontier to guide where to place $\phi$ functions.
337 | First, we place these "trivial" $\phi$ functions, without doing any renaming.
338 | Then, we rename all the variables.
339 | 
340 | ### Place the $\phi$ Functions
341 | 
342 | The dominance relation tells us where we _don't_ need $\phi$ functions for a definition:
343 |  if a definition dominates a use, we don't need a $\phi$ function.
344 | But just outside of that region, we do need $\phi$ functions.
345 | 
346 | (convince yourself of this! if $A$'s dominance frontier contains $B$, then $B$ is a join point for the control flow from $A$)
347 | 
348 | We begin by placing trivial $\phi$ functions that look like `x = phi x .source1 x .source2` at these points.
349 | In general, a phi function will take a number of arguments equal to the number of predecessors of the block it's in.
350 | So make sure your initial, trivial $\phi$ functions reflect that.
351 |  
352 | 
353 | 
354 | ```py
355 | DF[b] = dominance frontier of block b
356 | defs[v] = set of blocks that define variable v
357 | for var in variables:
358 |     for defining_block in defs[var]:
359 |         for block in DF[defining_block]:
360 |             insert_phi(var, block) if not already there
361 |             defs[var].add(block)
362 | ```
363 | 
364 | Note that this is "minimal SSA", but the some of these $\phi$ instructions may still be dead.
365 | Minimal SSA refers to having no "unneccessary" $\phi$ functions,
366 |  (i.e., no $\phi$ functions that are dominated by a single definition).
367 |  but it's still possible to have some that are dead.
368 | There are other variants of SSA like pruned SSA that try to eliminate these dead $\phi$ functions.
369 | Your dead code elimination pass should be able to remove these just fine.
370 | 
371 | ### Rename Variables
372 | 
373 | Now that we have $\phi$ functions at the right places,
374 |  we can rename variables to ensure that each definition is unique,
375 |  and that each use refers to the correct definition.
376 | 
377 | ```py
378 | stack[var] = [] # stack of names for each variable
379 | dom_tree[b] = list of children of block b in the dominator tree
380 |               i.e., blocks that are *immediately* dominated by b
381 | def rename(block):
382 |     remember the stack
383 | 
384 |     for inst in block:
385 |         inst.args = [stack[arg].top for arg in inst.args]
386 |         fresh = fresh_name(inst.dest)
387 |         stack[inst.dest].push(fresh)
388 |         inst.dest = fresh
389 |     for succ in block.successors:
390 |         for phi in succ.phis:
391 |             v = phi.dest
392 |             update the arg in this phi corresponding to block to stack[v].top
393 |     for child in dom_tree[block]:
394 |         rename(child)
395 | 
396 |     restore the stack by popping what we pushed
397 | ```
398 | 
399 | If you're in a functional language, you can use a more functional style and pass the stack around as an argument.
400 | 
401 | Note that when you update the phi argument, if you don't have a name on the stack, then you should just rename it to something undefined to indicate that indeed the variable is not defined when coming from that predecessor.
402 | 
403 | ## Converting Out of SSA
404 | 
405 | Converting out of SSA is a much simpler.
406 | An easy way to remove all the $\phi$ functions
407 |  is to simply insert a copy instruction at each of the arguments of the $\phi$ function.
408 | 
409 | ```
410 | .a:
411 |     x_1 = 1;
412 |     jump .c
413 | .b:
414 |     x_2 = 2;
415 |     jump .c
416 | .c:
417 |     x_3 = phi .a x_1 .b x_2
418 |     use x_3
419 | ```
420 | 
421 | Becomes:
422 | 
423 | ```
424 | .a:
425 |     x_1 = 1;
426 |     x_3 = x_1;
427 |     jump .c
428 | .b:
429 |     x_2 = 2;
430 |     x_3 = x_2;
431 |     jump .c
432 | .c: 
433 |     # phi can just be removed now
434 |     use x_3
435 | ```
436 | 
437 | This is quite simple, but you can see that it introduces some silly-looking copies.
438 | 
439 | There are many techniques for removing these copies (some of your passes might do this automatically).
440 | You can also use a more sophisticated algorithm to come out of SSA form to prevent introducing these copies in the first place.
441 | Chapter 3.2 of the [SSA Book](https://pfalcon.github.io/ssabook/latest/book-full.pdf) has a good overview of these techniques.
442 | 
443 | 
444 | 
445 | # Task
446 | 
447 | There is no task for this lesson.
448 | Later lectures and tasks will sometimes assume that you can convert to SSA form.
449 | 
450 | If you implement SSA conversion in your compiler
451 |  (I highly recommend it!),
452 |  tell me about it in whichever task you first use it.
453 | 
454 | 
455 | 
456 | 
457 | 


--------------------------------------------------------------------------------
/lessons/04-loops.md:
--------------------------------------------------------------------------------
  1 | # Loop Optimization
  2 | 
  3 | Loop optimizations are a key part of optimizing compilers.
  4 | Most of a program's execution time will be spent on loops,
  5 |  so optimizing them can have a big impact on performance!
  6 | 
  7 | Resources:
  8 | - [LICM](http://www.cs.cmu.edu/afs/cs/academic/class/15745-s19/www/lectures/L9-LICM.pdf) 
  9 |   and [induction variable](https://www.cs.cmu.edu/afs/cs/academic/class/15745-s19/www/lectures/L8-Induction-Variables.pdf)
 10 |   slides from CMU's 15-745
 11 | - [Compiler transformations for high-performance computing](https://dl.acm.org/doi/10.1145/197405.197406), 1994
 12 |     - Sections 6.1-6.4 deal with loop optimizations
 13 | - Course notes from Cornell's CS 4120 on [induction variables](https://www.cs.cornell.edu/courses/cs4120/2019sp/lectures/27indvars/lec27-sp19.pdf)
 14 | 
 15 | ## Natural Loops
 16 | 
 17 | Before we can optimize loops, we need to find them in the program!
 18 | 
 19 | Many optimizations focus on so-called _natural loops_, loops 
 20 |  with a single entry point.
 21 | 
 22 | In this CFG, there is a natural loop formed by nodes A-E.
 23 | 
 24 | ```mermaid
 25 | graph TD
 26 | A[A: preheader] --> B[B: loop header]
 27 | B --> C & D --> E --> B
 28 | D & B --> F[F: exit]
 29 | ```
 30 | 
 31 | Some things to observe:
 32 | - A natural loop has a _single entry point_, which we call the _loop header_
 33 |     - the loop header dominates all nodes in the loop!
 34 |     - this makes the loop header essential for loop optimizations
 35 | - Frequently, a loop will have a _preheader_ node
 36 |     - this is a node that is the _only_ predecessor of the loop header
 37 |     - useful for loop initialization code
 38 |     - if there is no preheader, we can create one by inserting a new node
 39 | - The loop may have many exits
 40 | 
 41 | The following is not a natural loop, because there is no single entry point.
 42 | 
 43 | ```mermaid
 44 | graph TD
 45 | A --> B & C
 46 | B --> C
 47 | C --> B
 48 | ```
 49 | 
 50 | Natural loops also allow us to classify certain edges in the CFG as _back edges_.
 51 | A back edge is an edge in CFG from node A to node B, where B dominates A.
 52 | Every back edge corresponds to a natural loop in the CFG!
 53 | In the natural loop example above, the edge from E to B is a back edge.
 54 | Note how the non-natural loop example has no back edges!
 55 | 
 56 | ### Reducibility
 57 | 
 58 | An alternative and more general definition of a 
 59 |  loop in a CFG would be to look for strongly-connected components.
 60 | However, this is typically less useful for optimization.
 61 | In this class, we will focus on natural loops.
 62 | 
 63 | A CFG with only natural loops is called _reducible_.
 64 | If you're building a compiler from a language with structured control flow,
 65 |  (if statements, while loops, etc),
 66 |  the CFG will be reducible.
 67 | If you have arbitrary control flow (like goto), you may have irreducible CFGs.
 68 | 
 69 | There are techniques to reduce an irreducible CFG to a reducible one,
 70 |  but we won't cover them in this class.
 71 | The vast majority of programs you will encounter will have reducible CFGs.
 72 | We'll only talk about optimizing natural loops, and you may just ignore the non-natural loops for now.
 73 | 
 74 | ### Nested and Shared Loops
 75 | 
 76 | Natural loops may share a loop header:
 77 | 
 78 | ```mermaid
 79 | graph TD
 80 | A[A: header] --> B --> A
 81 | A --> C --> A
 82 | ```
 83 | 
 84 | And they may be nested:
 85 | 
 86 | ```mermaid
 87 | graph TD
 88 | A[A: header 1] --> B[B: header 2] --> C --> B
 89 | B --> D --> A
 90 | ```
 91 | 
 92 | In the nested loop example,
 93 |  node C is in the inner loop defined by the back edge from C to B,
 94 |  but is it in the outer loop?
 95 | The outer loop is defined by the back edge from D to A.
 96 | 
 97 | We define a natural loop in terms of the back edge that defines it.
 98 | It's the smallest set of nodes that:
 99 | - contains the back edge that points to the loop header
100 | - has no predecessors outside the set
101 |     - except for the predecessors of the loop header
102 | 
103 | By this definition, node C _is_ in the outer loop! 
104 | It is a predecessor of B,
105 |  which is not the loop header of the outer loop.
106 | 
107 | ### Finding Natural Loops
108 | 
109 | Armed with that definition, we can find natural loops in a CFG.
110 | We will go about this by first finding back edges, then finding the loops they define.
111 | 
112 | 1. Compute the dominance relation
113 | 2. Find back edges (A -> B, where B dominates A)
114 | 3. Find the natural loops defined by the back edges
115 |    - if backedge does from N -> H (header)
116 |    - many ways to find to loop nodes
117 |        - find those nodes that can reach N without going through H
118 |        - those nodes plus H form the loop
119 | 
120 | ### Loop Normalization
121 | 
122 | It may be convenient to normalize the structure of loops
123 |  to make them easier to optimize.
124 | For example, LLVM normalizes loops to have the following structure:
125 | - Pre-header: The sole predecessor of the loop header
126 | - Header: The entry point of the loop
127 | - Latch: A single node executed for looping; the source of the back edge
128 | - Exit: An exit node that is guaranteed to be dominated by the header
129 | 
130 | You might find it useful to do some normalization.
131 | In particular, you may want to add a preheader if it doesn't exist,
132 |  as it's a convenient place to put loop initialization code.
133 | You may also want loop headers (and pre-headers) to be unique to a loop,
134 |  that way a loop is easily identified by its header.
135 | 
136 | Consider the following CFG where two natural loops share a header:
137 | 
138 | ```mermaid
139 | graph TD
140 | A[A: header] --> B --> A
141 | A --> C --> A
142 | ```
143 | 
144 | You could normalize by combining the two loops into one:
145 | ```mermaid
146 | graph TD
147 | A[A: header] --> B --> D[latch]
148 | A --> C --> D
149 | D --> A
150 | ```
151 | 
152 | ## Loop Invariant Code Motion (LICM)
153 | 
154 | A classic loop optimization is _loop invariant code motion_ (LICM),
155 |  which is a great motivation for finding and normalizing loops with a pre-header.
156 | 
157 | The goal is to move code out of the loop that doesn't depend on the loop iteration.
158 | Consider the following loop:
159 | 
160 | ```c
161 | i = 0;
162 | do {
163 |     i++;
164 |     c = a + b;
165 |     use(c);
166 | } while (cond);
167 | ```
168 | 
169 | The expression `a + b` is loop invariant, as it doesn't change in the loop.
170 | So we could get some savings by moving it out of the loop:
171 | 
172 | ```c
173 | i = 0;
174 | c = a + b;
175 | do {
176 |     i++;
177 |     use(c);
178 | } while (cond);
179 | ```
180 | 
181 | ### Requirements for LICM
182 | 
183 | Why does the above example work? And why does it use a do-while loop instead of a while loop?
184 | 1. Moving the code doesn't change the semantics of the program
185 |     - watch out for redefinitions of the same variable in a loop!
186 |     - if you're SSA, it's a bit easier
187 | 2. The moved instruction dominated all loop exits
188 |     - in other words, it's guaranteed to execute!
189 | 
190 | Requirement 2 is why we use a do-while loop.
191 | In a while loop, none of the code in the body is guaranteed to execute,
192 |  so none of it dominates all loop exits.
193 | Many loops won't look like this,
194 |  so what can you do?
195 | 
196 | The first approach is to ignore the second requirement.
197 | On one hand, this will allow you to directly implement LICM on while loops.
198 | On the other hand,
199 |  you now have some new performance _and_ correctness issues to worry about.
200 | Suppose the example above was a while loop.
201 | Even though `c = a + b` is loop invariant,
202 |  it's not guaranteed to execute.
203 | So if `cond` is false on the first iteration,
204 |  we will have make our program worse by moving the code out of the loop!
205 | 
206 | Even worse, what if the operation was something more complex than addition?
207 | If it has side effects, we have changed the meaning of the program!
208 | Still, this can be a useful optimization in practice
209 |  if you're careful to limit it to pure, side-effect-free code.
210 | 
211 | ### Loop Splitting/Peeling
212 | 
213 | Another approach to enabling LICM (and some other loop optimizations)
214 |  is to perform _loop splitting_, sometimes called _loop peeling_.
215 | The effect is to convert a while loop into a do-while loop:
216 | 
217 | ```c
218 | while (cond) {
219 |     body;
220 | }
221 | ```
222 | 
223 | becomes
224 | 
225 | ```c
226 | if (cond) {
227 |     do {
228 |         body;
229 |     } while (cond) 
230 | }
231 | ```
232 | 
233 | In CFG form, the while loop:
234 | 
235 | ```mermaid
236 | graph TD
237 | P[preheader] --> A
238 | A[header] --> B[body]
239 | B --> A
240 | A --> C[exit]
241 | ```
242 | 
243 | becomes the do-while loop:
244 | 
245 | ```mermaid
246 | graph TD
247 | X[header copy] --> B1 & C
248 | B1[preheader] --> B[body]
249 | A[header] --> B[body]
250 | B --> A
251 | A --> C[exit]
252 | ```
253 | 
254 | Note that in the converted do-while loop, the block labeled body is actually the natural loop header for the loop.
255 | Once in this form, code from the block labeled `body` (actually the header) dominates all loop exits,
256 |  so you can apply LICM to it.
257 | 
258 | In loops like do-while loops, 
259 |  you can even perform LICM on some code that is side-effecting, 
260 |  as long as those side effects are idempotent.
261 | A good example would be division.
262 | In general, 
263 |  it's not safe to move division out of a loop
264 |  if you consider crashing the program to be a side effect 
265 |  (you could also make it undefined behavior).
266 | But in a loop of this structure, 
267 |  you can move loop invariant divisions into the preheader.
268 | 
269 | Another way to phrase this transformation is that
270 |  code in the loop header dominates all loop exits automatically.
271 | So can be easier to move loop invariant code out of header of a loop than out of other nodes.
272 | Loop splitting can be seen as a way to make more loop code dominate loop exits;
273 |  in simple cases it essentially turns the loop body into the loop header.
274 | 
275 | ### Finding Loop Invariant Code
276 | 
277 | Now it just remains to identify which code is loop invariant.
278 | This too is a fixed point, but specific to the code in a loop.
279 | 
280 | We will limit ourselves to discussing the SSA form of the code.
281 | If you're not in SSA form, you'll have to do some extra work 
282 |  to reason about reaching definitions and avoiding redefinitions.
283 | 
284 | A value is loop invariant 
285 |  if it will always have the same value through out the execution of the loop.
286 | Loop invariance is a property of a value/instruction with respect to a particular loop.
287 | In nested loops, some code may be loop invariant with respect to one loop,
288 |  but not another.
289 | 
290 | A value (we are in SSA!) is loop invariant if either:
291 | - It is defined outside the loop
292 | - It is defined inside the loop, and:
293 |     - All arguments to the instruction are loop invariant
294 |     - The instruction is deterministic
295 | 
296 | The last point is important. 
297 | Consider the instruction `load 5` which loads from memory location 5.
298 | Is that loop invariant? 
299 | It depends on if the loop changes memory location 5!
300 | We can maybe figure that out with some kind of analysis,
301 |  but for now, we will consider that loads and stores cannot be loop invariant.
302 | 
303 | This definition should give you enough to iterate to a fixed point!
304 | 
305 | ## Induction Variables
306 | 
307 | Another quintessential elements of a loop in an _induction variable_,
308 |  typically defined as a variable that is incremented or decremented by some constant amount each iteration.
309 | 
310 | Consider the following C code:
311 | ```c
312 | // a is an array of 32-bit ints
313 | for (int i = 0; i < N; i++) {
314 |     a[i] = 42;
315 | }
316 | ```
317 | 
318 | And in a (non-bril) IR
319 | ```
320 |     i = 0
321 | .loop.header:
322 |     c = i < N
323 |     branch c .loop.body .loop.exit
324 | .loop.body:
325 |     offset = i * 4
326 |     addr = a + offset
327 |     store 42 addr
328 |     i = i + 1
329 |     jump .loop.header
330 | .loop.exit:
331 |     ...
332 | ```
333 | 
334 | In the above code, 
335 |  `i` is the loop counter,
336 |  and in this case it is also an so-called _basic induction variable_
337 |  since it is incremented by a constant amount each iteration.
338 | The variables `offset` and `addr` are _derived induction variables_,
339 |  since they are a function of the basic induction variable.
340 | Typically we restrict these deriving functions to be linear with respect to the basic induction variables:
341 |  so of some form `j = c * i + d` where:
342 | - `j` is the derived induction variable
343 | - `i` is the basic induction variable
344 | - `c` and `d` are loop invariant with respect to `i`'s loop (typically constants)
345 | 
346 | Loop induction variables are part of a couple classic loop optimizations, 
347 |  namely induction variable elimination and strength reduction[^1]
348 |  which we will discuss in the next section.
349 | But they are also important for unrolling, loop parallelization and interchange, and many other loop optimizations.
350 | 
351 | [^1]: Sometimes people use "strength reduction" to refer to the more general optimization of replacing expensive operations with cheaper ones, even outside of loops. Sometimes it's used specifically to refer to this optimization with respect to induction variables.
352 |  
353 | 
354 | In the above example,
355 |  the key observation is that `addr = a + i * 4`, and `a` is loop invariant.
356 | This fits one of a set of commons patterns that allows us to perform a "strength reduction", 
357 |  replacing the relatively expensive multiplication with a cheaper addition.
358 | Instead of multiplying by 4 each iteration, we can just add 4 each iteration, and initialize `addr` to `a`.
359 | 
360 | ```
361 |     i = 0
362 |     addr = a
363 | .loop.header:
364 |     c = i < N
365 |     branch c .loop.body .loop.exit
366 | .loop.body:
367 |     store 42 addr
368 |     addr = addr + 4
369 |     i = i + 1
370 |     jump .loop.header
371 | ```
372 | 
373 | Above I've also removed some dead code for the old calculation of `addr`.
374 | Now we can observe that `addr` is now a basic induction variable
375 |  instead of a derived induction variable.
376 | 
377 | Great, we've optimized the loop by removing a multiplication!
378 | But we can go further by observing that `i` is now only used to compute the loop bound.
379 | We can instead compute the loop bound in terms of `addr`, which allows us to eliminate `i` entirely.
380 | 
381 | ```
382 |     i = 0
383 |     addr = a
384 |     bound = a + N * 4
385 | .loop.header:
386 |     c = addr < bound
387 |     branch c .loop.body .loop.exit
388 | .loop.body:
389 |     store 42 addr
390 |     addr = addr + 4
391 |     jump .loop.header
392 | ```
393 | 
394 | Loop optimized!
395 | 
396 | 
397 | ### Finding Basic Induction Variables
398 | 
399 | Of course to do any optimization with induction variables,
400 |  you need to be able to find them.
401 | There are many approaches,
402 |  and we will read about one
403 |  in the paper "[Beyond Induction Variables](../reading/beyond-induction-variables.md)".
404 | 
405 | I will discuss an approach based on SSA, you can also take a dataflow approach, 
406 |  which the Dragon Book and these [notes from Cornell](https://www.cs.cornell.edu/courses/cs4120/2019sp/lectures/27indvars/lec27-sp19.pdf) cover quite well.
407 | The dataflow approach is quite elegant and worth looking at!
408 | It operates with a lattice based on maps from variables to triples: `var -> (var2, mult, add)`.
409 | These dataflow approaches are totally compatible with SSA form
410 |  (and SSA makes them easier to implement since you don't have to worry about redefinitions).
411 | I will instead discuss a simple approach that directly analyzes the SSA graph,
412 |  as SSA makes basic induction variable finding quite easy.
413 | 
414 | We begin by looking for basic induction variables of the form `i = i + e` where `e` is some loop invariant expression. 
415 | If you're just getting started, you can limit yourself to constant `e`s.
416 | 
417 | The essence of finding induction variables in SSA form is to look for cycles in the SSA graph.
418 | For a basic induction variable, you're looking for a cycle of the form `i = phi(base, i + incr)`.
419 | Graphically:
420 | 
421 | ```mermaid
422 | graph TD
423 | A["i: phi(base, i + incr)"]
424 | B[base]
425 | C[i + incr]
426 | D[incr]
427 | A --> B & C
428 | C --> A & D
429 | ```
430 | 
431 | ### Finding Derived Induction Variables
432 | 
433 | For derived induction variables `j`, you're looking for a pattern of the form `j = c * i + d` where `c` and `d` are loop invariant and `i` is a basic induction variable.
434 | Induction variables derived from the same basic induction variable are said to be in the same _family_ as the basic induction variable.
435 | Again, you can limit yourself to constant `c` and `d` to start.
436 | We can do this with a dataflow analysis as well, but we will stick to directly inspecting the SSA graph.
437 | 
438 | Note that you will need to consider separate patterns for the commutative cases, and cases where one of `c` or `d` is 0.
439 | 
440 | ```mermaid
441 | graph TD
442 | A["j = c * i + d"] --> B & C
443 | B["c * i"] --> B1["c"] & B2["i"]
444 | C["d"]
445 | ```
446 | 
447 | ### Replacing Induction Variables
448 | 
449 | The purpose of limiting ourselves to simple linear functions is that we 
450 |  can easily replace derived induction variables with basic induction variables.
451 | 
452 | For a derived induction variable `j = c * i + d`:
453 | - `i` is a basic induction variable, with base `base` and increment `incr`
454 | - Initialize `j0` to `c * base + d` in the preheader.
455 | - In the loop, redefine `j` to `j = phi(j0, j + c * incr)`
456 |     - `c * incr` is definitely loop invariant, probably constant!
457 | 
458 | ### Replacing Comparisons
459 | 
460 | In many cases, once you replace the derived induction variables,
461 |  the initial basic induction variable is only needed to compute the loop bounds.
462 | If that's the case,
463 |  you can replace the comparison to operate on another induction variable from the same family.
464 | 
465 | Consider `i < N` as a computation of a loop bound.
466 | And say we have a derived induction variable `j = c * i + d`.
467 | Some simple algebra gives us the transformation:
468 | ```
469 |     i     <     N
470 | c * i + d < c * N + d
471 | j         < c * N + d
472 | ```
473 | 
474 | So we can replace the comparison `i < N` with `j < c * N + d`.
475 | With that, a dead code analysis will be able to eliminate `i` entirely (if it's not used elsewhere).
476 | 
477 | ### Induction Variable Families
478 | 
479 | In general, induction variables in the same family can be expressed in terms of each other.
480 | As we saw above, a derived induction variable can be "lowered" into it's own basic induction variable as well
481 |  via strength reduction,
482 |  so basic induction variables can be expressed in terms of each other sometimes as well.
483 | This opens up a wider question/opportunity for optimization:
484 |  for a family of induction variables, 
485 |  what do we want to do?
486 | We could try to eliminate all but one of the basic induction variables,
487 |  or we could try to replace all derived induction variables with basic induction variables.
488 | 
489 | These have different trade-offs:
490 |  more basic induction variables may mean better strength reduction,
491 |  but it can also increase register pressure, as you have more variables to keep live across the loop.
492 | On the other hand, you can imagine trying to "de-optimize" code in the reverse process,
493 |  trying to express as many derived induction variables as possible in terms of a single basic induction variable.
494 | LLVM [does such a thing](https://llvm.org/doxygen/classllvm_1_1Loop.html#ae731e6e33c2f2a9a6ebd1d51886ce534)
495 |  in fact it tries to massage the loop to have a single basic induction variable that starts at 0 and increments by 1.
496 | 
497 | The classic strength reduction optimization as we discussed above is a case where the trade-off is very favorable.
498 | We begin with 1 basic (`i`) and 2 derived induction variables (`offset` and `addr`).
499 |  and we end up with 1 basic induction variable (`addr`) only.
500 | 
501 | # Task
502 | 
503 | This task will be to implement _some kind_ of loop optimization.
504 | It's up to you! 
505 | If you aren't sure,
506 |  consider doing a basic LICM implementation,
507 |  as it has a good payoff for the effort.
508 | 
509 | As always, run your optimizer on the entire benchmark suite in `benchmarks/`.
510 | Make sure to include some kind of summary statistics (average, min/max... or a plot!) in your writeup.
511 | 


--------------------------------------------------------------------------------
/lessons/05-memory.md:
--------------------------------------------------------------------------------
  1 | # Memory
  2 | 
  3 | You have probably encountered memory instructions like `load` and `store` in Bril programs,
  4 |  and so far we have mostly ignored them or worse, said that they have prevented other optimizations.
  5 | 
  6 | Resources:
  7 | - [15-745 slides](https://www.cs.cmu.edu/afs/cs/academic/class/15745-s19/www/lectures/L13-Pointer-Analysis.pdf)
  8 | 
  9 | 
 10 | ## Simple Memory Optimizations
 11 | 
 12 | Our motivating goal for this lesson is to be able to perform 
 13 |  some similar optimizations
 14 |  that we did on variables on memory accesses.
 15 | In the first couple lessons, we performed the following optimizations:
 16 | - **Dead code elimination** removed instructions that wrote to variables that were never read from (before being written to again).
 17 | - **Copy propagation** replaced uses of variables with their definitions.
 18 | - **Common subexpression elimination** prevented recomputing the same expression multiple times.
 19 | 
 20 | We can imagine similar optimizations for memory accesses. 
 21 | - **Dead store elimination** removes stores to memory that are never read from.
 22 |   Like our initial trivial dead code elimination, this is simplest to recognize 
 23 |   when you have two stores close to each other.
 24 |      - Just like `x = 5; x = 6;` can be replaced with `x = 6;`...
 25 |      - We can replace `store x 5; store x 6;` with `store x 6;`.
 26 | - **Store-to-load forwarding** replaces a load with the most recent store to the same location.
 27 |   This is similar to copy propagation, but for memory.
 28 |      - Just like `x = a; y = x;` can be replaced with `x = a; y = a`...
 29 |      - We can replace `store x 5; y = load x;` with `store x 5; y = 5;`.
 30 | - **Redundant load elimination** works on a similar redundancy-elimination principle to common subexpression elimination.
 31 |      - Just like `x = a + b; y = a + b;` can be replaced with `x = a + b; y = x;`...
 32 |      - We can replace `x = load p; y = load p;` with `x = load p; y = x;`.
 33 | 
 34 | 
 35 | As written, 
 36 |  these optimizations are not very useful
 37 |  since they only apply when the memory accesses are right next to each other.
 38 | How can we extend them?
 39 | 
 40 | To extend them, 
 41 |  we first need to see
 42 |  how we could go wrong.
 43 | In each of these cases,
 44 |  an intervening memory instruction would make the optimization invalid.
 45 | For example, say we are trying to do dead store elimination on the following code:
 46 | ```
 47 | store x 5;
 48 | ...
 49 | store x 6;
 50 | ```
 51 | 
 52 | Whether or not we can eliminate the first store depends on what happens between the two stores.
 53 | In particular,
 54 |  a problematic instruction would be a load from `x`,
 55 |  since it would observe the value `5`  that we are trying to eliminate.
 56 | 
 57 | So a sound, very conservative approach would be to say that we can only perform these optimizations
 58 |  if we can prove that there are **no** intervening memory instructions at all.
 59 | 
 60 | What about a load from some other pointer `y`?
 61 | That might break the optimization, but it might not!
 62 | It could be the case that `y` is actually an *alias* for `x`,
 63 |  and so the load from `y` would observe the value `5` and break the optimization.
 64 | Or, `y` could be completely unrelated to `x`,
 65 |  in which case we could still eliminate the first store!
 66 | 
 67 | ## Alias Analysis
 68 | 
 69 | This is the essence of a huge field of compilers/program analysis research called *alias analysis*.
 70 | The purpose of such an analysis is to answer one question:
 71 |  can these two pointers (`x` and `y` in the above) in the program refer to the same memory location?
 72 | 
 73 | If we can prove they cannot (they don't alias),
 74 |  then we can perform the above optimization and many others,
 75 |  since we don't have to worry about interactions between the two pointers.
 76 | If we can't prove they don't alias,
 77 |  then we have to be conservative and assume they do alias,
 78 |  which will prevent us from performing many optimizations.
 79 | 
 80 | There are many, many algorithms for alias analysis
 81 |  that vary in their precision, efficiency, and assumptions about the memory model.
 82 | Exploring these would make for a good project!
 83 | 
 84 | We will present a simple alias analysis powered by, what else, dataflow analysis!
 85 | The idea is to track the memory locations that each pointer can point to,
 86 |  and then use that information to determine if two pointers can point to the same location.
 87 | 
 88 | ### Memory Locations
 89 | 
 90 | What's a memory location?
 91 | In some languages, like C with its address-of operator `&`,
 92 |  you can get a pointer to a memory location in many ways.
 93 | But in Bril,
 94 |  there is only one way to create a memory location,
 95 |  calling the `alloc` instruction.
 96 | The `alloc` instruction _dynamically_ returns a fresh pointer 
 97 |  to a memory location that is not aliased with any other memory location.
 98 | 
 99 | How can we model that in a static analysis?
100 | We can't!
101 | So we will be conservative 
102 |  and "name" the memory locations according to the (static) program location where they were allocated.
103 | This is a standard approach in alias analysis,
104 |  since it allows you to work with a finite set of memory locations.
105 | 
106 | So for example, if we have the following program:
107 | ```c
108 | while (...) {
109 |     x = alloc 10; // line 1
110 |     y = alloc 10; // line 2
111 | }
112 | ```
113 | 
114 | We now can statically say what `x` points to: the memory location allocated at line 1.
115 | This is a necessary approximation to avoid having to name the potentially infinite number of memory locations that could be allocated in a loop.
116 | Note how we can still reason about the non-aliasing of `x` and `y` in this case.
117 | 
118 | ### A Simple Dataflow Analysis
119 | 
120 | While `alloc` is the only way to create a memory location from nothing,
121 |  there are other ways to create memory locations, and these are the sources of aliasing.
122 | 
123 | - `p1 = id p2`: the move or copy instruction is the most obvious way to create an alias.
124 | - `p1 = ptradd p2 offset`: pointer arithmetic is an interesting and challenging source of aliasing.
125 |   To figure out if these two pointers can alias, we'd need to figure out if `offset` can be zero.
126 |   To simplify things, we will assume that `offset` could always be zero, 
127 |   and so we will conservatively say that `p1` and `p2` can alias.
128 |   This also means we do not need to include indexing informations into our represetation of memory locations.
129 | - `p1 = load p2`: you can have pointers to other pointers, and so loading a pointer effectively copy a pointer, creating an alias.
130 | 
131 | Our dataflow analysis 
132 |  will center around building the _points-to_ graph,
133 |  a structure that maps each variable to the set of memory locations it can point to.
134 | 
135 | Here is a first, very conservative stab at it:
136 | 
137 | - **Direction**: forward
138 | - **State**: map of `var -> set[memory location]`
139 |     - If two vars have a non-empty intersection, they might alias!
140 | - **Meet**: Union for each variable's set of memory locations
141 | - **Transfer function**: 
142 |     - `x = alloc n`: `x` points to this allocations
143 |     - `x = id y`: `x` points to the same locations as `y` did
144 |     - `x = ptradd p offset`: same as `id` (conservative)
145 |     - `x = load p`: we aren't tracking anything about p, so `x` points to all memory locations
146 |     - `store p x`: no change
147 | 
148 | This is a very conservative but still useful analysis!
149 | If your program doesn't have any pointers-to-pointers,
150 |  then this big approximation of not modeling what `p` might point to isn't so bad. 
151 | 
152 | This is, however, still an _intra_-procedural analysis.
153 | So we don't know anything about the incoming function arguments.
154 | Ensure that you initialize your analysis something conservative, 
155 |  like all function arguments pointing to all memory locations.
156 | 
157 | # Task
158 | 
159 | Implement an alias analysis for Bril programs
160 |  and use it to perform at least one of the memory optimizations we discussed above.
161 | 
162 | You may implement any alias analysis you like,
163 |  including the simple one described above.
164 | If you are interested in a more advanced analysis,
165 |  you could look into the _Andersen_ analysis,
166 |  or the more efficient but less precise _Steensgaard_ analysis.
167 | 
168 | As always, run your optimizations on the provided test cases
169 |  and make sure they pass.
170 | Show some data analysis of the benchmarking results.
171 | 
172 | In addition,
173 |  create at least one contrived example that shows your optimization in action.
174 | It can be a simple code snippet,
175 |  but show the before and after of the optimization.
176 | 


--------------------------------------------------------------------------------
/lessons/06-interprocedural.md:
--------------------------------------------------------------------------------
  1 | # Interprodural Optimization and Analysis
  2 | 
  3 | Most of the techniques we have discussed so far have been limited to a single function.
  4 | But we have, however, made the jump from reasoning about a single basic block to reasoning about a whole function.
  5 | Let's recap how we did that.
  6 | 
  7 | ## Intraprocedural: Local to Global
  8 | 
  9 | A local analysis or optimization doesn't consider any kind of control flow.
 10 | To extend beyond a single basic block, 
 11 |  we needed to figure out how to carry information from one block to another.
 12 | 
 13 | Consider the following CFG:
 14 | 
 15 | ```mermaid
 16 | graph TD
 17 |     A --> B & C --> D
 18 | ```
 19 | 
 20 | As we saw in the first lesson about [extended basic blocks](./01-local-opt.md#extended-basic-blocks),
 21 |  you can "for free" reuse the information from a block the dominates you (for a forward analysis).
 22 | In the above example, blocks B and C can reuse information from block A since they are both dominated by A.
 23 | In fact, similar for block D, but you'd have to be a little careful about variable redefinitions.
 24 | But how do we get information from blocks B and C into D?
 25 | 
 26 | ### The Problem with Paths
 27 | 
 28 | First of all, what's even the problem here? Why is hard to reason about block D?
 29 | The issue that is that our analysis is attempted to summarize all possible executions of a program up to a point.
 30 | And when there are joins in the control flow,
 31 |  there are multiple paths that could have led to that point.
 32 | When there are cycles, there are infinitely many paths!
 33 | So the issue with global (and as we'll see later, interprocedural)
 34 |  analysis is that we need to handle multiple or potentially infinite paths of execution in the state of our analysis.
 35 | 
 36 | ### Summarize
 37 | 
 38 | The main approach that we took in this class was to summarize the information from multiple paths
 39 |  using the monotone framework of dataflow analysis.
 40 | When control converges in a block (i.e. there are multiple ways to get to a block),
 41 |  we combine the analysis results from each path using a "meet" operation.
 42 | This meet operation has nice algebraic properties
 43 |  that ensure we can always do this without spinning out of control in, for example, a loop.
 44 | However, it loses information!
 45 | Recall that some analyses, like constant propagation, 
 46 |  are not distributive over the meet operation.
 47 | 
 48 | In our dataflow analyses, 
 49 |  we have both optimistic and pessimistic analyses.
 50 | An optimistic analysis might result in more precision,
 51 |  but you have to wait until it's converged (until all paths have been considered)
 52 |  before you can trust the results.
 53 | A pessimistic analysis might be less precise,
 54 |  but you can trust the results at any point in the analysis.
 55 | We will see this tradeoff again in interprocedural analysis and the open-world assumption.
 56 | 
 57 | ### Duplicate
 58 | 
 59 | The other approach (that we briefly discussed) is to copy!
 60 | If you simply duplicate block D:
 61 | ```mermaid
 62 | graph TD
 63 |     A --> B --> D1
 64 |     A --> C --> D2
 65 | ```
 66 | Then you can apply the analysis to D2, and then merge the results back into D.
 67 | In this case, you could actually merge blocks D1 and D2 back into their predecessor blocks.
 68 | In an acyclic CFG, you can always do this, eliminating any convergence at potentially exponential cost.
 69 | Here's the pessimal CFG to duplicate that witnesses the exponential explosion:
 70 | ```mermaid
 71 | graph TD
 72 |     A1 & A2 --> B1 & B2 --> C1 & C2 --> D1 & D2
 73 | ```
 74 | 
 75 | An of course, CFGs with cycles cannot be fully duplicated in this way, since there are infinitely many paths.
 76 | 
 77 | But! You can always partially apply this technique and combine it with summarization.
 78 | Loop unrolling is a classic example of this technique.
 79 | Practical application of basic block duplication (in acyclic settings) will also
 80 |  only do this sometimes, leaving the rest to the summarization technique.
 81 | 
 82 | ### Contextualize
 83 | 
 84 | This is very similar to duplication,
 85 |  but instead of exploding the graph,
 86 |  you keep track of some context
 87 |  (in the case of a CFG, the path that led to the block)
 88 |  in the analysis state.
 89 | Of course, there are infinitely many paths in a cycle,
 90 |  so you have to be careful about how you do this.
 91 | A common approach is to finitize the context,
 92 |  for example by only keeping track of the last few blocks that led to the current block.
 93 | For a global dataflow analysis, this approach is called _path-sensitive analysis_.
 94 | We saw call-site sensitivity in the context of interprocedural analysis
 95 |  in last week's reading on the [Doop framework](../reading/doop.md).
 96 | 
 97 | ## Interprocedural Analysis
 98 | 
 99 | Of course, most programs are not just a single function.
100 | The interaction between functions is modeled by the _call graph_,
101 |  very similar to the control flow graph can basic blocks.
102 | 
103 | Let's saw I have a library that defines some public functions `lib_N` with private helpers `helper_N`:
104 | 
105 | ```mermaid
106 | graph TD
107 | lib_1 --> helper_1 --> helper_2 --> helper_3
108 | lib_2 --> helper_1
109 | lib_2 --> lib_1
110 | lib_3 --> helper_5 --> helper_4
111 | 
112 | helper_2 --> helper_4
113 | ```
114 | 
115 | In a CFG, one of the basic optimization is dead code elimination (on the block level).
116 | If a block is not reachable from the entry block, then it is dead code.
117 | But if you're compiling a library, you don't know what the entry block is, since you don't know what the user's program will look like.
118 | 
119 | ### Open World
120 | 
121 | This is the main challenge of interprocedural analysis: **the open-world assumption**.
122 | In most compilation settings, 
123 |  you don't know what the whole program looks like.
124 | There are a couple main reasons for this, which roughly fall into two categories:
125 | - separate compilation
126 |     - where parts of the program are compiled separately and then linked together
127 |     - this enables you to compile and distribute libraries
128 |     - also useful for faster, incremental compilation
129 | - dynamic code
130 |     - even worse, you might not know what the whole program looks like at runtime
131 |     - in many programming languages, you can load code at runtime
132 |     - so you might not know what the whole program looks even once the program is running!
133 | 
134 | Thankfully, 
135 |  many of the techniques we've discussed so far can be extended to interprocedural analysis, 
136 |  even in the open world setting using similar techniques over the call graph.
137 | But summarization must be pessimistic, since you don't know what the whole program looks like.
138 | 
139 | ### Closed World
140 | 
141 | Some, like dead code analysis, don't really work in the open world setting since they need to be optimistic.
142 | There are some settings where you can in fact do whole-program analysis
143 |  where you can make the **closed-world assumption**.
144 | The big ones are:
145 | - [LTO](https://en.wikipedia.org/wiki/Interprocedural_optimization#WPO_and_LTO) or link-time optimization. Even when the program is separately compiled,
146 |   there is typically a final linking step where you can see the whole program. Modern compilers like GCC and LLVM will do some optimizations at this stage.
147 | - JIT (just-in-time) compilation. In this case, the compiler is present at runtime! 
148 |   This gives the compiler the superpower of *making assumptions*. The compiler can basically assume anything it wants about the program as long as it can check it at runtime. If the assumption is true, great, you can use the code optimized for that assumption. If not, you can fall back to the unoptimized code.
149 | 
150 | ### Inlining
151 | 
152 | Summarization and contextuality are weakened in the open-world setting, as you can't do anything optimistically.
153 | However, duplication still works!
154 | 
155 | In the interprocedural setting, duplication manifests as **inlining**.
156 | Intuitively, inlining is the process of replacing a function call with the body of the function.
157 | 
158 | Inlining is frequently referred to the most important optimization in a compiler
159 |  for the following reasons:
160 | - Primarily, inlining can enable many other optimizations
161 | - It also removes the overhead of the function call itself
162 |   - the call/return instruction
163 |   - stack manipulation, argument passing, etc.
164 | 
165 | The downsides to inlining are essentially the same of basic-block duplication:
166 | - code size increase, which can decrease instruction cache locality
167 | - increased compile time
168 | - impossible to inline recursive functions
169 | 
170 | To illustrate the power of inlining,
171 |  here are two (contrived) examples that work a little differently.
172 | 
173 | The first shows how inlining can let precise information flow "into" a particular call site.
174 | 
175 | ```c
176 | int add_one(int x) {
177 |     return x + 1;
178 | }
179 | 
180 | int caller() {
181 |     return add_one(1);
182 | }
183 | 
184 | int caller_inlined() {
185 |     return 1 + 1;
186 | }
187 | ```
188 | 
189 | Here we can see that the constant argument can easily be propagated (and then folded) into the inlined call site.
190 | Inliners in a compiler will may also consider the callsite itself in addition the called function.
191 | If the callee is given constant arguments, then the inliner may be more likely to inline the function.
192 | 
193 | The second example shows information flowing in the other direction.
194 | 
195 | ```c
196 | int square(int x) {
197 |     return x * x;
198 | }
199 | 
200 | int caller(int x) {
201 |     while (1) {
202 |         int y = square(x);
203 |     }
204 | }
205 | 
206 | int caller_inlined(int x) {
207 |     while (1) {
208 |         int y = x * x;
209 |     }
210 | }
211 | ```
212 | 
213 | Yes, many properties like loop invariance or purity can also be derived for functions. 
214 | But inlining can save you from having to do so!
215 | 
216 | 
217 | ### When to Inline
218 | 
219 | Whether or not to inline a function is very difficult to determine in general.
220 | It depends on what optimizations the inlining would enable,
221 |  which might depend on other inlining decisions,
222 |  and so on.
223 | Our [reading this week](../reading/optimal-inlining.md) discusses how to find an
224 |  optimial point in this tradeoff space by an extremely expensive search.
225 | In practice, it's very difficult to know for sure if inlining a particular function is a good idea,
226 |  so you need some heuristics to guide you.
227 | In many cases, the heuristics are simple and based on the size of the function.
228 | 
229 | 
230 | 
231 | 
232 | 
233 | 
234 | 


--------------------------------------------------------------------------------
/lessons/README.md:
--------------------------------------------------------------------------------
1 | # Lessons
2 | 
3 | These are the lecture notes for the course.
4 | 
5 | See the parent [README](../README.md) for more info.


--------------------------------------------------------------------------------
/project.md:
--------------------------------------------------------------------------------
  1 | # CS 265: Final Project
  2 | 
  3 | ## Public Projects!
  4 | 
  5 | Some students have opted to make their projects public!
  6 | Check them out, and feel free to reach out to students directly!
  7 | I know some are graduating soon and looking for jobs, 
  8 |  so if you're looking for compilers folks, look here!
  9 | 
 10 | 
 11 | - [Redesigning the Vortex GPU ISA: 64-Bit Instruction Words and Conflict-Aware Register Allocation](https://github.com/richardyrh/cyclotron-cs265)
 12 |     - Ruohan Richard Yan, Shashank Anand
 13 | - [Efficient Register Allocation Algorithms](https://github.com/JacobBolano/cs265_final_project/blob/master/CS_265_Final_Project_Report.pdf)
 14 |     - Jacob Bolano, Shankar Kailas
 15 | - [Compiler Front-end for Translating ChocoPy into Bril](https://github.com/gabe-raulet/chocopy2bril/blob/master/report.pdf)
 16 |     - Gabe Raulet
 17 | - [Bril to RISC-V](https://github.com/ElShroomster/bril_to_riscv)
 18 |     - Sriram Srivatsan
 19 | - [Going to the gym with MLIR: Writing a recompiler for DEX instructions](https://badumbatish.github.io/posts/going_to_mlir_gym_1)
 20 |     - Jasmine Tang
 21 | - [TGO: Trace Guided Optimization](https://github.com/iansseijelly/ltrace_chipyard/tree/CS265-final-project-report)
 22 |     - Chengyi Lux Zhang
 23 | - [Multi-backend support in the cartokit compiler](https://observablehq.com/@parkerziegler/multi-backend-support-in-the-cartokit-compiler)
 24 |     - Parker Ziegler
 25 | 
 26 | # Project Info
 27 | 
 28 | This course features a project component.
 29 | 
 30 | You may do the project individually or in groups of 2-3 people.
 31 | Unlike the reflection assignments, 
 32 |  you should submit a single project report for the group.
 33 | 
 34 | You cannot use late days on the project.
 35 | If you think you will need an extension, please talk to me as soon as possible.
 36 | 
 37 | See the [schedule](./README.md#schedule) for due dates.
 38 | 
 39 | **If you are in a group**,
 40 |  have one person submit the proposal/check-in/report and list the group members in the proposal.
 41 | The others should just submit a text entry saying that they are in a group with that person.
 42 | 
 43 | ## Project Proposals
 44 | 
 45 | The first part of the project is to write a proposal.
 46 | 
 47 | The purpose of the proposal is to help you scope out project 
 48 |  that is both interesting and feasible
 49 |  to complete in the time allotted.
 50 | 
 51 | The proposal should be at least 2-3 pages long and should be submitted as a PDF on bCourses.
 52 | 
 53 | The proposal should include the following sections:
 54 | 
 55 | - Intro
 56 |     - What are you doing?
 57 |     - What is your goal?
 58 |     - Why?
 59 | - Background
 60 |     - What do you already know or need to learn to do this project?
 61 |     - What pieces of infrastructure will you need? Bril? LLVM? Doop?
 62 |         - Are you already familiar with these? Or will you need to learn them?
 63 |     - What parts are already done?
 64 | - Approach
 65 |     - How will you accomplish the things that need to be done?
 66 |     - What software will you need to build? Algorithms to implement? Papers to read?
 67 | - Evaluation plan
 68 |     - How will you know if you've succeeded?
 69 |     - What will you measure?
 70 | 
 71 | I will broadly accept projects that are related to any part of compilation, not just the middle-end that we focused on in class.
 72 | 
 73 | Here are some (non-exhaustive) categories that I expect to see projects in:
 74 | 1. Expanding your Bril compiler
 75 |     - Pick a new optimization or class of optimzations to implement and measure
 76 |         - I expect this will be the most common project, and that's great!
 77 |     - Expand the Bril infrastructure in some way
 78 |         - Generate Bril from a new source language
 79 |         - Add new IR features (parallelism, virtual functions, etc.)
 80 |         - Implement a backend to a real or virtual architecture
 81 | 2. Any of the above in some other compiler infrastructure
 82 |     - probably only do this if you already have experience with the infrastructure
 83 | 3. Connecting up with your current research project / hobby project
 84 |     - If you're already working on a project that involves compilation
 85 |     - Still follow the project guidelines above re: goal setting and evaluation
 86 | 4. Survey paper
 87 |     - If you're not interested in implementing something, you can write a survey paper on a topic in compilation
 88 |     - Still follow the project guidelines above re: goal setting
 89 |     - "Evaluation" will be a more nuanced reflection on how your report compares with the state of the art. What doesn't it cover?
 90 | 
 91 | Looking for ideas, come chat with me!
 92 | For more inspiration, 
 93 |  see what students in the similar [CS 6120](https://www.cs.cornell.edu/courses/cs6120/2023fa/blog/) course at Cornell did in that instance or others.
 94 | 
 95 | ## Project Check-ins
 96 | 
 97 | This is a ~1 page report submitted to bCourses to update me on your progress. It should answer the following questions:
 98 | 1. What have you done so far to make progress towards your stated project goals?
 99 | 2. Do you need to modify your project goals? If so, how?
100 | 3. What do you plan to do next?
101 |     - If you're feeling on track, then let me know what's still to be done.
102 |     - If you need to change your project goals, write how you plan to accomplish the new goals.
103 | 
104 | ## Project Report
105 | 
106 | The project report should be 4-6(ish) pages in length, and should be submitted on bCourses.
107 | Ultimately, the content of the report is flexible, but it should be self-contained
108 |  (not relying on the reader to have read your proposal or check-in).
109 | The report should focus on evaluation, to the extent that it makes sense for your project.
110 | Include graphs, tables, and other visualizations as needed.
111 | If you plan to continue work on the project (not required, but some projects are part of a larger research agenda),
112 |  include a section on future work.
113 | 
114 | ### Making Projects Public
115 | 
116 | You may **optionally** choose to make your project public.
117 | If you do this, I will link to your project from the course website, and post to social media saying "look at these cool projects!".
118 | To do this:
119 | 1. Submit only a URL to your project report on bCourses. The URL should point to a public website (github, personal website, etc.) where your project report is hosted.
120 | 2. Include a comment in the submission that says "I would like my project to be public."


--------------------------------------------------------------------------------
/reading/beyond-induction-variables.md:
--------------------------------------------------------------------------------
 1 | # Beyond Induction Variables
 2 | 
 3 | - Michael Wolfe
 4 | - PLDI 1992
 5 | - ACM DL [PDF](https://dl.acm.org/doi/pdf/10.1145/143095.143131)
 6 | 
 7 | This paper should be mostly accessible given the lecture material.
 8 | 
 9 | Section 4 is the payoff,
10 |  showing off the cool things you can find beyond 
11 |  standard linear induction variables.
12 | Don't get too bogged down in the details here.
13 | 
14 | Section 6 will discuss dependence analysis,
15 |  which we have not covered in class.
16 | You should still skim this part,
17 |  as it's a good primer for when we do cover it.


--------------------------------------------------------------------------------
/reading/braun-ssa.md:
--------------------------------------------------------------------------------
 1 | # Simple and Efficient Construction of Static Single Assignment Form
 2 | 
 3 | - Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mallon, and Andreas Zwinkau
 4 | - Compiler Construction (CC), 2013
 5 | - [DOI link](https://dl.acm.org/doi/10.1007/978-3-642-37051-9_6), and [free PDF](https://c9x.me/compile/bib/braun13cc.pdf)
 6 | 
 7 | Read Sections 1, 2, 3.1, (skip the rest of 3, 4, 5), and read 6-8.
 8 | 
 9 | The skipped sections discuss reducible vs irreducible control flow graphs, 
10 |  which we will cover in more later in the course.
11 | 
12 | One thing to look out for in this paper is the focus on efficiency (of compilation time).
13 | In fact, that was also the original motivation for SSA form
14 |  as we saw in last week's reading.
15 | Another thing (also present in last week's reading) is the combination of optimizations,
16 |  running them together in a single pass.
17 | 
18 | Submit your reading reflection on bCourses before noon on the day of discussion.


--------------------------------------------------------------------------------
/reading/buildit.md:
--------------------------------------------------------------------------------
 1 | # BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++
 2 | 
 3 | - Ajay Brahmakshatriya and Saman Amarasinghe 
 4 | - CGO 2021
 5 | - [Project Page](https://build-it.intimeand.space/)
 6 | - [PDF](https://intimeand.space/docs/buildit.pdf)
 7 | 
 8 | See also:
 9 | - This [great intro to multi-stage programming](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=1bec1c6c01075b518469434815c11b3f23c0093a) (cited in the above paper)
10 | - [Lightweight Modular Staging](https://scala-lms.github.io/) (LMS), a similar framework for Scala


--------------------------------------------------------------------------------
/reading/copy-and-patch.md:
--------------------------------------------------------------------------------
1 | # Copy-and-patch compilation
2 | 
3 | - Haoran Xu, Fredrik Kjolstad
4 | - OOPLSA 2021
5 | - [ACM DL](https://dl.acm.org/doi/10.1145/3485513)


--------------------------------------------------------------------------------
/reading/doop.md:
--------------------------------------------------------------------------------
 1 | # Doop: Strictly Declarative Specification of Sophisticated Points-to Analyses
 2 | 
 3 | - [ACM DL PDF](https://dl.acm.org/doi/pdf/10.1145/1639949.1640108)
 4 | - [Youtube talk](https://www.youtube.com/watch?v=FQRLB2xJC50)
 5 | 
 6 | This paper introduces the Doop framework for specifying points-to analyses in Datalog programs. 
 7 | Datalog is a declarative logic programming language (and also a database query language) 
 8 |  that is well-suited for expressing pointer analyses.
 9 | 
10 | The paper provides a pretty good introduction to Datalog if you're unfamiliar with it.
11 | For more of a Datalog primer, you can see [these lecture notes from my previous course](https://inst.eecs.berkeley.edu/~cs294-260/sp24/2024-02-05-datalog).
12 | 
13 | Skip section 4 if you aren't already familiar with Datalog; 
14 |  it described the optimizations needed to make Datalog efficient enough for this purpose.
15 | It's quite interesting, but not necessary to understand from a points-to analysis perspective.
16 | 
17 | 


--------------------------------------------------------------------------------
/reading/ir-survey.md:
--------------------------------------------------------------------------------
 1 | # Intermediate Representations in Imperative Compilers: A Survey
 2 | 
 3 | - James Stanier and Des Watson
 4 | - 2013
 5 | - [ACM DL](https://dl.acm.org/doi/10.1145/2480741.2480743)
 6 | - [alternate link](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=74db5beee87ef7abdb77eb4dd23e13fdaf12c77a)
 7 | 
 8 | This paper provides a survey of intermediate representations (IRs) used in imperative compilers.
 9 | It's a longer paper, but with little formalism, so it should be a quicker read.
10 | Don't get stuck on the details of any one IR; 
11 |  the goal is to understand the commonalities and differences between them.
12 | 


--------------------------------------------------------------------------------
/reading/lazy-code-motion.md:
--------------------------------------------------------------------------------
 1 | # Lazy Code Motion
 2 | 
 3 | - [ACM DL PDF](https://dl.acm.org/doi/pdf/10.1145/143103.143136)
 4 | 
 5 | This work introduces lazy code motion, 
 6 |  a technique that sits in a larger family of optimization called _partial redundancy elimination_.
 7 | We've seen common subexpression elimination,
 8 |  which is a form of redundancy elimination that looks for expressions that are computed more than once.
 9 | In fact,
10 |  loop invariant code motion is another form of (temporal) redundancy elimination.
11 | Partial redundancy elimination is a more general form of this optimization, 
12 |  and it can be seen to subsume most forms of LICM.
13 | 
14 | This work introduces a new form of partial redundancy elimination called _lazy code motion_,
15 |  which is typically seen as simpler than the original PRE work by Morel and Renvoise.
16 | 
17 | Note that many practical LICM implementations are not technically 
18 |  safe; they may actually hoist code out of a while-loop that doesn't have an effect.
19 | This is making the assumption that the loop will actually run, 
20 |  which is not always the case.
21 | In practice, since you cannot always prove statically that a loop will run,
22 |  you have to choose between being conservative and not hoisting anything,
23 |  or being more aggressive and hoisting non-effectful loop-invariant code.
24 | Partial redundancy elimination 
25 |  techniques are typically more conservative than the sometime aggressive LICM
26 |  implementations, which is why LICM isn't totally subsumed by PRE.
27 | 
28 | This paper is quite short but a little dense.
29 | If you are feeling stuck, 
30 |  focus your efforts on the transformation and the examples, and gloss over the proofs.
31 | There are also many presentations of this work online that may help you understand the material;
32 |  for example here are some [slides from CMU's 15-745 course](http://www.cs.cmu.edu/afs/cs/academic/class/15745-s19/www/lectures/L10-Lazy-Code-Motion.pdf).
33 | 
34 | There is also a pretty good [Youtube video](https://www.youtube.com/watch?v=zPLbAOdIqRw) 
35 |  that provides some examples. 
36 | 
37 | Some terminology:
38 | - down-safety: a.k.a. anticipated, needed. An expression is anticipated at point p if it's guaranteed to be computed in all outgoing paths.
39 | - up-safety: available. An expression is available at point p if it's guaranteed to be computed in all incoming paths.
40 | 


--------------------------------------------------------------------------------
/reading/linear-scan-ssa.md:
--------------------------------------------------------------------------------
1 | # Linear Scan Register Allocation on SSA Form
2 | - Christian Wimmer, Michael Franz
3 | - CGO 2010
4 | - [PDF](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=3523f1d9f14a31a4fd2abc8f4a3ab06e26f84f52), [alternate](http://www.christianwimmer.at/Publications/Wimmer10a/Wimmer10a.pdf)


--------------------------------------------------------------------------------
/reading/optimal-inlining.md:
--------------------------------------------------------------------------------
1 | # Understanding and exploiting optimal function inlining
2 | 
3 | - Theodoros Theodoridis, Tobias Grosser, Zhendong Su
4 | - 2022
5 | - [ACM DL](https://dl.acm.org/doi/10.1145/3503222.3507744), [PDF](https://ethz.ch/content/dam/ethz/special-interest/infk/ast-dam/documents/Theodoridis-ASPLOS22-Inlining-Paper.pdf)


--------------------------------------------------------------------------------
/reading/sparse-conditional-constant-prop.md:
--------------------------------------------------------------------------------
 1 | # Constant Propagation with Conditional Branches
 2 | 
 3 | - Mark N. Wegman and F. Kenneth Zadeck
 4 | - Transactions on Programming Languages and Systems (TOPLAS), 1991 
 5 | - [PDF](https://dl.acm.org/doi/pdf/10.1145/103135.103136) from ACM Digital Library
 6 | 
 7 | Read Sections 1-5.
 8 | 
 9 | A quote from the paper sums up why this is an important topic:
10 | 
11 | > Many optimizing compilers repeatedly execute constant propagation and
12 | unreachable code elimination since each provides information that improves
13 | the other. CC solves this problem in an elegant way by combining the two
14 | optimization. Additionally, the algorithm gets better results than are possible 
15 | by repeated applications of the separate algorithms, as described in
16 | Section 5.1.
17 | 
18 | In other words, this paper is a classic example of how
19 |  two dataflow analyses can be combined to get better results 
20 |  that simply running them one after the other (even in a loop)!
21 | 
22 | ## SSA Primer
23 | 
24 | This paper mentions the use of Static Single Assignment (SSA) form,
25 |  which we will cover in more detail later in the course.
26 | The paper actually provides a pretty accessible introduction to SSA form,
27 |  but you may find it helpful to read more about it
28 |  on the [Wikipedia page](https://en.wikipedia.org/wiki/Static_single-assignment_form)
29 |  or any of the many online resources on this topic.
30 | 
31 | In short,
32 |  you've probably notices how annoying it is to deal with redefinitions in your value numbering implementation.
33 | SSA form is a way to make this easier by ensuring that each variable is only assigned once.
34 | In simple code,
35 |  this can be done with a simple renaming:
36 | ```
37 | x = y;
38 | x = x + 1;
39 | use(x)
40 | ```
41 | becomes:
42 | ```
43 | x1 = y;
44 | x2 = x1 + 1;
45 | use(x2)
46 | ```
47 | 
48 | What happens when a block has multiple predecessors?
49 | `if`s are a common example of this:
50 | ```c
51 | x = 1;
52 | if (cond) x = x + 1;
53 | use(x);
54 | ```
55 | 
56 | SSA's solution is to use the φ (phi) function
57 |  to allow a special definition at the join point of the control flow
58 |  that magically selects the correct value from the predecessors.
59 | So here's the before IR:
60 | ```
61 | entry:
62 |   x = 1
63 |   br cond .then .end
64 | .then:
65 |   x = x + 1
66 | .end
67 |   use x
68 | ```
69 | 
70 | And here it is in SSA form:
71 | ```
72 | entry:
73 |   x1 = 1
74 |   br cond .then .end
75 | .then:
76 |   x2 = x1 + 1
77 | .end
78 |   x3 = φ(x1, x2)
79 |   use x3
80 | ```
81 | 
82 | Sometimes (and in Bril, as we'll see later),
83 |  the φ function also mentions the labels that define the values:
84 | ```
85 |   x3 = φ(entry: x1, .then: x2)
86 | ```
87 | 
88 | This very brief introduction to SSA form should be enough to get you through the paper!
89 | 
90 | 
91 | 
92 | 


--------------------------------------------------------------------------------
/syllabus.md:
--------------------------------------------------------------------------------
  1 | # Syllabus
  2 | 
  3 | See the [README](README.md) for general course information, including the schedule.
  4 | 
  5 | If you need support outside of context of the course,
  6 |  please feel free to contact me 
  7 |  or use the [UC Berkeley Supportal](https://supportal.berkeley.edu/home)
  8 |  to access resources for mental health, basic needs, and more.
  9 | 
 10 | **Caveat lector**:
 11 |  This course is under construction.
 12 |  That means that these documents are subject to change!
 13 |  That also means you are welcome to provide feedback to me about the course.
 14 | 
 15 | ## Communication
 16 | 
 17 | Course announcements, assignments, grading, and discussion will be on bCourses.
 18 | 
 19 | ## Course materials
 20 | 
 21 | The course material will be made available here in the form of lecture notes.
 22 | As these materials are open-source,
 23 |  you may provide direct edits to course materials (either here or in other course repos)
 24 |  via pull requests if you feel so inclined, but it is not at all required.
 25 | Issues and PRs to this repo are for typos, broken links, and other technical issues.
 26 | Questions and course discussion should be directed to the bCourses forum.
 27 | 
 28 | ## Class
 29 | 
 30 | Some classes will be a lecture, 
 31 |  some will be a discussion of a paper.
 32 | Some classes (or portions of classes) 
 33 |  will be dedicated to working on or discussing the implementation tasks.
 34 | These portions are not required (you may leave class early if you wish), 
 35 |  but I will be available to help you with the tasks.
 36 | 
 37 | The lecture notes are not designed to be a replacement for attending class.
 38 | There will not be recordings of class.
 39 | 
 40 | ## Assignments
 41 | 
 42 | Assignments mostly take the form of written reflections, 
 43 |  which will be turned in on bCourses.
 44 | I will read them!
 45 | 
 46 | **NOTE**: The point of the reflections is to tell me **your thoughts** and **your decisions**.
 47 | Yes, they are also there to ensure you actually did the assignment. 
 48 | But keep in mind that **I know the course material already**, and **I have read the paper**.
 49 | You **do not** need to re-explain the course material or the paper to me;
 50 |  you may use terms and assume I know what they mean.
 51 | You **do** need to tell me what you're thinking:
 52 |  things you found interesting,
 53 |  things you found challenging or got stuck on,
 54 |  decisions you made and how you made them,
 55 |  insights you had beyond what's in the course material,
 56 |  and so on.
 57 | 
 58 | The course will feature the following assignments:
 59 | - **Reading reflections**:
 60 |   - These are short reflections on the papers that we read in the course.
 61 |   - These should be done individually.
 62 |   - A reflection should roughly a paragraph at minimum.
 63 |   - Do not summarize the paper.
 64 |     - Assume that I have read the paper.
 65 |     - Include questions you may have about the paper, 
 66 |         whether or not you liked it, 
 67 |         and what you learned from it.
 68 |   - These are due **at noon** the day of the class where the paper is discussed.
 69 |   - Discussion participation will be holistically included in this grade.
 70 | - **Implementation tasks**:
 71 |   - May be done individually or groups of 2-3 people.
 72 |   - Your work will not be turned in our automatically graded, 
 73 |     but you will have to discuss your work in a reflection.
 74 |     - I reserve the right to ask for a demonstration of your work.
 75 |   - Reflection should still be written and submitted _individually_.
 76 |     - Include your group members in the reflection.
 77 |     - Include a brief description of what you personally contributed.
 78 |     - Include a link to your work.
 79 |       - I encourage you to use a public git repository, 
 80 |         but you may use a private one if you wish. 
 81 |         Just invite me to it.
 82 |     - Include discussion of the major design decisions and challenges in the implementation.
 83 |     - Show off! Includes some code snippets, figures, or other artifacts that you think are particularly interesting.
 84 |     - You may share some text/figures/code between group members' reflection.
 85 |   - Do the work!
 86 |     - Pretty much everything in this class has been done before.
 87 |     - You can easily find solutions online. I will even provide some in the course infrastructure.
 88 |     - You may look at any of these resources, 
 89 |       but you must acknowledge them (not include resources from this course) in your reflection.
 90 |       I encourage you to do the work without looking at these resources at first!
 91 | - **Final project**: see [project page](project.md) for more details.
 92 | 
 93 | ## Grading
 94 | 
 95 | The assignments from this course will be graded according to 
 96 |  a "[Michelin star](https://en.wikipedia.org/wiki/Michelin_Guide#Stars)" system
 97 |  (borrowed from [Adrian Sampson's](https://www.cs.cornell.edu/courses/cs6120/2023fa/syllabus/#grading) policy):
 98 | - **One star**: The work is high-quality.
 99 | - **Two star**: The work is excellent.
100 | - **Three star**: The work is exceptional.
101 | 
102 | On bCourses, this will appear as an assignment score out of 1 point. 
103 | Work below the "high-quality" threshold will receive zero stars (points).
104 | Earning one star is the goal for all assignments, and consistently doing so will earn an A in the course.
105 | Consistently earning multiple stars will earn an A+.
106 | 
107 | ## Late Policy
108 | 
109 | The assignment deadlines are designed to help you pace yourself through the course.
110 | That said, you do have late days to use throughout the semester.
111 | 
112 | 1. You have 15 "late days" to use throughout the semester.
113 | 2. Late days are counted in 24-hour increments, 1 minute late is 1 late day.
114 | 3. Reading reflections **cannot be turned in late**.


--------------------------------------------------------------------------------