├── LICENSE
├── README.md
├── compiler-correctness.rst
├── covid-notes.rst
├── defining-capture.rst
├── defining-escape.rst
├── deref+nofree.rst
├── element-wise-atomic-proposal
├── escape-by-example.rst
├── falcon-compiler.rst
├── ismm2020.rst
├── linux-perf-jit.rst
├── llvm-board-2020.rst
├── llvm-debugging-basics.rst
├── llvm-gc-retrospective-2019
├── llvm-loop-opt-ideas.rst
├── llvm-norms.rst
├── llvm-riscv-fuzzing.rst
├── llvm-riscv-status.rst
├── llvm-riscv
    ├── README.rst
    ├── ScalarCodeGen.rst
    ├── VectorCodeGen.rst
    ├── Vectorization.rst
    ├── memset.ll
    ├── scalar-branch-opt.ll
    ├── tsvc
    │   ├── 16.x-march=rv64gcv_zvl128b
    │   │   ├── diff
    │   │   ├── note
    │   │   └── stdout
    │   ├── 17.x-march=rv64gcv_zvl128b
    │   │   └── stdout
    │   ├── 18.x-march=rv64gcv_zvl128b
    │   │   └── stdout
    │   └── note
    └── vector-codegen.ll
├── llvm-shuffles.rst
├── multiple-exit-vectorization.rst
├── observable-allocations.rst
├── optimization-tricks.txt
├── pointer-provenance.rst
├── project-ideas.rst
├── recurrences.rst
├── riscv-attribute-validation.rst
├── riscv-ecosystem.rst
├── riscv-microarch.rst
├── riscv-notes.rst
├── riscv-spec-minutia.rst
├── riscv-tso-mappings.rst
├── riscv-vector-user-guide.rst
├── riscv
    ├── bp3-setup.rst
    ├── isa-detection.rst
    ├── vector-idioms.rst
    └── whole-register-move-abi.rst
├── scev-exponential.rst
├── talks-and-publications
├── unintended-instructions.rst
├── vectorization-reference.rst
└── virtual-machines.rst


/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 2-Clause License
 2 | 
 3 | Copyright (c) 2019, Philip Reames
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright notice, this
10 |    list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 |    this list of conditions and the following disclaimer in the documentation
14 |    and/or other materials provided with the distribution.
15 | 
16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
19 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
20 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
22 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
23 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
24 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # public-notes
 2 | A collection of (public) notes on assorted topics.  If there's enough on a single topic, I've moved them into individual pages, otherwise individual notes are listed below. 
 3 | 
 4 | # A Line Crossed
 5 | Jan 7, 2021 - I write this today having witnessed yesterday armed insurrection in the nations capital.  Armed rioters, at the direct instigation of our sitting President, invaded the capital of our country, threatened the safety of our elected officials, and attempted to overthrow the results of a valid election. 
 6 | 
 7 | This **can not** be allowed to stand.
 8 | 
 9 | Every single person who participated in yesterday's shame upon our country must be identified, arrested, and placed on trial.  To do otherwise is to admit that we are no longer a democracy.  Some may be convicted of minor crimes.  Others should be charged with armed insurrection against the state, convicted, and spend their remaining years in prison.
10 | 
11 | I join many others in calling for the immediate removal of President Trump from office.  We must not wait until his term expires.  Congress should impeach immediately.  The Cabinet should exercise the 25th amendment.  This is not a question of which, it's a question of which can be concluded sooner.  He literally just attempted a coup.  Leaving him in office for the remainder of his term is dangerous in the extreme.
12 | 
13 | # 2020 Highlights
14 | Dec 19, 2020 - With as overall terrible as this year has been, I wanted to highlight a few things which give me hope for the future.
15 | 
16 | SpaceX Starlink has entered widespread beta and appears to be delivering very reasonable latency (30ms) and bandwidth (100mbs down, 10s mbs up).  This is a massive technical accomplishment (antenna design, satalite launch and management, etc..) and could be worth changing.  Between this and the wide spread availability of solar power + battery backup + cheap generators, a lot of previously remote locations are about to enter the internet age.  That's wonderful for equity, but also has profound implications for society which are hard to predict.  
17 | 
18 | Between solar power and natural gas, the US is rapidly transitioning away from coal power.  We've reached the point that this is not a policy decision; it's an economic one.  Solar power and natural gas are just cheaper than coal.  Solar is now cheaper on a per MWh basis than any other generation source - by a noticable margin.  Even with the cost of natural gas plants to provide power in off times, or balance the grid, coal still can't compete.  Given how absolutely terrible coal is for air pollution (and thus public health), the environment, and the climate, this transition is a very good thing.  Better yet, because it's being driven by economics not policy, there's every reason to believe this will continue and, if anything, accelerate over the next few years.
19 | 
20 | Despite the headlines, the US election system worked.  Election day went smoothly, votes were counted and certified, and there was no actual disruption or ambiguity about the outcome.  We did have an outgoing president who tried to create drama and far too many republican politicians who supported him for far too long, but the system itself worked.  Leading into election day itself, it was not at all clear this was going to be the case (regardless of outcome).  
21 | 
22 | The development of vaccines for COVID has been an amazing scientific success.  Our previous record for vaccine development was 4 years; this time we did it in 10 months.  It's hard to understate how positive that is, both for the current pandemic, and for the future.  This won't be our last pandemic and the technology developed now (and over the last few decades) are going to be world changing going forward.
23 | 
24 | # As Posted to Facebook and Twitter, Election Day 2019
25 | Nov 3rd, 2019 -- Today is election day.  I have two requests of everyone reading this.
26 | 
27 | First, if you are eligible to and haven't already, vote.  Voting is one of our most important civic duties as citizens.  Make the time, and go vote.  Personally, I strongly prefer you voted to remove Trump and all of his supporters from office as I believe he has demonstrated himself grossly unfit to hold the office, but even if you disagree with me, please vote.
28 | 
29 | Second, expect and demand respect for the electoral process itself over the next few days.  Our system has it's problems, but it's also been one of the longest running democracies in this world.  An absolutely critical component of that is the graceful transition of power, and the ability of the loosing party to accept the results.  Up until recently, the US has been blessed with the ability to simply presume such, but this year, things look concerning.  Regardless of who you vote for, please be ready to sit back, wait for the votes to be counted, demand that all votes be counted, and accept the winner of the election.  You don't have to like the winners, but it is critical that we all commit to accept them. 
30 | 
31 | Those are my two key asks.  Now, let me explain why I personally feel Trump is unfit for office.  It doesn't come down to policy - though I disagree with many.  Instead, it comes down to the fact that he has repeatedly endangered the very bedrock of our democratic system. 
32 | 
33 | He has publicly said he may not accept the election results; that is dangerous, extremely so.  Refusal to accept election results and commit to peaceful transitions of power is how democracies fall.  And no, I'm really not exaggerating here. 
34 | 
35 | Earlier this year, he signaled an intention to send in the military to suppress political protests.  It was only the very public rejections of military involvement in civilian affairs by senior leaders which restrained him.  If you haven't read the letters from the Joint Chiefs of Staff, General Mattis, or watched the press conference given by the Secretary of Defense that week, please do.  Each was a historic document, and one I never want to see again. 
36 | 
37 | These are the two single largest incidents, but the last four years have been filled with smaller ones.  I sincerely believe that electing Trump for another term is dangerous for the very existence of our country. 
38 | 
39 | Having said that, I also commit to respecting the results of the election if he does win.  As dangerous as I believe his election to be, refusing to accept the valid result of the electoral process would be even more so.
40 | 


--------------------------------------------------------------------------------
/compiler-correctness.rst:
--------------------------------------------------------------------------------
 1 | 
 2 | Interesting Partial Correctness Oracles - Primarily geared at using fuzzers to 
 3 | find compiler bugs, but also useful for "compile the world" style testing and 
 4 | general metric monitoring across a large corpus of applications.  
 5 | 
 6 | Assertion Builds - Duh.
 7 | 
 8 | Sanatizer Builds (ASAN, UBSAN, etc...) - Also duh, these days.
 9 | 
10 | Hash Intermediate IRs to Detect Canonicalization Problems -- If the compiler has
11 | passes which don't agree on the canonical form, this can result in alternation between
12 | two IR states as the optimizer runs.  For any compiler with a serializeable IR,
13 | this can be detected by logging the IR after each pass and looking for duplicates
14 | via hashing.  (You could also use a more expensive comparison function such as llvm-diff,
15 | but the extra time is probably not worth it.)
16 | 
17 | Does the resulting program run?  -- Useful way to detect many miscompiles even when
18 | exact semantics of program aren't known.  Does require ability to generate fault free
19 | (by construction) programs.  
20 | 
21 | Does the program terminate? -- Obvious problem is not knowing if the program is 
22 | supposed to terminate.  In practice, can be really useful in finding algorithmic
23 | cornercases in the compiler.  Filtering non-terminating (before timeout) examples
24 | by whether there's a dominate stack trace for the whole execution and whether the
25 | behavior is recent regression surfaces mostly real problems.  
26 | 
27 | Does the program produce the same result on two compilers or *two versions* of
28 | the same compiler?  -- Very *very* effective at finding miscompile regressions
29 | and subtle long standing miscompiles.  Requires either a) ability to generate
30 | well defined source programs, b) a sanitizer like tool to make any program
31 | well defined in practice, or c) a very good scoring/filtering mechanism.  
32 | (a) or (b) are the strongly preferred approaches.
33 | 
34 | Maximum code expansion - A desirable property in a compiler is to not produce
35 | output sizes which are exponential in the input size.  A fuzzer can be useful
36 | way of finding such cornercases, though I'm not aware of anyone who has done
37 | so to date.  
38 | 
39 | 
40 | Performance fuzzing
41 | 
42 | One area which I think is very under-explored is using fuzzers to find regressions
43 | in optimization effectiveness or missing optimizations in compilers.  If you have
44 | a means of generating well defined programs and two compilers (or two compiler 
45 | *versions*) you can simply run the same program compiled with both, and compare
46 | the resulting execution times.
47 | 
48 | A couple of important points on making this work in practice.
49 | 
50 | 1) Detecting a performance difference on a fuzzer test case will require *a lot*
51 | of care about statistics.  This problem is absolutely begging for accidental
52 | "p-hacking" and that needs to be a first class part of the design of any
53 | practical system.
54 | 
55 | 2) I'd expect such a system to have no trouble finding missed optimizations
56 | and regressions.  (Compilers are full of them.)  I suspect the hard problem
57 | would be prioritizing which ones matter, grouping them, and tying them to
58 | other performance reports.  This problem is approachable for regressions, but
59 | doing it for generic missed optimizations is not a problem I know how to approach.
60 | 
61 | 
62 | 
63 | Other random notes around fuzzing
64 | 
65 | Empirically, finding regressions requires about a million unique test executions.
66 | No idea why, but most regressions fall out somewhere between a million and two
67 | million unique tests executions.  (With a niave fuzzer, nothing fancy.  May differ
68 | with better fuzzer technology.)  Assuming an average of 30 seconds per test (i.e.
69 | a fairly slow fuzzer), that's only ~280 CPU hours at the high end.  (i.e. one 32 
70 | core machine running for about 10 hours - or $5 at recent AWS spot prices.) At this
71 | point, there is economically no excuse not to fuzz heavily.  
72 | 
73 | Empirically, finding long standing subtle miscompiles takes quite a bit longer.
74 | One "interesting" (i.e. nasty, been latent for years, "how did we never see this?"
75 | bug) seems to fall out about once per 150 million unique test inputs.  
76 | 
77 | The empirical statements above apply to fuzzing LLVM indirectly through Azul's
78 | Falcon JIT using https://github.com/AzulSystems/JavaFuzzer.  Note that this
79 | fuzzer is not coverage based and doesn't play any other "fun tricks".  
80 | 
81 | 
82 | 


--------------------------------------------------------------------------------
/defining-capture.rst:
--------------------------------------------------------------------------------
  1 | 
  2 | .. header:: This is a DRAFT of a RFC I'm considering sending to llvm-dev.  The current status is more an attempt to organize thoughts than an actually proposal.  
  3 | 
  4 | -------------------------------------------------
  5 | Defining Capture
  6 | -------------------------------------------------
  7 | 
  8 | TLDR: ...
  9 | 
 10 | .. contents::
 11 | 
 12 | Terminology
 13 | ------------
 14 | As with any property, there are both may and must variations.  Unless explicitly stated otherwise, we assume henceforth that "captured" means "may be captured" or "potentially captured", and that "nocapture" means "definitely not captured."
 15 | 
 16 | Candidate Definition
 17 | ---------------------
 18 | 
 19 | A captured object is one whose contents or address can be observed by an external party which controls the implementation of externally defined functions, and can call back into the module through any external exposed entry point (potentially concurrently).  The external party is restricted from "guessing" the addresses of uncaptured objects.  Once captured, an object remains captured indefinitely.
 20 | 
 21 | Some specific examples of captured objects:
 22 | 
 23 | * A global variable with a linkage other than private is captured.
 24 | * An object passed to an external function as an argument is captured.
 25 | * An object returned by a function with non-private linkage is captured.
 26 | * A memory object reachable from another captured object is captured.
 27 | * A memory object which was *previously* reachable (even if transiently so) from a captured object remains captured.
 28 | * A memory object whose address could be propagated to a captured location is captured if there exists a non-private function which when invoked could perform said propagation.  (Remember, our externally party is both adversarial and running concurrently, so it can arrange perfect timing attacks as needed.)
 29 | 
 30 | Corollaries
 31 | -----------
 32 | 
 33 | An object which is eventually captured (i.e. visible to our external observer) may not yet have been captured at a particular program point.  We say that such objects as "nocapture before Ctx" where "Ctx" is the program point being discussed.  For instance, all static allocs start as uncaptured, and while the allocation may be eventually captured, that doesn't change the fact the object was nocapture before that point.
 34 | 
 35 | A capturing operation is one which exposes a previously uncaptured object to our external observer.
 36 | 
 37 | An object is uncaptured in a particular scope if the object was not previously captured before that scope, and no action performed within the scope is a capturing operation.  In particular, note that there's nothing preventing an enclosing scope from capturing the object provided that the capturing operation occurs strictly after the end of our inner scope.
 38 | 
 39 | FOR DISCUSSION: The last point differs from our currently implementation.  We'd consider an object captured in the current scope if returned.  We could phrase this as simple conservatism, but is there something deeper we're missing?
 40 | 
 41 | Exploratory Examples
 42 | --------------------
 43 | 
 44 | Let's start with a trivial example:
 45 | 
 46 | .. code:: c++
 47 | 
 48 |   void foo() {
 49 |     X* o = new X();
 50 |     o.f = 5;
 51 |     delete o;
 52 |   }
 53 | 
 54 | Object o is nocapture both globally and in the scope of foo.  
 55 | 
 56 | Leaking the object doesn't change that.
 57 | 
 58 | .. code:: c++
 59 | 
 60 |   void foo() {
 61 |     X* o = new X();
 62 |     o.f = 5;
 63 |   }
 64 | 
 65 | Adding a self referencial cycle doesn't change that.
 66 | 
 67 | .. code:: c++
 68 | 
 69 |   void foo() {
 70 |     X* o = new X();
 71 |     o.f = o;
 72 |     delete o;
 73 |   }
 74 | 
 75 | Scopes
 76 | =======
 77 | 
 78 | If we return an object, that object is captured if the function is not private.  So
 79 | 
 80 | .. code:: c++
 81 | 
 82 |   private_linkage X* wrap_alloc() {
 83 |     return new X();
 84 |   }
 85 | 
 86 | doesn't capture X, but
 87 | 
 88 | .. code:: c++
 89 | 
 90 |   X* wrap_alloc() {
 91 |     return new X();
 92 |   }
 93 | 
 94 | does.  Note that in both cases, the allocation is nocapture within the scope of wrap_alloc.
 95 | 
 96 | .. code:: c++
 97 | 
 98 |   private_linkage X* wrap_alloc() {
 99 |     return new X();
100 |   }
101 |   void foo() {
102 |     X* o = wrap_alloc();
103 |     o.f = 5;
104 |     delete o;
105 |   }
106 | 
107 | In this example, the allocation is uncaptured globally, and in both functions.
108 | 
109 | Object Graphs
110 | =============
111 | 
112 | Moving on, let's consider connected object graphs.  
113 | 
114 | .. code:: c++
115 | 
116 |   void foo() {
117 |     X* o1 = new X();
118 |     X* o2 = new X();
119 |     o1.f = o2;
120 |     o2.f = o1;
121 |   }
122 | 
123 | In this example, both o1 and o2 are nocapture.
124 | 
125 | If any object is observable, then all objects reachable through that object are captured.  
126 | 
127 | .. code:: c++
128 | 
129 |   X* foo() {
130 |     X* o1 = new X();
131 |     X* o2 = new X();
132 |     o1.f = o2;
133 |     o2.f = o1;
134 |     return o1;
135 |   }
136 |   
137 | 
138 | 
139 | Transient Captures
140 | ==================
141 | 
142 | .. code:: c++
143 | 
144 |   private_linkage int X;
145 |   int* Y;
146 | 
147 |   void oops() {
148 |     Y = &X;
149 |     Y = nullptr;
150 |   }
151 | 
152 | In this example, both X and Y are captured.  Our external observed can arrange oops to execute (since it's an external function) and read the address of X between the two writes.
153 | 
154 | This does nicely highlight that the optimizer can refine this program from one which captures X into one which doesn't by running dead store elimiantion.  As such, it's important to note that capture statements apply to the program at a moment in time.
155 | 
156 | Capture vs Lifetime
157 | ===================
158 | 
159 | .. code:: c++
160 | 
161 |   int* Y;
162 | 
163 |   void foo() {
164 |     Y = new X();
165 |     free(Y);
166 |     Y = nullptr;
167 |   }
168 | 
169 | In this example, Y has been captured.  Criticially, the memory object associated with the particular instance of X remains captured even once deallocated.  While the contents of said object are no longer defined, the address thereof continues to exist and may be validly used.
170 | 
171 | It's worth highlighting one counter intuitive implication.  If our adverserial observer calls this routine twice, a reasonable memory allocation may reuse the same physical memory for both instances of X.  This does not change the fact that conceptually these are two distinct memory objects.  Immediately before the store to Y on the second invocation, the first object may be captured (and deallocated) while the second one is not yet captured.  Even though they share the same address.
172 | 
173 | FOR DISCUSSION - I think this implies we need to tweak the definition slightly.  In particular, I think we need to incorporate something which references the based on rules to make access through the first copy UB, or we seem to have captured both (since per the proposed definition the address captures.)
174 | 
175 | (This discussion is not meant to be authorative on explaining the semantics of deallocation, for details, see the relevant section of langref.)
176 | 
177 | 
178 | Draft LangRef Text
179 | ------------------
180 | 
181 | nocapture argument attribute
182 | ============================
183 | 
184 | If we have a pointer to an object which has not yet been captured passed to a nocapture argument of a function, we know that the callee will not perform a capturing operation on this argument.  Note that this only restricts operations by the callee performed on this argument.  If a separate copy of the pointer is passed through an argument or memory, the callee may capture or store aside in an unknown location that copy of the pointer.
185 | 
186 | In addition to the capture fact just stated, a nocapture argument attribute also provides an additional "trackability" fact.  If before the call, the callee is aware of all copies of a pointer, and all copies of the pointer passed to the callee are passed through nocapture arguments, then after the call, the caller can assume that no new copies of the pointer have been created.  (Even if those copies are in uncaptured locations.)
187 | 
188 | Note that this definition says nothing about what the callee might do if the object was already captured before the call.
189 | 
190 | nofree function attribute
191 | =========================
192 | 
193 | TODO: Wording here is incompatible with global capture definition.  Need something finer grained - maybe escape?
194 | 
195 | From langref: "As a result, uncaptured pointers that are known to be dereferenceable prior to a call to a function with the nofree attribute are still known to be dereferenceable after the call (the capturing condition is necessary in environments where the function might communicate the pointer to another thread which then deallocates the memory)."
196 | 
197 | The problem with this is that an uncaptured copy in a private global variable still allows another thread to free it.
198 | 


--------------------------------------------------------------------------------
/defining-escape.rst:
--------------------------------------------------------------------------------
  1 | 
  2 | .. header:: This is a snapshot of a previous version of the defining-capture.rst file.  As noted at the bottom, this is a *incorrect* definition of capture, but the text seemed potentially useful, so I saved it for later.
  3 | 
  4 | -------------------------------------------------
  5 | Defining Capture
  6 | -------------------------------------------------
  7 | 
  8 | TLDR: ...
  9 | 
 10 | .. contents::
 11 | 
 12 | Terminology
 13 | ------------
 14 | As with any property, there are both may and must variations.  Unless explicitly stated otherwise, we assume henceforth that "captured" means "may be captured" or "potentially captured", and that "nocapture" means "definitely not captured."
 15 | 
 16 | Candidate Definition
 17 | ---------------------
 18 | An object is said to be captured in all scopes in which it's contents are observable.  It is said to be nocaptured in all outer scopes in which it's contents can not be observed in this scope, or any outer scope thereof.
 19 | 
 20 | There's a couple important points to this definition:
 21 | 
 22 | * **Scope** -- The only scopes currently defined in LLVM IR are function scopes.  As such, all capture statements are implicitly attached to some function.
 23 | * **Observation** -- This aspect allows stores to locations which are never read, and other uses which would appear to capture the pointer so long as the result is unobserved.  This prevents otherwise well defined transforms such as DSE from refining a nocapture object into a captured one.
 24 | * **Refinement** --  As with many other properties in LLVM IR, well defined transforms can refine a program into one with fewer legal behaviors.  The intention of the definition is to allow refinement from captured to nocapture, but not the other way around.  
 25 | 
 26 | Exploratory Examples
 27 | --------------------
 28 | 
 29 | Let's start with a trivial example:
 30 | 
 31 | .. code:: c++
 32 | 
 33 |   void foo() {
 34 |     X* o = new X();
 35 |     o.f = 5;
 36 |     delete o;
 37 |   }
 38 | 
 39 | Object o is nocapture in the scope of foo.  
 40 | 
 41 | Leaking the object doesn't change that.
 42 | 
 43 | .. code:: c++
 44 | 
 45 |   void foo() {
 46 |     X* o = new X();
 47 |     o.f = 5;
 48 |   }
 49 | 
 50 | Adding a self referencial cycle doesn't change that.
 51 | 
 52 | .. code:: c++
 53 | 
 54 |   void foo() {
 55 |     X* o = new X();
 56 |     o.f = o;
 57 |     delete o;
 58 |   }
 59 | 
 60 | As a notational conviance, further examples are listed without an explicit deletion to emphasize that the scope it tied to last observation, not allocation or deletion.  It is worth noting that it follow from the definition of deletion in most languages there can be no (defined) observations past deletion.
 61 | 
 62 | Scopes
 63 | =======
 64 | 
 65 | Next, let's consider an example which introduces multiple scopes:
 66 | 
 67 | .. code:: c++
 68 | 
 69 |   X* wrap_alloc() {
 70 |     return new X();
 71 |   }
 72 |   void foo() {
 73 |     X* o = wrap_alloc();
 74 |     o.f = 5;
 75 |     delete o;
 76 |   }
 77 | 
 78 | In this example, the allocation is captured in both foo and wrap_alloc, but for different reasons.  For wrap_alloc, the pointer is redundant and potentially observable outside it's scope.  For foo, we don't have the knowledge that the return value of wrap_alloc hasn't been captured inside wrap_alloc in a way observable outside of it.  The optimizer would in practice infer that fact, leading to out first instance of refinement.
 79 | 
 80 | .. code:: c++
 81 | 
 82 |   X* noalias wrap_alloc() {
 83 |     return new X();
 84 |   }
 85 |   void foo() {
 86 |     X* o = wrap_alloc();
 87 |     o.f = 5;
 88 |     delete o;
 89 |   }
 90 | 
 91 | With the additional fact, we can now infer that the allocation is nocapture in foo, but not in wrap_alloc.
 92 | 
 93 | Object Graphs
 94 | =============
 95 | 
 96 | Moving on, let's consider connected object graphs.  
 97 | 
 98 | .. code:: c++
 99 | 
100 |   void foo() {
101 |     X* o1 = new X();
102 |     X* o2 = new X();
103 |     o1.f = o2;
104 |     o2.f = o1;
105 |   }
106 | 
107 | In this example, both o1 and o2 are nocapture in the scope of foo.  
108 | 
109 | If any object is observable in a parent scope, then all objects reachable through that object are observable in that scope.  
110 | 
111 | .. code:: c++
112 | 
113 |   X* foo() {
114 |     X* o1 = new X();
115 |     X* o2 = new X();
116 |     o1.f = o2;
117 |     o2.f = o1;
118 |     return o1;
119 |   }
120 | 
121 |   void bar() {
122 |     X* o = foo();
123 |   }
124 | 
125 | In this case, we see that both allocations are captured in foo, but nocapture in bar.  In the following example, o1 is nocapture in both foo and bar, while o2 is only nocapture in bar.
126 | 
127 | .. code:: c++
128 | 
129 |   X* foo() {
130 |     X* o1 = new X();
131 |     X* o2 = new X();
132 |     o1.f = o2;
133 |     return o2;
134 |   }
135 | 
136 |   void bar() {
137 |     X* o = foo();
138 |   }
139 | 
140 | 
141 | Defects
142 | --------
143 | 
144 | As currently written, the definition makes allocas trivially nocapture.  Thus, it's clearly missing something.  Maybe we defined escape instead?
145 | 


--------------------------------------------------------------------------------
/element-wise-atomic-proposal:
--------------------------------------------------------------------------------
 1 | [DRAFT] Support element wise atomic vectors and FCAs
 2 | 
 3 | WARNING: This is a draft.  It is still under development, and should not be cited until shared on llvm-dev.  
 4 | 
 5 | TLDR: We need to be able to model atomicity of individual elements within vectors and structs to support vectorization and load combining of atomic loads and stores.  
 6 | 
 7 | Background
 8 | 
 9 | LLVM IR currently only supports atomic loads and stores of integer, floating point, and pointer types.  Attempting to use an atomic vector or FCA type is a verifier error.  LLVM supports both ordered and unordered atomics.  
10 | 
11 | For ease of discussion, I'm going to ignore alignment.  Assume that everything which follows refers to a properly aligned memory access.  The unaligned case is much harder, and is already fairly ill-defined today even for existing atomics.
12 | 
13 | On modern X86, there are no formal guarantees of atomicity for loads wider than 64 bits.  The practical behavior observed seems to be that loads and stores wider than 64 bits are not atomic on at least some architectures.  (This is exactly what you'd expect as the width of the load/store ports is often smaller than the max vector register size.)  However, all of the architectures I'm aware of seem to provide atomicity of the individual 64 bit chunks.  
14 | 
15 | In practice, every java virtual machine that I'm aware appears to assume that vector loads and stores are atomic at the 64-bit granularity.  (Java requires atomicity of all - well, most - memory accesses and thus any VM that vectorizes using x86 vector registers is implicitly making this assumption.)
16 | 
17 | This notion of "atomic in 64 bit chunks" is what I want to formalize in the IR.  Doing so solves two major problems.  
18 | 
19 | First, it allows vectorization of loops with atomics.  Today, upstream LLVM does not optimize any loop with an atomic load or store, and because of the semantic gap, we can't.  If we converted an atomic i32 load sequence into a non-atomic load of a <N x i32> vector, we'd be miscompiling.  And we can't simply convert to an atomic iM (M = N * 32) load as we likely can't guarantee atomicity for the full load width.
20 | 
21 | Second, it allows load combining at the IR (or MI) level to be a reversible transformation.  Today, if we merge a pair of atomic i8 loads, our only option is to mark the resulting i16 load as atomic.  Thi is problematic as if we later find a value available for one of the two original loads, we can't split the i16 apart again and perform load forwarding of the available value.  
22 | 
23 | As a simple example, consider:
24 | a[0] = 5;
25 | .... (something which later gets optimized away)
26 | v1 = a[0];
27 | v2 = a[1];
28 | 
29 | If we combine the two loads, we then can't perform the load forwarding from the visible store.  
30 | 
31 | Proposal
32 | 
33 | There are really two potential proposals.  The choice between them comes down to the queston of do we wish to support full width atomicity for vectors and FCAs as well?  Personally, I lean towards the first just due to there being less work, but won't object if the community as a whole things the second is worthwhile.  
34 | 
35 | Proposal 1 - No full width atomicity
36 | 
37 | Interpret the existing atomic keyword on load or store as meaning either "full width atomic" or "element wise atomic" based on the type being loaded or stored.   This would have the unfortunate implication that we can't canonicalize from vector to integer (or vice versa).  It also involves the potential for some confusing code, but I think that can mostly be abstracted behind some carefully chosen helper routines on the instruction classes themselves.  
38 | 
39 | The advantage of this proposal is that a) it requires no change to bitcode or IR, b) it's straight forward to implement, and c) it could be extended into the second at a later time if needed.
40 | 
41 | Proposal 2 - Both full width and element wise atomic vectors
42 | 
43 | This would require both a bitcode format change, and an IR change.  
44 | 
45 | For the IR change, I'd tenatively suggest something like:
46 | load element_atomic <N x Y>, <N x Y>* %p, unordered, align X (element wise atomic)
47 | load atomic <N x Y>, <N x Y>* %p, unordered, align X (full width atomic)
48 | 
49 | I haven't really investigated the bitcode change.  I have little experience in this area, so if someone with experience wants to make a suggestion on best practice, feel free.
50 | 
51 | The advantage of this proposal is in the generality.  The disadvantage is the additional implementatio complexity, and conceptual complexity of supporting both full width and element wise atomics.  
52 | 
53 | 
54 | Backend Implications
55 | 
56 | TBD - representation in MMO
57 | TBD - initial simple lowering
58 | TBD - testing
59 | 


--------------------------------------------------------------------------------
/falcon-compiler.rst:
--------------------------------------------------------------------------------
 1 | Zing Falcon
 2 | -----------
 3 | 
 4 | Over the last few years, I helped to develop the Falcon compiler for the Azul Zing VM.  Somewhat unfortunately, there's fairly little public information on the project.  (Marketing has never exactly been Azul's strength.)  This page exists to consolidate known public information along with some basic factual overview.  
 5 | 
 6 | What is it?
 7 | ------------
 8 | Falcon is a LLVM-based in memory compiler for java bytecode to x86-64 assembly.  It is the top tier compiler for the Azul Zing VM.  It is a heavily optimizing just in time compiler which is fed with high quality profiling information collected in lower tiers of execution.  
 9 | 
10 | Falcon's strength is the ability to emit high quality code resulting in runtime performance for workloads on Zing which absolutely trounces the competition.  Almost more interestingly, because it uses the same underlying compiler technology as Clang, Falcon can generate high performance code for just about anything Clang can.  With Falcon, it is routine and common to see vectorization of common loop idioms native for the platform of execution.
11 | 
12 | Falcon's largest weakness is compile time.  It's very much not a classic just in time compiler design; there's a reason I called it an "in memory compiler" above.  While the team has done a lot to mitigate this (which I can't talk about here), there do remain cases where this is observable.  
13 | 
14 | Public Talks
15 | -------------
16 | 
17 | The best intro talk is probably the keynote that I gave at the 2017 LLVM Developers Meeting.  Beyond that, there have been a number of talks presented by members of the team over the development life cycle.  I'm listing the ones I know of in reverse chronological order.  
18 | 
19 | * An Update on Optimizing Multiple Exit Loops. Philip Reames. Tech Talk, 2020 LLVM Virtual Developers Meeting.
20 | * Control-flow sensitive escape analysis in Falcon JIT. Artur Pilipenko. Tech Talk, 2020 LLVM Virtual Developers Meeting.
21 | * Truly Final Optimization in Zing VM.  Anna Thomas. Tech Talk, JVM Language Summit 2018.
22 | * Falcon: An optimizing Java JIT. Philip Reames, Keynote, 2017 LLVM Developers Meeting 
23 | * A Quick Intro to Falcon.  Philip Reames.  Lightning Talk, JVM Language Summit 2017.
24 | * Expressing high level optimizations within LLVM. Artur Pilipenko. Tech Talk, 2017 European LLVM Developers Meeting.
25 | * LLVM for a managed language: what we’ve learned. Sanjoy Das and Philip Reames. Tech Talk, 2015 LLVM Developers Meeting
26 | * Supporting Precise Relocating Garbage Collection in LLVM. Sanjoy Das and Philip Reames. Tech Talk, 2014 LLVM Developers Meeting
27 | 
28 | Other Citable Stuff
29 | -------------------
30 | `Introducing the Falcon JIT Compiler <https://www.azul.com/products/zing/falcon-jit-compiler/>`_
31 | 
32 | See also the blog posts linked to the launch at the bottom.  
33 | 
34 | `Using ZVM with the Falcon Compiler <https://docs.azul.com/zing/UseZVM_FalconCompiler.htm>`_ 
35 | 
36 | These are the public docs for the Zing product.  Here you can find the command line options to see the underlying LLVM IR - mostly interesting for compiler folks.  This is also the only public place I can find which gives the original release date (Dec 2016) and on by default date (March 2017).  Be cautious of the docs in generally though, they're frequently hilariously out of date and sometimes just wrong.  
37 | 
38 | `Truly Final Optimization in Zing VM <https://medium.com/azulsystems/truly-final-optimization-in-zing-vm-283d28418e55>`_
39 | 
40 | A nice written version of the 2018 JVMLS talk mentioned above.  
41 | 


--------------------------------------------------------------------------------
/ismm2020.rst:
--------------------------------------------------------------------------------
 1 | ISMM 2020 Notes
 2 | ===============
 3 | 
 4 | Trying something new.  As ISMM 2020 is a virtual conference this year, I'm listening in to all the talks and making notes as I go.  `YouTube Live Stream <https://www.youtube.com/watch?v=skNDP5ZYZJ4&feature=youtu.be>`_.  `Proceedings <https://conf.researchr.org/program/ismm-2020/program-ismm-2020?past=Show%20upcoming%20events%20only>`_.
 5 | 
 6 | Verified Sequential Malloc/Free
 7 | -------------------------------
 8 | 
 9 | Seperation logic based proofs, described in DSL for a proof tool I am unfamiliar with.  Basic strategy is to do source level proof w/proof annotations stored separately and rely on CompCert (a verifier C compiler) to produce a verified binary.  Library verified is a malloc library I'm unfamiliar with, unclear how "real" this code is.  Verification work done manually by the author.  The first couple of slides do nicely describe the strength of seperation logic for the domain and some of the key intuitions.
10 | 
11 | Alligator Collector: A Latency-Optimized Garbage Collector for Functional Programming Languages
12 | -----------------------------------------------------------------------------------------------
13 | 
14 | Partially concurrent collector for GHC.  Presentation is somewhat weak for a audience familiar with garbage collection fundementals.  Collector sounds fairly basic by modern Java standards, but it makes for an interesting experience report.  The design used isn't a bad starting point for languages without a mature collector.
15 | 
16 | The question at the end about implications of precompiled binaries is interesting.  In particular, acknowledged advantage of load barrier and partially incremental collection.  Answer mentioned "customer was strict" about that requirement which provides some context on why the design evolved in the way it did.  
17 | 
18 | Understanding and Optimizing Persistent Memory Allocation
19 | ----------------------------------------------------------
20 | 
21 | Focus is on writing crash atomic (i.e. exception safe) allocation code for persistent memory.  Approach take is to persist only heap metadata sufficient to implement a conservative (ambigious) GC.  Previous approaches referenced appear to be analogous to standard persistent techniques for databases (i.e. commit logs).  Positioning is described most clearly in conclusion slide.  Questions at the end are the clearest part of the talk.  
22 | 
23 | The contribution of this paper appear fairly minor conceptually.  There's a passing mention of first lock free allocator, but the only part actually discussed is persisting only heap metadata and reconstructing rest on recovery.  Previous work had already used conservative GC as fallback.  The choice to use a handlized heap is understandable, but has unfortunate performance implications.  
24 | 
25 | An interesting idea to explore in this area would be what a relocating collector looks like in this space.  One way to avoid need for handlization is to support pointer fixup.  You could track previous mapped addresses and as long as those don't overlap, you can interpret all previous pointers with a load barrier.  Technically, you can do fixup without mark or relocate, but if you're paying the semantic costs to support fixup you might as well get the fragmentation benefits.  
26 | 
27 |   
28 |   
29 | Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories
30 | ------------------------------------------------------------------------------------------
31 | 
32 | Starts with a nice overview of problem domain.  Super helpful as I know little about this area!
33 | 
34 | Approach appears to basically use profiling of first access to classify memory as "likely read heavy" and "likely write heavy".  There's also a bit about "near and far", but that went over my head.  There's a profile mechanism in the page fault handler (pressumably in hardware) which classifies.  On first access to a new page, this *instruction* profile is used to classify new page.  
35 | 
36 | Key expectation seems to be that individual instructions are either read heavy or write heavy.  This seems reasonable, but it's not clear to me why you need a dynamic profile for this.  The instruction encoding seems to tell you this.   Maybe the dynamic part is needed for near and far?  I didn't follow that part.
37 | 
38 | This is far enough out of my area that I'm not following details.  If you're interested in topic, highly recommend listening to the talk directly.  
39 | 
40 | 
41 | Prefetching in Functional Languages
42 | ------------------------------------
43 | 
44 | Nicely explained problem domain and problem statement.  
45 | 
46 | I am very skeptical of the value of prefetching on modern hardware; the talk tries to justify the need on OOO hardware and while not wrong, I think it's over stating the problem.  Particularly with a GC which arranges objects in approximate traversal order (single linked lists are trivially to get right) I don't see a strong value here.  (Ah, later performance comparison uses ARM chips and Knights Landing).  I just showed my mainline Intel bias...)
47 | 
48 | After listening to the talk, it's really not clear what the contribution was.  Most of the discussion is generic prefetch discussion.  Did they add a prefetch instruction to OCaml?  That's pretty basic.  Was this a characterization study?  Was there some form of compiler support?
49 | 
50 | It is a nice presentation on the basics of prefetching and the performance tradeoffs thereof.  
51 | 
52 | Garbage Collection Using a Finite Liveness Domain
53 | -------------------------------------------------
54 | 
55 | Basic notion is to use a heap analysis to describe reachable objects from each reference.  Standard GC is reachability based which simply means we use the conservative trivial analysis (i.e. everything is live if there's an edge to it.)   General problem with this approach is scalability, this paper tries to approach that with a restricted set of analysis results.  The basic framing appears to basically be given a tree, which subtree is live?  The set of subtrees is resticted to first level, left recurisve, right recursive, and all.
56 | 
57 | A thought, has anyone considered reversing the liveness result?  If when scanning the stack, you used the liveness analysis to break references to dead objects this *must* be semantic preserving.  At that point, a standard reachability GC can produce the refined results.  Actually, given that, isn't the result equivelent to a complicated DSE which inserts code which only runs at the GC point?  Maybe a compiler could support this with a DSE and gc closure which runs at stack scan time?  (Clarification: The inversion of the analysis result works for single threaded programs or unescaped objects in concurrent languages, but not for objects accessible by multiple threads.)  (Ian asked the same question :) and the other provided a nice answer about AA I'd missed.)
58 | 
59 | ThinGC: Complete Isolation With Marginal Overhead
60 | --------------------------------------------------
61 | 
62 | Objects are grouped into hot and cold pages.  All access is to hot pages; pages are moved to hot region if needed to satisfy a memory access.  (This seems an odd choice and has obvious parallels to compressed heap ideas which have been explored in previous work.)  Implementation builds on ZGC, but that appears to be irrelevant to the main focus of the work.
63 | 
64 | Question: Why not have cold heap be subset of old gen?  Allows reuse of card mark instead of a separate remembered set.  A: ZGC is single generational.  
65 | 
66 | Application study looks very promising.  This is probably the key contribution from the paper.  Reheating percentage is low for most application, but high for a few.  This is a problem for the approach.  I was unhappy that this wasn't further discussed.  I asked a question on this at the end, and didn't get the sense this had really been explored
67 | 
68 | As a result of the variation in reheating, also observed very wide variations in performance and pattern.  Also a verbal mention of run to run variation due to memory arrangement, but this wasn't justified or expanded upon.  (I'm midly suspicious of this without evidence.)
69 | 
70 | My questions:
71 | * What was the motivation for requiring reheating rather than directly accessing the cold objects?  Did you evaluate the implications of this choice?  A: plan is to use very slow memory for cold memory; current work does not.
72 | * Did you explore the patterns which tended to cause reheating?  Is there any idiomatic pattern which might influence the design? A: no information
73 | 
74 | My takeway: no strong conclusions, haven't either proved or disproved the idea of hot/cold separation; too tied to decision around reheating.  Maybe paper has more useful details?
75 | 
76 | 
77 | Improving Phase Change Memory Performance with Data Content Aware Access
78 | -------------------------------------------------------------------------
79 | 
80 | Skipped this session.
81 | 
82 | Keynote: Richard Jones
83 | -----------------------
84 | 
85 | Schedule conflict, will watch later.
86 | 
87 | 


--------------------------------------------------------------------------------
/linux-perf-jit.rst:
--------------------------------------------------------------------------------
 1 | This document discusses how a JIT compiler (or other code generator) can generate symbol information for use with the Linux perf utility.  This is written from the perspective of a JIT implementor, but might be of some interest to others as well.  This is mostly a summary of information available elsewhere, but at time of writing I couldn't find such a summary.  
 2 | 
 3 | There are two major mechanisms supported by perf for getting symbols for dynamically generated code: perf map files, and jitdump files.
 4 | 
 5 | Perf Map
 6 |   A perfmap is a textual file which maps address ranges to symbol names.  It has no other content and does not on its own [1]_ support disassembly or annotation.  
 7 |   
 8 |   I couldn't find a formal description of the format anywhere, but the format appears to have one entry per line of the form "<hex_start_addr> <hex_size> <symbol_string".  An example valid entry would be "30affdb58 a20 my_func".
 9 |   
10 |   Perf map files are looked for in a magic path by the perf utility.  That path is /tmp/perf-<pid>.map where <pid> is the pid of the process containing the generated code.  This pid is recorded in the perf.data file, so offline analysis is supported.  You can copy these files between machines if needed.  There's no graceful handling of pid collisions, and no machanism to cleanu old perfmap files which I found.
11 |   
12 |   The format does not support relocation of code, or recycling of memory for different executible contents.  What happens if you have overlapping ranges in a file is unspecified.  
13 |   
14 | Jitdump
15 |   The jitdump format is a binary format which is much covers a much broader range of use cases.  There is a `formal spec <https://raw.githubusercontent.com/torvalds/linux/master/tools/perf/Documentation/jitdump-specification.txt>`_ for the binary format.  In addition to basic symbol resolution, jitdump supports disassembly/annotaton and code relocation.
16 |   
17 |   To work with jitdump files, you have to use "perf inject" to produce a perf.data file which contains both the raw perf.data and the additional jitdump information.  The location of the jitdump file on disk is not documented, and I haven't yet tracked it down.  Once injected, the combined perf.jit.data file can be moved to other machines for analysis.  
18 |   
19 | From what I can tell, jitdump provides a strict superset of the functionality of perf map files.  Despite this, perf map appears to be much more widely used.  It's not clear to me why this is true; the main value I see in perf map files is that they're easy to generate by hand or by simple scripting from other information sources.
20 | 
21 | Example Command Sequences
22 | --------------------------
23 | 
24 | perf map::
25 | 
26 |   perf record <command-of-interest>
27 |   # remember to copy the /tmp/perf-<pid>.map along with the perf.data file
28 |   # if analyzing off box
29 |   perf report
30 |   
31 | jitdump::
32 | 
33 |   perf record <command-of-interest>
34 |   perf inject --jit -i perf.data -o perf.jit.data
35 |   # move perf.jit.data around if needed
36 |   perf report -i perf.jit.data
37 | 
38 | Potential Gotchas
39 | -----------------
40 | 
41 | Support for both mechanisms were added to perf relatively recently.  As perf is version locked to the kernel of the system, this implies that currently supported releases of some older distros contain perf versions which don't support either feature.  Unfortunately, there's no graceful error reporting; symbols simply fail to load.  I ended up resorting to grepping through strace output to confirm whether the perf binary I was using tried to load the perf map file.  This was the only conclusive way I found to distinguish between malformed perf map files and versions of perf lacking support.  
42 | 
43 | For some reason I've yet to understand, certain types of memory regions cause perf to fail to collect symbolizable traces.  This is particularly confusing as inspecting the raw data file with perf script and/or perf report shows addresses which map to valid entries in a perf map file, but for some reason the way memory was obtained during the perf record run effects symbolization.  I've only see this with perf maps; I haven't tried the same experiment with a jitdump setup just yet.
44 | 
45 | Useful References
46 | ------------------
47 | 
48 | The wasmtime folks have a `nice description <https://bytecodealliance.github.io/wasmtime/examples-profiling-perf.html>`_ of using a jitdump based mechanism from a user perspective.
49 | 
50 | Brenden Gregg has a post on using perf map files to generate `flame graphs for v8 <http://www.brendangregg.com/blog/2014-09-17/node-flame-graphs-on-linux.html>`_.  He also has a lot of other generally awesome perf stuff, but most of it's focused on statically compiled code.  
51 | 
52 | `perf-map-agent <https://github.com/jvm-profiling-tools/perf-map-agent>`_ and `perf-jitdump-agent <https://github.com/sfriberg/perf-jitdump-agent>`_ are useful examples of how to generate the corresponding file formats.  These are each jvmti agents for Java for each of the corresponding workflows.  
53 | 
54 | Footnotes
55 | ----------
56 | 
57 | .. [1] Several of the perf commands allow you to provide an alternate path to the objdump binary.  If you have an alternate source of disassembly of some of the methods named in the perf map file, you can write a shim script which wraps the real objdump, intercepts the disassembly request sent to objdump for a particular symbol name, and provides the alternate disassembly.  
58 | 
59 | 


--------------------------------------------------------------------------------
/llvm-board-2020.rst:
--------------------------------------------------------------------------------
 1 | This is the draft form of my 2020 application to be on the LLVM foundation board.  If successful, the application becomes public, so I've decided simply to publish it from the beginning.  My application was not accepted.
 2 | 
 3 | Name (First and Surname) *
 4 | --------------------------
 5 | Philip Reames
 6 | 
 7 | Please summarize relevant contributions to the LLVM Project. These can include things such as patches, code ownership, bug reports, mailing list posts, blog posts, volunteering at LLVM events, organizing LLVM social events, contributing to the developer meeting paper committee, etc.
 8 | -------------------------------------------------------------------
 9 | 
10 | LLVM contributor since 2013.  Major areas of (code) contribution:
11 | 
12 | * Main author of gc.statepoint infrastructure, and current defacto code owner of garbage collection support in LLVM
13 | * Extensively contributed both directly and indirectly (though coworkers) to SCEV, IndVars, and much of our loop canonicalization infrastructure.  Recently became code owner.
14 | * (Incrementally) rewrote most of LazyValueInfo and CorrelatedValuePropagation.
15 | * Led effort to optimize unordered atomic loads and stores throughout middle end and x86 backend.  Contributed ~50% of the code directly.  
16 | * Many smaller contributions through middle end optimizer and (to a much lesser degree) the x86 backend.
17 | 
18 | Non Technical Contributions
19 | 
20 | * Member of the recently established LLVM Security Group.
21 | * Contact point for all Azul contributions to LLVM.  Representing LLVM internally (e.g. relicensing, etc..), and organizing organization contributions upstream, including internally focused developer education around LLVM community norms and processes.  
22 | * Presenter at multiple LLVM Developer Conferences including one of the 2017 keynotes on the Falcon project.  Recently, most of my focus at developer meetings has been on the hallway track conversations, and the defacto "frontends for dynamic languages" working sessions which happen each year (whether formally organized or not). 
23 | * Wrote "Performance Tips for Frontend Authors" and "LLVM Loop Terminology (and Canonical Forms)" doc pages.  Also contributed to a bunch of other docs in smaller one-off changes.
24 | 
25 | Indirect Contributions
26 | 
27 | * Points that follow are things which Azul does which I have had some role in steering.  Most of the work on these has not been my own, and others should get all the credit for making things actually happen.  :)
28 | * Fuzzing, regression tracking, and quality improvements - We run one of the only large fuzzer deployments which actually runs generated code.  As a result of this, we catch a disportanate fraction of miscompiles.  We deliberate lag ToT by a few days so that our time and energy is spent on the harder subtle issues.  In addition to the normal "please revert patch X" cases, we've also found a number of deep and interesting bugs in core passes.  My favorite to date was the fuzzer finding incorrect nsw/nuw flag handling in GVN which had been present for almost a decade.  
29 | * Falcon (our LLVM based compiler for Java bytecode) demonstrated that it was possible to develop compilers for non-C family languages on LLVM, and achieve performance which beat existing state of the art approaches.  In the process of doing so, we fixed a number of issues, documented many of the items we stumbled across, and publically discussed most of the key design elements of our approach (including our mistakes).  I'd like to think this has positively impacted the broader LLVM ecosystem.  
30 | 
31 | Why do you want to be on the LLVM Foundation board of directors?
32 | -----------------------------------------------------------------
33 | 
34 | I believe that the LLVM project has become a core piece of infrastructure and investment is needing accordingly.  I personally greatly appreciate various aspects of the community (e.g. professionalism, creative tension between pragmatism and perfectionism, and a refusal to get lost in bike sheding), but also see stress points forming as the community scales (e.g. infrastructure, decision making, review fragmentation).  I want to ensure the project continues to scale without loosing the aspects which have made it such a wonderful ecosystem in which to work these last few years.  
35 | 
36 | What experience or skills can you bring to the board? Which of the above programs could you help drive forward?
37 | -------------------------
38 | 
39 | Helped to establish, and fundraise for initial New Haven Pride Center scholarship fund (https://www.newhavenpridecenter.org/youth/scholarship/).  That initial fund has now developed into five distinct scholarship funds with a total of six annual awards. 
40 | 
41 | The areas I'm most interested in contributing towards are scholarship grants, education oppurtunities for students getting started in the community (particular students from non-traditional backgrounds), and support of common project infrastructure.   I will also contribute in areas outside those foci, but they're the ones of most personal interest to me.  
42 | 
43 | 
44 | 
45 | We value diversity and representation of the various interested groups working on LLVM and using it. Do you consider yourself representative of a minority group, underrepresented geographic region, etc?
46 | -----------------------------------------
47 | No.
48 | 
49 | 
50 | Which program are you most interested in supporting?
51 | -----------------------------------------------------
52 | 
53 | Educational Outreach
54 | 
55 | Diversity & Inclusion in Compilers and Tools
56 | 
57 | **Grants & Scholarships**
58 | 
59 | Infrastructure Support
60 | 
61 | What is your second choice program to support?
62 | -----------------------------------------------
63 | 
64 | Educational Outreach
65 | 
66 | Diversity & Inclusion in Compilers & Tools
67 | 
68 | Grants & Scholarships
69 | 
70 | **Infrastructure Support**
71 | 
72 | 
73 | How many hours a week can you dedicate to LLVM Foundation business?
74 | Board members are expected to dedicate time to meetings and to the programs.
75 | -----------------------------------------------------------------------------
76 | 
77 | Time availability will vary widely, but a minimum of 2-3 hours and sometimes much more.
78 | 
79 | Are you interested in a specific position on the board?
80 | --------------------------------------------------------
81 | 
82 | No
83 | 
84 | 
85 | Are you willing and able to help fundraise for the LLVM Foundation? We rely on donations to fund our programs and need board members to help find new sponsors and donors.
86 | --------------------------------------------------------------------
87 | 
88 | Yes, with a paricular emphasis on 1) trying to establish periodic giving campaigns and otherwise diversify the foundations funding, and 2) separate dedicated funding sources for scholarships and student travel grants.
89 | 
90 | Is there anything else you would like to add for the board to consider?
91 | ------------------------------------------------------------------
92 | No.
93 | 
94 | New this year, we will accept letters of recommendation to support your application. Please have your references send their letter of recommendation directly to us at boardapp@llvm.org. This is totally optional.
95 | -------------------
96 | 
97 | I will not have any letters of recommendation
98 | 


--------------------------------------------------------------------------------
/llvm-debugging-basics.rst:
--------------------------------------------------------------------------------
 1 | -------------------------------------------------
 2 | LLVM Debugging Tricks
 3 | -------------------------------------------------
 4 | 
 5 | This page is a collection of basic tactics for debugging a problem with LLVM.  This is intended to serve as a reference document for new contributors.  At the moment, this is pretty bare bones; I'll expand on demand.  
 6 | 
 7 | .. contents::
 8 | 
 9 | Compiler Explorer (i.e. Godbolt)
10 | --------------------------------
11 | 
12 | `<https://godbolt.org/>`_ is an incredibly useful tool for seeing how different compilers or compiler versions compile the same piece of code.  The ability to link to exactly what you're looking at and share it with collaborators is invaluable for asking and answering highly contextual questions.  
13 | 
14 | 
15 | Assertion Builds
16 | ----------------
17 | 
18 | Before you do literally anything else, make sure that you have assertions enabled on your local build.  
19 | 
20 | LLVM makes very heavy use of internal assertions, and they are generally excellent at helping to isolate a failure.  In particular, many things which appear as miscompiles in a release binary will exhibit as an assertion failure if assertions are enabled.
21 | 
22 | **Warning:** Few of the commands mentioned in this document will work without assertions enabled!
23 | 
24 | As a practical matter, I do not recommend the debug flavors of the builds, but Release with assertions enabled if very very worthwhile.  For context, an assertion enabled release build is around 8GB; last time I did a debug build, it was around 60GB.  
25 | 
26 | Capture a standalone reproducer for a clang issue
27 | -------------------------------------------------
28 | 
29 | Clang will attempt to produce a standalone reproducer for the current clang invocation if a crash is encountered during execution.  
30 | 
31 | ```-gen-reproducer=always``` will enable this reproducer generation for non-crashing compiles.  This can be useful for extractting reproducers for an optimization quality problem from a large build system.
32 | 
33 | This doesn't always succeed.  If it reports failure, check the /tmp directory for a preprocessed file (depending on where it failed), or fall back to trying to create a preprocessed input file via -E.  
34 | 
35 | Capture IR before and after optimization
36 | ----------------------------------------
37 | 
38 | ``-S -emit-llvm`` will cause clang to emit an .ll file.  This will contain the result of mid-level optimization, immediately before the invocation of the backend.
39 | 
40 | ``-S -emit-llvm -disable-llvm-optzns`` will cause clang to emit an .ll file and *skip* optimization.  Note that this is often different than the result of ``-S -O0 emit-llvm`` as the later embeds ``optnone`` attributes in the IR.  
41 | 
42 | 
43 | Capture IR before or after a pass
44 | ---------------------------------
45 | 
46 | ``-mllvm -print-before=loop-vectorize -mllvm -print-module-scope`` will print the IR before each invocation of the pass "loop-vectorize".  (As it happens, there's only one of these in the standard pipeline.)  The resulting output will be valid IR (well, with a header you need to remove) which can be fed back to "opt" to reproduce a problem.  There's also an analogous ``-print-after=`` option.
47 | 
48 | If you want to trace through execution, ``-mllvm -print-after-all`` can also be useful, but be warned, this is very very verbose.  Pipe it to a file, and search through it with a decent text editor is likely your best bet.
49 | 
50 | See what a pass is doing
51 | ------------------------
52 | 
53 | ``-mllvm -debug-only=loop-vectorize`` turns on the internal debug tracing of the pass.  This can be very insightful when read in combination with the source code of the pass in question.
54 | 
55 | ``-mllvm -pass-remarks=*`` turns on the pass remarks mechanism which is intended to be more user facing.  My experience is that these are generally not real useful, and that the filtering mechanism doesn't work well.  Mostly relevant when looking for missed optimizations.
56 | 
57 | 
58 | opt and llc
59 | ------------
60 | 
61 | These tools are your friends.  You can pass IR to opt to exercise any mid-level optimization.  You can use llc to exercise the backend.
62 | 
63 | Many of the commands listed previously can be passed to opt or llc by simply omitting the ``-mllvm`` prefix.
64 | 
65 | llvm-reduce and bugpoint
66 | ------------------------
67 | 
68 | These tools provide a fully automated way to reduce an input IR program to the smallest program which triggers a failure.  Reducing crashes or assertion failures is pretty straight forward; reducing miscompiles is quite a bit trickier.
69 | 
70 | Alive2
71 | ------
72 | 
73 | Alive2 is a tool for formally reasoning about LLVM IR.  There is a web instance available at `<https://alive2.llvm.org/ce/>`_.  This is a great tool for quickly checking if an optimization you have in mind is correct.
74 | 
75 | You can also download and build alive2 yourself, and it has a lot of useful functionality for translation validation.  This can be very useful when tracking down a nasty miscompile, but is very much an advanced topic.  
76 | 


--------------------------------------------------------------------------------
/llvm-gc-retrospective-2019:
--------------------------------------------------------------------------------
 1 | [DRAFT] A retrospective on GC support in LLVM, and some proposed tweaks
 2 | 
 3 | WARNING: This is very much in rough draft state.  Please don't cite until completed and sent to llvm-dev.  
 4 | 
 5 | As some of you may remember, a few years back I led an effort to extend LLVM with support for fully relocating garbage collection in the form of the gc.statepoint familiy of intrinsics.  In the meantime, we've successfully shipped a compiler for a VM with a fully relocating collector using this support, and gained a bunch of practical experience with the design.  This has led me to revise my thinking quite a bit, and I want to both summarize my lesson learned, and propose some small changes.
 6 | 
 7 | Background and Retrospective
 8 | 
 9 | As a reminder, the gc.statepoint mechanism involved three major parts:
10 | 1) The first was the introduction of the abstract machine model (enabled by non-integral pointer types) followed by a lowering to the physical machine model.  The abstract machine model is relocation independent - at the cost of disallowing a few operations like addrspacecast and inttoptr/ptrtoint, and typing memory.  
11 | 2) The second was the gc.statepoint representation itself.  This representation was designed to make all relocations explicit in the physical machine model and was designed from the begining to make integration with the register allocation feasible.  
12 | 3) The third was the notion of late safepoint insertion.  We ended up having to abandon this ourselves due to a need to support deoptimization at the same set of safepoints, but there's still some (unused?) support for this upstream today.  
13 | 
14 | I think it's safe to say that the abstract machine model has proven itself.  The abstract machine model allows optimization passes to reason about references to objects (as opposed to raw pointers) and makes this like LICM and other loop opts straight-forward.  Without the abstract machin model, we'd have needed to make invasive changes across the entire optimizer.  Instead, the changes which have been made have been pretty isolated in scope and well contained.  Probably the biggest problem with the current implementation (as opposed to the model) is that RewriteStatepointsForGC (the lowering to the physical model) doesn't actually remove the non-integral marker from references; it just happens to work out in practice.  
15 | 
16 | As I already noted, the late safepoint insertion mechanism turned out not to be practical for our use case.  We had to be able to materialize an abstract machine state at each safepoint, and in the end, that required having safepoint locations (but not relocation semantics) explicit in the IR from the very begining.  I still think that late safepoint insertion could be very practical for a statically compiled language w/o deoptimizatio, but this is really unproven to date.  
17 | 
18 | Statepoints, well, that's where some of the lessons learned come in.  There were a couple of major assumptions which went into the design:
19 | 1) There was a strong assumption that the quality of safepoint lowering *mattered*.  Everyone we talked to who'd done this before in other compilers seemed to agree that performance was going to be bottlenecked by the safepoint lowering, and thus it was worth a lot of effot to get right.  This drove the whole notion of representing a set of defs explicitly via gc.relocates.
20 | 2) Somewhat following from the previous, we believed that first class support for derived pointers (both interior and exterior to their associated object) was a neccessity.  (If we didn't have first class support, you can always rematerialize an arbitrary derived pointer after a safepoint via a "offset = derived-base, relocate(base), add offset" idiom.  This is exactly what the GC would do internally.)
21 | 
22 | To my utter and complete suprise, it's turned out that both assumptions have been at least partially incorrect in practice.  
23 | 
24 | At least on X86, the ability to fold complex addresses into using instructions really deminishes the value of first class derived pointer support.  I'm not saying that there's no value to a fully regalloc integrated derived pointer aware stackmap, but I've seen few cases where it's obviously profitable - that may be largely due to the next point.  
25 | 
26 | The key surprise for me has been that while quality of safepoint lowering has mattered somewhat, it hasn't been anywhere near the peak performance limiter expected.  Instead, it's been mostly a matter of code size and slowpaths for which the current design is somewhat poor.  The key observations are that:
27 | 1) Any safepoint poll involves a fast and slow path, but *only the slowpath neeeds a stackmap*.  As such, the lowering required for the slowpath can be fairly terrible as long as it doesn't perturb the fastpath too much.
28 | 2) After optimization, safepoint polls are dynamically rare.  They're not as rare statically, but that's almost entirely due to deoptimization side effects.   
29 | 3) Any hot call safepoint is generally an optimization oppurtunity for either devirtualization or inlining.  It's really really rare to see a truly megamophic (i.e. not N morphic for small N) call on a fastpath, or a large function with many hot callers.  
30 | 
31 | Putting these observations together, there's been little motivation to work on improving the statepoint lowering.  As a result, we've never actually achieved the planned register allocation integration and are still shipping what was expected to be a stop gap implementation which does a poor man's register allocator in SelectionDAG.  It's only been recently that we've really started giving improving this state of affairs this some serious thought, and the motivaion to do so has mostly been driven by compile time and code size, not performance of the slowpath code.  
32 | 
33 | The other unexpected challenge with statepoints has been the compile time impact caused by the IR growth implied by the explicit relocation representation.  I don't have any hard numbers, but I suspect this to be one of the largest contributors to codegen time.  It's not uncommon to see methods with a large fraction of total instructions being gc.relocates.  
34 | 
35 | So, I think it's fair to ask whether given what we know now, would we reimplement gc.statepoint the same way?  Honestly, I think the answer has to be no.
36 | 
37 | Now what?
38 | 
39 | So, am I about to propose we remove gc.statepoint?  No, I'm not.  It works well enough, and there's no strong incentive to switch away for anyone who has built a working implementation on top of it.
40 | 
41 | However, I do think we should consider reframing our documentation to default to an alternative, and updating the lowering for the abstract machine model to support that alternative.  
42 | 
43 | What is that alternative you ask?  Well, gc.root.  (For those of you who were around through the initial rounds of discussion, yes, there's quite some irony here.)  I think we do need to make a slight tweak to the gc.root implementation though.
44 | 
45 | For those who don't know, the gc.root representation works as so: You explicitly create a stack slot (alloca), escape it via a special gc.root intrinsic, spill at the def of the variable, and reload after every possible safepoint.  The one real subtlety is that for this to be correct for a relocating GC, no function reachable from a function using gc.root can be readonly *or inferred readonly by the optimizer*.  Annoyingly, this issue only exists for relocating GCs, but it basically means that today, if you use gc.root you have to be *very* careful in ensuring none of your functions can be inferred readonly.  Here's the problematic example:
46 | 
47 | %a = %alloca i8*
48 | call void @gc.root(i8* %a)
49 | store i8* %myptr, i8** %a
50 | call void @readonly()
51 | %newptr = load i8*, i8** %a
52 | 
53 | GC.root relies on the capture of %a to force the optimizer to believe the call might clober %a (for a relocating collector, it does).  However, since we've inferred readonly, that fact doesn't hold.  (I've used readonly for the description, but you can create the same issue with writeonly, or argmemonly.)
54 | 
55 | Now, I want to be careful to stop here and emphasize that gc.root can be used entirely correctly.  It just imposes a subtle invariant on the module as a whole which is error prone.  
56 | 
57 | With a small tweak, we can remove this subtlety.  Since gc.root was introduced, we've added the notion of operand bundles.  One of the key semantics to operand bundles is that they can model memory semantics of the callsite independent of the callee. As such, we can use an operand bundle to avoid the subtlety around readonly calls.  Here's what a revised example would look like:
58 | 
59 | %a = %alloca i8*
60 | call void @gc.root(i8* %a)
61 | store i8* %myptr, i8** %a
62 | call void @readonly() ["gc-root" (%a)]
63 | %newptr = load i8*, i8** %a
64 | 
65 | This results in a much cleaner, and easier to describe semantic model w/o the global module invariant. 
66 | 
67 | Summary
68 | 
69 | So, what all am I proposing?  I'm proposing that we:
70 | 1) add the "gc-root" operand bundle type, and update all the gc.root documentation to use it.
71 | 2) add support for gc.root as a target of RewriteStatepointsForGC (i.e. as a supported physical model when lowering the abstract model).  
72 | 3) update the documentation to default to gc.root, and describe statepoint as an alternative.
73 | 4) update the documentation to encourage new frontends to lower directly to gc.root, then come back to the abstract machine model later.  
74 | 
75 | The last point needs a bit of justification.  From talking to a number of folks at conferences, getting something working with existing GC infrastructure is generally the biggest problem, and the indirect through the abstract model seems to really confuse folks.  I've witnessed a couple of projects stall here, and I'd like to avoid that going forward. 
76 | 
77 | 
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/llvm-norms.rst:
--------------------------------------------------------------------------------
 1 | -------------------------------------------------
 2 | LLVM Norms, Terminology, and Expecations
 3 | -------------------------------------------------
 4 | 
 5 | 
 6 | This page is a collection of things I find myself repeatedly needing to explain to new developers.  This is not official project documentation; it is my take on each issue.  Most of this is likely to agree broadly with what other long term contributors might say, but details in perspective may differ.  This is a perpetual WIP - it is extended when I find myself repeating myself, and is in no way a complete guide.  
 7 | 
 8 | Let me start by introducing myself in case you're not already familiar with my work in the project.  I am a long standing contributor to the LLVM project.  I've contributed heavily to the mid level optimizer, and to a lesser extend parts of the X86 backend.  I was the technical lead for the Falcon JIT - an LLVM based just in time compiler for Java bytecode.  I've managed a team of LLVM contributors, and been responsible for maintaining a long live downstream distribution of LLVM.  As such, I have a fairly broad perspective on what it takes to participate in the upstream community successfully, while still shipping downstream product.
 9 | 
10 | .. contents::
11 | 
12 | What does LGTM mean?
13 | --------------------
14 | 
15 | "LGTM" literally means "looks good to me", but there's a bunch more cultural context behind it.  LLVM requires pre-commit review for most changes.  For new contributors, *all* changes will require precommit review.  Having an established contributor LGTM a change is the gate which had to be cleared before a change can land.
16 | 
17 | While most reviews these days use phabricator, we're not always good about marking reviews approved through the UI.  A textual LGTM is what matters, not whether the review has been approved in the UI.  
18 | 
19 | Let me emphasize that LLVM is a single approval culture.  This means that once a knowledgeable reviewer has approved a patch, you do not need to wait for furthur reviewer approval.  You do need to use reasonable judgement here though.  If another reviewer has raised concerns, you probably want to wait until they've had a chance to reply to any changes before landing.  
20 | 
21 | One point which confuses a bunch of new contributors is that LLVM reviewers **expect that you have commit rights**.  Unless you **explicitly ask** someone to land your change on your behalf, reviewers will assume that you will do so after approval.  This comes from the fact that LLVM hands out commit rights much more freely than other open source projects.
22 | 
23 | LGTMs w/Conditions
24 | ------------------
25 | 
26 | It's not uncommon to see phrasings such as "LGTM w/comments addressed" or "LGTM w/minor comments".  What this means is that once you've addressed the issues identified *as suggested by the reviewer*, you can consider the patch to have received an LGTM without the need for further review.
27 | 
28 | This is frequently used by reviewers when the remaining issues with the patch are considered minor and straight forward.  If you as an author disagree with how any issue should be handled (e.g. a comment needs discussion), be aware that you don't have an LGTM without further discussion and an explicit re-LGTM by that reviewer (or someone else).
29 | 
30 | If the difference in approach is minor, I strongly suggest taking the reviewer's suggestion, landing your patch, and then posting a follow up patch to switch to your preferred approach.  This will let all parties make progress, and avoids back and forth on already accepted reviews which has a tendancy to get lost.  
31 | 
32 | Another form of conditional LGTM which comes up regularly is the "LGTM, but wait for @name" or "LGTM, but wait a couple of days in case @name has further comments".  These two are interesting precisely because they are *different*, and that subtly is often lost on non-native speakers.  For the first, the reviewer is explicitly asking for a second LGTM.  As such, our general "single accept" policy does not apply, and this review is blocked on a second accept by @name.  The second is merely instructing you to wait a couple of days before landing so that @name has a chance to chime in if desired.  The former blocks commit; the latter does not.  
33 | 
34 | What are "commit rights"?
35 | --------------------------
36 | 
37 | LLVM grants commit rights much more freely than most other open source projects.  However, that's because the implied expectationss are very different.  In LLVM, having commit rights simply means that you are trusted to take the mechanical action of rebasing and landing an approved patch, and then respond promptly to post commit review.  It **does not** change any expectation around precommit review, or imply anything beyond a very basic level of trust.  
38 | 
39 | Can I commit my change without review?
40 | --------------------------------------
41 | 
42 | As a general rule, unless you have been told otherwise, no.  New contributors, in particular, should *never* commit a change without review.
43 | 
44 | Beyond that initial state, we have in practice three levels of pre-commit rights.  
45 | 
46 | First, you'll pretty quickly be asked by reviewers to "pre-commit this test", or "pre-commit this NFC".  That means that you can separate out a change which does that (and only that), and submit it without further review.  A key point is that this change *was reviewed* in the original review thread.  The trust being shown is minor, and mostly mechanical.
47 | 
48 | Second, once you've been around for a while and have a sense of normal review flow, you'll reach the point where you have a good sense for what you'll be asked to pre-commit to reduce patch sizes during review.  Once you hit that point, checking in tests and NFCs without review (i.e. before posting the using change) is acceptable.  Reasonable judgement is expected, lean towards review.
49 | 
50 | Third, established contributors will sometimes land "obvious" patches without review.  If you're new enough to the community to be reading this guide closely, this is not relevant for you (yet).  
51 | 
52 | Silence means "No"
53 | ------------------
54 | As a general rule, silence on a review or RFC means "no".  It **does not** mean "no one cares, so go ahead".  There is a huge amount of coalition building and discussion which happens offline.  If you send out an RFC without talking it through with interested parties first, there is a good chance no one will have the time to read it and respond.  
55 | 


--------------------------------------------------------------------------------
/llvm-riscv-fuzzing.rst:
--------------------------------------------------------------------------------
  1 | -----------------------------
  2 | Fuzzing LLVM's RISCV Backend
  3 | -----------------------------
  4 | 
  5 | This document is a collection of notes for myself on attempts at fuzzing LLVM's riscv backend.  This is very much a WIP, and is not really intended to be read by another else just yet.  This is very much a background project, so updates will likely be slow.
  6 | 
  7 | .. contents::
  8 | 
  9 | Initial Attempt w/libFuzzer
 10 | ---------------------------
 11 | 
 12 | I started with libfuzzer because we used to have OSSFuzz isel fuzzing for other targets, and I figured figuring out the build problem would be easy.  Yeah, not so much.
 13 | 
 14 | I was not able to get a working build of libfuzzer with ASAN.
 15 | 
 16 | I found a three stage build approach that "somewhat" worked.
 17 | 
 18 | stage1 - my normal LLVM dev build tree, clang enabled, Release+Asserts, nothing special
 19 | 
 20 | stage2 - "PATH=~/llvm-dev/build/bin/:$PATH CC=clang CXX=clang++ cmake -GNinja -DCMAKE_BUILD_TYPE=Release ../llvm-project/llvm -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_NO_DEAD_STRIP=ON -DLLVM_USE_SANITIZER=Address -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_RUNTIME=Off -DLLVM_USE_SANITIZE_COVERAGE=On"
 21 | 
 22 | stage3 = "PATH=~/llvm-dev/fuzzer-build-stage1/bin/:$PATH CC=clang CXX=clang++ cmake -GNinja -DCMAKE_BUILD_TYPE=Release ../llvm-project/llvm -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_NO_DEAD_STRIP=ON -DLLVM_TARGETS_TO_BUILD="X86;RISCV" -DLLVM_BUILD_RUNTIME=Off -DLLVM_USE_SANITIZE_COVERAGE=On -DLLVM_TABLEGEN=/home/preames/llvm-dev/build/bin/llvm-tblgen "
 23 | 
 24 | I could not get the stage3 build to complete if I used the ASANified tblgen.  More than that, it seems clang from stage2 doesn't have a function ASAN.  Every attempt at using ASAN from that stage fails.  Cause unknown.
 25 | 
 26 | Once I got something which worked, I tried three experiments.
 27 | 
 28 | *Experiment 1 - " ./llvm-isel-fuzzer corpus/  -ignore_remaining_args=1 -mtriple riscv64 -O2"*
 29 | 
 30 | This used an empty corpus with default RISCV64 only (no extensions).  Ran for a couple hours, no failures found.
 31 | 
 32 | 
 33 | *Experiment 2 - "./llvm-isel-fuzzer corpus/ --help  -ignore_remaining_args=1 -mtriple riscv64 -O2 -mattr=+m,+d,+f,+c,+v"*
 34 | 
 35 | This used an empty corpus with a number of extensions enabled.  Ran for a weekend, no failures found.
 36 | 
 37 | 
 38 | *Experiment 3 - Ingest test/CodeGen/RISCV as starting corpus*
 39 | 
 40 | ```
 41 | $ cat ingest-one.sh 
 42 | #set -x
 43 | set -e
 44 | SOURCE_FILE=$1
 45 | 
 46 | PATH=../build/bin:$PATH
 47 | 
 48 | fgrep "llvm.riscv" $SOURCE_FILE > /dev/null && exit 1
 49 | 
 50 | llvm-as $SOURCE_FILE -o corpus/tmp.bc
 51 | llc -O2 -march=riscv64 -mattr=+m,+d,+f,+c,+v -riscv-v-vector-bits-min=128 -o /dev/null < corpus/tmp.bc || exit 1
 52 | HASH=$(sha1sum corpus/tmp.bc | cut -f 1 -d " ")
 53 | echo "$SOURCE_FILE -> corpus/$HASH.bc"
 54 | mv corpus/tmp.bc "corpus/$HASH.bc"
 55 | 
 56 | $ PATH=../build/bin:$PATH find ../llvm-project/llvm/test/CodeGen/RISCV/ -name "*.ll" | xargs -l ./ingest-one.sh
 57 | ```
 58 | 
 59 | This revealed an absolutely user interface disaster for libfuzzer.
 60 | 
 61 | Some tests deliberate check things which produce errors.  libFuzzer fails if the input corpus contains failures.  Since the exact setup is slightly different between the fuzzer binary, and llc, filtering out failures is a basically manual process.  I ran out of interest before finding useful results.
 62 | 
 63 | I also found that the public documentation on libfuzzer command line arguments is simply wrong in many cases.  Nor does the binary support useful -help output of any kind.
 64 | 
 65 | I consider this effort to have been a failure, and do not currently plan to spend more time on libfuzzer.  The fact that I, a longstanding LLVM dev, can't figure out how to actually build the damn thing in a useful way says all too much right there.
 66 | 
 67 | Reference save:
 68 | 
 69 | * https://github.com/google/oss-fuzz/pull/7179#issuecomment-1092802635
 70 | * https://github.com/google/oss-fuzz/commit/e0787861af03584754923979e76a243080e7dd96
 71 | * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=27686
 72 | * https://oss-fuzz-build-logs.storage.googleapis.com/log-9d455709-b52a-4ee0-8494-4a7e2529d5ff.txt
 73 | * https://oss-fuzz-build-logs.storage.googleapis.com/log-f98c7f04-9b4c-43da-9a7c-3559a6a9b3dd.txt
 74 | * https://oss-fuzz-build-logs.storage.googleapis.com/log-d902e118-7a63-439a-92f4-31bfaaf374f6.txt
 75 | * https://llvm.org/docs/FuzzingLLVM.html (build instructions *do not* work)
 76 | 
 77 | Brute Force via csmith and llvm-stress
 78 | --------------------------------------
 79 | 
 80 | I have written some simple shell scripts to do brute force fuzzing of clang and llc targetting RISCV driven by csmith and llvm-stress respectively.  As of this update, I have run ~100k unique csmith tests and ~10m unique llvm-stress tests.  So far, I have found no compiler crash bugs.  I am not running the resulting code, so I may have missed execution bugs.
 81 | 
 82 | 
 83 | AFL - Upcoming
 84 | ----------------
 85 | 
 86 | I'm planning to spend some time playing with afl-fuzz driving llc.  My hope is that the user interface is more practically approachable.  In theory, the fuzz rate will be lower due to the need to fork and instrument, but so be it.
 87 | 
 88 | 
 89 | 
 90 | Other ideas to investigate
 91 | --------------------------
 92 | 
 93 | Using JavaFuzzer for C++?  Or maybe the approach John Regehr's student is using with great success on AArch64 right now?
 94 | 
 95 | Rather than just fuzzing for crashes, fuzz using alive2 for miscompiles?  Harder for backend, but maybe through IR phase?  Or just find problems which depend on target hooks?
 96 | 
 97 | 
 98 | 
 99 | 
100 | 
101 | 
102 | 


--------------------------------------------------------------------------------
/llvm-riscv/ScalarCodeGen.rst:
--------------------------------------------------------------------------------
  1 | -------------------------------------------------
  2 | Open Items in Scalar Codegen for RISCV
  3 | -------------------------------------------------
  4 | 
  5 | .. contents::
  6 | 
  7 | 
  8 | Items in LLVM issue tracker
  9 | ============================
 10 | 
 11 | *  [SelectionDAGISel] Mysteriously dropped chain on strict FP node. `#54617 <https://github.com/llvm/llvm-project/issues/54617>`_.  This appears to be a wrong code bug for strictfp which affects RISCV.
 12 | *  Unaligned read followed by bswap generates suboptimal code `#48314 <https://github.com/llvm/llvm-project/issues/48314>`_
 13 | 
 14 | 
 15 | Code Size
 16 | =========
 17 | 
 18 | There has been a general view that RISCV code size has significant room for improvement aired in recent LLVM RISC-V sync-up calls, but no specifics are currently known.
 19 | 
 20 | 2022-07-11 - I spent some time last week glancing at usage of compressed instructions.  Main take away is that lack of linker optimization/relaxation support in LLD was really painful code size wise.  We should revisit once that support is complete, or evaluate using LD in the meantime.
 21 | 
 22 | 
 23 | Branch on inequality involving power of 2
 24 | =========================================
 25 | 
 26 | For the compare:
 27 |   %c = icmp ult i64 %a, 8
 28 |   br i1 %c, label %taken, label %untaken
 29 | 
 30 | We currently emit:
 31 |     li    a1, 7
 32 |     bltu    a1, a0, .LBB0_2
 33 | 
 34 | We could emit:
 35 |     slli    a0, a0, 3
 36 |     bnez    a0, .LBB1_2
 37 | 
 38 | This lengthens the critical path by one, but reduces register pressure.  This is probably worthwhile.
 39 | 
 40 | There are also many variations of this type of pattern if we decide this is worth spending time on.  
 41 |    
 42 | Optimizations for constant physregs (VLENB, X0)
 43 | ===============================================
 44 | 
 45 | Noticed while investigating use of the PseodoReadVLENB intrinsic, and working on them as follow ons to `<https://reviews.llvm.org/D125552>`_, but these also apply to other constant registers.  At the moment, the two I can think of are X0, and VLENB but there might be others.
 46 | 
 47 | Punch list (most have tests in test/CodeGen/RISCV/vlenb.ll but not all):
 48 | 
 49 | * PeepholeOptimizer should eliminate redundant copies from constant physregs. (`<https://reviews.llvm.org/D125564`_)
 50 | * PeepholeOptimizer should eliminate redundant copies from unmodified physregs.  Looking at the code structure, we appear to already do all the required def tracking for NA copies, and just need to merge some code paths and add some tests.
 51 | * SelectionDAG does not appear to be CSEing READ_REGISTER from constant physreg.
 52 | * MachineLICM can hoist a COPY from constant physreg since there are no possible clobbers.
 53 | * forward copy propagation can forward constant physreg sources.
 54 | * Remat (during RegAllocGreedy) can trivially remat COPY from constant physreg.
 55 | 
 56 | X0 specific punch list:
 57 | 
 58 | * Regalloc should prefer constant physreg for unused defs.  (e.g. generalize 042a7a5f for e.g. volatile loads)  May be able to delete custom AArch64 handling too.
 59 | 
 60 | VLEN specific punch list:
 61 | 
 62 | * VLENB has a restricted range of possible values, port pseudo handling to generic property of physreg.
 63 | * Once all above done, remove PseudoReadVLENB.
 64 | 
 65 | 
 66 | Vaguely related follow on ideas:
 67 | 
 68 | * A VSETVLI a0, x0 <vtype> whose implicit VL and VTYPE defines are dead essentially just computes a fixed function of VLENB.  We could consider replacing the VSETVLI with a CSR read and a shift.  (Unclear whether this is profitable on real hardware.)
 69 | 
 70 | 
 71 | Compressed Expansion for Alignment
 72 | ==================================
 73 | 
 74 | If we have sequence of compressed instructions followed by an align directive, it would be better to uncompress the prior instructions instead of asserting nops for alignment.
 75 | 
 76 | This is analogous to the relaxation support on X86 for using larger instruction encodings for alignment in the integrated assembler.
 77 | 
 78 | This is of questionable value, but might be interesting around e.g. loop alignment.
 79 | 
 80 | Constant Materialization Gaps
 81 | =============================
 82 | 
 83 | For constant floats, we have a couple oppurtunities:
 84 | 
 85 | * LUI/SHL-by-32/FMV.D.X - Analogous to the LUI/FMV.W.X pattern recently implemented, but requires an extra shift.  This basically reduces to increasing the cost threshold by 1, and may be worth doing for doubles.  
 86 | * LI/FCVT.S.W - Create a small integer, and convert to half/single/double.  Note this is a convert, not a move.  For half, LUI/FMV.H.X may be preferrable.
 87 | * FLI.S/D - Likely to be optimal when Zfa is available.
 88 | * FLI + FNEG.S - Can be used to produce some negative floats and doubles.  LUI/FMV.W.X is likely better for floats and halfs, so this mostly applies to doubles.  FNEG.S can be used to toggle the sign bit on any float, so may be more broadly applicable as well.
 89 | 
 90 | 
 91 | Rematerialization of LUI/ADDI sequences
 92 | =======================================
 93 | 
 94 | Given an LUI/ADDI sequence - either from a constant or a relocation - we should be able to rematerialize either or both instructions if required to reduce register pressure during allocation.
 95 | 
 96 | 
 97 | Register Pressure Reduction
 98 | ===========================
 99 | 
100 | Improvement to switch lowering - if generate an jump table for the labels, check to see if the result can be turned into a lookup table instead.  We're already paying the load cost.
101 | 
102 | Investigate simple improvements to ShrinkWrapping.
103 | 
104 | Consider firewalling cold call paths.
105 | 
106 | Define a fastcc variants where argument-0 and return don't require the same register and internalize aggressively - mostly helps LTO.
107 | 
108 | IPRA - Can we reduce need to spill some?
109 | 
110 | Prefer bnez (addi a0, a0, C) when doing so avoids the need for a immediate materialization and a0 has no other uses.
111 | 
112 | Prefer bnez (lshr a0, a0, XLen-1) for sign check, same logic as previous.  Also generalizes to bexti cases for any single bit check.
113 | 
114 | Use arithmetic more aggressively for select c, i32 C1, i32 C2 to avoid need for control flow.  (Doesn't really impact register pressure, may actually hurt.)
115 | 
116 | Aggressively duplicate (addi a0, x0, C) to users before register allocation OR itnegrate rematerialization into first CSR path.
117 | 
118 | Aggressively duplicate (addi a0, a0, C) when user is vector load or store to user to avoid long live ranges.  Or combine remat in first cSR + full remat.
119 | 
120 | Investigate full rematerialization.
121 | 
122 | Investigate negated compound branch thing reported 2024-11-24 on discourse.
123 | 
124 | 
125 | Disjoint OR
126 | -----------
127 | 
128 | (From an older list of notes, may be stale.)
129 | 
130 | * SelectAddrRegScale
131 | * the addw thing
132 | * SelectAddrRegImm (the large offset cases)
133 | * SelectAdd/RegReg
134 | * SelectShiftMask
135 | * In GEP AddrMatch
136 | 
137 | 
138 | 


--------------------------------------------------------------------------------
/llvm-riscv/Vectorization.rst:
--------------------------------------------------------------------------------
  1 | -------------------------------------------------
  2 | Open Items in Vectorization for RISCV
  3 | -------------------------------------------------
  4 | 
  5 | .. contents::
  6 | 
  7 | Loop Vectorizer
  8 | ----------------
  9 | 
 10 | Loop Vectorization is fully implemented for both fixed and scalable vectors.  It has been fully enabled in upstream LLVM for several months, and is mostly on par with other targets.  The items mentioned below are mostly target specific enhancements - i.e. oppurtunities that aren't remquired for breakeven functionality.
 11 | 
 12 | In terms of performaning tuning, we're still in the early days.  I've been fixing issues as I find them.  Concrete bug reports for vector code quality are very welcome.
 13 | 
 14 | Tail Folding
 15 | ++++++++++++
 16 | 
 17 | For code size reasons, it is desirable to be able to fold the remainder loop into the main loop body.  At the moment, we have two options for tail folding: mask predication and VL predication.  I've been starting to look at the tradeoffs here, but this section is still highly preliminary and subject to change.
 18 | 
 19 | Mask predication appears to work today.  We'd need to enable the flag, but at least some loops would start folding immediately.  There are some major profitability questions around doing so, particularly for short running loops which today would bypass the vector body entirely.
 20 | 
 21 | Talking with various hardware players, there appears to be a somewhat significant cost to using mask predication over VL predication.  For several teams I've talked to, SETVLI runs in the scalar domain whereas mask generation via vector compares run in the vector domain.  Particular for small loops which might be vector bottlenecked, this means VL predication is preferrable.
 22 | 
 23 | For VL predication, we have two major options.  We can either pattern match mask predication into VL predication in the backend, or we can upstream the work BSC has done on vectorizing using the VP intrinsics.  I'm unclear on which approach is likely to work out best long term.
 24 | 
 25 | Work on tail folding is currently being deferred until main loop vectorization is mature.
 26 | 
 27 | Epilogue Tail Folding
 28 | =====================
 29 | 
 30 | At the moment, my guess is that we're going to end up wanting *not* to tail fold the main loop (due to vsetvli resource limit concerns), and instead want to have a tail folded epilogue loop to run the tail in a single iteration.  At the moment, there is no support in the vectorizer for tail folded epilogues and a significant amount of rework will be needed.
 31 | 
 32 | One interesting point is that if we're (only) tail folding the epilogue loop, the relative importance of the predicate code quality drops significantly.  This may influence the masking vs VL predication decision from a pure engineering investiment perspective.
 33 | 
 34 | We may end up with a different strategy for loops which are known short.  There, the concern about vsetvli being a bottleneck is a lot less of a concern.  Maybe we'll tail fold the main loop in that case.
 35 | 
 36 | Tail Folding Gaps (via Masking)
 37 | ===============================
 38 | 
 39 | Tail folding appears to have a number of limitations which can be removed.
 40 | 
 41 | * Some cases with predicate-dont-vectorize are vectorizing without predication.  Bug.
 42 | * Any use outside of loop appears to kills predication.  Oddly, on examples I've tried, simply removing the bailout seems to generate correct code?
 43 | * Stores appear to be tripping scalarization cost not masking cost which inhibits profitability.
 44 | * Uniform Store.  Basic issue is we need to implement last active lane extraction.  Note active bits are a prefix and thus popcnt can be used to find index.  No current plans to support general predication.
 45 | 
 46 | Tail Folding via Speculation
 47 | ============================
 48 | 
 49 | This is mostly just noting an idea.  It occurs to me that if instructions in the loop are speculateable, we can "tail fold" via speculation.  That is, we can simply run the loop over the extra iterations, and then discard the result of any spurious elements.
 50 | 
 51 | .. code::
 52 | 
 53 |    // a is aligned by 16
 54 |    for (int i = 0; i < N; i++)
 55 |       sum += a[i];
 56 | 
 57 | .. code::
 58 | 
 59 |   // a is aligned by 16
 60 |   for (int i = 0; i < N+3; i += 4) {
 61 |       vtmp = a[i:i+3] // speculative load
 62 |       vtmp = select (splat(i) + step_vector < splat(N)), vtmp, 0
 63 |       vsum += vtmp
 64 |   }
 65 |   sum = reduce(vsum)
 66 | 
 67 | 
 68 | The above example relies on alignment implying access beyond a can't fault.  Note that this concept is *not* otherwise in LLVM's dereferenceable model, and is itself a fairly deep change.
 69 | 
 70 | LoopVectorizer generating duplicate broadcast shuffles
 71 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 72 | 
 73 | This is being fixed by the backend, but we should probably tweak LV to avoid anyways.
 74 | 
 75 | Duplicate IV for index vector
 76 | +++++++++++++++++++++++++++++
 77 | 
 78 | In a test which simply writes “i” to every element of a vector, we’re currently generating:
 79 | 
 80 |  %vec.ind = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, %vector.ph ], [ %vec.ind.next, %vector.body ]
 81 |   %step.add = add <4 x i32> %vec.ind, <i32 4, i32 4, i32 4, i32 4>
 82 |   …
 83 |   %vec.ind.next = add <4 x i32> %vec.ind, <i32 8, i32 8, i32 8, i32 8>
 84 |   %2 = icmp eq i64 %index.next, %n.vec
 85 |   br i1 %2, label %middle.block, label %vector.body, !llvm.loop !8
 86 | 
 87 | And assembly:
 88 | 
 89 |     vadd.vi    v9, v8, 4
 90 |     addi    a5, a3, -16
 91 |     vse32.v    v8, (a5)
 92 |     vse32.v    v9, (a3)
 93 |     vadd.vi    v8, v8, 8
 94 |     addi    a4, a4, -8
 95 |     addi    a3, a3, 32
 96 |     bnez    a4, .LBB0_4
 97 |     beq    a1, a2, .LBB0_8
 98 | 
 99 | We can do better here by exploiting the implicit broadcast of scalar arguments.  If we put the constant id vector into a vector register, and add the broadcasted scalar index we get the same result vector.
100 | 
101 |    
102 | Vectorization
103 | +++++++++++++
104 | 
105 | 
106 | * Issues around epilogue vectorization w/VF > 16 (for fixed length vectors, i8 for VLEN >= 128, i16 for VLEN >= 256, etc..)
107 | * Initial target assumes scalar epilogue loop, return to folding/epilogue vectorization in future.
108 | 
109 | 
110 | Scalable Vectorizer Gaps
111 | ++++++++++++++++++++++++
112 | 
113 | Here is a punch list of known missing cases around scalable vectorization in the LoopVectorizer.  These are mostly target independent.
114 | 
115 | * Interleaving Groups.  This one looks tricky as selects in IR require constants and the required shuffles for scalable can't currently be expressed as constants.  This is likely going to need an IR change; details as yet unsettled.  Current thinking has shifted towards just adding three more intrinsics and deferring shuffle definition change to some future point.  Pending sync with ARM SVE folks.
116 | * General loop scalarization.  For scalable vectors, we _can_ scalarize, but not via unrolling.  Instead, we must generate a loop.  This can be done in the vectorizer itself (since its a generic IR transform pass), but is not possible in SelectionDAG (which is not allowed to modify the CFG).  Interacts both with div/rem and intrinsic costing.  Initial patch for non-predicated scalarization up as `D131118 <https://reviews.llvm.org/D131118>`_
117 | * Unsupported reduction operators.  For reduction operations without instructions, we can handle via the simple scalar reduction loop.  This allows e.g. a product reduction to be done via widening strategy, then outside the loop reduced into the final result.  Only useful for outloop reduction.  (i.e. both options should be considered by the cost model)
118 | 
119 | 
120 | SLP Vectorization
121 | -----------------
122 | 
123 | As of 7f26c27e03f1b6b12a3450627934ee26256649cd (June 14, 2023) SLP vectorization is enabled by default for the RISCV target.
124 | 
125 | The overall code quality still has a lot of room for improvement.  All of the known major issues have been at least partially handled, but we've likely got quite a bit of interative performance work ahead.  In general, codegen tends to be most sensative for short vectors (VL<4 or so).  This is where the benefit of vectorization is small enough that minor deficiencies in vector codegen (or SLP costing) lead to unprofitable results.
126 | 
127 | 
128 | 


--------------------------------------------------------------------------------
/llvm-riscv/memset.ll:
--------------------------------------------------------------------------------
 1 | ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
 2 | ; RUN: llc -mtriple=riscv64 -mattr=+v,+unaligned-scalar-mem < %s | FileCheck %s
 3 | 
 4 | target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
 5 | target triple = "riscv64-unknown-linux-gnu"
 6 | 
 7 | ; TODO: Reverse order of stores to assit prefetcher - this case doesn't matter
 8 | ; but imagine a loop zeroing 15 byte structures in an array (with something to
 9 | ; prevent compiler merging it into one memset)
10 | define void @memset_15(ptr %p) {
11 | ; CHECK-LABEL: memset_15:
12 | ; CHECK:       # %bb.0: # %entry
13 | ; CHECK-NEXT:    sd zero, 7(a0)
14 | ; CHECK-NEXT:    sd zero, 0(a0)
15 | ; CHECK-NEXT:    ret
16 | entry:
17 |   tail call void @llvm.memset.p0.i64(ptr align 8 %p, i8 0, i64 15, i1 false)
18 |   ret void
19 | }
20 | 
21 | declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)
22 | 


--------------------------------------------------------------------------------
/llvm-riscv/scalar-branch-opt.ll:
--------------------------------------------------------------------------------
 1 | ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
 2 | ; RUN: llc -mtriple=riscv64 -mattr=+v,+unaligned-scalar-mem < %s | FileCheck %s
 3 | 
 4 | target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
 5 | target triple = "riscv64-unknown-linux-gnu"
 6 | 
 7 | declare void @foo()
 8 | 
 9 | ; TODO: Two problems here:
10 | ; 1) The blt should be reversed into a bge to avoid the add
11 | define void @test(i64 %a, i64 %b) {
12 | ; CHECK-LABEL: test:
13 | ; CHECK:       # %bb.0: # %entry
14 | ; CHECK-NEXT:    addi a0, a0, 1
15 | ; CHECK-NEXT:    blt a0, a1, .LBB0_2
16 | ; CHECK-NEXT:  # %bb.1: # %taken
17 | ; CHECK-NEXT:    addi sp, sp, -16
18 | ; CHECK-NEXT:    .cfi_def_cfa_offset 16
19 | ; CHECK-NEXT:    sd ra, 8(sp) # 8-byte Folded Spill
20 | ; CHECK-NEXT:    .cfi_offset ra, -8
21 | ; CHECK-NEXT:    call foo
22 | ; CHECK-NEXT:    ld ra, 8(sp) # 8-byte Folded Reload
23 | ; CHECK-NEXT:    addi sp, sp, 16
24 | ; CHECK-NEXT:  .LBB0_2: # %exit
25 | ; CHECK-NEXT:    ret
26 | entry:
27 |   %add1 = add i64 %a, 1
28 |   %cmp = icmp slt i64 %add1, %b
29 |   br i1 %cmp, label %exit, label %taken
30 | taken:
31 |   call void @foo()
32 |   ret void
33 | exit:
34 |   ret void
35 | }
36 | 
37 | declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)
38 | 


--------------------------------------------------------------------------------
/llvm-riscv/tsvc/16.x-march=rv64gcv_zvl128b/diff:
--------------------------------------------------------------------------------
 1 | diff --git a/makefiles/Makefile.clang b/makefiles/Makefile.clang
 2 | index cefe725..ad5374a 100644
 3 | --- a/makefiles/Makefile.clang
 4 | +++ b/makefiles/Makefile.clang
 5 | @@ -3,7 +3,7 @@ CC=clang
 6 |  CXX=clang++
 7 |  # no FC for clang
 8 |  FC=
 9 | -flags = -O3 -fstrict-aliasing
10 | +flags = -O3 -fstrict-aliasing -march=rv64gcv_zvl128b
11 |  vecflags = -fvectorize -fslp-vectorize-aggressive
12 |  novecflags = -fno-vectorize
13 |  omp_flags=-fopenmp=libomp
14 | 


--------------------------------------------------------------------------------
/llvm-riscv/tsvc/16.x-march=rv64gcv_zvl128b/note:
--------------------------------------------------------------------------------
 1 | This is the installed clang which comes with the image, not built from source.
 2 | Probably does not correspond to an official LLVM release.
 3 | 
 4 | $ clang -v
 5 | Ubuntu clang version 16.0.6 (15)
 6 | Target: riscv64-unknown-linux-gnu
 7 | Thread model: posix
 8 | InstalledDir: /usr/bin
 9 | Found candidate GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/13
10 | Selected GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/13
11 | 
12 | 


--------------------------------------------------------------------------------
/llvm-riscv/tsvc/16.x-march=rv64gcv_zvl128b/stdout:
--------------------------------------------------------------------------------
  1 | $ ./bin/clang/tsvc_novec_relaxed 
  2 | Loop 	Time(sec) 	Checksum
  3 |  s000	    24.868	512080000.000000
  4 |  s111	    14.000	32000.410156
  5 | s1111	    39.267	16005.822266
  6 |  s112	    37.820	84617.789062
  7 | s1112	    37.284	32001.644531
  8 |  s113	    48.577	32000.644531
  9 | s1113	    24.819	32001.644531
 10 |  s114	    28.692	919.861816
 11 |  s115	    73.566	31745.953125
 12 | s1115	    33.773	0.065533
 13 |  s116	    77.958	32000.000000
 14 |  s118	    26.243	84488.507812
 15 |  s119	    20.833	86339.023438
 16 | s1119	    21.025	119099.906250
 17 |  s121	    37.476	32009.031250
 18 |  s122	    12.875	196490.937500
 19 |  s123	    20.734	32003.287109
 20 |  s124	    30.327	32001.644531
 21 |  s125	    13.618	131072.000000
 22 |  s126	    13.973	66954.968750
 23 |  s127	    23.699	32003.287109
 24 |  s128	    23.499	80000.000000
 25 |  s131	    62.456	32009.031250
 26 |  s132	    45.264	65538.562500
 27 |  s141	    30.424	32487076.000000
 28 |  s151	    62.442	32009.031250
 29 |  s152	    31.290	152207.218750
 30 |  s161	    19.950	64002.054688
 31 | s1161	    38.641	64002.464844
 32 |  s162	    17.650	32009.031250
 33 |  s171	    12.963	196491.250000
 34 |  s172	    12.618	196491.250000
 35 |  s173	    64.259	32001.640625
 36 |  s174	    65.075	32001.640625
 37 |  s175	    12.546	32009.031250
 38 |  s176	    13.711	32063.832031
 39 |  s211	    30.445	63983.183594
 40 |  s212	    28.877	132011.359375
 41 | s1213	    21.885	132022.765625
 42 |  s221	    13.344	1543685120.000000
 43 | s1221	    13.521	630444416.000000
 44 |  s222	     5.632	32000.000000
 45 |  s231	   185.779	119099.906250
 46 |  s232	     5.082	65536.000000
 47 | s1232	   132.972	33957.886719
 48 |  s233	   467.290	504912.625000
 49 | s2233	   160.342	337652.968750
 50 |  s235	   386.811	160024.000000
 51 |  s241	    65.452	64000.000000
 52 |  s242	     5.374	1535952000.000000
 53 |  s243	    39.094	810624.937500
 54 |  s244	    32.806	70102.437500
 55 | s1244	    40.872	108496.453125
 56 | s2244	    21.775	70102.890625
 57 |  s251	    98.395	32004.371094
 58 | s1251	   113.548	400005.968750
 59 | s2251	    30.482	2.635709
 60 | s3251	    42.370	12.595594
 61 |  s252	    20.173	63999.000000
 62 |  s253	    40.831	3200009984.000000
 63 |  s254	    96.568	32000.000000
 64 |  s255	    28.142	31968.515625
 65 |  s256	    18.865	66207.695312
 66 |  s257	    24.094	163072.000000
 67 |  s258	     0.291	14.652789
 68 |  s261	    27.822	260999.140625
 69 |  s271	   122.050	689753.500000
 70 |  s272	    24.412	64000.000000
 71 |  s273	    39.335	99051.656250
 72 |  s274	    40.812	3200195584.000000
 73 |  s275	    37.317	65536.000000
 74 | ^[[Bs2275	   733.148	65536.000000
 75 |  s276	    85.976	689753.500000
 76 |  s277	    26.051	32000.000000
 77 |  s278	    34.415	64012.593750
 78 |  s279	    18.193	64014.289062
 79 | s1279	    24.541	64014.289062
 80 | s2710	    25.618	96003.289062
 81 | s2711	   121.925	689753.500000
 82 | s2712	   114.728	289751.625000
 83 |  s281	    30.168	32000.000000
 84 | s1281	   177.255	inf
 85 |  s291	    40.112	32000.000000
 86 |  s292	    32.289	31968.515625
 87 |  s293	    24.933	32000.000000
 88 | s2101	    14.601	1704092.000000
 89 | s2102	    19.100	256.000000
 90 | s2111	    27.237	34545052.000000
 91 |  s311	    80.968	10.950724
 92 | s31111	     9.147	10.950724
 93 |  s312	    81.027	1.030518
 94 |  s313	    55.261	1.644725
 95 |  s314	    40.520	1.000000
 96 |  s315	    24.107	54857.000000
 97 |  s316	    40.529	0.000031
 98 |  s317	    22.371	0.000000
 99 |  s318	    16.219	32002.000000
100 |  s319	    69.465	43.803417
101 | s3110	    21.015	514.000000
102 | s13110	    21.051	514.000000
103 | s3111	     8.078	10.950721
104 | s3112	     8.406	1.644725
105 | s3113	    64.479	2.000000
106 |  s321	    12.553	32000.000000
107 |  s322	    11.734	32000.000000
108 |  s323	    18.245	146484.968750
109 |  s331	    24.207	32000.000000
110 |  s332	    22.224	-1.000000
111 |  s341	    30.675	10.950724
112 |  s342	    32.522	10.950724
113 |  s343	    22.067	1567.833496
114 |  s351	    90.539	25600002048.000000
115 | s1351	    99.513	32010.253906
116 |  s352	    88.851	1.644841
117 |  s353	    18.155	3200002816.000000
118 |  s421	    50.028	32009.031250
119 | s1421	    51.028	16000.000000
120 |  s422	    97.994	257.701416
121 |  s423	    87.295	439.690308
122 |  s424	    51.495	822.360596
123 |  s431	   129.902	1674247.125000
124 |  s441	    38.154	196491.250000
125 |  s442	    12.370	114240.132812
126 |  s443	    46.044	361015.906250
127 |  s451	    27.748	32009.281250
128 |  s452	    57.656	32512.017578
129 |  s453	    24.233	21.901447
130 |  s471	    17.627	64004.933594
131 |  s481	    33.222	196491.250000
132 |  s482	    35.086	196491.250000
133 |  s491	    32.998	32001.644531
134 | s4112	    19.971	1127134.250000
135 | s4113	    24.355	32001.644531
136 | s4114	    27.426	32000.000000
137 | s4115	    17.584	1.038636
138 | s4116	    12.908	0.753265
139 | s4117	    22.531	32002.207031
140 | s4121	    17.181	196491.250000
141 |    va	    25.050	1.644884
142 |   vag	    29.851	1.644884
143 |   vas	    31.427	1.644885
144 |   vif	    24.318	1.644884
145 |   vpv	   126.368	1642411.250000
146 |   vtv	   126.337	32000.000000
147 | vpvtv	    68.677	689753.500000
148 | vpvts	    14.258	175255268098048.000000
149 | vpvpv	    83.829	1.644884
150 | vtvtv	    86.921	32000.000000
151 | vsumr	    80.816	10.950721
152 | vdotr	   115.374	1.644725
153 |  vbor	     9.867	31924.050781
154 | $ ./bin/clang/tsvc_vec_relaxed 
155 | Loop 	Time(sec) 	Checksum
156 |  s000	     6.339	512080000.000000
157 |  s111	    14.772	32000.410156
158 | s1111	    10.951	16005.822266
159 |  s112	    38.180	84617.789062
160 | s1112	    17.202	32001.644531
161 |  s113	    11.256	32000.644531
162 | s1113	    24.689	32001.644531
163 |  s114	    29.600	919.861816
164 |  s115	    73.029	31745.953125
165 | s1115	    77.778	0.065533
166 |  s116	    77.896	32000.000000
167 |  s118	    38.387	85045.453125
168 |  s119	     7.929	86339.023438
169 | s1119	     5.677	119099.906250
170 |  s121	    12.375	32009.031250
171 |  s122	    13.183	196490.937500
172 |  s123	    20.782	32003.287109
173 |  s124	     8.908	32001.644531
174 |  s125	     8.458	131072.000000
175 |  s126	    19.356	66954.968750
176 |  s127	    15.559	32003.287109
177 |  s128	    19.111	80000.000000
178 |  s131	    20.579	32009.031250
179 |  s132	    11.579	65538.562500
180 |  s141	    32.791	32487076.000000
181 |  s151	    20.605	32009.031250
182 |  s152	    22.385	152207.218750
183 |  s161	    19.577	64002.054688
184 | s1161	    38.721	64002.464844
185 |  s162	     5.621	32009.031250
186 |  s171	     3.846	196491.250000
187 |  s172	    13.175	196491.250000
188 |  s173	    19.401	32001.640625
189 |  s174	    20.197	32001.640625
190 |  s175	     4.055	32009.031250
191 |  s176	     4.145	32063.832031
192 |  s211	    29.282	63983.183594
193 |  s212	    27.724	132011.359375
194 | s1213	    21.004	132022.765625
195 |  s221	    13.451	1543685120.000000
196 | s1221	     4.220	630444416.000000
197 |  s222	     5.994	32000.000000
198 |  s231	   172.146	119099.906250
199 |  s232	     5.453	65536.000000
200 | s1232	   239.784	33957.886719
201 |  s233	   761.972	504912.625000
202 | s2233	   172.588	337652.968750
203 |  s235	   348.269	160024.000000
204 |  s241	    65.333	64000.000000
205 |  s242	     5.359	1535952000.000000
206 |  s243	    14.757	810624.937500
207 |  s244	    33.734	70102.437500
208 | s1244	    40.874	108496.453125
209 | s2244	     6.938	70102.890625
210 |  s251	    26.655	32004.371094
211 | s1251	    58.867	400005.968750
212 | s2251	    28.525	2.635709
213 | s3251	    23.103	12.595594
214 |  s252	     4.980	63999.000000
215 |  s253	     8.506	3200009984.000000
216 |  s254	    17.233	32000.000000
217 |  s255	     6.603	31968.515625
218 |  s256	    17.153	66207.695312
219 |  s257	    29.489	163072.000000
220 |  s258	     0.306	14.652789
221 |  s261	    28.551	260999.140625
222 |  s271	    22.389	689753.500000
223 |  s272	    10.223	64000.000000
224 |  s273	    14.521	99051.656250
225 |  s274	    24.234	3200195584.000000
226 |  s275	    40.462	65536.000000
227 | s2275	   950.944	65536.000000
228 |  s276	    44.881	689753.500000
229 |  s277	    26.121	32000.000000
230 |  s278	    14.772	64012.593750
231 |  s279	     9.305	64014.289062
232 | s1279	     9.875	64014.289062
233 | s2710	     6.472	96003.289062
234 | s2711	    22.334	689753.500000
235 | s2712	    22.788	289751.625000
236 |  s281	    29.930	32000.000000
237 | s1281	    59.478	inf
238 |  s291	    48.661	32000.000000
239 |  s292	    32.105	31968.515625
240 |  s293	    24.394	32000.000000
241 | s2101	    12.984	1704092.000000
242 | s2102	    18.689	256.000000
243 | s2111	    27.159	34545052.000000
244 |  s311	    16.638	10.950724
245 | s31111	     9.136	10.950724
246 |  s312	    23.829	1.030928
247 |  s313	    15.736	1.644884
248 |  s314	     8.317	1.000000
249 |  s315	    23.913	54857.000000
250 |  s316	     9.389	0.000031
251 |  s317	     2.518	0.000000
252 |  s318	    16.178	32002.000000
253 |  s319	    17.169	43.802910
254 | s3110	    20.935	514.000000
255 | s13110	    20.938	514.000000
256 | s3111	     0.951	10.950724
257 | s3112	    11.422	1.644725
258 | s3113	     7.598	2.000000
259 |  s321	    12.994	32000.000000
260 |  s322	    12.280	32000.000000
261 |  s323	    15.301	146484.968750
262 |  s331	    24.224	32000.000000
263 |  s332	    22.059	-1.000000
264 |  s341	    30.575	10.950724
265 |  s342	    32.405	10.950724
266 |  s343	    17.231	1567.833496
267 |  s351	   112.157	25600002048.000000
268 | s1351	    37.226	32010.253906
269 |  s352	    85.448	1.644891
270 |  s353	    20.354	3200002816.000000
271 |  s421	    16.477	32009.031250
272 | s1421	    15.744	16000.000000
273 |  s422	    30.633	257.701416
274 |  s423	    17.927	439.690308
275 |  s424	    22.340	822.360596
276 |  s431	    38.771	1674247.125000
277 |  s441	    11.336	196491.250000
278 |  s442	    14.889	114240.132812
279 |  s443	    13.666	361015.906250
280 |  s451	    27.723	32009.281250
281 |  s452	    21.574	32512.017578
282 |  s453	     5.439	21.901447
283 |  s471	    14.955	64004.933594
284 |  s481	    32.899	196491.250000
285 |  s482	    35.081	196491.250000
286 |  s491	    22.081	32001.644531
287 | s4112	     9.426	1127134.250000
288 | s4113	    12.692	32001.644531
289 | s4114	    22.597	32000.000000
290 | s4115	     8.154	1.038800
291 | s4116	     5.966	0.753265
292 | s4117	    10.469	32002.207031
293 | s4121	     5.419	196491.250000
294 |    va	    25.687	1.644884
295 |   vag	    15.519	1.644884
296 |   vas	    19.909	1.644885
297 |   vif	     2.962	1.644884
298 |   vpv	    38.498	1642411.250000
299 |   vtv	    38.491	32000.000000
300 | vpvtv	    21.706	689753.500000
301 | vpvts	     4.015	175255268098048.000000
302 | vpvpv	    21.415	1.644884
303 | vtvtv	    21.698	32000.000000
304 | vsumr	    16.744	10.950724
305 | vdotr	    31.524	1.644884
306 |  vbor	     1.559	31924.050781
307 | 


--------------------------------------------------------------------------------
/llvm-riscv/tsvc/17.x-march=rv64gcv_zvl128b/stdout:
--------------------------------------------------------------------------------
  1 | $ PATH=~/llvm/17.x/bin/:$PATH ./run.sh | tee raw.out
  2 | ++ set -e
  3 | ++ clang -v
  4 | clang version 17.0.6 (https://github.com/llvm/llvm-project.git 6009708b4367171ccdbf4b5905cb6a803753fe18)
  5 | Target: riscv64-unknown-linux-gnu
  6 | Thread model: posix
  7 | InstalledDir: /home/preames/llvm/17.x/bin
  8 | Found candidate GCC installation: /usr/lib/gcc/riscv64-linux-gnu/13
  9 | Selected GCC installation: /usr/lib/gcc/riscv64-linux-gnu/13
 10 | ++ git diff
 11 | diff --git a/makefiles/Makefile.clang b/makefiles/Makefile.clang
 12 | index cefe725..ad5374a 100644
 13 | --- a/makefiles/Makefile.clang
 14 | +++ b/makefiles/Makefile.clang
 15 | @@ -3,7 +3,7 @@ CC=clang
 16 |  CXX=clang++
 17 |  # no FC for clang
 18 |  FC=
 19 | -flags = -O3 -fstrict-aliasing
 20 | +flags = -O3 -fstrict-aliasing -march=rv64gcv_zvl128b
 21 |  vecflags = -fvectorize -fslp-vectorize-aggressive
 22 |  novecflags = -fno-vectorize
 23 |  omp_flags=-fopenmp=libomp
 24 | ++ make COMPILER=clang clean
 25 | make -C ./src COMPILER=clang clean
 26 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src'
 27 | rm -f *.o *.s
 28 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src'
 29 | ++ make COMPILER=clang
 30 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src'
 31 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive -c -o tsvc_vec.o tsvc.c
 32 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 33 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive  -c -o dummy.o dummy.c
 34 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 35 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive  -c -o common.o common.c
 36 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 37 | clang tsvc_vec.o dummy.o common.o -lm -o ../bin/clang/tsvc_vec_default
 38 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fno-vectorize -c -o tsvc_novec.o tsvc.c
 39 | clang tsvc_novec.o dummy.o common.o -lm -o ../bin/clang/tsvc_novec_default
 40 | rm common.o tsvc_vec.o dummy.o tsvc_novec.o
 41 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src'
 42 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src'
 43 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fvectorize -fslp-vectorize-aggressive -c -o tsvc_vec.o tsvc.c
 44 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 45 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fvectorize -fslp-vectorize-aggressive  -c -o dummy.o dummy.c
 46 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 47 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fvectorize -fslp-vectorize-aggressive  -c -o common.o common.c
 48 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 49 | clang tsvc_vec.o dummy.o common.o -lm -o ../bin/clang/tsvc_vec_relaxed
 50 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fno-vectorize -c -o tsvc_novec.o tsvc.c
 51 | clang tsvc_novec.o dummy.o common.o -lm -o ../bin/clang/tsvc_novec_relaxed
 52 | rm common.o tsvc_vec.o dummy.o tsvc_novec.o
 53 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src'
 54 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src'
 55 | /home/preames/benchmark/TSVC_2/makefiles/Makefile.clang:21: No 'precise' math flags for clang!
 56 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b  -fvectorize -fslp-vectorize-aggressive -c -o tsvc_vec.o tsvc.c
 57 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 58 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b  -fvectorize -fslp-vectorize-aggressive  -c -o dummy.o dummy.c
 59 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 60 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b  -fvectorize -fslp-vectorize-aggressive  -c -o common.o common.c
 61 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument]
 62 | clang tsvc_vec.o dummy.o common.o -lm -o ../bin/clang/tsvc_vec_precise
 63 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b  -fno-vectorize -c -o tsvc_novec.o tsvc.c
 64 | clang tsvc_novec.o dummy.o common.o -lm -o ../bin/clang/tsvc_novec_precise
 65 | rm common.o tsvc_vec.o dummy.o tsvc_novec.o
 66 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src'
 67 | ++ ./bin/clang/tsvc_novec_relaxed
 68 | (seeming hang here)
 69 | ^C
 70 | $ ./bin/clang/tsvc_vec_relaxed 
 71 | Loop 	Time(sec) 	Checksum
 72 |  s000	     5.602	512080000.000000
 73 |  s111	    17.220	32000.410156
 74 | s1111	     9.820	16005.822266
 75 |  s112	    38.008	84617.781250
 76 | s1112	    16.379	32001.644531
 77 |  s113	    10.646	32000.644531
 78 | s1113	    24.405	32001.644531
 79 |  s114	    29.867	919.861389
 80 |  s115	    72.877	31745.953125
 81 | s1115	    84.110	0.065533
 82 |  s116	   109.654	32000.000000
 83 |  s118	    47.019	85045.453125
 84 |  s119	     7.977	86338.992188
 85 | s1119	     5.458	119099.882812
 86 |  s121	    12.238	32009.031250
 87 |  s122	    12.861	196490.921875
 88 |  s123	    20.708	32003.287109
 89 |  s124	    10.069	32001.644531
 90 |  s125	     7.529	131072.000000
 91 |  s126	    16.304	66954.960938
 92 |  s127	    11.909	32003.287109
 93 |  s128	    28.782	80000.000000
 94 |  s131	    20.485	32009.031250
 95 |  s132	    11.569	65538.562500
 96 |  s141	    34.534	32487076.000000
 97 |  s151	    20.449	32009.031250
 98 |  s152	     9.227	152207.218750
 99 |  s161	    20.115	64002.054688
100 | s1161	    38.637	64002.464844
101 |  s162	     6.476	32009.031250
102 |  s171	     4.017	196491.250000
103 |  s172	    12.634	196491.250000
104 |  s173	    23.830	32001.640625
105 |  s174	    24.915	32001.640625
106 |  s175	     4.079	32009.031250
107 |  s176	     3.998	32063.832031
108 |  s211	    32.163	63983.183594
109 |  s212	    27.900	132011.375000
110 | s1213	    21.340	132022.765625
111 |  s221	    14.190	1543685120.000000
112 | s1221	     4.356	630444416.000000
113 |  s222	     5.587	32000.000000
114 |  s231	   166.512	119099.882812
115 |  s232	     4.983	65536.000000
116 | s1232	   214.019	33958.011719
117 |  s233	   655.004	504912.625000
118 | s2233	   168.476	337652.906250
119 |  s235	   343.114	160023.984375
120 |  s241	    65.050	64000.000000
121 |  s242	     5.361	1535952000.000000
122 |  s243	    12.132	810624.937500
123 |  s244	    32.335	70102.437500
124 | s1244	    41.270	108496.453125
125 | s2244	    10.868	70102.890625
126 |  s251	    29.778	32004.371094
127 | s1251	    48.582	400005.968750
128 | s2251	    30.529	2.635709
129 | s3251	    36.627	12.595594
130 |  s252	     5.307	63999.000000
131 |  s253	     8.936	3200010240.000000
132 |  s254	    14.119	32000.000000
133 |  s255	     5.238	31968.515625
134 |  s256	    16.912	66207.703125
135 |  s257	    34.863	163072.000000
136 |  s258	     0.290	14.652790
137 |  s261	    27.673	260999.140625
138 |  s271	    25.851	689753.562500
139 |  s272	     9.199	64000.000000
140 |  s273	    12.133	99051.656250
141 |  s274	    10.936	3200195328.000000
142 |  s275	    29.940	65536.000000
143 | s2275	   673.580	65536.000000
144 |  s276	    42.638	689753.562500
145 |  s277	    26.102	32000.000000
146 |  s278	    12.404	64012.593750
147 |  s279	     7.634	64014.289062
148 | s1279	     9.257	64014.289062
149 | s2710	     4.762	96003.289062
150 | s2711	    25.793	689753.562500
151 | s2712	    25.851	289751.656250
152 |  s281	    30.143	32000.000000
153 | s1281	    48.696	inf
154 |  s291	    40.253	32000.000000
155 |  s292	    32.151	31968.515625
156 |  s293	    24.644	32000.000000
157 | s2101	    13.391	1704092.000000
158 | s2102	    18.379	256.000000
159 | s2111	    27.172	34545052.000000
160 |  s311	    15.612	10.950724
161 | s31111	     9.142	10.950724
162 |  s312	    21.593	1.030958
163 |  s313	    15.516	1.644884
164 |  s314	     7.816	1.000000
165 |  s315	    24.030	54857.000000
166 |  s316	     8.394	0.000031
167 |  s317	     1.270	0.000000
168 |  s318	    16.184	32002.000000
169 |  s319	    18.855	43.802910
170 | s3110	    20.982	514.000000
171 | s13110	    20.984	514.000000
172 | s3111	     0.900	10.950724
173 | s3112	     8.425	1.644725
174 | s3113	     7.174	2.000000
175 |  s321	    12.543	32000.000000
176 |  s322	    11.702	32000.000000
177 |  s323	    18.941	146484.968750
178 |  s331	    24.362	32000.000000
179 |  s332	    22.108	-1.000000
180 |  s341	    30.289	10.950724
181 |  s342	    32.482	10.950724
182 |  s343	    17.078	1567.833496
183 |  s351	    84.301	25600002048.000000
184 | s1351	    42.303	32010.253906
185 |  s352	    54.830	1.644891
186 |  s353	    17.033	3200002816.000000
187 |  s421	    17.432	32009.031250
188 | s1421	    21.219	16000.000000
189 |  s422	    33.683	257.701416
190 |  s423	    18.757	439.690308
191 |  s424	    20.245	822.360596
192 |  s431	    40.825	1674247.125000
193 |  s441	     9.611	196491.250000
194 |  s442	    11.936	114240.132812
195 | ^[[B s443	    15.799	361015.906250
196 |  s451	    28.127	32009.281250
197 |  s452	    21.107	32512.015625
198 |  s453	     6.317	21.901447
199 |  s471	     5.761	64004.933594
200 |  s481	    33.085	196491.250000
201 |  s482	    35.027	196491.250000
202 |  s491	    14.245	32001.644531
203 | s4112	     9.376	1127134.250000
204 | s4113	    12.616	32001.644531
205 | s4114	    10.998	32000.000000
206 | s4115	     7.811	1.038800
207 | s4116	     5.448	0.753265
208 | s4117	     8.916	32002.207031
209 | s4121	     6.455	196491.250000
210 |    va	    32.810	1.644884
211 |   vag	    14.561	1.644884
212 |   vas	    16.406	1.644885
213 |   vif	     3.162	1.644884
214 |   vpv	    40.243	1642411.250000
215 |   vtv	    40.217	32000.000000
216 | vpvtv	    25.801	689753.562500
217 | vpvts	     3.928	175255268098048.000000
218 | vpvpv	    25.882	1.644884
219 | vtvtv	    25.869	32000.000000
220 | vsumr	    15.633	10.950724
221 | vdotr	    31.068	1.644884
222 |  vbor	     0.984	31924.050781
223 | 


--------------------------------------------------------------------------------
/llvm-riscv/tsvc/note:
--------------------------------------------------------------------------------
1 | This set of directories contains the results of running TSCV_2 on a BP3
2 | dev board.  The particular data sets should be reasonable self descibing
3 | in terms of how they were run. 
4 | 


--------------------------------------------------------------------------------
/llvm-shuffles.rst:
--------------------------------------------------------------------------------
 1 | --------------------------
 2 | LLVM Shuffles by Example
 3 | -------------------------
 4 | 
 5 | This document is an overview of the various common shuffle operations in LLVM's backend lowering.  I'm currently working on improving the RISC-V backends handling of fixed length shuffles, and this is being written to help me organize my thoughts.
 6 | 
 7 | .. contents::
 8 | 
 9 | Broadcast Variants
10 | ------------------
11 | 
12 | A general broadcast takes a single vector element (any vector element), and repeats it across all lanes of the containing vector.  Particularly useful forms of broadcast involve broadcasting the element at lane 0, and broadcasting a scalar element to all lanes of the vector type.  In code, we often comingle the naming of these, so you sometimes have to pay attention to figure out which variant is being discussed.
13 | 
14 | Examples:
15 | 
16 | .. code::
17 | 
18 |    ;; Broadcast lane 0 to all lanes
19 |    ;; ------------------------------
20 | 
21 |    ;; Fixed vector
22 |    shufflevector <4 x i32> %vec, <4 x i32> undef, <4 x i32> zeroinitializer
23 | 
24 |    ;; Scalable vector
25 |    shufflevector <vscale x 1 x i32> %vec, <vscale x 1 x i32> undef, <vscale x 1 x i32> zeroinitializer
26 | 
27 |    ;; Broadcast a scalar to all lanes
28 |    ;; -------------------------------
29 | 
30 |    ;; Fixed vector
31 |    %vec = insertelement <4 x i32>, i32 %elem, i64 0
32 |    shufflevector <4 x i32> %vec, <4 x i32> undef, <4 x i32> zeroinitializer
33 | 
34 |    ;; Scalable vector
35 |    %vec = insertelement <4 x i32>, i32 %elem, i64 0
36 |    shufflevector <vscale x 1 x i32> %vec, <vscale x 1 x i32> undef, <vscale x 1 x i32> zeroinitializer   
37 | 
38 |    ;; Broadcast the value in lane 1 to all lanes
39 |    ---------------------------------------------
40 | 
41 |    ;; Fixed vector
42 |    shufflevector <4 x i32> %vec, <4 x i32> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
43 | 
44 |    ;; Scalable vector
45 |    ;; Not cleanly representable
46 | 
47 | 
48 | In TargetTransformInfo, `SK_Broadcast` specifically refers to a lane 0 broadcast (possibly of a scalar).  The generic any-lane broadcast becomes a `SK_PermuteSingleSrc`.
49 | 
50 | 
51 | Single Source Permutes
52 | ----------------------
53 | 
54 | A single source permute is a shuffle where all output lanes come from one of the two input vectors.  IT represents a permutation of exactly one of the input vectors.  A permute does not change the length of the vector.
55 | 
56 | In TargetTransformInfo, `PermuteSingleSrc` models this case. There are multple sub-categories within single source permutes where better lowerings are available.  See also interleave, deinterleave, select, broadcast, and reverse.
57 | 
58 | 
59 | Two Source Permutes
60 | -------------------
61 | 
62 | A two source permute is a shuffle where the output length is equal to the length of each of the input vectors.  Conceptually, there's two (equivalent) common mental models.  The first is that we perform a single source permute on the result of concatenating the two source vectors and then extract the leading sub-vector.  The second is that we perform one permute on each source vector, and then merge the results with a vector select.
63 | 
64 | 
65 | In TargetTransformInfo, `SK_PermuteTwoSrc` models this case.  It is the fallback for when nothing more specific can be identified.
66 | 
67 | 
68 | Others (Updates Pending)
69 | -------------------------
70 | 
71 | .. code::
72 | 
73 |     SK_Reverse,          ///< Reverse the order of the vector.
74 |     SK_Select,           ///< Selects elements from the corresponding lane of
75 |                          ///< either source operand. This is equivalent to a
76 |                          ///< vector select with a constant condition operand.
77 |     SK_Transpose,        ///< Transpose two vectors.
78 |     SK_InsertSubvector,  ///< InsertSubvector. Index indicates start offset.
79 |     SK_ExtractSubvector, ///< ExtractSubvector Index indicates start offset.
80 |     SK_Splice            ///< Concatenates elements from the first input vector
81 |                          ///< with elements of the second input vector. Returning
82 |                          ///< a vector of the same type as the input vectors.
83 |                          ///< Index indicates start offset in first input vector.
84 | 


--------------------------------------------------------------------------------
/observable-allocations.rst:
--------------------------------------------------------------------------------
  1 | -------------------------------------------------
  2 | Observable Allocations
  3 | -------------------------------------------------
  4 | 
  5 | 
  6 | .. contents::
  7 | 
  8 | Examples
  9 | ========
 10 | 
 11 | .. code::
 12 | 
 13 |   free(malloc(8));
 14 |   ==>
 15 |   nop
 16 | 
 17 | .. code::
 18 | 
 19 |   malloc(0);
 20 |   ==>
 21 |   nullptr
 22 | 
 23 | .. code::
 24 | 
 25 |   free(realloc(o, N))
 26 |   ==>
 27 |   free(o);
 28 | 
 29 | .. code::
 30 | 
 31 |   o1 = realloc(o, 0)
 32 |   // (if o1's address is not captured)
 33 |   ==>
 34 |   free(o);
 35 |   o1 = nullptr;
 36 | 
 37 | .. code::
 38 | 
 39 |   free(new int())
 40 |   ==>
 41 |   unreachable
 42 |   
 43 | .. code::
 44 | 
 45 |   o = malloc(8);
 46 |   lifetime_end(o+4, 4)
 47 |   ==>
 48 |   o = malloc(4)
 49 | 
 50 | .. code::
 51 | 
 52 |   o = malloc(8);
 53 |   lifetime_end(o, 4)
 54 |   ==>
 55 |   o = malloc(4) - 4
 56 | 
 57 | .. code::
 58 | 
 59 |   if (o != nullptr) free(o)
 60 |   ==>
 61 |   free(o)
 62 | 
 63 | Other Allocation Properties
 64 | ==========================
 65 | 
 66 | Nullability
 67 | -----------
 68 | 
 69 | .. code::
 70 | 
 71 |    allocate(N) == nullptr?
 72 | 
 73 | Zero Sided Allocations
 74 | ----------------------
 75 | 
 76 | .. code::
 77 | 
 78 |    allocate(0) == allocate(0)?
 79 | 
 80 | Is the allocator required to return distinct addresses?  Or guaranteed to fail via exception or error-val on zero sized allocation?
 81 | 
 82 | Object Crossing Compares
 83 | -------------------------
 84 | 
 85 | .. code::
 86 | 
 87 |    allocate(N) + N == allocate(M)?
 88 | 
 89 | Is this well defined?
 90 | 
 91 | Distinct Heap/Stack/Globals
 92 | ---------------------------
 93 | 
 94 | If the prior example is well defined, are there limits of which objects can be compared?
 95 | 
 96 | .. code::
 97 | 
 98 |    allocate(N) == &global_var?
 99 |    allocate(N) == &stack_var?
100 | 
101 | Which of these are well defined?
102 | 
103 | Guessed Pointers
104 | ----------------
105 | 
106 | .. code::
107 | 
108 |    allocate(N) == cast<pointer>(0xmyconstant)?
109 | 
110 | Is this well defined?
111 | 


--------------------------------------------------------------------------------
/optimization-tricks.txt:
--------------------------------------------------------------------------------
 1 | A collection of cute optimization tricks, mostly things I haven't had time to follow up on...
 2 | 
 3 | Loads from constant globals with patterns in data
 4 | (Note: LLVM currently handles some cases w/compares, but that's all I can find)
 5 | a = constant global [1,3,5,7,9]
 6 | a[i] -> -1 + i * 2
 7 | (and other assorted idioms)
 8 | 
 9 | Use masked loads w/pass through to eliminate selects of addresses
10 | a = select c, a1, a2
11 | v = load a
12 | ->
13 | v1 = masked.load(c, a1, undef)
14 | v = masked.load(!c, a2, v1)
15 | (can do better if either are safe to speculate)
16 | 
17 | trivial specialization (no cost dispatch)
18 | foo(..., bool b) { if (b) C; else D; }
19 | ->
20 | foo(...) { if(b) tail C(); else tail D(); }
21 | (potentially very interesting w/early return idioms)
22 | 
23 | 
24 | Tricks for encoding long jumps with fewer bytes using implicit fault handlers
25 | LongJCC: # 6 bytes
26 |   jle baz
27 | 
28 | ShortJCC: # 2 bytes
29 |   jle bar
30 |   
31 | forward_jump_int3: # 3 bytes
32 |   jle continue
33 |   int3 # bad to have in instruction stream?
34 | continue: 
35 | 
36 | forward_jump_ud2: # 4 bytes
37 |   jle continue2
38 |   ud2 # bad to have in instruction stream?
39 | continue2: 
40 | 
41 | jcc_payload:  # 4 bytes
42 |   test %rax, %rcx
43 |   .byte 116 #0x74, je
44 | fault:
45 |   .byte 204 #0xcc, int3 as payload of je
46 |   jle fault #backbrach into previous jcc
47 | 
48 | and_payload: % 4 bytes
49 |   test %rax, %rcx
50 |   .byte 4 #0x04, add AL, 204
51 | fault2:
52 |   .byte 204 #0xcc, int3 as payload of add
53 |   jle fault2 #backbrach into previous add
54 |   
55 | bar:
56 |   retq
57 | .section ".other"
58 | baz:
59 |   retq
60 | 
61 | 


--------------------------------------------------------------------------------
/pointer-provenance.rst:
--------------------------------------------------------------------------------
 1 | -------------------------------------------------
 2 | Pointer Provenance in LLVM
 3 | -------------------------------------------------
 4 | 
 5 | **THIS IS AN IN PROGRESS DOCUMENT.  It hasn't yet been "published".  At the moment, this is KNOWN to not work.**
 6 | 
 7 | This write up is an attempt for me to wrap my head around the recent byte type discussion on llvm-dev.  This was me attempting to really wrap my head around `nhaehnie's blog post <https://nhaehnle.blogspot.com/2021/06/can-memcpy-be-implemented-in-llvm-ir.html>`_.  After working through the problem myself, I stumbled across the same CSE issue he mentions.  I haven't yet wrapped my head around why the CSE problem is restricted to integer types, but maybe I'd get there with time.  I'm out of time for the moment.
 8 | 
 9 | .. contents::
10 | 
11 | I'm going to start by stating a few critical tidbits as I think there's been some confusion about this in the discussion thread.
12 | 
13 | Memory is not typed (well mostly)
14 | ---------------------------------
15 | 
16 | Today, LLVM's memory is not typed.  You can store one type, and load back the same bits with another type.  You can used mismatched load and store widths.
17 | 
18 | However, this is *not* the same as saying memory holds only the raw bits of the value.  There are a couple of cornercases worth talking through.  
19 | 
20 | First, we model uninitialized memory as containing a special ``undef`` value.  This value is a special placeholder which allows the compiler to chose any value it wishes independently for each use.
21 | 
22 | Second, we allow ``poison`` to propogate through memory.  Poison is used to model the result of an undefined operation (e.g. an add which was asserted not to overflow, but did in fact overflow) which has not yet reached a point which triggers undefined behavior.  The definition of poison carefully balances allowing operations to be hoisted with complexity of semantics.  We've only recently gotten ``poison`` into a somewhat solid shape.
23 | 
24 | The key bit about both is that the rules for poison and undef propogation and the dynamic semantics of LLVM IR instructions with regards to each are chosen carefully such that "forgetting" memory contains ``undef`` or ``poison`` is entirely correct.  Or to say it differently, converting (intentionally or unintentionally) either to a specific concrete value is a refining transformation.
25 | 
26 | Aside: For garbage collection, we added the concept of a non-integral pointer type.  The original intent was *not* to require typed memory, but in practice, we have had to essentially assume this.  For a collector which needs to instrument pointer loads (but *not* integer loads), having the optimizer break memory typing.  If anything, this provides indirect evidence that memory is not typed, as otherwise non-integral pointers would have never been needed.
27 | 
28 | Aliasing vs Pointer Equality
29 | ----------------------------
30 | 
31 | The C/C++ rules for what is legal with an out of bounds pointer are complicated.  This is relevant because LLVM IR needs to correctly model the semantics of these languages.  Note that it does *not* mean that LLVM's semantics must exactly follow the C/C++ semantics, merely that there must be a reasonable translation from one to the other.
32 | 
33 | The key detail which is relevant here is that pointer *values* can be legally compared and have well defined semantics while *accessing the memory pointed to* may be fully undefined.  
34 | 
35 | This comes up in the context of alias analysis because it's possible to have two pointers which are equal, but don't alias.  The classic example would be:
36 | 
37 | .. code::
38 | 
39 |   %v1 = malloc(8)
40 |   %v2 = realloc(%v1, 8)
41 |   %v1 and %v2 *may* be equal here, but are guaranteed no-alias.  
42 | 
43 | Aside: If the runtime allows multi-mapping of memory pages, it's also possible to have two pointers which are inequal, but must alias.  This isn't well modeled today in LLVM, and is definitely out of scope for this discussion.
44 | 
45 | Memory Objects
46 | --------------
47 | 
48 | The LLVM memory model consists of indivially memory allocations which are (conceptually) infinitely far apart in memory.  In an actual execution environment, allocations might be near each other.  To reconcile this, it's important to note that comparing or subtracting two pointers from different allocations results in an undefined (i.e. ``poison``) result.  
49 | 
50 | So what?
51 | ---------
52 | 
53 | My understanding of the current pointer providence rules is the following:
54 | 
55 | * A pointer is (conceptually) derived from some particular memory object.
56 | * We may or may not actually be able to determine which object that is.  Essentially this means there is an ``any`` state possible for pointer provenance.  
57 | * The optimizer is free to infer providance where it can.  BasicAA for instance, is essentially a complicated providance inference engine.
58 | 
59 | There's a key implication of the second bullet.  Provenance is propogated through memory for pointer values.  We may not be able to determine it statically (or dynamically), but conceptually it exists.
60 | 
61 | This implies there are some corner cases we have to consider around "incorrectly" typed loads, and overlapping memory accesses.  At the moment, I don't see a reason why we can simply define the providance of any pointer load which doesn't exactly match a previous pointer store as being ``any``.  We do need to allow refining transformations which expose providance information (e.g. converting a integer store to a pointer one if the value being stored is a cast pointer), but I don't see that as being particular problematic.
62 | 
63 | Let me try stating that again, this time a bit more formally.
64 | 
65 | We're going to define a dynamic (e.g. execution or small step operational) semantics for pointer providance.  Every byte of a value will be mapped to some memory object or one of two special marker value ``poison`` or ``undef``.  Note that this is *every* value, not just *pointer values*.   
66 | 
67 | Allocations define a new symbolic memory object.  GetElementPtrs generally propogate their base pointer's provinance to their result, but see the rule below for mismatched providence.
68 | 
69 | Storing a pointer to memory conceptually creates an entry in a parallel memory which maps those bytes to the corresponding memory object.  Every time memory is stored over, that map is updated.  Additionally, the side memory remembers the bounds of the last store which touches each byte in memory.
70 | 
71 | Loading a pointer reads the last written providance associated with the address.  If the bytes read were last written by two different stores, the resulting providance is ``poison``.
72 | 
73 | Casting a pointer to an integer does *not* strip providance.  As a result, round tripping a pointer through an integer to pointer cast and back is a nop.  This is critical to keep the correspondance with the memory semantics.
74 | 
75 | Integer constants are considered to have the providance of the memory object which happens to be at that address at runtime, or the special value ``undef`` if there is not such memory object.  
76 | 
77 | Any operation (gep, or integer math) which consumes operands of two distinct providances returns a result with the providance``poison`` with the caveat that an ``undef`` providance can take on the value of any memory object chosen by the optimizer.  (This is analogous to ``undef`` semantics on concrete values, just extended to the providance type.)  Note that the result is *not* the ``poison`` value, it as a value with ``poison`` providance.  
78 | 
79 | Memory operations with a memory operand with ``poison`` providence are undefined.  Comparison instructions with a pointer operand with ``poison`` providance return the value ``poison``. 
80 | 
81 | Now, let's extend that to a static semantic.  The key thing we have to add is the marker value ``any`` as a possible providance.  ``any`` means simply that we don't (yet) know what the providance as, and must be conservative in our treatment.
82 | 
83 | As is normal, the optimizer is free to implement refining transformations which make the program less undefined.  As a result, memory forwarding, CSE, etc.. all remain legal.
84 | 
85 | **BUG**: CSE of two integer values with difference providance seems to not work.
86 | 
87 | 
88 | 


--------------------------------------------------------------------------------
/project-ideas.rst:
--------------------------------------------------------------------------------
 1 | .. header:: This is a collection of random ideas for small projects I think are interesting, and would in theory love to work on someday.  Feel free to steal anything here which inspires you!
 2 | 
 3 | -------------------------------------------------
 4 | Free! Project Ideas
 5 | -------------------------------------------------
 6 | 
 7 | .. contents::
 8 | 
 9 | Missing Opt Reduction
10 | ---------------------
11 | 
12 | With any form of super-optimizer, one tricky bit is turning missing optimizations in large inputs into small self contained test cases.  (i.e. making the findings actionable for compiler developers)  Automated reduction tools are generally good at reducing failures (i.e. easy interestness tests).  A tool which wrapped a super optimizer and simply returned an error code if the super optimizer could find a further optimization in the input would allow integration with e.g. bugpoint and llvm-reduce.  Bonus points if the wrapper passed through options to the underlying tool invocation (e.g. see opt-alive) so that a test can be entirely self contained.
13 | 
14 | opt-alive for NFC proofs
15 | ------------------------
16 | 
17 | Many changes are marked NFC.  These are generally a candidate for differential proofs, and the opt-alive tool (from alive2) seems to be a good fit here.  Could very well be enough to prove many NFCs were in fact NFCs (for at least some build environment).
18 | 
19 | SCEV Based Array Language
20 | --------------------------
21 | 
22 | Many places in LLVM need to be able to reason about memory accesses spanning multiple iterations of a loop (e.g. loop-idiom, vectorizer, etc..).  SCEV gives an existing way to model addresses and values stored (as separate SCEVs), but we don't really have a mechanism to model the memory access as a first class object.
23 | 
24 | Having "store <i64 0,+,1> to <%obj,+,8>" as a first class construct allows generalizations of existing transforms.  For example: A two accesses of the form "store <i32 0> to <%obj,+, 8>" and "store <i32 0> to <%obj+4,+, 8>" can be merged into a single "store <i64 0> to <%obj,+, 8>", enabling generalize memset recognition.
25 | 
26 | Looking at ajacent loops, knowing that two stores overlap (i.e. a later loop clobbers the same memory), allows iteration space reductions for the first.
27 | 
28 | This may combine in interesting ways with MemorySSA.  I have not looked at that closely.
29 | 
30 | Focus on optimizing outparams
31 | -----------------------------
32 | 
33 | The example below (originally written in the context of https://reviews.llvm.org/D109917), reveals some interesting missed LLVM optimizations.
34 | 
35 | .. code::
36 | 
37 |    int foo();
38 | 
39 |    extern int test(int *out) __attribute__((noinline));
40 |    int test(int *out) {
41 |      *out = foo();
42 |      return foo();
43 |    }
44 | 
45 |    int wrapper() {
46 |      int notdead;
47 |      if (test(&notdead))
48 |        return 0;
49 |    
50 |      int dead;
51 |      int tmp = test(&dead);
52 |      if (notdead)
53 |        return tmp;
54 |      return foo();
55 |    }
56 | 
57 | Here's the ones I've noticed so far:
58 | 
59 | * Failure to infer writeonly argument attribute.  I went ahead and posted https://reviews.llvm.org/D114963, but there's a bunch of follow up work needed.
60 | * Failure to sink the second call to dead (i.e. the motivating review above).
61 | * Failure to color allocas and reuse same alloca/variable for outparams to both calls.
62 | * Oppurtunity to promote outparam to struct return in dso local copy with shim.  (Not sure this is generally profitable through.)
63 | * Oppurtunity to reduce lifetime.start/end scope around 'notdead' alloca.  You can think of this as part of coloring, or as a separate canonicalization step.
64 | * Missed oppurtunity for tailcall optimization in final call to foo() in wrapper().  Similiar for final call to foo() in test()
65 | 


--------------------------------------------------------------------------------
/recurrences.rst:
--------------------------------------------------------------------------------
  1 |   
  2 | -------------------------------------------------
  3 | Fun w/Recurrences in LLVM
  4 | -------------------------------------------------
  5 | 
  6 | This page is a mixture of a writeup on some recent work I've done around recurrences in LLVM, and a punch list for potential follow ups.  
  7 | 
  8 | .. contents::
  9 | 
 10 | Problem Statement
 11 | =================
 12 | 
 13 | A recurrence is simply a sequence where the value of f(x) is computable from f(x-1) and some constant terms.  I find `this definition <https://mathinsight.org/definition/recurrence_relation>`_ useful.
 14 | 
 15 | Historically, LLVM has primarily used ScalarEvolution for modeling loop recurrences, but SCEV is rather specialized for add recurrences.  Over the years, we've grown a collection of one-off logic in a couple different places, mostly because pragmatic concerns about a) compile time of SCEV, and b) difficulty of plumbing SCEV through to all the places.
 16 | 
 17 | Notation
 18 | ========
 19 | 
 20 | Skip this until one my examples doesn't make sense, then come back.
 21 | 
 22 | **<Start,Op,Step>** follows the convention SCEV uses for displaying add recurrences and generalizes for any Op.  <Start,Op,Step> expands to:
 23 | 
 24 | ::
 25 | 
 26 |   %phi = phi i64 [%start, %entry], [%next, %backedge]
 27 |   ...
 28 |   %next = opcode i64 %phi, %step
 29 | 
 30 | For non-communative operators, I generally only use this notation for the form with %phi in the left hand side of the two operators.  Note that Step may not be loop invariant unless explicitly stated otherwise.
 31 | 
 32 | I will also sometimes use the notation f(x) = g(f(x-1)) where that's easier to follow.  In particular, I use that for the inverted forms of the non-communative operators.  You should assume that f(0) == Start unless explicitly said otherwise.
 33 | 
 34 | **LIV** stands for "loop invariant value" and is simply a value which is invariant in the loop refered do.
 35 | 
 36 | **CIV** stands for "canonical induction variable".  A canonical IV is the sequence (0,+,1).  (Or sometimes <1,+,1>.  Yes, I'm being sloppy about my off-by-ones.)
 37 | 
 38 | Fun w/Recurrences
 39 | =================
 40 | 
 41 | Integer Arithmetic
 42 | ------------------
 43 | 
 44 | **ADD recurrences** are generally well covered by Scalar Evolution already.
 45 | 
 46 | **SUB recurrences** are generally canonicalized to add recurrences.  One interesting case is:
 47 | 
 48 | ::
 49 | 
 50 |   %phi = phi i64 [%start, %entry], [%next, %backedge]
 51 |   ...
 52 |   %next = sub i64 %LIV, %phi
 53 | 
 54 | That is, f(x) = LIV - f(x-1).  This alternates between Start, and LIV - Start.  If we unrolled the loop by 2, we could eliminate the IV entirely.
 55 | 
 56 | Status: Unimplemented
 57 | 
 58 | 
 59 | A **mul recurrence** w/a loop invariant step value is a power sequence of the form Start*Step^CIV when overflow can be disproven.  (TODO: Does this hold with wrapping twos complement arithmetic?)   See my notes on `exponentiation in SCEV <https://github.com/preames/public-notes/blob/master/scev-exponential.rst>`_ for ideas on what we could do here.  It's worth noting that the overflow cases may be identical to the cases we could canonicalize the shifts.  (TBD)
 60 | 
 61 | A **udiv/sdiv recurrence** w/a loop invariant step forms a sequence of the form Start div (Step ^ CIV) when overflow can be disproven.  Again, exponentiation?
 62 | 
 63 | Shift Operators
 64 | ---------------
 65 | 
 66 | TBD - A bunch of work here on known bits, range info, and SCEV, both inflight, landed, and planned.  Needs written up.
 67 | 
 68 | 
 69 | Bitwise Operators
 70 | -----------------
 71 | 
 72 | A **AND and OR recurrence** w/ a loop invariant step value stablize after the first iteration.  That is, anding (oring) a value repeatedly has no effect.  Thus
 73 | 
 74 | ::
 75 | 
 76 |   %phi = phi i64 [%start, %entry], [%next, %backedge]
 77 |   ...
 78 |   %next = and i64 %phi, %LIV
 79 | 
 80 | Is equivalent to:
 81 | 
 82 | ::
 83 |    
 84 |   %next = and i64 %start, %LIV
 85 |   ...
 86 |   %phi = phi i64 [%start, %entry], [%next, %backedge]
 87 |   ...
 88 | 
 89 | Status: Implemented in Instcombine (via`D97578 <https://reviews.llvm.org/D97578>`_) + existing LICM transform.
 90 | 
 91 | A **XOR recurrence** w/ a loop invariant step value will alternate between two values.  As such, there is a potential to eliminate the recurrence by unrolling the loop by a factor of two.
 92 | 
 93 | Status: Unimplemented.
 94 | 
 95 | 
 96 | Floating Point Arithmetic
 97 | --------------------------
 98 | 
 99 | In general, floating point is tricky because many operators are not commutative.
100 | 
101 | Most of the obvious options involve proving floating point IVs can be done in integer math.  I have some old patches pending review (`D68954 <https://reviews.llvm.org/D68954>`_ and `D68844 <https://reviews.llvm.org/D68844>`_), but there's little active progress here.
102 | 
103 | Index
104 | =====
105 | 
106 | This section has the same information as above, but indexed by optimization type.
107 | 
108 | Known Bits
109 | ----------
110 | 
111 | Computing known bits for simple recurrences.  Currently handled: lshr, ashr, shl, add, sub, and, or, mul.  Missing cases of note include: overflow intrinsics, udiv, sdiv, urem, srem.
112 | 
113 | isKnownNonZero
114 | --------------
115 | 
116 | Can we tell a recurrence is non-zero through it's entire range?  Currently handles add, mul, shl, ashr, and lshr.  Missing cases of note include: overflow intrinsics, udiv, sdiv.
117 | 
118 | isKnownNonEqual
119 | ----------------
120 | 
121 | Can we tell two recurrences are inequal (cheaply)?  Used by BasicAA.  Patch out for review which handles add, sub, mul, shl, lshr, and ashr.  
122 | 
123 | Constant Range
124 | --------------
125 | 
126 | Entirely TODO - not clear how worthwhile this is a known bits gets the common cases which constant ranges could handle.  
127 | 
128 | SCEV Range
129 | ----------
130 | 
131 | Exploiting trip count information to refine constant range results.  Currently, only shl is handled.  Changes for ashr and lshr are out for review.
132 | 
133 | RLEV
134 | ----
135 | 
136 | IndVarSimplify can rewrite loop exit values.  For some of the alternating patterns (e.g. see xor above) if we know the trip count, we can select a single exit value and fold uses outside the loop.  (e.g. select between two values based on primary IV).  We can also use knowledge of trip count multiple (which SCEV computes) to avoid the select in some cases.
137 | 
138 | Entirely unimplemented.
139 | 
140 | Unroll/Vectorize Costing
141 | ----------------------------
142 | 
143 | For alternating patterns, we can exploit the fact that unrolling the loop by a multiple of the alternation length results in fixed patterns for each lane.  This could effect unrolling and vector codegen, but most importantly, should drive costing.
144 | 
145 | Entirely unimplemented.  We can also do this for general SCEV formulas as well.  
146 | 


--------------------------------------------------------------------------------
/riscv-attribute-validation.rst:
--------------------------------------------------------------------------------
 1 | ---------------------------
 2 | RISCV Attribute Validation
 3 | ---------------------------
 4 | 
 5 | At the moment, this page is a collection of notes on the topic of attribute validation by the linker.  There's no real coherant message here - yet - it's more of a reference document to help organize my thoughts.
 6 | 
 7 | .. contents::
 8 | 
 9 | Compatibility
10 | -------------
11 | 
12 | Since `87fdd7ac09b ("RISC-V: Stop reporting warnings for mismatched extension versions") <https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=87fdd7ac09b>`_ which landed in 2.38, ld stopped trying to enforce any attribute compatibility.  Previous versions enforced the (at least) the following cases:
13 | 
14 | * Unrecognized extension name
15 | * Mismatched extension version between object files
16 | 
17 | LD 2.38 is still pretty recent.  In particular, it has not yet (as of Feb 2023) made it into e.g. debian stable.
18 | 
19 | LLD historically did not enforce attribute consistency.  In `8a900f2438b4 ([ELF] Merge SHT_RISCV_ATTRIBUTES sections) <https://reviews.llvm.org/rG8a900f2438b4a167b98404565ad4da2645cc9330>`_ this was accidentally changed as part of a refacotring.  This is being treated as a regression (see `D144353 <https://reviews.llvm.org/D144353>`_), and will (hopefully!) not make it into a public release.
20 | 
21 | I have not investigated the status of `mold` or other linkers.
22 | 
23 | Previous Breakage
24 | -----------------
25 | 
26 | Clang Built Linux (Multiple)
27 | ============================
28 | 
29 | Feb 2023 `breakage <https://github.com/ClangBuiltLinux/linux/issues/1808>`_ due to _zicsr_zifence with older LD.  Older versions of LD did not recognize these extension names and failed to link.  This case was a cross compatibility one; it was an LLVM toolchain which emitted the new extension.
30 | 
31 | Feb 2023 `breakage <https://github.com/ClangBuiltLinux/linux/issues/1777>`_ due to unrecognized I version with TOT LLD.  This is the LLD regression mentioned above, but is interesting to call out as it was a extension *version* which was unrecognized, not an extension *name*.
32 | 
33 | Feb 2023 breakage due to Zmmul implication.  I've been verbally told about this one, but can't find a citation.  My understanding is that older LD did not support Zmmul, but that newer (GNU?) toolchain did.  Thus linking within same toolchain family, but different versions broke.  Details here may be wrong.
34 | 
35 | Sept 2022 `breakage <https://github.com/ClangBuiltLinux/linux/issues/1714>`_ due to unrecognized `zicbom` extension with LD.  This is an another example of the toolchains evolving independently.  LLVM added zicbom and the version of LD used in a mixed environment did not support the extension.
36 | 
37 | More generally, the kconfig files are full of workarounds for related issues.
38 | 
39 | 
40 | Use Cases
41 | ---------
42 | 
43 | Mixed Object Case (for e.g. environmental check dispatch)
44 | 
45 | Linking new object files with older linker
46 | 
47 | Cross toolchain linking (and thus version lock)
48 | 
49 | For both of prior two, sub-cases for unrecognized extension and version.
50 | 


--------------------------------------------------------------------------------
/riscv-ecosystem.rst:
--------------------------------------------------------------------------------
 1 | ---------------
 2 | RISCV Ecosystem
 3 | ---------------
 4 | 
 5 | This is a quick reference guide to find various bits of RISCV specific ecosystem pieces.  At the moment, the primary audience is me; I tend to forgot this stuff, and being able to find links easily is useful.
 6 | 
 7 | .. contents::
 8 | 
 9 | Distros
10 | -------
11 | 
12 | Fedora
13 | ======
14 | 
15 | `Current Disk Images <https://dl.fedoraproject.org/pub/alt/risc-v/disk_images/>`_
16 | 
17 | `Useful Overview from late 2019 <https://riscv.org/wp-content/uploads/2019/12/12.10-12.50-RISC-V_Summit_Fu_Wei_.pdf>`_
18 | 
19 | Ubuntu
20 | ======
21 | 
22 | `Download and install instructions for latest Dev Previews <https://ubuntu.com/download/risc-v>`_.
23 | 
24 | `Wiki, links to older images and qemu boot instructions <https://wiki.ubuntu.com/RISC-V?_ga=2.81036951.1244712651.1677095571-1294204069.1677095571>`_
25 | 
26 | Alphine
27 | =======
28 | 
29 | `Tracking issue for official riscv64 support <https://gitlab.alpinelinux.org/alpine/aports/-/issues/13269>`_
30 | 
31 | `A good writeup on early bringup and porting (from 2018) <https://drewdevault.com/2018/12/20/Porting-Alpine-Linux-to-RISC-V.html>`_, `An early 2022 update is here <https://drewdevault.com/2022/01/15/2022-01-15-The-RISC-V-experience.html>`_.
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/riscv-microarch.rst:
--------------------------------------------------------------------------------
 1 | ------------------------------
 2 | RISC-V Microarchitecture Notes
 3 | ------------------------------
 4 | 
 5 | This document is a collection of links to documentation on various vendors' micro-architecture.  
 6 | 
 7 | .. contents::
 8 | 
 9 | Spacemit
10 | --------
11 | 
12 | `BPi-F3 Datasheet: <https://docs.banana-pi.org/en/BPI-F3/SpacemiT_K1_datasheet>`_
13 | `Spacemit-K1 Datasheet: <https://developer.spacemit.com/#/documentation?token=DBd4wvqoqi2fiqkiERTcbEDknBh>`_
14 | 
15 | To purchase: `AliExpress <https://a.aliexpress.com/_mOI0MCI>`_, `Amazon <https://www.amazon.com/BPI-F3-RISC-V-K1-SBC-Performance/dp/B0D44TH59S?th=1>`_
16 | 
17 | `X100 <https://www.spacemit.com/en/spacemit-x100-core/>`_
18 | 
19 | SiFive
20 | ------
21 | 
22 | "SiFive Announces First RISC-V OoO CPU Core: The U8-Series Processor IP". AnandTech.  `<https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip/2>`_
23 | 
24 | "Incredibly Scalable High-Performance RISC-V Core IP". `<https://www.sifive.com/blog/incredibly-scalable-high-performance-risc-v-core-ip>`_ -- Minimal information, but the point about being able to issue int ops to fp-pipe is interesting and not covered in the anandtech writeup.  
25 | 
26 | WikiChip. `<https://en.wikichip.org/wiki/sifive>`_ -- Some information on the various N-series cores, unclear sourcing and quality.
27 | 
28 | "SiFive U74 Core Complex Manual". `<https://sifive.cdn.prismic.io/sifive/ad5577a0-9a00-45c9-a5d0-424a3d586060_u74_core_complex_manual_21G3.pdf>`_
29 | 
30 | Inside SiFive’s P550 Microarchitecture https://chipsandcheese.com/p/inside-sifives-p550-microarchitecture
31 | 
32 | SemiDynamics
33 | ------------
34 | 
35 | "Avispado: A RISC-V core supporting the RISC-V vector instruction set by Roger Espasa", `<https://www.youtube.com/watch?v=NQuE2j0B_aA>`_
36 | 
37 | Tenstorrent
38 | -----------
39 | 
40 | "Building AI Silicon: Ascalon RiscV Processor".  `<https://www.youtube.com/watch?v=KOHQQyAKY14&t=494s>`_
41 | 
42 | Esperanto
43 | ---------
44 | 
45 | "A Look At The ET-SoC-1, Esperanto’s Massively Multi-Core RISC-V Approach To AI". `<https://fuse.wikichip.org/news/4911/a-look-at-the-et-soc-1-esperantos-massively-multi-core-risc-v-approach-to-ai/>`_
46 | 
47 | "ESPERANTO MAXES OUT RISC-V".  Linley Group, Microprocessor Report.  December 10, 2018.  `<https://www.esperanto.ai/wp-content/uploads/2018/12/Esperanto-Maxes-Out-RISC-V.pdf>`_
48 | 
49 | Alibaba/T-Head
50 | --------------
51 | 
52 | "T-head RVB-ICE Development Board,Dual-core XuanTie C910 RISC-V 64GC ,1.2GHz, Support Android/Debian System" `<https://www.aliexpress.com/item/3256803209663707.html?gatewayAdapt=4itemAdapt>`_.  Sales page, but scroll down to description section for a reasonable detailed description.  
53 | 
54 | "T-Head ISA extension specification" `<https://github.com/T-head-Semi/thead-extension-spec/releases/download/2.2.0/xthead-2022-12-04-2.2.0.pdf>`_.  Defines the custom extensions shipped by T-Head.
55 | 
56 | Ventana
57 | -------
58 | 
59 | "VTx-family custom instructions: Custom ISA extensions for Ventana Micro Systems RISC-V cores"
60 | `<https://github.com/ventanamicro/ventana-custom-extensions/releases/download/v1.0.0/ventana-custom-extensions-v1.0.0.pdf>`_
61 | 
62 | Other Sources
63 | -------------
64 | 
65 | Reddit post: `<https://www.reddit.com/r/RISCV/comments/uf0kc5/riscv_optimization_guides_resource_request/i6qp5p7/>`_.  Has some good commentary; was also source of two of the better links above.  
66 | 


--------------------------------------------------------------------------------
/riscv-notes.rst:
--------------------------------------------------------------------------------
  1 | ---------------
  2 | Notes on RISCV
  3 | ---------------
  4 | 
  5 | This document is a collection of notes on the RISC-V architecture.  This is mostly to serve as a quick reference for me, as finding some of this in the specs is a bit challenging.
  6 | 
  7 | .. contents::
  8 | 
  9 | VLEN >= 32 (always) and VLEN >= 128 (for V extension)
 10 | -----------------------------------------------------
 11 | 
 12 | VLEN is determined by the Zvl32b, Zvl64b, Zvl128b, etc. extensions. V implies Zvl128b. Zve64* implies Zvl64b. Zve32* implies Zvl32b. VLEN can never be less than 32 with the currently defined extensions.
 13 | 
 14 | Additional clarification here:
 15 | 
 16 | "Note: Explicit use of the Zvl32b extension string is not required for any standard vector extension as they all effectively mandate at least this minimum, but the string can be useful when stating hardware capabilities."
 17 | 
 18 | Reviewing 18.2 and 18.3 confirms that none of the proposed vector variants allow VLEN < 32.
 19 | 
 20 | As a result, VLENB >= 4 (always), and VLENB >= 16 (for V extension).
 21 | 
 22 | ELEN <= 64
 23 | ----------
 24 | 
 25 | While room is left for future expansion in the vector spec, current ELEN values encodeable in VTYPE max out at 64 bits.
 26 | 
 27 | vsetivli can not always encode VLMAX
 28 | ------------------------------------
 29 | 
 30 | The five bit immediate field in vsetivli can encode a maximum value of 31.  For VLEN > 32, this means that VLMAX can not be represented as a constant even if the exact VLEN is known at compile time.
 31 | 
 32 | Odd gaps in vector ISA
 33 | ----------------------
 34 | 
 35 | There are a number of odd gaps in the vector extension.  By "gap" I mean a case where the ISA appears to force common idioms to generate oddly complex or expensive code.  By "odd" I mean either seemingly inconsistent within the extension itself, or significantly worse that alternative vector ISAs (e.g. x86 or AArch64 SVE).  I haven't gone actively looking for these; they're simply examples of repeating patterns I've seen when looking at compiler generated assembly where there doesn't seem to be something "obvious" the compiler should have done instead.
 36 | 
 37 | No Zbb, Zbs (Basic bitmanip idioms) vector analogy extensions
 38 | =============================================================
 39 | 
 40 | The lack of Zbb and Zbs prevent the vectorization of many common bit manipulation idioms.  The code sequences to replicate e.g. bitreverse without a dedicated instruction are rather painfully expensive.  Being able to generate fast code for e.g. vredmax(ctlz(v)) has some interesting applications.
 41 | 
 42 | Impact: Major.  Practically speaking prevents vector usage for many common idioms.
 43 | 
 44 | No Zbc (carryless multiply) analogy extensions
 45 | ==================================================
 46 | 
 47 | I haven't yet seen code which would benefit from a Zbc vector analogy, but I also don't much time on crypto and my understanding is that is the motivator for these.  Here's a `public proposal for a set of vector crypto operations <https://docs.google.com/presentation/d/1HxRp22kANHHnaO9p-rqXfsTX3_IoK1WeSbZm70QfVhc/edit#slide=id.p1>`_ which do include some restricted forms of carryless multiply.
 48 | 
 49 | Impact: ??
 50 | 
 51 | No SEW ignoring math ops
 52 | ========================
 53 | 
 54 | When working with indexed load and stores, the index width and data width are often different.  For instance, say I want to add 8 bits of data from addresses `(9 x i +256)` off a common base.  The code sequence looks roughly like this::
 55 |   
 56 |   vsetvli x0, x0, e64, mf8, ta, ma
 57 |   vshl v2, v2, 3
 58 |   vadd v2, v2, 256
 59 |   vsetvli x0, x0, e8, m1, ta, ma
 60 |   vluxei64.v vd, (x1), v2, vm
 61 | 
 62 | Note that we're toggling vtype solely for the purpose of performing indexing in i64.  
 63 | 
 64 | If we had a version of the basic arithmetic ops which ignored SEW - or even better, a variant of the Zba instructions! - we could rewrite this sequence as::
 65 | 
 66 |   vshl64 v2, v2, 3
 67 |   vadd64 v2, v2, 256
 68 |   vluxei64.v vd, (x1), v2, vm
 69 | 
 70 | Or even better::
 71 | 
 72 |   vsh3add64 v2, v2, 256
 73 |   vluxei64.v vd, (x1), v2, vm
 74 | 
 75 | Note that 64 here comes from the native index width for a 64 bit machine.  We could either produce two 32/64 variants or a single ELEN paraterized variant.
 76 | 
 77 | Impact: minor, main benefit is reduced code size and fewer vtype changes
 78 | 
 79 | Another idea here might be to instead have an indexed load/store variant which implicitly scaled its index vector by the index type.  (That is, implicitly included a mutiplication of the index vector by the index width..)  That would give us code along the lines of the following::
 80 | 
 81 |   add x2, x1, 256
 82 |   vluxei64.v.scaled vd, (x2), v2, vm
 83 | 
 84 | No Cheap Mask Extend
 85 | ====================
 86 | 
 87 | There does not appear to be a cheap zero or sign extend sequence to take a mask and produce e.g. an i32 vector.
 88 | 
 89 | The best sequence I see is::
 90 | 
 91 |   vmv.v.i vd, 0
 92 |   vmerge.vim vd, vd, 1, v0
 93 | 
 94 | How to fix:
 95 | 
 96 | * Allow EEW=1i on zext.vfN variants.  This covers extend to i8.
 97 | * Add zext.vf16,  zext.vf32, and zext.vf64 on the prior to get all SEW.
 98 | * Alternatively, add a dedicated mask extend op to SEW.
 99 | 
100 | Impact: fairly minor, mostly some extra vector register pressure due to need for zero splat.
101 | 
102 | No Product Reduction
103 | ====================
104 | 
105 | There does not appear to be a way to lower an "llvm.vector.reduce.mul" or "llvm.vector.reduce.fmul" into a single reduction instruction.  Other reduction types are supported, but for some reason there's no 'vredprod', 'vfredoprod' or 'vfreduprod'.
106 | 
107 | Impact: minor, mostly me being completionist.
108 | 
109 | Non vrgather vector.reverse
110 | ===========================
111 | 
112 | Reversing the order of elements in a vector is a common operation.  On RISC-V today, this requires the use of a vrgather, and almost more importantly, a several instruction long sequence to materialize the index vector.  E,g, the following sequence reverses an i8 vector::
113 | 
114 |     csrr a0, vlenb
115 |     srli a0, a0, 2
116 |     addi a0, a0, -1
117 |     vsetvli a1, zero, e16, mf2, ta, mu
118 |     vid.v v9
119 |     vrsub.vx v10, v9, a0
120 |     vsetvli zero, zero, e8, mf4, ta, mu
121 |     vrgatherei16.vv v9, v8, v10
122 |     vmv1r.v v8, v9
123 | 
124 | Note that AArch64 provides an instruction for this.
125 | 
126 | Other ways to improve this sequence might be to variants of the SEW independent index arithmetic above, or providing a cheap way to get the VLMax splat.
127 | 
128 | Lack of e1 element type
129 | =======================
130 | 
131 | For working with large bitvectors, having an element type of e1 would be helpful.  Today, we have the masked arithmetic ops, but because they're expected to only work on masks, they can't be combined with LMUL to work on more than one vreg of data.
132 | 
133 | Impact: minor, mostly a seeming inconsistency
134 | 
135 | vlast.m variant
136 | ===============
137 | 
138 | The extension has vfirst.m (and it's variants), but not vlast.m (and its variants).  I've been told the later is sometimes useful, though I don't have a good motivating example as of yet.
139 | 
140 | The other option here would be to support bitreverse on mask vectors.  A bitreverse followed by a vfirst.m should be equivalent to a vlast.m - modulo register pressure and latency.
141 | 
142 | In the currently available extension, probably the best option is to use CTZ in Zba to emulate this for any case we know VLMAX < ELEN.  This is likely enough for fixed vectors as ELEN=64, etype=e8, would give VLEN=512 as the maximally supported size for this trick.  By using a series of vslidedown, copy to gpr, and CTZs we could probably generate correct - if ever slower with every ELEN sized chunk - code for any fixed vector.
143 | 
144 | 
145 | 
146 | 
147 | 


--------------------------------------------------------------------------------
/riscv-spec-minutia.rst:
--------------------------------------------------------------------------------
  1 | ---------------------------
  2 | RISCV Specification Minutia
  3 | ---------------------------
  4 | 
  5 | This is a collection of minutia about the change history of the RISCV specifications which have come up on a couple of occasions.  This document is (hopefully!) not relevant for typical users.  
  6 | 
  7 | .. contents::
  8 | 
  9 | Document Version 2.1 to 2.2
 10 | ---------------------------
 11 | 
 12 | Between document version `riscv-spec-v2.1.pdf <https://github.com/riscv/riscv-isa-manual/releases/download/archive/riscv-spec-v2.1.pdf>`_  and document version `riscv-spec-v2.2.pdf <https://github.com/riscv/riscv-isa-manual/releases/download/archive/riscv-spec-v2.2.pdf>`_, a backwards incompatible change was made to remove selected instructions and CSRs from the base ISA. These instructions were grouped into a set of new extensions, but were no longer required by the base ISA.  
 13 | 
 14 | Note: Document versus specification versioning is particular confusing in this case as document 2.1 to 2.2 corresponds to base I specification 2.0 to 2.1.  
 15 | 
 16 | Zicsr and Zifencei
 17 | ==================
 18 | 
 19 | The part of this change relevant for Zicsr and Zifencei is described in “Preface to Document Version 20190608-Base-Ratified” from the specification document.  All of the changes appear to have been at once, and the new extensions were immediately ratified.
 20 | 
 21 | zicntr
 22 | ======
 23 | 
 24 | The wording which defined the RDCYCLE, RDTIME, and RDINSTRET counters was removed.  They were re-added to the specification as Zicntr in `commit 77aff0 <https://github.com/riscv/riscv-profiles/commit/77aff0b84edab1fb35dd7080a7371765d28c4da3>`_ in March 2022.
 25 | 
 26 | That change includes the following verbage:
 27 | 
 28 | "NOTE: Counters and timers (now known as Zicntr and Zihpm) were frozen
 29 | but not ratified in 2019, as they were removed from the base ISAs
 30 | during the ratification process.  Due to an oversight they were not
 31 | later ratified.  As they are required for the RVA20 and RVA22
 32 | profiles, the proposal is to ratify these extensions in 2022 and
 33 | retroactively add to the 2020 and 2022 profiles as an exception."
 34 | 
 35 | The ratification status is unclear. Zicntr appears in the `current draft specification <https://github.com/riscv/riscv-isa-manual/releases/tag/draft-20230131-c0b298a>`_ without any indication it might be un-ratified, but late last year we a `call for public comment <https://www.reddit.com/r/RISCV/comments/yq73r4/public_review_for_standard_extensions_zicntr_and/>_`. I don't see any formal indication these have been fully ratified.  The best summary of status I can find is `this issue on the profiles repo <https://github.com/riscv/riscv-profiles/issues/43>`_, but even that is not conclusive.
 36 | 
 37 | There's an additional problem with the current specification text.  No version number has yet been assigned.  This is tracked as `an open issue against the specification <https://github.com/riscv/riscv-isa-manual/issues/976>`_.
 38 | 
 39 | Zihpm
 40 | =====
 41 | 
 42 | At a high level, Zihpm parallels Zicntr in that ratification status is unclear, and no version has been assigned.
 43 | 
 44 | However, hpmcounter3–hpmcounter31 names do not appear to be present in older *unprivledged* specification documents.  As such, Zihpm is merely a newly proposed extension as opposed to a backwards incompatible spec change.  Note however that this appears to contradict the text added to the specification document quoted above.  It was pointed out to me that they are mentioned in some of the *privledged* specification documents, but I have not tracked when they were added or tried to reconcile the history of the two specification documents.
 45 | 
 46 | 
 47 | Can't we all just hold on a second?
 48 | -----------------------------------
 49 | 
 50 | For one little instruction, Zihintpause (i.e. the PAUSE instruction) has been a real mess process wise.  This section is largely based on research reported `here <https://inbox.sourceware.org/binutils/f662084e-8b42-a3f4-55b5-8641034d776a@irq.a4lg.com/>`_.
 51 | 
 52 | The version number of the `zihintpause` extension **moved backwards** from 2.0 to 1.0 very shortly after being merged into the main repository.  This is easy to write off as a minor issue, except that the `commit which moved the extension number backwards <https://github.com/riscv/riscv-isa-manual/commit/773a6c4cc9db7585d42ec732d5db24f930d1157a>`_ also introduced the sentence "No architectural state is changed.".  If you think about it a bit, this is absolutely absurd because the program counter is part of the architectural state.  This effectively says the instruction must execute forever.  Except, that also contradicts the wording which says the "duration of its effect must be bounded".  So basically, 1.0 is (pedantically) unimplementable.
 53 | 
 54 | In Aug 2021, the extension was ratified, and, a few hours later, the version number was increased again to 2.0.  The wording discussed above remained.
 55 | 
 56 | In `commit cb3b9d <https://github.com/riscv/riscv-isa-manual/commit/cb3b9d1dcdacefbde6602ada7a0050f5c723ddee>`_ (Dec 2022) the definition of the PAUSE instruction was again revised to remove the "No architectural state is changed." wording.  This is great, and long overdue.  However, the version number of the extension was *not* increased.  So as a result, we have two versions of the extension text - both which claim to be 2.0 - which are mutually incompatible.  Arguably, this was a small enough matter that an errata should suffice, but well, we don't have one of those either.
 57 | 
 58 | As a practical matter, the consensus seems to be to basically ignore the matter.  The prior text was unimplementable, and if you ignore that sentence, all of the known versions are substantially similar.  As a result, the discrepancies in version can mostly be ignored, and we pretend that only the most recent 2.0 version ever existed.
 59 | 
 60 | Zmmul vs M
 61 | ----------
 62 | 
 63 | Discussion in the issue `Is Zmmul a subset of M? <https://github.com/riscv/riscv-isa-manual/issues/869>`_ appears to indicate that in a pendantic sense that `Zmmul` is not a strict subset of `M`.  Specifically, `M` allows some configurations which don't actually support multiplication at runtime, whereas `Zmmul` does not.  Given toolchains completely ignore this possibility to start with - seriously, don't tell your toolchain you have a multiply instruction if it's disabled at runtime - in all practical sense `Zmmul` does appear to be a subset of `M`.  
 64 | 
 65 | Redefinition of Vector Overlap (Nov 2022)
 66 | -----------------------------------------
 67 | 
 68 | `This proposal <https://lists.riscv.org/g/tech-vector-ext/topic/94729097#845>`_ introduced a wording change which resulted in previously valid encodings become invalid.  This was raised in the discussion, and actively rejected as being a compatibility concern.  This change appears not to have been merged into the `specification repo <https://github.com/riscv/riscv-v-spec/>`_ as of 2023-02-23.  
 69 | 
 70 | Whole Vector Register Move and VILL
 71 | -----------------------------------
 72 | 
 73 | The 1.0 version of the vector specification says the following in section 3.4.4. "Vector Type Illegal vill":
 74 | 
 75 |    "If the vill bit is set, then any attempt to execute a vector instruction that depends upon vtype will raise an illegal-instruction exception.  Note vset{i}vl{i} and whole-register loads, stores, *and moves* do not depend upon vtype."
 76 | 
 77 | By version 20240411 of the combined unpriv specification, this wording has been changed to:
 78 | 
 79 |    "If the vill bit is set, then any attempt to execute a vector instruction that depends upon vtype will raise an illegal-instruction exception. vset{i}vl{i} and whole register loads and stores do not depend upon vtype."
 80 | 
 81 | Note that whole register moves have been *removed* from this note, and thus
 82 | are now required to raise an illegal-instruction exception on vill.
 83 | 
 84 | Both versions have the following note in section 31.16.6 "Whole Vector Register Move":
 85 | 
 86 |    "These instructions are intended to aid compilers to shuffle vector registers without needing to know or change vl or vtype."
 87 | 
 88 | Given ``vill`` is a bit within the ``vtype`` CSR ("The vill bit is held in bit XLEN-1 of the CSR to support checking for illegal values with a branch on the sign bit."), this really seems like the new version of the specification contradicts itself.
 89 | 
 90 | This change was introduced by:
 91 | 
 92 |    commit 856fe5bd1cb135c39258e6ca941bf234ae63e1b1
 93 | 
 94 |    Author: Andrew Waterman <andrew@sifive.com>
 95 | 
 96 |    Date:   Mon Apr 3 15:40:16 2023 -0700
 97 | 
 98 |    Delete non-normative claim that vmv<nr>r.v doesn't depend on vtype
 99 | 
100 |    The normative text says that vmv<nr>r.v "operates as though EEW=SEW",
101 |    meaning that it _does_ depend on vtype.
102 | 
103 |    The semantic difference becomes visible through vstart, since vstart is
104 |    measured in elements; hence, how to set vstart on an interrupt, or
105 |    interpret vstart upon resumption, depends on vtype.
106 | 
107 | Personally, I think this was the wrong change to resolve the stated issue.
108 | The wording changed was in a section specific to vill exception behavior,
109 | and adding a carve out that whole vector register move explicitly
110 | did not trap on vill (but otherwise did depend on vtype) would have been
111 | much more reasonable.
112 | 


--------------------------------------------------------------------------------
/riscv-tso-mappings.rst:
--------------------------------------------------------------------------------
  1 | -----------------------------
  2 | TSO Memory Mappings for RISCV
  3 | -----------------------------
  4 | 
  5 | This document lays out the memory mappings being proposed for Ztso.  Specifically, these are the mappings from C/C++ memory models to RISCV w/Ztso.  This document is a placeholder until this can be merged into something more official.
  6 | 
  7 | The proposed mapping tables are the work of Andrea Parri.  He's an actual memory model expert; I'm just a compiler guy who has gotten a bit too close to these issues, and needs to drive upcoming patches for LLVM.  As always, mistakes are probably mine, and all credit goes to Andrea.  
  8 | 
  9 | .. contents::
 10 | 
 11 | Background
 12 | ----------
 13 | 
 14 | RISCV uses the WMO memory model by default.  This is described in chapter 17 ("RVWMO Memory Consistency Model, Version 2.0") of the Unprivledged specification.  RISCV also supports the Total Store Order (TSO) model via an optional extension (Ztso) which was `<recently ratified  (Jan 2023) <https://drive.google.com/file/d/173BGJQLqtEzAAD5lV9uaLMMjS91WeAt7/view>`_.   There is also a description of version 0.1 of Ztso in the last ratified Unprivledge specification; I am not aware of any major differences between them.
 15 | 
 16 | Programming languages such as C/C++ and Java define their own memory models.  One of the tasks in implementing such a language is choosing a mapping from the language level memory model to the hardware level memory model.  For clarity sake, it's worth emphasizing that many such mappings map be legal (that is, equally correct), but that for ABI compatibility, it is important that we designate exactly one such mapping as part of the ABI and use it across all toolchains whose results need to interoperate.  Otherwise, you could end up creating a racy program by linking two object files which both correctly implement synchronization at the source level.  Generally, that is considered bad.
 17 | 
 18 |     Aside: There is a related ABI issue around defining how atomics and ordering work when involving data types larger or smaller than what the hardware natively supports.  These are currently implementation defined (but need specified eventually).  This is explicitly out of scope for this document, and the mapping here does not apply to such types.
 19 | 
 20 | The ABI designated mapping for WMO is defined in "Table A.6: Mappings from C/C++ primitives to RISC-V primitives" from the Unprivledged spec.  Having this specified in the ISA specification is arguably a weird RISCV quirk; it should probably live in something like the psABI specification instead.  To my knowledge, there is not yet a designated mapping for Ztso, and that's what the rest of this document discusses.  
 21 | 
 22 | 
 23 | Proposed Mapping
 24 | ----------------
 25 | 
 26 | The proposed mapping table is:
 27 | 
 28 | .. code::
 29 | 
 30 |    C/C++ Construct                          | RVTSO Mapping
 31 |    ------------------------------------------------------------------------------
 32 |    Non-atomic load                          | l{b|h|w|d}
 33 |    atomic_load(memory_order_relaxed)        | l{b|h|w|d}
 34 |    atomic_load(memory_order_acquire)        | l{b|h|w|d}
 35 |    atomic_load(memory_order_seq_cst)        | fence rw,rw ; l{b|h|w|d}
 36 |    ------------------------------------------------------------------------------
 37 |    Non-atomic store                         | s{b|h|w|d}
 38 |    atomic_store(memory_order_relaxed)       | s{b|h|w|d}
 39 |    atomic_store(memory_order_release)       | s{b|h|w|d}
 40 |    atomic_store(memory_order_seq_cst)       | s{b|h|w|d}
 41 |    ------------------------------------------------------------------------------
 42 |    atomic_thread_fence(memory_order_acquire)    | nop
 43 |    atomic_thread_fence(memory_order_release)    | nop
 44 |    atomic_thread_fence(memory_order_acq_rel)    | nop
 45 |    atomic_thread_fence(memory_order_seq_cst)    | fence rw,rw
 46 |    ------------------------------------------------------------------------------
 47 |    C/C++ Construct                          | RVTSO AMO Mapping
 48 |    atomic_<op>(memory_order_relaxed)        | amo<op>.{w|d}
 49 |    atomic_<op>(memory_order_acquire)        | amo<op>.{w|d}
 50 |    atomic_<op>(memory_order_release)        | amo<op>.{w|d}
 51 |    atomic_<op>(memory_order_acq_rel)        | amo<op>.{w|d}
 52 |    atomic_<op>(memory_order_seq_cst)        | amo<op>.{w|d}
 53 |    ------------------------------------------------------------------------------
 54 |    C/C++ Construct                          | RVTSO LR/SC Mapping
 55 |    atomic_<op>(memory_order_relaxed)        | loop: lr.{w|d} ; <op> ;
 56 |                                             |       sc.{w|d} ; bnez loop
 57 |    atomic_<op>(memory_order_acquire)        | loop: lr.{w|d} ; <op> ;
 58 |                                             |       sc.{w|d} ; bnez loop
 59 |    atomic_<op>(memory_order_release)        | loop: lr.{w|d} ; <op> ;
 60 |                                             |       sc.{w|d} ; bnez loop
 61 |    atomic_<op>(memory_order_acq_rel)        | loop: lr.{w|d} ; <op> ;
 62 |                                             |       sc.{w|d} ; bnez loop
 63 |    atomic_<op>(memory_order_seq_cst)        | loop: lr.{w|d}.aqrl ; <op> ;
 64 |                                             |       sc.{w|d} ; bnez loop
 65 | 
 66 | The key thing to note here is that we using an fence *before* any seq_cst *load*.  There is an alternative mapping (discussed below) which uses a fence *after* an atomic *store*.  The mapping shown here is the one I am proposing moving forward with.
 67 | 
 68 | The alternate mapping
 69 | ---------------------
 70 | 
 71 | This mapping table is listed here for explanatory value only.  This lowering is **incompatible** with the mapping proposed for inclusion in toolchains and psABI (above).
 72 | 
 73 | .. code::
 74 | 
 75 |    C/C++ Construct                          | RVTSO Mapping
 76 |    ------------------------------------------------------------------------------
 77 |    Non-atomic load                          | l{b|h|w|d}
 78 |    atomic_load(memory_order_relaxed)        | l{b|h|w|d}
 79 |    atomic_load(memory_order_acquire)        | l{b|h|w|d}
 80 |    atomic_load(memory_order_seq_cst)        | l{b|h|w|d}
 81 |    ------------------------------------------------------------------------------
 82 |    Non-atomic store                         | s{b|h|w|d}
 83 |    atomic_store(memory_order_relaxed)       | s{b|h|w|d}
 84 |    atomic_store(memory_order_release)       | s{b|h|w|d}
 85 |    atomic_store(memory_order_seq_cst)       | s{b|h|w|d} ; fence rw,rw
 86 |    ------------------------------------------------------------------------------
 87 |    atomic_thread_fence(memory_order_acquire)    | nop
 88 |    atomic_thread_fence(memory_order_release)    | nop
 89 |    atomic_thread_fence(memory_order_acq_rel)    | nop
 90 |    atomic_thread_fence(memory_order_seq_cst)    | fence rw,rw
 91 |    ------------------------------------------------------------------------------
 92 |    C/C++ Construct                          | RVTSO AMO Mapping
 93 |    atomic_<op>(memory_order_relaxed)        | amo<op>.{w|d}
 94 |    atomic_<op>(memory_order_acquire)        | amo<op>.{w|d}
 95 |    atomic_<op>(memory_order_release)        | amo<op>.{w|d}
 96 |    atomic_<op>(memory_order_acq_rel)        | amo<op>.{w|d}
 97 |    atomic_<op>(memory_order_seq_cst)        | amo<op>.{w|d}
 98 |    ------------------------------------------------------------------------------
 99 |    C/C++ Construct                          | RVTSO LR/SC Mapping
100 |    atomic_<op>(memory_order_relaxed)        | loop: lr.{w|d} ; <op> ;
101 |                                             |       sc.{w|d} ; bnez loop
102 |    atomic_<op>(memory_order_acquire)        | loop: lr.{w|d} ; <op> ;
103 |                                             |       sc.{w|d} ; bnez loop
104 |    atomic_<op>(memory_order_release)        | loop: lr.{w|d} ; <op> ;
105 |                                             |       sc.{w|d} ; bnez loop
106 |    atomic_<op>(memory_order_acq_rel)        | loop: lr.{w|d} ; <op> ;
107 |                                             |       sc.{w|d} ; bnez loop
108 |    atomic_<op>(memory_order_seq_cst)        | loop: lr.{w|d} ; <op> ;
109 |                                             |       sc.{w|d}.aqrl ; bnez loop
110 | 
111 | The key difference to note is that this lowering uses an fence *after* the sequentially consistent stores,
112 | 
113 | Discussion
114 | ----------
115 | 
116 | So, why are we proposing the first mapping and not the alternative?  This comes down to a benefit analysis.
117 | 
118 | The proposed Ztso mapping was constructed to be a strict subset of the WMO mapping.  Consider the case where we are running on a Ztso machine, but that not all of our object files or libraries were compiled assuming Ztso.  If the Ztso mapping is a subset of the WMO mapping, then all parts of this mixed application include the required fences for correctness on Ztso.  Some libraries might have a bunch of redundant fences (i.e. all the ones needed by WMO not needed for Ztso), but the application will behave correctly regardless.  This allows libraries targeted for WMO to be reused on a Ztso machine with only selective performance sensitive pieces selectively recompiled explicitly for ZTso.
119 | 
120 | The alternative mapping instead parallels the mappings used by X86.  Ztso is intended to parallel the X86 memory model, and it is desirable if explicitly fenced code ported from x86 just worked with Ztso.  Consider a developer who is doing a port of a library which is implemented using normal C intermixed with either inline assembly or intrinsic calls to generate fences.  If that code follows the x86 convention, then a naive port will match the alternate mapping.  The key point is that code using the alternate mapping will not properly synchronize with code compiled with the proposed mapping.
121 | 
122 | To avoid confusion, let me emphasize that the porting concern just mentioned *does not* apply to code written in terms of either C or C++'s explicit atomic APIs.  Instead, it *only* applies to manually ported assembly or code which is already living dangerously by using explicit fencing around relaxed atomics.  Such code is rare, and usually written by experts anyways.  The slightly broader class of code which may be concerning is that with non-atomic loads and stores mixed with explicit fencing.  Such code is already relying on undefined behavior in C/C++, but "probably works" on X86 today and might not after a naive RISCV port if synchronizing with code compiled with the proposed mapping.
123 | 
124 | The alternative mapping also has the advantage that stores are generally dynamically rarer than loads.  So the alternative mapping *may* result in dynamically fewer fence instructions.  I do not have numbers on this.
125 | 
126 | The choice between the two mappings essentially comes down to which of these we consider to be more important.  I am proposing we move forward with the mapping which gives us WMO compatibility.  It is my belief that allowing mixed applications is more important to the ecoyststem then ease of porting explicit synchronization.  
127 | 


--------------------------------------------------------------------------------
/riscv-vector-user-guide.rst:
--------------------------------------------------------------------------------
 1 | -------------------------------
 2 | Using RISCV Vector Instructions
 3 | -------------------------------
 4 | 
 5 | This is intended to be a user focused guide on how to leverage vector instructions on RISCV.  Most of what I can find on the web is geared towards vector before v1.0 was ratified, and there's enough differences that having something to point people to has proven useful.
 6 | 
 7 | .. contents::
 8 | 
 9 | 
10 | Execution Environments
11 | ----------------------
12 | 
13 | qemu-riscv32/64 support the v1.0 vector extension in upstream.  Note that the default packages available in most distros are not recent enough.  You will need to download qemu and build from source.  Thankfully, this is pretty straight forward, and `qemu's build instructions <https://wiki.qemu.org/Hosts/Linux>`_ are sufficiently up to date.
14 | 
15 | Once you have a sufficiently recent qemu-riscv, you should be able to run binaries containing vector instructions.  Note that vector is not enabled by qemu-user by default at this time, so you will need to explicitly enable it.  If you get unhelpful error output when doing so, you are most likely using a version of qemu which is too old.  
16 | 
17 | With qemu-user, you can run and test programs in a cross build environment with one major gotcha.  glibc does not have mcontext/ucontext support for vector, so anything which requires them - e.g. longjmp, signals, green threads, etc - will fail in interesting and unexpected ways.
18 | 
19 | **WARNING**: At the moment, support for the vector extension has *NOT* landed in upstream Linux kernel, and I am not aware of any distro which currently applies the required patches.  So, unless you are running a custom kernel, there is a *very* good chance you can't run a native environment.
20 | 
21 |    If you try, you will most likely get an illegal instruction exception (SIGILL) on the first vector instruction you execute.  In many programs - though not all - this will look like a SIGILL on the first access to a vector CSR (e.g. `csrr a2, vlenb`) or a vector configuration instruction (e.g. `vsetvli	a1, zero, e32, m1, ta, mu`).  
22 | 
23 | **Said differently, unless you're running a patched kernel, you can not enable vector code even if your hardware supports it!**
24 | 
25 | 
26 | Assembler Support
27 | ------------------
28 | 
29 | The LLVM assembler fully supports for v1.0 vector extension specification, and has for a while.  Using the 15.x release branch is definitely safe, and older release branches may also work.
30 | 
31 | The binutil's assembler used by GNU also supports the v1.0 vector extension since `the 2.38 release <https://sourceware.org/pipermail/binutils/2022-August/122594.html>`_  (released Feb 2022).
32 | 
33 | Compiler Support
34 | ----------------
35 | 
36 | Enabling Vector Codegen
37 | =======================
38 | 
39 | Vector code generation has been supported since (at least) LLVM 15.0.  I've been told by gcc developers that upstream GNU does support vector code generation as of (at least) gcc 13.
40 | 
41 | You need to make sure to explicitly tell the compiler (via `-mattr=...,+v`) to enable vector code lowering.  If you don't, you may be get compiler errors (i.e. use of vector types rejected), you may see code being scalarized by the backend, or (hopefully not) you may stumble across internal compiler errors.
42 | 
43 | **Warning**: The flag mentioned above also have the effect of enabling auto-vectorization; if this is undesirable, consider `-fno-vectorize` and `-fno-slp-vectorize`.  Vectorizer user documention can be found `here <https://llvm.org/docs/Vectorizers.html>`_.
44 | 
45 | Intrinsics and Explicit Vector Code
46 | ===================================
47 | 
48 | The `gcc vector extension syntax <https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html>`_ is fully supported by both GCC and Clang.  This is a good way of writing explicitly fixed length vector code in C/C++.
49 | 
50 | The RVV C Intrinsic family is fully supported by Clang and GCC.  This is currently the main way to write explicit length agnostic vector code.  
51 | 
52 | For clang, the `#pragma clang loop <https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations>`_ directives can be used to override the vectorizers default heuristics.  This can be very useful for exploring performance of various vectorization options.
53 | 
54 | .. code::
55 | 
56 |   // Let's force LMUL4 with unroll factor of 2.
57 |   #pragma clang loop vectorize(enable)
58 |   #pragma clang loop interleave(enable)
59 |   #pragma clang loop vectorize_width(8, scalable)
60 |   #pragma clang loop interleave_count(2)
61 |   for (unsigned i = 0; i < a_len; i++)
62 |     a[i] += b;
63 | 
64 | 
65 | Auto-vectorization
66 | ==================
67 | 
68 | I have been actively working to improve the state of auto-vectorization in LLVM for RISC-V.  If you're curious about the details, see `my working notes <https://github.com/preames/public-notes/blob/master/llvm-riscv-status.rst#vectorization>`_.  The following is a user focused summary of where we stand right now.  This is an area of extremely heavy churn, so please keep in mind that this snapshot is very specific to the current moment in time, and is likely to continue changing.
69 | 
70 | The LLVM 16 release branch contains all of the changes required for auto-vectorization via the LoopVectorizer via both scalable and fixed vectors when vector codegen is enabled (see above).
71 | 
72 | For SLPVectorizer, use of a very recent tip of tree is recommended.  SLP has recently been enabled by default in trunk, and is on track to be enabled in the 17.x release series, but that's subject to change.  If you're interested in this area, it is strongly recommended that you build LLVM from (very recent!) source.  If you wish to enable SLP vectorization on the 16.x release branch for experimental purposes, you need to specify `-mllvm -riscv-v-slp-max-vf=0`.  Note that this is an internal compiler option, and will not be supported.  Any bugs found will only be fixed on tip-of-tree, and will not be backported.
73 | 
74 | For gcc, patches to support auto-vectorization have recently started landing.  There's very active development going on with multiple contributors, so the exact status is hard to track.  Hopefully, the gcc-14 release notes will contain information about what is and is not supported.
75 | 
76 | T-Head `has a custom toolchain <https://occ.t-head.cn/community/download?id=4090445921563774976>`_ which may suppport vectorization as their processors include the v0.7 vector extensions.  I have not confirmed this since a) all the documents are in Chinese, b) it requires an account to download, and c) I'm not interested in v0.7 anyways.
77 | 
78 | If you wish to disable auto-vectorization for any reason, please consider `-fno-vectorize` and `-fno-slp-vectorize`.  Vectorizer user documention can be found `here <https://llvm.org/docs/Vectorizers.html>`_.
79 | 
80 | 
81 | 
82 | 
83 | 
84 | 
85 | 
86 | 
87 | 
88 | 
89 | 
90 | 


--------------------------------------------------------------------------------
/riscv/bp3-setup.rst:
--------------------------------------------------------------------------------
  1 | -----------------------
  2 | Banana PI 3 Setup Notes
  3 | -----------------------
  4 | 
  5 | This board was released in Q2 2024, and appears to be the first relatively widely available board with vector 1.0.  The ISA is rva22u w/V and VLEN=256.  This page is my notes from trying to set one up as a development board.
  6 | 
  7 | `BPi-F3 Datasheet <https://docs.banana-pi.org/en/BPI-F3/SpacemiT_K1_datasheet>`_,  `Spacemit-K1 Datasheet <https://developer.spacemit.com/#/documentation?token=DBd4wvqoqi2fiqkiERTcbEDknBh>`_
  8 | 
  9 | To purchase: `AliExpress <https://a.aliexpress.com/_mOI0MCI>`_, `Amazon <https://www.amazon.com/BPI-F3-RISC-V-K1-SBC-Performance/dp/B0D44TH59S?th=1>`_
 10 | 
 11 | .. contents::
 12 | 
 13 | 
 14 | Download the OS Image
 15 | ---------------------
 16 | 
 17 | Directly from the vendor `here <https://docs.banana-pi.org/en/BPI-F3/BananaPi_BPI-F3#_system_image>`_.  Make sure you grab the SD card images.  The easiest downloads are the google drive ones unless you speak Chinese.  
 18 | 
 19 | 
 20 | Format/Partition the SD Card
 21 | ----------------------------
 22 | 
 23 | First, figure out which device corresponds to the your SD card.  The rest of this assumes it is `/dev/sda` -- make sure you change for your environment!
 24 | 
 25 | .. code::
 26 | 
 27 |    lsblk
 28 | 
 29 | Zero the entire SD card.  Do not skip this step, or you get weird hangs at boot time.
 30 | 
 31 | .. code::
 32 | 
 33 |    sudo dd if=/dev/zero of=/dev/sda bs=4096 status=progress
 34 | 
 35 | Format the partion table, etc...  (`For a lot more detail on this step <https://linuxize.com/post/how-to-format-usb-sd-card-linux/>`_.)
 36 | 
 37 | .. code::
 38 |    
 39 |    sudo parted /dev/sda --script -- mklabel msdos
 40 |    sudo parted /dev/sda --script -- mkpart primary fat32 1MiB 100%
 41 |    sudo mkfs.vfat -F32 /dev/sda1
 42 |    sudo parted /dev/sda --script print
 43 | 
 44 | Copy the contents of your img to the device.
 45 | 
 46 | .. code::
 47 |   
 48 |    sudo dd if=/home/preames/DevBoards/BananaPI/bianbu-23.10-desktop-k1-v1.0rc3-release-20240525133016.img of=/dev/sda status=progress bs=4M
 49 | 
 50 | Use `gparted`, or tool of your choice, to resize the final fat32 partition to fill available space.  If you skip this step, installing new packages will fail due to insufficient space.
 51 | 
 52 | Boot It
 53 | -------
 54 | 
 55 | I plugged in a HDMI monitory, keyboard, and mouse.  Then went through the initial setup to e.g. configure WiFi.  After that, I switched to SSH login.
 56 | 
 57 | If this step fails (i.e. hangs), go back and read the warnings on the zeroing and formatting section again.
 58 | 
 59 | Initial Setup
 60 | -------------
 61 | 
 62 | Do the usual update:
 63 | 
 64 | .. code::
 65 | 
 66 |    sudo apt-get update
 67 |    sudo apt-get upgrade
 68 | 
 69 | Install the usual packages:
 70 | 
 71 | .. code::
 72 | 
 73 |    sudo apt-get --assume-yes install emacs man-db libc6-dev dpkg-dev make build-essential binutils binutils-dev gcc g++ autoconf python3 git clang cmake patchutils ninja-build flex bison
 74 | 
 75 | Getting `perf` working
 76 | ----------------------
 77 | 
 78 | Do the following:
 79 | 
 80 | .. code::
 81 | 
 82 |    git clone https://github.com/BPI-SINOVOIP/pi-linux
 83 |    cd pi-linux
 84 |    uname --all
 85 |    # Checkout the right branch for your kernel version
 86 |    git checkout linux-6.1.15-k1
 87 |    pushd tools/perf/pmu-events
 88 |    ./jevents.py riscv arch pmu-events.c
 89 |    popd
 90 |    sudo apt install libelf-dev libdw-dev flex bison
 91 |    sudo make -C tools/ NO_LIBBPF=1 WERROR=0 prefix=/usr/local/ perf_install
 92 | 
 93 | These instructions are inspired by `this blog post <https://dev.to/luzero/bringing-up-bpi-f3-part-25-27o4>`_.  Note that I'm running on the  `bianbu-23.10-desktop-k1-v1.0rc3-release-20240525133016` image, and that the default counter names appear to work for me in `perf stat`.  The synthetic `instructions` and `cycles` events do not work with `perf record`.  Reasonable proxies are `inst_issues` and `m_mode_cycles`.  
 94 | 
 95 | LLVM Native Build
 96 | -----------------
 97 | 
 98 | You *can* do an LLVM native build on these, but it takes quite a while.  The filesystem on the SD card is insanely slow, so just getting a git checkout in place took a while; starting from a zip file probably would have been a better idea.  It might also have been a good idea to setup the MMC storage first.  Due to memory limitations, you end up needing to build without parallelism (e.g. "ninja -j1").  Doing so takes a bit over 38 hours.  I was able to build both the llvm 17 and llvm 18 release branches.
 99 | 
100 | I strongly recommend configuring to build only RISCV and the projects you actually need.  Doing so greatly reduces the number of files build, and is the only thing which makes this vaguely practical.  I build RISCV only, and only LLVM core + clang.
101 | 
102 | Other References
103 | ----------------
104 | 
105 | https://dev.to/luzero/bringing-up-bpi-f3-part-1-3bm4
106 | 


--------------------------------------------------------------------------------
/riscv/isa-detection.rst:
--------------------------------------------------------------------------------
 1 | -------------------------------
 2 | User Mode ISA Detection (RISCV)
 3 | -------------------------------
 4 | 
 5 | .. contents::
 6 | 
 7 | 
 8 | AT_HWCAP
 9 | --------
10 | 
11 | .. code::
12 | 
13 |    #include <sys/auxv.h>
14 |    #include <stdio.h>
15 | 
16 |    int main() {
17 |      unsigned long hw_cap = getauxval(AT_HWCAP);
18 |      for (int i = 0; i < 26; i++) {
19 |        char Letter = 'A' + i;
20 |        printf("%c %s\n", Letter, hw_cap & (1 << i) ? "detected" : "not found");
21 |      }
22 |      return 0;
23 |    }
24 | 
25 | Problems:
26 | 
27 | * Only a few of the bits actually correspond the useful single letter extensions.  The rest are wasted bits, and there's none for e.g. Zba.
28 | * Unclear specification.  Exactly which *version* from which *specification document* does each bit correspond to?  See `minutia <https://github.com/preames/public-notes/blob/master/riscv-spec-minutia.rst#zicntr>`_ for some of the ambiguities.
29 | * V - Is this V1.0 or THeadVector?  Custom kernels are known to report the later as 'V'.
30 | 
31 | riscv_hwprobe
32 | -------------
33 | 
34 | syscall
35 |   `Documentation <https://docs.kernel.org/arch/riscv/hwprobe.html>`_. See RISE's RISC-V Optimization Guide for `an example <https://gitlab.com/riseproject/riscv-optimization-guide/-/blob/main/riscv-optimization-guide.adoc?ref_type=heads#user-content-detecting-risc-v-extensions-on-linux>`_.  As noted there, the syscall was added in 6.4.  Attempting to use it on an earlier kernel will return ENOSYS.  The syscall is a relatively cheap syscall, but you do have the transition overhead.  Cost is probably something in the 1000s of instructions.
36 | 
37 | vDSO
38 |   Also added in 6.4 (so there is no kernel version with the syscall, but without the vDSO).  Caches the key/value for the intersection of the flags for all cpus.  Will return results without a syscall if either a) all_cpus is queried (i.e. no cpu set given to the call) or b) underlying system is homogeneous.  The vDSO symbol name is `__vdso_riscv_hwprobe`. 
39 | 
40 | glibc
41 |   The patch for the glibc wrapper has landed, but *is not yet released*.  It is likely to be included in glibc 2.40 which is expected in Aug 2024, but should not be considered ABI stable until it is released.  `glibc provides <https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h>`_ two entry points `__riscv_hwprobe` and `__riscv_hwprobe_one` - an inline function specialized for the common single key case.  Confusingly, the `__riscv_hwprobe` case inverts the error codes from vDSO/kernel resulting in >0 being error instead of <0.  Thankfully, ==0 is success in all cases.
42 | 
43 | bionic
44 |   Appears to be the same status as glibc, with the change landed but not yet released for Android 15.  Note that bionic does not appear to implement the `__riscv_hwprobe_one` wrapper, but does appear to match the semantics of the `__riscv_hwprobe` case.
45 | 
46 | qemu-user
47 |   It is currently an open question whether the vDSO above is supported by qemu-riscv64.  Initial testing seems to indicate no, but user error has not yet been disproven.
48 | 


--------------------------------------------------------------------------------
/riscv/whole-register-move-abi.rst:
--------------------------------------------------------------------------------
  1 | ------------------------------------------------------
  2 | ABI Implication of vill and whole vector register move
  3 | ------------------------------------------------------
  4 | 
  5 | Background
  6 | ----------
  7 | 
  8 | The vector specification supports a whole register move instruction
  9 | whose documented purpose is to enable register allocators to move
 10 | vector register contents around without needing to track ``VL`` or
 11 | ``VTYPE``.
 12 | 
 13 | A `change was made to the specification <https://github.com/preames/public-notes/blob/master/riscv-spec-minutia.rst#whole-vector-register-move-and-vill>`_
 14 | which requires hardware to report a illegal instruction exception
 15 | if this instruction is executed with `vill` set.
 16 | 
 17 | Existing SW implementations - in particular LLVM and GCC - were
 18 | both implemented *before* this change was made, and implicitly
 19 | assume the prior behavior.
 20 | 
 21 | IMHO, an architecture which doesn't have an instruction to *just
 22 | unconditionally copy a dang register* is broken at a level which
 23 | just can't be saved, however we will probably have to workaround
 24 | this in software regardless.
 25 | 
 26 | See also discussion here: https://github.com/llvm/llvm-project/issues/114518 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117544, and https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/454.
 27 | 
 28 | ABI Impact
 29 | ----------
 30 | 
 31 | The current ABI says that ``VTYPE`` is *not* preserved across calls.
 32 | Note that in particular, this means that `vill` can be set across
 33 | any call boundary.
 34 | 
 35 | Before the above change, that was fine because a whole register
 36 | move did the sane thing regardless of vill.  After the above change
 37 | this means that any use of a whole register move betwen a call
 38 | boundary and the first dynamic vsetvli will potentially crash.
 39 | 
 40 | In theory, this is a huge problem.  However, up until recently
 41 | the defacto usage of the ABI was such that a valid `VTYPE` usually
 42 | did survive through call boundaries.
 43 | 
 44 | Unfortunately, Linux kernel 6.5 included a `patch <https://github.com/torvalds/linux/commit/9657e9b7d2538dc73c24947aa00a8525dfb8062c>`_ which sets
 45 | VILL=1 on all syscalls.  This is legal per the documented ABI (see below),
 46 | but has the effect of exposing the lurking problem.
 47 | 
 48 | The other case where this can be triggered in practice is via a
 49 | riscv vector intrinsic with a pass thru operand called early in the
 50 | execution of a program.  Since we don't explicitly initialize `VTYPE`
 51 | on the way to main, if we happen to use a whole register move to place
 52 | the pass thru value in the destination register (fairly common), we
 53 | may insert the vsetvli *after* the whole register move, and end up with
 54 | an exposed whole register move which can fault.  This problem has
 55 | been latent the whole time.
 56 | 
 57 | Known Hardware
 58 | --------------
 59 | 
 60 | Old Behavior (No Trap)
 61 | 
 62 | * Spacemit-X60 on K1
 63 | * C908 on K230
 64 | * Likely (but unconfirmed) all other THead processors
 65 | 
 66 | New Behavior (Trap)
 67 | 
 68 | * Presumably SiFive, but no confirmed specific micro-architectures at this time
 69 | 
 70 | Goals
 71 | -----
 72 | 
 73 | Key Goals:
 74 | 
 75 | * Acknowledge the existance and likely long term prevalence of both trapping
 76 |   and non-trapping hardware.
 77 | * Ensure that there is a correct by construction combination of compiler and
 78 |   libraries available for trapping hardware.
 79 | * Not fork the ecosystem on this point.  This practically speaking requires
 80 |   that any mitigation doesn't impose a noticable performance penalty when
 81 |   run on non-trapping hardware.
 82 | * Having existing binaries continue to mostly work in practice is highly
 83 |   desirable, even they were compiled without knowledge of any changes
 84 |   required to address this issue.
 85 | 
 86 | Options
 87 | -------
 88 | 
 89 | As of the 2024-11-21 LLVM RISCV Sync Up, the consensus appeared to be that
 90 | we're going to pursue Option 2 - the compiler side changes.  In offline
 91 | discussion, it was revealed that some information was accidentally
 92 | misreported, so this is still an open question.
 93 | 
 94 | Option 1 - Change the ABI
 95 | =========================
 96 | 
 97 | Since we're in the realm of making backwards incompatible specification
 98 | changes anyways, we can change the default calling convention in psABI
 99 | in a couple of possible ways:
100 | 
101 | * Require VTYPE to be non-vill on ABI boundaries.  
102 | * Require VTYPE to be equally vill on ABI boundaries; that is calls
103 |   would have to preserve the single-bit state of whether vill was
104 |   active.
105 | * Require VTYPE to be no more vill on return than on entry to the
106 |   function.  That is, a non-vill VTYPE on entry must be non-vill
107 |   on exit, but a vill VTYPE on entry can become non-vill on exit.
108 |   This would allow callees to unconditionally set VTYPE to any
109 |   non-vill value.  
110 | 
111 | In all of these variants, VTYPE would remain otherwise unspecified and
112 | unpreserved.  At the moment, variant three seems like the best option.
113 | 
114 | This would require a kernel change, and we'd end up having to tell folks
115 | that vector code essentially didn't fully work on the unpatched kernel
116 | version.
117 | 
118 | We'd also have to message and manage an ABI transition.  Older binaries
119 | in the wild would not be guaranteed to work until recompiled.  Note that
120 | this is the same state we're in today, and have been for years, so this
121 | is a bigger problem in theory than in practice.
122 | 
123 | This is my preferred option, but may be politically unpopular since
124 | it requires publicly admitting the retro-active change was actually
125 | a change.  RVI has generally not wanted to do that in the past.
126 | 
127 | It was pointed out that this requires VTYPE to be initialized during
128 | program startup, and that this would defeat the kernel's lazy state
129 | preservation optimization which avoids needing to spill vector state
130 | for programs which don't use vector - because vset* is a vector
131 | instruction.  (Per Palmer, this may not be an issue depending on how
132 | a kernel change is handled.  He's going to expand on a psabi issue
133 | filed in the near future - will like once available.
134 | 
135 | Option 2 - Enforce the ABI as written
136 | =====================================
137 | 
138 | This will require the compiler to insert a VSETVLI along any path
139 | from a call boundary to a whole register move.
140 | 
141 | I can't speak for GCC, but for LLVM this is doable, if exceedingly
142 | ugly.  It will likely result in otherwise spurious vsetvlis, and
143 | its hard to say how much this matters performance wise.
144 | 
145 | We can do heroics particularly for LTO builds (internal ABIs with
146 | IPO and adapter frames anyone?), but its hard to say if we can
147 | address all performance loss cases which arrise in practice.
148 | 
149 | As with the previous option, we defacto would have broken packages
150 | in the wild, and nothing would be guaranteed to work until all
151 | packages had been recompiled.  Unlike the previous option, most
152 | of those packages wouldn't "just work" in practice.
153 | 
154 | LLVM POC Patch: https://github.com/llvm/llvm-project/compare/main...preames:llvm-project:wip-vmv1r-depends-on-vtype-vill?expand=1
155 | 
156 | Option 3 - Ignore it
157 | ====================
158 | 
159 | This is what we've been doing to date.
160 | 
161 | Option 4 - Trap and Emulate
162 | ===========================
163 | 
164 | We could have the kernel trap and emulate the instruction.  This
165 | is argubly not crazy for a case where the specification changed.
166 | Since vsetvlis should be fairly common in vector code, this
167 | shouldn't be a hot trap case - unless someone is doing something
168 | weird like hot-looping around a sys-call.
169 | 
170 | This version basically represents treating the changed behavior
171 | as a SiFive errata.  Note that this will likely always disagree
172 | with the specification document.
173 | 
174 | Option 4a - Change the Specification
175 | ====================================
176 | 
177 | Several folks have indicates a desire to reverse the change in
178 | the specification.  I am sympathetic to this view, but don't
179 | believe such an effort to be politically viable.
180 | 
181 | As an alternative, we might be able to propose a specification
182 | change (or maybe an extension?) which allows both the trapping
183 | and non-trapping behaviors.  This wouldn't resove any of the
184 | SW complexity mentioned above, but would at least mean that
185 | the vast majority of vector hardware on the planet wasn't
186 | retroactively considered "non conformant".
187 | 


--------------------------------------------------------------------------------
/scev-exponential.rst:
--------------------------------------------------------------------------------
 1 | .. header:: This is a collection of notes on a topic that may someday make it into a proposed extension to LLVM's SCEV, but that I have no plans to move forward with in the near future.  
 2 | 
 3 | -------------------------------------------------
 4 | SCEV, shifts, and M^N as a primitive
 5 | -------------------------------------------------
 6 | 
 7 | .. contents::
 8 | 
 9 | Basic Idea
10 | ===========
11 | 
12 | SCEV currently does not have a first class representation for bit operators (lshr, ashr, and shl).  Particular special cases are represented with SCEVMulExpr and SCEVDivExpr, but in general, we fall back to wrapping the IR expression in a SCEVUnknown.
13 | 
14 | It's interesting to note that all shifts can be represented as LHS*2^N or LHS/2^N.  Unfortunately, we don't have a way to represent 2^M in SCEV for arbitrary N.
15 | 
16 | We've also run into cases before where power functions specialized for fixed values of N arise naturally when canonicalizing SCEVs or handling unrolled loops involving products or shifts.  As a result, commit 35b2a18eb96d0 added first class support in SCEVExpander for an optimized power emission.  (Leaving the internal representation unchanged.)
17 | 
18 | It's also interesting to note that a moderately wide family of non-add recurrences can be represented as an add recurrence multiplied by a power function.  For instance, <y,*,x> (my extended notation for a mul recurrence multiplying by x on every iteration) can be represented as y*x^<0,+,1).  The same idea can be extended to shl, ashr, and lshr, and some cases of udiv.
19 | 
20 | Obvious Alternatives
21 | ====================
22 | 
23 | Don't represent at all
24 | ----------------------
25 | 
26 | This is simply the status quo, possibly with some enhancements to handle obvious cases during construction and range analysis, but without changing the SCEV expression language itself.
27 | 
28 | Add explicit SCEVShlExpr, etc...
29 | ---------------------------------
30 | 
31 | This improves our ability to reason about the expressions, but doesn't get us the ability to reason about recurrences unless we add explicit SCEVShlRecExpr expressions as well.  
32 | 
33 | 
34 | 
35 | 
36 | 
37 | 
38 | 
39 | 
40 | 
41 | 
42 | 


--------------------------------------------------------------------------------
/talks-and-publications:
--------------------------------------------------------------------------------
 1 | # Past Talks and Public Events
 2 | 
 3 | An Update on Optimizing Multiple Exit Loops
 4 | Philip Reames
 5 | 2020 LLVM Virtual Developers Meeting
 6 | 
 7 | Panel: Inter-procedural Optimization (IPO)
 8 | Teresa Johnson, Philip Reames, Chandler Carruth, Johannes Doerfert
 9 | 2019 LLVM Developers Meeting
10 | 
11 | Panel: The Loop Optimization Working Group
12 | Kit Barton, Michael Kruse, Philip Reames, Hal Finkel
13 | 2019 LLVM Developers Meeting
14 | 
15 | Falcon: An optimizing Java JIT
16 | Philip Reames
17 | Keynote 2017 LLVM Developers Meeting
18 | 
19 | LLVM for a managed language: what we’ve learned
20 | Sanjoy Das and Philip Reames
21 | 2015 LLVM Developers Meeting
22 | 
23 | Supporting Precise Relocatig Garbage Collection in LLVM
24 | Sanjoy Das and Philip Reames
25 | 2014 LLVM Developers Meeting
26 | 
27 | For all of the LLVM talks, search google for the talk name + youtube 
28 | or navigate the historical conference pages for videos of the talks.
29 | 
30 | # Publications
31 | 
32 | For a list of publications, see appropriate section on LinkedIn.
33 | https://www.linkedin.com/in/philip-reames/
34 | 
35 | # Professional Service
36 | 
37 | Reviewer, 2020 LLVM Developers Conference
38 | 
39 | ERC, ISMM 2020
40 | 


--------------------------------------------------------------------------------
/vectorization-reference.rst:
--------------------------------------------------------------------------------
 1 | -----------------------------------
 2 | Vectorization - Ideas and Reference
 3 | -----------------------------------
 4 | 
 5 | This page is intended as a reference on various vectorization related topics I find myself needing to explain/describe/reference on a somewhat regular basis.
 6 | 
 7 | .. contents::
 8 | 
 9 | 
10 | Trip Count Required for Profitability
11 | -------------------------------------
12 | 
13 | For a given loop on a given piece of hardware, there is some minimum number of iterations required before a given vector loop is profitable over the equivalent scalar loop.  This is highly dependent both on micro architecture and vectorization strategy.
14 | 
15 | On most hardware, the tradeoff point will be in the high single digits to low double digits of executions.  It is common to have tradeoff points for floating point loops be lower than integer loops.  This happens because modern OOO processors frequently have more integer than FP resources, and may even share the vector and FP resources.
16 | 
17 | It may also be the case that vectorization is profitable for some trip count TC, but not for TC + 1.  This happens due to the need to handle tail iterations, and the overheads involved.  *Usually* if TC + 1 is not profitable, it's at least "close to" the scalar loop in performance, but exceptions do exist.
18 | 
19 | 
20 | Tail Handling
21 | -------------
22 | 
23 | Given a loop with a trip count (TC), and a desirable vectorization factor (VF), it is unlikely that TC % VF == 0.  You could artificially limit vectorization to cases where this property holds, but in practice that disallows all loops with runtime trip counts, so we don't.  The remainder iterations are often called the "tail iterations".
24 | 
25 | We have a couple major options here:
26 | 
27 | * Use a scalar loop to handle the remainder elements
28 | * Use a predicated straight line sequence
29 | * Fold the predication into the main loop body.
30 | * Use a loop with a smaller vector factor, possibly combined with the previous choices.
31 | 
32 | The assumption for the remainder of this section is that we run TC - TC % VF iterations of the original loop in our newly inserted vector loop, and are figuring out how to execute the remaining TC % VF iterations.
33 | 
34 | Note that the problem we're talking about here is a particular form of iteration set splitting, combined with some profitability heuristics.  Remembering this point is helpful when reasoning about correctness.
35 | 
36 | **Scalar Epilogue**
37 | 
38 | The simplest technique is to simply keep the scalar loop, and branch to it with incoming induction variable values which reflect the iterations completed in the vector body.  
39 | 
40 | This technique is advantegous when the scalar loop is required for other reasons (i.e. runtime aliasing check fallback), or the number of remaining iterations is expected to be short.  As such, it tends to be most useful when the chosen VL is relatively small.
41 | 
42 | Note that all the usual loop optimization choices apply, but with the additional known fact that the new loops trip count is < VF.  This (for instance) may result in extremely different loop unrolling profitability.
43 | 
44 | **Predicated (Non-Loop) Epilogue**
45 | 
46 | You can generate a single peeled vector iteration with predication.  This can be done with either software or hardware techniques (see below), but is usually only profitable with hardware support.
47 | 
48 | This technique works well when the remainder is expected to be large relative to the original VF.
49 | 
50 | **Tail Folded (Predicated) Main Loop**
51 | 
52 | As with the previous item, but with the extra iteration folded into the main vector body.  This involves computing a predicate on every iteration of the main loop, and predicating any required instructions.
53 | 
54 | On most hardware I'm aware of, a tail folded loop will run slower than the plain vector loop without predication.  So for very long trip counts, this can be non-optimal.  The tradeoff is (usually) a massive reduction in code size.
55 | 
56 | This choice can also hurt for short running loops.  If TC is significantly lower than VF, then a scalar loop might be significantly better.
57 | 
58 | Note that in the case I'm describing here, all but the last iteration of the vector loop run the same number of iterations of the original loop.  There's a related idea (called EVL vectorization in LLVM) where this property changes.  
59 | 
60 | **Epilogue Vectorization**
61 | 
62 | After performing the iteration set splitting for the original loop using our chosen vector factor, we can choose some other small vectorization factor - call it VF2 - and vectorize the remainder loop.  In principle, you can keep doing this recursively, but in practice, we tend to stop after the second.  Since the second vector.body may still have tail iterations, you need to pick one of the other techniques here as a fallback.
63 | 
64 | The benefit of this technique is that you have *multiple* vector bodies, each with independent VF, and can dispatch between them at runtime.  
65 | 
66 | The only case LLVM uses this technique is when VF is large compared to expected TC.  Specifically, some cases on AVX512 fallback to AVX2 loops.
67 | 
68 | 
69 | Forms of Predication
70 | --------------------
71 | 
72 | Predication is a general technique for "masking off" (that is, disabling) individual lanes or *otherwise avoiding faults in inactive lanes*.  Note that depending on usage, the result of the operations on the inactive lanes may either be defined (usually preserved, but not always) or undefined.
73 | 
74 | I tend to only be interested in the case where the result is undefined as that's the one which arrises naturally in compilers.  Our goal is basically to avoid faults on inactive lanes, and nothing else.
75 | 
76 | There are both hardware and software predication techniques!
77 | 
78 | The most common form of hardware predication is "mask predication" where you have a per-lane mask which indicates whether a particular lane is active or not.
79 | 
80 | RISCV also supports "VL predication", "VSTART predication", and "LMUL predication".  (These are my terms, not standardized.)  Each of them provides a way to deactivate some individual set of lanes.   Out of them, only VL predication is likely to be performant in any way, please don't use the others. 
81 | 
82 | In software, the usual tactic is called "masking" or "masking off".  The basic idea is that you conditionally select a known safe (but otherwise useless) value for the inactive lanes.  For a divide, this might look like "udiv X, (select mask, Y, 1)".  For a load, this might be "LD (select mask, P, SomeDereferenceableGlobal)".  There is no masking technique for stores.
83 | 
84 | There's also an entire family of related techniques for proving speculation safety (i.e. the absence of faults on inactive lanes *without* masking).  This isn't predication per se, but comes up in all the same cases, and is (almost) always profitable over any form of predication.
85 | 


--------------------------------------------------------------------------------
/virtual-machines.rst:
--------------------------------------------------------------------------------
 1 | Safepoints and Checkpoints are Yield Points
 2 | --------------------------------------------
 3 | 
 4 | I recently read an `interesting survey of GC API design alternative available for Rust <https://manishearth.github.io/blog/2021/04/05/a-tour-of-safe-tracing-gc-designs-in-rust/>`_, and it included a comment about the duality of async and garbage collection safepoints which got me thinking.
 5 | 
 6 | Language runtimes are full of situations where one thread needs to ask another to perform some action on its behalf.  Consider the following examples:
 7 | 
 8 | * Assumption invalidation - When a speculative assumption has been violated, a runtime needs to evict all running threads from previously compiled code which makes the about to be invalidated assumption.  To do so, it must arrange for all other threads to remove the given piece of compiled code from it's execution stack.  
 9 | * Locking protocols - It is common to optimize the case where only a single thread is interacting with a locked object.  In order for another thread to interact with the lock, it may need the currently owning thread to perform a compensation action on it's default.
10 | * Garbage collection - The garbage collector needs the running mutator threads to scan their stacks and report a list of garbage collected objects which are rooted by the given thread.
11 | * Debuggers and profilers - Tools frequently need to stop a thread and ask it to report information about it's thread context.  Depending on the information required, this may be possible without (much) cooperation from the running thread, but the more involved the information, the more likely we need the queried thread to be in a known state or otherwise cooperate with the execution of the query.  
12 | 
13 | Interestingly, these are all forms of cooperative thread preemption (i.e. cooperative multi-threading).  The currently running task is interrupted, another task is scheduled, and then the original task is rescheduled once the interrupting task is complete.  (To avoid confusion, let's be explicit about the fact that it's semantically the *abstract machine thread* being interrupted and resumed.  The physical execution execution may look quite different once abstract machine execution resumes for the interrupted thread.)
14 | 
15 | Beyond these preemption examples, there are also a number of cases where a single thread needs to transition from one physical execution context to another.  Conceptually, these transitions don't change the abstract machine state of the running thread, but we can consider them premption points as well by modeling the transition code which switches from one physical execution context to another as being another conceptual thread.  
16 | 
17 | Consider the "on stack replacement" and "uncommon trap" or "side exit" transitions.  The former transitions a piece of code from running an interpreter to a compiled piece of code, the later does the inverse.  There's usually a non-trivial amount of VM code which runs between the two pieces of abstract machine execution to do e.g. argument marshalling and frame setup.  We can consider there to be two conceptual threads: the abstract machine thread, and the "transition thread" which is trying to transition the code from one mode of execution to another.  The abstract machine thread reaches a logical premption point, transitions control to the transition thread, and then the transition thread returns control to the abstract machine thread (but running in another physical tier of exeuction.)
18 | 
19 | It is worth highlighting that while this is cooperative premption, it is not *generalized* cooperative premption.  That is, the code being transitions to at a premption point is not arbitrary.  In fact, there are usually very strong semantic restrictions on what it can do.  This restricted semantics allows the generated code from an optimized compiler to be optimized around these potential premption points in interesting ways.
20 | 
21 | It is worth noting that (at least in theory) the abstract machine thread may have different sets of premption points for each class of prempting thread.  (Said differently, maybe not all of your lock protocol premption points allow general deoptimization, or maybe your GC safepoints don't allow lock manipulation.)  This is quite difficult to take advantage of in practice - mostly because maintaining any form of timeliness guarantee gets complicated if you have unbounded prempting tasks and don't have the ability to prempt them in turn - but at least in theory the flexibility exists.
22 | 
23 | This observation raises some interesting possibilities for implementing safepoints and checkpoints in a compiler.  There's a lot of work on compiling continuations and generators, I wonder if anyone has explored what falls out if we view a safepoint as just another form of yield point?  Thinking through how you might level CPS style optimization tricks in such a model would be quite interesting.  (This may have already been explored in the academic literature, CPS compilation isn't an area I follow closely.)  
24 | 
25 | 
26 | Tier transitions as tail calls
27 | -------------------------------
28 | 
29 | When transitioning from one tier of execution to another, we need to replace on executing frame with another.  For example, when deoptimzing from compiled code to the interpreter you need to replace the frame of the executing function with an interpreter frame and setup the interpreter state so that execution of the abstract machine can resume in the same place.  
30 | 
31 | (For purpose of this discussion, we're only considering the case where the frame being replaced is the last one on the stack.  The general case can be handled in multiple ways, but is beyond the scope of this note.)
32 | 
33 | Worth noting is that this frame replacement is simply a form of guaranteed tail call.  The source frame is essentially making a tail call to another function with the abstract machine state as arguments.  (For clarity, the functions here are *physical* functions, not functions in the abstract machine language.)  This observation is mostly useful from a design for testability perspective and, potentially, code reuse.  If the abstract machine includes tail calls (either guaranteed or optional), the same logic can be used to implement both.  
34 | 
35 | You could generate a unique runtime stub per call site layout and abstract machine state signature pair.  If you're using a compiler toolchain which supports guaranteed tail calls - like LLVM - generating such a stub is fairly trivial.  (Note: Historically, many VMs hand roled assembly for such cases.  Don't do this!)
36 | 
37 | If you start down this path, you frequently find you have a tradeoff to make between number of stubs (e.g. cold code size), allowed call site layouts (e.g. performance of compiled code), and distinct abstract machine layouts (e.g. your interpreter and abstract language designs).  A common technique which is used to side step this tradeoff is to allow the compiler to describe the callsite layout (e.g. which arguments are where) in a side table which is intepreted at runtime as opposed to a function signature intepreted at stub generation time.
38 | 
39 | Let's explore what that looks like.  
40 | 
41 | Information about the values making up the abstract machine state is recorded in a side table.  The key bit is that minimal constraints are placed on where the values are; as long as the side table can describe it, that's a valid placement.  This may sound analogous to debug information, but there's an important semantic distinction.  Unlike debug info which often has a "best effort" semantic, this is *required* to be preserved for correctness.
42 | 
43 | At runtime, there's a piece of code which copies values from wherever they might have been in executing code, and the desired target location.  This often ends up taking the form of a small interpreter for the domain specific language (DSL) the side table can describe.
44 | 
45 | In LLVM, the information about the callsite is represented via the "deopt" bundle on the call.  During lowering, values in this bundle are allowed to be assigned freely provided the location is expressible in stackmap section which is generated into the object file.  LLVM does not provide a runtime mechanism for interpreting a stackmap section, that's left as an exercise for the caller.  Note that the name "stackmap" comes from the fact that garbage collection is stored in the same table.  This is an artifact of the implementation.  
46 | 
47 | A couple of observations on the scheme above.
48 | 
49 | * You'll note that we're replacing a stub with assembly optimized for a *particular* source/target layout pair with a generic intepreter.  This is obviously slower.  That's okay for our use case as deoptimization is assumed to be rare.  As a result, smaller code size is worth the potential slowdown on the deoptimization path.
50 | * There's a data size impact for the side table.  In general, these tend to be highly compressible, and are rarely accessed.  Given that, it's common to use custom binary formats which fit the details of the host virtual machine.  There's some interesting possibilities for using generic compressio techniques, but to my knowledge, this has not been explored.
51 | * There's a design space worth exploring around the expressibility of the side table.  The more general such a table is - and thus the more complicated our runtime intepreter is - the less impact we have on the code layout for our compiled code.  In practice, representing constants, registers, and stack locations appears to get most of the low hanging fruit, but this could in principle be pushed quite far.
52 | * It's worth noting that this is a *generic* mechanism to perform *any* (possibly tail) call.  I'm not aware of this being used outside of the virtual machine implementations, but in theory, a suficiently advanced compiler could use this for any sufficiently cold call site.  
53 | * This may go without saying, but it's important to have a mechanism for forcing deoptimation from your source language for testing purposes.  (e.g. Being able to express the "deopt-here!" command in a test)  Corner cases in deoptimization mechanisms tend to be hard to debug since they are by definition rare.  You really want both a way to write test cases, and (ideally) fuzz.
54 | 
55 | 


--------------------------------------------------------------------------------