├── LICENSE ├── README.md ├── compiler-correctness.rst ├── covid-notes.rst ├── defining-capture.rst ├── defining-escape.rst ├── deref+nofree.rst ├── element-wise-atomic-proposal ├── escape-by-example.rst ├── falcon-compiler.rst ├── ismm2020.rst ├── linux-perf-jit.rst ├── llvm-board-2020.rst ├── llvm-debugging-basics.rst ├── llvm-gc-retrospective-2019 ├── llvm-loop-opt-ideas.rst ├── llvm-norms.rst ├── llvm-riscv-fuzzing.rst ├── llvm-riscv-status.rst ├── llvm-riscv ├── README.rst ├── ScalarCodeGen.rst ├── VectorCodeGen.rst ├── Vectorization.rst ├── memset.ll ├── scalar-branch-opt.ll ├── tsvc │ ├── 16.x-march=rv64gcv_zvl128b │ │ ├── diff │ │ ├── note │ │ └── stdout │ ├── 17.x-march=rv64gcv_zvl128b │ │ └── stdout │ ├── 18.x-march=rv64gcv_zvl128b │ │ └── stdout │ └── note └── vector-codegen.ll ├── llvm-shuffles.rst ├── multiple-exit-vectorization.rst ├── observable-allocations.rst ├── optimization-tricks.txt ├── pointer-provenance.rst ├── project-ideas.rst ├── recurrences.rst ├── riscv-attribute-validation.rst ├── riscv-ecosystem.rst ├── riscv-microarch.rst ├── riscv-notes.rst ├── riscv-spec-minutia.rst ├── riscv-tso-mappings.rst ├── riscv-vector-user-guide.rst ├── riscv ├── bp3-setup.rst ├── isa-detection.rst ├── vector-idioms.rst └── whole-register-move-abi.rst ├── scev-exponential.rst ├── talks-and-publications ├── unintended-instructions.rst ├── vectorization-reference.rst └── virtual-machines.rst /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 2-Clause License 2 | 3 | Copyright (c) 2019, Philip Reames 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 19 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 20 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 22 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 23 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 24 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # public-notes 2 | A collection of (public) notes on assorted topics. If there's enough on a single topic, I've moved them into individual pages, otherwise individual notes are listed below. 3 | 4 | # A Line Crossed 5 | Jan 7, 2021 - I write this today having witnessed yesterday armed insurrection in the nations capital. Armed rioters, at the direct instigation of our sitting President, invaded the capital of our country, threatened the safety of our elected officials, and attempted to overthrow the results of a valid election. 6 | 7 | This **can not** be allowed to stand. 8 | 9 | Every single person who participated in yesterday's shame upon our country must be identified, arrested, and placed on trial. To do otherwise is to admit that we are no longer a democracy. Some may be convicted of minor crimes. Others should be charged with armed insurrection against the state, convicted, and spend their remaining years in prison. 10 | 11 | I join many others in calling for the immediate removal of President Trump from office. We must not wait until his term expires. Congress should impeach immediately. The Cabinet should exercise the 25th amendment. This is not a question of which, it's a question of which can be concluded sooner. He literally just attempted a coup. Leaving him in office for the remainder of his term is dangerous in the extreme. 12 | 13 | # 2020 Highlights 14 | Dec 19, 2020 - With as overall terrible as this year has been, I wanted to highlight a few things which give me hope for the future. 15 | 16 | SpaceX Starlink has entered widespread beta and appears to be delivering very reasonable latency (30ms) and bandwidth (100mbs down, 10s mbs up). This is a massive technical accomplishment (antenna design, satalite launch and management, etc..) and could be worth changing. Between this and the wide spread availability of solar power + battery backup + cheap generators, a lot of previously remote locations are about to enter the internet age. That's wonderful for equity, but also has profound implications for society which are hard to predict. 17 | 18 | Between solar power and natural gas, the US is rapidly transitioning away from coal power. We've reached the point that this is not a policy decision; it's an economic one. Solar power and natural gas are just cheaper than coal. Solar is now cheaper on a per MWh basis than any other generation source - by a noticable margin. Even with the cost of natural gas plants to provide power in off times, or balance the grid, coal still can't compete. Given how absolutely terrible coal is for air pollution (and thus public health), the environment, and the climate, this transition is a very good thing. Better yet, because it's being driven by economics not policy, there's every reason to believe this will continue and, if anything, accelerate over the next few years. 19 | 20 | Despite the headlines, the US election system worked. Election day went smoothly, votes were counted and certified, and there was no actual disruption or ambiguity about the outcome. We did have an outgoing president who tried to create drama and far too many republican politicians who supported him for far too long, but the system itself worked. Leading into election day itself, it was not at all clear this was going to be the case (regardless of outcome). 21 | 22 | The development of vaccines for COVID has been an amazing scientific success. Our previous record for vaccine development was 4 years; this time we did it in 10 months. It's hard to understate how positive that is, both for the current pandemic, and for the future. This won't be our last pandemic and the technology developed now (and over the last few decades) are going to be world changing going forward. 23 | 24 | # As Posted to Facebook and Twitter, Election Day 2019 25 | Nov 3rd, 2019 -- Today is election day. I have two requests of everyone reading this. 26 | 27 | First, if you are eligible to and haven't already, vote. Voting is one of our most important civic duties as citizens. Make the time, and go vote. Personally, I strongly prefer you voted to remove Trump and all of his supporters from office as I believe he has demonstrated himself grossly unfit to hold the office, but even if you disagree with me, please vote. 28 | 29 | Second, expect and demand respect for the electoral process itself over the next few days. Our system has it's problems, but it's also been one of the longest running democracies in this world. An absolutely critical component of that is the graceful transition of power, and the ability of the loosing party to accept the results. Up until recently, the US has been blessed with the ability to simply presume such, but this year, things look concerning. Regardless of who you vote for, please be ready to sit back, wait for the votes to be counted, demand that all votes be counted, and accept the winner of the election. You don't have to like the winners, but it is critical that we all commit to accept them. 30 | 31 | Those are my two key asks. Now, let me explain why I personally feel Trump is unfit for office. It doesn't come down to policy - though I disagree with many. Instead, it comes down to the fact that he has repeatedly endangered the very bedrock of our democratic system. 32 | 33 | He has publicly said he may not accept the election results; that is dangerous, extremely so. Refusal to accept election results and commit to peaceful transitions of power is how democracies fall. And no, I'm really not exaggerating here. 34 | 35 | Earlier this year, he signaled an intention to send in the military to suppress political protests. It was only the very public rejections of military involvement in civilian affairs by senior leaders which restrained him. If you haven't read the letters from the Joint Chiefs of Staff, General Mattis, or watched the press conference given by the Secretary of Defense that week, please do. Each was a historic document, and one I never want to see again. 36 | 37 | These are the two single largest incidents, but the last four years have been filled with smaller ones. I sincerely believe that electing Trump for another term is dangerous for the very existence of our country. 38 | 39 | Having said that, I also commit to respecting the results of the election if he does win. As dangerous as I believe his election to be, refusing to accept the valid result of the electoral process would be even more so. 40 | -------------------------------------------------------------------------------- /compiler-correctness.rst: -------------------------------------------------------------------------------- 1 | 2 | Interesting Partial Correctness Oracles - Primarily geared at using fuzzers to 3 | find compiler bugs, but also useful for "compile the world" style testing and 4 | general metric monitoring across a large corpus of applications. 5 | 6 | Assertion Builds - Duh. 7 | 8 | Sanatizer Builds (ASAN, UBSAN, etc...) - Also duh, these days. 9 | 10 | Hash Intermediate IRs to Detect Canonicalization Problems -- If the compiler has 11 | passes which don't agree on the canonical form, this can result in alternation between 12 | two IR states as the optimizer runs. For any compiler with a serializeable IR, 13 | this can be detected by logging the IR after each pass and looking for duplicates 14 | via hashing. (You could also use a more expensive comparison function such as llvm-diff, 15 | but the extra time is probably not worth it.) 16 | 17 | Does the resulting program run? -- Useful way to detect many miscompiles even when 18 | exact semantics of program aren't known. Does require ability to generate fault free 19 | (by construction) programs. 20 | 21 | Does the program terminate? -- Obvious problem is not knowing if the program is 22 | supposed to terminate. In practice, can be really useful in finding algorithmic 23 | cornercases in the compiler. Filtering non-terminating (before timeout) examples 24 | by whether there's a dominate stack trace for the whole execution and whether the 25 | behavior is recent regression surfaces mostly real problems. 26 | 27 | Does the program produce the same result on two compilers or *two versions* of 28 | the same compiler? -- Very *very* effective at finding miscompile regressions 29 | and subtle long standing miscompiles. Requires either a) ability to generate 30 | well defined source programs, b) a sanitizer like tool to make any program 31 | well defined in practice, or c) a very good scoring/filtering mechanism. 32 | (a) or (b) are the strongly preferred approaches. 33 | 34 | Maximum code expansion - A desirable property in a compiler is to not produce 35 | output sizes which are exponential in the input size. A fuzzer can be useful 36 | way of finding such cornercases, though I'm not aware of anyone who has done 37 | so to date. 38 | 39 | 40 | Performance fuzzing 41 | 42 | One area which I think is very under-explored is using fuzzers to find regressions 43 | in optimization effectiveness or missing optimizations in compilers. If you have 44 | a means of generating well defined programs and two compilers (or two compiler 45 | *versions*) you can simply run the same program compiled with both, and compare 46 | the resulting execution times. 47 | 48 | A couple of important points on making this work in practice. 49 | 50 | 1) Detecting a performance difference on a fuzzer test case will require *a lot* 51 | of care about statistics. This problem is absolutely begging for accidental 52 | "p-hacking" and that needs to be a first class part of the design of any 53 | practical system. 54 | 55 | 2) I'd expect such a system to have no trouble finding missed optimizations 56 | and regressions. (Compilers are full of them.) I suspect the hard problem 57 | would be prioritizing which ones matter, grouping them, and tying them to 58 | other performance reports. This problem is approachable for regressions, but 59 | doing it for generic missed optimizations is not a problem I know how to approach. 60 | 61 | 62 | 63 | Other random notes around fuzzing 64 | 65 | Empirically, finding regressions requires about a million unique test executions. 66 | No idea why, but most regressions fall out somewhere between a million and two 67 | million unique tests executions. (With a niave fuzzer, nothing fancy. May differ 68 | with better fuzzer technology.) Assuming an average of 30 seconds per test (i.e. 69 | a fairly slow fuzzer), that's only ~280 CPU hours at the high end. (i.e. one 32 70 | core machine running for about 10 hours - or $5 at recent AWS spot prices.) At this 71 | point, there is economically no excuse not to fuzz heavily. 72 | 73 | Empirically, finding long standing subtle miscompiles takes quite a bit longer. 74 | One "interesting" (i.e. nasty, been latent for years, "how did we never see this?" 75 | bug) seems to fall out about once per 150 million unique test inputs. 76 | 77 | The empirical statements above apply to fuzzing LLVM indirectly through Azul's 78 | Falcon JIT using https://github.com/AzulSystems/JavaFuzzer. Note that this 79 | fuzzer is not coverage based and doesn't play any other "fun tricks". 80 | 81 | 82 | -------------------------------------------------------------------------------- /defining-capture.rst: -------------------------------------------------------------------------------- 1 | 2 | .. header:: This is a DRAFT of a RFC I'm considering sending to llvm-dev. The current status is more an attempt to organize thoughts than an actually proposal. 3 | 4 | ------------------------------------------------- 5 | Defining Capture 6 | ------------------------------------------------- 7 | 8 | TLDR: ... 9 | 10 | .. contents:: 11 | 12 | Terminology 13 | ------------ 14 | As with any property, there are both may and must variations. Unless explicitly stated otherwise, we assume henceforth that "captured" means "may be captured" or "potentially captured", and that "nocapture" means "definitely not captured." 15 | 16 | Candidate Definition 17 | --------------------- 18 | 19 | A captured object is one whose contents or address can be observed by an external party which controls the implementation of externally defined functions, and can call back into the module through any external exposed entry point (potentially concurrently). The external party is restricted from "guessing" the addresses of uncaptured objects. Once captured, an object remains captured indefinitely. 20 | 21 | Some specific examples of captured objects: 22 | 23 | * A global variable with a linkage other than private is captured. 24 | * An object passed to an external function as an argument is captured. 25 | * An object returned by a function with non-private linkage is captured. 26 | * A memory object reachable from another captured object is captured. 27 | * A memory object which was *previously* reachable (even if transiently so) from a captured object remains captured. 28 | * A memory object whose address could be propagated to a captured location is captured if there exists a non-private function which when invoked could perform said propagation. (Remember, our externally party is both adversarial and running concurrently, so it can arrange perfect timing attacks as needed.) 29 | 30 | Corollaries 31 | ----------- 32 | 33 | An object which is eventually captured (i.e. visible to our external observer) may not yet have been captured at a particular program point. We say that such objects as "nocapture before Ctx" where "Ctx" is the program point being discussed. For instance, all static allocs start as uncaptured, and while the allocation may be eventually captured, that doesn't change the fact the object was nocapture before that point. 34 | 35 | A capturing operation is one which exposes a previously uncaptured object to our external observer. 36 | 37 | An object is uncaptured in a particular scope if the object was not previously captured before that scope, and no action performed within the scope is a capturing operation. In particular, note that there's nothing preventing an enclosing scope from capturing the object provided that the capturing operation occurs strictly after the end of our inner scope. 38 | 39 | FOR DISCUSSION: The last point differs from our currently implementation. We'd consider an object captured in the current scope if returned. We could phrase this as simple conservatism, but is there something deeper we're missing? 40 | 41 | Exploratory Examples 42 | -------------------- 43 | 44 | Let's start with a trivial example: 45 | 46 | .. code:: c++ 47 | 48 | void foo() { 49 | X* o = new X(); 50 | o.f = 5; 51 | delete o; 52 | } 53 | 54 | Object o is nocapture both globally and in the scope of foo. 55 | 56 | Leaking the object doesn't change that. 57 | 58 | .. code:: c++ 59 | 60 | void foo() { 61 | X* o = new X(); 62 | o.f = 5; 63 | } 64 | 65 | Adding a self referencial cycle doesn't change that. 66 | 67 | .. code:: c++ 68 | 69 | void foo() { 70 | X* o = new X(); 71 | o.f = o; 72 | delete o; 73 | } 74 | 75 | Scopes 76 | ======= 77 | 78 | If we return an object, that object is captured if the function is not private. So 79 | 80 | .. code:: c++ 81 | 82 | private_linkage X* wrap_alloc() { 83 | return new X(); 84 | } 85 | 86 | doesn't capture X, but 87 | 88 | .. code:: c++ 89 | 90 | X* wrap_alloc() { 91 | return new X(); 92 | } 93 | 94 | does. Note that in both cases, the allocation is nocapture within the scope of wrap_alloc. 95 | 96 | .. code:: c++ 97 | 98 | private_linkage X* wrap_alloc() { 99 | return new X(); 100 | } 101 | void foo() { 102 | X* o = wrap_alloc(); 103 | o.f = 5; 104 | delete o; 105 | } 106 | 107 | In this example, the allocation is uncaptured globally, and in both functions. 108 | 109 | Object Graphs 110 | ============= 111 | 112 | Moving on, let's consider connected object graphs. 113 | 114 | .. code:: c++ 115 | 116 | void foo() { 117 | X* o1 = new X(); 118 | X* o2 = new X(); 119 | o1.f = o2; 120 | o2.f = o1; 121 | } 122 | 123 | In this example, both o1 and o2 are nocapture. 124 | 125 | If any object is observable, then all objects reachable through that object are captured. 126 | 127 | .. code:: c++ 128 | 129 | X* foo() { 130 | X* o1 = new X(); 131 | X* o2 = new X(); 132 | o1.f = o2; 133 | o2.f = o1; 134 | return o1; 135 | } 136 | 137 | 138 | 139 | Transient Captures 140 | ================== 141 | 142 | .. code:: c++ 143 | 144 | private_linkage int X; 145 | int* Y; 146 | 147 | void oops() { 148 | Y = &X; 149 | Y = nullptr; 150 | } 151 | 152 | In this example, both X and Y are captured. Our external observed can arrange oops to execute (since it's an external function) and read the address of X between the two writes. 153 | 154 | This does nicely highlight that the optimizer can refine this program from one which captures X into one which doesn't by running dead store elimiantion. As such, it's important to note that capture statements apply to the program at a moment in time. 155 | 156 | Capture vs Lifetime 157 | =================== 158 | 159 | .. code:: c++ 160 | 161 | int* Y; 162 | 163 | void foo() { 164 | Y = new X(); 165 | free(Y); 166 | Y = nullptr; 167 | } 168 | 169 | In this example, Y has been captured. Criticially, the memory object associated with the particular instance of X remains captured even once deallocated. While the contents of said object are no longer defined, the address thereof continues to exist and may be validly used. 170 | 171 | It's worth highlighting one counter intuitive implication. If our adverserial observer calls this routine twice, a reasonable memory allocation may reuse the same physical memory for both instances of X. This does not change the fact that conceptually these are two distinct memory objects. Immediately before the store to Y on the second invocation, the first object may be captured (and deallocated) while the second one is not yet captured. Even though they share the same address. 172 | 173 | FOR DISCUSSION - I think this implies we need to tweak the definition slightly. In particular, I think we need to incorporate something which references the based on rules to make access through the first copy UB, or we seem to have captured both (since per the proposed definition the address captures.) 174 | 175 | (This discussion is not meant to be authorative on explaining the semantics of deallocation, for details, see the relevant section of langref.) 176 | 177 | 178 | Draft LangRef Text 179 | ------------------ 180 | 181 | nocapture argument attribute 182 | ============================ 183 | 184 | If we have a pointer to an object which has not yet been captured passed to a nocapture argument of a function, we know that the callee will not perform a capturing operation on this argument. Note that this only restricts operations by the callee performed on this argument. If a separate copy of the pointer is passed through an argument or memory, the callee may capture or store aside in an unknown location that copy of the pointer. 185 | 186 | In addition to the capture fact just stated, a nocapture argument attribute also provides an additional "trackability" fact. If before the call, the callee is aware of all copies of a pointer, and all copies of the pointer passed to the callee are passed through nocapture arguments, then after the call, the caller can assume that no new copies of the pointer have been created. (Even if those copies are in uncaptured locations.) 187 | 188 | Note that this definition says nothing about what the callee might do if the object was already captured before the call. 189 | 190 | nofree function attribute 191 | ========================= 192 | 193 | TODO: Wording here is incompatible with global capture definition. Need something finer grained - maybe escape? 194 | 195 | From langref: "As a result, uncaptured pointers that are known to be dereferenceable prior to a call to a function with the nofree attribute are still known to be dereferenceable after the call (the capturing condition is necessary in environments where the function might communicate the pointer to another thread which then deallocates the memory)." 196 | 197 | The problem with this is that an uncaptured copy in a private global variable still allows another thread to free it. 198 | -------------------------------------------------------------------------------- /defining-escape.rst: -------------------------------------------------------------------------------- 1 | 2 | .. header:: This is a snapshot of a previous version of the defining-capture.rst file. As noted at the bottom, this is a *incorrect* definition of capture, but the text seemed potentially useful, so I saved it for later. 3 | 4 | ------------------------------------------------- 5 | Defining Capture 6 | ------------------------------------------------- 7 | 8 | TLDR: ... 9 | 10 | .. contents:: 11 | 12 | Terminology 13 | ------------ 14 | As with any property, there are both may and must variations. Unless explicitly stated otherwise, we assume henceforth that "captured" means "may be captured" or "potentially captured", and that "nocapture" means "definitely not captured." 15 | 16 | Candidate Definition 17 | --------------------- 18 | An object is said to be captured in all scopes in which it's contents are observable. It is said to be nocaptured in all outer scopes in which it's contents can not be observed in this scope, or any outer scope thereof. 19 | 20 | There's a couple important points to this definition: 21 | 22 | * **Scope** -- The only scopes currently defined in LLVM IR are function scopes. As such, all capture statements are implicitly attached to some function. 23 | * **Observation** -- This aspect allows stores to locations which are never read, and other uses which would appear to capture the pointer so long as the result is unobserved. This prevents otherwise well defined transforms such as DSE from refining a nocapture object into a captured one. 24 | * **Refinement** -- As with many other properties in LLVM IR, well defined transforms can refine a program into one with fewer legal behaviors. The intention of the definition is to allow refinement from captured to nocapture, but not the other way around. 25 | 26 | Exploratory Examples 27 | -------------------- 28 | 29 | Let's start with a trivial example: 30 | 31 | .. code:: c++ 32 | 33 | void foo() { 34 | X* o = new X(); 35 | o.f = 5; 36 | delete o; 37 | } 38 | 39 | Object o is nocapture in the scope of foo. 40 | 41 | Leaking the object doesn't change that. 42 | 43 | .. code:: c++ 44 | 45 | void foo() { 46 | X* o = new X(); 47 | o.f = 5; 48 | } 49 | 50 | Adding a self referencial cycle doesn't change that. 51 | 52 | .. code:: c++ 53 | 54 | void foo() { 55 | X* o = new X(); 56 | o.f = o; 57 | delete o; 58 | } 59 | 60 | As a notational conviance, further examples are listed without an explicit deletion to emphasize that the scope it tied to last observation, not allocation or deletion. It is worth noting that it follow from the definition of deletion in most languages there can be no (defined) observations past deletion. 61 | 62 | Scopes 63 | ======= 64 | 65 | Next, let's consider an example which introduces multiple scopes: 66 | 67 | .. code:: c++ 68 | 69 | X* wrap_alloc() { 70 | return new X(); 71 | } 72 | void foo() { 73 | X* o = wrap_alloc(); 74 | o.f = 5; 75 | delete o; 76 | } 77 | 78 | In this example, the allocation is captured in both foo and wrap_alloc, but for different reasons. For wrap_alloc, the pointer is redundant and potentially observable outside it's scope. For foo, we don't have the knowledge that the return value of wrap_alloc hasn't been captured inside wrap_alloc in a way observable outside of it. The optimizer would in practice infer that fact, leading to out first instance of refinement. 79 | 80 | .. code:: c++ 81 | 82 | X* noalias wrap_alloc() { 83 | return new X(); 84 | } 85 | void foo() { 86 | X* o = wrap_alloc(); 87 | o.f = 5; 88 | delete o; 89 | } 90 | 91 | With the additional fact, we can now infer that the allocation is nocapture in foo, but not in wrap_alloc. 92 | 93 | Object Graphs 94 | ============= 95 | 96 | Moving on, let's consider connected object graphs. 97 | 98 | .. code:: c++ 99 | 100 | void foo() { 101 | X* o1 = new X(); 102 | X* o2 = new X(); 103 | o1.f = o2; 104 | o2.f = o1; 105 | } 106 | 107 | In this example, both o1 and o2 are nocapture in the scope of foo. 108 | 109 | If any object is observable in a parent scope, then all objects reachable through that object are observable in that scope. 110 | 111 | .. code:: c++ 112 | 113 | X* foo() { 114 | X* o1 = new X(); 115 | X* o2 = new X(); 116 | o1.f = o2; 117 | o2.f = o1; 118 | return o1; 119 | } 120 | 121 | void bar() { 122 | X* o = foo(); 123 | } 124 | 125 | In this case, we see that both allocations are captured in foo, but nocapture in bar. In the following example, o1 is nocapture in both foo and bar, while o2 is only nocapture in bar. 126 | 127 | .. code:: c++ 128 | 129 | X* foo() { 130 | X* o1 = new X(); 131 | X* o2 = new X(); 132 | o1.f = o2; 133 | return o2; 134 | } 135 | 136 | void bar() { 137 | X* o = foo(); 138 | } 139 | 140 | 141 | Defects 142 | -------- 143 | 144 | As currently written, the definition makes allocas trivially nocapture. Thus, it's clearly missing something. Maybe we defined escape instead? 145 | -------------------------------------------------------------------------------- /element-wise-atomic-proposal: -------------------------------------------------------------------------------- 1 | [DRAFT] Support element wise atomic vectors and FCAs 2 | 3 | WARNING: This is a draft. It is still under development, and should not be cited until shared on llvm-dev. 4 | 5 | TLDR: We need to be able to model atomicity of individual elements within vectors and structs to support vectorization and load combining of atomic loads and stores. 6 | 7 | Background 8 | 9 | LLVM IR currently only supports atomic loads and stores of integer, floating point, and pointer types. Attempting to use an atomic vector or FCA type is a verifier error. LLVM supports both ordered and unordered atomics. 10 | 11 | For ease of discussion, I'm going to ignore alignment. Assume that everything which follows refers to a properly aligned memory access. The unaligned case is much harder, and is already fairly ill-defined today even for existing atomics. 12 | 13 | On modern X86, there are no formal guarantees of atomicity for loads wider than 64 bits. The practical behavior observed seems to be that loads and stores wider than 64 bits are not atomic on at least some architectures. (This is exactly what you'd expect as the width of the load/store ports is often smaller than the max vector register size.) However, all of the architectures I'm aware of seem to provide atomicity of the individual 64 bit chunks. 14 | 15 | In practice, every java virtual machine that I'm aware appears to assume that vector loads and stores are atomic at the 64-bit granularity. (Java requires atomicity of all - well, most - memory accesses and thus any VM that vectorizes using x86 vector registers is implicitly making this assumption.) 16 | 17 | This notion of "atomic in 64 bit chunks" is what I want to formalize in the IR. Doing so solves two major problems. 18 | 19 | First, it allows vectorization of loops with atomics. Today, upstream LLVM does not optimize any loop with an atomic load or store, and because of the semantic gap, we can't. If we converted an atomic i32 load sequence into a non-atomic load of a vector, we'd be miscompiling. And we can't simply convert to an atomic iM (M = N * 32) load as we likely can't guarantee atomicity for the full load width. 20 | 21 | Second, it allows load combining at the IR (or MI) level to be a reversible transformation. Today, if we merge a pair of atomic i8 loads, our only option is to mark the resulting i16 load as atomic. Thi is problematic as if we later find a value available for one of the two original loads, we can't split the i16 apart again and perform load forwarding of the available value. 22 | 23 | As a simple example, consider: 24 | a[0] = 5; 25 | .... (something which later gets optimized away) 26 | v1 = a[0]; 27 | v2 = a[1]; 28 | 29 | If we combine the two loads, we then can't perform the load forwarding from the visible store. 30 | 31 | Proposal 32 | 33 | There are really two potential proposals. The choice between them comes down to the queston of do we wish to support full width atomicity for vectors and FCAs as well? Personally, I lean towards the first just due to there being less work, but won't object if the community as a whole things the second is worthwhile. 34 | 35 | Proposal 1 - No full width atomicity 36 | 37 | Interpret the existing atomic keyword on load or store as meaning either "full width atomic" or "element wise atomic" based on the type being loaded or stored. This would have the unfortunate implication that we can't canonicalize from vector to integer (or vice versa). It also involves the potential for some confusing code, but I think that can mostly be abstracted behind some carefully chosen helper routines on the instruction classes themselves. 38 | 39 | The advantage of this proposal is that a) it requires no change to bitcode or IR, b) it's straight forward to implement, and c) it could be extended into the second at a later time if needed. 40 | 41 | Proposal 2 - Both full width and element wise atomic vectors 42 | 43 | This would require both a bitcode format change, and an IR change. 44 | 45 | For the IR change, I'd tenatively suggest something like: 46 | load element_atomic , * %p, unordered, align X (element wise atomic) 47 | load atomic , * %p, unordered, align X (full width atomic) 48 | 49 | I haven't really investigated the bitcode change. I have little experience in this area, so if someone with experience wants to make a suggestion on best practice, feel free. 50 | 51 | The advantage of this proposal is in the generality. The disadvantage is the additional implementatio complexity, and conceptual complexity of supporting both full width and element wise atomics. 52 | 53 | 54 | Backend Implications 55 | 56 | TBD - representation in MMO 57 | TBD - initial simple lowering 58 | TBD - testing 59 | -------------------------------------------------------------------------------- /falcon-compiler.rst: -------------------------------------------------------------------------------- 1 | Zing Falcon 2 | ----------- 3 | 4 | Over the last few years, I helped to develop the Falcon compiler for the Azul Zing VM. Somewhat unfortunately, there's fairly little public information on the project. (Marketing has never exactly been Azul's strength.) This page exists to consolidate known public information along with some basic factual overview. 5 | 6 | What is it? 7 | ------------ 8 | Falcon is a LLVM-based in memory compiler for java bytecode to x86-64 assembly. It is the top tier compiler for the Azul Zing VM. It is a heavily optimizing just in time compiler which is fed with high quality profiling information collected in lower tiers of execution. 9 | 10 | Falcon's strength is the ability to emit high quality code resulting in runtime performance for workloads on Zing which absolutely trounces the competition. Almost more interestingly, because it uses the same underlying compiler technology as Clang, Falcon can generate high performance code for just about anything Clang can. With Falcon, it is routine and common to see vectorization of common loop idioms native for the platform of execution. 11 | 12 | Falcon's largest weakness is compile time. It's very much not a classic just in time compiler design; there's a reason I called it an "in memory compiler" above. While the team has done a lot to mitigate this (which I can't talk about here), there do remain cases where this is observable. 13 | 14 | Public Talks 15 | ------------- 16 | 17 | The best intro talk is probably the keynote that I gave at the 2017 LLVM Developers Meeting. Beyond that, there have been a number of talks presented by members of the team over the development life cycle. I'm listing the ones I know of in reverse chronological order. 18 | 19 | * An Update on Optimizing Multiple Exit Loops. Philip Reames. Tech Talk, 2020 LLVM Virtual Developers Meeting. 20 | * Control-flow sensitive escape analysis in Falcon JIT. Artur Pilipenko. Tech Talk, 2020 LLVM Virtual Developers Meeting. 21 | * Truly Final Optimization in Zing VM. Anna Thomas. Tech Talk, JVM Language Summit 2018. 22 | * Falcon: An optimizing Java JIT. Philip Reames, Keynote, 2017 LLVM Developers Meeting 23 | * A Quick Intro to Falcon. Philip Reames. Lightning Talk, JVM Language Summit 2017. 24 | * Expressing high level optimizations within LLVM. Artur Pilipenko. Tech Talk, 2017 European LLVM Developers Meeting. 25 | * LLVM for a managed language: what we’ve learned. Sanjoy Das and Philip Reames. Tech Talk, 2015 LLVM Developers Meeting 26 | * Supporting Precise Relocating Garbage Collection in LLVM. Sanjoy Das and Philip Reames. Tech Talk, 2014 LLVM Developers Meeting 27 | 28 | Other Citable Stuff 29 | ------------------- 30 | `Introducing the Falcon JIT Compiler `_ 31 | 32 | See also the blog posts linked to the launch at the bottom. 33 | 34 | `Using ZVM with the Falcon Compiler `_ 35 | 36 | These are the public docs for the Zing product. Here you can find the command line options to see the underlying LLVM IR - mostly interesting for compiler folks. This is also the only public place I can find which gives the original release date (Dec 2016) and on by default date (March 2017). Be cautious of the docs in generally though, they're frequently hilariously out of date and sometimes just wrong. 37 | 38 | `Truly Final Optimization in Zing VM `_ 39 | 40 | A nice written version of the 2018 JVMLS talk mentioned above. 41 | -------------------------------------------------------------------------------- /ismm2020.rst: -------------------------------------------------------------------------------- 1 | ISMM 2020 Notes 2 | =============== 3 | 4 | Trying something new. As ISMM 2020 is a virtual conference this year, I'm listening in to all the talks and making notes as I go. `YouTube Live Stream `_. `Proceedings `_. 5 | 6 | Verified Sequential Malloc/Free 7 | ------------------------------- 8 | 9 | Seperation logic based proofs, described in DSL for a proof tool I am unfamiliar with. Basic strategy is to do source level proof w/proof annotations stored separately and rely on CompCert (a verifier C compiler) to produce a verified binary. Library verified is a malloc library I'm unfamiliar with, unclear how "real" this code is. Verification work done manually by the author. The first couple of slides do nicely describe the strength of seperation logic for the domain and some of the key intuitions. 10 | 11 | Alligator Collector: A Latency-Optimized Garbage Collector for Functional Programming Languages 12 | ----------------------------------------------------------------------------------------------- 13 | 14 | Partially concurrent collector for GHC. Presentation is somewhat weak for a audience familiar with garbage collection fundementals. Collector sounds fairly basic by modern Java standards, but it makes for an interesting experience report. The design used isn't a bad starting point for languages without a mature collector. 15 | 16 | The question at the end about implications of precompiled binaries is interesting. In particular, acknowledged advantage of load barrier and partially incremental collection. Answer mentioned "customer was strict" about that requirement which provides some context on why the design evolved in the way it did. 17 | 18 | Understanding and Optimizing Persistent Memory Allocation 19 | ---------------------------------------------------------- 20 | 21 | Focus is on writing crash atomic (i.e. exception safe) allocation code for persistent memory. Approach take is to persist only heap metadata sufficient to implement a conservative (ambigious) GC. Previous approaches referenced appear to be analogous to standard persistent techniques for databases (i.e. commit logs). Positioning is described most clearly in conclusion slide. Questions at the end are the clearest part of the talk. 22 | 23 | The contribution of this paper appear fairly minor conceptually. There's a passing mention of first lock free allocator, but the only part actually discussed is persisting only heap metadata and reconstructing rest on recovery. Previous work had already used conservative GC as fallback. The choice to use a handlized heap is understandable, but has unfortunate performance implications. 24 | 25 | An interesting idea to explore in this area would be what a relocating collector looks like in this space. One way to avoid need for handlization is to support pointer fixup. You could track previous mapped addresses and as long as those don't overlap, you can interpret all previous pointers with a load barrier. Technically, you can do fixup without mark or relocate, but if you're paying the semantic costs to support fixup you might as well get the fragmentation benefits. 26 | 27 | 28 | 29 | Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories 30 | ------------------------------------------------------------------------------------------ 31 | 32 | Starts with a nice overview of problem domain. Super helpful as I know little about this area! 33 | 34 | Approach appears to basically use profiling of first access to classify memory as "likely read heavy" and "likely write heavy". There's also a bit about "near and far", but that went over my head. There's a profile mechanism in the page fault handler (pressumably in hardware) which classifies. On first access to a new page, this *instruction* profile is used to classify new page. 35 | 36 | Key expectation seems to be that individual instructions are either read heavy or write heavy. This seems reasonable, but it's not clear to me why you need a dynamic profile for this. The instruction encoding seems to tell you this. Maybe the dynamic part is needed for near and far? I didn't follow that part. 37 | 38 | This is far enough out of my area that I'm not following details. If you're interested in topic, highly recommend listening to the talk directly. 39 | 40 | 41 | Prefetching in Functional Languages 42 | ------------------------------------ 43 | 44 | Nicely explained problem domain and problem statement. 45 | 46 | I am very skeptical of the value of prefetching on modern hardware; the talk tries to justify the need on OOO hardware and while not wrong, I think it's over stating the problem. Particularly with a GC which arranges objects in approximate traversal order (single linked lists are trivially to get right) I don't see a strong value here. (Ah, later performance comparison uses ARM chips and Knights Landing). I just showed my mainline Intel bias...) 47 | 48 | After listening to the talk, it's really not clear what the contribution was. Most of the discussion is generic prefetch discussion. Did they add a prefetch instruction to OCaml? That's pretty basic. Was this a characterization study? Was there some form of compiler support? 49 | 50 | It is a nice presentation on the basics of prefetching and the performance tradeoffs thereof. 51 | 52 | Garbage Collection Using a Finite Liveness Domain 53 | ------------------------------------------------- 54 | 55 | Basic notion is to use a heap analysis to describe reachable objects from each reference. Standard GC is reachability based which simply means we use the conservative trivial analysis (i.e. everything is live if there's an edge to it.) General problem with this approach is scalability, this paper tries to approach that with a restricted set of analysis results. The basic framing appears to basically be given a tree, which subtree is live? The set of subtrees is resticted to first level, left recurisve, right recursive, and all. 56 | 57 | A thought, has anyone considered reversing the liveness result? If when scanning the stack, you used the liveness analysis to break references to dead objects this *must* be semantic preserving. At that point, a standard reachability GC can produce the refined results. Actually, given that, isn't the result equivelent to a complicated DSE which inserts code which only runs at the GC point? Maybe a compiler could support this with a DSE and gc closure which runs at stack scan time? (Clarification: The inversion of the analysis result works for single threaded programs or unescaped objects in concurrent languages, but not for objects accessible by multiple threads.) (Ian asked the same question :) and the other provided a nice answer about AA I'd missed.) 58 | 59 | ThinGC: Complete Isolation With Marginal Overhead 60 | -------------------------------------------------- 61 | 62 | Objects are grouped into hot and cold pages. All access is to hot pages; pages are moved to hot region if needed to satisfy a memory access. (This seems an odd choice and has obvious parallels to compressed heap ideas which have been explored in previous work.) Implementation builds on ZGC, but that appears to be irrelevant to the main focus of the work. 63 | 64 | Question: Why not have cold heap be subset of old gen? Allows reuse of card mark instead of a separate remembered set. A: ZGC is single generational. 65 | 66 | Application study looks very promising. This is probably the key contribution from the paper. Reheating percentage is low for most application, but high for a few. This is a problem for the approach. I was unhappy that this wasn't further discussed. I asked a question on this at the end, and didn't get the sense this had really been explored 67 | 68 | As a result of the variation in reheating, also observed very wide variations in performance and pattern. Also a verbal mention of run to run variation due to memory arrangement, but this wasn't justified or expanded upon. (I'm midly suspicious of this without evidence.) 69 | 70 | My questions: 71 | * What was the motivation for requiring reheating rather than directly accessing the cold objects? Did you evaluate the implications of this choice? A: plan is to use very slow memory for cold memory; current work does not. 72 | * Did you explore the patterns which tended to cause reheating? Is there any idiomatic pattern which might influence the design? A: no information 73 | 74 | My takeway: no strong conclusions, haven't either proved or disproved the idea of hot/cold separation; too tied to decision around reheating. Maybe paper has more useful details? 75 | 76 | 77 | Improving Phase Change Memory Performance with Data Content Aware Access 78 | ------------------------------------------------------------------------- 79 | 80 | Skipped this session. 81 | 82 | Keynote: Richard Jones 83 | ----------------------- 84 | 85 | Schedule conflict, will watch later. 86 | 87 | -------------------------------------------------------------------------------- /linux-perf-jit.rst: -------------------------------------------------------------------------------- 1 | This document discusses how a JIT compiler (or other code generator) can generate symbol information for use with the Linux perf utility. This is written from the perspective of a JIT implementor, but might be of some interest to others as well. This is mostly a summary of information available elsewhere, but at time of writing I couldn't find such a summary. 2 | 3 | There are two major mechanisms supported by perf for getting symbols for dynamically generated code: perf map files, and jitdump files. 4 | 5 | Perf Map 6 | A perfmap is a textual file which maps address ranges to symbol names. It has no other content and does not on its own [1]_ support disassembly or annotation. 7 | 8 | I couldn't find a formal description of the format anywhere, but the format appears to have one entry per line of the form " .map where is the pid of the process containing the generated code. This pid is recorded in the perf.data file, so offline analysis is supported. You can copy these files between machines if needed. There's no graceful handling of pid collisions, and no machanism to cleanu old perfmap files which I found. 11 | 12 | The format does not support relocation of code, or recycling of memory for different executible contents. What happens if you have overlapping ranges in a file is unspecified. 13 | 14 | Jitdump 15 | The jitdump format is a binary format which is much covers a much broader range of use cases. There is a `formal spec `_ for the binary format. In addition to basic symbol resolution, jitdump supports disassembly/annotaton and code relocation. 16 | 17 | To work with jitdump files, you have to use "perf inject" to produce a perf.data file which contains both the raw perf.data and the additional jitdump information. The location of the jitdump file on disk is not documented, and I haven't yet tracked it down. Once injected, the combined perf.jit.data file can be moved to other machines for analysis. 18 | 19 | From what I can tell, jitdump provides a strict superset of the functionality of perf map files. Despite this, perf map appears to be much more widely used. It's not clear to me why this is true; the main value I see in perf map files is that they're easy to generate by hand or by simple scripting from other information sources. 20 | 21 | Example Command Sequences 22 | -------------------------- 23 | 24 | perf map:: 25 | 26 | perf record 27 | # remember to copy the /tmp/perf-.map along with the perf.data file 28 | # if analyzing off box 29 | perf report 30 | 31 | jitdump:: 32 | 33 | perf record 34 | perf inject --jit -i perf.data -o perf.jit.data 35 | # move perf.jit.data around if needed 36 | perf report -i perf.jit.data 37 | 38 | Potential Gotchas 39 | ----------------- 40 | 41 | Support for both mechanisms were added to perf relatively recently. As perf is version locked to the kernel of the system, this implies that currently supported releases of some older distros contain perf versions which don't support either feature. Unfortunately, there's no graceful error reporting; symbols simply fail to load. I ended up resorting to grepping through strace output to confirm whether the perf binary I was using tried to load the perf map file. This was the only conclusive way I found to distinguish between malformed perf map files and versions of perf lacking support. 42 | 43 | For some reason I've yet to understand, certain types of memory regions cause perf to fail to collect symbolizable traces. This is particularly confusing as inspecting the raw data file with perf script and/or perf report shows addresses which map to valid entries in a perf map file, but for some reason the way memory was obtained during the perf record run effects symbolization. I've only see this with perf maps; I haven't tried the same experiment with a jitdump setup just yet. 44 | 45 | Useful References 46 | ------------------ 47 | 48 | The wasmtime folks have a `nice description `_ of using a jitdump based mechanism from a user perspective. 49 | 50 | Brenden Gregg has a post on using perf map files to generate `flame graphs for v8 `_. He also has a lot of other generally awesome perf stuff, but most of it's focused on statically compiled code. 51 | 52 | `perf-map-agent `_ and `perf-jitdump-agent `_ are useful examples of how to generate the corresponding file formats. These are each jvmti agents for Java for each of the corresponding workflows. 53 | 54 | Footnotes 55 | ---------- 56 | 57 | .. [1] Several of the perf commands allow you to provide an alternate path to the objdump binary. If you have an alternate source of disassembly of some of the methods named in the perf map file, you can write a shim script which wraps the real objdump, intercepts the disassembly request sent to objdump for a particular symbol name, and provides the alternate disassembly. 58 | 59 | -------------------------------------------------------------------------------- /llvm-board-2020.rst: -------------------------------------------------------------------------------- 1 | This is the draft form of my 2020 application to be on the LLVM foundation board. If successful, the application becomes public, so I've decided simply to publish it from the beginning. My application was not accepted. 2 | 3 | Name (First and Surname) * 4 | -------------------------- 5 | Philip Reames 6 | 7 | Please summarize relevant contributions to the LLVM Project. These can include things such as patches, code ownership, bug reports, mailing list posts, blog posts, volunteering at LLVM events, organizing LLVM social events, contributing to the developer meeting paper committee, etc. 8 | ------------------------------------------------------------------- 9 | 10 | LLVM contributor since 2013. Major areas of (code) contribution: 11 | 12 | * Main author of gc.statepoint infrastructure, and current defacto code owner of garbage collection support in LLVM 13 | * Extensively contributed both directly and indirectly (though coworkers) to SCEV, IndVars, and much of our loop canonicalization infrastructure. Recently became code owner. 14 | * (Incrementally) rewrote most of LazyValueInfo and CorrelatedValuePropagation. 15 | * Led effort to optimize unordered atomic loads and stores throughout middle end and x86 backend. Contributed ~50% of the code directly. 16 | * Many smaller contributions through middle end optimizer and (to a much lesser degree) the x86 backend. 17 | 18 | Non Technical Contributions 19 | 20 | * Member of the recently established LLVM Security Group. 21 | * Contact point for all Azul contributions to LLVM. Representing LLVM internally (e.g. relicensing, etc..), and organizing organization contributions upstream, including internally focused developer education around LLVM community norms and processes. 22 | * Presenter at multiple LLVM Developer Conferences including one of the 2017 keynotes on the Falcon project. Recently, most of my focus at developer meetings has been on the hallway track conversations, and the defacto "frontends for dynamic languages" working sessions which happen each year (whether formally organized or not). 23 | * Wrote "Performance Tips for Frontend Authors" and "LLVM Loop Terminology (and Canonical Forms)" doc pages. Also contributed to a bunch of other docs in smaller one-off changes. 24 | 25 | Indirect Contributions 26 | 27 | * Points that follow are things which Azul does which I have had some role in steering. Most of the work on these has not been my own, and others should get all the credit for making things actually happen. :) 28 | * Fuzzing, regression tracking, and quality improvements - We run one of the only large fuzzer deployments which actually runs generated code. As a result of this, we catch a disportanate fraction of miscompiles. We deliberate lag ToT by a few days so that our time and energy is spent on the harder subtle issues. In addition to the normal "please revert patch X" cases, we've also found a number of deep and interesting bugs in core passes. My favorite to date was the fuzzer finding incorrect nsw/nuw flag handling in GVN which had been present for almost a decade. 29 | * Falcon (our LLVM based compiler for Java bytecode) demonstrated that it was possible to develop compilers for non-C family languages on LLVM, and achieve performance which beat existing state of the art approaches. In the process of doing so, we fixed a number of issues, documented many of the items we stumbled across, and publically discussed most of the key design elements of our approach (including our mistakes). I'd like to think this has positively impacted the broader LLVM ecosystem. 30 | 31 | Why do you want to be on the LLVM Foundation board of directors? 32 | ----------------------------------------------------------------- 33 | 34 | I believe that the LLVM project has become a core piece of infrastructure and investment is needing accordingly. I personally greatly appreciate various aspects of the community (e.g. professionalism, creative tension between pragmatism and perfectionism, and a refusal to get lost in bike sheding), but also see stress points forming as the community scales (e.g. infrastructure, decision making, review fragmentation). I want to ensure the project continues to scale without loosing the aspects which have made it such a wonderful ecosystem in which to work these last few years. 35 | 36 | What experience or skills can you bring to the board? Which of the above programs could you help drive forward? 37 | ------------------------- 38 | 39 | Helped to establish, and fundraise for initial New Haven Pride Center scholarship fund (https://www.newhavenpridecenter.org/youth/scholarship/). That initial fund has now developed into five distinct scholarship funds with a total of six annual awards. 40 | 41 | The areas I'm most interested in contributing towards are scholarship grants, education oppurtunities for students getting started in the community (particular students from non-traditional backgrounds), and support of common project infrastructure. I will also contribute in areas outside those foci, but they're the ones of most personal interest to me. 42 | 43 | 44 | 45 | We value diversity and representation of the various interested groups working on LLVM and using it. Do you consider yourself representative of a minority group, underrepresented geographic region, etc? 46 | ----------------------------------------- 47 | No. 48 | 49 | 50 | Which program are you most interested in supporting? 51 | ----------------------------------------------------- 52 | 53 | Educational Outreach 54 | 55 | Diversity & Inclusion in Compilers and Tools 56 | 57 | **Grants & Scholarships** 58 | 59 | Infrastructure Support 60 | 61 | What is your second choice program to support? 62 | ----------------------------------------------- 63 | 64 | Educational Outreach 65 | 66 | Diversity & Inclusion in Compilers & Tools 67 | 68 | Grants & Scholarships 69 | 70 | **Infrastructure Support** 71 | 72 | 73 | How many hours a week can you dedicate to LLVM Foundation business? 74 | Board members are expected to dedicate time to meetings and to the programs. 75 | ----------------------------------------------------------------------------- 76 | 77 | Time availability will vary widely, but a minimum of 2-3 hours and sometimes much more. 78 | 79 | Are you interested in a specific position on the board? 80 | -------------------------------------------------------- 81 | 82 | No 83 | 84 | 85 | Are you willing and able to help fundraise for the LLVM Foundation? We rely on donations to fund our programs and need board members to help find new sponsors and donors. 86 | -------------------------------------------------------------------- 87 | 88 | Yes, with a paricular emphasis on 1) trying to establish periodic giving campaigns and otherwise diversify the foundations funding, and 2) separate dedicated funding sources for scholarships and student travel grants. 89 | 90 | Is there anything else you would like to add for the board to consider? 91 | ------------------------------------------------------------------ 92 | No. 93 | 94 | New this year, we will accept letters of recommendation to support your application. Please have your references send their letter of recommendation directly to us at boardapp@llvm.org. This is totally optional. 95 | ------------------- 96 | 97 | I will not have any letters of recommendation 98 | -------------------------------------------------------------------------------- /llvm-debugging-basics.rst: -------------------------------------------------------------------------------- 1 | ------------------------------------------------- 2 | LLVM Debugging Tricks 3 | ------------------------------------------------- 4 | 5 | This page is a collection of basic tactics for debugging a problem with LLVM. This is intended to serve as a reference document for new contributors. At the moment, this is pretty bare bones; I'll expand on demand. 6 | 7 | .. contents:: 8 | 9 | Compiler Explorer (i.e. Godbolt) 10 | -------------------------------- 11 | 12 | ``_ is an incredibly useful tool for seeing how different compilers or compiler versions compile the same piece of code. The ability to link to exactly what you're looking at and share it with collaborators is invaluable for asking and answering highly contextual questions. 13 | 14 | 15 | Assertion Builds 16 | ---------------- 17 | 18 | Before you do literally anything else, make sure that you have assertions enabled on your local build. 19 | 20 | LLVM makes very heavy use of internal assertions, and they are generally excellent at helping to isolate a failure. In particular, many things which appear as miscompiles in a release binary will exhibit as an assertion failure if assertions are enabled. 21 | 22 | **Warning:** Few of the commands mentioned in this document will work without assertions enabled! 23 | 24 | As a practical matter, I do not recommend the debug flavors of the builds, but Release with assertions enabled if very very worthwhile. For context, an assertion enabled release build is around 8GB; last time I did a debug build, it was around 60GB. 25 | 26 | Capture a standalone reproducer for a clang issue 27 | ------------------------------------------------- 28 | 29 | Clang will attempt to produce a standalone reproducer for the current clang invocation if a crash is encountered during execution. 30 | 31 | ```-gen-reproducer=always``` will enable this reproducer generation for non-crashing compiles. This can be useful for extractting reproducers for an optimization quality problem from a large build system. 32 | 33 | This doesn't always succeed. If it reports failure, check the /tmp directory for a preprocessed file (depending on where it failed), or fall back to trying to create a preprocessed input file via -E. 34 | 35 | Capture IR before and after optimization 36 | ---------------------------------------- 37 | 38 | ``-S -emit-llvm`` will cause clang to emit an .ll file. This will contain the result of mid-level optimization, immediately before the invocation of the backend. 39 | 40 | ``-S -emit-llvm -disable-llvm-optzns`` will cause clang to emit an .ll file and *skip* optimization. Note that this is often different than the result of ``-S -O0 emit-llvm`` as the later embeds ``optnone`` attributes in the IR. 41 | 42 | 43 | Capture IR before or after a pass 44 | --------------------------------- 45 | 46 | ``-mllvm -print-before=loop-vectorize -mllvm -print-module-scope`` will print the IR before each invocation of the pass "loop-vectorize". (As it happens, there's only one of these in the standard pipeline.) The resulting output will be valid IR (well, with a header you need to remove) which can be fed back to "opt" to reproduce a problem. There's also an analogous ``-print-after=`` option. 47 | 48 | If you want to trace through execution, ``-mllvm -print-after-all`` can also be useful, but be warned, this is very very verbose. Pipe it to a file, and search through it with a decent text editor is likely your best bet. 49 | 50 | See what a pass is doing 51 | ------------------------ 52 | 53 | ``-mllvm -debug-only=loop-vectorize`` turns on the internal debug tracing of the pass. This can be very insightful when read in combination with the source code of the pass in question. 54 | 55 | ``-mllvm -pass-remarks=*`` turns on the pass remarks mechanism which is intended to be more user facing. My experience is that these are generally not real useful, and that the filtering mechanism doesn't work well. Mostly relevant when looking for missed optimizations. 56 | 57 | 58 | opt and llc 59 | ------------ 60 | 61 | These tools are your friends. You can pass IR to opt to exercise any mid-level optimization. You can use llc to exercise the backend. 62 | 63 | Many of the commands listed previously can be passed to opt or llc by simply omitting the ``-mllvm`` prefix. 64 | 65 | llvm-reduce and bugpoint 66 | ------------------------ 67 | 68 | These tools provide a fully automated way to reduce an input IR program to the smallest program which triggers a failure. Reducing crashes or assertion failures is pretty straight forward; reducing miscompiles is quite a bit trickier. 69 | 70 | Alive2 71 | ------ 72 | 73 | Alive2 is a tool for formally reasoning about LLVM IR. There is a web instance available at ``_. This is a great tool for quickly checking if an optimization you have in mind is correct. 74 | 75 | You can also download and build alive2 yourself, and it has a lot of useful functionality for translation validation. This can be very useful when tracking down a nasty miscompile, but is very much an advanced topic. 76 | -------------------------------------------------------------------------------- /llvm-gc-retrospective-2019: -------------------------------------------------------------------------------- 1 | [DRAFT] A retrospective on GC support in LLVM, and some proposed tweaks 2 | 3 | WARNING: This is very much in rough draft state. Please don't cite until completed and sent to llvm-dev. 4 | 5 | As some of you may remember, a few years back I led an effort to extend LLVM with support for fully relocating garbage collection in the form of the gc.statepoint familiy of intrinsics. In the meantime, we've successfully shipped a compiler for a VM with a fully relocating collector using this support, and gained a bunch of practical experience with the design. This has led me to revise my thinking quite a bit, and I want to both summarize my lesson learned, and propose some small changes. 6 | 7 | Background and Retrospective 8 | 9 | As a reminder, the gc.statepoint mechanism involved three major parts: 10 | 1) The first was the introduction of the abstract machine model (enabled by non-integral pointer types) followed by a lowering to the physical machine model. The abstract machine model is relocation independent - at the cost of disallowing a few operations like addrspacecast and inttoptr/ptrtoint, and typing memory. 11 | 2) The second was the gc.statepoint representation itself. This representation was designed to make all relocations explicit in the physical machine model and was designed from the begining to make integration with the register allocation feasible. 12 | 3) The third was the notion of late safepoint insertion. We ended up having to abandon this ourselves due to a need to support deoptimization at the same set of safepoints, but there's still some (unused?) support for this upstream today. 13 | 14 | I think it's safe to say that the abstract machine model has proven itself. The abstract machine model allows optimization passes to reason about references to objects (as opposed to raw pointers) and makes this like LICM and other loop opts straight-forward. Without the abstract machin model, we'd have needed to make invasive changes across the entire optimizer. Instead, the changes which have been made have been pretty isolated in scope and well contained. Probably the biggest problem with the current implementation (as opposed to the model) is that RewriteStatepointsForGC (the lowering to the physical model) doesn't actually remove the non-integral marker from references; it just happens to work out in practice. 15 | 16 | As I already noted, the late safepoint insertion mechanism turned out not to be practical for our use case. We had to be able to materialize an abstract machine state at each safepoint, and in the end, that required having safepoint locations (but not relocation semantics) explicit in the IR from the very begining. I still think that late safepoint insertion could be very practical for a statically compiled language w/o deoptimizatio, but this is really unproven to date. 17 | 18 | Statepoints, well, that's where some of the lessons learned come in. There were a couple of major assumptions which went into the design: 19 | 1) There was a strong assumption that the quality of safepoint lowering *mattered*. Everyone we talked to who'd done this before in other compilers seemed to agree that performance was going to be bottlenecked by the safepoint lowering, and thus it was worth a lot of effot to get right. This drove the whole notion of representing a set of defs explicitly via gc.relocates. 20 | 2) Somewhat following from the previous, we believed that first class support for derived pointers (both interior and exterior to their associated object) was a neccessity. (If we didn't have first class support, you can always rematerialize an arbitrary derived pointer after a safepoint via a "offset = derived-base, relocate(base), add offset" idiom. This is exactly what the GC would do internally.) 21 | 22 | To my utter and complete suprise, it's turned out that both assumptions have been at least partially incorrect in practice. 23 | 24 | At least on X86, the ability to fold complex addresses into using instructions really deminishes the value of first class derived pointer support. I'm not saying that there's no value to a fully regalloc integrated derived pointer aware stackmap, but I've seen few cases where it's obviously profitable - that may be largely due to the next point. 25 | 26 | The key surprise for me has been that while quality of safepoint lowering has mattered somewhat, it hasn't been anywhere near the peak performance limiter expected. Instead, it's been mostly a matter of code size and slowpaths for which the current design is somewhat poor. The key observations are that: 27 | 1) Any safepoint poll involves a fast and slow path, but *only the slowpath neeeds a stackmap*. As such, the lowering required for the slowpath can be fairly terrible as long as it doesn't perturb the fastpath too much. 28 | 2) After optimization, safepoint polls are dynamically rare. They're not as rare statically, but that's almost entirely due to deoptimization side effects. 29 | 3) Any hot call safepoint is generally an optimization oppurtunity for either devirtualization or inlining. It's really really rare to see a truly megamophic (i.e. not N morphic for small N) call on a fastpath, or a large function with many hot callers. 30 | 31 | Putting these observations together, there's been little motivation to work on improving the statepoint lowering. As a result, we've never actually achieved the planned register allocation integration and are still shipping what was expected to be a stop gap implementation which does a poor man's register allocator in SelectionDAG. It's only been recently that we've really started giving improving this state of affairs this some serious thought, and the motivaion to do so has mostly been driven by compile time and code size, not performance of the slowpath code. 32 | 33 | The other unexpected challenge with statepoints has been the compile time impact caused by the IR growth implied by the explicit relocation representation. I don't have any hard numbers, but I suspect this to be one of the largest contributors to codegen time. It's not uncommon to see methods with a large fraction of total instructions being gc.relocates. 34 | 35 | So, I think it's fair to ask whether given what we know now, would we reimplement gc.statepoint the same way? Honestly, I think the answer has to be no. 36 | 37 | Now what? 38 | 39 | So, am I about to propose we remove gc.statepoint? No, I'm not. It works well enough, and there's no strong incentive to switch away for anyone who has built a working implementation on top of it. 40 | 41 | However, I do think we should consider reframing our documentation to default to an alternative, and updating the lowering for the abstract machine model to support that alternative. 42 | 43 | What is that alternative you ask? Well, gc.root. (For those of you who were around through the initial rounds of discussion, yes, there's quite some irony here.) I think we do need to make a slight tweak to the gc.root implementation though. 44 | 45 | For those who don't know, the gc.root representation works as so: You explicitly create a stack slot (alloca), escape it via a special gc.root intrinsic, spill at the def of the variable, and reload after every possible safepoint. The one real subtlety is that for this to be correct for a relocating GC, no function reachable from a function using gc.root can be readonly *or inferred readonly by the optimizer*. Annoyingly, this issue only exists for relocating GCs, but it basically means that today, if you use gc.root you have to be *very* careful in ensuring none of your functions can be inferred readonly. Here's the problematic example: 46 | 47 | %a = %alloca i8* 48 | call void @gc.root(i8* %a) 49 | store i8* %myptr, i8** %a 50 | call void @readonly() 51 | %newptr = load i8*, i8** %a 52 | 53 | GC.root relies on the capture of %a to force the optimizer to believe the call might clober %a (for a relocating collector, it does). However, since we've inferred readonly, that fact doesn't hold. (I've used readonly for the description, but you can create the same issue with writeonly, or argmemonly.) 54 | 55 | Now, I want to be careful to stop here and emphasize that gc.root can be used entirely correctly. It just imposes a subtle invariant on the module as a whole which is error prone. 56 | 57 | With a small tweak, we can remove this subtlety. Since gc.root was introduced, we've added the notion of operand bundles. One of the key semantics to operand bundles is that they can model memory semantics of the callsite independent of the callee. As such, we can use an operand bundle to avoid the subtlety around readonly calls. Here's what a revised example would look like: 58 | 59 | %a = %alloca i8* 60 | call void @gc.root(i8* %a) 61 | store i8* %myptr, i8** %a 62 | call void @readonly() ["gc-root" (%a)] 63 | %newptr = load i8*, i8** %a 64 | 65 | This results in a much cleaner, and easier to describe semantic model w/o the global module invariant. 66 | 67 | Summary 68 | 69 | So, what all am I proposing? I'm proposing that we: 70 | 1) add the "gc-root" operand bundle type, and update all the gc.root documentation to use it. 71 | 2) add support for gc.root as a target of RewriteStatepointsForGC (i.e. as a supported physical model when lowering the abstract model). 72 | 3) update the documentation to default to gc.root, and describe statepoint as an alternative. 73 | 4) update the documentation to encourage new frontends to lower directly to gc.root, then come back to the abstract machine model later. 74 | 75 | The last point needs a bit of justification. From talking to a number of folks at conferences, getting something working with existing GC infrastructure is generally the biggest problem, and the indirect through the abstract model seems to really confuse folks. I've witnessed a couple of projects stall here, and I'd like to avoid that going forward. 76 | 77 | 78 | 79 | 80 | -------------------------------------------------------------------------------- /llvm-norms.rst: -------------------------------------------------------------------------------- 1 | ------------------------------------------------- 2 | LLVM Norms, Terminology, and Expecations 3 | ------------------------------------------------- 4 | 5 | 6 | This page is a collection of things I find myself repeatedly needing to explain to new developers. This is not official project documentation; it is my take on each issue. Most of this is likely to agree broadly with what other long term contributors might say, but details in perspective may differ. This is a perpetual WIP - it is extended when I find myself repeating myself, and is in no way a complete guide. 7 | 8 | Let me start by introducing myself in case you're not already familiar with my work in the project. I am a long standing contributor to the LLVM project. I've contributed heavily to the mid level optimizer, and to a lesser extend parts of the X86 backend. I was the technical lead for the Falcon JIT - an LLVM based just in time compiler for Java bytecode. I've managed a team of LLVM contributors, and been responsible for maintaining a long live downstream distribution of LLVM. As such, I have a fairly broad perspective on what it takes to participate in the upstream community successfully, while still shipping downstream product. 9 | 10 | .. contents:: 11 | 12 | What does LGTM mean? 13 | -------------------- 14 | 15 | "LGTM" literally means "looks good to me", but there's a bunch more cultural context behind it. LLVM requires pre-commit review for most changes. For new contributors, *all* changes will require precommit review. Having an established contributor LGTM a change is the gate which had to be cleared before a change can land. 16 | 17 | While most reviews these days use phabricator, we're not always good about marking reviews approved through the UI. A textual LGTM is what matters, not whether the review has been approved in the UI. 18 | 19 | Let me emphasize that LLVM is a single approval culture. This means that once a knowledgeable reviewer has approved a patch, you do not need to wait for furthur reviewer approval. You do need to use reasonable judgement here though. If another reviewer has raised concerns, you probably want to wait until they've had a chance to reply to any changes before landing. 20 | 21 | One point which confuses a bunch of new contributors is that LLVM reviewers **expect that you have commit rights**. Unless you **explicitly ask** someone to land your change on your behalf, reviewers will assume that you will do so after approval. This comes from the fact that LLVM hands out commit rights much more freely than other open source projects. 22 | 23 | LGTMs w/Conditions 24 | ------------------ 25 | 26 | It's not uncommon to see phrasings such as "LGTM w/comments addressed" or "LGTM w/minor comments". What this means is that once you've addressed the issues identified *as suggested by the reviewer*, you can consider the patch to have received an LGTM without the need for further review. 27 | 28 | This is frequently used by reviewers when the remaining issues with the patch are considered minor and straight forward. If you as an author disagree with how any issue should be handled (e.g. a comment needs discussion), be aware that you don't have an LGTM without further discussion and an explicit re-LGTM by that reviewer (or someone else). 29 | 30 | If the difference in approach is minor, I strongly suggest taking the reviewer's suggestion, landing your patch, and then posting a follow up patch to switch to your preferred approach. This will let all parties make progress, and avoids back and forth on already accepted reviews which has a tendancy to get lost. 31 | 32 | Another form of conditional LGTM which comes up regularly is the "LGTM, but wait for @name" or "LGTM, but wait a couple of days in case @name has further comments". These two are interesting precisely because they are *different*, and that subtly is often lost on non-native speakers. For the first, the reviewer is explicitly asking for a second LGTM. As such, our general "single accept" policy does not apply, and this review is blocked on a second accept by @name. The second is merely instructing you to wait a couple of days before landing so that @name has a chance to chime in if desired. The former blocks commit; the latter does not. 33 | 34 | What are "commit rights"? 35 | -------------------------- 36 | 37 | LLVM grants commit rights much more freely than most other open source projects. However, that's because the implied expectationss are very different. In LLVM, having commit rights simply means that you are trusted to take the mechanical action of rebasing and landing an approved patch, and then respond promptly to post commit review. It **does not** change any expectation around precommit review, or imply anything beyond a very basic level of trust. 38 | 39 | Can I commit my change without review? 40 | -------------------------------------- 41 | 42 | As a general rule, unless you have been told otherwise, no. New contributors, in particular, should *never* commit a change without review. 43 | 44 | Beyond that initial state, we have in practice three levels of pre-commit rights. 45 | 46 | First, you'll pretty quickly be asked by reviewers to "pre-commit this test", or "pre-commit this NFC". That means that you can separate out a change which does that (and only that), and submit it without further review. A key point is that this change *was reviewed* in the original review thread. The trust being shown is minor, and mostly mechanical. 47 | 48 | Second, once you've been around for a while and have a sense of normal review flow, you'll reach the point where you have a good sense for what you'll be asked to pre-commit to reduce patch sizes during review. Once you hit that point, checking in tests and NFCs without review (i.e. before posting the using change) is acceptable. Reasonable judgement is expected, lean towards review. 49 | 50 | Third, established contributors will sometimes land "obvious" patches without review. If you're new enough to the community to be reading this guide closely, this is not relevant for you (yet). 51 | 52 | Silence means "No" 53 | ------------------ 54 | As a general rule, silence on a review or RFC means "no". It **does not** mean "no one cares, so go ahead". There is a huge amount of coalition building and discussion which happens offline. If you send out an RFC without talking it through with interested parties first, there is a good chance no one will have the time to read it and respond. 55 | -------------------------------------------------------------------------------- /llvm-riscv-fuzzing.rst: -------------------------------------------------------------------------------- 1 | ----------------------------- 2 | Fuzzing LLVM's RISCV Backend 3 | ----------------------------- 4 | 5 | This document is a collection of notes for myself on attempts at fuzzing LLVM's riscv backend. This is very much a WIP, and is not really intended to be read by another else just yet. This is very much a background project, so updates will likely be slow. 6 | 7 | .. contents:: 8 | 9 | Initial Attempt w/libFuzzer 10 | --------------------------- 11 | 12 | I started with libfuzzer because we used to have OSSFuzz isel fuzzing for other targets, and I figured figuring out the build problem would be easy. Yeah, not so much. 13 | 14 | I was not able to get a working build of libfuzzer with ASAN. 15 | 16 | I found a three stage build approach that "somewhat" worked. 17 | 18 | stage1 - my normal LLVM dev build tree, clang enabled, Release+Asserts, nothing special 19 | 20 | stage2 - "PATH=~/llvm-dev/build/bin/:$PATH CC=clang CXX=clang++ cmake -GNinja -DCMAKE_BUILD_TYPE=Release ../llvm-project/llvm -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_NO_DEAD_STRIP=ON -DLLVM_USE_SANITIZER=Address -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_BUILD_RUNTIME=Off -DLLVM_USE_SANITIZE_COVERAGE=On" 21 | 22 | stage3 = "PATH=~/llvm-dev/fuzzer-build-stage1/bin/:$PATH CC=clang CXX=clang++ cmake -GNinja -DCMAKE_BUILD_TYPE=Release ../llvm-project/llvm -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_NO_DEAD_STRIP=ON -DLLVM_TARGETS_TO_BUILD="X86;RISCV" -DLLVM_BUILD_RUNTIME=Off -DLLVM_USE_SANITIZE_COVERAGE=On -DLLVM_TABLEGEN=/home/preames/llvm-dev/build/bin/llvm-tblgen " 23 | 24 | I could not get the stage3 build to complete if I used the ASANified tblgen. More than that, it seems clang from stage2 doesn't have a function ASAN. Every attempt at using ASAN from that stage fails. Cause unknown. 25 | 26 | Once I got something which worked, I tried three experiments. 27 | 28 | *Experiment 1 - " ./llvm-isel-fuzzer corpus/ -ignore_remaining_args=1 -mtriple riscv64 -O2"* 29 | 30 | This used an empty corpus with default RISCV64 only (no extensions). Ran for a couple hours, no failures found. 31 | 32 | 33 | *Experiment 2 - "./llvm-isel-fuzzer corpus/ --help -ignore_remaining_args=1 -mtriple riscv64 -O2 -mattr=+m,+d,+f,+c,+v"* 34 | 35 | This used an empty corpus with a number of extensions enabled. Ran for a weekend, no failures found. 36 | 37 | 38 | *Experiment 3 - Ingest test/CodeGen/RISCV as starting corpus* 39 | 40 | ``` 41 | $ cat ingest-one.sh 42 | #set -x 43 | set -e 44 | SOURCE_FILE=$1 45 | 46 | PATH=../build/bin:$PATH 47 | 48 | fgrep "llvm.riscv" $SOURCE_FILE > /dev/null && exit 1 49 | 50 | llvm-as $SOURCE_FILE -o corpus/tmp.bc 51 | llc -O2 -march=riscv64 -mattr=+m,+d,+f,+c,+v -riscv-v-vector-bits-min=128 -o /dev/null < corpus/tmp.bc || exit 1 52 | HASH=$(sha1sum corpus/tmp.bc | cut -f 1 -d " ") 53 | echo "$SOURCE_FILE -> corpus/$HASH.bc" 54 | mv corpus/tmp.bc "corpus/$HASH.bc" 55 | 56 | $ PATH=../build/bin:$PATH find ../llvm-project/llvm/test/CodeGen/RISCV/ -name "*.ll" | xargs -l ./ingest-one.sh 57 | ``` 58 | 59 | This revealed an absolutely user interface disaster for libfuzzer. 60 | 61 | Some tests deliberate check things which produce errors. libFuzzer fails if the input corpus contains failures. Since the exact setup is slightly different between the fuzzer binary, and llc, filtering out failures is a basically manual process. I ran out of interest before finding useful results. 62 | 63 | I also found that the public documentation on libfuzzer command line arguments is simply wrong in many cases. Nor does the binary support useful -help output of any kind. 64 | 65 | I consider this effort to have been a failure, and do not currently plan to spend more time on libfuzzer. The fact that I, a longstanding LLVM dev, can't figure out how to actually build the damn thing in a useful way says all too much right there. 66 | 67 | Reference save: 68 | 69 | * https://github.com/google/oss-fuzz/pull/7179#issuecomment-1092802635 70 | * https://github.com/google/oss-fuzz/commit/e0787861af03584754923979e76a243080e7dd96 71 | * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=27686 72 | * https://oss-fuzz-build-logs.storage.googleapis.com/log-9d455709-b52a-4ee0-8494-4a7e2529d5ff.txt 73 | * https://oss-fuzz-build-logs.storage.googleapis.com/log-f98c7f04-9b4c-43da-9a7c-3559a6a9b3dd.txt 74 | * https://oss-fuzz-build-logs.storage.googleapis.com/log-d902e118-7a63-439a-92f4-31bfaaf374f6.txt 75 | * https://llvm.org/docs/FuzzingLLVM.html (build instructions *do not* work) 76 | 77 | Brute Force via csmith and llvm-stress 78 | -------------------------------------- 79 | 80 | I have written some simple shell scripts to do brute force fuzzing of clang and llc targetting RISCV driven by csmith and llvm-stress respectively. As of this update, I have run ~100k unique csmith tests and ~10m unique llvm-stress tests. So far, I have found no compiler crash bugs. I am not running the resulting code, so I may have missed execution bugs. 81 | 82 | 83 | AFL - Upcoming 84 | ---------------- 85 | 86 | I'm planning to spend some time playing with afl-fuzz driving llc. My hope is that the user interface is more practically approachable. In theory, the fuzz rate will be lower due to the need to fork and instrument, but so be it. 87 | 88 | 89 | 90 | Other ideas to investigate 91 | -------------------------- 92 | 93 | Using JavaFuzzer for C++? Or maybe the approach John Regehr's student is using with great success on AArch64 right now? 94 | 95 | Rather than just fuzzing for crashes, fuzz using alive2 for miscompiles? Harder for backend, but maybe through IR phase? Or just find problems which depend on target hooks? 96 | 97 | 98 | 99 | 100 | 101 | 102 | -------------------------------------------------------------------------------- /llvm-riscv/ScalarCodeGen.rst: -------------------------------------------------------------------------------- 1 | ------------------------------------------------- 2 | Open Items in Scalar Codegen for RISCV 3 | ------------------------------------------------- 4 | 5 | .. contents:: 6 | 7 | 8 | Items in LLVM issue tracker 9 | ============================ 10 | 11 | * [SelectionDAGISel] Mysteriously dropped chain on strict FP node. `#54617 `_. This appears to be a wrong code bug for strictfp which affects RISCV. 12 | * Unaligned read followed by bswap generates suboptimal code `#48314 `_ 13 | 14 | 15 | Code Size 16 | ========= 17 | 18 | There has been a general view that RISCV code size has significant room for improvement aired in recent LLVM RISC-V sync-up calls, but no specifics are currently known. 19 | 20 | 2022-07-11 - I spent some time last week glancing at usage of compressed instructions. Main take away is that lack of linker optimization/relaxation support in LLD was really painful code size wise. We should revisit once that support is complete, or evaluate using LD in the meantime. 21 | 22 | 23 | Branch on inequality involving power of 2 24 | ========================================= 25 | 26 | For the compare: 27 | %c = icmp ult i64 %a, 8 28 | br i1 %c, label %taken, label %untaken 29 | 30 | We currently emit: 31 | li a1, 7 32 | bltu a1, a0, .LBB0_2 33 | 34 | We could emit: 35 | slli a0, a0, 3 36 | bnez a0, .LBB1_2 37 | 38 | This lengthens the critical path by one, but reduces register pressure. This is probably worthwhile. 39 | 40 | There are also many variations of this type of pattern if we decide this is worth spending time on. 41 | 42 | Optimizations for constant physregs (VLENB, X0) 43 | =============================================== 44 | 45 | Noticed while investigating use of the PseodoReadVLENB intrinsic, and working on them as follow ons to ``_, but these also apply to other constant registers. At the moment, the two I can think of are X0, and VLENB but there might be others. 46 | 47 | Punch list (most have tests in test/CodeGen/RISCV/vlenb.ll but not all): 48 | 49 | * PeepholeOptimizer should eliminate redundant copies from constant physregs. (` whose implicit VL and VTYPE defines are dead essentially just computes a fixed function of VLENB. We could consider replacing the VSETVLI with a CSR read and a shift. (Unclear whether this is profitable on real hardware.) 69 | 70 | 71 | Compressed Expansion for Alignment 72 | ================================== 73 | 74 | If we have sequence of compressed instructions followed by an align directive, it would be better to uncompress the prior instructions instead of asserting nops for alignment. 75 | 76 | This is analogous to the relaxation support on X86 for using larger instruction encodings for alignment in the integrated assembler. 77 | 78 | This is of questionable value, but might be interesting around e.g. loop alignment. 79 | 80 | Constant Materialization Gaps 81 | ============================= 82 | 83 | For constant floats, we have a couple oppurtunities: 84 | 85 | * LUI/SHL-by-32/FMV.D.X - Analogous to the LUI/FMV.W.X pattern recently implemented, but requires an extra shift. This basically reduces to increasing the cost threshold by 1, and may be worth doing for doubles. 86 | * LI/FCVT.S.W - Create a small integer, and convert to half/single/double. Note this is a convert, not a move. For half, LUI/FMV.H.X may be preferrable. 87 | * FLI.S/D - Likely to be optimal when Zfa is available. 88 | * FLI + FNEG.S - Can be used to produce some negative floats and doubles. LUI/FMV.W.X is likely better for floats and halfs, so this mostly applies to doubles. FNEG.S can be used to toggle the sign bit on any float, so may be more broadly applicable as well. 89 | 90 | 91 | Rematerialization of LUI/ADDI sequences 92 | ======================================= 93 | 94 | Given an LUI/ADDI sequence - either from a constant or a relocation - we should be able to rematerialize either or both instructions if required to reduce register pressure during allocation. 95 | 96 | 97 | Register Pressure Reduction 98 | =========================== 99 | 100 | Improvement to switch lowering - if generate an jump table for the labels, check to see if the result can be turned into a lookup table instead. We're already paying the load cost. 101 | 102 | Investigate simple improvements to ShrinkWrapping. 103 | 104 | Consider firewalling cold call paths. 105 | 106 | Define a fastcc variants where argument-0 and return don't require the same register and internalize aggressively - mostly helps LTO. 107 | 108 | IPRA - Can we reduce need to spill some? 109 | 110 | Prefer bnez (addi a0, a0, C) when doing so avoids the need for a immediate materialization and a0 has no other uses. 111 | 112 | Prefer bnez (lshr a0, a0, XLen-1) for sign check, same logic as previous. Also generalizes to bexti cases for any single bit check. 113 | 114 | Use arithmetic more aggressively for select c, i32 C1, i32 C2 to avoid need for control flow. (Doesn't really impact register pressure, may actually hurt.) 115 | 116 | Aggressively duplicate (addi a0, x0, C) to users before register allocation OR itnegrate rematerialization into first CSR path. 117 | 118 | Aggressively duplicate (addi a0, a0, C) when user is vector load or store to user to avoid long live ranges. Or combine remat in first cSR + full remat. 119 | 120 | Investigate full rematerialization. 121 | 122 | Investigate negated compound branch thing reported 2024-11-24 on discourse. 123 | 124 | 125 | Disjoint OR 126 | ----------- 127 | 128 | (From an older list of notes, may be stale.) 129 | 130 | * SelectAddrRegScale 131 | * the addw thing 132 | * SelectAddrRegImm (the large offset cases) 133 | * SelectAdd/RegReg 134 | * SelectShiftMask 135 | * In GEP AddrMatch 136 | 137 | 138 | -------------------------------------------------------------------------------- /llvm-riscv/Vectorization.rst: -------------------------------------------------------------------------------- 1 | ------------------------------------------------- 2 | Open Items in Vectorization for RISCV 3 | ------------------------------------------------- 4 | 5 | .. contents:: 6 | 7 | Loop Vectorizer 8 | ---------------- 9 | 10 | Loop Vectorization is fully implemented for both fixed and scalable vectors. It has been fully enabled in upstream LLVM for several months, and is mostly on par with other targets. The items mentioned below are mostly target specific enhancements - i.e. oppurtunities that aren't remquired for breakeven functionality. 11 | 12 | In terms of performaning tuning, we're still in the early days. I've been fixing issues as I find them. Concrete bug reports for vector code quality are very welcome. 13 | 14 | Tail Folding 15 | ++++++++++++ 16 | 17 | For code size reasons, it is desirable to be able to fold the remainder loop into the main loop body. At the moment, we have two options for tail folding: mask predication and VL predication. I've been starting to look at the tradeoffs here, but this section is still highly preliminary and subject to change. 18 | 19 | Mask predication appears to work today. We'd need to enable the flag, but at least some loops would start folding immediately. There are some major profitability questions around doing so, particularly for short running loops which today would bypass the vector body entirely. 20 | 21 | Talking with various hardware players, there appears to be a somewhat significant cost to using mask predication over VL predication. For several teams I've talked to, SETVLI runs in the scalar domain whereas mask generation via vector compares run in the vector domain. Particular for small loops which might be vector bottlenecked, this means VL predication is preferrable. 22 | 23 | For VL predication, we have two major options. We can either pattern match mask predication into VL predication in the backend, or we can upstream the work BSC has done on vectorizing using the VP intrinsics. I'm unclear on which approach is likely to work out best long term. 24 | 25 | Work on tail folding is currently being deferred until main loop vectorization is mature. 26 | 27 | Epilogue Tail Folding 28 | ===================== 29 | 30 | At the moment, my guess is that we're going to end up wanting *not* to tail fold the main loop (due to vsetvli resource limit concerns), and instead want to have a tail folded epilogue loop to run the tail in a single iteration. At the moment, there is no support in the vectorizer for tail folded epilogues and a significant amount of rework will be needed. 31 | 32 | One interesting point is that if we're (only) tail folding the epilogue loop, the relative importance of the predicate code quality drops significantly. This may influence the masking vs VL predication decision from a pure engineering investiment perspective. 33 | 34 | We may end up with a different strategy for loops which are known short. There, the concern about vsetvli being a bottleneck is a lot less of a concern. Maybe we'll tail fold the main loop in that case. 35 | 36 | Tail Folding Gaps (via Masking) 37 | =============================== 38 | 39 | Tail folding appears to have a number of limitations which can be removed. 40 | 41 | * Some cases with predicate-dont-vectorize are vectorizing without predication. Bug. 42 | * Any use outside of loop appears to kills predication. Oddly, on examples I've tried, simply removing the bailout seems to generate correct code? 43 | * Stores appear to be tripping scalarization cost not masking cost which inhibits profitability. 44 | * Uniform Store. Basic issue is we need to implement last active lane extraction. Note active bits are a prefix and thus popcnt can be used to find index. No current plans to support general predication. 45 | 46 | Tail Folding via Speculation 47 | ============================ 48 | 49 | This is mostly just noting an idea. It occurs to me that if instructions in the loop are speculateable, we can "tail fold" via speculation. That is, we can simply run the loop over the extra iterations, and then discard the result of any spurious elements. 50 | 51 | .. code:: 52 | 53 | // a is aligned by 16 54 | for (int i = 0; i < N; i++) 55 | sum += a[i]; 56 | 57 | .. code:: 58 | 59 | // a is aligned by 16 60 | for (int i = 0; i < N+3; i += 4) { 61 | vtmp = a[i:i+3] // speculative load 62 | vtmp = select (splat(i) + step_vector < splat(N)), vtmp, 0 63 | vsum += vtmp 64 | } 65 | sum = reduce(vsum) 66 | 67 | 68 | The above example relies on alignment implying access beyond a can't fault. Note that this concept is *not* otherwise in LLVM's dereferenceable model, and is itself a fairly deep change. 69 | 70 | LoopVectorizer generating duplicate broadcast shuffles 71 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 72 | 73 | This is being fixed by the backend, but we should probably tweak LV to avoid anyways. 74 | 75 | Duplicate IV for index vector 76 | +++++++++++++++++++++++++++++ 77 | 78 | In a test which simply writes “i” to every element of a vector, we’re currently generating: 79 | 80 | %vec.ind = phi <4 x i32> [ , %vector.ph ], [ %vec.ind.next, %vector.body ] 81 | %step.add = add <4 x i32> %vec.ind, 82 | … 83 | %vec.ind.next = add <4 x i32> %vec.ind, 84 | %2 = icmp eq i64 %index.next, %n.vec 85 | br i1 %2, label %middle.block, label %vector.body, !llvm.loop !8 86 | 87 | And assembly: 88 | 89 | vadd.vi v9, v8, 4 90 | addi a5, a3, -16 91 | vse32.v v8, (a5) 92 | vse32.v v9, (a3) 93 | vadd.vi v8, v8, 8 94 | addi a4, a4, -8 95 | addi a3, a3, 32 96 | bnez a4, .LBB0_4 97 | beq a1, a2, .LBB0_8 98 | 99 | We can do better here by exploiting the implicit broadcast of scalar arguments. If we put the constant id vector into a vector register, and add the broadcasted scalar index we get the same result vector. 100 | 101 | 102 | Vectorization 103 | +++++++++++++ 104 | 105 | 106 | * Issues around epilogue vectorization w/VF > 16 (for fixed length vectors, i8 for VLEN >= 128, i16 for VLEN >= 256, etc..) 107 | * Initial target assumes scalar epilogue loop, return to folding/epilogue vectorization in future. 108 | 109 | 110 | Scalable Vectorizer Gaps 111 | ++++++++++++++++++++++++ 112 | 113 | Here is a punch list of known missing cases around scalable vectorization in the LoopVectorizer. These are mostly target independent. 114 | 115 | * Interleaving Groups. This one looks tricky as selects in IR require constants and the required shuffles for scalable can't currently be expressed as constants. This is likely going to need an IR change; details as yet unsettled. Current thinking has shifted towards just adding three more intrinsics and deferring shuffle definition change to some future point. Pending sync with ARM SVE folks. 116 | * General loop scalarization. For scalable vectors, we _can_ scalarize, but not via unrolling. Instead, we must generate a loop. This can be done in the vectorizer itself (since its a generic IR transform pass), but is not possible in SelectionDAG (which is not allowed to modify the CFG). Interacts both with div/rem and intrinsic costing. Initial patch for non-predicated scalarization up as `D131118 `_ 117 | * Unsupported reduction operators. For reduction operations without instructions, we can handle via the simple scalar reduction loop. This allows e.g. a product reduction to be done via widening strategy, then outside the loop reduced into the final result. Only useful for outloop reduction. (i.e. both options should be considered by the cost model) 118 | 119 | 120 | SLP Vectorization 121 | ----------------- 122 | 123 | As of 7f26c27e03f1b6b12a3450627934ee26256649cd (June 14, 2023) SLP vectorization is enabled by default for the RISCV target. 124 | 125 | The overall code quality still has a lot of room for improvement. All of the known major issues have been at least partially handled, but we've likely got quite a bit of interative performance work ahead. In general, codegen tends to be most sensative for short vectors (VL<4 or so). This is where the benefit of vectorization is small enough that minor deficiencies in vector codegen (or SLP costing) lead to unprofitable results. 126 | 127 | 128 | -------------------------------------------------------------------------------- /llvm-riscv/memset.ll: -------------------------------------------------------------------------------- 1 | ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2 2 | ; RUN: llc -mtriple=riscv64 -mattr=+v,+unaligned-scalar-mem < %s | FileCheck %s 3 | 4 | target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128" 5 | target triple = "riscv64-unknown-linux-gnu" 6 | 7 | ; TODO: Reverse order of stores to assit prefetcher - this case doesn't matter 8 | ; but imagine a loop zeroing 15 byte structures in an array (with something to 9 | ; prevent compiler merging it into one memset) 10 | define void @memset_15(ptr %p) { 11 | ; CHECK-LABEL: memset_15: 12 | ; CHECK: # %bb.0: # %entry 13 | ; CHECK-NEXT: sd zero, 7(a0) 14 | ; CHECK-NEXT: sd zero, 0(a0) 15 | ; CHECK-NEXT: ret 16 | entry: 17 | tail call void @llvm.memset.p0.i64(ptr align 8 %p, i8 0, i64 15, i1 false) 18 | ret void 19 | } 20 | 21 | declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg) 22 | -------------------------------------------------------------------------------- /llvm-riscv/scalar-branch-opt.ll: -------------------------------------------------------------------------------- 1 | ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2 2 | ; RUN: llc -mtriple=riscv64 -mattr=+v,+unaligned-scalar-mem < %s | FileCheck %s 3 | 4 | target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128" 5 | target triple = "riscv64-unknown-linux-gnu" 6 | 7 | declare void @foo() 8 | 9 | ; TODO: Two problems here: 10 | ; 1) The blt should be reversed into a bge to avoid the add 11 | define void @test(i64 %a, i64 %b) { 12 | ; CHECK-LABEL: test: 13 | ; CHECK: # %bb.0: # %entry 14 | ; CHECK-NEXT: addi a0, a0, 1 15 | ; CHECK-NEXT: blt a0, a1, .LBB0_2 16 | ; CHECK-NEXT: # %bb.1: # %taken 17 | ; CHECK-NEXT: addi sp, sp, -16 18 | ; CHECK-NEXT: .cfi_def_cfa_offset 16 19 | ; CHECK-NEXT: sd ra, 8(sp) # 8-byte Folded Spill 20 | ; CHECK-NEXT: .cfi_offset ra, -8 21 | ; CHECK-NEXT: call foo 22 | ; CHECK-NEXT: ld ra, 8(sp) # 8-byte Folded Reload 23 | ; CHECK-NEXT: addi sp, sp, 16 24 | ; CHECK-NEXT: .LBB0_2: # %exit 25 | ; CHECK-NEXT: ret 26 | entry: 27 | %add1 = add i64 %a, 1 28 | %cmp = icmp slt i64 %add1, %b 29 | br i1 %cmp, label %exit, label %taken 30 | taken: 31 | call void @foo() 32 | ret void 33 | exit: 34 | ret void 35 | } 36 | 37 | declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg) 38 | -------------------------------------------------------------------------------- /llvm-riscv/tsvc/16.x-march=rv64gcv_zvl128b/diff: -------------------------------------------------------------------------------- 1 | diff --git a/makefiles/Makefile.clang b/makefiles/Makefile.clang 2 | index cefe725..ad5374a 100644 3 | --- a/makefiles/Makefile.clang 4 | +++ b/makefiles/Makefile.clang 5 | @@ -3,7 +3,7 @@ CC=clang 6 | CXX=clang++ 7 | # no FC for clang 8 | FC= 9 | -flags = -O3 -fstrict-aliasing 10 | +flags = -O3 -fstrict-aliasing -march=rv64gcv_zvl128b 11 | vecflags = -fvectorize -fslp-vectorize-aggressive 12 | novecflags = -fno-vectorize 13 | omp_flags=-fopenmp=libomp 14 | -------------------------------------------------------------------------------- /llvm-riscv/tsvc/16.x-march=rv64gcv_zvl128b/note: -------------------------------------------------------------------------------- 1 | This is the installed clang which comes with the image, not built from source. 2 | Probably does not correspond to an official LLVM release. 3 | 4 | $ clang -v 5 | Ubuntu clang version 16.0.6 (15) 6 | Target: riscv64-unknown-linux-gnu 7 | Thread model: posix 8 | InstalledDir: /usr/bin 9 | Found candidate GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/13 10 | Selected GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/13 11 | 12 | -------------------------------------------------------------------------------- /llvm-riscv/tsvc/16.x-march=rv64gcv_zvl128b/stdout: -------------------------------------------------------------------------------- 1 | $ ./bin/clang/tsvc_novec_relaxed 2 | Loop Time(sec) Checksum 3 | s000 24.868 512080000.000000 4 | s111 14.000 32000.410156 5 | s1111 39.267 16005.822266 6 | s112 37.820 84617.789062 7 | s1112 37.284 32001.644531 8 | s113 48.577 32000.644531 9 | s1113 24.819 32001.644531 10 | s114 28.692 919.861816 11 | s115 73.566 31745.953125 12 | s1115 33.773 0.065533 13 | s116 77.958 32000.000000 14 | s118 26.243 84488.507812 15 | s119 20.833 86339.023438 16 | s1119 21.025 119099.906250 17 | s121 37.476 32009.031250 18 | s122 12.875 196490.937500 19 | s123 20.734 32003.287109 20 | s124 30.327 32001.644531 21 | s125 13.618 131072.000000 22 | s126 13.973 66954.968750 23 | s127 23.699 32003.287109 24 | s128 23.499 80000.000000 25 | s131 62.456 32009.031250 26 | s132 45.264 65538.562500 27 | s141 30.424 32487076.000000 28 | s151 62.442 32009.031250 29 | s152 31.290 152207.218750 30 | s161 19.950 64002.054688 31 | s1161 38.641 64002.464844 32 | s162 17.650 32009.031250 33 | s171 12.963 196491.250000 34 | s172 12.618 196491.250000 35 | s173 64.259 32001.640625 36 | s174 65.075 32001.640625 37 | s175 12.546 32009.031250 38 | s176 13.711 32063.832031 39 | s211 30.445 63983.183594 40 | s212 28.877 132011.359375 41 | s1213 21.885 132022.765625 42 | s221 13.344 1543685120.000000 43 | s1221 13.521 630444416.000000 44 | s222 5.632 32000.000000 45 | s231 185.779 119099.906250 46 | s232 5.082 65536.000000 47 | s1232 132.972 33957.886719 48 | s233 467.290 504912.625000 49 | s2233 160.342 337652.968750 50 | s235 386.811 160024.000000 51 | s241 65.452 64000.000000 52 | s242 5.374 1535952000.000000 53 | s243 39.094 810624.937500 54 | s244 32.806 70102.437500 55 | s1244 40.872 108496.453125 56 | s2244 21.775 70102.890625 57 | s251 98.395 32004.371094 58 | s1251 113.548 400005.968750 59 | s2251 30.482 2.635709 60 | s3251 42.370 12.595594 61 | s252 20.173 63999.000000 62 | s253 40.831 3200009984.000000 63 | s254 96.568 32000.000000 64 | s255 28.142 31968.515625 65 | s256 18.865 66207.695312 66 | s257 24.094 163072.000000 67 | s258 0.291 14.652789 68 | s261 27.822 260999.140625 69 | s271 122.050 689753.500000 70 | s272 24.412 64000.000000 71 | s273 39.335 99051.656250 72 | s274 40.812 3200195584.000000 73 | s275 37.317 65536.000000 74 | ^[[Bs2275 733.148 65536.000000 75 | s276 85.976 689753.500000 76 | s277 26.051 32000.000000 77 | s278 34.415 64012.593750 78 | s279 18.193 64014.289062 79 | s1279 24.541 64014.289062 80 | s2710 25.618 96003.289062 81 | s2711 121.925 689753.500000 82 | s2712 114.728 289751.625000 83 | s281 30.168 32000.000000 84 | s1281 177.255 inf 85 | s291 40.112 32000.000000 86 | s292 32.289 31968.515625 87 | s293 24.933 32000.000000 88 | s2101 14.601 1704092.000000 89 | s2102 19.100 256.000000 90 | s2111 27.237 34545052.000000 91 | s311 80.968 10.950724 92 | s31111 9.147 10.950724 93 | s312 81.027 1.030518 94 | s313 55.261 1.644725 95 | s314 40.520 1.000000 96 | s315 24.107 54857.000000 97 | s316 40.529 0.000031 98 | s317 22.371 0.000000 99 | s318 16.219 32002.000000 100 | s319 69.465 43.803417 101 | s3110 21.015 514.000000 102 | s13110 21.051 514.000000 103 | s3111 8.078 10.950721 104 | s3112 8.406 1.644725 105 | s3113 64.479 2.000000 106 | s321 12.553 32000.000000 107 | s322 11.734 32000.000000 108 | s323 18.245 146484.968750 109 | s331 24.207 32000.000000 110 | s332 22.224 -1.000000 111 | s341 30.675 10.950724 112 | s342 32.522 10.950724 113 | s343 22.067 1567.833496 114 | s351 90.539 25600002048.000000 115 | s1351 99.513 32010.253906 116 | s352 88.851 1.644841 117 | s353 18.155 3200002816.000000 118 | s421 50.028 32009.031250 119 | s1421 51.028 16000.000000 120 | s422 97.994 257.701416 121 | s423 87.295 439.690308 122 | s424 51.495 822.360596 123 | s431 129.902 1674247.125000 124 | s441 38.154 196491.250000 125 | s442 12.370 114240.132812 126 | s443 46.044 361015.906250 127 | s451 27.748 32009.281250 128 | s452 57.656 32512.017578 129 | s453 24.233 21.901447 130 | s471 17.627 64004.933594 131 | s481 33.222 196491.250000 132 | s482 35.086 196491.250000 133 | s491 32.998 32001.644531 134 | s4112 19.971 1127134.250000 135 | s4113 24.355 32001.644531 136 | s4114 27.426 32000.000000 137 | s4115 17.584 1.038636 138 | s4116 12.908 0.753265 139 | s4117 22.531 32002.207031 140 | s4121 17.181 196491.250000 141 | va 25.050 1.644884 142 | vag 29.851 1.644884 143 | vas 31.427 1.644885 144 | vif 24.318 1.644884 145 | vpv 126.368 1642411.250000 146 | vtv 126.337 32000.000000 147 | vpvtv 68.677 689753.500000 148 | vpvts 14.258 175255268098048.000000 149 | vpvpv 83.829 1.644884 150 | vtvtv 86.921 32000.000000 151 | vsumr 80.816 10.950721 152 | vdotr 115.374 1.644725 153 | vbor 9.867 31924.050781 154 | $ ./bin/clang/tsvc_vec_relaxed 155 | Loop Time(sec) Checksum 156 | s000 6.339 512080000.000000 157 | s111 14.772 32000.410156 158 | s1111 10.951 16005.822266 159 | s112 38.180 84617.789062 160 | s1112 17.202 32001.644531 161 | s113 11.256 32000.644531 162 | s1113 24.689 32001.644531 163 | s114 29.600 919.861816 164 | s115 73.029 31745.953125 165 | s1115 77.778 0.065533 166 | s116 77.896 32000.000000 167 | s118 38.387 85045.453125 168 | s119 7.929 86339.023438 169 | s1119 5.677 119099.906250 170 | s121 12.375 32009.031250 171 | s122 13.183 196490.937500 172 | s123 20.782 32003.287109 173 | s124 8.908 32001.644531 174 | s125 8.458 131072.000000 175 | s126 19.356 66954.968750 176 | s127 15.559 32003.287109 177 | s128 19.111 80000.000000 178 | s131 20.579 32009.031250 179 | s132 11.579 65538.562500 180 | s141 32.791 32487076.000000 181 | s151 20.605 32009.031250 182 | s152 22.385 152207.218750 183 | s161 19.577 64002.054688 184 | s1161 38.721 64002.464844 185 | s162 5.621 32009.031250 186 | s171 3.846 196491.250000 187 | s172 13.175 196491.250000 188 | s173 19.401 32001.640625 189 | s174 20.197 32001.640625 190 | s175 4.055 32009.031250 191 | s176 4.145 32063.832031 192 | s211 29.282 63983.183594 193 | s212 27.724 132011.359375 194 | s1213 21.004 132022.765625 195 | s221 13.451 1543685120.000000 196 | s1221 4.220 630444416.000000 197 | s222 5.994 32000.000000 198 | s231 172.146 119099.906250 199 | s232 5.453 65536.000000 200 | s1232 239.784 33957.886719 201 | s233 761.972 504912.625000 202 | s2233 172.588 337652.968750 203 | s235 348.269 160024.000000 204 | s241 65.333 64000.000000 205 | s242 5.359 1535952000.000000 206 | s243 14.757 810624.937500 207 | s244 33.734 70102.437500 208 | s1244 40.874 108496.453125 209 | s2244 6.938 70102.890625 210 | s251 26.655 32004.371094 211 | s1251 58.867 400005.968750 212 | s2251 28.525 2.635709 213 | s3251 23.103 12.595594 214 | s252 4.980 63999.000000 215 | s253 8.506 3200009984.000000 216 | s254 17.233 32000.000000 217 | s255 6.603 31968.515625 218 | s256 17.153 66207.695312 219 | s257 29.489 163072.000000 220 | s258 0.306 14.652789 221 | s261 28.551 260999.140625 222 | s271 22.389 689753.500000 223 | s272 10.223 64000.000000 224 | s273 14.521 99051.656250 225 | s274 24.234 3200195584.000000 226 | s275 40.462 65536.000000 227 | s2275 950.944 65536.000000 228 | s276 44.881 689753.500000 229 | s277 26.121 32000.000000 230 | s278 14.772 64012.593750 231 | s279 9.305 64014.289062 232 | s1279 9.875 64014.289062 233 | s2710 6.472 96003.289062 234 | s2711 22.334 689753.500000 235 | s2712 22.788 289751.625000 236 | s281 29.930 32000.000000 237 | s1281 59.478 inf 238 | s291 48.661 32000.000000 239 | s292 32.105 31968.515625 240 | s293 24.394 32000.000000 241 | s2101 12.984 1704092.000000 242 | s2102 18.689 256.000000 243 | s2111 27.159 34545052.000000 244 | s311 16.638 10.950724 245 | s31111 9.136 10.950724 246 | s312 23.829 1.030928 247 | s313 15.736 1.644884 248 | s314 8.317 1.000000 249 | s315 23.913 54857.000000 250 | s316 9.389 0.000031 251 | s317 2.518 0.000000 252 | s318 16.178 32002.000000 253 | s319 17.169 43.802910 254 | s3110 20.935 514.000000 255 | s13110 20.938 514.000000 256 | s3111 0.951 10.950724 257 | s3112 11.422 1.644725 258 | s3113 7.598 2.000000 259 | s321 12.994 32000.000000 260 | s322 12.280 32000.000000 261 | s323 15.301 146484.968750 262 | s331 24.224 32000.000000 263 | s332 22.059 -1.000000 264 | s341 30.575 10.950724 265 | s342 32.405 10.950724 266 | s343 17.231 1567.833496 267 | s351 112.157 25600002048.000000 268 | s1351 37.226 32010.253906 269 | s352 85.448 1.644891 270 | s353 20.354 3200002816.000000 271 | s421 16.477 32009.031250 272 | s1421 15.744 16000.000000 273 | s422 30.633 257.701416 274 | s423 17.927 439.690308 275 | s424 22.340 822.360596 276 | s431 38.771 1674247.125000 277 | s441 11.336 196491.250000 278 | s442 14.889 114240.132812 279 | s443 13.666 361015.906250 280 | s451 27.723 32009.281250 281 | s452 21.574 32512.017578 282 | s453 5.439 21.901447 283 | s471 14.955 64004.933594 284 | s481 32.899 196491.250000 285 | s482 35.081 196491.250000 286 | s491 22.081 32001.644531 287 | s4112 9.426 1127134.250000 288 | s4113 12.692 32001.644531 289 | s4114 22.597 32000.000000 290 | s4115 8.154 1.038800 291 | s4116 5.966 0.753265 292 | s4117 10.469 32002.207031 293 | s4121 5.419 196491.250000 294 | va 25.687 1.644884 295 | vag 15.519 1.644884 296 | vas 19.909 1.644885 297 | vif 2.962 1.644884 298 | vpv 38.498 1642411.250000 299 | vtv 38.491 32000.000000 300 | vpvtv 21.706 689753.500000 301 | vpvts 4.015 175255268098048.000000 302 | vpvpv 21.415 1.644884 303 | vtvtv 21.698 32000.000000 304 | vsumr 16.744 10.950724 305 | vdotr 31.524 1.644884 306 | vbor 1.559 31924.050781 307 | -------------------------------------------------------------------------------- /llvm-riscv/tsvc/17.x-march=rv64gcv_zvl128b/stdout: -------------------------------------------------------------------------------- 1 | $ PATH=~/llvm/17.x/bin/:$PATH ./run.sh | tee raw.out 2 | ++ set -e 3 | ++ clang -v 4 | clang version 17.0.6 (https://github.com/llvm/llvm-project.git 6009708b4367171ccdbf4b5905cb6a803753fe18) 5 | Target: riscv64-unknown-linux-gnu 6 | Thread model: posix 7 | InstalledDir: /home/preames/llvm/17.x/bin 8 | Found candidate GCC installation: /usr/lib/gcc/riscv64-linux-gnu/13 9 | Selected GCC installation: /usr/lib/gcc/riscv64-linux-gnu/13 10 | ++ git diff 11 | diff --git a/makefiles/Makefile.clang b/makefiles/Makefile.clang 12 | index cefe725..ad5374a 100644 13 | --- a/makefiles/Makefile.clang 14 | +++ b/makefiles/Makefile.clang 15 | @@ -3,7 +3,7 @@ CC=clang 16 | CXX=clang++ 17 | # no FC for clang 18 | FC= 19 | -flags = -O3 -fstrict-aliasing 20 | +flags = -O3 -fstrict-aliasing -march=rv64gcv_zvl128b 21 | vecflags = -fvectorize -fslp-vectorize-aggressive 22 | novecflags = -fno-vectorize 23 | omp_flags=-fopenmp=libomp 24 | ++ make COMPILER=clang clean 25 | make -C ./src COMPILER=clang clean 26 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src' 27 | rm -f *.o *.s 28 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src' 29 | ++ make COMPILER=clang 30 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src' 31 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive -c -o tsvc_vec.o tsvc.c 32 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 33 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive -c -o dummy.o dummy.c 34 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 35 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive -c -o common.o common.c 36 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 37 | clang tsvc_vec.o dummy.o common.o -lm -o ../bin/clang/tsvc_vec_default 38 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fno-vectorize -c -o tsvc_novec.o tsvc.c 39 | clang tsvc_novec.o dummy.o common.o -lm -o ../bin/clang/tsvc_novec_default 40 | rm common.o tsvc_vec.o dummy.o tsvc_novec.o 41 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src' 42 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src' 43 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fvectorize -fslp-vectorize-aggressive -c -o tsvc_vec.o tsvc.c 44 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 45 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fvectorize -fslp-vectorize-aggressive -c -o dummy.o dummy.c 46 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 47 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fvectorize -fslp-vectorize-aggressive -c -o common.o common.c 48 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 49 | clang tsvc_vec.o dummy.o common.o -lm -o ../bin/clang/tsvc_vec_relaxed 50 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -ffast-math -fno-vectorize -c -o tsvc_novec.o tsvc.c 51 | clang tsvc_novec.o dummy.o common.o -lm -o ../bin/clang/tsvc_novec_relaxed 52 | rm common.o tsvc_vec.o dummy.o tsvc_novec.o 53 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src' 54 | make[1]: Entering directory '/home/preames/benchmark/TSVC_2/src' 55 | /home/preames/benchmark/TSVC_2/makefiles/Makefile.clang:21: No 'precise' math flags for clang! 56 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive -c -o tsvc_vec.o tsvc.c 57 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 58 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive -c -o dummy.o dummy.c 59 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 60 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fvectorize -fslp-vectorize-aggressive -c -o common.o common.c 61 | clang: warning: the flag '-fslp-vectorize-aggressive' has been deprecated and will be ignored [-Wunused-command-line-argument] 62 | clang tsvc_vec.o dummy.o common.o -lm -o ../bin/clang/tsvc_vec_precise 63 | clang -O3 -fstrict-aliasing -march=rv64gcv_zvl128b -fno-vectorize -c -o tsvc_novec.o tsvc.c 64 | clang tsvc_novec.o dummy.o common.o -lm -o ../bin/clang/tsvc_novec_precise 65 | rm common.o tsvc_vec.o dummy.o tsvc_novec.o 66 | make[1]: Leaving directory '/home/preames/benchmark/TSVC_2/src' 67 | ++ ./bin/clang/tsvc_novec_relaxed 68 | (seeming hang here) 69 | ^C 70 | $ ./bin/clang/tsvc_vec_relaxed 71 | Loop Time(sec) Checksum 72 | s000 5.602 512080000.000000 73 | s111 17.220 32000.410156 74 | s1111 9.820 16005.822266 75 | s112 38.008 84617.781250 76 | s1112 16.379 32001.644531 77 | s113 10.646 32000.644531 78 | s1113 24.405 32001.644531 79 | s114 29.867 919.861389 80 | s115 72.877 31745.953125 81 | s1115 84.110 0.065533 82 | s116 109.654 32000.000000 83 | s118 47.019 85045.453125 84 | s119 7.977 86338.992188 85 | s1119 5.458 119099.882812 86 | s121 12.238 32009.031250 87 | s122 12.861 196490.921875 88 | s123 20.708 32003.287109 89 | s124 10.069 32001.644531 90 | s125 7.529 131072.000000 91 | s126 16.304 66954.960938 92 | s127 11.909 32003.287109 93 | s128 28.782 80000.000000 94 | s131 20.485 32009.031250 95 | s132 11.569 65538.562500 96 | s141 34.534 32487076.000000 97 | s151 20.449 32009.031250 98 | s152 9.227 152207.218750 99 | s161 20.115 64002.054688 100 | s1161 38.637 64002.464844 101 | s162 6.476 32009.031250 102 | s171 4.017 196491.250000 103 | s172 12.634 196491.250000 104 | s173 23.830 32001.640625 105 | s174 24.915 32001.640625 106 | s175 4.079 32009.031250 107 | s176 3.998 32063.832031 108 | s211 32.163 63983.183594 109 | s212 27.900 132011.375000 110 | s1213 21.340 132022.765625 111 | s221 14.190 1543685120.000000 112 | s1221 4.356 630444416.000000 113 | s222 5.587 32000.000000 114 | s231 166.512 119099.882812 115 | s232 4.983 65536.000000 116 | s1232 214.019 33958.011719 117 | s233 655.004 504912.625000 118 | s2233 168.476 337652.906250 119 | s235 343.114 160023.984375 120 | s241 65.050 64000.000000 121 | s242 5.361 1535952000.000000 122 | s243 12.132 810624.937500 123 | s244 32.335 70102.437500 124 | s1244 41.270 108496.453125 125 | s2244 10.868 70102.890625 126 | s251 29.778 32004.371094 127 | s1251 48.582 400005.968750 128 | s2251 30.529 2.635709 129 | s3251 36.627 12.595594 130 | s252 5.307 63999.000000 131 | s253 8.936 3200010240.000000 132 | s254 14.119 32000.000000 133 | s255 5.238 31968.515625 134 | s256 16.912 66207.703125 135 | s257 34.863 163072.000000 136 | s258 0.290 14.652790 137 | s261 27.673 260999.140625 138 | s271 25.851 689753.562500 139 | s272 9.199 64000.000000 140 | s273 12.133 99051.656250 141 | s274 10.936 3200195328.000000 142 | s275 29.940 65536.000000 143 | s2275 673.580 65536.000000 144 | s276 42.638 689753.562500 145 | s277 26.102 32000.000000 146 | s278 12.404 64012.593750 147 | s279 7.634 64014.289062 148 | s1279 9.257 64014.289062 149 | s2710 4.762 96003.289062 150 | s2711 25.793 689753.562500 151 | s2712 25.851 289751.656250 152 | s281 30.143 32000.000000 153 | s1281 48.696 inf 154 | s291 40.253 32000.000000 155 | s292 32.151 31968.515625 156 | s293 24.644 32000.000000 157 | s2101 13.391 1704092.000000 158 | s2102 18.379 256.000000 159 | s2111 27.172 34545052.000000 160 | s311 15.612 10.950724 161 | s31111 9.142 10.950724 162 | s312 21.593 1.030958 163 | s313 15.516 1.644884 164 | s314 7.816 1.000000 165 | s315 24.030 54857.000000 166 | s316 8.394 0.000031 167 | s317 1.270 0.000000 168 | s318 16.184 32002.000000 169 | s319 18.855 43.802910 170 | s3110 20.982 514.000000 171 | s13110 20.984 514.000000 172 | s3111 0.900 10.950724 173 | s3112 8.425 1.644725 174 | s3113 7.174 2.000000 175 | s321 12.543 32000.000000 176 | s322 11.702 32000.000000 177 | s323 18.941 146484.968750 178 | s331 24.362 32000.000000 179 | s332 22.108 -1.000000 180 | s341 30.289 10.950724 181 | s342 32.482 10.950724 182 | s343 17.078 1567.833496 183 | s351 84.301 25600002048.000000 184 | s1351 42.303 32010.253906 185 | s352 54.830 1.644891 186 | s353 17.033 3200002816.000000 187 | s421 17.432 32009.031250 188 | s1421 21.219 16000.000000 189 | s422 33.683 257.701416 190 | s423 18.757 439.690308 191 | s424 20.245 822.360596 192 | s431 40.825 1674247.125000 193 | s441 9.611 196491.250000 194 | s442 11.936 114240.132812 195 | ^[[B s443 15.799 361015.906250 196 | s451 28.127 32009.281250 197 | s452 21.107 32512.015625 198 | s453 6.317 21.901447 199 | s471 5.761 64004.933594 200 | s481 33.085 196491.250000 201 | s482 35.027 196491.250000 202 | s491 14.245 32001.644531 203 | s4112 9.376 1127134.250000 204 | s4113 12.616 32001.644531 205 | s4114 10.998 32000.000000 206 | s4115 7.811 1.038800 207 | s4116 5.448 0.753265 208 | s4117 8.916 32002.207031 209 | s4121 6.455 196491.250000 210 | va 32.810 1.644884 211 | vag 14.561 1.644884 212 | vas 16.406 1.644885 213 | vif 3.162 1.644884 214 | vpv 40.243 1642411.250000 215 | vtv 40.217 32000.000000 216 | vpvtv 25.801 689753.562500 217 | vpvts 3.928 175255268098048.000000 218 | vpvpv 25.882 1.644884 219 | vtvtv 25.869 32000.000000 220 | vsumr 15.633 10.950724 221 | vdotr 31.068 1.644884 222 | vbor 0.984 31924.050781 223 | -------------------------------------------------------------------------------- /llvm-riscv/tsvc/note: -------------------------------------------------------------------------------- 1 | This set of directories contains the results of running TSCV_2 on a BP3 2 | dev board. The particular data sets should be reasonable self descibing 3 | in terms of how they were run. 4 | -------------------------------------------------------------------------------- /llvm-shuffles.rst: -------------------------------------------------------------------------------- 1 | -------------------------- 2 | LLVM Shuffles by Example 3 | ------------------------- 4 | 5 | This document is an overview of the various common shuffle operations in LLVM's backend lowering. I'm currently working on improving the RISC-V backends handling of fixed length shuffles, and this is being written to help me organize my thoughts. 6 | 7 | .. contents:: 8 | 9 | Broadcast Variants 10 | ------------------ 11 | 12 | A general broadcast takes a single vector element (any vector element), and repeats it across all lanes of the containing vector. Particularly useful forms of broadcast involve broadcasting the element at lane 0, and broadcasting a scalar element to all lanes of the vector type. In code, we often comingle the naming of these, so you sometimes have to pay attention to figure out which variant is being discussed. 13 | 14 | Examples: 15 | 16 | .. code:: 17 | 18 | ;; Broadcast lane 0 to all lanes 19 | ;; ------------------------------ 20 | 21 | ;; Fixed vector 22 | shufflevector <4 x i32> %vec, <4 x i32> undef, <4 x i32> zeroinitializer 23 | 24 | ;; Scalable vector 25 | shufflevector %vec, undef, zeroinitializer 26 | 27 | ;; Broadcast a scalar to all lanes 28 | ;; ------------------------------- 29 | 30 | ;; Fixed vector 31 | %vec = insertelement <4 x i32>, i32 %elem, i64 0 32 | shufflevector <4 x i32> %vec, <4 x i32> undef, <4 x i32> zeroinitializer 33 | 34 | ;; Scalable vector 35 | %vec = insertelement <4 x i32>, i32 %elem, i64 0 36 | shufflevector %vec, undef, zeroinitializer 37 | 38 | ;; Broadcast the value in lane 1 to all lanes 39 | --------------------------------------------- 40 | 41 | ;; Fixed vector 42 | shufflevector <4 x i32> %vec, <4 x i32> undef, <4 x i32> 43 | 44 | ;; Scalable vector 45 | ;; Not cleanly representable 46 | 47 | 48 | In TargetTransformInfo, `SK_Broadcast` specifically refers to a lane 0 broadcast (possibly of a scalar). The generic any-lane broadcast becomes a `SK_PermuteSingleSrc`. 49 | 50 | 51 | Single Source Permutes 52 | ---------------------- 53 | 54 | A single source permute is a shuffle where all output lanes come from one of the two input vectors. IT represents a permutation of exactly one of the input vectors. A permute does not change the length of the vector. 55 | 56 | In TargetTransformInfo, `PermuteSingleSrc` models this case. There are multple sub-categories within single source permutes where better lowerings are available. See also interleave, deinterleave, select, broadcast, and reverse. 57 | 58 | 59 | Two Source Permutes 60 | ------------------- 61 | 62 | A two source permute is a shuffle where the output length is equal to the length of each of the input vectors. Conceptually, there's two (equivalent) common mental models. The first is that we perform a single source permute on the result of concatenating the two source vectors and then extract the leading sub-vector. The second is that we perform one permute on each source vector, and then merge the results with a vector select. 63 | 64 | 65 | In TargetTransformInfo, `SK_PermuteTwoSrc` models this case. It is the fallback for when nothing more specific can be identified. 66 | 67 | 68 | Others (Updates Pending) 69 | ------------------------- 70 | 71 | .. code:: 72 | 73 | SK_Reverse, ///< Reverse the order of the vector. 74 | SK_Select, ///< Selects elements from the corresponding lane of 75 | ///< either source operand. This is equivalent to a 76 | ///< vector select with a constant condition operand. 77 | SK_Transpose, ///< Transpose two vectors. 78 | SK_InsertSubvector, ///< InsertSubvector. Index indicates start offset. 79 | SK_ExtractSubvector, ///< ExtractSubvector Index indicates start offset. 80 | SK_Splice ///< Concatenates elements from the first input vector 81 | ///< with elements of the second input vector. Returning 82 | ///< a vector of the same type as the input vectors. 83 | ///< Index indicates start offset in first input vector. 84 | -------------------------------------------------------------------------------- /observable-allocations.rst: -------------------------------------------------------------------------------- 1 | ------------------------------------------------- 2 | Observable Allocations 3 | ------------------------------------------------- 4 | 5 | 6 | .. contents:: 7 | 8 | Examples 9 | ======== 10 | 11 | .. code:: 12 | 13 | free(malloc(8)); 14 | ==> 15 | nop 16 | 17 | .. code:: 18 | 19 | malloc(0); 20 | ==> 21 | nullptr 22 | 23 | .. code:: 24 | 25 | free(realloc(o, N)) 26 | ==> 27 | free(o); 28 | 29 | .. code:: 30 | 31 | o1 = realloc(o, 0) 32 | // (if o1's address is not captured) 33 | ==> 34 | free(o); 35 | o1 = nullptr; 36 | 37 | .. code:: 38 | 39 | free(new int()) 40 | ==> 41 | unreachable 42 | 43 | .. code:: 44 | 45 | o = malloc(8); 46 | lifetime_end(o+4, 4) 47 | ==> 48 | o = malloc(4) 49 | 50 | .. code:: 51 | 52 | o = malloc(8); 53 | lifetime_end(o, 4) 54 | ==> 55 | o = malloc(4) - 4 56 | 57 | .. code:: 58 | 59 | if (o != nullptr) free(o) 60 | ==> 61 | free(o) 62 | 63 | Other Allocation Properties 64 | ========================== 65 | 66 | Nullability 67 | ----------- 68 | 69 | .. code:: 70 | 71 | allocate(N) == nullptr? 72 | 73 | Zero Sided Allocations 74 | ---------------------- 75 | 76 | .. code:: 77 | 78 | allocate(0) == allocate(0)? 79 | 80 | Is the allocator required to return distinct addresses? Or guaranteed to fail via exception or error-val on zero sized allocation? 81 | 82 | Object Crossing Compares 83 | ------------------------- 84 | 85 | .. code:: 86 | 87 | allocate(N) + N == allocate(M)? 88 | 89 | Is this well defined? 90 | 91 | Distinct Heap/Stack/Globals 92 | --------------------------- 93 | 94 | If the prior example is well defined, are there limits of which objects can be compared? 95 | 96 | .. code:: 97 | 98 | allocate(N) == &global_var? 99 | allocate(N) == &stack_var? 100 | 101 | Which of these are well defined? 102 | 103 | Guessed Pointers 104 | ---------------- 105 | 106 | .. code:: 107 | 108 | allocate(N) == cast(0xmyconstant)? 109 | 110 | Is this well defined? 111 | -------------------------------------------------------------------------------- /optimization-tricks.txt: -------------------------------------------------------------------------------- 1 | A collection of cute optimization tricks, mostly things I haven't had time to follow up on... 2 | 3 | Loads from constant globals with patterns in data 4 | (Note: LLVM currently handles some cases w/compares, but that's all I can find) 5 | a = constant global [1,3,5,7,9] 6 | a[i] -> -1 + i * 2 7 | (and other assorted idioms) 8 | 9 | Use masked loads w/pass through to eliminate selects of addresses 10 | a = select c, a1, a2 11 | v = load a 12 | -> 13 | v1 = masked.load(c, a1, undef) 14 | v = masked.load(!c, a2, v1) 15 | (can do better if either are safe to speculate) 16 | 17 | trivial specialization (no cost dispatch) 18 | foo(..., bool b) { if (b) C; else D; } 19 | -> 20 | foo(...) { if(b) tail C(); else tail D(); } 21 | (potentially very interesting w/early return idioms) 22 | 23 | 24 | Tricks for encoding long jumps with fewer bytes using implicit fault handlers 25 | LongJCC: # 6 bytes 26 | jle baz 27 | 28 | ShortJCC: # 2 bytes 29 | jle bar 30 | 31 | forward_jump_int3: # 3 bytes 32 | jle continue 33 | int3 # bad to have in instruction stream? 34 | continue: 35 | 36 | forward_jump_ud2: # 4 bytes 37 | jle continue2 38 | ud2 # bad to have in instruction stream? 39 | continue2: 40 | 41 | jcc_payload: # 4 bytes 42 | test %rax, %rcx 43 | .byte 116 #0x74, je 44 | fault: 45 | .byte 204 #0xcc, int3 as payload of je 46 | jle fault #backbrach into previous jcc 47 | 48 | and_payload: % 4 bytes 49 | test %rax, %rcx 50 | .byte 4 #0x04, add AL, 204 51 | fault2: 52 | .byte 204 #0xcc, int3 as payload of add 53 | jle fault2 #backbrach into previous add 54 | 55 | bar: 56 | retq 57 | .section ".other" 58 | baz: 59 | retq 60 | 61 | -------------------------------------------------------------------------------- /pointer-provenance.rst: -------------------------------------------------------------------------------- 1 | ------------------------------------------------- 2 | Pointer Provenance in LLVM 3 | ------------------------------------------------- 4 | 5 | **THIS IS AN IN PROGRESS DOCUMENT. It hasn't yet been "published". At the moment, this is KNOWN to not work.** 6 | 7 | This write up is an attempt for me to wrap my head around the recent byte type discussion on llvm-dev. This was me attempting to really wrap my head around `nhaehnie's blog post `_. After working through the problem myself, I stumbled across the same CSE issue he mentions. I haven't yet wrapped my head around why the CSE problem is restricted to integer types, but maybe I'd get there with time. I'm out of time for the moment. 8 | 9 | .. contents:: 10 | 11 | I'm going to start by stating a few critical tidbits as I think there's been some confusion about this in the discussion thread. 12 | 13 | Memory is not typed (well mostly) 14 | --------------------------------- 15 | 16 | Today, LLVM's memory is not typed. You can store one type, and load back the same bits with another type. You can used mismatched load and store widths. 17 | 18 | However, this is *not* the same as saying memory holds only the raw bits of the value. There are a couple of cornercases worth talking through. 19 | 20 | First, we model uninitialized memory as containing a special ``undef`` value. This value is a special placeholder which allows the compiler to chose any value it wishes independently for each use. 21 | 22 | Second, we allow ``poison`` to propogate through memory. Poison is used to model the result of an undefined operation (e.g. an add which was asserted not to overflow, but did in fact overflow) which has not yet reached a point which triggers undefined behavior. The definition of poison carefully balances allowing operations to be hoisted with complexity of semantics. We've only recently gotten ``poison`` into a somewhat solid shape. 23 | 24 | The key bit about both is that the rules for poison and undef propogation and the dynamic semantics of LLVM IR instructions with regards to each are chosen carefully such that "forgetting" memory contains ``undef`` or ``poison`` is entirely correct. Or to say it differently, converting (intentionally or unintentionally) either to a specific concrete value is a refining transformation. 25 | 26 | Aside: For garbage collection, we added the concept of a non-integral pointer type. The original intent was *not* to require typed memory, but in practice, we have had to essentially assume this. For a collector which needs to instrument pointer loads (but *not* integer loads), having the optimizer break memory typing. If anything, this provides indirect evidence that memory is not typed, as otherwise non-integral pointers would have never been needed. 27 | 28 | Aliasing vs Pointer Equality 29 | ---------------------------- 30 | 31 | The C/C++ rules for what is legal with an out of bounds pointer are complicated. This is relevant because LLVM IR needs to correctly model the semantics of these languages. Note that it does *not* mean that LLVM's semantics must exactly follow the C/C++ semantics, merely that there must be a reasonable translation from one to the other. 32 | 33 | The key detail which is relevant here is that pointer *values* can be legally compared and have well defined semantics while *accessing the memory pointed to* may be fully undefined. 34 | 35 | This comes up in the context of alias analysis because it's possible to have two pointers which are equal, but don't alias. The classic example would be: 36 | 37 | .. code:: 38 | 39 | %v1 = malloc(8) 40 | %v2 = realloc(%v1, 8) 41 | %v1 and %v2 *may* be equal here, but are guaranteed no-alias. 42 | 43 | Aside: If the runtime allows multi-mapping of memory pages, it's also possible to have two pointers which are inequal, but must alias. This isn't well modeled today in LLVM, and is definitely out of scope for this discussion. 44 | 45 | Memory Objects 46 | -------------- 47 | 48 | The LLVM memory model consists of indivially memory allocations which are (conceptually) infinitely far apart in memory. In an actual execution environment, allocations might be near each other. To reconcile this, it's important to note that comparing or subtracting two pointers from different allocations results in an undefined (i.e. ``poison``) result. 49 | 50 | So what? 51 | --------- 52 | 53 | My understanding of the current pointer providence rules is the following: 54 | 55 | * A pointer is (conceptually) derived from some particular memory object. 56 | * We may or may not actually be able to determine which object that is. Essentially this means there is an ``any`` state possible for pointer provenance. 57 | * The optimizer is free to infer providance where it can. BasicAA for instance, is essentially a complicated providance inference engine. 58 | 59 | There's a key implication of the second bullet. Provenance is propogated through memory for pointer values. We may not be able to determine it statically (or dynamically), but conceptually it exists. 60 | 61 | This implies there are some corner cases we have to consider around "incorrectly" typed loads, and overlapping memory accesses. At the moment, I don't see a reason why we can simply define the providance of any pointer load which doesn't exactly match a previous pointer store as being ``any``. We do need to allow refining transformations which expose providance information (e.g. converting a integer store to a pointer one if the value being stored is a cast pointer), but I don't see that as being particular problematic. 62 | 63 | Let me try stating that again, this time a bit more formally. 64 | 65 | We're going to define a dynamic (e.g. execution or small step operational) semantics for pointer providance. Every byte of a value will be mapped to some memory object or one of two special marker value ``poison`` or ``undef``. Note that this is *every* value, not just *pointer values*. 66 | 67 | Allocations define a new symbolic memory object. GetElementPtrs generally propogate their base pointer's provinance to their result, but see the rule below for mismatched providence. 68 | 69 | Storing a pointer to memory conceptually creates an entry in a parallel memory which maps those bytes to the corresponding memory object. Every time memory is stored over, that map is updated. Additionally, the side memory remembers the bounds of the last store which touches each byte in memory. 70 | 71 | Loading a pointer reads the last written providance associated with the address. If the bytes read were last written by two different stores, the resulting providance is ``poison``. 72 | 73 | Casting a pointer to an integer does *not* strip providance. As a result, round tripping a pointer through an integer to pointer cast and back is a nop. This is critical to keep the correspondance with the memory semantics. 74 | 75 | Integer constants are considered to have the providance of the memory object which happens to be at that address at runtime, or the special value ``undef`` if there is not such memory object. 76 | 77 | Any operation (gep, or integer math) which consumes operands of two distinct providances returns a result with the providance``poison`` with the caveat that an ``undef`` providance can take on the value of any memory object chosen by the optimizer. (This is analogous to ``undef`` semantics on concrete values, just extended to the providance type.) Note that the result is *not* the ``poison`` value, it as a value with ``poison`` providance. 78 | 79 | Memory operations with a memory operand with ``poison`` providence are undefined. Comparison instructions with a pointer operand with ``poison`` providance return the value ``poison``. 80 | 81 | Now, let's extend that to a static semantic. The key thing we have to add is the marker value ``any`` as a possible providance. ``any`` means simply that we don't (yet) know what the providance as, and must be conservative in our treatment. 82 | 83 | As is normal, the optimizer is free to implement refining transformations which make the program less undefined. As a result, memory forwarding, CSE, etc.. all remain legal. 84 | 85 | **BUG**: CSE of two integer values with difference providance seems to not work. 86 | 87 | 88 | -------------------------------------------------------------------------------- /project-ideas.rst: -------------------------------------------------------------------------------- 1 | .. header:: This is a collection of random ideas for small projects I think are interesting, and would in theory love to work on someday. Feel free to steal anything here which inspires you! 2 | 3 | ------------------------------------------------- 4 | Free! Project Ideas 5 | ------------------------------------------------- 6 | 7 | .. contents:: 8 | 9 | Missing Opt Reduction 10 | --------------------- 11 | 12 | With any form of super-optimizer, one tricky bit is turning missing optimizations in large inputs into small self contained test cases. (i.e. making the findings actionable for compiler developers) Automated reduction tools are generally good at reducing failures (i.e. easy interestness tests). A tool which wrapped a super optimizer and simply returned an error code if the super optimizer could find a further optimization in the input would allow integration with e.g. bugpoint and llvm-reduce. Bonus points if the wrapper passed through options to the underlying tool invocation (e.g. see opt-alive) so that a test can be entirely self contained. 13 | 14 | opt-alive for NFC proofs 15 | ------------------------ 16 | 17 | Many changes are marked NFC. These are generally a candidate for differential proofs, and the opt-alive tool (from alive2) seems to be a good fit here. Could very well be enough to prove many NFCs were in fact NFCs (for at least some build environment). 18 | 19 | SCEV Based Array Language 20 | -------------------------- 21 | 22 | Many places in LLVM need to be able to reason about memory accesses spanning multiple iterations of a loop (e.g. loop-idiom, vectorizer, etc..). SCEV gives an existing way to model addresses and values stored (as separate SCEVs), but we don't really have a mechanism to model the memory access as a first class object. 23 | 24 | Having "store to <%obj,+,8>" as a first class construct allows generalizations of existing transforms. For example: A two accesses of the form "store to <%obj,+, 8>" and "store to <%obj+4,+, 8>" can be merged into a single "store to <%obj,+, 8>", enabling generalize memset recognition. 25 | 26 | Looking at ajacent loops, knowing that two stores overlap (i.e. a later loop clobbers the same memory), allows iteration space reductions for the first. 27 | 28 | This may combine in interesting ways with MemorySSA. I have not looked at that closely. 29 | 30 | Focus on optimizing outparams 31 | ----------------------------- 32 | 33 | The example below (originally written in the context of https://reviews.llvm.org/D109917), reveals some interesting missed LLVM optimizations. 34 | 35 | .. code:: 36 | 37 | int foo(); 38 | 39 | extern int test(int *out) __attribute__((noinline)); 40 | int test(int *out) { 41 | *out = foo(); 42 | return foo(); 43 | } 44 | 45 | int wrapper() { 46 | int notdead; 47 | if (test(¬dead)) 48 | return 0; 49 | 50 | int dead; 51 | int tmp = test(&dead); 52 | if (notdead) 53 | return tmp; 54 | return foo(); 55 | } 56 | 57 | Here's the ones I've noticed so far: 58 | 59 | * Failure to infer writeonly argument attribute. I went ahead and posted https://reviews.llvm.org/D114963, but there's a bunch of follow up work needed. 60 | * Failure to sink the second call to dead (i.e. the motivating review above). 61 | * Failure to color allocas and reuse same alloca/variable for outparams to both calls. 62 | * Oppurtunity to promote outparam to struct return in dso local copy with shim. (Not sure this is generally profitable through.) 63 | * Oppurtunity to reduce lifetime.start/end scope around 'notdead' alloca. You can think of this as part of coloring, or as a separate canonicalization step. 64 | * Missed oppurtunity for tailcall optimization in final call to foo() in wrapper(). Similiar for final call to foo() in test() 65 | -------------------------------------------------------------------------------- /recurrences.rst: -------------------------------------------------------------------------------- 1 | 2 | ------------------------------------------------- 3 | Fun w/Recurrences in LLVM 4 | ------------------------------------------------- 5 | 6 | This page is a mixture of a writeup on some recent work I've done around recurrences in LLVM, and a punch list for potential follow ups. 7 | 8 | .. contents:: 9 | 10 | Problem Statement 11 | ================= 12 | 13 | A recurrence is simply a sequence where the value of f(x) is computable from f(x-1) and some constant terms. I find `this definition `_ useful. 14 | 15 | Historically, LLVM has primarily used ScalarEvolution for modeling loop recurrences, but SCEV is rather specialized for add recurrences. Over the years, we've grown a collection of one-off logic in a couple different places, mostly because pragmatic concerns about a) compile time of SCEV, and b) difficulty of plumbing SCEV through to all the places. 16 | 17 | Notation 18 | ======== 19 | 20 | Skip this until one my examples doesn't make sense, then come back. 21 | 22 | **** follows the convention SCEV uses for displaying add recurrences and generalizes for any Op. expands to: 23 | 24 | :: 25 | 26 | %phi = phi i64 [%start, %entry], [%next, %backedge] 27 | ... 28 | %next = opcode i64 %phi, %step 29 | 30 | For non-communative operators, I generally only use this notation for the form with %phi in the left hand side of the two operators. Note that Step may not be loop invariant unless explicitly stated otherwise. 31 | 32 | I will also sometimes use the notation f(x) = g(f(x-1)) where that's easier to follow. In particular, I use that for the inverted forms of the non-communative operators. You should assume that f(0) == Start unless explicitly said otherwise. 33 | 34 | **LIV** stands for "loop invariant value" and is simply a value which is invariant in the loop refered do. 35 | 36 | **CIV** stands for "canonical induction variable". A canonical IV is the sequence (0,+,1). (Or sometimes <1,+,1>. Yes, I'm being sloppy about my off-by-ones.) 37 | 38 | Fun w/Recurrences 39 | ================= 40 | 41 | Integer Arithmetic 42 | ------------------ 43 | 44 | **ADD recurrences** are generally well covered by Scalar Evolution already. 45 | 46 | **SUB recurrences** are generally canonicalized to add recurrences. One interesting case is: 47 | 48 | :: 49 | 50 | %phi = phi i64 [%start, %entry], [%next, %backedge] 51 | ... 52 | %next = sub i64 %LIV, %phi 53 | 54 | That is, f(x) = LIV - f(x-1). This alternates between Start, and LIV - Start. If we unrolled the loop by 2, we could eliminate the IV entirely. 55 | 56 | Status: Unimplemented 57 | 58 | 59 | A **mul recurrence** w/a loop invariant step value is a power sequence of the form Start*Step^CIV when overflow can be disproven. (TODO: Does this hold with wrapping twos complement arithmetic?) See my notes on `exponentiation in SCEV `_ for ideas on what we could do here. It's worth noting that the overflow cases may be identical to the cases we could canonicalize the shifts. (TBD) 60 | 61 | A **udiv/sdiv recurrence** w/a loop invariant step forms a sequence of the form Start div (Step ^ CIV) when overflow can be disproven. Again, exponentiation? 62 | 63 | Shift Operators 64 | --------------- 65 | 66 | TBD - A bunch of work here on known bits, range info, and SCEV, both inflight, landed, and planned. Needs written up. 67 | 68 | 69 | Bitwise Operators 70 | ----------------- 71 | 72 | A **AND and OR recurrence** w/ a loop invariant step value stablize after the first iteration. That is, anding (oring) a value repeatedly has no effect. Thus 73 | 74 | :: 75 | 76 | %phi = phi i64 [%start, %entry], [%next, %backedge] 77 | ... 78 | %next = and i64 %phi, %LIV 79 | 80 | Is equivalent to: 81 | 82 | :: 83 | 84 | %next = and i64 %start, %LIV 85 | ... 86 | %phi = phi i64 [%start, %entry], [%next, %backedge] 87 | ... 88 | 89 | Status: Implemented in Instcombine (via`D97578 `_) + existing LICM transform. 90 | 91 | A **XOR recurrence** w/ a loop invariant step value will alternate between two values. As such, there is a potential to eliminate the recurrence by unrolling the loop by a factor of two. 92 | 93 | Status: Unimplemented. 94 | 95 | 96 | Floating Point Arithmetic 97 | -------------------------- 98 | 99 | In general, floating point is tricky because many operators are not commutative. 100 | 101 | Most of the obvious options involve proving floating point IVs can be done in integer math. I have some old patches pending review (`D68954 `_ and `D68844 `_), but there's little active progress here. 102 | 103 | Index 104 | ===== 105 | 106 | This section has the same information as above, but indexed by optimization type. 107 | 108 | Known Bits 109 | ---------- 110 | 111 | Computing known bits for simple recurrences. Currently handled: lshr, ashr, shl, add, sub, and, or, mul. Missing cases of note include: overflow intrinsics, udiv, sdiv, urem, srem. 112 | 113 | isKnownNonZero 114 | -------------- 115 | 116 | Can we tell a recurrence is non-zero through it's entire range? Currently handles add, mul, shl, ashr, and lshr. Missing cases of note include: overflow intrinsics, udiv, sdiv. 117 | 118 | isKnownNonEqual 119 | ---------------- 120 | 121 | Can we tell two recurrences are inequal (cheaply)? Used by BasicAA. Patch out for review which handles add, sub, mul, shl, lshr, and ashr. 122 | 123 | Constant Range 124 | -------------- 125 | 126 | Entirely TODO - not clear how worthwhile this is a known bits gets the common cases which constant ranges could handle. 127 | 128 | SCEV Range 129 | ---------- 130 | 131 | Exploiting trip count information to refine constant range results. Currently, only shl is handled. Changes for ashr and lshr are out for review. 132 | 133 | RLEV 134 | ---- 135 | 136 | IndVarSimplify can rewrite loop exit values. For some of the alternating patterns (e.g. see xor above) if we know the trip count, we can select a single exit value and fold uses outside the loop. (e.g. select between two values based on primary IV). We can also use knowledge of trip count multiple (which SCEV computes) to avoid the select in some cases. 137 | 138 | Entirely unimplemented. 139 | 140 | Unroll/Vectorize Costing 141 | ---------------------------- 142 | 143 | For alternating patterns, we can exploit the fact that unrolling the loop by a multiple of the alternation length results in fixed patterns for each lane. This could effect unrolling and vector codegen, but most importantly, should drive costing. 144 | 145 | Entirely unimplemented. We can also do this for general SCEV formulas as well. 146 | -------------------------------------------------------------------------------- /riscv-attribute-validation.rst: -------------------------------------------------------------------------------- 1 | --------------------------- 2 | RISCV Attribute Validation 3 | --------------------------- 4 | 5 | At the moment, this page is a collection of notes on the topic of attribute validation by the linker. There's no real coherant message here - yet - it's more of a reference document to help organize my thoughts. 6 | 7 | .. contents:: 8 | 9 | Compatibility 10 | ------------- 11 | 12 | Since `87fdd7ac09b ("RISC-V: Stop reporting warnings for mismatched extension versions") `_ which landed in 2.38, ld stopped trying to enforce any attribute compatibility. Previous versions enforced the (at least) the following cases: 13 | 14 | * Unrecognized extension name 15 | * Mismatched extension version between object files 16 | 17 | LD 2.38 is still pretty recent. In particular, it has not yet (as of Feb 2023) made it into e.g. debian stable. 18 | 19 | LLD historically did not enforce attribute consistency. In `8a900f2438b4 ([ELF] Merge SHT_RISCV_ATTRIBUTES sections) `_ this was accidentally changed as part of a refacotring. This is being treated as a regression (see `D144353 `_), and will (hopefully!) not make it into a public release. 20 | 21 | I have not investigated the status of `mold` or other linkers. 22 | 23 | Previous Breakage 24 | ----------------- 25 | 26 | Clang Built Linux (Multiple) 27 | ============================ 28 | 29 | Feb 2023 `breakage `_ due to _zicsr_zifence with older LD. Older versions of LD did not recognize these extension names and failed to link. This case was a cross compatibility one; it was an LLVM toolchain which emitted the new extension. 30 | 31 | Feb 2023 `breakage `_ due to unrecognized I version with TOT LLD. This is the LLD regression mentioned above, but is interesting to call out as it was a extension *version* which was unrecognized, not an extension *name*. 32 | 33 | Feb 2023 breakage due to Zmmul implication. I've been verbally told about this one, but can't find a citation. My understanding is that older LD did not support Zmmul, but that newer (GNU?) toolchain did. Thus linking within same toolchain family, but different versions broke. Details here may be wrong. 34 | 35 | Sept 2022 `breakage `_ due to unrecognized `zicbom` extension with LD. This is an another example of the toolchains evolving independently. LLVM added zicbom and the version of LD used in a mixed environment did not support the extension. 36 | 37 | More generally, the kconfig files are full of workarounds for related issues. 38 | 39 | 40 | Use Cases 41 | --------- 42 | 43 | Mixed Object Case (for e.g. environmental check dispatch) 44 | 45 | Linking new object files with older linker 46 | 47 | Cross toolchain linking (and thus version lock) 48 | 49 | For both of prior two, sub-cases for unrecognized extension and version. 50 | -------------------------------------------------------------------------------- /riscv-ecosystem.rst: -------------------------------------------------------------------------------- 1 | --------------- 2 | RISCV Ecosystem 3 | --------------- 4 | 5 | This is a quick reference guide to find various bits of RISCV specific ecosystem pieces. At the moment, the primary audience is me; I tend to forgot this stuff, and being able to find links easily is useful. 6 | 7 | .. contents:: 8 | 9 | Distros 10 | ------- 11 | 12 | Fedora 13 | ====== 14 | 15 | `Current Disk Images `_ 16 | 17 | `Useful Overview from late 2019 `_ 18 | 19 | Ubuntu 20 | ====== 21 | 22 | `Download and install instructions for latest Dev Previews `_. 23 | 24 | `Wiki, links to older images and qemu boot instructions `_ 25 | 26 | Alphine 27 | ======= 28 | 29 | `Tracking issue for official riscv64 support `_ 30 | 31 | `A good writeup on early bringup and porting (from 2018) `_, `An early 2022 update is here `_. 32 | 33 | 34 | -------------------------------------------------------------------------------- /riscv-microarch.rst: -------------------------------------------------------------------------------- 1 | ------------------------------ 2 | RISC-V Microarchitecture Notes 3 | ------------------------------ 4 | 5 | This document is a collection of links to documentation on various vendors' micro-architecture. 6 | 7 | .. contents:: 8 | 9 | Spacemit 10 | -------- 11 | 12 | `BPi-F3 Datasheet: `_ 13 | `Spacemit-K1 Datasheet: `_ 14 | 15 | To purchase: `AliExpress `_, `Amazon `_ 16 | 17 | `X100 `_ 18 | 19 | SiFive 20 | ------ 21 | 22 | "SiFive Announces First RISC-V OoO CPU Core: The U8-Series Processor IP". AnandTech. ``_ 23 | 24 | "Incredibly Scalable High-Performance RISC-V Core IP". ``_ -- Minimal information, but the point about being able to issue int ops to fp-pipe is interesting and not covered in the anandtech writeup. 25 | 26 | WikiChip. ``_ -- Some information on the various N-series cores, unclear sourcing and quality. 27 | 28 | "SiFive U74 Core Complex Manual". ``_ 29 | 30 | Inside SiFive’s P550 Microarchitecture https://chipsandcheese.com/p/inside-sifives-p550-microarchitecture 31 | 32 | SemiDynamics 33 | ------------ 34 | 35 | "Avispado: A RISC-V core supporting the RISC-V vector instruction set by Roger Espasa", ``_ 36 | 37 | Tenstorrent 38 | ----------- 39 | 40 | "Building AI Silicon: Ascalon RiscV Processor". ``_ 41 | 42 | Esperanto 43 | --------- 44 | 45 | "A Look At The ET-SoC-1, Esperanto’s Massively Multi-Core RISC-V Approach To AI". ``_ 46 | 47 | "ESPERANTO MAXES OUT RISC-V". Linley Group, Microprocessor Report. December 10, 2018. ``_ 48 | 49 | Alibaba/T-Head 50 | -------------- 51 | 52 | "T-head RVB-ICE Development Board,Dual-core XuanTie C910 RISC-V 64GC ,1.2GHz, Support Android/Debian System" ``_. Sales page, but scroll down to description section for a reasonable detailed description. 53 | 54 | "T-Head ISA extension specification" ``_. Defines the custom extensions shipped by T-Head. 55 | 56 | Ventana 57 | ------- 58 | 59 | "VTx-family custom instructions: Custom ISA extensions for Ventana Micro Systems RISC-V cores" 60 | ``_ 61 | 62 | Other Sources 63 | ------------- 64 | 65 | Reddit post: ``_. Has some good commentary; was also source of two of the better links above. 66 | -------------------------------------------------------------------------------- /riscv-notes.rst: -------------------------------------------------------------------------------- 1 | --------------- 2 | Notes on RISCV 3 | --------------- 4 | 5 | This document is a collection of notes on the RISC-V architecture. This is mostly to serve as a quick reference for me, as finding some of this in the specs is a bit challenging. 6 | 7 | .. contents:: 8 | 9 | VLEN >= 32 (always) and VLEN >= 128 (for V extension) 10 | ----------------------------------------------------- 11 | 12 | VLEN is determined by the Zvl32b, Zvl64b, Zvl128b, etc. extensions. V implies Zvl128b. Zve64* implies Zvl64b. Zve32* implies Zvl32b. VLEN can never be less than 32 with the currently defined extensions. 13 | 14 | Additional clarification here: 15 | 16 | "Note: Explicit use of the Zvl32b extension string is not required for any standard vector extension as they all effectively mandate at least this minimum, but the string can be useful when stating hardware capabilities." 17 | 18 | Reviewing 18.2 and 18.3 confirms that none of the proposed vector variants allow VLEN < 32. 19 | 20 | As a result, VLENB >= 4 (always), and VLENB >= 16 (for V extension). 21 | 22 | ELEN <= 64 23 | ---------- 24 | 25 | While room is left for future expansion in the vector spec, current ELEN values encodeable in VTYPE max out at 64 bits. 26 | 27 | vsetivli can not always encode VLMAX 28 | ------------------------------------ 29 | 30 | The five bit immediate field in vsetivli can encode a maximum value of 31. For VLEN > 32, this means that VLMAX can not be represented as a constant even if the exact VLEN is known at compile time. 31 | 32 | Odd gaps in vector ISA 33 | ---------------------- 34 | 35 | There are a number of odd gaps in the vector extension. By "gap" I mean a case where the ISA appears to force common idioms to generate oddly complex or expensive code. By "odd" I mean either seemingly inconsistent within the extension itself, or significantly worse that alternative vector ISAs (e.g. x86 or AArch64 SVE). I haven't gone actively looking for these; they're simply examples of repeating patterns I've seen when looking at compiler generated assembly where there doesn't seem to be something "obvious" the compiler should have done instead. 36 | 37 | No Zbb, Zbs (Basic bitmanip idioms) vector analogy extensions 38 | ============================================================= 39 | 40 | The lack of Zbb and Zbs prevent the vectorization of many common bit manipulation idioms. The code sequences to replicate e.g. bitreverse without a dedicated instruction are rather painfully expensive. Being able to generate fast code for e.g. vredmax(ctlz(v)) has some interesting applications. 41 | 42 | Impact: Major. Practically speaking prevents vector usage for many common idioms. 43 | 44 | No Zbc (carryless multiply) analogy extensions 45 | ================================================== 46 | 47 | I haven't yet seen code which would benefit from a Zbc vector analogy, but I also don't much time on crypto and my understanding is that is the motivator for these. Here's a `public proposal for a set of vector crypto operations `_ which do include some restricted forms of carryless multiply. 48 | 49 | Impact: ?? 50 | 51 | No SEW ignoring math ops 52 | ======================== 53 | 54 | When working with indexed load and stores, the index width and data width are often different. For instance, say I want to add 8 bits of data from addresses `(9 x i +256)` off a common base. The code sequence looks roughly like this:: 55 | 56 | vsetvli x0, x0, e64, mf8, ta, ma 57 | vshl v2, v2, 3 58 | vadd v2, v2, 256 59 | vsetvli x0, x0, e8, m1, ta, ma 60 | vluxei64.v vd, (x1), v2, vm 61 | 62 | Note that we're toggling vtype solely for the purpose of performing indexing in i64. 63 | 64 | If we had a version of the basic arithmetic ops which ignored SEW - or even better, a variant of the Zba instructions! - we could rewrite this sequence as:: 65 | 66 | vshl64 v2, v2, 3 67 | vadd64 v2, v2, 256 68 | vluxei64.v vd, (x1), v2, vm 69 | 70 | Or even better:: 71 | 72 | vsh3add64 v2, v2, 256 73 | vluxei64.v vd, (x1), v2, vm 74 | 75 | Note that 64 here comes from the native index width for a 64 bit machine. We could either produce two 32/64 variants or a single ELEN paraterized variant. 76 | 77 | Impact: minor, main benefit is reduced code size and fewer vtype changes 78 | 79 | Another idea here might be to instead have an indexed load/store variant which implicitly scaled its index vector by the index type. (That is, implicitly included a mutiplication of the index vector by the index width..) That would give us code along the lines of the following:: 80 | 81 | add x2, x1, 256 82 | vluxei64.v.scaled vd, (x2), v2, vm 83 | 84 | No Cheap Mask Extend 85 | ==================== 86 | 87 | There does not appear to be a cheap zero or sign extend sequence to take a mask and produce e.g. an i32 vector. 88 | 89 | The best sequence I see is:: 90 | 91 | vmv.v.i vd, 0 92 | vmerge.vim vd, vd, 1, v0 93 | 94 | How to fix: 95 | 96 | * Allow EEW=1i on zext.vfN variants. This covers extend to i8. 97 | * Add zext.vf16, zext.vf32, and zext.vf64 on the prior to get all SEW. 98 | * Alternatively, add a dedicated mask extend op to SEW. 99 | 100 | Impact: fairly minor, mostly some extra vector register pressure due to need for zero splat. 101 | 102 | No Product Reduction 103 | ==================== 104 | 105 | There does not appear to be a way to lower an "llvm.vector.reduce.mul" or "llvm.vector.reduce.fmul" into a single reduction instruction. Other reduction types are supported, but for some reason there's no 'vredprod', 'vfredoprod' or 'vfreduprod'. 106 | 107 | Impact: minor, mostly me being completionist. 108 | 109 | Non vrgather vector.reverse 110 | =========================== 111 | 112 | Reversing the order of elements in a vector is a common operation. On RISC-V today, this requires the use of a vrgather, and almost more importantly, a several instruction long sequence to materialize the index vector. E,g, the following sequence reverses an i8 vector:: 113 | 114 | csrr a0, vlenb 115 | srli a0, a0, 2 116 | addi a0, a0, -1 117 | vsetvli a1, zero, e16, mf2, ta, mu 118 | vid.v v9 119 | vrsub.vx v10, v9, a0 120 | vsetvli zero, zero, e8, mf4, ta, mu 121 | vrgatherei16.vv v9, v8, v10 122 | vmv1r.v v8, v9 123 | 124 | Note that AArch64 provides an instruction for this. 125 | 126 | Other ways to improve this sequence might be to variants of the SEW independent index arithmetic above, or providing a cheap way to get the VLMax splat. 127 | 128 | Lack of e1 element type 129 | ======================= 130 | 131 | For working with large bitvectors, having an element type of e1 would be helpful. Today, we have the masked arithmetic ops, but because they're expected to only work on masks, they can't be combined with LMUL to work on more than one vreg of data. 132 | 133 | Impact: minor, mostly a seeming inconsistency 134 | 135 | vlast.m variant 136 | =============== 137 | 138 | The extension has vfirst.m (and it's variants), but not vlast.m (and its variants). I've been told the later is sometimes useful, though I don't have a good motivating example as of yet. 139 | 140 | The other option here would be to support bitreverse on mask vectors. A bitreverse followed by a vfirst.m should be equivalent to a vlast.m - modulo register pressure and latency. 141 | 142 | In the currently available extension, probably the best option is to use CTZ in Zba to emulate this for any case we know VLMAX < ELEN. This is likely enough for fixed vectors as ELEN=64, etype=e8, would give VLEN=512 as the maximally supported size for this trick. By using a series of vslidedown, copy to gpr, and CTZs we could probably generate correct - if ever slower with every ELEN sized chunk - code for any fixed vector. 143 | 144 | 145 | 146 | 147 | -------------------------------------------------------------------------------- /riscv-spec-minutia.rst: -------------------------------------------------------------------------------- 1 | --------------------------- 2 | RISCV Specification Minutia 3 | --------------------------- 4 | 5 | This is a collection of minutia about the change history of the RISCV specifications which have come up on a couple of occasions. This document is (hopefully!) not relevant for typical users. 6 | 7 | .. contents:: 8 | 9 | Document Version 2.1 to 2.2 10 | --------------------------- 11 | 12 | Between document version `riscv-spec-v2.1.pdf `_ and document version `riscv-spec-v2.2.pdf `_, a backwards incompatible change was made to remove selected instructions and CSRs from the base ISA. These instructions were grouped into a set of new extensions, but were no longer required by the base ISA. 13 | 14 | Note: Document versus specification versioning is particular confusing in this case as document 2.1 to 2.2 corresponds to base I specification 2.0 to 2.1. 15 | 16 | Zicsr and Zifencei 17 | ================== 18 | 19 | The part of this change relevant for Zicsr and Zifencei is described in “Preface to Document Version 20190608-Base-Ratified” from the specification document. All of the changes appear to have been at once, and the new extensions were immediately ratified. 20 | 21 | zicntr 22 | ====== 23 | 24 | The wording which defined the RDCYCLE, RDTIME, and RDINSTRET counters was removed. They were re-added to the specification as Zicntr in `commit 77aff0 `_ in March 2022. 25 | 26 | That change includes the following verbage: 27 | 28 | "NOTE: Counters and timers (now known as Zicntr and Zihpm) were frozen 29 | but not ratified in 2019, as they were removed from the base ISAs 30 | during the ratification process. Due to an oversight they were not 31 | later ratified. As they are required for the RVA20 and RVA22 32 | profiles, the proposal is to ratify these extensions in 2022 and 33 | retroactively add to the 2020 and 2022 profiles as an exception." 34 | 35 | The ratification status is unclear. Zicntr appears in the `current draft specification `_ without any indication it might be un-ratified, but late last year we a `call for public comment _`. I don't see any formal indication these have been fully ratified. The best summary of status I can find is `this issue on the profiles repo `_, but even that is not conclusive. 36 | 37 | There's an additional problem with the current specification text. No version number has yet been assigned. This is tracked as `an open issue against the specification `_. 38 | 39 | Zihpm 40 | ===== 41 | 42 | At a high level, Zihpm parallels Zicntr in that ratification status is unclear, and no version has been assigned. 43 | 44 | However, hpmcounter3–hpmcounter31 names do not appear to be present in older *unprivledged* specification documents. As such, Zihpm is merely a newly proposed extension as opposed to a backwards incompatible spec change. Note however that this appears to contradict the text added to the specification document quoted above. It was pointed out to me that they are mentioned in some of the *privledged* specification documents, but I have not tracked when they were added or tried to reconcile the history of the two specification documents. 45 | 46 | 47 | Can't we all just hold on a second? 48 | ----------------------------------- 49 | 50 | For one little instruction, Zihintpause (i.e. the PAUSE instruction) has been a real mess process wise. This section is largely based on research reported `here `_. 51 | 52 | The version number of the `zihintpause` extension **moved backwards** from 2.0 to 1.0 very shortly after being merged into the main repository. This is easy to write off as a minor issue, except that the `commit which moved the extension number backwards `_ also introduced the sentence "No architectural state is changed.". If you think about it a bit, this is absolutely absurd because the program counter is part of the architectural state. This effectively says the instruction must execute forever. Except, that also contradicts the wording which says the "duration of its effect must be bounded". So basically, 1.0 is (pedantically) unimplementable. 53 | 54 | In Aug 2021, the extension was ratified, and, a few hours later, the version number was increased again to 2.0. The wording discussed above remained. 55 | 56 | In `commit cb3b9d `_ (Dec 2022) the definition of the PAUSE instruction was again revised to remove the "No architectural state is changed." wording. This is great, and long overdue. However, the version number of the extension was *not* increased. So as a result, we have two versions of the extension text - both which claim to be 2.0 - which are mutually incompatible. Arguably, this was a small enough matter that an errata should suffice, but well, we don't have one of those either. 57 | 58 | As a practical matter, the consensus seems to be to basically ignore the matter. The prior text was unimplementable, and if you ignore that sentence, all of the known versions are substantially similar. As a result, the discrepancies in version can mostly be ignored, and we pretend that only the most recent 2.0 version ever existed. 59 | 60 | Zmmul vs M 61 | ---------- 62 | 63 | Discussion in the issue `Is Zmmul a subset of M? `_ appears to indicate that in a pendantic sense that `Zmmul` is not a strict subset of `M`. Specifically, `M` allows some configurations which don't actually support multiplication at runtime, whereas `Zmmul` does not. Given toolchains completely ignore this possibility to start with - seriously, don't tell your toolchain you have a multiply instruction if it's disabled at runtime - in all practical sense `Zmmul` does appear to be a subset of `M`. 64 | 65 | Redefinition of Vector Overlap (Nov 2022) 66 | ----------------------------------------- 67 | 68 | `This proposal `_ introduced a wording change which resulted in previously valid encodings become invalid. This was raised in the discussion, and actively rejected as being a compatibility concern. This change appears not to have been merged into the `specification repo `_ as of 2023-02-23. 69 | 70 | Whole Vector Register Move and VILL 71 | ----------------------------------- 72 | 73 | The 1.0 version of the vector specification says the following in section 3.4.4. "Vector Type Illegal vill": 74 | 75 | "If the vill bit is set, then any attempt to execute a vector instruction that depends upon vtype will raise an illegal-instruction exception. Note vset{i}vl{i} and whole-register loads, stores, *and moves* do not depend upon vtype." 76 | 77 | By version 20240411 of the combined unpriv specification, this wording has been changed to: 78 | 79 | "If the vill bit is set, then any attempt to execute a vector instruction that depends upon vtype will raise an illegal-instruction exception. vset{i}vl{i} and whole register loads and stores do not depend upon vtype." 80 | 81 | Note that whole register moves have been *removed* from this note, and thus 82 | are now required to raise an illegal-instruction exception on vill. 83 | 84 | Both versions have the following note in section 31.16.6 "Whole Vector Register Move": 85 | 86 | "These instructions are intended to aid compilers to shuffle vector registers without needing to know or change vl or vtype." 87 | 88 | Given ``vill`` is a bit within the ``vtype`` CSR ("The vill bit is held in bit XLEN-1 of the CSR to support checking for illegal values with a branch on the sign bit."), this really seems like the new version of the specification contradicts itself. 89 | 90 | This change was introduced by: 91 | 92 | commit 856fe5bd1cb135c39258e6ca941bf234ae63e1b1 93 | 94 | Author: Andrew Waterman 95 | 96 | Date: Mon Apr 3 15:40:16 2023 -0700 97 | 98 | Delete non-normative claim that vmvr.v doesn't depend on vtype 99 | 100 | The normative text says that vmvr.v "operates as though EEW=SEW", 101 | meaning that it _does_ depend on vtype. 102 | 103 | The semantic difference becomes visible through vstart, since vstart is 104 | measured in elements; hence, how to set vstart on an interrupt, or 105 | interpret vstart upon resumption, depends on vtype. 106 | 107 | Personally, I think this was the wrong change to resolve the stated issue. 108 | The wording changed was in a section specific to vill exception behavior, 109 | and adding a carve out that whole vector register move explicitly 110 | did not trap on vill (but otherwise did depend on vtype) would have been 111 | much more reasonable. 112 | -------------------------------------------------------------------------------- /riscv-tso-mappings.rst: -------------------------------------------------------------------------------- 1 | ----------------------------- 2 | TSO Memory Mappings for RISCV 3 | ----------------------------- 4 | 5 | This document lays out the memory mappings being proposed for Ztso. Specifically, these are the mappings from C/C++ memory models to RISCV w/Ztso. This document is a placeholder until this can be merged into something more official. 6 | 7 | The proposed mapping tables are the work of Andrea Parri. He's an actual memory model expert; I'm just a compiler guy who has gotten a bit too close to these issues, and needs to drive upcoming patches for LLVM. As always, mistakes are probably mine, and all credit goes to Andrea. 8 | 9 | .. contents:: 10 | 11 | Background 12 | ---------- 13 | 14 | RISCV uses the WMO memory model by default. This is described in chapter 17 ("RVWMO Memory Consistency Model, Version 2.0") of the Unprivledged specification. RISCV also supports the Total Store Order (TSO) model via an optional extension (Ztso) which was ``_. There is also a description of version 0.1 of Ztso in the last ratified Unprivledge specification; I am not aware of any major differences between them. 15 | 16 | Programming languages such as C/C++ and Java define their own memory models. One of the tasks in implementing such a language is choosing a mapping from the language level memory model to the hardware level memory model. For clarity sake, it's worth emphasizing that many such mappings map be legal (that is, equally correct), but that for ABI compatibility, it is important that we designate exactly one such mapping as part of the ABI and use it across all toolchains whose results need to interoperate. Otherwise, you could end up creating a racy program by linking two object files which both correctly implement synchronization at the source level. Generally, that is considered bad. 17 | 18 | Aside: There is a related ABI issue around defining how atomics and ordering work when involving data types larger or smaller than what the hardware natively supports. These are currently implementation defined (but need specified eventually). This is explicitly out of scope for this document, and the mapping here does not apply to such types. 19 | 20 | The ABI designated mapping for WMO is defined in "Table A.6: Mappings from C/C++ primitives to RISC-V primitives" from the Unprivledged spec. Having this specified in the ISA specification is arguably a weird RISCV quirk; it should probably live in something like the psABI specification instead. To my knowledge, there is not yet a designated mapping for Ztso, and that's what the rest of this document discusses. 21 | 22 | 23 | Proposed Mapping 24 | ---------------- 25 | 26 | The proposed mapping table is: 27 | 28 | .. code:: 29 | 30 | C/C++ Construct | RVTSO Mapping 31 | ------------------------------------------------------------------------------ 32 | Non-atomic load | l{b|h|w|d} 33 | atomic_load(memory_order_relaxed) | l{b|h|w|d} 34 | atomic_load(memory_order_acquire) | l{b|h|w|d} 35 | atomic_load(memory_order_seq_cst) | fence rw,rw ; l{b|h|w|d} 36 | ------------------------------------------------------------------------------ 37 | Non-atomic store | s{b|h|w|d} 38 | atomic_store(memory_order_relaxed) | s{b|h|w|d} 39 | atomic_store(memory_order_release) | s{b|h|w|d} 40 | atomic_store(memory_order_seq_cst) | s{b|h|w|d} 41 | ------------------------------------------------------------------------------ 42 | atomic_thread_fence(memory_order_acquire) | nop 43 | atomic_thread_fence(memory_order_release) | nop 44 | atomic_thread_fence(memory_order_acq_rel) | nop 45 | atomic_thread_fence(memory_order_seq_cst) | fence rw,rw 46 | ------------------------------------------------------------------------------ 47 | C/C++ Construct | RVTSO AMO Mapping 48 | atomic_(memory_order_relaxed) | amo.{w|d} 49 | atomic_(memory_order_acquire) | amo.{w|d} 50 | atomic_(memory_order_release) | amo.{w|d} 51 | atomic_(memory_order_acq_rel) | amo.{w|d} 52 | atomic_(memory_order_seq_cst) | amo.{w|d} 53 | ------------------------------------------------------------------------------ 54 | C/C++ Construct | RVTSO LR/SC Mapping 55 | atomic_(memory_order_relaxed) | loop: lr.{w|d} ; ; 56 | | sc.{w|d} ; bnez loop 57 | atomic_(memory_order_acquire) | loop: lr.{w|d} ; ; 58 | | sc.{w|d} ; bnez loop 59 | atomic_(memory_order_release) | loop: lr.{w|d} ; ; 60 | | sc.{w|d} ; bnez loop 61 | atomic_(memory_order_acq_rel) | loop: lr.{w|d} ; ; 62 | | sc.{w|d} ; bnez loop 63 | atomic_(memory_order_seq_cst) | loop: lr.{w|d}.aqrl ; ; 64 | | sc.{w|d} ; bnez loop 65 | 66 | The key thing to note here is that we using an fence *before* any seq_cst *load*. There is an alternative mapping (discussed below) which uses a fence *after* an atomic *store*. The mapping shown here is the one I am proposing moving forward with. 67 | 68 | The alternate mapping 69 | --------------------- 70 | 71 | This mapping table is listed here for explanatory value only. This lowering is **incompatible** with the mapping proposed for inclusion in toolchains and psABI (above). 72 | 73 | .. code:: 74 | 75 | C/C++ Construct | RVTSO Mapping 76 | ------------------------------------------------------------------------------ 77 | Non-atomic load | l{b|h|w|d} 78 | atomic_load(memory_order_relaxed) | l{b|h|w|d} 79 | atomic_load(memory_order_acquire) | l{b|h|w|d} 80 | atomic_load(memory_order_seq_cst) | l{b|h|w|d} 81 | ------------------------------------------------------------------------------ 82 | Non-atomic store | s{b|h|w|d} 83 | atomic_store(memory_order_relaxed) | s{b|h|w|d} 84 | atomic_store(memory_order_release) | s{b|h|w|d} 85 | atomic_store(memory_order_seq_cst) | s{b|h|w|d} ; fence rw,rw 86 | ------------------------------------------------------------------------------ 87 | atomic_thread_fence(memory_order_acquire) | nop 88 | atomic_thread_fence(memory_order_release) | nop 89 | atomic_thread_fence(memory_order_acq_rel) | nop 90 | atomic_thread_fence(memory_order_seq_cst) | fence rw,rw 91 | ------------------------------------------------------------------------------ 92 | C/C++ Construct | RVTSO AMO Mapping 93 | atomic_(memory_order_relaxed) | amo.{w|d} 94 | atomic_(memory_order_acquire) | amo.{w|d} 95 | atomic_(memory_order_release) | amo.{w|d} 96 | atomic_(memory_order_acq_rel) | amo.{w|d} 97 | atomic_(memory_order_seq_cst) | amo.{w|d} 98 | ------------------------------------------------------------------------------ 99 | C/C++ Construct | RVTSO LR/SC Mapping 100 | atomic_(memory_order_relaxed) | loop: lr.{w|d} ; ; 101 | | sc.{w|d} ; bnez loop 102 | atomic_(memory_order_acquire) | loop: lr.{w|d} ; ; 103 | | sc.{w|d} ; bnez loop 104 | atomic_(memory_order_release) | loop: lr.{w|d} ; ; 105 | | sc.{w|d} ; bnez loop 106 | atomic_(memory_order_acq_rel) | loop: lr.{w|d} ; ; 107 | | sc.{w|d} ; bnez loop 108 | atomic_(memory_order_seq_cst) | loop: lr.{w|d} ; ; 109 | | sc.{w|d}.aqrl ; bnez loop 110 | 111 | The key difference to note is that this lowering uses an fence *after* the sequentially consistent stores, 112 | 113 | Discussion 114 | ---------- 115 | 116 | So, why are we proposing the first mapping and not the alternative? This comes down to a benefit analysis. 117 | 118 | The proposed Ztso mapping was constructed to be a strict subset of the WMO mapping. Consider the case where we are running on a Ztso machine, but that not all of our object files or libraries were compiled assuming Ztso. If the Ztso mapping is a subset of the WMO mapping, then all parts of this mixed application include the required fences for correctness on Ztso. Some libraries might have a bunch of redundant fences (i.e. all the ones needed by WMO not needed for Ztso), but the application will behave correctly regardless. This allows libraries targeted for WMO to be reused on a Ztso machine with only selective performance sensitive pieces selectively recompiled explicitly for ZTso. 119 | 120 | The alternative mapping instead parallels the mappings used by X86. Ztso is intended to parallel the X86 memory model, and it is desirable if explicitly fenced code ported from x86 just worked with Ztso. Consider a developer who is doing a port of a library which is implemented using normal C intermixed with either inline assembly or intrinsic calls to generate fences. If that code follows the x86 convention, then a naive port will match the alternate mapping. The key point is that code using the alternate mapping will not properly synchronize with code compiled with the proposed mapping. 121 | 122 | To avoid confusion, let me emphasize that the porting concern just mentioned *does not* apply to code written in terms of either C or C++'s explicit atomic APIs. Instead, it *only* applies to manually ported assembly or code which is already living dangerously by using explicit fencing around relaxed atomics. Such code is rare, and usually written by experts anyways. The slightly broader class of code which may be concerning is that with non-atomic loads and stores mixed with explicit fencing. Such code is already relying on undefined behavior in C/C++, but "probably works" on X86 today and might not after a naive RISCV port if synchronizing with code compiled with the proposed mapping. 123 | 124 | The alternative mapping also has the advantage that stores are generally dynamically rarer than loads. So the alternative mapping *may* result in dynamically fewer fence instructions. I do not have numbers on this. 125 | 126 | The choice between the two mappings essentially comes down to which of these we consider to be more important. I am proposing we move forward with the mapping which gives us WMO compatibility. It is my belief that allowing mixed applications is more important to the ecoyststem then ease of porting explicit synchronization. 127 | -------------------------------------------------------------------------------- /riscv-vector-user-guide.rst: -------------------------------------------------------------------------------- 1 | ------------------------------- 2 | Using RISCV Vector Instructions 3 | ------------------------------- 4 | 5 | This is intended to be a user focused guide on how to leverage vector instructions on RISCV. Most of what I can find on the web is geared towards vector before v1.0 was ratified, and there's enough differences that having something to point people to has proven useful. 6 | 7 | .. contents:: 8 | 9 | 10 | Execution Environments 11 | ---------------------- 12 | 13 | qemu-riscv32/64 support the v1.0 vector extension in upstream. Note that the default packages available in most distros are not recent enough. You will need to download qemu and build from source. Thankfully, this is pretty straight forward, and `qemu's build instructions `_ are sufficiently up to date. 14 | 15 | Once you have a sufficiently recent qemu-riscv, you should be able to run binaries containing vector instructions. Note that vector is not enabled by qemu-user by default at this time, so you will need to explicitly enable it. If you get unhelpful error output when doing so, you are most likely using a version of qemu which is too old. 16 | 17 | With qemu-user, you can run and test programs in a cross build environment with one major gotcha. glibc does not have mcontext/ucontext support for vector, so anything which requires them - e.g. longjmp, signals, green threads, etc - will fail in interesting and unexpected ways. 18 | 19 | **WARNING**: At the moment, support for the vector extension has *NOT* landed in upstream Linux kernel, and I am not aware of any distro which currently applies the required patches. So, unless you are running a custom kernel, there is a *very* good chance you can't run a native environment. 20 | 21 | If you try, you will most likely get an illegal instruction exception (SIGILL) on the first vector instruction you execute. In many programs - though not all - this will look like a SIGILL on the first access to a vector CSR (e.g. `csrr a2, vlenb`) or a vector configuration instruction (e.g. `vsetvli a1, zero, e32, m1, ta, mu`). 22 | 23 | **Said differently, unless you're running a patched kernel, you can not enable vector code even if your hardware supports it!** 24 | 25 | 26 | Assembler Support 27 | ------------------ 28 | 29 | The LLVM assembler fully supports for v1.0 vector extension specification, and has for a while. Using the 15.x release branch is definitely safe, and older release branches may also work. 30 | 31 | The binutil's assembler used by GNU also supports the v1.0 vector extension since `the 2.38 release `_ (released Feb 2022). 32 | 33 | Compiler Support 34 | ---------------- 35 | 36 | Enabling Vector Codegen 37 | ======================= 38 | 39 | Vector code generation has been supported since (at least) LLVM 15.0. I've been told by gcc developers that upstream GNU does support vector code generation as of (at least) gcc 13. 40 | 41 | You need to make sure to explicitly tell the compiler (via `-mattr=...,+v`) to enable vector code lowering. If you don't, you may be get compiler errors (i.e. use of vector types rejected), you may see code being scalarized by the backend, or (hopefully not) you may stumble across internal compiler errors. 42 | 43 | **Warning**: The flag mentioned above also have the effect of enabling auto-vectorization; if this is undesirable, consider `-fno-vectorize` and `-fno-slp-vectorize`. Vectorizer user documention can be found `here `_. 44 | 45 | Intrinsics and Explicit Vector Code 46 | =================================== 47 | 48 | The `gcc vector extension syntax `_ is fully supported by both GCC and Clang. This is a good way of writing explicitly fixed length vector code in C/C++. 49 | 50 | The RVV C Intrinsic family is fully supported by Clang and GCC. This is currently the main way to write explicit length agnostic vector code. 51 | 52 | For clang, the `#pragma clang loop `_ directives can be used to override the vectorizers default heuristics. This can be very useful for exploring performance of various vectorization options. 53 | 54 | .. code:: 55 | 56 | // Let's force LMUL4 with unroll factor of 2. 57 | #pragma clang loop vectorize(enable) 58 | #pragma clang loop interleave(enable) 59 | #pragma clang loop vectorize_width(8, scalable) 60 | #pragma clang loop interleave_count(2) 61 | for (unsigned i = 0; i < a_len; i++) 62 | a[i] += b; 63 | 64 | 65 | Auto-vectorization 66 | ================== 67 | 68 | I have been actively working to improve the state of auto-vectorization in LLVM for RISC-V. If you're curious about the details, see `my working notes `_. The following is a user focused summary of where we stand right now. This is an area of extremely heavy churn, so please keep in mind that this snapshot is very specific to the current moment in time, and is likely to continue changing. 69 | 70 | The LLVM 16 release branch contains all of the changes required for auto-vectorization via the LoopVectorizer via both scalable and fixed vectors when vector codegen is enabled (see above). 71 | 72 | For SLPVectorizer, use of a very recent tip of tree is recommended. SLP has recently been enabled by default in trunk, and is on track to be enabled in the 17.x release series, but that's subject to change. If you're interested in this area, it is strongly recommended that you build LLVM from (very recent!) source. If you wish to enable SLP vectorization on the 16.x release branch for experimental purposes, you need to specify `-mllvm -riscv-v-slp-max-vf=0`. Note that this is an internal compiler option, and will not be supported. Any bugs found will only be fixed on tip-of-tree, and will not be backported. 73 | 74 | For gcc, patches to support auto-vectorization have recently started landing. There's very active development going on with multiple contributors, so the exact status is hard to track. Hopefully, the gcc-14 release notes will contain information about what is and is not supported. 75 | 76 | T-Head `has a custom toolchain `_ which may suppport vectorization as their processors include the v0.7 vector extensions. I have not confirmed this since a) all the documents are in Chinese, b) it requires an account to download, and c) I'm not interested in v0.7 anyways. 77 | 78 | If you wish to disable auto-vectorization for any reason, please consider `-fno-vectorize` and `-fno-slp-vectorize`. Vectorizer user documention can be found `here `_. 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | -------------------------------------------------------------------------------- /riscv/bp3-setup.rst: -------------------------------------------------------------------------------- 1 | ----------------------- 2 | Banana PI 3 Setup Notes 3 | ----------------------- 4 | 5 | This board was released in Q2 2024, and appears to be the first relatively widely available board with vector 1.0. The ISA is rva22u w/V and VLEN=256. This page is my notes from trying to set one up as a development board. 6 | 7 | `BPi-F3 Datasheet `_, `Spacemit-K1 Datasheet `_ 8 | 9 | To purchase: `AliExpress `_, `Amazon `_ 10 | 11 | .. contents:: 12 | 13 | 14 | Download the OS Image 15 | --------------------- 16 | 17 | Directly from the vendor `here `_. Make sure you grab the SD card images. The easiest downloads are the google drive ones unless you speak Chinese. 18 | 19 | 20 | Format/Partition the SD Card 21 | ---------------------------- 22 | 23 | First, figure out which device corresponds to the your SD card. The rest of this assumes it is `/dev/sda` -- make sure you change for your environment! 24 | 25 | .. code:: 26 | 27 | lsblk 28 | 29 | Zero the entire SD card. Do not skip this step, or you get weird hangs at boot time. 30 | 31 | .. code:: 32 | 33 | sudo dd if=/dev/zero of=/dev/sda bs=4096 status=progress 34 | 35 | Format the partion table, etc... (`For a lot more detail on this step `_.) 36 | 37 | .. code:: 38 | 39 | sudo parted /dev/sda --script -- mklabel msdos 40 | sudo parted /dev/sda --script -- mkpart primary fat32 1MiB 100% 41 | sudo mkfs.vfat -F32 /dev/sda1 42 | sudo parted /dev/sda --script print 43 | 44 | Copy the contents of your img to the device. 45 | 46 | .. code:: 47 | 48 | sudo dd if=/home/preames/DevBoards/BananaPI/bianbu-23.10-desktop-k1-v1.0rc3-release-20240525133016.img of=/dev/sda status=progress bs=4M 49 | 50 | Use `gparted`, or tool of your choice, to resize the final fat32 partition to fill available space. If you skip this step, installing new packages will fail due to insufficient space. 51 | 52 | Boot It 53 | ------- 54 | 55 | I plugged in a HDMI monitory, keyboard, and mouse. Then went through the initial setup to e.g. configure WiFi. After that, I switched to SSH login. 56 | 57 | If this step fails (i.e. hangs), go back and read the warnings on the zeroing and formatting section again. 58 | 59 | Initial Setup 60 | ------------- 61 | 62 | Do the usual update: 63 | 64 | .. code:: 65 | 66 | sudo apt-get update 67 | sudo apt-get upgrade 68 | 69 | Install the usual packages: 70 | 71 | .. code:: 72 | 73 | sudo apt-get --assume-yes install emacs man-db libc6-dev dpkg-dev make build-essential binutils binutils-dev gcc g++ autoconf python3 git clang cmake patchutils ninja-build flex bison 74 | 75 | Getting `perf` working 76 | ---------------------- 77 | 78 | Do the following: 79 | 80 | .. code:: 81 | 82 | git clone https://github.com/BPI-SINOVOIP/pi-linux 83 | cd pi-linux 84 | uname --all 85 | # Checkout the right branch for your kernel version 86 | git checkout linux-6.1.15-k1 87 | pushd tools/perf/pmu-events 88 | ./jevents.py riscv arch pmu-events.c 89 | popd 90 | sudo apt install libelf-dev libdw-dev flex bison 91 | sudo make -C tools/ NO_LIBBPF=1 WERROR=0 prefix=/usr/local/ perf_install 92 | 93 | These instructions are inspired by `this blog post `_. Note that I'm running on the `bianbu-23.10-desktop-k1-v1.0rc3-release-20240525133016` image, and that the default counter names appear to work for me in `perf stat`. The synthetic `instructions` and `cycles` events do not work with `perf record`. Reasonable proxies are `inst_issues` and `m_mode_cycles`. 94 | 95 | LLVM Native Build 96 | ----------------- 97 | 98 | You *can* do an LLVM native build on these, but it takes quite a while. The filesystem on the SD card is insanely slow, so just getting a git checkout in place took a while; starting from a zip file probably would have been a better idea. It might also have been a good idea to setup the MMC storage first. Due to memory limitations, you end up needing to build without parallelism (e.g. "ninja -j1"). Doing so takes a bit over 38 hours. I was able to build both the llvm 17 and llvm 18 release branches. 99 | 100 | I strongly recommend configuring to build only RISCV and the projects you actually need. Doing so greatly reduces the number of files build, and is the only thing which makes this vaguely practical. I build RISCV only, and only LLVM core + clang. 101 | 102 | Other References 103 | ---------------- 104 | 105 | https://dev.to/luzero/bringing-up-bpi-f3-part-1-3bm4 106 | -------------------------------------------------------------------------------- /riscv/isa-detection.rst: -------------------------------------------------------------------------------- 1 | ------------------------------- 2 | User Mode ISA Detection (RISCV) 3 | ------------------------------- 4 | 5 | .. contents:: 6 | 7 | 8 | AT_HWCAP 9 | -------- 10 | 11 | .. code:: 12 | 13 | #include 14 | #include 15 | 16 | int main() { 17 | unsigned long hw_cap = getauxval(AT_HWCAP); 18 | for (int i = 0; i < 26; i++) { 19 | char Letter = 'A' + i; 20 | printf("%c %s\n", Letter, hw_cap & (1 << i) ? "detected" : "not found"); 21 | } 22 | return 0; 23 | } 24 | 25 | Problems: 26 | 27 | * Only a few of the bits actually correspond the useful single letter extensions. The rest are wasted bits, and there's none for e.g. Zba. 28 | * Unclear specification. Exactly which *version* from which *specification document* does each bit correspond to? See `minutia `_ for some of the ambiguities. 29 | * V - Is this V1.0 or THeadVector? Custom kernels are known to report the later as 'V'. 30 | 31 | riscv_hwprobe 32 | ------------- 33 | 34 | syscall 35 | `Documentation `_. See RISE's RISC-V Optimization Guide for `an example `_. As noted there, the syscall was added in 6.4. Attempting to use it on an earlier kernel will return ENOSYS. The syscall is a relatively cheap syscall, but you do have the transition overhead. Cost is probably something in the 1000s of instructions. 36 | 37 | vDSO 38 | Also added in 6.4 (so there is no kernel version with the syscall, but without the vDSO). Caches the key/value for the intersection of the flags for all cpus. Will return results without a syscall if either a) all_cpus is queried (i.e. no cpu set given to the call) or b) underlying system is homogeneous. The vDSO symbol name is `__vdso_riscv_hwprobe`. 39 | 40 | glibc 41 | The patch for the glibc wrapper has landed, but *is not yet released*. It is likely to be included in glibc 2.40 which is expected in Aug 2024, but should not be considered ABI stable until it is released. `glibc provides `_ two entry points `__riscv_hwprobe` and `__riscv_hwprobe_one` - an inline function specialized for the common single key case. Confusingly, the `__riscv_hwprobe` case inverts the error codes from vDSO/kernel resulting in >0 being error instead of <0. Thankfully, ==0 is success in all cases. 42 | 43 | bionic 44 | Appears to be the same status as glibc, with the change landed but not yet released for Android 15. Note that bionic does not appear to implement the `__riscv_hwprobe_one` wrapper, but does appear to match the semantics of the `__riscv_hwprobe` case. 45 | 46 | qemu-user 47 | It is currently an open question whether the vDSO above is supported by qemu-riscv64. Initial testing seems to indicate no, but user error has not yet been disproven. 48 | -------------------------------------------------------------------------------- /riscv/whole-register-move-abi.rst: -------------------------------------------------------------------------------- 1 | ------------------------------------------------------ 2 | ABI Implication of vill and whole vector register move 3 | ------------------------------------------------------ 4 | 5 | Background 6 | ---------- 7 | 8 | The vector specification supports a whole register move instruction 9 | whose documented purpose is to enable register allocators to move 10 | vector register contents around without needing to track ``VL`` or 11 | ``VTYPE``. 12 | 13 | A `change was made to the specification `_ 14 | which requires hardware to report a illegal instruction exception 15 | if this instruction is executed with `vill` set. 16 | 17 | Existing SW implementations - in particular LLVM and GCC - were 18 | both implemented *before* this change was made, and implicitly 19 | assume the prior behavior. 20 | 21 | IMHO, an architecture which doesn't have an instruction to *just 22 | unconditionally copy a dang register* is broken at a level which 23 | just can't be saved, however we will probably have to workaround 24 | this in software regardless. 25 | 26 | See also discussion here: https://github.com/llvm/llvm-project/issues/114518 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117544, and https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/454. 27 | 28 | ABI Impact 29 | ---------- 30 | 31 | The current ABI says that ``VTYPE`` is *not* preserved across calls. 32 | Note that in particular, this means that `vill` can be set across 33 | any call boundary. 34 | 35 | Before the above change, that was fine because a whole register 36 | move did the sane thing regardless of vill. After the above change 37 | this means that any use of a whole register move betwen a call 38 | boundary and the first dynamic vsetvli will potentially crash. 39 | 40 | In theory, this is a huge problem. However, up until recently 41 | the defacto usage of the ABI was such that a valid `VTYPE` usually 42 | did survive through call boundaries. 43 | 44 | Unfortunately, Linux kernel 6.5 included a `patch `_ which sets 45 | VILL=1 on all syscalls. This is legal per the documented ABI (see below), 46 | but has the effect of exposing the lurking problem. 47 | 48 | The other case where this can be triggered in practice is via a 49 | riscv vector intrinsic with a pass thru operand called early in the 50 | execution of a program. Since we don't explicitly initialize `VTYPE` 51 | on the way to main, if we happen to use a whole register move to place 52 | the pass thru value in the destination register (fairly common), we 53 | may insert the vsetvli *after* the whole register move, and end up with 54 | an exposed whole register move which can fault. This problem has 55 | been latent the whole time. 56 | 57 | Known Hardware 58 | -------------- 59 | 60 | Old Behavior (No Trap) 61 | 62 | * Spacemit-X60 on K1 63 | * C908 on K230 64 | * Likely (but unconfirmed) all other THead processors 65 | 66 | New Behavior (Trap) 67 | 68 | * Presumably SiFive, but no confirmed specific micro-architectures at this time 69 | 70 | Goals 71 | ----- 72 | 73 | Key Goals: 74 | 75 | * Acknowledge the existance and likely long term prevalence of both trapping 76 | and non-trapping hardware. 77 | * Ensure that there is a correct by construction combination of compiler and 78 | libraries available for trapping hardware. 79 | * Not fork the ecosystem on this point. This practically speaking requires 80 | that any mitigation doesn't impose a noticable performance penalty when 81 | run on non-trapping hardware. 82 | * Having existing binaries continue to mostly work in practice is highly 83 | desirable, even they were compiled without knowledge of any changes 84 | required to address this issue. 85 | 86 | Options 87 | ------- 88 | 89 | As of the 2024-11-21 LLVM RISCV Sync Up, the consensus appeared to be that 90 | we're going to pursue Option 2 - the compiler side changes. In offline 91 | discussion, it was revealed that some information was accidentally 92 | misreported, so this is still an open question. 93 | 94 | Option 1 - Change the ABI 95 | ========================= 96 | 97 | Since we're in the realm of making backwards incompatible specification 98 | changes anyways, we can change the default calling convention in psABI 99 | in a couple of possible ways: 100 | 101 | * Require VTYPE to be non-vill on ABI boundaries. 102 | * Require VTYPE to be equally vill on ABI boundaries; that is calls 103 | would have to preserve the single-bit state of whether vill was 104 | active. 105 | * Require VTYPE to be no more vill on return than on entry to the 106 | function. That is, a non-vill VTYPE on entry must be non-vill 107 | on exit, but a vill VTYPE on entry can become non-vill on exit. 108 | This would allow callees to unconditionally set VTYPE to any 109 | non-vill value. 110 | 111 | In all of these variants, VTYPE would remain otherwise unspecified and 112 | unpreserved. At the moment, variant three seems like the best option. 113 | 114 | This would require a kernel change, and we'd end up having to tell folks 115 | that vector code essentially didn't fully work on the unpatched kernel 116 | version. 117 | 118 | We'd also have to message and manage an ABI transition. Older binaries 119 | in the wild would not be guaranteed to work until recompiled. Note that 120 | this is the same state we're in today, and have been for years, so this 121 | is a bigger problem in theory than in practice. 122 | 123 | This is my preferred option, but may be politically unpopular since 124 | it requires publicly admitting the retro-active change was actually 125 | a change. RVI has generally not wanted to do that in the past. 126 | 127 | It was pointed out that this requires VTYPE to be initialized during 128 | program startup, and that this would defeat the kernel's lazy state 129 | preservation optimization which avoids needing to spill vector state 130 | for programs which don't use vector - because vset* is a vector 131 | instruction. (Per Palmer, this may not be an issue depending on how 132 | a kernel change is handled. He's going to expand on a psabi issue 133 | filed in the near future - will like once available. 134 | 135 | Option 2 - Enforce the ABI as written 136 | ===================================== 137 | 138 | This will require the compiler to insert a VSETVLI along any path 139 | from a call boundary to a whole register move. 140 | 141 | I can't speak for GCC, but for LLVM this is doable, if exceedingly 142 | ugly. It will likely result in otherwise spurious vsetvlis, and 143 | its hard to say how much this matters performance wise. 144 | 145 | We can do heroics particularly for LTO builds (internal ABIs with 146 | IPO and adapter frames anyone?), but its hard to say if we can 147 | address all performance loss cases which arrise in practice. 148 | 149 | As with the previous option, we defacto would have broken packages 150 | in the wild, and nothing would be guaranteed to work until all 151 | packages had been recompiled. Unlike the previous option, most 152 | of those packages wouldn't "just work" in practice. 153 | 154 | LLVM POC Patch: https://github.com/llvm/llvm-project/compare/main...preames:llvm-project:wip-vmv1r-depends-on-vtype-vill?expand=1 155 | 156 | Option 3 - Ignore it 157 | ==================== 158 | 159 | This is what we've been doing to date. 160 | 161 | Option 4 - Trap and Emulate 162 | =========================== 163 | 164 | We could have the kernel trap and emulate the instruction. This 165 | is argubly not crazy for a case where the specification changed. 166 | Since vsetvlis should be fairly common in vector code, this 167 | shouldn't be a hot trap case - unless someone is doing something 168 | weird like hot-looping around a sys-call. 169 | 170 | This version basically represents treating the changed behavior 171 | as a SiFive errata. Note that this will likely always disagree 172 | with the specification document. 173 | 174 | Option 4a - Change the Specification 175 | ==================================== 176 | 177 | Several folks have indicates a desire to reverse the change in 178 | the specification. I am sympathetic to this view, but don't 179 | believe such an effort to be politically viable. 180 | 181 | As an alternative, we might be able to propose a specification 182 | change (or maybe an extension?) which allows both the trapping 183 | and non-trapping behaviors. This wouldn't resove any of the 184 | SW complexity mentioned above, but would at least mean that 185 | the vast majority of vector hardware on the planet wasn't 186 | retroactively considered "non conformant". 187 | -------------------------------------------------------------------------------- /scev-exponential.rst: -------------------------------------------------------------------------------- 1 | .. header:: This is a collection of notes on a topic that may someday make it into a proposed extension to LLVM's SCEV, but that I have no plans to move forward with in the near future. 2 | 3 | ------------------------------------------------- 4 | SCEV, shifts, and M^N as a primitive 5 | ------------------------------------------------- 6 | 7 | .. contents:: 8 | 9 | Basic Idea 10 | =========== 11 | 12 | SCEV currently does not have a first class representation for bit operators (lshr, ashr, and shl). Particular special cases are represented with SCEVMulExpr and SCEVDivExpr, but in general, we fall back to wrapping the IR expression in a SCEVUnknown. 13 | 14 | It's interesting to note that all shifts can be represented as LHS*2^N or LHS/2^N. Unfortunately, we don't have a way to represent 2^M in SCEV for arbitrary N. 15 | 16 | We've also run into cases before where power functions specialized for fixed values of N arise naturally when canonicalizing SCEVs or handling unrolled loops involving products or shifts. As a result, commit 35b2a18eb96d0 added first class support in SCEVExpander for an optimized power emission. (Leaving the internal representation unchanged.) 17 | 18 | It's also interesting to note that a moderately wide family of non-add recurrences can be represented as an add recurrence multiplied by a power function. For instance, (my extended notation for a mul recurrence multiplying by x on every iteration) can be represented as y*x^<0,+,1). The same idea can be extended to shl, ashr, and lshr, and some cases of udiv. 19 | 20 | Obvious Alternatives 21 | ==================== 22 | 23 | Don't represent at all 24 | ---------------------- 25 | 26 | This is simply the status quo, possibly with some enhancements to handle obvious cases during construction and range analysis, but without changing the SCEV expression language itself. 27 | 28 | Add explicit SCEVShlExpr, etc... 29 | --------------------------------- 30 | 31 | This improves our ability to reason about the expressions, but doesn't get us the ability to reason about recurrences unless we add explicit SCEVShlRecExpr expressions as well. 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /talks-and-publications: -------------------------------------------------------------------------------- 1 | # Past Talks and Public Events 2 | 3 | An Update on Optimizing Multiple Exit Loops 4 | Philip Reames 5 | 2020 LLVM Virtual Developers Meeting 6 | 7 | Panel: Inter-procedural Optimization (IPO) 8 | Teresa Johnson, Philip Reames, Chandler Carruth, Johannes Doerfert 9 | 2019 LLVM Developers Meeting 10 | 11 | Panel: The Loop Optimization Working Group 12 | Kit Barton, Michael Kruse, Philip Reames, Hal Finkel 13 | 2019 LLVM Developers Meeting 14 | 15 | Falcon: An optimizing Java JIT 16 | Philip Reames 17 | Keynote 2017 LLVM Developers Meeting 18 | 19 | LLVM for a managed language: what we’ve learned 20 | Sanjoy Das and Philip Reames 21 | 2015 LLVM Developers Meeting 22 | 23 | Supporting Precise Relocatig Garbage Collection in LLVM 24 | Sanjoy Das and Philip Reames 25 | 2014 LLVM Developers Meeting 26 | 27 | For all of the LLVM talks, search google for the talk name + youtube 28 | or navigate the historical conference pages for videos of the talks. 29 | 30 | # Publications 31 | 32 | For a list of publications, see appropriate section on LinkedIn. 33 | https://www.linkedin.com/in/philip-reames/ 34 | 35 | # Professional Service 36 | 37 | Reviewer, 2020 LLVM Developers Conference 38 | 39 | ERC, ISMM 2020 40 | -------------------------------------------------------------------------------- /vectorization-reference.rst: -------------------------------------------------------------------------------- 1 | ----------------------------------- 2 | Vectorization - Ideas and Reference 3 | ----------------------------------- 4 | 5 | This page is intended as a reference on various vectorization related topics I find myself needing to explain/describe/reference on a somewhat regular basis. 6 | 7 | .. contents:: 8 | 9 | 10 | Trip Count Required for Profitability 11 | ------------------------------------- 12 | 13 | For a given loop on a given piece of hardware, there is some minimum number of iterations required before a given vector loop is profitable over the equivalent scalar loop. This is highly dependent both on micro architecture and vectorization strategy. 14 | 15 | On most hardware, the tradeoff point will be in the high single digits to low double digits of executions. It is common to have tradeoff points for floating point loops be lower than integer loops. This happens because modern OOO processors frequently have more integer than FP resources, and may even share the vector and FP resources. 16 | 17 | It may also be the case that vectorization is profitable for some trip count TC, but not for TC + 1. This happens due to the need to handle tail iterations, and the overheads involved. *Usually* if TC + 1 is not profitable, it's at least "close to" the scalar loop in performance, but exceptions do exist. 18 | 19 | 20 | Tail Handling 21 | ------------- 22 | 23 | Given a loop with a trip count (TC), and a desirable vectorization factor (VF), it is unlikely that TC % VF == 0. You could artificially limit vectorization to cases where this property holds, but in practice that disallows all loops with runtime trip counts, so we don't. The remainder iterations are often called the "tail iterations". 24 | 25 | We have a couple major options here: 26 | 27 | * Use a scalar loop to handle the remainder elements 28 | * Use a predicated straight line sequence 29 | * Fold the predication into the main loop body. 30 | * Use a loop with a smaller vector factor, possibly combined with the previous choices. 31 | 32 | The assumption for the remainder of this section is that we run TC - TC % VF iterations of the original loop in our newly inserted vector loop, and are figuring out how to execute the remaining TC % VF iterations. 33 | 34 | Note that the problem we're talking about here is a particular form of iteration set splitting, combined with some profitability heuristics. Remembering this point is helpful when reasoning about correctness. 35 | 36 | **Scalar Epilogue** 37 | 38 | The simplest technique is to simply keep the scalar loop, and branch to it with incoming induction variable values which reflect the iterations completed in the vector body. 39 | 40 | This technique is advantegous when the scalar loop is required for other reasons (i.e. runtime aliasing check fallback), or the number of remaining iterations is expected to be short. As such, it tends to be most useful when the chosen VL is relatively small. 41 | 42 | Note that all the usual loop optimization choices apply, but with the additional known fact that the new loops trip count is < VF. This (for instance) may result in extremely different loop unrolling profitability. 43 | 44 | **Predicated (Non-Loop) Epilogue** 45 | 46 | You can generate a single peeled vector iteration with predication. This can be done with either software or hardware techniques (see below), but is usually only profitable with hardware support. 47 | 48 | This technique works well when the remainder is expected to be large relative to the original VF. 49 | 50 | **Tail Folded (Predicated) Main Loop** 51 | 52 | As with the previous item, but with the extra iteration folded into the main vector body. This involves computing a predicate on every iteration of the main loop, and predicating any required instructions. 53 | 54 | On most hardware I'm aware of, a tail folded loop will run slower than the plain vector loop without predication. So for very long trip counts, this can be non-optimal. The tradeoff is (usually) a massive reduction in code size. 55 | 56 | This choice can also hurt for short running loops. If TC is significantly lower than VF, then a scalar loop might be significantly better. 57 | 58 | Note that in the case I'm describing here, all but the last iteration of the vector loop run the same number of iterations of the original loop. There's a related idea (called EVL vectorization in LLVM) where this property changes. 59 | 60 | **Epilogue Vectorization** 61 | 62 | After performing the iteration set splitting for the original loop using our chosen vector factor, we can choose some other small vectorization factor - call it VF2 - and vectorize the remainder loop. In principle, you can keep doing this recursively, but in practice, we tend to stop after the second. Since the second vector.body may still have tail iterations, you need to pick one of the other techniques here as a fallback. 63 | 64 | The benefit of this technique is that you have *multiple* vector bodies, each with independent VF, and can dispatch between them at runtime. 65 | 66 | The only case LLVM uses this technique is when VF is large compared to expected TC. Specifically, some cases on AVX512 fallback to AVX2 loops. 67 | 68 | 69 | Forms of Predication 70 | -------------------- 71 | 72 | Predication is a general technique for "masking off" (that is, disabling) individual lanes or *otherwise avoiding faults in inactive lanes*. Note that depending on usage, the result of the operations on the inactive lanes may either be defined (usually preserved, but not always) or undefined. 73 | 74 | I tend to only be interested in the case where the result is undefined as that's the one which arrises naturally in compilers. Our goal is basically to avoid faults on inactive lanes, and nothing else. 75 | 76 | There are both hardware and software predication techniques! 77 | 78 | The most common form of hardware predication is "mask predication" where you have a per-lane mask which indicates whether a particular lane is active or not. 79 | 80 | RISCV also supports "VL predication", "VSTART predication", and "LMUL predication". (These are my terms, not standardized.) Each of them provides a way to deactivate some individual set of lanes. Out of them, only VL predication is likely to be performant in any way, please don't use the others. 81 | 82 | In software, the usual tactic is called "masking" or "masking off". The basic idea is that you conditionally select a known safe (but otherwise useless) value for the inactive lanes. For a divide, this might look like "udiv X, (select mask, Y, 1)". For a load, this might be "LD (select mask, P, SomeDereferenceableGlobal)". There is no masking technique for stores. 83 | 84 | There's also an entire family of related techniques for proving speculation safety (i.e. the absence of faults on inactive lanes *without* masking). This isn't predication per se, but comes up in all the same cases, and is (almost) always profitable over any form of predication. 85 | -------------------------------------------------------------------------------- /virtual-machines.rst: -------------------------------------------------------------------------------- 1 | Safepoints and Checkpoints are Yield Points 2 | -------------------------------------------- 3 | 4 | I recently read an `interesting survey of GC API design alternative available for Rust `_, and it included a comment about the duality of async and garbage collection safepoints which got me thinking. 5 | 6 | Language runtimes are full of situations where one thread needs to ask another to perform some action on its behalf. Consider the following examples: 7 | 8 | * Assumption invalidation - When a speculative assumption has been violated, a runtime needs to evict all running threads from previously compiled code which makes the about to be invalidated assumption. To do so, it must arrange for all other threads to remove the given piece of compiled code from it's execution stack. 9 | * Locking protocols - It is common to optimize the case where only a single thread is interacting with a locked object. In order for another thread to interact with the lock, it may need the currently owning thread to perform a compensation action on it's default. 10 | * Garbage collection - The garbage collector needs the running mutator threads to scan their stacks and report a list of garbage collected objects which are rooted by the given thread. 11 | * Debuggers and profilers - Tools frequently need to stop a thread and ask it to report information about it's thread context. Depending on the information required, this may be possible without (much) cooperation from the running thread, but the more involved the information, the more likely we need the queried thread to be in a known state or otherwise cooperate with the execution of the query. 12 | 13 | Interestingly, these are all forms of cooperative thread preemption (i.e. cooperative multi-threading). The currently running task is interrupted, another task is scheduled, and then the original task is rescheduled once the interrupting task is complete. (To avoid confusion, let's be explicit about the fact that it's semantically the *abstract machine thread* being interrupted and resumed. The physical execution execution may look quite different once abstract machine execution resumes for the interrupted thread.) 14 | 15 | Beyond these preemption examples, there are also a number of cases where a single thread needs to transition from one physical execution context to another. Conceptually, these transitions don't change the abstract machine state of the running thread, but we can consider them premption points as well by modeling the transition code which switches from one physical execution context to another as being another conceptual thread. 16 | 17 | Consider the "on stack replacement" and "uncommon trap" or "side exit" transitions. The former transitions a piece of code from running an interpreter to a compiled piece of code, the later does the inverse. There's usually a non-trivial amount of VM code which runs between the two pieces of abstract machine execution to do e.g. argument marshalling and frame setup. We can consider there to be two conceptual threads: the abstract machine thread, and the "transition thread" which is trying to transition the code from one mode of execution to another. The abstract machine thread reaches a logical premption point, transitions control to the transition thread, and then the transition thread returns control to the abstract machine thread (but running in another physical tier of exeuction.) 18 | 19 | It is worth highlighting that while this is cooperative premption, it is not *generalized* cooperative premption. That is, the code being transitions to at a premption point is not arbitrary. In fact, there are usually very strong semantic restrictions on what it can do. This restricted semantics allows the generated code from an optimized compiler to be optimized around these potential premption points in interesting ways. 20 | 21 | It is worth noting that (at least in theory) the abstract machine thread may have different sets of premption points for each class of prempting thread. (Said differently, maybe not all of your lock protocol premption points allow general deoptimization, or maybe your GC safepoints don't allow lock manipulation.) This is quite difficult to take advantage of in practice - mostly because maintaining any form of timeliness guarantee gets complicated if you have unbounded prempting tasks and don't have the ability to prempt them in turn - but at least in theory the flexibility exists. 22 | 23 | This observation raises some interesting possibilities for implementing safepoints and checkpoints in a compiler. There's a lot of work on compiling continuations and generators, I wonder if anyone has explored what falls out if we view a safepoint as just another form of yield point? Thinking through how you might level CPS style optimization tricks in such a model would be quite interesting. (This may have already been explored in the academic literature, CPS compilation isn't an area I follow closely.) 24 | 25 | 26 | Tier transitions as tail calls 27 | ------------------------------- 28 | 29 | When transitioning from one tier of execution to another, we need to replace on executing frame with another. For example, when deoptimzing from compiled code to the interpreter you need to replace the frame of the executing function with an interpreter frame and setup the interpreter state so that execution of the abstract machine can resume in the same place. 30 | 31 | (For purpose of this discussion, we're only considering the case where the frame being replaced is the last one on the stack. The general case can be handled in multiple ways, but is beyond the scope of this note.) 32 | 33 | Worth noting is that this frame replacement is simply a form of guaranteed tail call. The source frame is essentially making a tail call to another function with the abstract machine state as arguments. (For clarity, the functions here are *physical* functions, not functions in the abstract machine language.) This observation is mostly useful from a design for testability perspective and, potentially, code reuse. If the abstract machine includes tail calls (either guaranteed or optional), the same logic can be used to implement both. 34 | 35 | You could generate a unique runtime stub per call site layout and abstract machine state signature pair. If you're using a compiler toolchain which supports guaranteed tail calls - like LLVM - generating such a stub is fairly trivial. (Note: Historically, many VMs hand roled assembly for such cases. Don't do this!) 36 | 37 | If you start down this path, you frequently find you have a tradeoff to make between number of stubs (e.g. cold code size), allowed call site layouts (e.g. performance of compiled code), and distinct abstract machine layouts (e.g. your interpreter and abstract language designs). A common technique which is used to side step this tradeoff is to allow the compiler to describe the callsite layout (e.g. which arguments are where) in a side table which is intepreted at runtime as opposed to a function signature intepreted at stub generation time. 38 | 39 | Let's explore what that looks like. 40 | 41 | Information about the values making up the abstract machine state is recorded in a side table. The key bit is that minimal constraints are placed on where the values are; as long as the side table can describe it, that's a valid placement. This may sound analogous to debug information, but there's an important semantic distinction. Unlike debug info which often has a "best effort" semantic, this is *required* to be preserved for correctness. 42 | 43 | At runtime, there's a piece of code which copies values from wherever they might have been in executing code, and the desired target location. This often ends up taking the form of a small interpreter for the domain specific language (DSL) the side table can describe. 44 | 45 | In LLVM, the information about the callsite is represented via the "deopt" bundle on the call. During lowering, values in this bundle are allowed to be assigned freely provided the location is expressible in stackmap section which is generated into the object file. LLVM does not provide a runtime mechanism for interpreting a stackmap section, that's left as an exercise for the caller. Note that the name "stackmap" comes from the fact that garbage collection is stored in the same table. This is an artifact of the implementation. 46 | 47 | A couple of observations on the scheme above. 48 | 49 | * You'll note that we're replacing a stub with assembly optimized for a *particular* source/target layout pair with a generic intepreter. This is obviously slower. That's okay for our use case as deoptimization is assumed to be rare. As a result, smaller code size is worth the potential slowdown on the deoptimization path. 50 | * There's a data size impact for the side table. In general, these tend to be highly compressible, and are rarely accessed. Given that, it's common to use custom binary formats which fit the details of the host virtual machine. There's some interesting possibilities for using generic compressio techniques, but to my knowledge, this has not been explored. 51 | * There's a design space worth exploring around the expressibility of the side table. The more general such a table is - and thus the more complicated our runtime intepreter is - the less impact we have on the code layout for our compiled code. In practice, representing constants, registers, and stack locations appears to get most of the low hanging fruit, but this could in principle be pushed quite far. 52 | * It's worth noting that this is a *generic* mechanism to perform *any* (possibly tail) call. I'm not aware of this being used outside of the virtual machine implementations, but in theory, a suficiently advanced compiler could use this for any sufficiently cold call site. 53 | * This may go without saying, but it's important to have a mechanism for forcing deoptimation from your source language for testing purposes. (e.g. Being able to express the "deopt-here!" command in a test) Corner cases in deoptimization mechanisms tend to be hard to debug since they are by definition rare. You really want both a way to write test cases, and (ideally) fuzz. 54 | 55 | --------------------------------------------------------------------------------