├── .gitignore ├── .travis.yml ├── chap-0-intro.md ├── chap-1-compiler-basics.md ├── chap-2-llvm-basics.md ├── chap-3-lexer.md ├── chap-4-parser.md ├── chap-5-code-generator.md ├── chap-6-summary.md ├── chap-7-if-else.md ├── chap-8-function-declarations.md ├── chap-9-while-loops.md ├── demo_file.cr ├── diagrams ├── img │ ├── BDMAS_1.png │ ├── BDMAS_2.png │ ├── BDMAS_3.png │ ├── BDMAS_4.png │ ├── BDMAS_5.png │ ├── BDMAS_6.png │ ├── BDMAS_7.png │ ├── BDMAS_8.png │ ├── Emerald_Architecture.png │ ├── LLVM_Architecture.png │ ├── demo_output.png │ ├── if_else_to_ir.png │ ├── lexer_basic.png │ ├── parser_basic.png │ └── while_to_ir.png └── xml_archive │ ├── Emerald_Architecture.xml │ ├── LLVM_Architecture.xml │ ├── Lexer.xml │ ├── Parser.xml │ └── Parsing BDMAS Walkthrough.xml ├── emeraldc.cr ├── example_clang ├── main.c ├── main2.c ├── main2.ll └── readme.md ├── example_ir ├── example_1.ll ├── example_2.ll ├── example_3.ll ├── example_4.cr ├── example_4.ll └── readme.md ├── license ├── readme.md ├── spec ├── errors_spec.cr ├── floats_spec.cr ├── full_integration_spec.cr ├── functions_2_spec.cr ├── functions_spec.cr ├── generator_spec.cr ├── if_statements_spec.cr ├── int64_spec.cr ├── lexer_spec.cr ├── parser_more_examples_spec.cr ├── parser_order_of_op_spec.cr ├── parser_spec.cr ├── strings_spec.cr ├── value_resolution_spec.cr ├── variables_and_literals_spec.cr └── while_spec.cr ├── src └── emerald │ ├── close_statements.cr │ ├── emerald.cr │ ├── error.cr │ ├── lexer.cr │ ├── nodes │ ├── basic_block_node.cr │ ├── binary_operator_node.cr │ ├── call_expression_node.cr │ ├── declaration_reference_node.cr │ ├── expression_node.cr │ ├── function_declaration_node.cr │ ├── if_expression_node.cr │ ├── literal_nodes.cr │ ├── node.cr │ ├── return_node.cr │ ├── root_node.cr │ ├── variable_declaration_node.cr │ └── while_expression_node.cr │ ├── parser.cr │ ├── state.cr │ ├── token.cr │ ├── types.cr │ └── verifier.cr ├── std-lib-opt.ll ├── std-lib.ll ├── test_inputs ├── floats.cr ├── full_integration.cr ├── full_integration_2.cr ├── full_integration_3.cr ├── functions.cr ├── functions_2.cr ├── if_statements_1.cr ├── if_statements_2.cr ├── int64.cr ├── strings.cr ├── value_resolution_1.cr ├── value_resolution_2.cr ├── variables_and_literals.cr └── while.cr └── test_outputs ├── floats ├── full_integration ├── full_integration_2 ├── full_integration_3 ├── functions ├── functions_2 ├── if_statements_1 ├── if_statements_2 ├── int64 ├── strings ├── value_resolution_1 ├── value_resolution_2 ├── variables_and_literals └── while /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | note.cr 3 | output.ll 4 | output.s 5 | output 6 | emerald/output.ll 7 | emerald/output.s 8 | emerald/output 9 | spec/output.ll 10 | spec/output.s 11 | spec/output 12 | emeraldc 13 | emeraldc.dwarf 14 | std-lib.s 15 | std-lib-opt.s 16 | diagrams/psd/* -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: crystal 2 | 3 | sudo: false 4 | 5 | os: 6 | - osx 7 | 8 | before_install: | 9 | export LLVM_CONFIG=/usr/local/opt/llvm@6/bin/llvm-config 10 | export PATH="/usr/local/opt/llvm@6/bin:$PATH" 11 | before_script: | 12 | ls -l /usr/bin | grep llvm-config 13 | crystal build emeraldc.cr -------------------------------------------------------------------------------- /chap-0-intro.md: -------------------------------------------------------------------------------- 1 | # Chapter 0 Introduction 2 | 3 | In this tutorial we will work together to write a compiler for a simple toy programming language. Disclaimer, we make no claims of performance, safety, or functionality. The main objective will be to better understand how the LLVM api works when building a front-end and to better understand how compilers work in general. Maybe you just want to satisfy your curiousity or maybe you genuinely want to build the next great programming language of the future. In either case I hope this tutorial will be of great value to you. 4 | 5 | We are going to write the compiler using Crystal and there are two major reasons for this decision. Primarily, I find Crystal to offer very clean syntax and by default comes with excellent bindings to LLVM. I want all functionality in our toy compiler to be explicit, and easy to understand, debug, test, and monitor. With Crystal, I can easily ensure all the code is clear and concise and the bindings will stay out of our way. Secondarily, because Crystal is itself a LLVM front-end, there is a ton of information and examples of using the LLVM bindings directly in the Crystal source code. I highly recommend you spend some time reading the Crystal compiler source code before/during/after reading this tutorial. 6 | 7 | The language will be imperative, statically typed, and able to compile to object code callable from C. It will discourage punctuation usage and strive to be explicit while terse. We will start by parsing everything at the top level, and gradually incorporate control flow and nested expressions expanding the initially sparse syntax. 8 | 9 | We will call our toy language Emerald to honor both Crystal and Ruby. Further, the syntax will also be a major nod to both languages. Here is a snippet of our initial goal, showing some of the basic syntax elements. 10 | ```ruby 11 | # I am a comment! 12 | four = 2 + 2 13 | puts four 14 | puts 10 < 6 15 | puts 11 != 10 16 | ``` 17 | 18 | While the above example may look simple, it is going to require us to cover some serious ground in our understanding of the LLVM api. Already our simple syntax is going to require variables, a "built-in" puts command, and binary operators. We will need to be able to parse the structure of input files, and understand the order of operations and expression context. But do not be discouraged. We are going to tackle this in easily digestable pieces. Once we have a solid foundation, we can gradually extend our language with more powerful features. 19 | 20 | ### Lookahead 21 | 22 | Information 23 | 24 | [Chapter 1 - Compiler Basics](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-1-compiler-basics.md) -- Partial 25 | 26 | [Chapter 2 - LLVM Basics](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-2-llvm-basics.md) -- Partial 27 | 28 | Basic Architecture 29 | 30 | [Chapter 3 - Lexer](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-3-lexer.md) -- Partial 31 | 32 | [Chapter 4 - Parser](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-4-parser.md) -- Partial 33 | 34 | [Chapter 5 - Code Generator](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-5-code-generator.md) -- Incomplete 35 | 36 | [Chapter 6 - Summary](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-6-summary.md) -- Incomplete 37 | 38 | Advanced Architecture 39 | 40 | [Chapter 7 - Implementing If/Else](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-7-if-else.md) -- Incomplete 41 | 42 | [Chapter 8 - Implementing Function Declarations](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-8-function-declarations.md) -- Incomplete 43 | 44 | [Chapter 9 - Implementing Loops](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-9-while-loops.md) 45 | -- Incomplete 46 | 47 | ### Diagrams 48 | 49 | Emerald Architecture 50 | 51 | ![Emerald Architecture](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/Emerald_Architecture.png) 52 | 53 | LLVM Architecture 54 | 55 | ![LLVM Architecture](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/LLVM_Architecture.png) 56 | -------------------------------------------------------------------------------- /chap-1-compiler-basics.md: -------------------------------------------------------------------------------- 1 | # Chapter 1 Compiler Basics 2 | 3 | This chapter serves as a very basic crash course in compilers. This is going to be very explicit and to the point. Please feel free to fill this information in with lots of other sources when you get the chance! 4 | 5 | Here are the basics of what you need to know. 6 | 7 | ### Steps of Standard Compiler Usage 8 | 9 | 1. User has a file with some source code the compiler understands. 10 | 2. User runs the compiler executable on the source code file. 11 | 3. The compiler "lexes" file into a token array 12 | 4. The compiler "parses" the token array into an AST. 13 | 5. The compiler "code generates" the AST into machine code or intermediate language. 14 | 6. The intermediate language is converted to machine code if necessary and the user can then run the native machine code. 15 | 16 | This is a gross simplification however it may introduce new concepts that are foreign. We will cover the terminology and details here. 17 | 18 | **Lexing/Lexer**: A lexer is code designed to perform lexing. Lexing is the process of reading source code as an array of characters, and producing an array of tokens. The lexer decides where one keyword, identifier, variable, or other syntax component ends and the next one starts. Every language has the ability to decide exactly how characters are grouped and delineated to form tokens. Some languages use symbols like ';' to delineate the end of lines, while some languages use the newline escape sequence, "\n". Most languages use spaces to delineate tokens but the lexer can be programmed to recognize any grouping and order of characters to generate the final array of tokens. 19 | 20 | **Tokens**: A token is simply a symbolic representation of exactly one keyword, identifier, variable, literal, or other component in the language. The token in its most basic form will hold the value of the given token as well as its semantic function in the language. It is useful for tokens to also carry their positional location in the file for later reference by the compiler. Tokens are the output of the lexer and input of the parser. 21 | 22 | **Parsing/Parser**: The parser is responsible for taking the array of tokens produced by the lexer and generating what is known as an AST (Abstract Syntax Tree). The parser's job is actually twofold. First, the parser's job is to make sense of the sequence of tokens it receives to produce valid expressions in the form of AST nodes. Because of this functionality, the parser is going to find mistakes if they exist in the source file. Therefore the secondary function of the parser is to identify and notify the user of syntax errors in the source code of the input file. 23 | 24 | **AST (Abstract Syntax Tree)**: The AST is a tree-like representation of the program code structured in such a way that the code generator can walk through the nodes to eventually produce machine code that is directly executable. Walking an AST is simply the process of moving along the nodes of the tree from top to bottom, parent nodes to child nodes and back. The AST nodes hold references to other nodes, making it possible to 'walk' the syntax of the language. These nodes typically also carry the location information to make error messages useful to developers in the event of syntax errors. 25 | 26 | **Code Generator**: The code generator takes the AST as input and produces intermediate or machine code. The code generator "walks" the nodes of the AST, using the references it has to other nodes to generate the necessary instructions in the output code. Depending on the compiler architecture you may need to assemble your output intermediate representation to machine code prior to the final execution. 27 | 28 | ### Advanced Compiler Stages 29 | 30 | If we want to get more advanced we can add two more stages to this process. The first advanced stage would be an AST simplifying step. This would occur between parsing and the code-generation. The job of the AST simplification stage is to walk the nodes of the AST and look for expressions that can be evaluated at compile time to single nodes. The more nodes that can be collapsed during this stage, the less work required for the code-generation to perform and the fewer calculations required at run-time. This can definitely be viewed as a code optimization. 31 | 32 | The second advanced stage is an actual optimization step. Principally, these optimizations are run during or after code-generation and are also sometimes performed at link-time. The goal of this step is to look for patterns in the generated machine or assembly code that will allow simplifications to the code without affecting the final result. Depending on your needs, this step can be tuned between compile-time or run-time speed and between performance and safety. 33 | 34 | In our toy example we will be using some AST simplification techniques as required to make use of the builder API, but we will not be spending any time with explicit optimizations. Feel free to use the toy example as a means to experiment with LLVM's optimizations and better understand how they manipulate the code to improve performance. If you compare source code to generated LLVM ir with optimizations, you will notice that the optimizations can be quite effective at turning function bodies into inline values and removing unnecessary calculations from statements. 35 | 36 | #### Next 37 | [Chapter 2 - LLVM Basics](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-2-llvm-basics.md) -------------------------------------------------------------------------------- /chap-2-llvm-basics.md: -------------------------------------------------------------------------------- 1 | # Chapter 2 LLVM Basics 2 | 3 | This chapter is a crash course in LLVM's most basic concepts and terminologies. This is by no means complete, but it will be enough to give you a decent idea of how LLVM works and how you will be using its API. 4 | 5 | First and foremost, we will be putting some blinders on and using LLVM in a very simplistic way. Despite the fact that LLVM exposes a very detailed modular API with lots of control at all stages of code compilation, we are going to take a very lazy, and in some cases perhaps even naive approach, to working with it. The benefits of this approach are: A) its more rewarding to get to a working stage quickly, and B) LLVM has lots of tooling that will make our naive code run plenty efficiently for the time being. Once we understand the basics, then we can begin to add more advanced techniques to our approach. 6 | 7 | So what do I mean by us taking a lazy approach? What I mean is we are going to let the LLVM ir builder API do the heavy lifting for us and we are not going to spend much, if any, time tinkering with the generated ir other than applying the standard optimizations LLVM offers. This means if our lexer and parser can generate an AST of nodes that the LLVM ir builder API understands, we are essentially done. The only remaining work will be to call the builder API with the correct references to each of our nodes. 8 | 9 | So what do I mean by us taking a naive approach? Full disclaimer, I am very much still a student of compilers and LLVM, I am probably doing many things that a compiler/LLVM expert would consider naive or ignorant. Also in the interest of enlightening people without adding unnecessary confusion I will try to keep things extremely simplistic. This means I will try to avoid using indirection, complex abstractions, inheritance, and implicit behaviour in the compiler even if the end result is more verbose code. The code may not follow all the best practices but it will be easy to read and understand. 10 | 11 | ### General LLVM Information 12 | 13 | Here are some gross simplications that should help you get started with LLVM. Fill your knowledge in with more details as it becomes necessary. 14 | 15 | 1. The main unit of grouping in LLVM is the module. You can have several modules in a program, and each module will contain functions, global variables and an externalized interface. In our simplistic, naive approach we will never use more than one module but note that it is possible. 16 | 2. Everything inside a module is an LLVM Value descendent. These include but are not limited to functions, blocks, expressions, instructions, etc. A nice way to visually think of this is: **Module** can have **Functions** which have one or more **BasicBlock** which consist of **Instructions**. Values are a way for each component to reference each other and is the basis for compile time mathematic calculations. 17 | 3. BasicBlocks are an important concept. BasicBlocks are the cornerstone of the ir builder API and for good reason. A BasicBlock is simply a list of instructions in which the only way they can be executed is from first to last in order with no control flow. Think of a function body with no logic statements or jumps and you are likely looking at a BasicBlock. 18 | 4. A function is basically a block of code that accepts a given list of typed parameters and returns a typed value. LLVM views it the same way. We will be initially running all our code inside a main function that we will initialize by giving it a C main interface. A simplified C main function takes no parameters and returns an integer. Therefore in LLVM we say that it will take an empty array of LLVM type values, and return a LLVM 32 bit integer. 19 | 5. LLVM type values are exactly what they sound like; it is LLVMs internal representation of type as related to an LLVM Value object. This is the system in which your code can be statically typed and compiled to object code callable from C. By giving our main function a C interface using LLVM types, and by flagging our main function as a LLVM::Linkage::External we have allowed linkers to identify our object code's main function as if it were compiled from C. 20 | 6. The LLVM builder API has the notion of position. A given builder has to be directed where it will be appending new instructions. In the case of our initial simplified approach we will be appending all our instructions to the BasicBlock of the main function. As you can imagine this sets up the primary means of constructing function bodies across multiple functions in a module. To add control flow, loops, and more functions we need to track the basic blocks of our module and append the instructions into the correct blocks. 21 | 7. Finally once you are finished compiling the instructions into your module, you will want to dump the output to LLVM IR ll files. The resultant [name_of_file].ll can now be treated like any LLVM IR as if it were just compiled straight from C. This includes all the optimizations and plug-ins available in the LLVM architecture. It is also ready to be compiled to object code and linked with any other object code compiled from other sources. The file.ll theoretically can be compiled to any target architecture so long as the LLVM IR is not doing anything machine specific. 22 | 8. Because our example is compiling instructions into a main function, if we execute the compiled and linked version of our output (aka the machine binary), it should immediately invoke the main function and we should see the results of our instructions. 23 | 9. LLVM is a low level machine, therefore it doesn't have high level types by default. However, there is nothing stopping you from adding your own types as aliases or structured combinations of low level built-in types. In our toy example we will not make much use of this power but know that it exists and can be used to create more powerful containers for values, for example like Crystal's or C++'s string type. In our example we will simply treat strings like C strings, terminated by a null byte, which LLVM treats as an i8* (8 bit integer pointer). This means that all string values in our program will be global string values, passed by pointer. 24 | 25 | 26 | ### LLVM IR instructions 27 | 28 | To get you started quickly here is a quick glossary of some LLVM ir instructions and what they do: 29 | 30 | ```llvm 31 | ;alloca - reserve space in memory for typed variable 32 | 33 | %fourp = alloca i32 34 | 35 | ;store - put value into allocated memory 36 | 37 | store i32 4, i32* %fourp 38 | 39 | ;load - get value stored in allocated memory 40 | 41 | %value = load i32, i32* %fourp 42 | 43 | ;getelementptr - get a pointer to a subelement, 44 | ;- useful for converting a char buffer into a const char* 45 | 46 | %buffer = alloca [79 x i8] 47 | %bpointer = getelementptr [79 x i8], [79 x i8]* %buffer, i32 0, i32 0 48 | 49 | ;call - call a named function with return type and params 50 | 51 | call i32 @puts( i8* %bufferp ) 52 | 53 | ;br - jumps to a code block based on the provided value or unconditionally jumps 54 | 55 | br i1 false, label %if_block, label %else_block ;conditional jump 56 | br label %code_block ;unconditional jump 57 | 58 | ;icmp - compare two integer values with a given operator 59 | ;- eq, ne, ult, ugt, uge, ule, slt, sgt, sge, sle 60 | ;- equals, not equal, signed and unsigned less than, greater than, less than or equal, greater than or equal 61 | ;- returns i1 62 | 63 | %comparison = icmp eq i32 2, 0 64 | 65 | ;ret - return a value from the active function 66 | 67 | ret i32 0 68 | 69 | ;sext & zext - extend an integer to a larger bit size 70 | ;- sext is sign extended, zext is zero extended 71 | 72 | %result = zext i1 %value to i32 73 | %result = sext i1 %value to i32 74 | 75 | ;trunc - reduce an integer size to a smaller bit size 76 | 77 | %result = trunc i32 %value to i1 78 | ``` 79 | 80 | ### Crystal LLVM Builder API Bindings 81 | 82 | We are using Crystal's builder API bindings to LLVM and as such we also need to have an idea of how to use the builder API to assemble modules. Below is a generic program class that demonstrates how to use the builder API in a simplistic way. Once you grasp this, you should be able to see how you can direct the builder API into different blocks and functions throughout your module as needed. 83 | 84 | ```crystal 85 | require "llvm" 86 | 87 | class Program 88 | getter main : LLVM::BasicBlock, mod : LLVM::Module, builder : LLVM::Builder 89 | getter! func : LLVM::Function 90 | 91 | def initialize 92 | # Create the context 93 | context = LLVM::Context.new 94 | 95 | # Create a module 96 | @mod = context.new_module("name") 97 | 98 | # Add a global number variable "number" = 10 99 | mod.globals.add context.int32, "number" 100 | mod.globals["number"].initializer = context.int32.const_int(10) 101 | 102 | # Create a main function 103 | @func = mod.functions.add "main", ([] of LLVM::Type), context.int32 104 | 105 | # Create body for main function - builder appends to basic blocks. 106 | @main = func.basic_blocks.append "main_body" 107 | 108 | # Make main function externally linkable 109 | func.linkage = LLVM::Linkage::External 110 | 111 | # Declare external function puts 112 | mod.functions.add "puts", [context.void_pointer], context.int32 113 | 114 | # Initialize Crystal's builder API 115 | @builder = context.new_builder 116 | end 117 | 118 | def code_generate 119 | # Before calling builder, you must position it into the active basic block of your program 120 | builder.position_at_end main 121 | 122 | # While walking the AST nodes you can call builder API to generate instructions into the basic block... 123 | str_ptr = builder.global_string_pointer "Johnny", "str" 124 | builder.call mod.functions["puts"], str_ptr, "str_call" 125 | num_val = builder.load mod.globals["number"] 126 | builder.ret num_val 127 | 128 | File.open("output.ll", "w") do |file| 129 | mod.to_s(file) 130 | end 131 | end 132 | end 133 | 134 | program = Program.new 135 | program.code_generate 136 | ``` 137 | 138 | It is the relationship between your AST nodes and your code generation functions that the final module will get built. Therefore you should spend time walking the nodes of your AST and thinking about what builder API calls you will need to accomplish the functionality you desire in LLVM IR. 139 | 140 | ### Builder API Usage 141 | 142 | Below is a list of builder methods with short descriptions. A few of the ones you'll find especially useful have demonstration usages provided. 143 | ```crystal 144 | #add(lhs, rhs, name = "") add two values together and return sum value 145 | 146 | value = builder.add(four_val, five_val, "4_plus_5") 147 | 148 | #alloca(type, name = "") allocate space for given variable type 149 | 150 | value = builder.alloca(LLVM::Int32, "number") 151 | 152 | #and(lhs, rhs, name = "") perform bitwise and operation 153 | #array_malloc(type, value, name = "") 154 | #ashr(lhs, rhs, name = "") perform bitwise right hand shift 155 | #atomicrmw(op, ptr, val, ordering, singlethread) atomically modify memory 156 | #bit_cast(value, type, name = "") convert value to type2 without changing bits 157 | #br(block) unconditional branch to block 158 | 159 | builder.br(block_ref) 160 | 161 | #call(func, args : Array(LLVM::Value), name : String = "") call a multi parameter function 162 | #call(func, arg : LLVM::Value, name : String = "") call a single param function 163 | 164 | ret_value = builder.call mod.functions["puts"], str_ptr, "puts_call" 165 | 166 | #call(func, name : String = "") call a no parameter function 167 | #cmpxchg(pointer, cmp, new, success_ordering, failure_ordering) atomically modify memory based on comparison 168 | #cond(cond, then_block, else_block) conditional branch to block 169 | 170 | builder.cond if_value, then_block_ref, else_block_ref 171 | 172 | #exact_sdiv(lhs, rhs, name = "") performs division using exact keyword, result is poison value if rounding would occur 173 | #extract_value(value, index, name = "") extracts value from aggregate object 174 | #fadd(lhs, rhs, name = "") floating point and vector addition 175 | #fcmp(op, lhs, rhs, name = "") floating point comparison 176 | #fdiv(lhs, rhs, name = "") floating point division 177 | #fence(ordering, singlethread, name = "") introduces edges between operations 178 | #fmul(lhs, rhs, name = "") floating point multiplication 179 | #fp2si(value, type, name = "") floating point to signed int 180 | #fp2ui(value, type, name = "") floating point to unsigned int 181 | #fpext(value, type, name = "") floating point extension 182 | #fptrunc(value, type, name = "") floating point truncation 183 | #fsub(lhs, rhs, name = "") floating point subtraction 184 | #gep(value, index1 : LLVM::Value, index2 : LLVM::Value, name = "") get element pointer returns a subelement of a container using start, end indices 185 | #gep(value, index : LLVM::Value, name = "") get element pointer returns a subelement of a container using a start index 186 | #gep(value, indices : Array(LLVM::ValueRef), name = "") get element pointer returns a sub element using an indices array 187 | #global_string_pointer(string, name = "") generate global string constant pointer 188 | 189 | string_ptr = builder.global_string_pointer("Hello World", "example") 190 | 191 | #icmp(op, lhs, rhs, name = "") integer comparison operation 192 | 193 | result = builder.icmp(LLVM::IntPredicate::ULT, ten_val, nine_val, "comparison") 194 | 195 | #inbounds_gep(value, indices : Array(LLVM::ValueRef), name = "") gep with inbounds keyword 196 | #inbounds_gep(value, index1 : LLVM::Value, index2 : LLVM::Value, name = "") gep with inbounds keyword 197 | #inbounds_gep(value, index : LLVM::Value, name = "") gep with inbounds keyword 198 | #int2ptr(value, type, name = "") convert integer to pointer type 199 | #invoke(fn, args : Array(LLVM::Value), a_then, a_catch, name = "") allows exception handling by returning to then block unless exception is detected and then instead returns to catch block 200 | #landing_pad(type, personality, clauses, name = "") designates a basic block as where an exception is handled inside a catch routine 201 | #load(ptr, name = "") get the value stored in a pointer 202 | 203 | value = builder.load(ptr_to_value, "value_in_ptr") 204 | 205 | #lshr(lhs, rhs, name = "") performs a logical right hand shift operation 206 | #mul(lhs, rhs, name = "") perform multiplication 207 | #not(value, name = "") 208 | #or(lhs, rhs, name = "") performs bitwise or operation 209 | 210 | #phi(type, table : LLVM::PhiTable, name = "") setup phi node based on preexisting table data 211 | 212 | #NOTE a phi node is simply a variable that takes on a value based on the preceding block that passed control to the phi node. 213 | 214 | #phi(type, incoming_blocks : Array(LLVM::BasicBlock), incoming_values : Array(LLVM::Value), name = "") setup phi node based on array of basic blocks and an array of the values it should take in each case 215 | 216 | #position_at_end(block) position builder at end of a given block 217 | #ptr2int(value, type, name = "") resolve integer pointer to int 218 | #ret(value) return a specified value 219 | 220 | builder.ret (LLVM.int (LLVM::Int32, 0)) 221 | 222 | #ret return void 223 | #sdiv(lhs, rhs, name = "") signed integer division 224 | #select(cond, a_then, a_else, name = "") select a value based on a condition 225 | #sext(value, type, name = "") signed extension 226 | #shl(lhs, rhs, name = "") shift left expression 227 | #si2fp(value, type, name = "") cast signed integer to floating point 228 | #srem(lhs, rhs, name = "") return remainder of signed integer division 229 | #store(value, ptr) store value into pointer 230 | 231 | builder.store four_val, number_ptr 232 | 233 | #sub(lhs, rhs, name = "") integer subtraction 234 | #switch(value, otherwise, cases) allow branching to one of several branches based on value 235 | #trunc(value, type, name = "") truncating integer 236 | #udiv(lhs, rhs, name = "") unsigned division 237 | #ui2fp(value, type, name = "") unsigned integer to floating point 238 | #urem(lhs, rhs, name = "") unsigned division remainder 239 | #xor(lhs, rhs, name = "") bitwise logical xor operation 240 | #zext(value, type, name = "") zero extension 241 | ``` 242 | 243 | Further Reading and References: 244 | 245 | 1. [LLVM for Grad Students by Adrian Sampson](https://www.cs.cornell.edu/~asampson/blog/llvm.html) 246 | 247 | 2. [How to get started with the LLVM C API by Paul Smith](https://pauladamsmith.com/blog/2015/01/how-to-get-started-with-llvm-c-api.html) 248 | 249 | 3. [Create a working compiler with the LLVM framework, Part 1 by Arpen Sen](https://www.ibm.com/developerworks/library/os-createcompilerllvm1/) 250 | 251 | 4. [My First LLVM Compiler by Wilfred Hughes](http://www.wilfred.me.uk/blog/2015/02/21/my-first-llvm-compiler/) 252 | 253 | #### Next 254 | [Chapter 3 - Lexer](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-3-lexer.md) 255 | -------------------------------------------------------------------------------- /chap-3-lexer.md: -------------------------------------------------------------------------------- 1 | # Chapter 3 Lexer 2 | 3 | It is time to begin with the actual design and implementation of our toy compiler. If you made it this far, then you know our first step in building a compiler is to create something known as a Lexer. The Lexer will be a component of our compiler. Its job is to take an array of character data and produce an array of tokens. 4 | 5 | The Lexer will begin by taking the array of characters and looping through them one at a time. As it does so it will be keeping track of important information such as the line and column number for positioning, the current character, and the current token being parsed. When the Lexer either reaches a character indicating the end of a token, or the current token being parsed equals specific keywords or symbols it knows that it can parse the current token information. When the previous condition is reached, the Lexer adds a new Token to its Token array reflecting the parsed token information, it then restarts the token processing algorithm where it left off. 6 | 7 | Our Lexer will also have some additional properties to help it when parsing our language. One thing our Lexer will have is a context property. This allows the Lexer to understand more complicated groupings of characters. For example we want our Lexer to recognize that the following grouping of characters: "Hello World!" is in fact one String token rather than two separate tokens. An easy way to accomplish this is to use the context property to indicate when the Lexer has entered a String section so it knows to continue adding characters until the second quotation symbol is encountered. This context property can also be used to collate comments into a single Token. 8 | 9 | The only remaining requirement for our Lexer is for it to inject some whitespace tokens during its work to aid the Parser. In order for the Parser to be able to clearly know when a given expression ends and a new one begins, our Lexer should append a new line token on each new line escape sequence, and to be explicit we will also append an end of file token once Lexing is complete. This way our array of tokens will clearly indicate the linear order of all the semantics of our programming language including the effects of whitespace on expressions. 10 | 11 | Here is a high level view of what the lexer is doing to generate the token array from a given array of characters. 12 | 13 | ![Lexer Basic](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/lexer_basic.png) 14 | 15 | #### Next 16 | [Chapter 4 - Parser](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-4-parser.md) -------------------------------------------------------------------------------- /chap-4-parser.md: -------------------------------------------------------------------------------- 1 | # Chapter 4 Parser 2 | 3 | This step is going to be the most complicated, but if you can get through and understand this, then the rest is going to be a piece of cake. We currently are able to use our Lexer to get an array of Tokens. We now wish to parse the tokens into an AST (Abstract Syntax Tree). It gets this name due to the tree-like structure it produces when fully parsed. Every literal, expression, block, return, if, while, def, (any syntactic component of the language) is going to have its own node. Each of these nodes will reference other nodes to indicate how they are related to one another. 4 | 5 | For instance, the binary operation 2 + 2 could be looked at as a binary expression node, with an operator node represented by the plus symbol, and a left hand side and right hand side expressions that are each in this case simply a number literal of 2. In this example the binary operator expression node would be considered the root, while the remaining nodes are its children. This tree like structure is very important as it makes our code-generation stage much more simple. In order to translate the AST into LLVM IR we simply will walk the AST, inspecting nodes as we go, and calling the LLVM IR Builder api with references to the respective child nodes. 6 | 7 | ``` 8 | BinaryExpressionNode -> 2 + 2 9 | Operator -> + 10 | LHS -> 2 11 | RHS -> 2 12 | ``` 13 | 14 | In order to create our AST we are going to need a class for each node type so as to allow each node to have instance variables that reflect the required child nodes for each node type that LLVM understands and that we wish to port into our language. Initially we only need a few node types, and all our code will be treated as though its being called from the main function and therefore appended to the end of the main function's BasicBlock as we go. Once we are ready to add control flow, loops, and functions we will need to keep track of the blocks in our program and append to the correct one during code generation. Finally we will implement the puts command as a language built-in which will require some special parsing and code generation logic to accommodate its features. A language built-in means that rather than forcing a user to implement a function and compile it each time with their own code, the language will have the function signature and implementation pre-compiled in the standard library so that every program linked against the standard library will have the same desired implementation of said function. 15 | 16 | Our parser will work by inspecting the current token in the array. The parser will be aware of each node type and how it relates to other node types in sequence. Each line will be treated as an expression, of which itself may consist of multiple other expressions. The parser will determine which tokens should be expected following a given token, if those tokens are not found, an error will be generated to help the user determine where a syntax error is occurring. Otherwise the parser will continue to take the tokens and generate the required node structure to form the final AST. We should be able to easily inspect our AST at the end of this stage to visually debug and ensure our code-generation calls are getting the correct information. 17 | 18 | If we use our initial example code from chapter 0. 19 | 20 | ```ruby 21 | # I am a comment! 22 | four = 2 + 2 23 | puts four 24 | puts 10 < 6 25 | puts 11 != 10 26 | ``` 27 | 28 | We will be parsing this into an AST that will look something like this. 29 | ``` 30 | Expressions Node : [four = 2 + 2, puts four, puts 10 < 6, puts 11 != 10] 31 | [0] 32 | |-> Variable Declaration Node : four = 2 + 2 33 | |---> Binary Expression Node : 2 + 2 34 | |-----> Number Literal : 2 35 | |-----> Number Literal : 2 36 | [1] 37 | |-> Call Expression Node : puts four 38 | |---> Declaration Reference Expression : four 39 | [2] 40 | |-> Call Expression Node : puts 10 < 6 41 | |---> Binary Expression Node : 10 < 6 42 | |-----> Number Literal : 10 43 | |-----> Number Literal : 6 44 | [3] 45 | |-> Call Expression Node : puts 11 != 10 46 | |---> Binary Expression Node : 11 != 10 47 | |-----> Number Literal : 11 48 | |-----> Number Literal : 10 49 | ``` 50 | 51 | Something that may be of use to you for experimental purposes is to see Clang's AST representation for simple C code. This is useful as LLVM uses Clang's AST in its own codegen methods and should be informative to browse. Below is a simple C code example. 52 | 53 | ```c 54 | // example_clang/main.c 55 | 56 | int addFour(int x) { 57 | return x + 4; 58 | } 59 | 60 | 61 | int main(){ 62 | int four = addFour(0); 63 | } 64 | 65 | ``` 66 | 67 | ```bash 68 | clang -cc1 -ast-dump name_of_file.c 69 | ``` 70 | 71 | If we scrape out some of the extraneous information we can see Clang's AST for this code: 72 | ``` 73 | |-FunctionDecl 0x7f9247882400 line:1:5 used addFour 'int (int)' 74 | | |-ParmVarDecl 0x7f92478316d8 col:17 used x 'int' 75 | | `-CompoundStmt 0x7f9247882590 76 | | `-ReturnStmt 0x7f9247882578 77 | | `-BinaryOperator 0x7f9247882550 'int' '+' 78 | | |-ImplicitCastExpr 0x7f9247882538 'int' 79 | | | `-DeclRefExpr 0x7f92478824f0 'int' lvalue ParmVar 0x7f92478316d8 'x' 'int' 80 | | `-IntegerLiteral 0x7f9247882518 'int' 4 81 | `-FunctionDecl 0x7f92478825f8 line:6:5 main 'int ()' 82 | `-CompoundStmt 0x7f92478827e8 83 | `-DeclStmt 0x7f92478827d0 84 | `-VarDecl 0x7f92478826b0 col:6 four 'int' cinit 85 | `-CallExpr 0x7f92478827a0 'int' 86 | |-ImplicitCastExpr 0x7f9247882788 'int (*)(int)' 87 | | `-DeclRefExpr 0x7f9247882710 'int (int)' Function 0x7f9247882400 'addFour' 'int (int)' 88 | `-IntegerLiteral 0x7f9247882738 'int' 0 89 | 90 | ``` 91 | 92 | Our parser is going to parse binary operations by taking a somewhat novel yet simple approach. In basic terms, when our parser reaches an expression, it will append the number literal nodes and make it the active node in anticipation of an impending binary node. If a binary node is reached, it then seeks the suitable insertion point in the AST and then promotes itself to that node, assuming that nodes children, itself becoming the new active node in the parse tree. 93 | 94 | If you are like me the above words might sound pretty confusing so a picture is worth a 1000 words: 95 | 96 | ``` 97 | watch as it parses the expression 2 * 5 + 3 in sequence 98 | 99 | step 1 Expression node is added with first token value - literal value 2 active 100 | Root Node 101 | Expression Node 102 | Literal Node 2 (Active) 103 | 104 | step 2 A binary operator is reached, promoted, inherits literal as its child, and is now the active node 105 | Root Node 106 | Expression Node 107 | Operator Node * (Active) 108 | Literal Node 2 109 | 110 | step 3 Next literal is appended to active node and then itself becomes active 111 | Root Node 112 | Expression Node 113 | Operator Node * 114 | Literal Node 2 115 | Literal Node 5 (Active) 116 | 117 | step 4 New operator is reached, is lower precedence than * operator so is therefore promoted twice, and is new active node 118 | Root Node 119 | Expression Node 120 | Operator Node + (Active) 121 | Operator Node * 122 | Literal Node 2 123 | Literal Node 5 124 | 125 | step 5 Final token is reached and appended to active node, resolving expression AST 126 | Root Node 127 | Expression Node 128 | Operator Node + 129 | Literal Node 3 130 | Operator Node * 131 | Literal Node 2 132 | Literal Node 5 133 | ``` 134 | 135 | ``` 136 | parsing : 2 * 3 + (4 * (5 + 6) * 7) + 8 * 9 137 | Root 138 | Expression Node 139 | Operator Node + 140 | Operator Node + 141 | Expression Node (4 * (5 + 6) * 7) 142 | Operator Node * 143 | Operator Node * 144 | Expression Node (5 + 6) 145 | Operator Node 5 146 | Literal Node 5 147 | Literal Node 6 148 | Literal Node 7 149 | Literal Node 4 150 | Operator Node * 151 | Literal Node 8 152 | Literal Node 9 153 | Operator Node * 154 | Literal Node 2 155 | Literal Node 3 156 | 157 | 158 | ``` 159 | 160 | Here are some sketches of this process to help you visualize. 161 | 162 | Blue means this node is new this step, red means it is both new and currently active. 163 | 164 | In this example we are using the expression: 165 | 166 | ```crystal 167 | 2 * 3 + (4 * (5 + 6 * 7) + 8) * 9 - 1 168 | ``` 169 | 170 | Step 1 - Root node and main expression node generated. 171 | 172 | ![BDMAS Parsing Stage 1](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_1.png) 173 | 174 | Step 2 - Begin parsing main expression, append 2 literal node. 175 | 176 | ![BDMAS Parsing Stage 2](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_2.png) 177 | 178 | Step 3 - Multiplication operator is promoted, and literal 3 appended to it. 179 | 180 | ![BDMAS Parsing Stage 3](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_3.png) 181 | 182 | Step 4 - Addition operator is promoted to top, the multiply operator becomes its child, and the parenthesis expression is also appended as its child 183 | 184 | ![BDMAS Parsing Stage 4](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_4.png) 185 | 186 | Step 5 - First 4 literal is appended to expression node, then multiplication operator is promoted, and then expression node is appended to multiplication node as active node. 187 | 188 | ![BDMAS Parsing Stage 5](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_5.png) 189 | 190 | Step 6 - Inner parenthesis expression is parsed, 5 literal to expression node, addition is promoted, 6 literal to addition node, multiplication is promoted, 7 literal to multiplication node. 191 | 192 | ![BDMAS Parsing Stage 6](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_6.png) 193 | 194 | Step 7 - First parenthesis closes so active node jumps to closest expression node parent and then immediately promotes addition binary and appends 8 literal as its child. 195 | 196 | ![BDMAS Parsing Stage 7](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_7.png) 197 | 198 | Step 8 - Second parenthesis closes so active node jumps to closest expression node parent and then immediately promotes multiplication, appends literal 9 to multiplication, double promotes subtraction, and then finally appends 1 literal to subtraction node. 199 | 200 | ![BDMAS Parsing Stage 8](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/BDMAS_8.png) 201 | 202 | Visualize how the active node is changing as it parses the expressions and inner expressions. 203 | 204 | The above approach allows us to parse each token in sequence and handle parenthesis scope because the following is true: 205 | 206 | 1. Expression nodes act as gatekeepers, preventing operators from promotion beyond their borders. 207 | 2. Expression nodes act as beacons, allowing the closing parenthesis to correctly activate the next required node in the parsing process. 208 | 3. promote and add_child in accordance with the active node should always resolve to the correct place if there are no syntax errors. 209 | 4. We likely can test for these syntax errors and provide user friendly messages if this situation is detected. 210 | 211 | The algorithm can be simply stated as follows: 212 | 213 | ``` 214 | Whenever a opening parenthesis is encountered, 215 | the active node appends an expression node 216 | which then becomes the new active node 217 | 218 | Whenever a closing parenthesis is encountered, 219 | the active node is recursively changed to its own parent node 220 | until the active node is an expression node. 221 | 222 | This process provides the boundary for the operator nodes 223 | and the designation for the active node once resolved. 224 | ``` 225 | 226 | Here is a high level view of what the parser is doing to generate the AST from a given array of tokens. 227 | 228 | ![Parser Basic](https://raw.githubusercontent.com/Virtual-Machine/llvm-tutorial-book/master/diagrams/img/parser_basic.png) 229 | 230 | #### Next 231 | [Chapter 5 - Code Generator](https://github.com/Virtual-Machine/llvm-tutorial-book/blob/master/chap-5-code-generator.md) -------------------------------------------------------------------------------- /chap-5-code-generator.md: -------------------------------------------------------------------------------- 1 | # Chapter 5 Code Generator 2 | 3 | While the parser's job was to convert an array of tokens into a structured AST, the code generator's job is to make the transition from AST to either intermediate or binary code. Code generation is completed by walking along the nodes of the AST and making sense of the structure. This is a recursive process, as the code generator must walk to the terminal nodes before resolving an expression. The final result will be all the parsed expressions into the module which will then be dumped as LLVM IR. 4 | 5 | It can be helpful to have an idea of what a given AST will translate into IR code, even if you do not plan to write that IR yourself. Merely the structure of that IR will be informative on how to walk the AST and generate the required builder calls via LLVM's IR Builder API. Just as in the last chapter, we can use clang to emit the AST and the LLVM IR for simple C examples to better understand how an AST translates to LLVM IR. Forgive this ugly C example, I am using variables to prevent Clang from outputting optimized LLVM IR that is non-informative. If you know how to output raw LLVM IR without collapsing number literals please let me know. 6 | 7 | ```c 8 | // example_clang/main2.c 9 | 10 | int main(){ 11 | int number; 12 | int two = 2, three = 3, four = 4; 13 | if (two + three * four < three) { 14 | number = two; 15 | } else { 16 | number = 5; 17 | } 18 | return number; 19 | } 20 | ``` 21 | 22 | ```bash 23 | clang -cc1 -ast-dump name_of_file.c 24 | clang -cc1 -emit-llvm name_of_file.c 25 | ``` 26 | 27 | Simplified AST: 28 | ``` 29 | `-FunctionDecl 0x7fc33a831718 line:1:5 main 'int ()' 30 | `-CompoundStmt 0x7fc33a8829b8 31 | |-DeclStmt 0x7fc33a882470 32 | | `-VarDecl 0x7fc33a882410 col:9 used number 'int' 33 | |-DeclStmt 0x7fc33a882658 34 | | |-VarDecl 0x7fc33a882498 col:9 used two 'int' cinit 35 | | | `-IntegerLiteral 0x7fc33a8824f8 'int' 2 36 | | |-VarDecl 0x7fc33a882528 col:18 used three 'int' cinit 37 | | | `-IntegerLiteral 0x7fc33a882588 'int' 3 38 | | `-VarDecl 0x7fc33a8825b8 col:29 used four 'int' cinit 39 | | `-IntegerLiteral 0x7fc33a882618 'int' 4 40 | |-IfStmt 0x7fc33a882928 41 | | |-<<>> 42 | | |-<<>> 43 | | |-BinaryOperator 0x7fc33a8827c0 'int' '<' 44 | | | |-BinaryOperator 0x7fc33a882758 'int' '+' 45 | | | | |-ImplicitCastExpr 0x7fc33a882740 'int' 46 | | | | | `-DeclRefExpr 0x7fc33a882670 'int' lvalue Var 0x7fc33a882498 'two' 'int' 47 | | | | `-BinaryOperator 0x7fc33a882718 'int' '*' 48 | | | | |-ImplicitCastExpr 0x7fc33a8826e8 'int' 49 | | | | | `-DeclRefExpr 0x7fc33a882698 'int' lvalue Var 0x7fc33a882528 'three' 'int' 50 | | | | `-ImplicitCastExpr 0x7fc33a882700 'int' 51 | | | | `-DeclRefExpr 0x7fc33a8826c0 'int' lvalue Var 0x7fc33a8825b8 'four' 'int' 52 | | | `-ImplicitCastExpr 0x7fc33a8827a8 'int' 53 | | | `-DeclRefExpr 0x7fc33a882780 'int' lvalue Var 0x7fc33a882528 'three' 'int' 54 | | |-CompoundStmt 0x7fc33a882878 55 | | | `-BinaryOperator 0x7fc33a882850 'int' '=' 56 | | | |-DeclRefExpr 0x7fc33a8827e8 'int' lvalue Var 0x7fc33a882410 'number' 'int' 57 | | | `-ImplicitCastExpr 0x7fc33a882838 'int' 58 | | | `-DeclRefExpr 0x7fc33a882810 'int' lvalue Var 0x7fc33a882498 'two' 'int' 59 | | `-CompoundStmt 0x7fc33a882908 60 | | `-BinaryOperator 0x7fc33a8828e0 'int' '=' 61 | | |-DeclRefExpr 0x7fc33a882898 'int' lvalue Var 0x7fc33a882410 'number' 'int' 62 | | `-IntegerLiteral 0x7fc33a8828c0 'int' 5 63 | `-ReturnStmt 0x7fc33a8829a0 64 | `-ImplicitCastExpr 0x7fc33a882988 'int' 65 | `-DeclRefExpr 0x7fc33a882960 'int' lvalue Var 0x7fc33a882410 'number' 'int' 66 | 67 | ``` 68 | 69 | Simplified LLVM IR: 70 | ``` 71 | ; Function Attrs: nounwind ssp uwtable 72 | define i32 @main() #0 { 73 | %1 = alloca i32, align 4 74 | %2 = alloca i32, align 4 75 | %3 = alloca i32, align 4 76 | %4 = alloca i32, align 4 77 | %5 = alloca i32, align 4 78 | store i32 0, i32* %1, align 4 79 | store i32 2, i32* %3, align 4 80 | store i32 3, i32* %4, align 4 81 | store i32 4, i32* %5, align 4 82 | %6 = load i32, i32* %3, align 4 83 | %7 = load i32, i32* %4, align 4 84 | %8 = load i32, i32* %5, align 4 85 | %9 = mul nsw i32 %7, %8 86 | %10 = add nsw i32 %6, %9 87 | %11 = load i32, i32* %4, align 4 88 | %12 = icmp slt i32 %10, %11 89 | br i1 %12, label %13, label %15 90 | 91 | ;