├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── TODO.md ├── elf.h ├── nanoc.c ├── od.sh └── test ├── archive.c ├── hello.c └── test2.c /.gitignore: -------------------------------------------------------------------------------- 1 | /nanoc 2 | /a.out 3 | *.o 4 | *.a 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2021 Ajay Tatachar 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | 2 | nanoc: nanoc.c 3 | $(CC) -o nanoc nanoc.c 4 | 5 | archive.a: test/archive.c 6 | i386-elf-gcc -c test/archive.c -o archive.o 7 | i386-elf-ar r archive.a archive.o 8 | 9 | .PHONY: clean 10 | clean: 11 | rm -fr nanoc a.out archive.o archive.a 12 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # nanoc 3 | nanoc is a tiny subset of C and a tiny compiler that targets 32-bit x86 machines. 4 | 5 | ## Tiny? 6 | The only types are: 7 | - `int` (32-bit signed integer) 8 | - `char` (8-bit signed integer) 9 | - `void` (only to be used as a function return type or pointee type) 10 | - pointers to the above types and to pointer types, which are functionally equivalent 11 | to 32-bit unsigned integers 12 | 13 | Notably, there are no structs, unions, enums or arrays. 14 | 15 | The only keywords are: `int`, `char`, `void`, `if`, `else`, `while`, `continue`, `break`, `return`. 16 | 17 | There are many syntactic limitations: 18 | - Declarations and assignments cannot both be made in a single statement. For example, `int a = 12;` is not valid but `int a; a = 12;` is. 19 | - Only a single declaration can be made in a single statement. `int a; int b;` is valid but `int a, b;` is not. 20 | - There is no operator precedence (see "Differences with C") 21 | - The `++` and `--` operators are always prefix operators. `++a` is valid, but `a++` is not. 22 | - There is no array indexing syntax. `*(a + 1)` is valid but `a[1]` is not. 23 | - There is no unary `-` operator. `0 - a` is valid but `-a` is not. 24 | - There is no preprocessor. 25 | 26 | As of now, the compiler is just under 1800 lines of code. It only outputs 32-bit x86 machine code formatted as an ELF executable. 27 | 28 | ## Differences with C 29 | nanoc is not a strict subset of C, as there are some small semantic differences between the two languages: 30 | - Operators do not take precedence over each other; the order in which they are applied must be specified explicitly with parentheses. For example, `1 + 2 * 3` is not a valid expression but `1 + (2 * 3)` is. 31 | - "Pointer arithmetic" does not exist: pointers are simply integers, so adding 1 to an `int*` will increment it by 1 instead of 4 (or whatever `sizeof(int)` is). 32 | - There is no explicit type casting (or any type checking at all), all types are cast implicitly. 33 | - Boolean operators always evaluate both of their operands. For example, `0` will always be evaluated in the expression `1 || 0`, and `1` will always be evaluated in the expression `0 && 1`. 34 | 35 | ## Purpose, Goals and TODOs 36 | nanoc is meant to be a small, self-contained and easily portable compiler for a usable subset of C. I wrote it because I wanted to write and compile code on my hobby [operating system](https://github.com/AjayMT/mako) without having to port a [big](https://gcc.gnu.org/) toolchain. Since many hobby operating systems run on x86 and read ELF executables, I hope this project proves to be useful for other OSdev enthusiasts as well. 37 | 38 | Most existing "little C compiler" projects output some form of assembly. Because writing an x86 assembler is a difficult (and not very interesting) task, I decided to make nanoc output x86 machine code directly rather than assembly text. nanoc is also capable of linking source code with a statically compiled archive (like a [libnanoc.a](https://github.com/AjayMT/mako/tree/master/src/libnanoc)) to produce an ELF executable that can use syscalls and library functions and do useful things. 39 | 40 | As of now, nanoc produces very inefficient code. This is not desirable, but I would not sacrifice portability or too much simplicity for code efficiency. It is also a very rudimentary linker -- in particular, the way nanoc combines `.data`, `.rodata` and `.bss` sections and handles global variables is questionable at best and non-functional at worst. 41 | 42 | TODO.md lists TODOs to be addressed in the short-term. 43 | 44 | ## Build and Usage 45 | To compile nanoc: 46 | ``` 47 | make 48 | ``` 49 | 50 | To use an OS-specific toolchain, simply define the `CC` variable: 51 | ``` 52 | CC=i686-pc-myos-gcc make 53 | ``` 54 | 55 | nanoc depends on a few libc functions: some simple ones from `string.h`, malloc+realloc, fopen+fread+fwrite, printf and atoi. 56 | 57 | If you are having trouble porting nanoc to your operating system, please reach out to me! I am happy to help. Feel free to raise an issue on this repository or send me an [email](mailto:ajaymt2@illinois.edu). 58 | 59 | To use nanoc: 60 | ``` 61 | nanoc [] 62 | ``` 63 | 64 | For example: 65 | ``` 66 | nanoc program.c 67 | ``` 68 | or 69 | ``` 70 | nanoc program.c /usr/lib/libnanoc.a 71 | ``` 72 | 73 | This will compile `program.c` and produce an executable file called `a.out`. If an archive file is specified, nanoc will link it with `program.c` so the program can use functions defined in the archive. 74 | -------------------------------------------------------------------------------- /TODO.md: -------------------------------------------------------------------------------- 1 | - Not all escaped characters inside string or character literals are escaped correctly. 2 | - The multiplication operator does not handle overflow correctly when multiplying `char`s 3 | - Documentation and code quality 4 | -------------------------------------------------------------------------------- /elf.h: -------------------------------------------------------------------------------- 1 | 2 | #ifndef _ELF_H_ 3 | #define _ELF_H_ 4 | 5 | #include 6 | 7 | // ELF magic numbers. 8 | #define ELFMAG0 0x7f 9 | #define ELFMAG1 'E' 10 | #define ELFMAG2 'L' 11 | #define ELFMAG3 'F' 12 | #define EI_NIDENT 16 13 | 14 | // ELF data types. 15 | typedef uint32_t Elf32_Word; 16 | typedef uint32_t Elf32_Addr; 17 | typedef uint32_t Elf32_Off; 18 | typedef uint32_t Elf32_Sword; 19 | typedef uint16_t Elf32_Half; 20 | 21 | // ELF header. 22 | typedef struct { 23 | uint8_t e_ident[EI_NIDENT]; 24 | Elf32_Half e_type; 25 | Elf32_Half e_machine; 26 | Elf32_Word e_version; 27 | Elf32_Addr e_entry; 28 | Elf32_Off e_phoff; 29 | Elf32_Off e_shoff; 30 | Elf32_Word e_flags; 31 | Elf32_Half e_ehsize; 32 | Elf32_Half e_phentsize; 33 | Elf32_Half e_phnum; 34 | Elf32_Half e_shentsize; 35 | Elf32_Half e_shnum; 36 | Elf32_Half e_shstrndx; 37 | } Elf32_Header; 38 | 39 | // e_type values. 40 | #define ET_NONE 0 41 | #define ET_REL 1 42 | #define ET_EXEC 2 43 | #define ET_DYN 3 44 | #define ET_CORE 4 45 | #define ET_LOPROC 0xff0 46 | #define ET_HIPROC 0xfff 47 | 48 | // Program header. 49 | typedef struct { 50 | Elf32_Word p_type; 51 | Elf32_Off p_offset; 52 | Elf32_Addr p_vaddr; 53 | Elf32_Addr p_paddr; 54 | Elf32_Word p_filesz; 55 | Elf32_Word p_memsz; 56 | Elf32_Word p_flags; 57 | Elf32_Word p_align; 58 | } Elf32_Phdr; 59 | 60 | // p_type values. 61 | #define PT_NULL 0 62 | #define PT_LOAD 1 63 | #define PT_DYNAMIC 2 64 | #define PT_INTERP 3 65 | #define PT_NOTE 4 66 | #define PT_SHLIB 5 67 | #define PT_PHDR 6 68 | #define PT_LOPROC 0x70000000 69 | #define PT_HIPROC 0x7FFFFFFF 70 | 71 | // p_flags values. 72 | #define PF_X 1 73 | #define PF_W 2 74 | #define PF_R 4 75 | 76 | // Section header. 77 | typedef struct { 78 | Elf32_Word sh_name; 79 | Elf32_Word sh_type; 80 | Elf32_Word sh_flags; 81 | Elf32_Addr sh_addr; 82 | Elf32_Off sh_offset; 83 | Elf32_Word sh_size; 84 | Elf32_Word sh_link; 85 | Elf32_Word sh_info; 86 | Elf32_Word sh_addralign; 87 | Elf32_Word sh_entsize; 88 | } Elf32_Shdr; 89 | 90 | // sh_type values 91 | #define SHT_NONE 0 92 | #define SHT_PROGBITS 1 93 | #define SHT_SYMTAB 2 94 | #define SHT_STRTAB 3 95 | #define SHT_RELA 4 96 | #define SHT_NOBITS 8 97 | #define SHT_REL 9 98 | 99 | // sh_flags values 100 | #define SHF_WRITE 1 101 | #define SHF_ALLOC 2 102 | #define SHF_EXECINSTR 4 103 | #define SHF_INFO_LINK 0x40 104 | 105 | // Symbol 106 | typedef struct { 107 | Elf32_Word st_name; 108 | Elf32_Addr st_value; 109 | Elf32_Word st_size; 110 | unsigned char st_info; 111 | unsigned char st_other; 112 | Elf32_Half st_shndx; 113 | } Elf32_Sym; 114 | 115 | // Macros to apply to st_info 116 | #define ELF32_ST_BIND(i) ((i)>>4) 117 | #define ELF32_ST_TYPE(i) ((i)&0xf) 118 | #define ELF32_ST_INFO(b,t) (((b)<<4)+((t)&0xf)) 119 | 120 | // ST_BIND values 121 | #define STB_LOCAL 0 122 | #define STB_GLOBAL 1 123 | #define STB_WEAK 2 124 | 125 | // ST_TYPE values 126 | #define STT_NOTYPE 0 127 | #define STT_OBJECT 1 128 | #define STT_FUNC 2 129 | #define STT_SECTION 3 130 | #define STT_FILE 4 131 | 132 | // Relocation entry 133 | typedef struct { 134 | Elf32_Addr r_offset; 135 | Elf32_Word r_info; 136 | } Elf32_Rel; 137 | 138 | // Macros to apply to r_info 139 | #define ELF32_R_SYM(i) ((i)>>8) 140 | #define ELF32_R_TYPE(i) ((unsigned char)(i)) 141 | #define ELF32_R_INFO(s,t) (((s)<<8)+(unsigned char)(t)) 142 | 143 | // R_TYPE values 144 | #define R_386_32 1 145 | #define R_386_PC32 2 146 | 147 | #endif /* _ELF_H_ */ 148 | -------------------------------------------------------------------------------- /nanoc.c: -------------------------------------------------------------------------------- 1 | 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include "elf.h" 8 | 9 | FILE *input = NULL; 10 | 11 | typedef enum { 12 | INT_LITERAL, SEMICOLON, EOF_, LPAREN, RPAREN, LCURLY, RCURLY, 13 | COMMA, LT, GT, LTE, GTE, EQ, EQUAL, NEQUAL, NOT, BIT_AND, BIT_OR, 14 | BIT_NOT, BIT_XOR, AND, OR, PLUS, MINUS, ASTERISK, SLASH, PERCENT, 15 | PLUS_EQ, MINUS_EQ, ASTERISK_EQ, SLASH_EQ, PERCENT_EQ, PLUS_PLUS, MINUS_MINUS, 16 | AND_EQ, OR_EQ, XOR_EQ, NOT_EQ, 17 | CHAR_LITERAL, STR_LITERAL, IDENT, 18 | INT, VOID, CHAR, IF, ELSE, WHILE, CONTINUE, BREAK, RETURN 19 | } token_type_t; 20 | 21 | typedef struct token_s { 22 | token_type_t type; 23 | char *str; 24 | uint32_t pos; 25 | } token_t; 26 | 27 | char escape_char(char c) 28 | { 29 | switch (c) { 30 | case 'n': return '\n'; 31 | case 't': return '\t'; 32 | } 33 | return c; 34 | } 35 | 36 | uint8_t buffered_token = 0; 37 | token_t tok_buf; 38 | 39 | token_t next_token() 40 | { 41 | if (buffered_token) { 42 | buffered_token = 0; 43 | return tok_buf; 44 | } 45 | 46 | char buf[512]; // max token size is 512 47 | memset(buf, 0, 512); 48 | uint32_t buf_idx = 0; 49 | char c = getc(input); 50 | 51 | while (isspace(c)) c = getc(input); // skip spaces 52 | 53 | #define RET_EMPTY_TOKEN(t) \ 54 | return (token_t) { .type = (t), .str = NULL, .pos = ftell(input) } 55 | 56 | switch (c) { 57 | case EOF: RET_EMPTY_TOKEN(EOF_); 58 | case ';': RET_EMPTY_TOKEN(SEMICOLON); 59 | case '(': RET_EMPTY_TOKEN(LPAREN); 60 | case ')': RET_EMPTY_TOKEN(RPAREN); 61 | case '{': RET_EMPTY_TOKEN(LCURLY); 62 | case '}': RET_EMPTY_TOKEN(RCURLY); 63 | case ',': RET_EMPTY_TOKEN(COMMA); 64 | } 65 | 66 | if (c == '<') { 67 | c = getc(input); 68 | if (c == '=') RET_EMPTY_TOKEN(LTE); 69 | ungetc(c, input); RET_EMPTY_TOKEN(LT); 70 | } 71 | 72 | if (c == '>') { 73 | c = getc(input); 74 | if (c == '=') RET_EMPTY_TOKEN(GTE); 75 | ungetc(c, input); RET_EMPTY_TOKEN(GT); 76 | } 77 | 78 | if (c == '=') { 79 | c = getc(input); 80 | if (c == '=') RET_EMPTY_TOKEN(EQUAL); 81 | ungetc(c, input); RET_EMPTY_TOKEN(EQ); 82 | } 83 | 84 | if (c == '!') { 85 | c = getc(input); 86 | if (c == '=') RET_EMPTY_TOKEN(NEQUAL); 87 | ungetc(c, input); RET_EMPTY_TOKEN(NOT); 88 | } 89 | 90 | if (c == '&') { 91 | c = getc(input); 92 | if (c == '&') RET_EMPTY_TOKEN(AND); 93 | if (c == '=') RET_EMPTY_TOKEN(AND_EQ); 94 | ungetc(c, input); RET_EMPTY_TOKEN(BIT_AND); 95 | } 96 | 97 | if (c == '|') { 98 | c = getc(input); 99 | if (c == '|') RET_EMPTY_TOKEN(OR); 100 | if (c == '=') RET_EMPTY_TOKEN(OR_EQ); 101 | ungetc(c, input); RET_EMPTY_TOKEN(BIT_OR); 102 | } 103 | 104 | if (c == '^') { 105 | c = getc(input); 106 | if (c == '=') RET_EMPTY_TOKEN(XOR_EQ); 107 | ungetc(c, input); RET_EMPTY_TOKEN(BIT_XOR); 108 | } 109 | 110 | if (c == '~') { 111 | c = getc(input); 112 | if (c == '=') RET_EMPTY_TOKEN(NOT_EQ); 113 | ungetc(c, input); RET_EMPTY_TOKEN(BIT_NOT); 114 | } 115 | 116 | if (c == '+') { 117 | c = getc(input); 118 | if (c == '=') RET_EMPTY_TOKEN(PLUS_EQ); 119 | if (c == '+') RET_EMPTY_TOKEN(PLUS_PLUS); 120 | ungetc(c, input); RET_EMPTY_TOKEN(PLUS); 121 | } 122 | 123 | if (c == '-') { 124 | c = getc(input); 125 | if (c == '=') RET_EMPTY_TOKEN(MINUS_EQ); 126 | if (c == '-') RET_EMPTY_TOKEN(MINUS_MINUS); 127 | ungetc(c, input); RET_EMPTY_TOKEN(MINUS); 128 | } 129 | 130 | if (c == '*') { 131 | c = getc(input); 132 | if (c == '=') RET_EMPTY_TOKEN(ASTERISK_EQ); 133 | ungetc(c, input); RET_EMPTY_TOKEN(ASTERISK); 134 | } 135 | 136 | if (c == '/') { 137 | c = getc(input); 138 | if (c == '=') RET_EMPTY_TOKEN(SLASH_EQ); 139 | ungetc(c, input); RET_EMPTY_TOKEN(SLASH); 140 | } 141 | 142 | if (c == '%') { 143 | c = getc(input); 144 | if (c == '=') RET_EMPTY_TOKEN(PERCENT_EQ); 145 | ungetc(c, input); RET_EMPTY_TOKEN(PERCENT); 146 | } 147 | 148 | if (c == '\'') { 149 | c = getc(input); 150 | uint8_t escaped = 0; 151 | if (c == '\\') { 152 | escaped = 1; 153 | c = getc(input); 154 | } 155 | if (c == '\'' && !escaped) goto fail; 156 | if (escaped) c = escape_char(c); 157 | buf[buf_idx] = c; ++buf_idx; 158 | c = getc(input); 159 | if (c != '\'') goto fail; 160 | char *s = strndup(buf, buf_idx); 161 | return (token_t) { .type = CHAR_LITERAL, .str = s, .pos = ftell(input) }; 162 | } 163 | 164 | if (c == '"') { 165 | c = getc(input); 166 | uint8_t escaped = 0; 167 | while (c != '"' || escaped) { 168 | if (c == '\n' || c == '\r') goto fail; 169 | if (escaped) c = escape_char(c); 170 | // TODO: handle "\\" 171 | escaped = 0; 172 | if (c == '\\') escaped = 1; 173 | else { buf[buf_idx] = c; ++buf_idx; } 174 | c = getc(input); 175 | } 176 | char *s = strndup(buf, buf_idx); 177 | return (token_t) { .type = STR_LITERAL, .str = s, .pos = ftell(input) }; 178 | } 179 | 180 | if (isdigit(c)) { 181 | while (isdigit(c)) { 182 | buf[buf_idx] = c; ++buf_idx; 183 | c = getc(input); 184 | } 185 | ungetc(c, input); 186 | char *s = strndup(buf, buf_idx); 187 | return (token_t) { .type = INT_LITERAL, .str = s, .pos = ftell(input) }; 188 | } 189 | 190 | if (isalpha(c) || c == '_') { 191 | while (isalnum(c) || c == '_') { 192 | buf[buf_idx] = c; ++buf_idx; 193 | c = getc(input); 194 | } 195 | ungetc(c, input); 196 | 197 | if (strcmp(buf, "int") == 0) RET_EMPTY_TOKEN(INT); 198 | if (strcmp(buf, "char") == 0) RET_EMPTY_TOKEN(CHAR); 199 | if (strcmp(buf, "void") == 0) RET_EMPTY_TOKEN(VOID); 200 | if (strcmp(buf, "if") == 0) RET_EMPTY_TOKEN(IF); 201 | if (strcmp(buf, "else") == 0) RET_EMPTY_TOKEN(ELSE); 202 | if (strcmp(buf, "while") == 0) RET_EMPTY_TOKEN(WHILE); 203 | if (strcmp(buf, "continue") == 0) RET_EMPTY_TOKEN(CONTINUE); 204 | if (strcmp(buf, "break") == 0) RET_EMPTY_TOKEN(BREAK); 205 | if (strcmp(buf, "return") == 0) RET_EMPTY_TOKEN(RETURN); 206 | 207 | char *s = strndup(buf, buf_idx); 208 | return (token_t) { .type = IDENT, .str = s, .pos = ftell(input) }; 209 | } 210 | 211 | fail: 212 | printf("Malformed token: %s\nAt position %lu\n", buf, ftell(input)); 213 | exit(1); 214 | 215 | return (token_t) { }; // to suppress compiler warning 216 | } 217 | 218 | void buffer_token(token_t tok) 219 | { 220 | buffered_token = 1; 221 | tok_buf = tok; 222 | } 223 | 224 | typedef enum { 225 | nFUNCTION, nARGUMENT, nSTMT, nEXPR, nTYPE 226 | } ast_node_type_t; 227 | 228 | typedef enum { 229 | // nSTMT variants 230 | vDECL, vEXPR, vEMPTY, vBLOCK, vIF, vWHILE, vCONTINUE, vBREAK, vRETURN, 231 | 232 | // nEXPR variants 233 | vIDENT, vINT_LITERAL, vCHAR_LITERAL, vSTRING_LITERAL, 234 | vDEREF, vADDRESSOF, vINCREMENT, vDECREMENT, vNOT, 235 | vADD, vSUBTRACT, vMULTIPLY, vDIVIDE, vMODULO, 236 | vLT, vGT, vEQUAL, vAND, vOR, 237 | vBIT_AND, vBIT_OR, vBIT_XOR, vBIT_NOT, 238 | vASSIGN, vCALL, 239 | 240 | // nTYPE variants 241 | vINT, vCHAR, vVOID, vPTR 242 | } ast_node_variant_t; 243 | 244 | typedef struct ast_node_s { 245 | ast_node_type_t type; 246 | ast_node_variant_t variant; 247 | int32_t i; 248 | char *s; 249 | struct ast_node_s *children; 250 | struct ast_node_s *next; 251 | } ast_node_t; 252 | 253 | ast_node_t *parse_type() 254 | { 255 | token_t tok = next_token(); 256 | if (tok.type != INT && tok.type != CHAR && tok.type != VOID) { 257 | printf("Expected type at position %u\n", tok.pos); 258 | exit(1); 259 | } 260 | 261 | ast_node_t *root = malloc(sizeof(ast_node_t)); 262 | memset(root, 0, sizeof(ast_node_t)); 263 | root->type = nTYPE; 264 | if (tok.type == INT) root->variant = vINT; 265 | else if (tok.type == CHAR) root->variant = vCHAR; 266 | else root->variant = vVOID; 267 | 268 | tok = next_token(); 269 | while (tok.type == ASTERISK) { 270 | ast_node_t *ptr_type = malloc(sizeof(ast_node_t)); 271 | memset(ptr_type, 0, sizeof(ast_node_t)); 272 | ptr_type->type = nTYPE; 273 | ptr_type->variant = vPTR; 274 | ptr_type->children = root; 275 | root = ptr_type; 276 | tok = next_token(); 277 | } 278 | buffer_token(tok); 279 | 280 | return root; 281 | } 282 | 283 | void check_lval(ast_node_t *root, uint32_t pos) 284 | { 285 | if (root == NULL) { 286 | printf("Invalid lvalue at position %u\n", pos); 287 | exit(1); 288 | } 289 | if ( 290 | root->type != nEXPR 291 | || (root->variant != vIDENT && root->variant != vDEREF) 292 | ) { 293 | printf("Invalid lvalue at position %u\n", pos); 294 | exit(1); 295 | } 296 | } 297 | 298 | ast_node_t *parse_expr(); 299 | 300 | ast_node_t *parse_operand() 301 | { 302 | ast_node_t *root; 303 | token_t tok = next_token(); 304 | 305 | if (tok.type == LPAREN) { 306 | root = parse_expr(); 307 | tok = next_token(); 308 | if (tok.type != RPAREN) { 309 | printf("Expected ')' at position %u\n", tok.pos); 310 | exit(1); 311 | } 312 | 313 | goto parse_call; 314 | } 315 | 316 | root = malloc(sizeof(ast_node_t)); 317 | memset(root, 0, sizeof(ast_node_t)); 318 | root->type = nEXPR; 319 | 320 | if (tok.type == INT_LITERAL) { 321 | root->variant = vINT_LITERAL; 322 | root->i = atoi(tok.str); 323 | return root; 324 | } 325 | 326 | if (tok.type == CHAR_LITERAL) { 327 | root->variant = vCHAR_LITERAL; 328 | root->i = tok.str[0]; 329 | return root; 330 | } 331 | 332 | if (tok.type == STR_LITERAL) { 333 | root->variant = vSTRING_LITERAL; 334 | root->s = tok.str; 335 | return root; 336 | } 337 | 338 | if (tok.type == IDENT) { 339 | root->variant = vIDENT; 340 | root->s = tok.str; 341 | goto parse_call; 342 | } 343 | 344 | printf("Malformed expression at position %u\n", tok.pos); 345 | exit(1); 346 | return root; 347 | 348 | parse_call: 349 | tok = next_token(); 350 | while (tok.type == LPAREN) { 351 | ast_node_t *callee = root; 352 | root = malloc(sizeof(ast_node_t)); 353 | memset(root, 0, sizeof(ast_node_t)); 354 | root->type = nEXPR; 355 | root->variant = vCALL; 356 | root->children = callee; 357 | 358 | ast_node_t **current_arg = &(callee->next); 359 | 360 | tok = next_token(); 361 | while (tok.type != RPAREN) { 362 | buffer_token(tok); 363 | *current_arg = parse_expr(); 364 | tok = next_token(); 365 | if (tok.type == COMMA) tok = next_token(); 366 | current_arg = &((*current_arg)->next); 367 | } 368 | 369 | tok = next_token(); 370 | } 371 | buffer_token(tok); 372 | 373 | return root; 374 | } 375 | 376 | ast_node_t *parse_expr() 377 | { 378 | ast_node_t *root; 379 | token_t tok = next_token(); 380 | 381 | uint8_t is_unary_op = 1; 382 | ast_node_variant_t unary_op_variant; 383 | 384 | switch (tok.type) { 385 | case BIT_AND: unary_op_variant = vADDRESSOF; break; 386 | case ASTERISK: unary_op_variant = vDEREF; break; 387 | case BIT_NOT: unary_op_variant = vBIT_NOT; break; 388 | case NOT: unary_op_variant = vNOT; break; 389 | case PLUS_PLUS: unary_op_variant = vINCREMENT; break; 390 | case MINUS_MINUS: unary_op_variant = vDECREMENT; break; 391 | default: is_unary_op = 0; 392 | } 393 | 394 | if (is_unary_op) { 395 | ast_node_t *child = parse_operand(); 396 | if (unary_op_variant == vINCREMENT || unary_op_variant == vDECREMENT) 397 | check_lval(child, tok.pos); 398 | root = malloc(sizeof(ast_node_t)); 399 | memset(root, 0, sizeof(ast_node_t)); 400 | root->type = nEXPR; 401 | root->variant = unary_op_variant; 402 | root->children = child; 403 | } else { 404 | buffer_token(tok); 405 | root = parse_operand(); 406 | } 407 | 408 | #define PARSE_BINOP(tok_type, variant_type, transform) \ 409 | if (tok.type == (tok_type)) { \ 410 | ast_node_t *left = root; \ 411 | ast_node_t *right = parse_operand(); \ 412 | root = malloc(sizeof(ast_node_t)); \ 413 | memset(root, 0, sizeof(ast_node_t)); \ 414 | root->type = nEXPR; \ 415 | root->variant = (variant_type); \ 416 | left->next = right; \ 417 | root->children = left; \ 418 | transform; \ 419 | return root; \ 420 | } \ 421 | 422 | tok = next_token(); 423 | 424 | PARSE_BINOP(PLUS, vADD, {}); 425 | PARSE_BINOP(MINUS, vSUBTRACT, {}); 426 | PARSE_BINOP(ASTERISK, vMULTIPLY, {}); 427 | PARSE_BINOP(SLASH, vDIVIDE, {}); 428 | PARSE_BINOP(PERCENT, vMODULO, {}); 429 | PARSE_BINOP(LT, vLT, {}); 430 | PARSE_BINOP(GT, vGT, {}); 431 | PARSE_BINOP(EQUAL, vEQUAL, {}); 432 | PARSE_BINOP(LTE, vGT, { 433 | ast_node_t *not_node = malloc(sizeof(ast_node_t)); 434 | memset(not_node, 0, sizeof(ast_node_t)); 435 | not_node->type = nEXPR; 436 | not_node->variant = vNOT; 437 | not_node->children = root; 438 | root = not_node; 439 | }); 440 | PARSE_BINOP(GTE, vLT, { 441 | ast_node_t *not_node = malloc(sizeof(ast_node_t)); 442 | memset(not_node, 0, sizeof(ast_node_t)); 443 | not_node->type = nEXPR; 444 | not_node->variant = vNOT; 445 | not_node->children = root; 446 | root = not_node; 447 | }); 448 | PARSE_BINOP(NEQUAL, vEQUAL, { 449 | ast_node_t *not_node = malloc(sizeof(ast_node_t)); 450 | memset(not_node, 0, sizeof(ast_node_t)); 451 | not_node->type = nEXPR; 452 | not_node->variant = vNOT; 453 | not_node->children = root; 454 | root = not_node; 455 | }); 456 | PARSE_BINOP(AND, vAND, {}); 457 | PARSE_BINOP(OR, vOR, {}); 458 | PARSE_BINOP(BIT_AND, vBIT_AND, {}); 459 | PARSE_BINOP(BIT_OR, vBIT_OR, {}); 460 | PARSE_BINOP(BIT_XOR, vBIT_XOR, {}); 461 | PARSE_BINOP(EQ, vASSIGN, { check_lval(left, tok.pos); }); 462 | PARSE_BINOP(PLUS_EQ, vADD, { 463 | check_lval(left, tok.pos); 464 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 465 | memcpy(new_left, left, sizeof(ast_node_t)); 466 | new_left->next = root; 467 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 468 | memset(new_root, 0, sizeof(ast_node_t)); 469 | new_root->type = nEXPR; 470 | new_root->variant = vASSIGN; 471 | new_root->children = new_left; 472 | root = new_root; 473 | }); 474 | PARSE_BINOP(MINUS_EQ, vSUBTRACT, { 475 | check_lval(left, tok.pos); 476 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 477 | memcpy(new_left, left, sizeof(ast_node_t)); 478 | new_left->next = root; 479 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 480 | memset(new_root, 0, sizeof(ast_node_t)); 481 | new_root->type = nEXPR; 482 | new_root->variant = vASSIGN; 483 | new_root->children = new_left; 484 | root = new_root; 485 | }); 486 | PARSE_BINOP(ASTERISK_EQ, vMULTIPLY, { 487 | check_lval(left, tok.pos); 488 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 489 | memcpy(new_left, left, sizeof(ast_node_t)); 490 | new_left->next = root; 491 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 492 | memset(new_root, 0, sizeof(ast_node_t)); 493 | new_root->type = nEXPR; 494 | new_root->variant = vASSIGN; 495 | new_root->children = new_left; 496 | root = new_root; 497 | }); 498 | PARSE_BINOP(SLASH_EQ, vDIVIDE, { 499 | check_lval(left, tok.pos); 500 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 501 | memcpy(new_left, left, sizeof(ast_node_t)); 502 | new_left->next = root; 503 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 504 | memset(new_root, 0, sizeof(ast_node_t)); 505 | new_root->type = nEXPR; 506 | new_root->variant = vASSIGN; 507 | new_root->children = new_left; 508 | root = new_root; 509 | }); 510 | PARSE_BINOP(PERCENT_EQ, vMODULO, { 511 | check_lval(left, tok.pos); 512 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 513 | memcpy(new_left, left, sizeof(ast_node_t)); 514 | new_left->next = root; 515 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 516 | memset(new_root, 0, sizeof(ast_node_t)); 517 | new_root->type = nEXPR; 518 | new_root->variant = vASSIGN; 519 | new_root->children = new_left; 520 | root = new_root; 521 | }); 522 | PARSE_BINOP(AND_EQ, vBIT_AND, { 523 | check_lval(left, tok.pos); 524 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 525 | memcpy(new_left, left, sizeof(ast_node_t)); 526 | new_left->next = root; 527 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 528 | memset(new_root, 0, sizeof(ast_node_t)); 529 | new_root->type = nEXPR; 530 | new_root->variant = vASSIGN; 531 | new_root->children = new_left; 532 | root = new_root; 533 | }); 534 | PARSE_BINOP(OR_EQ, vBIT_OR, { 535 | check_lval(left, tok.pos); 536 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 537 | memcpy(new_left, left, sizeof(ast_node_t)); 538 | new_left->next = root; 539 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 540 | memset(new_root, 0, sizeof(ast_node_t)); 541 | new_root->type = nEXPR; 542 | new_root->variant = vASSIGN; 543 | new_root->children = new_left; 544 | root = new_root; 545 | }); 546 | PARSE_BINOP(XOR_EQ, vBIT_XOR, { 547 | check_lval(left, tok.pos); 548 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 549 | memcpy(new_left, left, sizeof(ast_node_t)); 550 | new_left->next = root; 551 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 552 | memset(new_root, 0, sizeof(ast_node_t)); 553 | new_root->type = nEXPR; 554 | new_root->variant = vASSIGN; 555 | new_root->children = new_left; 556 | root = new_root; 557 | }); 558 | PARSE_BINOP(NOT_EQ, vBIT_NOT, { 559 | check_lval(left, tok.pos); 560 | ast_node_t *new_left = malloc(sizeof(ast_node_t)); 561 | memcpy(new_left, left, sizeof(ast_node_t)); 562 | new_left->next = root; 563 | ast_node_t *new_root = malloc(sizeof(ast_node_t)); 564 | memset(new_root, 0, sizeof(ast_node_t)); 565 | new_root->type = nEXPR; 566 | new_root->variant = vASSIGN; 567 | new_root->children = new_left; 568 | root = new_root; 569 | }); 570 | 571 | buffer_token(tok); 572 | 573 | return root; 574 | } 575 | 576 | ast_node_t *parse_stmt() 577 | { 578 | ast_node_t *root = malloc(sizeof(ast_node_t)); 579 | memset(root, 0, sizeof(ast_node_t)); 580 | root->type = nSTMT; 581 | token_t tok = next_token(); 582 | 583 | if (tok.type == SEMICOLON) { 584 | root->variant = vEMPTY; 585 | return root; 586 | } 587 | 588 | if (tok.type == CONTINUE) { 589 | root->variant = vCONTINUE; 590 | tok = next_token(); 591 | if (tok.type != SEMICOLON) goto fail; 592 | return root; 593 | } 594 | if (tok.type == BREAK) { 595 | root->variant = vBREAK; 596 | tok = next_token(); 597 | if (tok.type != SEMICOLON) goto fail; 598 | return root; 599 | } 600 | 601 | if (tok.type == IF) { 602 | root->variant = vIF; 603 | tok = next_token(); 604 | if (tok.type != LPAREN) goto fail; 605 | ast_node_t *cond_expr = parse_expr(); 606 | tok = next_token(); 607 | if (tok.type != RPAREN) goto fail; 608 | ast_node_t *if_stmt = parse_stmt(); 609 | 610 | // if the statement is not a block, wrap it in a block 611 | // simplifies symtab construction and codegen 612 | if (if_stmt->variant != vBLOCK) { 613 | ast_node_t *block = malloc(sizeof(ast_node_t)); 614 | memset(block, 0, sizeof(ast_node_t)); 615 | block->type = nSTMT; 616 | block->variant = vBLOCK; 617 | block->children = if_stmt; 618 | if_stmt = block; 619 | } 620 | 621 | tok = next_token(); 622 | ast_node_t *else_stmt; 623 | if (tok.type == ELSE) else_stmt = parse_stmt(); 624 | else { 625 | buffer_token(tok); 626 | else_stmt = malloc(sizeof(ast_node_t)); 627 | memset(else_stmt, 0, sizeof(ast_node_t)); 628 | else_stmt->type = nSTMT; 629 | else_stmt->variant = vBLOCK; 630 | } 631 | 632 | if (else_stmt->variant != vBLOCK) { 633 | ast_node_t *block = malloc(sizeof(ast_node_t)); 634 | memset(block, 0, sizeof(ast_node_t)); 635 | block->type = nSTMT; 636 | block->variant = vBLOCK; 637 | block->children = else_stmt; 638 | else_stmt = block; 639 | } 640 | 641 | cond_expr->next = if_stmt; 642 | if_stmt->next = else_stmt; 643 | root->children = cond_expr; 644 | return root; 645 | } 646 | 647 | if (tok.type == WHILE) { 648 | root->variant = vWHILE; 649 | tok = next_token(); 650 | if (tok.type != LPAREN) goto fail; 651 | ast_node_t *cond_expr = parse_expr(); 652 | tok = next_token(); 653 | if (tok.type != RPAREN) goto fail; 654 | ast_node_t *while_stmt = parse_stmt(); 655 | 656 | if (while_stmt->variant != vBLOCK) { 657 | ast_node_t *block = malloc(sizeof(ast_node_t)); 658 | memset(block, 0, sizeof(ast_node_t)); 659 | block->type = nSTMT; 660 | block->variant = vBLOCK; 661 | block->children = while_stmt; 662 | while_stmt = block; 663 | } 664 | 665 | cond_expr->next = while_stmt; 666 | root->children = cond_expr; 667 | return root; 668 | } 669 | 670 | if (tok.type == RETURN) { 671 | root->variant = vRETURN; 672 | tok = next_token(); 673 | if (tok.type == SEMICOLON) return root; 674 | buffer_token(tok); 675 | ast_node_t *return_expr = parse_expr(); 676 | root->children = return_expr; 677 | tok = next_token(); 678 | if (tok.type != SEMICOLON) goto fail; 679 | return root; 680 | } 681 | 682 | if (tok.type == LCURLY) { 683 | root->variant = vBLOCK; 684 | ast_node_t **child = &(root->children); 685 | tok = next_token(); 686 | while (tok.type != RCURLY) { 687 | buffer_token(tok); 688 | *child = parse_stmt(); 689 | child = &((*child)->next); 690 | tok = next_token(); 691 | } 692 | 693 | return root; 694 | } 695 | 696 | if (tok.type == INT || tok.type == CHAR || tok.type == VOID) { 697 | buffer_token(tok); 698 | root->variant = vDECL; 699 | ast_node_t *type_node = parse_type(); 700 | tok = next_token(); 701 | if (tok.type != IDENT) goto fail; 702 | root->s = tok.str; 703 | root->children = type_node; 704 | tok = next_token(); 705 | if (tok.type != SEMICOLON) goto fail; 706 | return root; 707 | } 708 | 709 | buffer_token(tok); 710 | root->variant = vEXPR; 711 | ast_node_t *expr_node = parse_expr(); 712 | root->children = expr_node; 713 | tok = next_token(); 714 | if (tok.type != SEMICOLON) { 715 | fail: 716 | printf("Malformed statement at position %u\n", tok.pos); 717 | exit(1); 718 | } 719 | 720 | return root; 721 | } 722 | 723 | ast_node_t *parse() 724 | { 725 | ast_node_t *root; 726 | ast_node_t **current = &root; 727 | token_t tok = next_token(); 728 | 729 | while (tok.type != EOF_) { 730 | buffer_token(tok); 731 | ast_node_t *type_node = parse_type(); 732 | tok = next_token(); 733 | if (tok.type != IDENT) { 734 | fail: 735 | printf("Malformed declaration at position %u\n", tok.pos); 736 | exit(1); 737 | } 738 | char *name = tok.str; 739 | tok = next_token(); 740 | if (tok.type == SEMICOLON) { 741 | *current = malloc(sizeof(ast_node_t)); 742 | memset(*current, 0, sizeof(ast_node_t)); 743 | (*current)->type = nSTMT; 744 | (*current)->variant = vDECL; 745 | (*current)->s = name; 746 | (*current)->children = type_node; 747 | current = &((*current)->next); 748 | tok = next_token(); 749 | continue; 750 | } 751 | 752 | if (tok.type != LPAREN) goto fail; 753 | 754 | *current = malloc(sizeof(ast_node_t)); 755 | memset(*current, 0, sizeof(ast_node_t)); 756 | (*current)->type = nFUNCTION; 757 | (*current)->s = name; 758 | (*current)->children = type_node; 759 | 760 | ast_node_t **current_arg = &(type_node->next); 761 | 762 | tok = next_token(); 763 | while (tok.type != RPAREN) { 764 | buffer_token(tok); 765 | ast_node_t *arg_type_node = parse_type(); 766 | tok = next_token(); 767 | if (tok.type != IDENT) goto fail; 768 | *current_arg = malloc(sizeof(ast_node_t)); 769 | memset(*current_arg, 0, sizeof(ast_node_t)); 770 | (*current_arg)->type = nARGUMENT; 771 | (*current_arg)->s = tok.str; 772 | (*current_arg)->children = arg_type_node; 773 | current_arg = &((*current_arg)->next); 774 | 775 | tok = next_token(); 776 | if (tok.type == COMMA) tok = next_token(); 777 | } 778 | 779 | *current_arg = parse_stmt(); 780 | if ((*current_arg)->variant != vBLOCK && (*current_arg)->variant != vEMPTY) 781 | goto fail; 782 | 783 | current = &((*current)->next); 784 | tok = next_token(); 785 | } 786 | 787 | return root; 788 | } 789 | 790 | uint32_t hash(char *str) 791 | { 792 | uint32_t h = 5381; 793 | int c; 794 | 795 | while ((c = *str++)) 796 | h = (h << 5) + h + c; 797 | 798 | return h; 799 | } 800 | 801 | typedef enum { 802 | tINT, tCHAR, tVOID, tINT_PTR, tCHAR_PTR, tVOID_PTR, tPTR_PTR 803 | } symbol_type_t; 804 | 805 | typedef enum { 806 | lTEXT, lDATA, lSTACK 807 | } loc_type_t; 808 | 809 | typedef struct symbol_s { 810 | char *name; 811 | symbol_type_t type; 812 | uint32_t loc; 813 | loc_type_t loc_type; 814 | struct symbol_s *child; 815 | struct symbol_s *parent; 816 | } symbol_t; 817 | 818 | // these numbers are completely arbitrary 819 | #define DATA_START 0x3000 820 | #define TEXT_START 0x4000 821 | 822 | uint8_t *text = NULL; 823 | uint32_t text_loc = 0; 824 | uint32_t text_cap = 0; 825 | 826 | // arbitrary data limit 827 | #define DATA_CAP 0x1000 828 | uint8_t data[DATA_CAP]; 829 | uint32_t data_loc = 0; 830 | 831 | // this is arbitrary 832 | #define SYMTAB_SIZE 256 833 | 834 | symbol_t *root_symtab = NULL; 835 | 836 | uint8_t symtab_insert(symbol_t *tab, symbol_t sym) 837 | { 838 | uint32_t idx = hash(sym.name) % SYMTAB_SIZE; 839 | uint32_t i = idx; 840 | do { 841 | if (tab[i].name == NULL) { 842 | tab[i] = sym; 843 | return 0; 844 | } 845 | if (strcmp(tab[i].name, sym.name) == 0) { 846 | tab[i] = sym; 847 | return 1; 848 | } 849 | i = (i + 1) % SYMTAB_SIZE; 850 | } while (i != idx); 851 | printf("Too many (>256) symbols\n"); 852 | exit(1); 853 | } 854 | 855 | symbol_t *symtab_get(symbol_t *tab, char *name) 856 | { 857 | if (tab == NULL) return NULL; 858 | uint32_t idx = hash(name) % SYMTAB_SIZE; 859 | symbol_t *parent = tab[0].parent; 860 | uint32_t i = idx; 861 | do { 862 | if (tab[i].name == NULL) { i = (i + 1) % SYMTAB_SIZE; continue; } 863 | if (strcmp(tab[i].name, name) == 0) return &(tab[i]); 864 | i = (i + 1) % SYMTAB_SIZE; 865 | } while (i != idx); 866 | 867 | return symtab_get(parent, name); 868 | } 869 | 870 | symbol_type_t symbol_type_of_node_type(ast_node_t *v) 871 | { 872 | switch (v->variant) { 873 | case vINT: return tINT; 874 | case vCHAR: return tCHAR; 875 | case vVOID: return tVOID; 876 | case vPTR: 877 | switch (v->children->variant) { 878 | case vINT: return tINT_PTR; 879 | case vCHAR: return tCHAR_PTR; 880 | case vVOID: return tVOID_PTR; 881 | case vPTR: return tPTR_PTR; 882 | default: ; 883 | } 884 | default: ; 885 | } 886 | 887 | return 0; 888 | } 889 | 890 | uint32_t construct_symtab( 891 | ast_node_t *root, symbol_t *out, symbol_t *parent, 892 | uint32_t loc, uint32_t block_id 893 | ) 894 | { 895 | if (root->type != nSTMT && root->type != nFUNCTION) { 896 | printf("Failed to construct symbol table\n"); 897 | exit(1); 898 | } 899 | 900 | if (root->type == nFUNCTION) { 901 | symbol_t sym; 902 | memset(&sym, 0, sizeof(symbol_t)); 903 | sym.parent = parent; 904 | sym.name = root->s; 905 | ast_node_t *type_node = root->children; 906 | sym.type = symbol_type_of_node_type(type_node); 907 | sym.loc = text_loc; 908 | sym.loc_type = lTEXT; 909 | sym.child = malloc(sizeof(symbol_t) * SYMTAB_SIZE); 910 | memset(sym.child, 0, sizeof(symbol_t) * SYMTAB_SIZE); 911 | sym.child[0].parent = out; 912 | 913 | ast_node_t *argument_nodes = type_node->next; 914 | ast_node_t *current_arg = argument_nodes; 915 | uint32_t arg_offset = 8; // 4 bytes for return address and 4 for old ebp 916 | while (current_arg->type == nARGUMENT) { 917 | symbol_t arg_sym; 918 | arg_sym.name = current_arg->s; 919 | arg_sym.loc = arg_offset; 920 | arg_sym.loc_type = lSTACK; 921 | arg_sym.type = symbol_type_of_node_type(current_arg->children); 922 | arg_sym.child = NULL; 923 | arg_sym.parent = out; 924 | symtab_insert(sym.child, arg_sym); 925 | arg_offset += 4; 926 | current_arg = current_arg->next; 927 | } 928 | 929 | if (current_arg->type == nSTMT && current_arg->variant == vEMPTY) 930 | sym.loc = -1; 931 | 932 | uint32_t stack_size = construct_symtab(current_arg, sym.child, out, 0, 0); 933 | symtab_insert(out, sym); 934 | return stack_size; 935 | } 936 | 937 | if (root->variant == vDECL) { 938 | symbol_t sym; 939 | sym.name = root->s; 940 | sym.type = symbol_type_of_node_type(root->children); 941 | uint32_t size = 4; 942 | if (sym.type == tCHAR) size = 1; 943 | if (out == root_symtab) { 944 | sym.loc = data_loc; 945 | sym.loc_type = lDATA; 946 | } else { 947 | sym.loc = -(loc + size); 948 | sym.loc_type = lSTACK; 949 | } 950 | sym.child = NULL; 951 | sym.parent = parent; 952 | symtab_insert(out, sym); 953 | return size; 954 | } 955 | 956 | if (root->variant == vBLOCK) { 957 | symbol_t sym; 958 | sym.name = malloc(2); 959 | sym.name[0] = block_id % 256; 960 | sym.name[1] = 0; 961 | sym.child = malloc(sizeof(symbol_t) * SYMTAB_SIZE); 962 | memset(sym.child, 0, sizeof(symbol_t) * SYMTAB_SIZE); 963 | sym.child[0].parent = out; 964 | sym.parent = parent; 965 | sym.loc_type = lSTACK; 966 | 967 | ast_node_t *current_child = root->children; 968 | uint32_t size = 0; 969 | uint32_t bid = 0; 970 | while (current_child != NULL) { 971 | size += construct_symtab(current_child, sym.child, out, loc + size, bid); 972 | if (current_child->variant == vBLOCK || current_child->variant == vWHILE) 973 | ++bid; 974 | if (current_child->variant == vIF) bid += 2; 975 | current_child = current_child->next; 976 | } 977 | 978 | sym.loc = -(loc + size); // TODO make this the text offset of the block? 979 | symtab_insert(out, sym); 980 | return size; 981 | } 982 | 983 | if (root->variant == vWHILE) 984 | return construct_symtab(root->children->next, out, parent, loc, block_id); 985 | 986 | if (root->variant == vIF) { 987 | uint32_t s = construct_symtab(root->children->next, out, parent, loc, block_id); 988 | return s + construct_symtab( 989 | root->children->next->next, out, parent, loc + s, block_id + 1 990 | ); 991 | } 992 | 993 | return 0; 994 | } 995 | 996 | void write_text(uint8_t *b, uint32_t n) 997 | { 998 | if (text == NULL || text_loc + n >= text_cap) { 999 | uint32_t size = 2 * (text_loc + n); 1000 | text = realloc(text, size); 1001 | text_cap = size; 1002 | } 1003 | memcpy(text + text_loc, b, n); 1004 | text_loc += n; 1005 | } 1006 | 1007 | void write_data(uint8_t *b, uint32_t n) 1008 | { 1009 | if (data_loc + n > DATA_CAP) { 1010 | printf("Too much data.\n"); 1011 | exit(1); 1012 | } 1013 | memcpy(data + data_loc, b, n); 1014 | data_loc += n; 1015 | } 1016 | 1017 | typedef enum { 1018 | rMOV_EAX, rOFFSET, rIMM 1019 | } relocation_type_t; 1020 | 1021 | typedef struct relocation_s { 1022 | uint32_t addr; 1023 | char *name; 1024 | symbol_t *symtab; 1025 | relocation_type_t type; 1026 | struct relocation_s *next; 1027 | } relocation_t; 1028 | 1029 | relocation_t *relocs = NULL; 1030 | 1031 | void add_relocation(uint32_t addr, char *name, symbol_t *symtab, relocation_type_t t) 1032 | { 1033 | relocation_t *r = malloc(sizeof(relocation_t)); 1034 | *r = (relocation_t) { 1035 | .addr = addr, .name = name, .symtab = symtab, .type = t, .next = relocs 1036 | }; 1037 | relocs = r; 1038 | } 1039 | 1040 | uint32_t codegen_argument(ast_node_t *arg, symbol_t *symtab); 1041 | 1042 | symbol_type_t codegen_expr(ast_node_t *expr, symbol_t *symtab) 1043 | { 1044 | if (expr->variant == vINT_LITERAL || expr->variant == vCHAR_LITERAL) { 1045 | // for int literal: 1046 | // movl , %eax 1047 | // for char literal: 1048 | // movb , %al 1049 | 1050 | uint8_t tmp = 0xb0; 1051 | uint32_t size = 1; 1052 | if (expr->variant == vINT_LITERAL) { tmp = 0xb8; size = 4; } 1053 | write_text(&tmp, 1); 1054 | uint32_t imm = expr->i; 1055 | for (uint32_t i = 0; i < size; ++i) { 1056 | uint8_t tmp = imm & 0xff; 1057 | write_text(&tmp, 1); 1058 | imm >>= 8; 1059 | } 1060 | if (expr->variant == vINT_LITERAL) return tINT; 1061 | return tCHAR; 1062 | } 1063 | 1064 | if (expr->variant == vIDENT) { 1065 | symbol_t *sym = symtab_get(symtab, expr->s); 1066 | if (sym == NULL) { 1067 | printf("Undefined symbol %s\n", expr->s); 1068 | exit(1); 1069 | } 1070 | if (sym->loc == (uint32_t) -1) { 1071 | add_relocation(text_loc, expr->s, symtab, rMOV_EAX); 1072 | text_loc += 5; 1073 | goto global_ident; 1074 | } 1075 | 1076 | if (sym->loc_type == lSTACK) { 1077 | // movl x(%ebp), %eax 1078 | uint8_t tmp[2] = { 0x8b, 0x85 }; 1079 | if (sym->type == tCHAR) tmp[0] = 0x8a; 1080 | write_text(tmp, 2); 1081 | uint32_t off = sym->loc; 1082 | for (uint32_t i = 0; i < 4; ++i) { 1083 | uint8_t tmp0 = off; 1084 | write_text(&tmp0, 1); 1085 | off >>= 8; 1086 | } 1087 | return sym->type; 1088 | } 1089 | 1090 | // movl addr, %eax 1091 | uint32_t addr = DATA_START + sym->loc; 1092 | if (sym->loc_type == lTEXT) addr = TEXT_START + sym->loc; 1093 | uint8_t tmp = 0xb8; 1094 | write_text(&tmp, 1); 1095 | for (uint32_t i = 0; i < 4; ++i) { 1096 | uint8_t tmp = addr & 0xff; 1097 | write_text(&tmp, 1); 1098 | addr >>= 8; 1099 | } 1100 | 1101 | global_ident: 1102 | // assume all symbols in .text are function pointers 1103 | // so do not dereference 1104 | if (sym->loc_type == lTEXT) return sym->type; 1105 | 1106 | // for int: 1107 | // movl (%eax), %eax 1108 | // for char: 1109 | // movb (%eax), %al 1110 | uint8_t tmp2[2] = { 0x8b, 0 }; 1111 | if (sym->type == tCHAR) tmp2[0] = 0x8a; 1112 | write_text(tmp2, 2); 1113 | return sym->type; 1114 | } 1115 | 1116 | if (expr->variant == vSTRING_LITERAL) { 1117 | // write to data section 1118 | uint32_t addr = DATA_START + data_loc; 1119 | uint32_t len = strlen(expr->s); 1120 | write_data((uint8_t *)(expr->s), len + 1); 1121 | 1122 | // movl addr, %eax 1123 | uint8_t tmp = 0xb8; 1124 | write_text(&tmp, 1); 1125 | for (uint32_t i = 0; i < 4; ++i) { 1126 | uint8_t tmp = addr & 0xff; 1127 | write_text(&tmp, 1); 1128 | addr >>= 8; 1129 | } 1130 | 1131 | return tCHAR_PTR; 1132 | } 1133 | 1134 | if (expr->variant == vDEREF) { 1135 | // for now, we do not bother returning the correct expression type 1136 | // for more than one level of indirection because nanoc treats all pointer 1137 | // arithmetic as being performed on char* / int 1138 | 1139 | // for int: 1140 | // movl (%eax), %eax 1141 | // for char: 1142 | // movb (%eax), %al 1143 | symbol_type_t ptr_type = codegen_expr(expr->children, symtab); 1144 | uint8_t tmp2[2] = { 0x8b, 0 }; 1145 | if (ptr_type == tCHAR_PTR) tmp2[0] = 0x8a; 1146 | write_text(tmp2, 2); 1147 | if (ptr_type == tCHAR_PTR) return tCHAR; 1148 | return tINT; 1149 | } 1150 | 1151 | if (expr->variant == vADDRESSOF) { 1152 | ast_node_t *child = expr->children; 1153 | if (child->variant != vDEREF && child->variant != vIDENT) { 1154 | printf("Invalid operand for 'address of' operator\n"); 1155 | exit(1); 1156 | } 1157 | if (child->variant == vDEREF) 1158 | return codegen_expr(child->children, symtab); 1159 | 1160 | symbol_t *sym = symtab_get(symtab, child->s); 1161 | if (sym == NULL) { 1162 | printf("Undefined symbol %s\n", expr->s); 1163 | exit(1); 1164 | } 1165 | if (sym->loc == (uint32_t) -1) { 1166 | add_relocation(text_loc, child->s, symtab, rMOV_EAX); 1167 | text_loc += 5; 1168 | return sym->type; 1169 | } 1170 | 1171 | if (sym->loc_type == lSTACK) { 1172 | // movl %ebp, %eax 1173 | uint8_t tmp[2] = { 0x89, 0xe8 }; 1174 | write_text(tmp, 2); 1175 | 1176 | // addl offset, %eax 1177 | tmp[0] = 0x05; 1178 | write_text(tmp, 1); 1179 | uint32_t off = sym->loc; 1180 | for (uint32_t i = 0; i < 4; ++i) { 1181 | uint8_t tmp = off & 0xff; 1182 | write_text(&tmp, 1); 1183 | off >>= 8; 1184 | } 1185 | switch (sym->type) { 1186 | case tINT: return tINT_PTR; 1187 | case tCHAR: return tCHAR_PTR; 1188 | case tVOID: return tVOID_PTR; 1189 | default: return tPTR_PTR; 1190 | } 1191 | } 1192 | 1193 | // movl addr, %eax 1194 | uint32_t addr = DATA_START + sym->loc; 1195 | if (sym->loc_type == lTEXT) addr = TEXT_START + sym->loc; 1196 | uint8_t tmp = 0xb8; 1197 | write_text(&tmp, 1); 1198 | for (uint32_t i = 0; i < 4; ++i) { 1199 | uint8_t tmp = addr & 0xff; 1200 | write_text(&tmp, 1); 1201 | addr >>= 8; 1202 | } 1203 | 1204 | if (sym->loc_type == lTEXT) return sym->type; 1205 | switch (sym->type) { 1206 | case tINT: return tINT_PTR; 1207 | case tCHAR: return tCHAR_PTR; 1208 | case tVOID: return tVOID_PTR; 1209 | default: return tPTR_PTR; 1210 | } 1211 | } 1212 | 1213 | if (expr->variant == vINCREMENT || expr->variant == vDECREMENT) { 1214 | ast_node_t *lval = malloc(sizeof(ast_node_t)); 1215 | memset(lval, 0, sizeof(ast_node_t)); 1216 | lval->type = nEXPR; 1217 | lval->variant = vADDRESSOF; 1218 | lval->children = expr->children; 1219 | codegen_expr(lval, symtab); 1220 | 1221 | // pushl %eax 1222 | uint8_t tmp0 = 0x50; write_text(&tmp0, 1); 1223 | 1224 | symbol_type_t child_type = codegen_expr(expr->children, symtab); 1225 | uint8_t tmp[2] = { 0x40, 0 }; 1226 | if (expr->variant == vINCREMENT && child_type != tCHAR) 1227 | write_text(tmp, 1); // incl %eax 1228 | else if (expr->variant == vDECREMENT && child_type != tCHAR) { 1229 | tmp[0] = 0x48; write_text(tmp, 1); // decl %eax 1230 | } else if (expr->variant == vINCREMENT && child_type == tCHAR) { 1231 | tmp[0] = 0xfe; tmp[1] = 0xc0; write_text(tmp, 2); // incb %al 1232 | } else { 1233 | tmp[0] = 0xfe; tmp[1] = 0xc8; write_text(tmp, 2); // decb %al 1234 | } 1235 | 1236 | // popl %ecx 1237 | // movl %eax, (%ecx) 1238 | tmp0 = 0x59; write_text(&tmp0, 1); 1239 | tmp[0] = 0x89; tmp[1] = 0x01; 1240 | write_text(tmp, 2); 1241 | return child_type; 1242 | } 1243 | 1244 | if (expr->variant == vNOT) { 1245 | symbol_type_t child_type = codegen_expr(expr->children, symtab); 1246 | 1247 | // xorl %ecx, %ecx 1248 | // test %eax, %eax 1249 | // sete %cl 1250 | // movl %ecx, %eax 1251 | uint8_t tmp[9] = { 0x31, 0xc9, 0x85, 0xc0, 0x0f, 0x94, 0xc1, 0x89, 0xc8 }; 1252 | // testb %al, %al instead of test %eax, %eax 1253 | if (child_type == tCHAR) tmp[2] = 0x84; 1254 | write_text(tmp, 9); 1255 | return tCHAR; 1256 | } 1257 | 1258 | if (expr->variant == vBIT_NOT) { 1259 | symbol_type_t child_type = codegen_expr(expr->children, symtab); 1260 | // notl %eax 1261 | uint8_t tmp[2] = { 0xf7, 0xd0 }; 1262 | write_text(tmp, 2); 1263 | return child_type; 1264 | } 1265 | 1266 | if ( 1267 | expr->variant == vADD || expr->variant == vSUBTRACT 1268 | || expr->variant == vMULTIPLY || expr->variant == vDIVIDE 1269 | || expr->variant == vMODULO || expr->variant == vBIT_AND 1270 | || expr->variant == vBIT_OR || expr->variant == vBIT_XOR 1271 | ) { 1272 | symbol_type_t right_type = codegen_expr(expr->children->next, symtab); 1273 | // pushl %eax 1274 | uint8_t tmp = 0x50; write_text(&tmp, 1); 1275 | symbol_type_t left_type = codegen_expr(expr->children, symtab); 1276 | // popl %ecx 1277 | // %ecx, %eax 1278 | tmp = 0x59; write_text(&tmp, 1); 1279 | uint8_t tmp1[3] = { 0x01, 0xc8, 0 }; 1280 | uint32_t len = 2; 1281 | if (expr->variant == vADD) { 1282 | if (left_type == tCHAR && right_type == tCHAR) 1283 | tmp1[0] = 0; // addb %cl, %al instead of addl %ecx, %eax 1284 | } else if (expr->variant == vSUBTRACT) { 1285 | tmp1[0] = 0x29; // subl/subb %ecx, %eax 1286 | if (left_type == tCHAR && right_type == tCHAR) tmp1[0] = 0x28; 1287 | } else if (expr->variant == vMULTIPLY) { 1288 | tmp1[0] = 0x0f; tmp1[1] = 0xaf; tmp1[2] = 0xc1; // imull %ecx, %eax 1289 | len = 3; 1290 | // TODO figure out how to multiply bytes 1291 | } else if (expr->variant == vDIVIDE) { 1292 | tmp1[0] = 0xf7; tmp1[1] = 0xf9; // idivl/idivb %ecx 1293 | if (left_type == tCHAR && right_type == tCHAR) tmp1[0] = 0xf6; 1294 | } else if (expr->variant == vMODULO) { 1295 | tmp1[0] = 0xf7; tmp1[1] = 0xf9; // idivl/idivb %ecx 1296 | if (left_type == tCHAR && right_type == tCHAR) tmp1[0] = 0xf6; 1297 | write_text(tmp1, 2); 1298 | tmp1[0] = 0x89; tmp1[1] = 0xd0; // movl/movb %edx, %eax 1299 | if (left_type == tCHAR && right_type == tCHAR) tmp1[0] = 0x88; 1300 | } else if (expr->variant == vBIT_AND) { 1301 | tmp1[0] = 0x21; tmp1[1] = 0xc8; // andl/andb %ecx, %eax 1302 | if (left_type == tCHAR && right_type == tCHAR) tmp1[0] = 0x20; 1303 | } else if (expr->variant == vBIT_OR) { 1304 | tmp1[0] = 0x09; tmp1[1] = 0xc8; // orl/orb %ecx, %eax 1305 | if (left_type == tCHAR && right_type == tCHAR) tmp1[0] = 0x08; 1306 | } else if (expr->variant == vBIT_XOR) { 1307 | tmp1[0] = 0x31; tmp1[1] = 0xc8; // orl/orb %ecx, %eax 1308 | if (left_type == tCHAR && right_type == tCHAR) tmp1[0] = 0x30; 1309 | } 1310 | write_text(tmp1, len); 1311 | if (left_type == tCHAR) return right_type; 1312 | return left_type; 1313 | } 1314 | 1315 | if (expr->variant == vLT || expr->variant == vGT || expr->variant == vEQUAL) { 1316 | symbol_type_t right_type = codegen_expr(expr->children->next, symtab); 1317 | // pushl %eax 1318 | uint8_t tmp = 0x50; write_text(&tmp, 1); 1319 | symbol_type_t left_type = codegen_expr(expr->children, symtab); 1320 | // popl %ecx 1321 | // cmpl/cmpb %ecx, %eax 1322 | // setl/setg/sete %al 1323 | // movzbl %al, %eax 1324 | uint8_t tmp1[9] = { 0x59, 0x39, 0xc8, 0x0f, 0x9c, 0xc0, 0x0f, 0xb6, 0xc0 }; 1325 | if (left_type == tCHAR && right_type == tCHAR) tmp1[1] = 0x38; 1326 | if (expr->variant == vGT) tmp1[4] = 0x9f; 1327 | else if (expr->variant == vEQUAL) tmp1[4] = 0x94; 1328 | write_text(tmp1, 9); 1329 | return tINT; 1330 | } 1331 | 1332 | if (expr->variant == vAND) { 1333 | symbol_type_t right_type = codegen_expr(expr->children->next, symtab); 1334 | // pushl %eax 1335 | uint8_t tmp = 0x50; write_text(&tmp, 1); 1336 | symbol_type_t left_type = codegen_expr(expr->children, symtab); 1337 | // popl %ecx 1338 | // mull/mulb %ecx 1339 | // orl %edx, %eax 1340 | // xorl %ecx, %ecx 1341 | // cmpl/cmpb %ecx, %eax 1342 | // setne %al 1343 | // movzbl %al, %eax 1344 | uint8_t tmp1[15] = { 1345 | 0x59, 0xf7, 0xe1, 0x09, 0xd0, 0x31, 0xc9, 0x39, 0xc8, 1346 | 0x0f, 0x95, 0xc0, 0x0f, 0xb6, 0xc0 1347 | }; 1348 | if (left_type == tCHAR && right_type == tCHAR) { 1349 | tmp1[1] = 0xf7; tmp1[7] = 0x38; 1350 | } 1351 | write_text(tmp1, 15); 1352 | return tINT; 1353 | } 1354 | 1355 | if (expr->variant == vOR) { 1356 | symbol_type_t right_type = codegen_expr(expr->children->next, symtab); 1357 | // pushl %eax 1358 | uint8_t tmp = 0x50; write_text(&tmp, 1); 1359 | symbol_type_t left_type = codegen_expr(expr->children, symtab); 1360 | // popl %ecx 1361 | // orl/orb %ecx, %eax 1362 | // xorl %ecx, %ecx 1363 | // cmpl/cmpb %ecx, %ax 1364 | // setne %al 1365 | // movzbl %al, %eax 1366 | uint8_t tmp1[13] = { 1367 | 0x59, 0x09, 0xc8, 0x31, 0xc9, 0x39, 0xc8, 1368 | 0x0f, 0x95, 0xc0, 0x0f, 0xb6, 0xc0 1369 | }; 1370 | if (left_type == tCHAR && right_type == tCHAR) { 1371 | tmp1[1] = 0x08; tmp1[5] = 0x38; 1372 | } 1373 | write_text(tmp1, 13); 1374 | return tINT; 1375 | } 1376 | 1377 | if (expr->variant == vASSIGN) { 1378 | ast_node_t *lval = malloc(sizeof(ast_node_t)); 1379 | memset(lval, 0, sizeof(ast_node_t)); 1380 | lval->type = nEXPR; 1381 | lval->variant = vADDRESSOF; 1382 | lval->children = expr->children; 1383 | codegen_expr(lval, symtab); 1384 | 1385 | // pushl %eax 1386 | uint8_t tmp0 = 0x50; write_text(&tmp0, 1); 1387 | 1388 | codegen_expr(expr->children->next, symtab); 1389 | 1390 | // popl %ecx 1391 | // movl %eax, (%ecx) 1392 | uint8_t tmp[3] = { 0x59, 0x89, 0x01 }; 1393 | write_text(tmp, 3); 1394 | } 1395 | 1396 | if (expr->variant == vCALL) { 1397 | uint32_t offset = codegen_argument(expr->children->next, symtab); 1398 | symbol_type_t callee_type = codegen_expr(expr->children, symtab); 1399 | // calll *%eax 1400 | uint8_t tmp[2] = { 0xff, 0xd0 }; 1401 | write_text(tmp, 2); 1402 | // addl , %esp 1403 | tmp[0] = 0x81; tmp[1] = 0xc4; 1404 | write_text(tmp, 2); 1405 | for (uint32_t i = 0; i < 4; ++i) { 1406 | uint8_t tmp0 = offset & 0xff; 1407 | write_text(&tmp0, 1); 1408 | offset >>= 8; 1409 | } 1410 | return callee_type; 1411 | } 1412 | 1413 | return tINT; 1414 | } 1415 | 1416 | uint32_t codegen_argument(ast_node_t *arg, symbol_t *symtab) 1417 | { 1418 | if (arg == NULL) return 0; 1419 | uint32_t offset = codegen_argument(arg->next, symtab); 1420 | codegen_expr(arg, symtab); 1421 | 1422 | // pushl %eax 1423 | uint8_t tmp = 0x50; 1424 | write_text(&tmp, 1); 1425 | 1426 | return offset + 4; 1427 | } 1428 | 1429 | void codegen_stmt( 1430 | ast_node_t *stmt, symbol_t *symtab, 1431 | uint32_t *block_id, uint32_t *continues, uint32_t *breaks 1432 | ) 1433 | { 1434 | if (stmt->variant == vEMPTY || stmt->variant == vDECL) return; 1435 | if (stmt->variant == vEXPR) codegen_expr(stmt->children, symtab); 1436 | 1437 | if (stmt->variant == vBLOCK) { 1438 | char symtab_key[2] = { (char) *block_id, 0 }; 1439 | ++(*block_id); 1440 | symbol_t *child_symtab = symtab_get(symtab, symtab_key)->child; 1441 | ast_node_t *child = stmt->children; 1442 | uint32_t child_block_id = 0; 1443 | while (child != NULL) { 1444 | codegen_stmt(child, child_symtab, &child_block_id, continues, breaks); 1445 | child = child->next; 1446 | } 1447 | } 1448 | 1449 | if (stmt->variant == vRETURN) { 1450 | if (stmt->children != NULL) codegen_expr(stmt->children, symtab); 1451 | // leave 1452 | // retl 1453 | uint8_t tmp[2] = { 0xc9, 0xc3 }; 1454 | write_text(tmp, 2); 1455 | } 1456 | 1457 | if (stmt->variant == vCONTINUE || stmt->variant == vBREAK) { 1458 | if ( 1459 | (stmt->variant == vCONTINUE && continues == NULL) 1460 | || (stmt->variant == vBREAK && breaks == NULL) 1461 | ) { 1462 | printf("Invalid '%s'\n", stmt->variant == vCONTINUE ? "continue" : "break"); 1463 | exit(1); 1464 | } 1465 | 1466 | uint32_t *arr = stmt->variant == vCONTINUE ? continues : breaks; 1467 | uint32_t i = 0; 1468 | while (arr[i]) ++i; 1469 | arr[i] = text_loc; 1470 | text_loc += 5; 1471 | } 1472 | 1473 | if (stmt->variant == vIF) { 1474 | codegen_expr(stmt->children, symtab); 1475 | 1476 | // cmpl $0, %eax 1477 | // je else_start 1478 | // 1479 | // jmp else_end 1480 | // else_start: 1481 | // 1482 | // else_end: 1483 | 1484 | uint8_t tmp[3] = { 0x83, 0xf8, 0x00 }; 1485 | write_text(tmp, 3); 1486 | 1487 | uint32_t je_addr = text_loc; 1488 | text_loc += 6; 1489 | uint32_t if_start = text_loc; 1490 | codegen_stmt(stmt->children->next, symtab, block_id, continues, breaks); 1491 | 1492 | uint32_t jmp_addr = text_loc; 1493 | text_loc += 5; 1494 | 1495 | uint32_t else_start = text_loc; 1496 | codegen_stmt(stmt->children->next->next, symtab, block_id, continues, breaks); 1497 | uint32_t else_end = text_loc; 1498 | 1499 | uint32_t je_offset = else_start - if_start; 1500 | if ((je_offset + 4) < 256) { 1501 | uint8_t tmp1[6] = { 0x74, (je_offset + 4) & 0xff, 0x90, 0x90, 0x90, 0x90 }; 1502 | memcpy(text + je_addr, tmp1, 6); 1503 | } else { 1504 | uint8_t tmp1[6] = { 0x0f, 0x84, 0x00, 0x00, 0x00, 0x00 }; 1505 | for (uint32_t i = 2; i < 6; ++i) { 1506 | tmp1[i] = je_offset & 0xff; 1507 | je_offset >>= 8; 1508 | } 1509 | memcpy(text + je_addr, tmp1, 6); 1510 | } 1511 | 1512 | uint32_t jmp_offset = else_end - else_start; 1513 | if ((jmp_offset + 3) < 256) { 1514 | uint8_t tmp2[5] = { 0xeb, (jmp_offset + 3) & 0xff, 0x90, 0x90, 0x90 }; 1515 | memcpy(text + jmp_addr, tmp2, 5); 1516 | } else { 1517 | uint8_t tmp2[5] = { 0xe9, 0x00, 0x00, 0x00, 0x00 }; 1518 | for (uint32_t i = 1; i < 5; ++i) { 1519 | tmp2[i] = jmp_offset & 0xff; 1520 | jmp_offset >>= 8; 1521 | } 1522 | memcpy(text + jmp_addr, tmp2, 5); 1523 | } 1524 | } 1525 | 1526 | if (stmt->variant == vWHILE) { 1527 | uint32_t cond_start = text_loc; 1528 | codegen_expr(stmt->children, symtab); 1529 | uint8_t tmp[3] = { 0x83, 0xf8, 0x00 }; 1530 | write_text(tmp, 3); 1531 | 1532 | uint32_t je_addr = text_loc; 1533 | text_loc += 6; 1534 | uint32_t while_start = text_loc; 1535 | 1536 | uint32_t cs[256]; memset(cs, 0, sizeof(cs)); 1537 | uint32_t bs[256]; memset(bs, 0, sizeof(bs)); 1538 | codegen_stmt(stmt->children->next, symtab, block_id, cs, bs); 1539 | 1540 | uint32_t jmp_addr = text_loc; 1541 | text_loc += 5; 1542 | uint32_t while_end = text_loc; 1543 | 1544 | uint32_t je_offset = while_end - while_start; 1545 | if ((je_offset + 4) < 256) { 1546 | uint8_t tmp1[6] = { 0x74, (je_offset + 4) & 0xff, 0x90, 0x90, 0x90, 0x90 }; 1547 | memcpy(text + je_addr, tmp1, 6); 1548 | } else { 1549 | uint8_t tmp1[6] = { 0x0f, 0x84, 0x00, 0x00, 0x00, 0x00 }; 1550 | for (uint32_t i = 2; i < 6; ++i) { 1551 | tmp1[i] = je_offset & 0xff; 1552 | je_offset >>= 8; 1553 | } 1554 | memcpy(text + je_addr, tmp1, 6); 1555 | } 1556 | 1557 | uint32_t jmp_offset = while_end - cond_start; 1558 | if (jmp_offset - 3 < 256) { 1559 | uint8_t tmp2[5] = { 0xeb, (256 - (jmp_offset - 3)) & 0xff, 0x90, 0x90, 0x90 }; 1560 | memcpy(text + jmp_addr, tmp2, 5); 1561 | } else { 1562 | uint8_t tmp2[5] = { 0xe9, 0x00, 0x00, 0x00, 0x00 }; 1563 | jmp_offset = -jmp_offset; 1564 | for (uint32_t i = 1; i < 5; ++i) { 1565 | tmp2[i] = jmp_offset & 0xff; 1566 | jmp_offset >>= 8; 1567 | } 1568 | memcpy(text + jmp_addr, tmp2, 5); 1569 | } 1570 | 1571 | // fill in continue statements 1572 | uint32_t ci = 0; 1573 | while (cs[ci]) { 1574 | uint32_t offset = cs[ci] - cond_start; 1575 | if (offset + 2 < 256) { 1576 | uint8_t tmp2[5] = { 0xeb, (256 - (offset + 2)) & 0xff, 0x90, 0x90, 0x90 }; 1577 | memcpy(text + cs[ci], tmp2, 5); 1578 | } else { 1579 | uint8_t tmp2[5] = { 0xe9, 0x00, 0x00, 0x00, 0x00 }; 1580 | offset = -(offset + 5); 1581 | for (uint32_t i = 1; i < 5; ++i) { 1582 | tmp2[i] = offset & 0xff; 1583 | offset >>= 8; 1584 | } 1585 | memcpy(text + cs[ci], tmp2, 5); 1586 | } 1587 | ++ci; 1588 | } 1589 | 1590 | // fill in break statements 1591 | uint32_t bi = 0; 1592 | while (bs[bi]) { 1593 | uint32_t offset = while_end - (bs[bi] + 5); 1594 | if ((offset + 3) < 256) { 1595 | uint8_t tmp2[5] = { 0xeb, (offset + 3) & 0xff, 0x90, 0x90, 0x90 }; 1596 | memcpy(text + bs[bi], tmp2, 5); 1597 | } else { 1598 | uint8_t tmp2[5] = { 0xe9, 0x00, 0x00, 0x00, 0x00 }; 1599 | for (uint32_t i = 1; i < 5; ++i) { 1600 | tmp2[i] = offset & 0xff; 1601 | offset >>= 8; 1602 | } 1603 | memcpy(text + bs[bi], tmp2, 5); 1604 | } 1605 | ++bi; 1606 | } 1607 | } 1608 | } 1609 | 1610 | void codegen(ast_node_t *ast) 1611 | { 1612 | ast_node_t *current = ast; 1613 | while (current != NULL) { 1614 | uint32_t size = construct_symtab(current, root_symtab, NULL, 0, 0); 1615 | 1616 | if (current->type == nSTMT) { 1617 | if (data_loc + size > DATA_CAP) { 1618 | printf("Too much data.\n"); 1619 | exit(1); 1620 | } 1621 | data_loc += size; 1622 | current = current->next; 1623 | continue; 1624 | } 1625 | 1626 | ast_node_t *current_child = current->children; 1627 | while (current_child->type != nSTMT) current_child = current_child->next; 1628 | if (current_child->variant != vBLOCK) { 1629 | current = current->next; continue; 1630 | } 1631 | 1632 | // function preamble: 1633 | // pushl %ebp 1634 | // movl %esp, %ebp 1635 | uint8_t tmp[3] = { 0x55, 0x89, 0xe5 }; 1636 | write_text(tmp, 3); 1637 | 1638 | // subl , %esp 1639 | tmp[0] = 0x81; tmp[1] = 0xec; 1640 | write_text(tmp, 2); 1641 | for (uint32_t i = 0; i < 4; ++i) { 1642 | uint8_t tmp0 = size; 1643 | write_text(&tmp0, 1); 1644 | size >>= 8; 1645 | } 1646 | 1647 | symbol_t *symtab = symtab_get(root_symtab, current->s)->child; 1648 | uint32_t block_id = 0; 1649 | codegen_stmt(current_child, symtab, &block_id, NULL, NULL); 1650 | 1651 | // function epilogue: 1652 | // leave 1653 | // retl 1654 | tmp[0] = 0xc9; tmp[1] = 0xc3; 1655 | write_text(tmp, 2); 1656 | 1657 | current = current->next; 1658 | } 1659 | } 1660 | 1661 | void relocate() 1662 | { 1663 | relocation_t *current = relocs; 1664 | while (current != NULL) { 1665 | symbol_t *sym = symtab_get(current->symtab, current->name); 1666 | if (sym == NULL || sym->loc == (uint32_t) -1) { 1667 | printf("Undefined symbol %s\n", current->name); 1668 | exit(1); 1669 | } 1670 | 1671 | uint32_t addr = DATA_START + sym->loc; 1672 | if (sym->loc_type == lTEXT) addr = TEXT_START + sym->loc; 1673 | 1674 | if (current->type == rMOV_EAX) { 1675 | uint8_t tmp[5]; 1676 | tmp[0] = 0xb8; 1677 | for (uint32_t i = 1; i < 5; ++i) { 1678 | tmp[i] = addr & 0xff; 1679 | addr >>= 8; 1680 | } 1681 | memcpy(text + current->addr, tmp, sizeof(tmp)); 1682 | } 1683 | 1684 | if (current->type == rOFFSET || current->type == rIMM) { 1685 | uint32_t offset = addr; 1686 | if (current->type == rOFFSET) 1687 | offset = addr - (current->addr + TEXT_START) - 4; 1688 | uint8_t tmp[4]; 1689 | for (uint32_t i = 0; i < 4; ++i) { 1690 | tmp[i] = offset & 0xff; 1691 | offset >>= 8; 1692 | } 1693 | memcpy(text + current->addr, tmp, sizeof(tmp)); 1694 | } 1695 | 1696 | current = current->next; 1697 | } 1698 | } 1699 | 1700 | void write_elf(FILE *out) 1701 | { 1702 | uint32_t entry = TEXT_START; 1703 | symbol_t *start = symtab_get(root_symtab, "_start"); 1704 | if (start != NULL) entry = TEXT_START + start->loc; 1705 | else printf("Cannot find entry symbol _start; defaulting to %#x\n", entry); 1706 | 1707 | Elf32_Header ehdr; 1708 | memset(&ehdr, 0, sizeof(ehdr)); 1709 | ehdr.e_ident[0] = ELFMAG0; ehdr.e_ident[1] = ELFMAG1; 1710 | ehdr.e_ident[2] = ELFMAG2; ehdr.e_ident[3] = ELFMAG3; 1711 | ehdr.e_ident[4] = 1; ehdr.e_ident[5] = 1; ehdr.e_ident[6] = 1; 1712 | ehdr.e_type = ET_EXEC; 1713 | ehdr.e_machine = 3; 1714 | ehdr.e_version = 1; 1715 | ehdr.e_phnum = 2; 1716 | ehdr.e_phentsize = sizeof(Elf32_Phdr); 1717 | ehdr.e_phoff = sizeof(ehdr); 1718 | ehdr.e_ehsize = sizeof(ehdr); 1719 | ehdr.e_entry = entry; 1720 | 1721 | Elf32_Phdr text_hdr; 1722 | memset(&text_hdr, 0, sizeof(text_hdr)); 1723 | text_hdr.p_type = PT_LOAD; 1724 | text_hdr.p_offset = sizeof(ehdr) + (2*sizeof(Elf32_Phdr)) + data_loc; 1725 | text_hdr.p_vaddr = TEXT_START; 1726 | text_hdr.p_memsz = text_loc; 1727 | text_hdr.p_filesz = text_loc; 1728 | text_hdr.p_flags |= PF_X; 1729 | 1730 | Elf32_Phdr data_hdr; 1731 | memset(&data_hdr, 0, sizeof(data_hdr)); 1732 | data_hdr.p_type = PT_LOAD; 1733 | data_hdr.p_offset = sizeof(ehdr) + (2*sizeof(Elf32_Phdr)); 1734 | data_hdr.p_vaddr = DATA_START; 1735 | data_hdr.p_memsz = data_loc; 1736 | data_hdr.p_filesz = data_loc; 1737 | 1738 | fwrite(&ehdr, sizeof(ehdr), 1, out); 1739 | fwrite(&data_hdr, sizeof(data_hdr), 1, out); 1740 | fwrite(&text_hdr, sizeof(text_hdr), 1, out); 1741 | fwrite(data, data_loc, 1, out); 1742 | fwrite(text, text_loc, 1, out); 1743 | 1744 | #ifdef NANOC_DEBUG 1745 | printf("text offset for objdumping: %#x\n", text_hdr.p_offset); 1746 | #endif 1747 | } 1748 | 1749 | struct archive_header_s { 1750 | char ident[16]; 1751 | char mod_time[12]; 1752 | char owner_id[6]; 1753 | char group_id[6]; 1754 | char file_mode[8]; 1755 | char file_len[10]; 1756 | char end[2]; 1757 | } __attribute__((packed)); 1758 | typedef struct archive_header_s archive_header_t; 1759 | 1760 | void read_elf(uint8_t *buffer, uint32_t len) 1761 | { 1762 | // TODO data section stuff 1763 | char elfmag[7] = { ELFMAG0, ELFMAG1, ELFMAG2, ELFMAG3, 1, 1, 1 }; 1764 | if (len < 5 || strncmp((char *)buffer, elfmag, 7) != 0) 1765 | return; 1766 | 1767 | Elf32_Header *hdr = (Elf32_Header *)buffer; 1768 | Elf32_Shdr *shstrtab_hdr = 1769 | (Elf32_Shdr *)(buffer + hdr->e_shoff + (hdr->e_shstrndx * hdr->e_shentsize)); 1770 | char *shstrtab = (char *)(buffer + shstrtab_hdr->sh_offset); 1771 | 1772 | Elf32_Shdr *symtab_hdr = NULL, *rel_text_hdr = NULL, *strtab_hdr = NULL; 1773 | Elf32_Shdr *text_hdr = NULL, *data_hdr = NULL, *bss_hdr = NULL, *rodata_hdr = NULL; 1774 | uint32_t text_idx = 0, data_idx = 0, bss_idx = 0, rodata_idx = 0, unused_idx = 0; 1775 | 1776 | for (uint32_t i = 0; i < hdr->e_shnum; ++i) { 1777 | Elf32_Shdr *current = 1778 | (Elf32_Shdr *)(buffer + hdr->e_shoff + (hdr->e_shentsize * i)); 1779 | 1780 | #define FIND_SECTION(name, hdr, idx) \ 1781 | if (strcmp(shstrtab + current->sh_name, (name)) == 0) { \ 1782 | (idx) = i; \ 1783 | (hdr) = current; continue; \ 1784 | } 1785 | 1786 | FIND_SECTION(".symtab", symtab_hdr, unused_idx); 1787 | FIND_SECTION(".rel.text", rel_text_hdr, unused_idx); 1788 | FIND_SECTION(".strtab", strtab_hdr, unused_idx); 1789 | FIND_SECTION(".text", text_hdr, text_idx); 1790 | FIND_SECTION(".data", data_hdr, data_idx); 1791 | FIND_SECTION(".bss", bss_hdr, bss_idx); 1792 | FIND_SECTION(".rodata", rodata_hdr, rodata_idx); 1793 | } 1794 | 1795 | uint32_t text_offset = text_loc; 1796 | write_text(buffer + text_hdr->sh_offset, text_hdr->sh_size); 1797 | 1798 | uint32_t data_offset = data_loc; 1799 | if (data_hdr != NULL) 1800 | write_data(buffer + data_hdr->sh_offset, data_hdr->sh_size); 1801 | uint32_t rodata_offset = data_loc; 1802 | if (rodata_hdr != NULL) 1803 | write_data(buffer + rodata_hdr->sh_offset, rodata_hdr->sh_size); 1804 | uint32_t bss_offset = data_loc; 1805 | if (bss_hdr != NULL) { 1806 | if (data_loc + bss_hdr->sh_size > DATA_CAP) { 1807 | printf("Too much data.\n"); 1808 | exit(1); 1809 | } 1810 | memset(data + data_loc, 0, bss_hdr->sh_size); 1811 | data_loc += bss_hdr->sh_size; 1812 | } 1813 | 1814 | char *strtab = (char *)(buffer + strtab_hdr->sh_offset); 1815 | Elf32_Sym *symtab = (Elf32_Sym *)(buffer + symtab_hdr->sh_offset); 1816 | Elf32_Sym *current = symtab; 1817 | while ((uintptr_t)current - (uintptr_t)symtab < symtab_hdr->sh_size) { 1818 | int32_t offset = -1; 1819 | loc_type_t loc_type = lDATA; 1820 | if (current->st_shndx == text_idx && text_hdr != NULL) { 1821 | offset = text_offset; loc_type = lTEXT; 1822 | } 1823 | if (current->st_shndx == data_idx && data_hdr != NULL) offset = data_offset; 1824 | if (current->st_shndx == rodata_idx && rodata_hdr != NULL) offset = rodata_offset; 1825 | if (current->st_shndx == bss_idx && bss_hdr != NULL) offset = bss_offset; 1826 | if (offset == -1) { ++current; continue; } 1827 | char *name = strtab + current->st_name; 1828 | symbol_t sym; memset(&sym, 0, sizeof(sym)); 1829 | sym.name = name; 1830 | sym.type = tINT; 1831 | sym.loc = offset + current->st_value; 1832 | sym.loc_type = loc_type; 1833 | symtab_insert(root_symtab, sym); 1834 | ++current; 1835 | } 1836 | 1837 | Elf32_Rel *rel = (Elf32_Rel *)(buffer + rel_text_hdr->sh_offset); 1838 | Elf32_Rel *current_rel = rel; 1839 | while ((uintptr_t)current_rel - (uintptr_t)rel < rel_text_hdr->sh_size) { 1840 | Elf32_Sym *sym = symtab + ELF32_R_SYM(current_rel->r_info); 1841 | relocation_type_t type = rOFFSET; 1842 | if (ELF32_R_TYPE(current_rel->r_info) == R_386_32) type = rIMM; 1843 | add_relocation( 1844 | text_offset + current_rel->r_offset, 1845 | strtab + sym->st_name, 1846 | root_symtab, type 1847 | ); 1848 | ++current_rel; 1849 | } 1850 | } 1851 | 1852 | void read_archive(char *name) 1853 | { 1854 | FILE *archive = fopen(name, "r"); 1855 | fseek(archive, 0, SEEK_END); 1856 | uint32_t len = ftell(archive); 1857 | fseek(archive, 0, SEEK_SET); 1858 | uint8_t *buffer = malloc(len); 1859 | fread(buffer, 1, len, archive); 1860 | 1861 | if (len < 8 || strncmp((char *)buffer, "!\n", 8) != 0) 1862 | return; 1863 | 1864 | uint32_t idx = 8; 1865 | while (idx < len) { 1866 | archive_header_t *header = (archive_header_t *)(buffer + idx); 1867 | uint32_t file_len = atoi(header->file_len); 1868 | read_elf(buffer + idx + sizeof(archive_header_t), file_len); 1869 | idx += file_len + sizeof(archive_header_t); 1870 | } 1871 | } 1872 | 1873 | int main(int argc, char *argv[]) 1874 | { 1875 | if (argc < 2) { 1876 | printf("Usage: nanoc []\n"); 1877 | return 1; 1878 | } 1879 | 1880 | char *filename = argv[1]; 1881 | input = fopen(filename, "r"); 1882 | 1883 | root_symtab = malloc(sizeof(symbol_t) * SYMTAB_SIZE); 1884 | memset(root_symtab, 0, sizeof(symbol_t) * SYMTAB_SIZE); 1885 | 1886 | ast_node_t *root = parse(); 1887 | codegen(root); 1888 | 1889 | if (argc > 2) read_archive(argv[2]); 1890 | 1891 | relocate(); 1892 | 1893 | // temporary thing to have a non-empty data section 1894 | if (data_loc == 0) { 1895 | memcpy(data, "asdf", 4); 1896 | data_loc += 4; 1897 | } 1898 | 1899 | FILE *out = fopen("a.out", "w"); 1900 | write_elf(out); 1901 | 1902 | return 0; 1903 | } 1904 | -------------------------------------------------------------------------------- /od.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | i386-elf-objdump -b binary -m i386 --start-address=$1 -D a.out 3 | -------------------------------------------------------------------------------- /test/archive.c: -------------------------------------------------------------------------------- 1 | 2 | extern int foo; 3 | 4 | void _start(char *c); 5 | 6 | int main(int a, int b, int c) 7 | { 8 | foo = 1; 9 | _start("hello"); 10 | return (int)_start; 11 | } 12 | -------------------------------------------------------------------------------- /test/hello.c: -------------------------------------------------------------------------------- 1 | 2 | int main(int a, int b, int c); 3 | 4 | int foo; 5 | 6 | void _start() 7 | { 8 | main(1, 2, 3); 9 | 10 | return 1; 11 | } 12 | -------------------------------------------------------------------------------- /test/test2.c: -------------------------------------------------------------------------------- 1 | 2 | void _start() 3 | { 4 | char *argv; 5 | argv = 10; 6 | *argv = 1; 7 | } 8 | --------------------------------------------------------------------------------