└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # BUILDING A PASCAL COMPILER WITH C, YACC & LEX 2 | 3 | - [ 1. Compilation stages and parts of a compiler](#1-compilation-stages-and-parts-of-a-compiler) 4 | - [ 2. Grammar, production, alphabet, language](#2-grammar-production-alphabet-language) 5 | - [ 3. Classification of grammars](#3-classification-of-grammars) 6 | - [ 4. Equivalence and ambiguity in grammars](#4-equivalence-and-ambiguity-in-grammars) 7 | - [ 5. Attributes and extended grammar](#5-attributes-and-extended-grammar) 8 | - [ 6. Using Lex and YACC](#6-using-lex-and-yacc) 9 | - [ 7. Lex file format](#7-lex-file-format) 10 | - [ 8. Lex regular expressions and operators](#8-lex-regular-expressions-and-operators) 11 | - [ 9. Lex definitions](#9-lex-definitions) 12 | - [10. YACC file format](#10-yacc-file-format) 13 | - [11. YACC definitions](#11-yacc-definitions) 14 | - [12. Calculator example](#12-calculator-example) 15 | - [13. Local variable storage and nested block scope](#13-local-variable-storage-and-nested-block-scope) 16 | - [14. Symbol table](#14-symbol-table) 17 | - [15. P-code](#15-p-code) 18 | - [16. P-code in detail](#16-p-code-in-detail) 19 | - [17. P-code templates for basic Pascal structures](#17-p-code-templates-for-basic-pascal-structures) 20 | - [18. Data structures used in the compiler](#18-data-structures-used-in-the-compiler) 21 | - [19. Types and functions used in the compiler](#19-types-and-functions-used-in-the-compiler) 22 | 23 | I don't have much left from the Pascal compiler I had built in 1997 when I was a university student. 24 | I remember at that time I had written some 10-15 thousand lines of C code, but in the early 2000s, the code was lost as its floppy disk died. 25 | Fortunately, I had an article on fortunecity.com from 1998 about this compiler, and I was able to find a copy of my article from archive.org. 26 | I reformatted the article and put it here on GitHub. 27 | The techniques I was using at the time also apply to the present, as nothing much has changed in compiler design since then. 28 | 29 | ## 1. Compilation stages and parts of a compiler: 30 | The compilation of a source program consists of the following stages: 31 | 32 | 1. Lexer/scanner (lexical analysis, produces the token table) 33 | 2. Parser (syntax analysis & semantic analysis, generates symbol table and uses it while checking for semantic errors or generating code) 34 | 3. Code generator (code generation, generates machine code or intermediate code) 35 | 4. Code optimizer (code optimization, generates optimized machine code in terms of speed and memory usage) 36 | 37 | - The purpose of lexical analysis is to recognize the words (tokens) that make up the language and pass them to the parser for syntax analysis. 38 | - The parser loads syntactic meaning to the words (tokens) with the help of reserved words and special symbols of the language. 39 | This is similar to loading syntactic meaning such as subject, object, verb to words in a sentence. 40 | - As words (tokens) are recognized, they are registered together with some additional information (variable types, addresses, sizes, etc.) in the symbol table. 41 | When the same words (tokens) are encountered again, the information is stored in the symbol table is used to check for semantic errors. 42 | - The last step is to produce machine code that matches the meaning of these words in the sentence. 43 | If necessary, the generated code can be optimized to reduce memory or increase speed. 44 | - In order to make more efficient optimizations and/or to generate code for different machines, 45 | first, a more convenient generalized intermediate code is generated, 46 | then this intermediate code is optimized and converted to real machine code for the desired machine. 47 | - Some compilers produce fast interpretable code for a virtual machine (JVM, CLR, etc.) and run this program on different machines by interpreting this code. 48 | 49 | ## 2. Grammar, production, alphabet, language: 50 | The grammar of a language (G) can be demonstrated with a simple example shown below: 51 | ``` 52 | sentence -> subject object verb 53 | subject -> I | name 54 | object -> the book | home 55 | verb -> went | took 56 | ``` 57 | The template of the language grammar is determined by the production rules. 58 | 59 | - Here the "|" symbol means "or". 60 | Each of these rules, which derives the right-hand symbols from the left-hand symbol, is called a production. 61 | The set of production rules is denoted by P. 62 | - Symbols that are not found on the left side of any production, are called terminal symbols. 63 | In the above example, they are symbols I, name, the book, home, went, and took. 64 | - The set of terminal symbols is indicated by T. 65 | For example, the sentence "Michael took the book" is transferred to the parser from the scanner in the following manner: name(Michael), took, the book. 66 | The value in parentheses, that is "Michael", is the attribute of the terminal symbol "name". 67 | - These attributes, which are transmitted to the parser together with the terminal symbols, 68 | are processed by the parser when the productions are applied. 69 | - In this example, the parser registers the information that "Michael" is a name, for future use in the symbol table. 70 | Then it applies productions to the symbols to obtain the sentence symbol. 71 | - Symbols other than terminal symbols are called non-terminal symbols. 72 | In this example, the non-terminal symbols are "sentence", "subject", "object", and "verb". 73 | The set of non-terminal symbols is indicated by N. 74 | - The special symbol, which derives all symbols from itself, is called a sentencial symbol and is denoted by S. 75 | In this example, the sentencial symbol is "sentence". 76 | - S, P, N, T quadruplet is called grammar and is denoted by G = {S, P, N, T}. 77 | - V=N∪T (N union T) is called an alphabet. 78 | - The ordered symbol group consisting of terminal symbols, resulting from subsequent application of productions of P, to the S symbol is called a sentence. 79 | - The whole set of sentences that can be derived from a grammar is called a language and is denoted by L(G). 80 | - The special terminal symbol, indicated by EPSILON, corresponds to empty or nothing (not to be mixed with the space character). 81 | 82 | ## 3. Classification of grammars: 83 | According to the Chomsky classification, grammars are groupd into four classes, from the most general to the most specific as follows: 84 | 85 | 1. Free grammars: 86 | Productions are of the form: u->v. Here u,v∈V ve u≠EPSILON. 87 | 88 | 2. Context-sensitive grammars: 89 | Productions are of the form: uxv->uvw. Here u,v,w∈V, u≠EPSILON and x∈N. 90 | That is, the intermediate symbol x can only be derived with the symbol group v if it is surrounded by groups of u and w symbols. 91 | 92 | 3. Context-free grammars: 93 | Productions are of the form: x->v. Where v∈V and x∈N. Programming languages ​​are usually produced from such grammars. 94 | Such languages ​​are always parsable. 95 | 96 | 4. Finite (regular) grammars: 97 | Productions are of the form: x->a or x->ay. Here x,y⊂N and a⊂T. 98 | Such grammars are used to describe the tokens (words) of the language, that is, a grammar that is used by the scanner during lexical (token/word) analysis. 99 | Finite grammars can be effectively parsed by finite state machines. 100 | 101 | The reason why lexical analysis and syntax analysis are done separately, is that they belong to different grammar classes. 102 | Lexical analysis is done using much more effective algorithms, and sometimes using plain regex libraries. 103 | 104 | ## 4. Equivalence and ambiguity in grammars: 105 | If the sets L(G) and L(H) are equal, the grammars G and H are called equivalent grammars. 106 | 107 | 108 | When productions are applied in a different order and the same sentence can be derived in different ways, 109 | this kind of grammar is called an ambiguous grammar. 110 | The parser must be able to recognize sentences unambiguously in any way, 111 | otherwise, the results will be different each time the sentence is parsed. 112 | We frequently encounter ambiguous grammars in arithmetic operations, 113 | in which case ambiguity can be resolved by defining the right or left association and priority rules for arithmetic operations. 114 | 115 | For example in the following grammar: 116 | ``` 117 | s -> aa 118 | a -> x | xx 119 | ``` 120 | The sentence "xxx" can be parsed in two different ways: 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 |
ss
/\or/\
aaaa
//\/\\
xxxxxx
187 | 188 | One of these parse trees should be preferred by the parser. 189 | 190 | ## 5. Attributes and extended grammar: 191 | Each symbol has its own attributes. 192 | In addition to production, the grammar obtained from the processes related to these qualities is called expanded grammar. 193 | The nature of the symbol on the left is if the attributes of the symbols on the right are generated 194 | by processing the attributes of the symbols on the right: 195 | 196 | x -> y1 y2 y3 ... yn and x.a = f(y1.a, y2.a, y3.a, ..., yn.a) 197 | 198 | Here, x.a means the attribute of x, and it is synthesized from attributes of y1, y2, ..., yn. 199 | This is called a synthesized attribute. 200 | 201 | ## 6. Using Lex and YACC: 202 | Lex and YACC are programs available on the UNIX system, and which provide a great convenience in building compilers. 203 | Lex generates a scanner that can parse a given finite grammar. 204 | Similarly, YACC generates a parser that accepts an extended LALR grammar and performs actions and operations using the parsed symbols and their attributes, defined in the production rules. 205 | They generate the scanner and the parser programs in C language. 206 | 207 | ## 7. Lex file format: 208 | ``` 209 | % { 210 | C expressions (optional) 211 | %} 212 | lex definitions (optional) 213 | %% 214 | lex regular-expressions and operations 215 | %% (optional) 216 | C statements (user functions) 217 | ``` 218 | 219 | ## 8. Lex regular expressions and operators: 220 | The regular expressions used in Lex include lex operators and the characters that are to be recognized by the scanner. 221 | ``` 222 | . single character other than \n .a a and its previous character that is not beginning of a line 223 | * zero or more characters a[a-z]* lower case words beginning with a (including the word "a") 224 | + one or more characters a[a-z]+ lower case words beginning with a (excluding the word "a") 225 | ? optional character XY?Z XZ or XYZ 226 | | or (either lefthand or righthand) a|b a or b 227 | xy concatenation (x and y concatenated) abc concatenated a, b and c ("abc") 228 | ( ) grouping (UNIX)|(Unix) UNIX or Unix 229 | " " cancels the meaning of operators c"++" the word "c++" 230 | \ 1. cancels the meaning of an operator c\+\+ the word "c++" 231 | 2. C escape character \n end of line character 232 | 3. octet representation \176 ~ character (ASCII=176) 233 | [ ] character group or range [A-Z] an upper case letter (from A to Z) 234 | [^ ] exclusion of characters [^XYZ] a character other than X,Y or Z 235 | ^ beginning of line ^(abc) the word "abc" at the beginning of a line 236 | $ end of line $a "a" at the end of a line 237 | {m,n} minimum m, maximum n number of a{1,5} up to 5 "a" characters (a, aa, aaa, aaaa, aaaaa) 238 | ``` 239 | 240 | ## 9. Lex Definitions: 241 | A Lex definition consists of a name and a regular-expression pair. The name on the left can be used in other expressions in exchange for the corresponding regular expression on the right. 242 | 243 | Sample: 244 | ``` 245 | integer [0-9]+ 246 | %% 247 | integer {sscanf(yytext, "%d", &yylval.ival); return INTEGER;} 248 | integer\.integer {sscanf(yytext, "%f", &yylval.fval); return REAL;} 249 | ``` 250 | In this example, the statements inside `{}` are C statements. 251 | `yytext` is the generated string variable, which stores the parsed word (an integer such as 15, 6349, or a real number such as 0.359, 1368.4). 252 | `yylex() is the generated function, which is called by the parser, and returns the recognized (parsed) token from the scanner to the parser. 253 | In addition, the attributes of the token are also passed by the global variable `yylval`. 254 | Since the expressions in `{}` are included in the C block, it is also possible to define the local C variable here. 255 | The type of `yylval` is defined in the parser generated by YACC, as YYSTYPE. 256 | 257 | ## 10. YACC file format: 258 | ``` 259 | %{ 260 | C expressions (optional) 261 | %} 262 | YACC definitions 263 | %% 264 | YACC production rules and operations 265 | %% 266 | C expressions (user functions and the main() function) 267 | ``` 268 | 269 | ## 11. YACC definitions: 270 | `%union{}`: defines `yylval` as a `union` C type. If we don't want a union type, we should place an expression such as `#define YYSTYPE type` inside the `% {%}`. We should place the fields that belong to the union type, inside `%union{}`. 271 | 272 | `%token`: terminal symbols should be defined as`%token symbol`. If `% union {}` is used, the type of the token, `the union` in the name of that type of field t 'token ` symbol should be specified as. 273 | 274 | `%left`,`%right': if an operator has left or right association, we can define it using `%left operator` or `%right operator`. 275 | These are used to eliminate ambiguity in the grammar. 276 | 277 | `%nonassoc`: indicates that the operator does not have a left or right association. 278 | 279 | `%type`: this is similar to the type definition `%union{}`, but instead used for non-terminal symbols. 280 | 281 | The precedence of the operator inversely depends on the row where the operator is defined with `%left`,`%right`, or `%nonassoc`. 282 | Operator defined in lower rows has higher precedence. 283 | 284 | ## 12. Calculator example: 285 | ``` 286 | %{ 287 | #define YYSTYPE int /* type of yylval and the internal YACC stack */ 288 | %} 289 | %token INTEGER 290 | %left '+' '-' 291 | %left '*' '/' 292 | %right '^' 293 | %left NEGATIVE 294 | %% 295 | exprs : exprs '+' exprs {$$=$1+$3;} 296 | | exprs '-' exprs {$$=$1-$3;} 297 | | exprs '*' exprs {$$=$1*$3;} 298 | | exprs '/' exprs {$$=$1/$3;} 299 | | exprs '^' exprs {$$=power($1,$3);} 300 | | '-' exprs %prec NEGATIVE {$$=-$2;} 301 | | INTEGER {$$=$1;} 302 | ; 303 | %% 304 | ... 305 | ``` 306 | In this grammar, the term `1 + 2 * 3 * 4 ^ 5` is parsed bottom up in the following order: 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 |
     exprs        
     / / \        
    / /  exprs    
   / /   / | \    
  / /exprs | exprs
 / /  /|\  |  /|\ 
1 +  2 * 3 * 4 ^ 5
450 | 451 | - Bottom-up derivation/production follows the post-order traversal of this tree structure. 452 | In other words, for each node, all its sub-branches are traversed from left to right, and finally, the node itself is traversed. 453 | - For each production rule in effect, the C statements (C blocks) inside { } next to the rule, are executed. 454 | Here, the pseudo-variables $n correspond to the attributes of the symbols. 455 | The attribute of the non-terminal symbol on the left is indicated by $0 or $$, 456 | and the attribute of the k.th symbol on the right is represented by $k. 457 | Inside {}, the result obtained by processing $1, $2, ..., $n, is assigned to $$. 458 | - The attributes along with the symbols, move upward on the stack used by the parser, there is no real tree data structure used in the parser. 459 | - The symbol on the left of the top rule, or another symbol defined by %start, is the sentencial symbol. 460 | Starting from the terminal symbols, all symbols are derived from right to left of the production row, and from bottom row to top row. 461 | - Parsing/compilation ends when the sentencial symbol is reached. 462 | - A special symbol, the error symbol, is used to derive any syntax errors encountered during parsing/compilation. This allows the parser to go on parsing next statements and help gather more information about the errors in the source program. 463 | 464 | ## 13. Local variable storage and nested block scope: 465 | - The Pascal source code is compiled to a form of intermediate code (p-code) and executed on the virtual stack-machine called p-machine. 466 | - The p-machine is a virtual machine that processes all its commands using its runtime stack. The commands are called p-code. 467 | - This virtual machine contains only a stack and two registers that address the stack. 468 | All data, including pointer variables, is kept on the stack. 469 | Results of arithmetic and logic operations are also placed on the stack. 470 | - Since Pascal language is a block-structured language (unlike C, BASIC, or FORTRAN), nested blocks (nested functions and procedures) can be defined. 471 | - A symbol can be accessed through the block in which it is defined and all the nested blocks within it, and is inaccessible from the outer blocks. 472 | Storage duration, or the life-span of a local symbol, is the same as the life span of the container block. 473 | - Functions and procedures can call themselves recursively. Temporary variables can also be defined. 474 | All these can be achieved by using the runtime stack of the p-machine. 475 | 476 | During the course of the execution of a p-code program, the runtime stack frame is as follows: 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 |
variableN<-T
...
variable2
variable1
return address
dynamic link
static link<-B
parameterM
...
parameter2
parameter1
return value
517 | 518 | - Here, B and T are the two registers of the p-machine where the compiled intermediate code (p-code) is executed. 519 | T is the pointer which always shows the top of the stack, and B is the pointer we use to address the stack. 520 | - All addresses are specified by adding a positive or negative offset to the base address B holds. 521 | Parameter1, ..., parameterM is used for parameters of the current function/procedure, variable1, ..., variableN is used for the local variables defined in the current function/procedure, return value indicates the area reserved for the return value of the current function. 522 | - Static link is the base address of the lexical outer block, which is used for accessing the variables of this lexical outer block (function/procedure) that contains the current block (function/procedure or procedure). 523 | - The dynamic link is the base address of the calling function/procedure's stack frame, to be used for switching back to the calling function/procedure's local scope (local variables, parameters, etc.) 524 | - The return address is the address of the command to be executed after the current function/procedure ends or returns. 525 | 526 | ## 14. Symbol table: 527 | - The symbols defined in a block can only be seen in this block and within the blocks contained in this block. The above runtime stack structure provides this behavior during runtime execution of the program. 528 | - To ensure that block scoping rules apply, the symbol table used in the compilation phase is also in a stack structure. This way, symbols (variables, functions, procedures, user defned types) in different blocks but with the same name, are not mixed with each other. 529 | 530 | For example, if block B and C are located in block A, the status of the symbol stack is: 531 | 532 | When compiling block A: 533 | ``` 534 | Symbols of block A 535 | ``` 536 | 537 | When compiling block B: 538 | ``` 539 | Symbols of block B 540 | Symbols of block A 541 | ``` 542 | 543 | When compiling block C: 544 | ``` 545 | Symbols of block C 546 | Symbols of block A 547 | ``` 548 | 549 | ## 15. P-code: 550 | ``` 551 | p-code opcode what it does 552 | LIT 0,N 00 push value N to the stack 553 | OPR 0,N 01 execute arithmetic operation N on the stack 554 | LOD L,D 02 push a variable to the stack 555 | LODX L,D 12 push an array variable to the stack 556 | STO L,D 03 store the popped value in a variable 557 | STOX L,D 13 store the popped value in an array variable 558 | CAL L,A 04 call a procedure/function 559 | INT 0,N 05 add a positive or negative constant to the T register 560 | JMP 0,A 06 jump to address 561 | JPC C,A 07 jump to address conditionally 562 | CSP 0,N 08 call a standard procedure/function 563 | ``` 564 | Note that the opcodes are in hexadecimal format. 565 | 566 | ## 16. P-code in detail: 567 | ``` 568 | p-code what it does algorihmic expression 569 | LIT 0,N push value N to the stack PUSH N 570 | OPR 0,0 return from procedure/function return (exit current block) 571 | OPR 0,1 negate POP A, PUSH (-A) 572 | OPR 0,2 add POP A, POP B, PUSH (B+A) 573 | OPR 0,3 subtract POP A, POP B, PUSH (B-A) 574 | OPR 0,4 multiply POP A, POP B, PUSH (B*A) 575 | OPR 0,5 divide POP A, POP B, PUSH (B/A) 576 | OPR 0,6 lowest bit POP A, PUSH (A AND 1) 577 | OPR 0,7 mod POP A, POP B, PUSH (B MOD A) 578 | OPR 0,8 is equal to POP A, POP B, PUSH (B =? A) 579 | OPR 0,9 is not equal to POP A, POP B, PUSH (B ≠? A) 580 | OPR 0,10 is less than POP A, POP B, PUSH (B =? A) 582 | OPR 0,12 is greater than POP A, POP B, PUSH (B >? A) 583 | OPR 0,13 is less than or equal to POP A, POP B, PUSH (B <=? A) 584 | OPR 0,14 or POP A, POP B, PUSH (B OR A) 585 | OPR 0,15 and POP A, POP B, PUSH (B AND A) 586 | OPR 0,16 not POP A, PUSH (NOT A) 587 | OPR 0,17 shift left POP A, POP B, PUSH (B shift left (A times)) 588 | OPR 0,18 shift right POP A, POP B, PUSH (B shift right (A times)) 589 | OPR 0,19 increment POP A, PUSH (A+1) 590 | OPR 0,20 decrement POP A, PUSH (A-1) 591 | OPR 0,21 copy POP A, PUSH A, PUSH A 592 | LOD L,D push value from address load A from (base of level offset L)+D, PUSH A 593 | LOD 255,0 push value from address popped from the stack POP address, load A with byte from address, PUSH A 594 | LODX L,D push value from indexed address POP index, load A from (base of level offset L)+D+index, PUSH A 595 | STO L,D store popped value at address POP A, store A at (base of level offset L)+D 596 | STO 255,0 store popped value at popped address POP A, POP address, store low byte of A at address 597 | STOX L,D store popped value at indexed address POP index, POP A, STORE A at (base of level offset L)+D+index 598 | CAL L,A call procedure/function call procedure/function at p-code location A, with base at level offset L 599 | CAL 255,0 call address popped from the stack POP address, PUSH return address(=PC), jump to address 600 | INT 0,N add N to T T=T+N 601 | JMP 0,A jump jump to p-code at location A 602 | JPC 0,A jump if popped value is true POP A, if (A and 1) = 0 then jump to p-code at location A 603 | CSP 0,0 input one character INPUT A, PUSH A 604 | CSP 0,1 output one character POP A, OUTPUT A 605 | CSP 0,2 input an integer INPUT A#, PUSH A 606 | CSP 0,3 output an integer POP A, OUTPUT A# 607 | CSP 0,8 output a character string POP A, FOR I:=1 to A DO BEGIN POP B; OUTPUT B; END 608 | ``` 609 | Note that true=1, false=0 610 | 611 | POP X means to remove the top element of the stack and load it into X (the 612 | stack size is now decreased by one element). In other words, copy the value pointed by the T register into X, and decrement T. 613 | 614 | PUSH X means to place the value of X onto the top of the stack (the stack size is 615 | now increased by one element). In other words, increment the T register, and copy the value pointed by the T register into X. 616 | 617 | ## 17. P-code templates for basic Pascal structures: 618 | ``` 619 | Pascal expression equivalent p-code 620 | x+10*y[5] LOD x 621 | LIT 10 622 | LIT 5 623 | LODX Y 624 | OPR * 625 | OPR + 626 | 627 | a:=expr; (expr) 628 | STO a 629 | 630 | p^=expr; (expr) 631 | STO 255 p 632 | 633 | if expr then stmt1 else stmt2; (expr) 634 | JPC 0,label1 635 | (stmt1) 636 | JMP label2 637 | label1: (stmt2) 638 | label2: ... 639 | 640 | for i=expr1 to expr2 do stmt; (expr1) 641 | STO I 642 | (expr2) 643 | label1: OPR CPY 644 | LOD I 645 | OPR >= 646 | JPC 0,label2 647 | (stmt) 648 | LOD I 649 | OPR INC 650 | STO I 651 | JMP label1 652 | label2: INT -1 653 | 654 | while expr do stmt label1: (expr) 655 | JPC 0,label2 656 | (stmt) 657 | JMP label1 658 | label2: ... 659 | 660 | case expr of (expr) 661 | const1,const2: stmt1; OPR CPY 662 | const3 : stmt2; LIT const1 663 | else stmt3 end; OPR = 664 | JPC 1,label1 665 | OPR CPY 666 | LIT const2 667 | OPR = 668 | JPC 0,label2 669 | label1: (stmt1) 670 | JMP label4 671 | label2: OPR CPY 672 | LIT const3 673 | OPR = 674 | JPC 0,label3 675 | (stmt2) 676 | JMP label4 677 | label3: (stmt3) 678 | label4: INT -1 679 | 680 | repeat stmt until expr; label1: (stmt) 681 | (expr) 682 | JPC 0,label1 683 | 684 | i=func1(expr1,expr2); INT 1 685 | (expr1) 686 | (expr2) 687 | CAL func1 688 | INT -2 689 | ``` 690 | Note that (expr), (stmt1), etc. are compiled p-code blocks, and label1, label2, etc. are address of the p-code. The compiled p-code blocks and the program itself is stored in a linked list, with the single p-code instruction on each node of the linked list. The P-machine (the virtual machine to execute the generated p-code), is just a tiny program which goes through the p-code instructions in this linked list, then either exeutes the current p-code statement, also utilizing andmaintaining it's runtime stack, or jumps to the specified label (address of a node of p-code in the linked list). 691 | 692 | ## 18. Data structures used in the compiler: 693 | - Linked lists are used to create dynamic arrays. 694 | - For each new node of the list, dynamic memory must be allocated with the malloc() function in C and then the previous node should be connected to this new node. 695 | 696 | - Single linked list: 697 | ``` 698 | head->d 699 | n->d 700 | n->d 701 | n->d 702 | n=0 703 | ``` 704 | Here, d holds the data and n is the pointer to the next node. 705 | The address of the first node is kept in head. The value of n in the last node is NULL, or 0. 706 | It is declared in C like this: 707 | ``` 708 | struct snode { 709 | int d; 710 | struct snode *n; /* next */ 711 | } *head; 712 | ``` 713 | - Double linked list: 714 | ``` 715 | p=0 716 | head->d<-p 717 | n->d<-p 718 | n->d<-p 719 | n->d<-tail 720 | n=0 721 | ``` 722 | Here, d holds the data, n is the pointer to the next node, and p is the pointer to the previous node. 723 | The address of the first node is kept in head, and the address of the last node is kept in tail. 724 | The value of n in the last node is NULL, or 0. Likewise, the value of p in the first node is NULL, or 0. 725 | It is declared in C like this: 726 | ``` 727 | struct snode { 728 | int d; 729 | struct snode *p; /* previous */ 730 | struct snode *n; /* next */ 731 | } *head, *tail; 732 | ``` 733 | - Dynamic stack: 734 | 735 | When the double linked list is used as a stack, tail pointer, which points to the last node, also points to the top of the stack. 736 | Thus we obtain a stack data structure which can be resized dynamically. 737 | 738 | ## 19. Types and functions used in the compiler: 739 | 740 | - `symblocktop`: Pointer addressing the top block on the symbol table (stack). 741 | - `pushsymblock()`: Adds (pushes) a new block on top of the symbol table (stack). 742 | - `popsymblock()`: Removes (pops) the top block from the symbol table (stack). 743 | - `instconst()`, `instlabel()`, `insttype()`, `instvar()`: Respectively, adds a new constant, label, type, or variable declaration inside the top block on the symbol table (stack). 744 | - `makeptrtype()`, `makeenumtype()`, `makerangetype()`, `makeidtype()`, `makerectype()`, `makearraytype()`, `makeuniontype()`, `makesettype()`, `makefiletype()`: Creates various type nodes. Type of a variable is stored on the symbol table in a special variant data structure, which varies for each type. 745 | 746 | Type node is declared in C like this: 747 | ``` 748 | struct stype { 749 | char metatype; /* TENUM, TID, TREC, TUNION, TFILE, TSET, TRANGE, TARR */ 750 | void *restptr; 751 | }; 752 | ``` 753 | Here, metatype can hold one of the constants `TENUM`, `TID`, `TREC`, `TUNION`, `TARR`, `TFILE`, `TSET` ve `TRANGE`. 754 | restptr points to a different data structure for each different metatype. 755 | These data structures are: 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 836 |
metatyperestptr
TENUM-->id0-->id1-->...-->idN
(linked list, id's are strings, which are also registered in the symbol table as constants from o to N)
TRANGE-->min,max,type (type is either character or number)
Example:VAR r:(A..Z); (min=A, max=Z, tip=character)
TID-->id (id: name of the equivalent type)
Example:TYPE numerictype = INTEGER; (type that is equivalent to INTEGER) 784 |
VAR i:numerictype; (id="numerictype")
TARRAY-->type,indextype1-->...-->indextypeN
(linked list, types of indexes in an array of N elements)
TREC and TUNION-->fields1-->...-->fieldsN
(linked list, each one of fieldsN is a tuple of a linked list of the names of the fields of same type and the field type)
Example:VAR u: UNION 808 | f1,f2,f3: REAL; (fields1->(f1->f2->f3,REAL)) 809 | i1,i2: INTEGER; (fields2->(i1->i2,INTEGER)) 810 | END;
TPTR-->type (type of the variable addressed by this pointer type)
Example:VAR p: ^ INTEGER; (type=INTEGER)
TFILE-->type
Example:VAR f: FILE OF CHAR; (type=CHAR)
TSET-->type
Example:VAR s: SET OF INTEGER; (type=INTEGER)td> 835 |
837 | 838 | 839 | `instproc()`, `instfunc()`: Adds a new procedure/function to the top block in the symbol table (stack). 840 | 841 | `addXnode()`: Adds a new node to the linked list of X's. 842 | These type of functions are: `addidnode()`, `addtypenode()`, `addfieldsnode()`, `addfparsec()`, `addexprnode()`, `addclinenode()`, `addconstnode()`, `addvarnode()` 843 | 844 | `linkNcodes()`: Concatenates linked lists containing p-code nodes, into one large linked list. 845 | 846 | `codeXXX()`: Generates a block of p-code. 847 | The generated code is actually a single linked list with p-code as data. 848 | The blocks (linked lists) of p-code produced at different times during parsing, are combined with `linkNcodes ()` at different stages and finally converted into one large block. 849 | The address of this final linked list is assigned to the variable named program, later used for the execution of the program. 850 | ``` 851 | --------------------------------------------------------------------------------