├── LICENSE.md ├── README.md ├── build ├── boot ├── build.rkt ├── exe.bat ├── make ├── make.rkt ├── unix └── waxeye ├── docs └── book │ ├── book │ ├── scheme.lang │ └── waxeye.lang ├── grammars ├── calc.waxeye ├── json.waxeye ├── modular │ └── mod.rkt ├── num.waxeye ├── regexp.waxeye ├── templ.waxeye └── waxeye.waxeye ├── src ├── example │ └── racket │ │ ├── calculator.rkt │ │ └── example.rkt ├── racket │ └── waxeye │ │ ├── ast.rkt │ │ ├── fa.rkt │ │ ├── parser.rkt │ │ └── set.rkt └── waxeye │ ├── action.rkt │ ├── code.rkt │ ├── debug.rkt │ ├── dfa.rkt │ ├── dot.rkt │ ├── expand.rkt │ ├── file.rkt │ ├── gen.rkt │ ├── grammar-parser.rkt │ ├── header.txt │ ├── interp.rkt │ ├── load.rkt │ ├── main.rkt │ ├── nfa.rkt │ ├── racket.rkt │ ├── set.rkt │ ├── tester.rkt │ ├── transform.rkt │ ├── util.rkt │ ├── version.rkt │ └── waxeye.rkt └── test └── grammars ├── json.rkt ├── templ.rkt └── waxeye.rkt /LICENSE.md: -------------------------------------------------------------------------------- 1 | # PolyForm Noncommercial License 1.0.0 2 | 3 | 4 | 5 | ## Acceptance 6 | 7 | In order to get any license under these terms, you must agree 8 | to them as both strict obligations and conditions to all 9 | your licenses. 10 | 11 | ## Copyright License 12 | 13 | The licensor grants you a copyright license for the 14 | software to do everything you might do with the software 15 | that would otherwise infringe the licensor's copyright 16 | in it for any permitted purpose. However, you may 17 | only distribute the software according to [Distribution 18 | License](#distribution-license) and make changes or new works 19 | based on the software according to [Changes and New Works 20 | License](#changes-and-new-works-license). 21 | 22 | ## Distribution License 23 | 24 | The licensor grants you an additional copyright license 25 | to distribute copies of the software. Your license 26 | to distribute covers distributing the software with 27 | changes and new works permitted by [Changes and New Works 28 | License](#changes-and-new-works-license). 29 | 30 | ## Notices 31 | 32 | You must ensure that anyone who gets a copy of any part of 33 | the software from you also gets a copy of these terms or the 34 | URL for them above, as well as copies of any plain-text lines 35 | beginning with `Required Notice:` that the licensor provided 36 | with the software. For example: 37 | 38 | > Required Notice: Copyright Yoyodyne, Inc. (http://example.com) 39 | 40 | ## Changes and New Works License 41 | 42 | The licensor grants you an additional copyright license to 43 | make changes and new works based on the software for any 44 | permitted purpose. 45 | 46 | ## Patent License 47 | 48 | The licensor grants you a patent license for the software that 49 | covers patent claims the licensor can license, or becomes able 50 | to license, that you would infringe by using the software. 51 | 52 | ## Noncommercial Purposes 53 | 54 | Any noncommercial purpose is a permitted purpose. 55 | 56 | ## Personal Uses 57 | 58 | Personal use for research, experiment, and testing for 59 | the benefit of public knowledge, personal study, private 60 | entertainment, hobby projects, amateur pursuits, or religious 61 | observance, without any anticipated commercial application, 62 | is use for a permitted purpose. 63 | 64 | ## Noncommercial Organizations 65 | 66 | Use by any charitable organization, educational institution, 67 | public research organization, public safety or health 68 | organization, environmental protection organization, 69 | or government institution is use for a permitted purpose 70 | regardless of the source of funding or obligations resulting 71 | from the funding. 72 | 73 | ## Fair Use 74 | 75 | You may have "fair use" rights for the software under the 76 | law. These terms do not limit them. 77 | 78 | ## No Other Rights 79 | 80 | These terms do not allow you to sublicense or transfer any of 81 | your licenses to anyone else, or prevent the licensor from 82 | granting licenses to anyone else. These terms do not imply 83 | any other licenses. 84 | 85 | ## Patent Defense 86 | 87 | If you make any written claim that the software infringes or 88 | contributes to infringement of any patent, your patent license 89 | for the software granted under these terms ends immediately. If 90 | your company makes such a claim, your patent license ends 91 | immediately for work on behalf of your company. 92 | 93 | ## Violations 94 | 95 | The first time you are notified in writing that you have 96 | violated any of these terms, or done anything with the software 97 | not covered by your licenses, your licenses can nonetheless 98 | continue if you come into full compliance with these terms, 99 | and take practical steps to correct past violations, within 100 | 32 days of receiving notice. Otherwise, all your licenses 101 | end immediately. 102 | 103 | ## No Liability 104 | 105 | ***As far as the law allows, the software comes as is, without 106 | any warranty or condition, and the licensor will not be liable 107 | to you for any damages arising out of these terms or the use 108 | or nature of the software, under any kind of legal claim.*** 109 | 110 | ## Definitions 111 | 112 | The **licensor** is the individual or entity offering these 113 | terms, and the **software** is the software the licensor makes 114 | available under these terms. 115 | 116 | **You** refers to the individual or entity agreeing to these 117 | terms. 118 | 119 | **Your company** is any legal entity, sole proprietorship, 120 | or other kind of organization that you work for, plus all 121 | organizations that have control over, are under the control of, 122 | or are under common control with that organization. **Control** 123 | means ownership of substantially all the assets of an entity, 124 | or the power to direct its management and policies by vote, 125 | contract, or otherwise. Control can be direct or indirect. 126 | 127 | **Your licenses** are all the licenses granted to you for the 128 | software under these terms. 129 | 130 | **Use** means anything you do with the software requiring one 131 | of your licenses. 132 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Waxeye Parser Generator 2 | ======================= 3 | 4 | Waxeye is a parser generator based on parsing expression grammars (PEGs). 5 | 6 | Currently supported programming languages: 7 | * Racket 8 | 9 | 10 | Features 11 | -------- 12 | 13 | * Language-agnostic, modular, composable grammars 14 | 15 | * Automatic AST generation 16 | 17 | * Command-line grammar interpreter 18 | 19 | * Grammar testing DSL 20 | 21 | 22 | User Manual 23 | ----------- 24 | 25 | Waxeye's user manual is in `docs/manual.html`. The latest version is also 26 | online at http://waxeye.org/manual.html. 27 | 28 | 29 | Installation 30 | ------------ 31 | 32 | ### Unix and OSX 33 | 34 | 1. Extract the files of the distribution. 35 | 36 | 2. Copy the `waxeye` directory to where you wish to install it. 37 | 38 | 3. Add the `bin/waxeye` binary to your search path. e.g. If you have `~/bin` in 39 | your `PATH` and installed waxeye to `/usr/local/waxeye` then you might do 40 | the following. 41 | 42 | `ln -s /usr/local/waxeye/bin/waxeye ~/bin/` 43 | 44 | 45 | ### Windows 46 | 47 | 1. Extract the files of the distribution. 48 | 49 | 2. Copy the `waxeye` directory to where you wish to install it. 50 | 51 | 52 | Running 53 | ------- 54 | 55 | ### Unix and OSX 56 | 57 | Use the `waxeye` command. 58 | 59 | ### Windows 60 | 61 | Use a command prompt to run `waxeye.exe`. Note: If using the interpreter under 62 | Windows, you will need to press `Ctrl-z` and then 'Enter' after the input you 63 | want to interpret. 64 | 65 | 66 | Building from Source 67 | -------------------- 68 | 69 | 1. Install [Racket](http://racket-lang.org) 70 | 71 | 2. Install Waxeye's backend for Racket. 72 | * Unix and OSX 73 | 74 | `sudo ln -s /usr/local/waxeye/src/racket/waxeye /usr/local/racket/lib/racket/collects/` 75 | 76 | * Windows 77 | 78 | Copy the directory `src/racket/waxeye` into your Racket `collects` 79 | directory. For example, `C:\Program Files\Racket\collects`. 80 | 81 | 3. Build Waxeye 82 | * Unix and OSX 83 | 84 | `./build/unix` 85 | 86 | * Windows 87 | 88 | - If your Racket installation isn't `C:\Program Files\Racket`, then you 89 | will need to modify `build\exe.bat` to use the correct path. 90 | 91 | - From your Waxeye installation directory, run the `build\exe.bat` script 92 | in a command prompt. 93 | 94 | 95 | License 96 | ------- 97 | 98 | [PolyForm Noncommercial License 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0) 99 | -------------------------------------------------------------------------------- /build/boot: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ./build/waxeye -g racket src/waxeye/ -c src/waxeye/header.txt -p grammar grammars/waxeye.waxeye 4 | -------------------------------------------------------------------------------- /build/build.rkt: -------------------------------------------------------------------------------- 1 | #lang racket/base 2 | 3 | (require "make.rkt" 4 | "../src/waxeye/version.rkt") 5 | 6 | 7 | (define *name* "waxeye") 8 | (define *doc-book* "/usr/local/docbook") 9 | 10 | 11 | (target clean (clean-book clean-dist clean-unix) 12 | (^ rm -rf tmp)) 13 | 14 | 15 | (target book (book-html)) 16 | 17 | 18 | (target book-html () 19 | (^ asciidoc -a toc -n -o docs/manual.html docs/book/book)) 20 | 21 | 22 | (target book-pdf () 23 | (^ mkdir -p tmp/book) 24 | (^ asciidoc -a toc -b docbook --doctype=book -o tmp/book/book.xml docs/book/book) 25 | ($ xsltproc '-o 'tmp/book/book.fo (++ *doc-book* "/fo/docbook.xsl") 'tmp/book/book.xml) 26 | (^ fop tmp/book/book.fo docs/manual.pdf)) 27 | 28 | 29 | (target clean-book () 30 | (^ rm -rf tmp/book) 31 | (^ rm -f docs/manual.html docs/manual.pdf)) 32 | 33 | 34 | (target dist (clean dist-src dist-unix)) 35 | 36 | 37 | (define (cp-dist from) 38 | ($ cp '-r from (++ "dist/waxeye-" *version* "/"))) 39 | 40 | 41 | (target dist-base (book) 42 | 43 | ($ mkdir '-p (++ "dist/waxeye-" *version*)) 44 | 45 | (cp-dist "build") 46 | (cp-dist "docs") 47 | (cp-dist "grammars") 48 | (cp-dist "lib") 49 | (cp-dist "LICENSE.md") 50 | (cp-dist "README.md") 51 | (cp-dist "src") 52 | (cp-dist "test") 53 | 54 | ($ chmod '755 (++ "dist/waxeye-" *version* "/build/make")) 55 | ($ chmod '755 (++ "dist/waxeye-" *version* "/build/unix")) 56 | ($ chmod '755 (++ "dist/waxeye-" *version* "/build/waxeye"))) 57 | 58 | 59 | (target dist-src (dist-base) 60 | (cd dist 61 | ($ zip '-r (++ "waxeye-" *version* "-src.zip waxeye-" *version*)) 62 | ($ tar 'cjf (++ "waxeye-" *version* "-src.tar.bz2 waxeye-" *version*)))) 63 | 64 | 65 | (target dist-unix (dist-base) 66 | (cd$ (++ "dist/waxeye-" *version*) 67 | (^ ./build/unix)) 68 | (cd dist 69 | ($ tar 'czf (++ "waxeye-" *version* "-unix.tar.gz waxeye-" *version*)) 70 | ($ tar 'cjf (++ "waxeye-" *version* "-unix.tar.bz2 waxeye-" *version*)))) 71 | 72 | 73 | (target clean-dist () 74 | (^ rm -rf dist)) 75 | 76 | 77 | (target clean-unix () 78 | (^ rm -rf bin lib)) 79 | 80 | 81 | (run-make) 82 | -------------------------------------------------------------------------------- /build/exe.bat: -------------------------------------------------------------------------------- 1 | C:\"Program Files\Racket\raco.exe" exe src\waxeye\waxeye.rkt 2 | C:\"Program Files\Racket\raco.exe" distribute . src\waxeye\waxeye.exe 3 | DEL src\waxeye\waxeye.exe 4 | -------------------------------------------------------------------------------- /build/make: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | racket build/build.rkt $* 4 | -------------------------------------------------------------------------------- /build/make.rkt: -------------------------------------------------------------------------------- 1 | #lang racket/base 2 | 3 | (require (only-in racket/system system) 4 | (only-in "../src/waxeye/util.rkt" display-ln)) 5 | 6 | (provide ^ $ ++ cd cd$ run-cmd run-make target) 7 | 8 | 9 | (define *target-table* (make-hash)) 10 | (define *dep-table* (make-hash)) 11 | 12 | (define ++ string-append) 13 | 14 | (define-syntax target 15 | (syntax-rules () 16 | ((_ name (deps ...) code ...) 17 | ;; bind target name to code 18 | (hash-set! 19 | *target-table* 20 | 'name 21 | (lambda () 22 | ;; run dependencies 23 | (for-each run-target '(deps ...)) 24 | ;; run code 25 | code ...))))) 26 | 27 | 28 | (define (run-target t) 29 | (let ((t-code (hash-ref *target-table* t #f))) 30 | (if t-code 31 | (unless (hash-ref *dep-table* t #f) 32 | (hash-set! *dep-table* t #t) 33 | (apply t-code ())) 34 | (error 'make (++ "target doesn't exist - " (symbol->string t)))))) 35 | 36 | 37 | (define (run-make) 38 | (let ((args (map string->symbol (vector->list (current-command-line-arguments))))) 39 | ;; if no make target was specified 40 | (if (null? args) 41 | ;; print all possible targets 42 | (begin 43 | (display-ln "possible targets:") 44 | (for-each display-ln (sort (map symbol->string (hash-map *target-table* (lambda (k v) k))) stringstring s)) 53 | ((char? s) (list->string (list s))) 54 | ((number? s) (number->string s)) 55 | (else s))) 56 | (let ((cmd (++ (as-string prog) 57 | (foldr (lambda (a b) 58 | (++ " " (as-string a) b)) 59 | "" 60 | args)))) 61 | (display-ln cmd) 62 | (system cmd))) 63 | 64 | 65 | (define-syntax $ 66 | (syntax-rules () 67 | ((_ prog arg ...) 68 | (run-cmd 'prog (list arg ...))))) 69 | 70 | 71 | (define-syntax ^ 72 | (syntax-rules () 73 | ((_ prog arg ...) 74 | (run-cmd 'prog '(arg ...))))) 75 | 76 | 77 | (define-syntax cd$ 78 | (syntax-rules () 79 | ((_ dir code ...) 80 | (parameterize ((current-directory (let ((d dir)) 81 | (if (symbol? d) 82 | (symbol->string d) 83 | d)))) 84 | code ...)))) 85 | 86 | 87 | (define-syntax cd 88 | (syntax-rules () 89 | ((_ dir code ...) 90 | (cd$ 'dir code ...)))) 91 | -------------------------------------------------------------------------------- /build/unix: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | raco exe -o waxeye src/waxeye/waxeye.rkt 4 | raco distribute . waxeye 5 | rm waxeye 6 | -------------------------------------------------------------------------------- /build/waxeye: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | racket src/waxeye/waxeye.rkt $* 4 | -------------------------------------------------------------------------------- /docs/book/book: -------------------------------------------------------------------------------- 1 | Language Development with Waxeye 2 | ================================ 3 | Orlando Hill 4 | version 0.9.0-dev, January 2021 5 | 6 | 7 | 8 | == Introduction == 9 | 10 | As programmers, we are required to make use of data that is presented in a 11 | variety of formats. In order to extract and manipulate the desired information, 12 | we need the ability to navigate the structure of the language the data is 13 | written in. Unless the language is very simple, we must use a parser that 14 | understands the language and gives us the data in a form we can more readily 15 | use. 16 | 17 | Manually creating parsers can be boring and time consuming. It is, therefore, 18 | common to use a use parser generator to do the grunt work of constructing the 19 | parser. This is where Waxeye comes in handy. 20 | 21 | 22 | 23 | == Getting Started == 24 | 25 | === Downloading === 26 | 27 | You can download the latest version of Waxeye's source code from 28 | https://github.com/pomanu/waxeye[GitHub]. 29 | 30 | 31 | === Requirements === 32 | 33 | There are no external dependencies needed to run a pre-built version of Waxeye. 34 | If you build from source, you'll need http://racket-lang.org[Racket]. 35 | 36 | To use a generated parser, you need a supported programming language to run it 37 | from. 38 | 39 | 40 | === Installation === 41 | 42 | ==== Unix and MacOSX ==== 43 | 1. Extract the files of the distribution. 44 | 45 | 2. Copy the 'waxeye' directory to where you wish to install it. 46 | 47 | 3. Add the 'bin/waxeye' binary to your search path. e.g. If you have `~/bin` in 48 | your PATH and installed waxeye to '/usr/local/waxeye' then you might do the 49 | following. 50 | 51 | ------------------------------------------------------------------------------- 52 | ln -s /usr/local/waxeye/bin/waxeye ~/bin/ 53 | ------------------------------------------------------------------------------- 54 | 55 | ==== Windows ==== 56 | 57 | 1. Extract the files of the distribution. 58 | 59 | 2. Copy the 'waxeye' directory to where you wish to install it. 60 | 61 | 62 | === Running === 63 | 64 | Currently, Waxeye is used from a command-line interface. You can use it as a 65 | command-line tool or, as part of a script or build-system. There are plans to 66 | develop a graphical tool at a later stage. 67 | 68 | ==== Unix and MacOSX ==== 69 | 70 | Run Waxeye by executing the `waxeye` binary. 71 | 72 | ==== Windows ==== 73 | 74 | Use a command prompt to run `waxeye.exe`. 75 | 76 | 77 | 78 | == Basic Concepts == 79 | 80 | === What is a parser? === 81 | 82 | When we want to understand data that has been written in a language of interest 83 | ('L'), we need to break our data into units of the language. This process of 84 | breaking our input into different parts, based on the structure of 'L', is 85 | called 'parsing'. A program used for parsing is called a 'parser'. 86 | 87 | 88 | === What is the result of a parser? === 89 | 90 | Once your input has been parsed, you need the result to be presented in a from 91 | that is easy to understand and manipulate. Since the input was organized based 92 | on the hierarchical structure of the language, it makes sense that the output 93 | of the parser mimic this structure. The most effective form to do this with is 94 | a tree. 95 | 96 | Such a tree is known as an Abstract Syntax Tree (AST). A Waxeye parser will 97 | automatically give you an AST that represents your input. The structure of this 98 | AST is based on the structure of your language's grammar. 99 | 100 | 101 | === What is a parser generator? === 102 | 103 | If 'L' is simple, it is easy for us to use our programming lanugage of choice 104 | to, manually, write a parser for 'L'. However, as the structural complexity of 105 | 'L' increases, so too, does the size and complexity of the parser program. 106 | Writing and maintaining a large parser, by hand, can quickly become a tedious 107 | and laborious job. Thankfully, we can use a parser generator to automate the 108 | work of creating a parser so we can focus on other problems. 109 | 110 | A parser generator is a tool designed to help software developers automate the 111 | process of creating a parser. Just like compilers and assemblers, a parser 112 | generator takes a description of a program, automatically does the boring work 113 | for you and gives you a transformed program as output. Each tool accepts input 114 | in one language ('L1'), performs various transformations and creates output in 115 | another language ('L2'). 116 | 117 | L1 --> Compiler --> L2 118 | L1 --> Assembler --> L2 119 | L1 --> Parser Generator --> L2 120 | 121 | The key difference between the three tools is the level of abstraction held by 122 | the input and output languages. The assembler works at the lowest level by 123 | taking assembly files and producing machine code. The compiler works above the 124 | assembler by taking a more abstract programming language and generating 125 | assembly files or machine code directly. Finally, the parser generator has the 126 | highest level of abstraction and transforms a 'grammar file' into programming 127 | language source code for a compiler to process. 128 | 129 | 130 | === What is a grammar file? === 131 | 132 | We can define a language as the set of strings it contains. While it is 133 | sometimes possible to specify a language simply by enumerating all of its 134 | strings, such an approach has significant drawbacks. Trying to write each 135 | string in our language could be very time consuming and, potentially, take 136 | forever. 137 | 138 | Suppose we need to read time information as part of a larger program. In a 139 | trivial case, the time information may be presented as two digits for the 140 | hours, a colon `:`, and then two digits for the minutes. 141 | 142 | ------------------------------------------------------------------------------- 143 | 00:00, 00:01, 00:02, ... 14:23, 14:24, 14:25, ... 23:57, 23:58, 23:59 144 | ------------------------------------------------------------------------------- 145 | 146 | We could describe our time language this way but, writing all 1,440 possible 147 | hour/minute combinations wouldn't be much fun. Not to mention how bad things 148 | would be if we extended our language to include date information. 149 | 150 | As another example, consider the language that consists of all strings of one 151 | or more alphabet character. 152 | 153 | ------------------------------------------------------------------------------- 154 | a, b, c, ... z, aa, ab, ac, ... az, aaa, aab, aac, ... 155 | ------------------------------------------------------------------------------- 156 | 157 | Even worse than our time example, this language is infinite. It would be 158 | impossible for us to explicitly list every string in the language. 159 | 160 | If we want to describe such languages, we need a notation that is more abstract 161 | than simply writing out strings. We call this notation a 'grammar' and the file 162 | that contains it a 'grammar file'. 163 | 164 | 165 | 166 | == Waxeye Grammars == 167 | 168 | To generate a parser for a language, you must supply the parser generator with 169 | a grammar file that describes the language. Waxeye grammar files are written as 170 | text documents and are, by convention, given the `.waxeye` file extension. 171 | 172 | A Waxeye grammar consists of a set of rule definitions, called 'non-terminals'. 173 | Together, the non-terminals succinctly describe the syntax of the language. By 174 | default, the first non-terminal is considered the starting point of the 175 | language definition. 176 | 177 | 178 | === Non-terminals === 179 | 180 | Non-terminals are defined in three parts; a name, a rule type and one or more 181 | grammar expressions. 182 | 183 | The most common non-terminal type is the tree constructing non-terminal. A tree 184 | constructing non-terminal has the following form: 185 | 186 | ******************************************************************************* 187 | 'Name' `<-` '+expressions' 188 | ******************************************************************************* 189 | 190 | Where 'Name' matches `[a-zA-Z_] *[a-zA-Z0-9_-]`. 191 | 192 | [source,waxeye] 193 | .A tree constructing non-terminal 194 | ------------------------------------------------------------------------------- 195 | Example <- A | B 196 | ------------------------------------------------------------------------------- 197 | 198 | 199 | The other common non-terminal type is the void non-terminal. The result of a 200 | void non-terminal is not included in the AST that is constructed by the parser. 201 | To define a void non-terminal, use this form: 202 | 203 | ******************************************************************************* 204 | 'Name' `<:` '+expressions' 205 | ******************************************************************************* 206 | 207 | [source,waxeye] 208 | .A void non-terminal 209 | ------------------------------------------------------------------------------- 210 | Example <: A | B 211 | ------------------------------------------------------------------------------- 212 | 213 | 214 | 215 | === Expressions === 216 | 217 | The most important part of each non-terminal definition is the set of 218 | expressions it contains. Grammar expressions come in different forms and have 219 | their own meanings. Places where an expression can be contained within another 220 | expression are denoted with an 'e'. 221 | 222 | 223 | ==== Atomic Expressions ==== 224 | 225 | ===== Wildcard ===== 226 | `.` 227 | 228 | Matches any character from the input. 229 | 230 | 231 | ===== Literal ===== 232 | `'text'` 233 | 234 | Matches `text` in the input. 235 | 236 | 237 | ===== Case-insensitive Literal ===== 238 | `"text"` 239 | 240 | Matches `text` in the input while ignores case. This is equivalent to the 241 | expression `[tT][eE][xX][tT]` but, is much more readable. 242 | 243 | 244 | ===== Character Class ===== 245 | `[a-z_-]` 246 | 247 | Character-class that matches either a lower-case English character, `_` or 248 | `-`. 249 | 250 | 251 | ===== Non-terminal ===== 252 | `NT` 253 | 254 | References the non-terminal named `NT`. 255 | 256 | 257 | ===== Parentheses ===== 258 | `(`'e'`)` 259 | 260 | Raises the precedence of the expression 'e'. 261 | 262 | 263 | /////////////////////////////////////////////////////////////////////////////// 264 | ===== Context Actions ===== 265 | `@action` 266 | 267 | References the context-action `action` and gives the action the data held by 268 | the labels `a` and `b`. These are used for context-sensitive parsing. Not fully 269 | implemented yet. 270 | /////////////////////////////////////////////////////////////////////////////// 271 | 272 | 273 | ==== Prefix Expressions ==== 274 | 275 | ===== Void ===== 276 | `:`'e' 277 | 278 | Doesn't include the result of 'e' when building the AST. 279 | 280 | 281 | ===== Closure ===== 282 | `*`'e' 283 | 284 | Puts 'e' within a closure. 285 | 286 | 287 | ===== Plus ===== 288 | `+`'e' 289 | 290 | Puts 'e' within a plus-closure. 291 | 292 | 293 | ===== Optional ===== 294 | `?`'e' 295 | 296 | Puts 'e' within an optional. 297 | 298 | 299 | ===== Negative Check ===== 300 | `!`'e' 301 | 302 | Checks that 'e' fails. 303 | 304 | 305 | ===== Positive Check ===== 306 | `&`'e' 307 | 308 | Checks that 'e' succeeds. 309 | 310 | 311 | /////////////////////////////////////////////////////////////////////////////// 312 | ===== Labels ===== 313 | `a=`'e' 314 | 315 | Labels the expression 'e' with the label `a`. Not fully implemented yet. 316 | /////////////////////////////////////////////////////////////////////////////// 317 | 318 | 319 | ==== Sequence Expressions ==== 320 | 'e1 e2' 321 | 322 | Matches 'e1' and 'e2' in sequence. 323 | 324 | 325 | ==== Alternation Expressions ==== 326 | 'e1'`|`'e2' 327 | 328 | Tries to match 'e1' and, if that fails, tries to match 'e2'. 329 | 330 | 331 | === Precedence === 332 | 333 | In Waxeye grammars, some expressions can have other expressions nested within 334 | them. When we use parentheses, we are explicitly denoting the nesting structure 335 | of the expressions. 336 | 337 | [source,waxeye] 338 | ------------------------------------------------------------------------------- 339 | ((?A) B) | C 340 | ------------------------------------------------------------------------------- 341 | 342 | At times, this can seem needlessly verbose. In many cases, we are able to omit 343 | the parentheses in favor of a shorter notation. We do this by exploiting the 344 | precedence of each expression type. 345 | 346 | [source,waxeye] 347 | ------------------------------------------------------------------------------- 348 | ?A B | C 349 | ------------------------------------------------------------------------------- 350 | 351 | The precedence of an expression determines the priority it has when resolving 352 | implicitly nested expressions. Each expression type has a level of precedence 353 | relative to all other types. There are four different precedence levels in 354 | Waxeye grammars. 355 | 356 | 357 | ==== Level 4 ==== 358 | 359 | The highest precedence is held by the atomic expressions. Because these 360 | expressions cannot, themselves, contain expressions, there is no need to 361 | consider which expressions are nested within them. 362 | 363 | 364 | ==== Level 3 ==== 365 | 366 | The prefix expressions hold the next precedence level. Their nesting is 367 | resolved directly after the atomic expressions. 368 | 369 | 370 | ==== Level 2 ==== 371 | 372 | Sequences of expressions are formed once the atomic and prefix expressions have 373 | been resolved. 374 | 375 | 376 | ==== Level 1 ==== 377 | 378 | Finally, once all other expressions have been resolved, the different choices of 379 | the alternation expression are resolved. 380 | 381 | 382 | 383 | === Pruning Non-terminals === 384 | 385 | Sometimes, creating a new AST node will give us more information than we need. 386 | We might want to create a new AST node, only if doing so will tell us something 387 | interesting about our input. If the additional node gives us nothing of 388 | interest, our tree could be said to contain 'vertical noise'. 389 | 390 | To make it easier to process the AST, we can remove this vertical noise by 391 | using the 'pruning' non-terminal type. This non-terminal type has the following 392 | form: 393 | 394 | ******************************************************************************* 395 | 'Name' `<=` '+expressions' 396 | ******************************************************************************* 397 | 398 | When 'Name' has successfully parsed a string, one of three things will happen, 399 | depending on the number of results to be included from 'Name''s expressions. 400 | 401 | * If there are no expression results to be included, nothing new will be added 402 | to the AST. 403 | 404 | * If there is one expression result to be included, that result will take the 405 | place of the 'Name' AST node. 406 | 407 | * Otherwise, a new 'Name' AST node will be created, just like a tree 408 | constructing non-terminal. 409 | 410 | 411 | To help understand how this works, consider an example from a simple arithmetic 412 | grammar. 413 | 414 | [source,waxeye] 415 | ------------------------------------------------------------------------------- 416 | Product <- Number *([*/] Number) 417 | 418 | Number <- +[0-9] 419 | ------------------------------------------------------------------------------- 420 | 421 | If we use the 'Product' rule to parse the string `3*7`, we get a tree with 422 | 'Product' at the root and, below that, a 'Number', a `*` character and then 423 | another 'Number'. 424 | 425 | ------------------------------------------------------------------------------- 426 | Product 427 | -> Number 428 | | 3 429 | | * 430 | -> Number 431 | | 7 432 | ------------------------------------------------------------------------------- 433 | 434 | However, if the 'Product' rule parses a string with just one 'Number' in it, we 435 | will get a tree that is slightly bigger than we need. Parsing the string `5` 436 | produces the following tree. 437 | 438 | ------------------------------------------------------------------------------- 439 | Product 440 | -> Number 441 | | 5 442 | ------------------------------------------------------------------------------- 443 | 444 | In this case, having a 'Product' node at the root of the AST isn't necessary. 445 | If we want to, we can rewrite the original grammar to use a pruning 446 | non-terminal. 447 | 448 | [source,waxeye] 449 | ------------------------------------------------------------------------------- 450 | Product <= Number *([*/] Number) 451 | 452 | Number <- +[0-9] 453 | ------------------------------------------------------------------------------- 454 | 455 | Now, when we use 'Product' to parse `3*7`, we will get the same result as 456 | before but, when parsing `5`, we get an AST with 'Number' as the root. 457 | 458 | ------------------------------------------------------------------------------- 459 | Number 460 | | 5 461 | ------------------------------------------------------------------------------- 462 | 463 | 464 | As a second example, let's look at a grammar for nested parentheses. 465 | 466 | [source,waxeye] 467 | ------------------------------------------------------------------------------- 468 | A <- :'(' A :')' | B 469 | 470 | B <- 'b' 471 | ------------------------------------------------------------------------------- 472 | 473 | Here are some example inputs and their resulting ASTs: 474 | 475 | Input: `b` 476 | 477 | ------------------------------------------------------------------------------- 478 | A 479 | -> B 480 | | b 481 | ------------------------------------------------------------------------------- 482 | 483 | Input: `(b)` 484 | 485 | ------------------------------------------------------------------------------- 486 | A 487 | -> A 488 | -> B 489 | | b 490 | ------------------------------------------------------------------------------- 491 | 492 | Input: `(((b)))` 493 | 494 | ------------------------------------------------------------------------------- 495 | A 496 | -> A 497 | -> A 498 | -> A 499 | -> B 500 | | b 501 | ------------------------------------------------------------------------------- 502 | 503 | Unless we want to know the number of parentheses matched, trees like these 504 | contain more information than we need. Again, we are able to solve this by 505 | rewriting the grammar using a 'pruning' non-terminal. 506 | 507 | [source,waxeye] 508 | ------------------------------------------------------------------------------- 509 | A <= :'(' A :')' | B 510 | 511 | B <- 'b' 512 | ------------------------------------------------------------------------------- 513 | 514 | This time, parsing the input `(((b)))` gives us a much shorter tree. 515 | 516 | ------------------------------------------------------------------------------- 517 | B 518 | | b 519 | ------------------------------------------------------------------------------- 520 | 521 | 522 | === Comments === 523 | 524 | There are two types of comments in Waxeye grammars; single-line and multi-line. 525 | 526 | ==== Single-line ==== 527 | 528 | Single-line comments start at the first `#` outside of an atomic expression and 529 | extend until the end of the line. 530 | 531 | [source,waxeye] 532 | ------------------------------------------------------------------------------- 533 | # This is a single-line comment. 534 | ------------------------------------------------------------------------------- 535 | 536 | 537 | ==== Multi-line ==== 538 | 539 | Multi-line comments are opened at the first `/*` outside of an atomic 540 | expression and closed with a `*/`. 541 | 542 | 543 | [source,waxeye] 544 | ------------------------------------------------------------------------------- 545 | /* This is a multi-line comment. */ 546 | ------------------------------------------------------------------------------- 547 | 548 | [source,waxeye] 549 | ------------------------------------------------------------------------------- 550 | /* This is, also, 551 | a multi-line comment. */ 552 | ------------------------------------------------------------------------------- 553 | 554 | 555 | As an added convenience for when editing a grammar, multi-line comments can be 556 | nested within each other. This is handy when you want to comment out a section 557 | of the grammar that already contains a comment. 558 | 559 | 560 | [source,waxeye] 561 | ------------------------------------------------------------------------------- 562 | /* 563 | 564 | This is the outer comment. 565 | 566 | A <- 'a' 567 | 568 | /* 569 | * This is the inner comment. 570 | */ 571 | B <- 'b' 572 | 573 | */ 574 | ------------------------------------------------------------------------------- 575 | 576 | 577 | 578 | 579 | == Using Waxeye == 580 | 581 | This chapter will show you how to setup Waxeye for your programming language. 582 | It covers language specific installation requirements and presents some basic 583 | boilerplate code to get you started. You can find copies of this boilerplate 584 | code in `src/example/`. I use `$WAXEYE_HOME` to refer to the location where you 585 | have installed the files of the Waxeye distribution. 586 | 587 | The example grammar we'll be using can be found in `grammars/num.waxeye`. You 588 | may wish to copy it to the directory you're working in so you can experiment 589 | with extending and modifying the grammar. 590 | 591 | .grammars/num.waxeye 592 | [source,waxeye] 593 | ------------------------------------------------------------------------------- 594 | Num <- '0' | [1-9] *[0-9] 595 | ------------------------------------------------------------------------------- 596 | 597 | Once setup and run, the boilerplate example will use the parser you generated 598 | to parse the string `42` and print the AST it creates. 599 | 600 | ------------------------------------------------------------------------------- 601 | Num 602 | | 4 603 | | 2 604 | ------------------------------------------------------------------------------- 605 | 606 | 607 | 608 | === Using Waxeye from Racket === 609 | 610 | Waxeye's Racket runtime is compatible with http://racket-lang.org/[Racket]. 611 | 612 | ==== Install ==== 613 | 614 | Install the waxeye collection where Racket can find it. 615 | 616 | ------------------------------------------------------------------------------- 617 | # Install the Waxeye collection; change to your install paths as needed 618 | sudo ln -s /usr/local/waxeye/src/racket/waxeye /usr/local/racket/lib/racket/collects/ 619 | ------------------------------------------------------------------------------- 620 | 621 | ==== Generate Parser ==== 622 | 623 | ------------------------------------------------------------------------------- 624 | waxeye -g racket . num.waxeye 625 | ------------------------------------------------------------------------------- 626 | 627 | ==== Use Parser ==== 628 | 629 | .src/example/racket/example.rkt 630 | [source,scheme] 631 | ------------------------------------------------------------------------------- 632 | #lang racket 633 | 634 | (require "parser.rkt") 635 | 636 | ;; Parse our input 637 | (let ((ast (parser "42"))) 638 | ;; Print our AST 639 | (display-ast ast)) 640 | ------------------------------------------------------------------------------- 641 | 642 | ==== Run from Racket ==== 643 | 644 | ------------------------------------------------------------------------------- 645 | racket -t example.rkt 646 | ------------------------------------------------------------------------------- 647 | 648 | 649 | 650 | == Using ASTs and Parse Errors == 651 | 652 | Since just printing an Abstract Syntax Tree isn't very interesting, let's have 653 | a look at how to access the information the ASTs contain. 654 | 655 | When you use a Waxeye parser, the result will be one of two things. If the 656 | parser successfully parsed the input, the result will be an AST. If the input 657 | doesn't match the syntax of the language, the result will be a 'parse error'. 658 | 659 | 660 | === ASTs === 661 | 662 | ASTs come in three different forms; 'tree', 'char' and 'empty'. 663 | 664 | * A 'tree' AST contains a type, a list of children and, the start and end 665 | position in the input. 666 | 667 | * A 'char' AST contains a single character and has no children. 668 | 669 | * An 'empty' AST simply signifies that parsing was successful. If your starting 670 | non-terminal is voided or is pruning and had no children, you will get an 671 | empty AST. 672 | 673 | 674 | ==== Using an AST node as string ==== 675 | 676 | If a given AST node will only ever have 'char' children, you may wish to treat 677 | that node as a single string. 678 | 679 | 680 | ===== From Racket ===== 681 | 682 | [source,scheme] 683 | ------------------------------------------------------------------------------- 684 | (display (list->string (ast-c ast))) 685 | (newline) 686 | ------------------------------------------------------------------------------- 687 | 688 | 689 | 690 | === Parse Errors === 691 | 692 | A parse error contains information about where the input is invalid and hints 693 | about what is wrong with it. 694 | 695 | 696 | 697 | === Determining the result type === 698 | 699 | 700 | ==== From Racket ==== 701 | 702 | [source,scheme] 703 | ------------------------------------------------------------------------------- 704 | (cond 705 | ((ast? result) "tree ast") 706 | ((parse-error? result) "error") 707 | (else "empty ast")) 708 | ------------------------------------------------------------------------------- 709 | 710 | 711 | 712 | 713 | == Example: A Calculator == 714 | 715 | Now that we know how to write grammars, generate parsers and manipulate AST, we 716 | can put these skills together to build a small language interpreter. In this 717 | chapter, we create a command-line calculator. 718 | 719 | Our calculator reads a line of input, parses it as an arithmetic expression and 720 | computes the result. The arithmetic language supports the following constructs. 721 | 722 | * floating point numbers 723 | * binary operators +,-,*,/ 724 | * unary negation 725 | * parentheses 726 | 727 | 728 | .grammars/calc.waxeye 729 | [source,waxeye] 730 | ------------------------------------------------------------------------------- 731 | calc <- ws sum 732 | 733 | sum <- prod *([+-] ws prod) 734 | 735 | prod <- unary *([*/] ws unary) 736 | 737 | unary <= '-' ws unary 738 | | :'(' ws sum :')' ws 739 | | num 740 | 741 | num <- +[0-9] ?('.' +[0-9]) ws 742 | 743 | ws <: *[ \t\n\r] 744 | ------------------------------------------------------------------------------- 745 | 746 | 747 | === Calculator in Racket === 748 | 749 | .src/example/racket/calculator.rkt 750 | [source,scheme] 751 | ------------------------------------------------------------------------------- 752 | #lang racket 753 | 754 | (require "parser.rkt") 755 | 756 | ;; A commandline arithmetic calculator. 757 | 758 | (define (calc input) 759 | (let ((ast (parser input))) 760 | (if (ast? ast) 761 | (begin (display (sum (car (ast-c ast)))) 762 | (newline)) 763 | (display-parse-error ast)))) 764 | 765 | 766 | (define (bin-op ast fn ch op1 op2) 767 | (let* ((chil (list->vector (ast-c ast))) 768 | (val (fn (vector-ref chil 0)))) 769 | (let loop ((i 1)) 770 | (unless (= i (vector-length chil)) 771 | ;; Increment val by the operator applied to val and the operand 772 | (set! val ((if (equal? (vector-ref chil i) ch) op1 op2) 773 | val (fn (vector-ref chil (+ i 1))))) 774 | (loop (+ i 2)))) 775 | val)) 776 | 777 | 778 | (define (sum ast) 779 | (bin-op ast prod #\+ + -)) 780 | 781 | 782 | (define (prod ast) 783 | (bin-op ast unary #\* * /)) 784 | 785 | 786 | (define (unary ast) 787 | (case (ast-t ast) 788 | ((unary) (- (unary (cadr (ast-c ast))))) 789 | ((sum) (sum ast)) 790 | (else (num ast)))) 791 | 792 | 793 | (define (num ast) 794 | (string->number (list->string (ast-c ast)))) 795 | 796 | 797 | (define (rl) 798 | (display "calc> ") 799 | (read-line (current-input-port))) 800 | 801 | 802 | (let loop ((input (rl))) 803 | (if (eof-object? input) 804 | (newline) 805 | (begin (calc input) 806 | (loop (rl))))) 807 | ------------------------------------------------------------------------------- 808 | 809 | 810 | 811 | 812 | /////////////////////////////////////////////////////////////////////////////// 813 | == A Short Example == 814 | 815 | This chapter will introduce you to the basic work-flow used with Waxeye. In the 816 | process, we will iteratively develop the grammar of a simple real-world 817 | language. 818 | 819 | 820 | == Using the Interpreter == 821 | todo 822 | 823 | == Extended Example == 824 | todo 825 | /////////////////////////////////////////////////////////////////////////////// 826 | 827 | 828 | 829 | == Grammar Testing == 830 | 831 | .test/grammars/waxeye.rkt 832 | [source,scheme] 833 | ------------------------------------------------------------------------------- 834 | ;; These are tests for the 'Grammar' non-terminal 835 | (Grammar ; <- This is the non-terminal's name 836 | 837 | ;; Following the name are pairs of input string and expected output. The 838 | ;; output is either the keyword 'pass', the keyword 'fail' or an AST. The AST 839 | ;; specifies the structure of the expected tree, the names of the nodes and 840 | ;; the individual characters. If you don't want to specify the whole tree, 841 | ;; just use the wild-card symbol '*' for the portion of the tree you want to 842 | ;; skip. 843 | 844 | "" ; <- This is the input 845 | (Grammar) ; <- This is the expected output 846 | 847 | "A <- 'a'" 848 | pass ; <- The keyword 'pass' 849 | 850 | "A" 851 | fail ; <- The keyword 'fail' 852 | 853 | "A <- 'a' B <- 'b'" 854 | (Grammar (Definition (Identifier #\A) *) ; <- Here we skip some of 855 | (Definition (Identifier #\B) *)) ; Definition's children 856 | 857 | "A <- 'a'" 858 | (Grammar (*)) ; <- Here we specify a child tree of any type 859 | 860 | "A <- [a-z] *[a-z0-9]" 861 | (Grammar (Definition (Identifier #\A) (LeftArrow) (Alternation *))) 862 | 863 | "A <- 'a'" 864 | (Grammar (Definition (Identifier #\A) 865 | (LeftArrow) (Alternation (Sequence (Unit (Literal (LChar #\a))))))) 866 | ) 867 | ------------------------------------------------------------------------------- 868 | 869 | 870 | 871 | == Modular Grammars == 872 | 873 | It is sometimes desirable to have a grammar split across multiple files and to 874 | have a final grammar built from those files. We can do this by using a 875 | modular grammar. 876 | 877 | Having our grammar split in this way provides us with the opportunity to 878 | manipulate the definition of the non-terminals and, in the process, create new 879 | languages. Depending on how we compose our final grammar, we can create vastly 880 | different languages from the same base grammars and only need to change the one 881 | modular grammar. 882 | 883 | One of the biggest advantages of modular grammars is that they make it very 884 | easy to embed one language within another. Many languages can be thought of in 885 | this way. Prime examples are when a programming language is embedded within XML 886 | or HTML. Or, going the other way, you could embed a data language like SQL 887 | within a programming language. 888 | 889 | There are also cases when a language's syntax changes subtly over time. We 890 | want to have parsers for each version of the language but without duplicating 891 | large parts of our grammars. 892 | 893 | 894 | === Grammar Composition === 895 | 896 | A modular grammar is made up of expressions that pull together non-modular 897 | grammars. Some modular expressions can have other expressions nested within 898 | them. An expression is one of the following: 899 | 900 | * '"grammar.waxeye"' + 901 | A path to a '.waxeye' file. This path should either be relative to the 902 | modular grammar or be an absolute path. 903 | 904 | * '(rename modular-exp (old-name . new-name) ...)' + 905 | Renames the specified non-terminals with their new names. 906 | 907 | * '(only modular-exp non-term ...)' + 908 | Includes only the listed non-terminals. 909 | 910 | * '(all-except modular-exp non-term ...)' + 911 | Includes all non-terminals except those listed. 912 | 913 | * '(prefix prefix modular-exp)' + 914 | Prefixes the names of non-terminals from 'modular-exp'. 915 | 916 | * '(prefix-only prefix modular-exp non-term ...)' + 917 | Prefixes only the listed non-terminals. 918 | 919 | * '(prefix-all-except prefix modular-exp non-term ...)' + 920 | Prefixes all non-terminals except those listed. 921 | 922 | * '(join modular-exp ...)' + 923 | Combines the results of multiple modular expressions into a single 924 | expression. Not needed at the top-level. 925 | 926 | 927 | .grammars/modular/mod.rkt 928 | [source,scheme] 929 | ------------------------------------------------------------------------------- 930 | ;; A contrived example where we replace the definition of Number in Json with a 931 | ;; much simpler one that only supports integers. 932 | 933 | (all-except "../json.waxeye" Number) 934 | 935 | (rename (only "../num.waxeye" Num) (Num . Number)) 936 | ------------------------------------------------------------------------------- 937 | 938 | 939 | 940 | == Waxeye Options == 941 | 942 | ------------------------------------------------------------------------------- 943 | waxeye [