├── .activate ├── .github └── workflows │ └── go.yml ├── .gitignore ├── .gitmodules ├── CHANGELOG ├── LICENSE ├── README.md ├── dfa ├── dfa_helpers.go ├── dfa_helpers_test.go ├── gen.go └── gen_test.go ├── doc.go ├── examples ├── sensors-parser │ ├── .gitignore │ ├── README.md │ ├── ast.go │ ├── main.go │ ├── sensors.conf │ ├── sensors.y │ ├── sensors_golex.go │ └── y.go └── sensors │ └── main.go ├── frontend ├── ast.go ├── ast_equality.go ├── desugar.go ├── desugar_test.go ├── frontend_test.go ├── gen.go └── parser.go ├── go.mod ├── go.sum ├── grammar ├── inst ├── inst.go └── inst_test.go ├── lexc └── main.go ├── lexer.go ├── lexer_test.go ├── machines ├── dfa_machine.go ├── machine.go └── machine_test.go └── queue └── queue.go /.activate: -------------------------------------------------------------------------------- 1 | export GOPATH=$(readlink -e $(pwd)/../../../..) 2 | export PATH=$GOPATH/bin:$PATH 3 | -------------------------------------------------------------------------------- /.github/workflows/go.yml: -------------------------------------------------------------------------------- 1 | name: Go 2 | on: [push] 3 | jobs: 4 | 5 | build: 6 | name: Build 7 | runs-on: ubuntu-latest 8 | steps: 9 | 10 | - name: Set up Go 1.13 11 | uses: actions/setup-go@v1 12 | with: 13 | go-version: 1.13 14 | id: go 15 | 16 | - name: Check out code into the Go module directory 17 | uses: actions/checkout@v1 18 | 19 | - name: Get dependencies 20 | run: | 21 | go get -v -t -d ./... 22 | 23 | - name: Test 24 | run: go test github.com/timtadh/lexmachine/... 25 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | *.swo 3 | bin 4 | pkg 5 | vendor 6 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "src/github.com/timtadh/getopt"] 2 | path = src/github.com/timtadh/getopt 3 | url = git@github.com:timtadh/getopt.git 4 | [submodule "src/github.com/timtadh/data-structures"] 5 | path = src/github.com/timtadh/data-structures 6 | url = git@github.com:timtadh/data-structures.git 7 | -------------------------------------------------------------------------------- /CHANGELOG: -------------------------------------------------------------------------------- 1 | ## 0.2.1 2 | 3 | - Fixed regression bugs in new DFA backend 4 | 5 | ## 0.2.0 6 | 7 | - Added DFA backend 8 | - Improved documentation 9 | - Fixed lint issues 10 | 11 | ## 0.1.1 12 | 13 | - Improved regex parser 14 | - Added documentation 15 | 16 | ## 0.1.0 initial release 17 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | lexmachine 2 | 3 | This library is wholly written by the authors below. Please respect all 4 | licensing terms. 5 | 6 | Copyright (c) 2014-2017 All rights reserved. Portions owned by 7 | * Tim Henderson 8 | * Case Western Reserve Univserity 9 | * Google Inc. 10 | 11 | Redistribution and use in source and binary forms, with or without 12 | modification, are permitted provided that the following conditions are met: 13 | 14 | * Redistributions of source code must retain the above copyright notice, 15 | this list of conditions and the following disclaimer. 16 | * Redistributions in binary form must reproduce the above copyright notice, 17 | this list of conditions and the following disclaimer in the documentation 18 | and/or other materials provided with the distribution. 19 | * Neither the name of the lexmachine nor the names of its contributors may 20 | be used to endorse or promote products derived from this software without 21 | specific prior written permission. 22 | 23 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 24 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 25 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 26 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR 27 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 28 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 29 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 30 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 31 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 32 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 33 | 34 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # `lexmachine` - Lexical Analysis Framework for Golang 2 | 3 | By Tim Henderson 4 | 5 | Copyright 2014-2017, All Rights Reserved. Made available for public use under 6 | the terms of a BSD 3-Clause license. 7 | 8 | [![GoDoc](https://godoc.org/github.com/timtadh/lexmachine?status.svg)](https://godoc.org/github.com/timtadh/lexmachine) 9 | [![ReportCard](https://goreportcard.com/badge/github.com/timtadh/lexmachine)](https://goreportcard.com/report/github.com/timtadh/lexmachine) 10 | 11 | ## What? 12 | 13 | `lexmachine` is a full lexical analysis framework for the Go programming 14 | language. It supports a restricted but usable set of regular expressions 15 | appropriate for writing lexers for complex programming languages. The framework 16 | also supports sub lexers and non-regular lexing through an "escape hatch" which 17 | allows the users to consume any number of further bytes after a match. So if you 18 | want to support nested C-style comments or other paired structures you can do so 19 | at the lexical analysis stage. 20 | 21 | Subscribe to the [mailing 22 | list](https://groups.google.com/forum/#!forum/lexmachine-users) to get 23 | announcement of major changes, new versions, and important patches. 24 | 25 | ## Goal 26 | 27 | `lexmachine` intends to be the best, fastest, and easiest to use lexical 28 | analysis system for Go. 29 | 30 | 1. [Documentation Links](#documentation) 31 | 1. [Narrative Documentation](#narrative-documentation) 32 | 1. [Regular Expressions in `lexmachine`](#regular-expressions) 33 | 1. [History](#history) 34 | 1. [Complete Example](#complete-example) 35 | 36 | ## Documentation 37 | 38 | - [Tutorial](http://hackthology.com/writing-a-lexer-in-go-with-lexmachine.html) 39 | - [How It Works](http://hackthology.com/faster-tokenization-with-a-dfa-backend-for-lexmachine.html) 40 | - [Narrative Documentation](#narrative-documentation) 41 | - [Using `lexmachine` with `goyacc`](https://github.com/timtadh/lexmachine/tree/master/examples/sensors-parser) 42 | Required reading if you want to use `lexmachine` with the standard yacc 43 | implementation for Go (or its derivatives). 44 | - [![GoDoc](https://godoc.org/github.com/timtadh/lexmachine?status.svg)](https://godoc.org/github.com/timtadh/lexmachine) 45 | 46 | ### What is in Box 47 | 48 | `lexmachine` includes the following components 49 | 50 | 1. A parser for restricted set of regular expressions. 51 | 2. A abstract syntax tree (AST) for regular expressions. 52 | 3. A backpatching code generator which compiles the AST to (NFA) machine code. 53 | 4. Both DFA (Deterministic Finite Automata) and a NFA (Non-deterministic Finite 54 | Automata) simulation based lexical analysis engines. Lexical analysis 55 | engines work in a slightly different way from a normal regular expression 56 | engine as they tokenize a stream rather than matching one string. 57 | 5. Match objects which include start and end column and line numbers of the 58 | lexemes as well as their associate token name. 59 | 6. A declarative "DSL" for specifying the lexers. 60 | 7. An "escape hatch" which allows one to match non-regular tokens by consuming 61 | any number of further bytes after the match. 62 | 63 | ## Narrative Documentation 64 | 65 | `lexmachine` splits strings into substrings and categorizes each substring. In 66 | compiler design, the substrings are referred to as *lexemes* and the 67 | categories are referred to as *token types* or just *tokens*. The categories are 68 | defined by *patterns* which are specified using [regular 69 | expressions](#regular-expressions). The process of splitting up a string is 70 | sometimes called *tokenization*, *lexical analysis*, or *lexing*. 71 | 72 | ### Defining a Lexer 73 | 74 | The set of patterns (regular expressions) used to *tokenize* (split up and 75 | categorize) is called a *lexer*. Lexer's are first class objects in 76 | `lexmachine`. They can be defined once and re-used over and over-again to 77 | tokenize multiple strings. After the lexer has been defined it will be compiled 78 | (either explicitly or implicitly) into either a Non-deterministic Finite 79 | Automaton (NFA) or Deterministic Finite Automaton (DFA). The automaton is then 80 | used (and re-used) to tokenize strings. 81 | 82 | #### Creating a new Lexer 83 | 84 | ```go 85 | lexer := lexmachine.NewLexer() 86 | ``` 87 | 88 | #### Adding a pattern 89 | 90 | Let's pretend we want a lexer which only recognizes one category: strings which 91 | match the word "wild" capitalized or not (eg. Wild, wild, WILD, ...). That 92 | expression is denoted: `[Ww][Ii][Ll][Dd]`. Patterns are added using the `Add` 93 | function: 94 | 95 | ```go 96 | lexer.Add([]byte(`[Ww][Ii][Ll][Dd]`), func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) { 97 | return 0, nil 98 | }) 99 | ``` 100 | 101 | Add takes two arguments: the pattern and a call back function called a *lexing 102 | action*. The action allows you, the programmer, to transform the low level 103 | `machines.Match` object (from `github.com/lexmachine/machines`) into a object 104 | meaningful for your program. As an example, let's define a few token types, and 105 | a token object. Then we will construct appropriate action functions. 106 | 107 | ```go 108 | Tokens := []string{ 109 | "WILD", 110 | "SPACE", 111 | "BANG", 112 | } 113 | TokenIds := make(map[string]int) 114 | for i, tok := range Tokens { 115 | TokenIds[tok] = i 116 | } 117 | ``` 118 | 119 | Now that we have defined a set of three tokens (WILD, SPACE, BANG), let's create 120 | a token object: 121 | 122 | ```go 123 | type Token struct { 124 | TokenType int 125 | Lexeme string 126 | Match *machines.Match 127 | } 128 | ``` 129 | 130 | Now let's make a helper function which takes a `Match` and a token type and 131 | creates a Token. 132 | 133 | ```go 134 | func NewToken(tokenType string, m *machines.Match) *Token { 135 | return &Token{ 136 | TokenType: TokenIds[tokenType], // defined above 137 | Lexeme: string(m.Bytes), 138 | Match: m, 139 | } 140 | } 141 | ``` 142 | 143 | Now we write an action for the previous pattern 144 | 145 | ```go 146 | lexer.Add([]byte(`[Ww][Ii][Ll][Dd]`), func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) { 147 | return NewToken("WILD", m), nil 148 | }) 149 | ``` 150 | 151 | Writing the action functions can get tedious, a good idea is to create a helper 152 | function which produces these action functions: 153 | 154 | ```go 155 | func token(tokenType string) func(*lexmachine.Scanner, *machines.Match) (interface{}, error) { 156 | return func(s *lexmachine.Scanner, m *machines.Match) (interface{}, error) { 157 | return NewToken(tokenType, m), nil 158 | } 159 | } 160 | ``` 161 | 162 | Then adding patterns for our 3 tokens is concise: 163 | 164 | ```go 165 | lexer.Add([]byte(`[Ww][Ii][Ll][Dd]`), token("WILD")) 166 | lexer.Add([]byte(` `), token("SPACE")) 167 | lexer.Add([]byte(`!`), token("BANG")) 168 | ``` 169 | 170 | #### Built-in Token Type 171 | 172 | Many programs use similar representations for tokens. `lexmachine` provides a 173 | completely optional `Token` object you can use in lieu of writing your own. 174 | 175 | ```go 176 | type Token struct { 177 | Type int 178 | Value interface{} 179 | Lexeme []byte 180 | TC int 181 | StartLine int 182 | StartColumn int 183 | EndLine int 184 | EndColumn int 185 | } 186 | ``` 187 | 188 | Here is an example for constructing a lexer Action which turns a machines.Match 189 | struct into a token using the scanners Token helper function. 190 | 191 | ```go 192 | func token(name string, tokenIds map[string]int) lex.Action { 193 | return func(s *lex.Scanner, m *machines.Match) (interface{}, error) { 194 | return s.Token(tokenIds[name], string(m.Bytes), m), nil 195 | } 196 | } 197 | ``` 198 | 199 | #### Adding Multiple Patterns 200 | 201 | When constructing a lexer for a complex computer language often tokens have 202 | patterns which overlap -- multiple patterns could match the same strings. To 203 | address this problem lexical analysis engines follow 2 rules when choosing which 204 | pattern to match: 205 | 206 | 1. Pick the pattern which matches the longest prefix of unmatched text. 207 | 2. Break ties by picking the pattern which appears earlier in the user supplied 208 | list. 209 | 210 | For example, let's pretend we are writing a lexer for Python. Python has a bunch 211 | of keywords in it such as `class` and `def`. However, it also has identifiers 212 | which match the pattern `[A-Za-z_][A-Za-z0-9_]*`. That pattern also matches the 213 | keywords, if we were to define the lexer as: 214 | 215 | ```go 216 | lexer.Add([]byte(`[A-Za-z_][A-Za-z0-9_]*`), token("ID")) 217 | lexer.Add([]byte(`class`), token("CLASS")) 218 | lexer.Add([]byte(`def`), token("DEF")) 219 | ``` 220 | 221 | Then, the keywords class and def would never be found because the ID token would 222 | take precedence. The correct way to solve this problem is by putting the 223 | keywords first: 224 | 225 | ```go 226 | lexer.Add([]byte(`class`), token("CLASS")) 227 | lexer.Add([]byte(`def`), token("DEF")) 228 | lexer.Add([]byte(`[A-Za-z_][A-Za-z0-9_]*`), token("ID")) 229 | ``` 230 | 231 | #### Skipping Patterns 232 | 233 | Sometimes it is advantageous to not emit tokens for certain patterns and to 234 | instead skip them. Commonly this occurs for whitespace and comments. To skip a 235 | pattern simply have the action `return nil, nil`: 236 | 237 | ```go 238 | lexer.Add( 239 | []byte("( |\t|\n)"), 240 | func(scan *Scanner, match *machines.Match) (interface{}, error) { 241 | // skip white space 242 | return nil, nil 243 | }, 244 | ) 245 | lexer.Add( 246 | []byte("//[^\n]*\n"), 247 | func(scan *Scanner, match *machines.Match) (interface{}, error) { 248 | // skip white space 249 | return nil, nil 250 | }, 251 | ) 252 | ``` 253 | 254 | #### Compiling the Lexer 255 | 256 | `lexmachine` uses the theory of [finite state 257 | machines](http://hackthology.com/faster-tokenization-with-a-dfa-backend-for-lexmachine.html) 258 | to efficiently tokenize text. So what is a finite state machine? A finite state 259 | machine is a mathematical construct which is made up of a set of states, with a 260 | labeled starting state, and accepting states. There is a transition function 261 | which moves from one state to another state based on an input character. In 262 | general, in lexing there are two usual types of state machines used: 263 | Non-deterministic and Deterministic. 264 | 265 | Before a lexer (like the ones described above) can be used it must be compiled 266 | into either a Non-deterministic Finite Automaton (NFA) or a [Deterministic 267 | Finite Automaton 268 | (DFA)](http://hackthology.com/faster-tokenization-with-a-dfa-backend-for-lexmachine.html). 269 | The difference between the two (from a practical perspective) is *construction 270 | time* and *match efficiency*. 271 | 272 | Construction time is the amount of time it takes to turn a set of regular 273 | expressions into a state machine (also called a finite state automaton). For an 274 | NFA it is O(`r`) which `r` is the length of the regular expression. However, for 275 | DFA it could be as bad as O(`2^r`) but in practical terms it is rarely worse 276 | than O(`r^3`). The DFA's in `lexmachine` are also automatically *minimized* which 277 | reduces the amount of memory they consume which takes O(`r*log(log(r))`) steps. 278 | 279 | However, construction time is an upfront cost. If your program is tokenizing 280 | multiple strings it is less important than match efficiency. Let's say a string 281 | has length `n`. An NFA can tokenize such a string in O(`n*r`) steps while a DFA 282 | can tokenize the string in O(`n`). For larger languages `r` becomes a 283 | significant overhead. 284 | 285 | By default, `lexmachine` uses a DFA. To explicitly invoke compilation call 286 | `Compile`: 287 | 288 | ```go 289 | err := lexer.Compile() 290 | if err != nil { 291 | // handle err 292 | } 293 | ``` 294 | 295 | To explicitly compile a DFA (in case of changes to the default behavior of 296 | Compile): 297 | 298 | ```go 299 | err := lexer.CompileDFA() 300 | if err != nil { 301 | // handle err 302 | } 303 | ``` 304 | 305 | To explicitly compile a NFA: 306 | 307 | ```go 308 | err := lexer.CompileNFA() 309 | if err != nil { 310 | // handle err 311 | } 312 | ``` 313 | 314 | ### Tokenizing a String 315 | 316 | To tokenize (lex) a string construct a `Scanner` object using the lexer. This 317 | will compile the lexer if it has not already been compiled. 318 | 319 | ```go 320 | scanner, err := lexer.Scanner([]byte("some text to lex")) 321 | if err != nil { 322 | // handle err 323 | } 324 | ``` 325 | 326 | The scanner object is an iterator which yields the next token (or error) by 327 | calling the `Next()` method: 328 | 329 | ```go 330 | for tok, err, eos := scanner.Next(); !eos; tok, err, eos = scanner.Next() { 331 | if ui, is := err.(*machines.UnconsumedInput); is { 332 | // skip the error via: 333 | // scanner.TC = ui.FailTC 334 | // 335 | return err 336 | } else if err != nil { 337 | return err 338 | } 339 | fmt.Println(tok) 340 | } 341 | ``` 342 | 343 | Let's break down that first line: 344 | 345 | ```go 346 | for tok, err, eos := scanner.Next(); 347 | ``` 348 | 349 | The `Next()` method returns three things, the token (`tok`) if there is one, an 350 | error (`err`) if there is one, and `eos` which is a boolean which indicates if 351 | the End Of String (EOS) has been reached. 352 | 353 | ```go 354 | ; !eos; 355 | ``` 356 | 357 | Iteration proceeds until the EOS has been reached. 358 | 359 | ```go 360 | ; tok, err, eos = scanner.Next() { 361 | ``` 362 | 363 | The update block calls `Next()` again to get the next token. In each iteration 364 | of the loop the first thing a client **must** do is check for an error. 365 | 366 | ```go 367 | if err != nil { 368 | return err 369 | } 370 | ``` 371 | 372 | This prevents an infinite loop on an unexpected character or other bad token. To 373 | skip bad tokens check to see if the `err` is a `*machines.UnconsumedInput` 374 | object and reset the scanners text counter (`scanner.TC`) to point to the end of 375 | the failed token. 376 | 377 | ```go 378 | if ui, is := err.(*machines.UnconsumedInput); is { 379 | scanner.TC = ui.FailTC 380 | continue 381 | } 382 | ``` 383 | 384 | Finally, a client can make use of the token produced by the scanner (if there 385 | was no error: 386 | 387 | ```go 388 | fmt.Println(tok) 389 | ``` 390 | 391 | ### Dealing with Non-regular Tokens 392 | 393 | `lexmachine` like most lexical analysis frameworks primarily deals with patterns 394 | which are represented by regular expressions. However, sometimes a language 395 | has a token which is "non-regular." A pattern is non-regular if there is no 396 | regular expression (or finite automata) which can express the pattern. For 397 | instance, if you wanted to define a pattern which matches only consecutive 398 | balanced parentheses: `()`, `()()()`, `((()()))()()`, ... You would quickly find 399 | there is no regular expression which can express this language. The reason is 400 | simple: finite automata cannot "count" or keep track of how many opening 401 | parentheses it has seen. 402 | 403 | This problem arises in many programming languages when dealing with nested 404 | "c-style" comments. Supporting the nesting means solving the "balanced 405 | parenthesis" problem. Luckily, `lexmachine` provides an "escape-hatch" to deal 406 | with these situations in the `Action` functions. All actions receive a pointer 407 | to the `Scanner`. The scanner (as discussed above) has a public modifiable field 408 | called `TC` which stands for text counter. Any action can *modify* the text 409 | counter to point at the desired position it would like the scanner to resume 410 | scanning from. 411 | 412 | An example of using this feature for tokenizing nested "c-style" comments is 413 | below: 414 | 415 | ```go 416 | lexer.Add( 417 | []byte("/\\*"), 418 | func(scan *Scanner, match *machines.Match) (interface{}, error) { 419 | for tc := scan.TC; tc < len(scan.Text); tc++ { 420 | if scan.Text[tc] == '\\' { 421 | // the next character is skipped 422 | tc++ 423 | } else if scan.Text[tc] == '*' && tc+1 < len(scan.Text) { 424 | if scan.Text[tc+1] == '/' { 425 | // set the text counter to point to after the 426 | // end of the comment. This will cause the 427 | // scanner to resume after the comment instead 428 | // of picking up in the middle. 429 | scan.TC = tc + 2 430 | // don't return a token to skip the comment 431 | return nil, nil 432 | } 433 | } 434 | } 435 | return nil, 436 | fmt.Errorf("unclosed comment starting at %d, (%d, %d)", 437 | match.TC, match.StartLine, match.StartColumn) 438 | }, 439 | ) 440 | ``` 441 | 442 | ## Regular Expressions 443 | 444 | Lexmachine (like most lexical analysis frameworks) uses [Regular 445 | Expressions](https://en.wikipedia.org/wiki/Regular_expression) to specify the 446 | *patterns* to match when splitting the string up into categorized *tokens.* 447 | For a more advanced introduction to regular expressions engines see Russ Cox's 448 | [articles](https://swtch.com/~rsc/regexp/). To learn more about how regular 449 | expressions are used to *tokenize* string take a look at Alex Aiken's [video 450 | lectures](https://youtu.be/SRhkfvqeA1M) on the subject. Finally, Aho *et al.* 451 | give a through treatment of the subject in the [Dragon 452 | Book](http://www.worldcat.org/oclc/951336275) Chapter 3. 453 | 454 | A regular expression is a *pattern* which *matches* a set of strings. It is made 455 | up of *characters* such as `a` or `b`, characters with special meanings (such as 456 | `.` which matches any character), and operators. The regular expression `abc` 457 | matches exactly one string `abc`. 458 | 459 | ### Character Expressions 460 | 461 | In lexmachine most characters (eg. `a`, `b` or `#`) represent themselves. Some 462 | have special meanings (as detailed below in operators). However, all characters 463 | can be represented by prefixing the character with a `\`. 464 | 465 | #### Any Character 466 | 467 | `.` matches any character. 468 | 469 | #### Special Characters 470 | 471 | 1. `\` use `\\` to match 472 | 2. newline use `\n` to match 473 | 3. carriage return use `\r` to match 474 | 4. tab use `\t` to match 475 | 5. `.` use `\.` to match 476 | 6. operators: {`|`, `+`, `*`, `?`, `(`, `)`, `[`, `]`, `^`} prefix with a `\` to 477 | match. 478 | 479 | #### Character Classes 480 | 481 | Sometimes it is advantages to match a variety of characters. For instance, if 482 | you want to ignore capitalization for the work `Capitol` you could write the 483 | expression `[Cc]apitol` which would match both `Capitol` or `capitol`. There are 484 | two forms of character ranges: 485 | 486 | 1. `[abcd]` matches all the letters inside the `[]` (eg. that pattern matches 487 | the strings `a`, `b`, `c`, `d`). 488 | 2. `[a-d]` matches the range of characters between the character before the dash 489 | (`a`) and the character after the dash (`d`) (eg. that pattern matches 490 | the strings `a`, `b`, `c`, `d`). 491 | 492 | These two forms may be combined: 493 | 494 | For instance, `[a-zA-Z123]` matches the strings {`a`, `b`, ..., `z`, `A`, `B`, 495 | ... `Z`, `1`, `2`, `3`} 496 | 497 | #### Inverted Character Classes 498 | 499 | Sometimes it is easier to specify the characters you don't want to match than 500 | the characters you do. For instance, you want to match any character but a lower 501 | case one. This can be achieved using an inverted class: `[^a-z]`. An inverted 502 | class is specified by putting a `^` just after the opening bracket. 503 | 504 | #### Built-in Character Classes 505 | 506 | 1. `\d` = `[0-9]` (the digit class) 507 | 2. `\D` = `[^0-9]` (the not a digit class) 508 | 3. `\s` = `[ \t\n\r\f]` (the space class). where \f is a form feed (note: \f is 509 | not a special sequence in lexmachine, if you want to specify the form feed 510 | character (ascii 0x0c) use []byte{12}. 511 | 4. `\S` = `[^ \t\n\r\f]` (the not a space class) 512 | 5. `\w` = `[0-9a-zA-Z_]` (the letter class) 513 | 5. `\W` = `[^0-9a-zA-Z_]` (the not a letter class) 514 | 515 | ### Operators 516 | 517 | 1. The pipe operator `|` indicates alternative choices. For instance the 518 | expression `a|b` matches either the string `a` or the string `b` but not `ab` 519 | or `ba` or the empty string. 520 | 521 | 2. The parenthesis operator `()` groups a subexpression together. For instance 522 | the expression `a(b|c)d` matches `abd` or `acd` but not `abcd`. 523 | 524 | 3. The star operator `*` indicates the "starred" subexpression should match zero 525 | or more times. For instance, `a*` matches the empty string, `a`, `aa`, `aaa` 526 | and so on. 527 | 528 | 4. The plus operator `+` indicates the "plussed" subexpression should match one 529 | or more times. For instance, `a+` matches `a`, `aa`, `aaa` and so on. 530 | 531 | 5. The maybe operator `?` indicates the "questioned" subexpression should match 532 | zero or one times. For instance, `a?` matches the empty string and `a`. 533 | 534 | ### Grammar 535 | 536 | The canonical grammar is found in the handwritten recursive descent 537 | [parser](https://github.com/timtadh/lexmachine/blob/master/frontend/parser.go). 538 | This section should be considered documentation not specification. 539 | 540 | Note: e stands for the empty string 541 | 542 | ``` 543 | Regex -> Alternation 544 | 545 | Alternation -> AtomicOps Alternation' 546 | 547 | Alternation' -> `|` AtomicOps Alternation' 548 | | e 549 | 550 | AtomicOps -> AtomicOp AtomicOps 551 | | e 552 | 553 | AtomicOp -> Atomic 554 | | Atomic Ops 555 | 556 | Ops -> Op Ops 557 | | e 558 | 559 | Op -> `+` 560 | | `*` 561 | | `?` 562 | 563 | Atomic -> Char 564 | | Group 565 | 566 | Group -> `(` Alternation `)` 567 | 568 | Char -> CHAR 569 | | CharClass 570 | 571 | CharClass -> `[` Range `]` 572 | | `[` `^` Range `]` 573 | 574 | Range -> CharClassItem Range' 575 | 576 | Range' -> CharClassItem Range' 577 | | e 578 | 579 | CharClassItem -> BYTE 580 | -> BYTE `-` BYTE 581 | 582 | CHAR -> matches any character except '|', '+', '*', '?', '(', ')', '[', ']', '^' 583 | unless escaped. Additionally '.' is returned as the wildcard character 584 | which matches any character. Built-in character classes are also handled 585 | here. 586 | 587 | BYTE -> matches any byte 588 | ``` 589 | 590 | ## History 591 | 592 | This library was started when I was teaching EECS 337 *Compiler Design and 593 | Implementation* at Case Western Reserve University in Fall of 2014. It wrote two 594 | compilers one was "hidden" from the students as the language implemented was 595 | their project language. The other was [tcel](https://github.com/timtadh/tcel) 596 | which was written initially as an example of how to do type checking. That 597 | compiler was later expanded to explain AST interpretation, intermediate code 598 | generation, and x86 code generation. 599 | 600 | ## Complete Example 601 | 602 | ### Using the Lexer 603 | 604 | ```go 605 | package main 606 | 607 | import ( 608 | "fmt" 609 | "log" 610 | ) 611 | 612 | import ( 613 | "github.com/timtadh/lexmachine" 614 | "github.com/timtadh/lexmachine/machines" 615 | ) 616 | 617 | func main() { 618 | s, err := Lexer.Scanner([]byte(`digraph { 619 | rankdir=LR; 620 | a [label="a" shape=box]; 621 | c [