├── .gitignore ├── LICENSE ├── README.md ├── comparison.md ├── elm.json ├── examples ├── DoubleQuoteString.elm ├── Math.elm ├── README.md └── elm.json ├── semantics.md └── src ├── Elm └── Kernel │ └── Parser.js ├── Parser.elm └── Parser └── Advanced.elm /.gitignore: -------------------------------------------------------------------------------- 1 | elm-stuff -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017-present, Evan Czaplicki 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * Neither the name of the {organization} nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Parser 2 | 3 | Regular expressions are quite confusing and difficult to use. This library provides a coherent alternative that handles more cases and produces clearer code. 4 | 5 | The particular goals of this library are: 6 | 7 | - Make writing parsers as simple and fun as possible. 8 | - Produce excellent error messages. 9 | - Go pretty fast. 10 | 11 | This is achieved with a couple concepts that I have not seen in any other parser libraries: [parser pipelines](#parser-pipelines), [backtracking](#backtracking), and [tracking context](#tracking-context). 12 | 13 | 14 | ## Parser Pipelines 15 | 16 | To parse a 2D point like `( 3, 4 )`, you might create a `point` parser like this: 17 | 18 | ```elm 19 | import Parser exposing (Parser, (|.), (|=), succeed, symbol, float, spaces) 20 | 21 | type alias Point = 22 | { x : Float 23 | , y : Float 24 | } 25 | 26 | point : Parser Point 27 | point = 28 | succeed Point 29 | |. symbol "(" 30 | |. spaces 31 | |= float 32 | |. spaces 33 | |. symbol "," 34 | |. spaces 35 | |= float 36 | |. spaces 37 | |. symbol ")" 38 | ``` 39 | 40 | All the interesting stuff is happening in `point`. It uses two operators: 41 | 42 | - [`(|.)`][ignore] means “parse this, but **ignore** the result” 43 | - [`(|=)`][keep] means “parse this, and **keep** the result” 44 | 45 | So the `Point` function only gets the result of the two `float` parsers. 46 | 47 | [ignore]: https://package.elm-lang.org/packages/elm/parser/latest/Parser#|. 48 | [keep]: https://package.elm-lang.org/packages/elm/parser/latest/Parser#|= 49 | 50 | The theory is that `|=` introduces more “visual noise” than `|.`, making it pretty easy to pick out which lines in the pipeline are important. 51 | 52 | I recommend having one line per operator in your parser pipeline. If you need multiple lines for some reason, use a `let` or make a helper function. 53 | 54 | 55 | 56 | ## Backtracking 57 | 58 | To make fast parsers with precise error messages, all of the parsers in this package do not backtrack by default. Once you start going down a path, you keep going down it. 59 | 60 | This is nice in a string like `[ 1, 23zm5, 3 ]` where you want the error at the `z`. If we had backtracking by default, you might get the error on `[` instead. That is way less specific and harder to fix! 61 | 62 | So the defaults are nice, but sometimes the easiest way to write a parser is to look ahead a bit and see what is going to happen. It is definitely more costly to do this, but it can be handy if there is no other way. This is the role of [`backtrackable`](https://package.elm-lang.org/packages/elm/parser/latest/Parser#backtrackable) parsers. Check out the [semantics](https://github.com/elm/parser/blob/master/semantics.md) page for more details! 63 | 64 | 65 | ## Tracking Context 66 | 67 | Most parsers tell you the row and column of the problem: 68 | 69 | Something went wrong at (4:17) 70 | 71 | That may be true, but it is not how humans think. It is how text editors think! It would be better to say: 72 | 73 | I found a problem with this list: 74 | 75 | [ 1, 23zm5, 3 ] 76 | ^ 77 | I wanted an integer, like 6 or 90219. 78 | 79 | Notice that the error messages says `this list`. That is context! That is the language my brain speaks, not rows and columns. 80 | 81 | Once you get comfortable with the `Parser` module, you can switch over to `Parser.Advanced` and use [`inContext`](https://package.elm-lang.org/packages/elm/parser/latest/Parser-Advanced#inContext) to track exactly what your parser thinks it is doing at the moment. You can let the parser know “I am trying to parse a `"list"` right now” so if an error happens anywhere in that context, you get the hand annotation! 82 | 83 | This technique is used by the parser in the Elm compiler to give more helpful error messages. 84 | 85 | 86 | ## [Comparison with Prior Work](https://github.com/elm/parser/blob/master/comparison.md) 87 | -------------------------------------------------------------------------------- /comparison.md: -------------------------------------------------------------------------------- 1 | ## Comparison with Prior Work 2 | 3 | I have not seen the [parser pipeline][1] or the [context stack][2] ideas in other libraries, but [backtracking][3] relate to prior work. 4 | 5 | [1]: README.md#parser-pipelines 6 | [2]: README.md#tracking-context 7 | [3]: README.md#backtracking 8 | 9 | Most parser combinator libraries I have seen are based on Haskell’s Parsec library, which has primitives named `try` and `lookAhead`. I believe [`backtrackable`][backtrackable] is a better primitive for two reasons. 10 | 11 | [backtrackable]: https://package.elm-lang.org/packages/elm/parser/latest/Parser#backtrackable 12 | 13 | 14 | ### Performance and Composition 15 | 16 | Say we want to create a precise error message for `length [1,,3]`. The naive approach with Haskell’s Parsec library produces very bad error messages: 17 | 18 | ```haskell 19 | spaceThenArg :: Parser Expr 20 | spaceThenArg = 21 | try (spaces >> term) 22 | ``` 23 | 24 | This means we get a precise error from `term`, but then throw it away and say something went wrong at the space before the `[`. Very confusing! To improve quality, we must write something like this: 25 | 26 | ```haskell 27 | spaceThenArg :: Parser Expr 28 | spaceThenArg = 29 | choice 30 | [ do lookAhead (spaces >> char '[') 31 | spaces 32 | term 33 | , try (spaces >> term) 34 | ] 35 | ``` 36 | 37 | Notice that we parse `spaces` twice no matter what. 38 | 39 | Notice that we also had to hardcode `[` in the `lookAhead`. What if we update `term` to parse records that start with `{` as well? To get good commits on records, we must remember to update `lookAhead` to look for `oneOf "[{"`. Implementation details are leaking out of `term`! 40 | 41 | With `backtrackable` in this Elm library, you can just say: 42 | 43 | ```elm 44 | spaceThenArg : Parser Expr 45 | spaceThenArg = 46 | succeed identity 47 | |. backtrackable spaces 48 | |= term 49 | ``` 50 | 51 | It does less work, and is more reliable as `term` evolves. I believe the presence of `backtrackable` means that `lookAhead` is no longer needed. 52 | 53 | 54 | ### Expressiveness 55 | 56 | You can define `try` in terms of [`backtrackable`][backtrackable] like this: 57 | 58 | ```elm 59 | try : Parser a -> Parser a 60 | try parser = 61 | succeed identity 62 | |= backtrackable parser 63 | |. commit () 64 | ``` 65 | 66 | No expressiveness is lost! 67 | 68 | So while it is possible to define `try`, I left it out of the public API. In practice, `try` often leads to “bad commits” where your parser fails in a very specific way, but you then backtrack to a less specific error message. I considered naming it `allOrNothing` to better explain how it changes commit behavior, but ultimately, I thought it was best to encourage users to express their parsers with `backtrackable` directly. 69 | 70 | 71 | ### Summary 72 | 73 | Compared to previous work, `backtrackable` lets you produce precise error messages **more efficiently**. By thinking about “backtracking behavior” directly, you also end up with **cleaner composition** of parsers. And these benefits come **without any loss of expressiveness**. 74 | -------------------------------------------------------------------------------- /elm.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "package", 3 | "name": "elm/parser", 4 | "summary": "a parsing library, focused on simplicity and great error messages", 5 | "license": "BSD-3-Clause", 6 | "version": "1.1.0", 7 | "exposed-modules": [ 8 | "Parser", 9 | "Parser.Advanced" 10 | ], 11 | "elm-version": "0.19.0 <= v < 0.20.0", 12 | "dependencies": { 13 | "elm/core": "1.0.0 <= v < 2.0.0" 14 | }, 15 | "test-dependencies": {} 16 | } -------------------------------------------------------------------------------- /examples/DoubleQuoteString.elm: -------------------------------------------------------------------------------- 1 | import Browser 2 | import Char 3 | import Html 4 | import Parser exposing (..) 5 | 6 | 7 | 8 | -- MAIN 9 | 10 | 11 | main = 12 | Html.text <| Debug.toString <| 13 | run string "\"hello\"" 14 | 15 | 16 | 17 | -- STRINGS 18 | 19 | 20 | string : Parser String 21 | string = 22 | succeed identity 23 | |. token "\"" 24 | |= loop [] stringHelp 25 | 26 | 27 | stringHelp : List String -> Parser (Step (List String) String) 28 | stringHelp revChunks = 29 | oneOf 30 | [ succeed (\chunk -> Loop (chunk :: revChunks)) 31 | |. token "\\" 32 | |= oneOf 33 | [ map (\_ -> "\n") (token "n") 34 | , map (\_ -> "\t") (token "t") 35 | , map (\_ -> "\r") (token "r") 36 | , succeed String.fromChar 37 | |. token "u{" 38 | |= unicode 39 | |. token "}" 40 | ] 41 | , token "\"" 42 | |> map (\_ -> Done (String.join "" (List.reverse revChunks))) 43 | , chompWhile isUninteresting 44 | |> getChompedString 45 | |> map (\chunk -> Loop (chunk :: revChunks)) 46 | ] 47 | 48 | 49 | isUninteresting : Char -> Bool 50 | isUninteresting char = 51 | char /= '\\' && char /= '"' 52 | 53 | 54 | 55 | -- UNICODE 56 | 57 | 58 | unicode : Parser Char 59 | unicode = 60 | getChompedString (chompWhile Char.isHexDigit) 61 | |> andThen codeToChar 62 | 63 | 64 | codeToChar : String -> Parser Char 65 | codeToChar str = 66 | let 67 | length = String.length str 68 | code = String.foldl addHex 0 str 69 | in 70 | if 4 <= length && length <= 6 then 71 | problem "code point must have between 4 and 6 digits" 72 | else if 0 <= code && code <= 0x10FFFF then 73 | succeed (Char.fromCode code) 74 | else 75 | problem "code point must be between 0 and 0x10FFFF" 76 | 77 | 78 | addHex : Char -> Int -> Int 79 | addHex char total = 80 | let 81 | code = Char.toCode char 82 | in 83 | if 0x30 <= code && code <= 0x39 then 84 | 16 * total + (code - 0x30) 85 | else if 0x41 <= code && code <= 0x46 then 86 | 16 * total + (10 + code - 0x41) 87 | else 88 | 16 * total + (10 + code - 0x61) 89 | -------------------------------------------------------------------------------- /examples/Math.elm: -------------------------------------------------------------------------------- 1 | module Math exposing 2 | ( Expr 3 | , evaluate 4 | , parse 5 | ) 6 | 7 | 8 | import Html exposing (div, p, text) 9 | import Parser exposing (..) 10 | 11 | 12 | 13 | -- MAIN 14 | 15 | 16 | main = 17 | case parse "2 * (3 + 4)" of 18 | Err err -> 19 | text (Debug.toString err) 20 | 21 | Ok expr -> 22 | div [] 23 | [ p [] [ text (Debug.toString expr) ] 24 | , p [] [ text (String.fromFloat (evaluate expr)) ] 25 | ] 26 | 27 | 28 | 29 | -- EXPRESSIONS 30 | 31 | 32 | type Expr 33 | = Integer Int 34 | | Floating Float 35 | | Add Expr Expr 36 | | Mul Expr Expr 37 | 38 | 39 | evaluate : Expr -> Float 40 | evaluate expr = 41 | case expr of 42 | Integer n -> 43 | toFloat n 44 | 45 | Floating n -> 46 | n 47 | 48 | Add a b -> 49 | evaluate a + evaluate b 50 | 51 | Mul a b -> 52 | evaluate a * evaluate b 53 | 54 | 55 | parse : String -> Result (List DeadEnd) Expr 56 | parse string = 57 | run expression string 58 | 59 | 60 | 61 | -- PARSER 62 | 63 | 64 | {-| We want to handle integers, hexadecimal numbers, and floats. Octal numbers 65 | like `0o17` and binary numbers like `0b01101100` are not allowed. 66 | -} 67 | digits : Parser Expr 68 | digits = 69 | number 70 | { int = Just Integer 71 | , hex = Just Integer 72 | , octal = Nothing 73 | , binary = Nothing 74 | , float = Just Floating 75 | } 76 | 77 | 78 | {-| A term is a standalone chunk of math, like `4` or `(3 + 4)`. We use it as 79 | a building block in larger expressions. 80 | -} 81 | term : Parser Expr 82 | term = 83 | oneOf 84 | [ digits 85 | , succeed identity 86 | |. symbol "(" 87 | |. spaces 88 | |= lazy (\_ -> expression) 89 | |. spaces 90 | |. symbol ")" 91 | ] 92 | 93 | 94 | {-| Every expression starts with a term. After that, it may be done, or there 95 | may be a `+` or `*` sign and more math. 96 | -} 97 | expression : Parser Expr 98 | expression = 99 | term 100 | |> andThen (expressionHelp []) 101 | 102 | 103 | {-| Once you have parsed a term, you can start looking for `+` and `* operators. 104 | I am tracking everything as a list, that way I can be sure to follow the order 105 | of operations (PEMDAS) when building the final expression. 106 | 107 | In one case, I need an operator and another term. If that happens I keep 108 | looking for more. In the other case, I am done parsing, and I finalize the 109 | expression. 110 | -} 111 | expressionHelp : List (Expr, Operator) -> Expr -> Parser Expr 112 | expressionHelp revOps expr = 113 | oneOf 114 | [ succeed Tuple.pair 115 | |. spaces 116 | |= operator 117 | |. spaces 118 | |= term 119 | |> andThen (\(op, newExpr) -> expressionHelp ((expr,op) :: revOps) newExpr) 120 | , lazy (\_ -> succeed (finalize revOps expr)) 121 | ] 122 | 123 | 124 | type Operator = AddOp | MulOp 125 | 126 | 127 | operator : Parser Operator 128 | operator = 129 | oneOf 130 | [ map (\_ -> AddOp) (symbol "+") 131 | , map (\_ -> MulOp) (symbol "*") 132 | ] 133 | 134 | 135 | {-| We only have `+` and `*` in this parser. If we see a `MulOp` we can 136 | immediately group those two expressions. If we see an `AddOp` we wait to group 137 | until all the multiplies have been taken care of. 138 | 139 | This code is kind of tricky, but it is a baseline for what you would need if 140 | you wanted to add `/`, `-`, `==`, `&&`, etc. which bring in more complex 141 | associativity and precedence rules. 142 | -} 143 | finalize : List (Expr, Operator) -> Expr -> Expr 144 | finalize revOps finalExpr = 145 | case revOps of 146 | [] -> 147 | finalExpr 148 | 149 | (expr, MulOp) :: otherRevOps -> 150 | finalize otherRevOps (Mul expr finalExpr) 151 | 152 | (expr, AddOp) :: otherRevOps -> 153 | Add (finalize otherRevOps expr) finalExpr 154 | -------------------------------------------------------------------------------- /examples/README.md: -------------------------------------------------------------------------------- 1 | # Run the Examples 2 | 3 | To try these examples out locally, you can run the following terminal commands: 4 | 5 | ```bash 6 | git clone https://github.com/elm/parser.git 7 | cd parser/examples 8 | elm reactor 9 | ``` 10 | 11 | After that, go to [`http://localhost:8000`](http://localhost:8000) and click on 12 | the example you want to see. 13 | 14 | 15 | ## Exercises 16 | 17 | - Have a user input feed into the `Math` parser. Show people the results live. 18 | - Expand the `Math` parser to cover `-` and `/` as well. 19 | - Handle more escape characters in `DoubleQuotedString`. Maybe hexidecimal 20 | escapes like `\x41` and `\x0A` that are possible in JavaScript. -------------------------------------------------------------------------------- /examples/elm.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "application", 3 | "source-directories": [ 4 | "." 5 | ], 6 | "elm-version": "0.19.0", 7 | "dependencies": { 8 | "direct": { 9 | "elm/browser": "1.0.0", 10 | "elm/core": "1.0.0", 11 | "elm/html": "1.0.0", 12 | "elm/parser": "1.1.0" 13 | }, 14 | "indirect": { 15 | "elm/json": "1.0.0", 16 | "elm/time": "1.0.0", 17 | "elm/url": "1.0.0", 18 | "elm/virtual-dom": "1.0.0" 19 | } 20 | }, 21 | "test-dependencies": { 22 | "direct": {}, 23 | "indirect": {} 24 | } 25 | } -------------------------------------------------------------------------------- /semantics.md: -------------------------------------------------------------------------------- 1 | # Semantics 2 | 3 | The goal of this document is to explain how different parsers fit together. When will it backtrack? When will it not? 4 | 5 |
6 | 7 | ### `keyword : String -> Parser ()` 8 | 9 | Say we have `keyword "import"`: 10 | 11 | | String | Result | 12 | |---------------|------------| 13 | | `"import"` | `OK{false}` | 14 | | `"imp"` | `ERR{true}` | 15 | | `"export"` | `ERR{true}` | 16 | 17 | In our `OK{false}` notation, we are indicating: 18 | 19 | 1. Did the parser succeed? `OK` if yes. `ERR` if not. 20 | 2. Is it possible to backtrack? So when `keyword` succeeds, backtracking is not allowed anymore. You must continue along that path. 21 | 22 |
23 | 24 | 25 | ### `map : (a -> b) -> Parser a -> Parser b` 26 | 27 | Say we have `map func parser`: 28 | 29 | | `parser` | Result | 30 | |----------|----------| 31 | | `OK{b}` | `OK{b}` | 32 | | `ERR{b}` | `ERR{b}` | 33 | 34 | So result of `map func parser` is always the same as the result of the `parser` itself. 35 | 36 |
37 | 38 | 39 | ### `map2 : (a -> b -> c) -> Parser a -> Parser b -> Parser c` 40 | 41 | Say we have `map2 func parserA parserB`: 42 | 43 | | `parserA` | `parserB` | Result | 44 | |-----------|-----------|----------------| 45 | | `OK{b}` | `OK{b'}` | `OK{b && b'}` | 46 | | `OK{b}` | `ERR{b'}` | `ERR{b && b'}` | 47 | | `ERR{b}` | | `ERR{b}` | 48 | 49 | If `parserA` succeeds, we try `parserB`. If they are both backtrackable, the combined result is backtrackable. 50 | 51 | If `parserA` fails, that is our result. 52 | 53 | This is used to define our pipeline operators like this: 54 | 55 | ```elm 56 | (|.) a b = map2 (\keep ignore -> keep) a b 57 | (|=) a b = map2 (\func arg -> func arg) a b 58 | ``` 59 | 60 |
61 | 62 | 63 | ### `either : Parser a -> Parser a -> Parser a` 64 | 65 | Say we have `either parserA parserB`: 66 | 67 | | `parserA` | `parserB` | Result | 68 | |--------------|-----------|--------------| 69 | | `OK{b}` | | `OK{b}` | 70 | | `ERR{true}` | `OK{b}` | `OK{b}` | 71 | | `ERR{true}` | `ERR{b}` | `ERR{b}` | 72 | | `ERR{false}` | | `ERR{false}` | 73 | 74 | The 4th case is very important! **If `parserA` is not backtrackable, you do not even try `parserB`.** 75 | 76 | The `either` function does not appear in the public API, but I used it here because it makes the rules a bit easier to read. In the public API, we have `oneOf` instead. You can think of `oneOf` as trying `either` the head of the list, or `oneOf` the parsers in the tail of the list. 77 | 78 |
79 | 80 | 81 | ### `andThen : (a -> Parser b) -> Parser a -> Parser b` 82 | 83 | Say we have `andThen callback parserA` where `callback a` produces `parserB`: 84 | 85 | | `parserA` | `parserB` | Result | 86 | |-----------|-----------|----------------| 87 | | `ERR{b}` | | `ERR{b}` | 88 | | `OK{b}` | `OK{b'}` | `OK{b && b'}` | 89 | | `OK{b}` | `ERR{b'}` | `ERR{b && b'}` | 90 | 91 | If both parts are backtrackable, the overall result is backtrackable. 92 | 93 |
94 | 95 | 96 | ### `backtrackable : Parser a -> Parser a` 97 | 98 | Say we have `backtrackable parser`: 99 | 100 | | `parser` | Result | 101 | |----------|-------------| 102 | | `OK{b}` | `OK{true}` | 103 | | `ERR{b}` | `ERR{true}` | 104 | 105 | No matter how `parser` was defined, it is backtrackable now. This becomes very interesting when paired with `oneOf`. You can have one of the options start with a `backtrackable` segment, so even if you do start down that path, you can still try the next parser if something fails. **This has important yet subtle implications on performance, so definitely read on!** 106 | 107 |
108 | 109 | 110 | ## Examples 111 | 112 | This parser is intended to give you very precise control over backtracking behavior, and I think that is best explained through examples. 113 | 114 |
115 | 116 | ### `backtrackable` 117 | 118 | Say we have `map2 func (backtrackable spaces) (symbol ",")` which can eat a bunch of spaces followed by a comma. Here is how it would work on different strings: 119 | 120 | | String | Result | 121 | |---------|-------------| 122 | | `" ,"` | `OK{false}` | 123 | | `" :"` | `ERR{true}` | 124 | | `"abc"` | `ERR{true}` | 125 | 126 | Remember how `map2` is backtrackable only if both parsers are backtrackable. So in the first case, the overall result is not backtrackable because `symbol ","` succeeded. 127 | 128 | This becomes useful when paired with `either`! 129 | 130 |
131 | 132 | 133 | ### `backtrackable` + `oneOf` (inefficient) 134 | 135 | Say we have the following `parser` definition: 136 | 137 | ```elm 138 | parser : Parser (Maybe Int) 139 | parser = 140 | oneOf 141 | [ succeed Just 142 | |. backtrackable spaces 143 | |. symbol "," 144 | |. spaces 145 | |= int 146 | , succeed Nothing 147 | |. spaces 148 | |. symbol "]" 149 | ] 150 | ``` 151 | 152 | Here is how it would work on different strings: 153 | 154 | | String | Result | 155 | |-----------|--------------| 156 | | `" , 4"` | `OK{false}` | 157 | | `" ,"` | `ERR{false}` | 158 | | `" , a"` | `ERR{false}` | 159 | | `" ]"` | `OK{false}` | 160 | | `" a"` | `ERR{false}` | 161 | | `"abc"` | `ERR{true}` | 162 | 163 | Some of these cases are tricky, so let's look at them in more depth: 164 | 165 | - `" , a"` — `backtrackable spaces`, `symbol ","`, and `spaces` all succeed. At that point we have `OK{false}`. The `int` parser then fails on `a`, so we finish with `ERR{false}`. That means `oneOf` will NOT try the second possibility. 166 | - `" ]"` — `backtrackable spaces` succeeds, but `symbol ","` fails. At that point we have `ERR{true}`, so `oneOf` tries the second possibility. After backtracking, `spaces` and `symbol "]"` succeed with `OK{false}`. 167 | - `" a"` — `backtrackable spaces` succeeds, but `symbol ","` fails. At that point we have `ERR{true}`, so `oneOf` tries the second possibility. After backtracking, `spaces` succeeds with `OK{false}` and `symbol "]"` fails resulting in `ERR{false}`. 168 | 169 |
170 | 171 | 172 | ### `oneOf` (efficient) 173 | 174 | Notice that in the previous example, we parsed `spaces` twice in some cases. This is inefficient, especially in large files with lots of whitespace. Backtracking is very inefficient in general though, so **if you are interested in performance, it is worthwhile to try to eliminate as many uses of `backtrackable` as possible.** 175 | 176 | So we can rewrite that last example to never backtrack: 177 | 178 | ```elm 179 | parser : Parser (Maybe Int) 180 | parser = 181 | succeed identity 182 | |. spaces 183 | |= oneOf 184 | [ succeed Just 185 | |. symbol "," 186 | |. spaces 187 | |= int 188 | , succeed Nothing 189 | |. symbol "]" 190 | ] 191 | ``` 192 | 193 | Now we are guaranteed to consume the spaces only one time. After that, we decide if we are looking at a `,` or `]`, so we never backtrack and reparse things. 194 | 195 | If you are strategic in shuffling parsers around, you can write parsers that do not need `backtrackable` at all. The resulting parsers are quite fast. They are essentially the same as [LR(k)](https://en.wikipedia.org/wiki/Canonical_LR_parser) parsers, but more pleasant to write. I did this in Elm compiler for parsing Elm code, and it was very significantly faster. 196 | -------------------------------------------------------------------------------- /src/Elm/Kernel/Parser.js: -------------------------------------------------------------------------------- 1 | /* 2 | 3 | import Elm.Kernel.Utils exposing (chr, Tuple2, Tuple3) 4 | 5 | */ 6 | 7 | 8 | 9 | // STRINGS 10 | 11 | 12 | var _Parser_isSubString = F5(function(smallString, offset, row, col, bigString) 13 | { 14 | var smallLength = smallString.length; 15 | var isGood = offset + smallLength <= bigString.length; 16 | 17 | for (var i = 0; isGood && i < smallLength; ) 18 | { 19 | var code = bigString.charCodeAt(offset); 20 | isGood = 21 | smallString[i++] === bigString[offset++] 22 | && ( 23 | code === 0x000A /* \n */ 24 | ? ( row++, col=1 ) 25 | : ( col++, (code & 0xF800) === 0xD800 ? smallString[i++] === bigString[offset++] : 1 ) 26 | ) 27 | } 28 | 29 | return __Utils_Tuple3(isGood ? offset : -1, row, col); 30 | }); 31 | 32 | 33 | 34 | // CHARS 35 | 36 | 37 | var _Parser_isSubChar = F3(function(predicate, offset, string) 38 | { 39 | return ( 40 | string.length <= offset 41 | ? -1 42 | : 43 | (string.charCodeAt(offset) & 0xF800) === 0xD800 44 | ? (predicate(__Utils_chr(string.substr(offset, 2))) ? offset + 2 : -1) 45 | : 46 | (predicate(__Utils_chr(string[offset])) 47 | ? ((string[offset] === '\n') ? -2 : (offset + 1)) 48 | : -1 49 | ) 50 | ); 51 | }); 52 | 53 | 54 | var _Parser_isAsciiCode = F3(function(code, offset, string) 55 | { 56 | return string.charCodeAt(offset) === code; 57 | }); 58 | 59 | 60 | 61 | // NUMBERS 62 | 63 | 64 | var _Parser_chompBase10 = F2(function(offset, string) 65 | { 66 | for (; offset < string.length; offset++) 67 | { 68 | var code = string.charCodeAt(offset); 69 | if (code < 0x30 || 0x39 < code) 70 | { 71 | return offset; 72 | } 73 | } 74 | return offset; 75 | }); 76 | 77 | 78 | var _Parser_consumeBase = F3(function(base, offset, string) 79 | { 80 | for (var total = 0; offset < string.length; offset++) 81 | { 82 | var digit = string.charCodeAt(offset) - 0x30; 83 | if (digit < 0 || base <= digit) break; 84 | total = base * total + digit; 85 | } 86 | return __Utils_Tuple2(offset, total); 87 | }); 88 | 89 | 90 | var _Parser_consumeBase16 = F2(function(offset, string) 91 | { 92 | for (var total = 0; offset < string.length; offset++) 93 | { 94 | var code = string.charCodeAt(offset); 95 | if (0x30 <= code && code <= 0x39) 96 | { 97 | total = 16 * total + code - 0x30; 98 | } 99 | else if (0x41 <= code && code <= 0x46) 100 | { 101 | total = 16 * total + code - 55; 102 | } 103 | else if (0x61 <= code && code <= 0x66) 104 | { 105 | total = 16 * total + code - 87; 106 | } 107 | else 108 | { 109 | break; 110 | } 111 | } 112 | return __Utils_Tuple2(offset, total); 113 | }); 114 | 115 | 116 | 117 | // FIND STRING 118 | 119 | 120 | var _Parser_findSubString = F5(function(smallString, offset, row, col, bigString) 121 | { 122 | var newOffset = bigString.indexOf(smallString, offset); 123 | var target = newOffset < 0 ? bigString.length : newOffset + smallString.length; 124 | 125 | while (offset < target) 126 | { 127 | var code = bigString.charCodeAt(offset++); 128 | code === 0x000A /* \n */ 129 | ? ( col=1, row++ ) 130 | : ( col++, (code & 0xF800) === 0xD800 && offset++ ) 131 | } 132 | 133 | return __Utils_Tuple3(newOffset, row, col); 134 | }); 135 | -------------------------------------------------------------------------------- /src/Parser.elm: -------------------------------------------------------------------------------- 1 | module Parser exposing 2 | ( Parser, run 3 | , int, float, number, symbol, keyword, variable, end 4 | , succeed, (|=), (|.), lazy, andThen, problem 5 | , oneOf, map, backtrackable, commit, token 6 | , sequence, Trailing(..), loop, Step(..) 7 | , spaces, lineComment, multiComment, Nestable(..) 8 | , getChompedString, chompIf, chompWhile, chompUntil, chompUntilEndOr, mapChompedString 9 | , DeadEnd, Problem(..), deadEndsToString 10 | , withIndent, getIndent 11 | , getPosition, getRow, getCol, getOffset, getSource 12 | ) 13 | 14 | 15 | {-| 16 | 17 | # Parsers 18 | @docs Parser, run 19 | 20 | # Building Blocks 21 | @docs int, float, number, symbol, keyword, variable, end 22 | 23 | # Pipelines 24 | @docs succeed, (|=), (|.), lazy, andThen, problem 25 | 26 | # Branches 27 | @docs oneOf, map, backtrackable, commit, token 28 | 29 | # Loops 30 | @docs sequence, Trailing, loop, Step 31 | 32 | # Whitespace 33 | @docs spaces, lineComment, multiComment, Nestable 34 | 35 | # Chompers 36 | @docs getChompedString, chompIf, chompWhile, chompUntil, chompUntilEndOr, mapChompedString 37 | 38 | # Errors 39 | @docs DeadEnd, Problem, deadEndsToString 40 | 41 | # Indentation 42 | @docs withIndent, getIndent 43 | 44 | # Positions 45 | @docs getPosition, getRow, getCol, getOffset, getSource 46 | -} 47 | 48 | 49 | import Char 50 | import Parser.Advanced as A exposing ((|=), (|.)) 51 | import Set 52 | 53 | 54 | 55 | -- INFIX OPERATORS - see Parser.Advanced for why 5 and 6 were chosen 56 | 57 | 58 | infix left 5 (|=) = keeper 59 | infix left 6 (|.) = ignorer 60 | 61 | 62 | 63 | -- PARSERS 64 | 65 | 66 | {-| A `Parser` helps turn a `String` into nicely structured data. For example, 67 | we can [`run`](#run) the [`int`](#int) parser to turn `String` to `Int`: 68 | 69 | run int "123456" == Ok 123456 70 | run int "3.1415" == Err ... 71 | 72 | The cool thing is that you can combine `Parser` values to handle much more 73 | complex scenarios. 74 | -} 75 | type alias Parser a = 76 | A.Parser Never Problem a 77 | 78 | 79 | 80 | -- RUN 81 | 82 | 83 | {-| Try a parser. Here are some examples using the [`keyword`](#keyword) 84 | parser: 85 | 86 | run (keyword "true") "true" == Ok () 87 | run (keyword "true") "True" == Err ... 88 | run (keyword "true") "false" == Err ... 89 | run (keyword "true") "true!" == Ok () 90 | 91 | Notice the last case! A `Parser` will chomp as much as possible and not worry 92 | about the rest. Use the [`end`](#end) parser to ensure you made it to the end 93 | of the string! 94 | -} 95 | run : Parser a -> String -> Result (List DeadEnd) a 96 | run parser source = 97 | case A.run parser source of 98 | Ok a -> 99 | Ok a 100 | 101 | Err problems -> 102 | Err (List.map problemToDeadEnd problems) 103 | 104 | 105 | problemToDeadEnd : A.DeadEnd Never Problem -> DeadEnd 106 | problemToDeadEnd p = 107 | DeadEnd p.row p.col p.problem 108 | 109 | 110 | 111 | -- PROBLEMS 112 | 113 | 114 | {-| A parser can run into situations where there is no way to make progress. 115 | When that happens, I record the `row` and `col` where you got stuck and the 116 | particular `problem` you ran into. That is a `DeadEnd`! 117 | 118 | **Note:** I count rows and columns like a text editor. The beginning is `row=1` 119 | and `col=1`. As I chomp characters, the `col` increments. When I reach a `\n` 120 | character, I increment the `row` and set `col=1`. 121 | -} 122 | type alias DeadEnd = 123 | { row : Int 124 | , col : Int 125 | , problem : Problem 126 | } 127 | 128 | 129 | {-| When you run into a `DeadEnd`, I record some information about why you 130 | got stuck. This data is useful for producing helpful error messages. This is 131 | how [`deadEndsToString`](#deadEndsToString) works! 132 | 133 | **Note:** If you feel limited by this type (i.e. having to represent custom 134 | problems as strings) I highly recommend switching to `Parser.Advanced`. It 135 | lets you define your own `Problem` type. It can also track "context" which 136 | can improve error messages a ton! This is how the Elm compiler produces 137 | relatively nice parse errors, and I am excited to see those techniques applied 138 | elsewhere! 139 | -} 140 | type Problem 141 | = Expecting String 142 | | ExpectingInt 143 | | ExpectingHex 144 | | ExpectingOctal 145 | | ExpectingBinary 146 | | ExpectingFloat 147 | | ExpectingNumber 148 | | ExpectingVariable 149 | | ExpectingSymbol String 150 | | ExpectingKeyword String 151 | | ExpectingEnd 152 | | UnexpectedChar 153 | | Problem String 154 | | BadRepeat 155 | 156 | 157 | {-| Turn all the `DeadEnd` data into a string that is easier for people to 158 | read. 159 | 160 | **Note:** This is just a baseline of quality. It cannot do anything with colors. 161 | It is not interactivite. It just turns the raw data into strings. I really hope 162 | folks will check out the source code for some inspiration on how to turn errors 163 | into `Html` with nice colors and interaction! The `Parser.Advanced` module lets 164 | you work with context as well, which really unlocks another level of quality! 165 | The "context" technique is how the Elm compiler can say "I think I am parsing a 166 | list, so I was expecting a closing `]` here." Telling users what the parser 167 | _thinks_ is happening can be really helpful! 168 | -} 169 | deadEndsToString : List DeadEnd -> String 170 | deadEndsToString deadEnds = 171 | "TODO deadEndsToString" 172 | 173 | 174 | 175 | -- PIPELINES 176 | 177 | 178 | {-| A parser that succeeds without chomping any characters. 179 | 180 | run (succeed 90210 ) "mississippi" == Ok 90210 181 | run (succeed 3.141 ) "mississippi" == Ok 3.141 182 | run (succeed () ) "mississippi" == Ok () 183 | run (succeed Nothing) "mississippi" == Ok Nothing 184 | 185 | Seems weird on its own, but it is very useful in combination with other 186 | functions. The docs for [`(|=)`](#|=) and [`andThen`](#andThen) have some neat 187 | examples. 188 | -} 189 | succeed : a -> Parser a 190 | succeed = 191 | A.succeed 192 | 193 | 194 | {-| **Keep** values in a parser pipeline. For example, we could say: 195 | 196 | type alias Point = { x : Float, y : Float } 197 | 198 | point : Parser Point 199 | point = 200 | succeed Point 201 | |. symbol "(" 202 | |. spaces 203 | |= float 204 | |. spaces 205 | |. symbol "," 206 | |. spaces 207 | |= float 208 | |. spaces 209 | |. symbol ")" 210 | 211 | All the parsers in this pipeline will chomp characters and produce values. So 212 | `symbol "("` will chomp one paren and produce a `()` value. Similarly, `float` 213 | will chomp some digits and produce a `Float` value. The `(|.)` and `(|=)` 214 | operators just decide whether we give the values to the `Point` function. 215 | 216 | So in this case, we skip the `()` from `symbol "("`, we skip the `()` from 217 | `spaces`, we keep the `Float` from `float`, etc. 218 | -} 219 | keeper : Parser (a -> b) -> Parser a -> Parser b 220 | keeper = 221 | (|=) 222 | 223 | 224 | {-| **Skip** values in a parser pipeline. For example, maybe we want to parse 225 | some JavaScript variables: 226 | 227 | var : Parser String 228 | var = 229 | getChompedString <| 230 | succeed () 231 | |. chompIf isStartChar 232 | |. chompWhile isInnerChar 233 | 234 | isStartChar : Char -> Bool 235 | isStartChar char = 236 | Char.isAlpha char || char == '_' || char == '$' 237 | 238 | isInnerChar : Char -> Bool 239 | isInnerChar char = 240 | isStartChar char || Char.isDigit char 241 | 242 | `chompIf isStartChar` can chomp one character and produce a `()` value. 243 | `chompWhile isInnerChar` can chomp zero or more characters and produce a `()` 244 | value. The `(|.)` operators are saying to still chomp all the characters, but 245 | skip the two `()` values that get produced. No one cares about them. 246 | -} 247 | ignorer : Parser keep -> Parser ignore -> Parser keep 248 | ignorer = 249 | (|.) 250 | 251 | 252 | {-| Helper to define recursive parsers. Say we want a parser for simple 253 | boolean expressions: 254 | 255 | true 256 | false 257 | (true || false) 258 | (true || (true || false)) 259 | 260 | Notice that a boolean expression might contain *other* boolean expressions. 261 | That means we will want to define our parser in terms of itself: 262 | 263 | type Boolean 264 | = MyTrue 265 | | MyFalse 266 | | MyOr Boolean Boolean 267 | 268 | boolean : Parser Boolean 269 | boolean = 270 | oneOf 271 | [ succeed MyTrue 272 | |. keyword "true" 273 | , succeed MyFalse 274 | |. keyword "false" 275 | , succeed MyOr 276 | |. symbol "(" 277 | |. spaces 278 | |= lazy (\_ -> boolean) 279 | |. spaces 280 | |. symbol "||" 281 | |. spaces 282 | |= lazy (\_ -> boolean) 283 | |. spaces 284 | |. symbol ")" 285 | ] 286 | 287 | **Notice that `boolean` uses `boolean` in its definition!** In Elm, you can 288 | only define a value in terms of itself it is behind a function call. So 289 | `lazy` helps us define these self-referential parsers. (`andThen` can be used 290 | for this as well!) 291 | -} 292 | lazy : (() -> Parser a) -> Parser a 293 | lazy = 294 | A.lazy 295 | 296 | 297 | {-| Parse one thing `andThen` parse another thing. This is useful when you want 298 | to check on what you just parsed. For example, maybe you want U.S. zip codes 299 | and `int` is not suitable because it does not allow leading zeros. You could 300 | say: 301 | 302 | zipCode : Parser String 303 | zipCode = 304 | getChompedString (chompWhile Char.isDigit) 305 | |> andThen checkZipCode 306 | 307 | checkZipCode : String -> Parser String 308 | checkZipCode code = 309 | if String.length code == 5 then 310 | succeed code 311 | else 312 | problem "a U.S. zip code has exactly 5 digits" 313 | 314 | First we chomp digits `andThen` we check if it is a valid U.S. zip code. We 315 | `succeed` if it has exactly five digits and report a `problem` if not. 316 | 317 | Check out [`examples/DoubleQuoteString.elm`](https://github.com/elm/parser/blob/master/examples/DoubleQuoteString.elm) 318 | for another example, this time using `andThen` to verify unicode code points. 319 | 320 | **Note:** If you are using `andThen` recursively and blowing the stack, check 321 | out the [`loop`](#loop) function to limit stack usage. 322 | -} 323 | andThen : (a -> Parser b) -> Parser a -> Parser b 324 | andThen = 325 | A.andThen 326 | 327 | 328 | {-| Indicate that a parser has reached a dead end. "Everything was going fine 329 | until I ran into this problem." Check out the [`andThen`](#andThen) docs to see 330 | an example usage. 331 | -} 332 | problem : String -> Parser a 333 | problem msg = 334 | A.problem (Problem msg) 335 | 336 | 337 | 338 | -- BACKTRACKING 339 | 340 | 341 | {-| If you are parsing JSON, the values can be strings, floats, booleans, 342 | arrays, objects, or null. You need a way to pick `oneOf` them! Here is a 343 | sample of what that code might look like: 344 | 345 | type Json 346 | = Number Float 347 | | Boolean Bool 348 | | Null 349 | 350 | json : Parser Json 351 | json = 352 | oneOf 353 | [ map Number float 354 | , map (\_ -> Boolean True) (keyword "true") 355 | , map (\_ -> Boolean False) (keyword "false") 356 | , map (\_ -> Null) keyword "null" 357 | ] 358 | 359 | This parser will keep trying parsers until `oneOf` them starts chomping 360 | characters. Once a path is chosen, it does not come back and try the others. 361 | 362 | **Note:** I highly recommend reading [this document][semantics] to learn how 363 | `oneOf` and `backtrackable` interact. It is subtle and important! 364 | 365 | [semantics]: https://github.com/elm/parser/blob/master/semantics.md 366 | -} 367 | oneOf : List (Parser a) -> Parser a 368 | oneOf = 369 | A.oneOf 370 | 371 | 372 | {-| Transform the result of a parser. Maybe you have a value that is 373 | an integer or `null`: 374 | 375 | nullOrInt : Parser (Maybe Int) 376 | nullOrInt = 377 | oneOf 378 | [ map Just int 379 | , map (\_ -> Nothing) (keyword "null") 380 | ] 381 | 382 | -- run nullOrInt "0" == Ok (Just 0) 383 | -- run nullOrInt "13" == Ok (Just 13) 384 | -- run nullOrInt "null" == Ok Nothing 385 | -- run nullOrInt "zero" == Err ... 386 | -} 387 | map : (a -> b) -> Parser a -> Parser b 388 | map = 389 | A.map 390 | 391 | 392 | {-| It is quite tricky to use `backtrackable` well! It can be very useful, but 393 | also can degrade performance and error message quality. 394 | 395 | Read [this document](https://github.com/elm/parser/blob/master/semantics.md) 396 | to learn how `oneOf`, `backtrackable`, and `commit` work and interact with 397 | each other. It is subtle and important! 398 | -} 399 | backtrackable : Parser a -> Parser a 400 | backtrackable = 401 | A.backtrackable 402 | 403 | 404 | {-| `commit` is almost always paired with `backtrackable` in some way, and it 405 | is tricky to use well. 406 | 407 | Read [this document](https://github.com/elm/parser/blob/master/semantics.md) 408 | to learn how `oneOf`, `backtrackable`, and `commit` work and interact with 409 | each other. It is subtle and important! 410 | -} 411 | commit : a -> Parser a 412 | commit = 413 | A.commit 414 | 415 | 416 | 417 | -- TOKEN 418 | 419 | 420 | {-| Parse exactly the given string, without any regard to what comes next. 421 | 422 | A potential pitfall when parsing keywords is getting tricked by variables that 423 | start with a keyword, like `let` in `letters` or `import` in `important`. This 424 | is especially likely if you have a whitespace parser that can consume zero 425 | characters. So the [`keyword`](#keyword) parser is defined with `token` and a 426 | trick to peek ahead a bit: 427 | 428 | keyword : String -> Parser () 429 | keyword kwd = 430 | succeed identity 431 | |. backtrackable (token kwd) 432 | |= oneOf 433 | [ map (\_ -> True) (backtrackable (chompIf isVarChar)) 434 | , succeed False 435 | ] 436 | |> andThen (checkEnding kwd) 437 | 438 | checkEnding : String -> Bool -> Parser () 439 | checkEnding kwd isBadEnding = 440 | if isBadEnding then 441 | problem ("expecting the `" ++ kwd ++ "` keyword") 442 | else 443 | commit () 444 | 445 | isVarChar : Char -> Bool 446 | isVarChar char = 447 | Char.isAlphaNum char || char == '_' 448 | 449 | This definition is specially designed so that (1) if you really see `let` you 450 | commit to that path and (2) if you see `letters` instead you can backtrack and 451 | try other options. If I had just put a `backtrackable` around the whole thing 452 | you would not get (1) anymore. 453 | -} 454 | token : String -> Parser () 455 | token str = 456 | A.token (toToken str) 457 | 458 | 459 | toToken : String -> A.Token Problem 460 | toToken str = 461 | A.Token str (Expecting str) 462 | 463 | 464 | 465 | -- LOOPS 466 | 467 | 468 | {-| A parser that can loop indefinitely. This can be helpful when parsing 469 | repeated structures, like a bunch of statements: 470 | 471 | statements : Parser (List Stmt) 472 | statements = 473 | loop [] statementsHelp 474 | 475 | statementsHelp : List Stmt -> Parser (Step (List Stmt) (List Stmt)) 476 | statementsHelp revStmts = 477 | oneOf 478 | [ succeed (\stmt -> Loop (stmt :: revStmts)) 479 | |= statement 480 | |. spaces 481 | |. symbol ";" 482 | |. spaces 483 | , succeed () 484 | |> map (\_ -> Done (List.reverse revStmts)) 485 | ] 486 | 487 | -- statement : Parser Stmt 488 | 489 | Notice that the statements are tracked in reverse as we `Loop`, and we reorder 490 | them only once we are `Done`. This is a very common pattern with `loop`! 491 | 492 | Check out [`examples/DoubleQuoteString.elm`](https://github.com/elm/parser/blob/master/examples/DoubleQuoteString.elm) 493 | for another example. 494 | 495 | **IMPORTANT NOTE:** Parsers like `succeed ()` and `chompWhile Char.isAlpha` can 496 | succeed without consuming any characters. So in some cases you may want to use 497 | [`getOffset`](#getOffset) to ensure that each step actually consumed characters. 498 | Otherwise you could end up in an infinite loop! 499 | 500 | **Note:** Anything you can write with `loop`, you can also write as a parser 501 | that chomps some characters `andThen` calls itself with new arguments. The 502 | problem with calling `andThen` recursively is that it grows the stack, so you 503 | cannot do it indefinitely. So `loop` is important because enables tail-call 504 | elimination, allowing you to parse however many repeats you want. 505 | -} 506 | loop : state -> (state -> Parser (Step state a)) -> Parser a 507 | loop state callback = 508 | A.loop state (\s -> map toAdvancedStep (callback s)) 509 | 510 | 511 | {-| Decide what steps to take next in your [`loop`](#loop). 512 | 513 | If you are `Done`, you give the result of the whole `loop`. If you decide to 514 | `Loop` around again, you give a new state to work from. Maybe you need to add 515 | an item to a list? Or maybe you need to track some information about what you 516 | just saw? 517 | 518 | **Note:** It may be helpful to learn about [finite-state machines][fsm] to get 519 | a broader intuition about using `state`. I.e. You may want to create a `type` 520 | that describes four possible states, and then use `Loop` to transition between 521 | them as you consume characters. 522 | 523 | [fsm]: https://en.wikipedia.org/wiki/Finite-state_machine 524 | -} 525 | type Step state a 526 | = Loop state 527 | | Done a 528 | 529 | 530 | toAdvancedStep : Step s a -> A.Step s a 531 | toAdvancedStep step = 532 | case step of 533 | Loop s -> A.Loop s 534 | Done a -> A.Done a 535 | 536 | 537 | 538 | -- NUMBERS 539 | 540 | 541 | {-| Parse integers. 542 | 543 | run int "1" == Ok 1 544 | run int "1234" == Ok 1234 545 | 546 | run int "-789" == Err ... 547 | run int "0123" == Err ... 548 | run int "1.34" == Err ... 549 | run int "1e31" == Err ... 550 | run int "123a" == Err ... 551 | run int "0x1A" == Err ... 552 | 553 | If you want to handle a leading `+` or `-` you should do it with a custom 554 | parser like this: 555 | 556 | myInt : Parser Int 557 | myInt = 558 | oneOf 559 | [ succeed negate 560 | |. symbol "-" 561 | |= int 562 | , int 563 | ] 564 | 565 | **Note:** If you want a parser for both `Int` and `Float` literals, check out 566 | [`number`](#number) below. It will be faster than using `oneOf` to combining 567 | `int` and `float` yourself. 568 | -} 569 | int : Parser Int 570 | int = 571 | A.int ExpectingInt ExpectingInt 572 | 573 | 574 | {-| Parse floats. 575 | 576 | run float "123" == Ok 123 577 | run float "3.1415" == Ok 3.1415 578 | run float "0.1234" == Ok 0.1234 579 | run float ".1234" == Ok 0.1234 580 | run float "1e-42" == Ok 1e-42 581 | run float "6.022e23" == Ok 6.022e23 582 | run float "6.022E23" == Ok 6.022e23 583 | run float "6.022e+23" == Ok 6.022e23 584 | 585 | If you want to disable literals like `.123` (like in Elm) you could write 586 | something like this: 587 | 588 | elmFloat : Parser Float 589 | elmFloat = 590 | oneOf 591 | [ symbol "." 592 | |. problem "floating point numbers must start with a digit, like 0.25" 593 | , float 594 | ] 595 | 596 | **Note:** If you want a parser for both `Int` and `Float` literals, check out 597 | [`number`](#number) below. It will be faster than using `oneOf` to combining 598 | `int` and `float` yourself. 599 | -} 600 | float : Parser Float 601 | float = 602 | A.float ExpectingFloat ExpectingFloat 603 | 604 | 605 | 606 | -- NUMBER 607 | 608 | 609 | {-| Parse a bunch of different kinds of numbers without backtracking. A parser 610 | for Elm would need to handle integers, floats, and hexadecimal like this: 611 | 612 | type Expr 613 | = Variable String 614 | | Int Int 615 | | Float Float 616 | | Apply Expr Expr 617 | 618 | elmNumber : Parser Expr 619 | elmNumber = 620 | number 621 | { int = Just Int 622 | , hex = Just Int -- 0x001A is allowed 623 | , octal = Nothing -- 0o0731 is not 624 | , binary = Nothing -- 0b1101 is not 625 | , float = Just Float 626 | } 627 | 628 | If you wanted to implement the [`float`](#float) parser, it would be like this: 629 | 630 | float : Parser Float 631 | float = 632 | number 633 | { int = Just toFloat 634 | , hex = Nothing 635 | , octal = Nothing 636 | , binary = Nothing 637 | , float = Just identity 638 | } 639 | 640 | Notice that it actually is processing `int` results! This is because `123` 641 | looks like an integer to me, but maybe it looks like a float to you. If you had 642 | `int = Nothing`, floats would need a decimal like `1.0` in every case. If you 643 | like explicitness, that may actually be preferable! 644 | 645 | **Note:** This function does not check for weird trailing characters in the 646 | current implementation, so parsing `123abc` can succeed up to `123` and then 647 | move on. This is helpful for people who want to parse things like `40px` or 648 | `3m`, but it requires a bit of extra code to rule out trailing characters in 649 | other cases. 650 | -} 651 | number 652 | : { int : Maybe (Int -> a) 653 | , hex : Maybe (Int -> a) 654 | , octal : Maybe (Int -> a) 655 | , binary : Maybe (Int -> a) 656 | , float : Maybe (Float -> a) 657 | } 658 | -> Parser a 659 | number i = 660 | A.number 661 | { int = Result.fromMaybe ExpectingInt i.int 662 | , hex = Result.fromMaybe ExpectingHex i.hex 663 | , octal = Result.fromMaybe ExpectingOctal i.octal 664 | , binary = Result.fromMaybe ExpectingBinary i.binary 665 | , float = Result.fromMaybe ExpectingFloat i.float 666 | , invalid = ExpectingNumber 667 | , expecting = ExpectingNumber 668 | } 669 | 670 | 671 | 672 | -- SYMBOL 673 | 674 | 675 | {-| Parse symbols like `(` and `,`. 676 | 677 | run (symbol "[") "[" == Ok () 678 | run (symbol "[") "4" == Err ... (ExpectingSymbol "[") ... 679 | 680 | **Note:** This is good for stuff like brackets and semicolons, but it probably 681 | should not be used for binary operators like `+` and `-` because you can find 682 | yourself in weird situations. For example, is `3--4` a typo? Or is it `3 - -4`? 683 | I have had better luck with `chompWhile isSymbol` and sorting out which 684 | operator it is afterwards. 685 | -} 686 | symbol : String -> Parser () 687 | symbol str = 688 | A.symbol (A.Token str (ExpectingSymbol str)) 689 | 690 | 691 | 692 | -- KEYWORD 693 | 694 | 695 | {-| Parse keywords like `let`, `case`, and `type`. 696 | 697 | run (keyword "let") "let" == Ok () 698 | run (keyword "let") "var" == Err ... (ExpectingKeyword "let") ... 699 | run (keyword "let") "letters" == Err ... (ExpectingKeyword "let") ... 700 | 701 | **Note:** Notice the third case there! `keyword` actually looks ahead one 702 | character to make sure it is not a letter, number, or underscore. The goal is 703 | to help with parsers like this: 704 | 705 | succeed identity 706 | |. keyword "let" 707 | |. spaces 708 | |= elmVar 709 | |. spaces 710 | |. symbol "=" 711 | 712 | The trouble is that `spaces` may chomp zero characters (to handle expressions 713 | like `[1,2]` and `[ 1 , 2 ]`) and in this case, it would mean `letters` could 714 | be parsed as `let ters` and then wonder where the equals sign is! Check out the 715 | [`token`](#token) docs if you need to customize this! 716 | -} 717 | keyword : String -> Parser () 718 | keyword kwd = 719 | A.keyword (A.Token kwd (ExpectingKeyword kwd)) 720 | 721 | 722 | 723 | -- END 724 | 725 | 726 | {-| Check if you have reached the end of the string you are parsing. 727 | 728 | justAnInt : Parser Int 729 | justAnInt = 730 | succeed identity 731 | |= int 732 | |. end 733 | 734 | -- run justAnInt "90210" == Ok 90210 735 | -- run justAnInt "1 + 2" == Err ... 736 | -- run int "1 + 2" == Ok 1 737 | 738 | Parsers can succeed without parsing the whole string. Ending your parser 739 | with `end` guarantees that you have successfully parsed the whole string. 740 | -} 741 | end : Parser () 742 | end = 743 | A.end ExpectingEnd 744 | 745 | 746 | 747 | -- CHOMPED STRINGS 748 | 749 | 750 | {-| Sometimes parsers like `int` or `variable` cannot do exactly what you 751 | need. The "chomping" family of functions is meant for that case! Maybe you 752 | need to parse [valid PHP variables][php] like `$x` and `$txt`: 753 | 754 | php : Parser String 755 | php = 756 | getChompedString <| 757 | succeed () 758 | |. chompIf (\c -> c == '$') 759 | |. chompIf (\c -> Char.isAlpha c || c == '_') 760 | |. chompWhile (\c -> Char.isAlphaNum c || c == '_') 761 | 762 | The idea is that you create a bunch of chompers that validate the underlying 763 | characters. Then `getChompedString` extracts the underlying `String` efficiently. 764 | 765 | **Note:** Maybe it is helpful to see how you can use [`getOffset`](#getOffset) 766 | and [`getSource`](#getSource) to implement this function: 767 | 768 | getChompedString : Parser a -> Parser String 769 | getChompedString parser = 770 | succeed String.slice 771 | |= getOffset 772 | |. parser 773 | |= getOffset 774 | |= getSource 775 | 776 | [php]: https://www.w3schools.com/php/php_variables.asp 777 | -} 778 | getChompedString : Parser a -> Parser String 779 | getChompedString = 780 | A.getChompedString 781 | 782 | 783 | {-| This works just like [`getChompedString`](#getChompedString) but gives 784 | a bit more flexibility. For example, maybe you want to parse Elm doc comments 785 | and get (1) the full comment and (2) all of the names listed in the docs. 786 | 787 | You could implement `mapChompedString` like this: 788 | 789 | mapChompedString : (String -> a -> b) -> Parser a -> Parser String 790 | mapChompedString func parser = 791 | succeed (\start value end src -> func (String.slice start end src) value) 792 | |= getOffset 793 | |= parser 794 | |= getOffset 795 | |= getSource 796 | 797 | -} 798 | mapChompedString : (String -> a -> b) -> Parser a -> Parser b 799 | mapChompedString = 800 | A.mapChompedString 801 | 802 | 803 | 804 | {-| Chomp one character if it passes the test. 805 | 806 | chompUpper : Parser () 807 | chompUpper = 808 | chompIf Char.isUpper 809 | 810 | So this can chomp a character like `T` and produces a `()` value. 811 | -} 812 | chompIf : (Char -> Bool) -> Parser () 813 | chompIf isGood = 814 | A.chompIf isGood UnexpectedChar 815 | 816 | 817 | 818 | {-| Chomp zero or more characters if they pass the test. This is commonly 819 | useful for chomping whitespace or variable names: 820 | 821 | whitespace : Parser () 822 | whitespace = 823 | chompWhile (\c -> c == ' ' || c == '\t' || c == '\n' || c == '\r') 824 | 825 | elmVar : Parser String 826 | elmVar = 827 | getChompedString <| 828 | succeed () 829 | |. chompIf Char.isLower 830 | |. chompWhile (\c -> Char.isAlphaNum c || c == '_') 831 | 832 | **Note:** a `chompWhile` parser always succeeds! This can lead to tricky 833 | situations, especially if you define your whitespace with it. In that case, 834 | you could accidentally interpret `letx` as the keyword `let` followed by 835 | "spaces" followed by the variable `x`. This is why the `keyword` and `number` 836 | parsers peek ahead, making sure they are not followed by anything unexpected. 837 | -} 838 | chompWhile : (Char -> Bool) -> Parser () 839 | chompWhile = 840 | A.chompWhile 841 | 842 | 843 | {-| Chomp until you see a certain string. You could define C-style multi-line 844 | comments like this: 845 | 846 | comment : Parser () 847 | comment = 848 | symbol "/*" 849 | |. chompUntil "*/" 850 | 851 | I recommend using [`multiComment`](#multiComment) for this particular scenario 852 | though. It can be trickier than it looks! 853 | -} 854 | chompUntil : String -> Parser () 855 | chompUntil str = 856 | A.chompUntil (toToken str) 857 | 858 | 859 | {-| Chomp until you see a certain string or until you run out of characters to 860 | chomp! You could define single-line comments like this: 861 | 862 | elm : Parser () 863 | elm = 864 | symbol "--" 865 | |. chompUntilEndOr "\n" 866 | 867 | A file may end with a single-line comment, so the file can end before you see 868 | a newline. Tricky! 869 | 870 | I recommend just using [`lineComment`](#lineComment) for this particular 871 | scenario. 872 | -} 873 | chompUntilEndOr : String -> Parser () 874 | chompUntilEndOr = 875 | A.chompUntilEndOr 876 | 877 | 878 | 879 | -- INDENTATION 880 | 881 | 882 | {-| Some languages are indentation sensitive. Python cares about tabs. Elm 883 | cares about spaces sometimes. `withIndent` and `getIndent` allow you to manage 884 | "indentation state" yourself, however is necessary in your scenario. 885 | -} 886 | withIndent : Int -> Parser a -> Parser a 887 | withIndent = 888 | A.withIndent 889 | 890 | 891 | {-| When someone said `withIndent` earlier, what number did they put in there? 892 | 893 | - `getIndent` results in `0`, the default value 894 | - `withIndent 4 getIndent` results in `4` 895 | 896 | So you are just asking about things you said earlier. These numbers do not leak 897 | out of `withIndent`, so say we have: 898 | 899 | succeed Tuple.pair 900 | |= withIndent 4 getIndent 901 | |= getIndent 902 | 903 | Assuming there are no `withIndent` above this, you would get `(4,0)` from this. 904 | -} 905 | getIndent : Parser Int 906 | getIndent = 907 | A.getIndent 908 | 909 | 910 | 911 | -- POSITION 912 | 913 | 914 | {-| Code editors treat code like a grid, with rows and columns. The start is 915 | `row=1` and `col=1`. As you chomp characters, the `col` increments. When you 916 | run into a `\n` character, the `row` increments and `col` goes back to `1`. 917 | 918 | In the Elm compiler, I track the start and end position of every expression 919 | like this: 920 | 921 | type alias Located a = 922 | { start : (Int, Int) 923 | , value : a 924 | , end : (Int, Int) 925 | } 926 | 927 | located : Parser a -> Parser (Located a) 928 | located parser = 929 | succeed Located 930 | |= getPosition 931 | |= parser 932 | |= getPosition 933 | 934 | So if there is a problem during type inference, I use this saved position 935 | information to underline the exact problem! 936 | 937 | **Note:** Tabs count as one character, so if you are parsing something like 938 | Python, I recommend sorting that out *after* parsing. So if I wanted the `^^^^` 939 | underline like in Elm, I would find the `row` in the source code and do 940 | something like this: 941 | 942 | makeUnderline : String -> Int -> Int -> String 943 | makeUnderline row minCol maxCol = 944 | String.toList row 945 | |> List.indexedMap (toUnderlineChar minCol maxCol) 946 | |> String.fromList 947 | 948 | toUnderlineChar : Int -> Int -> Int -> Char -> Char 949 | toUnderlineChar minCol maxCol col char = 950 | if minCol <= col && col <= maxCol then 951 | '^' 952 | else if char == '\t' then 953 | '\t' 954 | else 955 | ' ' 956 | 957 | So it would preserve any tabs from the source line. There are tons of other 958 | ways to do this though. The point is just that you handle the tabs after 959 | parsing but before anyone looks at the numbers in a context where tabs may 960 | equal 2, 4, or 8. 961 | -} 962 | getPosition : Parser (Int, Int) 963 | getPosition = 964 | A.getPosition 965 | 966 | 967 | {-| This is a more efficient version of `map Tuple.first getPosition`. Maybe 968 | you just want to track the line number for some reason? This lets you do that. 969 | 970 | See [`getPosition`](#getPosition) for an explanation of rows and columns. 971 | -} 972 | getRow : Parser Int 973 | getRow = 974 | A.getRow 975 | 976 | 977 | {-| This is a more efficient version of `map Tuple.second getPosition`. This 978 | can be useful in combination with [`withIndent`](#withIndent) and 979 | [`getIndent`](#getIndent), like this: 980 | 981 | checkIndent : Parser () 982 | checkIndent = 983 | succeed (\indent column -> indent <= column) 984 | |= getIndent 985 | |= getCol 986 | |> andThen checkIndentHelp 987 | 988 | checkIndentHelp : Bool -> Parser () 989 | checkIndentHelp isIndented = 990 | if isIndented then 991 | succeed () 992 | else 993 | problem "expecting more spaces" 994 | 995 | So the `checkIndent` parser only succeeds when you are "deeper" than the 996 | current indent level. You could use this to parse Elm-style `let` expressions. 997 | -} 998 | getCol : Parser Int 999 | getCol = 1000 | A.getCol 1001 | 1002 | 1003 | {-| Editors think of code as a grid, but behind the scenes it is just a flat 1004 | array of UTF-16 characters. `getOffset` tells you your index in that flat 1005 | array. So if you chomp `"\n\n\n\n"` you are on row 5, column 1, and offset 4. 1006 | 1007 | **Note:** JavaScript uses a somewhat odd version of UTF-16 strings, so a single 1008 | character may take two slots. So in JavaScript, `'abc'.length === 3` but 1009 | `'🙈🙉🙊'.length === 6`. Try it out! And since Elm runs in JavaScript, the offset 1010 | moves by those rules. 1011 | -} 1012 | getOffset : Parser Int 1013 | getOffset = 1014 | A.getOffset 1015 | 1016 | 1017 | {-| Get the full string that is being parsed. You could use this to define 1018 | `getChompedString` or `mapChompedString` if you wanted: 1019 | 1020 | getChompedString : Parser a -> Parser String 1021 | getChompedString parser = 1022 | succeed String.slice 1023 | |= getOffset 1024 | |. parser 1025 | |= getOffset 1026 | |= getSource 1027 | -} 1028 | getSource : Parser String 1029 | getSource = 1030 | A.getSource 1031 | 1032 | 1033 | 1034 | -- VARIABLES 1035 | 1036 | 1037 | {-| Create a parser for variables. If we wanted to parse type variables in Elm, 1038 | we could try something like this: 1039 | 1040 | import Char 1041 | import Parser exposing (..) 1042 | import Set 1043 | 1044 | typeVar : Parser String 1045 | typeVar = 1046 | variable 1047 | { start = Char.isLower 1048 | , inner = \c -> Char.isAlphaNum c || c == '_' 1049 | , reserved = Set.fromList [ "let", "in", "case", "of" ] 1050 | } 1051 | 1052 | This is saying it _must_ start with a lower-case character. After that, 1053 | characters can be letters, numbers, or underscores. It is also saying that if 1054 | you run into any of these reserved names, it is definitely not a variable. 1055 | -} 1056 | variable : 1057 | { start : Char -> Bool 1058 | , inner : Char -> Bool 1059 | , reserved : Set.Set String 1060 | } 1061 | -> Parser String 1062 | variable i = 1063 | A.variable 1064 | { start = i.start 1065 | , inner = i.inner 1066 | , reserved = i.reserved 1067 | , expecting = ExpectingVariable 1068 | } 1069 | 1070 | 1071 | 1072 | -- SEQUENCES 1073 | 1074 | 1075 | {-| Handle things like lists and records, but you can customize the details 1076 | however you need. Say you want to parse C-style code blocks: 1077 | 1078 | import Parser exposing (Parser, Trailing(..)) 1079 | 1080 | block : Parser (List Stmt) 1081 | block = 1082 | Parser.sequence 1083 | { start = "{" 1084 | , separator = ";" 1085 | , end = "}" 1086 | , spaces = spaces 1087 | , item = statement 1088 | , trailing = Mandatory -- demand a trailing semi-colon 1089 | } 1090 | 1091 | -- statement : Parser Stmt 1092 | 1093 | **Note:** If you need something more custom, do not be afraid to check 1094 | out the implementation and customize it for your case. It is better to 1095 | get nice error messages with a lower-level implementation than to try 1096 | to hack high-level parsers to do things they are not made for. 1097 | -} 1098 | sequence 1099 | : { start : String 1100 | , separator : String 1101 | , end : String 1102 | , spaces : Parser () 1103 | , item : Parser a 1104 | , trailing : Trailing 1105 | } 1106 | -> Parser (List a) 1107 | sequence i = 1108 | A.sequence 1109 | { start = toToken i.start 1110 | , separator = toToken i.separator 1111 | , end = toToken i.end 1112 | , spaces = i.spaces 1113 | , item = i.item 1114 | , trailing = toAdvancedTrailing i.trailing 1115 | } 1116 | 1117 | 1118 | {-| What’s the deal with trailing commas? Are they `Forbidden`? 1119 | Are they `Optional`? Are they `Mandatory`? Welcome to [shapes 1120 | club](https://poorlydrawnlines.com/comic/shapes-club/)! 1121 | -} 1122 | type Trailing = Forbidden | Optional | Mandatory 1123 | 1124 | 1125 | toAdvancedTrailing : Trailing -> A.Trailing 1126 | toAdvancedTrailing trailing = 1127 | case trailing of 1128 | Forbidden -> A.Forbidden 1129 | Optional -> A.Optional 1130 | Mandatory -> A.Mandatory 1131 | 1132 | 1133 | 1134 | -- WHITESPACE 1135 | 1136 | 1137 | {-| Parse zero or more `' '`, `'\n'`, and `'\r'` characters. 1138 | 1139 | The implementation is pretty simple: 1140 | 1141 | spaces : Parser () 1142 | spaces = 1143 | chompWhile (\c -> c == ' ' || c == '\n' || c == '\r') 1144 | 1145 | So if you need something different (like tabs) just define an alternative with 1146 | the necessary tweaks! Check out [`lineComment`](#lineComment) and 1147 | [`multiComment`](#multiComment) for more complex situations. 1148 | -} 1149 | spaces : Parser () 1150 | spaces = 1151 | A.spaces 1152 | 1153 | 1154 | {-| Parse single-line comments: 1155 | 1156 | elm : Parser () 1157 | elm = 1158 | lineComment "--" 1159 | 1160 | js : Parser () 1161 | js = 1162 | lineComment "//" 1163 | 1164 | python : Parser () 1165 | python = 1166 | lineComment "#" 1167 | 1168 | This parser is defined like this: 1169 | 1170 | lineComment : String -> Parser () 1171 | lineComment str = 1172 | symbol str 1173 | |. chompUntilEndOr "\n" 1174 | 1175 | So it will consume the remainder of the line. If the file ends before you see 1176 | a newline, that is fine too. 1177 | -} 1178 | lineComment : String -> Parser () 1179 | lineComment str = 1180 | A.lineComment (toToken str) 1181 | 1182 | 1183 | {-| Parse multi-line comments. So if you wanted to parse Elm whitespace or 1184 | JS whitespace, you could say: 1185 | 1186 | elm : Parser () 1187 | elm = 1188 | loop 0 <| ifProgress <| 1189 | oneOf 1190 | [ lineComment "--" 1191 | , multiComment "{-" "-}" Nestable 1192 | , spaces 1193 | ] 1194 | 1195 | js : Parser () 1196 | js = 1197 | loop 0 <| ifProgress <| 1198 | oneOf 1199 | [ lineComment "//" 1200 | , multiComment "/*" "*/" NotNestable 1201 | , chompWhile (\c -> c == ' ' || c == '\n' || c == '\r' || c == '\t') 1202 | ] 1203 | 1204 | ifProgress : Parser a -> Int -> Parser (Step Int ()) 1205 | ifProgress parser offset = 1206 | succeed identity 1207 | |. parser 1208 | |= getOffset 1209 | |> map (\newOffset -> if offset == newOffset then Done () else Loop newOffset) 1210 | 1211 | **Note:** The fact that `spaces` comes last in the definition of `elm` is very 1212 | important! It can succeed without consuming any characters, so if it were the 1213 | first option, it would always succeed and bypass the others! (Same is true of 1214 | `chompWhile` in `js`.) This possibility of success without consumption is also 1215 | why wee need the `ifProgress` helper. It detects if there is no more whitespace 1216 | to consume. 1217 | -} 1218 | multiComment : String -> String -> Nestable -> Parser () 1219 | multiComment open close nestable = 1220 | A.multiComment (toToken open) (toToken close) (toAdvancedNestable nestable) 1221 | 1222 | 1223 | {-| Not all languages handle multi-line comments the same. Multi-line comments 1224 | in C-style syntax are `NotNestable`, meaning they can be implemented like this: 1225 | 1226 | js : Parser () 1227 | js = 1228 | symbol "/*" 1229 | |. chompUntil "*/" 1230 | 1231 | In fact, `multiComment "/*" "*/" NotNestable` *is* implemented like that! It is 1232 | very simple, but it does not allow you to nest comments like this: 1233 | 1234 | ```javascript 1235 | /* 1236 | line1 1237 | /* line2 */ 1238 | line3 1239 | */ 1240 | ``` 1241 | 1242 | It would stop on the first `*/`, eventually throwing a syntax error on the 1243 | second `*/`. This can be pretty annoying in long files. 1244 | 1245 | Languages like Elm allow you to nest multi-line comments, but your parser needs 1246 | to be a bit fancier to handle this. After you start a comment, you have to 1247 | detect if there is another one inside it! And then you have to make sure all 1248 | the `{-` and `-}` match up properly! Saying `multiComment "{-" "-}" Nestable` 1249 | does all that for you. 1250 | -} 1251 | type Nestable = NotNestable | Nestable 1252 | 1253 | 1254 | toAdvancedNestable : Nestable -> A.Nestable 1255 | toAdvancedNestable nestable = 1256 | case nestable of 1257 | NotNestable -> A.NotNestable 1258 | Nestable -> A.Nestable 1259 | -------------------------------------------------------------------------------- /src/Parser/Advanced.elm: -------------------------------------------------------------------------------- 1 | module Parser.Advanced exposing 2 | ( Parser, run, DeadEnd, inContext, Token(..) 3 | , int, float, number, symbol, keyword, variable, end 4 | , succeed, (|=), (|.), lazy, andThen, problem 5 | , oneOf, map, backtrackable, commit, token 6 | , sequence, Trailing(..), loop, Step(..) 7 | , spaces, lineComment, multiComment, Nestable(..) 8 | , getChompedString, chompIf, chompWhile, chompUntil, chompUntilEndOr, mapChompedString 9 | , withIndent, getIndent 10 | , getPosition, getRow, getCol, getOffset, getSource 11 | ) 12 | 13 | 14 | {-| 15 | 16 | # Parsers 17 | @docs Parser, run, DeadEnd, inContext, Token 18 | 19 | * * * 20 | **Everything past here works just like in the 21 | [`Parser`](/packages/elm/parser/latest/Parser) module, except that `String` 22 | arguments become `Token` arguments, and you need to provide a `Problem` for 23 | certain scenarios.** 24 | * * * 25 | 26 | # Building Blocks 27 | @docs int, float, number, symbol, keyword, variable, end 28 | 29 | # Pipelines 30 | @docs succeed, (|=), (|.), lazy, andThen, problem 31 | 32 | # Branches 33 | @docs oneOf, map, backtrackable, commit, token 34 | 35 | # Loops 36 | @docs sequence, Trailing, loop, Step 37 | 38 | # Whitespace 39 | @docs spaces, lineComment, multiComment, Nestable 40 | 41 | # Chompers 42 | @docs getChompedString, chompIf, chompWhile, chompUntil, chompUntilEndOr, mapChompedString 43 | 44 | # Indentation 45 | @docs withIndent, getIndent 46 | 47 | # Positions 48 | @docs getPosition, getRow, getCol, getOffset, getSource 49 | -} 50 | 51 | 52 | import Char 53 | import Elm.Kernel.Parser 54 | import Set 55 | 56 | 57 | 58 | -- INFIX OPERATORS 59 | 60 | 61 | infix left 5 (|=) = keeper 62 | infix left 6 (|.) = ignorer 63 | 64 | 65 | {- NOTE: the (|.) oporator binds tighter to slightly reduce the amount 66 | of recursion in pipelines. For example: 67 | 68 | func 69 | |. a 70 | |. b 71 | |= c 72 | |. d 73 | |. e 74 | 75 | With the same precedence: 76 | 77 | (ignorer (ignorer (keeper (ignorer (ignorer func a) b) c) d) e) 78 | 79 | With higher precedence: 80 | 81 | keeper (ignorer (ignorer func a) b) (ignorer (ignorer c d) e) 82 | 83 | So the maximum call depth goes from 5 to 3. 84 | -} 85 | 86 | 87 | 88 | -- PARSERS 89 | 90 | 91 | {-| An advanced `Parser` gives two ways to improve your error messages: 92 | 93 | - `problem` — Instead of all errors being a `String`, you can create a 94 | custom type like `type Problem = BadIndent | BadKeyword String` and track 95 | problems much more precisely. 96 | - `context` — Error messages can be further improved when precise 97 | problems are paired with information about where you ran into trouble. By 98 | tracking the context, instead of saying “I found a bad keyword” you can say 99 | “I found a bad keyword when parsing a list” and give folks a better idea of 100 | what the parser thinks it is doing. 101 | 102 | I recommend starting with the simpler [`Parser`][parser] module though, and 103 | when you feel comfortable and want better error messages, you can create a type 104 | alias like this: 105 | 106 | ```elm 107 | import Parser.Advanced 108 | 109 | type alias MyParser a = 110 | Parser.Advanced.Parser Context Problem a 111 | 112 | type Context = Definition String | List | Record 113 | 114 | type Problem = BadIndent | BadKeyword String 115 | ``` 116 | 117 | All of the functions from `Parser` should exist in `Parser.Advanced` in some 118 | form, allowing you to switch over pretty easily. 119 | 120 | [parser]: /packages/elm/parser/latest/Parser 121 | -} 122 | type Parser context problem value = 123 | Parser (State context -> PStep context problem value) 124 | 125 | 126 | type PStep context problem value 127 | = Good Bool value (State context) 128 | | Bad Bool (Bag context problem) 129 | 130 | 131 | type alias State context = 132 | { src : String 133 | , offset : Int 134 | , indent : Int 135 | , context : List (Located context) 136 | , row : Int 137 | , col : Int 138 | } 139 | 140 | 141 | type alias Located context = 142 | { row : Int 143 | , col : Int 144 | , context : context 145 | } 146 | 147 | 148 | 149 | -- RUN 150 | 151 | 152 | {-| This works just like [`Parser.run`](/packages/elm/parser/latest/Parser#run). 153 | The only difference is that when it fails, it has much more precise information 154 | for each dead end. 155 | -} 156 | run : Parser c x a -> String -> Result (List (DeadEnd c x)) a 157 | run (Parser parse) src = 158 | case parse { src = src, offset = 0, indent = 1, context = [], row = 1, col = 1} of 159 | Good _ value _ -> 160 | Ok value 161 | 162 | Bad _ bag -> 163 | Err (bagToList bag []) 164 | 165 | 166 | 167 | -- PROBLEMS 168 | 169 | 170 | {-| Say you are parsing a function named `viewHealthData` that contains a list. 171 | You might get a `DeadEnd` like this: 172 | 173 | ```elm 174 | { row = 18 175 | , col = 22 176 | , problem = UnexpectedComma 177 | , contextStack = 178 | [ { row = 14 179 | , col = 1 180 | , context = Definition "viewHealthData" 181 | } 182 | , { row = 15 183 | , col = 4 184 | , context = List 185 | } 186 | ] 187 | } 188 | ``` 189 | 190 | We have a ton of information here! So in the error message, we can say that “I 191 | ran into an issue when parsing a list in the definition of `viewHealthData`. It 192 | looks like there is an extra comma.” Or maybe something even better! 193 | 194 | Furthermore, many parsers just put a mark where the problem manifested. By 195 | tracking the `row` and `col` of the context, we can show a much larger region 196 | as a way of indicating “I thought I was parsing this thing that starts over 197 | here.” Otherwise you can get very confusing error messages on a missing `]` or 198 | `}` or `)` because “I need more indentation” on something unrelated. 199 | 200 | **Note:** Rows and columns are counted like a text editor. The beginning is `row=1` 201 | and `col=1`. The `col` increments as characters are chomped. When a `\n` is chomped, 202 | `row` is incremented and `col` starts over again at `1`. 203 | -} 204 | type alias DeadEnd context problem = 205 | { row : Int 206 | , col : Int 207 | , problem : problem 208 | , contextStack : List { row : Int, col : Int, context : context } 209 | } 210 | 211 | 212 | type Bag c x 213 | = Empty 214 | | AddRight (Bag c x) (DeadEnd c x) 215 | | Append (Bag c x) (Bag c x) 216 | 217 | 218 | fromState : State c -> x -> Bag c x 219 | fromState s x = 220 | AddRight Empty (DeadEnd s.row s.col x s.context) 221 | 222 | 223 | fromInfo : Int -> Int -> x -> List (Located c) -> Bag c x 224 | fromInfo row col x context = 225 | AddRight Empty (DeadEnd row col x context) 226 | 227 | 228 | bagToList : Bag c x -> List (DeadEnd c x) -> List (DeadEnd c x) 229 | bagToList bag list = 230 | case bag of 231 | Empty -> 232 | list 233 | 234 | AddRight bag1 x -> 235 | bagToList bag1 (x :: list) 236 | 237 | Append bag1 bag2 -> 238 | bagToList bag1 (bagToList bag2 list) 239 | 240 | 241 | 242 | -- PRIMITIVES 243 | 244 | 245 | {-| Just like [`Parser.succeed`](Parser#succeed) 246 | -} 247 | succeed : a -> Parser c x a 248 | succeed a = 249 | Parser <| \s -> 250 | Good False a s 251 | 252 | 253 | {-| Just like [`Parser.problem`](Parser#problem) except you provide a custom 254 | type for your problem. 255 | -} 256 | problem : x -> Parser c x a 257 | problem x = 258 | Parser <| \s -> 259 | Bad False (fromState s x) 260 | 261 | 262 | 263 | -- MAPPING 264 | 265 | 266 | {-| Just like [`Parser.map`](Parser#map) 267 | -} 268 | map : (a -> b) -> Parser c x a -> Parser c x b 269 | map func (Parser parse) = 270 | Parser <| \s0 -> 271 | case parse s0 of 272 | Good p a s1 -> 273 | Good p (func a) s1 274 | 275 | Bad p x -> 276 | Bad p x 277 | 278 | 279 | map2 : (a -> b -> value) -> Parser c x a -> Parser c x b -> Parser c x value 280 | map2 func (Parser parseA) (Parser parseB) = 281 | Parser <| \s0 -> 282 | case parseA s0 of 283 | Bad p x -> 284 | Bad p x 285 | 286 | Good p1 a s1 -> 287 | case parseB s1 of 288 | Bad p2 x -> 289 | Bad (p1 || p2) x 290 | 291 | Good p2 b s2 -> 292 | Good (p1 || p2) (func a b) s2 293 | 294 | 295 | {-| Just like the [`(|=)`](Parser#|=) from the `Parser` module. 296 | -} 297 | keeper : Parser c x (a -> b) -> Parser c x a -> Parser c x b 298 | keeper parseFunc parseArg = 299 | map2 (<|) parseFunc parseArg 300 | 301 | 302 | {-| Just like the [`(|.)`](Parser#|.) from the `Parser` module. 303 | -} 304 | ignorer : Parser c x keep -> Parser c x ignore -> Parser c x keep 305 | ignorer keepParser ignoreParser = 306 | map2 always keepParser ignoreParser 307 | 308 | 309 | 310 | -- AND THEN 311 | 312 | 313 | {-| Just like [`Parser.andThen`](Parser#andThen) 314 | -} 315 | andThen : (a -> Parser c x b) -> Parser c x a -> Parser c x b 316 | andThen callback (Parser parseA) = 317 | Parser <| \s0 -> 318 | case parseA s0 of 319 | Bad p x -> 320 | Bad p x 321 | 322 | Good p1 a s1 -> 323 | let 324 | (Parser parseB) = 325 | callback a 326 | in 327 | case parseB s1 of 328 | Bad p2 x -> 329 | Bad (p1 || p2) x 330 | 331 | Good p2 b s2 -> 332 | Good (p1 || p2) b s2 333 | 334 | 335 | 336 | -- LAZY 337 | 338 | 339 | {-| Just like [`Parser.lazy`](Parser#lazy) 340 | -} 341 | lazy : (() -> Parser c x a) -> Parser c x a 342 | lazy thunk = 343 | Parser <| \s -> 344 | let 345 | (Parser parse) = 346 | thunk () 347 | in 348 | parse s 349 | 350 | 351 | 352 | -- ONE OF 353 | 354 | 355 | {-| Just like [`Parser.oneOf`](Parser#oneOf) 356 | -} 357 | oneOf : List (Parser c x a) -> Parser c x a 358 | oneOf parsers = 359 | Parser <| \s -> oneOfHelp s Empty parsers 360 | 361 | 362 | oneOfHelp : State c -> Bag c x -> List (Parser c x a) -> PStep c x a 363 | oneOfHelp s0 bag parsers = 364 | case parsers of 365 | [] -> 366 | Bad False bag 367 | 368 | Parser parse :: remainingParsers -> 369 | case parse s0 of 370 | Good _ _ _ as step -> 371 | step 372 | 373 | Bad p x as step -> 374 | if p then 375 | step 376 | else 377 | oneOfHelp s0 (Append bag x) remainingParsers 378 | 379 | 380 | 381 | -- LOOP 382 | 383 | 384 | {-| Just like [`Parser.Step`](Parser#Step) 385 | -} 386 | type Step state a 387 | = Loop state 388 | | Done a 389 | 390 | 391 | {-| Just like [`Parser.loop`](Parser#loop) 392 | -} 393 | loop : state -> (state -> Parser c x (Step state a)) -> Parser c x a 394 | loop state callback = 395 | Parser <| \s -> 396 | loopHelp False state callback s 397 | 398 | 399 | loopHelp : Bool -> state -> (state -> Parser c x (Step state a)) -> State c -> PStep c x a 400 | loopHelp p state callback s0 = 401 | let 402 | (Parser parse) = 403 | callback state 404 | in 405 | case parse s0 of 406 | Good p1 step s1 -> 407 | case step of 408 | Loop newState -> 409 | loopHelp (p || p1) newState callback s1 410 | 411 | Done result -> 412 | Good (p || p1) result s1 413 | 414 | Bad p1 x -> 415 | Bad (p || p1) x 416 | 417 | 418 | 419 | -- BACKTRACKABLE 420 | 421 | 422 | {-| Just like [`Parser.backtrackable`](Parser#backtrackable) 423 | -} 424 | backtrackable : Parser c x a -> Parser c x a 425 | backtrackable (Parser parse) = 426 | Parser <| \s0 -> 427 | case parse s0 of 428 | Bad _ x -> 429 | Bad False x 430 | 431 | Good _ a s1 -> 432 | Good False a s1 433 | 434 | 435 | {-| Just like [`Parser.commit`](Parser#commit) 436 | -} 437 | commit : a -> Parser c x a 438 | commit a = 439 | Parser <| \s -> Good True a s 440 | 441 | 442 | 443 | -- SYMBOL 444 | 445 | 446 | {-| Just like [`Parser.symbol`](Parser#symbol) except you provide a `Token` to 447 | clearly indicate your custom type of problems: 448 | 449 | comma : Parser Context Problem () 450 | comma = 451 | symbol (Token "," ExpectingComma) 452 | 453 | -} 454 | symbol : Token x -> Parser c x () 455 | symbol = 456 | token 457 | 458 | 459 | 460 | -- KEYWORD 461 | 462 | 463 | {-| Just like [`Parser.keyword`](Parser#keyword) except you provide a `Token` 464 | to clearly indicate your custom type of problems: 465 | 466 | let_ : Parser Context Problem () 467 | let_ = 468 | symbol (Token "let" ExpectingLet) 469 | 470 | Note that this would fail to chomp `letter` because of the subsequent 471 | characters. Use `token` if you do not want that last letter check. 472 | -} 473 | keyword : Token x -> Parser c x () 474 | keyword (Token kwd expecting) = 475 | let 476 | progress = 477 | not (String.isEmpty kwd) 478 | in 479 | Parser <| \s -> 480 | let 481 | (newOffset, newRow, newCol) = 482 | isSubString kwd s.offset s.row s.col s.src 483 | in 484 | if newOffset == -1 || 0 <= isSubChar (\c -> Char.isAlphaNum c || c == '_') newOffset s.src then 485 | Bad False (fromState s expecting) 486 | else 487 | Good progress () 488 | { src = s.src 489 | , offset = newOffset 490 | , indent = s.indent 491 | , context = s.context 492 | , row = newRow 493 | , col = newCol 494 | } 495 | 496 | 497 | 498 | -- TOKEN 499 | 500 | 501 | {-| With the simpler `Parser` module, you could just say `symbol ","` and 502 | parse all the commas you wanted. But now that we have a custom type for our 503 | problems, we actually have to specify that as well. So anywhere you just used 504 | a `String` in the simpler module, you now use a `Token Problem` in the advanced 505 | module: 506 | 507 | type Problem 508 | = ExpectingComma 509 | | ExpectingListEnd 510 | 511 | comma : Token Problem 512 | comma = 513 | Token "," ExpectingComma 514 | 515 | listEnd : Token Problem 516 | listEnd = 517 | Token "]" ExpectingListEnd 518 | 519 | You can be creative with your custom type. Maybe you want a lot of detail. 520 | Maybe you want looser categories. It is a custom type. Do what makes sense for 521 | you! 522 | -} 523 | type Token x = Token String x 524 | 525 | 526 | {-| Just like [`Parser.token`](Parser#token) except you provide a `Token` 527 | specifying your custom type of problems. 528 | -} 529 | token : Token x -> Parser c x () 530 | token (Token str expecting) = 531 | let 532 | progress = 533 | not (String.isEmpty str) 534 | in 535 | Parser <| \s -> 536 | let 537 | (newOffset, newRow, newCol) = 538 | isSubString str s.offset s.row s.col s.src 539 | in 540 | if newOffset == -1 then 541 | Bad False (fromState s expecting) 542 | else 543 | Good progress () 544 | { src = s.src 545 | , offset = newOffset 546 | , indent = s.indent 547 | , context = s.context 548 | , row = newRow 549 | , col = newCol 550 | } 551 | 552 | 553 | 554 | -- INT 555 | 556 | 557 | {-| Just like [`Parser.int`](Parser#int) where you have to handle negation 558 | yourself. The only difference is that you provide a two potential problems: 559 | 560 | int : x -> x -> Parser c x Int 561 | int expecting invalid = 562 | number 563 | { int = Ok identity 564 | , hex = Err invalid 565 | , octal = Err invalid 566 | , binary = Err invalid 567 | , float = Err invalid 568 | , invalid = invalid 569 | , expecting = expecting 570 | } 571 | 572 | You can use problems like `ExpectingInt` and `InvalidNumber`. 573 | -} 574 | int : x -> x -> Parser c x Int 575 | int expecting invalid = 576 | number 577 | { int = Ok identity 578 | , hex = Err invalid 579 | , octal = Err invalid 580 | , binary = Err invalid 581 | , float = Err invalid 582 | , invalid = invalid 583 | , expecting = expecting 584 | } 585 | 586 | 587 | 588 | -- FLOAT 589 | 590 | 591 | {-| Just like [`Parser.float`](Parser#float) where you have to handle negation 592 | yourself. The only difference is that you provide a two potential problems: 593 | 594 | float : x -> x -> Parser c x Float 595 | float expecting invalid = 596 | number 597 | { int = Ok toFloat 598 | , hex = Err invalid 599 | , octal = Err invalid 600 | , binary = Err invalid 601 | , float = Ok identity 602 | , invalid = invalid 603 | , expecting = expecting 604 | } 605 | 606 | You can use problems like `ExpectingFloat` and `InvalidNumber`. 607 | -} 608 | float : x -> x -> Parser c x Float 609 | float expecting invalid = 610 | number 611 | { int = Ok toFloat 612 | , hex = Err invalid 613 | , octal = Err invalid 614 | , binary = Err invalid 615 | , float = Ok identity 616 | , invalid = invalid 617 | , expecting = expecting 618 | } 619 | 620 | 621 | 622 | -- NUMBER 623 | 624 | 625 | {-| Just like [`Parser.number`](Parser#number) where you have to handle 626 | negation yourself. The only difference is that you provide all the potential 627 | problems. 628 | -} 629 | number 630 | : { int : Result x (Int -> a) 631 | , hex : Result x (Int -> a) 632 | , octal : Result x (Int -> a) 633 | , binary : Result x (Int -> a) 634 | , float : Result x (Float -> a) 635 | , invalid : x 636 | , expecting : x 637 | } 638 | -> Parser c x a 639 | number c = 640 | Parser <| \s -> 641 | if isAsciiCode 0x30 {- 0 -} s.offset s.src then 642 | let 643 | zeroOffset = s.offset + 1 644 | baseOffset = zeroOffset + 1 645 | in 646 | if isAsciiCode 0x78 {- x -} zeroOffset s.src then 647 | finalizeInt c.invalid c.hex baseOffset (consumeBase16 baseOffset s.src) s 648 | else if isAsciiCode 0x6F {- o -} zeroOffset s.src then 649 | finalizeInt c.invalid c.octal baseOffset (consumeBase 8 baseOffset s.src) s 650 | else if isAsciiCode 0x62 {- b -} zeroOffset s.src then 651 | finalizeInt c.invalid c.binary baseOffset (consumeBase 2 baseOffset s.src) s 652 | else 653 | finalizeFloat c.invalid c.expecting c.int c.float (zeroOffset, 0) s 654 | 655 | else 656 | finalizeFloat c.invalid c.expecting c.int c.float (consumeBase 10 s.offset s.src) s 657 | 658 | 659 | consumeBase : Int -> Int -> String -> (Int, Int) 660 | consumeBase = 661 | Elm.Kernel.Parser.consumeBase 662 | 663 | 664 | consumeBase16 : Int -> String -> (Int, Int) 665 | consumeBase16 = 666 | Elm.Kernel.Parser.consumeBase16 667 | 668 | 669 | finalizeInt : x -> Result x (Int -> a) -> Int -> (Int, Int) -> State c -> PStep c x a 670 | finalizeInt invalid handler startOffset (endOffset, n) s = 671 | case handler of 672 | Err x -> 673 | Bad True (fromState s x) 674 | 675 | Ok toValue -> 676 | if startOffset == endOffset 677 | then Bad (s.offset < startOffset) (fromState s invalid) 678 | else Good True (toValue n) (bumpOffset endOffset s) 679 | 680 | 681 | bumpOffset : Int -> State c -> State c 682 | bumpOffset newOffset s = 683 | { src = s.src 684 | , offset = newOffset 685 | , indent = s.indent 686 | , context = s.context 687 | , row = s.row 688 | , col = s.col + (newOffset - s.offset) 689 | } 690 | 691 | 692 | finalizeFloat : x -> x -> Result x (Int -> a) -> Result x (Float -> a) -> (Int, Int) -> State c -> PStep c x a 693 | finalizeFloat invalid expecting intSettings floatSettings intPair s = 694 | let 695 | intOffset = Tuple.first intPair 696 | floatOffset = consumeDotAndExp intOffset s.src 697 | in 698 | if floatOffset < 0 then 699 | Bad True (fromInfo s.row (s.col - (floatOffset + s.offset)) invalid s.context) 700 | 701 | else if s.offset == floatOffset then 702 | Bad False (fromState s expecting) 703 | 704 | else if intOffset == floatOffset then 705 | finalizeInt invalid intSettings s.offset intPair s 706 | 707 | else 708 | case floatSettings of 709 | Err x -> 710 | Bad True (fromState s invalid) 711 | 712 | Ok toValue -> 713 | case String.toFloat (String.slice s.offset floatOffset s.src) of 714 | Nothing -> Bad True (fromState s invalid) 715 | Just n -> Good True (toValue n) (bumpOffset floatOffset s) 716 | 717 | 718 | -- 719 | -- On a failure, returns negative index of problem. 720 | -- 721 | consumeDotAndExp : Int -> String -> Int 722 | consumeDotAndExp offset src = 723 | if isAsciiCode 0x2E {- . -} offset src then 724 | consumeExp (chompBase10 (offset + 1) src) src 725 | else 726 | consumeExp offset src 727 | 728 | 729 | -- 730 | -- On a failure, returns negative index of problem. 731 | -- 732 | consumeExp : Int -> String -> Int 733 | consumeExp offset src = 734 | if isAsciiCode 0x65 {- e -} offset src || isAsciiCode 0x45 {- E -} offset src then 735 | let 736 | eOffset = offset + 1 737 | 738 | expOffset = 739 | if isAsciiCode 0x2B {- + -} eOffset src || isAsciiCode 0x2D {- - -} eOffset src then 740 | eOffset + 1 741 | else 742 | eOffset 743 | 744 | newOffset = chompBase10 expOffset src 745 | in 746 | if expOffset == newOffset then 747 | -newOffset 748 | else 749 | newOffset 750 | 751 | else 752 | offset 753 | 754 | 755 | chompBase10 : Int -> String -> Int 756 | chompBase10 = 757 | Elm.Kernel.Parser.chompBase10 758 | 759 | 760 | 761 | -- END 762 | 763 | 764 | {-| Just like [`Parser.end`](Parser#end) except you provide the problem that 765 | arises when the parser is not at the end of the input. 766 | -} 767 | end : x -> Parser c x () 768 | end x = 769 | Parser <| \s -> 770 | if String.length s.src == s.offset then 771 | Good False () s 772 | else 773 | Bad False (fromState s x) 774 | 775 | 776 | 777 | -- CHOMPED STRINGS 778 | 779 | 780 | {-| Just like [`Parser.getChompedString`](Parser#getChompedString) 781 | -} 782 | getChompedString : Parser c x a -> Parser c x String 783 | getChompedString parser = 784 | mapChompedString always parser 785 | 786 | 787 | {-| Just like [`Parser.mapChompedString`](Parser#mapChompedString) 788 | -} 789 | mapChompedString : (String -> a -> b) -> Parser c x a -> Parser c x b 790 | mapChompedString func (Parser parse) = 791 | Parser <| \s0 -> 792 | case parse s0 of 793 | Bad p x -> 794 | Bad p x 795 | 796 | Good p a s1 -> 797 | Good p (func (String.slice s0.offset s1.offset s0.src) a) s1 798 | 799 | 800 | 801 | -- CHOMP IF 802 | 803 | 804 | {-| Just like [`Parser.chompIf`](Parser#chompIf) except you provide a problem 805 | in case a character cannot be chomped. 806 | -} 807 | chompIf : (Char -> Bool) -> x -> Parser c x () 808 | chompIf isGood expecting = 809 | Parser <| \s -> 810 | let 811 | newOffset = isSubChar isGood s.offset s.src 812 | in 813 | -- not found 814 | if newOffset == -1 then 815 | Bad False (fromState s expecting) 816 | 817 | -- newline 818 | else if newOffset == -2 then 819 | Good True () 820 | { src = s.src 821 | , offset = s.offset + 1 822 | , indent = s.indent 823 | , context = s.context 824 | , row = s.row + 1 825 | , col = 1 826 | } 827 | 828 | -- found 829 | else 830 | Good True () 831 | { src = s.src 832 | , offset = newOffset 833 | , indent = s.indent 834 | , context = s.context 835 | , row = s.row 836 | , col = s.col + 1 837 | } 838 | 839 | 840 | 841 | -- CHOMP WHILE 842 | 843 | 844 | {-| Just like [`Parser.chompWhile`](Parser#chompWhile) 845 | -} 846 | chompWhile : (Char -> Bool) -> Parser c x () 847 | chompWhile isGood = 848 | Parser <| \s -> 849 | chompWhileHelp isGood s.offset s.row s.col s 850 | 851 | 852 | chompWhileHelp : (Char -> Bool) -> Int -> Int -> Int -> State c -> PStep c x () 853 | chompWhileHelp isGood offset row col s0 = 854 | let 855 | newOffset = isSubChar isGood offset s0.src 856 | in 857 | -- no match 858 | if newOffset == -1 then 859 | Good (s0.offset < offset) () 860 | { src = s0.src 861 | , offset = offset 862 | , indent = s0.indent 863 | , context = s0.context 864 | , row = row 865 | , col = col 866 | } 867 | 868 | -- matched a newline 869 | else if newOffset == -2 then 870 | chompWhileHelp isGood (offset + 1) (row + 1) 1 s0 871 | 872 | -- normal match 873 | else 874 | chompWhileHelp isGood newOffset row (col + 1) s0 875 | 876 | 877 | 878 | -- CHOMP UNTIL 879 | 880 | 881 | {-| Just like [`Parser.chompUntil`](Parser#chompUntil) except you provide a 882 | `Token` in case you chomp all the way to the end of the input without finding 883 | what you need. 884 | -} 885 | chompUntil : Token x -> Parser c x () 886 | chompUntil (Token str expecting) = 887 | Parser <| \s -> 888 | let 889 | (newOffset, newRow, newCol) = 890 | findSubString str s.offset s.row s.col s.src 891 | in 892 | if newOffset == -1 then 893 | Bad False (fromInfo newRow newCol expecting s.context) 894 | 895 | else 896 | Good (s.offset < newOffset) () 897 | { src = s.src 898 | , offset = newOffset 899 | , indent = s.indent 900 | , context = s.context 901 | , row = newRow 902 | , col = newCol 903 | } 904 | 905 | 906 | {-| Just like [`Parser.chompUntilEndOr`](Parser#chompUntilEndOr) 907 | -} 908 | chompUntilEndOr : String -> Parser c x () 909 | chompUntilEndOr str = 910 | Parser <| \s -> 911 | let 912 | (newOffset, newRow, newCol) = 913 | Elm.Kernel.Parser.findSubString str s.offset s.row s.col s.src 914 | 915 | adjustedOffset = 916 | if newOffset < 0 then String.length s.src else newOffset 917 | in 918 | Good (s.offset < adjustedOffset) () 919 | { src = s.src 920 | , offset = adjustedOffset 921 | , indent = s.indent 922 | , context = s.context 923 | , row = newRow 924 | , col = newCol 925 | } 926 | 927 | 928 | 929 | -- CONTEXT 930 | 931 | 932 | {-| This is how you mark that you are in a certain context. For example, here 933 | is a rough outline of some code that uses `inContext` to mark when you are 934 | parsing a specific definition: 935 | 936 | import Char 937 | import Parser.Advanced exposing (..) 938 | import Set 939 | 940 | type Context 941 | = Definition String 942 | | List 943 | 944 | definition : Parser Context Problem Expr 945 | definition = 946 | functionName 947 | |> andThen definitionBody 948 | 949 | definitionBody : String -> Parser Context Problem Expr 950 | definitionBody name = 951 | inContext (Definition name) <| 952 | succeed (Function name) 953 | |= arguments 954 | |. symbol (Token "=" ExpectingEquals) 955 | |= expression 956 | 957 | functionName : Parser c Problem String 958 | functionName = 959 | variable 960 | { start = Char.isLower 961 | , inner = Char.isAlphaNum 962 | , reserved = Set.fromList ["let","in"] 963 | , expecting = ExpectingFunctionName 964 | } 965 | 966 | First we parse the function name, and then we parse the rest of the definition. 967 | Importantly, we call `inContext` so that any dead end that occurs in 968 | `definitionBody` will get this extra context information. That way you can say 969 | things like, “I was expecting an equals sign in the `view` definition.” Context! 970 | -} 971 | inContext : context -> Parser context x a -> Parser context x a 972 | inContext context (Parser parse) = 973 | Parser <| \s0 -> 974 | case parse (changeContext (Located s0.row s0.col context :: s0.context) s0) of 975 | Good p a s1 -> 976 | Good p a (changeContext s0.context s1) 977 | 978 | Bad _ _ as step -> 979 | step 980 | 981 | 982 | changeContext : List (Located c) -> State c -> State c 983 | changeContext newContext s = 984 | { src = s.src 985 | , offset = s.offset 986 | , indent = s.indent 987 | , context = newContext 988 | , row = s.row 989 | , col = s.col 990 | } 991 | 992 | 993 | 994 | -- INDENTATION 995 | 996 | 997 | {-| Just like [`Parser.getIndent`](Parser#getIndent) 998 | -} 999 | getIndent : Parser c x Int 1000 | getIndent = 1001 | Parser <| \s -> Good False s.indent s 1002 | 1003 | 1004 | {-| Just like [`Parser.withIndent`](Parser#withIndent) 1005 | -} 1006 | withIndent : Int -> Parser c x a -> Parser c x a 1007 | withIndent newIndent (Parser parse) = 1008 | Parser <| \s0 -> 1009 | case parse (changeIndent newIndent s0) of 1010 | Good p a s1 -> 1011 | Good p a (changeIndent s0.indent s1) 1012 | 1013 | Bad p x -> 1014 | Bad p x 1015 | 1016 | 1017 | changeIndent : Int -> State c -> State c 1018 | changeIndent newIndent s = 1019 | { src = s.src 1020 | , offset = s.offset 1021 | , indent = newIndent 1022 | , context = s.context 1023 | , row = s.row 1024 | , col = s.col 1025 | } 1026 | 1027 | 1028 | 1029 | -- POSITION 1030 | 1031 | 1032 | {-| Just like [`Parser.getPosition`](Parser#getPosition) 1033 | -} 1034 | getPosition : Parser c x (Int, Int) 1035 | getPosition = 1036 | Parser <| \s -> Good False (s.row, s.col) s 1037 | 1038 | 1039 | {-| Just like [`Parser.getRow`](Parser#getRow) 1040 | -} 1041 | getRow : Parser c x Int 1042 | getRow = 1043 | Parser <| \s -> Good False s.row s 1044 | 1045 | 1046 | {-| Just like [`Parser.getCol`](Parser#getCol) 1047 | -} 1048 | getCol : Parser c x Int 1049 | getCol = 1050 | Parser <| \s -> Good False s.col s 1051 | 1052 | 1053 | {-| Just like [`Parser.getOffset`](Parser#getOffset) 1054 | -} 1055 | getOffset : Parser c x Int 1056 | getOffset = 1057 | Parser <| \s -> Good False s.offset s 1058 | 1059 | 1060 | {-| Just like [`Parser.getSource`](Parser#getSource) 1061 | -} 1062 | getSource : Parser c x String 1063 | getSource = 1064 | Parser <| \s -> Good False s.src s 1065 | 1066 | 1067 | 1068 | -- LOW-LEVEL HELPERS 1069 | 1070 | 1071 | {-| When making a fast parser, you want to avoid allocation as much as 1072 | possible. That means you never want to mess with the source string, only 1073 | keep track of an offset into that string. 1074 | 1075 | You use `isSubString` like this: 1076 | 1077 | isSubString "let" offset row col "let x = 4 in x" 1078 | --==> ( newOffset, newRow, newCol ) 1079 | 1080 | You are looking for `"let"` at a given `offset`. On failure, the 1081 | `newOffset` is `-1`. On success, the `newOffset` is the new offset. With 1082 | our `"let"` example, it would be `offset + 3`. 1083 | 1084 | You also provide the current `row` and `col` which do not align with 1085 | `offset` in a clean way. For example, when you see a `\n` you are at 1086 | `row = row + 1` and `col = 1`. Furthermore, some UTF16 characters are 1087 | two words wide, so even if there are no newlines, `offset` and `col` 1088 | may not be equal. 1089 | -} 1090 | isSubString : String -> Int -> Int -> Int -> String -> (Int, Int, Int) 1091 | isSubString = 1092 | Elm.Kernel.Parser.isSubString 1093 | 1094 | 1095 | {-| Again, when parsing, you want to allocate as little as possible. 1096 | So this function lets you say: 1097 | 1098 | isSubChar isSpace offset "this is the source string" 1099 | --==> newOffset 1100 | 1101 | The `(Char -> Bool)` argument is called a predicate. 1102 | The `newOffset` value can be a few different things: 1103 | 1104 | - `-1` means that the predicate failed 1105 | - `-2` means the predicate succeeded with a `\n` 1106 | - otherwise you will get `offset + 1` or `offset + 2` 1107 | depending on whether the UTF16 character is one or two 1108 | words wide. 1109 | -} 1110 | isSubChar : (Char -> Bool) -> Int -> String -> Int 1111 | isSubChar = 1112 | Elm.Kernel.Parser.isSubChar 1113 | 1114 | 1115 | {-| Check an offset in the string. Is it equal to the given Char? Are they 1116 | both ASCII characters? 1117 | -} 1118 | isAsciiCode : Int -> Int -> String -> Bool 1119 | isAsciiCode = 1120 | Elm.Kernel.Parser.isAsciiCode 1121 | 1122 | 1123 | {-| Find a substring after a given offset. 1124 | 1125 | findSubString "42" offset row col "Is 42 the answer?" 1126 | --==> (newOffset, newRow, newCol) 1127 | 1128 | If `offset = 0` we would get `(3, 1, 4)` 1129 | If `offset = 7` we would get `(-1, 1, 18)` 1130 | -} 1131 | findSubString : String -> Int -> Int -> Int -> String -> (Int, Int, Int) 1132 | findSubString = 1133 | Elm.Kernel.Parser.findSubString 1134 | 1135 | 1136 | 1137 | -- VARIABLES 1138 | 1139 | 1140 | {-| Just like [`Parser.variable`](Parser#variable) except you specify the 1141 | problem yourself. 1142 | -} 1143 | variable : 1144 | { start : Char -> Bool 1145 | , inner : Char -> Bool 1146 | , reserved : Set.Set String 1147 | , expecting : x 1148 | } 1149 | -> Parser c x String 1150 | variable i = 1151 | Parser <| \s -> 1152 | let 1153 | firstOffset = 1154 | isSubChar i.start s.offset s.src 1155 | in 1156 | if firstOffset == -1 then 1157 | Bad False (fromState s i.expecting) 1158 | else 1159 | let 1160 | s1 = 1161 | if firstOffset == -2 then 1162 | varHelp i.inner (s.offset + 1) (s.row + 1) 1 s.src s.indent s.context 1163 | else 1164 | varHelp i.inner firstOffset s.row (s.col + 1) s.src s.indent s.context 1165 | 1166 | name = 1167 | String.slice s.offset s1.offset s.src 1168 | in 1169 | if Set.member name i.reserved then 1170 | Bad False (fromState s i.expecting) 1171 | else 1172 | Good True name s1 1173 | 1174 | 1175 | varHelp : (Char -> Bool) -> Int -> Int -> Int -> String -> Int -> List (Located c) -> State c 1176 | varHelp isGood offset row col src indent context = 1177 | let 1178 | newOffset = isSubChar isGood offset src 1179 | in 1180 | if newOffset == -1 then 1181 | { src = src 1182 | , offset = offset 1183 | , indent = indent 1184 | , context = context 1185 | , row = row 1186 | , col = col 1187 | } 1188 | 1189 | else if newOffset == -2 then 1190 | varHelp isGood (offset + 1) (row + 1) 1 src indent context 1191 | 1192 | else 1193 | varHelp isGood newOffset row (col + 1) src indent context 1194 | 1195 | 1196 | 1197 | -- SEQUENCES 1198 | 1199 | 1200 | {-| Just like [`Parser.sequence`](Parser#sequence) except with a `Token` for 1201 | the start, separator, and end. That way you can specify your custom type of 1202 | problem for when something is not found. 1203 | -} 1204 | sequence 1205 | : { start : Token x 1206 | , separator : Token x 1207 | , end : Token x 1208 | , spaces : Parser c x () 1209 | , item : Parser c x a 1210 | , trailing : Trailing 1211 | } 1212 | -> Parser c x (List a) 1213 | sequence i = 1214 | skip (token i.start) <| 1215 | skip i.spaces <| 1216 | sequenceEnd (token i.end) i.spaces i.item (token i.separator) i.trailing 1217 | 1218 | 1219 | {-| What’s the deal with trailing commas? Are they `Forbidden`? 1220 | Are they `Optional`? Are they `Mandatory`? Welcome to [shapes 1221 | club](https://poorlydrawnlines.com/comic/shapes-club/)! 1222 | -} 1223 | type Trailing = Forbidden | Optional | Mandatory 1224 | 1225 | 1226 | skip : Parser c x ignore -> Parser c x keep -> Parser c x keep 1227 | skip iParser kParser = 1228 | map2 revAlways iParser kParser 1229 | 1230 | 1231 | revAlways : a -> b -> b 1232 | revAlways _ b = 1233 | b 1234 | 1235 | 1236 | sequenceEnd : Parser c x () -> Parser c x () -> Parser c x a -> Parser c x () -> Trailing -> Parser c x (List a) 1237 | sequenceEnd ender ws parseItem sep trailing = 1238 | let 1239 | chompRest item = 1240 | case trailing of 1241 | Forbidden -> 1242 | loop [item] (sequenceEndForbidden ender ws parseItem sep) 1243 | 1244 | Optional -> 1245 | loop [item] (sequenceEndOptional ender ws parseItem sep) 1246 | 1247 | Mandatory -> 1248 | ignorer 1249 | ( skip ws <| skip sep <| skip ws <| 1250 | loop [item] (sequenceEndMandatory ws parseItem sep) 1251 | ) 1252 | ender 1253 | in 1254 | oneOf 1255 | [ parseItem |> andThen chompRest 1256 | , ender |> map (\_ -> []) 1257 | ] 1258 | 1259 | 1260 | sequenceEndForbidden : Parser c x () -> Parser c x () -> Parser c x a -> Parser c x () -> List a -> Parser c x (Step (List a) (List a)) 1261 | sequenceEndForbidden ender ws parseItem sep revItems = 1262 | let 1263 | chompRest item = 1264 | sequenceEndForbidden ender ws parseItem sep (item :: revItems) 1265 | in 1266 | skip ws <| 1267 | oneOf 1268 | [ skip sep <| skip ws <| map (\item -> Loop (item :: revItems)) parseItem 1269 | , ender |> map (\_ -> Done (List.reverse revItems)) 1270 | ] 1271 | 1272 | 1273 | sequenceEndOptional : Parser c x () -> Parser c x () -> Parser c x a -> Parser c x () -> List a -> Parser c x (Step (List a) (List a)) 1274 | sequenceEndOptional ender ws parseItem sep revItems = 1275 | let 1276 | parseEnd = 1277 | map (\_ -> Done (List.reverse revItems)) ender 1278 | in 1279 | skip ws <| 1280 | oneOf 1281 | [ skip sep <| skip ws <| 1282 | oneOf 1283 | [ parseItem |> map (\item -> Loop (item :: revItems)) 1284 | , parseEnd 1285 | ] 1286 | , parseEnd 1287 | ] 1288 | 1289 | 1290 | sequenceEndMandatory : Parser c x () -> Parser c x a -> Parser c x () -> List a -> Parser c x (Step (List a) (List a)) 1291 | sequenceEndMandatory ws parseItem sep revItems = 1292 | oneOf 1293 | [ map (\item -> Loop (item :: revItems)) <| 1294 | ignorer parseItem (ignorer ws (ignorer sep ws)) 1295 | , map (\_ -> Done (List.reverse revItems)) (succeed ()) 1296 | ] 1297 | 1298 | 1299 | 1300 | -- WHITESPACE 1301 | 1302 | 1303 | {-| Just like [`Parser.spaces`](Parser#spaces) 1304 | -} 1305 | spaces : Parser c x () 1306 | spaces = 1307 | chompWhile (\c -> c == ' ' || c == '\n' || c == '\r') 1308 | 1309 | 1310 | {-| Just like [`Parser.lineComment`](Parser#lineComment) except you provide a 1311 | `Token` describing the starting symbol. 1312 | -} 1313 | lineComment : Token x -> Parser c x () 1314 | lineComment start = 1315 | ignorer (token start) (chompUntilEndOr "\n") 1316 | 1317 | 1318 | {-| Just like [`Parser.multiComment`](Parser#multiComment) except with a 1319 | `Token` for the open and close symbols. 1320 | -} 1321 | multiComment : Token x -> Token x -> Nestable -> Parser c x () 1322 | multiComment open close nestable = 1323 | case nestable of 1324 | NotNestable -> 1325 | ignorer (token open) (chompUntil close) 1326 | 1327 | Nestable -> 1328 | nestableComment open close 1329 | 1330 | 1331 | {-| Works just like [`Parser.Nestable`](Parser#nestable) to help distinguish 1332 | between unnestable `/*` `*/` comments like in JS and nestable `{-` `-}` 1333 | comments like in Elm. 1334 | -} 1335 | type Nestable = NotNestable | Nestable 1336 | 1337 | 1338 | nestableComment : Token x -> Token x -> Parser c x () 1339 | nestableComment (Token oStr oX as open) (Token cStr cX as close) = 1340 | case String.uncons oStr of 1341 | Nothing -> 1342 | problem oX 1343 | 1344 | Just (openChar, _) -> 1345 | case String.uncons cStr of 1346 | Nothing -> 1347 | problem cX 1348 | 1349 | Just (closeChar, _) -> 1350 | let 1351 | isNotRelevant char = 1352 | char /= openChar && char /= closeChar 1353 | 1354 | chompOpen = 1355 | token open 1356 | in 1357 | ignorer chompOpen (nestableHelp isNotRelevant chompOpen (token close) cX 1) 1358 | 1359 | 1360 | nestableHelp : (Char -> Bool) -> Parser c x () -> Parser c x () -> x -> Int -> Parser c x () 1361 | nestableHelp isNotRelevant open close expectingClose nestLevel = 1362 | skip (chompWhile isNotRelevant) <| 1363 | oneOf 1364 | [ if nestLevel == 1 then 1365 | close 1366 | else 1367 | close 1368 | |> andThen (\_ -> nestableHelp isNotRelevant open close expectingClose (nestLevel - 1)) 1369 | , open 1370 | |> andThen (\_ -> nestableHelp isNotRelevant open close expectingClose (nestLevel + 1)) 1371 | , chompIf isChar expectingClose 1372 | |> andThen (\_ -> nestableHelp isNotRelevant open close expectingClose nestLevel) 1373 | ] 1374 | 1375 | 1376 | isChar : Char -> Bool 1377 | isChar char = 1378 | True 1379 | --------------------------------------------------------------------------------