├── .gitignore ├── README.md ├── TODO ├── doc └── intro.md ├── project.clj ├── src-java └── com │ └── champbacon │ └── pex │ ├── CharMatcher.java │ ├── PEGMatcher.java │ ├── ParseAction.java │ ├── ParsingExpressionGrammar.java │ ├── ValueStackManip.java │ └── impl │ ├── Actions.java │ ├── Matchers.java │ ├── OpCodes.java │ ├── PEGByteCodeVM.java │ └── StackEntry.java ├── src └── com │ └── champbacon │ ├── pex.clj │ └── pex │ ├── examples │ ├── csv.clj │ └── json.clj │ └── impl │ ├── codegen.clj │ └── tree.clj └── test └── com └── champbacon └── pex_test.clj /.gitignore: -------------------------------------------------------------------------------- 1 | /target 2 | /classes 3 | /checkouts 4 | pom.xml 5 | pom.xml.asc 6 | *.jar 7 | *.class 8 | /.lein-* 9 | /.nrepl-port 10 | .hgignore 11 | .hg/ 12 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pex, a parsing library 2 | 3 | This is super-alpha. 4 | 5 | ### Rationale 6 | 7 | PEGs (Parsing Expression Grammars) are more powerful than regexes, compose better, and are expressible using PODS (Plain Ol' Data Structures) 8 | 9 | Deterministic parsers are a simpler model than those that produce ambiguity. 10 | 11 | `pex` is implemented with a Virtual Machine, just like its inspiration [LPEG](http://www.inf.puc-rio.br/~roberto/lpeg/). 12 | 13 | ### Fundamentals 14 | 15 | Grammars are input as a quoted datastructure, just like Datomic queries. 16 | 17 | ```clj 18 | (def Number '{number [digits (? fractional) (? exponent)] 19 | fractional ["." digits] 20 | exponent ["e" (? (/ "+" "-")) digits] 21 | digits [(class num) (* (class num))]}) 22 | ``` 23 | 24 | The left hand side of the map is the name of the rule, the right hand side is the definition. 25 | Any bare symbol inside a definition is a call to that rule. Calls are *not* applied with parentheses. 26 | Parentheses denote some special behavior. 27 | 28 | Grammars are then compiled, like a java.util.regex.Pattern. 29 | The compiled grammar can then be run upon inputs. 30 | 31 | ### Rule Fundamentals 32 | 33 | String and chars literals match... literally 34 | ```clj 35 | "foo" 36 | ``` 37 | 38 | Ordered Choice is the most important operation in a PEG. Rule `B` will only be attempted only if `A` fails: 39 | ```clj 40 | (/ A B) 41 | ``` 42 | 43 | NB that `A` cannot be a prefix of `B` if you want `B` to ever match: 44 | ```clj 45 | ;; invalid, foo will always win over foobar 46 | (/ "foo" "foobar") 47 | ``` 48 | 49 | Vectors denote sequencing rules together. If you want `A` `B` & `C` to succeed sequentially: 50 | ```clj 51 | [A B C] 52 | ``` 53 | 54 | ### The Value Stack 55 | 56 | TODO 57 | 58 | `capture` places the region matched by the rule on the Value Stack 59 | 60 | ```clj 61 | (capture integer (? fractional) (? exponent)) 62 | ``` 63 | 64 | ### Char Matchers 65 | 66 | `class` refers symbolically to a matcher for a particular character class. 67 | 68 | ```clj 69 | ["42" (class alpha)] 70 | ``` 71 | 72 | There are several helpers that build up character classes. Each character class must be passed into `pex/compile` as a matcher. `TODO` Elaborate 73 | 74 | ### Optionality or Repetition 75 | 76 | `?` is an optional rule: 77 | ```clj 78 | (? b) 79 | ``` 80 | 81 | `*` is repetition, 0 or more times. 82 | ```clj 83 | (* foo) 84 | ``` 85 | 86 | The typical way to match separator delimited things: 87 | ```clj 88 | [pattern (* separator pattern)] 89 | ``` 90 | ### Parse Actions 91 | 92 | `action` refers to a parse action, immediately invoking it. 93 | ```clj 94 | (action make-integer) 95 | ``` 96 | Actions can manipulate the Value Stack by reducing over items captured, 97 | updating the last item captured, or push a value. 98 | 99 | There are also a few pre-built actions that access an efficient StringBuffer for mutation while building up Strings. 100 | 101 | `EOI` means end of input. This only matches when input is exhausted, not when you're done parsing. 102 | 103 | ### Rule Macros 104 | 105 | User supplied macros can expand rules to remove boilerplate. 106 | `TODO` example 107 | 108 | # Examples 109 | 110 | JSON parser 111 | 112 | -------------------------------------------------------------------------------- /TODO: -------------------------------------------------------------------------------- 1 | * TRACE opcode (Logs rule applications with debug info. Rule name and Position). 2 | * cuts 3 | * string capture uses a slow constructor 4 | * input shouldn't be a char[] 5 | * make json/cast-number less dumb 6 | * maybe a case-match statement for tail recursion 7 | * better documentation 8 | * codegen/emit should be a multimethod 9 | * add CSV example 10 | * reset the VM 11 | * LPEG tree optimizations 12 | -------------------------------------------------------------------------------- /doc/intro.md: -------------------------------------------------------------------------------- 1 | # Introduction to pex 2 | 3 | TODO: write [great documentation](http://jacobian.org/writing/what-to-write/) 4 | -------------------------------------------------------------------------------- /project.clj: -------------------------------------------------------------------------------- 1 | (defproject com.champbacon/pex "0.0.1-SNAPSHOT" 2 | :description "a data-driven parsing library" 3 | :url "http://github.com/ghadishayban/pex" 4 | :license {:name "Eclipse Public License" 5 | :url "http://www.eclipse.org/legal/epl-v10.html"} 6 | :dependencies [[org.clojure/clojure "1.7.0"]] 7 | :java-source-paths ["src-java"]) 8 | -------------------------------------------------------------------------------- /src-java/com/champbacon/pex/CharMatcher.java: -------------------------------------------------------------------------------- 1 | 2 | package com.champbacon.pex; 3 | 4 | public interface CharMatcher { 5 | public boolean match(int ch); 6 | } 7 | -------------------------------------------------------------------------------- /src-java/com/champbacon/pex/PEGMatcher.java: -------------------------------------------------------------------------------- 1 | 2 | package com.champbacon.pex; 3 | 4 | public interface PEGMatcher { 5 | 6 | public int match(); 7 | public int match(int pos); 8 | 9 | public void reset(); 10 | 11 | public Object[] getCaptures(); 12 | } 13 | -------------------------------------------------------------------------------- /src-java/com/champbacon/pex/ParseAction.java: -------------------------------------------------------------------------------- 1 | package com.champbacon.pex; 2 | 3 | public interface ParseAction { 4 | 5 | public void execute(ValueStackManip vm); 6 | // subjectPosition, context, captureList, capturePosition 7 | } 8 | -------------------------------------------------------------------------------- /src-java/com/champbacon/pex/ParsingExpressionGrammar.java: -------------------------------------------------------------------------------- 1 | package com.champbacon.pex; 2 | 3 | import com.champbacon.pex.CharMatcher; 4 | import com.champbacon.pex.PEGMatcher; 5 | import com.champbacon.pex.ParseAction; 6 | import com.champbacon.pex.impl.PEGByteCodeVM; 7 | 8 | /** 9 | * Created by ghadi on 11/13/15. 10 | */ 11 | public class ParsingExpressionGrammar { 12 | 13 | public final int[] instructions; 14 | public final CharMatcher[] charMatchers; 15 | public final ParseAction[] actions; 16 | 17 | public ParsingExpressionGrammar(int[] instructions, CharMatcher[] charMatchers, ParseAction[] actions) { 18 | this.instructions = instructions; 19 | this.charMatchers = charMatchers; 20 | this.actions = actions; 21 | } 22 | 23 | public PEGMatcher matcher(char[] input, Object context) { 24 | return new PEGByteCodeVM(this, input, context); 25 | } 26 | 27 | } 28 | -------------------------------------------------------------------------------- /src-java/com/champbacon/pex/ValueStackManip.java: -------------------------------------------------------------------------------- 1 | package com.champbacon.pex; 2 | 3 | public interface ValueStackManip { 4 | public Object getUserParseContext(); 5 | public void setUserParseContext(Object ctx); 6 | 7 | public int getInputPosition(); 8 | public char[] getInput(); 9 | public char getLastMatch(); 10 | 11 | public int getCaptureStart(); 12 | public int getCaptureEnd(); 13 | public void setCaptureEnd(int i); 14 | public Object[] getCurrentCaptures(); 15 | public void push(Object v); 16 | } 17 | -------------------------------------------------------------------------------- /src-java/com/champbacon/pex/impl/Actions.java: -------------------------------------------------------------------------------- 1 | package com.champbacon.pex.impl; 2 | 3 | import clojure.lang.IFn; 4 | import com.champbacon.pex.ParseAction; 5 | import com.champbacon.pex.ValueStackManip; 6 | 7 | /** 8 | * Created by ghadi on 11/13/15. 9 | */ 10 | public class Actions { 11 | public static class PushAction implements ParseAction { 12 | final Object val; 13 | 14 | public PushAction(Object val) { 15 | this.val = val; 16 | } 17 | 18 | public void execute(ValueStackManip vm) { 19 | vm.push(val); 20 | } 21 | } 22 | 23 | public static class UpdateStackTop implements ParseAction { 24 | IFn f; 25 | 26 | public UpdateStackTop(IFn f) { 27 | this.f = f; 28 | } 29 | 30 | public void execute(ValueStackManip vm) { 31 | Object[] captures = vm.getCurrentCaptures(); 32 | 33 | int cur = vm.getCaptureEnd() - 1; 34 | captures[cur] = f.invoke(captures[cur]); 35 | } 36 | } 37 | 38 | public static class FoldCaptures implements ParseAction { 39 | private final IFn f; 40 | 41 | public FoldCaptures(IFn f) { 42 | this.f = f; 43 | } 44 | 45 | public void execute(ValueStackManip vm) { 46 | int low = vm.getCaptureStart(); 47 | int high = vm.getCaptureEnd(); 48 | Object[] caps = vm.getCurrentCaptures(); 49 | 50 | Object ret = f.invoke(); 51 | for(int i = low; i= stack.length) doubleStack(); 60 | StackEntry e = stack[stk]; 61 | if (e == null) { 62 | stack[stk] = e = new StackEntry(); 63 | } else { 64 | e.reset(); 65 | } 66 | return e; 67 | } 68 | 69 | private final void doubleStack() { 70 | StackEntry[] newStack = new StackEntry[stack.length << 1]; 71 | System.arraycopy(stack, 0, newStack, 0, stack.length); 72 | stack = newStack; 73 | } 74 | 75 | private final void doubleCaptures() { 76 | Object[] newCaptures = new Object[captureStack.length << 1]; 77 | System.arraycopy(captureStack, 0, newCaptures, 0, captureStack.length); 78 | captureStack = newCaptures; 79 | } 80 | 81 | private void opCall() { 82 | StackEntry e = ensure1(); 83 | 84 | // do not set subjectPosition 85 | e.setCaptureHeight(captureTop); 86 | e.setReturnAddress(pc + 1); 87 | 88 | stk++; 89 | pc = instructions[pc]; 90 | } 91 | 92 | private void opRet() { 93 | stk--; 94 | StackEntry s = stack[stk]; 95 | // captureTop = s.getCaptureHeight(); 96 | pc = s.getReturnAddress(); 97 | } 98 | 99 | private void opChoice() { 100 | StackEntry s = ensure1(); 101 | s.setReturnAddress(instructions[pc]); 102 | s.setCaptureHeight(captureTop); 103 | s.setSubjectPosition(subjectPointer); 104 | 105 | stk++; 106 | pc++; 107 | } 108 | 109 | private void opCommit() { 110 | stk--; 111 | pc = instructions[pc]; 112 | } 113 | 114 | private void opPartialCommit() { 115 | StackEntry s = stack[stk - 1]; 116 | s.setSubjectPosition(subjectPointer); 117 | s.setCaptureHeight(captureTop); 118 | pc = instructions[pc]; 119 | } 120 | 121 | // VALIDATE SEMANTICS 122 | private void opBackCommit() { 123 | stk--; 124 | StackEntry s = stack[stk]; 125 | subjectPointer = s.getSubjectPosition(); 126 | captureTop = s.getCaptureHeight(); 127 | } 128 | 129 | private void opJump() { 130 | pc = instructions[pc]; 131 | } 132 | 133 | private void opFailTwice() { 134 | stk--; 135 | opFail(); 136 | } 137 | 138 | private void opFail() { 139 | // pop off any plain CALL frames 140 | StackEntry s; 141 | do { 142 | stk--; 143 | s = stack[stk]; 144 | } while (s.isCall() && stk > 0); 145 | 146 | if (stk == 0) { 147 | 148 | matchFailed = true; 149 | pc = instructions.length - 1; // jump to the final instruction, always END 150 | return; 151 | } 152 | 153 | subjectPointer = s.getSubjectPosition(); 154 | captureTop = s.getCaptureHeight(); 155 | pc = s.getReturnAddress(); 156 | } 157 | 158 | private void opMatchChar() { 159 | int ch = instructions[pc]; 160 | if (subjectPointer < input.length && input[subjectPointer] == ch) { 161 | pc++; 162 | subjectPointer++; 163 | } else { 164 | opFail(); 165 | } 166 | } 167 | 168 | /* private void opTestChar() { 169 | int ch = instructions[pc]; 170 | if (subjectPointer < input.length && input[subjectPointer] == ch) { 171 | pc++; 172 | subjectPointer++; 173 | } else { 174 | opFail(); 175 | } 176 | } 177 | */ 178 | 179 | private void opAny() { 180 | if (subjectPointer < input.length) { 181 | subjectPointer++; 182 | } else { 183 | opFail(); 184 | } 185 | } 186 | 187 | private void opBeginCapture() { 188 | StackEntry s = stack[stk - 1]; 189 | s.setCurrentCaptureBegin(subjectPointer); 190 | } 191 | 192 | private void opEndCapture() { 193 | StackEntry s = stack[stk - 1]; 194 | 195 | int captureBegin = s.getCurrentCaptureBegin(); 196 | s.clearOpenCapture(); 197 | 198 | String cap = new String(input, captureBegin, subjectPointer - captureBegin); 199 | 200 | push(cap); 201 | } 202 | 203 | private void opAction() { 204 | ParseAction a = actions[instructions[pc]]; 205 | a.execute(this); 206 | pc++; 207 | } 208 | 209 | private void opCharset() { 210 | CharMatcher m = charMatchers[instructions[pc]]; 211 | if (subjectPointer < input.length && m.match(input[subjectPointer])) { 212 | pc++; 213 | subjectPointer++; 214 | } else { 215 | opFail(); 216 | } 217 | } 218 | 219 | private void opEndOfInput() { 220 | if (subjectPointer != input.length) 221 | opFail(); 222 | } 223 | 224 | private void debug() { 225 | if (subjectPointer >= input.length) return; 226 | System.out.printf( 227 | "{:pc %3d :op %2d :subj [\"%s\" %5d] :captop %2d :stk %2d}%n", 228 | pc, 229 | instructions[pc], 230 | input[subjectPointer], subjectPointer, 231 | captureTop, 232 | stk); 233 | } 234 | 235 | private void unimplemented() { 236 | throw new UnsupportedOperationException(); 237 | } 238 | 239 | public int match() { 240 | return match(0); 241 | } 242 | 243 | public int match(int pos) { 244 | subjectPointer = pos; 245 | 246 | vm: 247 | while (true) { 248 | final int op = instructions[pc++]; 249 | 250 | switch(op) { 251 | case OpCodes.CALL: opCall(); break; 252 | case OpCodes.RET: opRet(); break; 253 | 254 | case OpCodes.CHOICE: opChoice(); break; 255 | case OpCodes.COMMIT: opCommit(); break; 256 | 257 | case OpCodes.PARTIAL_COMMIT: opPartialCommit(); break; 258 | case OpCodes.BACK_COMMIT: opBackCommit(); break; 259 | 260 | case OpCodes.JUMP: opJump(); break; 261 | 262 | case OpCodes.FAIL_TWICE: opFailTwice(); break; 263 | case OpCodes.FAIL: opFail(); break; 264 | case OpCodes.END: break vm; 265 | 266 | case OpCodes.MATCH_CHAR: opMatchChar(); break; 267 | case OpCodes.CHARSET: opCharset(); break; 268 | case OpCodes.ANY: opAny(); break; 269 | case OpCodes.TEST_CHAR: unimplemented(); break; 270 | case OpCodes.TEST_CHARSET: unimplemented(); break; 271 | case OpCodes.TEST_ANY: unimplemented(); break; 272 | case OpCodes.SPAN: unimplemented(); break; 273 | 274 | case OpCodes.BEGIN_CAPTURE: opBeginCapture(); break; 275 | case OpCodes.END_CAPTURE: opEndCapture(); break; 276 | case OpCodes.FULL_CAPTURE: unimplemented(); break; 277 | case OpCodes.BEHIND: unimplemented(); break; 278 | case OpCodes.END_OF_INPUT: opEndOfInput(); break; 279 | 280 | case OpCodes.ACTION: opAction(); break; 281 | default: throw new IllegalStateException("unknown instruction: " + op + " at pc " + pc); 282 | } 283 | 284 | 285 | } 286 | 287 | return getMatchEnd(); 288 | 289 | } 290 | 291 | public void reset() { 292 | unimplemented(); 293 | } 294 | 295 | public int getCaptureStart() { 296 | StackEntry s = stack[stk - 1]; 297 | return s.getCaptureHeight(); 298 | } 299 | 300 | public int getCaptureEnd() { 301 | return captureTop; 302 | } 303 | 304 | public void setCaptureEnd(int i) { 305 | captureTop = i; 306 | } 307 | 308 | public Object[] getCurrentCaptures() { 309 | return captureStack; 310 | } 311 | 312 | public char[] getInput() { 313 | return input; 314 | } 315 | 316 | public int getInputPosition() { 317 | return subjectPointer; 318 | } 319 | 320 | public char getLastMatch() { 321 | return input[subjectPointer - 1]; 322 | } 323 | 324 | public void push(Object v) { 325 | if (captureTop >= captureStack.length) doubleCaptures(); 326 | captureStack[captureTop] = v; 327 | captureTop++; 328 | } 329 | 330 | public Object[] getCaptures() { 331 | if (matchFailed) { 332 | return null; 333 | } 334 | Object[] captures = new Object[captureTop]; 335 | System.arraycopy(captureStack, 0, captures, 0, captureTop); 336 | return captures; 337 | } 338 | 339 | } 340 | -------------------------------------------------------------------------------- /src-java/com/champbacon/pex/impl/StackEntry.java: -------------------------------------------------------------------------------- 1 | package com.champbacon.pex.impl; 2 | 3 | final class StackEntry { 4 | 5 | static final int NO_OPEN_CAPTURE = -1; 6 | static final int IS_CALL = -1; 7 | private int returnAddress; 8 | private int subjectPosition = IS_CALL; 9 | private int captureHeight; 10 | private int currentCaptureBegin = NO_OPEN_CAPTURE; 11 | 12 | public void reset() { 13 | returnAddress = 0; 14 | subjectPosition = IS_CALL; 15 | captureHeight = 0; 16 | currentCaptureBegin = NO_OPEN_CAPTURE; 17 | } 18 | 19 | public int getCaptureHeight() { 20 | return captureHeight; 21 | } 22 | 23 | public void setCaptureHeight(int captureHeight) { 24 | this.captureHeight = captureHeight; 25 | } 26 | 27 | public int getSubjectPosition() { 28 | return subjectPosition; 29 | } 30 | 31 | public void setSubjectPosition(int subjectPosition) { 32 | this.subjectPosition = subjectPosition; 33 | } 34 | 35 | public int getReturnAddress() { 36 | return returnAddress; 37 | } 38 | 39 | public void setReturnAddress(int returnAddress) { 40 | this.returnAddress = returnAddress; 41 | } 42 | 43 | public boolean isCall() { 44 | return subjectPosition == -1; 45 | }; 46 | 47 | public void setCurrentCaptureBegin(int subjectPosition) { 48 | if (currentCaptureBegin == NO_OPEN_CAPTURE) { 49 | currentCaptureBegin = subjectPosition; 50 | } else throw new IllegalStateException("Nested capture within a single rule."); 51 | } 52 | 53 | public void clearOpenCapture() { 54 | currentCaptureBegin = NO_OPEN_CAPTURE; 55 | } 56 | 57 | public int getCurrentCaptureBegin() { 58 | if (currentCaptureBegin != NO_OPEN_CAPTURE) 59 | return currentCaptureBegin; 60 | else throw new IllegalStateException("No open capture."); 61 | } 62 | } 63 | -------------------------------------------------------------------------------- /src/com/champbacon/pex.clj: -------------------------------------------------------------------------------- 1 | (ns com.champbacon.pex 2 | (:refer-clojure :exclude [compile]) 3 | (:require [com.champbacon.pex.impl.tree :as tree] 4 | [com.champbacon.pex.impl.codegen :as codegen]) 5 | (:import (com.champbacon.pex.impl Matchers$SingleRangeMatcher 6 | Matchers$RangeMatcher 7 | Actions 8 | Actions$PushAction 9 | Actions$UpdateStackTop 10 | Actions$FoldCaptures 11 | Actions$ReplaceCaptures) 12 | (com.champbacon.pex ParsingExpressionGrammar))) 13 | 14 | (defn single-range-matcher 15 | [low high] 16 | (Matchers$SingleRangeMatcher. (int low) (int high))) 17 | 18 | (defn range-matcher 19 | [ranges] 20 | (let [nums (int-array (sequence (comp cat (map int)) (sort-by first ranges)))] 21 | (Matchers$RangeMatcher. nums))) 22 | 23 | (defn update-stack-top 24 | [f] 25 | (Actions$UpdateStackTop. f)) 26 | 27 | (defn push 28 | [val] 29 | (Actions$PushAction. val)) 30 | 31 | (defn replace-captures 32 | "f will be passed array, low-idx, high-idx. 33 | The extent of captures will be replaced with the result of f 34 | high-idx is exclusive" 35 | [f] 36 | (Actions$ReplaceCaptures. f)) 37 | 38 | (def clear-sb Actions/CLEAR_STRING_BUFFER) 39 | (def append-sb Actions/APPEND_STRING_BUFFER) 40 | (def push-sb Actions/PUSH_STRING_BUFFER) 41 | 42 | (defn fold-cap 43 | [rf] 44 | (Actions$FoldCaptures. rf)) 45 | 46 | (defn compile 47 | ([grammar entrypoint matchers] 48 | (compile grammar entrypoint matchers {} {})) 49 | ([grammar entrypoint matchers actions] 50 | (compile grammar entrypoint matchers actions {})) 51 | ([grammar entrypoint matchers actions macros] 52 | (when-not (contains? grammar entrypoint) 53 | (throw (ex-info "Unknown entrypoint" {:grammar grammar 54 | :entrypoint entrypoint}))) 55 | (let [ast (tree/parse-grammar grammar macros)] 56 | (codegen/compile-grammar ast entrypoint matchers actions)))) 57 | 58 | (defn matcher 59 | ([peg input] 60 | (matcher peg input nil)) 61 | ([^ParsingExpressionGrammar peg input user-parse-context] 62 | (.matcher peg input user-parse-context))) 63 | 64 | (defn print-instructions 65 | [insts] 66 | (println "ADDR INST") 67 | (doseq [[idx inst] (map vector (range) insts)] 68 | (printf "%4d %s%n" idx inst))) 69 | -------------------------------------------------------------------------------- /src/com/champbacon/pex/examples/csv.clj: -------------------------------------------------------------------------------- 1 | (ns com.champbacon.pex.examples.csv 2 | (:require [com.champbacon.pex :as pex])) 3 | 4 | ;; THIS IS INCOMPLETE AND INCORRECT 5 | 6 | (def CSV '{file [OWS record (* NL record) EOI] 7 | 8 | record [field (* field-delimeter field)] 9 | 10 | field-delimeter "," 11 | field (/ quoted unquoted) 12 | 13 | unquoted (capture (* (class nonquotechars))) 14 | ;; (capture (not \") (* ANY)) 15 | quoted [OWS \" 16 | (capture (* (/ (class quotechars) "\"\""))) 17 | ;; (apply unescape-quotes (* (/ (class quotechars) 18 | ;; "\"\""))) 19 | ;; 20 | 21 | \" OWS 22 | (action unescape-quotes)] 23 | 24 | NL [(? \r) \n] 25 | OWS (* (class :ws))}) 26 | 27 | (def csv-field '{record [field (* sep field) EOI] 28 | field (/ quoted unquoted) 29 | quoted ["\"" [(not "\"") ANY] ] 30 | sep ","}) 31 | 32 | (def csv-macros {:ws (fn [patt] [patt 'whitespace]) 33 | :join (fn [patt sep] [patt (list '* sep patt)])}) -------------------------------------------------------------------------------- /src/com/champbacon/pex/examples/json.clj: -------------------------------------------------------------------------------- 1 | (ns com.champbacon.pex.examples.json 2 | (:require [com.champbacon.pex :as pex]) 3 | (:import (com.champbacon.pex ParseAction CharMatcher))) 4 | 5 | ;; HOW TO RUN AT THE BOTTOM 6 | 7 | (def JSON '{json [whitespace value EOI] ;; Main Rule 8 | 9 | value (/ string number object array jtrue jfalse jnull) 10 | 11 | object [(:sp "{") ;; sp is a rule macro that chews whitespace 12 | (? (:join [string (:sp ":") value] (:sp ","))) 13 | (:sp "}") 14 | (action capture-object)] 15 | 16 | array [(:sp "[") (? (:join value (:sp ","))) (:sp "]") (action capture-array)] 17 | 18 | number [(:sp (capture integer (? frac) (? exp))) (action cast-number)] 19 | 20 | string [\" (action clear-sb) characters (:sp \") (action push-sb)] 21 | 22 | characters (* (/ [(not (/ \" \\)) any (action append-sb)] 23 | [\\ escaped-character])) 24 | 25 | escaped-character (/ [(class escape) (action append-escape)] 26 | unicode) 27 | unicode ["u" 28 | (capture (class hexdigit) (class hexdigit) (class hexdigit) (class hexdigit)) 29 | (action append-hexdigit)] 30 | 31 | integer [(? "-") (/ [(class digit19) digits] 32 | (class digit))] 33 | digits [(class digit) (* (class digit))] 34 | frac ["." digits] 35 | exp [(/ "e" "E") (? (/ "+" "-")) digits] 36 | jtrue ["true" whitespace (action push-true)] 37 | jfalse ["false" whitespace (action push-false)] 38 | jnull ["null" whitespace (action push-nil)] 39 | whitespace (* (class whitespace))}) 40 | 41 | (def json-macros {:sp (fn [patt] [patt 'whitespace]) 42 | :join (fn [patt sep] [patt (list '* sep patt)])}) 43 | 44 | (defn make-json-object 45 | [^objects captures low high] 46 | (loop [i low 47 | m (transient {})] 48 | (if (< i high) 49 | (recur (unchecked-add i 2) 50 | (assoc! m (aget captures i) 51 | (aget captures (unchecked-inc i)))) 52 | (persistent! m)))) 53 | 54 | (defn json-parser 55 | [] 56 | (let [escapes {\b \backspace 57 | \f \formfeed 58 | \n \newline 59 | \r \return 60 | \t \tab 61 | \\ \\ 62 | \/ \/ 63 | \" \"} 64 | matchers {:digit19 (pex/single-range-matcher \1 \:) 65 | :digit (pex/single-range-matcher \0 \:) 66 | :hexdigit (pex/range-matcher [[\a \g] [\0 \:]]) 67 | :escape (reify CharMatcher 68 | (match [_ ch] 69 | (> (.indexOf "bfnrt\\/\"" ch) 0))) 70 | :whitespace (reify CharMatcher 71 | (match [_ ch] 72 | (Character/isWhitespace ch)))} 73 | actions {:append-hexdigit (reify ParseAction 74 | (execute [_ vsm] 75 | (let [^StringBuffer sb (.getUserParseContext vsm) 76 | captures (.getCurrentCaptures vsm) 77 | top (.getCaptureEnd vsm) 78 | hex (aget captures top)] 79 | (.append sb (char (Integer/parseInt hex 16))) 80 | (.setCaptureEnd vsm (dec top))))) 81 | :capture-object (pex/replace-captures make-json-object) 82 | :capture-array (pex/fold-cap (fn 83 | ([] (transient [])) 84 | ([res] (persistent! res)) 85 | ([res input] (conj! res input)))) 86 | :append-escape (reify ParseAction 87 | (execute [_ vsm] 88 | (let [^StringBuffer sb (.getUserParseContext vsm) 89 | last-ch (.getLastMatch vsm)] 90 | (.append sb ^char (escapes (char last-ch)))))) 91 | :cast-number (pex/update-stack-top #(Double/valueOf ^String %)) 92 | :push-true (pex/push true) 93 | :push-false (pex/push false) 94 | :push-nil (pex/push nil) 95 | :clear-sb pex/clear-sb 96 | :append-sb pex/append-sb 97 | :push-sb pex/push-sb}] 98 | (pex/compile JSON 'json matchers actions json-macros))) 99 | 100 | (comment 101 | (let [json (json-parser) 102 | ;;input "\"42\"" 103 | input "{\"bar\": [\"this\", 42, {}, [1,2,3], \"foo\"]}" 104 | m (pex/matcher json (.toCharArray input) (StringBuffer.)) 105 | result (.match m 0)] 106 | (first (.getCaptures m)))) -------------------------------------------------------------------------------- /src/com/champbacon/pex/impl/codegen.clj: -------------------------------------------------------------------------------- 1 | (ns com.champbacon.pex.impl.codegen 2 | (:import com.champbacon.pex.impl.OpCodes 3 | (com.champbacon.pex ParsingExpressionGrammar CharMatcher ParseAction))) 4 | 5 | (declare emit) 6 | 7 | (defn label 8 | [env] 9 | ((:next-label env))) 10 | 11 | (defn emit-choice 12 | [env ast] 13 | (let [{:keys [children]} ast 14 | n (count children) 15 | last-pos? (fn [i] (= i (dec n))) 16 | 17 | labels (into [] (take n) (repeatedly #(label env))) 18 | blocks (mapv (partial emit env) children) 19 | 20 | end (label env) 21 | 22 | emit-alternative (fn [idx tree] 23 | (let [header (when-not (zero? idx) 24 | [[:label (labels idx)]]) 25 | choice (when-not (last-pos? idx) 26 | [[:choice (labels (inc idx))]]) 27 | commit (when-not (last-pos? idx) 28 | [[:commit end]])] 29 | (into [] cat 30 | [header 31 | choice 32 | (blocks idx) 33 | commit])))] 34 | (-> (into [] (comp (map-indexed emit-alternative) cat) children) 35 | (conj [:label end])))) 36 | 37 | (defn concat-trees 38 | [env ts] 39 | (into [] (mapcat (partial emit env)) ts)) 40 | 41 | (defn emit-cat 42 | [env ast] 43 | (concat-trees env (:children ast))) 44 | 45 | (defn emit-char 46 | [env ast] 47 | (let [{:keys [codepoint]} ast] 48 | [[:char codepoint]])) 49 | 50 | (defn emit-rep 51 | [env ast] 52 | (let [body (concat-trees env (:children ast)) 53 | head (label env) 54 | cont (label env)] 55 | (into [] cat 56 | [[[:choice cont] 57 | [:label head]] 58 | body 59 | [[:partial-commit head] 60 | [:label cont]]]))) 61 | 62 | (defn emit-call 63 | [env ast] 64 | (let [{:keys [target]} ast] 65 | (when-not (contains? (:non-terminals env) target) 66 | (throw (ex-info "Undefined non-terminal" {:target target}))) 67 | [[:call target]])) 68 | 69 | (defn emit-optional 70 | [env ast] 71 | (let [body (concat-trees env (:children ast)) 72 | cont (label env)] 73 | (-> (into [[:choice cont]] body) 74 | (conj [:commit cont] [:label cont])))) 75 | 76 | (defn emit-not-predicate 77 | [env ast] 78 | (let [body (concat-trees env (:children ast)) 79 | L1 (label env)] 80 | (-> (into [[:choice L1]] body) 81 | (conj [:fail-twice] 82 | [:label L1])))) 83 | 84 | (defn emit-and-predicate 85 | [env ast] 86 | (let [body (concat-trees env (:children ast)) 87 | L1 (label env) 88 | L2 (label env)] 89 | (-> (into [[:choice L1]] body) 90 | (into [[:back-commit L2] 91 | [:label L1] 92 | [:fail] 93 | [:label L2]])))) 94 | 95 | (defn emit-capture 96 | [env ast] 97 | ;; optimize 98 | (let [body (concat-trees env (:children ast))] 99 | (-> (into [[:begin-capture]] body) 100 | (conj [:end-capture])))) 101 | 102 | (defn emit-linked-instruction 103 | [k env ast] 104 | (let [linked-constant (-> ast :args first keyword) 105 | n (or (get-in env [:constants k linked-constant]) 106 | (throw (ex-info "Linked constant not found" ast)))] 107 | [[k n]])) 108 | 109 | (def dispatch {:choice emit-choice 110 | :char emit-char 111 | :cat emit-cat 112 | :rep emit-rep 113 | :open-call emit-call 114 | :optional emit-optional 115 | :true (constantly []) 116 | :fail (constantly [[:fail]]) 117 | :any (constantly [[:any]]) 118 | :end-of-input (constantly [[:end-of-input]]) 119 | :not emit-not-predicate 120 | :and emit-and-predicate 121 | :capture emit-capture 122 | :action (partial emit-linked-instruction :action) 123 | :charset (partial emit-linked-instruction :charset)}) 124 | 125 | (defn emit 126 | [env ast] 127 | (let [f (dispatch (:op ast))] 128 | (when-not f (throw (ex-info "bad ast" ast))) 129 | (f env ast))) 130 | 131 | (defn initial-jump-block 132 | [instrs entrypoint] 133 | (let [preamble [[:call entrypoint] 134 | [:end]]] 135 | (into preamble instrs))) 136 | 137 | (def branching? 138 | #{:commit 139 | :choice 140 | :jump 141 | :call 142 | :back-commit 143 | :partial-commit}) 144 | 145 | (def op->code 146 | (let [m {:call OpCodes/CALL 147 | :return OpCodes/RET 148 | :choice OpCodes/CHOICE 149 | :commit OpCodes/COMMIT 150 | :partial-commit OpCodes/PARTIAL_COMMIT 151 | :back-commit OpCodes/BACK_COMMIT 152 | :jump OpCodes/JUMP 153 | :fail-twice OpCodes/FAIL_TWICE 154 | :fail OpCodes/FAIL 155 | :end OpCodes/END 156 | 157 | :char OpCodes/MATCH_CHAR 158 | :test-char OpCodes/TEST_CHAR 159 | :charset OpCodes/CHARSET 160 | :test-charset OpCodes/TEST_CHARSET 161 | :any OpCodes/ANY 162 | :test-any OpCodes/TEST_ANY 163 | :span OpCodes/SPAN 164 | 165 | :begin-capture OpCodes/BEGIN_CAPTURE 166 | :end-capture OpCodes/END_CAPTURE 167 | :full-capture OpCodes/FULL_CAPTURE 168 | :behind OpCodes/BEHIND 169 | :end-of-input OpCodes/END_OF_INPUT 170 | :action OpCodes/ACTION}] 171 | (fn [kw] 172 | (or (get m kw) 173 | (throw (IllegalArgumentException. (str "No opcode defined " kw))))))) 174 | 175 | (defn link 176 | "Turns all symbolic jumps into relative address jumps" 177 | [instructions] 178 | (let [[insts labels] (reduce (fn [[insts labels] [op arg :as inst]] 179 | (if (= :label op) 180 | [insts (assoc labels arg (count insts))] 181 | [(into insts inst) labels])) 182 | [[] {}] instructions) 183 | 184 | patch-jumps (fn [stream] 185 | (let [n (count stream)] 186 | (loop [i 0 stream stream] 187 | (if (< i n) 188 | (let [op (get stream i)] 189 | (if (and (keyword? op) (branching? op)) 190 | (let [target (inc i)] 191 | (recur (inc target) (update stream target labels))) 192 | (recur (inc i) stream))) 193 | stream))))] 194 | (patch-jumps insts))) 195 | 196 | (defn add-entrypoint 197 | [env code entrypoint] 198 | (let [end (label env)] 199 | (concat [[:call entrypoint] 200 | [:jump end]] 201 | code 202 | [[:label end] 203 | [:end]]))) 204 | 205 | (defn empty-env 206 | [grammar matchers actions] 207 | (let [current-id (atom 0)] 208 | {:non-terminals (set (keys grammar)) 209 | :next-label #(swap! current-id inc) 210 | :matchers (vec (vals matchers)) 211 | :actions (vec (vals actions)) 212 | :constants {:charset (into {} (map vector (keys matchers) (range))) 213 | :action (into {} (map vector (keys actions) (range)))}})) 214 | 215 | (defn transform-instructions 216 | [insts] 217 | (let [->bytecode (fn [i] 218 | (if (keyword? i) 219 | (op->code i) 220 | i))] 221 | (into [] (map ->bytecode) insts))) 222 | 223 | (defn compile-grammar 224 | [grammar entrypoint matchers actions] 225 | (let [env (empty-env grammar matchers actions) 226 | emit-rule (fn [[sym ast]] 227 | (-> (into [[:label sym :call]] 228 | (emit env ast)) 229 | (conj [:return]))) 230 | instructions (into [] (mapcat emit-rule) grammar)] 231 | (ParsingExpressionGrammar. 232 | (-> (add-entrypoint env instructions entrypoint) 233 | (link) 234 | (transform-instructions) 235 | (int-array)) 236 | (into-array CharMatcher (:matchers env)) 237 | (into-array ParseAction (:actions env))))) 238 | -------------------------------------------------------------------------------- /src/com/champbacon/pex/impl/tree.clj: -------------------------------------------------------------------------------- 1 | (ns com.champbacon.pex.impl.tree 2 | (:refer-clojure :exclude [cat char not and])) 3 | 4 | (def ^:dynamic *macros* {}) 5 | 6 | (defprotocol OpTree 7 | (pattern [_])) 8 | 9 | ;; make this work on pairs and do associativity fix inline? 10 | (defn choice 11 | [alternatives] 12 | (case (count alternatives) 13 | 0 (throw (ex-info "Empty ordered choice operator" {})) 14 | 1 (first alternatives) 15 | {:op :choice 16 | :children alternatives})) 17 | 18 | ;; make this work on pairs and do associativity fix inline? 19 | (defn cat 20 | [ps] 21 | (if (= 1 (count ps)) 22 | (first ps) 23 | {:op :cat 24 | :children ps})) 25 | 26 | (defn char-range 27 | [args] 28 | ;; checkargs 29 | {:op :charset 30 | :args args}) 31 | 32 | (defn char 33 | [codepoint] 34 | {:op :char 35 | :codepoint codepoint}) 36 | 37 | (defn string 38 | [s] 39 | (if (pos? (count s)) 40 | (cat (mapv (comp char int) s)) 41 | {:op :true})) 42 | 43 | (def fail 44 | {:op :fail}) 45 | 46 | (defn rep 47 | [patt] 48 | {:op :rep 49 | :children patt}) 50 | 51 | (defn any 52 | ([] {:op :any}) 53 | ([n] (cat (vec (repeat n {:op :any}))))) 54 | 55 | (defn not 56 | [patt] 57 | {:op :not 58 | :children patt}) 59 | 60 | (defn and 61 | [patt] 62 | {:op :and 63 | :children and}) 64 | 65 | (def end-of-input 66 | {:op :end-of-input}) 67 | 68 | (defn action 69 | [args] 70 | {:op :action 71 | :args args}) 72 | 73 | (defn non-terminal 74 | [s] 75 | {:op :open-call 76 | :target s}) 77 | 78 | (defn push 79 | [obj] 80 | {:op :push 81 | :value obj}) 82 | 83 | (defn capture 84 | [ps] 85 | {:op :capture 86 | :children ps}) 87 | 88 | (defn optional 89 | [ps] 90 | {:op :optional 91 | :children ps}) 92 | 93 | (def special '#{any EOI / * ? capture push and not class action reduce}) 94 | 95 | (extend-protocol OpTree 96 | String 97 | (pattern [s] (string s)) 98 | 99 | clojure.lang.IPersistentVector 100 | (pattern [v] (cat (mapv pattern v))) 101 | 102 | clojure.lang.Symbol 103 | (pattern [s] 104 | (condp = s 105 | 'any (any) 106 | 'EOI end-of-input 107 | (non-terminal s))) 108 | 109 | java.lang.Character 110 | (pattern [ch] (char (int ch))) 111 | 112 | java.lang.Boolean 113 | (pattern [b] 114 | (if b 115 | {:op :true} 116 | {:op :fail})) 117 | 118 | clojure.lang.IPersistentList 119 | ;; TODO Add macroexpansion 120 | (pattern [l] 121 | (when-first [call l] 122 | (if-let [macro (get *macros* call)] 123 | (pattern (apply macro (next l))) 124 | (let [args (next l) 125 | call-with-args #(%1 (mapv pattern args))] 126 | (condp = call 127 | 128 | '/ 129 | (call-with-args choice) 130 | 131 | 'any 132 | (call-with-args any) 133 | 134 | '* 135 | (call-with-args rep) 136 | 137 | '? 138 | (call-with-args optional) 139 | 140 | 'not 141 | (call-with-args not) 142 | 143 | 'and 144 | (call-with-args and) 145 | 146 | 'capture 147 | (call-with-args capture) 148 | 149 | 'push 150 | (push (fnext l)) 151 | 152 | 'class 153 | (char-range (next l)) 154 | 155 | ;;'reduce 156 | ;;(reduce-value-stack (next l)) 157 | 158 | 'action 159 | (action (next l)) 160 | 161 | :else 162 | (throw (ex-info "Unrecognized call" {:op call :form l})))))))) 163 | 164 | (defn parse-grammar 165 | [grammar macros] 166 | (when-not (every? keyword? (keys macros)) 167 | (throw (ex-info "PEG macros must all be keywords" macros))) 168 | (binding [*macros* macros] 169 | (into {} (map (fn [[kw p]] 170 | [kw (pattern p)])) grammar))) 171 | 172 | ;;; Optimizations 173 | 174 | (defn has-capture? 175 | [op] 176 | ) 177 | 178 | (defn empty-string-behavior 179 | [tree] 180 | ;; A pattern is *nullable* if it can match without consuming any character; 181 | ;; A pattern is *nofail* if it never fails for any string 182 | ;; (including the empty string). 183 | ) 184 | 185 | (defn fixed-length 186 | [tree] 187 | ;; number of characters to match a pattern (or nil if variable) 188 | ;; avoid infinite loop with a MAXRULES traversal count 189 | 190 | ) 191 | 192 | (defn first-set 193 | [tree] 194 | ;; https://github.com/lua/lpeg/blob/2960f1cf68af916a095625dfd3e39263dac5f38c/lpcode.c#L246 195 | ) 196 | 197 | (defn head-fail? 198 | [tree] 199 | ;; https://github.com/lua/lpeg/blob/2960f1cf68af916a095625dfd3e39263dac5f38c/lpcode.c#L341 200 | ) 201 | -------------------------------------------------------------------------------- /test/com/champbacon/pex_test.clj: -------------------------------------------------------------------------------- 1 | (ns com.champbacon.pex-test 2 | (:require [clojure.test :refer :all] 3 | [com.champbacon.pex :refer :all])) 4 | 5 | (deftest a-test 6 | (testing "FIXME, I fail." 7 | (is (= 0 1)))) 8 | --------------------------------------------------------------------------------