├── .gitignore ├── LICENSE ├── README.org ├── src ├── basics.lisp ├── capture.lisp ├── compile.lisp ├── core.lisp ├── uclp.lisp └── util.lisp ├── test ├── basics-test.lisp ├── capture-test.lisp ├── core-test.lisp ├── init.lisp └── util-test.lisp └── uclp.asd /.gitignore: -------------------------------------------------------------------------------- 1 | **/*.fasl -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2024 Ravi D'Elia 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 7 | of the Software, and to permit persons to whom the Software is furnished to do 8 | so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. -------------------------------------------------------------------------------- /README.org: -------------------------------------------------------------------------------- 1 | * Unnamed Common Lisp Peg 2 | 3 | UCLP is an experimental implementation of [[https://en.wikipedia.org/wiki/Parsing_expression_grammar][PEG parsing]] in Common Lisp which compiles 4 | grammar rules directly to source code at runtime. A parsing expression grammar is a very 5 | elegant way of recognizing, parsing, and transforming text- much more powerful than 6 | regular expressions without the complexity of a custom-built parser. Note that while it is 7 | possible to parse PEGs in guaranteed linear time (at the cost of linear space) with a 8 | packrat parser, UCLP does not. Most patterns unless very poorly written will run in linear 9 | time anyway, and for ease of use some necessary departures from the strict definition are 10 | included. 11 | 12 | UCLP patterns are just made of native data structures, so it's easy to compose patterns 13 | and interact with them in code. Unlike regular expressions, PEG syntax is pretty readable 14 | even if you aren't familiar with the specifics. The below example is largely copied from the 15 | [[https://janet-lang.org/docs/peg.html][Janet language documentation]]: 16 | 17 | #+BEGIN_SRC lisp 18 | (defparameter ip-address 19 | '(grammar 20 | :dig (range "09") 21 | :0-4 (range "04") 22 | :0-5 (range "05") 23 | :byte (choice 24 | (sequence "25" :0-5) 25 | (sequence "2" :0-4 :dig) 26 | (sequence "1" :dig :dig) 27 | (between :dig 1 2)) 28 | :main (sequence :byte "." :byte "." :byte "." :byte))) 29 | 30 | ; uclp:match returns two values, a boolean indicating success/failure and a list of captures 31 | (uclp:match ip-address "0.0.0.0") ; -> t nil 32 | (uclp:match ip-address "elephant") ; -> nil 33 | (uclp:match ip-address "256.0.0.0") ; -> nil 34 | (uclp:match ip-address "0.0.0.0moretext") ; -> t nil 35 | #+END_SRC 36 | 37 | If you follow the link you'll see that the example is almost exactly copied, with 38 | the only differences being those forced by the differences in language syntax. UCLP 39 | is a very close reproduction of the semantics of Janet's PEG module. After experiencing 40 | how pleasant text parsing is in Janet you'll also feel the urge to rewrite it for every 41 | language you use. 42 | 43 | As of now UCLP is usable. Bugs are to be expected, but almost all of Janet's patterns are 44 | supported and there shouldn't be any significant footguns. Unfortunately it has only been 45 | tested on SBCL, but I plan on at least expanding that to some other implementations since 46 | there isn't much implementation-specific code. Except of course that for it to be 47 | performant it needs to be compiled quickly to a quick runtime, ideally machine code. 48 | 49 | ** Usage 50 | The peg syntax used by UCLP is largely a 1:1 reproduction of Janet's. Differences are 51 | noted below, but otherwise you can safely default to using [[https://janet-lang.org/docs/peg.html][Janet's documentation]]. Many 52 | tasks that might require a particular function when using regex can be accomplished with 53 | the correct pattern. However, for convenience, there are entrypoints for a few common 54 | usecases. 55 | 56 | *** ~(match rule str &optional start &rest args)~ 57 | This is the primary entrypoint to UCLP. Matches ~str~ against ~rule~ anchored to ~start~. 58 | Returns two values on success, a boolean indicating success or failure and a list of 59 | captures, and nil otherwise. Additional arguments are accessable while matching using the 60 | ~argument~ pattern. Note that if using additional arguments you must specify a ~start~. 61 | 62 | *** ~(compile-peg rule &key (quiet? t) debug?)~ 63 | Compiles ~rule~ to source ahead of time. Even with SBCL's lightning quick compilation this 64 | can still save a significant amount of time, and for many rules almost all the 65 | consing. The result is just a closure, and it can be used in other patterns by including 66 | it literally. Does not promise to be nice if used outside its wrapper. 67 | 68 | *** ~(captured rule str &optional start &rest args)~ 69 | Takes exactly the same arguments as ~match~ but returns in the opposite order- first the 70 | list of captures, and success as the second value. Useful to get at captures without 71 | ~multiple-value-bind~. 72 | 73 | *** ~(replace-all rule replace str &optional start &rest args)~ 74 | Returns the string resulting from replacing every instance of ~rule~ in ~str~ with 75 | ~replace~, which may be a string to substitute or a function taking as an argument the 76 | substring matched by ~rule~. Returns a boolean indicating if matches were found as a 77 | second value. 78 | 79 | *** ~(replace-one rule replace str &optional start &rest args)~ 80 | Behaves like ~replace-all~, but replaces only the first match. 81 | 82 | *** ~(find-all rule str &optional start &rest args)~ 83 | Gives the position of each match in ~str~ as a list, such that 84 | #+BEGIN_SRC lisp 85 | (defparameter dig '(* "dig:" :d)) 86 | (defparameter str "dig:7, dig:8, dig:9") 87 | 88 | (every (lambda (position) 89 | (uclp:match dig str position)) 90 | (uclp:find-all dig str)) ; => t 91 | #+END_SRC 92 | 93 | 94 | *** ~(find-one rule str &optional start &rest args)~ 95 | Returns the position of the /first/ match, or ~nil~ if there are no matches. 96 | 97 | ** Differences From Janet 98 | Mostly UCLP adheres to the behavior of Janet pegs, enough so that Janet's documentation is 99 | better than anything I'll have put together for a little while. However, there are some 100 | differences which obviously need to be documented somewhere. Some are due to differences 101 | in host language, some are due to taste, and some are just features I haven't implemented 102 | yet. 103 | 104 | Any difference in behavior from Janet not mentioned below is a *bug*, and either the 105 | behavior or the documentation will need to be changed. 106 | 107 | *** Unimplemented 108 | Janet buffers and strings are simply byte strings, and Janet pegs work on arbitrary 109 | strings of bytes. UCLP expects to work with Common Lisp strings, which in general are 110 | /not/ byte strings but vectors of type ~character~. As such the patterns intended to work 111 | on bytes are not implemented. These are: ~uint~, ~uint-be~, ~int~, and ~int-be~. Because 112 | using PEGs to parse binaries is so nice, I plan on at some point implementing some way of 113 | compiling PEGs intended to operate on raw bytes. However, specialization will be at 114 | compile-time and the above patterns will likely be available only in a byte PEG. 115 | 116 | I'm not yet sure if UCLP will ever support a general ~number~ pattern. It's possible 117 | I'll bring in ~parse-float~ to make a ~float~ as well as a general ~number~ pattern. 118 | 119 | *** Implemented 120 | - While UCLP does not have ~number~, it does have ~integer~, which takes identical 121 | arguments and parses an integer using ~parse-integer~. 122 | - The pattern ~(split div-pat fill-pat)~ was contributed to Janet but not yet released. 123 | It will match when ~fill-pat~ matches each segment of the input when divided by 124 | ~div-pat~. However, it is a little more complicated than that. For one, ~split~ always 125 | matches against the entire input. If you don't want that, you need to use a ~sub~ 126 | pattern to contain it. This means that something like ~(match '(split "," 1) "a,b,c,")~ 127 | will not match! Additionally, ~fill-pat~ is implicitly matched in a ~sub~- so if you 128 | want to match everything between your seperators, you can use a pattern like 129 | ~(split "," (any 1))~. Look at the test cases for examples here. 130 | - The pattern ~(split div-pat fill-pat)~ was contributed to Janet but not yet released. 131 | It matches if the input string consists of instances of ~fill-pat~ split apart by 132 | ~div-pat~. Look at the test cases to get a finer idea of the details, but it is 133 | equivalent to the following pattern: 134 | #+BEGIN_SRC lisp 135 | `(grammar 136 | :mid (sub (to (+ ,div-pat -1)) ,fill-pat) ; fill-pat matches up to the next div 137 | :main (* :mid (any (* (drop ,div-pat) :mid)) -1)) 138 | #+END_SRC 139 | 140 | *** Changes 141 | - Anywhere a string literal can go, including those in ~range~ or ~set~, a character or 142 | list of characters and strings can also go. This is because Common Lisp strings do not have 143 | escape codes like Janet strings. So ~(range ("a" #\Newline))~ in UCLP is the same as 144 | ~(range "a\n")~ in Janet. 145 | - ~between~, ~at-least~, ~at-most~, and ~look~ all have the pattern as the first argument, unlike 146 | in Janet where it is the last argument. 147 | - ~backmatch~ requires a tag argument, and will not look up captures on the capture stack 148 | - ~replace~ takes either a string which it captures literally, or a function which it calls. 149 | Taking other datatypes literally will probably be in the next version. But unlike Janet, 150 | it will never look the matches up. 151 | - Grammars, represented in Janet by tables or structs, are written in UCLP with the 152 | ~grammar~ rule, which is followed by alternating keywords naming rules and patterns 153 | implementing them. 154 | - Because the reader doesn't distinguish between ~:s~ and ~:S~, the complement of a built-in 155 | pattern is prefixed with ~!~. So ~:!s~ instead of ~:S~. 156 | - The ~error~ pattern has significant differences from Janet's ~error~. See *Error* below. 157 | - In addition to the ~+~ and ~*~ variants of built-in patterns, UCLP has a maybe variant 158 | marked by a ~?~. So ~:w?~ denotes ~(? :w)~. 159 | - UCLP includes any (~*~), some (~+~), and maybe (~?~) variants of complement patterns. So 160 | ~:!d+~ is ~(some (if-not :d 1))~. See *Aliases* below 161 | 162 | ** Aliases 163 | UCLP offers aliases, keywords that stand in for larger patterns, similar to the Janet's 164 | built-in patterns. And like built-in patterns, aliases are user extensible. However, there 165 | are a number of differences which are important to be aware of. Aliases are not first 166 | class citizens of UCLP- rather than full mutually recursive subpatterns, they are simple 167 | find-and-replace macros, inserted literally. You can reference other aliases from inside 168 | one, but if you create a cycle it'll just blow out the stack. So just be cautious! 169 | 170 | Aliases are stored in the alist ~*aliases*~. You can manipulate ~*aliases*~ directly, or 171 | call the helper functions ~register-alias!~ and ~register-alias-suite!~. Both take the 172 | name of the alias as a keyword, and the body as a peg expression, and push the new alias 173 | to ~*aliases*~. However, ~register-alias-suite!~ will also add the complement, some, 174 | maybe, and any variants, like so: 175 | 176 | #+BEGIN_SRC lisp 177 | (uclp:register-alias-suite! :v '(set "vV")) 178 | (uclp:match '(* :v (<- :!v+)) "v not a V") ; => t (" not a ") 179 | #+END_SRC 180 | 181 | ** Error 182 | Because Common Lisp conditions are so different from Janet signals, the ~error~ pattern 183 | has some subtleties in UCLP. It takes arguments of the form 184 | ~(error &optional pat condition)~. 185 | With 0 arguments, it will raise a ~peg-error~. With one argument, it will 186 | raise an error only if ~pat~ matches. With two arguments, it will raise a ~condition~ so 187 | long as ~pat~ matches. To specify a particular condition a pattern must be given, but 188 | something like ~0~ will always match. 189 | 190 | UCLP special cases conditions inheriting from ~peg-error~, which is exported. A 191 | ~peg-error~ has slots ~pat~, ~matched~, and ~caps~. If a pattern is given, these will 192 | automatically be filled by the pattern itself, the text that pattern matched, and the 193 | captures from the pattern respectively. If no pattern is supplied they will all be 194 | ~nil~. These slots can be accessed with ~error-pat~, ~error-matched~, and ~caps~. By 195 | default ~peg-error~ has a report function giving the pattern and, if nonempty, the 196 | matching substring. Depending on your choice of pattern, this can be tolerably readable. 197 | 198 | # LocalWords: UCLP alist LocalWords subpatterns structs datatypes PEGs packrat footguns 199 | # LocalWords: SBCL performant 200 | -------------------------------------------------------------------------------- /src/basics.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp) 2 | 3 | ;;;;;;;;;;;;;;;;;;;;;; 4 | ;; ;;;;;;;;;;;;;;;; ;; 5 | ;; ;; Primitives ;; ;; 6 | ;; ;;;;;;;;;;;;;;;; ;; 7 | ;;;;;;;;;;;;;;;;;;;;;; 8 | 9 | (defun compile-literal (opts literal) 10 | (declare (ignore opts)) 11 | (with-gensyms ($strc $litc) 12 | `(when (<= (+ curr ,(length literal)) strlen) 13 | (loop for ,$litc of-type character across ,literal 14 | for ,$strc of-type character = (char str curr) 15 | do (if (char/= ,$strc ,$litc) (return nil)) 16 | (incf curr) 17 | finally (return t))))) 18 | (defun compile-count (opts count) 19 | (declare (ignore opts)) 20 | (if (< count 0) 21 | `(> (+ curr ,(- count)) strlen) 22 | `(when (<= (+ curr ,count) strlen) 23 | (incf curr ,count) 24 | t))) 25 | 26 | (defpattern (range) 27 | (&rest (ranges (lambda (s) (and (strform? s) (length= s 2))) 28 | "string of length 2")) 29 | (let ((range-strs (mapcar #'from-strform ranges))) 30 | (with-gensyms ($c) 31 | `(when (< curr strlen) 32 | (let* ((,$c (char str curr))) 33 | (when (or ,@(loop for s in range-strs 34 | collect (list 'char<= (char s 0) $c (char s 1)))) 35 | (incf curr) 36 | t)))))) 37 | 38 | (defpattern (set) ((set :string)) 39 | (with-gensyms ($strc $setc) 40 | (let ((set (from-strform set))) 41 | `(if (>= curr strlen) 42 | nil 43 | (let ((,$strc (char str curr))) 44 | ,(if (< (length set) 8) 45 | `(if (or ,@(loop for c across set collect `(eq ,c ,$strc))) (incf curr)) 46 | `(loop for ,$setc of-type character across ,set 47 | if (char= ,$strc ,$setc) do 48 | (incf curr) 49 | (return t) 50 | finally (return nil)))))))) 51 | 52 | ;;;;;;;;;;;;;;;;;;;;;;; 53 | ;; ;;;;;;;;;;;;;;;;; ;; 54 | ;; ;; Combinators ;; ;; 55 | ;; ;;;;;;;;;;;;;;;;; ;; 56 | ;;;;;;;;;;;;;;;;;;;;;;; 57 | 58 | (defpattern (sequence *) (&rest (pats :pat)) 59 | `(and ,@(mapcar (lambda (p) (compile-expr opts p)) pats))) 60 | 61 | (defpattern (choice +) (&rest (pats :pat)) 62 | (let ((tail (last-elt pats)) 63 | (buttail (butlast pats))) 64 | `(with-save (curr caps tags accum) 65 | (or ,@(mapcar 66 | (lambda (pat) 67 | `(if ,(compile-expr opts pat) 68 | t 69 | (restore curr caps tags accum))) 70 | buttail) 71 | ,(compile-expr opts tail))))) 72 | 73 | (defun compile-several (opts pat minimum &optional maximum) 74 | (with-gensyms ($matches $matched?) 75 | `(with-save (curr caps tags accum) 76 | (loop with ,$matches = 0 77 | for ,$matched? = ,(compile-expr opts pat) 78 | while ,$matched? do 79 | (incf ,$matches) 80 | (checkpoint curr caps tags accum) 81 | ,@(list-if 82 | maximum 83 | `(if (= ,$matches ,maximum) 84 | (return t))) 85 | finally 86 | (restore curr caps tags accum) 87 | (return (>= ,$matches ,minimum)))))) 88 | 89 | (defpattern (any) ((pat :pat)) 90 | (compile-several opts pat 0)) 91 | (defpattern (some) ((pat :pat)) 92 | (compile-several opts pat 1)) 93 | (defpattern (between) ((pat :pat) (low :reps) (high :reps)) 94 | (compile-several opts pat low high)) 95 | (defpattern (at-least) ((pat :pat) (n :reps)) 96 | (compile-several opts pat n)) 97 | (defpattern (at-most) ((pat :pat) (n :reps)) 98 | (compile-several opts pat 0 n)) 99 | (defpattern (repeat) ((pat :pat) (n :reps)) 100 | (compile-several opts pat n n)) 101 | 102 | (defpattern (opt ?) ((pat :pat)) 103 | `(with-save (curr caps tags accum) 104 | (if ,(compile-expr opts pat) 105 | t 106 | (progn (restore curr caps tags accum) t)))) 107 | 108 | (defpattern (grammar) (&rest (pats :pat)) 109 | (verify-args! '(&rest (name :tag) (pat :pat)) `(grammar ,@pats)) 110 | (let* ((name-bodies (pairs pats)) 111 | (knames (mapcar #'first name-bodies)) 112 | (bodies (mapcar #'second name-bodies))) 113 | (unless (find :main knames) 114 | (throw-msg! (format nil "grammar error: no main in ~%~s" `(grammar ,@pats)))) 115 | (push knames (compopts-env opts)) 116 | (unless (compopts-prefix opts) (setf (compopts-prefix opts) (gensym))) 117 | (let* ((prefix (compopts-prefix opts)) 118 | (names (mapcar (lambda (n) (prefsym prefix n)) knames)) 119 | (comped (mapcar (lambda (b) (compile-expr opts b)) bodies)) 120 | (out `(labels ,(mapcar 121 | (lambda (name body) `(,name () ,body)) 122 | names comped) 123 | (declare (dynamic-extent 124 | ,@(mapcar (lambda (n) `(function ,n)) names))) 125 | (,(prefsym prefix :main))))) 126 | (pop (compopts-env opts)) 127 | out))) 128 | 129 | (defun compile-look (opts pat n) 130 | (with-gensyms ($matched?) 131 | `(with-save (curr caps tags accum) 132 | (,@(cond ((< n 0) `(when (<= ,n curr) (incf curr ,n))) ;; We need to check for underflow 133 | ((> n 0) `(progn (incf curr ,n))) ;; Every primitive already checks for overflow 134 | (t '(progn))) 135 | (let ((,$matched? ,(compile-expr opts pat))) 136 | (restore curr caps tags accum) 137 | ,$matched?))))) 138 | (defpattern (look >) ((pat :pat) (n :num)) 139 | (compile-look opts pat n)) 140 | (defpattern (not !) ((pat :pat)) 141 | `(not ,(compile-look opts pat 0))) 142 | (defpattern (if) ((cond :pat) (pat :pat)) 143 | `(and ,(compile-look opts cond 0) ,(compile-expr opts pat))) 144 | (defpattern (if-not) ((cond :pat) (pat :pat)) 145 | `(and (not ,(compile-look opts cond 0)) ,(compile-expr opts pat))) 146 | 147 | ;; With tail call elim these should compile to code just as efficient 148 | ;; as handwritten loops- maybe even more efficient 149 | (defpattern (thru) ((pat :pat)) 150 | (compile-expr opts `(grammar :main (+ ,pat (* 1 :main))))) 151 | (defpattern (to) ((pat :pat)) 152 | (compile-expr opts `(thru (look ,pat 0)))) 153 | 154 | (defpattern (sub) ((super-pat :pat) (sub-pat :pat)) 155 | (with-gensyms ($matched $after-curr) 156 | `(with-save (curr) 157 | (let ((,$after-curr 0)) 158 | (declare (fixnum ,$after-curr)) 159 | (when ,(compile-expr opts super-pat) 160 | (with-save (strlen) 161 | (setf strlen curr) 162 | (setf ,$after-curr curr) 163 | (restore curr) 164 | (let ((,$matched ,(compile-expr opts sub-pat))) 165 | (restore strlen) 166 | (setf curr ,$after-curr) 167 | ,$matched))))))) 168 | 169 | (defpattern (split) ((div :pat) (inside :pat)) 170 | (with-gensyms ($init-curr $after-curr $cont? $matched? $ret $fill) 171 | `(let ((,$init-curr 0) 172 | (,$after-curr 0) 173 | (,$ret nil) 174 | (,$cont? nil) 175 | (strlen-bak strlen) 176 | (curr-bak curr)) 177 | (declare (fixnum ,$init-curr ,$after-curr strlen-bak curr-bak) 178 | (boolean ,$ret ,$cont?)) 179 | (tagbody 180 | ,$fill 181 | (setf curr-bak curr) 182 | (setf ,$init-curr curr) 183 | (loop until (setf ,$cont? ,(compile-expr opts `(drop ,div))) 184 | while (< curr strlen) do 185 | (incf curr) 186 | (incf ,$init-curr) 187 | finally 188 | (setf strlen ,$init-curr) 189 | (setf ,$after-curr curr) 190 | (setf curr curr-bak)) 191 | (when-let ((,$matched? ,(compile-expr opts inside))) 192 | (setf curr ,$after-curr) 193 | (setf strlen strlen-bak) 194 | (when ,$cont? 195 | (go ,$fill)) 196 | (setf ,$ret t))) 197 | ,$ret))) 198 | -------------------------------------------------------------------------------- /src/capture.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp) 2 | 3 | (defmacro push-item! (item-place &optional tag) 4 | `(progn 5 | ,@(list-if tag `(tpush! tags ,tag ,item-place)) 6 | (if accum? 7 | (if (stringp ,item-place) 8 | (apush! accum ,item-place 0 (length ,item-place)) 9 | (format accum "~a" ,item-place)) 10 | (qpush! caps ,item-place)))) 11 | 12 | (defmacro defcap (names spec &body body) 13 | `(defpattern ,names ,`((pat :pat) ,@spec &optional (tag :tag)) 14 | ,@body)) 15 | 16 | (defcap (capture <- quote) () 17 | `(with-save (curr) 18 | (when ,(compile-expr opts pat) 19 | (if accum? 20 | (apush! accum str curr-bak curr) 21 | (qpush! caps (subseq str curr-bak curr))) 22 | ,(when tag `(tpush! tags ,tag (subseq str curr-bak curr))) 23 | t))) 24 | 25 | (defpattern (drop) ((pat :pat)) 26 | (with-gensyms ($matched) 27 | `(with-save (caps accum tags) 28 | (let ((,$matched ,(compile-expr opts pat))) 29 | (restore caps accum tags) 30 | ,$matched)))) 31 | 32 | (defcap (accum %) () 33 | (with-gensyms ($matched $accumed) 34 | `(if accum? 35 | ,(compile-expr opts pat) 36 | (progn 37 | (setf accum? t) 38 | (with-save (accum) 39 | (let ((,$matched ,(compile-expr opts pat))) 40 | (when ,$matched 41 | (let ((,$accumed (subseq accum ,(to->back 'accum) (length accum)))) 42 | (unless (emptyp ,$accumed) 43 | (qpush! caps ,$accumed) 44 | ,(if tag `(tpush! tags ,tag ,$accumed))))) 45 | (restore accum) 46 | (setf accum? nil) 47 | ,$matched)))))) 48 | 49 | (defcap (group) () 50 | (with-gensyms ($matched $capped) 51 | `(with-save (caps accum?) 52 | (setf accum? nil) 53 | (let ((,$matched ,(compile-expr opts pat)) 54 | (,$capped (cdr ,(to->back 'caps)))) 55 | (restore accum? caps) 56 | (when ,$matched 57 | (push-item! ,$capped ,tag) 58 | t))))) 59 | 60 | (defcap (replace /) ((replacer :any)) 61 | (with-gensyms ($matched? $capped $result) 62 | `(with-save (caps accum?) 63 | (setf accum? nil) 64 | (let ((,$matched? ,(compile-expr opts pat)) 65 | ,@(list-if (functionp replacer) `(,$capped (cdr ,(to->back 'caps))))) 66 | (restore caps accum?) 67 | (when ,$matched? 68 | (let ((,$result ,(if (functionp replacer) 69 | `(apply ,replacer ,$capped) 70 | `',replacer))) 71 | (push-item! ,$result ,tag) 72 | t)))))) 73 | 74 | (defcap (cmt) ((replacer :any)) 75 | (with-gensyms ($matched $capped $result) 76 | `(with-save (caps accum?) 77 | (setf accum? nil) 78 | (let ((,$matched ,(compile-expr opts pat)) 79 | ,@(list-if (functionp replacer) `(,$capped (cdr caps-bak)))) 80 | (restore caps accum?) 81 | (when ,$matched 82 | (let ((,$result ,(if (functionp replacer) 83 | `(apply ,replacer ,$capped) 84 | `',replacer))) 85 | (when ,$result 86 | (push-item! ,$result ,tag) 87 | t))))))) 88 | 89 | (defpattern (backref ->) ((tag :tag) &optional (other-tag :tag)) 90 | (with-gensyms ($bind $val) 91 | `(let ((,$bind (backref tags ,tag))) 92 | (when ,$bind 93 | (let ((,$val (tbind-value ,$bind))) 94 | (push-item! ,$val ,other-tag)) 95 | t)))) 96 | 97 | (defpattern (backmatch) ((tag :tag)) 98 | (with-gensyms ($bind $val $strc $valc) 99 | `(let ((,$bind (backref tags ,tag))) 100 | (when ,$bind 101 | (let ((,$val (tbind-value ,$bind))) 102 | (when (and (stringp ,$val) 103 | (<= (+ (length ,$val) curr) strlen)) 104 | (loop for ,$valc of-type character across ,$val 105 | for ,$strc of-type character = (char str curr) 106 | do (if (char/= ,$valc ,$strc) (return nil)) 107 | (incf curr) 108 | finally (return t)))))))) 109 | 110 | (defpattern (unref) ((pat :pat) &optional (tag :tag)) 111 | (with-gensyms ($result) 112 | `(with-save (tags) 113 | (let ((,$result ,(compile-expr opts pat))) 114 | (when ,$result 115 | ,(if tag 116 | `(tscope-tag! tags tags-bak ,tag) 117 | `(restore tags)) 118 | ,$result))))) 119 | 120 | (defpattern (constant) ((thing :any) &optional (tag :tag)) 121 | (with-gensyms ($thing) 122 | `(let ((,$thing (quote ,thing))) 123 | (push-item! ,$thing ,tag) 124 | t))) 125 | 126 | (defpattern (position $) (&optional (tag :tag)) 127 | `(progn (push-item! curr ,tag) t)) 128 | (defun compile-linecol (line-or-col tag) 129 | (with-gensyms ($line $col) 130 | `(progn 131 | (unless line-map? 132 | (build-linemap! line-map str) 133 | (setf line-map? t)) 134 | (multiple-value-bind (,$line ,$col) (search-line line-map curr) 135 | (declare (ignorable ,$line ,$col)) 136 | (push-item! ,(if (eq line-or-col :line) $line $col) ,tag)) 137 | t))) 138 | (defpattern (line) (&optional (tag :tag)) 139 | (compile-linecol :line tag)) 140 | (defpattern (column) (&optional (tag :tag)) 141 | (compile-linecol :col tag)) 142 | 143 | (add-type! :index (list (lambda (n) (and (integerp n) 144 | (>= n 0))) 145 | "a non-negative integer")) 146 | (defpattern (argument) ((n :index) &optional (tag :tag)) 147 | `(when (< -1 ,n (length args)) 148 | (push-item! (aref args ,n) ,tag) 149 | t)) 150 | 151 | (defun peg-error? (name) 152 | (or (and (symbolp name) 153 | (subtypep name 'condition)) 154 | (stringp name))) 155 | 156 | (add-type! :error (list #'peg-error? "error message or symbol denoting condition")) 157 | 158 | (define-condition peg-error (error) 159 | ((pat :initarg :pat :reader error-pat) 160 | (matched :initarg :matched :reader error-matched) 161 | (caps :initarg :caps :reader error-caps)) 162 | (:report (lambda (e s) 163 | (let ((p (error-pat e)) 164 | (m (error-matched e))) 165 | (when p 166 | (when (/= (length m) 0) 167 | (format s "~s is " m)) 168 | (format s "~s" p)))))) 169 | 170 | (defpattern (error) (&optional (pat :pat) (err :error)) 171 | (with-gensyms ($matched? $capped) 172 | (if (and pat (subtypep (or err 'peg-error) 'peg-error)) 173 | `(with-save (curr caps tags accum accum?) 174 | (setf accum? nil) 175 | (let ((,$matched? ,(compile-expr opts pat)) 176 | (,$capped (cdr caps-bak))) 177 | (when ,$matched? 178 | (error ',(or err 'peg-error) 179 | :pat ',pat 180 | :matched (subseq str curr-bak curr) 181 | :caps ,$capped)) 182 | (restore curr caps tags accum accum?) 183 | t)) 184 | `(error ',(or err 'peg-error))))) 185 | 186 | (defpattern (lenprefix) ((n-pat :pat) (r-pat :pat)) ;;Grouped in with capture because it manipulates the stack 187 | (with-gensyms ($matched $capped $count) 188 | `(with-save (accum? caps tags) 189 | (setf accum? nil) 190 | (let ((,$matched ,(compile-expr opts n-pat)) 191 | (,$capped (cdr caps-bak))) 192 | (restore accum? caps tags) 193 | (when (and ,$matched ,$capped) 194 | (let ((,$count (first ,$capped))) 195 | (when (positive-integer-p ,$count) 196 | (loop repeat ,$count 197 | for ,$matched = ,(compile-expr opts r-pat) 198 | unless ,$matched do (return nil) 199 | finally (return t))))))))) 200 | 201 | (add-type! :radix (list (lambda (n) 202 | (or (null n) 203 | (and (integerp n) (<= 2 n 36))) 204 | "integer between 2 and 36"))) 205 | 206 | (defpattern (integer) ((pat :pat) &optional (radix :radix) (tag :tag)) 207 | (with-gensyms ($region $reg-len $int $int-len) 208 | `(with-save (curr) 209 | (when ,(compile-expr opts pat) 210 | (let ((,$region (subseq str curr-bak curr)) 211 | (,$reg-len (- curr curr-bak))) 212 | (multiple-value-bind (,$int ,$int-len) 213 | (parse-integer ,$region :radix ,(or radix 10) :junk-allowed t) 214 | (when (and ,$int (= ,$int-len ,$reg-len)) 215 | (push-item! ,$int ,tag) 216 | t))))))) 217 | -------------------------------------------------------------------------------- /src/compile.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp) 2 | 3 | ;; Facilities for defining new combinators 4 | (defparameter *argtype* 5 | `((:string ,#'strform? "string") 6 | (:tag ,#'keywordp "keyword") 7 | (:num ,#'integerp "integer") 8 | (:reps ,#'positive-integer-p "positive integer") 9 | (:pat ,(lambda (x) (declare (ignore x)) t)) 10 | (:any ,(lambda (x) (declare (ignore x)) t)))) 11 | 12 | (defun add-type! (type typeform) 13 | (if-let ((bind (assoc type *argtype*))) 14 | (setf (cdr bind) typeform) 15 | (push (cons type typeform) *argtype*))) 16 | 17 | (defun check-arg (argspec arg) 18 | (cond 19 | ((symbolp argspec) t) 20 | ((listp argspec) 21 | (destructuring-bind (name fn/type &optional expectstr) argspec 22 | (declare (ignore name)) 23 | (if (keywordp fn/type) 24 | (check-arg (assoc fn/type *argtype*) arg) 25 | (if (funcall fn/type arg) 26 | t 27 | (values nil expectstr))))))) 28 | 29 | (define-condition grammar-error (error) 30 | ((expr :initarg :expr :reader expr) 31 | (expected :initarg :expected :reader expected) 32 | (got :initarg :got :reader got)) 33 | (:report (lambda (c s) 34 | (format 35 | s "grammar error in ~s, expected ~a, got ~s" 36 | (expr c) (expected c) (got c))))) 37 | 38 | (defun grammar-error (expr expected got) 39 | (error 'grammar-error :expr expr :expected expected :got got)) 40 | 41 | (defun verify-positionals! (spec expr &optional (reqarity (length spec) inexact?)) 42 | "Verify that the positional arguments in SPEC line up with EXPR. REQARITY 43 | should be one more than the last position index. On fail, signal appropriate 44 | grammar-error" 45 | (let ((args (rest expr))) 46 | (when (and (not inexact?) (/= (length args) reqarity)) ; all positional 47 | (grammar-error expr (format nil "~s args" reqarity) (length args))) 48 | (unless (>= (length args) reqarity) ; not all positional 49 | (grammar-error expr (format nil "at least ~s args" reqarity) (length args))) 50 | (loop repeat reqarity 51 | for argspec in spec 52 | for arg in args 53 | do (multiple-value-bind (match? errstr) (check-arg argspec arg) 54 | (unless match? (grammar-error expr errstr arg)))) 55 | t)) 56 | 57 | (defun verify-optionals! (spec expr reqarity) 58 | (let* ((args (rest expr)) 59 | (optspec (subseq spec (1+ reqarity))) 60 | (optargs (subseq args reqarity))) 61 | (unless (<= (length optargs) (length optspec)) 62 | (grammar-error expr (format nil "at most ~a args" (1- (length spec))) (length args))) 63 | (loop for argspec in optspec 64 | for arg in optargs 65 | do (multiple-value-bind (match? errstr) (check-arg argspec arg) 66 | (unless match? (grammar-error expr errstr arg)))) 67 | t)) 68 | 69 | (defun circularize! (list) (setf (cdr (last list)) list) list) 70 | (defun verify-rest! (spec expr reqarity) 71 | (let ((argspecs (circularize! (subseq spec (1+ reqarity)))) 72 | (restargs (subseq expr (1+ reqarity)))) 73 | (loop for arg in restargs 74 | for argspec in argspecs 75 | do (multiple-value-bind (match? errstr) (check-arg argspec arg) 76 | (unless match? (grammar-error expr errstr arg))))) 77 | t) 78 | 79 | (defun verify-args! (spec expr) 80 | (if-let ((reqarity (position-if 81 | (lambda (p) 82 | (and (symbolp p) 83 | (case (to-keyword p) 84 | (:&optional t) 85 | (:&rest t) 86 | (t nil)))) 87 | spec))) 88 | (and (verify-positionals! spec expr reqarity) 89 | (case (to-keyword (nth reqarity spec)) 90 | (:&optional (verify-optionals! spec expr reqarity)) 91 | (:&rest (verify-rest! spec expr reqarity)))) 92 | (verify-positionals! spec expr))) 93 | 94 | (defstruct (pattern) name spec compile-fn) 95 | (defvar *patterns* (make-hash-table :test 'eq)) 96 | 97 | (defmacro defpattern ((name &rest aliases) spec &body body) 98 | "Bind pattern with name and aliases whose arguments obey SPEC. 99 | In BODY, OPTS is anaphorically bound" 100 | (let ((destruct (mapcar (lambda (o) (if (listp o) (first o) o)) spec)) 101 | (qspec `(list ,@(loop for s in spec 102 | if (listp s) 103 | collect `(list ',(first s) ,@(rest s)) 104 | else collect `(quote ,s))))) 105 | (with-gensyms ($pat $expr $n) 106 | `(let ((,$pat (make-pattern 107 | :name ',name :spec ,qspec 108 | :compile-fn 109 | (lambda (opts ,$expr) 110 | (declare (ignorable opts)) 111 | (destructuring-bind ,destruct (rest ,$expr) 112 | ,@body))))) 113 | (loop for ,$n in ',(cons name aliases) 114 | do (setf (gethash (to-keyword ,$n) *patterns*) ,$pat)))))) 115 | 116 | (defparameter *aliases* nil) 117 | (defun register-alias! (alias pattern) 118 | "Does not check pattern, but DO NOT put a recursive pattern. Probably 119 | you shouldn't capture either" 120 | (unless (null pattern) 121 | (push (cons alias pattern) *aliases*))) 122 | (defun register-alias-suite! (alias pattern) 123 | "Adds pattern under alias, but also !alias, alias*, alias+, !alias*, !alias+" 124 | (let* ((sn (symbol-name alias)) 125 | (!sn (concatenate 'string "!" sn)) 126 | (sn+ (concatenate 'string sn "+")) 127 | (!sn+ (concatenate 'string !sn "+")) 128 | (sn* (concatenate 'string sn "*")) 129 | (!sn* (concatenate 'string !sn "*")) 130 | (sn? (concatenate 'string sn "?")) 131 | (!sn? (concatenate 'string !sn "?")) 132 | (!pattern `(if-not ,pattern 1))) 133 | (mapcar #'register-alias! 134 | (mapcar #'to-keyword (list sn !sn sn+ !sn+ sn* !sn* sn? !sn?)) 135 | (list pattern !pattern 136 | `(some ,pattern) `(some ,!pattern) 137 | `(any ,pattern) `(any ,!pattern) 138 | `(? ,pattern) `(? ,!pattern))))) 139 | 140 | (defparameter *base-aliases* 141 | '(:d (range "09") 142 | :a (range "az" "AZ") 143 | :w (range "az" "AZ" "09") 144 | :s (set (#\tab #\return #\newline #\null #\page #\vt #\space)) 145 | :h (range "09" "af" "AF"))) 146 | 147 | (mapcar (lambda (pair) (register-alias-suite! (first pair) (second pair))) 148 | (pairs *base-aliases*)) 149 | 150 | (defstruct (compopts) prefix env) 151 | (defun copts (prefix env) 152 | (make-compopts :prefix prefix :env env)) 153 | 154 | (defun print-peg-state (state stream depth) 155 | (declare (ignore depth state)) 156 | (print "peg state" stream)) 157 | (defstruct (peg-state (:print-function print-peg-state)) 158 | (str "" :type simple-string) 159 | (curr 0 :type fixnum) 160 | (args #() :type vector) 161 | (strlen 0 :type fixnum) 162 | (caps nil :type queue) 163 | (accum "" :type accum) 164 | (tags #() :type tstack) 165 | (accum? nil :type boolean) 166 | (line-map? nil :type boolean) 167 | (line-to 0 :type index) 168 | (line-map nil :type line-map)) 169 | (defun initialize-peg-state (str curr args) 170 | (make-peg-state 171 | :str str 172 | :curr curr 173 | :args args 174 | :strlen (length str) 175 | :caps (make-queue) 176 | :accum (make-accum) 177 | :tags (make-tstack) 178 | :accum? nil 179 | :line-map? nil 180 | :line-map (make-array 1 :element-type 'index :adjustable t :fill-pointer t))) 181 | 182 | (defun compile-toplevel (expr &key (quiet? t) debug?) 183 | (let ((muffler #+sbcl '(sb-ext:muffle-conditions sb-ext:compiler-note) #-sbcl nil)) 184 | `(lambda (state) 185 | (declare ,@(list-if quiet? muffler) 186 | (optimize ,@(if debug? 187 | '((speed 0)) 188 | '((debug 0) (speed 3)))) 189 | (peg-state state)) 190 | (with-slots (str curr args strlen caps tags accum accum? line-map? line-map) state 191 | (declare (ignorable str curr args strlen caps tags accum accum? line-map line-map?)) 192 | (if ,(compile-expr (make-compopts) expr) 193 | (values t (qitems caps)) 194 | nil))))) 195 | 196 | (defun env-lookup (env name) 197 | (when env 198 | (or (find name (first env)) 199 | (env-lookup (rest env) name)))) 200 | 201 | (define-condition unknown-pattern (error) 202 | ((name :initarg :name :reader ukpat-name)) 203 | (:report (lambda (c s) 204 | (format s "~s is not a recognized pattern" 205 | (ukpat-name c))))) 206 | 207 | (define-condition undefined-rule (error) 208 | ((rule-name :initarg :name :reader rule-name)) 209 | (:report (lambda (c s) 210 | (format s "~s is not defined in the current environment." 211 | (rule-name c))))) 212 | 213 | (defun compile-expr (opts expr) 214 | (cond 215 | ((functionp expr) `(funcall ,expr state)) 216 | ((strform? expr) (compile-literal opts (from-strform expr))) 217 | ((keywordp expr) (or (if (env-lookup (compopts-env opts) expr) 218 | (list (prefsym (compopts-prefix opts) expr))) 219 | (if-let ((assoced (assoc expr *aliases*))) 220 | (compile-expr opts (cdr assoced))) 221 | (error 'undefined-rule :name expr))) 222 | ((integerp expr) (compile-count opts expr)) 223 | ((listp expr) (if-let ((pattern (gethash (to-keyword (first expr)) *patterns*))) 224 | (and (verify-args! (pattern-spec pattern) expr) 225 | (funcall (pattern-compile-fn pattern) opts expr)) 226 | (error 'unknown-pattern :name (first expr)))))) 227 | -------------------------------------------------------------------------------- /src/core.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp) 2 | 3 | (defun match (rule str &optional (start 0) &rest args) 4 | (funcall (if (functionp rule) rule (compile-peg rule)) 5 | (initialize-peg-state 6 | str 7 | start 8 | (loop with out = (make-array 0 :fill-pointer t :adjustable t) 9 | for a in args do (vector-push-extend a out) 10 | finally (return out))))) 11 | (defun captured (rule str &optional (start 0) &rest args) 12 | (multiple-value-bind (matched? caps) (apply #'match rule str start args) 13 | (values caps matched?))) 14 | 15 | (defun replace-all (match replace str &optional (start 0) &rest args) 16 | (multiple-value-bind (matched? caps) 17 | (apply #'match `(% (any (+ (/ '(drop ,match) ,replace) '1))) 18 | str start args) 19 | (values (first caps) matched?))) 20 | 21 | (defun replace-one (match replace str &optional (start 0) &rest args) 22 | (multiple-value-bind (matched? caps) 23 | (apply #'match `(% (* (<- (to ,match)) 24 | (/ '(drop ,match) ,replace) 25 | (<- (any 1)))) 26 | str start args) 27 | (values (first caps) matched?))) 28 | 29 | (defun find-all (match str &optional (start 0) &rest args) 30 | (multiple-value-bind (_ caps) 31 | (apply #'match `(any (* (to ,match) ($) (drop ,match))) ; we'd rather match twice than push and pop 32 | str start args) 33 | (declare (ignore _)) 34 | caps)) 35 | (defun find-one (match str &optional (start 0) &rest args) 36 | (multiple-value-bind (_ caps) 37 | (apply #'match `(* (to ,match) ($) (drop ,match)) 38 | str start args) 39 | (declare (ignore _)) 40 | (first caps))) 41 | 42 | (defun compile-peg (expr &rest opts) 43 | "Compile EXPR to a peg matcher, for use with uclp:match." 44 | (let ((out (compile nil (apply #'compile-toplevel expr opts)))) 45 | out)) 46 | -------------------------------------------------------------------------------- /src/uclp.lisp: -------------------------------------------------------------------------------- 1 | (defpackage :uclp 2 | (:use :cl :alexandria) 3 | (:export 4 | :compile-peg :match :captured :replace-all :replace-one :find-all :find-one 5 | :*aliases* :register-alias! :register-alias-suite! 6 | :peg-error)) 7 | -------------------------------------------------------------------------------- /src/util.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp) 2 | 3 | (deftype index () `(integer 0 ,(+ array-dimension-limit 2))) 4 | 5 | ;; Semi-Norvig Queues: We need to implement the capture stack as a queue, but we will 6 | ;; never actually need to dequeue! So we can make optimizations which are otherwise 7 | ;; impossible. 8 | ;; A queue is just a cons where the head points to the last cell and the tail points to 9 | ;; the first, UNLESS its empty, in which cass the head points to the cell itself. 10 | 11 | 12 | (defmacro dec-inlines (&rest fn-decs) 13 | (flet ((to-ftype (typedec) 14 | (destructuring-bind (fname argtypes outtype) typedec 15 | `(ftype (function ,argtypes ,outtype) ,fname)))) 16 | (let ((fn-names (mapcar #'first fn-decs)) 17 | (ftypes (mapcar #'to-ftype fn-decs))) 18 | `(declaim ,@ftypes 19 | ,(cons 'inline fn-names))))) 20 | 21 | (deftype queue () 'cons) 22 | (dec-inlines 23 | (make-queue () queue) 24 | (qempty? (queue) boolean) 25 | (qitems (queue) list) 26 | (qpush! (queue t) null) 27 | (qsave (queue) cons) 28 | (qrestore! (queue cons) null)) 29 | (defun make-queue () 30 | (let ((out (cons nil nil))) 31 | (setf (car out) out) 32 | out)) 33 | (defun qempty? (q) (eq q (car q))) 34 | (defun qitems (q) (cdr q)) 35 | 36 | (defun qpush! (q item) 37 | (setf (car q) (setf (cdar q) (cons item nil))) 38 | nil) 39 | (defun qpush-all! (q items) (loop for i in items do (qpush! q i))) 40 | 41 | (defun qsave (q) (car q)) 42 | (defun qrestore! (q saved) 43 | "Should take a queue and the car of that queue at some earlier point" 44 | (setf (car q) saved) 45 | (setf (cdar q) nil)) 46 | 47 | ;; Accum String: Just an extendable string with save and reload semantics 48 | 49 | (deftype accum () '(vector character *)) 50 | (dec-inlines 51 | (make-accum () accum) 52 | (aempty? (accum) boolean) 53 | (apush! (accum string fixnum fixnum) null) 54 | (asave (accum) fixnum) 55 | (arestore! (accum fixnum) null)) 56 | 57 | (defun make-accum () (make-array 0 :adjustable t :fill-pointer t :element-type 'character)) 58 | (defun aempty? (a) (= (length a) 0)) 59 | 60 | (defun apush! (a s start end) 61 | (loop for i from start below end 62 | do (vector-push-extend (char s i) a))) 63 | 64 | (defun asave (a) (length a)) 65 | (defun arestore! (a saved) (setf (fill-pointer a) saved) nil) 66 | 67 | ;; Tag Stack: Literally just an alist searched for a tag from beginning to end 68 | 69 | (defstruct tbind (tag :scope :type keyword) value) 70 | 71 | (deftype tstack () '(vector tbind *)) 72 | (dec-inlines 73 | (make-tstack () tstack) 74 | (tpush! (tstack keyword t) null) 75 | (tpush-scope! (tstack) null) 76 | (backref (tstack keyword) (or tbind null)) 77 | (tsave (tstack) fixnum) 78 | (trestore! (tstack fixnum) null) 79 | (tscope-tag! (tstack fixnum keyword) null)) 80 | 81 | (defun make-tstack () (make-array 0 :adjustable t :fill-pointer t :element-type 'tbind)) 82 | 83 | (defun tpush! (tstack key item) 84 | (vector-push-extend (make-tbind :tag key :value item) tstack) 85 | nil) 86 | 87 | (defun backref (tstack key) 88 | (loop for bind across tstack 89 | if (eq (tbind-tag bind) key) do 90 | (return bind))) 91 | 92 | (defun tsave (tstack) 93 | (length tstack)) 94 | (defun trestore! (tstack save) 95 | (setf (fill-pointer tstack) save) 96 | nil) 97 | (defun tscope-tag! (tstack save tag) 98 | (loop with goodc = save 99 | for i from save below (length tstack) 100 | for bind = (aref tstack i) 101 | if (not (eq (tbind-tag bind) tag)) do 102 | ;; goodc points to the end of the known good binds. If we encounter a bad bind, 103 | ;; i will advance and goodc won't. Most of the library is ripped from Bakpakin, but 104 | ;; this specific implementation I never would have thought of on my own. Seeing it 105 | ;; in the codebase made me confident it would work. 106 | (rotatef (aref tstack i) (aref tstack goodc)) 107 | (incf goodc) 108 | finally (setf (fill-pointer tstack) goodc)) 109 | nil) 110 | 111 | (defun strform? (strform) 112 | (or (stringp strform) 113 | (characterp strform) 114 | (and (listp strform) (every #'strform? strform)))) 115 | (defun from-strform (strlikes) 116 | "Any place we normally take a string, also accept a list of strings and/or chars" 117 | (if (or (stringp strlikes) (characterp strlikes)) 118 | (string strlikes) 119 | (reduce (lambda (s1 s2) (concatenate 'string s1 s2)) 120 | strlikes 121 | :key #'string))) 122 | 123 | (defun pairs (list) 124 | (cond 125 | ((null list) nil) 126 | ((null (cdr list)) (list list)) 127 | (t (destructuring-bind (a b &rest rest) list 128 | (cons (list a b) (pairs rest)))))) 129 | 130 | ;; Lines 131 | (deftype line-map () '(and (vector index *) (not (vector t 0)))) 132 | (dec-inlines 133 | (search-line (line-map index) (values index index)) 134 | (build-linemap! (line-map string) NULL)) 135 | 136 | (defun search-line (line-map curr) 137 | (labels ((iter (left right) 138 | (if (= left right) 139 | (values (1+ left) (1+ (- curr (aref line-map left)))) 140 | (let* ((mid (ash (+ left right) -1)) 141 | (midval (aref line-map mid))) 142 | (cond 143 | ((= midval curr) (values (1+ mid) 1)) 144 | ((= left mid) (values (1+ mid) (1+ (- curr midval)))) 145 | ((< midval curr) (iter mid right)) 146 | (t (iter left mid))))))) 147 | (declare (ftype (function (index (and index (integer 1))) (values index index)) iter) 148 | (dynamic-extent (function iter))) 149 | (iter 0 (length line-map)))) 150 | 151 | (defun build-linemap! (lmap str) 152 | (loop for c across str 153 | for i from 0 154 | do (when (char= c #\Newline) (vector-push-extend (1+ i) lmap)))) 155 | 156 | (define-condition generic-error (error) 157 | ((msg :initarg :msg :reader error-msg)) 158 | (:report (lambda (c s) (format s "~a" (error-msg c))))) 159 | (defun throw-msg! (msg) (error 'generic-error :msg msg)) 160 | 161 | 162 | ;; str str-end args curr curr-bak caps caps-bak tags tags-bak accum accum? accum-bak 163 | 164 | (defun to-keyword (name) 165 | (intern (if (symbolp name) (symbol-name name) name) 166 | (find-package :keyword))) 167 | 168 | (defun prefsym (prefix symbol) 169 | (intern (concatenate 170 | 'string 171 | (symbol-name prefix) 172 | "/" 173 | (symbol-name symbol)))) 174 | (defun to->back (symbol) 175 | (intern (concatenate 'string (symbol-name symbol) "-BAK") 176 | (symbol-package symbol))) 177 | 178 | (defun list-if (x obj) 179 | (when x (list obj))) 180 | 181 | (define-condition bad-place (error) (place)) 182 | (defun save-for (place) 183 | (case (to-keyword place) 184 | (:curr place) 185 | (:caps `(qsave ,place)) 186 | (:tags `(tsave ,place)) 187 | (:accum? place) 188 | (:strlen place) 189 | (:accum `(asave ,place)) 190 | (t (error 'bad-place :place place)))) 191 | 192 | (defun save (&rest places) 193 | (if (= (length places) 1) 194 | (let ((place (first places))) 195 | (list (to->back place) (save-for place))) 196 | (mapcar #'save places))) 197 | (defmacro with-save (places &body body) 198 | (let* ((save-slots (mapcar #'to->back places)) 199 | (save-methods (mapcar #'save-for places)) 200 | (bindings (mapcar #'list save-slots save-methods))) 201 | `(let ,bindings 202 | ,@body))) 203 | 204 | (defmacro checkpoint (&rest places) 205 | `(setf ,@(mapcan 206 | (lambda (p) 207 | (list (to->back p) (save-for p))) 208 | places))) 209 | 210 | (defun restore-for (place) 211 | (case (to-keyword place) 212 | (:curr `(setf ,place ,(to->back place))) 213 | (:caps `(qrestore! ,place ,(to->back place))) 214 | (:tags `(trestore! ,place ,(to->back place))) 215 | (:accum? `(setf ,place ,(to->back place))) 216 | (:accum `(arestore! ,place ,(to->back place))) 217 | (:strlen `(setf ,place ,(to->back place))) 218 | (t (error 'bad-place :place place)))) 219 | 220 | (defmacro restore (&rest places) 221 | `(progn ,@(mapcar #'restore-for places))) 222 | -------------------------------------------------------------------------------- /test/basics-test.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp/test) 2 | 3 | (def-suite basics :in uclp) 4 | (in-suite basics) 5 | 6 | (test literal-matches-from-start 7 | (is-match "a" "a") 8 | (is-match #\a "ab") 9 | (isnt-match "ab" "a") 10 | (isnt-match "ab" "ab" :start 1)) 11 | (test strform-converted 12 | (is-match #\Newline (string #\Newline)) 13 | (is-match '(#\A #\a) "Aa") 14 | (is-match '("ab") "ab")) 15 | (test empty-string-matches 16 | (is-match "" "abc") 17 | (is-match "" "abc" :start 3)) 18 | 19 | (test positive-count 20 | (is-match 1 "a") 21 | (is-match 2 "ab") 22 | (is-match 1 "ab") 23 | (isnt-match 3 "ab") 24 | (isnt-match 1 "") 25 | (isnt-match 1 "a" :start 1)) 26 | (test zero-count 27 | (is-match 0 "") 28 | (is-match 0 "a") 29 | (is-match 0 "a" :start 1) 30 | (isnt-match 0 "a" :start 2)) 31 | (test negative-count 32 | (is-match -1 "") 33 | (is-match -1 "a" :start 1) 34 | (is-match -2 "a") 35 | (isnt-match -2 "ab") 36 | (is-match -2 "a" :start 1)) 37 | 38 | (test range 39 | (is-match '(range "ac") "a") 40 | (is-match '(range (#\a #\c)) "c") 41 | (is-match '(range "ac") "c") 42 | (isnt-match '(range "az") "A") 43 | (isnt-match '(range "az") "7") 44 | (isnt-match '(range "AZ") "a")) 45 | 46 | (test set 47 | (is-match '(set ("a1" #\Z #\_)) "18") 48 | (is-match '(set "a1Z_") "a7") 49 | (is-match '(set "a1Z_") "_1") 50 | (isnt-match '(set "az") "b") 51 | (isnt-match '(set "az") "")) 52 | 53 | ;; (test tail-call-elim 54 | ;; (is-match '(grammar :main (+ -1 (* 1 :main))) 55 | ;; (make-array 50000 :element-type 'character :initial-element #\a))) ; milage may vary 56 | 57 | (test grammar 58 | (check-pat `(grammar 59 | :a (* "a" :b "a") 60 | :b (* "b" (+ :a 0) "b") 61 | :main (* "(" :b ")")) 62 | :match "(bb)" "(babbab)" "(babbab)" "(bababbabab)" 63 | :fail "()" "(aba" "(aa)" "(abbb)" "(bab)")) 64 | 65 | (test sub 66 | (check-pat `(* (sub (<- 2 :b) (* (any "b") -1)) (backmatch :b)) 67 | :match "bbbb" "bbbbc" 68 | :fail "bbb" "" "1" "bbcc") 69 | ; After matching sub-pat, advance to end of super-pat 70 | (is-match '(* (sub 3 '1) "b") "aaab")) 71 | 72 | ;; As of now (2024-03-20) the SPLIT pattern is not in the latest release of Janet, 73 | ;; but is in the current repo. The desired behavior doesn't seem to be documented 74 | ;; anywhere, but my faith in the design goals of Ian Henry far outstrips my faith 75 | ;; in my own. So I am trying to mimic the behavior of Janet's SPLIT as closely as 76 | ;; as possible. These tests are copied verbatim from Janet's test suite. 77 | 78 | (test split 79 | ; basic functionality 80 | (is-match '(split "," '1) "a,b,c" 81 | :result '("a" "b" "c")) 82 | 83 | ; drops captures from separator pattern 84 | (is-match '(split '"," '1) "a,b,c" 85 | :result '("a" "b" "c")) 86 | 87 | ; can match empty subpatterns 88 | (is-match '(split "," ':w*) ",a,,bar,,,c,," 89 | :result '("" "a" "" "bar" "" "" "c" "" "")) 90 | 91 | ; subpattern is limited to only text before the separator 92 | (is-match '(split "," '(to -1)) "a,,bar,c" 93 | :result '("a" "" "bar" "c")) 94 | 95 | ; fails if any subpattern fails 96 | (isnt-match '(split "," '"a") "a,a,b") 97 | 98 | ; separator does not have to match anything 99 | (is-match '(split "x" '(to -1)) "a,a,b" 100 | :result '("a,a,b")) 101 | 102 | ; always consumes entire input 103 | (is-match '(split 1 '"") "abc" 104 | :result '("" "" "" "")) 105 | 106 | ; separator can be an arbitrary PEG 107 | (is-match '(split :s+ '(to -1)) "a b c" 108 | :result '("a" "b" "c")) 109 | 110 | ; does not advance past the end of the input 111 | (is-match '(* (split "," ':w+) 0) "a,b,c" 112 | :result '("a" "b" "c"))) 113 | -------------------------------------------------------------------------------- /test/capture-test.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp/test) 2 | 3 | (def-suite capture :in uclp) 4 | (in-suite capture) 5 | 6 | (test nested-accum 7 | (check-pat `(% (* (+ (% (* '"a" '"b")) 8 | (/ (% (* '"c" '"d")) ,#'string-upcase)) 9 | '"ef")) 10 | :match ("abef" :result '("abef")) ("cdef" :result '("CDef")))) 11 | 12 | (test backref 13 | (check-pat `(grammar 14 | :pad (any "=") 15 | :open (* "[" (<- :pad :n) "[") 16 | :close (* "]" (cmt (* (backref :n) (<- :pad)) ,#'string=) "]") 17 | :main (* :open (any (if-not :close 1)) :close -1)) 18 | :match "[[]]" "[==[a]==]" "[[blark]]" "[[bl[ark]]" "[[bl]rk]]" "[===[]==]===]" 19 | :fail "[[bl]rk]] " "[=[bl]]rk]=] " "[=[bl]==]rk]=] " "[==[]===]") 20 | (check-pat `(* (+ (<- "b" :b) 21 | (<- 1 :a)) 22 | (+ (* (backmatch :a) "c") 23 | (backmatch :b)) 24 | -1) 25 | :match "bb" "aac" 26 | :fail "cc" "ab")) 27 | 28 | (test argument 29 | (is-match '(* (argument 0 :a) (backmatch :a) -1) "abc" :args '("abc")) 30 | (isnt-match '(* (argument 1 :a) (backmatch :a) -1) "abc" :args '("abc" "efg")) 31 | (isnt-match '(* (argument 0)) "abc")) 32 | 33 | (test lenprefix 34 | (check-pat `'(* (lenprefix (* (integer 1 nil :b) (integer 1)) "a") (? (backref :b))) 35 | :match "35aaa" "31aaaa" ("30aaaa" :result '("30aaa")) 36 | :fail "33aa" "3aab" "33ab" "v2aaa")) 37 | 38 | (define-condition test-error (peg-error) ()) 39 | (test error 40 | (is-match '(* 1 (error -1) 1) "ab") 41 | (signals uclp:peg-error 42 | (match '(* 1 (error -1) 1) "a")) 43 | (signals test-error 44 | (match '(* 1 (error -1 test-error) 1) "a"))) 45 | -------------------------------------------------------------------------------- /test/core-test.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp/test) 2 | (def-suite core :in uclp) 3 | (in-suite core) 4 | 5 | ;; Built ints are in core for now 6 | 7 | (test build-ins 8 | (check-pat '(* :d :h :d? :!d :s :!s :w+ -1) 9 | :match "111a *l23ghS" "1aa d44Gg" 10 | :fail "1aaaaa" "1ad X wef " "aaaaa")) 11 | 12 | (test user-aliases 13 | (let ((*aliases* *aliases*)) 14 | (register-alias! :v "video") 15 | (register-alias! :w " not w!") 16 | (is-match :v "video") 17 | (is-match :w " not w!")) 18 | (signals uclp::undefined-rule 19 | (match :v "video")) 20 | (isnt-match :w " not w!")) 21 | 22 | (test compile-peg 23 | (let ((comp-pat (compile-peg '(grammar 24 | :a (* "a" :b "a") 25 | :b (+ "bb" :a "|") 26 | :main :a)))) 27 | (is-match comp-pat "aaa|aaa") ; Can pass to match 28 | (is-match `(split "," ,comp-pat) "a|a,abba,aa|aa"))) ; Can compose 29 | 30 | (test replace 31 | (is (string= (replace-one '(* (set "Rr") "avi" (? "'s")) "John" "Ravi's name is ravi") 32 | "John name is ravi")) 33 | (is (string= (replace-all '(* (set "Rr") "avi" (? "'s")) "John" "Ravi's name is ravi") 34 | "John name is John"))) 35 | 36 | (test find 37 | (let ((str "This is a secret waldo message with a man waldo hiding in waldo it")) 38 | (is (match "waldo" str (find-one "waldo" str))) ; find-one should return int 39 | (is (every (lambda (pos) (match "waldo" str pos)) ; find-all should return list 40 | (find-all "waldo" str))))) 41 | -------------------------------------------------------------------------------- /test/init.lisp: -------------------------------------------------------------------------------- 1 | (defpackage :uclp/test 2 | (:use :cl :uclp :fiveam :alexandria)) 3 | (in-package :uclp/test) 4 | 5 | (defun matches? (pat str start args &optional (expect-caps nil check-caps?)) 6 | (multiple-value-bind (success? caps) (apply #'uclp:match pat str start args) 7 | (and success? 8 | (if check-caps? 9 | (equalp caps expect-caps) 10 | t)))) 11 | (defmacro is-match (pat str &key (start 0) (args nil) (result nil check-result?)) 12 | `(fiveam:is (matches? ,pat ,str ,start ,args ,@(if check-result? (list result))))) 13 | (defmacro isnt-match (pat str &key (start 0) (args nil)) 14 | `(fiveam:is-false (matches? ,pat ,str ,start ,args))) 15 | 16 | (defmacro check-pat (pat &body inputs) 17 | `(progn 18 | ,@(loop with mode = nil 19 | for input in inputs 20 | if (eq input :match) do (setf mode 'is-match) 21 | else if (eq input :fail) do (setf mode 'isnt-match) 22 | else if (stringp input) collect (list mode pat input) 23 | else if (consp input) collect `(,mode ,pat ,@input)))) 24 | 25 | (fiveam:def-suite uclp) 26 | 27 | (defun run-tests! () (fiveam:run! 'uclp)) 28 | -------------------------------------------------------------------------------- /test/util-test.lisp: -------------------------------------------------------------------------------- 1 | (in-package :uclp/test) 2 | 3 | (def-suite utils :in uclp) 4 | (in-suite utils) 5 | 6 | (defun strform= (strform str) 7 | (string= (uclp::from-strform strform) str)) 8 | 9 | (test strform 10 | (is (strform= "a" "a")) 11 | (is (strform= #\a "a")) 12 | (is (strform= '("a" #\b "c" #\d) "abcd"))) 13 | -------------------------------------------------------------------------------- /uclp.asd: -------------------------------------------------------------------------------- 1 | (defsystem :uclp 2 | :version "0.1.3" 3 | :author "Ravi D'Elia" 4 | :license "MIT" 5 | :depends-on (:alexandria :serapeum :trivia) 6 | :pathname "src" 7 | :components 8 | ((:file "uclp") 9 | (:file "util" :depends-on ("uclp")) 10 | (:file "compile" :depends-on ("util")) 11 | (:file "basics" :depends-on ("compile")) 12 | (:file "capture" :depends-on ("compile")) 13 | (:file "core" :depends-on ("basics" "capture"))) 14 | :in-order-to ((test-op (test-op "uclp/test"))) 15 | :description "An experimental implementation of Janet's PEG module in common lisp") 16 | 17 | (defsystem :uclp/test 18 | :depends-on (:uclp :fiveam) 19 | :pathname "test" 20 | :components 21 | ((:file "init") 22 | (:file "util-test" :depends-on ("init")) 23 | (:file "core-test" :depends-on ("init")) 24 | (:file "basics-test" :depends-on ("init")) 25 | (:file "capture-test" :depends-on ("init"))) 26 | :perform (test-op (o c) (symbol-call :uclp/test :run-tests!))) 27 | --------------------------------------------------------------------------------