├── .gitignore ├── re.asd ├── LICENSE.txt ├── README.md └── re.lisp /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.swp 3 | *.*fasl 4 | *.*fsl 5 | -------------------------------------------------------------------------------- /re.asd: -------------------------------------------------------------------------------- 1 | (defpackage :re-asd 2 | (:use :cl :asdf)) 3 | 4 | (in-package :re-asd) 5 | 6 | (defsystem :re 7 | :name "re" 8 | :version "1.0" 9 | :author "Jeffrey Massung" 10 | :license "Apache 2.0" 11 | :description "Lua-style string pattern matching." 12 | :serial t 13 | :components ((:file "re")) 14 | :depends-on ("parse")) 15 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | This file is provided to you under the Apache License, 2 | Version 2.0 (the "License"); you may not use this file 3 | except in compliance with the License. You may obtain 4 | a copy of the License at 5 | 6 | http://www.apache.org/licenses/LICENSE-2.0 7 | 8 | Unless required by applicable law or agreed to in writing, 9 | software distributed under the License is distributed on an 10 | "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 11 | KIND, either express or implied. See the License for the 12 | specific language governing permissions and limitations 13 | under the License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # The RE Package 2 | 3 | The `re` package is a small, portable, lightweight, and quick, regular expression library for Common Lisp. It is a non-recursive, backtracing VM. The syntax is similar to [Lua](http://www.lua.org)-style pattern patching (found [here](http://www.lua.org/pil/20.2.html)), but has added support for additional regex features (see below). It's certainly not the fastest, but is very easy to understand and extend. 4 | 5 | It makes heavy use of the monadic [`parse`](http://github.com/massung/parse) combinator library for parsing the regular expressions. If you'd like to understand the parsing and compiling of regular expressions, I recommend reading up on that library as well. 6 | 7 | ## Compiling Patterns 8 | 9 | To create a `re` object, you can either use the `compile-re` function or the `#r` dispatch macro. 10 | 11 | CL-USER > (compile-re "%d+") 12 | # 13 | 14 | CL-USER > #r/%d+/ 15 | # 16 | 17 | Both work equally well, but the dispatch macro will compile the pattern at read-time. The `re` class has a [load form](http://www.lispworks.com/documentation/HyperSpec/Body/f_mk_ld_.htm#make-load-form) and so can be saved to a FASL file. 18 | 19 | *HINT: When using the read macro, use a backslash to escape the `/` and other characters that might mess with syntax coloring.* 20 | 21 | Finally, the `with-re` macro lets you use either strings or `re` objects in a body of code. If a string is passed as the pattern, then it will be compiled before the body is evaluated. 22 | 23 | CL-USER > (with-re (re "%d+") re) 24 | # 25 | 26 | *NOTE: All pattern matching functions use the `with-re` macro, and so the pattern argument can be either a string or a pre-compiled `re` object.* 27 | 28 | ## Basic Pattern Matching 29 | 30 | The heart of all pattern matching is the `match-re` function. 31 | 32 | (match-re pattern string &key start end exact) 33 | 34 | It will match `string` against `pattern` and return a `re-match` object on success or `nil` on failure. The `start` and `end` arguments limit the scope of the match and default to the entire string. If `exact` is `t` then the pattern has to consume the entire string (from start to end). 35 | 36 | CL-USER > (match-re "%d+" "abc 123") 37 | NIL 38 | 39 | CL-USER > (match-re "%a+" "abc 123") 40 | # 41 | 42 | Once you have successfully matched and have a `re-match` object, you can use the following reader functions to inspect it: 43 | 44 | * `match-string` returns the entire match 45 | * `match-groups` returns a list of groups 46 | * `match-pos-start` returns the index where the match began 47 | * `match-pos-end` returns the index where the match ended 48 | 49 | Try peeking into a match... 50 | 51 | CL-USER > (inspect (match-re "(a(b(c)))" "abc 123")) 52 | MATCH "abc" 53 | GROUPS ("abc" "bc" "c") 54 | START-POS 0 55 | END-POS 3 56 | 57 | ## Pattern Scanning 58 | 59 | To find a pattern match anywhere in a string use the `find-re` function. 60 | 61 | (find-re pattern string &key start end all) 62 | 63 | It will scan `string` looking for matches to `pattern`. If `all` is non-`nil` then a list of all matches found is returned, otherwise it will simply be the first match. 64 | 65 | CL-USER > (find-re "%d+" "abc 123") 66 | # 67 | 68 | CL-USER > (find-re "[^%s]+" "abc 123" :all t) 69 | (# 70 | #) 71 | 72 | ## Splitting by Pattern 73 | 74 | Once patterns have been matched, splitting a string from the matches is trivial. 75 | 76 | (split-re pattern string &key start end all coalesce-seps) 77 | 78 | If `all` is true, then a list of all sub-sequences in `string` (delimited by `pattern`) are returned, otherwise just the first and the rest of the string. 79 | 80 | If `coalesce-seps` is true the sub-sequences that are empty will be excluded from the results. This argument is ignored if `all` is `nil`. 81 | 82 | CL-USER > (split-re "," "1,2,3") 83 | "1" 84 | "2,3" 85 | 86 | CL-USER > (split-re "," "1,2,,,abc,3,," :all t :coalesce-seps t) 87 | ("1" "2" "abc" "3") 88 | 89 | ## Replacing by Pattern 90 | 91 | The `replace-re` function scans the string looking for matching sub-sequences that will be replaced with another string. 92 | 93 | (replace-re pattern with string &key start end all) 94 | 95 | If `with` is a function, then the function is called with the `re-match` object, replacing the pattern with the return value. Otherwise the value is used as-is. As with `find-re` and `split-re`, if `all` is true, then the pattern is globally replaced. 96 | 97 | CL-USER > (replace-re "%d+" #\* "1 2 3") 98 | "* 2 3" 99 | 100 | CL-USER > (replace-re "%a+" #'(lambda (m) (length (match-string m))) "a bc def" :all t) 101 | "1 2 3" 102 | 103 | *NOTE: The string returned by `replace-re` is a completely new string. This is true even if `pattern` isn't found in the string.* 104 | 105 | ## Groups 106 | 107 | Using parentheses in a pattern will cause the matching text to be groups in the returned `re-match` object. The `match-groups` function will return a list of all the captured strings in the match. 108 | 109 | CL-USER > (match-groups (match-re #r/(%d+)(%a+)/ "123abc")) 110 | ("123" "abc") 111 | 112 | Captures can be nested, but are always returned in the order they are **opened**. 113 | 114 | CL-USER > (match-groups (match-re #r/(a(b(c)))(d)/ "abcd")) 115 | ("abc" "bc" "c" "d") 116 | 117 | *HINT: You can always use the `match-string` function to get at the full text that was matched and there's no need to capture the entire pattern.* 118 | 119 | ## The `with-re-match` Macro 120 | 121 | Whe `with-re-match` macro can be used to assist in extracting the matched patterns and groups. 122 | 123 | (with-re-match ((var match-expr &key no-match) &body body) 124 | 125 | If the result of `match-expr` is `nil`, then `no-match` is returned and `body` is not executed. 126 | 127 | While in the body of the macro, `$$` will be bound to the `match-string` and the groups will be bound to `$1`, `$2`, ..., `$9`. Any groups beyond the first 9 are bound in a list to `$_`. The symbol `$*` is bound to all the match groups. 128 | 129 | CL-USER > (with-re-match (m (match-re "(%a+)(%s+)(%d+)" "abc 123")) 130 | (string-append $3 $2 $1))) 131 | "123 abc" 132 | 133 | CL-USER > (flet ((initial (m) 134 | (with-re-match (v m) 135 | (format nil "~@(~a~)." $1)))) 136 | (replace-re #r/(%a)%a+%s*/ #'initial "lisp in small pieces" :all t)) 137 | "L.I.S.P." 138 | 139 | ## Additional Features 140 | 141 | In addition to supporting all of what Lua pattern matching has to offer, it also supports branching with `|` and uncaptured groups: `(?..)`. For example... 142 | 143 | CL-USER > (match-re "(?a|b)+" "abbaaabbccc") 144 | # 145 | 146 | Also, to have slightly better support for Windows line endings, there is also the `%r` character set that matches on return characters only (as opposed to `%n` which matches on return and linefeed). This allows for patterns that match a single Windows or Linux/Mac line ending. 147 | 148 | CL-USER > (match-re "%r?%n" (concatenate 'string '(#\return #\linefeed))) 149 | # 151 | 152 | Finally, the `re` package has one special feature: user-defined character set predicates! Using `%:`, you can provide a predicate function for the regexp VM to test characters against. 153 | 154 | CL-USER > (match-re #r"%:digit-char-p:+" "103") 155 | # 156 | 157 | The predicate must take a single character and return non-`nil` if the character matches the predicate function. *Note: this is especially handy when parsing unicode strings!* 158 | 159 | # Thank You! 160 | 161 | If you get some good use out of this package, please let me know; it's nice to know your work is valued by others. 162 | 163 | I'm always improving it; it's the foundation for many of the other packages I've created for XML parsing, HTTP header parsing, etc. 164 | 165 | Should you find/fix a bug or add a nice feature, please feel free to send a pull request or let me know at [massung@gmail.com](mailto:massung@gmail.com). 166 | -------------------------------------------------------------------------------- /re.lisp: -------------------------------------------------------------------------------- 1 | ;;;; Regular Expressions for Common Lisp 2 | ;;;; 3 | ;;;; Copyright (c) Jeffrey Massung 4 | ;;;; 5 | ;;;; This file is provided to you under the Apache License, 6 | ;;;; Version 2.0 (the "License"); you may not use this file 7 | ;;;; except in compliance with the License. You may obtain 8 | ;;;; a copy of the License at 9 | ;;;; 10 | ;;;; http://www.apache.org/licenses/LICENSE-2.0 11 | ;;;; 12 | ;;;; Unless required by applicable law or agreed to in writing, 13 | ;;;; software distributed under the License is distributed on an 14 | ;;;; "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15 | ;;;; KIND, either express or implied. See the License for the 16 | ;;;; specific language governing permissions and limitations 17 | ;;;; under the License. 18 | ;;;; 19 | 20 | (defpackage :re 21 | (:use :cl :parse) 22 | (:export 23 | #:with-re 24 | #:with-re-match 25 | 26 | ;; interface 27 | #:compile-re 28 | #:match-re 29 | #:find-re 30 | #:split-re 31 | #:replace-re 32 | 33 | ;; match readers 34 | #:match-string 35 | #:match-groups 36 | #:match-pos-start 37 | #:match-pos-end)) 38 | 39 | (in-package :re) 40 | 41 | ;;; ---------------------------------------------------- 42 | 43 | (defclass re () 44 | ((pattern :initarg :pattern :reader re-pattern) 45 | (expr :initarg :expression :reader re-expression)) 46 | (:documentation "Regular expression.")) 47 | 48 | ;;; ---------------------------------------------------- 49 | 50 | (defclass re-match () 51 | ((match :initarg :match :reader match-string) 52 | (groups :initarg :groups :reader match-groups) 53 | (start-pos :initarg :start-pos :reader match-pos-start) 54 | (end-pos :initarg :end-pos :reader match-pos-end)) 55 | (:documentation "Matched pattern.")) 56 | 57 | ;;; ---------------------------------------------------- 58 | 59 | (defmethod print-object ((re re) s) 60 | "Output a regular expression to a stream." 61 | (print-unreadable-object (re s :type t) 62 | (format s "~s" (re-pattern re)))) 63 | 64 | ;;; ---------------------------------------------------- 65 | 66 | (defmethod print-object ((match re-match) s) 67 | "Output a regular expression match to a stream." 68 | (print-unreadable-object (match s :type t) 69 | (format s "~s" (match-string match)))) 70 | 71 | ;;; ---------------------------------------------------- 72 | 73 | (defun tab-p (c) 74 | "T if c is a tab character." 75 | (char= c #\tab)) 76 | 77 | ;;; ---------------------------------------------------- 78 | 79 | (defun space-p (c) 80 | "T if c is a whitespace character." 81 | (or (char= c #\tab) 82 | (char= c #\space))) 83 | 84 | ;;; ---------------------------------------------------- 85 | 86 | (defun return-p (c) 87 | "T if c is a return character." 88 | (char= c #\return)) 89 | 90 | ;;; ---------------------------------------------------- 91 | 92 | (defun newline-p (c) 93 | "T if c is a newline character." 94 | (or (char= c #\return) 95 | (char= c #\linefeed))) 96 | 97 | ;;; ---------------------------------------------------- 98 | 99 | (defun word-char-p (c) 100 | "T if is alphanumeric or an underscore." 101 | (or (alphanumericp c) (char= c #\_))) 102 | 103 | ;;; ---------------------------------------------------- 104 | 105 | (defun punctuation-p (c) 106 | "T if c is a punctuation character." 107 | (find c "`~!@#$%^&*()-+=[]{}\|;:',./<>?\"" :test #'char=)) 108 | 109 | ;;; ---------------------------------------------------- 110 | 111 | (defun hex-char-p (c) 112 | "T if c is a hexadecimal character." 113 | (digit-char-p c 16)) 114 | 115 | ;;; ---------------------------------------------------- 116 | 117 | (define-parser re-parser 118 | "A regular expression is one or more expressions." 119 | (.let (ex (.many 're-expr)) 120 | (.opt ex (.let (otherwise (.do (.is :or) 're-parser)) 121 | (.ret `((:or ,ex ,otherwise))))))) 122 | 123 | ;;; ---------------------------------------------------- 124 | 125 | (define-parser re-expr 126 | "A single character, set, or loop of expressions." 127 | (.let (e (.or 're-boundary 128 | 're-bounds 129 | 're-char 130 | 're-set 131 | 're-group)) 132 | 133 | ;; check to see if there is a following iteration token 134 | (.opt e (.or (.do (.is :*) (.ret (list :* e))) 135 | (.do (.is :-) (.ret (list :- e))) 136 | (.do (.is :+) (.ret (list :+ e))) 137 | (.do (.is :?) (.ret (list :? e))))))) 138 | 139 | ;;; ---------------------------------------------------- 140 | 141 | (define-parser re-boundary 142 | "The start or end of a string." 143 | (.or (.do (.is :start) (.ret (list :start))) 144 | (.do (.is :end) (.ret (list :end))))) 145 | 146 | ;;; ---------------------------------------------------- 147 | 148 | (define-parser re-bounds 149 | "Lua-style %b bounds." 150 | (.let (bs (.is :bounds)) 151 | (.ret (cons :bounds bs)))) 152 | 153 | ;;; ---------------------------------------------------- 154 | 155 | (define-parser re-char 156 | "Match any character, exact character, or predicate function." 157 | (.or (.do (.is :any) (.ret '(:any))) 158 | 159 | ;; predicates and exact characters 160 | (.let (p (.is :is)) (.ret (list :is p))) 161 | (.let (c (.is :char)) (.ret (list :char c))))) 162 | 163 | ;;; ---------------------------------------------------- 164 | 165 | (define-parser re-set 166 | "Match from a set of characters." 167 | (.let* ((exclusive (.is :set)) 168 | (predicates 're-set-chars)) 169 | (flet ((any (c) 170 | (some #'(lambda (p) (funcall p c)) predicates))) 171 | (.ret (list (if exclusive :is-not :is) #'any))))) 172 | 173 | ;;; ---------------------------------------------------- 174 | 175 | (define-parser re-set-chars 176 | "Characters, character ranges, and named character sets." 177 | (.let (ps (.many1 (.or (.is :is) 178 | 179 | ;; exact character 180 | (.let (a 're-set-char) 181 | 182 | ;; range of characters? 183 | (.or (.let (z (.do (.is :-) 're-set-char)) 184 | (.ret #'(lambda (c) (char<= a c z)))) 185 | (.ret #'(lambda (c) (char= c a)))))))) 186 | 187 | ;; match the end of the set and return the predicates 188 | (.do (.is :end-set) (.ret ps)))) 189 | 190 | ;;; ---------------------------------------------------- 191 | 192 | (define-parser re-set-char 193 | "Valid characters in a character set." 194 | (.or (.is :char) 195 | 196 | ;; special characters are aren't special in a set 197 | (.do (.is :any) (.ret #\.)) 198 | (.do (.is :or) (.ret #\|)) 199 | (.do (.is :*) (.ret #\*)) 200 | (.do (.is :-) (.ret #\-)) 201 | (.do (.is :+) (.ret #\+)) 202 | (.do (.is :?) (.ret #\?)) 203 | (.do (.is :group) (.ret #\()) 204 | (.do (.is :end-group) (.ret #\))))) 205 | 206 | ;;; ---------------------------------------------------- 207 | 208 | (define-parser re-group 209 | "Match an optionally captured group." 210 | (.let* ((ignorep (.is :group)) 211 | (xs 're-parser)) 212 | (.do (.is :end-group) 213 | (.ret (list (if ignorep :ignore :capture) xs))))) 214 | 215 | ;;; ---------------------------------------------------- 216 | 217 | (defun is-not (pred) 218 | "Create a predicate that tests the inverse." 219 | #'(lambda (c) (not (funcall pred c)))) 220 | 221 | ;;; ---------------------------------------------------- 222 | 223 | (defun escape (stream) 224 | "Return the test and predicate for an escaped character." 225 | (let ((c (read-char stream))) 226 | (case c 227 | 228 | ;; user-defined predicate 229 | (#\: (let ((sym (with-output-to-string (s) 230 | (do ((c (read-char stream) 231 | (read-char stream))) 232 | ((eql c #\:)) 233 | (write-char c s))))) 234 | (values :is (read-from-string sym)))) 235 | 236 | ;; boundary test 237 | (#\b (let ((b1 (read-char stream)) 238 | (b2 (read-char stream))) 239 | (values :bounds (list b1 b2)))) 240 | 241 | ;; named inclusive sets 242 | (#\s (values :is #'space-p)) 243 | (#\t (values :is #'tab-p)) 244 | (#\r (values :is #'return-p)) 245 | (#\n (values :is #'newline-p)) 246 | (#\a (values :is #'alpha-char-p)) 247 | (#\l (values :is #'lower-case-p)) 248 | (#\u (values :is #'upper-case-p)) 249 | (#\d (values :is #'digit-char-p)) 250 | (#\w (values :is #'word-char-p)) 251 | (#\x (values :is #'hex-char-p)) 252 | (#\p (values :is #'punctuation-p)) 253 | 254 | ;; named exclusive sets 255 | (#\S (values :is (is-not #'space-p))) 256 | (#\T (values :is (is-not #'tab-p))) 257 | (#\R (values :is (is-not #'return-p))) 258 | (#\N (values :is (is-not #'newline-p))) 259 | (#\A (values :is (is-not #'alpha-char-p))) 260 | (#\L (values :is (is-not #'lower-case-p))) 261 | (#\U (values :is (is-not #'upper-case-p))) 262 | (#\D (values :is (is-not #'digit-char-p))) 263 | (#\W (values :is (is-not #'word-char-p))) 264 | (#\X (values :is (is-not #'hex-char-p))) 265 | (#\P (values :is (is-not #'punctuation-p))) 266 | 267 | ;; just a character 268 | (otherwise (values :char c))))) 269 | 270 | ;;; ---------------------------------------------------- 271 | 272 | (defun parse-re (pattern) 273 | "Parse a regular expression pattern." 274 | (with-input-from-string (stream pattern) 275 | (flet ((token-reader () 276 | (let ((c (read-char stream nil nil))) 277 | (when c 278 | (case c 279 | 280 | ;; any character 281 | (#\. :any) 282 | 283 | ;; escaped characters 284 | (#\% (escape stream)) 285 | 286 | ;; iterators 287 | (#\* :*) 288 | (#\+ :+) 289 | (#\? :?) 290 | 291 | ;; lazy iterator, or end of set 292 | (#\- (if (eql (peek-char nil stream nil nil) #\]) 293 | (values :char #\-) 294 | (values :-))) 295 | 296 | ;; conditional 297 | (#\| :or) 298 | 299 | ;; groups 300 | (#\( (if (eql (peek-char nil stream nil nil) #\?) 301 | (values :group (read-char stream)) 302 | (values :group ()))) 303 | 304 | ;; sets 305 | (#\[ (if (eql (peek-char nil stream nil nil) #\^) 306 | (values :set (read-char stream)) 307 | (values :set ()))) 308 | 309 | ;; group and set terminals 310 | (#\) :end-group) 311 | (#\] :end-set) 312 | 313 | ;; start/end boundary 314 | (#\^ :start) 315 | (#\$ :end) 316 | 317 | ;; default to just an exact character match 318 | (otherwise (values :char c))))))) 319 | 320 | ;; parse all the tokens in the regular expression 321 | (parse 're-parser #'token-reader)))) 322 | 323 | ;;; ---------------------------------------------------- 324 | 325 | (defun compile-re (pattern) 326 | "Create a regular expression from a pattern string." 327 | (let ((re (make-array 8 :adjustable t :fill-pointer 0)) 328 | (bs (make-array 4 :adjustable t :fill-pointer 0))) 329 | (labels ((compile-op (op &rest args) 330 | (vector-push-extend (cons op args) re)) 331 | 332 | ;; branch labels 333 | (make-label () 334 | (vector-push-extend nil bs)) 335 | (resolve-label (label) 336 | (setf (aref bs label) (fill-pointer re))) 337 | 338 | ;; compile a list of tokens recursively 339 | (compile-tokens (xs) 340 | (loop 341 | for (op x y) in xs 342 | 343 | ;; compile each token 344 | do (case op 345 | 346 | ;; if not x then y 347 | (:or (let ((this (make-label)) 348 | (else (make-label)) 349 | (done (make-label))) 350 | (compile-op :split this else) 351 | (resolve-label this) 352 | (compile-tokens x) 353 | (compile-op :jump done) 354 | (resolve-label else) 355 | (compile-tokens y) 356 | (resolve-label done))) 357 | 358 | ;; zero or more (greedy) 359 | (:* (let ((try (make-label)) 360 | (done (make-label)) 361 | (again (make-label))) 362 | (resolve-label again) 363 | (compile-op :split try done) 364 | (resolve-label try) 365 | (compile-tokens (list x)) 366 | (compile-op :jump again) 367 | (resolve-label done))) 368 | 369 | ;; zero or more (lazy) 370 | (:- (let ((try (make-label)) 371 | (done (make-label)) 372 | (again (make-label))) 373 | (resolve-label again) 374 | (compile-op :split done try) 375 | (resolve-label try) 376 | (compile-tokens (list x)) 377 | (compile-op :jump again) 378 | (resolve-label done))) 379 | 380 | ;; one or more matches 381 | (:+ (let ((rep (make-label)) 382 | (done (make-label))) 383 | (resolve-label rep) 384 | (compile-tokens (list x)) 385 | (compile-op :split rep done) 386 | (resolve-label done))) 387 | 388 | ;; maybe match 389 | (:? (let ((this (make-label)) 390 | (else (make-label))) 391 | (compile-op :split this else) 392 | (resolve-label this) 393 | (compile-tokens (list x)) 394 | (resolve-label else))) 395 | 396 | ;; Lua boundary 397 | (:bounds (let ((try (make-label)) 398 | (done (make-label)) 399 | (again (make-label))) 400 | (compile-op :char x) 401 | (resolve-label again) 402 | (compile-op :split done try) 403 | (resolve-label try) 404 | (compile-op :any) 405 | (compile-op :jump again) 406 | (resolve-label done) 407 | (compile-op :char y))) 408 | 409 | ;; ignore groups just match tokens 410 | (:ignore (compile-tokens x)) 411 | 412 | ;; capture groups push, match, and pop 413 | (:capture (progn 414 | (compile-op :push) 415 | (compile-tokens x) 416 | (compile-op :pop))) 417 | 418 | ;; all other tokens compile to themselves 419 | (otherwise (compile-op op x)))))) 420 | 421 | ;; compile the parsed tokens 422 | (compile-tokens (parse-re pattern)) 423 | 424 | ;; resolve all labels in split and jump instuctions 425 | (dotimes (i (length re)) 426 | (symbol-macrolet ((b1 (second (aref re i))) 427 | (b2 (third (aref re i)))) 428 | (case (first (aref re i)) 429 | (:jump (setf b1 (aref bs b1))) 430 | (:split (setf b1 (aref bs b1) 431 | b2 (aref bs b2)))))) 432 | 433 | ;; finally, append the match instruction 434 | (compile-op :match) 435 | 436 | ;; return the regular expression 437 | (make-instance 're :pattern pattern :expression re)))) 438 | 439 | ;;; ---------------------------------------------------- 440 | 441 | (defstruct (re-thread (:constructor make-re-thread (pc sp groups stack))) 442 | pc ; program counter in compiled re instruction vector 443 | sp ; string pointer 444 | groups ; pushed capture groups (subseq) 445 | stack) ; pushed capture groups (sp) 446 | 447 | ;;; ---------------------------------------------------- 448 | 449 | (defun match (s thread start offset) 450 | "Create a re-match from a thread that matched." 451 | (with-slots (sp groups) 452 | thread 453 | (let ((cs (let (cs) 454 | (do ((g (pop groups) 455 | (pop groups))) 456 | ((null g) cs) 457 | (push (subseq s (first g) (second g)) cs))))) 458 | (make-instance 're-match 459 | :start-pos (+ start offset) 460 | :end-pos sp 461 | :groups cs 462 | :match (subseq s (+ start offset) sp))))) 463 | 464 | ;;; ---------------------------------------------------- 465 | 466 | (defun run (re s start end &aux (pc 0) (offset 0)) 467 | "Execute a regular expression program." 468 | (loop 469 | with threads = (list (make-re-thread pc (+ start offset) nil nil)) 470 | 471 | ;; get the next thread off the list 472 | for thread = (pop threads) 473 | 474 | ;; once all the threads have been exhausted, the match fails 475 | until (null thread) 476 | 477 | ;; evaluate the next thread, checking for a match 478 | do (with-slots (pc sp groups stack) 479 | thread 480 | (loop 481 | for (op x y) = (aref re pc) 482 | 483 | ;; get the current character 484 | for c = (when (< sp end) (char s sp)) 485 | 486 | ;; advance the program counter 487 | do (incf pc) 488 | 489 | ;; loop until the thread fails an instruction 490 | while (case op 491 | 492 | ;; start and end boundaries 493 | (:start (= sp 0)) 494 | (:end (= sp (length s))) 495 | 496 | ;; match any character 497 | (:any (when (< sp end) 498 | (incf sp))) 499 | 500 | ;; match an exact character 501 | (:char (when (eql c x) 502 | (incf sp))) 503 | 504 | ;; match a predicate function 505 | (:is (when (and c (funcall x c)) 506 | (incf sp))) 507 | 508 | ;; fail to match a predicate function 509 | (:is-not (when (and c (not (funcall x c))) 510 | (incf sp))) 511 | 512 | ;; push a capture group 513 | (:push (let ((capture (list sp))) 514 | (push capture stack) 515 | (push capture groups))) 516 | 517 | ;; pop a capture group 518 | (:pop (rplacd (pop stack) (list sp))) 519 | 520 | ;; jump to an instruction 521 | (:jump (setf pc x)) 522 | 523 | ;; fork a thread 524 | (:split (let ((b (make-re-thread y sp groups stack))) 525 | (push b threads) 526 | (setf pc x))) 527 | 528 | ;; successfully matched, create and return 529 | (:match (return-from run 530 | (match s thread start offset)))))))) 531 | 532 | ;;; ---------------------------------------------------- 533 | 534 | (defmacro with-re ((re pattern) &body body) 535 | "Compile pattern if it's not a RE object and execute body." 536 | (let ((p (gensym))) 537 | `(let ((,p ,pattern)) 538 | (let ((,re (if (subtypep (type-of ,p) 're) 539 | ,p 540 | (compile-re ,p)))) 541 | (progn ,@body))))) 542 | 543 | ;;; ---------------------------------------------------- 544 | 545 | (defmacro with-re-match ((match match-expr &key no-match) &body body) 546 | "Intern match symbols to execute a body." 547 | (let (($$ (intern "$$" *package*)) 548 | ($1 (intern "$1" *package*)) 549 | ($2 (intern "$2" *package*)) 550 | ($3 (intern "$3" *package*)) 551 | ($4 (intern "$4" *package*)) 552 | ($5 (intern "$5" *package*)) 553 | ($6 (intern "$6" *package*)) 554 | ($7 (intern "$7" *package*)) 555 | ($8 (intern "$8" *package*)) 556 | ($9 (intern "$9" *package*)) 557 | ($_ (intern "$_" *package*)) 558 | ($* (intern "$*" *package*))) 559 | `(let ((,match ,match-expr)) 560 | (if (null ,match) 561 | ,no-match 562 | (let ((,$$ (match-string ,match)) 563 | (,$* (match-groups ,match))) 564 | (declare (ignorable ,$$ ,$*)) 565 | (symbol-macrolet ((,$1 (first ,$*)) 566 | (,$2 (second ,$*)) 567 | (,$3 (third ,$*)) 568 | (,$4 (fourth ,$*)) 569 | (,$5 (fifth ,$*)) 570 | (,$6 (sixth ,$*)) 571 | (,$7 (seventh ,$*)) 572 | (,$8 (eighth ,$*)) 573 | (,$9 (ninth ,$*)) 574 | (,$_ (nthcdr 9 ,$*))) 575 | (progn ,@body))))))) 576 | 577 | ;;; ---------------------------------------------------- 578 | 579 | (defun match-re (pattern s &key exact (start 0) (end (length s))) 580 | "Test a pattern re against a string." 581 | (with-re (re pattern) 582 | (let ((m (run (re-expression re) s start end))) 583 | (if (not exact) 584 | m 585 | (when m 586 | (and (= (match-pos-end m) end) m)))))) 587 | 588 | ;;; ---------------------------------------------------- 589 | 590 | (defun find-re (pattern s &key all (start 0) (end (length s))) 591 | "Find a regexp pattern match somewhere in a string." 592 | (with-re (re pattern) 593 | (let ((i start)) 594 | (flet ((next-match () 595 | (loop 596 | until (>= i end) 597 | 598 | ;; is there a match at this offset? 599 | for m = (run (re-expression re) s i end) 600 | 601 | ;; return the found match or advance the offset 602 | do (if m 603 | (return (prog1 m 604 | (setf i (match-pos-end m)))) 605 | (incf i))))) 606 | (if all 607 | (loop for m = (next-match) while m collect m) 608 | (next-match)))))) 609 | 610 | ;;; ---------------------------------------------------- 611 | 612 | (defun split-re (pattern s &key all coalesce-seps (start 0) (end (length s))) 613 | "Split a string into one or more strings by pattern match." 614 | (with-re (re pattern) 615 | (let* ((seqs (list nil)) (tail seqs)) 616 | (do ((m (find-re re s :start start :end end) 617 | (find-re re s :start start :end end))) 618 | ((null m)) 619 | 620 | ;; only split if not all, coalescing, or there's something there 621 | (when (or (not coalesce-seps) (> (match-pos-start m) start)) 622 | (let ((split (subseq s start (match-pos-start m)))) 623 | (setf tail (cdr (rplacd tail (list split)))))) 624 | 625 | ;; update the search position after the split 626 | (setf start (match-pos-end m)) 627 | 628 | ;; stop after a single match? 629 | (unless (or all (eq tail seqs)) 630 | (return))) 631 | 632 | ;; add everything that's left 633 | (when (< start end) 634 | (rplacd tail (list (subseq s start end)))) 635 | 636 | ;; return the list or two values 637 | (if all 638 | (rest seqs) 639 | (values-list (rest seqs)))))) 640 | 641 | ;;; ---------------------------------------------------- 642 | 643 | (defun replace-re (pattern with s &key all (start 0) (end (length s))) 644 | "Replace patterns found within a string with a new value." 645 | (with-re (re pattern) 646 | (with-output-to-string (rep nil :element-type 'character) 647 | (do ((m (find-re re s :start start :end end) 648 | (find-re re s :start start :end end))) 649 | ((null m)) 650 | 651 | ;; write out everything up to the match 652 | (when (< start (match-pos-start m)) 653 | (write-string s rep :start start :end (match-pos-start m))) 654 | 655 | ;; replace the match with a value 656 | (princ (if (functionp with) (funcall with m) with) rep) 657 | 658 | ;; update the search position after the match 659 | (setf start (match-pos-end m)) 660 | 661 | ;; stop after a single replace? 662 | (unless all (return))) 663 | 664 | ;; add everything that's left 665 | (when (< start end) 666 | (write-string s rep :start start :end end))))) 667 | 668 | ;;; ---------------------------------------------------- 669 | 670 | (defmethod make-load-form ((re re) &optional env) 671 | "Tell the system how to save and load a regular expression to a FASL." 672 | (declare (ignore env)) 673 | `(compile-re ,(re-pattern re))) 674 | 675 | ;;; ---------------------------------------------------- 676 | 677 | (eval-when (:compile-toplevel :load-toplevel :execute) 678 | (flet ((dispatch-re (s c n) 679 | (declare (ignorable c n)) 680 | (let ((delim (read-char s))) 681 | (compile-re (with-output-to-string (re) 682 | (do ((c (read-char s t nil t) 683 | (read-char s t nil t))) 684 | ((char= c delim)) 685 | (if (char= c #\\) 686 | (princ (read-char s t nil t) re) 687 | (princ c re)))))))) 688 | (set-dispatch-macro-character #\# #\r #'dispatch-re))) 689 | --------------------------------------------------------------------------------