├── .gitignore ├── rexxparse-test.asd ├── rexxparse.asd ├── LICENSE ├── package.lisp ├── test.lisp ├── README.md └── rexxparse.lisp /.gitignore: -------------------------------------------------------------------------------- 1 | *.FASL 2 | *.fasl 3 | *.lisp-temp 4 | *.dfsl 5 | *.pfsl 6 | *.d64fsl 7 | *.p64fsl 8 | *.lx64fsl 9 | *.lx32fsl 10 | *.dx64fsl 11 | *.dx32fsl 12 | *.fx64fsl 13 | *.fx32fsl 14 | *.sx64fsl 15 | *.sx32fsl 16 | *.wx64fsl 17 | *.wx32fsl 18 | -------------------------------------------------------------------------------- /rexxparse-test.asd: -------------------------------------------------------------------------------- 1 | (in-package :cl-user) 2 | 3 | (defpackage :rexxparse-test-asd 4 | (:use :cl :asdf)) 5 | 6 | (in-package :rexxparse-test-asd) 7 | 8 | (defsystem :rexxparse-test 9 | :version "0.1.1" 10 | :license "MIT" 11 | :author "Dave Tenny" 12 | :description "Tests for the :rexxparse package." 13 | :depends-on (:rexxparse :fiveam) 14 | :components ((:file "test"))) 15 | -------------------------------------------------------------------------------- /rexxparse.asd: -------------------------------------------------------------------------------- 1 | (in-package :cl-user) 2 | 3 | (defpackage :rexxparse-asd 4 | (:use :cl :asdf)) 5 | 6 | (in-package :rexxparse-asd) 7 | 8 | (defsystem :rexxparse 9 | :version "0.1.0" 10 | :license "MIT" 11 | :author "Dave Tenny" 12 | :description "A trivial parsing tool inspired by the REXX PARSE construct." 13 | ;;:bug-tracker "https://github.com/dtenny/rexxparse/issues" 14 | ;;:source-control (:git "https://github.com/dtenny/rexxparse") 15 | :depends-on (:alexandria :parse-float) 16 | :serial t 17 | :components ((:file "package") 18 | (:file "rexxparse"))) 19 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Jeffrey D. Tenny 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /package.lisp: -------------------------------------------------------------------------------- 1 | (in-package :cl-user) 2 | 3 | (defpackage :rexxparse 4 | (:use :cl) 5 | 6 | (:export 7 | 8 | ;; The only thing you need most of the time. 9 | #:parse 10 | ;; To change the default values of unmatched PARSE variables. 11 | ;; Defaults to the empty string. 12 | #:*unmatched-binding-value* 13 | ;; To specify what kind of scanner you'd like to use for pattern literals 14 | ;; if REXXPARSE doesn't give you the behavior you want. 15 | #:*pattern->scanner* 16 | ;; In case you want to call this from your custom scanner for pattern 17 | ;; literals you don't support from your custom scanner. Note that defining 18 | ;; new pattern->scanner methods is not supported. 19 | #:pattern->scanner 20 | ;; If you want to specify an alternative extraction behavior for matched 21 | ;; vars. 22 | #:*extractor* 23 | ;; To change the default "space only" behavior of LTRIM, RTRIM, and TRIM transforms. 24 | #:*trim-character-bag* 25 | ;; If you want to specify alternative transforms for matched vars. 26 | #:*options*) 27 | 28 | (:documentation "Provides a PARSE macro emulating the REXX programming language 29 | namesakew with lexical bindings and extensible 'template' (REXX nomenclature) 30 | capabilities.")) 31 | 32 | -------------------------------------------------------------------------------- /test.lisp: -------------------------------------------------------------------------------- 1 | (in-package :cl-user) 2 | 3 | (defpackage :rexxparse-test 4 | (:use :cl :rexxparse :fiveam) 5 | (:export #:run-tests) 6 | (:documentation "Tests for the :rexxparse package.")) 7 | 8 | (in-package :rexxparse-test) 9 | 10 | (def-suite test-suite :description ":rexxparse tests") 11 | (in-suite test-suite) 12 | 13 | ;; Spaces in these text values are significant. 14 | (defvar *text* 15 | "2024/02/23 17:35:42.022 - unable to locate '/usr/local/examples/' directory") 16 | (defvar *text2* "This is the text which, I think, is scanned.") 17 | 18 | (test string-patterns-and-basics 19 | "Test PARSE with string patterns and basic REXX PARSE edge cases." 20 | 21 | ;; An empty string is never found, it always matches the end of the source string. 22 | (is (equalp '(" abc ") (parse " abc " (a "")))) 23 | (is (equalp '(" abc " "") (parse " abc " (a "" b)))) 24 | 25 | (is (null (parse "abc" ()))) 26 | (is (equalp '("abc" "") (parse "abc" (a b)))) 27 | (let ((rexxparse:*UNMATCHED-BINDING-VALUE* nil)) 28 | (is (equalp '("abc" "" nil) (parse "abc" (a b c))))) 29 | (is (equalp '("a" "b c") (parse "a b c" (a b)))) 30 | (is (equalp '("a" "b") (parse "a b c" (a b " ")))) 31 | (is (equalp '(" a b c ") (parse " a b c " (a)))) 32 | (is (equalp '("a" "b c") (parse " a b c" (a b)))) 33 | (is (equalp '("a" "b c") (parse " a b c" (" " a b)))) 34 | (is (equalp '() (parse "a b c" (_)))) 35 | (is (equalp '("a" "c") (parse "a b c" (a _ c)))) 36 | (is (equalp '("a" "b c " " g") (parse "a b c x g" (a b "x" g)))) 37 | 38 | ;; with one var, always matches whole string 39 | (is (equalp '(" ") (parse " " (stuff)))) 40 | ;; with two vars, both bound to empty strings due to word splitting? 41 | (is (equalp '("" "") (parse " " (a b)))) 42 | (is (equalp '("" "" "") (parse " " (a " " b c)))) 43 | ;; This is pretty central to understanding splits, because the 44 | ;; un-trimmed last binding principle applies _to each split_ 45 | ;; Thus there's a split at ".", and "Q" is not trimmed. However there's 46 | ;; still the word-splitting blank elimination after "John". 47 | (is (equalp '("John" " Q" " Public") 48 | (parse " John Q. Public" (fn init "." ln)))) 49 | ;; I believe that for the next example, the blank is omitted from the 'is' 50 | ;; variable because of the leading space elimination because one of the 51 | ;; pattern goalposts is the implicit word-splitting pattern. 52 | (is (equalp '("Now" "is" "the time") 53 | (parse "Now is the time" (now " " is the-time)))) 54 | 55 | ;; OOREXX manual 42.2.2 Parsing strings into words 56 | ;; The word splitting is not the same as searching for " " in the pattern 57 | ;; which is why this test has this expected value 58 | ;; In this case the leading space for ' I think' on w2, is preserved 59 | ;; because the implicit non-word-splitting binding context applies. 60 | ;; Still guessing though. 61 | (is (equalp '("This is the text which" " I think" " is scanned.") 62 | (parse *text2* (w1 "," w2 "," w3)))) 63 | (is (equalp '("This is the text which" " I think" " is scanned." "") 64 | (parse *text2* (w1 "," w2 "," w3 "," w4)))) 65 | (is (equalp '("This" "is" "the" "text which") (parse *text2* (w1 w2 w3 w4 ",")))) 66 | (is (equalp '("This" "is" "" "the text which") 67 | (parse *text2* (w1 " " w2 " " w3 " " w4 ",")))) 68 | 69 | ;; Unmatched content, we specify a colon before the millis when a period 70 | ;; was in the input. 71 | ;; When a match for a pattern cannot be found, it matches the end of the string. 72 | (is (equalp '("2024" "02" "23" "17" "35" 73 | "42.022 - unable to locate '/usr/local/examples/' directory" 74 | "" "") 75 | (parse *text* (year "/" month "/" day hours ":" minutes ":" seconds 76 | ":" millis "-" rest)))) 77 | ;; Note the blank following 022. 78 | (is (equalp '("2024" "02" "23" "17" "35" "42" "022 " 79 | " unable to locate '/usr/local/examples/' directory") 80 | (parse *text* (year "/" month "/" day hours ":" minutes ":" seconds 81 | "." millis "-" rest)))) 82 | 83 | ;; Sexp source that is not a string but produces a string 84 | (flet ((foo () "bar")) 85 | (is (equalp '("b" "r") (parse (foo) (b "a" r))))) 86 | 87 | ;; Source that is not a string. 88 | (signals type-error (parse #\c (a))) 89 | 90 | ;; Inappropriate use of keywords as variables, which are symbolp. 91 | ;; Trying to test this condition may be more trouble than it's worth... 92 | (signals error (macroexpand '(parse "a b" (:a)))) 93 | 94 | ;; Non-default body and return value 95 | (is (eq 'xyz 96 | (let (a b c) 97 | (parse "a b c" (x y z) 98 | ;; just a couple of statements to mess around 99 | (setq a x b y c z) 100 | (is (equalp '("a" "b" "c") (list a b c))) 101 | 'xyz)))) 102 | ) 103 | 104 | (test standard-options 105 | ;; Equal, at least on SBCL, respects case, while EQUALP does not. 106 | ;; And we want these tests to respect case 107 | (is (not (equal "a" "A"))) 108 | (is (not (equal '("a") '("A")))) 109 | 110 | ;; Test conditions raised during macro expansion 111 | (signals error (macroexpand '(parse :upper :lower "A" (w)))) 112 | 113 | (is (equal '("a b " " d") 114 | (parse :lower "A b C d" (w "c" r)))) 115 | (is (equal '("a b c d" "") 116 | (parse :lower "A b C d" (w "C" r)))) 117 | (is (equal '("A B " " D") 118 | (parse :upper "A b C d" (w "C" r)))) 119 | (is (equal '("A B C D" "") 120 | (parse :upper "A b C d" (w "c" r)))) 121 | (is (equal '("A b " " d") 122 | (parse :caseless "A b C d" (w "c" r)))) 123 | ) 124 | 125 | (test positions 126 | ;; Absolute positions 127 | 128 | ;; 1 2 3 129 | ;; 1234567890123456789012345678901234 130 | (is (equalp '("Brimfield " "Massachusetts " "10101") 131 | (parse "Brimfield Massachusetts 10101" (city 14 state 30 zip)))) 132 | 133 | (is (equalp '(" ab" " a") (parse " abc " (ab "c" 1 c 3)))) 134 | 135 | ;; The position 1 is similar to an empty string, it is never "found" 136 | ;; (in the sense of a pattern being found). 137 | (is (equalp '("abc" "abc") (parse "abc" (a 1 bc)))) 138 | (is (equalp '(" abc " " abc ") (parse " abc " (a 1 bc)))) 139 | (is (equalp '("a" "bc") (parse "abc" (a 2 bc)))) 140 | (let ((x 1)) 141 | (is (equalp '("abc" "abc") (parse "abc" (a (= x) bc)))) 142 | (is (equalp '(" abc " " abc ") (parse " abc " (a (= x) bc)))) 143 | (is (equalp '("a" "bc") (parse "abc" (a (= (+ x 1)) bc))))) 144 | 145 | (is (equalp '("st" "a" "r" "s") 146 | (parse "astronomers" (2 st 4 1 a 2 4 r 5 11 s)))) 147 | 148 | ;; Invalid/edge-case absolute positions 149 | ;; Basically too far to the left treated like "1". 150 | ;; Too far to the right treated as the whole (or remaining) string matches. 151 | (is (equalp '(" abc ") (parse " abc " (-0 x -0)))) 152 | (is (equalp '(" abc ") (parse " abc " (-1 x -1)))) 153 | (is (equalp '(" abc ") (parse " abc " (0 x 0)))) 154 | (is (equalp '(" abc ") (parse " abc " (1 x 1)))) 155 | (is (equalp '(" abc") (parse " abc " (x 5)))) 156 | (is (equalp '(" abc" " ") (parse " abc " (x 5 x2)))) 157 | (is (equalp '(" abc ") (parse " abc " (x 6)))) 158 | (is (equalp '(" abc ") (parse " abc " (x 7)))) 159 | (is (equalp '(" abc " "") (parse " abc " (x 7 x2)))) 160 | 161 | ;; Relative positions 162 | (is (equalp '(" a" "abc ") (parse " abc " (x "b" (- 1) y)))) 163 | (is (equalp '(" a" " abc ") (parse " abc " (w1 "b" (- 5) w2)))); no such "match" for offset -5 164 | 165 | ;; template with "string" var is a special case, 166 | ;; var _includes_ string pattern which would normally be skipped 167 | ;; (after some very similar non-relative positional matches for context/validation) 168 | (is (equalp '(" c ") (parse " a b c " ("b" b)))) ; "b" not included, no relative positional 169 | ;; '5' is effectively equal to the source scanning start position when its pseudo-pattern 170 | ;; is matched, which means an empty string match, which is a break/tail-position behavior. 171 | (is (equalp '(" c ") (parse " a b c " ("b" b 5)))) ; from end of "b" to 5 is empty, full break 172 | (is (equalp '(" c") (parse " a b c " ("b" b 7)))) ; space past "b" to 7, 2 chars 173 | (is (equalp '("b") (parse " a b c " ("b" b (+ 1))))) ; from position of b to position+1 174 | (is (equalp '("b c ") (parse " a b c " ("b" b (- 1))))) ; from position of b to end of string 175 | (is (equalp '(" a" "bc " "abc ") (parse " abc " (x "b" y (- 1) z)))) 176 | 177 | ;; This is the relative version of the absolute 'stars' position match above 178 | (is (equalp '("st" "a" "r" "s") 179 | (parse "astronomers" (2 st (+ 2) (- 3) a (+ 1) (+ 2) r (+ 1) (+ 6) s)))) 180 | 181 | (is (equalp '("RE" "X" "X") (parse "REstructured eXtended eXecutor" 182 | (v1 3 _ "X" v2 (+ 1) _ "X" v3 (+ 1) _)))) 183 | 184 | ;; various bounds cases 185 | (is (equalp '("a" "b" "c") (parse "abc" (0 v1 2 v2 3 v3 4)))) 186 | (is (equalp '("a" "b" "c") (parse "abc" ((- 2) v1 (+ 1) v2 (+ 1) v3 (+ 3))))) 187 | (is (equalp '("abc") (parse "abc" ((- 2) v1)))) 188 | (is (equalp '("") (parse "abc" ((+ 12) v1)))) 189 | (is (equalp '("") (parse "abc" (12 v1)))) 190 | (is (equalp '("abc") (parse "abc" (v1 (- 2))))) 191 | (is (equalp '("abc") (parse "abc" (v1 (+ 12))))) 192 | (is (equalp '("abc") (parse "abc" (v1 12)))) 193 | (is (equalp '("ab") (parse "abc" ("a" (+ 0) b 3)))) 194 | (is (equalp '("ab") (parse "abc" ("a" (- 0) b 3)))) 195 | ) 196 | 197 | (test length-positions 198 | ;; Testing several things here. 199 | ;; 1. Length positions vs relative positions, which are mostly the same but not always, 200 | ;; particularly for zero. 201 | ;; 2. References to previous bindings later in the template 202 | ;; 3. Implicit conversion of strings to integers for positional patterns. 203 | 204 | ;; Parsing with relative patterns only. 'middle' being the tricky bit. 205 | (is (equalp '("Mark" "05Twain" "Twain" "05") 206 | (parse "04Mark0005Twain" 207 | (len (+ 2) first (+ len) len (+ 2) middle (+ len) len (+ 2) last (+ len)) 208 | (list first middle last len)))) 209 | 210 | ;; Parsing with positional length patterns. 211 | (is (equalp '("Mark" "" "Twain" "05") 212 | (parse "04Mark0005Twain" 213 | (len (+ 2) first (> len) len (+ 2) middle (> len) len (+ 2) last (> len)) 214 | (list first middle last len)))) 215 | 216 | ;; Parsing with length patterns 217 | (is (equalp '("5" "5.6789") 218 | (parse "12345.6789" ("." digit (< 1) rest)))) 219 | ;; Parsing with relative patterns 220 | (is (equalp '("5" ".6789") 221 | (parse "12345.6789" ("." (- 1) digit (+ 1) rest)))) 222 | 223 | ;; For no particular reason other than it messeed with my head at one point 224 | (is (equalp '("a b" "a b") (parse "a b" (1 w1 (+ 0) w2)))) 225 | (is (equalp '("" "a b") (parse "a b" (1 w1 (> 0) w2)))) 226 | 227 | ;; under/over flows 228 | (is (equalp '("" "12345.6789") 229 | (parse "12345.6789" (1 digit (< 1) rest)))) 230 | (is (equalp '("12345" "12345.6789") 231 | (parse "12345.6789" ("." digit (< 5) rest)))) 232 | (is (equalp '("12345" "12345.6789") 233 | (parse "12345.6789" ("." digit (< 6) rest)))) 234 | (is (equalp '(".67" "89") 235 | (parse "12345.6789" ("." digit (> 3) rest)))) 236 | (is (equalp '(".678" "9") 237 | (parse "12345.6789" ("." digit (> 4) rest)))) 238 | (is (equalp '(".6789" "") 239 | (parse "12345.6789" ("." digit (> 5) rest)))) 240 | (is (equalp '(".6789" "") 241 | (parse "12345.6789" ("." digit (> 6) rest)))) 242 | ) 243 | 244 | (test var-reuse 245 | (is (equalp '("b") (parse "abc" (w 2 w 3)))) 246 | (is (equalp '("") (parse "abc" (w w)))) 247 | (is (equalp '("") (parse "abc def" (w w w)))) 248 | ) 249 | 250 | (test variable-string-patterns 251 | (is (equalp '("the quick " " fox") 252 | (let ((x "brown")) 253 | (parse "the quick brown fox" (start ($ x) end))))) 254 | ) 255 | 256 | ;; As I intend to use this elsewhere (in some flavor for a clojure-like 257 | ;; `with-redefs`), I've made a few notes for future documentation. 258 | 259 | (defmacro with-redef ((f-sym function) &body body) 260 | "Redefine the global function definition of symbol F-SYM 261 | with the function FUNCTION with a lexical scope wrapping BODY. 262 | 263 | FUNCTION must be an object of type FUNCTION, not some other function designator, 264 | compatible with (setf (fdefinition f-sym) ). 265 | 266 | Execute BODY with the rebound function, restoring the original function (or lack thereof) 267 | on exit. F-SYM need not be FBOUNDP to start with. 268 | 269 | Note that this macro may have unsafe effects in a multi-threaded use of the 270 | symbol unless the caller arranges additional critical-section logic. 271 | Also note compiler transformations or inlining may also result in surprises 272 | when it comes to redefining a function, as well as use of compiled symbol-function 273 | references that have previously nabbed the function and which won't see changes made 274 | after the fact. 275 | 276 | Warning: If the new function attempts to call the old function, make sure it isn't via the 277 | the function being redefined. E.g. 278 | 279 | ;; This will be an infinite loop or stack overflow 280 | (with-redef (my-fun (lambda () ... stuff ... (funcall 'my-fun)))) 281 | 282 | ;; ;This will work 283 | (let ((old-fun #'my-fun)) 284 | (with-redef (my-fun (lambda () ... stuff ... (funcall old-fun))))) 285 | 286 | Returns the value(s) returned by BODY." 287 | ;; Don't really like this implementation, it'll do for the limited test here. 288 | ;; Would prefer to unwind-protect the setting of the symbol function as well as the 289 | ;; restoration, among other things. Also, we should probably use FDEFINITION 290 | ;; instead of SYMBOL-FUNCTION, so we can do SETF functions as well as plain symbols. 291 | ;; Or something like that. 292 | ;; Note effect on generic functions/methods? 293 | (let ((old (gensym)) 294 | (fun (gensym))) 295 | (declare (ignorable old)) 296 | `(let ((,fun ,function)) 297 | (assert (typep ,fun 'cl:function)) 298 | (cond 299 | ((fboundp ',f-sym) 300 | (let ((,old (symbol-function ',f-sym))) 301 | (setf (symbol-function ',f-sym) ,fun) 302 | (unwind-protect (progn ,@body) 303 | (setf (symbol-function ',f-sym) ,old)))) 304 | (t 305 | (setf (symbol-function ',f-sym) ,fun) 306 | (unwind-protect (progn ,@body) 307 | (fmakunbound ',f-sym))))))) 308 | 309 | ;; The redef logic used by this test doesn't work with ECL. It works with SBCL, ACL, 310 | ;; ABCL, CCL, and Lispworks Basically the flet 'extractor' function used in the redef 311 | ;; is never called. I haven't been able to make a simpler reprudicible case yet. 312 | #-ECL 313 | (test no-useless-extractions 314 | (let ((counter 0) 315 | (old-extractor #'rexxparse::extract)) 316 | (flet ((extractor (&rest args) 317 | (incf counter) 318 | (apply old-extractor args))) 319 | (with-redef (rexxparse::extract #'extractor) 320 | (is (null (parse "abc" (_)))) 321 | (is (zerop counter)) 322 | (is (equalp '("abc") (parse "abc" (a)))) 323 | (is (= counter 1)) 324 | (is (equalp '("b") (parse "abc" (2 3 (- 1) c (+ 1))))) 325 | (is (= counter 2)) 326 | )))) 327 | 328 | (test transforms 329 | ;; Upper lower, first & last positions, tail positions 330 | (is (equalp '("A" "b" "C") (parse "a b c" ((upper a) b (upper c))))) 331 | (is (equalp '("a" "B" "c") (parse "a b c" (a (upper b) c)))) 332 | (is (equalp '("C D") (parse "a b c d" (_ _ (upper x))))) 333 | (is (equalp '("a" "B" "c") (parse "A B C" ((lower a) b (lower c))))) 334 | (is (equalp '("A" "b" "C") (parse "A B C" (a (lower b) c)))) 335 | (is (equalp '("c d") (parse "A B C D" (_ _ (lower x))))) 336 | 337 | ;; SNAKE, KEBAB, LTRIM, RTRIM, TRIM, *TRIM-CHARACTER-BAG* 338 | (is (equalp '("kebab_case" "to_snake_case") 339 | (parse "kebab-case to-snake-case" ((snake a) (snake b))))) 340 | (is (equalp '("snake-case" "to-kebab-case") 341 | (parse "snake_case to_kebab_case" ((kebab a) (kebab b))))) 342 | (is (equalp '(" ab" " def " "hi ") (parse " abc def ghi " (a "c" d "g" g)))) 343 | (is (equalp '("ab" "def" "hi") (parse " abc def ghi " ((ltrim a) "c" (trim d) "g" (rtrim g))))) 344 | 345 | (let ((s (coerce (list #\space #\newline #\a #\b #\c #\space #\newline) 'string)) 346 | (s-with-newline (coerce (list #\newline #\a #\b #\c #\space #\newline) 'string)) 347 | (s-without-newline "abc")) 348 | (is (equalp (list s-with-newline) (parse s ((trim x))))) 349 | (let ((*trim-character-bag* (list #\space #\newline))) 350 | (is (equalp (list s-without-newline) (parse s ((trim x))))))) 351 | 352 | ;; INTEGER, FLOAT, DOUBLE, KEYWORD 353 | (parse "1" ((integer one)) 354 | (is (integerp one)) 355 | (is (= 1 one))) 356 | (parse "1.0 2.0" ((float one) (double two)) 357 | (is (typep one 'single-float)) 358 | (is (typep two 'double-float)) 359 | (is (= 1 (round one))) 360 | (is (= 2 (round two)))) 361 | (signals error (parse "fred" ((integer x)))) 362 | (signals error (parse "fred" ((float x)))) 363 | (signals error (parse "fred" ((double x)))) 364 | 365 | ;; Assumes upper case symbols on the lisp. 366 | (parse "abc" ((keyword x)) 367 | (is (keywordp x)) 368 | (is (not (eq :abc x))) 369 | (is (eq :|abc| x))) 370 | (parse :upper "abc" ((keyword x)) 371 | (is (keywordp x)) 372 | (is (eq :abc x))) 373 | 374 | ;; The TRANSFORM transform and various function designators 375 | (let (y) 376 | (flet ((saver (val) (setq y val))) 377 | (is (equalp '("a b c") (parse "a b c" ((transform a #'saver))))) 378 | (is (equalp "a b c" y)))) 379 | 380 | ;; #'(lambda ...) 381 | (let (y) 382 | (is (equalp '("A B C") 383 | (parse "a b c" ((transform a #'(lambda (x) (setq y (string-upcase x)))))))) 384 | (is (equalp "A B C" y))) 385 | 386 | ;; (lambda ...) 387 | (let (y) 388 | (is (equalp '("A B C") 389 | (parse "a b c" ((transform a (lambda (x) (setq y (string-upcase x)))))))) 390 | (is (equalp "A B C" y))) 391 | 392 | (is (equalp '("A" "b") (parse "a b" ((transform a 'string-upcase) b)))) 393 | (is (equalp '("A" "b") (parse "a b" ((transform a #'string-upcase) b)))) 394 | 395 | 396 | ;; Bogus transforms 397 | (signals error (macroexpand '(parse "a b c" ((foo a))))) 398 | (signals error (macroexpand '(parse "a b c" ((foo))))) 399 | ) 400 | 401 | (test declarations 402 | (is (equalp '("a" 2) 403 | (parse "a 1" (a (integer one)) 404 | (declare (string a) (integer one)) 405 | (list a (1+ one)))))) 406 | 407 | (test using 408 | (let ((a 0)) 409 | (is (equalp '("a" "b") (parse :using (a) "a b" (a b)))) 410 | (is (equalp "a" a)) 411 | (is (equalp '("a" "b") (parse (:using a) "a b" (b a) 412 | (list b a)))) 413 | (is (equalp "b" a))) 414 | 415 | (let ((v (make-array 4 :fill-pointer 0 :adjustable nil))) 416 | (is (equalp '("a" "b") (parse :using-vector (v) "a b" (v b)))) 417 | (is (equalp #("a") v)) 418 | ;; Explicit body reference to V is going to return the vector, not what was matched 419 | ;; and stuffed into the vector. This is by design. User supplied body has full control 420 | ;; (and responsibility for any mess that arises). 421 | (is (equalp (list "a" v) (parse (:using-vector v) "a b" (a v) 422 | (list a v)))) 423 | (is (equalp #("a" "b") v)) ;accumulated across two parse invocations 424 | (is (null (parse (:using-vector v) "c d" (v v) nil))) 425 | (is (equalp #("a" "b" "c" "d") v)) 426 | ;; Vector-push silently does nothing if vector is full. 427 | (is (equalp '("e") (parse (:using-vector v) "e" (v)))) 428 | (is (equalp #("a" "b" "c" "d") v))) 429 | 430 | ;; Vector and positional patterns 431 | (let ((v (make-array 4 :fill-pointer 0))) 432 | (is (equalp #("04" "Mark") 433 | (parse (:using-vector v) "04Mark" (1 v (+ 2) v (> (aref v 0))) v)))) 434 | 435 | ;; Mix it up! 436 | (let (a b (v1 (make-array 4 :fill-pointer 0)) (v2 (make-array 4 :fill-pointer 0))) 437 | (is (equalp '("a" "b" "c" "d" "e") 438 | (parse (:using a b) (:using-vector v1 v2) "a b c d e" (a b v1 v2 e)))) 439 | (is (equalp "a" a)) 440 | (is (equalp "b" b)) 441 | (is (equalp #("c") v1)) 442 | (is (equalp #("d") v2)) 443 | (is (not (boundp 'e)))) 444 | 445 | ;; Negative tests. 446 | 447 | ;; Symbols may only be in :USING or :USING-VECTOR, not both 448 | (signals error (macroexpand '(parse :using (a b e) :using-vector (b c d a) "abc" (a b c d e)))) 449 | 450 | ;; _ may not be in :USING or :USING-VECTOR 451 | (signals error (macroexpand '(parse :using (a _ c) "abc" (a c)))) 452 | (signals error (macroexpand '(parse :using-vector (a _ c) "abc" (a c)))) 453 | 454 | ;; :USING-[VECTOR] - lists may be empty, but must contain (non-keyword) symbols if non-empty 455 | (is (equalp '("a" "b") (parse :using () "a b" (a b)))) 456 | (is (equalp '("a" "b") (parse :using-vector () "a b" (a b)))) 457 | (signals error (macroexpand '(parse :using (:fred) "a b" (a b)))) 458 | (signals error (macroexpand '(parse :using-vector (1) (a b) "a b" (a b)))) 459 | ) 460 | 461 | (defun run-tests () 462 | "Run all :rexxparse tests" 463 | (run 'test-suite) 464 | nil) 465 | 466 | ;;(trace rexxparse::match-and-extract rexxparse::scan-string rexxparse::scan-word-split rexxparse::extract rexxparse::extract-after-left-trim rexxparse::scan-absolute-position rexxparse::pattern->scanner rexxparse::scan-leftward-relative-position rexxparse::scan-rightward-relative-position REXXPARSE::SCAN-LEFTWARD-LENGTH-POSITION REXXPARSE::SCAN-RIGHTWARD-LENGTH-POSITION) 467 | 468 | ;; Just a tool so I don't have to edit a 'foo.rexx' file and run rexx on it to compare 469 | ;; to what rexx does. Don't use it in tests, it would add a very undesirable dependency (OORexx) 470 | (defun trexx (source template &key debug) 471 | "Try parsing SOURCE with TEMPLATE, both strings, in REXX, and returning a list of 472 | bound variables the way PARSE would. Period placeholders will not be in the results. 473 | Template should be a string containing rexx-syntax template data. 474 | 475 | E.g. (trexx \"abc\" \"v1 2 v2 3 . 4\") => (\"a\" \"b\") 476 | 477 | Assumes REXXPARSE:PARSE is working enough to parse the REXX output :-) 478 | Assumes newlines for line separators, sorry Winblows and Fruit OSes. 479 | 480 | Only works works if 'rexx' is in your PATH. Author tests with Open Object Rexx." 481 | (let ((rexx-output 482 | (uiop/stream:call-with-temporary-file 483 | (lambda (pathname) 484 | (with-open-file (s pathname :direction :output :if-exists :supersede) 485 | (format s "trace i~%parse value '~a' with ~a~%" source template)) 486 | (with-output-to-string (stream) 487 | #+NIL(uiop:run-program (list "cat" (namestring pathname)) :output stream) 488 | (uiop:run-program (list "rexx" (namestring pathname)) 489 | :error-output stream 490 | :output stream))) 491 | :want-pathname-p t :want-stream-p nil))) 492 | ;; For (trexx "abc" "v1 2 v2") 493 | ;; Looking for output lines like >=> V2 <= \"bc\" 494 | ;; Report last binding if var bound multiple times. 495 | (let ((rexxparse:*unmatched-binding-value* nil)) 496 | (with-input-from-string (stream rexx-output) 497 | (loop with result = nil 498 | for line = (if debug (print (read-line stream nil nil)) (read-line stream nil nil)) 499 | while line 500 | do (parse line (">=> " name " <= " val) 501 | (when val 502 | (parse val ("\"" v "\"") 503 | (setf result (acons name v result))))) 504 | finally (return (nreverse 505 | (mapcar #'cdr 506 | (delete-duplicates result :key #'car :test #'equalp 507 | :from-end t))))))))) 508 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TL;DR Purpose 2 | 3 | A DSL to concisely scan/tokenize, extract, and transform semi-structured string data, and 4 | bind the results to variables. Inspired by the REXX PARSE command. 5 | 6 | Some simple if not particularly inspired examples: 7 | 8 | (parse "The quick brown fox" (_ _ color animal) 9 | (format t "The color of the ~a is ~a~%" animal color)) 10 | 11 | => The color of the fox is brown 12 | 13 | (defvar *log-line* "2024-Aug-12: [ERROR] Some stupid log error") 14 | (parse *log-line* (year "-" "[" severity "] " rest) 15 | (when (string= severity "ERROR") 16 | (format t "WARNING WILL ROBINSON! ~s happened in ~a~%" rest year))) 17 | 18 | => WARNING WILL ROBINSON! "Some stupid log error" happened in 2024 19 | 20 | (parse "Meal total: $23.12" ("$" (float dollars)) 21 | (format t "Amount with 15 & 20 percent tips: $~,2f, $~,2f~%" 22 | (* dollars 1.15) (* dollars 1.20))) 23 | 24 | => Amount with 15 & 20 percent tips: $26.59, $27.74 25 | 26 | # About 27 | 28 | A long time ago there was a novel scripting language that ran on the IBM 29 | VM/CMS operating system known as [REXX](https://www.rexxla.org/). 30 | 31 | One of the things I always liked about REXX in its era of pre-regexp 32 | scripting languages was the `PARSE` statement. In its simplest form PARSE 33 | is a very nice way to parse strings with delimited or positional data and 34 | then bind the matching substrings to variables. 35 | 36 | This package attempts to reproduce REXX's `PARSE` statement as a Common Lisp DSL. 37 | 38 | There's a bit of a zen thing to `PARSE`. Pattern matching is almost 39 | the opposite of a regexp. Instead of specifying patterns for what you want 40 | to match, you specify patterns for the bits that are not of interest, and 41 | what gets bound as a match is the stuff in-between those uninteresting bits. 42 | 43 | For example `(parse "12:00:00" (hh ":" mm ":" ss)) => ("12" "00" "00"))` 44 | is matching for the colons, the other tokens are just variable names to 45 | receive the text matched _around_ the tokens (though this example doesn't 46 | show the variables in use, `PARSE` defaults to returning a list of the matched 47 | variables if there is no body). 48 | 49 | `PARSE` is about scanning strings and binding desired subsequences 50 | to variables in the style of REXX. It is not intended for lisp syntax 51 | parsing, nor will it replace regexps where you need need to express complex 52 | patterns. It is better suited for word-splitting and fixed-format data. 53 | 54 | That said `PARSE` can sometimes express some parsing tasks in a clearer and 55 | shorter way, and has other useful capabilities such as its ability to 56 | act like a programmable tape reader (e.g. reading a length descriptor 57 | from the input and then extracting a substring for the specified length). 58 | 59 | Realistically this package probably adds little to what you can piece 60 | together with other lisp parsing and/or pattern-matching packages, but if 61 | you liked REXX, then perhaps you'll like this macro with its style of 62 | binding and pattern specifications. This package is also regexp-free by 63 | design. Some overlapping regexp capabilities are mentioned below. 64 | 65 | # Tested platforms 66 | 67 | Tests ran on the following without issues except as otherwise noted. 68 | 69 | * SBCL 70 | * ECL 71 | * CCL 72 | * ACL 73 | * LISPWORKS 74 | * ABCL - with the following mild warning about the `parse-float` packages 75 | when loading, but otherwise okay: 76 | 77 | ; Loading "rexxparse-test" 78 | ; Caught BAD-SYSTEM-NAME: 79 | ; System definition file #P"/home/dave/quicklisp/dists/quicklisp/software/parse-float-20200218-git/parse-float.asd" contains definition for system "parse-float-tests". Please only define "parse-float" and secondary systems with a name starting with "parse-float/" (e.g. "parse-float/test") in that file. 80 | ; Compilation unit finished 81 | ; Caught 1 WARNING condition 82 | 83 | 84 | # REXX Compatibility 85 | 86 | I have endeavored to make the basic string and position parsing compatible 87 | with REXX semantics. So on the very slim chance you're a former REXX 88 | programmer using Lisp, hopefully you will feel at home. 89 | 90 | This was also the hardest part of the project because the REXX semantics 91 | are sometimes subtle. I was never particularly knowledgeable of REXX to 92 | begin with, and the REXX documentation is a bit hit-or-miss on some 93 | details. At times I was guessing at black box behavior. I've tried to boil 94 | down the main rules in a section labled "Parse Rules 101" below. 95 | 96 | Among the REXX compatibility features of REXXPARSE is tolerance of edge 97 | cases like rebinding the same variable multiple times, position patterns 98 | which are out of bounds of the string, and so on. About the only 99 | restriction is that relative position fixnums must not be negative, which 100 | is in keeping with REXX semantics. Otherwise it tries not to complain about 101 | mundane things in your templates. 102 | 103 | If you think you've found a bug, try your PARSE with Open Object REXX and 104 | see what it does. My goal is to match its semantics for any functionality 105 | shared between the two, however note that the Lisp version has additional 106 | capabilities which can't be compared. 107 | 108 | # Alternative text parsing packages 109 | 110 | Lisp has plenty of great tools that already do parsing, here's a couple for 111 | consideration. 112 | 113 | ## cl-ppcre 114 | 115 | If regexps are your thing you could also use the 116 | [cl-ppcre](https://edicl.github.io/cl-ppcre/#register-groups-bind) 117 | `register-groups-bind` construct. It probably performs just 118 | as well (or better with its years of fine tuning, I have no idea). It even has its own 119 | flavor of transforms that can be applied to the match before binding. 120 | 121 | ## scanfcl 122 | 123 | There's also the Common Lisp [scanf](https://github.com/splittist/scanfcl) 124 | tool, which provides a lisp equivalent to the C `scanf` family of functions 125 | and has the ability to parse numbers for you, but does not provide bindings 126 | and suffers from broader limitations of `scanf`'s parsing capabilities. 127 | 128 | # Example comparison of regexp/scanf/PARSE 129 | 130 | Here is an example of parsing a simple text string with regexps and/or 131 | `scanf`, followed by the way parsing is done with `PARSE`. 132 | 133 | Let's use this text string that we want to parse, where we want to 134 | tease out the year/month/day and error message components: 135 | 136 | (defvar *text* 137 | "2024/02/23 17:35:42.022 - unable to locate '/usr/local/examples/' directory") 138 | 139 | Note the additional blank space after the hyphen as well. 140 | 141 | ## Using cl-ppcre `register-groups-bind` 142 | 143 | (cl-ppcre:register-groups-bind (year month day error-msg) 144 | ("(\\d+)/(\\d+)/(\\d+).* - (.*)" *text*) 145 | (list year month day error-msg)) 146 | 147 | => ("2024" "02" "23" "unable to locate '/usr/local/examples/' directory") 148 | 149 | Nice enough, with the usual cross-eyed issues of writing regexps. 150 | 151 | ## Using `scanf` 152 | 153 | (scanfcl:sscanf *text* "%d/%d/%d %*s - %s") 154 | 155 | => (2024 2 23 "unable") 156 | 157 | Scanf is nice because it will convert matched text to numeric types, 158 | `REXXPARSE:PARSE` can do that as well via transforms, a REXXPARSE extension 159 | to basic REXX capabilities. 160 | 161 | Note that the scanf example is able to suppress scanning of some text with 162 | the '*' modifier, but fails to parse the message that was desired with 163 | whitespace content. You can use fixed width %s or %c if you could make 164 | assumptions about the width but not generally compatible with most service 165 | log content. If your scanf supports character sets, you could use that 166 | too. Still, it isn't super friendly for reading delimited substrings the 167 | way we do with PARSE. 168 | 169 | ## Using `PARSE` 170 | 171 | ### Pure REXX PARSE 172 | 173 | The original (NOT LISP!) REXX syntax would be: 174 | 175 | PARSE *text* year "/" month "/" day . "-" error 176 | 177 | In the above statement, `*text*` is known as the source (to be matched), 178 | and the remainder of the statement is known as the "template". In REXX, 179 | the period was a placeholder, in lisp we use '_' (underscore) because periods 180 | have different behavior with the Lisp reader. In the above example, the 181 | period would match the timestamp text. 182 | 183 | The template contains symbols naming variables to be bound, and strings to 184 | be matched in the source text such that they delimit the text of interest 185 | to be bound. 186 | 187 | ### Lisp-styled REXX PARSE 188 | 189 | The general syntax of PARSE is 190 | 191 | (parse () ) 192 | 193 | The body allows for optional declarations of template variable symbols via 194 | an implicit enclosing `locally`. Normally they will be strings unless you 195 | are using transforms, but no such implicit declarations are made. 196 | 197 | Here is a simple text parse without a body. The underscore is as mentioned above: 198 | 199 | (parse *text* (year "/" month "/" day _ "-" error)) 200 | 201 | If all you want to do is return a list of values bound, you can omit all forms 202 | after the template and a list of bound values will be returned, so the above 203 | would return 204 | 205 | => ("2024" "02" "23" "unable to locate '/usr/local/examples/' directory") 206 | 207 | Values are returned in order of the variables specified in the template. 208 | Text conceptually (but not physically) bound do the placeholder `_` 209 | is not included in the result. 210 | 211 | One of the main points of PARSE is to lexically bind variables for you 212 | so you don't have to go and fetch them from a list with `destructuring-bind` 213 | or other tools. For example: 214 | 215 | ;; mock snippet dealing with some error noted in *text* 216 | (parse *text* ("unable to locate '" path "' directory") 217 | (cerror "Create the directory ~s and continue" 218 | "The directory ~s did not exist" 219 | path)) 220 | 221 | => 222 | 223 | The directory "/usr/local/examples/" did not exist 224 | [Condition of type SIMPLE-ERROR] 225 | 226 | Restarts: 227 | 0: [CONTINUE] Create the directory "/usr/local/examples/" and continue 228 | 1: [RETRY] Retry SLIME REPL evaluation request. 229 | 2: [*ABORT] Return to SLIME's top level. 230 | 3: [ABORT] abort thread (#) 231 | 232 | #### Template variables, bindings vs. assignment 233 | 234 | Symbols acting as variables in the template, except for '_', are _bindings_ 235 | introduced by `LET` and initialized with 236 | `REXXPARSE:*UNMATCHED-BINDING-VALUE*`. 237 | 238 | However depending on the use of the symbols in the template, they may 239 | undergo multiple assignments, either to text matched by the parse, or to 240 | the result of transformations on the parsed text. 241 | 242 | The '_' does not result in a binding, no `_` symbol is bound on the stack, 243 | any template matches for this symbol will not be extracted or saved to any variable. 244 | 245 | #### REXX variables vs. Lisp s-expressions 246 | 247 | If you're reading REXX documentation (or otherwise familiar with it), such 248 | as [Open Object REXX Reference](https://rexxinfo.org/reference/articles/oorexxref.pdf), 249 | note that the use of parenthesized forms is different between REXX and 250 | REXXPARSE:PARSE. Where REXX would use a parenthesized expression to do a 251 | language variable references, REXXPARSE uses parenthesized forms in templates for their 252 | syntactic value beyond that, e.g. `(+ x)` to is a positional pattern to 253 | move rightward `x` columns. I imagine the confusion will only occur to 254 | people who have been writing a lot of REXX recently. 255 | 256 | ### Word-oriented tokenization 257 | 258 | The basic behavior of PARSE favors matching tokens delimited by 259 | spaces. Absent specific patterns from you, the spaces around tokens bound 260 | to variables are discarded. Thus 261 | 262 | (parse "Now is the time" (now is the-time)) 263 | => ("now" "is" "the time") 264 | 265 | Note the multiple spaces between "Now" and "is", all used to divide tokens 266 | matched and discarded. This is different from a pattern indicating a space, 267 | e.g. 268 | 269 | (parse "Now is the time" (now " " is " " the-time)) 270 | => ("now" "" "is the time") 271 | 272 | Here 'is' is matched to the text between the point matched by the pattern 273 | on the left and the point matched by the pattern on the right. The two 274 | patterns match consecutive spaces and produce the a zero length binding. 275 | Don't let it mess with your head too much, this is a fairly contrived example. 276 | 277 | ### More text than bindings 278 | 279 | The last binding variable will be assigned any unmatched tail of the source 280 | string. E.g. 281 | 282 | (parse "a b c" (a b)) 283 | => ("a" "b c") 284 | 285 | In this situation, the text bound to the tail variable will not have spaces trimmed. 286 | 287 | ### More bindings than text 288 | 289 | If there are unused variables because there are fewer words in the 290 | source than there are variables in the template, unused variables will 291 | be bound to `REXXPARSE:*UNMATCHED-BINDING-VALUE*`, which defaults to an 292 | empty string (in keeping with REXX semantics). You can change this 293 | behavior by rebinding the variable. 294 | 295 | (parse "a b" (a b c)) 296 | => ("a" "b" "") 297 | 298 | ### Consecutive bindings and/or patterns 299 | 300 | Your template may have binding sequences without interleaved patterns, in which case 301 | the implicit word splitting pattern applies. It may also have pattern 302 | sequences without interleaved binding variables, which may be useful if, 303 | for example, you're looking to advance across like tokens, e.g. 304 | 305 | (parse "I want the text following the second occurrence of 'text', this text." 306 | ("text" "text" the-rest)) 307 | => ("', this text.") 308 | 309 | ### Parse Rules 101 310 | 311 | The simplest form of parsing template consists of a list of variable names. 312 | The string being parsed is split up into words (characters delimited by 313 | blanks), and each word from the string is assigned to a variable in 314 | sequence from left to right. Leading blanks are removed from each word in 315 | the string before it is assigned to a variable, as is the blank that 316 | delimits the end of the word. 317 | 318 | Beyond the simple case there are some rules to remember for the myriad 319 | edge cases and features related to PARSE: 320 | 321 | 1. If there is one variable and no pattern, the variable matches the whole 322 | source string (no whitespace characters are removed). 323 | 324 | 2. If there are more variables than words, excess varables are bound to 325 | `*UNMATCHED-BINDING-VALUE*`. 326 | 327 | 3. If there is more text than variables would match, the last variable is 328 | bound to all remaining text. Sometimes called the "tail match" rule. 329 | Tail matches never eat spaces, they preserve the remainder of the source 330 | string to be matched. 331 | 332 | 4. [SUBTLE, CRUCIAL] Any explicit pattern (with a match in the source 333 | string) creates a logical break in the 334 | source string such the var to the left of the pattern is treated as a 335 | "tail match" situation on the substring terminated by the pattern. 336 | 337 | Moreover, variables to the left apply to the substring to the left 338 | of the pattern. I.e. 339 | 340 | `(parse "a b c x g" (a b "x" g)) => ("a" " b c" " g")` 341 | 342 | 5. Where no pattern is given between two variables or between a variable 343 | and the beginning or start of the source string, an implicit "word 344 | splitting" takes place. Word splitting eats spaces before a token to be 345 | matched, and one space after the token. 346 | 347 | 6. An empty string is never found, it always matches the end of the source 348 | string. Specifying an absolute position of 1 as the pattern following a 349 | variable has a similar effect as an empty string pattern, it leaves 350 | the cursor positioned such that you can match source string again. 351 | 352 | `(parse " a b c " (a "" b)) => (" a b c " "")` 353 | `(parse " a b c " (a 1 b)) => (" a b c " " a b c ")` 354 | 355 | 7. Absolute positions less than one are treated as one. 356 | Absolute positions greater than the source string length are treated 357 | as being the string length. 358 | 359 | 8. Relative position expressions, e.g. `(- )` require the `` 360 | to be a non-negative fixnum or a string that can be converted to a 361 | non-negative fixnum. Absolute positions expressed with `(= )` have 362 | the same rules. 363 | 364 | 9. Relative and absolute positional patterns are interchangeable _with one exception_. 365 | 366 | Normally template parsing of string matches skips the text matched by 367 | the string pattern. However when a template sequence of the form 368 | `"string" variable ` does NOT skip the string 369 | pattern data when assigning to the variable, and so the pattern text 370 | will will appear in the variable. For example (with non-relative examples too): 371 | 372 | `(parse " a b c " ("b" b)) => (" c "))` ; "b" not included, no relative positional 373 | 374 | ;; '5' is effectively equal to the source scanning start position when its pseudo-pattern 375 | ;; is matched, which means an empty string match, which is a break/tail-position behavior. 376 | `(parse " a b c " ("b" b 5)) => (" c "))` ; from end of "b" to 5 is empty, full break 377 | `(parse " a b c " ("b" b 7)) => (" c"))` ; space past "b" to 7, 2 chars 378 | `(parse " a b c " ("b" b (+ 1))) => ("b"))` ; from position of b to position+1 379 | `(parse " a b c " ("b" b (- 1))) => ("b c "))` ; from position of b to end of string 380 | 381 | 10. Template expressions may reference variables bound by preceding 382 | template matches. See the section on `Length Positional Patterns` for an example. 383 | 384 | ### Positional template directives 385 | 386 | Patterns may also be positional directives, where integers specify absolute 387 | or relative positions in the source string, relative positions being 388 | relative to the start of the last pattern matched. Positions are generally 389 | used for fixed length subfields in strings, but can also be used to re-scan 390 | the source. 391 | 392 | Like string patterns, positions identify points at which the source string 393 | is split, only the length of the match is zero. Also like string patterns, 394 | variables bracketed by patterns will not be string trimmed. 395 | 396 | ; 1 2 3 397 | ; 1234567890123456789012345678901234 398 | (parse "Brimfield Massachusetts 10101" 399 | (city 14 state 30 zip)) 400 | => ("Brimfield " "Massachusetts " "10101") 401 | 402 | _Absolute_ positions may be specified as positive integer literals. 403 | The above example specifies position matches for columns 14 and 30. 404 | Absolute positions are all 1-based integer values, i.e. a column ordinal. Subtract 405 | one mentally for Lisp array indices. (This choice is for REXX compatibility). 406 | 407 | For positions involving the integer-valued variables 408 | instead of integer literals, you must supply an s-expression whose car is 409 | one of `+`, `-`, `=`, followed by a s-expression that is evaluated at 410 | runtime (not macroexpansion time) to produce an integer to be interpreted 411 | as the relative or absolute position. The `+` and `-` expressions indicate 412 | _relative_ positions, while `=` indicates an absolute position. 413 | 414 | Examples: 415 | 416 | ;; City occupies columns 1-13 inclusive. 417 | ;; State occupies columns 14-29 inclusive 418 | ;; '+' indicates position relative to the prior pattern match position. 419 | (parse "Brimfield Massachusetts 10101" 420 | (city (+ 13) state (+ 16) zip)) 421 | => ("Brimfield " "Massachusetts " "10101") 422 | 423 | ;; Mixing absolute positions 30 and 31 with relative offsets. 424 | ;; reparsing the first '1' twice 425 | (parse "Brimfield Massachusetts 10101" 426 | (30 one-a 31 (- 1) one-b (+ 1))) 427 | => ("1" "1") 428 | 429 | ;; use of variables must be through the parenthesized expression 430 | ;; otherwise they would be indistinguishable from variables to be bound. 431 | ;; '=' indicates absolute positions 432 | (defvar *state-column* 14) 433 | (defvar *zip-column* 30) 434 | (parse "Brimfield Massachusetts 10101" 435 | (city (= *state-column*) state (= *zip*-column*) zip)) 436 | => ("Brimfield " "Massachusetts " "10101") 437 | 438 | 439 | Any positional directive that would precede the first source column 440 | (i.e. are < 1) are treated as 1. 441 | 442 | Any positional directive that would exceed the length of the 443 | source string is treated as the string length, matching the 444 | remainder of the string. 445 | 446 | It is an error for any net position value to exceed the range of a fixnum. 447 | 448 | ### Positional pattern data types 449 | 450 | All positional expressions must be integers in the range of non-negative 451 | fixnums, or s-expressions that resolve to those values. This constraint is 452 | relaxed for `+`, `-`, and `=` patterns, as well as `>` and `<` (described 453 | below) so that strings matched while parsing may be used later in the 454 | template as numeric positional directives. Note that such uses 455 | of the value bound at one step of the parse act as input controlling later 456 | steps of the parse. 457 | 458 | Allowing strings as positional values is a shortcut to avoid the need that 459 | for littering your template with `parse-integer` calls on previously 460 | matched text. String to fixnum conversions in positional templates that do 461 | not resolve to non-negative fixnums will result in an continuable error 462 | being signalled. Conversions are performed with `cl:parse-integer` and may 463 | generate a `cl:parse-error` condition if the text is not not parseable as 464 | an integer, and `parse-error` is not a continuable condition. 465 | 466 | See the next section with examples matching integer data in the source 467 | string and using those integers for subsequent match activity. 468 | 469 | ### Length Positional Patterns 470 | 471 | A `length positional pattern` is a number in a `<` or `>` pattern sexp 472 | similar to the `+`, `-`, and `=` pattern forms. I'm not sure why REXX 473 | distinguishes this from `-` and `+` positional patterns, they are identical 474 | in behavior except for one situation noted below. 475 | 476 | As with `-` and `+` the number specifies the length at which the source 477 | string is to be split relative to the current position. `>` and `<` 478 | indicates movement right or left, respectively from the start of the string 479 | or from the position of the last match. 480 | 481 | The `>` length pattern and the `+` relative positional pattern are 482 | interchangeable except in the special case of a zero value. A `(> 0)` pattern 483 | will split the string into a null (empty) string and leave the match position 484 | unchanged, whereas a `(+ 0)` pattern also leaves the match position 485 | unchanged, but doesn't split the string. In essence `(> 0)` says "match 486 | empty string" whereas `(+ 0)` advance scan zero characters, matching 487 | whatever follows. 488 | 489 | This string splitting behavior is useful for parsing string subfields 490 | whose lengths are also encoded in the string. 491 | 492 | The following example shows the difference between `(> 0)` and `(+ 0)`, 493 | note the different matches for `middle`: 494 | 495 | ;; Parsing with length patterns 496 | (parse "04Mark0005Twain" 497 | (len (+ 2) first (> len) len (+ 2) middle (> len) len (+ 2) last (> len)) 498 | (list first middle last len)) 499 | => ("Mark" "" "Twain" "05") 500 | 501 | ;; Parsing with relative patterns only 502 | (parse "04Mark0005Twain" 503 | (len (+ 2) first (+ len) len (+ 2) middle (+ len) len (+ 2) last (+ len)) 504 | (list first middle last len)) 505 | => ("Mark" "05Twain" "Twain" "05") 506 | 507 | While `<` is similar to `-`, application of of the match/extract process 508 | differs. To achieve the effect of `<` on a region of text with `-` you 509 | must use a `-`/`+` pair, and the position in source differs as in the 510 | following example: 511 | 512 | ;; Parsing with length patterns 513 | (parse "12345.6789" ("." digit (< 1) rest)) => ("5" "5.6789") 514 | ;; Parsing with relative patterns 515 | (parse "12345.6789" ("." (- 1) digit (+ 1) rest)) => ("5" ".6789") 516 | 517 | `<` is similar to matching a string literal, _without_ advancing the next 518 | position to be scanned after binding. 519 | 520 | ### Transformations (REXXPARSE extension) 521 | 522 | `REXXPARSE::PARSE` supports transformations on matched strings before they 523 | are assigned to variables. Transforms are a REXXPARSE lisp extension and 524 | not part of the basic REXX PARSE capability. 525 | 526 | The syntax for an assignmement based on a predefined transformation is: 527 | 528 | ( variable) 529 | 530 | where you would otherwise just have a variable to be bound that wasn't in a 531 | list. Note that the above uses `` as a non-terminal BNF token 532 | representing many possible pre-defined transformations. There is also a `TRANSFORM` 533 | terminal symbol with specific user-defined transformation semantics. 534 | 535 | Transforms have the same syntax as list-form patterns but are in 536 | fact binding forms. PARSE distinguishes patterns from transform-augmented 537 | bindings by the symbol name of the CAR of the list being known as a 538 | transform symbol. 539 | 540 | Transforms are a convenience for common parse situations, you 541 | could always do the transformations in the `&BODY` of the parse if you need 542 | different transformation semantics than those pre-defined by REXXPARSE or 543 | simply don't like the confusion transform syntax that resembles pattern syntax. 544 | 545 | (parse "some text with numbers: 1.0 2" (_ ": " (float x) (integer n)) 546 | (format t "~f is a ~s, ~d is a ~s~%" x (type-of x) n (type-of n))) 547 | 548 | => 549 | 1.0 is a SINGLE-FLOAT, 2 is a (INTEGER 0 4611686018427387903) 550 | NIL 551 | 552 | The `(float x)` and `(integer n)` expressions are DSL syntax to invoke transformations 553 | on the text corresponding to variables `x` and `n`, and assigning the 554 | transformation result to those variables. The set supported 555 | transformations are describe below. 556 | 557 | Transformations do not currently nest, i.e. you _cannot_ do `(LOWER (KEBAB x))` 558 | if you need to apply more complicated transformations, see 'user defined transforms' below. 559 | 560 | Transforms expressions using the `_` symbol are effectively NO-OPs. No 561 | text is extracted, and no transform function is run. 562 | 563 | #### Pre-defined transforms 564 | 565 | * UPPER - uppercases the extracted text. 566 | * LOWER - lowercases the extracted text. 567 | * SNAKE - convert hyphens to underscores. 568 | * KEBAB - convert underscores to hyphens. 569 | * LTRIM - remove leading spaces. 570 | * RTRIM - remove trailing spaces. 571 | * TRIM - remove leading and trailing spaces. 572 | * INTEGER - convert extracted text to an integer. 573 | * FLOAT - convert extracted text to a single-float. 574 | * DOUBLE - convert extracted text to a double-float. 575 | * KEYWORD - convert extracted text to a keyword. 576 | 577 | The floating point conversions are done using the `:parse-float` package, 578 | they do _not_ perform unsafe `READ`s. INTEGER conversion is done using `PARSE-INTEGER`. 579 | 580 | `SINGLE-FLOAT` and `LONG-FLOAT` conversions are not supported, the 581 | `:parse-float` package doesn't seem to support them on SBCL at least. If 582 | you need these representations you'll probably want to use 583 | `*READ-DEFAULT-FLOAT-FORMAT*` bindings with a user-defined transform that 584 | observes it and manages the conversion. 585 | 586 | Of course you could also just supply a BODY to the `PARSE` form and do the 587 | conversions in the body, there's no need whatsoever to use user-defined 588 | transforms except perhaps to abbreviate code if the transformation is used a lot. 589 | 590 | The `LTRIM`, `RTRIM`, and `TRIM` transforms use the Common Lisp `STRING-LEFT-TRIM`, 591 | `STRING-RIGHT-TRIM`, and `STRING-TRIM` functions respectively, supplying 592 | `REXXPARSE:*TRIM-CHARACTER-BAG*` as the character bag argument. This is 593 | exported so that you may bind it to other characters (outside of the 594 | `PARSE` form), but note that it will 595 | affect all trim transforms in the scope of the binding. 596 | 597 | The KEYWORD transform does no conversions to case, so it's easy to make 598 | symbols in unexpected cases if you aren't careful. If you want to 599 | upper/lower case the text before the transform makes a keyword of it, you 600 | could use `:UPPER` or `:LOWER` options to `PARSE` (though that will change 601 | the case of the whole source string). Or you can just do what you want in 602 | the `PARSE` body. 603 | 604 | #### User defined transforms 605 | 606 | There is a special transform operator, `TRANSFORM`, which exists to invoke 607 | user-supplied transformation functions. 608 | 609 | (transform ) 610 | 611 | Will invoke `function` on the text extracted by the parse, and assign the 612 | result of the transformation function to `symbol`. `function` must be a 613 | [function designator](https://www.lispworks.com/documentation/HyperSpec/Body/26_glo_f.htm#function_designator) 614 | for a function of one argument which will always be a string. 615 | 616 | ;; User defined transform example 617 | (defun stupid (str) "Stupid!") 618 | (parse "Don't call me dull." (_ _ _ (transform s 'stupid) ".")) 619 | => ("Stupid!") 620 | 621 | ## Full PARSE syntax 622 | 623 | The general form of a `PARSE` invocation (pardon the weak BNF) is below. 624 | Only the `` expression is required, and it must yield a string. 625 | 626 | PARSE