251 | Parse-EZ : Clojure Parser Library
252 |
253 | API Documentation
254 |
255 | Parse-EZ is a parser library for Clojure programmers. It allows easy
256 | mixing of declarative and imperative styles and does not
257 | require any special constructs, macros, monads, etc. to write custom parsers.
258 | All the parsing is implemented using regular Clojure functions.
259 |
260 | The library provides a number of
261 | parse functions and combinators and comes with a built-in customizable infix
262 | expression parser and evaluator. It allows the programmer to concisely specify
263 | the structure of input text using Clojure functions and easily build parse trees
264 | without having to step out of Clojure. Whether you are writing a parser
265 | for some well structured data or for data scraping or prototyping a new language,
266 | you can make use of this library to quickly create a parser.
267 |
268 | Features
269 |
270 |
271 | - Parse functions and Combinators
272 | - Automatic handling of whitespaces, comments
273 | - Marking positions and backtracking
274 | - Seek, read, skip string/regex patterns
275 | - Builtin customizable expression parser and evaluator
276 | - Exceptions based error handling
277 | - Custom error messages
278 |
279 |
280 | Usage
281 |
282 | Installation
283 |
284 | Just add Parse-EZ as a dependency to your lein project
285 | [protoflex/parse-ez "0.4.2"]
286 |
287 | and run
288 |
290 | A Taste of Parse-EZ
291 |
292 | Here are a couple of sample parsers to give you a taste of the parser library.
293 |
294 | CSV Parser
295 |
296 | A CSV file contains multiple records, one-record per line, with field-values separated by a delimiter
297 | such as a comma or a tab. The field values may optionally be quoted either using a single or double
298 | quotes. When field-values are quoted, they may contain the field-delimiter characters, and in such
299 | cases they will not be treated as field separators.
300 |
301 | First, let us define a parse function for parsing one-line of csv file:
302 | (defn csv-1 [sep]
303 | (sep-by #(any-string sep) #(chr sep)))
304 |
305 | In the above function definition, we make use of the parse combinator sep-by
306 | which takes two arguments: the first one to read a field-value and the second
307 | one to read the separator. Here, we have used Clojure's anonymous function shortcuts to
308 | specify the desired behavior succinctly. The any-string function matches a single-quoted
309 | string or a double-quoted string or a plain-string that is followed by the specified separator
310 | sep. This is exactly the function that we need to read the field-value. The second argument
311 | provided to sep-by above uses the primitive parse function chr which succeeds only when
312 | the next character in the input matches its argument (sep parameter in this case). The csv-1 function returns the field values as a vector.
313 |
314 | The sep-by function actually takes a third, optional argument as record-separator
315 | function with the default value of a function that matches a newline. We didn't
316 | pass the third argument above because the default behavior suits our purpose.
317 | Had the default behavior of sep-by been different, we would have written the
318 | above function as:
319 | (defn csv-1 [sep]
320 | (sep-by #(any-string sep) #(chr sep) #(regex #"\r?\n")))
321 |
322 | Now that we have created a parse function to parse a single line of CSV
323 | file, let us write another parse function that parses the entire CSV file
324 | content and returns the result as a vector of vector of field values
325 | (one-vector per record/line). All we need to do is to repeatedly apply the
326 | above defined csv-1 function and the multi* parse combinator does
327 | just that.
328 |
329 | Just one small but important detail: by default, Parse-EZ
330 | automatically trims whitespace after successfully applying a parse function.
331 | This means that the newline at the end of line would be consumed after reading
332 | the last field value and the sep-by would be unable to match the end-of-line
333 | which is the record-separator in this case. So, we will disable the newline
334 | trimming functionality using the no-trim combinator.
335 | (defn csv [sep]
336 | (multi* (fn [] (no-trim #(csv-1 sep)))))
337 |
338 | Alternatively, you can express the above function a bit more easily using the macro versions of combinators introduced in Version 0.3.0 as follows:
339 | (defn csv [sep]
340 | (multi* (no-trim_ (csv-1 sep))))
341 |
342 | Now, let us try out our csv parser. First let us define a couple of test
343 | strings containing a couple of records (lines) each. Note that the second
344 | string contains a comma inside the first cell (a quoted string).
345 | user> (def s1 "1abc,def,ghi\n2jkl,mno,pqr\n")
346 | #'user/s1
347 | user> (def s2 "'1a,bc',def,ghi\n2jkl,mno,pqr\n")
348 | #'user/s2
349 | user> (parse #(csv \,) s1)
350 | [["1abc" "def" "ghi"] ["2jkl" "mno" "pqr"]]
351 | user> (parse #(csv \,) s2)
352 | [["1a,bc" "def" "ghi"] ["2jkl" "mno" "pqr"]]
353 | user>
354 |
355 | Well, all we had to do was to write two lines of Clojure code to implement the CSV parser.
356 | Let's add a bit more functionality: the CSV files may use a comma or a tab character to
357 | separate the field values. Let's say we don't know ahead of time which character
358 | a file uses as a separator and we want to detect the separator automatically. Note
359 | that both characters may occur in a data file, but only one acts as a field-separator -- that too
360 | only when it's not inside a quoted string.
361 |
362 | Here is our strategy to detect the separator:
363 |
364 |
365 | - if the first field value is quoted (single or double), read the quoted string
366 | - else, read until one of comma or tab occurs
367 | - the next char is our delimiter
368 |
369 |
370 | Here is the code:
371 | (defn detect-sep []
372 | (let [m (mark-pos)
373 | s (attempt #(any dq-str sq-str))
374 | s (if s s (no-trim #(read-to-re #",|\t")))
375 | sep (read-ch)]
376 | (back-to-mark m)
377 | sep))
378 |
379 | Note how we used the mark-pos and back-to-mark Parse-EZ functions to 'unconsume'
380 | the consumed input.
381 |
382 | The complete code for the sample CSV parser with the separator-detection functionality is
383 | listed below (you can find this in csv_parse.clj file under the examples directory.
384 | (ns protoflex.examples.csv_parse
385 | (:use [protoflex.parse]))
386 |
387 | (declare detect-sep csv-1)
388 |
389 | (defn csv
390 | "Reads and returns one or more records as a vector of vector of field-values"
391 | ([] (csv (no-trim #(detect-sep))))
392 | ([sep] (multi* (fn [] (no-trim-nl #(csv-1 sep))))))
393 |
394 | (defn csv-1
395 | "Reads and returns the fields of one record (line)"
396 | [sep] (sep-by #(any-string sep) #(chr sep)))
397 |
398 | (defn detect-sep
399 | "Detects the separator used in a csv file (a comma or a tab)"
400 | [] (let [m (mark-pos)
401 | s (attempt #(any dq-str sq-str))
402 | s (if s s (no-trim #(read-to-re #",|\t")))
403 | sep (read-ch)]
404 | (back-to-mark m)
405 | sep))
406 |
407 | Let's try out the new auto-detect functionality. Let us define two new test
408 | strings s3 and s4 that use tab character as field-separator.
409 | user> (use 'protoflex.examples.csv_parse)
410 | nil
411 | user> (def s3 "1abc\tdef\tghi\n2jkl\tmno\tpqr\n")
412 | #'user/s3
413 | user> (def s4 "'1a\tbc'\tdef\tghi\n2jkl\tmno\tpqr\n")
414 | #'user/s4
415 | user> (parse csv s3)
416 | [["1abc" "def" "ghi"] ["2jkl" "mno" "pqr"]]
417 | user> (parse csv s4)
418 | [["1a\tbc" "def" "ghi"] ["2jkl" "mno" "pqr"]]
419 | user> (parse csv s1)
420 | [["1abc" "def" "ghi"] ["2jkl" "mno" "pqr"]]
421 | user>
422 |
423 | As you can see, this time we didn't specify what field-separator to use: the parser
424 | itself detected the field-separator character and used it, returning us the desired
425 | results.
426 |
427 | XML Parser
428 |
429 | Here is the listing of a sample XML parser implemented using Parse-EZ. You can find the
430 | source file in the examples directory. The parser returns a map containing keys and values
431 | for :tag, :attributes and :children for the root element. The value for :attributes key
432 | is itself another map containing attribute names and their values. The value for :children
433 | key is a vector (potentially empty) containing string content and/or maps for child elements.
434 | (ns protoflex.examples.xml_parse
435 | (:use [protoflex.parse]))
436 |
437 | (declare pi prolog element attributes children-and-close cdata elem-or-text close-tag)
438 |
439 | (defn parse-xml [xml-str]
440 | (parse #(between prolog element pi) xml-str :blk-cmt-delim ["<!--" "-->"] :line-cmt-start nil))
441 |
442 | (defn- pi [] (while (starts-with? "<?") (skip-over "?>")))
443 |
444 | (defn- prolog [] (pi) (attempt #(regex #"(?s)<!DOCTYPE([^<]+?>)|(.*?\]\s*>)")) (pi))
445 |
446 | The function parse-xml is the entry point that kicks off parsing of input xml string xml-str. It passes the between combinator to Parse-EZ's parse function. Here, the call to between returns the value returned by the element parse function, ignoring the content surrounding it (matched by prolog and pi functions). The block-comment delimiters are set to match XML's and the line-comment delimiter is cleared (by default these match Java comments).
447 |
448 | The parse function pi is used to skip consecutive processing instructions by using the delimiters <? and ?>.
449 |
450 | The parse function prolog is used to skip DTD declaration (if any) and also any surrounding processing instructions. Note that the regex used to match DTD declaration is only meant for illustration purposes. It isn't complete but will work in most cases.
451 | (def name-start ":A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\u02FF\\u0370-\\u037D\\u037F-\\u1FFF\\u200C-\\u200D\\u2070-\\u218F\\u2C00-\\u2FEF\\u3001-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFFD")
452 |
453 | (def name-char (str name-start "\\-.0-9\\xB7\\u0300-\\u036F\\u203F-\\u2040"))
454 |
455 | (def name-re (-> (format "[%s][%s]*" name-start name-char) re-pattern))
456 |
457 | name-re is a regular expression that matches xml element and attribute names.
458 | (defn element []
459 | (let [tag (do (chr \<) (regex name-re))
460 | attrs (attributes)
461 | children (look-ahead* [
462 | ">" #(children-and-close tag)
463 | "/>" (fn [] [])])]
464 | {:tag tag, :attributes attrs, :children children}))
465 |
466 | The element parse function matches an xml element and returns the tag, attribute list and children in a hash map. Note the usage of the look_ahead* combinator to handle both the cases -- with children and without children. If it sees a ">" after reading the attributes, the look-ahead* function calls the children-and-close parse function to read children and the element close tag. On the other hand, if it sees "/>" after the attributes, it calls the (almost) empty parse function that simply returns an empty list.
467 | (defn attr []
468 | (let [n (regex name-re) _ (chr \=)
469 | v (any sq-str dq-str)]
470 | [n v]))
471 |
472 | (defn attributes [] (apply hash-map (flatten (multi* attr))))
473 |
474 | The attr parse function matches a single attribute. The attribute value may be
475 | a single-quoted or double-quoted string. Note the usage of any parse combinator for this purpose.
476 |
477 | The attributes parse function matches multiple attribute specifications by passing the attr parse function to multi* parse combinator.
478 | (defn- children-and-close [tag]
479 | (let [children (multi* #(between pi elem-or-text pi))]
480 | (close-tag tag)
481 | children))
482 |
483 | Each child item is read using the elem-or-text parse function while ignoring any surrounding processing instructions using the between combinator; the combinator multi* is used to read all the child items.
484 | (defn- elem-or-text []
485 | (look-ahead [
486 | "<![CDATA[" cdata
487 | "</" (fn [] nil)
488 | "<" element
489 | "" #(read-to "<")]))
490 |
491 | The look-ahead parse combinator is used to call different parse functions
492 | based on different lookahead strings. Note that the look-ahead function
493 | doesn't consume the lookahead string unlike the look-ahead* function used
494 | earlier (in the definition of element parse function).
495 | (defn- cdata []
496 | (string "<![CDATA[")
497 | (let [txt (read-to "]]>")] (string "]]>") txt))
498 |
499 | (defn- close-tag [tag]
500 | (string (str "</" tag))
501 | (chr \>))
502 |
503 | By now, it should be obvious what the above two functions do.
504 |
505 | Well, an XML parser in under 50 lines. Let's try it with a few sample inputs:
506 | user> (use 'protoflex.examples.xml_parse)
507 | nil
508 | user> (parse-xml "<abc>text</abc>")
509 | {:tag "abc", :attributes {}, :children ["text"]}
510 | user> (parse-xml "<abc a1=\"1\" a2=\"attr 2\">sample text</abc>")
511 | {:tag "abc", :attributes {"a1" "1", "a2" "attr2"}, :children ["sample text"]}
512 | user> (parse-xml "<abc a1=\"1\" a2=\"attr 2\"><def d1=\"99\">xxx</def></abc>")
513 | {:tag "abc", :attributes {"a1" "1", "a2" "attr2"}, :children [{:tag "def", :attributes {"d1" "99"}, :children ["xxx"]}]}
514 | user>
515 |
516 | Comments and Whitespaces
517 |
518 | By default, Parse-EZ automatically handles comments and whitespaces. This
519 | behavior can be turned on or off temporarily using the macros with-trim-on
520 | and with-trim-off respectively. The parser option :auto-trim can be used to
521 | enable or disable the auto handling of whitespace and comments. Use the parser
522 | option :blk-cmt-delim to specify the begin and end delimiters for block
523 | comments. The parser option :line-cmt-start can be used to specify the line
524 | comment marker. By default, these options are set to java/C++ block and line
525 | comment markers respectively. You can alter the whitespace recognizer by setting
526 | the :ws-regex parser option. By default it is set to #"\s+".
527 |
528 | Alternatively, you can turn off auto-handling of whitespace and comments and use
529 | the lexeme function which trims the whitespace/comments after application of the
530 | parse-function passed as its argument.
531 |
532 | Also see the no-trim and no-trim-nl functions.
533 |
534 | Primitive Parse Functions
535 |
536 | Parse-EZ provides a number of primitive parse functions such as: chr,
537 | chr-in, string, string-in, word, word-in, sq-str, dq-str,
538 | any-string, regex, read-to, skip-over, read-re, read-to-re,
539 | skip-over-re, read-n, read-ch, read-ch-in-set, etc.
540 | See API Documentation
541 |
542 | Let us try some of the builtin primitive parse functions:
543 | user> (use 'protoflex.parse)
544 | nil
545 | user> (parse integer "12")
546 | 12
547 | user> (parse decimal "12.5")
548 | 12.5
549 | user> (parse #(chr \a) "a")
550 | \a
551 | user> (parse #(chr-in "abc") "b")
552 | \b
553 | user> (parse #(string-in ["abc" "def"]) "abc")
554 | "abc"
555 | user> (parse #(string-in ["abc" "def"]) "abcx")
556 | Parse Error: Extraneous text at line 1, col 4
557 | [Thrown class java.lang.Exception]
558 |
559 | Note the parse error for the last parse call. By default, the parse function parses to the
560 | end of the input text. Even though the first 3 characters of the input text is recognized
561 | as valid input, a parse error is generated because the input cursor would not be at the
562 | end of input-text after recognizing "abc".
563 |
564 | The parser option :eof can be set to false to allow recognition of partial input:
565 | user> (parse #(string-in ["abc" "def"]) "abcx" :eof false)
566 | "abc"
567 | user>
568 |
569 | You can start parsing by looking for some marker patterns using the read-to,
570 | read-to-re, skip-over, skip-over-re functions.
571 | user> (parse #(do (skip-over ">>") (number)) "ignore upto this>> 456.7")
572 | 456.7
573 |
574 | Parse Combinators
575 |
576 | Parse Combinators in Parse-EZ are higher-order functions that take other parse
577 | functions as input arguments and combine/apply them in different ways to
578 | implement new parse functionality. Parse-EZ provides parse combinators such as:
579 | opt, attempt, any, series, multi\*, multi+, between, look-ahead, lexeme,
580 | expect, etc.
581 | See API Documentation
582 |
583 | Let us try some of the builtin parse combinators:
584 | user> (parse #(opt integer) "abc" :eof false)
585 | nil
586 | user> (parse #(opt integer) "12")
587 | 12
588 | user> (parse #(any integer decimal) "12")
589 | 12
590 | user> (parse #(any integer decimal) "12.3")
591 | 12.3
592 | user> (parse #(series integer decimal integer) "3 4.2 6")
593 | [3 4.2 6]
594 | user> (parse #(multi* integer) "1 2 3 4")
595 | [1 2 3 4]
596 | user> (parse #(multi* (fn [] (string-in ["abc" "def"]))) "abcabcdefabc abcdef")
597 | ["abc" "abc" "def" "abc" "abc" "def"]
598 | user>
599 |
600 | You can create your own parse functions on top of primitive parse-functions and/or
601 | parse combinators provided by Parse-EZ.
602 |
603 | Committing to a particular parse branch
604 |
605 | Version 0.4.0 added support for committing to a particular parse branch via
606 | the new parse combinators commit and commit-on. These functions make the
607 | parser commit to the current parse branch, making the parser report subsequent
608 | parse-failures in the current branch as parse-errors and preventing it
609 | from trying other alternatives at higher levels.
610 |
611 | Nesting Parse Combinators Using Macros
612 |
613 | Version 0.3.0 of Parse-EZ adds macro versions of parse combinator functions
614 | to make it easy to nest calls to parse combinators without having to write
615 | nested anonymous functions using the "(fn [] ...)" syntax. Note that Clojure
616 | does not allow nesting of anonymous functions of "#(...)" forms. Whereas
617 | the existing parse combinators take parse functions as arguments and actually
618 | perform parsing and return the parse results, the newly added macros take
619 | parse expressions as arguments and return parse functions (to be passed
620 | to other parse combinators). These macros are named the same as the
621 | corresponding parse combinators but with an underscore ("_") suffix. For example
622 | the macro version of "any" is named "any_".
623 |
624 | Error Handling
625 |
626 | Parse Errors are handled in Parse-EZ using Exceptions. The default error messages generated
627 | by Parse-EZ include line and column number information and in some cases what is expected
628 | at that location. However, you can provide your own custom error messages by using the
629 | expect parse combinator.
630 |
631 | Expressions
632 |
633 | Parse-EZ includes a customizable expression parser expr for parsing expressions in infix
634 | notation and an expression evaluator function eval-expr to evaluate infix expressions.
635 | You can customize the operators, their precedences and associative properties using
636 | :operators option to the parse function. For evaluating expressions, you can optionally
637 | specify the functions to invoke for each operator using the :op-fn-map option.
638 |
639 | Parser State
640 |
641 | The parser state consists of the input cursor and various parser options (specified or derived)
642 | such as those affecting whitespace and comment parsing, word recognizers, expression parsing,
643 | etc. The parser options can be changed any time in your own parse functions using set-opt.
644 |
645 | Note that most of the parse functions affect Parser state (e.g: input cursor) and hence they are
646 | not pure functions. The side-effects could be avoided by making the Parser State an explicit
647 | parameter to all the parse functions and returning the changed Parser State along with the parse
648 | value from each of the parse functions. However, the result would be a significantly programmer
649 | unfriendly API. We made a design decision to keep the parse fuctions simple and easy to use
650 | than to fanatically keep the functions "pure".
651 |
652 | Relation to Parsec
653 |
654 | Parsec is a popular parser combinator library written in Haskell. While Parse-EZ
655 | makes use of some of the ideas in there, it is not a port of Parsec to Clojure.
656 |
657 | License
658 |
659 | Copyright (C) 2012 Protoflex Software
660 |
661 | Distributed under the Eclipse Public License, the same as Clojure.
662 |