├── .gitignore
├── LICENSE
├── README.md
├── aerie
    ├── __init__.py
    ├── nwa.py
    ├── plexer.py
    └── sregex.py
├── samples
    └── clang.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .#*
2 | *~
3 | *#
4 | *.pyc
5 | *.pyo
6 | build/
7 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2016, Paul Khuong.
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are
 6 | met:
 7 | 
 8 | 1. Redistributions of source code must retain the above copyright
 9 | notice, this list of conditions and the following disclaimer.
10 | 
11 | 2. Redistributions in binary form must reproduce the above copyright
12 | notice, this list of conditions and the following disclaimer in the
13 | documentation and/or other materials provided with the distribution.
14 | 
15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
16 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
17 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
18 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
19 | HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
20 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
21 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
22 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
23 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
24 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Aerie: a regex-like engine nested word grammars
  2 | ===============================================
  3 | 
  4 | Aerie is a library to match against semi-structured input with nested
  5 | word automata.  NWAs are a generalisation of finite state automata
  6 | that operate on streams of values *and* nested streams.  Unlike full
  7 | context-free languages, NWAs assume that the input is pre-tokenised as
  8 | values and open/close brackets (nested stream markers).  In return for
  9 | this (often irrelevant) restriction, we get simple operations that are
 10 | comparable to regular languages.  In particular, we can match a given
 11 | NWA against a stream *without* backtracking and time linear with
 12 | respect to the size of the stream.
 13 | 
 14 | Logically, Aerie is composed of two independent components: the paired
 15 | lexer, aerie.plexer, and the structured streaming regex matcher,
 16 | aerie.sregex.
 17 | 
 18 | The plexer takes an input string and returns a nest: an array of
 19 | (non-array) values and child nest arrays.  In theory, we could do this
 20 | with any iteratable, but repeated matches against a stream are
 21 | annoying.  The default dispatch table for aerie.plex matches
 22 | parentheses, single and double quoted strings.  More on that later.
 23 | 
 24 | The structured regex (sregex) matcher works by taking structured
 25 | regular expressions, converting them to (nested) NFAs, and executing
 26 | NFAs without backtracking, in a set of state representation.  Sregexen
 27 | are structured because they are first-class Python objects, and not
 28 | just line noise in a string, and because they match structured
 29 | (nested) streams of values.  Any sregex Pattern object can be directly
 30 | matched against an iteratable.
 31 | 
 32 | How to write a plexer
 33 | ---------------------
 34 | 
 35 | Custom plexers are defined by the dispatch function.  Dispatch
 36 | functions accept the current state and a ViewString (copy-free string
 37 | slice), and return an action.
 38 | 
 39 | An action is a pair of:
 40 | - value
 41 | - stack action.
 42 | 
 43 | The value is appended to the topmost stream of items.  A value of
 44 | `None` is not added, and a list represents a sequence of values to
 45 | adjoin to the end of the topmost stream.  As a special case for
 46 | incremental lexing of strings, consecutive strings are smooshed before
 47 | entering the stream of items.
 48 | 
 49 | The stack action describes what should happen to the *stack* of plexer
 50 | states.  We need to capture nesting, paired delimited, but wish to
 51 | restrict complexity, so plexers are simple deterministic pushdown
 52 | automata.  Typically, each state in the stack describes the closing
 53 | delimiter we're expecting.  A falsish action leaves the state stack
 54 | alone.  A negative integer action pops off the corresponding number of
 55 | states from the stack (e.g., `-2` means "pop off the two topmost
 56 | states").  Any other action is pushed to the stack of state.
 57 | 
 58 | When both a value and a stack action are returned, the value is
 59 | applied before popping and after pushing.
 60 | 
 61 | The hardest part about writing a lexer in Python is probably the
 62 | immutable strings.  Aerie avoids quadratic runtime with the ViewString
 63 | class: ViewStrings convert strings to byte buffers and traverse the
 64 | buffers with memoryview wrappers.  Plexers will mostly interact with
 65 | ViewStrings via the `match` and `match_groups` methods.
 66 | `ViewString.match` accepts a regex and checks if the regex matches a
 67 | prefix of the ViewString's position.  If it does, the method consumes
 68 | that prefix and returns it; otherwise, the method returns `None`.  The
 69 | `match_group` method is the same, except that is also returns the
 70 | dictionary of captures as a second value (or `None` on match failure).
 71 | 
 72 | There's a lot of redundancy in plexer dispatch functions.  We'll
 73 | probably add more support tools once we have a better grasp of common
 74 | patterns.
 75 | 
 76 | Also, there's no reason we can't plex arbitrary iteratable
 77 | streams...  I've just been too lazy to write a buffering iterator
 78 | wrapper and port matchers to use sregex instead of re.
 79 | 
 80 | Building sregex
 81 | ---------------
 82 | 
 83 | Aerie.sregex describe patterns with actual Python object trees.  `Seq`
 84 | objects represent a sequence of patterns.  `Seq(p1, p2)` builds an
 85 | sregex pattern that must first match `p1` and then `p2`.  `Alt` is for
 86 | alternation: `Alt(p1, p2, ...)` must match either `p1` or `p2` (or
 87 | `p3`, ...).  `Plus(p)` matches `p` at least once.  `Star(p)` matches
 88 | `p` any number of times (including none).  `Maybe(p)` maybes `p` 0 or
 89 | 1 time.
 90 | 
 91 | Everything else matches individual items in the input (nested) stream
 92 | and builds on top of `Function`: `Function(fun)` accepts an item `x`
 93 | if `fun(x)` returns a dictionary of groups (otherwise, we expect
 94 | `None` to denote mismatches).  For example, `Any` patterns simply
 95 | match any single item and are built as
 96 | 
 97 |     class Any(Function):
 98 |         def __init__(self):
 99 |             super().__init__(lambda item: _empty_dict)
100 | 
101 | The `Literal` class, which tests items for equality, is equally
102 | simple:
103 | 
104 |     class Literal(Function):
105 |         def __init__(self, literal):
106 |             super().__init__(lambda item: _empty_dict if item == literal else None)
107 | 
108 | The `Nest` class is the one thing that separates sregexen from normal
109 | regular expressions.  `Nest(p1, p2, ...)` matches an item if that item
110 | is an iteratable that matches `Seq(p1, p2, ...)`.
111 | 
112 | The last built-in atomic matcher is `Regex`.  `Regex(re)` accepts an
113 | item `x` if it is a string that matches `re`.  If so, it returns the
114 | regular expression matcher's `groupdict()`.
115 | 
116 | The object structure is nice at scale, but annoying for small
117 | matchers.  Classes that accept patterns (`Seq`, `Alt`, `Plus`, `Star`,
118 | `Maybe`, and `Nest`) also accept a shorthand based describing
119 | sequences as lists.  Each element in such a list describes a pattern
120 | in a `Seq` object:
121 | - Patterns describe pattern;
122 | - lists are themselves `Seq` object (not nested, i.e., `[a, [b, c],
123 | d]` is equivalent to `[a, b, c, d]`;
124 | - strings are `Regex` patterns;
125 | - callables are `Function` patterns
126 | - anything else is a `Literal` pattern.
127 | 
128 | For `Alt`, each argument can itself be desugared.
129 | 
130 | At any moment, a `Pattern` object can be
131 | `pattern.matched(iteratable)`.  The pattern will be converted to a
132 | (nondeterministic) nested word automaton on demand, and automatically
133 | execute the NWA.  `aerie.sregex.build(sweet, patterns)` will build a
134 | pattern from the shorthand (the argument tuple is converted to a
135 | list), and that pattern can then be `match`ed.  `aerie.sregex.compile`
136 | will `build` an sregex, convert it to an NWA, and wrap the NWA in an
137 | `sregex.Matcher` object.  The only method on that object is
138 | `Matcher.match(iteratable)`.  Finally, for convenience,
139 | `aerie.sregex.match(pattern, values)` simply calls
140 | `pattern.match(value)`.
141 | 
142 | But I thought CFLs were hard!
143 | -----------------------------
144 | 
145 | In the general case, context-free languages are hard (as hard as
146 | matrix multiplication).  However, not all context-free languages are
147 | hard; as a trivial example, regular languages are also context-free.
148 | Classical parser theory explored restrictions like `LL`, `LR` or
149 | `LALR`; such parsers are honestly not that great to work with,
150 | especially when the generator throws a dreaded shift/reduce conflict
151 | error.
152 | 
153 | Nested words seem like a much better fit for the kind of
154 | (semi-)structured languages that programmers often encounter, and
155 | certainly jibes better with my intuition of what should be hard.  I
156 | don't have trouble matching parentheses in linear time (and worst case
157 | linear space, granted).
158 | 
159 | The difference between nested word automata and finite state automata
160 | is that certain input symbols are marked as (mandatory) push stack
161 | / pop stack points.  This gives us a form of recursion that we need
162 | for a lot of formal languages.  However, since the recursion structure
163 | is fixed ahead of time, we don't hit the same problem as full-blown
164 | context-free parsing.
165 | 
166 | This marking of push / pop points seems like a cop-out at first: we're
167 | just pushing the non-regularity somewhere else.  The trick is to
168 | notice that, if we *only* want to lex these special symbols, we should
169 | be ok with a deterministic pushdown automaton (equivalently, an
170 | `LL(1)` parser).  That's certainly true for common use cases like
171 | quotes, brackets, or XML tags.
172 | 
173 | Now that we have these points, we can treat the input stream as a
174 | stream of values that are either symbols or nested streams.  The
175 | nesting is guaranteed to be well behaved and never spill out of its
176 | one slot in the stream.  In other words, the way we accept a nested
177 | stream can never affect the rest of the match because we can only
178 | accept exactly one nested value at a time.  This simplification lets
179 | us apply the classic set of state approach for executing NFAs (it also
180 | means we can compare NWAs for equality, etc., with Brozowski
181 | derivatives without too much memoisation cleverness).
182 | 
183 | The first step is to convert sregex Patterns to NFAs where some of the
184 | states have arbitrary logic *that only looks at the current input
185 | value*; that logic may include recursively matching that value as a
186 | nested stream of values.  The nice part is that what a state does to
187 | determine whether it accepts or rejects the current value is
188 | irrelevant to the NFA machinery itself.  We can reuse textbook
189 | conversion algorithms.
190 | 
191 | The conversion from structured regex to NFA is only slightly
192 | complicated by the fact that we don't want epsilon transitions;
193 | Epsilon transitions must be propagated to the predecessor state(s).
194 | The result of converting an sregex pattern to NFA is thus a set of
195 | successor states, not just one state with potential epsilon
196 | transitions.  The reason will become clear when we get to executing
197 | nondeterministic nested word automata (NWAs).
198 | 
199 | The only non-obvious trick is that we must close a loop for `Plus`
200 | patterns: the continuation for the repeated pattern is the union of
201 | the `Plus` pattern and the `Plus`'s patterns continuation.  We achieve
202 | that by adding a proxy object to the continuation and backpatching in
203 | a later pass.
204 | 
205 | Finally, when it comes to execution, we avoid backtracking and
206 | memoisation in favor of simply keeping track of all active states at
207 | once and advancing them in lockstep.  An NWA has all the states
208 | numbered in an array, and each state lists its successors (on success)
209 | by their index.  We can associate a dict of match groups with each
210 | active state, and execute active states one after the other.  If a
211 | state fails, we have nothing to do.  If it succeeds, we will mark its
212 | successors as active in the *next* iteration (and bring along an
213 | updated dict of match groups); we know that our NFAs don't contain
214 | epsilon transitions, so we always want to execute successors in the
215 | next iteration.
216 | 
217 | Matching is now trivially linear-time, without any backtracking,
218 | memoisation or memory allocation (past the match group
219 | dictionaries)...  and it can even work on arbitrary iteratable
220 | objects exactly as well as on lists and strings.
221 | 


--------------------------------------------------------------------------------
/aerie/__init__.py:
--------------------------------------------------------------------------------
1 | from .sregex import compile, convert, match, Matcher, Pattern, Empty, Any, Nest, Literal, Regex, Function, Maybe, Alt, Star, Plus, Seq
2 | from .plexer import Dispatch, plex
3 | from .nwa import NWA
4 | 
5 | __all__ = []
6 | 


--------------------------------------------------------------------------------
/aerie/nwa.py:
--------------------------------------------------------------------------------
  1 | # Nested work automaton matcher based on the set of state trick for NFA.
  2 | 
  3 | 
  4 | import itertools
  5 | import re
  6 | 
  7 | 
  8 | _states = None
  9 | 
 10 | 
 11 | def _flat_tuples(values):
 12 |     flat = []
 13 |     seen = set()
 14 | 
 15 |     def push(x):
 16 |         if id(x) in seen:
 17 |             return
 18 |         seen.add(id(x))
 19 |         flat.append(x)
 20 | 
 21 |     for k in values:
 22 |         assert not isinstance(k, tuple)
 23 |         if isinstance(k, NWAProxy):
 24 |             for x in k.flatten():
 25 |                 push(x)
 26 |         else:
 27 |             push(k)
 28 |     return tuple(flat)
 29 | 
 30 | 
 31 | class NWAProxy(object):
 32 |     def __init__(self):
 33 |         self.actual = ()
 34 |         self.flat = None
 35 | 
 36 |     def flatten(self):
 37 |         if self.flat is not None:
 38 |             return self.flat
 39 | 
 40 |         self.flat = _flat_tuples(self.actual)
 41 |         return self.flat
 42 | 
 43 | 
 44 | class NWAState(object):
 45 |     def __init__(self, cont):
 46 |         assert isinstance(cont, tuple)
 47 |         self.index = len(_states)
 48 |         self.cont = cont
 49 |         _states.append(self)
 50 | 
 51 |     def __repr__(self):
 52 |         return "NWAState{%s %d -> %s}" % (type(self), self.index, self.flat_cont)
 53 | 
 54 |     def flatten(self):
 55 |         self.flat_cont = [x.index for x in _flat_tuples(self.cont)]
 56 |         self.cont = None
 57 | 
 58 | 
 59 | class NWAStart(NWAState):
 60 |     def __init__(self, cont):
 61 |         super().__init__(cont)
 62 | 
 63 | 
 64 | class NWAAccept(NWAState):
 65 |     def __init__(self):
 66 |         super().__init__(())
 67 | 
 68 |     def match(self, item):
 69 |         return None
 70 | 
 71 | 
 72 | class NWAFunction(NWAState):
 73 |     def __init__(self, function, cont):
 74 |         super().__init__(cont)
 75 |         self.function = function
 76 | 
 77 |     def match(self, item):
 78 |         return self.function(item)
 79 | 
 80 | 
 81 | def _nwaify(pattern):
 82 |     global _states
 83 | 
 84 |     old_states = _states
 85 |     _states = []
 86 | 
 87 |     accept = (NWAAccept(),)
 88 |     start = NWAStart(pattern.nwaify(accept))
 89 |     for x in _states:
 90 |         x.flatten()
 91 |     ret = _states
 92 | 
 93 |     _states = old_states
 94 |     return ret
 95 | 
 96 | 
 97 | def _reverse_pairs_to_list(pair):
 98 |     stack = []
 99 |     while pair != ():
100 |         assert len(pair) == 2
101 |         stack.append(pair[0])
102 |         pair = pair[1]
103 |     return stack
104 | 
105 | 
106 | def _flatten(pair):
107 |     if pair is None:
108 |         return None
109 | 
110 |     stack = _reverse_pairs_to_list(pair)
111 |     ret = dict()
112 | 
113 |     def force(v):
114 |         if isinstance(v, Result):
115 |             return v.force()
116 |         return v
117 | 
118 |     def push(k, v):
119 |         if isinstance(v, list):
120 |             v = [ force(x) for x in v ]
121 |             if ret.get(k) is None:
122 |                 ret[k] = v
123 |             else:
124 |                 ret[k].extend(v)
125 |             return
126 | 
127 |         v = force(v)
128 |         if isinstance(k, str) and k.endswith('List'):
129 |             if ret.get(k) is None:
130 |                 ret[k] = [v]
131 |             else:
132 |                 ret[k].append(v)
133 |         elif v is not None:
134 |             ret[k] = v
135 | 
136 |     for tos in reversed(stack):
137 |         tos = force(tos)
138 |         if hasattr(tos, 'groupdict'):
139 |             for k, v in tos.groupdict().items():
140 |                 push(k, v)
141 |         elif isinstance(tos, dict):
142 |             for k, v in tos.items():
143 |                 push(k, v)
144 |         elif isinstance(tos, tuple):
145 |             assert len(tos) in (0, 2)
146 |             if tos:
147 |                 push(tos[0], tos[1])
148 |         else:
149 |             for k, v in tos:
150 |                 push(k, v)
151 |     return ret
152 | 
153 | 
154 | def _splat(dst, indices, dict, active):
155 |     for index in indices:
156 |         if dst[index] is None:
157 |             dst[index] = dict
158 |             active.append(index)
159 | 
160 | 
161 | class Result(object):
162 |     def __init__(self, values):
163 |         assert values is not None
164 |         self.values = values
165 |         self.evaluated = None
166 | 
167 |     def force(self):
168 |         if self.evaluated is not None:
169 |             return self.evaluated
170 |         self.evaluated = _flatten(self.values)
171 |         return self.evaluated
172 | 
173 |     def __getitem__(self, key):
174 |         return self.force()[key]
175 | 
176 | def _wrap(x):
177 |     return None if x is None else Result(x)
178 | 
179 | 
180 | class NWA(object):
181 |     def __init__(self, sregex):
182 |         self.states = _nwaify(sregex)
183 | 
184 |     def __repr__(self):
185 |         return "NWA(%s)" % self.states
186 | 
187 |     def match(self, values, anchored = True):
188 |         if not hasattr(values, '__iter__'):
189 |             return None
190 | 
191 |         last_accept = None
192 |         # active is the list of non-None slots in groups.
193 |         old_groups = [None] * len(self.states)
194 |         old_active = []
195 |         groups = [None] * len(self.states)
196 |         active = []
197 |         _splat(groups, self.states[-1].flat_cont, (), active)
198 | 
199 |         for value in values:
200 |             if groups[0] is not None:
201 |                 last_accept = groups[0]
202 | 
203 |             if len(active) == 0:
204 |                 return _wrap(last_accept)
205 | 
206 |             new_groups = old_groups
207 |             new_active = old_active
208 | 
209 |             new_active.clear()
210 |             for index in active:
211 |                 state = self.states[index]
212 |                 group = groups[index]
213 |                 groups[index] = None
214 | 
215 |                 assert group is not None
216 |                 ret = state.match(value)
217 |                 if ret is None:
218 |                     continue
219 |                 _splat(new_groups, state.flat_cont, (ret, group), new_active)
220 | 
221 |             old_groups = groups
222 |             old_active = active
223 |             groups = new_groups
224 |             active = new_active
225 | 
226 |         if groups[0] is None:
227 |             return _wrap(last_accept)
228 |         return _wrap(groups[0])
229 | 


--------------------------------------------------------------------------------
/aerie/plexer.py:
--------------------------------------------------------------------------------
  1 | # Paired lexer.  A lexer that only looks for pairs of delimiters.
  2 | # Input is a string and optionally a lexer and an initial state for the lexer.
  3 | 
  4 | 
  5 | import itertools
  6 | import re
  7 | 
  8 | 
  9 | class View(object):
 10 |     def __init__(self, string):
 11 |         self.string = string
 12 |         self.index = 0
 13 |         self.length = len(string)
 14 | 
 15 |     def has_data(self):
 16 |         return self.index < self.length
 17 | 
 18 | 
 19 | def _build_re(mappings):
 20 |     entries = []
 21 |     for i, (regex, handler) in zip(itertools.count(), mappings.items()):
 22 |         if handler:
 23 |             entries.append(("aeriePlexerDispatchHandler%i" % i, regex, handler))
 24 |     regex = re.compile("|".join([ "(?P<%s>%s)" % (name, regex) for name, regex, _ in entries]))
 25 |     handlers = [ None ] * (regex.groups + 1)
 26 |     for name, _, handler in entries:
 27 |         handlers[regex.groupindex[name]] = handler
 28 | 
 29 |     return re.compile(regex), tuple(handlers)
 30 | 
 31 |         
 32 | class Dispatch(object):
 33 |     def __init__(self, name, mappings):
 34 |         self.name = name
 35 |         self.re, self.handlers = _build_re(mappings)
 36 |         self.mappings = mappings
 37 | 
 38 |     def __repr__(self):
 39 |         return "plexer.Dispatch(%s, %s)" % (self.name, self.mappings)
 40 | 
 41 |     def extend(self, name, mappings):
 42 |         copy = self.mappings.copy()
 43 |         copy.update(mappings)
 44 |         return Dispatch(name, copy)
 45 | 
 46 |     def dispatch(self, view):
 47 |         match = self.re.search(view.string, view.index)
 48 |         if match is None:
 49 |             prefix = view.string[view.index:]
 50 |             view.index = len(view.string)
 51 |             return (prefix, None, None)
 52 | 
 53 |         prefix_end = match.start() # view.string[view.index:prefix_end] has normal characters
 54 |         surrounding_group_id = match.lastindex
 55 |         handler = self.handlers[surrounding_group_id]
 56 |         assert handler
 57 |         insert, action = handler(match)
 58 | 
 59 |         prefix = None
 60 |         if prefix_end != view.index:
 61 |             prefix = view.string[view.index : prefix_end]
 62 |         view.index = match.end()
 63 |         return (prefix, insert, action)
 64 | 
 65 | 
 66 | # Take an array of strings/characters and other values.  Smoosh consecutive strings.
 67 | def _flatten(values):
 68 |     ret = []
 69 |     string_acc = []
 70 | 
 71 |     def flush():
 72 |         if string_acc:
 73 |             if len(string_acc) > 1:
 74 |                 ret.append("".join(string_acc))
 75 |             else:
 76 |                 ret.append(string_acc[0])
 77 |             string_acc.clear()
 78 | 
 79 |     for x in values:
 80 |         if isinstance(x, str):
 81 |             string_acc.append(x)
 82 |         else:
 83 |             flush()
 84 |             if isinstance(x, list):
 85 |                 ret.extend(x)
 86 |             else:
 87 |                 ret.append(x)
 88 |     flush()
 89 |     return ret
 90 | 
 91 | 
 92 | default_dispatch = Dispatch("default, no string",
 93 |                              { r'[(]' : lambda _: ('(', paren_dispatch),
 94 |                                r'["]' : lambda _: ('"', double_quote_dispatch),
 95 |                                r"[']" : lambda _: ("'", single_quote_dispatch),
 96 |                                r'\s+' : lambda _: ([], None) })
 97 | 
 98 | paren_dispatch = Dispatch("default, in parenthesis",
 99 |                              { r'[)]' : lambda _: (')', -1),
100 |                                r'[(]' : lambda _: ('(', paren_dispatch),
101 |                                r'["]' : lambda _: ('"', double_quote_dispatch),
102 |                                r"[']" : lambda _: ("'", single_quote_dispatch),
103 |                                r'\s+' : lambda _: ([], None) })
104 | 
105 | double_quote_dispatch = Dispatch("default, double_quote",
106 |                                   { r'\\(?P<escaped>.)' : lambda m: (m.groupdict()['escaped'], None),
107 |                                     r'"' : lambda _: ('"', -1) })
108 | 
109 | single_quote_dispatch = Dispatch("default, single_quote", { r"'" : lambda _: ("'", -1) })
110 | 
111 | 
112 | def plex(string, state = default_dispatch, robust = False):
113 |     view = View(string)
114 |     states = [ state ]
115 |     accumulators = [[]]
116 | 
117 |     while view.has_data():
118 |         prefix, value, delta = states[-1].dispatch(view)
119 |         def push(x = value):
120 |             if x is not None:
121 |                 accumulators[-1].append(x)
122 | 
123 |         push(prefix)
124 |         if isinstance(delta, int) and delta < 0:
125 |             push()
126 |             for i in range(-delta):
127 |                 states.pop()
128 |                 tos = _flatten(accumulators.pop())
129 |                 accumulators[-1].append([tos])
130 |             continue
131 | 
132 |         if isinstance(delta, list) or isinstance(delta, tuple):
133 |             states.extend(delta)
134 |             accumulators.append([[]] * len(delta))
135 |         elif delta:
136 |             states.append(delta)
137 |             accumulators.append([])
138 |         push()
139 | 
140 |     assert len(states) == len(accumulators)
141 |     while robust and len(states) > 1:
142 |         states.pop()
143 |         tos = _flatten(accumulators.pop())
144 |         accumulators[-1].append([tos])
145 | 
146 |     if len(states) != 1:
147 |         print('states: %i for %s' % (len(states), string))
148 |         assert len(states) == 1
149 |     return _flatten(accumulators.pop())
150 | 


--------------------------------------------------------------------------------
/aerie/sregex.py:
--------------------------------------------------------------------------------
  1 | # Structured regex-like notation for nested word automata
  2 | 
  3 | 
  4 | import re
  5 | from .nwa import *
  6 | 
  7 | 
  8 | class Pattern(object):
  9 |     def __init__(self):
 10 |         self.nwa = None
 11 | 
 12 |     def match(self, pattern, **kwargs):
 13 |         return self.build().match(pattern, **kwargs)
 14 | 
 15 |     def build(self):
 16 |         if self.nwa is None:
 17 |             self.nwa = NWA(self)
 18 | 
 19 |         return self.nwa
 20 | 
 21 |     def __add__(self, other):
 22 |         return Seq(self, other)
 23 | 
 24 |     def __or__(self, other):
 25 |         return Alt(self, other)
 26 | 
 27 | 
 28 | class Empty(Pattern):
 29 |     def nwaify(self, cont):
 30 |         return cont
 31 | 
 32 | 
 33 | class Function(Pattern):
 34 |     def __init__(self, function):
 35 |         super().__init__()
 36 |         self.function = function
 37 | 
 38 |     def nwaify(self, cont):
 39 |         return (NWAFunction(self.function, cont),)
 40 | 
 41 | 
 42 | # I don't trust Python implementations to have safe-for-space
 43 | # closures.
 44 | def _make_regex_matcher(regex):
 45 |     def match(item):
 46 |         if not isinstance(item, str):
 47 |             return None
 48 |         return re.fullmatch(regex, item)
 49 |     def match_no_group(item):
 50 |         if not isinstance(item, str):
 51 |             return None
 52 |         return () if re.fullmatch(regex, item) else None
 53 | 
 54 |     return match_no_group if regex.groups == 0 else match
 55 | 
 56 | 
 57 | class Regex(Function):
 58 |     def __init__(self, regex):
 59 |         if isinstance(regex, str):
 60 |             regex = re.compile(regex)
 61 |         super().__init__(_make_regex_matcher(regex))
 62 | 
 63 | 
 64 | class Any(Function):
 65 |     def __init__(self):
 66 |         super().__init__(lambda item: ())
 67 | 
 68 | 
 69 | class Literal(Function):
 70 |     def __init__(self, literal):
 71 |         super().__init__(lambda item: () if item == literal else None)
 72 | 
 73 | 
 74 | def _make_nwa_matcher(nwa):
 75 |     def match(item):
 76 |         return nwa.match(item)
 77 |     return match
 78 | 
 79 | 
 80 | class Nest(Function):
 81 |     def __init__(self, *nest):
 82 |         super().__init__(_make_nwa_matcher(compile(*nest)))
 83 | 
 84 | 
 85 | class Plus(Pattern):
 86 |     def __init__(self, *subpattern):
 87 |         super().__init__()
 88 |         self.subpattern = convert(*subpattern)
 89 | 
 90 |     def nwaify(self, cont):
 91 |         proxy = NWAProxy()
 92 |         ret = self.subpattern.nwaify((proxy,) + cont)
 93 |         proxy.actual = ret
 94 |         return ret
 95 | 
 96 | 
 97 | class Alt(Pattern):
 98 |     def __init__(self, *options):
 99 |         super().__init__()
100 |         self.options = [convert(x) for x in options]
101 | 
102 |     def nwaify(self, cont):
103 |         ret = []
104 |         for x in self.options:
105 |             ret += list(x.nwaify(cont))
106 |         return tuple(ret)
107 | 
108 | 
109 | class Maybe(Alt):
110 |     def __init__(self, *subpattern):
111 |         super().__init__(convert(*subpattern), Empty())
112 | 
113 | 
114 | class Star(Maybe):
115 |     def __init__(self, *subpattern):
116 |         super().__init__(Plus(*subpattern))
117 | 
118 | 
119 | class Seq(Pattern):
120 |     def __init__(self, *patterns):
121 |         super().__init__()
122 |         self.patterns = _desugar(list(patterns))
123 | 
124 |     def nwaify(self, cont):
125 |         for pattern in reversed(self.patterns):
126 |             cont = pattern.nwaify(cont)
127 |         return cont
128 | 
129 | 
130 | def _desugar(pattern):
131 |     if not isinstance(pattern, list):
132 |         return pattern
133 | 
134 |     ret = []
135 |     for x in pattern:
136 |         if isinstance(x, Pattern):
137 |             ret.append(x)
138 |         elif isinstance(x, list):
139 |             ret.append(Seq(*x))
140 |         elif isinstance(x, str):
141 |             ret.append(Regex(x))
142 |         elif callable(x):
143 |             ret.append(Function(x))
144 |         else:
145 |             ret.append(Literal(x))
146 |     return ret
147 | 
148 | 
149 | class Matcher(object):
150 |     def __init__(self, nwa):
151 |         self.nwa = nwa
152 | 
153 |     def match(self, values, **kwargs):
154 |         return self.nwa.match(values, **kwargs)
155 | 
156 | 
157 | def convert(*patterns):
158 |     if len(patterns) == 1 and isinstance(patterns[0], Pattern):
159 |         return patterns[0]
160 |     return Seq(*patterns)
161 | 
162 | 
163 | def compile(*patterns):
164 |     return Matcher(convert(*patterns).build())
165 | 
166 | 
167 | def match(pattern, values, **kwargs):
168 |     return pattern.match(values, **kwargs)
169 | 


--------------------------------------------------------------------------------
/samples/clang.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | 
 4 | import aerie
 5 | from aerie.sregex import *
 6 | 
 7 | 
 8 | black = (0, 30)
 9 | red = (0, 31)
10 | green = (0, 32)
11 | yellow = (0, 33)
12 | blue = (0, 34)
13 | magenta = (0, 35)
14 | cyan = (0, 36)
15 | white = (0, 37)
16 | 
17 | bright_black = (0, 1, 30)
18 | bright_red = (0, 1, 31)
19 | bright_green = (0, 1, 32)
20 | bright_yellow = (0, 1, 33)
21 | bright_blue = (0, 1, 34)
22 | bright_magenta = (0, 1, 35)
23 | bright_cyan = (0, 1, 36)
24 | bright_white = (0, 1, 37)
25 | 
26 | 
27 | def colorized(state, view):
28 |     if state == 'esc':
29 |         if view.match(br'\x1b\[0m'):
30 |             return None, -1
31 |     if state == "'":
32 |         if view.match(b"'"):
33 |             return "'", -1
34 |     elif state == '"':
35 |         r = view.match(br'\\.')
36 |         if r:
37 |             return r[1], None
38 |         if view.match(br'"'):
39 |             return '"', -1
40 |     else:
41 |         if view.match(b"'"):
42 |             return "'", "'"
43 |         if view.match(br'"'):
44 |             return '"', '"'
45 | 
46 |     # Always be on the lookout for ANSI escape codes.
47 |     r, matches = view.match_groups(br'\x1b\[(?P<A>\d+)?(;(?P<B>\d+))?(;(?P<C>\d+))?(;\d+)*m')
48 |     if r:
49 |         ret = tuple()
50 |         for k in ['A', 'B', 'C']:
51 |             if matches.get(k):
52 |                 ret += (int(matches.get(k)),)
53 |         return ret, 'esc'
54 |     return view.advance(1)[0], None
55 | 
56 | 
57 | def plex_clang(string):
58 |     return aerie.plex(string, colorized)
59 | 
60 | 
61 | def parse_attributes(string):
62 |     if not isinstance(string, str):
63 |         return None
64 |     if not re.match(r'(?P<Attributes>( [A-Za-z0-9]+)*)', string):
65 |         return None
66 |     return {'attributes' : string.split()}
67 | 
68 | 
69 | 
70 | def address(name):
71 |     return Nest(yellow, r'^ 0x(?P<%s>[0-9a-fA-F]+)$' % name)
72 | 
73 | 
74 | def sloc(name):
75 |     string = r'^(?P<%s>.*)$' % name
76 |     # todo: split the fields up
77 |     return Nest(yellow, string)
78 | 
79 | 
80 | def _type(name):
81 |     return Nest(green, Nest(r"^'(?P<%s>.*)'$" % name))
82 | 
83 | 
84 | def identifier(name):
85 |     return Nest(bright_cyan, ' ', Nest(r"'(?P<%s>(\w|\s)+)'" % name))
86 | 
87 | source_range = Seq('^ <$', sloc('RangeBegin'), Maybe(', ', sloc('RangeEnd')), '^> $')
88 | source_loc = sloc('SourceLoc')
89 | node_type = Nest(bright_magenta, r'^(?P<NodeType>[A-Za-z_0-9]+)$')
90 | attributes = Nest(cyan, parse_attributes)
91 | kind = Nest(bright_green, r'^(?P<Kind>(\w|\d)*)')
92 | 
93 | 
94 | pattern = aerie.compile(node_type, address("Address"), source_range, _type("Type"), attributes, Nest(cyan), ' ',
95 |                     kind, address("TargetAddress"), identifier('Name'), ' ', _type('TargetType'),
96 |                     Maybe('\n'))
97 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from distutils.core import setup
 4 | 
 5 | setup(name='aerie',
 6 |       version='0.0.1',
 7 |       description='Matching library for nested word regular expressions',
 8 |       author='Paul Khuong',
 9 |       author_email='pvk@pvk.ca',
10 |       url='https://www.pvk.ca',
11 |       packages=['aerie'],
12 |      )
13 | 


--------------------------------------------------------------------------------