├── .gitignore ├── LICENSE ├── README.md ├── aerie ├── __init__.py ├── nwa.py ├── plexer.py └── sregex.py ├── samples └── clang.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | .#* 2 | *~ 3 | *# 4 | *.pyc 5 | *.pyo 6 | build/ 7 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016, Paul Khuong. 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are 6 | met: 7 | 8 | 1. Redistributions of source code must retain the above copyright 9 | notice, this list of conditions and the following disclaimer. 10 | 11 | 2. Redistributions in binary form must reproduce the above copyright 12 | notice, this list of conditions and the following disclaimer in the 13 | documentation and/or other materials provided with the distribution. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 16 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 17 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 18 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 19 | HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 20 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 21 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 22 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 23 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 24 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Aerie: a regex-like engine nested word grammars 2 | =============================================== 3 | 4 | Aerie is a library to match against semi-structured input with nested 5 | word automata. NWAs are a generalisation of finite state automata 6 | that operate on streams of values *and* nested streams. Unlike full 7 | context-free languages, NWAs assume that the input is pre-tokenised as 8 | values and open/close brackets (nested stream markers). In return for 9 | this (often irrelevant) restriction, we get simple operations that are 10 | comparable to regular languages. In particular, we can match a given 11 | NWA against a stream *without* backtracking and time linear with 12 | respect to the size of the stream. 13 | 14 | Logically, Aerie is composed of two independent components: the paired 15 | lexer, aerie.plexer, and the structured streaming regex matcher, 16 | aerie.sregex. 17 | 18 | The plexer takes an input string and returns a nest: an array of 19 | (non-array) values and child nest arrays. In theory, we could do this 20 | with any iteratable, but repeated matches against a stream are 21 | annoying. The default dispatch table for aerie.plex matches 22 | parentheses, single and double quoted strings. More on that later. 23 | 24 | The structured regex (sregex) matcher works by taking structured 25 | regular expressions, converting them to (nested) NFAs, and executing 26 | NFAs without backtracking, in a set of state representation. Sregexen 27 | are structured because they are first-class Python objects, and not 28 | just line noise in a string, and because they match structured 29 | (nested) streams of values. Any sregex Pattern object can be directly 30 | matched against an iteratable. 31 | 32 | How to write a plexer 33 | --------------------- 34 | 35 | Custom plexers are defined by the dispatch function. Dispatch 36 | functions accept the current state and a ViewString (copy-free string 37 | slice), and return an action. 38 | 39 | An action is a pair of: 40 | - value 41 | - stack action. 42 | 43 | The value is appended to the topmost stream of items. A value of 44 | `None` is not added, and a list represents a sequence of values to 45 | adjoin to the end of the topmost stream. As a special case for 46 | incremental lexing of strings, consecutive strings are smooshed before 47 | entering the stream of items. 48 | 49 | The stack action describes what should happen to the *stack* of plexer 50 | states. We need to capture nesting, paired delimited, but wish to 51 | restrict complexity, so plexers are simple deterministic pushdown 52 | automata. Typically, each state in the stack describes the closing 53 | delimiter we're expecting. A falsish action leaves the state stack 54 | alone. A negative integer action pops off the corresponding number of 55 | states from the stack (e.g., `-2` means "pop off the two topmost 56 | states"). Any other action is pushed to the stack of state. 57 | 58 | When both a value and a stack action are returned, the value is 59 | applied before popping and after pushing. 60 | 61 | The hardest part about writing a lexer in Python is probably the 62 | immutable strings. Aerie avoids quadratic runtime with the ViewString 63 | class: ViewStrings convert strings to byte buffers and traverse the 64 | buffers with memoryview wrappers. Plexers will mostly interact with 65 | ViewStrings via the `match` and `match_groups` methods. 66 | `ViewString.match` accepts a regex and checks if the regex matches a 67 | prefix of the ViewString's position. If it does, the method consumes 68 | that prefix and returns it; otherwise, the method returns `None`. The 69 | `match_group` method is the same, except that is also returns the 70 | dictionary of captures as a second value (or `None` on match failure). 71 | 72 | There's a lot of redundancy in plexer dispatch functions. We'll 73 | probably add more support tools once we have a better grasp of common 74 | patterns. 75 | 76 | Also, there's no reason we can't plex arbitrary iteratable 77 | streams... I've just been too lazy to write a buffering iterator 78 | wrapper and port matchers to use sregex instead of re. 79 | 80 | Building sregex 81 | --------------- 82 | 83 | Aerie.sregex describe patterns with actual Python object trees. `Seq` 84 | objects represent a sequence of patterns. `Seq(p1, p2)` builds an 85 | sregex pattern that must first match `p1` and then `p2`. `Alt` is for 86 | alternation: `Alt(p1, p2, ...)` must match either `p1` or `p2` (or 87 | `p3`, ...). `Plus(p)` matches `p` at least once. `Star(p)` matches 88 | `p` any number of times (including none). `Maybe(p)` maybes `p` 0 or 89 | 1 time. 90 | 91 | Everything else matches individual items in the input (nested) stream 92 | and builds on top of `Function`: `Function(fun)` accepts an item `x` 93 | if `fun(x)` returns a dictionary of groups (otherwise, we expect 94 | `None` to denote mismatches). For example, `Any` patterns simply 95 | match any single item and are built as 96 | 97 | class Any(Function): 98 | def __init__(self): 99 | super().__init__(lambda item: _empty_dict) 100 | 101 | The `Literal` class, which tests items for equality, is equally 102 | simple: 103 | 104 | class Literal(Function): 105 | def __init__(self, literal): 106 | super().__init__(lambda item: _empty_dict if item == literal else None) 107 | 108 | The `Nest` class is the one thing that separates sregexen from normal 109 | regular expressions. `Nest(p1, p2, ...)` matches an item if that item 110 | is an iteratable that matches `Seq(p1, p2, ...)`. 111 | 112 | The last built-in atomic matcher is `Regex`. `Regex(re)` accepts an 113 | item `x` if it is a string that matches `re`. If so, it returns the 114 | regular expression matcher's `groupdict()`. 115 | 116 | The object structure is nice at scale, but annoying for small 117 | matchers. Classes that accept patterns (`Seq`, `Alt`, `Plus`, `Star`, 118 | `Maybe`, and `Nest`) also accept a shorthand based describing 119 | sequences as lists. Each element in such a list describes a pattern 120 | in a `Seq` object: 121 | - Patterns describe pattern; 122 | - lists are themselves `Seq` object (not nested, i.e., `[a, [b, c], 123 | d]` is equivalent to `[a, b, c, d]`; 124 | - strings are `Regex` patterns; 125 | - callables are `Function` patterns 126 | - anything else is a `Literal` pattern. 127 | 128 | For `Alt`, each argument can itself be desugared. 129 | 130 | At any moment, a `Pattern` object can be 131 | `pattern.matched(iteratable)`. The pattern will be converted to a 132 | (nondeterministic) nested word automaton on demand, and automatically 133 | execute the NWA. `aerie.sregex.build(sweet, patterns)` will build a 134 | pattern from the shorthand (the argument tuple is converted to a 135 | list), and that pattern can then be `match`ed. `aerie.sregex.compile` 136 | will `build` an sregex, convert it to an NWA, and wrap the NWA in an 137 | `sregex.Matcher` object. The only method on that object is 138 | `Matcher.match(iteratable)`. Finally, for convenience, 139 | `aerie.sregex.match(pattern, values)` simply calls 140 | `pattern.match(value)`. 141 | 142 | But I thought CFLs were hard! 143 | ----------------------------- 144 | 145 | In the general case, context-free languages are hard (as hard as 146 | matrix multiplication). However, not all context-free languages are 147 | hard; as a trivial example, regular languages are also context-free. 148 | Classical parser theory explored restrictions like `LL`, `LR` or 149 | `LALR`; such parsers are honestly not that great to work with, 150 | especially when the generator throws a dreaded shift/reduce conflict 151 | error. 152 | 153 | Nested words seem like a much better fit for the kind of 154 | (semi-)structured languages that programmers often encounter, and 155 | certainly jibes better with my intuition of what should be hard. I 156 | don't have trouble matching parentheses in linear time (and worst case 157 | linear space, granted). 158 | 159 | The difference between nested word automata and finite state automata 160 | is that certain input symbols are marked as (mandatory) push stack 161 | / pop stack points. This gives us a form of recursion that we need 162 | for a lot of formal languages. However, since the recursion structure 163 | is fixed ahead of time, we don't hit the same problem as full-blown 164 | context-free parsing. 165 | 166 | This marking of push / pop points seems like a cop-out at first: we're 167 | just pushing the non-regularity somewhere else. The trick is to 168 | notice that, if we *only* want to lex these special symbols, we should 169 | be ok with a deterministic pushdown automaton (equivalently, an 170 | `LL(1)` parser). That's certainly true for common use cases like 171 | quotes, brackets, or XML tags. 172 | 173 | Now that we have these points, we can treat the input stream as a 174 | stream of values that are either symbols or nested streams. The 175 | nesting is guaranteed to be well behaved and never spill out of its 176 | one slot in the stream. In other words, the way we accept a nested 177 | stream can never affect the rest of the match because we can only 178 | accept exactly one nested value at a time. This simplification lets 179 | us apply the classic set of state approach for executing NFAs (it also 180 | means we can compare NWAs for equality, etc., with Brozowski 181 | derivatives without too much memoisation cleverness). 182 | 183 | The first step is to convert sregex Patterns to NFAs where some of the 184 | states have arbitrary logic *that only looks at the current input 185 | value*; that logic may include recursively matching that value as a 186 | nested stream of values. The nice part is that what a state does to 187 | determine whether it accepts or rejects the current value is 188 | irrelevant to the NFA machinery itself. We can reuse textbook 189 | conversion algorithms. 190 | 191 | The conversion from structured regex to NFA is only slightly 192 | complicated by the fact that we don't want epsilon transitions; 193 | Epsilon transitions must be propagated to the predecessor state(s). 194 | The result of converting an sregex pattern to NFA is thus a set of 195 | successor states, not just one state with potential epsilon 196 | transitions. The reason will become clear when we get to executing 197 | nondeterministic nested word automata (NWAs). 198 | 199 | The only non-obvious trick is that we must close a loop for `Plus` 200 | patterns: the continuation for the repeated pattern is the union of 201 | the `Plus` pattern and the `Plus`'s patterns continuation. We achieve 202 | that by adding a proxy object to the continuation and backpatching in 203 | a later pass. 204 | 205 | Finally, when it comes to execution, we avoid backtracking and 206 | memoisation in favor of simply keeping track of all active states at 207 | once and advancing them in lockstep. An NWA has all the states 208 | numbered in an array, and each state lists its successors (on success) 209 | by their index. We can associate a dict of match groups with each 210 | active state, and execute active states one after the other. If a 211 | state fails, we have nothing to do. If it succeeds, we will mark its 212 | successors as active in the *next* iteration (and bring along an 213 | updated dict of match groups); we know that our NFAs don't contain 214 | epsilon transitions, so we always want to execute successors in the 215 | next iteration. 216 | 217 | Matching is now trivially linear-time, without any backtracking, 218 | memoisation or memory allocation (past the match group 219 | dictionaries)... and it can even work on arbitrary iteratable 220 | objects exactly as well as on lists and strings. 221 | -------------------------------------------------------------------------------- /aerie/__init__.py: -------------------------------------------------------------------------------- 1 | from .sregex import compile, convert, match, Matcher, Pattern, Empty, Any, Nest, Literal, Regex, Function, Maybe, Alt, Star, Plus, Seq 2 | from .plexer import Dispatch, plex 3 | from .nwa import NWA 4 | 5 | __all__ = [] 6 | -------------------------------------------------------------------------------- /aerie/nwa.py: -------------------------------------------------------------------------------- 1 | # Nested work automaton matcher based on the set of state trick for NFA. 2 | 3 | 4 | import itertools 5 | import re 6 | 7 | 8 | _states = None 9 | 10 | 11 | def _flat_tuples(values): 12 | flat = [] 13 | seen = set() 14 | 15 | def push(x): 16 | if id(x) in seen: 17 | return 18 | seen.add(id(x)) 19 | flat.append(x) 20 | 21 | for k in values: 22 | assert not isinstance(k, tuple) 23 | if isinstance(k, NWAProxy): 24 | for x in k.flatten(): 25 | push(x) 26 | else: 27 | push(k) 28 | return tuple(flat) 29 | 30 | 31 | class NWAProxy(object): 32 | def __init__(self): 33 | self.actual = () 34 | self.flat = None 35 | 36 | def flatten(self): 37 | if self.flat is not None: 38 | return self.flat 39 | 40 | self.flat = _flat_tuples(self.actual) 41 | return self.flat 42 | 43 | 44 | class NWAState(object): 45 | def __init__(self, cont): 46 | assert isinstance(cont, tuple) 47 | self.index = len(_states) 48 | self.cont = cont 49 | _states.append(self) 50 | 51 | def __repr__(self): 52 | return "NWAState{%s %d -> %s}" % (type(self), self.index, self.flat_cont) 53 | 54 | def flatten(self): 55 | self.flat_cont = [x.index for x in _flat_tuples(self.cont)] 56 | self.cont = None 57 | 58 | 59 | class NWAStart(NWAState): 60 | def __init__(self, cont): 61 | super().__init__(cont) 62 | 63 | 64 | class NWAAccept(NWAState): 65 | def __init__(self): 66 | super().__init__(()) 67 | 68 | def match(self, item): 69 | return None 70 | 71 | 72 | class NWAFunction(NWAState): 73 | def __init__(self, function, cont): 74 | super().__init__(cont) 75 | self.function = function 76 | 77 | def match(self, item): 78 | return self.function(item) 79 | 80 | 81 | def _nwaify(pattern): 82 | global _states 83 | 84 | old_states = _states 85 | _states = [] 86 | 87 | accept = (NWAAccept(),) 88 | start = NWAStart(pattern.nwaify(accept)) 89 | for x in _states: 90 | x.flatten() 91 | ret = _states 92 | 93 | _states = old_states 94 | return ret 95 | 96 | 97 | def _reverse_pairs_to_list(pair): 98 | stack = [] 99 | while pair != (): 100 | assert len(pair) == 2 101 | stack.append(pair[0]) 102 | pair = pair[1] 103 | return stack 104 | 105 | 106 | def _flatten(pair): 107 | if pair is None: 108 | return None 109 | 110 | stack = _reverse_pairs_to_list(pair) 111 | ret = dict() 112 | 113 | def force(v): 114 | if isinstance(v, Result): 115 | return v.force() 116 | return v 117 | 118 | def push(k, v): 119 | if isinstance(v, list): 120 | v = [ force(x) for x in v ] 121 | if ret.get(k) is None: 122 | ret[k] = v 123 | else: 124 | ret[k].extend(v) 125 | return 126 | 127 | v = force(v) 128 | if isinstance(k, str) and k.endswith('List'): 129 | if ret.get(k) is None: 130 | ret[k] = [v] 131 | else: 132 | ret[k].append(v) 133 | elif v is not None: 134 | ret[k] = v 135 | 136 | for tos in reversed(stack): 137 | tos = force(tos) 138 | if hasattr(tos, 'groupdict'): 139 | for k, v in tos.groupdict().items(): 140 | push(k, v) 141 | elif isinstance(tos, dict): 142 | for k, v in tos.items(): 143 | push(k, v) 144 | elif isinstance(tos, tuple): 145 | assert len(tos) in (0, 2) 146 | if tos: 147 | push(tos[0], tos[1]) 148 | else: 149 | for k, v in tos: 150 | push(k, v) 151 | return ret 152 | 153 | 154 | def _splat(dst, indices, dict, active): 155 | for index in indices: 156 | if dst[index] is None: 157 | dst[index] = dict 158 | active.append(index) 159 | 160 | 161 | class Result(object): 162 | def __init__(self, values): 163 | assert values is not None 164 | self.values = values 165 | self.evaluated = None 166 | 167 | def force(self): 168 | if self.evaluated is not None: 169 | return self.evaluated 170 | self.evaluated = _flatten(self.values) 171 | return self.evaluated 172 | 173 | def __getitem__(self, key): 174 | return self.force()[key] 175 | 176 | def _wrap(x): 177 | return None if x is None else Result(x) 178 | 179 | 180 | class NWA(object): 181 | def __init__(self, sregex): 182 | self.states = _nwaify(sregex) 183 | 184 | def __repr__(self): 185 | return "NWA(%s)" % self.states 186 | 187 | def match(self, values, anchored = True): 188 | if not hasattr(values, '__iter__'): 189 | return None 190 | 191 | last_accept = None 192 | # active is the list of non-None slots in groups. 193 | old_groups = [None] * len(self.states) 194 | old_active = [] 195 | groups = [None] * len(self.states) 196 | active = [] 197 | _splat(groups, self.states[-1].flat_cont, (), active) 198 | 199 | for value in values: 200 | if groups[0] is not None: 201 | last_accept = groups[0] 202 | 203 | if len(active) == 0: 204 | return _wrap(last_accept) 205 | 206 | new_groups = old_groups 207 | new_active = old_active 208 | 209 | new_active.clear() 210 | for index in active: 211 | state = self.states[index] 212 | group = groups[index] 213 | groups[index] = None 214 | 215 | assert group is not None 216 | ret = state.match(value) 217 | if ret is None: 218 | continue 219 | _splat(new_groups, state.flat_cont, (ret, group), new_active) 220 | 221 | old_groups = groups 222 | old_active = active 223 | groups = new_groups 224 | active = new_active 225 | 226 | if groups[0] is None: 227 | return _wrap(last_accept) 228 | return _wrap(groups[0]) 229 | -------------------------------------------------------------------------------- /aerie/plexer.py: -------------------------------------------------------------------------------- 1 | # Paired lexer. A lexer that only looks for pairs of delimiters. 2 | # Input is a string and optionally a lexer and an initial state for the lexer. 3 | 4 | 5 | import itertools 6 | import re 7 | 8 | 9 | class View(object): 10 | def __init__(self, string): 11 | self.string = string 12 | self.index = 0 13 | self.length = len(string) 14 | 15 | def has_data(self): 16 | return self.index < self.length 17 | 18 | 19 | def _build_re(mappings): 20 | entries = [] 21 | for i, (regex, handler) in zip(itertools.count(), mappings.items()): 22 | if handler: 23 | entries.append(("aeriePlexerDispatchHandler%i" % i, regex, handler)) 24 | regex = re.compile("|".join([ "(?P<%s>%s)" % (name, regex) for name, regex, _ in entries])) 25 | handlers = [ None ] * (regex.groups + 1) 26 | for name, _, handler in entries: 27 | handlers[regex.groupindex[name]] = handler 28 | 29 | return re.compile(regex), tuple(handlers) 30 | 31 | 32 | class Dispatch(object): 33 | def __init__(self, name, mappings): 34 | self.name = name 35 | self.re, self.handlers = _build_re(mappings) 36 | self.mappings = mappings 37 | 38 | def __repr__(self): 39 | return "plexer.Dispatch(%s, %s)" % (self.name, self.mappings) 40 | 41 | def extend(self, name, mappings): 42 | copy = self.mappings.copy() 43 | copy.update(mappings) 44 | return Dispatch(name, copy) 45 | 46 | def dispatch(self, view): 47 | match = self.re.search(view.string, view.index) 48 | if match is None: 49 | prefix = view.string[view.index:] 50 | view.index = len(view.string) 51 | return (prefix, None, None) 52 | 53 | prefix_end = match.start() # view.string[view.index:prefix_end] has normal characters 54 | surrounding_group_id = match.lastindex 55 | handler = self.handlers[surrounding_group_id] 56 | assert handler 57 | insert, action = handler(match) 58 | 59 | prefix = None 60 | if prefix_end != view.index: 61 | prefix = view.string[view.index : prefix_end] 62 | view.index = match.end() 63 | return (prefix, insert, action) 64 | 65 | 66 | # Take an array of strings/characters and other values. Smoosh consecutive strings. 67 | def _flatten(values): 68 | ret = [] 69 | string_acc = [] 70 | 71 | def flush(): 72 | if string_acc: 73 | if len(string_acc) > 1: 74 | ret.append("".join(string_acc)) 75 | else: 76 | ret.append(string_acc[0]) 77 | string_acc.clear() 78 | 79 | for x in values: 80 | if isinstance(x, str): 81 | string_acc.append(x) 82 | else: 83 | flush() 84 | if isinstance(x, list): 85 | ret.extend(x) 86 | else: 87 | ret.append(x) 88 | flush() 89 | return ret 90 | 91 | 92 | default_dispatch = Dispatch("default, no string", 93 | { r'[(]' : lambda _: ('(', paren_dispatch), 94 | r'["]' : lambda _: ('"', double_quote_dispatch), 95 | r"[']" : lambda _: ("'", single_quote_dispatch), 96 | r'\s+' : lambda _: ([], None) }) 97 | 98 | paren_dispatch = Dispatch("default, in parenthesis", 99 | { r'[)]' : lambda _: (')', -1), 100 | r'[(]' : lambda _: ('(', paren_dispatch), 101 | r'["]' : lambda _: ('"', double_quote_dispatch), 102 | r"[']" : lambda _: ("'", single_quote_dispatch), 103 | r'\s+' : lambda _: ([], None) }) 104 | 105 | double_quote_dispatch = Dispatch("default, double_quote", 106 | { r'\\(?P.)' : lambda m: (m.groupdict()['escaped'], None), 107 | r'"' : lambda _: ('"', -1) }) 108 | 109 | single_quote_dispatch = Dispatch("default, single_quote", { r"'" : lambda _: ("'", -1) }) 110 | 111 | 112 | def plex(string, state = default_dispatch, robust = False): 113 | view = View(string) 114 | states = [ state ] 115 | accumulators = [[]] 116 | 117 | while view.has_data(): 118 | prefix, value, delta = states[-1].dispatch(view) 119 | def push(x = value): 120 | if x is not None: 121 | accumulators[-1].append(x) 122 | 123 | push(prefix) 124 | if isinstance(delta, int) and delta < 0: 125 | push() 126 | for i in range(-delta): 127 | states.pop() 128 | tos = _flatten(accumulators.pop()) 129 | accumulators[-1].append([tos]) 130 | continue 131 | 132 | if isinstance(delta, list) or isinstance(delta, tuple): 133 | states.extend(delta) 134 | accumulators.append([[]] * len(delta)) 135 | elif delta: 136 | states.append(delta) 137 | accumulators.append([]) 138 | push() 139 | 140 | assert len(states) == len(accumulators) 141 | while robust and len(states) > 1: 142 | states.pop() 143 | tos = _flatten(accumulators.pop()) 144 | accumulators[-1].append([tos]) 145 | 146 | if len(states) != 1: 147 | print('states: %i for %s' % (len(states), string)) 148 | assert len(states) == 1 149 | return _flatten(accumulators.pop()) 150 | -------------------------------------------------------------------------------- /aerie/sregex.py: -------------------------------------------------------------------------------- 1 | # Structured regex-like notation for nested word automata 2 | 3 | 4 | import re 5 | from .nwa import * 6 | 7 | 8 | class Pattern(object): 9 | def __init__(self): 10 | self.nwa = None 11 | 12 | def match(self, pattern, **kwargs): 13 | return self.build().match(pattern, **kwargs) 14 | 15 | def build(self): 16 | if self.nwa is None: 17 | self.nwa = NWA(self) 18 | 19 | return self.nwa 20 | 21 | def __add__(self, other): 22 | return Seq(self, other) 23 | 24 | def __or__(self, other): 25 | return Alt(self, other) 26 | 27 | 28 | class Empty(Pattern): 29 | def nwaify(self, cont): 30 | return cont 31 | 32 | 33 | class Function(Pattern): 34 | def __init__(self, function): 35 | super().__init__() 36 | self.function = function 37 | 38 | def nwaify(self, cont): 39 | return (NWAFunction(self.function, cont),) 40 | 41 | 42 | # I don't trust Python implementations to have safe-for-space 43 | # closures. 44 | def _make_regex_matcher(regex): 45 | def match(item): 46 | if not isinstance(item, str): 47 | return None 48 | return re.fullmatch(regex, item) 49 | def match_no_group(item): 50 | if not isinstance(item, str): 51 | return None 52 | return () if re.fullmatch(regex, item) else None 53 | 54 | return match_no_group if regex.groups == 0 else match 55 | 56 | 57 | class Regex(Function): 58 | def __init__(self, regex): 59 | if isinstance(regex, str): 60 | regex = re.compile(regex) 61 | super().__init__(_make_regex_matcher(regex)) 62 | 63 | 64 | class Any(Function): 65 | def __init__(self): 66 | super().__init__(lambda item: ()) 67 | 68 | 69 | class Literal(Function): 70 | def __init__(self, literal): 71 | super().__init__(lambda item: () if item == literal else None) 72 | 73 | 74 | def _make_nwa_matcher(nwa): 75 | def match(item): 76 | return nwa.match(item) 77 | return match 78 | 79 | 80 | class Nest(Function): 81 | def __init__(self, *nest): 82 | super().__init__(_make_nwa_matcher(compile(*nest))) 83 | 84 | 85 | class Plus(Pattern): 86 | def __init__(self, *subpattern): 87 | super().__init__() 88 | self.subpattern = convert(*subpattern) 89 | 90 | def nwaify(self, cont): 91 | proxy = NWAProxy() 92 | ret = self.subpattern.nwaify((proxy,) + cont) 93 | proxy.actual = ret 94 | return ret 95 | 96 | 97 | class Alt(Pattern): 98 | def __init__(self, *options): 99 | super().__init__() 100 | self.options = [convert(x) for x in options] 101 | 102 | def nwaify(self, cont): 103 | ret = [] 104 | for x in self.options: 105 | ret += list(x.nwaify(cont)) 106 | return tuple(ret) 107 | 108 | 109 | class Maybe(Alt): 110 | def __init__(self, *subpattern): 111 | super().__init__(convert(*subpattern), Empty()) 112 | 113 | 114 | class Star(Maybe): 115 | def __init__(self, *subpattern): 116 | super().__init__(Plus(*subpattern)) 117 | 118 | 119 | class Seq(Pattern): 120 | def __init__(self, *patterns): 121 | super().__init__() 122 | self.patterns = _desugar(list(patterns)) 123 | 124 | def nwaify(self, cont): 125 | for pattern in reversed(self.patterns): 126 | cont = pattern.nwaify(cont) 127 | return cont 128 | 129 | 130 | def _desugar(pattern): 131 | if not isinstance(pattern, list): 132 | return pattern 133 | 134 | ret = [] 135 | for x in pattern: 136 | if isinstance(x, Pattern): 137 | ret.append(x) 138 | elif isinstance(x, list): 139 | ret.append(Seq(*x)) 140 | elif isinstance(x, str): 141 | ret.append(Regex(x)) 142 | elif callable(x): 143 | ret.append(Function(x)) 144 | else: 145 | ret.append(Literal(x)) 146 | return ret 147 | 148 | 149 | class Matcher(object): 150 | def __init__(self, nwa): 151 | self.nwa = nwa 152 | 153 | def match(self, values, **kwargs): 154 | return self.nwa.match(values, **kwargs) 155 | 156 | 157 | def convert(*patterns): 158 | if len(patterns) == 1 and isinstance(patterns[0], Pattern): 159 | return patterns[0] 160 | return Seq(*patterns) 161 | 162 | 163 | def compile(*patterns): 164 | return Matcher(convert(*patterns).build()) 165 | 166 | 167 | def match(pattern, values, **kwargs): 168 | return pattern.match(values, **kwargs) 169 | -------------------------------------------------------------------------------- /samples/clang.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | 4 | import aerie 5 | from aerie.sregex import * 6 | 7 | 8 | black = (0, 30) 9 | red = (0, 31) 10 | green = (0, 32) 11 | yellow = (0, 33) 12 | blue = (0, 34) 13 | magenta = (0, 35) 14 | cyan = (0, 36) 15 | white = (0, 37) 16 | 17 | bright_black = (0, 1, 30) 18 | bright_red = (0, 1, 31) 19 | bright_green = (0, 1, 32) 20 | bright_yellow = (0, 1, 33) 21 | bright_blue = (0, 1, 34) 22 | bright_magenta = (0, 1, 35) 23 | bright_cyan = (0, 1, 36) 24 | bright_white = (0, 1, 37) 25 | 26 | 27 | def colorized(state, view): 28 | if state == 'esc': 29 | if view.match(br'\x1b\[0m'): 30 | return None, -1 31 | if state == "'": 32 | if view.match(b"'"): 33 | return "'", -1 34 | elif state == '"': 35 | r = view.match(br'\\.') 36 | if r: 37 | return r[1], None 38 | if view.match(br'"'): 39 | return '"', -1 40 | else: 41 | if view.match(b"'"): 42 | return "'", "'" 43 | if view.match(br'"'): 44 | return '"', '"' 45 | 46 | # Always be on the lookout for ANSI escape codes. 47 | r, matches = view.match_groups(br'\x1b\[(?P\d+)?(;(?P\d+))?(;(?P\d+))?(;\d+)*m') 48 | if r: 49 | ret = tuple() 50 | for k in ['A', 'B', 'C']: 51 | if matches.get(k): 52 | ret += (int(matches.get(k)),) 53 | return ret, 'esc' 54 | return view.advance(1)[0], None 55 | 56 | 57 | def plex_clang(string): 58 | return aerie.plex(string, colorized) 59 | 60 | 61 | def parse_attributes(string): 62 | if not isinstance(string, str): 63 | return None 64 | if not re.match(r'(?P( [A-Za-z0-9]+)*)', string): 65 | return None 66 | return {'attributes' : string.split()} 67 | 68 | 69 | 70 | def address(name): 71 | return Nest(yellow, r'^ 0x(?P<%s>[0-9a-fA-F]+)$' % name) 72 | 73 | 74 | def sloc(name): 75 | string = r'^(?P<%s>.*)$' % name 76 | # todo: split the fields up 77 | return Nest(yellow, string) 78 | 79 | 80 | def _type(name): 81 | return Nest(green, Nest(r"^'(?P<%s>.*)'$" % name)) 82 | 83 | 84 | def identifier(name): 85 | return Nest(bright_cyan, ' ', Nest(r"'(?P<%s>(\w|\s)+)'" % name)) 86 | 87 | source_range = Seq('^ <$', sloc('RangeBegin'), Maybe(', ', sloc('RangeEnd')), '^> $') 88 | source_loc = sloc('SourceLoc') 89 | node_type = Nest(bright_magenta, r'^(?P[A-Za-z_0-9]+)$') 90 | attributes = Nest(cyan, parse_attributes) 91 | kind = Nest(bright_green, r'^(?P(\w|\d)*)') 92 | 93 | 94 | pattern = aerie.compile(node_type, address("Address"), source_range, _type("Type"), attributes, Nest(cyan), ' ', 95 | kind, address("TargetAddress"), identifier('Name'), ' ', _type('TargetType'), 96 | Maybe('\n')) 97 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from distutils.core import setup 4 | 5 | setup(name='aerie', 6 | version='0.0.1', 7 | description='Matching library for nested word regular expressions', 8 | author='Paul Khuong', 9 | author_email='pvk@pvk.ca', 10 | url='https://www.pvk.ca', 11 | packages=['aerie'], 12 | ) 13 | --------------------------------------------------------------------------------