├── LICENSE
├── Makefile
├── README.md
├── docs
    ├── automata.png
    ├── determinization.png
    ├── theory.pdf
    └── theory.typ
├── test.sh
├── trre.1
├── trre_dft.c
└── trre_nft.c


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Konstantin Selivanov
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | CC=cc
 2 | CFLAGS = -std=c99 -Wall -Wextra -Wpedantic -O3
 3 | #CFLAGS_DEBUG = -Wall -Wextra -Wpedantic -O0 -g
 4 | 
 5 | all: nft dft
 6 | 
 7 | nft: trre_nft.c
 8 | 	$(CC) $(CFLAGS) trre_nft.c -o trre
 9 | 
10 | dft: trre_dft.c
11 | 	$(CC) $(CFLAGS) trre_dft.c -o trre_dft
12 | 
13 | clean:
14 | 	rm -f trre trre_dft
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # TRRE: transductive regular expressions
  2 | 
  3 | #### TLDR: It is an extension of the regular expressions for text editing and a `grep`-like command line tool.
  4 | 
  5 | -   [Examples](#examples)
  6 | -   [Language specification](#language-specification)
  7 | -   [Why it works](#why-it-works)
  8 | -   [Precedence](#precedence)
  9 | -   [Modes and greediness](#modes-and-greediness)
 10 | -   [Determinization](#determinization)
 11 | -   [Performance](#performance)
 12 | -   [Installation](#installation)
 13 | -   [TODO](#todo)
 14 | -   [References](#references)
 15 | 
 16 | 
 17 | ## Intro
 18 | 
 19 | ![automata](docs/automata.png)
 20 | 
 21 | Regular expressions is a great tool for searching patterns in text. But I always found it unnatural for text editing. The *group* logic works as a post-processor and can be complicated. Here I propose an extension to the regular expression language for pattern matching and text modification. I call it **transductive regular expressions** or simply **`trre`**.
 22 | 
 23 | It  introduces the `:` symbol to define transformations. The simplest form is `a:b`, which replaces a with b. I call this a `transductive pair` or `transduction`.
 24 | 
 25 | To demonstrate the concept, I have created a command line tool **`trre`**. It feels similar to the `grep -E` command.
 26 | 
 27 | ## Examples
 28 | 
 29 | To run it first build the program:
 30 | 
 31 | ```bash
 32 | make
 33 | ```
 34 | 
 35 | ### Basics
 36 | 
 37 | To change `cat` to `dog` we use the following expression:
 38 | 
 39 | ```bash
 40 | echo 'cat' | ./trre 'cat:dog'
 41 | ```
 42 | ```
 43 | dog
 44 | ```
 45 | 
 46 | A more cryptic token-by-token version:
 47 | 
 48 | ```bash
 49 | echo 'cat' | ./trre '(c:d)(a:o)(t:g)'
 50 | ```
 51 | ```
 52 | dog
 53 | ```
 54 | 
 55 | It can be used like `sed` to replace all matches in a string:
 56 | 
 57 | ```bash
 58 | echo 'Mary had a little lamb.' | ./trre 'lamb:cat'
 59 | ```
 60 | ```
 61 | Mary had a little cat.
 62 | ```
 63 | 
 64 | **Deletion:**
 65 | 
 66 | To delete a string we can use the pattern `string_to_delete:`
 67 | 
 68 | ```bash
 69 | echo 'xor' | ./trre '(x:)or'
 70 | ```
 71 | ```
 72 | or
 73 | ```
 74 | 
 75 | The expression `(x:)` could be interpreted as of translation of `x` to an empty string.
 76 | 
 77 | This expression can be used to delete all the occurances in the default `scan` mode. E.g. expression below removes all occurances of 'a' from a string/
 78 | 
 79 | ```bash
 80 | echo 'Mary had a little lamb.' | ./trre 'a:'
 81 | ```
 82 | ```
 83 | Mry hd  little lmb.
 84 | ```
 85 | 
 86 | Technically, here we replace character `a` with empty symbol.
 87 | 
 88 | To remove several characters from this string we can use square brackets construction similar to normal regex:
 89 | 
 90 | ```bash
 91 | echo 'Mary had a little lamb.' | ./trre '[aie]:'
 92 | ```
 93 | ```
 94 | Mry hd  lttl lmb.
 95 | ```
 96 | 
 97 | **Insertion:**
 98 | 
 99 | To insert a string we can use the pattern `:string_to_insert`
100 | 
101 | ```bash
102 | echo 'or' | ./trre '(:x)or'
103 | ```
104 | ```
105 | xor
106 | ```
107 | 
108 | We could think of the expression `(:x)` as of translation of an empty string into `x`.
109 | 
110 | We can insert a word in a context:
111 | 
112 | ```bash
113 | echo 'Mary had a lamb.' | ./trre 'had a (:little )lamb'
114 | ```
115 | ```
116 | Mary had a little lamb.
117 | ```
118 | 
119 | ### Regex over transductions
120 | 
121 | As for normal regular expression we could use **alternations** with `|` symbol:
122 | 
123 | ```bash
124 | echo 'cat dog' | ./trre '(c:b)at|(d:h)og'
125 | ```
126 | ```
127 | bat hog
128 | ```
129 | 
130 | Or use the **star** over `trre` to repeat the transformation:
131 | 
132 | ```bash
133 | echo 'catcatcat' | ./trre '(cat:dog)*'
134 | ```
135 | ```
136 | dogdogdog
137 | ```
138 | 
139 | In the default `scan` mode, **star** can be omitted:
140 | 
141 | ```bash
142 | echo 'catcatcat' | ./trre 'cat:dog'
143 | ```
144 | ```
145 | dogdogdog
146 | ```
147 | 
148 | You can also use the star in the left part to "consume" a pattern infinitely:
149 | 
150 | ```bash
151 | echo 'catcatcat' | ./trre '(cat)*:dog'
152 | ```
153 | ```
154 | dog
155 | ```
156 | 
157 | #### Danger zone
158 | 
159 | Avoid using `*` or `+` in the right part, as it can cause infinite loops:
160 | 
161 | ```bash
162 | echo '' | ./trre ':a*'      # <- do NOT do this
163 | ```
164 | ```
165 | ...
166 | ```
167 | 
168 | Instead, use finite iterations:
169 | 
170 | ```bash
171 | echo '' | ./trre ':(repeat-10-times){10}'
172 | 
173 | repeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-times
174 | ```
175 | 
176 | ### Range transformations
177 | 
178 | Transform ranges of characters:
179 | 
180 | ```bash
181 | echo "regular expressions" | ./trre  "[a:A-z:Z]"
182 | ```
183 | ```
184 | REGULAR EXPRESSIONS
185 | ```
186 | 
187 | As more complex example, lets create a toy cipher. Below is the Caesar cipher(1) implementation:
188 | 
189 | ```bash
190 | echo 'caesar cipher' | ./trre '[a:b-y:zz:a]'
191 | ```
192 | ```
193 | dbftbs djqifs
194 | ```
195 | 
196 | And decrypt it back:
197 | 
198 | ```bash
199 | echo 'dbftbs djqifs' | ./trre '[a:zb:a-y:x]'
200 | ```
201 | ```
202 | caesar cipher
203 | ```
204 | 
205 | ### Generators
206 | 
207 | **`trre`** can generate multiple output strings for a single input. By default, it uses the first possible match. You can also generate all possible outputs.
208 | 
209 | **Binary sequences:**
210 | ```bash
211 | echo '' | ./trre -ma ':(0|1){3}'
212 | ```
213 | ```
214 | 
215 | 000
216 | 001
217 | 010
218 | 011
219 | 100
220 | 101
221 | 110
222 | 111
223 | ```
224 | 
225 | **Subsets:**
226 | ```bash
227 | echo '' | ./trre -ma ':(0|1){,3}?'
228 | ```
229 | ```
230 | 
231 | 0
232 | 00
233 | 000
234 | 001
235 | 01
236 | 010
237 | 011
238 | 1
239 | 10
240 | 100
241 | 101
242 | 11
243 | 110
244 | 111
245 | ```
246 | 
247 | ## Language specification
248 | 
249 | Informally, we define a **`trre`** as a pair `pattern-to-match`:`pattern-to-generate`. The `pattern-to-match` can be a string or regexp. The `pattern-to-generate` normally is a string. But it can be a `regex` as well. Moreover, we can do normal regular expression over these pairs.
250 | 
251 | We can specify a smiplified grammar of **`trre`** as following:
252 | 
253 | ```
254 | TRRE    <- TRRE* TRRE|TRRE TRRE.TRRE
255 | TRRE    <- REGEX REGEX:REGEX
256 | ```
257 | 
258 | Where `REGEX` is a usual regular expression, and `.` stands for concatenation.
259 | 
260 | For now I consider the operator `:` as non-associative and the expression `TRRE:TRRE` as unsyntactical. There is a natural meaning for this expression as a composition of relations defined by **`trre`**s. But it can make things too complex. Maybe I change this later.
261 | 
262 | 
263 | ## Why it works
264 | 
265 | Under the hood, **`trre`** constructs a special automaton called a **Finite State Transducer (FST)**, which is similar to a **Finite State Automaton (FSA)** used in normal regular expressions but handles input-output pairs instead of plain strings.
266 | 
267 | Key differences:
268 | 
269 | * **`trre`** defines a binary relation between two regular languages.
270 | 
271 | * It uses **FST**s instead of **FSA**s for inference.
272 | 
273 | * It supports on-the-fly determinization for performance (experimental and under construction!).
274 | 
275 | To justify the laguage of trunsductive regular expression we need to prove the correspondence between **`trre`** expressions and the corresponding **FST**s. There is my sketch of a the proof: [docs/theory.pdf](docs/theory.pdf).
276 | 
277 | ## Precedence
278 | 
279 | Below is the table of precedence (priority) of the `trre` operators from high to low:
280 | 
281 | | Operator                          | Expression           |
282 | | --------------------------------- | -------------------- |
283 | | Escaped characters                | \\                   |
284 | | Bracket expression                | []                   |
285 | | Grouping                          | ()                   |
286 | | Iteration                         | * + ? {m,n}          |
287 | | Concatenation                     |                      |
288 | | **Transduction**                  | :                    |
289 | | Alternation                       | \|                   |
290 | 
291 | 
292 | ## Modes and greediness
293 | 
294 | **`trre`** supports two modes:
295 | 
296 | * **Scan Mode (default)**: Applies transformations sequentially.
297 | 
298 | * **Match Mode**: Checks the entire string against the expression (use `-m` flag).
299 | 
300 | Use `-a` to generate all possible outputs.
301 | 
302 | The `?` modifier makes `*`, `+`, and `{,}` operators non-greedy:
303 | 
304 | ```bash
305 | echo '<cat><dog>' | ./trre '<(.:)*>'
306 | ```
307 | ```
308 | <>
309 | ```
310 | 
311 | ```bash
312 | echo '<cat><dog>' | ./trre '<(.:)*?>'
313 | ```
314 | ```
315 | <><>
316 | ```
317 | 
318 | Another common example is to change something within tags or brackets. E.g. to change everything within "<>" we can use the following expression:
319 | 
320 | ```bash
321 | echo '<dog> <mouse>' | ./trre '<(.*?:cat)>'
322 | ```
323 | ```
324 | <cat> <cat>
325 | ```
326 | 
327 | 
328 | ## Determinization
329 | 
330 | <img src="docs/determinization.png" width="80%"/>
331 | 
332 | The important part of the modern regex engines is determinization. This routine converts the non-deterministic automata to the deterministic one. Once converted it has linear time inference on the input string length. It is handy but the convertion is exponential in the worst case.
333 | 
334 | For **`trre`** the similar approach is possible. The bad news is that not all the non-deterministic transducers (**NFT**) can be converted to a deterministic (**DFT**). In case of two "bad" cycles with same input labels the algorithm is trapped in the infinite loop of a state creation. There is a way to detect such loops but it is expensive (see more in [Allauzen, Mohri, Efficient Algorithms for testing the twins property](https://cs.nyu.edu/~mohri/pub/twins.pdf)).
335 | 
336 | ## Performance
337 | 
338 | The default non-deterministic version is a bit slower then `sed`:
339 | 
340 | ```bash
341 | wget https://www.gutenberg.org/cache/epub/57333/pg57333.txt -O chekhov.txt
342 | 
343 | time cat chekhov.txt | ./trre '(vodka):(VODKA)' > /dev/null
344 | ```
345 | ```
346 | real	0m0.046s
347 | user	0m0.043s
348 | sys	0m0.007s
349 | ```
350 | 
351 | ```bash
352 | time cat chekhov.txt | sed  's/vodka/VODKA/' > /dev/null
353 | ```
354 | ```
355 | real	0m0.024s
356 | user	0m0.020s
357 | sys	0m0.010s
358 | ```
359 | 
360 | For complex tasks, **`trre_dft`** (deterministic version) can outperform sed:
361 | 
362 | ```bash
363 | time cat chekhov.txt | sed -e 's/\(.*\)/\U\1/' > /dev/null
364 | ```
365 | ```
366 | real	0m0.508s
367 | user	0m0.504s
368 | sys	0m0.015s
369 | ```
370 | 
371 | ```bash
372 | time cat chekhov.txt | ./trre_dft '[a:A-z:Z]' > /dev/null
373 | ```
374 | ```
375 | real	0m0.131s
376 | user	0m0.127s
377 | sys	0m0.009s
378 | ```
379 | 
380 | ## Installation
381 | 
382 | No pre-built binaries are available yet. Clone the repository and compile:
383 | 
384 | ```bash
385 | git clone git@github.com:c0stya/trre.git trre
386 | cd trre
387 | make && sh test.sh
388 | ```
389 | 
390 | Then move the binary to a directory in your `$PATH`.
391 | 
392 | ## TODO
393 | 
394 | * Stable *DFT* version
395 | * Full unicode support
396 | * Complete the ERE feature set:
397 |     - negation `^` within `[]`
398 |     - character classes
399 |     - '$^' anchoring symbols
400 | * Efficient range processing
401 | 
402 | ## References
403 | 
404 | * The approach is strongly inspired by Russ Cox article: [Regular Expression Matching Can Be Simple And Fast](https://swtch.com/~rsc/regexp/regexp1.html)
405 | * The idea of transducer determinization comes from this paper: [Cyril Allauzen, Mehryar Mohri, "Finitely Subsequential Transducers"](https://research.google/pubs/finitely-subsequential-transducers-2/)
406 | * Parsing approach comes from [Double-E algorithm](https://github.com/erikeidt/erikeidt.github.io/blob/master/The-Double-E-Method.md) by Erik Eidt which is close to the classical [Shunting Yard algorithm](https://en.wikipedia.org/wiki/Shunting_yard_algorithm).
407 | 


--------------------------------------------------------------------------------
/docs/automata.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/c0stya/trre/c474a6849153762530810d91667c192c5e53341a/docs/automata.png


--------------------------------------------------------------------------------
/docs/determinization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/c0stya/trre/c474a6849153762530810d91667c192c5e53341a/docs/determinization.png


--------------------------------------------------------------------------------
/docs/theory.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/c0stya/trre/c474a6849153762530810d91667c192c5e53341a/docs/theory.pdf


--------------------------------------------------------------------------------
/docs/theory.typ:
--------------------------------------------------------------------------------
  1 | = Transductive regular expression
  2 | 
  3 | == Intro
  4 | 
  5 | The idea behind *trre* is very similar to regular expressions (*re*). But where *re* defines a regular language the *trre* defines a binary relation between two regular languages. To check if a string corresponds to a given regular expression normally we construct an abstract machine we call finite state automaton or finite state acceptor (FSA). There is an established formal correspondence between the class of *re* and the class of FSA. For each expression in *re* we can construct an automaton. And for each automaton we can compose an expression in *re*.
  6 | 
  7 | For *trre* the class of FSA is not enough. Instead we use finite state transducer (FST) as underlying matching engine. It is similar to the acceptor but for each transition we have two labels. One defines an input character and the other defines output. The current article establishes correspondence between *trre* and FST similar to *re* and FSA.
  8 | 
  9 | To do this we need to define the trre language formaly and then prove that for every *trre* expression there is a corresponding FST and vice versa. But first, let's define the *re* language and then extend it to the *trre*.
 10 | 
 11 | == RE language specification
 12 | 
 13 | *Definition*. Let $cal(A)$ be a finite set augmented with symbols $emptyset$ and $epsilon$. We call it an alphabet. Then *RE* language will be set of words over the alphabet $cal(A)$ and symbols $+, dot.op, *$ defined inductively:
 14 | 
 15 | 1. symbols $epsilon, emptyset, "and" a in cal(A)$ are regular expressions
 16 | 2. for any regular expressions $r$ and $s$ the words $r + s$, $r dot.op s$ and $r\*$ are regular expressions.
 17 | 
 18 | *Definition*. For a given regular expression $r$ the language of $r$ denoted as $L(r)$ is a set (may be infinite) of finite words defined inductively:
 19 | 
 20 | 1. for trre expressions $epsilon, emptyset, a in cal(A)$
 21 |     - $L(epsilon) = {epsilon}$
 22 |     - $L(emptyset) = emptyset$
 23 |     - $L(a) = {a}$
 24 | 2. for regular expressions $r, s$
 25 |     - $L(r + s) = L(r) union L(s)$
 26 |     - $L(r dot.op s) = L(r) dot.op L(s) = {a dot.op b | a in L(r), b in L(s)} $
 27 |     - $L(r \*) = L(r)\* = sum_(n >= 0) r^n = epsilon + r + r dot.op r + r dot.op r dot.op r + ... $
 28 | 
 29 | The operation $r\*$ we call Kleene star or Kleene closure.
 30 | 
 31 | *Examples*.
 32 |     - $(a+b)c$
 33 |     - $a*b*$
 34 | 
 35 | == TRRE language specification
 36 | 
 37 | *Definition*. Let $R$ be a set of regular expressions over alphabet $cal(A)$ and $S$ be a  set of regular expressions over alphabet $cal(B)$. We refer to $cal(A)$ as input alphabet and $cal(B)$ as output alphabet. Transductive regular expressions `trre` will be a pair of regular expression combined with a delimiter symbol ':':
 38 | 1. for regular expressions $r in R, s in S$, the string $r:s$ is a transductive regular expression
 39 | 2. for transductive regular expressions $t, u$ expressions $t dot.op u$, $t + u$, $t\*$ are transductive regular expressions
 40 | 
 41 | *Definition*. For a given transductive regular expression $t$ the language of $t$ denoted as $L(t)$ is a set of *pairs* of words defined inductively:
 42 | 
 43 | 1. for a transductive regular expression $r:s$ where r, s are the regular expressions the language $L(r:s) = L(r) times L(s) = { (a,b) | a in L(r), b in L(s) }$, i.e. the direct product of languages $L(r)$ and $L(s)$.
 44 | 2. for transductive regular expression $t, u$:
 45 |     - $L(t dot.op u) = L(t) dot.op L(u) = { (a dot.op b, x dot.op y ) | (a,x) in L(t), (b,y) in L(u) } $
 46 |     - $L(t + u) = L(t) union  L(u) $
 47 |     - $L(t\*) = L(t)\* =  sum_(n >= 0) t^n$.
 48 | 
 49 | == Some properties of TRRE language
 50 | 
 51 | 1. $a:b = a:epsilon dot.op epsilon:b = epsilon : b dot.op a : epsilon$
 52 | 
 53 | Proof:
 54 |     $ L(a:b) & = {(u,v) | u in L(a), v in L(b)}  & "by definition of" a:b "language" \
 55 |     & = {(u dot.op epsilon), (epsilon dot.op v) | u in L(a), v in L(b)}  \
 56 |     & = {(u dot.op epsilon), (epsilon dot.op v) | (u, epsilon) in L(a:epsilon), v in L(epsilon:b)} \
 57 |     & = L(a:epsilon) dot.op L(epsilon:b) & "by definition of concatenation" \
 58 |     & = L(a:epsilon dot.op epsilon:b) $
 59 | 
 60 | 2. $(a + b):c = a dot.op c + b dot.op c $
 61 | 
 62 | Proof:
 63 |     $ L((a + b):c) & = L(a + b) times L(c) & "by definition of" a:b "language" \
 64 |     & = (L(a) union L(b)) times L(c) & "by definition of" a + b "language" \
 65 |     & = L(a) times L(c) union L(b) times L(c) & "cartesian product of sets is distributive over union" \
 66 |     & = L(a:c) union L(b:c) \
 67 |     & = L(a:c + b:c) $
 68 | 
 69 | 3. $a:(b + c) = a dot.op b + a dot.op c $
 70 | 
 71 | Proof: same as in 2
 72 | 
 73 | 4. $a\*:epsilon = (a : epsilon)\* $
 74 | 
 75 | Proof:
 76 |     $ a\*:epsilon & = ( sum_(n>=0) a^n ) : epsilon  	& "by definition of" \* \
 77 |     & = sum_(n>=0) (a^n: epsilon ) & "by Property 2" \
 78 |     & = (a:epsilon)* & "by definition of" \* $
 79 | 
 80 | 5. (a : epsilon)\* = $a\*:epsilon $
 81 | 
 82 | Proof: same as in 4
 83 | 
 84 | == trre normal form
 85 | 
 86 | Let's introduce a simplified version of TRRE language. We will call it normal form of TRRE.
 87 | 
 88 | *Definition*. Let $cal(A), cal(B)$ be the alphabets. Then normal form is any string defined recursively:
 89 | 
 90 | 1. any string of the form $a:b$ is the normal form, where $a in cal(A) union {epsilon}, b in cal(B) union {epsilon}$
 91 | 2. any string of the form $A:B, A+B, A\* $ are normal forms, where $A,B$ are normal forms
 92 | 
 93 | It is clear that any normal is a transductive expression. Let's show that the inverse is true as well.
 94 | 
 95 | *Lemma 1.* Every trre expression can be converted to a normal form.
 96 | 
 97 | *Proof:* First, let's show that any trre expressions of the form $a:epsilon$ and $epsilon:b$ where a,b are valid regular expressions can be converted to a normal form. Let's prove it for $A:epsilon$ using structural induction on the *RE* construction.
 98 | 
 99 | _Base_: $a:epsilon$, $epsilon:epsilon$ are in the normal form.
100 | 
101 | _Step_: Let's assume that all the subformulas of the formula $a:epsilon$ can be converted to normal form. Then let's prove that $(a + b) : epsilon, a dot.op b : epsilon, a*:epsilon$ can be converted to normal form:
102 |     1. $(a + b):epsilon = a:epsilon + b:epsilon$ by Property 2.
103 |     2. $a dot.op b : epsilon = a:epsilon dot.op b:epsilon$ by Property 1.
104 |     3. $(a \*:epsilon) = (a:epsilon)\*$ by Property 4.
105 | 
106 | For the case $epsilon:b$ the proof is similar.
107 | So, knowing that any formula $A:B$ can be represented as $A:epsilon dot.op epsilon:B$ we can represent any TRRE of the form "A:B" in the normal form. The last step is to prove that any TRRE formula can be represented as a normal form. Using current fact as a base of induction we can prove it by structural induction on the recursive definition of TRRE.
108 | 
109 | == trre and automata equivalence
110 | 
111 | The final step is to show that for any TRRE there is corresponding FST which defines same language. And for any FST there is TRRE which defines same language.
112 | 
113 | *Theorem*.  $ forall t in "TRRE" exists "FST" a "such" cal(L)(t) = cal(L)(a) $  and
114 | 	$ forall "FST" a exists "TRRE" t "such" cal(L)(a) = cal(L)(t) $
115 | 
116 | *Proof $arrow.r$*
117 | 
118 | The proof follows classical Thompson's construction algorithm. The only difference is that we represent the transducer as an automaton over the alphabet $cal(A) union {epsilon} times cal(B) union {epsilon}$.
119 | 
120 | *Proof $arrow.l$*
121 | 
122 | The proof is a classical Kleene's algorithm for converting automaton to regular expression. The difference here is that we represent transductive regular expression as regular expression by first expressing TRRE in normal form (by Lemma 1) and then treating each transductive pair of the form $a:b$ as a symbol in the new alphabet. The rest of the proof is the same.
123 | 
124 | == Inference
125 | 
126 | Like in many modern re engines we construct a deterministic automata on the fly. The difference here is that we construct deterministic FST whenever it is possible.
127 | 


--------------------------------------------------------------------------------
/test.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | cmd_scan="./trre"
  4 | cmd_match="./trre -ma"
  5 | 
  6 | test_cmd() {
  7 |     local inp=$1
  8 |     local tre=$2
  9 |     local exp=$3
 10 |     local cmd=$4
 11 | 
 12 |     res=$(echo "$inp" | $cmd "$tre")
 13 | 
 14 |     if ! diff <(echo -e "$exp") <(echo -e "$res") > /dev/null; then
 15 |         echo -e "FAIL $cmd:" "$inp" "->" "$tre"
 16 |         diff <(echo -e "$exp") <(echo -e "$res")
 17 |     fi
 18 | }
 19 | 
 20 | M() {
 21 |     test_cmd "$1" "$2" "$3" "$cmd_match"
 22 | }
 23 | 
 24 | S() {
 25 |     test_cmd "$1" "$2" "$3" "$cmd_scan"
 26 | }
 27 | 
 28 | 	# input		# trre			# expected
 29 | # basics
 30 | M 	"a"		"a:x" 			"x"
 31 | M	"ab"		"ab:xy" 		"xy"
 32 | M	"ab"		"(a:x)(b:y)" 		"xy"
 33 | M 	"cat"		"cat:dog"		"dog"
 34 | M 	"cat"		"(cat):(dog)"		"dog"
 35 | M	"cat"		"(c:d)(a:o)(t:g)"	"dog"
 36 | M	"mat"		"c:da:ot:g"		""
 37 | 
 38 | # basics deletion
 39 | M 	 "xor" 		"(x:)or"		"or"
 40 | S 	 "xor" 		"x:"			"or"
 41 | S  	 "Mary had a little lamb"	"a:"	"Mry hd  little lmb"
 42 | 
 43 | # basics insertion
 44 | M 	 'or' 		'(:x)or'		"xor"
 45 | S 	 'or' 		':='			"=o=r="
 46 | 
 47 | # alternation
 48 | M 	"a"		"a|b|c"			"a"
 49 | M 	"b"		"a|b|c"			"b"
 50 | M 	"c"		"a|b|c"			"c"
 51 | 
 52 | # star
 53 | S	"a"		"a*"			"a"
 54 | S	"aaa"		"a*"			"aaa"
 55 | M	"b"		"a*"			""
 56 | M	"bbb"		"a*"			""
 57 | M	"abab"		"(ab)*"			"abab"
 58 | M	"ababa"		"(ab)*"			""
 59 | 
 60 | # basic ranges
 61 | M	"a"		"[a-c]"			"a"
 62 | M	"b"		"[a-c]"			"b"
 63 | M	"c"		"[a-c]"			"c"
 64 | M	"d"		"[a-c]"			""
 65 | 
 66 | # transform ranges
 67 | M	"a"		"[a:x]"			"x"
 68 | M	"a"		"[a:xb:y]"		"x"
 69 | M	"b"		"[a:xb:y]"		"y"
 70 | M	"c"		"[a:xb:y]"		""
 71 | 
 72 | # range-transform ranges
 73 | M	"a"		"[a:x-c:z]"		"x"
 74 | M	"b"		"[a:x-c:z]"		"y"
 75 | M	"c"		"[a:x-c:z]"		"z"
 76 | M	"d"		"[a:x-c:z]"		""
 77 | 
 78 | # any char
 79 | M	"a"		"."			"a"
 80 | M	"b"		"."			"b"
 81 | M	"abc"		"..."			"abc"
 82 | 
 83 | # iteration, single
 84 | M	"aa"		"a{2}"			"aa"
 85 | M	"a"		"a{2}"			""
 86 | M	"aaa"		"a{2}"			""
 87 | 
 88 | # iteration, left
 89 | M	"" 		"a{,2}"			""
 90 | M	"a" 		"a{,2}"			"a"
 91 | M	"aa" 		"a{,2}"			"aa"
 92 | M	"aaa" 		"a{,2}"			""
 93 | 
 94 | # iteration, center
 95 | M	"" 		"a{1,2}"		""
 96 | M	"a" 		"a{1,2}"		"a"
 97 | M	"aa" 		"a{1,2}"		"aa"
 98 | M	"aaa" 		"a{1,2}"		""
 99 | 
100 | # iteration, right
101 | M	""		"a{2,}"			""
102 | M	"a"		"a{2,}"			""
103 | M	"aa"		"a{2,}"			"aa"
104 | M	"aaa"		"a{2,}"			"aaa"
105 | 
106 | # generators
107 | S	""		":a{,3}"		"aaa"
108 | M	""		":a{,3}"		"aaa\naa\na"
109 | 
110 | S	""		":a{,3}?"		""
111 | M	""		":a{,3}?"		"\na\naa\naaa"
112 | 
113 | # greed modifiers
114 | S	"aaa"		"(.:x)*.*"		"xxx"
115 | S	"aaa"		"(.:x)*?.*"		"aaa"
116 | 
117 | M	"aaa"		"(.:x)*.*"		"xxx\nxxa\nxaa\naaa"
118 | M	"aaa"		"(.:x)*?.*"		"aaa\nxaa\nxxa\nxxx"
119 | 
120 | # escapes
121 | S	"."		"\."			"."
122 | S	"+"		"\+"			"+"
123 | S	"?"		"\?"			"?"
124 | S	":"		"\:"			":"
125 | S	".a"		"\.a"			".a"
126 | M	".c"		"[.]c"			".c"
127 | 
128 | # greed modifiers priorities
129 | S 	"<cat><dog>" 	"<(.:)*>"		"<>"
130 | S 	"<cat><dog>" 	"<(.:)*?>"		"<><>"
131 | S 	"<cat><dog>" 	"<(.:)+>"		"<>"
132 | S 	"<cat><dog>" 	"<(.:)+?>"		"<><>"
133 | 
134 | 
135 | 
136 | # epsilon
137 | # generators
138 | 
139 | ## iteration, range
140 | #printf "aaa\n"	| $CMD "((a:x){,}.*)"
141 | 


--------------------------------------------------------------------------------
/trre.1:
--------------------------------------------------------------------------------
 1 | .\" Process this file with
 2 | .\" groff -man -Tascii trre.1
 3 | .\"
 4 | .TH TRRE 1 "User Commands"
 5 | .SH NAME
 6 | trre \- stream text editor based on transductive regular expressions
 7 | .SH SYNOPSIS
 8 | .B trre
 9 | [\fB\-mad\fR]
10 | .I PATTERN
11 | [\fIFILE\fR]
12 | .SH DESCRIPTION
13 | .B trre
14 | is a stream editor that performs string transformations. Similar to
15 | .BR sed (1),
16 | it processes input in a single pass. The transformation is entirely defined by the
17 | .IR PATTERN .
18 | 
19 | If no input file is specified,
20 | .B trre
21 | reads from standard input.
22 | .SH OPTIONS
23 | .IP \fB\-m\fR
24 | Enable matching mode. The expression must match the entire input string.
25 | .IP \fB\-a\fR
26 | Print all possible output strings. Useful in matching mode.
27 | .IP \fB\-d\fR
28 | Enable debug mode. Prints the parsing tree and automaton to stderr.
29 | .SH EXAMPLES
30 | .IP "\fBecho cat | trre 'cat:dog'\fR"
31 | Transforms the word "cat" to "dog".
32 | .IP "\fBecho cat | trre '(c:d)(a:o)(t:g)'\fR"
33 | Transforms the word "cat" to "dog".
34 | .SH BUGS
35 | This is an experimental tool. Use with caution.
36 | .SH AUTHOR
37 | Konstantin Selivanov <konstantin dot selivanov at g mail com>
38 | .SH "SEE ALSO"
39 | .BR sed (1),
40 | .BR tr (1)
41 | 


--------------------------------------------------------------------------------
/trre_dft.c:
--------------------------------------------------------------------------------
   1 | // Experimental. Do not use in production.
   2 | 
   3 | #define _GNU_SOURCE
   4 | #include <stdio.h>
   5 | #include <string.h>
   6 | #include <stdlib.h>
   7 | #include <assert.h>
   8 | #include <stdint.h>
   9 | #include <unistd.h>
  10 | 
  11 | 
  12 | /* precendence table */
  13 | int prec(char c) {
  14 |     switch (c) {
  15 |     	case '|': 		return 1;
  16 | 	case '-':		return 2;
  17 |     	case ':':		return 3;
  18 |     	case '.': 		return 4;
  19 | 	case '?': case '*':
  20 | 	case '+': case 'I': 	return 5;
  21 | 	case '\\':		return 6;
  22 |     }
  23 |     return -1;
  24 | }
  25 | 
  26 | 
  27 | struct node {
  28 |     unsigned char type;
  29 |     unsigned char val;
  30 |     struct node * l;
  31 |     struct node * r;
  32 | };
  33 | 
  34 | #define push(stack, item) 	*stack++ = item
  35 | #define pop(stack) 		*--stack
  36 | #define top(stack) 		*(stack-1)
  37 | 
  38 | #define STACK_MAX_CAPACITY	1000000
  39 | 
  40 | static unsigned char operators[1024];
  41 | static struct node *operands[1024];
  42 | 
  43 | static unsigned char *opr = operators;
  44 | static struct node **opd = operands;
  45 | 
  46 | static char* output;
  47 | static size_t output_capacity=32;
  48 | 
  49 | 
  50 | struct node * create_node(unsigned char type, struct node *l, struct node *r) {
  51 |     struct node *node = malloc(sizeof(struct node));
  52 |     if (!node) {
  53 | 	fprintf(stderr, "error: node memory allocation failed\n");
  54 | 	exit(EXIT_FAILURE);
  55 |     }
  56 |     node->type = type;
  57 |     node->val = 0;
  58 |     node->l = l;
  59 |     node->r = r;
  60 |     return node;
  61 | }
  62 | 
  63 | // alias-helper
  64 | struct node * create_nodev(unsigned char type, unsigned char val) {
  65 |     struct node *node = create_node(type, NULL, NULL);
  66 |     node->val = val;
  67 |     return node;
  68 | }
  69 | 
  70 | enum infer_mode {
  71 |     SCAN,
  72 |     MATCH,
  73 | };
  74 | 
  75 | 
  76 | // reduce immediately
  77 | void reduce_postfix(char op, int ng) {
  78 |     struct node *l, *r;
  79 | 
  80 |     switch(op) {
  81 | 	case '*': case '+': case '?':
  82 | 	    l = pop(opd);
  83 | 	    r = create_node(op, l, NULL);
  84 | 	    r->val = ng;
  85 | 	    push(opd, r);
  86 | 	    break;
  87 | 	default:
  88 | 	    fprintf(stderr, "error: unexpected postfix operator\n");
  89 | 	    exit(EXIT_FAILURE);
  90 |     }
  91 | }
  92 | 
  93 | 
  94 | void reduce() {
  95 |     char op;
  96 |     struct node *l, *r;
  97 | 
  98 |     op = pop(opr);
  99 |     switch(op) {
 100 | 	case '|': case '.': case ':': case '-':
 101 | 	    r = pop(opd);
 102 | 	    l = pop(opd);
 103 | 	    push(opd, create_node(op, l, r));
 104 | 	    break;
 105 | 	case '(':
 106 | 	    fprintf(stderr, "error: unmached parenthesis\n");
 107 | 	    exit(EXIT_FAILURE);
 108 |     }
 109 | }
 110 | 
 111 | 
 112 | void reduce_op(char op) {
 113 |     while(opr != operators && prec(top(opr)) >= prec(op))
 114 |         reduce();
 115 |     push(opr, op);
 116 | }
 117 | 
 118 | char* parse_curly_brackets(char *expr) {
 119 |     int state = 0, ng = 0;
 120 |     int count = 0, lv = 0;
 121 |     struct node *l, *r;
 122 |     char c;
 123 | 
 124 |     while ((c = *expr) != '\0') {
 125 |         if (c >= '0' && c <= '9') {
 126 |             count = count*10 + c - '0';
 127 | 	} else if (c == ',') {
 128 |             lv = count;
 129 |             count = 0;
 130 |             state += 1;
 131 | 	} else if (c == '}') {
 132 | 	    if(*(expr+1) == '?') {			// safe to do, we have the '/0' sentinel
 133 | 		ng = 1;
 134 | 		expr++;
 135 | 	    }
 136 |             if (state == 0)
 137 |             	lv = count;
 138 | 	    else if (state > 1) {
 139 | 		fprintf(stderr, "error: more then one comma in curly brackets\n");
 140 | 		exit(EXIT_FAILURE);
 141 | 	    }
 142 | 
 143 | 	    r = create_nodev(lv, count);
 144 | 	    l = create_node('I', pop(opd), r);
 145 | 	    l->val = ng;
 146 | 	    push(opd, l);
 147 | 
 148 |             return expr;
 149 |         } else {
 150 | 	    fprintf(stderr, "error: unexpected symbol in curly brackets: %c\n", c);
 151 | 	    exit(EXIT_FAILURE);
 152 |         }
 153 |         expr++;
 154 |     }
 155 |     fprintf(stderr, "error: unmached curly brackets");
 156 |     exit(EXIT_FAILURE);
 157 | }
 158 | 
 159 | char* parse_square_brackets(char *expr) {
 160 |     char c;
 161 |     int state = 0;
 162 | 
 163 |     while ((c = *expr) != '\0') {
 164 |         if (state == 0) {              			   // expect operand
 165 |             switch(c) {
 166 | 		case ':': case '-': case '[': case ']':
 167 | 		    fprintf(stderr, "error: unexpected symbol in square brackets: %c", c);
 168 | 		    exit(EXIT_FAILURE);
 169 | 		default:
 170 | 		    push(opd, create_nodev('c', c));       // push operand
 171 | 		    state = 1;
 172 | 	    }
 173 | 	} else {                       		   	   // expect operator
 174 |             switch(c) {
 175 | 		case ':': case '-':
 176 | 		    reduce_op(c);
 177 | 		    state = 0;
 178 | 		    break;
 179 | 		case ']':
 180 | 		    while (opr != operators && top(opr) != '[')
 181 | 			reduce();
 182 | 		    --opr;              // remove [ from the stack
 183 | 		    return expr;
 184 | 		default:                // implicit alternation
 185 | 		    reduce_op('|');
 186 | 		    state = 0;
 187 | 		    continue;		// skip expr increment
 188 | 	    }
 189 | 	}
 190 |         expr++;
 191 |     }
 192 |     fprintf(stderr, "error: unmached square brackets");
 193 |     exit(EXIT_FAILURE);
 194 | }
 195 | 
 196 | 
 197 | struct node * parse(char *expr) {
 198 |     unsigned char c;
 199 |     int state = 0;
 200 | 
 201 |     while ((c = *expr) != '\0') {
 202 |         if (state == 0) {                     	// expect operand
 203 |             switch(c) {
 204 | 		case '(':
 205 | 		    push(opr, c);
 206 | 		    break;
 207 | 		case '[':
 208 | 		    push(opr, c);
 209 | 		    expr = parse_square_brackets(expr+1);
 210 | 		    state = 1;
 211 | 		    break;
 212 | 		case '\\':
 213 | 		    push(opd, create_nodev('c', *++expr));
 214 | 		    state = 1;
 215 | 		    break;
 216 | 		case '.':
 217 | 		    push(opd, create_node('-',
 218 | 		    		create_nodev('c', 0),
 219 | 		    		create_nodev('c', 255)));
 220 | 		    state = 1;
 221 | 		    break;
 222 | 		case ':':					// epsilon as an implicit left operand
 223 | 		    push(opd, create_nodev('e', c));
 224 | 		    state = 1;
 225 | 		    continue;					// stay in the same position in expr
 226 | 		case '|': case '*': case '+': case '?':
 227 | 		case ')': case '{': case '}':
 228 | 		    if (opr != operators && top(opr) == ':') { 	// epsilon as an implicit right operand
 229 | 			push(opd, create_nodev('e', c));
 230 | 			state = 1;
 231 | 			continue;				// stay in the same position in expr
 232 | 		    } else {
 233 | 			fprintf(stderr, "error: unexpected symbol %c\n", c);
 234 | 			exit(EXIT_FAILURE);
 235 | 		    }
 236 | 		default:
 237 | 		    push(opd, create_nodev('c', c));
 238 | 		    state = 1;
 239 |             }
 240 | 	} else {               					// expect postfix or binary operator
 241 | 	    switch (c) {
 242 | 	    case '*': case '+': case '?':
 243 |                 if(*(expr+1) == '?') {				// safe to do, we have the '/0' sentinel
 244 | 		    reduce_postfix(c, 1);			// found non-greedy modifier
 245 | 		    expr++;
 246 |                 }
 247 | 		else
 248 | 		    reduce_postfix(c, 0);
 249 |                 break;
 250 |             case '|':
 251 |                 reduce_op(c);
 252 |                 state = 0;
 253 |                 break;
 254 |             case ':':
 255 |                 if (*(expr+1) == '\0') {		// implicit epsilon as a right operand
 256 |                     push(opd, create_nodev('e', c));
 257 |                 }
 258 | 		reduce_op(c);
 259 | 		state = 0;
 260 | 		break;
 261 |             case '{':
 262 |                 expr = parse_curly_brackets(expr+1);
 263 |                 break;
 264 |             case ')':
 265 |                 while (opr != operators && top(opr) != '(')
 266 |                     reduce();
 267 |                 if (opr == operators || top(opr) != '(') {
 268 | 		    fprintf(stderr, "error: unmached parenthesis");
 269 | 		    exit(EXIT_FAILURE);
 270 | 		}
 271 |                 --opr;                       	// remove ( from the stack
 272 |                 break;
 273 |             default:                            // implicit cat
 274 |                 reduce_op('.');
 275 |                 state = 0;
 276 |                 continue;
 277 |             }
 278 | 	}
 279 |         expr++;
 280 |     }
 281 | 
 282 |     while (opr != operators) {
 283 |         reduce();
 284 |     }
 285 |     assert (operands != opd);
 286 | 
 287 |     return pop(opd);
 288 | }
 289 | 
 290 | 
 291 | void plot_ast(struct node *n) {
 292 |     struct node *stack[1024];
 293 |     struct node **sp = stack;
 294 | 
 295 |     printf("digraph G {\n\tsplines=true; rankdir=TD;\n");
 296 | 
 297 |     push(sp, n);
 298 |     while (sp != stack) {
 299 |         n = pop(sp);
 300 | 
 301 |         if (n->type == 'c') {
 302 |             //printf("ntype: %c", n->type);
 303 |             printf("\t\"%p\" [peripheries=2, label=\"%u\"];\n", (void*)n, n->val);
 304 |         } else {
 305 |             printf("\t\"%p\" [label=\"%c\"];\n", (void*)n, n->type);
 306 |         }
 307 | 
 308 |         if (n->l) {
 309 |             printf("\t\"%p\" -> \"%p\";\n", (void*)n, (void*)n->l);
 310 |             push(sp, n->l);
 311 |         }
 312 | 
 313 |         if (n->r) {
 314 |             printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)n, (void*)n->r, 'r');
 315 |             push(sp, n->r);
 316 |         }
 317 |     }
 318 |     printf("}\n");
 319 | }
 320 | 
 321 | 
 322 | enum nstate_type {
 323 |     PROD,
 324 |     CONS,
 325 |     SPLIT,
 326 |     SPLITNG,
 327 |     JOIN,
 328 |     FINAL
 329 | };
 330 | 
 331 | 
 332 | // nft state
 333 | struct nstate {
 334 |     enum nstate_type type;
 335 |     unsigned char val;
 336 |     unsigned char mode;
 337 |     struct nstate *nexta;
 338 |     struct nstate *nextb;
 339 |     uint8_t visited;
 340 | };
 341 | 
 342 | 
 343 | struct nstate* create_nstate(enum nstate_type type, struct nstate *nexta, struct nstate *nextb) {
 344 |     struct nstate *state;
 345 | 
 346 |     state = malloc(sizeof(struct nstate));
 347 |     if (state == NULL) {
 348 |         fprintf(stderr, "error: nft state memory allocation failed\n");
 349 |         exit(EXIT_FAILURE);
 350 |     }
 351 | 
 352 |     state->type = type;
 353 |     state->nexta = nexta;
 354 |     state->nextb = nextb;
 355 |     state->val = 0;
 356 |     state->visited = 0;
 357 | 
 358 |     return state;
 359 | }
 360 | 
 361 | 
 362 | struct nchunk {
 363 |     struct nstate * head;
 364 |     struct nstate * tail;
 365 | };
 366 | 
 367 | 
 368 | struct nchunk chunk(struct nstate *head, struct nstate *tail) {
 369 |     struct nchunk chunk;
 370 |     chunk.head = head;
 371 |     chunk.tail = tail;
 372 |     return chunk;
 373 | }
 374 | 
 375 | 
 376 | struct nchunk nft(struct node *n, char mode) {
 377 |     struct nstate *split, *psplit, *join;
 378 |     struct nstate *cstate, *pstate, *state, *head, *tail, *final;
 379 |     struct nchunk l, r;
 380 |     int llv, lrv, rlv;
 381 |     int lb, rb;
 382 | 
 383 |     if (n == NULL)
 384 |     	return chunk(NULL, NULL);
 385 | 
 386 |     switch(n->type) {
 387 | 	case '.':
 388 | 	    l = nft(n->l, mode);
 389 | 	    r = nft(n->r, mode);
 390 | 	    l.tail->nexta = r.head;
 391 | 	    return chunk(l.head, r.tail);
 392 | 	case '|':
 393 | 	    l = nft(n->l, mode);
 394 | 	    r = nft(n->r, mode);
 395 | 	    split = create_nstate(SPLITNG, l.head, r.head);
 396 | 	    join = create_nstate(JOIN, NULL, NULL);
 397 | 	    l.tail->nexta = join;
 398 | 	    r.tail->nexta = join;
 399 | 	    return chunk(split, join);
 400 | 	case '*':
 401 | 	    l = nft(n->l, mode);
 402 | 	    split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head);
 403 | 	    l.tail->nexta = split;
 404 | 	    return chunk(split, split);
 405 | 	case '?':
 406 | 	    l = nft(n->l, mode);
 407 | 	    join = create_nstate(JOIN, NULL, NULL);
 408 | 	    split = create_nstate(n->val ? SPLITNG : SPLIT, join, l.head);
 409 | 	    l.tail->nexta = join;
 410 | 	    return chunk(split, join);
 411 | 	case '+':
 412 | 	    l = nft(n->l, mode);
 413 | 	    split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head);
 414 | 	    l.tail->nexta = split;
 415 | 	    return chunk(l.head, split);
 416 | 	case ':':
 417 | 	    if (n->l->type == 'e') // implicit eps left operand
 418 | 	    	return nft(n->r, 2);
 419 | 	    if (n->r->type == 'e')
 420 | 		return nft(n->l, 1);
 421 | 	    // implicit eps right operand
 422 | 	    l = nft(n->l, 1);
 423 | 	    r = nft(n->r, 2);
 424 | 	    l.tail->nexta = r.head;
 425 | 	    return chunk(l.head, r.tail);
 426 | 	case '-':
 427 | 	    if (n->l->type == 'c' && n->r->type == 'c') {
 428 | 		join = create_nstate(JOIN, NULL, NULL);
 429 | 		psplit = NULL;
 430 | 		for(int c=n->r->val; c >= n->l->val; c--){
 431 | 		    l = nft(create_nodev('c', c), mode);
 432 | 		    split = create_nstate(SPLITNG, l.head, psplit);
 433 | 		    l.tail->nexta = join;
 434 | 		    psplit = split;
 435 | 		}
 436 | 		return chunk(psplit, join);
 437 | 	    } else if (n->l->type == ':' && n->r->type == ':') {
 438 | 		join = create_nstate(JOIN, NULL, NULL);
 439 | 		psplit = NULL;
 440 | 
 441 | 		llv = n->l->l->val;
 442 | 		lrv = n->l->r->val;
 443 | 		rlv = n->r->l->val;
 444 | 		for(int c=rlv-llv; c >= 0; c--) {
 445 | 		    l = nft(create_node(':',
 446 | 		    	    create_nodev('c', llv+c),
 447 | 		    	    create_nodev('c', lrv+c)
 448 | 		    	    ), mode);
 449 | 		    split = create_nstate(SPLITNG, l.head, psplit);
 450 | 		    l.tail->nexta = join;
 451 | 		    psplit = split;
 452 | 		}
 453 | 		return chunk(psplit, join);
 454 | 	    }
 455 | 	    else {
 456 | 		fprintf(stderr, "error: unexpected range syntax\n");
 457 | 		exit(EXIT_FAILURE);
 458 | 	    }
 459 | 	case 'I':
 460 | 	    lb = n->r->type;		/* placeholder for the left range */
 461 | 	    rb = n->r->val;		/* placeholder for the right range */
 462 | 	    head = tail = create_nstate(JOIN, NULL, NULL);
 463 | 
 464 | 	    for(int i=0; i <lb; i++) {
 465 | 		l = nft(n->l, mode);
 466 | 		tail->nexta = l.head;
 467 | 		tail = l.tail;
 468 | 	    }
 469 | 
 470 | 	    if(rb == 0) {
 471 | 		l = nft(create_node('*', n->l, NULL), mode);
 472 | 		tail->nexta = l.head;
 473 | 		tail = l.tail;
 474 | 	    } else {
 475 | 		final = create_nstate(JOIN, NULL, NULL);
 476 | 		for(int i=lb; i < rb; i++) {
 477 | 		    l = nft(n->l, mode);
 478 | 		    // TODO: add non-greed modifier
 479 | 		    tail->nexta = create_nstate(n->val ? SPLITNG : SPLIT, final, l.head);
 480 | 		    tail = l.tail;
 481 | 		}
 482 | 		tail->nexta = final;
 483 | 		tail = final;
 484 | 	    }
 485 | 
 486 | 	    return chunk(head, tail);
 487 | 	default: 	 	//character
 488 | 	    if (mode == 0) {
 489 | 		cstate = create_nstate(CONS, NULL, NULL);
 490 | 		pstate = create_nstate(PROD, NULL, NULL);
 491 | 		cstate->val = n->val;
 492 | 		pstate->val = n->val;
 493 | 		cstate->nexta = pstate;
 494 | 		return chunk(cstate, pstate);
 495 | 	    }
 496 | 	    else if (mode == 1) {
 497 | 		state = create_nstate(CONS, NULL, NULL);
 498 | 		state->val=n->val;
 499 | 	    } else {
 500 | 		state = create_nstate(PROD, NULL, NULL);
 501 | 		state->val=n->val;
 502 | 	    }
 503 | 	    return chunk(state, state);
 504 |     }
 505 | }
 506 | 
 507 | /*
 508 | struct nstate* create_nft(struct node *root) {
 509 |     struct nstate *final = create_nstate(FINAL, NULL, NULL);
 510 |     struct nchunk ch = nft(root, 0);
 511 |     ch.tail->nexta = final;
 512 |     return ch.head;
 513 | }
 514 | */
 515 | 
 516 | struct nstate* create_nft(struct node *root) {
 517 |     struct nstate *final = create_nstate(FINAL, NULL, NULL);
 518 |     struct nstate *init = create_nstate(JOIN, NULL, NULL);
 519 |     struct nchunk ch = nft(root, 0);
 520 |     ch.tail->nexta = final;
 521 |     init->nexta = ch.head;
 522 |     return init;
 523 | }
 524 | 
 525 | struct sitem {
 526 |     struct nstate *s;
 527 |     size_t i;
 528 |     size_t o;
 529 | };
 530 | 
 531 | struct sstack {
 532 |     struct sitem *items;
 533 |     size_t n_items;
 534 |     size_t capacity;
 535 | };
 536 | 
 537 | struct sstack * screate(size_t capacity) {
 538 |     struct sstack *stack;
 539 | 
 540 |     stack = (struct sstack*)malloc(sizeof(struct sstack));
 541 |     stack->items = malloc(capacity * sizeof(struct sitem));
 542 |     if (stack == NULL || stack->items == NULL) {
 543 | 	fprintf(stderr, "error: stack memory allocation failed\n");
 544 | 	exit(EXIT_FAILURE);
 545 |     }
 546 |     stack->n_items = 0;
 547 |     stack->capacity = capacity;
 548 |     return stack;
 549 | }
 550 | 
 551 | void sresize(struct sstack *stack, size_t new_capacity) {
 552 |     stack->items = realloc(stack->items, new_capacity * sizeof(struct sitem));
 553 |     if (stack->items == NULL) {
 554 | 	fprintf(stderr, "error: stack memory re-allocation failed\n");
 555 | 	exit(EXIT_FAILURE);
 556 |     }
 557 |     stack->capacity = new_capacity;
 558 | }
 559 | 
 560 | void spush(struct sstack *stack, struct nstate *s, size_t i, size_t o) {
 561 |     struct sitem *it;
 562 |     if (stack->n_items == stack->capacity) {
 563 |         if (stack->capacity * 2 > STACK_MAX_CAPACITY) {
 564 | 	    fprintf(stderr, "error: stack max capacity reached\n");
 565 | 	    exit(EXIT_FAILURE);
 566 | 	}
 567 | 	sresize(stack, stack->capacity * 2);
 568 |     }
 569 |     it = &stack->items[stack->n_items];
 570 |     it->s = s;
 571 |     it->i = i;
 572 |     it->o = o;
 573 | 
 574 |     stack->n_items++;
 575 | }
 576 | 
 577 | size_t spop(struct sstack *stack, struct nstate **s, size_t *i, size_t *o) {
 578 |     struct sitem *it;
 579 |     if (stack->n_items == 0) {
 580 | 	fprintf(stderr, "error: stack underflow\n");
 581 |         exit(EXIT_FAILURE);
 582 |     }
 583 |     stack->n_items--;
 584 |     it = &stack->items[stack->n_items];
 585 |     *s = it->s;
 586 |     *i = it->i;
 587 |     *o = it->o;
 588 | 
 589 |     return stack->n_items;
 590 | }
 591 | 
 592 | 
 593 | // Function to dynamically resize the output array
 594 | char* resize_output(char *output, size_t *capacity) {
 595 |     *capacity *= 2; // Double the capacity
 596 |     output = realloc(output, *capacity * sizeof(char));
 597 |     if (!output) {
 598 |         fprintf(stderr, "error: memory reallocation failed for output\n");
 599 |         exit(EXIT_FAILURE);
 600 |     }
 601 |     return output;
 602 | }
 603 | 
 604 | // Main NFT traversal function (depth-first)
 605 | ssize_t infer_backtrack(struct nstate *start, char *input, struct sstack *stack, enum infer_mode mode) {
 606 |     size_t i = 0, o = 0;
 607 |     struct nstate *s = start;
 608 | 
 609 |     while (stack->n_items || s) {
 610 |         if (!s) {
 611 | 	    spop(stack, &s, &i, &o);
 612 |             if (!s) {
 613 |                 continue;
 614 |             }
 615 |         }
 616 |         // Resize output array if necessary
 617 |         if (o >= output_capacity - 1) {
 618 |             output = resize_output(output, &output_capacity);
 619 |         }
 620 | 
 621 |         switch (s->type) {
 622 |             case CONS:
 623 |                 if (input[i] != '\0' && s->val == input[i]) {
 624 |                     i++;
 625 |                     s = s->nexta;
 626 |                 } else {
 627 |                     s = NULL;
 628 |                 }
 629 |                 break;
 630 |             case PROD:
 631 |                 output[o++] = s->val;
 632 |                 s = s->nexta;
 633 |                 break;
 634 |             case SPLIT:
 635 |                 spush(stack, s->nexta, i, o);
 636 |                 s = s->nextb;
 637 |                 break;
 638 |             case SPLITNG:
 639 |                 spush(stack, s->nextb, i, o);
 640 |                 s = s->nexta;
 641 |                 break;
 642 |             case JOIN:
 643 |                 s = s->nexta;
 644 |                 break;
 645 |             case FINAL:
 646 |             	if (mode == MATCH) {
 647 | 		    if (input[i] == '\0') {
 648 | 			output[o] = '\0'; // Null-terminate the output string
 649 | 			fputs(output, stdout);
 650 | 			fputs("\n", stdout);
 651 | 		    }
 652 | 		    s = NULL;
 653 | 		} else {
 654 | 		    output[o] = '\0'; // Null-terminate the output string
 655 | 		    fputs(output, stdout);
 656 | 		    return i;
 657 | 		}
 658 |                 break;
 659 |             default:
 660 |                 fprintf(stderr, "error: unknown state type\n");
 661 |                 exit(1);
 662 |         }
 663 |     }
 664 |     return -1;
 665 | }
 666 | 
 667 | int stack_lookup(struct nstate **b, struct nstate **e, struct nstate *v) {
 668 |     while(b != e)
 669 | 	if (v == *(--e)) return 1;
 670 |     return 0;
 671 | }
 672 | 
 673 | 
 674 | 
 675 | void plot_nft(struct nstate *start) {
 676 |     struct nstate *stack[1024];
 677 |     struct nstate *visited[1024];
 678 | 
 679 |     struct nstate **sp = stack;
 680 |     struct nstate **vp = visited;
 681 |     struct nstate *s = start;
 682 |     char l,m;
 683 | 
 684 |     printf("digraph G {\n\tsplines=true; rankdir=LR;\n");
 685 | 
 686 |     push(sp, s);
 687 |     while (sp != stack) {
 688 |         s = pop(sp);
 689 |         push(vp, s);
 690 | 
 691 |         if (s->type == FINAL)
 692 |             printf("\t\"%p\" [peripheries=2, label=\"\"];\n", (void*)s);
 693 |         else {
 694 |             switch(s->type) {
 695 | 		case PROD: 	l=s->val; m='+'; break;
 696 | 		case CONS: 	l=s->val; m='-'; break;
 697 | 		case SPLITNG: 	l='S'; m='n'; break;
 698 | 		case SPLIT: 	l='S'; m=' '; break;
 699 | 		case JOIN: 	l='J'; m=' '; break;
 700 | 		default:	l=' '; m=' ';break;
 701 | 	    }
 702 |             printf("\t\"%p\" [label=\"%c%c\"];\n", (void*)s, l, m);
 703 |         }
 704 | 
 705 |         if (s->nexta) {
 706 |             printf("\t\"%p\" -> \"%p\";\n", (void*)s, (void*)s->nexta);
 707 |             if (!stack_lookup(visited, vp, s->nexta))
 708 |                 push(sp, s->nexta);
 709 |         }
 710 | 
 711 |         if (s->nextb) {
 712 |             printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)s, (void*)s->nextb, '*');
 713 |             if (!stack_lookup(visited, vp, s->nextb))
 714 |                 push(sp, s->nextb);
 715 |         }
 716 |     }
 717 |     printf("}\n");
 718 | }
 719 | 
 720 | struct str_item {
 721 |     unsigned char c;
 722 |     struct str_item *next;
 723 | };
 724 | 
 725 | struct str {
 726 |     struct str_item *head;
 727 |     struct str_item *tail;
 728 | };
 729 | 
 730 | struct str * str_create() {
 731 |     struct str *st;
 732 |     st = malloc(sizeof(struct str));
 733 |     st->head = st->tail = NULL;
 734 |     return st;
 735 | }
 736 | 
 737 | struct str * str_append(struct str *s, unsigned char c) {
 738 |     struct str_item *si;
 739 |     si = malloc(sizeof(struct str_item));
 740 |     si->c = c;
 741 |     si->next = NULL;
 742 | 
 743 |     if (s->tail) {
 744 |     	s->tail->next = si;
 745 |     	s->tail = si;
 746 |     } else {
 747 |     	s->tail = s->head = si;
 748 |     }
 749 |     return s;
 750 | }
 751 | 
 752 | char str_popleft(struct str *s) {
 753 |     unsigned char c;
 754 |     struct str_item *tmp;
 755 | 
 756 |     if (s->head == NULL) {
 757 | 	fprintf(stderr, "error: string underflow\n");
 758 |     	exit(EXIT_FAILURE);
 759 |     }
 760 | 
 761 |     // question: what if s->head == s->tail ??
 762 |     c = s->head->c;
 763 | 
 764 |     if (s->head == s->tail) {
 765 |     	free(s->head);
 766 |     	s->head = s->tail = NULL;
 767 |     } else {
 768 | 	tmp = s->head;
 769 | 	s->head = s->head->next;
 770 | 	free(tmp);
 771 |     }
 772 | 
 773 |     return c;
 774 | }
 775 | 
 776 | struct str * str_append_str(struct str *dst, struct str *src) {
 777 |     for(struct str_item *si = src->head; si != NULL; si = si->next) {
 778 |     	str_append(dst, si->c);
 779 |     }
 780 |     return dst;
 781 | }
 782 | 
 783 | struct str * str_copy(struct str* src) {
 784 |     struct str *dst = str_create();
 785 |     assert(src);
 786 | 
 787 |     for(struct str_item *si = src->head; si != NULL; si = si->next) {
 788 |     	str_append(dst, si->c);
 789 |     }
 790 |     return dst;
 791 | }
 792 | 
 793 | void str_free(struct str* s) {
 794 |     struct str_item *si, *next;
 795 |     if (!s) return;
 796 | 
 797 |     for(si = s->head; si != NULL;) {
 798 |     	next = si->next;
 799 |     	free(si);
 800 |     	si = next;
 801 |     }
 802 |     free(s);
 803 | }
 804 | 
 805 | void str_print(struct str *s) {
 806 |     assert(s);
 807 |     for(struct str_item *si=s->head; si != NULL; si=si->next)
 808 |     	fputc(si->c, stdout);
 809 | }
 810 | 
 811 | void str_to_char(struct str *s, unsigned char *ch) {
 812 |     assert(s);
 813 |     int i = 0;
 814 |     for(struct str_item *si=s->head; si != NULL; si=si->next)
 815 |     	ch[i++] = si->c;
 816 |     ch[i] = '\0';
 817 | }
 818 | 
 819 | 
 820 | 
 821 | struct slist {
 822 |     struct slitem *head;
 823 |     struct slitem *tail;
 824 |     size_t n;
 825 | };
 826 | 
 827 | struct slitem {
 828 |     struct nstate *state;
 829 |     struct str *suffix;
 830 |     struct slitem *next;
 831 | };
 832 | 
 833 | struct slist * slist_create() {
 834 |     struct slist *sl;
 835 |     sl = malloc(sizeof(struct slist));
 836 |     sl->head = sl->tail = NULL;
 837 |     sl->n = 0;
 838 | 
 839 |     return sl;
 840 | }
 841 | 
 842 | struct slist * slist_append(struct slist *sl, struct nstate *state, struct str *suffix) {
 843 |     struct slitem *li;
 844 |     li = malloc(sizeof(struct slitem));
 845 |     li->state = state;
 846 |     li->suffix = suffix;
 847 |     li->next = NULL;
 848 | 
 849 |     if (sl->tail) {
 850 |     	sl->tail->next = li;
 851 |     	sl->tail = li;
 852 |     } else {
 853 |     	sl->tail = sl->head = li;
 854 |     }
 855 |     sl->n += 1;
 856 |     return sl;
 857 | }
 858 | 
 859 | 
 860 | void slist_free(struct slist* sl) {
 861 |     struct slitem *li, *next;
 862 |     if (!sl) return;
 863 | 
 864 |     for(li = sl->head; li != NULL;) {
 865 |     	next = li->next;
 866 |     	str_free(li->suffix);
 867 |     	free(li);
 868 |     	li = next;
 869 |     }
 870 |     free(sl);
 871 | }
 872 | 
 873 | 
 874 | void nft_step_(struct nstate *s, struct str *o, unsigned char c, struct slist *sl) {
 875 |     if (!s) return;
 876 | 
 877 | 
 878 |     switch(s->type) {
 879 | 	case SPLIT:
 880 | 	    nft_step_(s->nextb, o, c, sl);
 881 | 	    nft_step_(s->nexta, str_copy(o), c, sl);
 882 | 	    break;
 883 | 	case SPLITNG:
 884 | 	    nft_step_(s->nexta, o, c, sl);
 885 | 	    nft_step_(s->nextb, str_copy(o), c, sl);
 886 | 	    break;
 887 | 	case JOIN:
 888 | 	    nft_step_(s->nexta, o, c, sl);
 889 | 	    break;
 890 | 	case PROD:
 891 | 	    nft_step_(s->nexta, str_append(o, s->val), c, sl);
 892 | 	    break;
 893 | 	case CONS:	// found CONS state marked with 'c'
 894 | 	    if (c == s->val && s->visited == 0) {
 895 | 	    	s->visited = 1;
 896 | 		slist_append(sl, s, o);
 897 | 	    }
 898 | 	    break;
 899 | 	case FINAL:
 900 | 	    if(c == '\0' && s->visited == 0) {	/* final states closure */
 901 | 		slist_append(sl, s, o);
 902 | 	    	s->visited = 1;
 903 | 	    }
 904 | 	    return;
 905 |     }
 906 |     return;
 907 | }
 908 | 
 909 | 
 910 | struct slist * nft_step(struct slist *states, unsigned char c) {
 911 |     struct slist *sl = slist_create();
 912 |     struct slitem *li;
 913 | 
 914 |     for(li=states->head; li; li=li->next)
 915 | 	nft_step_(li->state->nexta, str_copy(li->suffix), c, sl);
 916 | 
 917 |     /* reset the visited flag; yes it is linear
 918 |      * but the list have to be short */
 919 |     for(li=sl->head; li; li=li->next)
 920 |     	li->state->visited = 0;
 921 | 
 922 |     return sl;
 923 | }
 924 | 
 925 | 
 926 | struct dstate {
 927 |     struct slist *states;
 928 |     struct str *final_out;
 929 |     int8_t final;
 930 |     struct str *out[256];
 931 |     struct dstate *next[256];
 932 | };
 933 | 
 934 | struct dstate * dstate_create(struct slist *states) {
 935 |     struct dstate *ds;
 936 |     ds = malloc(sizeof(struct dstate));
 937 |     ds->states = states;
 938 |     ds->final = -1;
 939 |     memset(ds->next, 0, sizeof ds->next);
 940 |     memset(ds->out, 0, sizeof ds->out);
 941 |     return ds;
 942 | }
 943 | 
 944 | int dstack_lookup(struct dstate **b, struct dstate **e, struct dstate *v) {
 945 |     while(b != e)
 946 | 	if (v == *(--e)) return 1;
 947 |     return 0;
 948 | }
 949 | 
 950 | 
 951 | void plot_dft(struct dstate *dstart) {
 952 |     struct dstate *stack[1024];
 953 |     struct dstate *visited[1024];
 954 | 
 955 |     struct dstate **sp = stack;
 956 |     struct dstate **vp = visited;
 957 |     struct dstate *s = dstart, *s_next = NULL;
 958 |     unsigned char out[32], label[32];
 959 | 
 960 |     printf("digraph G {\n\tsplines=true; rankdir=LR;\n");
 961 | 
 962 |     push(sp, s);
 963 |     while (sp != stack) {
 964 |         s = pop(sp);
 965 |         push(vp, s);
 966 | 
 967 |         if (s->final == 1) {
 968 |             str_to_char(s->final_out, out);
 969 | 	    printf("\t\"%p\" [peripheries=2, label=\"%s\"];\n", (void*)s, out);
 970 | 	}
 971 |         else
 972 |             printf("\t\"%p\" [label=\"\"];\n", (void*)s);
 973 | 
 974 | 	for(int c=0; c < 256; c++) {
 975 | 	    if ((s_next = s->next[c]) != NULL) {
 976 | 	    	str_to_char(s->out[c], label);
 977 | 		printf("\t\"%p\" -> \"%p\" [label=\"%c:%s\"];\n", (void*)s, (void*)s_next, c, label);
 978 | 		if (!dstack_lookup(visited, vp, s_next))
 979 | 		    push(sp, s_next);
 980 | 	    }
 981 |         }
 982 |     }
 983 |     printf("}\n");
 984 | }
 985 | 
 986 | 
 987 | 
 988 | struct str * truncate_lcp(struct slist *sl, struct str *prefix) {
 989 |     struct slitem *li;
 990 |     unsigned char ch;
 991 | 
 992 |     for(;;) {
 993 |     	if(!sl->head->suffix->head) 			/* end of lcp */
 994 |     	    return prefix;
 995 | 
 996 |     	ch = sl->head->suffix->head->c;			/* omg */
 997 | 	for(li=sl->head; li; li=li->next) {
 998 | 	    if (!li->suffix || !li->suffix->head || li->suffix->head->c != ch)
 999 | 		return prefix;
1000 | 	}
1001 | 
1002 | 	/* we found the common char,
1003 | 	 * - add the char to the prefix,
1004 | 	 * - truncate the suffixes */
1005 | 	str_append(prefix, ch);
1006 | 	for(li=sl->head; li; li=li->next)
1007 | 	    str_popleft(li->suffix);
1008 |     }
1009 |     return prefix;
1010 | }
1011 | 
1012 | /* lexicographic comparison */
1013 | int str_cmp(struct str *a, struct str *b) {
1014 |     struct str_item *ai, *bi;
1015 |     for(ai=a->head, bi=b->head; ai && bi; ai=ai->next, bi=bi->next)
1016 |     	if (ai->c < bi->c)
1017 |     	    return -1;
1018 | 	else if (ai->c > bi->c)
1019 |     	    return 1;
1020 |     if (ai)
1021 | 	return 1;
1022 |     if (bi)
1023 | 	return -1;
1024 | 
1025 |     return 0;
1026 | }
1027 | 
1028 | int list_cmp(struct slist *a, struct slist *b) {
1029 |     struct slitem *ai, *bi;
1030 |     int sign;
1031 | 
1032 |     if(a->n < b->n)
1033 |     	return -1;
1034 |     if(a->n > b->n)
1035 |     	return 1;
1036 | 
1037 |     for(ai=a->head, bi=b->head; ai && bi; ai=ai->next, bi=bi->next)
1038 | 	if(ai->state < bi->state)
1039 | 	    return -1;
1040 | 	else if(ai->state > bi->state)
1041 | 	    return 1;
1042 | 	else {
1043 | 	    sign = str_cmp(ai->suffix, bi->suffix);
1044 | 	    if (sign != 0)
1045 | 		return sign;
1046 | 	}
1047 | 
1048 |     return 0;
1049 | }
1050 | 
1051 | 
1052 | // Define the structure for the tree node
1053 | struct btnode {
1054 |     struct dstate *ds;
1055 |     struct btnode *l;
1056 |     struct btnode *r;
1057 | };
1058 | 
1059 | // Function to create a new node with given data
1060 | struct btnode* bt_create(struct dstate *ds) {
1061 |     struct btnode* node;
1062 |     node = malloc(sizeof(struct btnode));
1063 |     if (node == NULL) {
1064 |         printf("error: binary tree node allocation failed.\n");
1065 | 	exit(EXIT_FAILURE);
1066 |     }
1067 |     node->ds = ds;
1068 |     node->l = NULL;
1069 |     node->r = NULL;
1070 |     return node;
1071 | }
1072 | 
1073 | // Lookup function to search for a value in the binary tree
1074 | struct dstate* bt_lookup(struct btnode *n, struct slist *sl) {
1075 |     int sign = 0;
1076 | 
1077 |     while (n != NULL) {
1078 |     	sign = list_cmp(sl, n->ds->states);
1079 |         if (sign < 0)		/* less */
1080 |             n = n->l;
1081 |         else if (sign > 0)	/* more */
1082 |             n = n->r;
1083 |         else
1084 |             return n->ds; 	/* found */
1085 |     }
1086 |     return NULL;
1087 | }
1088 | 
1089 | // Function to insert nodes to form a binary search tree
1090 | struct btnode* bt_insert(struct btnode* n, struct dstate *ds) {
1091 |     int sign;
1092 |     if (n == NULL)
1093 |         return bt_create(ds);
1094 |     sign = list_cmp(ds->states, n->ds->states);
1095 |     if (sign < 0)
1096 |         n->l = bt_insert(n->l, ds);
1097 |     else if (sign > 0)
1098 | 	n->r = bt_insert(n->r, ds);
1099 |     return n;
1100 | }
1101 | 
1102 | void bt_free(struct btnode* root) {
1103 |     if (root == NULL) return;
1104 |     bt_free(root->l);
1105 |     bt_free(root->r);
1106 |     free(root);
1107 | }
1108 | 
1109 | 
1110 | int infer_dft(struct dstate *dstart, unsigned char *inp, struct btnode *dcache, enum infer_mode mode) {
1111 |     struct dstate *ds_next, *ds=dstart;
1112 |     struct str *prefix, *out = str_create();
1113 |     struct slist *sl;
1114 | 
1115 |     unsigned char *c;
1116 |     int i = 0;
1117 | 
1118 |     for(c=inp; *c != '\0'; c++, i++) {
1119 | 
1120 | 	if (mode == SCAN && ds->final == 1) {
1121 | 	    str_print(out);
1122 | 	    str_print(ds->final_out);
1123 | 	    str_free(out);
1124 | 	    return i;
1125 | 	}
1126 | 
1127 | 	// todo: double check this logic
1128 | 	if (ds->next[*c] != NULL) {
1129 | 	    str_append_str(out, ds->out[*c]);
1130 | 	    ds = ds->next[*c];
1131 | 	}
1132 | 	else if (ds->out[*c] != NULL) {				/* already explored but found nothing */
1133 | 	    break;
1134 | 	}
1135 | 	else {							/* not explored, explore */
1136 | 	    sl = nft_step(ds->states, *c);
1137 | 
1138 | 	    /* expand each state and accumulate CONS states labeled with c */
1139 | 	    if (!sl->head) {					/* got empty list; mark as explored and exit */
1140 | 	    	ds->out[*c] = (struct str*)1;			/* fixme: can we create a better indicator? */
1141 | 	    	slist_free(sl);
1142 | 	    	break;
1143 | 	    }
1144 | 
1145 | 	    prefix = str_create();
1146 | 	    truncate_lcp(sl, prefix);
1147 | 
1148 | 	    if ((ds_next = bt_lookup(dcache, sl)) != NULL) {
1149 | 	    	ds->next[*c] = ds_next;
1150 | 	    	ds->out[*c] = prefix;
1151 | 	    	slist_free(sl);					/* no need for the list */
1152 | 	    } else {
1153 | 	    	ds_next = dstate_create(sl);
1154 | 	    	ds->next[*c] = ds_next;
1155 | 	    	ds->out[*c] = prefix;
1156 | 	    	bt_insert(dcache, ds_next);
1157 | 	    }
1158 | 
1159 | 	    str_append_str(out, ds->out[*c]);
1160 | 	    ds = ds_next;
1161 | 	    //str_free(prefix);
1162 | 
1163 | 	    // refactor this: do not make closure several times
1164 | 	    if (ds->final < 0) {					/* unexplored */
1165 | 		sl = nft_step(ds->states, '\0');
1166 | 
1167 | 		if (sl->head) {						/* got final states; take the first one */
1168 | 		    ds->final = 1;
1169 | 		    ds->final_out = str_copy(sl->head->suffix);
1170 | 		    // we do not need the 'sl' list anymore
1171 | 		} else {
1172 | 		    ds->final = 0;
1173 | 		}
1174 | 	    }
1175 | 	}
1176 |     }
1177 | 
1178 |     if (mode == SCAN && ds->final == 1) {
1179 | 	str_print(out);
1180 | 	str_print(ds->final_out);
1181 | 	return i;
1182 |     }
1183 | 
1184 | 
1185 |     /*
1186 |     //if (mode == MATCH && ds->final == 1 && *c == '\0') {
1187 |     if (ds->final == 1 && *c == '\0') {
1188 | 	str_print(out);
1189 | 	str_print(ds->final_out);
1190 |     }
1191 |     */
1192 | 
1193 |     str_free(out);
1194 | 
1195 |     return -1;
1196 | }
1197 | 
1198 | 
1199 | int main(int argc, char **argv)
1200 | {
1201 |     FILE *fp;
1202 |     char *expr;
1203 |     ssize_t read, ioffset;
1204 |     size_t input_len;
1205 |     //unsigned char *line = NULL, *input_fn, *ch;
1206 |     char *line = NULL, *input_fn, *ch;
1207 |     struct node *root;
1208 |     struct nstate *start;
1209 |     //struct sstack *stack = screate(32);
1210 |     struct dstate *dstart;
1211 |     struct btnode *dcache;
1212 |     enum infer_mode mode = SCAN;
1213 | 
1214 | 
1215 |     int opt, debug=0;
1216 | 
1217 |     while ((opt = getopt(argc, argv, "dma")) != -1) {
1218 | 	switch (opt) {
1219 | 	    case 'd':
1220 | 		debug = 1;
1221 | 		break;
1222 | 	    case 'm':
1223 | 		mode = MATCH;
1224 | 		break;
1225 | 	    case 'a':
1226 | 		fprintf(stderr, "Not supported yet\n");
1227 | 		exit(EXIT_FAILURE);
1228 | 	    default: /* '?' */
1229 | 		fprintf(stderr, "Usage: %s [-dma] expr [file]\n",
1230 | 		       argv[0]);
1231 | 		exit(EXIT_FAILURE);
1232 | 	}
1233 |     }
1234 | 
1235 |     if (optind >= argc) {
1236 |        fprintf(stderr, "error: missing trre expression\n");
1237 |        exit(EXIT_FAILURE);
1238 |     }
1239 | 
1240 |     expr = argv[optind];
1241 |     root = parse(expr);
1242 | 
1243 |     start = create_nft(root);
1244 | 
1245 |     if (debug) {
1246 | 	//plot_ast(root);
1247 | 	plot_nft(start);
1248 |     }
1249 | 
1250 |     //printf("%c\n", start->type);
1251 | 
1252 |     // todo: can we do better?
1253 |     output = malloc(output_capacity*sizeof(char));
1254 | 
1255 |     struct slist *sl_init = slist_create();
1256 |     slist_append(sl_init, start, str_create());
1257 | 
1258 |     dstart = dstate_create(sl_init);
1259 |     dcache = bt_create(dstart);
1260 | 
1261 |     if (optind == argc - 2) {		// filename provided
1262 | 	input_fn = argv[optind + 1];
1263 | 
1264 | 	fp = fopen(input_fn, "r");
1265 | 	if (fp == NULL) {
1266 | 	    fprintf(stderr, "error: can not open file %s\n", input_fn);
1267 | 	    exit(EXIT_FAILURE);
1268 | 	}
1269 |     } else
1270 |     	fp = stdin;
1271 | 
1272 |     if (mode == SCAN) {
1273 | 	while ((read = getline(&line, &input_len, fp)) != -1) {
1274 | 	    line[read-1] = '\0';
1275 | 	    ch = line;
1276 | 
1277 | 	    while (*ch != '\0') {
1278 | 		ioffset = infer_dft(dstart, (unsigned char*)ch, dcache, mode);
1279 | 		if (ioffset > 0)
1280 | 		    ch += ioffset;
1281 | 		else
1282 | 		    fputc(*ch++, stdout);
1283 | 	    }
1284 | 	    infer_dft(dstart, (unsigned char*)ch, dcache, mode);
1285 | 	    fputc('\n', stdout);
1286 | 	}
1287 |     } else {	/* MATCH mode and generator */
1288 | 	while ((read = getline(&line, &input_len, fp)) != -1) {
1289 | 	    line[read-1] = '\0';
1290 | 	    ioffset = infer_dft(dstart, (unsigned char*)line, dcache, mode);
1291 | 	    fputc('\n', stdout);
1292 | 	}
1293 |     }
1294 |     if (debug) {
1295 |     	plot_dft(dstart);
1296 |     }
1297 | 
1298 |     fclose(fp);
1299 |     if (line)
1300 |         free(line);
1301 |     return 0;
1302 | }
1303 | 


--------------------------------------------------------------------------------
/trre_nft.c:
--------------------------------------------------------------------------------
  1 | #define _GNU_SOURCE
  2 | #include <stdio.h>
  3 | #include <string.h>
  4 | #include <stdlib.h>
  5 | #include <assert.h>
  6 | #include <stdint.h>
  7 | #include <unistd.h>
  8 | 
  9 | 
 10 | /* precendence table */
 11 | int prec(char c) {
 12 |     switch (c) {
 13 |     	case '|': 		return 1;
 14 | 	case '-':		return 2;
 15 |     	case ':':		return 3;
 16 |     	case '.': 		return 4;
 17 | 	case '?': case '*':
 18 | 	case '+': case 'I': 	return 5;
 19 | 	case '\\':		return 6;
 20 |     }
 21 |     return -1;
 22 | }
 23 | 
 24 | struct node {
 25 |     unsigned char type;
 26 |     unsigned char val;
 27 |     struct node * l;
 28 |     struct node * r;
 29 | };
 30 | 
 31 | #define push(stack, item) 	*stack++ = item
 32 | #define pop(stack) 		*--stack
 33 | #define top(stack) 		*(stack-1)
 34 | 
 35 | #define STACK_INIT_CAPACITY	32
 36 | #define STACK_MAX_CAPACITY	100000
 37 | 
 38 | static unsigned char operators[1024];
 39 | static struct node *operands[1024];
 40 | 
 41 | static unsigned char *opr = operators;
 42 | static struct node **opd = operands;
 43 | 
 44 | static char* output;
 45 | static size_t output_capacity=32;
 46 | 
 47 | 
 48 | struct node * create_node(char type, struct node *l, struct node *r) {
 49 |     struct node *node = malloc(sizeof(struct node));
 50 |     if (!node) {
 51 | 	fprintf(stderr, "error: node memory allocation failed\n");
 52 | 	exit(EXIT_FAILURE);
 53 |     }
 54 |     node->type = type;
 55 |     node->val = 0;
 56 |     node->l = l;
 57 |     node->r = r;
 58 |     return node;
 59 | }
 60 | 
 61 | // alias-helper
 62 | struct node * create_nodev(unsigned char type, unsigned char val) {
 63 |     struct node *node = create_node(type, NULL, NULL);
 64 |     node->val = val;
 65 |     return node;
 66 | }
 67 | 
 68 | enum infer_mode {
 69 |     MODE_SCAN,
 70 |     MODE_MATCH,
 71 |     MODE_GEN,
 72 | };
 73 | 
 74 | 
 75 | // reduce immediately
 76 | void reduce_postfix(char op, int ng) {
 77 |     struct node *l, *r;
 78 | 
 79 |     switch(op) {
 80 | 	case '*': case '+': case '?':
 81 | 	    l = pop(opd);
 82 | 	    r = create_node(op, l, NULL);
 83 | 	    r->val = ng;
 84 | 	    push(opd, r);
 85 | 	    break;
 86 | 	default:
 87 | 	    fprintf(stderr, "error: unexpected postfix operator\n");
 88 | 	    exit(EXIT_FAILURE);
 89 |     }
 90 | }
 91 | 
 92 | 
 93 | void reduce() {
 94 |     char op;
 95 |     struct node *l, *r;
 96 | 
 97 |     op = pop(opr);
 98 |     switch(op) {
 99 | 	case '|': case '.': case ':': case '-':
100 | 	    r = pop(opd);
101 | 	    l = pop(opd);
102 | 	    push(opd, create_node(op, l, r));
103 | 	    break;
104 | 	case '(':
105 | 	    fprintf(stderr, "error: unmached parenthesis\n");
106 | 	    exit(EXIT_FAILURE);
107 |     }
108 | }
109 | 
110 | 
111 | void reduce_op(char op) {
112 |     while(opr != operators && prec(top(opr)) >= prec(op))
113 |         reduce();
114 |     push(opr, op);
115 | }
116 | 
117 | char* parse_curly_brackets(char *expr) {
118 |     int state = 0, ng = 0;
119 |     int count = 0, lv = 0;
120 |     struct node *l, *r;
121 |     char c;
122 | 
123 |     while ((c = *expr) != '\0') {
124 |         if (c >= '0' && c <= '9') {
125 |             count = count*10 + c - '0';
126 | 	} else if (c == ',') {
127 |             lv = count;
128 |             count = 0;
129 |             state += 1;
130 | 	} else if (c == '}') {
131 | 	    if(*(expr+1) == '?') {			// safe to do, we have the '/0' sentinel
132 | 		ng = 1;
133 | 		expr++;
134 | 	    }
135 |             if (state == 0)
136 |             	lv = count;
137 | 	    else if (state > 1) {
138 | 		fprintf(stderr, "error: more then one comma in curly brackets\n");
139 | 		exit(EXIT_FAILURE);
140 | 	    }
141 | 
142 | 	    r = create_nodev(lv, count);
143 | 	    l = create_node('I', pop(opd), r);
144 | 	    l->val = ng;
145 | 	    push(opd, l);
146 | 
147 |             return expr;
148 |         } else {
149 | 	    fprintf(stderr, "error: unexpected symbol in curly brackets: %c\n", c);
150 | 	    exit(EXIT_FAILURE);
151 |         }
152 |         expr++;
153 |     }
154 |     fprintf(stderr, "error: unmached curly brackets");
155 |     exit(EXIT_FAILURE);
156 | }
157 | 
158 | char* parse_square_brackets(char *expr) {
159 |     char c;
160 |     int state = 0;
161 | 
162 |     while ((c = *expr) != '\0') {
163 |         if (state == 0) {              			   // expect operand
164 |             switch(c) {
165 | 		case ':': case '-': case '[': case ']':
166 | 		    fprintf(stderr, "error: unexpected symbol in square brackets: %c", c);
167 | 		    exit(EXIT_FAILURE);
168 | 		default:
169 | 		    push(opd, create_nodev('c', c));       // push operand
170 | 		    state = 1;
171 | 	    }
172 | 	} else {                       		   	   // expect operator
173 |             switch(c) {
174 | 		case ':': case '-':
175 | 		    reduce_op(c);
176 | 		    state = 0;
177 | 		    break;
178 | 		case ']':
179 | 		    while (opr != operators && top(opr) != '[')
180 | 			reduce();
181 | 		    --opr;              // remove [ from the stack
182 | 		    return expr;
183 | 		default:                // implicit alternation
184 | 		    reduce_op('|');
185 | 		    state = 0;
186 | 		    continue;		// skip expr increment
187 | 	    }
188 | 	}
189 |         expr++;
190 |     }
191 |     fprintf(stderr, "error: unmached square brackets");
192 |     exit(EXIT_FAILURE);
193 | }
194 | 
195 | 
196 | struct node * parse(char *expr) {
197 |     unsigned char c;
198 |     int state = 0;
199 | 
200 |     while ((c = *expr) != '\0') {
201 |         if (state == 0) {                     	// expect operand
202 |             switch(c) {
203 | 		case '(':
204 | 		    push(opr, c);
205 | 		    break;
206 | 		case '[':
207 | 		    push(opr, c);
208 | 		    expr = parse_square_brackets(expr+1);
209 | 		    state = 1;
210 | 		    break;
211 | 		case '\\':
212 | 		    push(opd, create_nodev('c', *++expr));
213 | 		    state = 1;
214 | 		    break;
215 | 		case '.':
216 | 		    //push(opd, create_nodev('a', 0));
217 | 		    push(opd, create_node('-',
218 | 		    		create_nodev('c', 0),
219 | 		    		create_nodev('c', 255)));
220 | 		    state = 1;
221 | 		    break;
222 | 		case ':':					// epsilon as an implicit left operand
223 | 		    push(opd, create_nodev('e', c));
224 | 		    state = 1;
225 | 		    continue;					// stay in the same position in expr
226 | 		case '|': case '*': case '+': case '?':
227 | 		case ')': case '{': case '}':
228 | 		    if (opr != operators && top(opr) == ':') { 	// epsilon as an implicit right operand
229 | 			push(opd, create_nodev('e', c));
230 | 			state = 1;
231 | 			continue;				// stay in the same position in expr
232 | 		    } else {
233 | 			fprintf(stderr, "error: unexpected symbol %c\n", c);
234 | 			exit(EXIT_FAILURE);
235 | 		    }
236 | 		default:
237 | 		    push(opd, create_nodev('c', c));
238 | 		    state = 1;
239 |             }
240 | 	} else {               					// expect postfix or binary operator
241 | 	    switch (c) {
242 | 	    case '*': case '+': case '?':
243 |                 if(*(expr+1) == '?') {				// safe to do, we have the '/0' sentinel
244 | 		    reduce_postfix(c, 1);			// found non-greedy modifier
245 | 		    expr++;
246 |                 }
247 | 		else
248 | 		    reduce_postfix(c, 0);
249 |                 break;
250 |             case '|':
251 |                 reduce_op(c);
252 |                 state = 0;
253 |                 break;
254 |             case ':':
255 |                 if (*(expr+1) == '\0') {		// implicit epsilon as a right operand
256 |                     push(opd, create_nodev('e', c));
257 |                 }
258 | 		reduce_op(c);
259 | 		state = 0;
260 | 		break;
261 |             case '{':
262 |                 expr = parse_curly_brackets(expr+1);
263 |                 break;
264 |             case ')':
265 |                 while (opr != operators && top(opr) != '(')
266 |                     reduce();
267 |                 if (opr == operators || top(opr) != '(') {
268 | 		    fprintf(stderr, "error: unmached parenthesis");
269 | 		    exit(EXIT_FAILURE);
270 | 		}
271 |                 --opr;                       	// remove ( from the stack
272 |                 break;
273 |             default:                            // implicit cat
274 |                 reduce_op('.');
275 |                 state = 0;
276 |                 continue;
277 |             }
278 | 	}
279 |         expr++;
280 |     }
281 | 
282 |     while (opr != operators) {
283 |         reduce();
284 |     }
285 |     assert (operands != opd);
286 | 
287 |     return pop(opd);
288 | }
289 | 
290 | 
291 | void plot_ast(struct node *n) {
292 |     struct node *stack[1024];
293 |     struct node **sp = stack;
294 | 
295 |     printf("digraph G {\n\tsplines=true; rankdir=TD;\n");
296 | 
297 |     push(sp, n);
298 |     while (sp != stack) {
299 |         n = pop(sp);
300 | 
301 |         if (n->type == 'c') {
302 |             //printf("ntype: %c", n->type);
303 |             printf("\t\"%p\" [peripheries=2, label=\"%u\"];\n", (void*)n, n->val);
304 |         } else {
305 |             printf("\t\"%p\" [label=\"%c\"];\n", (void*)n, n->type);
306 |         }
307 | 
308 |         if (n->l) {
309 |             printf("\t\"%p\" -> \"%p\";\n", (void*)n, (void*)n->l);
310 |             push(sp, n->l);
311 |         }
312 | 
313 |         if (n->r) {
314 |             printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)n, (void*)n->r, 'r');
315 |             push(sp, n->r);
316 |         }
317 |     }
318 |     printf("}\n");
319 | }
320 | 
321 | 
322 | 
323 | enum nstate_type {
324 |     PROD,
325 |     CONS,
326 |     SPLIT,
327 |     SPLITNG,
328 |     JOIN,
329 |     FINAL
330 | };
331 | 
332 | 
333 | // nft state
334 | struct nstate {
335 |     enum nstate_type type;
336 |     char val;
337 |     char mode;
338 |     struct nstate *nexta;
339 |     struct nstate *nextb;
340 | };
341 | 
342 | 
343 | struct nstate* create_nstate(enum nstate_type type, struct nstate *nexta, struct nstate *nextb) {
344 |     struct nstate *state;
345 | 
346 |     state = malloc(sizeof(struct nstate));
347 |     if (state == NULL) {
348 |         fprintf(stderr, "error: nft state memory allocation failed\n");
349 |         exit(EXIT_FAILURE);
350 |     }
351 | 
352 |     state->type = type;
353 |     state->nexta = nexta;
354 |     state->nextb = nextb;
355 |     state->val = 0;
356 |     return state;
357 | }
358 | 
359 | 
360 | 
361 | struct nchunk {
362 |     struct nstate * head;
363 |     struct nstate * tail;
364 | };
365 | 
366 | 
367 | struct nchunk chunk(struct nstate *head, struct nstate *tail) {
368 |     struct nchunk chunk;
369 |     chunk.head = head;
370 |     chunk.tail = tail;
371 |     return chunk;
372 | }
373 | 
374 | 
375 | struct nchunk nft(struct node *n, char mode) {
376 |     struct nstate *split, *psplit, *join;
377 |     struct nstate *cstate, *pstate, *state, *head, *tail, *final;
378 |     struct nchunk l, r;
379 |     int llv, lrv, rlv;
380 |     int lb, rb;
381 | 
382 |     if (n == NULL)
383 |     	return chunk(NULL, NULL);
384 | 
385 |     switch(n->type) {
386 | 	case '.':
387 | 	    l = nft(n->l, mode);
388 | 	    r = nft(n->r, mode);
389 | 	    l.tail->nexta = r.head;
390 | 	    return chunk(l.head, r.tail);
391 | 	case '|':
392 | 	    l = nft(n->l, mode);
393 | 	    r = nft(n->r, mode);
394 | 	    split = create_nstate(SPLITNG, l.head, r.head);
395 | 	    join = create_nstate(JOIN, NULL, NULL);
396 | 	    l.tail->nexta = join;
397 | 	    r.tail->nexta = join;
398 | 	    return chunk(split, join);
399 | 	case '*':
400 | 	    l = nft(n->l, mode);
401 | 	    split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head);
402 | 	    l.tail->nexta = split;
403 | 	    return chunk(split, split);
404 | 	case '?':
405 | 	    l = nft(n->l, mode);
406 | 	    join = create_nstate(JOIN, NULL, NULL);
407 | 	    split = create_nstate(n->val ? SPLITNG : SPLIT, join, l.head);
408 | 	    l.tail->nexta = join;
409 | 	    return chunk(split, join);
410 | 	case '+':
411 | 	    l = nft(n->l, mode);
412 | 	    split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head);
413 | 	    l.tail->nexta = split;
414 | 	    return chunk(l.head, split);
415 | 	case ':':
416 | 	    if (n->l->type == 'e') // implicit eps left operand
417 | 	    	return nft(n->r, 2);
418 | 	    if (n->r->type == 'e')
419 | 		return nft(n->l, 1);
420 | 	    // implicit eps right operand
421 | 	    l = nft(n->l, 1);
422 | 	    r = nft(n->r, 2);
423 | 	    l.tail->nexta = r.head;
424 | 	    return chunk(l.head, r.tail);
425 | 	case '-':
426 | 	    if (n->l->type == 'c' && n->r->type == 'c') {
427 | 		join = create_nstate(JOIN, NULL, NULL);
428 | 		psplit = NULL;
429 | 		for(int c=n->r->val; c >= n->l->val; c--){
430 | 		    l = nft(create_nodev('c', c), mode);
431 | 		    split = create_nstate(SPLITNG, l.head, psplit);
432 | 		    l.tail->nexta = join;
433 | 		    psplit = split;
434 | 		}
435 | 		return chunk(psplit, join);
436 | 	    } else if (n->l->type == ':' && n->r->type == ':') {
437 | 		join = create_nstate(JOIN, NULL, NULL);
438 | 		psplit = NULL;
439 | 
440 | 		llv = n->l->l->val;
441 | 		lrv = n->l->r->val;
442 | 		rlv = n->r->l->val;
443 | 		for(int c=rlv-llv; c >= 0; c--) {
444 | 		    l = nft(create_node(':',
445 | 		    	    create_nodev('c', llv+c),
446 | 		    	    create_nodev('c', lrv+c)
447 | 		    	    ), mode);
448 | 		    split = create_nstate(SPLITNG, l.head, psplit);
449 | 		    l.tail->nexta = join;
450 | 		    psplit = split;
451 | 		}
452 | 		return chunk(psplit, join);
453 | 	    }
454 | 	    else {
455 | 		fprintf(stderr, "error: unexpected range syntax\n");
456 | 		exit(EXIT_FAILURE);
457 | 	    }
458 | 	case 'I':
459 | 	    lb = n->r->type;		/* placeholder for the left range */
460 | 	    rb = n->r->val;		/* placeholder for the right range */
461 | 	    head = tail = create_nstate(JOIN, NULL, NULL);
462 | 
463 | 	    for(int i=0; i <lb; i++) {
464 | 		l = nft(n->l, mode);
465 | 		tail->nexta = l.head;
466 | 		tail = l.tail;
467 | 	    }
468 | 
469 | 	    if(rb == 0) {
470 | 		l = nft(create_node('*', n->l, NULL), mode);
471 | 		tail->nexta = l.head;
472 | 		tail = l.tail;
473 | 	    } else {
474 | 		final = create_nstate(JOIN, NULL, NULL);
475 | 		for(int i=lb; i < rb; i++) {
476 | 		    l = nft(n->l, mode);
477 | 		    // TODO: add non-greed modifier
478 | 		    tail->nexta = create_nstate(n->val ? SPLITNG : SPLIT, final, l.head);
479 | 		    tail = l.tail;
480 | 		}
481 | 		tail->nexta = final;
482 | 		tail = final;
483 | 	    }
484 | 
485 | 	    return chunk(head, tail);
486 | 	default: 	 	//character
487 | 	    if (mode == 0) {
488 | 		cstate = create_nstate(CONS, NULL, NULL);
489 | 		pstate = create_nstate(PROD, NULL, NULL);
490 | 		cstate->val = n->val;
491 | 		pstate->val = n->val;
492 | 		cstate->nexta = pstate;
493 | 		return chunk(cstate, pstate);
494 | 	    }
495 | 	    else if (mode == 1) {
496 | 		state = create_nstate(CONS, NULL, NULL);
497 | 		state->val=n->val;
498 | 	    } else {
499 | 		state = create_nstate(PROD, NULL, NULL);
500 | 		state->val=n->val;
501 | 	    }
502 | 	    return chunk(state, state);
503 |     }
504 | }
505 | 
506 | struct nstate* create_nft(struct node *root) {
507 |     struct nstate *final = create_nstate(FINAL, NULL, NULL);
508 |     struct nchunk ch = nft(root, 0);
509 |     ch.tail->nexta = final;
510 |     return ch.head;
511 | }
512 | 
513 | struct sitem {
514 |     struct nstate *s;
515 |     size_t i;
516 |     size_t o;
517 | };
518 | 
519 | struct sstack {
520 |     struct sitem *items;
521 |     size_t n_items;
522 |     size_t capacity;
523 | };
524 | 
525 | struct sstack * screate(size_t capacity) {
526 |     struct sstack *stack;
527 | 
528 |     stack = (struct sstack*)malloc(sizeof(struct sstack));
529 |     stack->items = malloc(capacity * sizeof(struct sitem));
530 |     if (stack == NULL || stack->items == NULL) {
531 | 	fprintf(stderr, "error: stack memory allocation failed\n");
532 | 	exit(EXIT_FAILURE);
533 |     }
534 |     stack->n_items = 0;
535 |     stack->capacity = capacity;
536 |     return stack;
537 | }
538 | 
539 | void sresize(struct sstack *stack, size_t new_capacity) {
540 |     stack->items = realloc(stack->items, new_capacity * sizeof(struct sitem));
541 |     if (stack->items == NULL) {
542 | 	fprintf(stderr, "error: stack memory re-allocation failed\n");
543 | 	exit(EXIT_FAILURE);
544 |     }
545 |     stack->capacity = new_capacity;
546 | }
547 | 
548 | void spush(struct sstack *stack, struct nstate *s, size_t i, size_t o) {
549 |     struct sitem *it;
550 |     if (stack->n_items == stack->capacity) {
551 |         if (stack->capacity * 2 > STACK_MAX_CAPACITY) {
552 | 	    fprintf(stderr, "error: stack max capacity reached\n");
553 | 	    exit(EXIT_FAILURE);
554 | 	}
555 | 	sresize(stack, stack->capacity * 2);
556 |     }
557 |     it = &stack->items[stack->n_items];
558 |     it->s = s;
559 |     it->i = i;
560 |     it->o = o;
561 | 
562 |     stack->n_items++;
563 | }
564 | 
565 | size_t spop(struct sstack *stack, struct nstate **s, size_t *i, size_t *o) {
566 |     struct sitem *it;
567 |     if (stack->n_items == 0) {
568 | 	fprintf(stderr, "error: stack underflow\n");
569 |         exit(EXIT_FAILURE);
570 |     }
571 |     stack->n_items--;
572 |     it = &stack->items[stack->n_items];
573 |     *s = it->s;
574 |     *i = it->i;
575 |     *o = it->o;
576 | 
577 |     return stack->n_items;
578 | }
579 | 
580 | 
581 | // Function to dynamically resize the output array
582 | char* resize_output(char *output, size_t *capacity) {
583 |     *capacity *= 2; // Double the capacity
584 |     output = realloc(output, *capacity * sizeof(char));
585 |     if (!output) {
586 |         fprintf(stderr, "error: memory reallocation failed for output\n");
587 |         exit(EXIT_FAILURE);
588 |     }
589 |     return output;
590 | }
591 | 
592 | // Main DFS traversal function
593 | ssize_t infer_backtrack(struct nstate *start, char *input, struct sstack *stack, enum infer_mode mode, int all) {
594 |     size_t i = 0, o = 0;
595 |     struct nstate *s = start;
596 |     stack->n_items = 0;		/* reset stack; do not shrink */
597 | 
598 |     while (stack->n_items || s) {
599 |         if (!s) {
600 | 	    spop(stack, &s, &i, &o);
601 |             if (!s) {
602 |                 continue;
603 |             }
604 |         }
605 |         // Resize output array if necessary
606 |         if (o >= output_capacity - 1) {
607 |             output = resize_output(output, &output_capacity);
608 |         }
609 | 
610 |         switch (s->type) {
611 |             case CONS:
612 |                 if (input[i] != '\0' && s->val == input[i]) {
613 |                     i++;
614 |                     s = s->nexta;
615 |                 } else {
616 |                     s = NULL;
617 |                 }
618 |                 break;
619 |             case PROD:
620 |                 output[o++] = s->val;
621 |                 s = s->nexta;
622 |                 break;
623 |             case SPLIT:
624 |                 spush(stack, s->nexta, i, o);
625 |                 s = s->nextb;
626 |                 break;
627 |             case SPLITNG:
628 |                 spush(stack, s->nextb, i, o);
629 |                 s = s->nexta;
630 |                 break;
631 |             case JOIN:
632 |                 s = s->nexta;
633 |                 break;
634 |             case FINAL:
635 | 		if (mode == MODE_MATCH) {
636 | 		    if (input[i] == '\0') {
637 | 			output[o] = '\0'; // Null-terminate the output string
638 | 			fputs(output, stdout);
639 | 			fputc('\n', stdout);
640 | 			if (!all)
641 | 			    return i;
642 | 		    }
643 | 		} else {
644 | 		    output[o] = '\0'; // Null-terminate the output string
645 | 		    fputs(output, stdout);
646 | 		    if (!all)
647 | 			return i;
648 | 		}
649 | 		s = NULL;
650 |                 break;
651 |             default:
652 |                 fprintf(stderr, "error: unknown state type\n");
653 |                 exit(1);
654 |         }
655 |     }
656 |     return -1;
657 | }
658 | 
659 | 
660 | int stack_lookup(struct nstate **b, struct nstate **e, struct nstate *v) {
661 |     while(b != e)
662 | 	if (v == *(--e)) return 1;
663 |     return 0;
664 | }
665 | 
666 | 
667 | void plot_nft(struct nstate *start) {
668 |     struct nstate *stack[1024];
669 |     struct nstate *visited[1024];
670 | 
671 |     struct nstate **sp = stack;
672 |     struct nstate **vp = visited;
673 |     struct nstate *s = start;
674 |     char l,m;
675 | 
676 |     printf("digraph G {\n\tsplines=true; rankdir=LR;\n");
677 | 
678 |     push(sp, s);
679 |     while (sp != stack) {
680 |         s = pop(sp);
681 |         push(vp, s);
682 | 
683 |         if (s->type == FINAL)
684 |             printf("\t\"%p\" [peripheries=2, label=\"\"];\n", (void*)s);
685 |         else {
686 |             switch(s->type) {
687 | 		case PROD: 	l=s->val; m='+'; break;
688 | 		case CONS: 	l=s->val; m='-'; break;
689 | 		case SPLITNG: 	l='S'; m='n'; break;
690 | 		case SPLIT: 	l='S'; m=' '; break;
691 | 		case JOIN: 	l='J'; m=' '; break;
692 | 		default:	l=' '; m=' '; break;
693 | 	    }
694 |             printf("\t\"%p\" [label=\"%c%c\"];\n", (void*)s, l, m);
695 |         }
696 | 
697 |         if (s->nexta) {
698 |             printf("\t\"%p\" -> \"%p\";\n", (void*)s, (void*)s->nexta);
699 |             if (!stack_lookup(visited, vp, s->nexta))
700 |                 push(sp, s->nexta);
701 |         }
702 | 
703 |         if (s->nextb) {
704 |             printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)s, (void*)s->nextb, '*');
705 |             if (!stack_lookup(visited, vp, s->nextb))
706 |                 push(sp, s->nextb);
707 |         }
708 |     }
709 |     printf("}\n");
710 | }
711 | 
712 | 
713 | int main(int argc, char **argv)
714 | {
715 |     FILE *fp;
716 |     char *expr;
717 |     ssize_t read, ioffset;
718 |     size_t input_len;
719 |     char *line = NULL, *input_fn, *ch;
720 |     struct node *root;
721 |     struct nstate *start;
722 |     struct sstack *stack = screate(STACK_INIT_CAPACITY);
723 |     enum infer_mode mode = MODE_SCAN;
724 |     int all = 0;	// 1 = generate all the
725 | 
726 |     int opt, debug=0;
727 | 
728 |     while ((opt = getopt(argc, argv, "dma")) != -1) {
729 | 	switch (opt) {
730 | 	    case 'd':
731 | 		debug = 1;
732 | 		break;
733 | 	    case 'm':
734 | 		mode = MODE_MATCH;
735 | 		break;
736 | 	    case 'a':
737 | 		all = 1;
738 | 		break;
739 | 	    default: /* '?' */
740 | 		fprintf(stderr, "Usage: %s [-d] [-m] expr [file]\n",
741 | 		       argv[0]);
742 | 		exit(EXIT_FAILURE);
743 | 	}
744 |     }
745 | 
746 |     if (optind >= argc) {
747 |        fprintf(stderr, "error: missing trre expression\n");
748 |        exit(EXIT_FAILURE);
749 |     }
750 | 
751 |     expr = argv[optind];
752 |     root = parse(expr);
753 | 
754 |     start = create_nft(root);
755 | 
756 |     if (debug) {
757 | 	//plot_ast(root);
758 | 	plot_nft(start);
759 |     }
760 | 
761 | 
762 |     output = malloc(output_capacity*sizeof(char));
763 | 
764 |     if (optind == argc - 2) {		// filename provided
765 | 	input_fn = argv[optind + 1];
766 | 
767 | 	fp = fopen(input_fn, "r");
768 | 	if (fp == NULL) {
769 | 	    fprintf(stderr, "error: can not open file %s\n", input_fn);
770 | 	    exit(EXIT_FAILURE);
771 | 	}
772 |     } else
773 |     	fp = stdin;
774 | 
775 |     if (mode == MODE_SCAN) {
776 | 	while ((read = getline(&line, &input_len, fp)) != -1) {
777 | 	    line[read-1] = '\0';
778 | 	    ch = line;
779 | 
780 | 	    while (*ch != '\0') {
781 | 		ioffset = infer_backtrack(start, ch, stack, mode, all);
782 | 		if (ioffset > 0)
783 | 		    ch += ioffset;
784 | 		else
785 | 		    fputc(*ch++, stdout);
786 | 	    }
787 | 	    // even if we have empty string we still need to run the inference
788 | 	    infer_backtrack(start, ch, stack, mode, all);
789 | 	    fputc('\n', stdout);
790 | 	}
791 |     } else {	/* MATCH mode */
792 | 	while ((read = getline(&line, &input_len, fp)) != -1) {
793 | 	    line[read-1] = '\0';
794 | 	    infer_backtrack(start, line, stack, mode, all);
795 | 	    //fputc('\n', stdout);
796 | 	}
797 |     }
798 | 
799 |     fclose(fp);
800 |     if (line)
801 |         free(line);
802 |     return 0;
803 | }
804 | 


--------------------------------------------------------------------------------