├── LICENSE ├── Makefile ├── README.md ├── docs ├── automata.png ├── determinization.png ├── theory.pdf └── theory.typ ├── test.sh ├── trre.1 ├── trre_dft.c └── trre_nft.c /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Konstantin Selivanov 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | CC=cc 2 | CFLAGS = -std=c99 -Wall -Wextra -Wpedantic -O3 3 | #CFLAGS_DEBUG = -Wall -Wextra -Wpedantic -O0 -g 4 | 5 | all: nft dft 6 | 7 | nft: trre_nft.c 8 | $(CC) $(CFLAGS) trre_nft.c -o trre 9 | 10 | dft: trre_dft.c 11 | $(CC) $(CFLAGS) trre_dft.c -o trre_dft 12 | 13 | clean: 14 | rm -f trre trre_dft 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TRRE: transductive regular expressions 2 | 3 | #### TLDR: It is an extension of the regular expressions for text editing and a `grep`-like command line tool. 4 | 5 | - [Examples](#examples) 6 | - [Language specification](#language-specification) 7 | - [Why it works](#why-it-works) 8 | - [Precedence](#precedence) 9 | - [Modes and greediness](#modes-and-greediness) 10 | - [Determinization](#determinization) 11 | - [Performance](#performance) 12 | - [Installation](#installation) 13 | - [TODO](#todo) 14 | - [References](#references) 15 | 16 | 17 | ## Intro 18 | 19 | ![automata](docs/automata.png) 20 | 21 | Regular expressions is a great tool for searching patterns in text. But I always found it unnatural for text editing. The *group* logic works as a post-processor and can be complicated. Here I propose an extension to the regular expression language for pattern matching and text modification. I call it **transductive regular expressions** or simply **`trre`**. 22 | 23 | It introduces the `:` symbol to define transformations. The simplest form is `a:b`, which replaces a with b. I call this a `transductive pair` or `transduction`. 24 | 25 | To demonstrate the concept, I have created a command line tool **`trre`**. It feels similar to the `grep -E` command. 26 | 27 | ## Examples 28 | 29 | To run it first build the program: 30 | 31 | ```bash 32 | make 33 | ``` 34 | 35 | ### Basics 36 | 37 | To change `cat` to `dog` we use the following expression: 38 | 39 | ```bash 40 | echo 'cat' | ./trre 'cat:dog' 41 | ``` 42 | ``` 43 | dog 44 | ``` 45 | 46 | A more cryptic token-by-token version: 47 | 48 | ```bash 49 | echo 'cat' | ./trre '(c:d)(a:o)(t:g)' 50 | ``` 51 | ``` 52 | dog 53 | ``` 54 | 55 | It can be used like `sed` to replace all matches in a string: 56 | 57 | ```bash 58 | echo 'Mary had a little lamb.' | ./trre 'lamb:cat' 59 | ``` 60 | ``` 61 | Mary had a little cat. 62 | ``` 63 | 64 | **Deletion:** 65 | 66 | To delete a string we can use the pattern `string_to_delete:` 67 | 68 | ```bash 69 | echo 'xor' | ./trre '(x:)or' 70 | ``` 71 | ``` 72 | or 73 | ``` 74 | 75 | The expression `(x:)` could be interpreted as of translation of `x` to an empty string. 76 | 77 | This expression can be used to delete all the occurances in the default `scan` mode. E.g. expression below removes all occurances of 'a' from a string/ 78 | 79 | ```bash 80 | echo 'Mary had a little lamb.' | ./trre 'a:' 81 | ``` 82 | ``` 83 | Mry hd little lmb. 84 | ``` 85 | 86 | Technically, here we replace character `a` with empty symbol. 87 | 88 | To remove several characters from this string we can use square brackets construction similar to normal regex: 89 | 90 | ```bash 91 | echo 'Mary had a little lamb.' | ./trre '[aie]:' 92 | ``` 93 | ``` 94 | Mry hd lttl lmb. 95 | ``` 96 | 97 | **Insertion:** 98 | 99 | To insert a string we can use the pattern `:string_to_insert` 100 | 101 | ```bash 102 | echo 'or' | ./trre '(:x)or' 103 | ``` 104 | ``` 105 | xor 106 | ``` 107 | 108 | We could think of the expression `(:x)` as of translation of an empty string into `x`. 109 | 110 | We can insert a word in a context: 111 | 112 | ```bash 113 | echo 'Mary had a lamb.' | ./trre 'had a (:little )lamb' 114 | ``` 115 | ``` 116 | Mary had a little lamb. 117 | ``` 118 | 119 | ### Regex over transductions 120 | 121 | As for normal regular expression we could use **alternations** with `|` symbol: 122 | 123 | ```bash 124 | echo 'cat dog' | ./trre '(c:b)at|(d:h)og' 125 | ``` 126 | ``` 127 | bat hog 128 | ``` 129 | 130 | Or use the **star** over `trre` to repeat the transformation: 131 | 132 | ```bash 133 | echo 'catcatcat' | ./trre '(cat:dog)*' 134 | ``` 135 | ``` 136 | dogdogdog 137 | ``` 138 | 139 | In the default `scan` mode, **star** can be omitted: 140 | 141 | ```bash 142 | echo 'catcatcat' | ./trre 'cat:dog' 143 | ``` 144 | ``` 145 | dogdogdog 146 | ``` 147 | 148 | You can also use the star in the left part to "consume" a pattern infinitely: 149 | 150 | ```bash 151 | echo 'catcatcat' | ./trre '(cat)*:dog' 152 | ``` 153 | ``` 154 | dog 155 | ``` 156 | 157 | #### Danger zone 158 | 159 | Avoid using `*` or `+` in the right part, as it can cause infinite loops: 160 | 161 | ```bash 162 | echo '' | ./trre ':a*' # <- do NOT do this 163 | ``` 164 | ``` 165 | ... 166 | ``` 167 | 168 | Instead, use finite iterations: 169 | 170 | ```bash 171 | echo '' | ./trre ':(repeat-10-times){10}' 172 | 173 | repeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-times 174 | ``` 175 | 176 | ### Range transformations 177 | 178 | Transform ranges of characters: 179 | 180 | ```bash 181 | echo "regular expressions" | ./trre "[a:A-z:Z]" 182 | ``` 183 | ``` 184 | REGULAR EXPRESSIONS 185 | ``` 186 | 187 | As more complex example, lets create a toy cipher. Below is the Caesar cipher(1) implementation: 188 | 189 | ```bash 190 | echo 'caesar cipher' | ./trre '[a:b-y:zz:a]' 191 | ``` 192 | ``` 193 | dbftbs djqifs 194 | ``` 195 | 196 | And decrypt it back: 197 | 198 | ```bash 199 | echo 'dbftbs djqifs' | ./trre '[a:zb:a-y:x]' 200 | ``` 201 | ``` 202 | caesar cipher 203 | ``` 204 | 205 | ### Generators 206 | 207 | **`trre`** can generate multiple output strings for a single input. By default, it uses the first possible match. You can also generate all possible outputs. 208 | 209 | **Binary sequences:** 210 | ```bash 211 | echo '' | ./trre -ma ':(0|1){3}' 212 | ``` 213 | ``` 214 | 215 | 000 216 | 001 217 | 010 218 | 011 219 | 100 220 | 101 221 | 110 222 | 111 223 | ``` 224 | 225 | **Subsets:** 226 | ```bash 227 | echo '' | ./trre -ma ':(0|1){,3}?' 228 | ``` 229 | ``` 230 | 231 | 0 232 | 00 233 | 000 234 | 001 235 | 01 236 | 010 237 | 011 238 | 1 239 | 10 240 | 100 241 | 101 242 | 11 243 | 110 244 | 111 245 | ``` 246 | 247 | ## Language specification 248 | 249 | Informally, we define a **`trre`** as a pair `pattern-to-match`:`pattern-to-generate`. The `pattern-to-match` can be a string or regexp. The `pattern-to-generate` normally is a string. But it can be a `regex` as well. Moreover, we can do normal regular expression over these pairs. 250 | 251 | We can specify a smiplified grammar of **`trre`** as following: 252 | 253 | ``` 254 | TRRE <- TRRE* TRRE|TRRE TRRE.TRRE 255 | TRRE <- REGEX REGEX:REGEX 256 | ``` 257 | 258 | Where `REGEX` is a usual regular expression, and `.` stands for concatenation. 259 | 260 | For now I consider the operator `:` as non-associative and the expression `TRRE:TRRE` as unsyntactical. There is a natural meaning for this expression as a composition of relations defined by **`trre`**s. But it can make things too complex. Maybe I change this later. 261 | 262 | 263 | ## Why it works 264 | 265 | Under the hood, **`trre`** constructs a special automaton called a **Finite State Transducer (FST)**, which is similar to a **Finite State Automaton (FSA)** used in normal regular expressions but handles input-output pairs instead of plain strings. 266 | 267 | Key differences: 268 | 269 | * **`trre`** defines a binary relation between two regular languages. 270 | 271 | * It uses **FST**s instead of **FSA**s for inference. 272 | 273 | * It supports on-the-fly determinization for performance (experimental and under construction!). 274 | 275 | To justify the laguage of trunsductive regular expression we need to prove the correspondence between **`trre`** expressions and the corresponding **FST**s. There is my sketch of a the proof: [docs/theory.pdf](docs/theory.pdf). 276 | 277 | ## Precedence 278 | 279 | Below is the table of precedence (priority) of the `trre` operators from high to low: 280 | 281 | | Operator | Expression | 282 | | --------------------------------- | -------------------- | 283 | | Escaped characters | \\ | 284 | | Bracket expression | [] | 285 | | Grouping | () | 286 | | Iteration | * + ? {m,n} | 287 | | Concatenation | | 288 | | **Transduction** | : | 289 | | Alternation | \| | 290 | 291 | 292 | ## Modes and greediness 293 | 294 | **`trre`** supports two modes: 295 | 296 | * **Scan Mode (default)**: Applies transformations sequentially. 297 | 298 | * **Match Mode**: Checks the entire string against the expression (use `-m` flag). 299 | 300 | Use `-a` to generate all possible outputs. 301 | 302 | The `?` modifier makes `*`, `+`, and `{,}` operators non-greedy: 303 | 304 | ```bash 305 | echo '' | ./trre '<(.:)*>' 306 | ``` 307 | ``` 308 | <> 309 | ``` 310 | 311 | ```bash 312 | echo '' | ./trre '<(.:)*?>' 313 | ``` 314 | ``` 315 | <><> 316 | ``` 317 | 318 | Another common example is to change something within tags or brackets. E.g. to change everything within "<>" we can use the following expression: 319 | 320 | ```bash 321 | echo ' ' | ./trre '<(.*?:cat)>' 322 | ``` 323 | ``` 324 | 325 | ``` 326 | 327 | 328 | ## Determinization 329 | 330 | 331 | 332 | The important part of the modern regex engines is determinization. This routine converts the non-deterministic automata to the deterministic one. Once converted it has linear time inference on the input string length. It is handy but the convertion is exponential in the worst case. 333 | 334 | For **`trre`** the similar approach is possible. The bad news is that not all the non-deterministic transducers (**NFT**) can be converted to a deterministic (**DFT**). In case of two "bad" cycles with same input labels the algorithm is trapped in the infinite loop of a state creation. There is a way to detect such loops but it is expensive (see more in [Allauzen, Mohri, Efficient Algorithms for testing the twins property](https://cs.nyu.edu/~mohri/pub/twins.pdf)). 335 | 336 | ## Performance 337 | 338 | The default non-deterministic version is a bit slower then `sed`: 339 | 340 | ```bash 341 | wget https://www.gutenberg.org/cache/epub/57333/pg57333.txt -O chekhov.txt 342 | 343 | time cat chekhov.txt | ./trre '(vodka):(VODKA)' > /dev/null 344 | ``` 345 | ``` 346 | real 0m0.046s 347 | user 0m0.043s 348 | sys 0m0.007s 349 | ``` 350 | 351 | ```bash 352 | time cat chekhov.txt | sed 's/vodka/VODKA/' > /dev/null 353 | ``` 354 | ``` 355 | real 0m0.024s 356 | user 0m0.020s 357 | sys 0m0.010s 358 | ``` 359 | 360 | For complex tasks, **`trre_dft`** (deterministic version) can outperform sed: 361 | 362 | ```bash 363 | time cat chekhov.txt | sed -e 's/\(.*\)/\U\1/' > /dev/null 364 | ``` 365 | ``` 366 | real 0m0.508s 367 | user 0m0.504s 368 | sys 0m0.015s 369 | ``` 370 | 371 | ```bash 372 | time cat chekhov.txt | ./trre_dft '[a:A-z:Z]' > /dev/null 373 | ``` 374 | ``` 375 | real 0m0.131s 376 | user 0m0.127s 377 | sys 0m0.009s 378 | ``` 379 | 380 | ## Installation 381 | 382 | No pre-built binaries are available yet. Clone the repository and compile: 383 | 384 | ```bash 385 | git clone git@github.com:c0stya/trre.git trre 386 | cd trre 387 | make && sh test.sh 388 | ``` 389 | 390 | Then move the binary to a directory in your `$PATH`. 391 | 392 | ## TODO 393 | 394 | * Stable *DFT* version 395 | * Full unicode support 396 | * Complete the ERE feature set: 397 | - negation `^` within `[]` 398 | - character classes 399 | - '$^' anchoring symbols 400 | * Efficient range processing 401 | 402 | ## References 403 | 404 | * The approach is strongly inspired by Russ Cox article: [Regular Expression Matching Can Be Simple And Fast](https://swtch.com/~rsc/regexp/regexp1.html) 405 | * The idea of transducer determinization comes from this paper: [Cyril Allauzen, Mehryar Mohri, "Finitely Subsequential Transducers"](https://research.google/pubs/finitely-subsequential-transducers-2/) 406 | * Parsing approach comes from [Double-E algorithm](https://github.com/erikeidt/erikeidt.github.io/blob/master/The-Double-E-Method.md) by Erik Eidt which is close to the classical [Shunting Yard algorithm](https://en.wikipedia.org/wiki/Shunting_yard_algorithm). 407 | -------------------------------------------------------------------------------- /docs/automata.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/c0stya/trre/c474a6849153762530810d91667c192c5e53341a/docs/automata.png -------------------------------------------------------------------------------- /docs/determinization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/c0stya/trre/c474a6849153762530810d91667c192c5e53341a/docs/determinization.png -------------------------------------------------------------------------------- /docs/theory.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/c0stya/trre/c474a6849153762530810d91667c192c5e53341a/docs/theory.pdf -------------------------------------------------------------------------------- /docs/theory.typ: -------------------------------------------------------------------------------- 1 | = Transductive regular expression 2 | 3 | == Intro 4 | 5 | The idea behind *trre* is very similar to regular expressions (*re*). But where *re* defines a regular language the *trre* defines a binary relation between two regular languages. To check if a string corresponds to a given regular expression normally we construct an abstract machine we call finite state automaton or finite state acceptor (FSA). There is an established formal correspondence between the class of *re* and the class of FSA. For each expression in *re* we can construct an automaton. And for each automaton we can compose an expression in *re*. 6 | 7 | For *trre* the class of FSA is not enough. Instead we use finite state transducer (FST) as underlying matching engine. It is similar to the acceptor but for each transition we have two labels. One defines an input character and the other defines output. The current article establishes correspondence between *trre* and FST similar to *re* and FSA. 8 | 9 | To do this we need to define the trre language formaly and then prove that for every *trre* expression there is a corresponding FST and vice versa. But first, let's define the *re* language and then extend it to the *trre*. 10 | 11 | == RE language specification 12 | 13 | *Definition*. Let $cal(A)$ be a finite set augmented with symbols $emptyset$ and $epsilon$. We call it an alphabet. Then *RE* language will be set of words over the alphabet $cal(A)$ and symbols $+, dot.op, *$ defined inductively: 14 | 15 | 1. symbols $epsilon, emptyset, "and" a in cal(A)$ are regular expressions 16 | 2. for any regular expressions $r$ and $s$ the words $r + s$, $r dot.op s$ and $r\*$ are regular expressions. 17 | 18 | *Definition*. For a given regular expression $r$ the language of $r$ denoted as $L(r)$ is a set (may be infinite) of finite words defined inductively: 19 | 20 | 1. for trre expressions $epsilon, emptyset, a in cal(A)$ 21 | - $L(epsilon) = {epsilon}$ 22 | - $L(emptyset) = emptyset$ 23 | - $L(a) = {a}$ 24 | 2. for regular expressions $r, s$ 25 | - $L(r + s) = L(r) union L(s)$ 26 | - $L(r dot.op s) = L(r) dot.op L(s) = {a dot.op b | a in L(r), b in L(s)} $ 27 | - $L(r \*) = L(r)\* = sum_(n >= 0) r^n = epsilon + r + r dot.op r + r dot.op r dot.op r + ... $ 28 | 29 | The operation $r\*$ we call Kleene star or Kleene closure. 30 | 31 | *Examples*. 32 | - $(a+b)c$ 33 | - $a*b*$ 34 | 35 | == TRRE language specification 36 | 37 | *Definition*. Let $R$ be a set of regular expressions over alphabet $cal(A)$ and $S$ be a set of regular expressions over alphabet $cal(B)$. We refer to $cal(A)$ as input alphabet and $cal(B)$ as output alphabet. Transductive regular expressions `trre` will be a pair of regular expression combined with a delimiter symbol ':': 38 | 1. for regular expressions $r in R, s in S$, the string $r:s$ is a transductive regular expression 39 | 2. for transductive regular expressions $t, u$ expressions $t dot.op u$, $t + u$, $t\*$ are transductive regular expressions 40 | 41 | *Definition*. For a given transductive regular expression $t$ the language of $t$ denoted as $L(t)$ is a set of *pairs* of words defined inductively: 42 | 43 | 1. for a transductive regular expression $r:s$ where r, s are the regular expressions the language $L(r:s) = L(r) times L(s) = { (a,b) | a in L(r), b in L(s) }$, i.e. the direct product of languages $L(r)$ and $L(s)$. 44 | 2. for transductive regular expression $t, u$: 45 | - $L(t dot.op u) = L(t) dot.op L(u) = { (a dot.op b, x dot.op y ) | (a,x) in L(t), (b,y) in L(u) } $ 46 | - $L(t + u) = L(t) union L(u) $ 47 | - $L(t\*) = L(t)\* = sum_(n >= 0) t^n$. 48 | 49 | == Some properties of TRRE language 50 | 51 | 1. $a:b = a:epsilon dot.op epsilon:b = epsilon : b dot.op a : epsilon$ 52 | 53 | Proof: 54 | $ L(a:b) & = {(u,v) | u in L(a), v in L(b)} & "by definition of" a:b "language" \ 55 | & = {(u dot.op epsilon), (epsilon dot.op v) | u in L(a), v in L(b)} \ 56 | & = {(u dot.op epsilon), (epsilon dot.op v) | (u, epsilon) in L(a:epsilon), v in L(epsilon:b)} \ 57 | & = L(a:epsilon) dot.op L(epsilon:b) & "by definition of concatenation" \ 58 | & = L(a:epsilon dot.op epsilon:b) $ 59 | 60 | 2. $(a + b):c = a dot.op c + b dot.op c $ 61 | 62 | Proof: 63 | $ L((a + b):c) & = L(a + b) times L(c) & "by definition of" a:b "language" \ 64 | & = (L(a) union L(b)) times L(c) & "by definition of" a + b "language" \ 65 | & = L(a) times L(c) union L(b) times L(c) & "cartesian product of sets is distributive over union" \ 66 | & = L(a:c) union L(b:c) \ 67 | & = L(a:c + b:c) $ 68 | 69 | 3. $a:(b + c) = a dot.op b + a dot.op c $ 70 | 71 | Proof: same as in 2 72 | 73 | 4. $a\*:epsilon = (a : epsilon)\* $ 74 | 75 | Proof: 76 | $ a\*:epsilon & = ( sum_(n>=0) a^n ) : epsilon & "by definition of" \* \ 77 | & = sum_(n>=0) (a^n: epsilon ) & "by Property 2" \ 78 | & = (a:epsilon)* & "by definition of" \* $ 79 | 80 | 5. (a : epsilon)\* = $a\*:epsilon $ 81 | 82 | Proof: same as in 4 83 | 84 | == trre normal form 85 | 86 | Let's introduce a simplified version of TRRE language. We will call it normal form of TRRE. 87 | 88 | *Definition*. Let $cal(A), cal(B)$ be the alphabets. Then normal form is any string defined recursively: 89 | 90 | 1. any string of the form $a:b$ is the normal form, where $a in cal(A) union {epsilon}, b in cal(B) union {epsilon}$ 91 | 2. any string of the form $A:B, A+B, A\* $ are normal forms, where $A,B$ are normal forms 92 | 93 | It is clear that any normal is a transductive expression. Let's show that the inverse is true as well. 94 | 95 | *Lemma 1.* Every trre expression can be converted to a normal form. 96 | 97 | *Proof:* First, let's show that any trre expressions of the form $a:epsilon$ and $epsilon:b$ where a,b are valid regular expressions can be converted to a normal form. Let's prove it for $A:epsilon$ using structural induction on the *RE* construction. 98 | 99 | _Base_: $a:epsilon$, $epsilon:epsilon$ are in the normal form. 100 | 101 | _Step_: Let's assume that all the subformulas of the formula $a:epsilon$ can be converted to normal form. Then let's prove that $(a + b) : epsilon, a dot.op b : epsilon, a*:epsilon$ can be converted to normal form: 102 | 1. $(a + b):epsilon = a:epsilon + b:epsilon$ by Property 2. 103 | 2. $a dot.op b : epsilon = a:epsilon dot.op b:epsilon$ by Property 1. 104 | 3. $(a \*:epsilon) = (a:epsilon)\*$ by Property 4. 105 | 106 | For the case $epsilon:b$ the proof is similar. 107 | So, knowing that any formula $A:B$ can be represented as $A:epsilon dot.op epsilon:B$ we can represent any TRRE of the form "A:B" in the normal form. The last step is to prove that any TRRE formula can be represented as a normal form. Using current fact as a base of induction we can prove it by structural induction on the recursive definition of TRRE. 108 | 109 | == trre and automata equivalence 110 | 111 | The final step is to show that for any TRRE there is corresponding FST which defines same language. And for any FST there is TRRE which defines same language. 112 | 113 | *Theorem*. $ forall t in "TRRE" exists "FST" a "such" cal(L)(t) = cal(L)(a) $ and 114 | $ forall "FST" a exists "TRRE" t "such" cal(L)(a) = cal(L)(t) $ 115 | 116 | *Proof $arrow.r$* 117 | 118 | The proof follows classical Thompson's construction algorithm. The only difference is that we represent the transducer as an automaton over the alphabet $cal(A) union {epsilon} times cal(B) union {epsilon}$. 119 | 120 | *Proof $arrow.l$* 121 | 122 | The proof is a classical Kleene's algorithm for converting automaton to regular expression. The difference here is that we represent transductive regular expression as regular expression by first expressing TRRE in normal form (by Lemma 1) and then treating each transductive pair of the form $a:b$ as a symbol in the new alphabet. The rest of the proof is the same. 123 | 124 | == Inference 125 | 126 | Like in many modern re engines we construct a deterministic automata on the fly. The difference here is that we construct deterministic FST whenever it is possible. 127 | -------------------------------------------------------------------------------- /test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cmd_scan="./trre" 4 | cmd_match="./trre -ma" 5 | 6 | test_cmd() { 7 | local inp=$1 8 | local tre=$2 9 | local exp=$3 10 | local cmd=$4 11 | 12 | res=$(echo "$inp" | $cmd "$tre") 13 | 14 | if ! diff <(echo -e "$exp") <(echo -e "$res") > /dev/null; then 15 | echo -e "FAIL $cmd:" "$inp" "->" "$tre" 16 | diff <(echo -e "$exp") <(echo -e "$res") 17 | fi 18 | } 19 | 20 | M() { 21 | test_cmd "$1" "$2" "$3" "$cmd_match" 22 | } 23 | 24 | S() { 25 | test_cmd "$1" "$2" "$3" "$cmd_scan" 26 | } 27 | 28 | # input # trre # expected 29 | # basics 30 | M "a" "a:x" "x" 31 | M "ab" "ab:xy" "xy" 32 | M "ab" "(a:x)(b:y)" "xy" 33 | M "cat" "cat:dog" "dog" 34 | M "cat" "(cat):(dog)" "dog" 35 | M "cat" "(c:d)(a:o)(t:g)" "dog" 36 | M "mat" "c:da:ot:g" "" 37 | 38 | # basics deletion 39 | M "xor" "(x:)or" "or" 40 | S "xor" "x:" "or" 41 | S "Mary had a little lamb" "a:" "Mry hd little lmb" 42 | 43 | # basics insertion 44 | M 'or' '(:x)or' "xor" 45 | S 'or' ':=' "=o=r=" 46 | 47 | # alternation 48 | M "a" "a|b|c" "a" 49 | M "b" "a|b|c" "b" 50 | M "c" "a|b|c" "c" 51 | 52 | # star 53 | S "a" "a*" "a" 54 | S "aaa" "a*" "aaa" 55 | M "b" "a*" "" 56 | M "bbb" "a*" "" 57 | M "abab" "(ab)*" "abab" 58 | M "ababa" "(ab)*" "" 59 | 60 | # basic ranges 61 | M "a" "[a-c]" "a" 62 | M "b" "[a-c]" "b" 63 | M "c" "[a-c]" "c" 64 | M "d" "[a-c]" "" 65 | 66 | # transform ranges 67 | M "a" "[a:x]" "x" 68 | M "a" "[a:xb:y]" "x" 69 | M "b" "[a:xb:y]" "y" 70 | M "c" "[a:xb:y]" "" 71 | 72 | # range-transform ranges 73 | M "a" "[a:x-c:z]" "x" 74 | M "b" "[a:x-c:z]" "y" 75 | M "c" "[a:x-c:z]" "z" 76 | M "d" "[a:x-c:z]" "" 77 | 78 | # any char 79 | M "a" "." "a" 80 | M "b" "." "b" 81 | M "abc" "..." "abc" 82 | 83 | # iteration, single 84 | M "aa" "a{2}" "aa" 85 | M "a" "a{2}" "" 86 | M "aaa" "a{2}" "" 87 | 88 | # iteration, left 89 | M "" "a{,2}" "" 90 | M "a" "a{,2}" "a" 91 | M "aa" "a{,2}" "aa" 92 | M "aaa" "a{,2}" "" 93 | 94 | # iteration, center 95 | M "" "a{1,2}" "" 96 | M "a" "a{1,2}" "a" 97 | M "aa" "a{1,2}" "aa" 98 | M "aaa" "a{1,2}" "" 99 | 100 | # iteration, right 101 | M "" "a{2,}" "" 102 | M "a" "a{2,}" "" 103 | M "aa" "a{2,}" "aa" 104 | M "aaa" "a{2,}" "aaa" 105 | 106 | # generators 107 | S "" ":a{,3}" "aaa" 108 | M "" ":a{,3}" "aaa\naa\na" 109 | 110 | S "" ":a{,3}?" "" 111 | M "" ":a{,3}?" "\na\naa\naaa" 112 | 113 | # greed modifiers 114 | S "aaa" "(.:x)*.*" "xxx" 115 | S "aaa" "(.:x)*?.*" "aaa" 116 | 117 | M "aaa" "(.:x)*.*" "xxx\nxxa\nxaa\naaa" 118 | M "aaa" "(.:x)*?.*" "aaa\nxaa\nxxa\nxxx" 119 | 120 | # escapes 121 | S "." "\." "." 122 | S "+" "\+" "+" 123 | S "?" "\?" "?" 124 | S ":" "\:" ":" 125 | S ".a" "\.a" ".a" 126 | M ".c" "[.]c" ".c" 127 | 128 | # greed modifiers priorities 129 | S "" "<(.:)*>" "<>" 130 | S "" "<(.:)*?>" "<><>" 131 | S "" "<(.:)+>" "<>" 132 | S "" "<(.:)+?>" "<><>" 133 | 134 | 135 | 136 | # epsilon 137 | # generators 138 | 139 | ## iteration, range 140 | #printf "aaa\n" | $CMD "((a:x){,}.*)" 141 | -------------------------------------------------------------------------------- /trre.1: -------------------------------------------------------------------------------- 1 | .\" Process this file with 2 | .\" groff -man -Tascii trre.1 3 | .\" 4 | .TH TRRE 1 "User Commands" 5 | .SH NAME 6 | trre \- stream text editor based on transductive regular expressions 7 | .SH SYNOPSIS 8 | .B trre 9 | [\fB\-mad\fR] 10 | .I PATTERN 11 | [\fIFILE\fR] 12 | .SH DESCRIPTION 13 | .B trre 14 | is a stream editor that performs string transformations. Similar to 15 | .BR sed (1), 16 | it processes input in a single pass. The transformation is entirely defined by the 17 | .IR PATTERN . 18 | 19 | If no input file is specified, 20 | .B trre 21 | reads from standard input. 22 | .SH OPTIONS 23 | .IP \fB\-m\fR 24 | Enable matching mode. The expression must match the entire input string. 25 | .IP \fB\-a\fR 26 | Print all possible output strings. Useful in matching mode. 27 | .IP \fB\-d\fR 28 | Enable debug mode. Prints the parsing tree and automaton to stderr. 29 | .SH EXAMPLES 30 | .IP "\fBecho cat | trre 'cat:dog'\fR" 31 | Transforms the word "cat" to "dog". 32 | .IP "\fBecho cat | trre '(c:d)(a:o)(t:g)'\fR" 33 | Transforms the word "cat" to "dog". 34 | .SH BUGS 35 | This is an experimental tool. Use with caution. 36 | .SH AUTHOR 37 | Konstantin Selivanov 38 | .SH "SEE ALSO" 39 | .BR sed (1), 40 | .BR tr (1) 41 | -------------------------------------------------------------------------------- /trre_dft.c: -------------------------------------------------------------------------------- 1 | // Experimental. Do not use in production. 2 | 3 | #define _GNU_SOURCE 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | 12 | /* precendence table */ 13 | int prec(char c) { 14 | switch (c) { 15 | case '|': return 1; 16 | case '-': return 2; 17 | case ':': return 3; 18 | case '.': return 4; 19 | case '?': case '*': 20 | case '+': case 'I': return 5; 21 | case '\\': return 6; 22 | } 23 | return -1; 24 | } 25 | 26 | 27 | struct node { 28 | unsigned char type; 29 | unsigned char val; 30 | struct node * l; 31 | struct node * r; 32 | }; 33 | 34 | #define push(stack, item) *stack++ = item 35 | #define pop(stack) *--stack 36 | #define top(stack) *(stack-1) 37 | 38 | #define STACK_MAX_CAPACITY 1000000 39 | 40 | static unsigned char operators[1024]; 41 | static struct node *operands[1024]; 42 | 43 | static unsigned char *opr = operators; 44 | static struct node **opd = operands; 45 | 46 | static char* output; 47 | static size_t output_capacity=32; 48 | 49 | 50 | struct node * create_node(unsigned char type, struct node *l, struct node *r) { 51 | struct node *node = malloc(sizeof(struct node)); 52 | if (!node) { 53 | fprintf(stderr, "error: node memory allocation failed\n"); 54 | exit(EXIT_FAILURE); 55 | } 56 | node->type = type; 57 | node->val = 0; 58 | node->l = l; 59 | node->r = r; 60 | return node; 61 | } 62 | 63 | // alias-helper 64 | struct node * create_nodev(unsigned char type, unsigned char val) { 65 | struct node *node = create_node(type, NULL, NULL); 66 | node->val = val; 67 | return node; 68 | } 69 | 70 | enum infer_mode { 71 | SCAN, 72 | MATCH, 73 | }; 74 | 75 | 76 | // reduce immediately 77 | void reduce_postfix(char op, int ng) { 78 | struct node *l, *r; 79 | 80 | switch(op) { 81 | case '*': case '+': case '?': 82 | l = pop(opd); 83 | r = create_node(op, l, NULL); 84 | r->val = ng; 85 | push(opd, r); 86 | break; 87 | default: 88 | fprintf(stderr, "error: unexpected postfix operator\n"); 89 | exit(EXIT_FAILURE); 90 | } 91 | } 92 | 93 | 94 | void reduce() { 95 | char op; 96 | struct node *l, *r; 97 | 98 | op = pop(opr); 99 | switch(op) { 100 | case '|': case '.': case ':': case '-': 101 | r = pop(opd); 102 | l = pop(opd); 103 | push(opd, create_node(op, l, r)); 104 | break; 105 | case '(': 106 | fprintf(stderr, "error: unmached parenthesis\n"); 107 | exit(EXIT_FAILURE); 108 | } 109 | } 110 | 111 | 112 | void reduce_op(char op) { 113 | while(opr != operators && prec(top(opr)) >= prec(op)) 114 | reduce(); 115 | push(opr, op); 116 | } 117 | 118 | char* parse_curly_brackets(char *expr) { 119 | int state = 0, ng = 0; 120 | int count = 0, lv = 0; 121 | struct node *l, *r; 122 | char c; 123 | 124 | while ((c = *expr) != '\0') { 125 | if (c >= '0' && c <= '9') { 126 | count = count*10 + c - '0'; 127 | } else if (c == ',') { 128 | lv = count; 129 | count = 0; 130 | state += 1; 131 | } else if (c == '}') { 132 | if(*(expr+1) == '?') { // safe to do, we have the '/0' sentinel 133 | ng = 1; 134 | expr++; 135 | } 136 | if (state == 0) 137 | lv = count; 138 | else if (state > 1) { 139 | fprintf(stderr, "error: more then one comma in curly brackets\n"); 140 | exit(EXIT_FAILURE); 141 | } 142 | 143 | r = create_nodev(lv, count); 144 | l = create_node('I', pop(opd), r); 145 | l->val = ng; 146 | push(opd, l); 147 | 148 | return expr; 149 | } else { 150 | fprintf(stderr, "error: unexpected symbol in curly brackets: %c\n", c); 151 | exit(EXIT_FAILURE); 152 | } 153 | expr++; 154 | } 155 | fprintf(stderr, "error: unmached curly brackets"); 156 | exit(EXIT_FAILURE); 157 | } 158 | 159 | char* parse_square_brackets(char *expr) { 160 | char c; 161 | int state = 0; 162 | 163 | while ((c = *expr) != '\0') { 164 | if (state == 0) { // expect operand 165 | switch(c) { 166 | case ':': case '-': case '[': case ']': 167 | fprintf(stderr, "error: unexpected symbol in square brackets: %c", c); 168 | exit(EXIT_FAILURE); 169 | default: 170 | push(opd, create_nodev('c', c)); // push operand 171 | state = 1; 172 | } 173 | } else { // expect operator 174 | switch(c) { 175 | case ':': case '-': 176 | reduce_op(c); 177 | state = 0; 178 | break; 179 | case ']': 180 | while (opr != operators && top(opr) != '[') 181 | reduce(); 182 | --opr; // remove [ from the stack 183 | return expr; 184 | default: // implicit alternation 185 | reduce_op('|'); 186 | state = 0; 187 | continue; // skip expr increment 188 | } 189 | } 190 | expr++; 191 | } 192 | fprintf(stderr, "error: unmached square brackets"); 193 | exit(EXIT_FAILURE); 194 | } 195 | 196 | 197 | struct node * parse(char *expr) { 198 | unsigned char c; 199 | int state = 0; 200 | 201 | while ((c = *expr) != '\0') { 202 | if (state == 0) { // expect operand 203 | switch(c) { 204 | case '(': 205 | push(opr, c); 206 | break; 207 | case '[': 208 | push(opr, c); 209 | expr = parse_square_brackets(expr+1); 210 | state = 1; 211 | break; 212 | case '\\': 213 | push(opd, create_nodev('c', *++expr)); 214 | state = 1; 215 | break; 216 | case '.': 217 | push(opd, create_node('-', 218 | create_nodev('c', 0), 219 | create_nodev('c', 255))); 220 | state = 1; 221 | break; 222 | case ':': // epsilon as an implicit left operand 223 | push(opd, create_nodev('e', c)); 224 | state = 1; 225 | continue; // stay in the same position in expr 226 | case '|': case '*': case '+': case '?': 227 | case ')': case '{': case '}': 228 | if (opr != operators && top(opr) == ':') { // epsilon as an implicit right operand 229 | push(opd, create_nodev('e', c)); 230 | state = 1; 231 | continue; // stay in the same position in expr 232 | } else { 233 | fprintf(stderr, "error: unexpected symbol %c\n", c); 234 | exit(EXIT_FAILURE); 235 | } 236 | default: 237 | push(opd, create_nodev('c', c)); 238 | state = 1; 239 | } 240 | } else { // expect postfix or binary operator 241 | switch (c) { 242 | case '*': case '+': case '?': 243 | if(*(expr+1) == '?') { // safe to do, we have the '/0' sentinel 244 | reduce_postfix(c, 1); // found non-greedy modifier 245 | expr++; 246 | } 247 | else 248 | reduce_postfix(c, 0); 249 | break; 250 | case '|': 251 | reduce_op(c); 252 | state = 0; 253 | break; 254 | case ':': 255 | if (*(expr+1) == '\0') { // implicit epsilon as a right operand 256 | push(opd, create_nodev('e', c)); 257 | } 258 | reduce_op(c); 259 | state = 0; 260 | break; 261 | case '{': 262 | expr = parse_curly_brackets(expr+1); 263 | break; 264 | case ')': 265 | while (opr != operators && top(opr) != '(') 266 | reduce(); 267 | if (opr == operators || top(opr) != '(') { 268 | fprintf(stderr, "error: unmached parenthesis"); 269 | exit(EXIT_FAILURE); 270 | } 271 | --opr; // remove ( from the stack 272 | break; 273 | default: // implicit cat 274 | reduce_op('.'); 275 | state = 0; 276 | continue; 277 | } 278 | } 279 | expr++; 280 | } 281 | 282 | while (opr != operators) { 283 | reduce(); 284 | } 285 | assert (operands != opd); 286 | 287 | return pop(opd); 288 | } 289 | 290 | 291 | void plot_ast(struct node *n) { 292 | struct node *stack[1024]; 293 | struct node **sp = stack; 294 | 295 | printf("digraph G {\n\tsplines=true; rankdir=TD;\n"); 296 | 297 | push(sp, n); 298 | while (sp != stack) { 299 | n = pop(sp); 300 | 301 | if (n->type == 'c') { 302 | //printf("ntype: %c", n->type); 303 | printf("\t\"%p\" [peripheries=2, label=\"%u\"];\n", (void*)n, n->val); 304 | } else { 305 | printf("\t\"%p\" [label=\"%c\"];\n", (void*)n, n->type); 306 | } 307 | 308 | if (n->l) { 309 | printf("\t\"%p\" -> \"%p\";\n", (void*)n, (void*)n->l); 310 | push(sp, n->l); 311 | } 312 | 313 | if (n->r) { 314 | printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)n, (void*)n->r, 'r'); 315 | push(sp, n->r); 316 | } 317 | } 318 | printf("}\n"); 319 | } 320 | 321 | 322 | enum nstate_type { 323 | PROD, 324 | CONS, 325 | SPLIT, 326 | SPLITNG, 327 | JOIN, 328 | FINAL 329 | }; 330 | 331 | 332 | // nft state 333 | struct nstate { 334 | enum nstate_type type; 335 | unsigned char val; 336 | unsigned char mode; 337 | struct nstate *nexta; 338 | struct nstate *nextb; 339 | uint8_t visited; 340 | }; 341 | 342 | 343 | struct nstate* create_nstate(enum nstate_type type, struct nstate *nexta, struct nstate *nextb) { 344 | struct nstate *state; 345 | 346 | state = malloc(sizeof(struct nstate)); 347 | if (state == NULL) { 348 | fprintf(stderr, "error: nft state memory allocation failed\n"); 349 | exit(EXIT_FAILURE); 350 | } 351 | 352 | state->type = type; 353 | state->nexta = nexta; 354 | state->nextb = nextb; 355 | state->val = 0; 356 | state->visited = 0; 357 | 358 | return state; 359 | } 360 | 361 | 362 | struct nchunk { 363 | struct nstate * head; 364 | struct nstate * tail; 365 | }; 366 | 367 | 368 | struct nchunk chunk(struct nstate *head, struct nstate *tail) { 369 | struct nchunk chunk; 370 | chunk.head = head; 371 | chunk.tail = tail; 372 | return chunk; 373 | } 374 | 375 | 376 | struct nchunk nft(struct node *n, char mode) { 377 | struct nstate *split, *psplit, *join; 378 | struct nstate *cstate, *pstate, *state, *head, *tail, *final; 379 | struct nchunk l, r; 380 | int llv, lrv, rlv; 381 | int lb, rb; 382 | 383 | if (n == NULL) 384 | return chunk(NULL, NULL); 385 | 386 | switch(n->type) { 387 | case '.': 388 | l = nft(n->l, mode); 389 | r = nft(n->r, mode); 390 | l.tail->nexta = r.head; 391 | return chunk(l.head, r.tail); 392 | case '|': 393 | l = nft(n->l, mode); 394 | r = nft(n->r, mode); 395 | split = create_nstate(SPLITNG, l.head, r.head); 396 | join = create_nstate(JOIN, NULL, NULL); 397 | l.tail->nexta = join; 398 | r.tail->nexta = join; 399 | return chunk(split, join); 400 | case '*': 401 | l = nft(n->l, mode); 402 | split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head); 403 | l.tail->nexta = split; 404 | return chunk(split, split); 405 | case '?': 406 | l = nft(n->l, mode); 407 | join = create_nstate(JOIN, NULL, NULL); 408 | split = create_nstate(n->val ? SPLITNG : SPLIT, join, l.head); 409 | l.tail->nexta = join; 410 | return chunk(split, join); 411 | case '+': 412 | l = nft(n->l, mode); 413 | split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head); 414 | l.tail->nexta = split; 415 | return chunk(l.head, split); 416 | case ':': 417 | if (n->l->type == 'e') // implicit eps left operand 418 | return nft(n->r, 2); 419 | if (n->r->type == 'e') 420 | return nft(n->l, 1); 421 | // implicit eps right operand 422 | l = nft(n->l, 1); 423 | r = nft(n->r, 2); 424 | l.tail->nexta = r.head; 425 | return chunk(l.head, r.tail); 426 | case '-': 427 | if (n->l->type == 'c' && n->r->type == 'c') { 428 | join = create_nstate(JOIN, NULL, NULL); 429 | psplit = NULL; 430 | for(int c=n->r->val; c >= n->l->val; c--){ 431 | l = nft(create_nodev('c', c), mode); 432 | split = create_nstate(SPLITNG, l.head, psplit); 433 | l.tail->nexta = join; 434 | psplit = split; 435 | } 436 | return chunk(psplit, join); 437 | } else if (n->l->type == ':' && n->r->type == ':') { 438 | join = create_nstate(JOIN, NULL, NULL); 439 | psplit = NULL; 440 | 441 | llv = n->l->l->val; 442 | lrv = n->l->r->val; 443 | rlv = n->r->l->val; 444 | for(int c=rlv-llv; c >= 0; c--) { 445 | l = nft(create_node(':', 446 | create_nodev('c', llv+c), 447 | create_nodev('c', lrv+c) 448 | ), mode); 449 | split = create_nstate(SPLITNG, l.head, psplit); 450 | l.tail->nexta = join; 451 | psplit = split; 452 | } 453 | return chunk(psplit, join); 454 | } 455 | else { 456 | fprintf(stderr, "error: unexpected range syntax\n"); 457 | exit(EXIT_FAILURE); 458 | } 459 | case 'I': 460 | lb = n->r->type; /* placeholder for the left range */ 461 | rb = n->r->val; /* placeholder for the right range */ 462 | head = tail = create_nstate(JOIN, NULL, NULL); 463 | 464 | for(int i=0; i l, mode); 466 | tail->nexta = l.head; 467 | tail = l.tail; 468 | } 469 | 470 | if(rb == 0) { 471 | l = nft(create_node('*', n->l, NULL), mode); 472 | tail->nexta = l.head; 473 | tail = l.tail; 474 | } else { 475 | final = create_nstate(JOIN, NULL, NULL); 476 | for(int i=lb; i < rb; i++) { 477 | l = nft(n->l, mode); 478 | // TODO: add non-greed modifier 479 | tail->nexta = create_nstate(n->val ? SPLITNG : SPLIT, final, l.head); 480 | tail = l.tail; 481 | } 482 | tail->nexta = final; 483 | tail = final; 484 | } 485 | 486 | return chunk(head, tail); 487 | default: //character 488 | if (mode == 0) { 489 | cstate = create_nstate(CONS, NULL, NULL); 490 | pstate = create_nstate(PROD, NULL, NULL); 491 | cstate->val = n->val; 492 | pstate->val = n->val; 493 | cstate->nexta = pstate; 494 | return chunk(cstate, pstate); 495 | } 496 | else if (mode == 1) { 497 | state = create_nstate(CONS, NULL, NULL); 498 | state->val=n->val; 499 | } else { 500 | state = create_nstate(PROD, NULL, NULL); 501 | state->val=n->val; 502 | } 503 | return chunk(state, state); 504 | } 505 | } 506 | 507 | /* 508 | struct nstate* create_nft(struct node *root) { 509 | struct nstate *final = create_nstate(FINAL, NULL, NULL); 510 | struct nchunk ch = nft(root, 0); 511 | ch.tail->nexta = final; 512 | return ch.head; 513 | } 514 | */ 515 | 516 | struct nstate* create_nft(struct node *root) { 517 | struct nstate *final = create_nstate(FINAL, NULL, NULL); 518 | struct nstate *init = create_nstate(JOIN, NULL, NULL); 519 | struct nchunk ch = nft(root, 0); 520 | ch.tail->nexta = final; 521 | init->nexta = ch.head; 522 | return init; 523 | } 524 | 525 | struct sitem { 526 | struct nstate *s; 527 | size_t i; 528 | size_t o; 529 | }; 530 | 531 | struct sstack { 532 | struct sitem *items; 533 | size_t n_items; 534 | size_t capacity; 535 | }; 536 | 537 | struct sstack * screate(size_t capacity) { 538 | struct sstack *stack; 539 | 540 | stack = (struct sstack*)malloc(sizeof(struct sstack)); 541 | stack->items = malloc(capacity * sizeof(struct sitem)); 542 | if (stack == NULL || stack->items == NULL) { 543 | fprintf(stderr, "error: stack memory allocation failed\n"); 544 | exit(EXIT_FAILURE); 545 | } 546 | stack->n_items = 0; 547 | stack->capacity = capacity; 548 | return stack; 549 | } 550 | 551 | void sresize(struct sstack *stack, size_t new_capacity) { 552 | stack->items = realloc(stack->items, new_capacity * sizeof(struct sitem)); 553 | if (stack->items == NULL) { 554 | fprintf(stderr, "error: stack memory re-allocation failed\n"); 555 | exit(EXIT_FAILURE); 556 | } 557 | stack->capacity = new_capacity; 558 | } 559 | 560 | void spush(struct sstack *stack, struct nstate *s, size_t i, size_t o) { 561 | struct sitem *it; 562 | if (stack->n_items == stack->capacity) { 563 | if (stack->capacity * 2 > STACK_MAX_CAPACITY) { 564 | fprintf(stderr, "error: stack max capacity reached\n"); 565 | exit(EXIT_FAILURE); 566 | } 567 | sresize(stack, stack->capacity * 2); 568 | } 569 | it = &stack->items[stack->n_items]; 570 | it->s = s; 571 | it->i = i; 572 | it->o = o; 573 | 574 | stack->n_items++; 575 | } 576 | 577 | size_t spop(struct sstack *stack, struct nstate **s, size_t *i, size_t *o) { 578 | struct sitem *it; 579 | if (stack->n_items == 0) { 580 | fprintf(stderr, "error: stack underflow\n"); 581 | exit(EXIT_FAILURE); 582 | } 583 | stack->n_items--; 584 | it = &stack->items[stack->n_items]; 585 | *s = it->s; 586 | *i = it->i; 587 | *o = it->o; 588 | 589 | return stack->n_items; 590 | } 591 | 592 | 593 | // Function to dynamically resize the output array 594 | char* resize_output(char *output, size_t *capacity) { 595 | *capacity *= 2; // Double the capacity 596 | output = realloc(output, *capacity * sizeof(char)); 597 | if (!output) { 598 | fprintf(stderr, "error: memory reallocation failed for output\n"); 599 | exit(EXIT_FAILURE); 600 | } 601 | return output; 602 | } 603 | 604 | // Main NFT traversal function (depth-first) 605 | ssize_t infer_backtrack(struct nstate *start, char *input, struct sstack *stack, enum infer_mode mode) { 606 | size_t i = 0, o = 0; 607 | struct nstate *s = start; 608 | 609 | while (stack->n_items || s) { 610 | if (!s) { 611 | spop(stack, &s, &i, &o); 612 | if (!s) { 613 | continue; 614 | } 615 | } 616 | // Resize output array if necessary 617 | if (o >= output_capacity - 1) { 618 | output = resize_output(output, &output_capacity); 619 | } 620 | 621 | switch (s->type) { 622 | case CONS: 623 | if (input[i] != '\0' && s->val == input[i]) { 624 | i++; 625 | s = s->nexta; 626 | } else { 627 | s = NULL; 628 | } 629 | break; 630 | case PROD: 631 | output[o++] = s->val; 632 | s = s->nexta; 633 | break; 634 | case SPLIT: 635 | spush(stack, s->nexta, i, o); 636 | s = s->nextb; 637 | break; 638 | case SPLITNG: 639 | spush(stack, s->nextb, i, o); 640 | s = s->nexta; 641 | break; 642 | case JOIN: 643 | s = s->nexta; 644 | break; 645 | case FINAL: 646 | if (mode == MATCH) { 647 | if (input[i] == '\0') { 648 | output[o] = '\0'; // Null-terminate the output string 649 | fputs(output, stdout); 650 | fputs("\n", stdout); 651 | } 652 | s = NULL; 653 | } else { 654 | output[o] = '\0'; // Null-terminate the output string 655 | fputs(output, stdout); 656 | return i; 657 | } 658 | break; 659 | default: 660 | fprintf(stderr, "error: unknown state type\n"); 661 | exit(1); 662 | } 663 | } 664 | return -1; 665 | } 666 | 667 | int stack_lookup(struct nstate **b, struct nstate **e, struct nstate *v) { 668 | while(b != e) 669 | if (v == *(--e)) return 1; 670 | return 0; 671 | } 672 | 673 | 674 | 675 | void plot_nft(struct nstate *start) { 676 | struct nstate *stack[1024]; 677 | struct nstate *visited[1024]; 678 | 679 | struct nstate **sp = stack; 680 | struct nstate **vp = visited; 681 | struct nstate *s = start; 682 | char l,m; 683 | 684 | printf("digraph G {\n\tsplines=true; rankdir=LR;\n"); 685 | 686 | push(sp, s); 687 | while (sp != stack) { 688 | s = pop(sp); 689 | push(vp, s); 690 | 691 | if (s->type == FINAL) 692 | printf("\t\"%p\" [peripheries=2, label=\"\"];\n", (void*)s); 693 | else { 694 | switch(s->type) { 695 | case PROD: l=s->val; m='+'; break; 696 | case CONS: l=s->val; m='-'; break; 697 | case SPLITNG: l='S'; m='n'; break; 698 | case SPLIT: l='S'; m=' '; break; 699 | case JOIN: l='J'; m=' '; break; 700 | default: l=' '; m=' ';break; 701 | } 702 | printf("\t\"%p\" [label=\"%c%c\"];\n", (void*)s, l, m); 703 | } 704 | 705 | if (s->nexta) { 706 | printf("\t\"%p\" -> \"%p\";\n", (void*)s, (void*)s->nexta); 707 | if (!stack_lookup(visited, vp, s->nexta)) 708 | push(sp, s->nexta); 709 | } 710 | 711 | if (s->nextb) { 712 | printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)s, (void*)s->nextb, '*'); 713 | if (!stack_lookup(visited, vp, s->nextb)) 714 | push(sp, s->nextb); 715 | } 716 | } 717 | printf("}\n"); 718 | } 719 | 720 | struct str_item { 721 | unsigned char c; 722 | struct str_item *next; 723 | }; 724 | 725 | struct str { 726 | struct str_item *head; 727 | struct str_item *tail; 728 | }; 729 | 730 | struct str * str_create() { 731 | struct str *st; 732 | st = malloc(sizeof(struct str)); 733 | st->head = st->tail = NULL; 734 | return st; 735 | } 736 | 737 | struct str * str_append(struct str *s, unsigned char c) { 738 | struct str_item *si; 739 | si = malloc(sizeof(struct str_item)); 740 | si->c = c; 741 | si->next = NULL; 742 | 743 | if (s->tail) { 744 | s->tail->next = si; 745 | s->tail = si; 746 | } else { 747 | s->tail = s->head = si; 748 | } 749 | return s; 750 | } 751 | 752 | char str_popleft(struct str *s) { 753 | unsigned char c; 754 | struct str_item *tmp; 755 | 756 | if (s->head == NULL) { 757 | fprintf(stderr, "error: string underflow\n"); 758 | exit(EXIT_FAILURE); 759 | } 760 | 761 | // question: what if s->head == s->tail ?? 762 | c = s->head->c; 763 | 764 | if (s->head == s->tail) { 765 | free(s->head); 766 | s->head = s->tail = NULL; 767 | } else { 768 | tmp = s->head; 769 | s->head = s->head->next; 770 | free(tmp); 771 | } 772 | 773 | return c; 774 | } 775 | 776 | struct str * str_append_str(struct str *dst, struct str *src) { 777 | for(struct str_item *si = src->head; si != NULL; si = si->next) { 778 | str_append(dst, si->c); 779 | } 780 | return dst; 781 | } 782 | 783 | struct str * str_copy(struct str* src) { 784 | struct str *dst = str_create(); 785 | assert(src); 786 | 787 | for(struct str_item *si = src->head; si != NULL; si = si->next) { 788 | str_append(dst, si->c); 789 | } 790 | return dst; 791 | } 792 | 793 | void str_free(struct str* s) { 794 | struct str_item *si, *next; 795 | if (!s) return; 796 | 797 | for(si = s->head; si != NULL;) { 798 | next = si->next; 799 | free(si); 800 | si = next; 801 | } 802 | free(s); 803 | } 804 | 805 | void str_print(struct str *s) { 806 | assert(s); 807 | for(struct str_item *si=s->head; si != NULL; si=si->next) 808 | fputc(si->c, stdout); 809 | } 810 | 811 | void str_to_char(struct str *s, unsigned char *ch) { 812 | assert(s); 813 | int i = 0; 814 | for(struct str_item *si=s->head; si != NULL; si=si->next) 815 | ch[i++] = si->c; 816 | ch[i] = '\0'; 817 | } 818 | 819 | 820 | 821 | struct slist { 822 | struct slitem *head; 823 | struct slitem *tail; 824 | size_t n; 825 | }; 826 | 827 | struct slitem { 828 | struct nstate *state; 829 | struct str *suffix; 830 | struct slitem *next; 831 | }; 832 | 833 | struct slist * slist_create() { 834 | struct slist *sl; 835 | sl = malloc(sizeof(struct slist)); 836 | sl->head = sl->tail = NULL; 837 | sl->n = 0; 838 | 839 | return sl; 840 | } 841 | 842 | struct slist * slist_append(struct slist *sl, struct nstate *state, struct str *suffix) { 843 | struct slitem *li; 844 | li = malloc(sizeof(struct slitem)); 845 | li->state = state; 846 | li->suffix = suffix; 847 | li->next = NULL; 848 | 849 | if (sl->tail) { 850 | sl->tail->next = li; 851 | sl->tail = li; 852 | } else { 853 | sl->tail = sl->head = li; 854 | } 855 | sl->n += 1; 856 | return sl; 857 | } 858 | 859 | 860 | void slist_free(struct slist* sl) { 861 | struct slitem *li, *next; 862 | if (!sl) return; 863 | 864 | for(li = sl->head; li != NULL;) { 865 | next = li->next; 866 | str_free(li->suffix); 867 | free(li); 868 | li = next; 869 | } 870 | free(sl); 871 | } 872 | 873 | 874 | void nft_step_(struct nstate *s, struct str *o, unsigned char c, struct slist *sl) { 875 | if (!s) return; 876 | 877 | 878 | switch(s->type) { 879 | case SPLIT: 880 | nft_step_(s->nextb, o, c, sl); 881 | nft_step_(s->nexta, str_copy(o), c, sl); 882 | break; 883 | case SPLITNG: 884 | nft_step_(s->nexta, o, c, sl); 885 | nft_step_(s->nextb, str_copy(o), c, sl); 886 | break; 887 | case JOIN: 888 | nft_step_(s->nexta, o, c, sl); 889 | break; 890 | case PROD: 891 | nft_step_(s->nexta, str_append(o, s->val), c, sl); 892 | break; 893 | case CONS: // found CONS state marked with 'c' 894 | if (c == s->val && s->visited == 0) { 895 | s->visited = 1; 896 | slist_append(sl, s, o); 897 | } 898 | break; 899 | case FINAL: 900 | if(c == '\0' && s->visited == 0) { /* final states closure */ 901 | slist_append(sl, s, o); 902 | s->visited = 1; 903 | } 904 | return; 905 | } 906 | return; 907 | } 908 | 909 | 910 | struct slist * nft_step(struct slist *states, unsigned char c) { 911 | struct slist *sl = slist_create(); 912 | struct slitem *li; 913 | 914 | for(li=states->head; li; li=li->next) 915 | nft_step_(li->state->nexta, str_copy(li->suffix), c, sl); 916 | 917 | /* reset the visited flag; yes it is linear 918 | * but the list have to be short */ 919 | for(li=sl->head; li; li=li->next) 920 | li->state->visited = 0; 921 | 922 | return sl; 923 | } 924 | 925 | 926 | struct dstate { 927 | struct slist *states; 928 | struct str *final_out; 929 | int8_t final; 930 | struct str *out[256]; 931 | struct dstate *next[256]; 932 | }; 933 | 934 | struct dstate * dstate_create(struct slist *states) { 935 | struct dstate *ds; 936 | ds = malloc(sizeof(struct dstate)); 937 | ds->states = states; 938 | ds->final = -1; 939 | memset(ds->next, 0, sizeof ds->next); 940 | memset(ds->out, 0, sizeof ds->out); 941 | return ds; 942 | } 943 | 944 | int dstack_lookup(struct dstate **b, struct dstate **e, struct dstate *v) { 945 | while(b != e) 946 | if (v == *(--e)) return 1; 947 | return 0; 948 | } 949 | 950 | 951 | void plot_dft(struct dstate *dstart) { 952 | struct dstate *stack[1024]; 953 | struct dstate *visited[1024]; 954 | 955 | struct dstate **sp = stack; 956 | struct dstate **vp = visited; 957 | struct dstate *s = dstart, *s_next = NULL; 958 | unsigned char out[32], label[32]; 959 | 960 | printf("digraph G {\n\tsplines=true; rankdir=LR;\n"); 961 | 962 | push(sp, s); 963 | while (sp != stack) { 964 | s = pop(sp); 965 | push(vp, s); 966 | 967 | if (s->final == 1) { 968 | str_to_char(s->final_out, out); 969 | printf("\t\"%p\" [peripheries=2, label=\"%s\"];\n", (void*)s, out); 970 | } 971 | else 972 | printf("\t\"%p\" [label=\"\"];\n", (void*)s); 973 | 974 | for(int c=0; c < 256; c++) { 975 | if ((s_next = s->next[c]) != NULL) { 976 | str_to_char(s->out[c], label); 977 | printf("\t\"%p\" -> \"%p\" [label=\"%c:%s\"];\n", (void*)s, (void*)s_next, c, label); 978 | if (!dstack_lookup(visited, vp, s_next)) 979 | push(sp, s_next); 980 | } 981 | } 982 | } 983 | printf("}\n"); 984 | } 985 | 986 | 987 | 988 | struct str * truncate_lcp(struct slist *sl, struct str *prefix) { 989 | struct slitem *li; 990 | unsigned char ch; 991 | 992 | for(;;) { 993 | if(!sl->head->suffix->head) /* end of lcp */ 994 | return prefix; 995 | 996 | ch = sl->head->suffix->head->c; /* omg */ 997 | for(li=sl->head; li; li=li->next) { 998 | if (!li->suffix || !li->suffix->head || li->suffix->head->c != ch) 999 | return prefix; 1000 | } 1001 | 1002 | /* we found the common char, 1003 | * - add the char to the prefix, 1004 | * - truncate the suffixes */ 1005 | str_append(prefix, ch); 1006 | for(li=sl->head; li; li=li->next) 1007 | str_popleft(li->suffix); 1008 | } 1009 | return prefix; 1010 | } 1011 | 1012 | /* lexicographic comparison */ 1013 | int str_cmp(struct str *a, struct str *b) { 1014 | struct str_item *ai, *bi; 1015 | for(ai=a->head, bi=b->head; ai && bi; ai=ai->next, bi=bi->next) 1016 | if (ai->c < bi->c) 1017 | return -1; 1018 | else if (ai->c > bi->c) 1019 | return 1; 1020 | if (ai) 1021 | return 1; 1022 | if (bi) 1023 | return -1; 1024 | 1025 | return 0; 1026 | } 1027 | 1028 | int list_cmp(struct slist *a, struct slist *b) { 1029 | struct slitem *ai, *bi; 1030 | int sign; 1031 | 1032 | if(a->n < b->n) 1033 | return -1; 1034 | if(a->n > b->n) 1035 | return 1; 1036 | 1037 | for(ai=a->head, bi=b->head; ai && bi; ai=ai->next, bi=bi->next) 1038 | if(ai->state < bi->state) 1039 | return -1; 1040 | else if(ai->state > bi->state) 1041 | return 1; 1042 | else { 1043 | sign = str_cmp(ai->suffix, bi->suffix); 1044 | if (sign != 0) 1045 | return sign; 1046 | } 1047 | 1048 | return 0; 1049 | } 1050 | 1051 | 1052 | // Define the structure for the tree node 1053 | struct btnode { 1054 | struct dstate *ds; 1055 | struct btnode *l; 1056 | struct btnode *r; 1057 | }; 1058 | 1059 | // Function to create a new node with given data 1060 | struct btnode* bt_create(struct dstate *ds) { 1061 | struct btnode* node; 1062 | node = malloc(sizeof(struct btnode)); 1063 | if (node == NULL) { 1064 | printf("error: binary tree node allocation failed.\n"); 1065 | exit(EXIT_FAILURE); 1066 | } 1067 | node->ds = ds; 1068 | node->l = NULL; 1069 | node->r = NULL; 1070 | return node; 1071 | } 1072 | 1073 | // Lookup function to search for a value in the binary tree 1074 | struct dstate* bt_lookup(struct btnode *n, struct slist *sl) { 1075 | int sign = 0; 1076 | 1077 | while (n != NULL) { 1078 | sign = list_cmp(sl, n->ds->states); 1079 | if (sign < 0) /* less */ 1080 | n = n->l; 1081 | else if (sign > 0) /* more */ 1082 | n = n->r; 1083 | else 1084 | return n->ds; /* found */ 1085 | } 1086 | return NULL; 1087 | } 1088 | 1089 | // Function to insert nodes to form a binary search tree 1090 | struct btnode* bt_insert(struct btnode* n, struct dstate *ds) { 1091 | int sign; 1092 | if (n == NULL) 1093 | return bt_create(ds); 1094 | sign = list_cmp(ds->states, n->ds->states); 1095 | if (sign < 0) 1096 | n->l = bt_insert(n->l, ds); 1097 | else if (sign > 0) 1098 | n->r = bt_insert(n->r, ds); 1099 | return n; 1100 | } 1101 | 1102 | void bt_free(struct btnode* root) { 1103 | if (root == NULL) return; 1104 | bt_free(root->l); 1105 | bt_free(root->r); 1106 | free(root); 1107 | } 1108 | 1109 | 1110 | int infer_dft(struct dstate *dstart, unsigned char *inp, struct btnode *dcache, enum infer_mode mode) { 1111 | struct dstate *ds_next, *ds=dstart; 1112 | struct str *prefix, *out = str_create(); 1113 | struct slist *sl; 1114 | 1115 | unsigned char *c; 1116 | int i = 0; 1117 | 1118 | for(c=inp; *c != '\0'; c++, i++) { 1119 | 1120 | if (mode == SCAN && ds->final == 1) { 1121 | str_print(out); 1122 | str_print(ds->final_out); 1123 | str_free(out); 1124 | return i; 1125 | } 1126 | 1127 | // todo: double check this logic 1128 | if (ds->next[*c] != NULL) { 1129 | str_append_str(out, ds->out[*c]); 1130 | ds = ds->next[*c]; 1131 | } 1132 | else if (ds->out[*c] != NULL) { /* already explored but found nothing */ 1133 | break; 1134 | } 1135 | else { /* not explored, explore */ 1136 | sl = nft_step(ds->states, *c); 1137 | 1138 | /* expand each state and accumulate CONS states labeled with c */ 1139 | if (!sl->head) { /* got empty list; mark as explored and exit */ 1140 | ds->out[*c] = (struct str*)1; /* fixme: can we create a better indicator? */ 1141 | slist_free(sl); 1142 | break; 1143 | } 1144 | 1145 | prefix = str_create(); 1146 | truncate_lcp(sl, prefix); 1147 | 1148 | if ((ds_next = bt_lookup(dcache, sl)) != NULL) { 1149 | ds->next[*c] = ds_next; 1150 | ds->out[*c] = prefix; 1151 | slist_free(sl); /* no need for the list */ 1152 | } else { 1153 | ds_next = dstate_create(sl); 1154 | ds->next[*c] = ds_next; 1155 | ds->out[*c] = prefix; 1156 | bt_insert(dcache, ds_next); 1157 | } 1158 | 1159 | str_append_str(out, ds->out[*c]); 1160 | ds = ds_next; 1161 | //str_free(prefix); 1162 | 1163 | // refactor this: do not make closure several times 1164 | if (ds->final < 0) { /* unexplored */ 1165 | sl = nft_step(ds->states, '\0'); 1166 | 1167 | if (sl->head) { /* got final states; take the first one */ 1168 | ds->final = 1; 1169 | ds->final_out = str_copy(sl->head->suffix); 1170 | // we do not need the 'sl' list anymore 1171 | } else { 1172 | ds->final = 0; 1173 | } 1174 | } 1175 | } 1176 | } 1177 | 1178 | if (mode == SCAN && ds->final == 1) { 1179 | str_print(out); 1180 | str_print(ds->final_out); 1181 | return i; 1182 | } 1183 | 1184 | 1185 | /* 1186 | //if (mode == MATCH && ds->final == 1 && *c == '\0') { 1187 | if (ds->final == 1 && *c == '\0') { 1188 | str_print(out); 1189 | str_print(ds->final_out); 1190 | } 1191 | */ 1192 | 1193 | str_free(out); 1194 | 1195 | return -1; 1196 | } 1197 | 1198 | 1199 | int main(int argc, char **argv) 1200 | { 1201 | FILE *fp; 1202 | char *expr; 1203 | ssize_t read, ioffset; 1204 | size_t input_len; 1205 | //unsigned char *line = NULL, *input_fn, *ch; 1206 | char *line = NULL, *input_fn, *ch; 1207 | struct node *root; 1208 | struct nstate *start; 1209 | //struct sstack *stack = screate(32); 1210 | struct dstate *dstart; 1211 | struct btnode *dcache; 1212 | enum infer_mode mode = SCAN; 1213 | 1214 | 1215 | int opt, debug=0; 1216 | 1217 | while ((opt = getopt(argc, argv, "dma")) != -1) { 1218 | switch (opt) { 1219 | case 'd': 1220 | debug = 1; 1221 | break; 1222 | case 'm': 1223 | mode = MATCH; 1224 | break; 1225 | case 'a': 1226 | fprintf(stderr, "Not supported yet\n"); 1227 | exit(EXIT_FAILURE); 1228 | default: /* '?' */ 1229 | fprintf(stderr, "Usage: %s [-dma] expr [file]\n", 1230 | argv[0]); 1231 | exit(EXIT_FAILURE); 1232 | } 1233 | } 1234 | 1235 | if (optind >= argc) { 1236 | fprintf(stderr, "error: missing trre expression\n"); 1237 | exit(EXIT_FAILURE); 1238 | } 1239 | 1240 | expr = argv[optind]; 1241 | root = parse(expr); 1242 | 1243 | start = create_nft(root); 1244 | 1245 | if (debug) { 1246 | //plot_ast(root); 1247 | plot_nft(start); 1248 | } 1249 | 1250 | //printf("%c\n", start->type); 1251 | 1252 | // todo: can we do better? 1253 | output = malloc(output_capacity*sizeof(char)); 1254 | 1255 | struct slist *sl_init = slist_create(); 1256 | slist_append(sl_init, start, str_create()); 1257 | 1258 | dstart = dstate_create(sl_init); 1259 | dcache = bt_create(dstart); 1260 | 1261 | if (optind == argc - 2) { // filename provided 1262 | input_fn = argv[optind + 1]; 1263 | 1264 | fp = fopen(input_fn, "r"); 1265 | if (fp == NULL) { 1266 | fprintf(stderr, "error: can not open file %s\n", input_fn); 1267 | exit(EXIT_FAILURE); 1268 | } 1269 | } else 1270 | fp = stdin; 1271 | 1272 | if (mode == SCAN) { 1273 | while ((read = getline(&line, &input_len, fp)) != -1) { 1274 | line[read-1] = '\0'; 1275 | ch = line; 1276 | 1277 | while (*ch != '\0') { 1278 | ioffset = infer_dft(dstart, (unsigned char*)ch, dcache, mode); 1279 | if (ioffset > 0) 1280 | ch += ioffset; 1281 | else 1282 | fputc(*ch++, stdout); 1283 | } 1284 | infer_dft(dstart, (unsigned char*)ch, dcache, mode); 1285 | fputc('\n', stdout); 1286 | } 1287 | } else { /* MATCH mode and generator */ 1288 | while ((read = getline(&line, &input_len, fp)) != -1) { 1289 | line[read-1] = '\0'; 1290 | ioffset = infer_dft(dstart, (unsigned char*)line, dcache, mode); 1291 | fputc('\n', stdout); 1292 | } 1293 | } 1294 | if (debug) { 1295 | plot_dft(dstart); 1296 | } 1297 | 1298 | fclose(fp); 1299 | if (line) 1300 | free(line); 1301 | return 0; 1302 | } 1303 | -------------------------------------------------------------------------------- /trre_nft.c: -------------------------------------------------------------------------------- 1 | #define _GNU_SOURCE 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | 10 | /* precendence table */ 11 | int prec(char c) { 12 | switch (c) { 13 | case '|': return 1; 14 | case '-': return 2; 15 | case ':': return 3; 16 | case '.': return 4; 17 | case '?': case '*': 18 | case '+': case 'I': return 5; 19 | case '\\': return 6; 20 | } 21 | return -1; 22 | } 23 | 24 | struct node { 25 | unsigned char type; 26 | unsigned char val; 27 | struct node * l; 28 | struct node * r; 29 | }; 30 | 31 | #define push(stack, item) *stack++ = item 32 | #define pop(stack) *--stack 33 | #define top(stack) *(stack-1) 34 | 35 | #define STACK_INIT_CAPACITY 32 36 | #define STACK_MAX_CAPACITY 100000 37 | 38 | static unsigned char operators[1024]; 39 | static struct node *operands[1024]; 40 | 41 | static unsigned char *opr = operators; 42 | static struct node **opd = operands; 43 | 44 | static char* output; 45 | static size_t output_capacity=32; 46 | 47 | 48 | struct node * create_node(char type, struct node *l, struct node *r) { 49 | struct node *node = malloc(sizeof(struct node)); 50 | if (!node) { 51 | fprintf(stderr, "error: node memory allocation failed\n"); 52 | exit(EXIT_FAILURE); 53 | } 54 | node->type = type; 55 | node->val = 0; 56 | node->l = l; 57 | node->r = r; 58 | return node; 59 | } 60 | 61 | // alias-helper 62 | struct node * create_nodev(unsigned char type, unsigned char val) { 63 | struct node *node = create_node(type, NULL, NULL); 64 | node->val = val; 65 | return node; 66 | } 67 | 68 | enum infer_mode { 69 | MODE_SCAN, 70 | MODE_MATCH, 71 | MODE_GEN, 72 | }; 73 | 74 | 75 | // reduce immediately 76 | void reduce_postfix(char op, int ng) { 77 | struct node *l, *r; 78 | 79 | switch(op) { 80 | case '*': case '+': case '?': 81 | l = pop(opd); 82 | r = create_node(op, l, NULL); 83 | r->val = ng; 84 | push(opd, r); 85 | break; 86 | default: 87 | fprintf(stderr, "error: unexpected postfix operator\n"); 88 | exit(EXIT_FAILURE); 89 | } 90 | } 91 | 92 | 93 | void reduce() { 94 | char op; 95 | struct node *l, *r; 96 | 97 | op = pop(opr); 98 | switch(op) { 99 | case '|': case '.': case ':': case '-': 100 | r = pop(opd); 101 | l = pop(opd); 102 | push(opd, create_node(op, l, r)); 103 | break; 104 | case '(': 105 | fprintf(stderr, "error: unmached parenthesis\n"); 106 | exit(EXIT_FAILURE); 107 | } 108 | } 109 | 110 | 111 | void reduce_op(char op) { 112 | while(opr != operators && prec(top(opr)) >= prec(op)) 113 | reduce(); 114 | push(opr, op); 115 | } 116 | 117 | char* parse_curly_brackets(char *expr) { 118 | int state = 0, ng = 0; 119 | int count = 0, lv = 0; 120 | struct node *l, *r; 121 | char c; 122 | 123 | while ((c = *expr) != '\0') { 124 | if (c >= '0' && c <= '9') { 125 | count = count*10 + c - '0'; 126 | } else if (c == ',') { 127 | lv = count; 128 | count = 0; 129 | state += 1; 130 | } else if (c == '}') { 131 | if(*(expr+1) == '?') { // safe to do, we have the '/0' sentinel 132 | ng = 1; 133 | expr++; 134 | } 135 | if (state == 0) 136 | lv = count; 137 | else if (state > 1) { 138 | fprintf(stderr, "error: more then one comma in curly brackets\n"); 139 | exit(EXIT_FAILURE); 140 | } 141 | 142 | r = create_nodev(lv, count); 143 | l = create_node('I', pop(opd), r); 144 | l->val = ng; 145 | push(opd, l); 146 | 147 | return expr; 148 | } else { 149 | fprintf(stderr, "error: unexpected symbol in curly brackets: %c\n", c); 150 | exit(EXIT_FAILURE); 151 | } 152 | expr++; 153 | } 154 | fprintf(stderr, "error: unmached curly brackets"); 155 | exit(EXIT_FAILURE); 156 | } 157 | 158 | char* parse_square_brackets(char *expr) { 159 | char c; 160 | int state = 0; 161 | 162 | while ((c = *expr) != '\0') { 163 | if (state == 0) { // expect operand 164 | switch(c) { 165 | case ':': case '-': case '[': case ']': 166 | fprintf(stderr, "error: unexpected symbol in square brackets: %c", c); 167 | exit(EXIT_FAILURE); 168 | default: 169 | push(opd, create_nodev('c', c)); // push operand 170 | state = 1; 171 | } 172 | } else { // expect operator 173 | switch(c) { 174 | case ':': case '-': 175 | reduce_op(c); 176 | state = 0; 177 | break; 178 | case ']': 179 | while (opr != operators && top(opr) != '[') 180 | reduce(); 181 | --opr; // remove [ from the stack 182 | return expr; 183 | default: // implicit alternation 184 | reduce_op('|'); 185 | state = 0; 186 | continue; // skip expr increment 187 | } 188 | } 189 | expr++; 190 | } 191 | fprintf(stderr, "error: unmached square brackets"); 192 | exit(EXIT_FAILURE); 193 | } 194 | 195 | 196 | struct node * parse(char *expr) { 197 | unsigned char c; 198 | int state = 0; 199 | 200 | while ((c = *expr) != '\0') { 201 | if (state == 0) { // expect operand 202 | switch(c) { 203 | case '(': 204 | push(opr, c); 205 | break; 206 | case '[': 207 | push(opr, c); 208 | expr = parse_square_brackets(expr+1); 209 | state = 1; 210 | break; 211 | case '\\': 212 | push(opd, create_nodev('c', *++expr)); 213 | state = 1; 214 | break; 215 | case '.': 216 | //push(opd, create_nodev('a', 0)); 217 | push(opd, create_node('-', 218 | create_nodev('c', 0), 219 | create_nodev('c', 255))); 220 | state = 1; 221 | break; 222 | case ':': // epsilon as an implicit left operand 223 | push(opd, create_nodev('e', c)); 224 | state = 1; 225 | continue; // stay in the same position in expr 226 | case '|': case '*': case '+': case '?': 227 | case ')': case '{': case '}': 228 | if (opr != operators && top(opr) == ':') { // epsilon as an implicit right operand 229 | push(opd, create_nodev('e', c)); 230 | state = 1; 231 | continue; // stay in the same position in expr 232 | } else { 233 | fprintf(stderr, "error: unexpected symbol %c\n", c); 234 | exit(EXIT_FAILURE); 235 | } 236 | default: 237 | push(opd, create_nodev('c', c)); 238 | state = 1; 239 | } 240 | } else { // expect postfix or binary operator 241 | switch (c) { 242 | case '*': case '+': case '?': 243 | if(*(expr+1) == '?') { // safe to do, we have the '/0' sentinel 244 | reduce_postfix(c, 1); // found non-greedy modifier 245 | expr++; 246 | } 247 | else 248 | reduce_postfix(c, 0); 249 | break; 250 | case '|': 251 | reduce_op(c); 252 | state = 0; 253 | break; 254 | case ':': 255 | if (*(expr+1) == '\0') { // implicit epsilon as a right operand 256 | push(opd, create_nodev('e', c)); 257 | } 258 | reduce_op(c); 259 | state = 0; 260 | break; 261 | case '{': 262 | expr = parse_curly_brackets(expr+1); 263 | break; 264 | case ')': 265 | while (opr != operators && top(opr) != '(') 266 | reduce(); 267 | if (opr == operators || top(opr) != '(') { 268 | fprintf(stderr, "error: unmached parenthesis"); 269 | exit(EXIT_FAILURE); 270 | } 271 | --opr; // remove ( from the stack 272 | break; 273 | default: // implicit cat 274 | reduce_op('.'); 275 | state = 0; 276 | continue; 277 | } 278 | } 279 | expr++; 280 | } 281 | 282 | while (opr != operators) { 283 | reduce(); 284 | } 285 | assert (operands != opd); 286 | 287 | return pop(opd); 288 | } 289 | 290 | 291 | void plot_ast(struct node *n) { 292 | struct node *stack[1024]; 293 | struct node **sp = stack; 294 | 295 | printf("digraph G {\n\tsplines=true; rankdir=TD;\n"); 296 | 297 | push(sp, n); 298 | while (sp != stack) { 299 | n = pop(sp); 300 | 301 | if (n->type == 'c') { 302 | //printf("ntype: %c", n->type); 303 | printf("\t\"%p\" [peripheries=2, label=\"%u\"];\n", (void*)n, n->val); 304 | } else { 305 | printf("\t\"%p\" [label=\"%c\"];\n", (void*)n, n->type); 306 | } 307 | 308 | if (n->l) { 309 | printf("\t\"%p\" -> \"%p\";\n", (void*)n, (void*)n->l); 310 | push(sp, n->l); 311 | } 312 | 313 | if (n->r) { 314 | printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)n, (void*)n->r, 'r'); 315 | push(sp, n->r); 316 | } 317 | } 318 | printf("}\n"); 319 | } 320 | 321 | 322 | 323 | enum nstate_type { 324 | PROD, 325 | CONS, 326 | SPLIT, 327 | SPLITNG, 328 | JOIN, 329 | FINAL 330 | }; 331 | 332 | 333 | // nft state 334 | struct nstate { 335 | enum nstate_type type; 336 | char val; 337 | char mode; 338 | struct nstate *nexta; 339 | struct nstate *nextb; 340 | }; 341 | 342 | 343 | struct nstate* create_nstate(enum nstate_type type, struct nstate *nexta, struct nstate *nextb) { 344 | struct nstate *state; 345 | 346 | state = malloc(sizeof(struct nstate)); 347 | if (state == NULL) { 348 | fprintf(stderr, "error: nft state memory allocation failed\n"); 349 | exit(EXIT_FAILURE); 350 | } 351 | 352 | state->type = type; 353 | state->nexta = nexta; 354 | state->nextb = nextb; 355 | state->val = 0; 356 | return state; 357 | } 358 | 359 | 360 | 361 | struct nchunk { 362 | struct nstate * head; 363 | struct nstate * tail; 364 | }; 365 | 366 | 367 | struct nchunk chunk(struct nstate *head, struct nstate *tail) { 368 | struct nchunk chunk; 369 | chunk.head = head; 370 | chunk.tail = tail; 371 | return chunk; 372 | } 373 | 374 | 375 | struct nchunk nft(struct node *n, char mode) { 376 | struct nstate *split, *psplit, *join; 377 | struct nstate *cstate, *pstate, *state, *head, *tail, *final; 378 | struct nchunk l, r; 379 | int llv, lrv, rlv; 380 | int lb, rb; 381 | 382 | if (n == NULL) 383 | return chunk(NULL, NULL); 384 | 385 | switch(n->type) { 386 | case '.': 387 | l = nft(n->l, mode); 388 | r = nft(n->r, mode); 389 | l.tail->nexta = r.head; 390 | return chunk(l.head, r.tail); 391 | case '|': 392 | l = nft(n->l, mode); 393 | r = nft(n->r, mode); 394 | split = create_nstate(SPLITNG, l.head, r.head); 395 | join = create_nstate(JOIN, NULL, NULL); 396 | l.tail->nexta = join; 397 | r.tail->nexta = join; 398 | return chunk(split, join); 399 | case '*': 400 | l = nft(n->l, mode); 401 | split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head); 402 | l.tail->nexta = split; 403 | return chunk(split, split); 404 | case '?': 405 | l = nft(n->l, mode); 406 | join = create_nstate(JOIN, NULL, NULL); 407 | split = create_nstate(n->val ? SPLITNG : SPLIT, join, l.head); 408 | l.tail->nexta = join; 409 | return chunk(split, join); 410 | case '+': 411 | l = nft(n->l, mode); 412 | split = create_nstate(n->val ? SPLITNG : SPLIT, NULL, l.head); 413 | l.tail->nexta = split; 414 | return chunk(l.head, split); 415 | case ':': 416 | if (n->l->type == 'e') // implicit eps left operand 417 | return nft(n->r, 2); 418 | if (n->r->type == 'e') 419 | return nft(n->l, 1); 420 | // implicit eps right operand 421 | l = nft(n->l, 1); 422 | r = nft(n->r, 2); 423 | l.tail->nexta = r.head; 424 | return chunk(l.head, r.tail); 425 | case '-': 426 | if (n->l->type == 'c' && n->r->type == 'c') { 427 | join = create_nstate(JOIN, NULL, NULL); 428 | psplit = NULL; 429 | for(int c=n->r->val; c >= n->l->val; c--){ 430 | l = nft(create_nodev('c', c), mode); 431 | split = create_nstate(SPLITNG, l.head, psplit); 432 | l.tail->nexta = join; 433 | psplit = split; 434 | } 435 | return chunk(psplit, join); 436 | } else if (n->l->type == ':' && n->r->type == ':') { 437 | join = create_nstate(JOIN, NULL, NULL); 438 | psplit = NULL; 439 | 440 | llv = n->l->l->val; 441 | lrv = n->l->r->val; 442 | rlv = n->r->l->val; 443 | for(int c=rlv-llv; c >= 0; c--) { 444 | l = nft(create_node(':', 445 | create_nodev('c', llv+c), 446 | create_nodev('c', lrv+c) 447 | ), mode); 448 | split = create_nstate(SPLITNG, l.head, psplit); 449 | l.tail->nexta = join; 450 | psplit = split; 451 | } 452 | return chunk(psplit, join); 453 | } 454 | else { 455 | fprintf(stderr, "error: unexpected range syntax\n"); 456 | exit(EXIT_FAILURE); 457 | } 458 | case 'I': 459 | lb = n->r->type; /* placeholder for the left range */ 460 | rb = n->r->val; /* placeholder for the right range */ 461 | head = tail = create_nstate(JOIN, NULL, NULL); 462 | 463 | for(int i=0; i l, mode); 465 | tail->nexta = l.head; 466 | tail = l.tail; 467 | } 468 | 469 | if(rb == 0) { 470 | l = nft(create_node('*', n->l, NULL), mode); 471 | tail->nexta = l.head; 472 | tail = l.tail; 473 | } else { 474 | final = create_nstate(JOIN, NULL, NULL); 475 | for(int i=lb; i < rb; i++) { 476 | l = nft(n->l, mode); 477 | // TODO: add non-greed modifier 478 | tail->nexta = create_nstate(n->val ? SPLITNG : SPLIT, final, l.head); 479 | tail = l.tail; 480 | } 481 | tail->nexta = final; 482 | tail = final; 483 | } 484 | 485 | return chunk(head, tail); 486 | default: //character 487 | if (mode == 0) { 488 | cstate = create_nstate(CONS, NULL, NULL); 489 | pstate = create_nstate(PROD, NULL, NULL); 490 | cstate->val = n->val; 491 | pstate->val = n->val; 492 | cstate->nexta = pstate; 493 | return chunk(cstate, pstate); 494 | } 495 | else if (mode == 1) { 496 | state = create_nstate(CONS, NULL, NULL); 497 | state->val=n->val; 498 | } else { 499 | state = create_nstate(PROD, NULL, NULL); 500 | state->val=n->val; 501 | } 502 | return chunk(state, state); 503 | } 504 | } 505 | 506 | struct nstate* create_nft(struct node *root) { 507 | struct nstate *final = create_nstate(FINAL, NULL, NULL); 508 | struct nchunk ch = nft(root, 0); 509 | ch.tail->nexta = final; 510 | return ch.head; 511 | } 512 | 513 | struct sitem { 514 | struct nstate *s; 515 | size_t i; 516 | size_t o; 517 | }; 518 | 519 | struct sstack { 520 | struct sitem *items; 521 | size_t n_items; 522 | size_t capacity; 523 | }; 524 | 525 | struct sstack * screate(size_t capacity) { 526 | struct sstack *stack; 527 | 528 | stack = (struct sstack*)malloc(sizeof(struct sstack)); 529 | stack->items = malloc(capacity * sizeof(struct sitem)); 530 | if (stack == NULL || stack->items == NULL) { 531 | fprintf(stderr, "error: stack memory allocation failed\n"); 532 | exit(EXIT_FAILURE); 533 | } 534 | stack->n_items = 0; 535 | stack->capacity = capacity; 536 | return stack; 537 | } 538 | 539 | void sresize(struct sstack *stack, size_t new_capacity) { 540 | stack->items = realloc(stack->items, new_capacity * sizeof(struct sitem)); 541 | if (stack->items == NULL) { 542 | fprintf(stderr, "error: stack memory re-allocation failed\n"); 543 | exit(EXIT_FAILURE); 544 | } 545 | stack->capacity = new_capacity; 546 | } 547 | 548 | void spush(struct sstack *stack, struct nstate *s, size_t i, size_t o) { 549 | struct sitem *it; 550 | if (stack->n_items == stack->capacity) { 551 | if (stack->capacity * 2 > STACK_MAX_CAPACITY) { 552 | fprintf(stderr, "error: stack max capacity reached\n"); 553 | exit(EXIT_FAILURE); 554 | } 555 | sresize(stack, stack->capacity * 2); 556 | } 557 | it = &stack->items[stack->n_items]; 558 | it->s = s; 559 | it->i = i; 560 | it->o = o; 561 | 562 | stack->n_items++; 563 | } 564 | 565 | size_t spop(struct sstack *stack, struct nstate **s, size_t *i, size_t *o) { 566 | struct sitem *it; 567 | if (stack->n_items == 0) { 568 | fprintf(stderr, "error: stack underflow\n"); 569 | exit(EXIT_FAILURE); 570 | } 571 | stack->n_items--; 572 | it = &stack->items[stack->n_items]; 573 | *s = it->s; 574 | *i = it->i; 575 | *o = it->o; 576 | 577 | return stack->n_items; 578 | } 579 | 580 | 581 | // Function to dynamically resize the output array 582 | char* resize_output(char *output, size_t *capacity) { 583 | *capacity *= 2; // Double the capacity 584 | output = realloc(output, *capacity * sizeof(char)); 585 | if (!output) { 586 | fprintf(stderr, "error: memory reallocation failed for output\n"); 587 | exit(EXIT_FAILURE); 588 | } 589 | return output; 590 | } 591 | 592 | // Main DFS traversal function 593 | ssize_t infer_backtrack(struct nstate *start, char *input, struct sstack *stack, enum infer_mode mode, int all) { 594 | size_t i = 0, o = 0; 595 | struct nstate *s = start; 596 | stack->n_items = 0; /* reset stack; do not shrink */ 597 | 598 | while (stack->n_items || s) { 599 | if (!s) { 600 | spop(stack, &s, &i, &o); 601 | if (!s) { 602 | continue; 603 | } 604 | } 605 | // Resize output array if necessary 606 | if (o >= output_capacity - 1) { 607 | output = resize_output(output, &output_capacity); 608 | } 609 | 610 | switch (s->type) { 611 | case CONS: 612 | if (input[i] != '\0' && s->val == input[i]) { 613 | i++; 614 | s = s->nexta; 615 | } else { 616 | s = NULL; 617 | } 618 | break; 619 | case PROD: 620 | output[o++] = s->val; 621 | s = s->nexta; 622 | break; 623 | case SPLIT: 624 | spush(stack, s->nexta, i, o); 625 | s = s->nextb; 626 | break; 627 | case SPLITNG: 628 | spush(stack, s->nextb, i, o); 629 | s = s->nexta; 630 | break; 631 | case JOIN: 632 | s = s->nexta; 633 | break; 634 | case FINAL: 635 | if (mode == MODE_MATCH) { 636 | if (input[i] == '\0') { 637 | output[o] = '\0'; // Null-terminate the output string 638 | fputs(output, stdout); 639 | fputc('\n', stdout); 640 | if (!all) 641 | return i; 642 | } 643 | } else { 644 | output[o] = '\0'; // Null-terminate the output string 645 | fputs(output, stdout); 646 | if (!all) 647 | return i; 648 | } 649 | s = NULL; 650 | break; 651 | default: 652 | fprintf(stderr, "error: unknown state type\n"); 653 | exit(1); 654 | } 655 | } 656 | return -1; 657 | } 658 | 659 | 660 | int stack_lookup(struct nstate **b, struct nstate **e, struct nstate *v) { 661 | while(b != e) 662 | if (v == *(--e)) return 1; 663 | return 0; 664 | } 665 | 666 | 667 | void plot_nft(struct nstate *start) { 668 | struct nstate *stack[1024]; 669 | struct nstate *visited[1024]; 670 | 671 | struct nstate **sp = stack; 672 | struct nstate **vp = visited; 673 | struct nstate *s = start; 674 | char l,m; 675 | 676 | printf("digraph G {\n\tsplines=true; rankdir=LR;\n"); 677 | 678 | push(sp, s); 679 | while (sp != stack) { 680 | s = pop(sp); 681 | push(vp, s); 682 | 683 | if (s->type == FINAL) 684 | printf("\t\"%p\" [peripheries=2, label=\"\"];\n", (void*)s); 685 | else { 686 | switch(s->type) { 687 | case PROD: l=s->val; m='+'; break; 688 | case CONS: l=s->val; m='-'; break; 689 | case SPLITNG: l='S'; m='n'; break; 690 | case SPLIT: l='S'; m=' '; break; 691 | case JOIN: l='J'; m=' '; break; 692 | default: l=' '; m=' '; break; 693 | } 694 | printf("\t\"%p\" [label=\"%c%c\"];\n", (void*)s, l, m); 695 | } 696 | 697 | if (s->nexta) { 698 | printf("\t\"%p\" -> \"%p\";\n", (void*)s, (void*)s->nexta); 699 | if (!stack_lookup(visited, vp, s->nexta)) 700 | push(sp, s->nexta); 701 | } 702 | 703 | if (s->nextb) { 704 | printf("\t\"%p\" -> \"%p\" [label=\"%c\"];\n", (void*)s, (void*)s->nextb, '*'); 705 | if (!stack_lookup(visited, vp, s->nextb)) 706 | push(sp, s->nextb); 707 | } 708 | } 709 | printf("}\n"); 710 | } 711 | 712 | 713 | int main(int argc, char **argv) 714 | { 715 | FILE *fp; 716 | char *expr; 717 | ssize_t read, ioffset; 718 | size_t input_len; 719 | char *line = NULL, *input_fn, *ch; 720 | struct node *root; 721 | struct nstate *start; 722 | struct sstack *stack = screate(STACK_INIT_CAPACITY); 723 | enum infer_mode mode = MODE_SCAN; 724 | int all = 0; // 1 = generate all the 725 | 726 | int opt, debug=0; 727 | 728 | while ((opt = getopt(argc, argv, "dma")) != -1) { 729 | switch (opt) { 730 | case 'd': 731 | debug = 1; 732 | break; 733 | case 'm': 734 | mode = MODE_MATCH; 735 | break; 736 | case 'a': 737 | all = 1; 738 | break; 739 | default: /* '?' */ 740 | fprintf(stderr, "Usage: %s [-d] [-m] expr [file]\n", 741 | argv[0]); 742 | exit(EXIT_FAILURE); 743 | } 744 | } 745 | 746 | if (optind >= argc) { 747 | fprintf(stderr, "error: missing trre expression\n"); 748 | exit(EXIT_FAILURE); 749 | } 750 | 751 | expr = argv[optind]; 752 | root = parse(expr); 753 | 754 | start = create_nft(root); 755 | 756 | if (debug) { 757 | //plot_ast(root); 758 | plot_nft(start); 759 | } 760 | 761 | 762 | output = malloc(output_capacity*sizeof(char)); 763 | 764 | if (optind == argc - 2) { // filename provided 765 | input_fn = argv[optind + 1]; 766 | 767 | fp = fopen(input_fn, "r"); 768 | if (fp == NULL) { 769 | fprintf(stderr, "error: can not open file %s\n", input_fn); 770 | exit(EXIT_FAILURE); 771 | } 772 | } else 773 | fp = stdin; 774 | 775 | if (mode == MODE_SCAN) { 776 | while ((read = getline(&line, &input_len, fp)) != -1) { 777 | line[read-1] = '\0'; 778 | ch = line; 779 | 780 | while (*ch != '\0') { 781 | ioffset = infer_backtrack(start, ch, stack, mode, all); 782 | if (ioffset > 0) 783 | ch += ioffset; 784 | else 785 | fputc(*ch++, stdout); 786 | } 787 | // even if we have empty string we still need to run the inference 788 | infer_backtrack(start, ch, stack, mode, all); 789 | fputc('\n', stdout); 790 | } 791 | } else { /* MATCH mode */ 792 | while ((read = getline(&line, &input_len, fp)) != -1) { 793 | line[read-1] = '\0'; 794 | infer_backtrack(start, line, stack, mode, all); 795 | //fputc('\n', stdout); 796 | } 797 | } 798 | 799 | fclose(fp); 800 | if (line) 801 | free(line); 802 | return 0; 803 | } 804 | --------------------------------------------------------------------------------