├── .gitignore ├── LICENSE ├── README.md ├── Setup.hs ├── docs ├── _build-docs.sh ├── _template.html5 ├── index.html └── style.css ├── parsing-with-haskell-parser-combinators.cabal ├── src ├── srt-file-parser.hs └── version-number-parser.hs ├── stack.yaml ├── stack.yaml.lock └── test-input ├── gifcurry-version-output.txt ├── imagemagick-version-output.txt └── subtitles.srt /.gitignore: -------------------------------------------------------------------------------- 1 | *stack-work* 2 | *cabal*sandbox* 3 | *dist* 4 | *tmp* 5 | *blend1 6 | *blend2 7 | *blend3 8 | *blend4 9 | *blend5 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2019, David Lettier 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 |

3 | Parsing With Haskell Parser Combinators 4 |
5 | 6 |

7 |
8 | 9 | # Parsing With Haskell Parser Combinators 10 | 11 | Need to parse something? 12 | Never heard of a "parser combinator"? 13 | Looking to learn some Haskell? 14 | Awesome! 15 | Below is everything you'll need to get up and parsing with Haskell parser combinators. 16 | From here you can try tackling esoteric data serialization formats, 17 | compiler front ends, 18 | domain specific languages—you name it! 19 | 20 | - [Building The Demos](#building-the-demos) 21 | - [Running The Demos](#running-the-demos) 22 | - [Parser Combinator](#parser-combinator) 23 | - [Version Number](#version-number) 24 | - [SRT](#srt) 25 | - [Exercises](#exercises) 26 | 27 | ## Building The Demos 28 | 29 | Included with this guide are two demo programs. 30 | 31 | `version-number-parser` parses a file for a version number. 32 | `srt-file-parser` parses a file for SRT subtitles. 33 | Feel free to try them out with the files found in `test-input/`. 34 | 35 | ### Stack 36 | 37 | Download the Haskell tool [Stack](https://docs.haskellstack.org/en/stable/README/) 38 | and then run the following. 39 | 40 | ```bash 41 | git clone https://github.com/lettier/parsing-with-haskell-parser-combinators 42 | cd parsing-with-haskell-parser-combinators 43 | stack build 44 | ``` 45 | 46 | ### Cabal 47 | 48 | If using Cabal, you can run the following. 49 | 50 | ```bash 51 | git clone https://github.com/lettier/parsing-with-haskell-parser-combinators 52 | cd parsing-with-haskell-parser-combinators 53 | cabal sandbox init 54 | cabal --require-sandbox build 55 | cabal --require-sandbox install 56 | ``` 57 | 58 | ## Running The Demos 59 | 60 | After building the two demo programs, you can run them like so. 61 | 62 | ### Stack 63 | 64 | To try the version number parser, run the following. 65 | 66 | ```bash 67 | cd parsing-with-haskell-parser-combinators 68 | stack exec -- version-number-parser 69 | What is the version output file path? 70 | test-input/gifcurry-version-output.txt 71 | ``` 72 | 73 | To try the SRT file parser, run the following. 74 | 75 | ```bash 76 | cd parsing-with-haskell-parser-combinators 77 | stack exec -- srt-file-parser 78 | What is the SRT file path? 79 | test-input/subtitles.srt 80 | ``` 81 | 82 | ### Cabal 83 | 84 | To try the version number parser, run the following. 85 | 86 | ```bash 87 | cd parsing-with-haskell-parser-combinators 88 | .cabal-sandbox/bin/version-number-parser 89 | What is the version output file path? 90 | test-input/gifcurry-version-output.txt 91 | ``` 92 | 93 | To try the SRT file parser, run the following. 94 | 95 | ```bash 96 | cd parsing-with-haskell-parser-combinators 97 | .cabal-sandbox/bin/srt-file-parser 98 | What is the SRT file path? 99 | test-input/subtitles.srt 100 | ``` 101 | 102 | ## Parser Combinator 103 | 104 | 105 |

106 | Parser Combinators 107 |
108 | 109 |

110 |
111 | 112 | One of the better ways to learn about the parsing strategy, 113 | [parser combinator](https://en.wikipedia.org/wiki/Parser_combinator), 114 | is to look at an implementation of one. 115 | 116 |
117 |

118 | Parsers built using combinators are straightforward to construct, readable, modular, well-structured, and easily maintainable. 119 |

120 | 121 | —Parser combinator - Wikipedia 122 | 123 |

124 |
125 | 126 | ### ReadP 127 | 128 | Let's take a look under the hood of [ReadP](https://hackage.haskell.org/package/base-4.12.0.0/docs/Text-ParserCombinators-ReadP.html), 129 | a parser combinator library found in base. 130 | Since it is in base, you should already have it. 131 | 132 | :bulb: Note, you may want to try out [Parsec](https://hackage.haskell.org/package/parsec) after getting familiar with ReadP. 133 | It too is a parser combinator library that others prefer to ReadP. 134 | As an added bonus, it is included in 135 | [GHC's boot libraries](https://gitlab.haskell.org/ghc/ghc/wikis/commentary/libraries/version-history) 136 | as of GHC version 8.4.1. 137 | 138 | #### P Data Type 139 | 140 | ```haskell 141 | -- (c) The University of Glasgow 2002 142 | 143 | data P a 144 | = Get (Char -> P a) 145 | | Look (String -> P a) 146 | | Fail 147 | | Result a (P a) 148 | | Final [(a,String)] 149 | deriving Functor 150 | ``` 151 | 152 | We'll start with the `P` data type. 153 | The `a` in `P a` is up to you (the library user) and can be whatever you'd like. 154 | The compiler creates a functor instance automatically and there are hand-written instances for 155 | applicative, 156 | monad, 157 | `MonadFail`, 158 | and alternative. 159 | 160 | :bulb: Note, for more on functors, applicatives, and monads, checkout 161 | [Your easy guide to Monads, Applicatives, & Functors](https://medium.com/@lettier/your-easy-guide-to-monads-applicatives-functors-862048d61610). 162 | 163 | `P` is a [sum type](https://en.wikipedia.org/wiki/Tagged_union) with five cases. 164 | 165 | - `Get` consumes a single character from the input string and returns a new `P`. 166 | - `Look` accepts a duplicate of the input string and returns a new `P`. 167 | - `Fail` indicates the parser finished without a result. 168 | - `Result` holds a possible parsing and another `P` case. 169 | - `Final` is a list of two-tuples. The first tuple element is a possible parsing of the input 170 | and the second tuple element is the rest of the input string that wasn't consumed by `Get`. 171 | 172 | #### Run 173 | 174 | ```haskell 175 | -- (c) The University of Glasgow 2002 176 | 177 | run :: P a -> ReadS a 178 | run (Get f) (c:s) = run (f c) s 179 | run (Look f) s = run (f s) s 180 | run (Result x p) s = (x,s) : run p s 181 | run (Final r) _ = r 182 | run _ _ = [] 183 | ``` 184 | 185 | `run` is the heart of the ReadP parser. 186 | It does all of the heavy lifting as it recursively runs through all of the parser states that we saw up above. 187 | You can see that it takes a `P` and returns a `ReadS`. 188 | 189 | ```haskell 190 | -- (c) The University of Glasgow 2002 191 | 192 | type ReadS a = String -> [(a,String)] 193 | ``` 194 | 195 | `ReadS a` is a type alias for `String -> [(a,String)]`. 196 | So whenever you see `ReadS a`, think `String -> [(a,String)]`. 197 | 198 | ```haskell 199 | -- (c) The University of Glasgow 2002 200 | 201 | run :: P a -> String -> [(a,String)] 202 | run (Get f) (c:s) = run (f c) s 203 | run (Look f) s = run (f s) s 204 | run (Result x p) s = (x,s) : run p s 205 | run (Final r) _ = r 206 | run _ _ = [] 207 | ``` 208 | 209 | `run` pattern matches the different cases of `P`. 210 | 211 | - If it's `Get`, 212 | it calls itself with a new `P` (returned by passing the function `f`, in `Get f`, the next character `c` in the input string) 213 | and the rest of the input string `s`. 214 | - If it's `Look`, 215 | it calls itself with a new `P` (returned by passing the function `f`, in `Look f`, the input string `s`) 216 | and the input string. 217 | Notice how `Look` doesn't consume any characters from the input string like `Get` does. 218 | - If it's `Result`, 219 | it assembles a two-tuple—containing the parsed result and what's left of the input string—and 220 | prepends this to the result of a recursive call that runs with another `P` case and the input string. 221 | - If it's `Final`, `run` returns a list of two-tuples containing parsed results and input string leftovers. 222 | - For anything else, `run` returns an empty list. 223 | For example, if the case is `Fail`, `run` will return an empty list. 224 | 225 | ```haskell 226 | > run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345" 227 | [("12","345")] 228 | ``` 229 | 230 | ReadP doesn't expose `run` but if it did, you could call it like this. 231 | The two `Get`s consume the `'1'` and `'2'`, leaving the `"345"` behind. 232 | 233 | ```haskell 234 | > run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345" 235 | > run (Get (\ b -> Result ['1',b] Fail)) "2345" 236 | > run (Result ['1','2'] Fail) "345" 237 | > (['1', '2'], "345") : run (Fail) "345" 238 | > (['1', '2'], "345") : [] 239 | [("12","345")] 240 | ``` 241 | 242 | Running through each recursive call, you can see how we arrived at the final result. 243 | 244 | ```haskell 245 | > run (Get (\ a -> Get (\ b -> Result [a,b] (Final [(['a','b'],"c")])))) "12345" 246 | [("12","345"),("ab","c")] 247 | ``` 248 | 249 | Using `Final`, you can include a parsed result in the final list of two-tuples. 250 | 251 | #### readP_to_S 252 | 253 | ```haskell 254 | -- (c) The University of Glasgow 2002 255 | 256 | readP_to_S :: ReadP a -> ReadS a 257 | -- readP_to_S :: ReadP a -> String -> [(a,String)] 258 | readP_to_S (R f) = run (f return) 259 | ``` 260 | 261 | While ReadP doesn't expose `run` directly, it does expose it via `readP_to_S`. 262 | `readP_to_S` introduces a `newtype` called `ReadP`. 263 | `readP_to_S` accepts a `ReadP a`, a string, and returns a list of two-tuples. 264 | 265 | #### ReadP Newtype 266 | 267 | 268 |

269 | ReadP Newtype 270 |
271 | 272 |

273 |
274 | 275 | ```haskell 276 | -- (c) The University of Glasgow 2002 277 | 278 | newtype ReadP a = R (forall b . (a -> P b) -> P b) 279 | ``` 280 | 281 | Here's the definition of `ReadP a`. 282 | There are instances for functor, applicative, monad, `MonadFail`, alternative, and `MonadPlus`. 283 | The `R` constructor takes a function that takes another function and returns a `P`. 284 | The accepted function takes whatever you chose for `a` and returns a `P`. 285 | 286 | ```haskell 287 | -- (c) The University of Glasgow 2002 288 | 289 | readP_to_S (R f) = run (f return) 290 | ``` 291 | 292 | Recall that `P` is a monad and `return`'s type is `a -> m a`. 293 | So `f` is the `(a -> P b) -> Pb` function and `return` is the `(a -> P b)` function. 294 | Ultimately, `run` gets the `P b` it expects. 295 | 296 | ```haskell 297 | -- (c) The University of Glasgow 2002 298 | 299 | readP_to_S (R f) inputString = run (f return) inputString 300 | -- ^^^^^^^^^^^ ^^^^^^^^^^^ 301 | ``` 302 | 303 | It's left off in the source code but remember that `readP_to_S` and `run` expects an input string. 304 | 305 | ```haskell 306 | -- (c) The University of Glasgow 2002 307 | 308 | instance Functor ReadP where 309 | fmap h (R f) = R (\k -> f (k . h)) 310 | ``` 311 | 312 | Here's the functor instance definition for `ReadP`. 313 | 314 | ```haskell 315 | > readP_to_S (fmap toLower get) "ABC" 316 | [('a',"BC")] 317 | 318 | > readP_to_S (toLower <$> get) "ABC" 319 | [('a',"BC")] 320 | ``` 321 | 322 | This allows us to do something like this. 323 | `fmap` functor maps `toLower` over the functor `get` which equals `R Get`. 324 | Recall that the type of `Get` is `(Char -> P a) -> P a` which the `ReadP` constructor (`R`) accepts. 325 | 326 | ```haskell 327 | -- (c) The University of Glasgow 2002 328 | 329 | fmap h (R f ) = R (\ k -> f (k . h )) 330 | fmap toLower (R Get) = R (\ k -> Get (k . toLower)) 331 | ``` 332 | 333 | Here you see the functor definition rewritten for the `fmap toLower get` example. 334 | 335 | #### Applicative P Instance 336 | 337 | Looking up above, how did `readP_to_S` return `[('a',"BC")]` when we only used `Get` which doesn't terminate `run`? 338 | The answer lies in the applicative definition for `P`. 339 | 340 | ```haskell 341 | -- (c) The University of Glasgow 2002 342 | 343 | instance Applicative P where 344 | pure x = Result x Fail 345 | (<*>) = ap 346 | ``` 347 | 348 | `return` equals `pure` so we could rewrite `readP_to_S (R f) = run (f return)` to be `readP_to_S (R f) = run (f pure)`. 349 | By using `return` or rather `pure`, `readP_to_S` sets `Result x Fail` as the final case `run` will encounter. 350 | If reached, 351 | `run` will terminate and we'll get our list of parsings. 352 | 353 | ```haskell 354 | > readP_to_S (fmap toLower get) "ABC" 355 | 356 | -- Use the functor instance to transform fmap toLower get. 357 | > readP_to_S (R (\ k -> Get (k . toLower))) "ABC" 358 | 359 | -- Call run which removes R. 360 | > run ((\ k -> Get (k . toLower)) pure) "ABC" 361 | 362 | -- Call function with pure to get rid of k. 363 | > run (Get (pure . toLower)) "ABC" 364 | 365 | -- Call run for Get case to get rid of Get. 366 | > run ((pure . toLower) 'A') "BC" 367 | 368 | -- Call toLower with 'A' to get rid of toLower. 369 | > run (pure 'a') "BC" 370 | 371 | -- Use the applicative instance to transform pure 'a'. 372 | > run (Result 'a' Fail) "BC" 373 | 374 | -- Call run for the Result case to get rid of Result. 375 | > ('a', "BC") : run (Fail) "BC" 376 | 377 | -- Call run for the Fail case to get rid of Fail. 378 | > ('a', "BC") : [] 379 | 380 | -- Prepend. 381 | [('a',"BC")] 382 | ``` 383 | 384 | Here you see the flow from `readP_to_S` to the parsed result. 385 | 386 | #### Alternative P Instance 387 | 388 | ```haskell 389 | -- (c) The University of Glasgow 2002 390 | 391 | instance Alternative P where 392 | -- ... 393 | 394 | -- most common case: two gets are combined 395 | Get f1 <|> Get f2 = Get (\c -> f1 c <|> f2 c) 396 | 397 | -- results are delivered as soon as possible 398 | Result x p <|> q = Result x (p <|> q) 399 | p <|> Result x q = Result x (p <|> q) 400 | 401 | -- ... 402 | ``` 403 | 404 | The `Alternative` instance for `P` allows us to split the flow of the parser into a left and right path. 405 | This comes in handy when the input can go none, one, or (more rarely) two of two ways. 406 | 407 | ```haskell 408 | > readP_to_S ((get >>= \ a -> return a) <|> (get >> get >>= \ b -> return b)) "ABC" 409 | [('A',"BC"),('B',"C")] 410 | ``` 411 | 412 | The `<|>` operator or function introduces a fork in the parser's flow. 413 | The parser will travel through both the left and right paths. 414 | The end result will contain all of the possible parsings that went left 415 | and all of the possible parsings that went right. 416 | If both paths fail, then the whole parser fails. 417 | 418 | :bulb: Note, in other parser combinator implementations, 419 | when using the `<|>` operator, 420 | the parser will go left or right but not both. 421 | If the left succeeds, the right is ignored. 422 | The right is only processed if the left side fails. 423 | 424 | ```haskell 425 | > readP_to_S ((get >>= \ a -> return [a]) <|> look <|> (get >> get >>= \a -> return [a])) "ABC" 426 | [("ABC","ABC"),("A","BC"),("B","C")] 427 | ``` 428 | 429 | You can chain the `<|>` operator for however many options or alternatives there are. 430 | The parser will return a possible parsing involving each. 431 | 432 | #### ReadP Failure 433 | 434 | ```haskell 435 | -- (c) The University of Glasgow 2002 436 | 437 | instance Monad ReadP where 438 | fail _ = R (\_ -> Fail) 439 | R m >>= f = R (\k -> m (\a -> let R m' = f a in m' k)) 440 | ``` 441 | 442 | Here is the `ReadP` monad instance. 443 | Notice the definition for `fail`. 444 | 445 | ```haskell 446 | > readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> get <*> get) "ABC" 447 | [("ABC","")] 448 | 449 | > readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> fail "" <*> get) "ABC" 450 | [] 451 | 452 | > readP_to_S (get >>= \ a -> get >>= \ b -> get >>= \ c -> return [a,b,c]) "ABC" 453 | [("ABC","")] 454 | 455 | > readP_to_S (get >>= \ a -> get >>= \ b -> fail "" >>= \ c -> return [a,b,c]) "ABC" 456 | [] 457 | ``` 458 | 459 | You can cause an entire parser path to abort by calling `fail`. 460 | Since ReadP doesn't provide a direct way to generate a `Result` or `Final` case, 461 | the return value will be an empty list. 462 | If the failed path is the only path, then the entire result will be an empty list. 463 | Recall that when `run` matches `Fail`, it returns an empty list. 464 | 465 | ```haskell 466 | -- (c) The University of Glasgow 2002 467 | 468 | instance Alternative P where 469 | -- ... 470 | 471 | -- fail disappears 472 | Fail <|> p = p 473 | p <|> Fail = p 474 | 475 | -- ... 476 | ``` 477 | 478 | Going back to the alternative `P` instance, 479 | you can see how a failure on either side (but not both) will not fail the whole parser. 480 | 481 | ```haskell 482 | > readP_to_S (get >>= \ a -> get >>= \ b -> pfail >>= \ c -> return [a,b,c]) "ABC" 483 | [] 484 | ``` 485 | 486 | Instead of using `fail`, ReadP provides `pfail` which allows you to generate a `Fail` case directly. 487 | 488 | ## Version Number 489 | 490 | 491 |

492 | Version Number 493 |
494 | 495 |

496 |
497 | 498 | [Gifcurry](https://github.com/lettier/gifcurry), 499 | the Haskell-built video editor for GIF makers, shells out to various different programs. 500 | To ensure compatibility, it needs the version number for each of the programs it shells out to. 501 | One of those programs is ImageMagick. 502 | 503 | ```bash 504 | Version: ImageMagick 6.9.10-14 Q16 x86_64 2018-10-24 https://imagemagick.org 505 | Copyright: © 1999-2018 ImageMagick Studio LLC 506 | License: https://imagemagick.org/script/license.php 507 | Features: Cipher DPC HDRI Modules OpenCL OpenMP 508 | ``` 509 | 510 | Here you see the output of `convert --version`. 511 | How could you parse this to capture the 6, 9, 10, and 14? 512 | 513 | Looking at the output, 514 | we know the version number is a collection of numbers separated by either a period or a dash. 515 | This definition covers the dates as well so we'll make sure that the first two numbers are separated by a period. 516 | That way, if they put a date before the version number, we won't get the wrong result. 517 | 518 | 519 |

520 | Version Number Parser 521 |
522 | 523 |

524 |
525 | 526 | ```txt 527 | 1. Consume zero or more characters that are not 0 through 9 and go to 2. 528 | 2. Consume zero or more characters that are 0 through 9, save this number, and go to 3. 529 | 3. Look at the rest of the input and go to 4. 530 | 4. If the input 531 | - is empty, go to 6. 532 | - starts with a period, go to 1. 533 | - starts with a dash 534 | - and you have exactly one number, go to 5. 535 | - and you have more than one number, go to 1. 536 | - doesn't start with a period or dash 537 | - and you have exactly one number, go to 5. 538 | - you have more than one number, go to 6. 539 | 5. Delete any saved numbers and go to 1. 540 | 6. Return the numbers found. 541 | ``` 542 | 543 | Before we dive into the code, here's the algorithm we'll be following. 544 | 545 | ### Building The Version Number Parser 546 | 547 | ```haskell 548 | parseVersionNumber 549 | :: [String] 550 | -> ReadP [String] 551 | parseVersionNumber 552 | nums 553 | = do 554 | _ <- parseNotNumber 555 | num <- parseNumber 556 | let nums' = nums ++ [num] 557 | parseSeparator nums' parseVersionNumber 558 | ``` 559 | 560 | `parseVersionNumber` is the main parser combinator that parses an input string for a version number. 561 | It accepts a list of strings and returns a list of strings in the context of the `ReadP` data type. 562 | The accepted list of strings is not the input that gets parsed but rather the list of numbers found so far. 563 | For the first function call, the list is empty since it hasn't parsed anything yet. 564 | 565 | ```haskell 566 | parseVersionNumber 567 | nums 568 | ``` 569 | 570 | Starting from the top, 571 | `parseVersionNumber` takes a list of strings which are the current list of numbers found so far. 572 | 573 | ```haskell 574 | _ <- parseNotNumber 575 | ``` 576 | 577 | `parseNotNumber` consumes everything that isn't a number from the input string. 578 | Since we are not interested in the result, we discard it (`_ <-`). 579 | 580 | ```haskell 581 | num <- parseNumber 582 | let nums' = nums ++ [num] 583 | ``` 584 | 585 | Next we consume everything that is a number and then add that to the list of numbers found so far. 586 | 587 | ```haskell 588 | parseSeparator nums' parseVersionNumber 589 | ``` 590 | 591 | After `parseVersionNumber` has processed the next number, it passes the list of numbers found and itself to `parseSeparator`. 592 | 593 | #### Parsing The Separator 594 | 595 | ```haskell 596 | parseSeparator 597 | :: [String] 598 | -> ([String] -> ReadP [String]) 599 | -> ReadP [String] 600 | parseSeparator 601 | nums 602 | f 603 | = do 604 | next <- look 605 | case next of 606 | "" -> return nums 607 | (c:_) -> 608 | case c of 609 | '.' -> f nums 610 | '-' -> if length nums == 1 then f [] else f nums 611 | _ -> if length nums == 1 then f [] else return nums 612 | ``` 613 | 614 | Here you see `parseSeparator`. 615 | 616 | ```haskell 617 | next <- look 618 | case next of 619 | "" -> return nums 620 | (c:_) -> 621 | ``` 622 | 623 | `look` allows us to get what's left of the input string without consuming it. 624 | If there's nothing left, it returns the numbers found. 625 | However, if there is something left, it analyzes the first character. 626 | 627 | ```haskell 628 | case c of 629 | '.' -> f nums 630 | '-' -> if length nums == 1 then f [] else f nums 631 | _ -> if length nums == 1 then f [] else return nums 632 | ``` 633 | 634 | If the next character is a period, call `parseVersionNumber` again with the current list of numbers found. 635 | If it's a dash and we have exactly one number, call `parseVersionNumber` with an empty list of numbers since it's a date. 636 | If it's a dash and we don't have exactly one number, call `parseVersionNumber` with the list of numbers found so far. 637 | Otherwise, 638 | call `parseVersionNumber` with an empty list if we have exactly one number 639 | or return the numbers found if we don't have exactly one number. 640 | 641 | #### Parsing Non-numbers 642 | 643 | ```haskell 644 | parseNotNumber 645 | :: ReadP String 646 | parseNotNumber 647 | = 648 | munch (not . isNumber) 649 | ``` 650 | 651 | `parseNotNumber` uses `munch` which `ReadP` provides. 652 | `munch` is given the predicate `(not . isNumber)` which returns true for any character that isn't 0 through 9. 653 | 654 | ```haskell 655 | munch :: (Char -> Bool) -> ReadP String 656 | ``` 657 | 658 | `munch` continuously calls `get` if the next character in the input string satisfies the predicate. 659 | If it doesn't, `munch` returns the characters that did, if any. 660 | Since it only uses `get`, munch always succeeds. 661 | 662 | :bulb: Note, `parseNumber` is similar to `parseNotNumber`. 663 | Instead of `not . isNumber`, the predicate is just `isNumber`. 664 | 665 | #### Munch Versus Many 666 | 667 | ```haskell 668 | parseNotNumber' 669 | :: ReadP String 670 | parseNotNumber' 671 | = 672 | many (satisfy (not . isNumber)) 673 | ``` 674 | 675 | Instead of using `munch`, 676 | you could write `parseNotNumber` like this, 677 | using `many` and `satisfy`—both of which ReadP provides. 678 | Looking at the type signature for `many`, it accepts a single parser combinator (`ReadP a`). 679 | In this instance, it's being given the parser combinator `satisfy`. 680 | 681 | ```haskell 682 | > readP_to_S (satisfy (not . isNumber)) "a" 683 | [('a',"")] 684 | 685 | > readP_to_S (satisfy (not . isNumber)) "1" 686 | [] 687 | ``` 688 | 689 | `satisfy` takes a predicate and uses `get` to consume the next character. 690 | If the accepted predicate returns true, `satisfy` returns the character. 691 | Otherwise, `satisfy` calls `pfail` and fails. 692 | 693 | ```haskell 694 | > readP_to_S (munch (not . isNumber)) "abc123" 695 | [("abc","123")] 696 | 697 | > readP_to_S (many (satisfy (not . isNumber))) "abc123" 698 | [("","abc123"),("a","bc123"),("ab","c123"),("abc","123")] 699 | ``` 700 | 701 | Using `many` can give you unwanted results. 702 | Ultimately, `many` introduces one or more `Result` cases. 703 | Because of this, `many` always succeeds. 704 | 705 | ```haskell 706 | > readP_to_S (many look) "abc123" 707 | -- Runs forever. 708 | ``` 709 | 710 | `many` will run your parser until it fails or runs out of input. 711 | If your parser never fails or never runs out of input, `many` will never return. 712 | 713 | ```haskell 714 | > readP_to_S (many (get >>= \ a -> return (read (a : "") :: Int))) "12345" 715 | [([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")] 716 | ``` 717 | 718 | For every index in the result, 719 | the parsed result will be the outcome of having ran the parser index times on the entire input. 720 | 721 | ```haskell 722 | > let parser = get >>= \ a -> return (read (a : "") :: Int) 723 | > let many' results = return results <|> (parser >>= \ result -> many' (results ++ [result])) 724 | > readP_to_S (many' []) "12345" 725 | [([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")] 726 | ``` 727 | 728 | Here's an alternate definition for `many`. 729 | On the left side of `<|>`, 730 | it returns the current parser results. 731 | On the right side of `<|>`, 732 | it runs the parser, 733 | adds that result to the current parser results, 734 | and calls itself with the updated results. 735 | This has a cumulative sum type effect where index `i` is the parser result appended to the parser result at 736 | `i - 1`, 737 | `i - 2`, 738 | ..., 739 | and `1`. 740 | 741 | ### Running The Version Number Parser 742 | 743 | Now that we built the parser, let's run it. 744 | 745 | ```haskell 746 | > let inputString = 747 | > "Some Program (C) 1234-56-78 All rights reserved.\n\ 748 | > \Version: 12.345.6-7\n\ 749 | > \License: Some open source license." 750 | > readP_to_S (parseVersionNumber []) inputString 751 | [(["12","345","6","7"],"\nLicense: Some open source license.")] 752 | ``` 753 | 754 | You can see it extracted the version number correctly even with the date coming before it. 755 | 756 | ## SRT 757 | 758 | 759 |

760 | SRT 761 |
762 | 763 |

764 |
765 | 766 | Now let's parse something more complicated—SRT files. 767 | 768 | For the release of 769 | [Gifcurry](https://lettier.github.io/gifcurry) 770 | six, I needed to parse 771 | [SRT (SubRip Text) files](http://www.visualsubsync.org/help/srt). 772 | SRT files contain subtitles that video processing programs use to display text on top of a video. 773 | Typically this text is the dialog of a movie translated into various different languages. 774 | By keeping the text separate from the video, 775 | there only needs to be one video which saves time, storage space, and bandwidth. 776 | The video software can swap out the text without having to swap out the video. 777 | Contrast this with burning-in or hard-coding the subtitles where the text becomes a part of the image data that makes up the video. 778 | In this case, you would need a video for each collection of subtitles. 779 | 780 | 781 |

782 | Gifcurry 783 |
784 | Inner Video © Blender Foundation | www.sintel.org 785 |

786 |
787 | 788 | Gifcurry can take a SRT file and burn-in the subtitles for the video slice your select. 789 | 790 | ```txt 791 | 7 792 | 00:02:09,400 --> 00:02:13,800 793 | What brings you to 794 | the land of the gatekeepers? 795 | 796 | 8 797 | 00:02:15,000 --> 00:02:17,500 798 | I'm searching for someone. 799 | 800 | 9 801 | 00:02:18,000 --> 00:02:22,200 802 | Someone very dear? 803 | A kindred spirit? 804 | ``` 805 | 806 | Here you see the English subtitles for 807 | [Sintel](https://durian.blender.org/) (© Blender Foundation | www.sintel.org). 808 | 809 | ### SRT Format 810 | 811 |
812 |

813 | SRT is perhaps the most basic of all subtitle formats. 814 |

815 | 816 | —SRT Subtitle | Matrosk 817 | 818 |

819 |
820 | 821 | The SRT file format consists of blocks, one for each subtitle, separated by an empty line. 822 | 823 | ```txt 824 | 2 825 | ``` 826 | 827 | At the top of the block is the index. 828 | This determines the order of the subtitles. 829 | Hopefully the subtitles are already in order and all of them have unique indexes but this may not be the case. 830 | 831 | ```txt 832 | 01:04:13,000 --> 02:01:01,640 X1:167 X2:267 Y1:33 Y2:63 833 | ``` 834 | 835 | After the index is the start time, end time, and an optional set of points specifying the rectangle the 836 | subtitle text should go in. 837 | 838 | ```txt 839 | 01:04:13,000 840 | ``` 841 | 842 | The timestamp format is `hours:minutes:seconds,milliseconds`. 843 | 844 | :bulb: Note the comma instead of the period separating the seconds from the milliseconds. 845 | 846 | ```txt 847 | This is the actual subtitle 848 | text. It can span multiple lines. 849 | It may include formating 850 | like bold, italic, 851 | underline, 852 | and font color. 853 | ``` 854 | 855 | The third and last part of a block is the subtitle text. 856 | It can span multiple lines and ends when there is an empty line. 857 | The text can include formatting tags reminiscent of HTML. 858 | 859 | ### Building The SRT Parser 860 | 861 | 862 |

863 | Parsing SRT 864 |
865 | 866 |

867 |
868 | 869 | ```haskell 870 | parseSrt 871 | :: ReadP [SrtSubtitle] 872 | parseSrt 873 | = 874 | manyTill parseBlock (skipSpaces >> eof) 875 | ``` 876 | 877 | `parseSrt` is the main parser combinator that handles everything. 878 | It parses each block until it reaches the end of the file (`eof`) or input. 879 | To be on the safe side, 880 | there could be trailing whitespace between the last block and the end of the file. 881 | To handle this, it parses zero or more characters of whitespace (`skipSpaces`) before parsing 882 | the end of the file (`skipSpaces >> eof`). 883 | If there is still input left by the time `eof` is reached, `eof` will fail and this will return nothing. 884 | Therefore, it's important that `parseBlock` doesn't leave any thing but whitespace behind. 885 | 886 | #### Building The SRT Block Parser 887 | 888 | ```haskell 889 | parseBlock 890 | :: ReadP SrtSubtitle 891 | parseBlock 892 | = do 893 | i <- parseIndex 894 | (s, e) <- parseTimestamps 895 | c <- parseCoordinates 896 | t <- parseTextLines 897 | return 898 | SrtSubtitle 899 | { index = i 900 | , start = s 901 | , end = e 902 | , coordinates = c 903 | , taggedText = t 904 | } 905 | ``` 906 | 907 | As we went over earlier, a block consists of an index, timestamps, possibly some coordinates, and some lines of text. 908 | In this version of `parseBlock`, you see the more imperative do notation style with the record syntax. 909 | 910 | ```haskell 911 | parseBlock' 912 | :: ReadP SrtSubtitle 913 | parseBlock' 914 | = 915 | SrtSubtitle 916 | <$> parseIndex 917 | <*> parseStartTimestamp 918 | <*> parseEndTimestamp 919 | <*> parseCoordinates 920 | <*> parseTextLines 921 | ``` 922 | 923 | Here's another way you could write `parseBlock`. 924 | This is the applicative style. 925 | Just be sure to get the order right. 926 | For example, I could've accidentally mixed up the start and end timestamps. 927 | 928 | #### Building The SRT Index Parser 929 | 930 | 931 |

932 | Parsing The Index 933 |
934 | 935 |

936 |
937 | 938 | ```haskell 939 | parseIndex 940 | :: ReadP Int 941 | parseIndex 942 | = 943 | skipSpaces 944 | >> readInt <$> parseNumber 945 | ``` 946 | 947 | At the top of the block is the index. 948 | Here you see `skipSpaces` again. 949 | After skipping over whitespace, 950 | it parses the input for numbers and converts it to an actual integer. 951 | 952 | ```haskell 953 | readInt 954 | :: String 955 | -> Int 956 | readInt 957 | = 958 | read 959 | ``` 960 | 961 | `readInt` looks like this. 962 | 963 | ```haskell 964 | > read "123" :: Int 965 | 123 966 | > read "1abc" :: Int 967 | *** Exception: Prelude.read: no parse 968 | ``` 969 | 970 | Normally using `read` directly can be dangerous. 971 | `read` may not be able to convert the input to the specified type. 972 | However, `parseNumber` will only return the 10 numerical digit characters (`['0'..'9']`) 973 | so using `read` directly becomes safe. 974 | 975 | #### Building The SRT Timestamps Parser 976 | 977 | 978 |

979 | Parsing The Timestamps 980 |
981 | 982 |

983 |
984 | 985 | Parsing the timestamps are a little more involved than parsing the index. 986 | 987 | ```haskell 988 | parseTimestamps 989 | :: ReadP (Timestamp, Timestamp) 990 | parseTimestamps 991 | = do 992 | _ <- char '\n' 993 | s <- parseTimestamp 994 | _ <- skipSpaces 995 | _ <- string "-->" 996 | _ <- skipSpaces 997 | e <- parseTimestamp 998 | return (s, e) 999 | ``` 1000 | 1001 | This is the main combinator for parsing the timestamps. 1002 | 1003 | `char` parses the character you give it or it fails. 1004 | If it fails then `parseTimestamps` fails, ultimately causing `parseSrt` to fail 1005 | so there must be a newline character after the index. 1006 | 1007 | `string` is like `char` except instead of just one character, it 1008 | parses the string of characters you give it or it fails. 1009 | 1010 | ```haskell 1011 | parseStartTimestamp 1012 | :: ReadP Timestamp 1013 | parseStartTimestamp 1014 | = 1015 | char '\n' 1016 | >> parseTimestamp 1017 | ``` 1018 | 1019 | `parseTimestamps` parses both timestamps, 1020 | but for the applicative style (`parseSrt'`), 1021 | we need a parser just for the start timestamp. 1022 | 1023 | ```haskell 1024 | parseEndTimestamp 1025 | :: ReadP Timestamp 1026 | parseEndTimestamp 1027 | = 1028 | skipSpaces 1029 | >> string "-->" 1030 | >> skipSpaces 1031 | >> parseTimestamp 1032 | ``` 1033 | 1034 | This parses everything between the timestamps and returns the end timestamp. 1035 | 1036 | ```haskell 1037 | parseTimestamp 1038 | :: ReadP Timestamp 1039 | parseTimestamp 1040 | = do 1041 | h <- parseNumber 1042 | _ <- char ':' 1043 | m <- parseNumber 1044 | _ <- char ':' 1045 | s <- parseNumber 1046 | _ <- char ',' <|> char '.' 1047 | m' <- parseNumber 1048 | return 1049 | Timestamp 1050 | { hours = readInt h 1051 | , minutes = readInt m 1052 | , seconds = readInt s 1053 | , milliseconds = readInt m' 1054 | } 1055 | ``` 1056 | 1057 | This parses the four numbers that make up the timestamp. 1058 | The first three numbers are separated by a colon and the last one is separated by a comma. 1059 | To be more forgiving, however, we allow the possibility of there being a period instead of a comma. 1060 | 1061 | ```haskell 1062 | > readP_to_S (char '.' <|> char ',') "..." 1063 | [('.',"..")] 1064 | 1065 | > readP_to_S (char '.' <|> char ',') ",.." 1066 | [(',',"..")] 1067 | ``` 1068 | 1069 | :bulb: Note, when using `char` with `<|>`, 1070 | only one side can succeed (two `char` enter, one `char` leave) 1071 | since `char` consumes a single character and two characters cannot occupy the same space. 1072 | 1073 | #### Building The SRT Coordinates Parser 1074 | 1075 | 1076 |

1077 | Parsing The Coordinates 1078 |
1079 | 1080 |

1081 |
1082 | 1083 | The coordinates are an optional part of the block but if included, will be on the same line as the timestamps. 1084 | 1085 | ```haskell 1086 | parseCoordinates 1087 | :: ReadP (Maybe SrtSubtitleCoordinates) 1088 | parseCoordinates 1089 | = 1090 | option Nothing $ do 1091 | _ <- skipSpaces1 1092 | x1 <- parseCoordinate 'x' 1 1093 | _ <- skipSpaces1 1094 | x2 <- parseCoordinate 'x' 2 1095 | _ <- skipSpaces1 1096 | y1 <- parseCoordinate 'y' 1 1097 | _ <- skipSpaces1 1098 | y2 <- parseCoordinate 'y' 2 1099 | return 1100 | $ Just 1101 | SrtSubtitleCoordinates 1102 | { x1 = readInt x1 1103 | , x2 = readInt x2 1104 | , y1 = readInt y1 1105 | , y2 = readInt y2 1106 | } 1107 | ``` 1108 | 1109 | `option` takes two arguments. 1110 | The first argument is returned if the second argument, a parser, fails. 1111 | So if the coordinates parser fails, `parseCoordinates` will return `Nothing`. 1112 | Put another way, the coordinates parser failing does not cause the whole parser to fail. 1113 | This block will just have `Nothing` for its `coordinates` "field". 1114 | 1115 | ```haskell 1116 | parseCoordinate 1117 | :: Char 1118 | -> Int 1119 | -> ReadP String 1120 | parseCoordinate 1121 | c 1122 | n 1123 | = do 1124 | _ <- char (Data.Char.toUpper c) <|> char (Data.Char.toLower c) 1125 | _ <- string $ show n ++ ":" 1126 | parseNumber 1127 | ``` 1128 | 1129 | This parser allows the coordinate labels to be in either uppercase or lowercase. 1130 | For example, `x1:1 X2:2 Y1:3 y2:4` would succeed. 1131 | 1132 | #### Building The SRT Text Parser 1133 | 1134 | 1135 |

1136 | Parsing The Text 1137 |
1138 | 1139 |

1140 |
1141 | 1142 | Parsing the text is the most involved portion due to the HTML-like tag formatting. 1143 | 1144 | Tag parsing can be challenging—just ask anyone who parses them with a regular expression. 1145 | To make this easier on us—and for the user—we'll use a 1146 | [tag soup](https://en.wikipedia.org/wiki/Tag_soup) 1147 | kind of approach. 1148 | The parser will allow unclosed and/or wrongly nested tags. 1149 | It will also allow any tag and not just `b`, `u`, `i`, and `font`. 1150 | 1151 | ```haskell 1152 | parseTextLines 1153 | :: ReadP [TaggedText] 1154 | parseTextLines 1155 | = 1156 | char '\n' 1157 | >> (getTaggedText <$> manyTill parseAny parseEndOfTextLines) 1158 | ``` 1159 | 1160 | We start out by matching on a newline character. 1161 | After that, we functor map or fmap (`<$>`) `getTaggedText` over the subtitle text characters until we reach the end of the text lines. 1162 | 1163 | ```haskell 1164 | parseEndOfTextLines 1165 | :: ReadP () 1166 | parseEndOfTextLines 1167 | = 1168 | void (string "\n\n") <|> eof 1169 | ``` 1170 | 1171 | We stop collecting characters (`parseAny`) when we reach two newline characters or the end of the file. 1172 | This signals the end of the block. 1173 | 1174 | ```haskell 1175 | getTaggedText 1176 | :: String 1177 | -> [TaggedText] 1178 | getTaggedText 1179 | s 1180 | = 1181 | fst 1182 | $ foldl 1183 | folder 1184 | ([], []) 1185 | parsed 1186 | where 1187 | ``` 1188 | 1189 | `getTaggedText` folds through the parsed text from left to right, returning the accumulated tagged text. 1190 | 1191 | ```haskell 1192 | parsed 1193 | :: [String] 1194 | parsed 1195 | = 1196 | case readP_to_S (parseTaggedText []) s of 1197 | [] -> [s] 1198 | r@(_:_) -> (fst . last) r 1199 | ``` 1200 | 1201 | `parsed` returns a list of one or more strings. 1202 | It attempts to parse the input text for tags. 1203 | If that fails, `parsed` returns the input string inside a list. 1204 | Otherwise, if `parseTaggedText` succeeds, `parse` returns the last possible parsing (`(fst . last) r`). 1205 | 1206 | ```haskell 1207 | folder 1208 | :: ([TaggedText], [Tag]) 1209 | -> String 1210 | -> ([TaggedText], [Tag]) 1211 | folder 1212 | (tt, t) 1213 | x 1214 | | isTag x = (tt, updateTags t x) 1215 | | otherwise = (tt ++ [TaggedText { text = x, tags = t}], t) 1216 | ``` 1217 | 1218 | As `folder` moves from left to right, over the parsed strings, it checks if the current string is a tag. 1219 | If it is a tag, it updates the current set of active tags (`t`). 1220 | Otherwise, it appends another tagged piece of text associated with the set of active tags. 1221 | 1222 | ```haskell 1223 | updateTags 1224 | :: [Tag] 1225 | -> String 1226 | -> [Tag] 1227 | updateTags 1228 | tags 1229 | x 1230 | | isClosingTag x = remove compare' tags (makeTag x) 1231 | | isOpeningTag x = add compare' tags (makeTag x) 1232 | | otherwise = tags 1233 | where 1234 | compare' 1235 | :: Tag 1236 | -> Tag 1237 | -> Bool 1238 | compare' 1239 | a 1240 | b 1241 | = 1242 | name a /= name b 1243 | ``` 1244 | 1245 | `updateTags` updates the `tags` given by either removing or adding the given tag (`x`) depending on if it is a closing or opening tag. 1246 | If it is neither, it just returns the passed set of tags. 1247 | `add` will overwrite an existing tag if `tags` already has a tag by the same name. 1248 | You can see this in the `compare'` function given. 1249 | 1250 | To keep the parser simple, if an opening tag `T` is found, `T` gets added to the list of tags 1251 | or overwrites an exiting `T` if already present. 1252 | If a corresponding closing `/T` is found, then `T` is removed from the list of tags, if present. 1253 | It doesn't matter if there is two or more `T`s in a row, 1254 | one or more `T`s without a closing `/T`, 1255 | and/or there's a closing `/T` without an opening `T`. 1256 | 1257 | ```haskell 1258 | makeTag 1259 | :: String 1260 | -> Tag 1261 | makeTag 1262 | s 1263 | = 1264 | Tag 1265 | { name = getTagName s 1266 | , attributes = getTagAttributes s 1267 | } 1268 | ``` 1269 | 1270 | `makeTag` assembles a tag from the given string (`s`). 1271 | Each `Tag` has a name and zero or more attributes. 1272 | 1273 | ```haskell 1274 | parseTaggedText 1275 | :: [String] 1276 | -> ReadP [String] 1277 | parseTaggedText 1278 | strings 1279 | = do 1280 | s <- look 1281 | case s of 1282 | "" -> return strings 1283 | _ -> do 1284 | r <- munch1 (/= '<') <++ parseClosingTag <++ parseOpeningTag 1285 | parseTaggedText $ strings ++ [r] 1286 | ``` 1287 | 1288 | `parseTaggedText` returns the input string broken up into pieces. 1289 | Each piece is either the text enclosed by tags, a closing tag, or an opening tag. 1290 | After it splits off a piece, it adds it to the other pieces and calls itself again. 1291 | If the remaining input string is empty, it returns the list of strings found. 1292 | 1293 | ```haskell 1294 | > readP_to_S (string "ab" <++ string "abc") "abcd" 1295 | [("ab","cd")] 1296 | 1297 | > readP_to_S (string "ab" +++ string "abc") "abcd" 1298 | [("ab","cd"),("abc","d")] 1299 | 1300 | > readP_to_S (string "ab" <|> string "abc") "abcd" 1301 | [("ab","cd"),("abc","d")] 1302 | ``` 1303 | 1304 | The `<++` operator is left biased meaning that if the left side succeeds, it won't even bother with the right. 1305 | Recall that when we run the parser, we get a list of all the possible parsings. 1306 | All of these possible parsings are the result of the parser having traveled through all of the possible paths. 1307 | By using `<++`, 1308 | we receive the possible parsings from the left path and from the right path if and only if the left side failed. 1309 | If you'd like all of the possible parsings through the left and right side, 1310 | you can use the `+++` operator provided by `ReadP`. 1311 | `+++` is just `<|>` which we saw up above. 1312 | 1313 | ```haskell 1314 | parseOpeningTag 1315 | :: ReadP String 1316 | parseOpeningTag 1317 | = do 1318 | _ <- char '<' 1319 | t <- munch1 (\ c -> c /= '/' && c /= '>') 1320 | _ <- char '>' 1321 | return $ "<" ++ t ++ ">" 1322 | ``` 1323 | 1324 | An opening tag is an opening angle bracket, some text that doesn't include a forward slash, and the next immediate closing angle bracket. 1325 | 1326 | ```haskell 1327 | parseClosingTag 1328 | :: ReadP String 1329 | parseClosingTag 1330 | = do 1331 | _ <- char '<' 1332 | _ <- char '/' 1333 | t <- munch1 (/= '>') 1334 | _ <- char '>' 1335 | return $ "" 1336 | ``` 1337 | 1338 | A closing tag is an opening angle bracket, a forward slash, some text, and the next immediate closing angle bracket. 1339 | 1340 | 1341 |

1342 | Parsing Tags 1343 |
1344 | 1345 |

1346 |
1347 | 1348 | ```haskell 1349 | getTagAttributes 1350 | :: String 1351 | -> [TagAttribute] 1352 | getTagAttributes 1353 | s 1354 | = 1355 | if isOpeningTag s 1356 | then 1357 | case readP_to_S (parseTagAttributes []) s of 1358 | [] -> [] 1359 | (x:_) -> fst x 1360 | else 1361 | [] 1362 | ``` 1363 | 1364 | Opening tags can have attributes. 1365 | For example, ``. 1366 | Each attribute is a two-tuple, key-value pair. 1367 | In the above example, `color` would be the key and `#101010` would be the value. 1368 | 1369 | ```haskell 1370 | getTagName 1371 | :: String 1372 | -> String 1373 | getTagName 1374 | s 1375 | = 1376 | case readP_to_S parseTagName s of 1377 | [] -> "" 1378 | (x:_) -> toLower' $ fst x 1379 | ``` 1380 | 1381 | This returns the tag name in lowercase. 1382 | 1383 | ```haskell 1384 | parseTagName 1385 | :: ReadP String 1386 | parseTagName 1387 | = do 1388 | _ <- char '<' 1389 | _ <- munch (== '/') 1390 | _ <- skipSpaces 1391 | n <- munch1 (\ c -> c /= ' ' && c /= '>') 1392 | _ <- munch (/= '>') 1393 | _ <- char '>' 1394 | return n 1395 | ``` 1396 | 1397 | The tag name is the first string of non-whitespace characters 1398 | after the opening angle bracket, 1399 | a possible forward slash, 1400 | and some possible whitespace 1401 | and before some more whitespace 1402 | and/or the closing angle bracket. 1403 | 1404 | ```haskell 1405 | parseTagAttributes 1406 | :: [TagAttribute] 1407 | -> ReadP [TagAttribute] 1408 | parseTagAttributes 1409 | tagAttributes 1410 | = do 1411 | s <- look 1412 | case s of 1413 | "" -> return tagAttributes 1414 | _ -> do 1415 | let h = head s 1416 | case h of 1417 | '>' -> return tagAttributes 1418 | '<' -> trimTagname >> parseTagAttributes' 1419 | _ -> parseTagAttributes' 1420 | where 1421 | parseTagAttributes' 1422 | :: ReadP [TagAttribute] 1423 | parseTagAttributes' 1424 | = do 1425 | tagAttribute <- parseTagAttribute 1426 | parseTagAttributes 1427 | ( add 1428 | (\ a b -> fst a /= fst b) 1429 | tagAttributes 1430 | tagAttribute 1431 | ) 1432 | ``` 1433 | 1434 | `parseTagAttributes` recursively goes through the input string, collecting up the key-value pairs. 1435 | At the start of the tag (`<`), it first trims the tag name before tackling the attributes. 1436 | It stops parsing for attributes when it reaches the closing angle bracket (`>`). 1437 | If a tag happens to have duplicate attributes (based on the key), 1438 | `add` will ensure only the latest one remains in the list. 1439 | 1440 | ```haskell 1441 | trimTagname 1442 | :: ReadP () 1443 | trimTagname 1444 | = 1445 | char '<' 1446 | >> skipSpaces 1447 | >> munch1 (\ c -> c /= ' ' && c /= '>') 1448 | >> return () 1449 | ``` 1450 | 1451 | This trims or discards the tag name. 1452 | 1453 | ```haskell 1454 | parseTagAttribute 1455 | :: ReadP TagAttribute 1456 | parseTagAttribute 1457 | = do 1458 | _ <- skipSpaces 1459 | k <- munch1 (/= '=') 1460 | _ <- string "=\"" 1461 | v <- munch1 (/= '\"') 1462 | _ <- char '\"' 1463 | _ <- skipSpaces 1464 | return (toLower' k, v) 1465 | ``` 1466 | 1467 | The attribute key is any string of non-whitespace characters before the equal sign. 1468 | The attribute value is any characters after the equal sign and double quote and before the next immediate double quote. 1469 | 1470 | ```haskell 1471 | isTag 1472 | :: String 1473 | -> Bool 1474 | isTag 1475 | s 1476 | = 1477 | isOpeningTag s || isClosingTag s 1478 | ``` 1479 | 1480 | A string is a tag if it is either an opening tag or a closing tag. 1481 | 1482 | ```haskell 1483 | isOpeningTag 1484 | :: String 1485 | -> Bool 1486 | isOpeningTag 1487 | s 1488 | = 1489 | isPresent $ readP_to_S parseOpeningTag s 1490 | ``` 1491 | 1492 | A string is an opening tag if the opening tag parser succeeds. 1493 | 1494 | ```haskell 1495 | isClosingTag 1496 | :: String 1497 | -> Bool 1498 | isClosingTag 1499 | s 1500 | = 1501 | isPresent $ readP_to_S parseClosingTag s 1502 | ``` 1503 | 1504 | A string is a closing tag if the closing tag parser succeeds. 1505 | 1506 | ### Running The SRT Parser 1507 | 1508 | 1509 |

1510 | Parsed SRT Results 1511 |
1512 | 1513 |

1514 |
1515 | 1516 | Now that we've assembled the parser, let's try it out. 1517 | 1518 | ```haskell 1519 | > let srt = 1520 | > " 1\n\ 1521 | > \0:0:0,1 --> 0:1:0.2 x1:1 X2:3 y1:4 y2:10\n\ 1522 | > \This is some \n \ 1523 | > \subtitle \n\ 1524 | > \text. " 1525 | > readP_to_S parseSrt srt 1526 | [([ SrtSubtitle 1527 | { index = 1 1528 | , start = Timestamp {hours = 0, minutes = 0, seconds = 0, milliseconds = 1} 1529 | , end = Timestamp {hours = 0, minutes = 1, seconds = 0, milliseconds = 2} 1530 | , coordinates = Just (SrtSubtitleCoordinates {x1 = 1, x2 = 3, y1 = 4, y2 = 10}) 1531 | , taggedText = [ TaggedText 1532 | { text = "This is some " 1533 | , tags = [ Tag {name = "font", attributes = [("color","blue")]} 1534 | ] 1535 | } 1536 | , TaggedText 1537 | { text = "\n subtitle \n" 1538 | , tags = [ Tag {name = "font", attributes = [("color","blue")]} 1539 | , Tag {name = "b", attributes = []} 1540 | , Tag {name = "u", attributes = []} 1541 | , Tag {name = "i", attributes = []} 1542 | ] 1543 | } 1544 | , TaggedText 1545 | { text = "text." 1546 | , tags = [ Tag {name = "font", attributes = [("color","blue")]} 1547 | , Tag {name = "b", attributes = []} 1548 | , Tag {name = "i", attributes = []} 1549 | ] 1550 | } 1551 | , TaggedText 1552 | { text = " " 1553 | , tags = [ Tag {name = "font", attributes = [("color","blue")]} 1554 | , Tag {name = "i", attributes = []} 1555 | ] 1556 | } 1557 | ] 1558 | } 1559 | ] 1560 | , "" 1561 | )] 1562 | ``` 1563 | 1564 | Here you see the result of parsing a test string. 1565 | Notice the errors in the test string like the use of a period instead of a comma or the duplicate tag attribute. 1566 | 1567 | ## Exercises 1568 | 1569 | - Write a program that can convert an SRT file to a JSON file. 1570 | - Rewrite the version number parser using Parsec instead of ReadP. 1571 | - Rewrite the SRT parser using Parsec instead of ReadP. 1572 | 1573 | ## Copyright 1574 | 1575 | (C) 2019 David Lettier 1576 |
1577 | [lettier.com](https://www.lettier.com/) 1578 | -------------------------------------------------------------------------------- /Setup.hs: -------------------------------------------------------------------------------- 1 | import Distribution.Simple 2 | main = defaultMain 3 | -------------------------------------------------------------------------------- /docs/_build-docs.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | SCRIPT_PATH="$(cd "$(dirname "$0")"; pwd -P)" 4 | MAIN_TITLE="Parsing With Haskell Parser Combinators" 5 | DESCRIPTION="Need to parse something? Never heard of a parser combinator? \ 6 | Looking to learn some Haskell? \ 7 | Awesome! This is everything you'll need to get up and parsing with Haskell parser combinators. \ 8 | From here you can try tackling esoteric data serialization formats, compiler front ends, domain specific languages—you name it!" 9 | REPO_URL="https://github.com/lettier/parsing-with-haskell-parser-combinators" 10 | AUTHOR="David Lettier" 11 | CSS="style.css" 12 | 13 | $PANDOC \ 14 | -f gfm \ 15 | -t html5 \ 16 | --highlight-style=breezedark \ 17 | --template=$SCRIPT_PATH/_template.html5 \ 18 | $SCRIPT_PATH/../README.md \ 19 | --metadata pagetitle="$MAIN_TITLE" \ 20 | --metadata author-meta="$AUTHOR" \ 21 | --metadata description="$DESCRIPTION" \ 22 | --metadata css=$CSS \ 23 | -o "$SCRIPT_PATH/index.html" 24 | -------------------------------------------------------------------------------- /docs/_template.html5: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | $for(author-meta)$ 15 | 16 | $endfor$ 17 | $if(date-meta)$ 18 | 19 | $endif$ 20 | $if(keywords)$ 21 | 22 | $endif$ 23 | $if(title-prefix)$$title-prefix$ – $endif$$pagetitle$ 24 | 33 | $if(highlighting-css)$ 34 | 37 | $endif$ 38 | $if(math)$ 39 | $math$ 40 | $endif$ 41 | 44 | $for(header-includes)$ 45 | $header-includes$ 46 | $endfor$ 47 | $for(css)$ 48 | 49 | $endfor$ 50 | 51 | 52 | $for(include-before)$ 53 | $include-before$ 54 | $endfor$ 55 | $if(title)$ 56 |
57 |

$title$

58 | $if(subtitle)$ 59 |

$subtitle$

60 | $endif$ 61 | $for(author)$ 62 |

$author$

63 | $endfor$ 64 | $if(date)$ 65 |

$date$

66 | $endif$ 67 |
68 | $endif$ 69 | $if(toc)$ 70 | 73 | $endif$ 74 | $body$ 75 | $for(include-after)$ 76 | $include-after$ 77 | $endfor$ 78 | 79 | 80 | -------------------------------------------------------------------------------- /docs/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | Parsing With Haskell Parser Combinators 16 | 22 | 90 | 93 | 94 | 95 | 96 | 97 |

98 | Parsing With Haskell Parser Combinators 99 |
100 | 101 |

102 |
103 | 104 |

Parsing With Haskell Parser Combinators

105 |

Need to parse something? Never heard of a "parser combinator"? Looking to learn some Haskell? Awesome! Below is everything you'll need to get up and parsing with Haskell parser combinators. From here you can try tackling esoteric data serialization formats, compiler front ends, domain specific languages—you name it!

106 | 114 |

Building The Demos

115 |

Included with this guide are two demo programs.

116 |

version-number-parser parses a file for a version number. srt-file-parser parses a file for SRT subtitles. Feel free to try them out with the files found in test-input/.

117 |

Stack

118 |

Download the Haskell tool Stack and then run the following.

119 |
git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
 120 | cd parsing-with-haskell-parser-combinators
 121 | stack build
122 |

Cabal

123 |

If using Cabal, you can run the following.

124 |
git clone https://github.com/lettier/parsing-with-haskell-parser-combinators
 125 | cd parsing-with-haskell-parser-combinators
 126 | cabal sandbox init
 127 | cabal --require-sandbox build
 128 | cabal --require-sandbox install
129 |

Running The Demos

130 |

After building the two demo programs, you can run them like so.

131 |

Stack

132 |

To try the version number parser, run the following.

133 |
cd parsing-with-haskell-parser-combinators
 134 | stack exec -- version-number-parser
 135 | What is the version output file path?
 136 | test-input/gifcurry-version-output.txt
137 |

To try the SRT file parser, run the following.

138 |
cd parsing-with-haskell-parser-combinators
 139 | stack exec -- srt-file-parser
 140 | What is the SRT file path?
 141 | test-input/subtitles.srt
142 |

Cabal

143 |

To try the version number parser, run the following.

144 |
cd parsing-with-haskell-parser-combinators
 145 | .cabal-sandbox/bin/version-number-parser
 146 | What is the version output file path?
 147 | test-input/gifcurry-version-output.txt
148 |

To try the SRT file parser, run the following.

149 |
cd parsing-with-haskell-parser-combinators
 150 | .cabal-sandbox/bin/srt-file-parser
 151 | What is the SRT file path?
 152 | test-input/subtitles.srt
153 |

Parser Combinator

154 | 155 |

156 | Parser Combinators 157 |
158 | 159 |

160 |
161 | 162 |

One of the better ways to learn about the parsing strategy, parser combinator, is to look at an implementation of one.

163 |
164 |

165 | Parsers built using combinators are straightforward to construct, readable, modular, well-structured, and easily maintainable. 166 |

167 | 168 | —Parser combinator - Wikipedia 169 | 170 |

171 |
172 | 173 |

ReadP

174 |

Let's take a look under the hood of ReadP, a parser combinator library found in base. Since it is in base, you should already have it.

175 |

💡 Note, you may want to try out Parsec after getting familiar with ReadP. It too is a parser combinator library that others prefer to ReadP. As an added bonus, it is included in GHC's boot libraries as of GHC version 8.4.1.

176 |

P Data Type

177 |
-- (c) The University of Glasgow 2002
 178 | 
 179 | data P a
 180 |   = Get (Char -> P a)
 181 |   | Look (String -> P a)
 182 |   | Fail
 183 |   | Result a (P a)
 184 |   | Final [(a,String)]
 185 |   deriving Functor
186 |

We'll start with the P data type. The a in P a is up to you (the library user) and can be whatever you'd like. The compiler creates a functor instance automatically and there are hand-written instances for applicative, monad, MonadFail, and alternative.

187 |

💡 Note, for more on functors, applicatives, and monads, checkout Your easy guide to Monads, Applicatives, & Functors.

188 |

P is a sum type with five cases.

189 |
    190 |
  • Get consumes a single character from the input string and returns a new P.
  • 191 |
  • Look accepts a duplicate of the input string and returns a new P.
  • 192 |
  • Fail indicates the parser finished without a result.
  • 193 |
  • Result holds a possible parsing and another P case.
  • 194 |
  • Final is a list of two-tuples. The first tuple element is a possible parsing of the input and the second tuple element is the rest of the input string that wasn't consumed by Get.
  • 195 |
196 |

Run

197 |
-- (c) The University of Glasgow 2002
 198 | 
 199 | run :: P a -> ReadS a
 200 | run (Get f)      (c:s) = run (f c) s
 201 | run (Look f)     s     = run (f s) s
 202 | run (Result x p) s     = (x,s) : run p s
 203 | run (Final r)    _     = r
 204 | run _            _     = []
205 |

run is the heart of the ReadP parser. It does all of the heavy lifting as it recursively runs through all of the parser states that we saw up above. You can see that it takes a P and returns a ReadS.

206 |
-- (c) The University of Glasgow 2002
 207 | 
 208 | type ReadS a = String -> [(a,String)]
209 |

ReadS a is a type alias for String -> [(a,String)]. So whenever you see ReadS a, think String -> [(a,String)].

210 |
-- (c) The University of Glasgow 2002
 211 | 
 212 | run :: P a -> String -> [(a,String)]
 213 | run (Get f)      (c:s) = run (f c) s
 214 | run (Look f)     s     = run (f s) s
 215 | run (Result x p) s     = (x,s) : run p s
 216 | run (Final r)    _     = r
 217 | run _            _     = []
218 |

run pattern matches the different cases of P.

219 |
    220 |
  • If it's Get, it calls itself with a new P (returned by passing the function f, in Get f, the next character c in the input string) and the rest of the input string s.
  • 221 |
  • If it's Look, it calls itself with a new P (returned by passing the function f, in Look f, the input string s) and the input string. Notice how Look doesn't consume any characters from the input string like Get does.
  • 222 |
  • If it's Result, it assembles a two-tuple—containing the parsed result and what's left of the input string—and prepends this to the result of a recursive call that runs with another P case and the input string.
  • 223 |
  • If it's Final, run returns a list of two-tuples containing parsed results and input string leftovers.
  • 224 |
  • For anything else, run returns an empty list. For example, if the case is Fail, run will return an empty list.
  • 225 |
226 |
> run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345"
 227 | [("12","345")]
228 |

ReadP doesn't expose run but if it did, you could call it like this. The two Gets consume the '1' and '2', leaving the "345" behind.

229 |
> run (Get (\ a -> Get (\ b -> Result [a,b] Fail))) "12345"
 230 | > run (Get (\ b -> Result ['1',b] Fail)) "2345"
 231 | > run (Result ['1','2'] Fail) "345"
 232 | > (['1', '2'], "345") : run (Fail) "345"
 233 | > (['1', '2'], "345") : []
 234 | [("12","345")]
235 |

Running through each recursive call, you can see how we arrived at the final result.

236 |
> run (Get (\ a -> Get (\ b -> Result [a,b] (Final [(['a','b'],"c")])))) "12345"
 237 | [("12","345"),("ab","c")]
238 |

Using Final, you can include a parsed result in the final list of two-tuples.

239 |

readP_to_S

240 |
-- (c) The University of Glasgow 2002
 241 | 
 242 |    readP_to_S :: ReadP a -> ReadS a
 243 | -- readP_to_S :: ReadP a -> String -> [(a,String)]
 244 |    readP_to_S (R f) = run (f return)
245 |

While ReadP doesn't expose run directly, it does expose it via readP_to_S. readP_to_S introduces a newtype called ReadP. readP_to_S accepts a ReadP a, a string, and returns a list of two-tuples.

246 |

ReadP Newtype

247 | 248 |

249 | ReadP Newtype 250 |
251 | 252 |

253 |
254 | 255 |
-- (c) The University of Glasgow 2002
 256 | 
 257 | newtype ReadP a = R (forall b . (a -> P b) -> P b)
258 |

Here's the definition of ReadP a. There are instances for functor, applicative, monad, MonadFail, alternative, and MonadPlus. The R constructor takes a function that takes another function and returns a P. The accepted function takes whatever you chose for a and returns a P.

259 |
-- (c) The University of Glasgow 2002
 260 | 
 261 | readP_to_S (R f) = run (f return)
262 |

Recall that P is a monad and return's type is a -> m a. So f is the (a -> P b) -> Pb function and return is the (a -> P b) function. Ultimately, run gets the P b it expects.

263 |
-- (c) The University of Glasgow 2002
 264 | 
 265 | readP_to_S (R f) inputString = run (f return) inputString
 266 | --               ^^^^^^^^^^^                  ^^^^^^^^^^^
267 |

It's left off in the source code but remember that readP_to_S and run expects an input string.

268 |
-- (c) The University of Glasgow 2002
 269 | 
 270 | instance Functor ReadP where
 271 |   fmap h (R f) = R (\k -> f (k . h))
272 |

Here's the functor instance definition for ReadP.

273 |
> readP_to_S (fmap toLower get) "ABC"
 274 | [('a',"BC")]
 275 | 
 276 | > readP_to_S (toLower <$> get) "ABC"
 277 | [('a',"BC")]
278 |

This allows us to do something like this. fmap functor maps toLower over the functor get which equals R Get. Recall that the type of Get is (Char -> P a) -> P a which the ReadP constructor (R) accepts.

279 |
-- (c) The University of Glasgow 2002
 280 | 
 281 | fmap h       (R f  ) = R (\ k -> f   (k . h      ))
 282 | fmap toLower (R Get) = R (\ k -> Get (k . toLower))
283 |

Here you see the functor definition rewritten for the fmap toLower get example.

284 |

Applicative P Instance

285 |

Looking up above, how did readP_to_S return [('a',"BC")] when we only used Get which doesn't terminate run? The answer lies in the applicative definition for P.

286 |
-- (c) The University of Glasgow 2002
 287 | 
 288 | instance Applicative P where
 289 |   pure x = Result x Fail
 290 |   (<*>) = ap
291 |

return equals pure so we could rewrite readP_to_S (R f) = run (f return) to be readP_to_S (R f) = run (f pure). By using return or rather pure, readP_to_S sets Result x Fail as the final case run will encounter. If reached, run will terminate and we'll get our list of parsings.

292 |
> readP_to_S (fmap toLower get) "ABC"
 293 | 
 294 | -- Use the functor instance to transform fmap toLower get.
 295 | > readP_to_S (R (\ k -> Get (k . toLower))) "ABC"
 296 | 
 297 | -- Call run which removes R.
 298 | > run ((\ k -> Get (k . toLower)) pure) "ABC"
 299 | 
 300 | -- Call function with pure to get rid of k.
 301 | > run (Get (pure . toLower)) "ABC"
 302 | 
 303 | -- Call run for Get case to get rid of Get.
 304 | > run ((pure . toLower) 'A') "BC"
 305 | 
 306 | -- Call toLower with 'A' to get rid of toLower.
 307 | > run (pure 'a') "BC"
 308 | 
 309 | -- Use the applicative instance to transform pure 'a'.
 310 | > run (Result 'a' Fail) "BC"
 311 | 
 312 | -- Call run for the Result case to get rid of Result.
 313 | > ('a', "BC") : run (Fail) "BC"
 314 | 
 315 | -- Call run for the Fail case to get rid of Fail.
 316 | > ('a', "BC") : []
 317 | 
 318 | -- Prepend.
 319 | [('a',"BC")]
320 |

Here you see the flow from readP_to_S to the parsed result.

321 |

Alternative P Instance

322 |
-- (c) The University of Glasgow 2002
 323 | 
 324 | instance Alternative P where
 325 |   -- ...
 326 | 
 327 |   -- most common case: two gets are combined
 328 |   Get f1     <|> Get f2     = Get (\c -> f1 c <|> f2 c)
 329 | 
 330 |   -- results are delivered as soon as possible
 331 |   Result x p <|> q          = Result x (p <|> q)
 332 |   p          <|> Result x q = Result x (p <|> q)
 333 | 
 334 |   -- ...
335 |

The Alternative instance for P allows us to split the flow of the parser into a left and right path. This comes in handy when the input can go none, one, or (more rarely) two of two ways.

336 |
> readP_to_S ((get >>= \ a -> return a) <|> (get >> get >>= \ b -> return b)) "ABC"
 337 | [('A',"BC"),('B',"C")]
338 |

The <|> operator or function introduces a fork in the parser's flow. The parser will travel through both the left and right paths. The end result will contain all of the possible parsings that went left and all of the possible parsings that went right. If both paths fail, then the whole parser fails.

339 |

💡 Note, in other parser combinator implementations, when using the <|> operator, the parser will go left or right but not both. If the left succeeds, the right is ignored. The right is only processed if the left side fails.

340 |
> readP_to_S ((get >>= \ a -> return [a]) <|> look <|> (get >> get >>= \a -> return [a])) "ABC"
 341 | [("ABC","ABC"),("A","BC"),("B","C")]
342 |

You can chain the <|> operator for however many options or alternatives there are. The parser will return a possible parsing involving each.

343 |

ReadP Failure

344 |
-- (c) The University of Glasgow 2002
 345 | 
 346 | instance Monad ReadP where
 347 |   fail _    = R (\_ -> Fail)
 348 |   R m >>= f = R (\k -> m (\a -> let R m' = f a in m' k))
349 |

Here is the ReadP monad instance. Notice the definition for fail.

350 |
> readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> get <*> get) "ABC"
 351 | [("ABC","")]
 352 | 
 353 | > readP_to_S ((\ a b c -> [a,b,c]) <$> get <*> fail "" <*> get) "ABC"
 354 | []
 355 | 
 356 | > readP_to_S (get >>= \ a -> get >>= \ b -> get >>= \ c -> return [a,b,c]) "ABC"
 357 | [("ABC","")]
 358 | 
 359 | > readP_to_S (get >>= \ a -> get >>= \ b -> fail "" >>= \ c -> return [a,b,c]) "ABC"
 360 | []
361 |

You can cause an entire parser path to abort by calling fail. Since ReadP doesn't provide a direct way to generate a Result or Final case, the return value will be an empty list. If the failed path is the only path, then the entire result will be an empty list. Recall that when run matches Fail, it returns an empty list.

362 |
-- (c) The University of Glasgow 2002
 363 | 
 364 | instance Alternative P where
 365 |   -- ...
 366 | 
 367 |   -- fail disappears
 368 |   Fail       <|> p          = p
 369 |   p          <|> Fail       = p
 370 | 
 371 |   -- ...
372 |

Going back to the alternative P instance, you can see how a failure on either side (but not both) will not fail the whole parser.

373 |
> readP_to_S (get >>= \ a -> get >>= \ b -> pfail >>= \ c -> return [a,b,c]) "ABC"
 374 | []
375 |

Instead of using fail, ReadP provides pfail which allows you to generate a Fail case directly.

376 |

Version Number

377 | 378 |

379 | Version Number 380 |
381 | 382 |

383 |
384 | 385 |

Gifcurry, the Haskell-built video editor for GIF makers, shells out to various different programs. To ensure compatibility, it needs the version number for each of the programs it shells out to. One of those programs is ImageMagick.

386 |
Version: ImageMagick 6.9.10-14 Q16 x86_64 2018-10-24 https://imagemagick.org
 387 | Copyright: © 1999-2018 ImageMagick Studio LLC
 388 | License: https://imagemagick.org/script/license.php
 389 | Features: Cipher DPC HDRI Modules OpenCL OpenMP
390 |

Here you see the output of convert --version. How could you parse this to capture the 6, 9, 10, and 14?

391 |

Looking at the output, we know the version number is a collection of numbers separated by either a period or a dash. This definition covers the dates as well so we'll make sure that the first two numbers are separated by a period. That way, if they put a date before the version number, we won't get the wrong result.

392 | 393 |

394 | Version Number Parser 395 |
396 | 397 |

398 |
399 | 400 |
1. Consume zero or more characters that are not 0 through 9 and go to 2.
 401 | 2. Consume zero or more characters that are 0 through 9, save this number, and go to 3.
 402 | 3. Look at the rest of the input and go to 4.
 403 | 4. If the input
 404 |     - is empty, go to 6.
 405 |     - starts with a period, go to 1.
 406 |     - starts with a dash
 407 |         - and you have exactly one number, go to 5.
 408 |         - and you have more than one number, go to 1.
 409 |     - doesn't start with a period or dash
 410 |         - and you have exactly one number, go to 5.
 411 |         - you have more than one number, go to 6.
 412 | 5. Delete any saved numbers and go to 1.
 413 | 6. Return the numbers found.
414 |

Before we dive into the code, here's the algorithm we'll be following.

415 |

Building The Version Number Parser

416 |
parseVersionNumber
 417 |   ::  [String]
 418 |   ->  ReadP [String]
 419 | parseVersionNumber
 420 |   nums
 421 |   = do
 422 |   _         <- parseNotNumber
 423 |   num       <- parseNumber
 424 |   let nums' = nums ++ [num]
 425 |   parseSeparator nums' parseVersionNumber
426 |

parseVersionNumber is the main parser combinator that parses an input string for a version number. It accepts a list of strings and returns a list of strings in the context of the ReadP data type. The accepted list of strings is not the input that gets parsed but rather the list of numbers found so far. For the first function call, the list is empty since it hasn't parsed anything yet.

427 |
parseVersionNumber
 428 |   nums
429 |

Starting from the top, parseVersionNumber takes a list of strings which are the current list of numbers found so far.

430 |
  _         <- parseNotNumber
431 |

parseNotNumber consumes everything that isn't a number from the input string. Since we are not interested in the result, we discard it (_ <-).

432 |
  num       <- parseNumber
 433 |   let nums' = nums ++ [num]
434 |

Next we consume everything that is a number and then add that to the list of numbers found so far.

435 |
  parseSeparator nums' parseVersionNumber
436 |

After parseVersionNumber has processed the next number, it passes the list of numbers found and itself to parseSeparator.

437 |

Parsing The Separator

438 |
parseSeparator
 439 |   ::  [String]
 440 |   ->  ([String] -> ReadP [String])
 441 |   ->  ReadP [String]
 442 | parseSeparator
 443 |   nums
 444 |   f
 445 |   = do
 446 |   next <- look
 447 |   case next of
 448 |     ""    -> return nums
 449 |     (c:_) ->
 450 |       case c of
 451 |         '.' -> f nums
 452 |         '-' -> if length nums == 1 then f [] else f nums
 453 |         _   -> if length nums == 1 then f [] else return nums
454 |

Here you see parseSeparator.

455 |
  next <- look
 456 |   case next of
 457 |     ""    -> return nums
 458 |     (c:_) ->
459 |

look allows us to get what's left of the input string without consuming it. If there's nothing left, it returns the numbers found. However, if there is something left, it analyzes the first character.

460 |
      case c of
 461 |         '.' -> f nums
 462 |         '-' -> if length nums == 1 then f [] else f nums
 463 |         _   -> if length nums == 1 then f [] else return nums
464 |

If the next character is a period, call parseVersionNumber again with the current list of numbers found. If it's a dash and we have exactly one number, call parseVersionNumber with an empty list of numbers since it's a date. If it's a dash and we don't have exactly one number, call parseVersionNumber with the list of numbers found so far. Otherwise, call parseVersionNumber with an empty list if we have exactly one number or return the numbers found if we don't have exactly one number.

465 |

Parsing Non-numbers

466 |
parseNotNumber
 467 |   ::  ReadP String
 468 | parseNotNumber
 469 |   =
 470 |   munch (not . isNumber)
471 |

parseNotNumber uses munch which ReadP provides. munch is given the predicate (not . isNumber) which returns true for any character that isn't 0 through 9.

472 |
munch :: (Char -> Bool) -> ReadP String
473 |

munch continuously calls get if the next character in the input string satisfies the predicate. If it doesn't, munch returns the characters that did, if any. Since it only uses get, munch always succeeds.

474 |

💡 Note, parseNumber is similar to parseNotNumber. Instead of not . isNumber, the predicate is just isNumber.

475 |

Munch Versus Many

476 |
parseNotNumber'
 477 |   ::  ReadP String
 478 | parseNotNumber'
 479 |   =
 480 |   many (satisfy (not . isNumber))
481 |

Instead of using munch, you could write parseNotNumber like this, using many and satisfy—both of which ReadP provides. Looking at the type signature for many, it accepts a single parser combinator (ReadP a). In this instance, it's being given the parser combinator satisfy.

482 |
> readP_to_S (satisfy (not . isNumber)) "a"
 483 | [('a',"")]
 484 | 
 485 | > readP_to_S (satisfy (not . isNumber)) "1"
 486 | []
487 |

satisfy takes a predicate and uses get to consume the next character. If the accepted predicate returns true, satisfy returns the character. Otherwise, satisfy calls pfail and fails.

488 |
> readP_to_S (munch (not . isNumber)) "abc123"
 489 | [("abc","123")]
 490 | 
 491 | > readP_to_S (many (satisfy (not . isNumber))) "abc123"
 492 | [("","abc123"),("a","bc123"),("ab","c123"),("abc","123")]
493 |

Using many can give you unwanted results. Ultimately, many introduces one or more Result cases. Because of this, many always succeeds.

494 |
> readP_to_S (many look) "abc123"
 495 | -- Runs forever.
496 |

many will run your parser until it fails or runs out of input. If your parser never fails or never runs out of input, many will never return.

497 |
> readP_to_S (many (get >>= \ a -> return (read (a : "") :: Int))) "12345"
 498 | [([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")]
499 |

For every index in the result, the parsed result will be the outcome of having ran the parser index times on the entire input.

500 |
> let parser        = get >>= \ a -> return (read (a : "") :: Int)
 501 | > let many' results = return results <|> (parser >>= \ result -> many' (results ++ [result]))
 502 | > readP_to_S (many' []) "12345"
 503 | [([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")]
504 |

Here's an alternate definition for many. On the left side of <|>, it returns the current parser results. On the right side of <|>, it runs the parser, adds that result to the current parser results, and calls itself with the updated results. This has a cumulative sum type effect where index i is the parser result appended to the parser result at i - 1, i - 2, ..., and 1.

505 |

Running The Version Number Parser

506 |

Now that we built the parser, let's run it.

507 |
> let inputString =
 508 | >     "Some Program (C) 1234-56-78 All rights reserved.\n\
 509 | >     \Version: 12.345.6-7\n\
 510 | >     \License: Some open source license."
 511 | > readP_to_S (parseVersionNumber []) inputString
 512 | [(["12","345","6","7"],"\nLicense: Some open source license.")]
513 |

You can see it extracted the version number correctly even with the date coming before it.

514 |

SRT

515 | 516 |

517 | SRT 518 |
519 | 520 |

521 |
522 | 523 |

Now let's parse something more complicated—SRT files.

524 |

For the release of Gifcurry six, I needed to parse SRT (SubRip Text) files. SRT files contain subtitles that video processing programs use to display text on top of a video. Typically this text is the dialog of a movie translated into various different languages. By keeping the text separate from the video, there only needs to be one video which saves time, storage space, and bandwidth. The video software can swap out the text without having to swap out the video. Contrast this with burning-in or hard-coding the subtitles where the text becomes a part of the image data that makes up the video. In this case, you would need a video for each collection of subtitles.

525 | 526 |

527 | Gifcurry 528 |
529 | Inner Video © Blender Foundation | www.sintel.org 530 |

531 |
532 | 533 |

Gifcurry can take a SRT file and burn-in the subtitles for the video slice your select.

534 |
7
 535 | 00:02:09,400 --> 00:02:13,800
 536 | What brings you to
 537 | the land of the gatekeepers?
 538 | 
 539 | 8
 540 | 00:02:15,000 --> 00:02:17,500
 541 | I'm searching for someone.
 542 | 
 543 | 9
 544 | 00:02:18,000 --> 00:02:22,200
 545 | Someone very dear?
 546 | A kindred spirit?
547 |

Here you see the English subtitles for Sintel (© Blender Foundation | www.sintel.org).

548 |

SRT Format

549 |
550 |

551 | SRT is perhaps the most basic of all subtitle formats. 552 |

553 | 554 | —SRT Subtitle | Matrosk 555 | 556 |

557 |
558 | 559 |

The SRT file format consists of blocks, one for each subtitle, separated by an empty line.

560 |
2
561 |

At the top of the block is the index. This determines the order of the subtitles. Hopefully the subtitles are already in order and all of them have unique indexes but this may not be the case.

562 |
01:04:13,000 --> 02:01:01,640 X1:167 X2:267 Y1:33 Y2:63
563 |

After the index is the start time, end time, and an optional set of points specifying the rectangle the subtitle text should go in.

564 |
01:04:13,000
565 |

The timestamp format is hours:minutes:seconds,milliseconds.

566 |

💡 Note the comma instead of the period separating the seconds from the milliseconds.

567 |
This is the actual subtitle
 568 | text. It can span multiple lines.
 569 | It may include formating
 570 | like <b>bold</b>, <i>italic</i>,
 571 | <u>underline</u>,
 572 | and <font color="#010101">font color</font>.
573 |

The third and last part of a block is the subtitle text. It can span multiple lines and ends when there is an empty line. The text can include formatting tags reminiscent of HTML.

574 |

Building The SRT Parser

575 | 576 |

577 | Parsing SRT 578 |
579 | 580 |

581 |
582 | 583 |
parseSrt
 584 |   ::  ReadP [SrtSubtitle]
 585 | parseSrt
 586 |   =
 587 |   manyTill parseBlock (skipSpaces >> eof)
588 |

parseSrt is the main parser combinator that handles everything. It parses each block until it reaches the end of the file (eof) or input. To be on the safe side, there could be trailing whitespace between the last block and the end of the file. To handle this, it parses zero or more characters of whitespace (skipSpaces) before parsing the end of the file (skipSpaces >> eof). If there is still input left by the time eof is reached, eof will fail and this will return nothing. Therefore, it's important that parseBlock doesn't leave any thing but whitespace behind.

589 |

Building The SRT Block Parser

590 |
parseBlock
 591 |   ::  ReadP SrtSubtitle
 592 | parseBlock
 593 |   = do
 594 |   i      <- parseIndex
 595 |   (s, e) <- parseTimestamps
 596 |   c      <- parseCoordinates
 597 |   t      <- parseTextLines
 598 |   return
 599 |     SrtSubtitle
 600 |       { index       = i
 601 |       , start       = s
 602 |       , end         = e
 603 |       , coordinates = c
 604 |       , taggedText  = t
 605 |       }
606 |

As we went over earlier, a block consists of an index, timestamps, possibly some coordinates, and some lines of text. In this version of parseBlock, you see the more imperative do notation style with the record syntax.

607 |
parseBlock'
 608 |   ::  ReadP SrtSubtitle
 609 | parseBlock'
 610 |   =
 611 |       SrtSubtitle
 612 |   <$> parseIndex
 613 |   <*> parseStartTimestamp
 614 |   <*> parseEndTimestamp
 615 |   <*> parseCoordinates
 616 |   <*> parseTextLines
617 |

Here's another way you could write parseBlock. This is the applicative style. Just be sure to get the order right. For example, I could've accidentally mixed up the start and end timestamps.

618 |

Building The SRT Index Parser

619 | 620 |

621 | Parsing The Index 622 |
623 | 624 |

625 |
626 | 627 |
parseIndex
 628 |   ::  ReadP Int
 629 | parseIndex
 630 |   =
 631 |       skipSpaces
 632 |   >>  readInt <$> parseNumber
633 |

At the top of the block is the index. Here you see skipSpaces again. After skipping over whitespace, it parses the input for numbers and converts it to an actual integer.

634 |
readInt
 635 |   ::  String
 636 |   ->  Int
 637 | readInt
 638 |   =
 639 |   read
640 |

readInt looks like this.

641 |
> read "123" :: Int
 642 | 123
 643 | > read "1abc" :: Int
 644 | *** Exception: Prelude.read: no parse
645 |

Normally using read directly can be dangerous. read may not be able to convert the input to the specified type. However, parseNumber will only return the 10 numerical digit characters (['0'..'9']) so using read directly becomes safe.

646 |

Building The SRT Timestamps Parser

647 | 648 |

649 | Parsing The Timestamps 650 |
651 | 652 |

653 |
654 | 655 |

Parsing the timestamps are a little more involved than parsing the index.

656 |
parseTimestamps
 657 |   ::  ReadP (Timestamp, Timestamp)
 658 | parseTimestamps
 659 |   = do
 660 |   _   <- char '\n'
 661 |   s   <- parseTimestamp
 662 |   _   <- skipSpaces
 663 |   _   <- string "-->"
 664 |   _   <- skipSpaces
 665 |   e   <- parseTimestamp
 666 |   return (s, e)
667 |

This is the main combinator for parsing the timestamps.

668 |

char parses the character you give it or it fails. If it fails then parseTimestamps fails, ultimately causing parseSrt to fail so there must be a newline character after the index.

669 |

string is like char except instead of just one character, it parses the string of characters you give it or it fails.

670 |
parseStartTimestamp
 671 |   ::  ReadP Timestamp
 672 | parseStartTimestamp
 673 |   =
 674 |       char '\n'
 675 |   >>  parseTimestamp
676 |

parseTimestamps parses both timestamps, but for the applicative style (parseSrt'), we need a parser just for the start timestamp.

677 |
parseEndTimestamp
 678 |   ::  ReadP Timestamp
 679 | parseEndTimestamp
 680 |   =
 681 |       skipSpaces
 682 |   >>  string "-->"
 683 |   >>  skipSpaces
 684 |   >>  parseTimestamp
685 |

This parses everything between the timestamps and returns the end timestamp.

686 |
parseTimestamp
 687 |   ::  ReadP Timestamp
 688 | parseTimestamp
 689 |   = do
 690 |   h  <- parseNumber
 691 |   _  <- char ':'
 692 |   m  <- parseNumber
 693 |   _  <- char ':'
 694 |   s  <- parseNumber
 695 |   _  <- char ',' <|> char '.'
 696 |   m' <- parseNumber
 697 |   return
 698 |     Timestamp
 699 |       { hours        = readInt h
 700 |       , minutes      = readInt m
 701 |       , seconds      = readInt s
 702 |       , milliseconds = readInt m'
 703 |       }
704 |

This parses the four numbers that make up the timestamp. The first three numbers are separated by a colon and the last one is separated by a comma. To be more forgiving, however, we allow the possibility of there being a period instead of a comma.

705 |
> readP_to_S (char '.' <|> char ',') "..."
 706 | [('.',"..")]
 707 | 
 708 | > readP_to_S (char '.' <|> char ',') ",.."
 709 | [(',',"..")]
710 |

💡 Note, when using char with <|>, only one side can succeed (two char enter, one char leave) since char consumes a single character and two characters cannot occupy the same space.

711 |

Building The SRT Coordinates Parser

712 | 713 |

714 | Parsing The Coordinates 715 |
716 | 717 |

718 |
719 | 720 |

The coordinates are an optional part of the block but if included, will be on the same line as the timestamps.

721 |
parseCoordinates
 722 |   ::  ReadP (Maybe SrtSubtitleCoordinates)
 723 | parseCoordinates
 724 |   =
 725 |   option Nothing $ do
 726 |     _  <- skipSpaces1
 727 |     x1 <- parseCoordinate 'x' 1
 728 |     _  <- skipSpaces1
 729 |     x2 <- parseCoordinate 'x' 2
 730 |     _  <- skipSpaces1
 731 |     y1 <- parseCoordinate 'y' 1
 732 |     _  <- skipSpaces1
 733 |     y2 <- parseCoordinate 'y' 2
 734 |     return
 735 |       $ Just
 736 |         SrtSubtitleCoordinates
 737 |           { x1 = readInt x1
 738 |           , x2 = readInt x2
 739 |           , y1 = readInt y1
 740 |           , y2 = readInt y2
 741 |           }
742 |

option takes two arguments. The first argument is returned if the second argument, a parser, fails. So if the coordinates parser fails, parseCoordinates will return Nothing. Put another way, the coordinates parser failing does not cause the whole parser to fail. This block will just have Nothing for its coordinates "field".

743 |
parseCoordinate
 744 |   ::  Char
 745 |   ->  Int
 746 |   ->  ReadP String
 747 | parseCoordinate
 748 |   c
 749 |   n
 750 |   = do
 751 |   _  <- char (Data.Char.toUpper c) <|> char (Data.Char.toLower c)
 752 |   _  <- string $ show n ++ ":"
 753 |   parseNumber
754 |

This parser allows the coordinate labels to be in either uppercase or lowercase. For example, x1:1 X2:2 Y1:3 y2:4 would succeed.

755 |

Building The SRT Text Parser

756 | 757 |

758 | Parsing The Text 759 |
760 | 761 |

762 |
763 | 764 |

Parsing the text is the most involved portion due to the HTML-like tag formatting.

765 |

Tag parsing can be challenging—just ask anyone who parses them with a regular expression. To make this easier on us—and for the user—we'll use a tag soup kind of approach. The parser will allow unclosed and/or wrongly nested tags. It will also allow any tag and not just b, u, i, and font.

766 |
parseTextLines
 767 |   ::  ReadP [TaggedText]
 768 | parseTextLines
 769 |   =
 770 |       char '\n'
 771 |   >>  (getTaggedText <$> manyTill parseAny parseEndOfTextLines)
772 |

We start out by matching on a newline character. After that, we functor map or fmap (<$>) getTaggedText over the subtitle text characters until we reach the end of the text lines.

773 |
parseEndOfTextLines
 774 |   ::  ReadP ()
 775 | parseEndOfTextLines
 776 |   =
 777 |   void (string "\n\n") <|> eof
778 |

We stop collecting characters (parseAny) when we reach two newline characters or the end of the file. This signals the end of the block.

779 |
getTaggedText
 780 |   ::  String
 781 |   ->  [TaggedText]
 782 | getTaggedText
 783 |   s
 784 |   =
 785 |   fst
 786 |     $ foldl
 787 |       folder
 788 |       ([], [])
 789 |       parsed
 790 |   where
791 |

getTaggedText folds through the parsed text from left to right, returning the accumulated tagged text.

792 |
    parsed
 793 |       ::  [String]
 794 |     parsed
 795 |       =
 796 |       case readP_to_S (parseTaggedText []) s of
 797 |         []      -> [s]
 798 |         r@(_:_) -> (fst . last) r
799 |

parsed returns a list of one or more strings. It attempts to parse the input text for tags. If that fails, parsed returns the input string inside a list. Otherwise, if parseTaggedText succeeds, parse returns the last possible parsing ((fst . last) r).

800 |
    folder
 801 |       ::  ([TaggedText], [Tag])
 802 |       ->  String
 803 |       ->  ([TaggedText], [Tag])
 804 |     folder
 805 |       (tt, t)
 806 |       x
 807 |       | isTag x   = (tt, updateTags t x)
 808 |       | otherwise = (tt ++ [TaggedText { text = x, tags = t}], t)
809 |

As folder moves from left to right, over the parsed strings, it checks if the current string is a tag. If it is a tag, it updates the current set of active tags (t). Otherwise, it appends another tagged piece of text associated with the set of active tags.

810 |
updateTags
 811 |   ::  [Tag]
 812 |   ->  String
 813 |   ->  [Tag]
 814 | updateTags
 815 |   tags
 816 |   x
 817 |   | isClosingTag x = remove compare' tags (makeTag x)
 818 |   | isOpeningTag x = add    compare' tags (makeTag x)
 819 |   | otherwise      = tags
 820 |   where
 821 |     compare'
 822 |       ::  Tag
 823 |       ->  Tag
 824 |       ->  Bool
 825 |     compare'
 826 |       a
 827 |       b
 828 |       =
 829 |       name a /= name b
830 |

updateTags updates the tags given by either removing or adding the given tag (x) depending on if it is a closing or opening tag. If it is neither, it just returns the passed set of tags. add will overwrite an existing tag if tags already has a tag by the same name. You can see this in the compare' function given.

831 |

To keep the parser simple, if an opening tag T is found, T gets added to the list of tags or overwrites an exiting T if already present. If a corresponding closing /T is found, then T is removed from the list of tags, if present. It doesn't matter if there is two or more Ts in a row, one or more Ts without a closing /T, and/or there's a closing /T without an opening T.

832 |
makeTag
 833 |   ::  String
 834 |   ->  Tag
 835 | makeTag
 836 |   s
 837 |   =
 838 |   Tag
 839 |     { name       = getTagName       s
 840 |     , attributes = getTagAttributes s
 841 |     }
842 |

makeTag assembles a tag from the given string (s). Each Tag has a name and zero or more attributes.

843 |
parseTaggedText
 844 |   ::  [String]
 845 |   ->  ReadP [String]
 846 | parseTaggedText
 847 |   strings
 848 |   = do
 849 |   s <- look
 850 |   case s of
 851 |     "" -> return strings
 852 |     _  -> do
 853 |       r <- munch1 (/= '<') <++ parseClosingTag <++ parseOpeningTag
 854 |       parseTaggedText $ strings ++ [r]
855 |

parseTaggedText returns the input string broken up into pieces. Each piece is either the text enclosed by tags, a closing tag, or an opening tag. After it splits off a piece, it adds it to the other pieces and calls itself again. If the remaining input string is empty, it returns the list of strings found.

856 |
> readP_to_S (string "ab" <++ string "abc") "abcd"
 857 | [("ab","cd")]
 858 | 
 859 | > readP_to_S (string "ab" +++ string "abc") "abcd"
 860 | [("ab","cd"),("abc","d")]
 861 | 
 862 | > readP_to_S (string "ab" <|> string "abc") "abcd"
 863 | [("ab","cd"),("abc","d")]
864 |

The <++ operator is left biased meaning that if the left side succeeds, it won't even bother with the right. Recall that when we run the parser, we get a list of all the possible parsings. All of these possible parsings are the result of the parser having traveled through all of the possible paths. By using <++, we receive the possible parsings from the left path and from the right path if and only if the left side failed. If you'd like all of the possible parsings through the left and right side, you can use the +++ operator provided by ReadP. +++ is just <|> which we saw up above.

865 |
parseOpeningTag
 866 |   ::  ReadP String
 867 | parseOpeningTag
 868 |   = do
 869 |   _ <- char '<'
 870 |   t <- munch1 (\ c -> c /= '/' && c /= '>')
 871 |   _ <- char '>'
 872 |   return $ "<" ++ t ++ ">"
873 |

An opening tag is an opening angle bracket, some text that doesn't include a forward slash, and the next immediate closing angle bracket.

874 |
parseClosingTag
 875 |   ::  ReadP String
 876 | parseClosingTag
 877 |   = do
 878 |   _ <- char '<'
 879 |   _ <- char '/'
 880 |   t <- munch1 (/= '>')
 881 |   _ <- char '>'
 882 |   return $ "</" ++ t ++ ">"
883 |

A closing tag is an opening angle bracket, a forward slash, some text, and the next immediate closing angle bracket.

884 | 885 |

886 | Parsing Tags 887 |
888 | 889 |

890 |
891 | 892 |
getTagAttributes
 893 |   ::  String
 894 |   ->  [TagAttribute]
 895 | getTagAttributes
 896 |   s
 897 |   =
 898 |   if isOpeningTag s
 899 |     then
 900 |       case readP_to_S (parseTagAttributes []) s of
 901 |         []    -> []
 902 |         (x:_) -> fst x
 903 |     else
 904 |       []
905 |

Opening tags can have attributes. For example, <font color="#101010">. Each attribute is a two-tuple, key-value pair. In the above example, color would be the key and #101010 would be the value.

906 |
getTagName
 907 |   ::  String
 908 |   ->  String
 909 | getTagName
 910 |   s
 911 |   =
 912 |   case readP_to_S parseTagName s of
 913 |     []    -> ""
 914 |     (x:_) -> toLower' $ fst x
915 |

This returns the tag name in lowercase.

916 |
parseTagName
 917 |   ::  ReadP String
 918 | parseTagName
 919 |   = do
 920 |   _ <- char '<'
 921 |   _ <- munch (== '/')
 922 |   _ <- skipSpaces
 923 |   n <- munch1 (\ c -> c /= ' ' && c /= '>')
 924 |   _ <- munch  (/= '>')
 925 |   _ <- char '>'
 926 |   return n
927 |

The tag name is the first string of non-whitespace characters after the opening angle bracket, a possible forward slash, and some possible whitespace and before some more whitespace and/or the closing angle bracket.

928 |
parseTagAttributes
 929 |   ::  [TagAttribute]
 930 |   ->  ReadP [TagAttribute]
 931 | parseTagAttributes
 932 |   tagAttributes
 933 |   = do
 934 |   s <- look
 935 |   case s of
 936 |     "" -> return tagAttributes
 937 |     _  -> do
 938 |       let h = head s
 939 |       case h of
 940 |         '>' -> return tagAttributes
 941 |         '<' -> trimTagname >> parseTagAttributes'
 942 |         _   -> parseTagAttributes'
 943 |   where
 944 |     parseTagAttributes'
 945 |       ::  ReadP [TagAttribute]
 946 |     parseTagAttributes'
 947 |       = do
 948 |       tagAttribute <- parseTagAttribute
 949 |       parseTagAttributes
 950 |         ( add
 951 |             (\ a b -> fst a /= fst b)
 952 |             tagAttributes
 953 |             tagAttribute
 954 |         )
955 |

parseTagAttributes recursively goes through the input string, collecting up the key-value pairs. At the start of the tag (<), it first trims the tag name before tackling the attributes. It stops parsing for attributes when it reaches the closing angle bracket (>). If a tag happens to have duplicate attributes (based on the key), add will ensure only the latest one remains in the list.

956 |
trimTagname
 957 |   :: ReadP ()
 958 | trimTagname
 959 |   =
 960 |       char '<'
 961 |   >> skipSpaces
 962 |   >> munch1 (\ c -> c /= ' ' && c /= '>')
 963 |   >> return ()
964 |

This trims or discards the tag name.

965 |
parseTagAttribute
 966 |   ::  ReadP TagAttribute
 967 | parseTagAttribute
 968 |   = do
 969 |   _ <- skipSpaces
 970 |   k <- munch1 (/= '=')
 971 |   _ <- string "=\""
 972 |   v <- munch1 (/= '\"')
 973 |   _ <- char '\"'
 974 |   _ <- skipSpaces
 975 |   return (toLower' k, v)
976 |

The attribute key is any string of non-whitespace characters before the equal sign. The attribute value is any characters after the equal sign and double quote and before the next immediate double quote.

977 |
isTag
 978 |   ::  String
 979 |   ->  Bool
 980 | isTag
 981 |   s
 982 |   =
 983 |   isOpeningTag s || isClosingTag s
984 |

A string is a tag if it is either an opening tag or a closing tag.

985 |
isOpeningTag
 986 |   ::  String
 987 |   ->  Bool
 988 | isOpeningTag
 989 |   s
 990 |   =
 991 |   isPresent $ readP_to_S parseOpeningTag s
992 |

A string is an opening tag if the opening tag parser succeeds.

993 |
isClosingTag
 994 |   ::  String
 995 |   ->  Bool
 996 | isClosingTag
 997 |   s
 998 |   =
 999 |   isPresent $ readP_to_S parseClosingTag s
1000 |

A string is a closing tag if the closing tag parser succeeds.

1001 |

Running The SRT Parser

1002 | 1003 |

1004 | Parsed SRT Results 1005 |
1006 | 1007 |

1008 |
1009 | 1010 |

Now that we've assembled the parser, let's try it out.

1011 |
> let srt =
1012 | >       " 1\n\
1013 | >       \0:0:0,1 --> 0:1:0.2  x1:1 X2:3  y1:4 y2:10\n\
1014 | >       \<font color=\"red\" color=\"blue\">This is some <b><u><i>\n \
1015 | >       \subtitle \n\
1016 | >       \</u>text.</b>  "
1017 | > readP_to_S parseSrt srt
1018 | [([ SrtSubtitle
1019 |       { index = 1
1020 |       , start = Timestamp {hours = 0, minutes = 0, seconds = 0, milliseconds = 1}
1021 |       , end   = Timestamp {hours = 0, minutes = 1, seconds = 0, milliseconds = 2}
1022 |       , coordinates = Just (SrtSubtitleCoordinates {x1 = 1, x2 = 3, y1 = 4, y2 = 10})
1023 |       , taggedText =  [ TaggedText
1024 |                         { text = "This is some "
1025 |                         , tags = [ Tag {name = "font", attributes = [("color","blue")]}
1026 |                                  ]
1027 |                         }
1028 |                       , TaggedText
1029 |                           { text = "\n subtitle \n"
1030 |                           , tags = [ Tag {name = "font", attributes = [("color","blue")]}
1031 |                                    , Tag {name = "b",    attributes = []}
1032 |                                    , Tag {name = "u",    attributes = []}
1033 |                                    , Tag {name = "i",    attributes = []}
1034 |                                    ]
1035 |                           }
1036 |                       , TaggedText
1037 |                           { text = "text."
1038 |                           , tags = [ Tag {name = "font", attributes = [("color","blue")]}
1039 |                                    , Tag {name = "b",    attributes = []}
1040 |                                    , Tag {name = "i",    attributes = []}
1041 |                                    ]
1042 |                           }
1043 |                       , TaggedText
1044 |                           { text = "  "
1045 |                           , tags = [ Tag {name = "font", attributes = [("color","blue")]}
1046 |                                    , Tag {name = "i",    attributes = []}
1047 |                                    ]
1048 |                           }
1049 |                       ]
1050 |       }
1051 |   ]
1052 | , ""
1053 | )]
1054 |

Here you see the result of parsing a test string. Notice the errors in the test string like the use of a period instead of a comma or the duplicate tag attribute.

1055 |

Exercises

1056 |
    1057 |
  • Write a program that can convert an SRT file to a JSON file.
  • 1058 |
  • Rewrite the version number parser using Parsec instead of ReadP.
  • 1059 |
  • Rewrite the SRT parser using Parsec instead of ReadP.
  • 1060 |
1061 | 1062 |

(C) 2019 David Lettier
lettier.com

1063 | 1064 | 1065 | -------------------------------------------------------------------------------- /docs/style.css: -------------------------------------------------------------------------------- 1 | html { 2 | font-size: 100%; 3 | overflow-y: scroll; 4 | -webkit-text-size-adjust: 100%; 5 | -ms-text-size-adjust: 100%; 6 | } 7 | 8 | body { 9 | color: #444; 10 | font-family: Helvetica, Arial, sans-serif; 11 | font-size: 20px; 12 | line-height: 2; 13 | padding: 1em; 14 | margin: auto; 15 | max-width: 887px; 16 | background: #fefefe; 17 | } 18 | 19 | a { 20 | color: #059; 21 | text-decoration: none; 22 | } 23 | 24 | a:visited { 25 | color: #048; 26 | } 27 | 28 | a:hover { 29 | color: #06a; 30 | } 31 | 32 | a:active { 33 | color: #06a; 34 | } 35 | 36 | a:focus { 37 | outline: thin dotted; 38 | } 39 | 40 | *::-moz-selection { 41 | background: rgba(0, 200, 255, 0.3); 42 | color: #111; 43 | } 44 | 45 | *::selection { 46 | background: rgba(0, 200, 255, 0.3); 47 | color: #111; 48 | } 49 | 50 | a::-moz-selection { 51 | background: rgba(0, 200, 255, 0.3); 52 | color: #048; 53 | } 54 | 55 | a::selection { 56 | background: rgba(0, 200, 255, 0.3); 57 | color: #048; 58 | } 59 | 60 | a > span.emoji { 61 | font-size: 30px; 62 | margin-left: 5px; 63 | } 64 | 65 | p { 66 | margin: 1em 0; 67 | } 68 | 69 | img { 70 | max-width: 100%; 71 | } 72 | 73 | h1, h2, h3, h4, h5, h6 { 74 | color: #111; 75 | line-height: 125%; 76 | margin-top: 1em; 77 | font-weight: lighter; 78 | font-family: 'Roboto Condensed', Helvetica, Arial, sans-serif; 79 | } 80 | 81 | h4, h5, h6 { 82 | font-weight: bold; 83 | } 84 | 85 | h1 { 86 | font-size: 2.5em; 87 | } 88 | 89 | h2 { 90 | font-size: 2em; 91 | } 92 | 93 | h3 { 94 | font-size: 1.5em; 95 | } 96 | 97 | h4 { 98 | font-size: 1.2em; 99 | } 100 | 101 | h5 { 102 | font-size: 1em; 103 | } 104 | 105 | h6 { 106 | font-size: 0.9em; 107 | } 108 | 109 | blockquote { 110 | color: #666666; 111 | margin: 0; 112 | padding-left: 3em; 113 | border-left: 0.5em #EEE solid; 114 | } 115 | 116 | hr { 117 | display: block; 118 | height: 2px; 119 | border: 0; 120 | border-top: 1px solid #aaa; 121 | border-bottom: 1px solid #eee; 122 | margin: 1em 0; 123 | padding: 0; 124 | } 125 | 126 | pre, code, kbd, samp { 127 | font-family: monospace; 128 | font-size: 14px; 129 | } 130 | 131 | pre { 132 | white-space: pre; 133 | white-space: pre-wrap; 134 | word-wrap: break-word; 135 | padding: 15px; 136 | } 137 | 138 | b, strong { 139 | font-weight: bold; 140 | } 141 | 142 | p > code { 143 | font-weight: bold; 144 | } 145 | 146 | dfn { 147 | font-style: italic; 148 | } 149 | 150 | ins { 151 | background: #ff9; 152 | color: #000; 153 | text-decoration: none; 154 | } 155 | 156 | mark { 157 | background: #ff0; 158 | color: #000; 159 | font-style: italic; 160 | font-weight: bold; 161 | } 162 | 163 | sub, sup { 164 | font-size: 75%; 165 | line-height: 0; 166 | position: relative; 167 | vertical-align: baseline; 168 | } 169 | 170 | sup { 171 | top: -0.5em; 172 | } 173 | 174 | sub { 175 | bottom: -0.25em; 176 | } 177 | 178 | ul, ol { 179 | margin: 1em 0; 180 | padding: 0 0 0 2em; 181 | } 182 | 183 | li p:last-child { 184 | margin-bottom: 0; 185 | } 186 | 187 | ul ul, ol ol { 188 | margin: .3em 0; 189 | } 190 | 191 | dl { 192 | margin-bottom: 1em; 193 | } 194 | 195 | dt { 196 | font-weight: bold; 197 | margin-bottom: .8em; 198 | } 199 | 200 | dd { 201 | margin: 0 0 .8em 2em; 202 | } 203 | 204 | dd:last-child { 205 | margin-bottom: 0; 206 | } 207 | 208 | img { 209 | border: 0; 210 | -ms-interpolation-mode: bicubic; 211 | vertical-align: middle; 212 | } 213 | 214 | figure { 215 | display: block; 216 | text-align: center; 217 | margin: 1em 0; 218 | } 219 | 220 | figure img { 221 | border: none; 222 | margin: 0 auto; 223 | } 224 | 225 | figcaption { 226 | font-size: 0.8em; 227 | font-style: italic; 228 | margin: 0 0 .8em; 229 | } 230 | 231 | table { 232 | margin-bottom: 2em; 233 | border-bottom: 1px solid #ddd; 234 | border-right: 1px solid #ddd; 235 | border-spacing: 0; 236 | border-collapse: collapse; 237 | } 238 | 239 | table th { 240 | padding: .2em 1em; 241 | background-color: #eee; 242 | border-top: 1px solid #ddd; 243 | border-left: 1px solid #ddd; 244 | } 245 | 246 | table td { 247 | padding: .2em 1em; 248 | border-top: 1px solid #ddd; 249 | border-left: 1px solid #ddd; 250 | vertical-align: top; 251 | } 252 | 253 | kbd { 254 | border: 1px solid #999; 255 | padding: 5px; 256 | border-radius: 2px; 257 | background-color: #555; 258 | color: #eee; 259 | white-space: nowrap; 260 | } 261 | 262 | .author { 263 | font-size: 1.2em; 264 | text-align: center; 265 | } 266 | 267 | @media print { 268 | * { 269 | background: transparent !important; 270 | color: black !important; 271 | filter: none !important; 272 | -ms-filter: none !important; 273 | } 274 | 275 | body { 276 | font-size: 12pt; 277 | max-width: 100%; 278 | } 279 | 280 | a, a:visited { 281 | text-decoration: underline; 282 | } 283 | 284 | hr { 285 | height: 1px; 286 | border: 0; 287 | border-bottom: 1px solid black; 288 | } 289 | 290 | a[href]:after { 291 | content: " (" attr(href) ")"; 292 | } 293 | 294 | abbr[title]:after { 295 | content: " (" attr(title) ")"; 296 | } 297 | 298 | .ir a:after, a[href^="javascript:"]:after, a[href^="#"]:after { 299 | content: ""; 300 | } 301 | 302 | pre, blockquote { 303 | border: 1px solid #999; 304 | padding-right: 1em; 305 | page-break-inside: avoid; 306 | } 307 | 308 | tr, img { 309 | page-break-inside: avoid; 310 | } 311 | 312 | img { 313 | max-width: 100% !important; 314 | } 315 | 316 | @page :left { 317 | margin: 15mm 20mm 15mm 10mm; 318 | } 319 | 320 | @page :right { 321 | margin: 15mm 10mm 15mm 20mm; 322 | } 323 | 324 | p, h2, h3 { 325 | orphans: 3; 326 | widows: 3; 327 | } 328 | 329 | h2, h3 { 330 | page-break-after: avoid; 331 | } 332 | } 333 | -------------------------------------------------------------------------------- /parsing-with-haskell-parser-combinators.cabal: -------------------------------------------------------------------------------- 1 | name: parsing-with-haskell-parser-combinators 2 | version: 0.0.0.0 3 | homepage: https://github.com/lettier/parsing-with-haskell-parser-combinators 4 | author: David Lettier 5 | copyright: 2019 David Lettier 6 | license: BSD3 7 | build-type: Simple 8 | cabal-version: >= 1.10 9 | extra-source-files: README.md 10 | 11 | source-repository head 12 | type: git 13 | location: https://github.com/lettier/parsing-with-haskell-parser-combinators 14 | 15 | executable version-number-parser 16 | main-is: src/version-number-parser.hs 17 | build-depends: 18 | base >=4.7 && <5 19 | ghc-options: -threaded -rtsopts -with-rtsopts=-N 20 | default-language: Haskell2010 21 | 22 | executable srt-file-parser 23 | main-is: src/srt-file-parser.hs 24 | build-depends: 25 | base >=4.7 && <5 26 | ghc-options: -threaded -rtsopts -with-rtsopts=-N 27 | default-language: Haskell2010 28 | -------------------------------------------------------------------------------- /src/srt-file-parser.hs: -------------------------------------------------------------------------------- 1 | {- 2 | SRT File Parser 3 | (C) 2019 David Lettier 4 | lettier.com 5 | -} 6 | 7 | {-# LANGUAGE 8 | NamedFieldPuns 9 | #-} 10 | 11 | import Control.Applicative ((<|>)) 12 | import Control.Monad 13 | import Text.ParserCombinators.ReadP 14 | import Data.Char 15 | import Data.Maybe 16 | 17 | type TagAttribute = (String, String) 18 | 19 | data Tag = 20 | Tag 21 | { name :: String 22 | , attributes :: [TagAttribute] 23 | } 24 | deriving (Show, Read) 25 | 26 | data TaggedText = 27 | TaggedText 28 | { text :: String 29 | , tags :: [Tag] 30 | } 31 | deriving (Show, Read) 32 | 33 | data Timestamp = 34 | Timestamp 35 | { hours :: Int 36 | , minutes :: Int 37 | , seconds :: Int 38 | , milliseconds :: Int 39 | } 40 | deriving (Show, Read) 41 | 42 | data SrtSubtitleCoordinates = 43 | SrtSubtitleCoordinates 44 | { x1 :: Int 45 | , x2 :: Int 46 | , y1 :: Int 47 | , y2 :: Int 48 | } 49 | deriving (Show, Read) 50 | 51 | data SrtSubtitle = 52 | SrtSubtitle 53 | { index :: Int 54 | , start :: Timestamp 55 | , end :: Timestamp 56 | , coordinates :: Maybe SrtSubtitleCoordinates 57 | , taggedText :: [TaggedText] 58 | } 59 | deriving (Show, Read) 60 | 61 | main 62 | :: IO () 63 | main 64 | = do 65 | putStrLn "What is the SRT file path?" 66 | filePath <- getLine 67 | text <- readFile filePath 68 | let result = 69 | case readP_to_S parseSrt text of 70 | [] -> [] 71 | r@(_:_) -> fst $ last r 72 | putStrLn "" 73 | print result 74 | 75 | parseSrt 76 | :: ReadP [SrtSubtitle] 77 | parseSrt 78 | = 79 | manyTill parseBlock (skipSpaces >> eof) 80 | 81 | parseBlock 82 | :: ReadP SrtSubtitle 83 | parseBlock 84 | = do 85 | i <- parseIndex 86 | (s, e) <- parseTimestamps 87 | c <- parseCoordinates 88 | t <- parseTextLines 89 | return 90 | SrtSubtitle 91 | { index = i 92 | , start = s 93 | , end = e 94 | , coordinates = c 95 | , taggedText = t 96 | } 97 | 98 | parseBlock' 99 | :: ReadP SrtSubtitle 100 | parseBlock' 101 | = 102 | SrtSubtitle 103 | <$> parseIndex 104 | <*> parseStartTimestamp 105 | <*> parseEndTimestamp 106 | <*> parseCoordinates 107 | <*> parseTextLines 108 | 109 | parseIndex 110 | :: ReadP Int 111 | parseIndex 112 | = 113 | skipSpaces 114 | >> readInt <$> parseNumber 115 | 116 | parseTimestamps 117 | :: ReadP (Timestamp, Timestamp) 118 | parseTimestamps 119 | = do 120 | _ <- char '\n' 121 | s <- parseTimestamp 122 | _ <- skipSpaces 123 | _ <- string "-->" 124 | _ <- skipSpaces 125 | e <- parseTimestamp 126 | return (s, e) 127 | 128 | parseStartTimestamp 129 | :: ReadP Timestamp 130 | parseStartTimestamp 131 | = 132 | char '\n' 133 | >> parseTimestamp 134 | 135 | parseEndTimestamp 136 | :: ReadP Timestamp 137 | parseEndTimestamp 138 | = 139 | skipSpaces 140 | >> string "-->" 141 | >> skipSpaces 142 | >> parseTimestamp 143 | 144 | parseTimestamp 145 | :: ReadP Timestamp 146 | parseTimestamp 147 | = do 148 | h <- parseNumber 149 | _ <- char ':' 150 | m <- parseNumber 151 | _ <- char ':' 152 | s <- parseNumber 153 | _ <- char ',' <|> char '.' 154 | m' <- parseNumber 155 | return 156 | Timestamp 157 | { hours = readInt h 158 | , minutes = readInt m 159 | , seconds = readInt s 160 | , milliseconds = readInt m' 161 | } 162 | 163 | parseCoordinates 164 | :: ReadP (Maybe SrtSubtitleCoordinates) 165 | parseCoordinates 166 | = 167 | option Nothing $ do 168 | _ <- skipSpaces1 169 | x1 <- parseCoordinate 'x' 1 170 | _ <- skipSpaces1 171 | x2 <- parseCoordinate 'x' 2 172 | _ <- skipSpaces1 173 | y1 <- parseCoordinate 'y' 1 174 | _ <- skipSpaces1 175 | y2 <- parseCoordinate 'y' 2 176 | return 177 | $ Just 178 | SrtSubtitleCoordinates 179 | { x1 = readInt x1 180 | , x2 = readInt x2 181 | , y1 = readInt y1 182 | , y2 = readInt y2 183 | } 184 | 185 | parseCoordinate 186 | :: Char 187 | -> Int 188 | -> ReadP String 189 | parseCoordinate 190 | c 191 | n 192 | = do 193 | _ <- char (Data.Char.toUpper c) <|> char (Data.Char.toLower c) 194 | _ <- string $ show n ++ ":" 195 | parseNumber 196 | 197 | parseTextLines 198 | :: ReadP [TaggedText] 199 | parseTextLines 200 | = 201 | char '\n' 202 | >> (getTaggedText <$> manyTill parseAny parseEndOfTextLines) 203 | 204 | getTaggedText 205 | :: String 206 | -> [TaggedText] 207 | getTaggedText 208 | s 209 | = 210 | fst 211 | $ foldl 212 | folder 213 | ([], []) 214 | parsed 215 | where 216 | parsed 217 | :: [String] 218 | parsed 219 | = 220 | case readP_to_S (parseTaggedText []) s of 221 | [] -> [s] 222 | r@(_:_) -> (fst . last) r 223 | folder 224 | :: ([TaggedText], [Tag]) 225 | -> String 226 | -> ([TaggedText], [Tag]) 227 | folder 228 | (tt, t) 229 | x 230 | | isTag x = (tt, updateTags t x) 231 | | otherwise = (tt ++ [TaggedText { text = x, tags = t}], t) 232 | 233 | updateTags 234 | :: [Tag] 235 | -> String 236 | -> [Tag] 237 | updateTags 238 | tags 239 | x 240 | | isClosingTag x = remove compare' tags (makeTag x) 241 | | isOpeningTag x = add compare' tags (makeTag x) 242 | | otherwise = tags 243 | where 244 | compare' 245 | :: Tag 246 | -> Tag 247 | -> Bool 248 | compare' 249 | a 250 | b 251 | = 252 | name a /= name b 253 | 254 | makeTag 255 | :: String 256 | -> Tag 257 | makeTag 258 | s 259 | = 260 | Tag 261 | { name = getTagName s 262 | , attributes = getTagAttributes s 263 | } 264 | 265 | parseEndOfTextLines 266 | :: ReadP () 267 | parseEndOfTextLines 268 | = 269 | void (string "\n\n") <|> eof 270 | 271 | parseTaggedText 272 | :: [String] 273 | -> ReadP [String] 274 | parseTaggedText 275 | strings 276 | = do 277 | s <- look 278 | case s of 279 | "" -> return strings 280 | _ -> do 281 | r <- munch1 (/= '<') <++ parseClosingTag <++ parseOpeningTag 282 | parseTaggedText $ strings ++ [r] 283 | 284 | parseOpeningTag 285 | :: ReadP String 286 | parseOpeningTag 287 | = do 288 | _ <- char '<' 289 | t <- munch1 (\ c -> c /= '/' && c /= '>') 290 | _ <- char '>' 291 | return $ "<" ++ t ++ ">" 292 | 293 | parseClosingTag 294 | :: ReadP String 295 | parseClosingTag 296 | = do 297 | _ <- char '<' 298 | _ <- char '/' 299 | t <- munch1 (/= '>') 300 | _ <- char '>' 301 | return $ "" 302 | 303 | getTagAttributes 304 | :: String 305 | -> [TagAttribute] 306 | getTagAttributes 307 | s 308 | = 309 | if isOpeningTag s 310 | then 311 | case readP_to_S (parseTagAttributes []) s of 312 | [] -> [] 313 | (x:_) -> fst x 314 | else 315 | [] 316 | 317 | getTagName 318 | :: String 319 | -> String 320 | getTagName 321 | s 322 | = 323 | case readP_to_S parseTagName s of 324 | [] -> "" 325 | (x:_) -> toLower' $ fst x 326 | 327 | parseTagName 328 | :: ReadP String 329 | parseTagName 330 | = do 331 | _ <- char '<' 332 | _ <- munch (== '/') 333 | _ <- skipSpaces 334 | n <- munch1 (\ c -> c /= ' ' && c /= '>') 335 | _ <- munch (/= '>') 336 | _ <- char '>' 337 | return n 338 | 339 | parseTagAttributes 340 | :: [TagAttribute] 341 | -> ReadP [TagAttribute] 342 | parseTagAttributes 343 | tagAttributes 344 | = do 345 | s <- look 346 | case s of 347 | "" -> return tagAttributes 348 | _ -> do 349 | let h = head s 350 | case h of 351 | '>' -> return tagAttributes 352 | '<' -> trimTagname >> parseTagAttributes' 353 | _ -> parseTagAttributes' 354 | where 355 | parseTagAttributes' 356 | :: ReadP [TagAttribute] 357 | parseTagAttributes' 358 | = do 359 | tagAttribute <- parseTagAttribute 360 | parseTagAttributes 361 | ( add 362 | (\ a b -> fst a /= fst b) 363 | tagAttributes 364 | tagAttribute 365 | ) 366 | 367 | trimTagname 368 | :: ReadP () 369 | trimTagname 370 | = 371 | char '<' 372 | >> skipSpaces 373 | >> munch1 (\ c -> c /= ' ' && c /= '>') 374 | >> return () 375 | 376 | parseTagAttribute 377 | :: ReadP TagAttribute 378 | parseTagAttribute 379 | = do 380 | _ <- skipSpaces 381 | k <- munch1 (/= '=') 382 | _ <- string "=\"" 383 | v <- munch1 (/= '\"') 384 | _ <- char '\"' 385 | _ <- skipSpaces 386 | return (toLower' k, v) 387 | 388 | parseAny 389 | :: ReadP Char 390 | parseAny 391 | = 392 | satisfy (const True) 393 | 394 | parseNumber 395 | :: ReadP String 396 | parseNumber 397 | = 398 | munch1 isNumber 399 | 400 | skipSpaces1 401 | :: ReadP () 402 | skipSpaces1 403 | = 404 | void $ skipMany1 (char ' ') 405 | 406 | isTag 407 | :: String 408 | -> Bool 409 | isTag 410 | s 411 | = 412 | isOpeningTag s || isClosingTag s 413 | 414 | isOpeningTag 415 | :: String 416 | -> Bool 417 | isOpeningTag 418 | s 419 | = 420 | isPresent $ readP_to_S parseOpeningTag s 421 | 422 | isClosingTag 423 | :: String 424 | -> Bool 425 | isClosingTag 426 | s 427 | = 428 | isPresent $ readP_to_S parseClosingTag s 429 | 430 | readInt 431 | :: String 432 | -> Int 433 | readInt 434 | = 435 | read 436 | 437 | toLower' 438 | :: String 439 | -> String 440 | toLower' 441 | = 442 | map toLower 443 | 444 | remove 445 | :: (a -> a -> Bool) 446 | -> [a] 447 | -> a 448 | -> [a] 449 | remove 450 | f 451 | xs 452 | x 453 | = 454 | filter 455 | (f x) 456 | xs 457 | 458 | add 459 | :: (a -> a -> Bool) 460 | -> [a] 461 | -> a 462 | -> [a] 463 | add 464 | f 465 | xs 466 | x 467 | | isPresent xs = remove f xs x ++ [x] 468 | | otherwise = [x] 469 | 470 | isPresent 471 | :: Foldable t 472 | => t a 473 | -> Bool 474 | isPresent 475 | = 476 | not . null 477 | -------------------------------------------------------------------------------- /src/version-number-parser.hs: -------------------------------------------------------------------------------- 1 | {- 2 | Version Number Parser 3 | (C) 2019 David Lettier 4 | lettier.com 5 | -} 6 | 7 | import Control.Monad 8 | import Text.ParserCombinators.ReadP 9 | import Data.Char 10 | import Data.Maybe 11 | 12 | main 13 | :: IO () 14 | main 15 | = do 16 | putStrLn "What is the version output file path?" 17 | filePath <- getLine 18 | text <- readFile filePath 19 | let result = 20 | case readP_to_S (parseVersionNumber []) text of 21 | [] -> [] 22 | r@(_:_) -> map readInt $ fst $ last r 23 | putStrLn "" 24 | print result 25 | 26 | parseVersionNumber 27 | :: [String] 28 | -> ReadP [String] 29 | parseVersionNumber 30 | nums 31 | = do 32 | _ <- parseNotNumber 33 | num <- parseNumber 34 | let nums' = nums ++ [num] 35 | parseSeparator nums' parseVersionNumber 36 | 37 | parseSeparator 38 | :: [String] 39 | -> ([String] -> ReadP [String]) 40 | -> ReadP [String] 41 | parseSeparator 42 | nums 43 | f 44 | = do 45 | next <- look 46 | case next of 47 | "" -> return nums 48 | (c:_) -> 49 | case c of 50 | '.' -> f nums 51 | '-' -> if length nums == 1 then f [] else f nums 52 | _ -> if length nums == 1 then f [] else return nums 53 | 54 | parseNotNumber 55 | :: ReadP String 56 | parseNotNumber 57 | = 58 | munch (not . isNumber) 59 | 60 | parseNumber 61 | :: ReadP String 62 | parseNumber 63 | = 64 | munch1 isNumber 65 | 66 | readInt 67 | :: String 68 | -> Int 69 | readInt 70 | = 71 | read 72 | -------------------------------------------------------------------------------- /stack.yaml: -------------------------------------------------------------------------------- 1 | resolver: lts-13.27 2 | packages: 3 | - . 4 | -------------------------------------------------------------------------------- /stack.yaml.lock: -------------------------------------------------------------------------------- 1 | # This file was autogenerated by Stack. 2 | # You should not edit this file by hand. 3 | # For more information, please see the documentation at: 4 | # https://docs.haskellstack.org/en/stable/lock_files 5 | 6 | packages: [] 7 | snapshots: 8 | - completed: 9 | size: 500539 10 | url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/13/27.yaml 11 | sha256: 690db832392afe55733b4c7023fd29b1b1c660ee42f1fb505b86b07394ca994e 12 | original: lts-13.27 13 | -------------------------------------------------------------------------------- /test-input/gifcurry-version-output.txt: -------------------------------------------------------------------------------- 1 | 2 |          ▄▄▄▄▄▄▄▄                                                                              3 |     ▄▄████   ▀▀███▄                                                                          4 |       ████▀   ▄    ▀███           ▄    ▐██▌   ▄███▄                                           5 |   ▄   ▐███   ████   ▀███      ▄███▀▀██       ███                                              6 |  ▐█▌   ██   ▐███     ████    ███        ▐██ █████▌ ▄█████ ▐██▌  ██▌  ██▄██▌ ██▄██▌ ██▌   ███  7 |  ███   ▐▌   ███      ▐███▌   ███  ████▌ ▐██   ██▌  ███     ▐██▌  ██▌  ███▀   ███▀   ▐██  ███   8 |  ████      ███▀  ▐█   ███▌   ███    ██▌ ▐██   ██▌  ███     ▐██▌  ██▌  ██▌    ██▌     ██▌▐██    9 |  ▐███▄    ▐██▌   ██    ██     ███▄▄▄██▌ ▐██   ██▌   ███▄▄█ ███▄███▌  ██▌    ██▌      ████▌    10 |   ▀███   ▀███   ▐███   ▀        ▀▀▀▀▀    ▀▀   ▀▀      ▀▀▀    ▀▀▀   ▀▀     ▀▀        ███     11 |     ███▄   ▀    ████▌                                                                ███▀      12 |       ▀███▄▄   █████▀                                                                          13 |           ▀▀▀▀▀▀▀                                                                              14 | 15 | 16 | Gifcurry 6.0.0.0 17 | (C) 2016 David Lettier 18 | https://lettier.com 19 | 20 | Wanna help out Gifcurry? Star it on GitHub! ☺ Thanks for helping out—you rock! 21 | https://github.com/lettier/gifcurry/stargazers 22 | 23 | -------------------------------------------------------------------------------- /test-input/imagemagick-version-output.txt: -------------------------------------------------------------------------------- 1 | Version: ImageMagick 2018-10-24 6.9.10-14 Q16 x86_64 https://imagemagick.org 2 | Copyright: © 1999-2018 ImageMagick Studio LLC 3 | License: https://imagemagick.org/script/license.php 4 | Features: Cipher DPC HDRI Modules OpenCL OpenMP 5 | -------------------------------------------------------------------------------- /test-input/subtitles.srt: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 1 5 | 01:00:12,000 --> 01:00:15,000 6 | This is a subtitle. 7 | 8 | 9 | 10 | 11 | 2 12 | 1:02:18,010 --> 01:05:10,010 13 | This is some subtitle 14 | text that spans multiple lines. 15 | It includes formatting 16 | like bold, italic, 17 | underline, < font color="#010101" color="#333" > 18 | font > color
, and << even 19 | nested tags over multiple 20 | lines. 21 | 22 | 3 23 | 03:23:11,010 --> 03:55:17.110 X1:123 X2:223 Y1:50 Y2:101 24 | This subtitle specifies a text box 25 | using X1, X2, Y1, and Y2. 26 | 27 | 28 | --------------------------------------------------------------------------------