├── Building_Recursive_Descent_Parsers_with_Python_en.md ├── Building_Recursive_Descent_Parsers_with_Python_zh.md ├── Oreilly_Getting_Started_with_Pyparsing.pdf ├── image └── icon.png └── readme.md /Building_Recursive_Descent_Parsers_with_Python_en.md: -------------------------------------------------------------------------------- 1 | # Building Recursive Descent Parsers with Python 2 | 3 | by Paul McGuire 4 | 5 | 01/26/2006 6 | 7 | What is "parsing"? Parsing is processing a series of symbols to extract their meaning. 8 | Typically, this means reading the words of a sentence and drawing information from them. 9 | When application programs need to process data that is provided as text, they must use some form of parsing logic. 10 | This logic scans the text characters and character groups (words) 11 | and recognizes patterns of groups to extract the underlying commands or information. 12 | 13 | Software parsers are usually special-purpose programs, built to process a specific form of text. 14 | This text could be a set of encoded notations on insurance or medical forms; function declarations in a C header file; 15 | node-edge descriptions showing interconnections of a graph; HTML tags in a web page; 16 | or interactive commands to configure a network, modify or rotate a 3D image, or navigate through an adventure game. 17 | In each case, the parsers process a specific set of character groups and patterns. This set of patterns is the parser's grammar. 18 | 19 | For instance, when parsing the string Hello, World! you might wish to parse any greeting that follows its general pattern. 20 | Hello, World! begins with a salutation: the word Hello. There are many salutations--Howdy, 21 | Greetings, Aloha, G'day, and so on--so you could define a very narrow greeting grammar to begin with a single-word salutation. 22 | A comma character follows, and then comes the object of the greeting, the greetee, also a single word. 23 | Lastly, some form of terminating punctuation ends the greeting, such as an exclamation point. 24 | This greeting grammar looks roughly like this (reading `::` as "is composed of"): 25 | 26 | ``` 27 | word :: group of alphabetic characters 28 | salutation :: word 29 | comma :: "," 30 | greetee :: word 31 | endPunctuation :: "!" 32 | greeting :: salutation comma greetee endPunctuation 33 | ``` 34 | 35 | This is the Backus-Naur form (BNF). There are various dialects for symbols representing optional/mandatory constructs, 36 | repetition, alternation, and so on. 37 | 38 | Once you have specified the target grammar in a BNF, you must then convert it to an executable form. 39 | A common technique is to develop a recursive-descent parser, 40 | one that defines functions that read individual terminal constructs of the grammar, 41 | and then higher-level functions that call the lower-level functions. 42 | Functions can return success or failure if they match at the current parsing position 43 | (or return matching tokens for success and raise exceptions for failure). 44 | 45 | ## What Is Pyparsing? 46 | 47 | Pyparsing is a Python class library that helps you to quickly and easily create recursive-descent parsers. 48 | Here is the pyparsing implementation of the Hello, World! example: 49 | 50 | ```python 51 | from pyparsing import Word, Literal, alphas 52 | 53 | salutation = Word( alphas + "'" ) 54 | comma = Literal(",") 55 | greetee = Word( alphas ) 56 | endPunctuation = Literal("!") 57 | 58 | greeting = salutation + comma + greetee + endPunctuation 59 | ``` 60 | 61 | Several pyparsing features help developers create their text-parsing functions quickly: 62 | 63 | * Grammars are native Python, so no separate grammar definition files are required 64 | * No special syntax is required, outside `+` for And, `^` for Or (longest, or "greedy" match), `|` 65 | * for MatchFirst (first match), and `~` for Not 66 | * No separate code-generation step is required 67 | * It implicitly skips white space and comments that may appear between parse elements; 68 | there is no need to clutter your grammar with markers for ignorable text 69 | 70 | The pyparsing grammar shown will parse not only Hello, World! but also any of: 71 | 72 | * Hey, Jude! 73 | * Hi, Mom! 74 | * G'day, Mate! 75 | * Yo, Adrian! 76 | * Howdy, Pardner! 77 | * Whattup, Dude! 78 | 79 | Listing 1 contains a full Hello, World! parser, including output of the parsed results. 80 | 81 | **Listing 1** 82 | 83 | ```python 84 | from pyparsing import Word, Literal, alphas 85 | 86 | salutation = Word( alphas + "'" ) 87 | comma = Literal(",") 88 | greetee = Word( alphas ) 89 | endPunctuation = Literal("!") 90 | 91 | greeting = salutation + comma + greetee + endPunctuation 92 | 93 | tests = ("Hello, World!", 94 | "Hey, Jude!", 95 | "Hi, Mom!", 96 | "G'day, Mate!", 97 | "Yo, Adrian!", 98 | "Howdy, Pardner!", 99 | "Whattup, Dude!" ) 100 | 101 | for t in tests: 102 | print t, "->", greeting.parseString(t) 103 | ``` 104 | ``` 105 | Hello, World! -> ['Hello', ',', 'World', '!'] 106 | Hey, Jude! -> ['Hey', ',', 'Jude', '!'] 107 | Hi, Mom! -> ['Hi', ',', 'Mom', '!'] 108 | G'day, Mate! -> ["G'day", ',', 'Mate', '!'] 109 | Yo, Adrian! -> ['Yo', ',', 'Adrian', '!'] 110 | Howdy, Pardner! -> ['Howdy', ',', 'Pardner', '!'] 111 | Whattup, Dude! -> ['Whattup', ',', 'Dude', '!'] 112 | ``` 113 | 114 | ## Pyparsing Is a "Combinator" 115 | 116 | With the pyparsing module, you first define the basic pieces of your grammar. 117 | Then you combine them into more complex parse expressions for the various branches of the overall grammar syntax. 118 | Combine them by defining relationships such as: 119 | 120 | * Which expressions should follow each other in the grammar, such as 121 | "the keyword if is followed by a Boolean expression enclosed in parentheses" 122 | * Which expressions are valid alternatives at a certain point in the grammar, such as 123 | "a SQL command may start with SELECT, INSERT, UPDATE, or DELETE" 124 | * Which expressions are optional, as in "a phone number may optionally be preceded by an area code, enclosed in parentheses" 125 | * Which expressions are repeatable, such as "an opening XML tag may contain zero or more attributes" 126 | 127 | Although some complex grammars can involve dozens or even hundreds of grammar combinations, 128 | many parsing tasks are easily performed with only a handful of definitions. 129 | Capturing the grammar in BNF helps organize your thoughts and parser design. 130 | It also helps you keep track of your progress in implementing the grammar with pyparsing's functions and classes. 131 | 132 | ## Defining a Simple Grammar 133 | 134 | The smallest building blocks of most grammars are typically exact strings of characters. 135 | For example, here is a simple BNF for parsing a phone number: 136 | 137 | ``` 138 | number :: '0'.. '9'* 139 | phoneNumber :: [ '(' number ')' ] number '-' number 140 | ``` 141 | 142 | Because this looks for dashes and parentheses within a phone number string, 143 | you can also define simple literal tokens for these punctuation marks: 144 | 145 | ```python 146 | dash = Literal( "-" ) 147 | lparen = Literal( "(" ) 148 | rparen = Literal( ")" ) 149 | ``` 150 | 151 | To define the groups of numbers in the phone number, you need to handle groups of characters of varying lengths. 152 | For this, use the Word token: 153 | 154 | ```python 155 | digits = "0123456789" 156 | number = Word( digits ) 157 | ``` 158 | 159 | The number token will match contiguous sequences made up of characters listed in the string digits; that is, 160 | it is a "word" composed of digits (as opposed to a traditional word, which is composed of letters of the alphabet). 161 | Now you have enough individual pieces of the phone number, so you can string them together using the And class. 162 | 163 | ```python 164 | phoneNumber = 165 | And( [ lparen, number, rparen, number, dash, number ] ) 166 | ``` 167 | 168 | This is fairly ugly, and unnatural to read. Fortunately, the pyparsing module defines operator methods to combine separate parse elements more easily. 169 | A more legible definition uses `+` for And: 170 | 171 | ```python 172 | phoneNumber = lparen + number + rparen + number + dash + number 173 | ``` 174 | 175 | For an even cleaner version, the `+` operator will join strings to parse elements, implicitly converting the strings to Literals. 176 | This gives the very easy-to-read: 177 | 178 | ```python 179 | phoneNumber = "(" + number + ")" + number + "-" + number 180 | ``` 181 | 182 | Finally, to designate that the area code at the beginning of the phone number is optional, use pyparsing's Optional class: 183 | 184 | ```python 185 | phoneNumber = Optional( "(" + number + ")" ) + number + "-" + number 186 | ``` 187 | 188 | ## Using the Grammar 189 | 190 | Once you've defined your grammar, the next step is to apply it to the source text. 191 | Pyparsing expressions support three methods for processing input text with a given grammar: 192 | 193 | * The parseString method uses a grammar that completely specifies the contents of an input string, parses the string, 194 | and returns a collection of strings and substrings for each grammar construct. 195 | * The scanString method uses a grammar that may match only parts of an input string, scans the string looking for matches, 196 | and returns a tuple that contains the matched tokens and their starting and ending locations within the input string. 197 | * The transformString method is a variation on scanString. 198 | It applies any changes to the matched tokens and returns a single string representing the original input text, 199 | as modified by the individual matches. 200 | 201 | The initial Hello, World! parser calls parseString and returns straightforward token results: 202 | 203 | ``` 204 | Hello, World! -> ['Hello', ',', 'World', '!'] 205 | ``` 206 | 207 | Although this looks like a simple list of token strings, pyparsing returns data using a ParseResults object. 208 | In the example above, the results variable behaves like a simple Python list. 209 | In fact, you can index into the results just like a list: 210 | 211 | 212 | ```python 213 | print results[0] 214 | print results[-2] 215 | ``` 216 | 217 | will print: 218 | 219 | ``` 220 | Hello 221 | World 222 | ``` 223 | 224 | ParseResults also lets you define names for individual syntax elements, 225 | making it easier to retrieve bits and pieces of the parsed text. 226 | This is especially helpful when a grammar includes optional elements, 227 | which can change the length and offsets of the returned token list. 228 | By modifying the definitions of `salute` and `greetee`: 229 | 230 | ```python 231 | salute = Word( alphas+"'" ).setResultsName("salute") 232 | greetee = Word( alphas ).setResultsName("greetee") 233 | ``` 234 | 235 | you can reference the corresponding tokens as if they were attributes of the returned results object: 236 | 237 | ```python 238 | print hello, "->", results 239 | print results.salute 240 | print results.greetee 241 | ``` 242 | 243 | Now the program will print: 244 | 245 | ``` 246 | G'day, Mate! -> ["G'day", ',', 'Mate', '!'] 247 | G'day 248 | Mate 249 | ``` 250 | 251 | Results names can greatly help to improve the readability and maintainability of your parsing programs. 252 | 253 | In the case of the phone number grammar, you can parse an input string containing a list of phone numbers, one after the next, as: 254 | 255 | ```python 256 | phoneNumberList = OneOrMore( phoneNumber ) 257 | data = phoneNumberList.parseString( inputString ) 258 | ``` 259 | 260 | This will return data as a pyparsing ParseResults object, containing a list of all of the input phone numbers. 261 | 262 | Pyparsing includes some helper expressions, such as `delimitedList`, 263 | so that if your input were a comma-separated list of phone numbers, 264 | you could simply change phoneNumberList to: 265 | 266 | ```python 267 | phoneNumberList = delimitedList( phoneNumber ) 268 | ``` 269 | 270 | This will return the same list of phone numbers that you had before. 271 | (`delimitedList` supports any custom string or expression as a delimiter, 272 | but comma delimiters are the most common, and so they are the default.) 273 | 274 | If, instead of having a string containing only phone numbers, you had a complete mailing list of names, addresses, 275 | zip codes, and phone numbers, you could extract the phone numbers using scanString. scanString is a Python generator function, 276 | so you must use it in a for loop, list comprehension, or generator expression. 277 | 278 | ``` 279 | for data,dataStart,dataEnd in 280 | phoneNumber.scanString( mailingListText ): 281 | . 282 | . 283 | # do something with the phone number tokens, 284 | # returned in the 'data' variable 285 | . 286 | . 287 | ``` 288 | 289 | Lastly, if you had the same mailing list but wished to hide the numbers from, say, a potential telemarketer, 290 | you could transform the string by attaching a parse action that just changes all phone numbers to the string (000)000-0000. 291 | Replacing the input tokens with a fixed string is a common parse action, 292 | so pyparsing provides a built-in function replaceWith to make this very simple: 293 | 294 | ```python 295 | phoneNumber.setParseAction( replaceWith("(000)000-0000") ) 296 | sanitizedList = 297 | phoneNumber.transformString( originalMailingListText ) 298 | ``` 299 | 300 | ## When Good Input Goes Bad 301 | 302 | Pyparsing will process input text until it runs out of matching text for its given parser elements. 303 | If it finds an unexpected token or character and there is no matching parsing element, 304 | then pyparsing will raise a ParseException. ParseExceptions print out a diagnostic message by default; 305 | they also have attributes to help you locate the line number, column, text line, and annotated line of text. 306 | 307 | If you provide the input string Hello, World? to your parser, you will receive the exception: 308 | 309 | ```python 310 | pyparsing.ParseException: Expected "!" (at char 12), (line:1, col:13) 311 | ``` 312 | 313 | At this point, you can choose to fix the input text or make the grammar more tolerant of other syntax 314 | (in this case, supporting question marks as valid sentence terminators). 315 | 316 | ## A Complete Application 317 | 318 | Consider an application where you need to process chemical formulas, such as NaCl, H2O, or C6H5OH. For this application, 319 | the chemical formula grammar will be one or more element symbols, each followed by an optional integer. 320 | In BNF-style notation, this is: 321 | 322 | ``` 323 | integer :: '0'..'9'+ 324 | cap :: 'A'..'Z' 325 | lower :: 'a'..'z' 326 | elementSymbol :: cap lower* 327 | elementRef :: elementSymbol [ integer ] 328 | formula :: elementRef+ 329 | ``` 330 | 331 | The pyparsing module handles these concepts with the classes Optional and OneOrMore. 332 | The definition of the elementSymbol will use the two-argument constructor Word: 333 | the first argument lists the set of valid leading characters, 334 | and the second argument gives the set of valid body characters. Using the pyparsing module, a simple version of the grammar is: 335 | 336 | ```python 337 | caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 338 | lowers = caps.lower() 339 | digits = "0123456789" 340 | 341 | element = Word( caps, lowers ) 342 | elementRef = element + Optional( Word( digits ) ) 343 | formula = OneOrMore( elementRef ) 344 | 345 | elements = formula.parseString( testString ) 346 | ``` 347 | 348 | So far, this program is an adequate tokenizer, processing the following formulas into their appropriate tokens. 349 | The default behavior for pyparsing is to return all of the parsed tokens within a single list of matching substrings: 350 | 351 | ``` 352 | H2O -> ['H', '2', 'O'] 353 | C6H5OH -> ['C', '6', 'H', '5', 'O', 'H'] 354 | NaCl -> ['Na', 'Cl'] 355 | ``` 356 | 357 | Of course, you want to do some processing with these returned results, beyond simply printing them out as a list. 358 | Assume that you want to compute the molecular weight for each given chemical formula. 359 | The program somewhere defines a dictionary of chemical symbols and their corresponding atomic weight: 360 | 361 | ``` 362 | atomicWeight = { 363 | "O" : 15.9994, 364 | "H" : 1.00794, 365 | "Na" : 22.9897, 366 | "Cl" : 35.4527, 367 | "C" : 12.0107, 368 | ... 369 | } 370 | ``` 371 | 372 | Next it would be good to establish a more logical grouping in the parsed chemical symbols and associated quantities, to return a structured set of results. Fortunately, the pyparsing module provides the Group class for just this purpose. By changing the elementRef declaration from: 373 | 374 | ```python 375 | elementRef = element + Optional( Word( digits ) ) 376 | ``` 377 | 378 | to: 379 | 380 | ```python 381 | elementRef = Group( element + Optional( Word( digits ) ) ) 382 | ``` 383 | 384 | you will now get the results grouped by chemical symbol: 385 | 386 | ``` 387 | H2O -> [['H', '2'], ['O']] 388 | C6H5OH -> [['C', '6'], ['H', '5'], ['O'], ['H']] 389 | NaCl -> [['Na'], ['Cl']] 390 | ``` 391 | 392 | The last simplification is to include a default value for the quantity part of elementRef, 393 | using the default argument for the constructor of the Optional class: 394 | 395 | ```python 396 | elementRef = Group( element + Optional( Word( digits ), 397 | default="1" ) ) 398 | ``` 399 | 400 | Now every elementRef will return a pair of values: the element's chemical symbol and the number of atoms of that element, 401 | with "1" implied if no quantity is given. Now the test formulas return a very clean list of ordered pairs of element symbols and their respective quantities: 402 | 403 | ``` 404 | H2O -> [['H', '2'], ['O', '1']] 405 | C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']] 406 | NaCl -> [['Na', '1'], ['Cl', '1']] 407 | ``` 408 | 409 | The final step is to compute the atomic weight for each. Add a single line of Python code after the call to parseString: 410 | 411 | ```python 412 | wt = sum( [ atomicWeight[elem] * int(qty) 413 | for elem,qty in elements ] ) 414 | ``` 415 | 416 | giving the results: 417 | 418 | ``` 419 | H2O -> [['H', '2'], ['O', '1']] (18.01528) 420 | C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']] 421 | (94.11124) 422 | NaCl -> [['Na', '1'], ['Cl', '1']] (58.4424) 423 | ``` 424 | 425 | Listing 2 contains the entire pyparsing program. 426 | 427 | **Listing 2** 428 | 429 | ```python 430 | from pyparsing import Word, Optional, OneOrMore, Group, ParseException 431 | 432 | atomicWeight = { 433 | "O" : 15.9994, 434 | "H" : 1.00794, 435 | "Na" : 22.9897, 436 | "Cl" : 35.4527, 437 | "C" : 12.0107 438 | } 439 | 440 | caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 441 | lowers = caps.lower() 442 | digits = "0123456789" 443 | 444 | element = Word( caps, lowers ) 445 | elementRef = Group( element + Optional( Word( digits ), default="1" ) ) 446 | formula = OneOrMore( elementRef ) 447 | 448 | tests = [ "H2O", "C6H5OH", "NaCl" ] 449 | for t in tests: 450 | try: 451 | results = formula.parseString( t ) 452 | print t,"->", results, 453 | except ParseException, pe: 454 | print pe 455 | else: 456 | wt = sum( [atomicWeight[elem]*int(qty) for elem,qty in results] ) 457 | print "(%.3f)" % wt 458 | ``` 459 | 460 | ``` 461 | H2O -> [['H', '2'], ['O', '1']] (18.015) 462 | C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']] (94.111) 463 | NaCl -> [['Na', '1'], ['Cl', '1']] (58.442) 464 | ``` 465 | 466 | One of the nice by-products of using a parser is the inherent validation it performs on the input text. 467 | Note that in the calculation of the wt variable, there was no need to test that the qty string was all numeric, 468 | or to catch ValueError exceptions an invalid argument raised. If qty weren't all numeric, and therefore a valid argument to int(), 469 | it would not have passed the parser. 470 | 471 | ## An HTML Scraper 472 | 473 | As a final example, consider the development of a simple HTML "scraper." It is not a comprehensive HTML parser, 474 | as such a parser would require scores of parse expressions. Fortunately, it is not usually necessary to have a complete HTML grammar 475 | definition to be able to extract significant pieces of data from most web pages, 476 | especially those autogenerated by CGI or other application programs. 477 | 478 | This example will extract data with a minimal parser, targeted to work with a specific web page--in this case, 479 | the page kept by NIST listing publicly available network time protocol (NTP) servers. 480 | This routine could be part of a larger NTP client application that, during its initialization, 481 | would look up what NTP servers are currently available. 482 | 483 | To begin developing an HTML scraper, you must first see what sort of HTML text you will need to process. 484 | By visiting the web site and viewing the returned HTML source, 485 | you can see that the page lists the names and IP addresses of the NTP servers in an HTML table: 486 | 487 | 488 | ``` 489 | Name IP Address Location 490 | time-a.nist.gov 129.6.15.28 NIST, Gaithersburg, Maryland 491 | time-b.nist.gov 129.6.15.29 NIST, Gaithersburg, Maryland 492 | ``` 493 | 494 | The underlying HTML source for this table uses ``, ``, and `
` tags to structure the NTP server data: 495 | 496 | ```html 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | ``` 514 | 515 | This table is part of a much larger body of HTML, but pyparsing allows you to define a parse expression that matches only 516 | a subset of the total input text and to scan for text that matches the given parse expression. 517 | So you need only define the minimum amount of grammar required to match the desired HTML source. 518 | 519 | The program should extract the IP addresses and locations of those servers, 520 | so you can focus your grammar on just those columns of the table. 521 | Informally, you want to extract the values that match the pattern 522 | 523 | ```html 524 | 525 | ``` 526 | 527 | You do want to be a bit more specific than just matching on something as generic as 528 | `` ``, 529 | because so general an expression would match the first two columns of the table instead of the second two 530 | (as well as the first two columns of any table on the page!). Instead, 531 | use the specific format of the IP address to help narrow your search pattern by 532 | eliminating any false matches from other table data on the page. 533 | 534 | To build up the elements of an IP address, start by defining an integer, then combining four integers with intervening periods: 535 | 536 | ```python 537 | integer = Word("0123456789") 538 | ipAddress = integer + "." + integer + "." + integer + "." + integer 539 | ``` 540 | 541 | You will also need to match the HTML tags ``, so define parse elements for each: 542 | 543 | ```python 544 | tdStart = Literal("") 546 | ``` 547 | 548 | In general, `` tag. 555 | Pyparsing includes a class named SkipTo for this kind of grammar element. 556 | 557 | You now have all the pieces you need to define the time server text pattern: 558 | 559 | ```python 560 | timeServer = tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd 561 | ``` 562 | 563 | To extract the data, invoke `timeServer.scanString`, 564 | which is a generator function that yields the matched tokens and the start and end string positions for each matching set of text. 565 | This application uses only the matched tokens. 566 | 567 | **Listing 3** 568 | 569 | ```python 570 | from pyparsing import * 571 | import urllib 572 | 573 | # define basic text pattern for NTP server 574 | integer = Word("0123456789") 575 | ipAddress = integer + "." + integer + "." + integer + "." + integer 576 | tdStart = Literal("") 578 | timeServer = tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd 579 | 580 | # get list of time servers 581 | nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html" 582 | serverListPage = urllib.urlopen( nistTimeServerURL ) 583 | serverListHTML = serverListPage.read() 584 | serverListPage.close() 585 | 586 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 587 | print srvrtokens 588 | ``` 589 | 590 | Running the program in Listing 3 gives the token data: 591 | 592 | ``` 593 | [' ', ' '] 594 | [' ', ' '] 595 | [' ', ' '] 596 | [' ', ' '] 597 | : 598 | ``` 599 | 600 | Looking at these results, a couple of things immediately jump out. 601 | One is that the parser records each IP address as a series of separate tokens, 602 | one for each subfield and delimiting period. It would be nice if pyparsing were to do a bit of work during the parsing process 603 | to combine these fields into a single-string token. Pyparsing's Combine class will do just this. 604 | Modify the ipAddress definition to read: 605 | 606 | ```python 607 | ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer ) 608 | ``` 609 | 610 | to get a single-string token returned for the IP address. 611 | 612 | The second observation is that the results include the opening and closing HTML tags that mark the table columns. 613 | While the presence of these tags is important during the parsing process, 614 | the tags themselves are not interesting in the extracted data. 615 | To have them suppressed from the returned token data, construct the tag literals with the suppress method. 616 | 617 | ```python 618 | tdStart = Literal("").suppress() 620 | ``` 621 | 622 | **Listing 4** 623 | 624 | ```python 625 | from pyparsing import * 626 | import urllib 627 | 628 | # define basic text pattern for NTP server 629 | integer = Word("0123456789") 630 | ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer ) 631 | tdStart = Literal("").suppress() 633 | timeServer = tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd 634 | 635 | # get list of time servers 636 | nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html" 637 | serverListPage = urllib.urlopen( nistTimeServerURL ) 638 | serverListHTML = serverListPage.read() 639 | serverListPage.close() 640 | 641 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 642 | print srvrtokens 643 | ``` 644 | 645 | Now run the program in Listing 4. Your returned token data has substantially improved: 646 | 647 | ``` 648 | ['129.6.15.28', 'NIST, Gaithersburg, Maryland'] 649 | ['129.6.15.29', 'NIST, Gaithersburg, Maryland'] 650 | ['132.163.4.101', 'NIST, Boulder, Colorado'] 651 | ['132.163.4.102', 'NIST, Boulder, Colorado'] 652 | ``` 653 | Finally, add result names to these tokens, so that you can access them by attribute name. 654 | The easiest way to do this is in the definition of `timeServer`: 655 | 656 | ```python 657 | timeServer = tdStart + ipAddress.setResultsName("ipAddress") + tdEnd 658 | + tdStart + SkipTo(tdEnd).setResultsName("locn") + tdEnd 659 | ``` 660 | 661 | Now you can neaten up the body of the for loop and access these tokens just like members in a dictionary: 662 | 663 | ```python 664 | servers = {} 665 | 666 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 667 | print "%(ipAddress)-15s : %(locn)s" % srvrtokens 668 | servers[srvrtokens.ipAddress] = srvrtokens.locn 669 | ``` 670 | 671 | Listing 5 contains the finished running program. 672 | 673 | **Listing 5** 674 | 675 | ```python 676 | from pyparsing import * 677 | import urllib 678 | 679 | # define basic text pattern for NTP server 680 | integer = Word("0123456789") 681 | ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer ) 682 | tdStart = Literal("").suppress() 684 | timeServer = tdStart + ipAddress.setResultsName("ipAddress") + tdEnd + \ 685 | tdStart + SkipTo(tdEnd).setResultsName("locn") + tdEnd 686 | 687 | # get list of time servers 688 | nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html" 689 | serverListPage = urllib.urlopen( nistTimeServerURL ) 690 | serverListHTML = serverListPage.read() 691 | serverListPage.close() 692 | 693 | servers = {} 694 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 695 | print "%(ipAddress)-15s : %(locn)s" % srvrtokens 696 | servers[srvrtokens.ipAddress] = srvrtokens.locn 697 | 698 | print servers 699 | ``` 700 | 701 | At this point, you've successfully extracted the NTP servers and their IP addresses and populated a program variable 702 | so that your NTP-client application can make use of the parsed results. 703 | 704 | ## In Conclusion 705 | 706 | Pyparsing provides a basic framework for creating recursive-descent parsers, 707 | taking care of the overhead functions of scanning the input string, 708 | handling expression mismatches, selecting the longest of matching alternatives, 709 | invoking callback functions, and returning the parsed results. 710 | This leaves developers free to focus on their grammar design and the design and implementation of corresponding token processing. 711 | Pyparsing's nature as a combinator allows developers to 712 | scale their applications from simple tokenizers up to complex grammar processors. 713 | It is a great way to get started with your next parsing project! 714 | 715 | [Download pyparsing from SourceForge](http://pyparsing.sourceforge.net/). 716 | 717 | [Paul McGuire](http://www.onlamp.com/pub/au/2557) is a senior manufacturing systems consultant at Alan Weber & Associates. 718 | In his spare time, he administers the pyparsing project on SourceForge. 719 | -------------------------------------------------------------------------------- /Building_Recursive_Descent_Parsers_with_Python_zh.md: -------------------------------------------------------------------------------- 1 | # 用Python构建递归下降解析器 2 | 3 | by Paul McGuire 4 | 5 | 01/26/2006 6 | 7 | 什么是"解析"?解析是一处理一系列符号的过程,提取它们的意义。比较其日常,就是阅读句子中的单词并且理解其中的意思。 8 | 当应用程序需要处理以文本构成的数据,它们必须使用某种解析逻辑。这个逻辑扫描文本中的字符,字符组(单词),字符组 9 | 的模式以提取背后的命令或信息。 10 | 11 | 解析器程序经常是针对特定类型的文本的。这种文本可以是保险与医疗的表格;C语言头文件的函数声明;一个图的点边的连接关系描述;网页上的HTML 12 | 标记或者配置一个网络,修改或旋转一个3D图像,或描述冒险游戏中的游戏流程的脚本。在每个情况中,解析器处理一个特定的字符组/模式。这个 13 | 模式的集合就是这个解析器的语法。 14 | 15 | 作为例子,当解析一个字符串Hello, World! 你可能想要解析遵循这样的模式任意问候句式。Hello, World!以一个问候词开始:单词Hello。 16 | 存在很多问候次--Howdy,Greetings,Aloha,G'day之类的--所以你可以定义一个简单的语法,它以一个单独的问候词开始。后面跟了一个 17 | 逗号,再后面是问候的对象,也是一个单词。最后一些终结符(terminating punctuation)终结了这个问候,像感叹号等。 18 | 这样一个问候的语法粗略来看大概可以这样表示(`::` 读作 "由...组成") 19 | 20 | ``` 21 | word :: group of alphabetic characters 22 | salutation :: word 23 | comma :: "," 24 | greetee :: word 25 | endPunctuation :: "!" 26 | greeting :: salutation comma greetee endPunctuation 27 | ``` 28 | 29 | 这是巴克斯范式(BNF)。存在很多语法变体表示诸如可选/必须(optional/manatory) 组成,重复,候选之类的语法概念。 30 | 31 | 一旦你已经以BNF确定了一个特定语法,你必须将其转化为可执行形式。一个常用的方式是以一个递归下降解析器 32 | 的形式来进行开发。一个这样的解析器定义了一个函数,其每次读入一个字符,高级函数调用底层函数,它们会 33 | 返回success如果匹配成功,否则抛出一个异常。 34 | 35 | ## Pyparsing是什么? 36 | 37 | Pyparsing 是一个Python模块帮助你便捷创建递归下降解析器。这里一个用pyparsing实现的Hello, World!实例: 38 | 39 | ```python 40 | from pyparsing import Word, Literal, alphas 41 | 42 | salutation = Word( alphas + "'" ) 43 | comma = Literal(",") 44 | greetee = Word( alphas ) 45 | endPunctuation = Literal("!") 46 | 47 | greeting = salutation + comma + greetee + endPunctuation 48 | ``` 49 | 50 | 下面是pyparsing可以帮助开发者创建文本解析函数的一些特点: 51 | 52 | 53 | * 语法是纯Python的,所以么有单独的语法定义文件被要求 54 | * 不需要使用特殊的语法,除了 `+`表示且,`^`表示或(长模式,或者说"贪婪"匹配。),`|`表示第一次匹配,以及`~`表示非。 55 | * 不需要单独的代码生成阶段 56 | * 它隐式的跳过解析元素之间的空格或注释;避免使你的代码因为应该忽略的文本而变得乱糟糟的。 57 | 58 | 上面的pyparsing语法不只可以解析Hello, World! 还能解析这些: 59 | 60 | * Hey, Jude! 61 | * Hi, Mom! 62 | * G'day, Mate! 63 | * Yo, Adrian! 64 | * Howdy, Pardner! 65 | * Whattup, Dude! 66 | 67 | Listing 1包含了完整的 Hello, World! 解析器代码,也有解析结果的输出。 68 | 69 | **Listing 1** 70 | 71 | ```python 72 | from pyparsing import Word, Literal, alphas 73 | 74 | salutation = Word( alphas + "'" ) 75 | comma = Literal(",") 76 | greetee = Word( alphas ) 77 | endPunctuation = Literal("!") 78 | 79 | greeting = salutation + comma + greetee + endPunctuation 80 | 81 | tests = ("Hello, World!", 82 | "Hey, Jude!", 83 | "Hi, Mom!", 84 | "G'day, Mate!", 85 | "Yo, Adrian!", 86 | "Howdy, Pardner!", 87 | "Whattup, Dude!" ) 88 | 89 | for t in tests: 90 | print t, "->", greeting.parseString(t) 91 | ``` 92 | ``` 93 | Hello, World! -> ['Hello', ',', 'World', '!'] 94 | Hey, Jude! -> ['Hey', ',', 'Jude', '!'] 95 | Hi, Mom! -> ['Hi', ',', 'Mom', '!'] 96 | G'day, Mate! -> ["G'day", ',', 'Mate', '!'] 97 | Yo, Adrian! -> ['Yo', ',', 'Adrian', '!'] 98 | Howdy, Pardner! -> ['Howdy', ',', 'Pardner', '!'] 99 | Whattup, Dude! -> ['Whattup', ',', 'Dude', '!'] 100 | ``` 101 | 102 | ## Pyparsing是一个"组合器" 103 | 104 | 使用pyparsing模块时,你应当首先定义你语法最基础的部分。然后再组合它们成一个更复杂的解析表达式 105 | 以表示整体语法的不同变体。可以以下面的关系来对它们进行组合: 106 | 107 | * 表达式可以跟在另一个后面,如"`if`关键词后面应该是一个括号包裹的布尔表达式"。 108 | * 在一个特定位置的表达式可以以一些候选表达式表示,如"一个SQL命令可以以 SELECT,INSERT,UPDATE或DELETE 开头"。 109 | * 表达式是可选的,如"一个电话号码可以有或没有一个括号包裹的区号开头"。 110 | * 表达式是可以重复的,如"一个开的XML标记"可以包含零或多个属性"。 111 | 112 | 尽管一些复杂的语法会涉及数百个语法组合,但是大多数解析任务都可以以少数几个定义来表出。 113 | 使用BNF表示语法会帮助你思考与设计解析器,它也帮助你追踪你实现语法的进度通过使用pyparsing的函数与类。 114 | 115 | ## 定义一个简单的语法 116 | 117 | 最基础的语法块一般是指定某一字符集的字符串。作为例子,这是一个简单的BNF,其解析一个电话号码: 118 | 119 | ``` 120 | number :: '0'.. '9'* 121 | phoneNumber :: [ '(' number ')' ] number '-' number 122 | ``` 123 | 124 | Because this looks for dashes and parentheses within a phone number string, 125 | you can also define simple literal tokens for these punctuation marks: 126 | 127 | ```python 128 | dash = Literal( "-" ) 129 | lparen = Literal( "(" ) 130 | rparen = Literal( ")" ) 131 | ``` 132 | 133 | To define the groups of numbers in the phone number, you need to handle groups of characters of varying lengths. 134 | For this, use the Word token: 135 | 136 | ```python 137 | digits = "0123456789" 138 | number = Word( digits ) 139 | ``` 140 | 141 | The number token will match contiguous sequences made up of characters listed in the string digits; that is, 142 | it is a "word" composed of digits (as opposed to a traditional word, which is composed of letters of the alphabet). 143 | Now you have enough individual pieces of the phone number, so you can string them together using the And class. 144 | 145 | ```python 146 | phoneNumber = 147 | And( [ lparen, number, rparen, number, dash, number ] ) 148 | ``` 149 | 150 | This is fairly ugly, and unnatural to read. Fortunately, the pyparsing module defines operator methods to combine separate parse elements more easily. 151 | A more legible definition uses `+` for And: 152 | 153 | ```python 154 | phoneNumber = lparen + number + rparen + number + dash + number 155 | ``` 156 | 157 | For an even cleaner version, the `+` operator will join strings to parse elements, implicitly converting the strings to Literals. 158 | This gives the very easy-to-read: 159 | 160 | ```python 161 | phoneNumber = "(" + number + ")" + number + "-" + number 162 | ``` 163 | 164 | Finally, to designate that the area code at the beginning of the phone number is optional, use pyparsing's Optional class: 165 | 166 | ```python 167 | phoneNumber = Optional( "(" + number + ")" ) + number + "-" + number 168 | ``` 169 | 170 | ## Using the Grammar 171 | 172 | Once you've defined your grammar, the next step is to apply it to the source text. 173 | Pyparsing expressions support three methods for processing input text with a given grammar: 174 | 175 | * The parseString method uses a grammar that completely specifies the contents of an input string, parses the string, 176 | and returns a collection of strings and substrings for each grammar construct. 177 | * The scanString method uses a grammar that may match only parts of an input string, scans the string looking for matches, 178 | and returns a tuple that contains the matched tokens and their starting and ending locations within the input string. 179 | * The transformString method is a variation on scanString. 180 | It applies any changes to the matched tokens and returns a single string representing the original input text, 181 | as modified by the individual matches. 182 | 183 | The initial Hello, World! parser calls parseString and returns straightforward token results: 184 | 185 | ``` 186 | Hello, World! -> ['Hello', ',', 'World', '!'] 187 | ``` 188 | 189 | Although this looks like a simple list of token strings, pyparsing returns data using a ParseResults object. 190 | In the example above, the results variable behaves like a simple Python list. 191 | In fact, you can index into the results just like a list: 192 | 193 | 194 | ```python 195 | print results[0] 196 | print results[-2] 197 | ``` 198 | 199 | will print: 200 | 201 | ``` 202 | Hello 203 | World 204 | ``` 205 | 206 | ParseResults also lets you define names for individual syntax elements, 207 | making it easier to retrieve bits and pieces of the parsed text. 208 | This is especially helpful when a grammar includes optional elements, 209 | which can change the length and offsets of the returned token list. 210 | By modifying the definitions of `salute` and `greetee`: 211 | 212 | ```python 213 | salute = Word( alphas+"'" ).setResultsName("salute") 214 | greetee = Word( alphas ).setResultsName("greetee") 215 | ``` 216 | 217 | you can reference the corresponding tokens as if they were attributes of the returned results object: 218 | 219 | ```python 220 | print hello, "->", results 221 | print results.salute 222 | print results.greetee 223 | ``` 224 | 225 | Now the program will print: 226 | 227 | ``` 228 | G'day, Mate! -> ["G'day", ',', 'Mate', '!'] 229 | G'day 230 | Mate 231 | ``` 232 | 233 | Results names can greatly help to improve the readability and maintainability of your parsing programs. 234 | 235 | In the case of the phone number grammar, you can parse an input string containing a list of phone numbers, one after the next, as: 236 | 237 | ```python 238 | phoneNumberList = OneOrMore( phoneNumber ) 239 | data = phoneNumberList.parseString( inputString ) 240 | ``` 241 | 242 | This will return data as a pyparsing ParseResults object, containing a list of all of the input phone numbers. 243 | 244 | Pyparsing includes some helper expressions, such as `delimitedList`, 245 | so that if your input were a comma-separated list of phone numbers, 246 | you could simply change phoneNumberList to: 247 | 248 | ```python 249 | phoneNumberList = delimitedList( phoneNumber ) 250 | ``` 251 | 252 | This will return the same list of phone numbers that you had before. 253 | (`delimitedList` supports any custom string or expression as a delimiter, 254 | but comma delimiters are the most common, and so they are the default.) 255 | 256 | If, instead of having a string containing only phone numbers, you had a complete mailing list of names, addresses, 257 | zip codes, and phone numbers, you could extract the phone numbers using scanString. scanString is a Python generator function, 258 | so you must use it in a for loop, list comprehension, or generator expression. 259 | 260 | ``` 261 | for data,dataStart,dataEnd in 262 | phoneNumber.scanString( mailingListText ): 263 | . 264 | . 265 | # do something with the phone number tokens, 266 | # returned in the 'data' variable 267 | . 268 | . 269 | ``` 270 | 271 | Lastly, if you had the same mailing list but wished to hide the numbers from, say, a potential telemarketer, 272 | you could transform the string by attaching a parse action that just changes all phone numbers to the string (000)000-0000. 273 | Replacing the input tokens with a fixed string is a common parse action, 274 | so pyparsing provides a built-in function replaceWith to make this very simple: 275 | 276 | ```python 277 | phoneNumber.setParseAction( replaceWith("(000)000-0000") ) 278 | sanitizedList = 279 | phoneNumber.transformString( originalMailingListText ) 280 | ``` 281 | 282 | ## When Good Input Goes Bad 283 | 284 | Pyparsing will process input text until it runs out of matching text for its given parser elements. 285 | If it finds an unexpected token or character and there is no matching parsing element, 286 | then pyparsing will raise a ParseException. ParseExceptions print out a diagnostic message by default; 287 | they also have attributes to help you locate the line number, column, text line, and annotated line of text. 288 | 289 | If you provide the input string Hello, World? to your parser, you will receive the exception: 290 | 291 | ```python 292 | pyparsing.ParseException: Expected "!" (at char 12), (line:1, col:13) 293 | ``` 294 | 295 | At this point, you can choose to fix the input text or make the grammar more tolerant of other syntax 296 | (in this case, supporting question marks as valid sentence terminators). 297 | 298 | ## A Complete Application 299 | 300 | Consider an application where you need to process chemical formulas, such as NaCl, H2O, or C6H5OH. For this application, 301 | the chemical formula grammar will be one or more element symbols, each followed by an optional integer. 302 | In BNF-style notation, this is: 303 | 304 | ``` 305 | integer :: '0'..'9'+ 306 | cap :: 'A'..'Z' 307 | lower :: 'a'..'z' 308 | elementSymbol :: cap lower* 309 | elementRef :: elementSymbol [ integer ] 310 | formula :: elementRef+ 311 | ``` 312 | 313 | The pyparsing module handles these concepts with the classes Optional and OneOrMore. 314 | The definition of the elementSymbol will use the two-argument constructor Word: 315 | the first argument lists the set of valid leading characters, 316 | and the second argument gives the set of valid body characters. Using the pyparsing module, a simple version of the grammar is: 317 | 318 | ```python 319 | caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 320 | lowers = caps.lower() 321 | digits = "0123456789" 322 | 323 | element = Word( caps, lowers ) 324 | elementRef = element + Optional( Word( digits ) ) 325 | formula = OneOrMore( elementRef ) 326 | 327 | elements = formula.parseString( testString ) 328 | ``` 329 | 330 | So far, this program is an adequate tokenizer, processing the following formulas into their appropriate tokens. 331 | The default behavior for pyparsing is to return all of the parsed tokens within a single list of matching substrings: 332 | 333 | ``` 334 | H2O -> ['H', '2', 'O'] 335 | C6H5OH -> ['C', '6', 'H', '5', 'O', 'H'] 336 | NaCl -> ['Na', 'Cl'] 337 | ``` 338 | 339 | Of course, you want to do some processing with these returned results, beyond simply printing them out as a list. 340 | Assume that you want to compute the molecular weight for each given chemical formula. 341 | The program somewhere defines a dictionary of chemical symbols and their corresponding atomic weight: 342 | 343 | ``` 344 | atomicWeight = { 345 | "O" : 15.9994, 346 | "H" : 1.00794, 347 | "Na" : 22.9897, 348 | "Cl" : 35.4527, 349 | "C" : 12.0107, 350 | ... 351 | } 352 | ``` 353 | 354 | Next it would be good to establish a more logical grouping in the parsed chemical symbols and associated quantities, to return a structured set of results. Fortunately, the pyparsing module provides the Group class for just this purpose. By changing the elementRef declaration from: 355 | 356 | ```python 357 | elementRef = element + Optional( Word( digits ) ) 358 | ``` 359 | 360 | to: 361 | 362 | ```python 363 | elementRef = Group( element + Optional( Word( digits ) ) ) 364 | ``` 365 | 366 | you will now get the results grouped by chemical symbol: 367 | 368 | ``` 369 | H2O -> [['H', '2'], ['O']] 370 | C6H5OH -> [['C', '6'], ['H', '5'], ['O'], ['H']] 371 | NaCl -> [['Na'], ['Cl']] 372 | ``` 373 | 374 | The last simplification is to include a default value for the quantity part of elementRef, 375 | using the default argument for the constructor of the Optional class: 376 | 377 | ```python 378 | elementRef = Group( element + Optional( Word( digits ), 379 | default="1" ) ) 380 | ``` 381 | 382 | Now every elementRef will return a pair of values: the element's chemical symbol and the number of atoms of that element, 383 | with "1" implied if no quantity is given. Now the test formulas return a very clean list of ordered pairs of element symbols and their respective quantities: 384 | 385 | ``` 386 | H2O -> [['H', '2'], ['O', '1']] 387 | C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']] 388 | NaCl -> [['Na', '1'], ['Cl', '1']] 389 | ``` 390 | 391 | The final step is to compute the atomic weight for each. Add a single line of Python code after the call to parseString: 392 | 393 | ```python 394 | wt = sum( [ atomicWeight[elem] * int(qty) 395 | for elem,qty in elements ] ) 396 | ``` 397 | 398 | giving the results: 399 | 400 | ``` 401 | H2O -> [['H', '2'], ['O', '1']] (18.01528) 402 | C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']] 403 | (94.11124) 404 | NaCl -> [['Na', '1'], ['Cl', '1']] (58.4424) 405 | ``` 406 | 407 | Listing 2 contains the entire pyparsing program. 408 | 409 | **Listing 2** 410 | 411 | ```python 412 | from pyparsing import Word, Optional, OneOrMore, Group, ParseException 413 | 414 | atomicWeight = { 415 | "O" : 15.9994, 416 | "H" : 1.00794, 417 | "Na" : 22.9897, 418 | "Cl" : 35.4527, 419 | "C" : 12.0107 420 | } 421 | 422 | caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 423 | lowers = caps.lower() 424 | digits = "0123456789" 425 | 426 | element = Word( caps, lowers ) 427 | elementRef = Group( element + Optional( Word( digits ), default="1" ) ) 428 | formula = OneOrMore( elementRef ) 429 | 430 | tests = [ "H2O", "C6H5OH", "NaCl" ] 431 | for t in tests: 432 | try: 433 | results = formula.parseString( t ) 434 | print t,"->", results, 435 | except ParseException, pe: 436 | print pe 437 | else: 438 | wt = sum( [atomicWeight[elem]*int(qty) for elem,qty in results] ) 439 | print "(%.3f)" % wt 440 | ``` 441 | 442 | ``` 443 | H2O -> [['H', '2'], ['O', '1']] (18.015) 444 | C6H5OH -> [['C', '6'], ['H', '5'], ['O', '1'], ['H', '1']] (94.111) 445 | NaCl -> [['Na', '1'], ['Cl', '1']] (58.442) 446 | ``` 447 | 448 | One of the nice by-products of using a parser is the inherent validation it performs on the input text. 449 | Note that in the calculation of the wt variable, there was no need to test that the qty string was all numeric, 450 | or to catch ValueError exceptions an invalid argument raised. If qty weren't all numeric, and therefore a valid argument to int(), 451 | it would not have passed the parser. 452 | 453 | ## An HTML Scraper 454 | 455 | As a final example, consider the development of a simple HTML "scraper." It is not a comprehensive HTML parser, 456 | as such a parser would require scores of parse expressions. Fortunately, it is not usually necessary to have a complete HTML grammar 457 | definition to be able to extract significant pieces of data from most web pages, 458 | especially those autogenerated by CGI or other application programs. 459 | 460 | This example will extract data with a minimal parser, targeted to work with a specific web page--in this case, 461 | the page kept by NIST listing publicly available network time protocol (NTP) servers. 462 | This routine could be part of a larger NTP client application that, during its initialization, 463 | would look up what NTP servers are currently available. 464 | 465 | To begin developing an HTML scraper, you must first see what sort of HTML text you will need to process. 466 | By visiting the web site and viewing the returned HTML source, 467 | you can see that the page lists the names and IP addresses of the NTP servers in an HTML table: 468 | 469 | 470 | ``` 471 | Name IP Address Location 472 | time-a.nist.gov 129.6.15.28 NIST, Gaithersburg, Maryland 473 | time-b.nist.gov 129.6.15.29 NIST, Gaithersburg, Maryland 474 | ``` 475 | 476 | The underlying HTML source for this table uses `
NameIP AddressLocation
time-a.nist.gov129.6.15.28NIST, Gaithersburg, Maryland
time-b.nist.gov129.6.15.29NIST, Gaithersburg, Maryland
IP address location name ` any text ` `more any text `` and `") 545 | tdEnd = Literal("` tags can also contain attribute specifiers for alignment, color, and so on. 549 | However, this is not a general-purpose parser, only one written specifically for this web page, which fortunately does not use complicated 550 | `` tags. (The latest version of pyparsing includes a helper method for constructing HTML tags, which supports attribute specifiers in opening tags.) 551 | 552 | Finally, you need some sort of expression to match the server's location description. 553 | This is actually a rather freely formatted bit of text--there's no knowing whether it will include alphabetic data, commas, periods, 554 | or numbers--so the simplest choice is to just accept everything up to the terminating `") 577 | tdEnd = Literal("', '129', '.', '6', '.', '15', '.', '28', '', 'NIST, Gaithersburg, Maryland', '', '129', '.', '6', '.', '15', '.', '29', '', 'NIST, Gaithersburg, Maryland', '', '132', '.', '163', '.', '4', '.', '101', '', 'NIST, Boulder, Colorado', '', '132', '.', '163', '.', '4', '.', '102', '', 'NIST, Boulder, Colorado', '").suppress() 619 | tdEnd = Literal("").suppress() 632 | tdEnd = Literal("").suppress() 683 | tdEnd = Literal("
`, ``, and `
` tags to structure the NTP server data: 477 | 478 | ```html 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | ``` 496 | 497 | This table is part of a much larger body of HTML, but pyparsing allows you to define a parse expression that matches only 498 | a subset of the total input text and to scan for text that matches the given parse expression. 499 | So you need only define the minimum amount of grammar required to match the desired HTML source. 500 | 501 | The program should extract the IP addresses and locations of those servers, 502 | so you can focus your grammar on just those columns of the table. 503 | Informally, you want to extract the values that match the pattern 504 | 505 | ```html 506 | 507 | ``` 508 | 509 | You do want to be a bit more specific than just matching on something as generic as 510 | `` ``, 511 | because so general an expression would match the first two columns of the table instead of the second two 512 | (as well as the first two columns of any table on the page!). Instead, 513 | use the specific format of the IP address to help narrow your search pattern by 514 | eliminating any false matches from other table data on the page. 515 | 516 | To build up the elements of an IP address, start by defining an integer, then combining four integers with intervening periods: 517 | 518 | ```python 519 | integer = Word("0123456789") 520 | ipAddress = integer + "." + integer + "." + integer + "." + integer 521 | ``` 522 | 523 | You will also need to match the HTML tags ``, so define parse elements for each: 524 | 525 | ```python 526 | tdStart = Literal("") 528 | ``` 529 | 530 | In general, `` tag. 537 | Pyparsing includes a class named SkipTo for this kind of grammar element. 538 | 539 | You now have all the pieces you need to define the time server text pattern: 540 | 541 | ```python 542 | timeServer = tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd 543 | ``` 544 | 545 | To extract the data, invoke `timeServer.scanString`, 546 | which is a generator function that yields the matched tokens and the start and end string positions for each matching set of text. 547 | This application uses only the matched tokens. 548 | 549 | **Listing 3** 550 | 551 | ```python 552 | from pyparsing import * 553 | import urllib 554 | 555 | # define basic text pattern for NTP server 556 | integer = Word("0123456789") 557 | ipAddress = integer + "." + integer + "." + integer + "." + integer 558 | tdStart = Literal("") 560 | timeServer = tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd 561 | 562 | # get list of time servers 563 | nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html" 564 | serverListPage = urllib.urlopen( nistTimeServerURL ) 565 | serverListHTML = serverListPage.read() 566 | serverListPage.close() 567 | 568 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 569 | print srvrtokens 570 | ``` 571 | 572 | Running the program in Listing 3 gives the token data: 573 | 574 | ``` 575 | [' ', ' '] 576 | [' ', ' '] 577 | [' ', ' '] 578 | [' ', ' '] 579 | : 580 | ``` 581 | 582 | Looking at these results, a couple of things immediately jump out. 583 | One is that the parser records each IP address as a series of separate tokens, 584 | one for each subfield and delimiting period. It would be nice if pyparsing were to do a bit of work during the parsing process 585 | to combine these fields into a single-string token. Pyparsing's Combine class will do just this. 586 | Modify the ipAddress definition to read: 587 | 588 | ```python 589 | ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer ) 590 | ``` 591 | 592 | to get a single-string token returned for the IP address. 593 | 594 | The second observation is that the results include the opening and closing HTML tags that mark the table columns. 595 | While the presence of these tags is important during the parsing process, 596 | the tags themselves are not interesting in the extracted data. 597 | To have them suppressed from the returned token data, construct the tag literals with the suppress method. 598 | 599 | ```python 600 | tdStart = Literal("").suppress() 602 | ``` 603 | 604 | **Listing 4** 605 | 606 | ```python 607 | from pyparsing import * 608 | import urllib 609 | 610 | # define basic text pattern for NTP server 611 | integer = Word("0123456789") 612 | ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer ) 613 | tdStart = Literal("").suppress() 615 | timeServer = tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd 616 | 617 | # get list of time servers 618 | nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html" 619 | serverListPage = urllib.urlopen( nistTimeServerURL ) 620 | serverListHTML = serverListPage.read() 621 | serverListPage.close() 622 | 623 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 624 | print srvrtokens 625 | ``` 626 | 627 | Now run the program in Listing 4. Your returned token data has substantially improved: 628 | 629 | ``` 630 | ['129.6.15.28', 'NIST, Gaithersburg, Maryland'] 631 | ['129.6.15.29', 'NIST, Gaithersburg, Maryland'] 632 | ['132.163.4.101', 'NIST, Boulder, Colorado'] 633 | ['132.163.4.102', 'NIST, Boulder, Colorado'] 634 | ``` 635 | Finally, add result names to these tokens, so that you can access them by attribute name. 636 | The easiest way to do this is in the definition of `timeServer`: 637 | 638 | ```python 639 | timeServer = tdStart + ipAddress.setResultsName("ipAddress") + tdEnd 640 | + tdStart + SkipTo(tdEnd).setResultsName("locn") + tdEnd 641 | ``` 642 | 643 | Now you can neaten up the body of the for loop and access these tokens just like members in a dictionary: 644 | 645 | ```python 646 | servers = {} 647 | 648 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 649 | print "%(ipAddress)-15s : %(locn)s" % srvrtokens 650 | servers[srvrtokens.ipAddress] = srvrtokens.locn 651 | ``` 652 | 653 | Listing 5 contains the finished running program. 654 | 655 | **Listing 5** 656 | 657 | ```python 658 | from pyparsing import * 659 | import urllib 660 | 661 | # define basic text pattern for NTP server 662 | integer = Word("0123456789") 663 | ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer ) 664 | tdStart = Literal("").suppress() 666 | timeServer = tdStart + ipAddress.setResultsName("ipAddress") + tdEnd + \ 667 | tdStart + SkipTo(tdEnd).setResultsName("locn") + tdEnd 668 | 669 | # get list of time servers 670 | nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html" 671 | serverListPage = urllib.urlopen( nistTimeServerURL ) 672 | serverListHTML = serverListPage.read() 673 | serverListPage.close() 674 | 675 | servers = {} 676 | for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ): 677 | print "%(ipAddress)-15s : %(locn)s" % srvrtokens 678 | servers[srvrtokens.ipAddress] = srvrtokens.locn 679 | 680 | print servers 681 | ``` 682 | 683 | At this point, you've successfully extracted the NTP servers and their IP addresses and populated a program variable 684 | so that your NTP-client application can make use of the parsed results. 685 | 686 | ## In Conclusion 687 | 688 | Pyparsing provides a basic framework for creating recursive-descent parsers, 689 | taking care of the overhead functions of scanning the input string, 690 | handling expression mismatches, selecting the longest of matching alternatives, 691 | invoking callback functions, and returning the parsed results. 692 | This leaves developers free to focus on their grammar design and the design and implementation of corresponding token processing. 693 | Pyparsing's nature as a combinator allows developers to 694 | scale their applications from simple tokenizers up to complex grammar processors. 695 | It is a great way to get started with your next parsing project! 696 | 697 | [Download pyparsing from SourceForge](http://pyparsing.sourceforge.net/). 698 | 699 | [Paul McGuire](http://www.onlamp.com/pub/au/2557) is a senior manufacturing systems consultant at Alan Weber & Associates. 700 | In his spare time, he administers the pyparsing project on SourceForge. 701 | -------------------------------------------------------------------------------- /Oreilly_Getting_Started_with_Pyparsing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yiyuezhuo/pyparsing-doc-zh/86bc996b1d782cc48f0e39f5b35a405de5d068da/Oreilly_Getting_Started_with_Pyparsing.pdf -------------------------------------------------------------------------------- /image/icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yiyuezhuo/pyparsing-doc-zh/86bc996b1d782cc48f0e39f5b35a405de5d068da/image/icon.png -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Pyparsing 导引 4 | 5 | Paul McGuire 著 6 | 7 | yiyuezhuo 译 8 | 9 | --- 10 | 11 | Copyright C 2008 O'Reilly Media, Inc. 12 | 13 | ISBN:9780596514235 14 | 15 | Released: October 4, 2007 16 | 17 | --- 18 | 19 | 20 | 目录 21 | 22 | * [Pyparsing是什么?](#what) 23 | * [Pyparsing程序的简单形式](#basic) 24 | * ["Hello World"](#hello) 25 | * [什么使Pyparsing变得不同?](diff) 26 | * [从表中解析数据-使用Parse Action和ParseResults](#table) 27 | * [从网页解析数据](#page) 28 | * [一个简单的S表达式解析器](#SS) 29 | * [一个复杂的S表达式解析器](#CS) 30 | * [解析搜索字符串](#search) 31 | * [100行代码以内的搜索引擎](#engine) 32 | * [结论](#conclusion) 33 | * [索引](#index) 34 | 35 | 36 | > "我需要解析这个日志文件..." 37 | > 38 | > "只是要从网页中提取数据..." 39 | > 40 | > "我们需要一个简单的命令行解释器..." 41 | > 42 | > "我们的源代码需要移植到新API集上..." 43 | 44 | 这些工作要求每天都让开发者们条件反射般的骂娘"擦,又要写一个解析器!" 45 | 46 | 解析不十分严格格式的数据形式的任务经常出现在开发者面前。有时其是一次性的,像内部使用的API升级程序。 47 | 其他时候,解析程序作为在命令行驱动的程序中的内建函数。 48 | 49 | 如果你在Python中编程,你可以简化这些工作,通过使用Python的内建字符串方法,比如split(),index()以及startwith(). 50 | 51 | 让这项工作又变得讨厌的是我们经常不只是对字符串分割和索引,对于一些复杂的语法定义来说。比如: 52 | 53 | ``` 54 | y = 2 * x + 10 55 | ``` 56 | 57 | 它每个符号间都有空分隔,是容易解析的,对于这种空格分离的形式。不幸的是,很少有用户会如此这般使用空格,算术表达式经常像这样写出: 58 | 59 | ``` 60 | y = 2*x + 10 61 | y = 2*x+10 62 | y=2*x+10 63 | ``` 64 | 65 | 直接对最后一个字符串运用`str.split`方法会导致返回原字符串(作为一个列表的唯一实例),而不会分离出这些单独的元素`y`,`=2`,等等. 66 | 67 | 处理这种超越str.split的解析任务的工具是正则表达式或lex/yacc。正则表达式用一个字符串去描述文本模式以便匹配。 68 | 那个字符串使用特殊符号(像`|`,`+`,`.`,`*`,`?`)去表示不同的解析概念像alternation(多选),repetition(重复) 69 | 以及wildcards(通配符).Lex/yacc是则先拆出标记,然后应用处理代码到提取出的标记上。Lex/yacc 70 | 使用一个单独的标记定义文件,然后产生lex中间文件以及处理过程代码模板给程序员扩展,以定制不同的应用行为。 71 | 72 | >*历史注释* 73 | > 74 | >这些文本处理技术最早在1970年以C实现,现在它们仍在广大的领域发挥作用。内置电池的Python拥有标准库 75 | >`re`来提供对正则表达式的支持。你还可以下到一些免费的lex/yacc风格的解析器模块,其提供了对python的接口。 76 | 77 | 78 | 这些传统工具的主要问题在于它们独特的标记系统需要被精确映射到Python的代码上。比如lex/yacc风格工具往往要单独进行一个代码生成阶段。 79 | 80 | 实践中,解析器编写看起来陷入到一个怪圈中:写代码,解析示例文本,找到额外的特殊情况等等。 81 | 组合正则表达式符号,额外的代码生成步骤,很可能使这个循环过程可能会不断的陷入挫折。 82 | 83 | ## Pyparsing是什么? 84 | 85 | Pyparsing是纯python的,易于使用。Pyparsing提供了一系列类让你可以以单独的表达式元素开始来构建解析器。 86 | 其表达式使用直觉的符号组合,如`+`表示将一个表达式加到另一个后面。`|`,`^`表示解析多选 87 | (意为匹配第一个或匹配最长的).表达式的重复可以以类的形式表示,如`OneOrMore`,`ZeroOrMore`,`Optional`. 88 | 89 | 作为例子,一个正则表达式处理IP地址后面跟着一个美式电话号码的情况需要这样写: 90 | 91 | ``` 92 | (\d{1,3}(?:\.\d{1,3}){3})\s+(\(\d{3}\)\d{3}-\d{4}) 93 | ``` 94 | 95 | 对比一下,类似的表达式用pyparsing写是这个样子 96 | 97 | ```python 98 | ipField = Word(nums, max=3) 99 | ipAddr = Combine( ipField + "." + ipField + "." + ipField + "." + ipField ) 100 | phoneNum = Combine( "(" + Word(nums, exact=3) + ")" + Word(nums, exact=3) + "?" + Word(nums, exact=4) ) 101 | userdata = ipAddr + phoneNum 102 | ``` 103 | 尽管更长,但pyparsing版本更易读,也更容易被回朔和更新,比如可以更容易从此移植去处理其他国家的电话号码格式, 104 | 105 | >Python新手? 106 | > 107 | >我已经收到很多邮件,他们告诉我使用pyparsing也是他们第一次使用python编程。他们发现pyparsing易于学习, 108 | >容易改写内部的例子完成应用。如果你是刚开始使用python,你可能对于阅读这些例子感到一点困难。 109 | >Pyparsing的使用并不要求任何高级的python知识。对于它所需要的那一部分,有一些网络教程资源,比如python的[官网] 110 | (>www.python.org). 111 | > 112 | >为了更好的使用pyparsing,你应当更熟悉python的语言特性,如缩进语法,数据类型, 113 | >以及`for item in itemSequence` 式循环控制语法。 114 | >Pyparsing使用object.attribute式标记,就像python的内建容器类,元组,表以及字典。 115 | > 116 | >这本书的例子使用了python的lambda表达式,本质上就是单行函数;lambda表达式对于定义简单的解析操作特别有用。 117 | > 118 | >列表解析和生成器表达式的知识是有用的,它们可以用在在解析标记结果的截断,但这并不是必须的。 119 | 120 | 121 | Pyparsing是: 122 | 123 | * 100%纯python,没有的动态链接库(DLLs)或者共享库包含其中,所以你可以在python2.3能够通过编译的任何地方使用它。 124 | * 解析表达式使用标准的python类标记和符号表示。没有单独的代码生成过程也没有特殊符号和标记,这将使得你的应用易于开发,理解和维护。 125 | * 对于常见的模式准备了辅助方法: 126 | * C,C++,Java,Python,HTML注释 127 | * 引号字符串(使用单个或双引号,除了\',\''转义情况外) 128 | * HTML与XML标签(包含上下级以及属性操作) 129 | * 逗号分隔以及被限制的列表表达式 130 | * 轻量级封装-Pyparsing的代码包含在单个python文件中,容易放进site-packages目录下,或者被你的应用直接包含。 131 | * 宽松的许可证,MIT许可证使得你可以随意进行非商用或商业应用。 132 | 133 | ## Pyparsing程序的简单形式 134 | 135 | 典型的pyparsing程序具有以下结构: 136 | * import pyparsing模块 137 | * 使用pyparsing类和帮助方法定义语法 138 | * 使用语法解析输入文本 139 | * 处理文本解析的结果 140 | 141 | 142 | ### 从Pyparsing出导入名字 143 | 144 | 通常,使用`from pyparsing import *`是不被python风格专家鼓励的。因为它污染了本地变量命名空间, 145 | 因其从不明确的模块中引入的不知道数量的名字。无论如何,在pyparsing开发工作中,很难想象pyparsing定义的名字会被使用, 146 | 而且这样写简化了早期的语法开发。在语法最终完成后,你可以回到传统风格的引用,或专门from导入你需要的那些名字。 147 | 148 | ### 定义语法 149 | 语法是你的定义的文本模式,这个模式被应用于输入文本提取信息。在pyparsing中,语法由一个或多个Python语句构成,而模式的组合则使用pyparsing的类和辅助对象去指定组合的元素。Pyparsing允许你使用像`+`,`|`,`^`这样的操作符来简化代码。作为例子,假如我使用pyparsing的`Word`类去定义一个典型的程序变量名字,其由字母符号或字母数字或下划线构成。我将以Python语句这样描述: 150 | 151 | ```python 152 | identifier = Word(alphas,alphanums+'_') 153 | ``` 154 | 155 | 我也想解析常数,如整数和浮点数。另一个简单的定义的Word对象,它应当包含数字,也许还包含小数点。 156 | 157 | ```python 158 | number= Word(nums+'.') 159 | ``` 160 | 161 | 从这里,我然后定义一个简单的赋值语句像这样: 162 | 163 | ```python 164 | assignmentExpr = identifier + "=" +(identifier|number) 165 | ``` 166 | 167 | 现在我们可以解析像这样的内容了: 168 | 169 | ``` 170 | a = 10 171 | a_2=100 172 | pi=3.14159 173 | goldenRatio = 1.61803 174 | E =mc2 175 | ``` 176 | 177 | 在程序的这个部分,你可以附加任何解析时回调函数(或称为解析动作parse actions)或为语法定义名字去减轻之后指派它们的工作。 178 | 解析动作是非常有力的特性对于pyparsing,之后我们将论述它的细节, 179 | 180 | > 实践:BNF范式初步 181 | > 在写python代码实现语法之前,将其先写在纸上是有益的,如: 182 | > * 帮助你澄清你的想法 183 | > * 指导你设计解析器 184 | > * 提前演算,就像你在执行你的解析器 185 | > * 帮助你知道设计的界限 186 | > 幸运的是,在设计解析器过程中,有一个简单的符号系统用来描绘解析器, 187 | >它被称为BNF(Backus-Naur Form)范式.你可以在这里获得BNF的[好例子](http://en.wikipedia.org/wiki/backus-naur_form) >。你并不需要十分严格的遵循它,只要它能刻画你的语法想法即可。 188 | > 189 | > 在这本书里我们用到了这些BNF记号: 190 | > * `::=` 表示"被定义为" 191 | > * `+` 表示“一个或更多” 192 | > * `*` 表示“零个或更多” 193 | > * 被[]包围的项是可选的 194 | > * 连续的项序列表示被匹配的标记必须在序列中出现 195 | > * `|` 表示两个项之一会被匹配 196 | 197 | ### 使用语法解析输入文本 198 | 199 | 在早期版本的pyparsing中,这一步被限制为使用`parseString`方法,像这样: 200 | 201 | ```python 202 | assignmentTokens = assignmentExpr.parseString("pi=3.14159") 203 | ``` 204 | 205 | 来得到被匹配的标记。 206 | 207 | 现在你可以使用更多的方法,全部列举如下: 208 | 209 | * `parseString` 应用语法到给定的输入文本(从定义上看,如果这个文本可以应用多次规则也只会运用到第一次上) 210 | * `scanString` 这是个生成器函数,给定文本和上界下界,其会试图返回所有解析结果 211 | * `searchString` scanString的简单封装,返回你给定文本的全部解析结果,放在一个列表中。 212 | * `transformString` scanString的另一个封装,还附带了替换操作。 213 | 214 | 现在,让我们继续研究`parseString`,稍后我将给你们展示其他选择的更多细节。 215 | 216 | ### 处理解析后的文本 217 | 218 | 当然,如何处理解析文本得到的返回值是最重要的。在大多数解析工具中,通常会返回一个匹配到的标记的列表供未来进一步解释使用。 219 | Pyparsing则返回一个更强的对象,被称为ParseResults.在最简单的形式中,ParseResults可以被打印和连接像python列表一样。 220 | 作为例子,继续我们赋值表达式的例子,下面的代码: 221 | 222 | ```python 223 | assignmentTokens = assignmentExpr.parseString("pi=3.14159") 224 | print assignmentTokens 225 | ``` 226 | 227 | 会打印出 228 | 229 | ``` 230 | ['pi','=','3.14159'] 231 | ``` 232 | 233 | 但是ParseResults也支持解析文本中的个域(individual fields),如果语法为返回值的某些成分指派了名字。 234 | 235 | 这里我们通过给表达式里的元素取名字加强它们(左项成为lhs,右项称为rhs),我们就能在ParseResults里连接这些域, 236 | 就像它们是返回的对象的属性一样。 237 | 238 | ```python 239 | assignmentExpr = identifier.setResultsName("lhs") + "=" + \ 240 | (identifier | number).setResultsName("rhs") 241 | assignmentTokens = assignmentExpr.parseString( "pi=3.14159" ) 242 | print assignmentTokens.rhs, "is assigned to", assignmentTokens.lhs 243 | ``` 244 | 245 | 将打印出 246 | 247 | ``` 248 | 3.14159 is assigned to pi 249 | ``` 250 | 251 | 现在介绍进入转入细节部分了,让我们看一些例子 252 | 253 | ## Hello,World 254 | 255 | Pyparsing有很多例子,其中有一个简单地"Hello World"解析器。这个简单的例子也被 256 | [O'Reilly,ONLamp.com](http://onlamp.com)的文章 257 | [用Python建立递归下降解析器](Building_Recursive_Descent_Parsers_with_Python_en.md)所使用. 258 | 在这一节,我也使用类似的例子以介绍简单的pyparsing解析工具。 259 | 260 | 当前"Hello,World!"的解析模式被限制为: 261 | 262 | ``` 263 | word, word ! 264 | ``` 265 | 266 | 这过于受限了,让我们扩展语法以适应更多的情况。比如说应当可以解析以下情况: 267 | 268 | ``` 269 | Hello, World! 270 | Hi, Mom! 271 | Good morning, Miss Crabtree! 272 | Yo, Adrian! 273 | Whattup, G? 274 | How's it goin', Dude? 275 | Hey, Jude! 276 | Goodbye, Mr. Chips! 277 | ``` 278 | 279 | 写一个这样的解析器的第一步是分析这些文本的抽象模式。像我们之前推荐的那样,让我们用BNF范式来表达。 280 | 用自然语言表达这个意思,我们可以说:"一个这样的句子由一个或多个词(作为问候词),后跟一个逗号, 281 | 后跟一个或多个附加词(作为问候的对象)",结尾则使用一个感叹号或问好。用BNF可以这样表达: 282 | 283 | ``` 284 | greeting ::= salutation comma greetee endpunc 285 | salutation ::= word+ 286 | comma ::= , 287 | greetee ::= word+ 288 | word ::= a collection of one or more characters, which are any alpha or ' or . 289 | endpunc ::= ! | ? 290 | ``` 291 | 292 | 这个BNF几乎可以直译为pyparsing的语言,通过使用pyparsing的`Word`,`Literal`,`OneOrMore`以及辅助方法`oneOf`。 293 | (BNF与pyparsing的一个区别在于BNF喜欢使用传统的由上自下的语法定义,pyparsing则使用由底至上的方式。 294 | 因为我们要保证我们使用的元素在上面已经定义过了) 295 | 296 | ```python 297 | word = Word(alphas+"'.") 298 | salutation = OneOrMore(word) 299 | comma = Literal(",") 300 | greetee = OneOrMore(word) 301 | endpunc = oneOf("! ?") 302 | greeting = salutation + comma + greetee + endpunc 303 | ``` 304 | 305 | `oneOf`使定义更容易,比较两种等价写法 306 | 307 | ```python 308 | endpunc = oneOf("! ?") 309 | endpunc = Literal("!") | Literal("?") 310 | ``` 311 | 312 | `oneOf`也可以直接传入由字符串构成的列表,直接传字符串也是先以空格分离成那样的列表的 313 | 314 | 使用我们的解析器解析那些简单字符串可以得到这样的结果。 315 | 316 | ```python 317 | ['Hello', ',', 'World', '!'] 318 | ['Hi', ',', 'Mom', '!'] 319 | ['Good', 'morning', ',', 'Miss', 'Crabtree', '!'] 320 | ['Yo', ',', 'Adrian', '!'] 321 | ['Whattup', ',', 'G', '?'] 322 | ["How's", 'it', "goin'", ',', 'Dude', '?'] 323 | ['Hey', ',', 'Jude', '!'] 324 | ['Goodbye', ',', 'Mr.', 'Chips', '!'] 325 | ``` 326 | 327 | 每个东西都被很好的解析了出来。但是我们的结果缺乏结构。对这个解析器而言,如果我们想要提取出句子的左边部分-即问候部分, 328 | 我们还需要做一些工作,迭代结果直到我们碰上了逗号: 329 | 330 | ```python 331 | for t in tests: 332 | results = greeting.parseString(t) 333 | salutation = [] 334 | for token in results: 335 | if token == ",": 336 | break 337 | salutation.append(token) 338 | print salutation 339 | ``` 340 | 341 | 很好!我们应该已经实现了一个不错的字符-字符的扫描器。幸运的是,我们的解析器可以足够智能以避免之后繁琐工作。 342 | 343 | 当我们直到问候及问候对象是不同的逻辑部分之后,我们可以使用pyparsing的Group类来为返回结果赋予更多的结构。 344 | 我们修改salutation和greetee为 345 | 346 | ```python 347 | salutation = Group( OneOrMore(word) ) 348 | greetee = Group( OneOrMore(word) ) 349 | ``` 350 | 351 | 于是我们的结果看起来更有组织性了: 352 | 353 | ``` 354 | ['Hello'], ',', ['World'], '!'] 355 | ['Hi'], ',', ['Mom'], '!'] 356 | ['Good', 'morning'], ',', ['Miss', 'Crabtree'], '!'] 357 | ['Yo'], ',', ['Adrian'], '!'] 358 | ['Whattup'], ',', ['G'], '?'] 359 | ["How's", 'it', "goin'"], ',', ['Dude'], '?'] 360 | ['Hey'], ',', ['Jude'], '!'] 361 | ['Goodbye'], ',', ['Mr.', 'Chips'], '!'] 362 | ``` 363 | 364 | 然后我们可以使用简单的列表拆包实现不同部分赋值: 365 | 366 | ```python 367 | for t in tests: 368 | salutation, dummy, greetee, endpunc = greeting.parseString(t) 369 | print salutation, greetee, endpunc 370 | ``` 371 | 372 | 会打印出: 373 | 374 | ``` 375 | ['Hello'] ['World'] ! 376 | ['Hi'] ['Mom'] ! 377 | ['Good', 'morning'] ['Miss', 'Crabtree'] ! 378 | ['Yo'] ['Adrian'] ! 379 | ['Whattup'] ['G'] ? 380 | ["How's", 'it', "goin'"] ['Dude'] ? 381 | ['Hey'] ['Jude'] ! 382 | ['Goodbye'] ['Mr.', 'Chips'] ! 383 | ``` 384 | 385 | 注意我们用dummy变量记入了解析出的逗号。这些逗号在解析中是很有用的,比如让我们分隔问候部分和问候对象部分。 386 | 但在结果中我们对逗号不感兴趣,它应当从结果中消失。你可以使用`Suppress`对象包住逗号定义以抑制其出现。 387 | 388 | ```python 389 | comma = Suppress( Literal(",") ) 390 | ``` 391 | 392 | 你可以以不同的等价方式表达以上语句 393 | 394 | ```python 395 | comma = Suppress( Literal(",") ) 396 | comma = Literal(",").suppress() 397 | comma = Suppress(",") 398 | ``` 399 | 400 | 使用以上形式之一,我们解析出的结果变成这个样子: 401 | 402 | ``` 403 | ['Hello'], ['World'], '!'] 404 | ['Hi'], ['Mom'], '!'] 405 | ['Good', 'morning'], ['Miss', 'Crabtree'], '!'] 406 | ['Yo'], ['Adrian'], '!'] 407 | ['Whattup'], ['G'], '?'] 408 | ["How's", 'it', "goin'"], ['Dude'], '?'] 409 | ['Hey'], ['Jude'], '!'] 410 | ['Goodbye'], ['Mr.', 'Chips'], '!'] 411 | ``` 412 | 413 | 所以现在结果控制代码可以丢掉dummy变量了,只需: 414 | 415 | ```python 416 | for t in tests: 417 | salutation, greetee, endpunc = greeting.parseString(t) 418 | ``` 419 | 420 | 现在我们有了一个不错的解析器并可以处理它的返回结果。让我们开始处理测试数据,首先,让我们将问候以及问候对象加进它们的表里: 421 | 422 | ```python 423 | salutes = [] 424 | greetees = [] 425 | for t in tests: 426 | salutation, greetee, endpunc = greeting.parseString(t) 427 | salutes.append( ( " ".join(salutation), endpunc) ) 428 | greetees.append( " ".join(greetee) ) 429 | ``` 430 | 431 | 我们还有其他一些小变化: 432 | 433 | * 使用`" ".join(list)`去将拆出来的标记列表转回简单的字符串 434 | * 保存问候语与行末的符号来区分问候是How are you?这种疑问句还是Hello!感叹句所表示的。 435 | 436 | 现在我们收集一些名字和问候语,我们可以使用它们产生一些新的句子: 437 | 438 | ```python 439 | for i in range(50): 440 | salute = random.choice( salutes ) 441 | greetee = random.choice( greetees ) 442 | print "%s, %s%s" % ( salute[0], greetee, salute[1] ) 443 | ``` 444 | 445 | 现在我们可以看到全新的问候了: 446 | 447 | ``` 448 | Hello, Miss Crabtree! 449 | How's it goin', G? 450 | Yo, Mr. Chips! 451 | Whattup, World? 452 | Good morning, Mr. Chips! 453 | Goodbye, Jude! 454 | Good morning, Miss Crabtree! 455 | Hello, G! 456 | Hey, Dude! 457 | How's it goin', World? 458 | Good morning, Mom! 459 | How's it goin', Adrian? 460 | Yo, G! 461 | Hey, Adrian! 462 | Hi, Mom! 463 | Hello, Mr. Chips! 464 | Hey, G! 465 | Whattup, Mr. Chips? 466 | Whattup, Miss Crabtree? 467 | ... 468 | ``` 469 | 470 | 我们也可以模拟一些介绍通过以下代码: 471 | 472 | ```python 473 | for i in range(50): 474 | print '%s, say "%s" to %s.' % ( random.choice( greetees ),"".join( random.choice( salutes ) ),random.choice( greetees ) ) 475 | ``` 476 | 477 | 看起来像这样! 478 | 479 | ``` 480 | Jude, say "Good morning!" to Mom. 481 | G, say "Yo!" to Miss Crabtree. 482 | Jude, say "Goodbye!" to World. 483 | Adrian, say "Whattup?" to World. 484 | Mom, say "Hello!" to Dude. 485 | Mr. Chips, say "Good morning!" to Miss Crabtree. 486 | Miss Crabtree, say "Hi!" to Adrian. 487 | Adrian, say "Hey!" to Mr. Chips. 488 | Mr. Chips, say "How's it goin'?" to Mom. 489 | G, say "Whattup?" to Mom. 490 | Dude, say "Hello!" to World. 491 | Miss Crabtree, say "Goodbye!" to Miss Crabtree. 492 | Dude, say "Hi!" to Mr. Chips. 493 | G, say "Yo!" to Mr. Chips. 494 | World, say "Hey!" to Mr. Chips. 495 | G, say "Hey!" to Adrian. 496 | Adrian, say "Good morning!" to G. 497 | Adrian, say "Hello!" to Mom. 498 | World, say "Good morning!" to Miss Crabtree. 499 | Miss Crabtree, say "Yo!" to G. 500 | ... 501 | ``` 502 | 503 | 好了,我们已经见识了pyparsing模块。通过使用一些极其简单的pyparsing类和方法,我们就达成了十分强大的表达能力。 504 | 505 | ## 什么使得Pyparsing显得不同? 506 | 507 | Pyparsing被设计为满足一些特别的目标,其中包括语法必须易写易理解而且能够很容易修改一个解析器去适应新的需求。 508 | 这些目标在于极简化解析器设计任务使pyparsing用户能聚焦于解析而不是在解析库与元语法之间挣扎,下面是pyparsing之禅. 509 | 510 | ### 语法规则的编写应是自然易读的python程序,而且其形式为python程序员所熟悉。 511 | 512 | Pyparsing以以下方式实现该目标 513 | 514 | * 使用操作符组合解析器要素。python支持操作符重载,利用它我们可以超越常规的对象式语法结构, 515 | 使得我们的解析器表达式更易读。 516 | 517 | 比如说: 518 | ```python 519 | streetAddress = And( [streetNumber, name,Or( [Literal("Rd."), Literal("St.")] ) ] ) 520 | ``` 521 | 可以被写成 522 | ```python 523 | streetAddress = streetNumber + name + ( Literal("Rd.") | Literal("St.") ) 524 | ``` 525 | * 很多pyparsing的属性设置方法会返回调用对象本身,所以一些属性调用可以组成一个链。作为例子, 526 | 下面的解析器定义了interger,并为其指定了名字和解析动作,解析动作将字符串转回python整数。使用上述特性,可以将 527 | ```python 528 | integer = Word(nums) 529 | integer.Name = "integer" 530 | integer.ParseAction = lambda t: int(t[0]) 531 | ``` 532 | 写为 533 | ```python 534 | integer = Word(nums).setName("integer").setParseAction(lambda t:int(t[0])) 535 | ``` 536 | 537 | ### 类名比特殊符号好读并且好理解 538 | 539 | 这可能是pyparsing与正则表达式或类似的工具之间最明显的区别了。简介处的IP地址-电话号码例子就展现了这一点。 540 | 但正则表达式在它自己的控制符号出现在它想模式匹配的文本中时,会陷入理解上的真正困境。 541 | 结果我们会写出一个转义斜杠的大杂烩。这里有一个正则表达式试图匹配C式函数,其可以有一个或多个参数,其由单词或整数构成: 542 | 543 | ``` 544 | (\w+)\((((\d+|\w+)(,(\d+|\w+))*)?)\) 545 | ``` 546 | 547 | 括号是控制字符还是要匹配的字符并不是一目了然的,而如果输入文本包括`\`,`.`,`*`或`?`情况将变得更糟糕。而pyparsing版本描述类似的表达式则为: 548 | 549 | ```python 550 | Word(alphas)+ "(" + Group( Optional(Word(nums)|Word(alphas) + ZeroOrMore("," + Word(nums)|Word(alphas))) ) + ")" 551 | ``` 552 | 553 | 这当然更好读。由于`x + ZeroOrMore(","+x)`的形式过于普遍,我们有一个pyparsing的辅助方法,`delimitedList`方法,它等价于这个表达式。通过使用delimitedList,我们的pyparsing写法可以更简化为: 554 | 555 | ```python 556 | Word(alphas)+ "(" + Group( Optional(delimitedList(Word(nums)|Word(alphas))) ) + ")" 557 | ``` 558 | 559 | ### 语法定义中的空格的扰乱 560 | 561 | 对于"特殊符号不特殊"问题,正则表达式必须明确声明空格在输入文本的位置。在C函数例子中,正则表达式应当匹配到 562 | 563 | ``` 564 | abc(1,2,def,5) 565 | ``` 566 | 567 | 而不是 568 | 569 | ``` 570 | abc(1, 2, def, 5) 571 | ``` 572 | 573 | 不幸的是,声明可选空格出现或不出现并不容易,其结果是\s*的表达形式从头到尾到处都是,使我们所要匹配的目标更加模糊: 574 | 575 | ``` 576 | (\w+)\s*\(\s*(((\d+|\w+)(\s*,\s*(\d+|\w+))*)?)\s*\) 577 | ``` 578 | 579 | 对应的,pyparsing无视两个要素之间的空格,所以对应的表达式还是: 580 | 581 | ```python 582 | Word(alphas)+ "(" + Group( Optional(delimitedList(Word(nums)|Word(alphas))) ) + ")" 583 | ``` 584 | 585 | 而并不需要对空格多说什么。 586 | 587 | 类似的概念也可以应用到注释上,其可能会出现在代码的任何地方(不是那种只能在任一行最后的那种)。想象注释像空格一样出现在参数之间,正则表达式会变成什么样子,不过在pyparsing中,只需要这些代码就可以解决。 588 | 589 | ```python 590 | cFunction = Word(alphas)+ "(" + Group( Optional(delimitedList(Word(nums)|Word(alphas))) ) + ")" 591 | cFunction.ignore( cStyleComment ) 592 | ``` 593 | 594 | ### 结果应当比单纯的列表形式多更多东西 595 | 596 | Pyparsing的解析结果是ParseResult类的实例。其可以被当成列表操作(使用[],len,iter或分片等)。 597 | 但它也可以编程嵌套(nested)形式,字典风格的以及对象风格(点字段调用风格)。C函数例子中的解析结果一般看作: 598 | 599 | ``` 600 | ['abc', '(', ['1', '2', 'def', '5'], ')'] 601 | ``` 602 | 603 | 你可以看到参数被放进了一个子列表中,这使得进一步解析更容易。而如果语法定义中给结果中的某些结构命了名, 604 | 则我们可以使用字段引用法取代索引值法来减少出错概率。 605 | 606 | 这些高级引用技术在处理更复杂的语法问题时是至关重要的。 607 | 608 | ### 在解析时间就执行的预处理 609 | 610 | 当解析时,解析器对输入文本进行很多检查:测试不同的字符串或匹配一些模式,如两个引号间的字符串。 611 | 如果匹配到一个字符串,则立即解析(post-parsing)代码可以执行一个变换,将其转为python整数类型或字符串类型对象。 612 | 613 | pyparsing支持在解析时调用的回调函数(成为解析行为)你可以附加一个单独的表达式给语法。 614 | 解析器会在匹配到对应形式时调用这些函数。作为例子,从文本中提取双引号围成的字符串,一个简单地解析行为会移除掉引号。像这样: 615 | 616 | ```python 617 | quotedString.setParseAction( lambda t: t[0][1:-1] ) 618 | ``` 619 | 620 | 就够了。不需要检查开头结尾是不是引号.这个函数只有匹配到这样的形式时才会调用。 621 | 622 | 解析行为也可以用来执行不同的检查,像测试匹配到的词是不是一个给定列表中的词,并在不是时抛出一个`ParseException`。 623 | 解析行为也可以返回一个列表或一个对象,如将输入文本编译为一系列可执行或能调用的对象。解析行为在pyparsing中是个有力的工具。 624 | 625 | ### 语法必须对改变具有更强的适应性和健壮性 626 | 627 | 解析器中最普遍的死亡陷进在你编写时很难躲开。一个简单的模式匹配执行器可能日渐变得更复杂和笨拙。 628 | 吐过输入文本的数据不能被匹配却被需要被匹配,于是解析器需要加上一个补丁以解决这一新变化。或者去修改它的语言规则。 629 | 在这一切发生后一段时间,补丁开始干扰早先的语法定义,而之后越来越多的补丁使问题变得越来越困难。 630 | 当一个修改在你最后一次修改几个月发生时,你去重新理解它花去的时间会超出你的想象,这都增加了困难。 631 | 632 | pyparsing也没有解决这一问题,但是其单独定义的语法风格和结构使问题得以缓解-单独的要素容易找到也易读 633 | ,易于修改和扩展。这是pyparsing的使用者们发给我的一句话:“我可以只写一个自定义方法, 634 | 但是我过去的经验反应一旦我创建一个pyparsing语法,它就会自动变得有组织而且容易维护和扩展。” 635 | 636 | ## 从表格文件中解析数据-使用解析行为和ParseResults 637 | 638 | 作为我们第一个例子,让我们看一个处理给定橄榄球比赛文件信息文件的程序。 639 | 该文件每一行记录一场比赛的信息,其中包括时间,比赛双方和他们的比分。 640 | 641 | ``` 642 | 09/04/2004 Virginia 44 Temple 14 643 | 09/04/2004 LSU 22 Oregon State 21 644 | 09/09/2004 Troy State 24 Missouri 14 645 | 01/02/2003 Florida State 103 University of Miami 2 646 | ``` 647 | 648 | 这些数据的BNF形式是简洁明了的 649 | 650 | ``` 651 | digit ::= '0'..'9' 652 | alpha ::= 'A'..'Z' 'a'..'z' 653 | date ::= digit+ '/' digit+ '/' digit+ 654 | schoolName ::= ( alpha+ )+ 655 | score ::= digit+ 656 | schoolAndScore ::= schoolName score 657 | gameResult ::= date schoolAndScore schoolAndScore 658 | ``` 659 | 660 | 我们以覆盖BNF定义来开始我们的解析器构建工作。就像扩展"Hello,World!"程序一样,我们将先设计好构建块, 661 | 然后再将它们组合起来构成更复杂的语法。 662 | 663 | ```python 664 | # nums and alphas are already defined by pyparsing 665 | num = Word(nums) 666 | date = num + "/" + num + "/" + num 667 | schoolName = OneOrMore( Word(alphas) ) 668 | ``` 669 | 670 | 注意我们可以使用+操作符或组合pyparsing表达式和字符串符号(literals)。通过组合这些加单的元素为更大的表达式,我们可以完成语法定义。 671 | 672 | ```python 673 | score = Word(nums) 674 | schoolAndScore = schoolName + score 675 | gameResult = date + schoolAndScore + schoolAndScore 676 | ``` 677 | 678 | 我们使用`gameResult`对象去解析输入文本的每一行: 679 | 680 | ```python 681 | tests = """\ 682 | 09/04/2004 Virginia 44 Temple 14 683 | 09/04/2004 LSU 22 Oregon State 21 684 | 09/09/2004 Troy State 24 Missouri 14 685 | 01/02/2003 Florida State 103 University of Miami 2""".splitlines() 686 | for test in tests: 687 | stats = gameResult.parseString(test) 688 | print stats.asList() 689 | ``` 690 | 691 | 就像我们曾在"Hello,World"解析器里看的那样,我们从这个语法中得到一个没有结构性的表。 692 | 693 | ``` 694 | ['09', '/', '04', '/', '2004', 'Virginia', '44', 'Temple', '14'] 695 | ['09', '/', '04', '/', '2004', 'LSU', '22', 'Oregon', 'State', '21'] 696 | ['09', '/', '09', '/', '2004', 'Troy', 'State', '24', 'Missouri', '14'] 697 | ['01', '/', '02', '/', '2003', 'Florida', 'State', '103', 'University', 'of', 698 | 'Miami', '2'] 699 | ``` 700 | 701 | 对此的第一个改进是将返回的有关日期的分散的数据组合成简单的MM/DD/YYYY型字符串。我们只要用`Combine`类将表达式包起来就行了。 702 | 703 | ```python 704 | date = Combine( num + "/" + num + "/" + num ) 705 | ``` 706 | 707 | 解析结果变为 708 | 709 | ``` 710 | ['09/04/2004', 'Virginia', '44', 'Temple', '14'] 711 | ['09/04/2004', 'LSU', '22', 'Oregon', 'State', '21'] 712 | ['09/09/2004', 'Troy', 'State', '24', 'Missouri', '14'] 713 | ['01/02/2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2'] 714 | ``` 715 | 716 | `Combine`实际上为我们做了两件事。第一是将匹配的标签合并进一个字符串,而它还使这些文本连在一起。 717 | 718 | 下一个改进是将学校名字组合起来。因为`Combine`的默认行为要求标记相邻,所以我们将不使用它。 719 | 作为替代,我们定义一个在解析时运行的过程,组合和返回单个字符串的标记.向前面提到的那样, 720 | 这类过程通过解析行为实现,它们在解析过程时执行一些函数。 721 | 722 | 对这个例子,我们将定义一个解析行为,其接受被解析的标记,使用字符串的join函数,返回组合后的字符串, 723 | 这个解析行为被一个python的lambda表达式描绘。这个解析行为与lambda表达式的绑定是通过调用一个叫`setParseAction`的函数,像这样: 724 | 725 | ```python 726 | schoolName.setParseAction( lambda tokens: " ".join(tokens) ) 727 | ``` 728 | 729 | 这类手法的另一个用法是用于进行超越表达式定义的语法匹配的额外语义确认。作为例子, 730 | 之前的data的表达式会接受像03023/808098/29921这样的字符串作为有意义的数据,而这显然不是我们所期望的。 731 | 一个解析行为的对输入日期的赋意义化可以通过使用`time.strptime`方法去分析时间字符串。 732 | 733 | ```python 734 | time.strptime(tokens[0],"%m/%d/%Y") 735 | ``` 736 | 737 | 如果`strptime`检查失败,则它会抛出一个`ValueError`异常。Pyparsing使用它独特的异常类,`PrseException`, 738 | 去作为表达式匹配与否的信号。解析行为可以抛出它们独有的异常去标示,哪怕语法判定通过,但却由一些高级的语义判定所触发。 739 | 我们的解析行为将看起来像这样: 740 | 741 | ```python 742 | def validateDateString(tokens): 743 | try: 744 | time.strptime(tokens[0], "%m/%d/%Y") 745 | except ValueError,ve: 746 | raise ParseException("Invalid date string (%s)" % tokens[0]) 747 | date.setParseAction(validateDateString) 748 | ``` 749 | 750 | 如果我们修改我们数据的第一行的输入为19/04/2004,则我们得到一个异常: 751 | 752 | ``` 753 | pyparsing.ParseException: Invalid date string (19/04/2004)(at char 0),(line:1,col:1) 754 | ``` 755 | 756 | 另一个对解析结果的改进途径是使用pyparsing的Group类。Group不改变标签本身,而是将它们编组到一个子列表里。 757 | Group在赋予解析结果结构上很有用处。 758 | 759 | ```python 760 | score = Word(nums) 761 | schoolAndScore = Group( schoolName + score ) 762 | ``` 763 | 764 | 随着编组和组合,解析结果现在看起来很有结构性了。 765 | 766 | ``` 767 | ['09/04/2004', ['Virginia', '44'], ['Temple', '14'] 768 | ['09/04/2004', ['LSU', '22'], ['Oregon State', '21'] 769 | ['09/09/2004', ['Troy State', '24'], ['Missouri', '14'] 770 | ['01/02/2003', ['Florida State', '103'], ['University of Miami', '2'] 771 | ``` 772 | 773 | 最终,我们将增加一个或多个解析行为去执行对数字字符串到真正数字的转换。 774 | 775 | 这在解析行为中的使用里是非常普遍的。它也显示出pyparsing可以返回结构化的数据而不只是一些被解析的字符串组成的表。 776 | 这个解析行为也可以更简单通过一个lambda表达式表示: 777 | 778 | ```python 779 | score = Word(nums).setParseAction( lambda tokens : int(tokens[0]) ) 780 | ``` 781 | 782 | 又一次,我们可以定义我们的解析行为执行一个类型转换而不需要进行错误处理,如果输入的字符串不能被转换到数字类型。因为它是被表达式`Word(nums)`解析出来的,这就保证了它一定会在解析行为中有意义。 783 | 784 | 我们的返回结果开始变得像真正的对象式的数据记录。 785 | 786 | ``` 787 | ['09/04/2004', ['Virginia', 44], ['Temple', 14] 788 | ['09/04/2004', ['LSU', 22], ['Oregon State', 21] 789 | ['09/09/2004', ['Troy State', 24], ['Missouri', 14] 790 | ['01/02/2003', ['Florida State', 103], ['University of Miami', 2] 791 | ``` 792 | 793 | 至此,数据又有结构又有正确的类型了。故而我们可以对它进行一些真正的操作,如将比赛结果按时间排序, 794 | 标出胜利的队等。`parseString`返回的`ParseResults`对象允许我们索引数据通过嵌套下标。 795 | 796 | 但如果我们使用这种结构,事情立即相应的变得丑陋。 797 | 798 | ```python 799 | for test in tests: 800 | stats = gameResult.parseString(test) 801 | if stats[1][1] != stats[2][1]: 802 | if stats[1][1] > stats[2][1]: 803 | result = "won by " + stats[1][0] 804 | else: 805 | result = "won by " + stats[2][0] 806 | else: 807 | result = "tied" 808 | print "%s %s(%d) %s(%d), %s" % (stats[0], stats[1][0], stats[1][1],stats[2][0], stats[2][1], result) 809 | ``` 810 | 811 | 使用索引不仅使得代码难以理解,而且由于其非常依赖结果的次序,如果我们的语法包含一些可选项字段, 812 | 我们可能要使用一些其他方法去测试这些字段并且据此调整下标。这使得我们的解析器非常脆弱。 813 | 814 | 我们可以使用多重变量赋值以缩减索引的使用像我们在'"Hello,World!"中做的那样。 815 | 816 | ```python 817 | for test in tests: 818 | stats = gameResult.parseString(test) 819 | gamedate,team1,team2 = stats # <- assign parsed bits to individual variable names 820 | if team1[1] != team2[1]: 821 | if team1[1] > team2[1]: 822 | result = "won by " + team1[0] 823 | else: 824 | result = "won by " + team2[0] 825 | else: 826 | result = "tied" 827 | print "%s %s(%d) %s(%d), %s" % (gamedate, team1[0], team1[1], team2[0], team2[1],result) 828 | ``` 829 | 830 | > *最佳实践:使用ResultsNames* 831 | > 832 | > 但是这依旧使我们对所处理数据的次序过于敏感。 833 | 取代其的,我们可以在语法中定义名字。为了做到这一点,我们插入setResults-Name到我们的语法里,所以表达式将标记出标签。 834 | 835 | ```python 836 | schoolAndScore = Group( 837 | schoolName.setResultsName("school") +score.setResultsName("score") ) 838 | gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") +schoolAndScore.setResultsName("team2") 839 | ``` 840 | 841 | 而这个代码生成的结果更易读。 842 | 843 | ```python 844 | if stats.team1.score != stats.team2.score 845 | if stats.team1.score > stats.team2.score: 846 | result = "won by " + stats.team1.school 847 | else: 848 | result = "won by " + stats.team2.school 849 | else: 850 | result = "tied" 851 | print "%s %s(%d) %s(%d), %s" % (stats.date, stats.team1.school, stats.team1.score,stats.team2.school, stats.team2.score, result) 852 | ``` 853 | 854 | 这次代码获得改善,通过引用名字引用单个标记而不是索引,这使得过程代码免疫标记顺序的变化以及可选数据字段的干扰。 855 | 856 | 创建`ParseResults`时使用结果名字将允许你使用字典风格的来引用标记。作为例子,你可以使用`ParseResults`对象支持数据在格式化字符串中使用,这么做可以简化输出代码: 857 | 858 | ```python 859 | print "%(date)s %(team1)s %(team2)s" % stats 860 | ``` 861 | 862 | 它给出下面结果: 863 | 864 | ``` 865 | 09/04/2004 ['Virginia', 44] ['Temple', 14] 866 | 09/04/2004 ['LSU', 22] ['Oregon State', 21] 867 | 09/09/2004 ['Troy State', 24] ['Missouri', 14] 868 | 01/02/2003 ['Florida State', 103] ['University of Miami', 2] 869 | ``` 870 | 871 | `ParseResults`也装备了keys(),items(),values()方法,同时支持以python关键字`in`进行测试。 872 | 873 | >*来写令人兴奋的!* 874 | > 875 | >pyparsing的最新版本(1.4.7)包含了让给表达式追加名字的更简单的记法。缩减代码的效果见此例: 876 | > ```python 877 | >schoolAndScore =Group( schoolName("school") +score("score") ) 878 | >gameResult = date("date") +schoolAndScore("team1") +schoolAndScore("team2") 879 | > ``` 880 | > 现在没有理由不为你的解析结果命名了! 881 | 882 | 为了调试,你可以调用dump()返回已经组织化过的名字与值的层次结构。这是一个调用stats.dump()方法为第一行输入文本的结果: 883 | 884 | ``` 885 | print stats.dump() 886 | ['09/04/2004', ['Virginia', 44], 887 | ['Temple', 14] 888 | - date: 09/04/2004 889 | - team1: ['Virginia', 44] 890 | - school: Virginia 891 | - score: 44 892 | - team2: ['Temple', 14] 893 | - school: Temple 894 | - score: 14 895 | ``` 896 | 897 | 最终,你可以产生一个XML文件模拟类似的结构。但你需要额外指定一个根元素名称: 898 | 899 | ``` 900 | print stats.asXML("GAME") 901 | 902 | 903 | 09/04/2004 904 | 905 | Virginia 906 | 44 907 | 908 | 909 | Temple 910 | 14 911 | 912 | 913 | ``` 914 | 915 | 这里还有最后一个问题需要考虑,我们的解析器总将输入文本当成合法的。其将执行语法直到遇到终结, 916 | 然后返回匹配的结果,哪怕输入文本还有很多未被解析。作为例子,这个语句: 917 | 918 | ```python 919 | word = Word("A") 920 | data = "AAA AA AAA BA AAA" 921 | print OneOrMore(word).parseString(data) 922 | ``` 923 | 924 | 将不抛出一个异常,而是简化输出为: 925 | 926 | ``` 927 | ['AAA', 'AA', 'AAA'] 928 | ``` 929 | 930 | 你可能本来想解析出更多的AAA,而那个B只是个错误。即使不是这样,额外的文本有时也是令人不安的。 931 | 932 | 如果你想避免这个情况,你应当使用`StringEnd`类。并将其“加”到你要解析的文本后面。 933 | 934 | 此时若没有解析整个文本,则会抛出`ParseException`错误于解析结束的点上。注意尾部有空格无关紧要 935 | ,pyparsing会自动跳过它们。 936 | 937 | 在我们现在的应用中,增加了`stringEnd`到我们的解析表达式里将保护我们遭遇意外的匹配。 938 | 939 | ``` 940 | 09/04/2004 LSU 2x2 Oregon State 21 941 | ``` 942 | 943 | 作为: 944 | 945 | ``` 946 | 09/04/2004 ['LSU', 2] ['x', 2] 947 | ``` 948 | 949 | 这看上去就像LSU和X学院打成了平局。为了避免这个错误,若追加了`ParseException`看起来就会像这样: 950 | 951 | ``` 952 | pyparsing.ParseException: Expected stringEnd (at char 44), (line:1, col:45) 953 | ``` 954 | 955 | 这是解析器的完整代码: 956 | 957 | ```python 958 | from pyparsing import Word, Group, Combine, Suppress, OneOrMore, alphas, nums,alphanums, stringEnd, ParseException 959 | import time 960 | num = Word(nums) 961 | date = Combine(num + "/" + num + "/" + num) 962 | def validateDateString(tokens): 963 | try: 964 | time.strptime(tokens[0], "%m/%d/%Y") 965 | except ValueError,ve: 966 | raise ParseException("Invalid date string (%s)" % tokens[0]) 967 | date.setParseAction(validateDateString) 968 | schoolName = OneOrMore( Word(alphas) ) 969 | schoolName.setParseAction( lambda tokens: " ".join(tokens) ) 970 | score = Word(nums).setParseAction(lambda tokens: int(tokens[0])) 971 | schoolAndScore = Group( schoolName.setResultsName("school") + score.setResultsName("score") ) 972 | gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") + schoolAndScore.setResultsName("team2") 973 | tests = """\ 974 | 09/04/2004 Virginia 44 Temple 14 975 | 09/04/2004 LSU 22 Oregon State 21 976 | 09/09/2004 Troy State 24 Missouri 14 977 | 01/02/2003 Florida State 103 University of Miami 2""".splitlines() 978 | for test in tests: 979 | stats = (gameResult + stringEnd).parseString(test) 980 | if stats.team1.score != stats.team2.score: 981 | if stats.team1.score > stats.team2.score: 982 | result = "won by " + stats.team1.school 983 | else: 984 | result = "won by " + stats.team2.school 985 | else: 986 | result = "tied" 987 | print "%s %s(%d) %s(%d), %s" % (stats.date, stats.team1.school, stats.team1.score,stats.team2.school, stats.team2.score, result) 988 | # or print one of these alternative formats 989 | #print "%(date)s %(team1)s %(team2)s" % stats 990 | #print stats.asXML("GAME") 991 | ``` 992 | 993 | ## 从网页数据中提取数据 994 | 995 | 网络已经变成一个免费的巨大数据源,它们可以在你家用电脑的浏览器窗口上被访问。而大多数网上的资源 996 | 是被以计算机程序易于使用的方式被编码的,其中大多数又是为给人类用户阅读接口的浏览器而定制的HTML 997 | 标记语言。 998 | 999 | 有时候你会需要从网页上获取特定格式的数据。如果数据没有被实现转换成为逗号分隔或其他易于处理的格式, 1000 | 你会需要写一个解析器来处理实际要提取数据伴随的HTML的标签。 1001 | 1002 | 经常看到很多人试图使用正则表达式来执行这个任务,作为例子,有的人尝试提取图像引用标签从网页中, 1003 | 他们会尝试使用匹配模式`""`。不幸的是,HTML与XML标签包含很多可选属性,而且浏览器对于 1004 | 不规则的HTML标签也十分宽容,对于爬虫来说,碰到这些不规则的页面可能造成很大的麻烦。下面是一些典型 1005 | 的处理HTML标签时会遇上的"陷阱": 1006 | 1007 | * 标签中具有额外的空格以及大小写混杂 * 1008 | ``,``与``尽管形式不同 1009 | 但是表示的东西是相同的。 1010 | 1011 | * 标签中有意料之外的属性 * 1012 | 1013 | `IMG` 标签经常包含可选的属性,如`align`,`alt`,`id`,`vspace`,`hspace`,`height`,`width`等。 1014 | 1015 | * 标签属性顺序的任意性 * 1016 | 1017 | 如果匹配模式被扩展来处理`src`,`align`与`alt`属性,来处理这样的情况 1018 | `The Great Sphinx`,然而这些属性可以以所有可能的顺序排出。 1019 | 1020 | * 标签属性可能会也可能不会以引号包裹 * 1021 | 1022 | ``也可以被以``或``表示。 1023 | 1024 | pyparsing包含了辅助方法`makeHTMLTags`来简化对这种标准标签的定义。为了使用这个方法,你应当以 1025 | 所要解析的标签名调用`makeHTMLTags`,而它会返回一个pyparsing表达式,以备之后的匹配使用。当然 1026 | `makeHTMLTags(X)`做的比单纯的`Literal("")`或`Literal("")`要多,它还会处理这些问题: 1027 | 1028 | * 标签可能是大写或小写形式 1029 | * "额外的"空格 1030 | * 任意数量以任意顺序排出的属性 1031 | * 属性值可以以单引号/双引号/没引号形式表出 1032 | * 标签与属性名可能包含命名空间引用 1033 | 1034 | 然而`makeHTMLTags`返回的解析表达式最强大的特性可能使解析出的结果中包含了标签的HTML属性命名 1035 | 的结果,结果名是在解析时动态创建的。 1036 | 1037 | 下面是一个短脚本,其搜索网页中的图像引用,其打印了图像的列表与任何可能的文本。 1038 | 1039 | >*注释* 1040 | > 1041 | >标准Python库中提供了`HTMLParser`与`htmllib`来处理HTML源代码,但是它们不能 1042 | >处理不规则的HTML。一个流行的HTML第三方处理模块是`BeautifulSoup`。这里有一个 1043 | >`BeautifulSoup`实现(rendition)的``标签提取器: 1044 | >``` 1045 | >from BeautifulSoup import BeautifulSoup 1046 | > 1047 | >soup = BeautifulSoup(html) 1048 | >imgs = soup.findAll("img") 1049 | >for img i imgs: 1050 | > print "'%(alt)s' : %(src)s" % img 1051 | >``` 1052 | >`BeautifulSoup`面向整个HTML页面进行处理,并产生一个Pythonic的DOM与XPATH混合结构, 1053 | >与对解析出的HTML标签,属性与文本域的访问方式。 1054 | 1055 | ```python 1056 | from pyparsing import makeHTMLTags 1057 | import urllib 1058 | 1059 | # read data from web page 1060 | url="https://www.cia.gov/library/"\ 1061 | "publications/the-world-"\ 1062 | "factbook/docs/refmaps.html" 1063 | html=urllib.urlopen(url).read() 1064 | # define expression for tag 1065 | imgTag,endImgTag=makeHTMLTags("img") 1066 | 1067 | # search for matching tags, and 1068 | #print key attributes 1069 | for img in imgTag.searchString(html): 1070 | print "'%(alt)s' : %(src)s" % img 1071 | ``` 1072 | 1073 | 作为使用`parseString`的替代,这个脚本使用`searchString`搜索匹配文本。对于每个`searchString`的返回值, 1074 | 脚本打印标签的属性值`alt`与`tag`,仅当在`img`表达式中存在。 1075 | 1076 | 这个脚本只是列出了CIA Factbook在线初始页面的地图的图像元素内容。输出包含每个地图图像元素的引用的信息, 1077 | 像这样: 1078 | 1079 | ``` 1080 | 'Africa Map' : ../reference_maps/thumbnails/africa.jpg 1081 | 'Antarctic Region Map' : ../reference_maps/thumbnails/antarctic.jpg 1082 | 'Arctic Region Map' : ../reference_maps/thumbnails/arctic.jpg 1083 | 'Asia Map' : ../reference_maps/thumbnails/asia.jpg 1084 | 'Central America and Caribbean Map' : ../reference_maps/thumbnails/central_america.jpg 1085 | 'Europe Map' : ../reference_maps/thumbnails/europe.jpg 1086 | ... 1087 | ``` 1088 | 1089 | CIA Factbook网站也包含了更复杂的页面,在那里列出了世界上使用广泛的度量单位换算比率。这是该表的一些 1090 | 数据行: 1091 | 1092 | Label | Label | Value | 1093 | ----------|------|--------| 1094 | ares | square | meters 100 1095 | ares | square yards | 119.599 1096 | barrels, US beer | gallons | 31 1097 | barrels, US beer | liters |117.347 77 1098 | barrels, US petroleum | gallons (British) | 34.97 1099 | barrels, US petroleum | gallons (US) | 42 1100 | barrels, US petroleum | liters | 158.987 29 1101 | barrels, US proof spirits | gallons | 40 1102 | barrels, US proof spirits | liters | 151.416 47 1103 | bushels (US) | bushels (British) | 0.968 9 1104 | bushels (US) | cubic feet | 1.244 456 1105 | bushels (US) | cubic inches | 2,150.42 1106 | 1107 | 与此表对应的HTML源代码为: 1108 | 1109 | ``` 1110 | 1111 | 1112 | 1113 | 1114 | 1115 | 1116 | 1117 | 1118 | 1119 | 1120 | 1121 | 1122 | 1123 | 1124 | 1125 | 1126 | 1127 | 1128 | 1129 | 1130 | ... 1131 | ``` 1132 | 1133 | 现在我们有了一个样例HTML作为解析参照的模板。我们可以用`makeHTMLTags`来简化对BNF的构建。 1134 | 1135 | ``` 1136 | entry ::= conversionLabel conversionLabel conversionValue 1137 | conversionLabel ::= 1138 | conversionValue ::= 1139 | ``` 1140 | 1141 | 注意换算比率被以易于阅读的形式格式化过了(对于人来说),表现为: 1142 | 1143 | * 整数部分每3位用一个逗号隔断 1144 | * 小数部分每3位用一个空格隔断 1145 | 1146 | 我们可以准备在解析行为处反格式化此文本在调用`float()`将其转为浮点数之前。我们也将 1147 | 需要在转换标记处进行处理,它们可能包含嵌入的`
`标签进行显式换行。 1148 | 1149 | 从自动化的角度看,我们的脚本必须以所要提取的网页的URL下载HTML开始,我发现Python 1150 | 的`urllib`模块对于这个任务已经足够了: 1151 | 1152 | ```python 1153 | import urllib 1154 | url = "https://www.cia.gov/library/publications/" \ 1155 | "the-world-factbook/appendix/appendix-g.html" 1156 | html = urllib.urlopen(url).read() 1157 | ``` 1158 | 1159 | 目前我们已经从网页下到了HTML源代码到我们的Python变量`html`中,其以字符串形式存在。 1160 | 我们将在之后将从这个字符串中提取信息。 1161 | 1162 | 但是我们似乎忘记了什么--我们应该先初始化我们的语法!让我们从数字开始,浏览这个网页, 1163 | 可以看到数字以这些形式表出: 1164 | 1165 | ``` 1166 | 200 1167 | 0.032 808 40 1168 | 1,728 1169 | 0.028 316 846 592 1170 | 3,785.411 784 1171 | ``` 1172 | 1173 | 下面是一个可以匹配这些数字的表达式 1174 | 1175 | ```python 1176 | decimalNumber = Word(nums, nums+",") + Optional("." + OneOrMore(Word(nums))) 1177 | ``` 1178 | 1179 | 可以看到我们以新的形式使用了`Word`构造器,由一个参数变成了一个。当使用这个形式,`Word`将将 1180 | 第一个参数作为"开头"(starting)字符,而第二个参数作为"体"(body)参数。其指定了匹配字符串在第一个 1181 | 字符上的可用字符集与之后的可用字符集。如上述形式可以匹配`1,000`而不能匹配`,456`。`Word`的 1182 | 这种方法对于解析一般编程语言中称为标识符的东西是有用的,比如下式可以匹配Python变量名: 1183 | 1184 | ```python 1185 | Word(alphas+"_", alphanums+"_") 1186 | ``` 1187 | 1188 | 现在`decimalNumber`是自足的了,我们可以在将它组合到更大,更复杂的表达式之前对其单独进行测试 1189 | 1190 | >*最佳实践:增量测试* 1191 | > 1192 | >单独测试语法元素避免当合并它们到更大的语法中碰见意外情况。 1193 | 1194 | 使用样例列表,我们可以得到下面的结果: 1195 | 1196 | ``` 1197 | ['200'] 1198 | ['0', '.', '032', '808', '40'] 1199 | ['1,728'] 1200 | ['0', '.', '028', '316', '846', '592'] 1201 | ['3,785', '.', '411', '784'] 1202 | ``` 1203 | 1204 | 为了将它们转成浮点数,我们需要 1205 | 1206 | * 将独立的标签合并在一起 1207 | * 从整数部分中删除逗号 1208 | 1209 | 尽管这两步可以在一个表达式里完成,但我想要创建两个解析行为来展示解析行为可以被链式调用。 1210 | 1211 | 第一个解析行为可以被叫做`joinTokens`,其可以以lambda表达式式实现: 1212 | 1213 | ```python 1214 | joinTokens=lambda : tokens : "".join(tokens) 1215 | ``` 1216 | 1217 | 下一个解析行为可以被叫做`stripCommas`,作为链的下一部分,`stripCommas`将接只收到一个字符串 1218 | (`joinTokens`的输出),所以我们将只需要操作`tokens`的第0个元素: 1219 | 1220 | ```python 1221 | stripCommas = lambda tokens : tokens[0].replace(",", "") 1222 | ``` 1223 | 1224 | 当然,我们需要一个最终的解析行为来转换为浮点数: 1225 | 1226 | ```python 1227 | convertToFloat = lambda tokens : float(tokens[0]) 1228 | ``` 1229 | 1230 | 现在,将这些解析行为装备到一个表达式上,我们使用`setParseAction`与`addParseAction` 1231 | 方法做到这一点: 1232 | 1233 | ```python 1234 | decimalNumber.setParseAction( joinTokens ) 1235 | decimalNumber.addParseAction( stripCommas ) 1236 | decimalNumber.addParseAction( convertToFloat ) 1237 | ``` 1238 | 1239 | 后者,我们可以仅调用`setParseAction`,然后以逗号分隔参数列表形式将解析行为穿进去, 1240 | 这将定义出一个以相同方式工作的链: 1241 | 1242 | ```python 1243 | decimalNumber.setParseAction( joinTokens, stripCommas, convertToFloat ) 1244 | ``` 1245 | 1246 | 接下来,让我们做更充足的测试,通过使用`decimalNumber`来扫描整个HTML源代码 1247 | 1248 | ```python 1249 | tdStart,tdEnd = makeHTMLTags("td") 1250 | conversionValue = tdStart + decimalNumber + tdEnd 1251 | for tokens,start,end in conversionValue.scanString(html): 1252 | print tokens 1253 | ``` 1254 | 1255 | `scanString`是另一个解析方法,其对于测试语法脆弱性特别有用。`parseString`会对文本进行 1256 | 整体匹配。而`scanString`在输入文本中扫描,观察是否有文本中的部分与语法相匹配。而且 1257 | `scanString`是一个生成器函数,这意味着它返回的只是它那时候匹配到的标签,而不是先解析完整个 1258 | 文本。在示例代码中,你可以看到`scanString`返回的标记,以及它们的初始与结束位置对于每个匹配。 1259 | 1260 | 这是使用`scanString`测试`conversionValue`表达式的原始结果: 1261 | 1262 | ``` 1263 | ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 1264 | 40.468564223999998, ''] 1265 | ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 1266 | 0.40468564223999998, ''] 1267 | ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 43560.0, 1268 | ''] 1269 | ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 1270 | 0.0040468564224000001, ''] 1271 | ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 1272 | 4046.8564224000002, ''] 1273 | ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 1274 | 0.0015625000000000001, ''] 1275 | ['td', ['width', '33%'], ['valign', 'top'], ['class', 'Normal'], False, 4840.0, 1276 | ''] 1277 | ... 1278 | ``` 1279 | 1280 | 是的,解析出的东西包含
`闭标签时结束。根据这个知识,我们 1317 | 可以避免对`units`的模式进行集的直接定义,而转而使用pyparsing辅助类`SkipTo`。`SkipTo` 1318 | 收集所有内部文本,从当前解析位置到目标表达式,到一个单独的字符串中。使用`SkipTo`,我们 1319 | 可以如此简单的定义`units`。 1320 | 1321 | ```python 1322 | units = SkipTo( tdEnd ) 1323 | ``` 1324 | 1325 | 我们还没有对所得到的文本做一些立即处理,像去掉两边的空格之类的。但是至少我们没有遗漏匹配 1326 | 任何合法的单位。 1327 | 1328 | 我们已经准备好我们的表达式来提取单位转换信息了,通过增加从哪种单位到哪种单位的信息: 1329 | 1330 | ```python 1331 | conversion = trStart + fromUnits + toUnits + conversionValue + trEnd 1332 | ``` 1333 | 1334 | 重复扫描过程,我们可以得到下面的东西 1335 | 1336 | ```python 1337 | for tokens,start,end in conversion.scanString(html): 1338 | print "%(fromUnit)s : %(toUnit)s : %(factor)f" % tokens 1339 | ``` 1340 | ``` 1341 | acres : ares : 40.468564 1342 | acres : hectares : 0.404686 1343 | acres : square feet : 43560.000000 1344 | acres : square kilometers : 0.004047 1345 | acres : square meters : 4046.856422 1346 | ... 1347 | ``` 1348 | 1349 | 这看起来不坏,不过到了这个列表的下边,发现了一些格式化问题: 1350 | 1351 | ``` 1352 | barrels, US petroleum : liters : 158.987290 1353 | barrels, US proof 1354 | spirits : gallons : 40.000000 1355 | barrels, US proof 1356 | spirits : liters : 151.416470 1357 | bushels (US) : bushels (British) : 0.968900 1358 | ``` 1359 | 1360 | 在更下面,我们看到了这样的东西: 1361 | 1362 | ``` 1363 | tons, net register : cubic feet of permanently enclosed space
1364 | for cargo and passengers : 100.000000 1365 | tons, net register : cubic meters of permanently enclosed space
1366 | for cargo and passengers : 2.831685 1367 | ``` 1368 | 1369 | >*网络爬虫的注释* 1370 | > 1371 | >像这种网上的数据,一般是在服务条款(TOS)下放出的,而它并不总是允许HTML处理脚本对其 1372 | >进行数据收集。经常这些条款会阻止试图模仿和与之竞争的站点为意图的访问。但是在另外一些 1373 | >情况,条款要却倒网站是完全被人类用户所使用的,网站的服务器不能承担爬虫施加的过度负担, 1374 | >而网站拥有者会获得全额赔偿为它网站的内容。大多数TOS允许数据提取对于私人目的使用和 1375 | >引用。总要检查网站的服务条款在为你自己获取数据之前。 1376 | > 1377 | >对于这个例子,CIA Factbook网站的内容是在公共域下的(见 1378 | >https://www.cia.gov/about-cia/site-policies/index.html#link1) 1379 | 1380 | 为了清理度量单位,我们需要清楚换行和额外的空格,并且移除嵌入的`
`标签。正如你猜到的, 1381 | 我们将使用解析行为办这件事. 1382 | 1383 | 我们的解析行为有两个任务: 1384 | 1385 | * 移除`
`标记. 1386 | * 压缩空格与换行符 1387 | 1388 | 在Python中最简单的压缩重复的空格的方式是使用`str`类型的方法`split`然后再跟`join`。为了移除 1389 | `
`标签,我们将只使用`str.replace("
"," ")`。一个简单的lambda表达式同时处理这两个 1390 | 情况不太清晰,所以这次我们创建一个真正的Python函数并且将它组装到units表达式上: 1391 | 1392 | ```python 1393 | def htmlCleanup(t): 1394 | unitText = t[0] 1395 | unitText = unitText.replace("
"," ") 1396 | unitText = " ".join(unitText.split()) 1397 | return unitText 1398 | 1399 | units.setParseAction(htmlCleanup) 1400 | ``` 1401 | 1402 | 在这些变化之后,我们的换算比率提取器可以收集单位转换信息,我们可以加载它们到一个Python 1403 | 字典变量或者本地数据库中为我们之后的程序来调用它们提供便利。 1404 | 1405 | 这是换算比率提取程序的完整代码: 1406 | 1407 | ```python 1408 | import urllib 1409 | from pyparsing import * 1410 | 1411 | url = "https://www.cia.gov/library/" \ 1412 | "publications/the-world-factbook/" \ 1413 | "appendix/appendix-g.html" 1414 | page = urllib.urlopen(url) 1415 | html = page.read() 1416 | page.close() 1417 | 1418 | tdStart,tdEnd = makeHTMLTags("td") 1419 | trStart,trEnd = makeHTMLTags("tr") 1420 | decimalNumber = Word(nums+",") + Optional("." + OneOrMore(Word(nums))) 1421 | joinTokens = lambda tokens : "".join(tokens) 1422 | stripCommas = lambda tokens: tokens[0].replace(",","") 1423 | convertToFloat = lambda tokens: float(tokens[0]) 1424 | decimalNumber.setParseAction( joinTokens, stripCommas, convertToFloat ) 1425 | 1426 | conversionValue = tdStart + decimalNumber.setResultsName("factor") + tdEnd 1427 | 1428 | units = SkipTo(tdEnd) 1429 | def htmlCleanup(t): 1430 | unitText = t[0] 1431 | unitText = " ".join(unitText.split()) 1432 | unitText = unitText.replace("
","") 1433 | return unitText 1434 | 1435 | units.setParseAction(htmlCleanup) 1436 | fromUnit = tdStart + units.setResultsName("fromUnit") + tdEnd 1437 | toUnit = tdStart + units.setResultsName("toUnit") + tdEnd 1438 | conversion = trStart + fromUnit + toUnit + conversionValue + trEnd 1439 | 1440 | for tokens,start,end in conversion.scanString(html): 1441 | print "%(fromUnit)s : %(toUnit)s : %(factor)s" % tokens 1442 | ``` 1443 | 1444 | ##
一个简易S表达式解析器 1445 | 翻译TODO... 1446 | 1447 | ## 一个完整的S表达式解析器 1448 | 翻译TODO... 1449 | 1450 | ## 解析搜索字符串 1451 | 翻译TODO... 1452 | 1453 | ## 100行代码以内的搜索引擎 1454 | 略翻译TODO... 1455 | 1456 | ## 结论 1457 | 1458 | 我在PyCon'06发表关于pyparsing的介绍,在那我推荐pyparsing作为正则表达式的替代品, 1459 | 或处理一些不寻常的问题。之后我被问到一个问题,"pyparsing不能用来做什么?"我踌躇了一会儿, 1460 | 然后我指出pyparsing并不是处理任何情况的最佳工具-对于一个很有结构性的数据, 1461 | 最好的方法就是直接上`str.split()`。我也不推荐使用pyparsing去处理XML-它有解析它自己的专用库, 1462 | 而且往往比pyparsing能处理更多东西。 1463 | 1464 | 但我认为pyparsing是一个显然的好工具对于处理命令行程序,网页爬虫以及文本文件(如测试文件或分析输出文件)。 1465 | pyparsing已经被嵌入到很多python附加模块当中,可以到pyparsing的 1466 | [wiki](http://pyparsing.wikispaces.com/whosusingpyparsing)查询最新进展。 1467 | 1468 | 我已经为pyparsing写了一些文档,但我可能花费了更多时间去开发代码示例去演示不同的pyparsing代码技术。 1469 | 我的经验是很多开发者都想得到示范性源代码,并在手头的特定工作上直接运用它们。最近, 1470 | 我已经开始收到邮件,他们要求更多的正式文档,所以我希望这个手册可以帮助那些想要在pyparsing取得成果的人们。 1471 | 1472 | ### 获得更多帮助 1473 | 1474 | 这是一些pyparsing用户可用的网络资源,而且它们的数量一直在增长 1475 | 1476 | * pyparsing wiki(http://pyparsing.wikispaces.com):这个wiki是最早提供pyparsing新消息的资源处。 1477 | 它包含了安装信息,FAQ,以及使用了pyparsing的项目列表。一个例子页面会给出不同的“如何做”例子, 1478 | 包括对算术表达式,国际象棋标记,JSON数据,SQL语句(它们也包括在pyparsing的分发代码中)的。 1479 | pyparsing的更新,描述和其他事件被在News页面上更新。而主页的讨论频道是一个用户之间交流思想, 1480 | 提出问题的好地方。 1481 | 1482 | * pyparsing邮件组(pyparsing-users@lists.sourceforge.net)。另一个普遍使用的提出pyparsing问题的资源。 1483 | 以前的邮件组信息可以再pyparsing的SourceForge项目上找到,http://sourceforge.net/projects/pyparsing. 1484 | 1485 | * comp.lang.python:这个世界性的讨论组是一个关乎python的一般讨论组,但很可能pyparsing相关的频道将在此建立。 1486 | 这是个好地方去问关于python用法或特殊模块或特殊领域的研究的问题。如果你在这个列表中Google"pyparsing",你将找到很多讨论。 1487 | 1488 | ## 索引 1489 | 1490 | 翻译TODO... 1491 | --------------------------------------------------------------------------------
NameIP AddressLocation
time-a.nist.gov129.6.15.28NIST, Gaithersburg, Maryland
time-b.nist.gov129.6.15.29NIST, Gaithersburg, Maryland
IP address location name ` any text ` `more any text `` and `") 527 | tdEnd = Literal("` tags can also contain attribute specifiers for alignment, color, and so on. 531 | However, this is not a general-purpose parser, only one written specifically for this web page, which fortunately does not use complicated 532 | `` tags. (The latest version of pyparsing includes a helper method for constructing HTML tags, which supports attribute specifiers in opening tags.) 533 | 534 | Finally, you need some sort of expression to match the server's location description. 535 | This is actually a rather freely formatted bit of text--there's no knowing whether it will include alphabetic data, commas, periods, 536 | or numbers--so the simplest choice is to just accept everything up to the terminating `") 559 | tdEnd = Literal("', '129', '.', '6', '.', '15', '.', '28', '', 'NIST, Gaithersburg, Maryland', '', '129', '.', '6', '.', '15', '.', '29', '', 'NIST, Gaithersburg, Maryland', '', '132', '.', '163', '.', '4', '.', '101', '', 'NIST, Boulder, Colorado', '', '132', '.', '163', '.', '4', '.', '102', '', 'NIST, Boulder, Colorado', '").suppress() 601 | tdEnd = Literal("").suppress() 614 | tdEnd = Literal("").suppress() 665 | tdEnd = Literal("
ares square meters 100
ares square yards 119.599
barrels, US beer gallons 31
barrels, US beer liters 117.347 77
text readableNumber 里的所有内容,这使人分心。我们应当为`decimalNumber`表达式赋予 1281 | 一个结果名字然后在打印部分只将它所对应的部分打印出来: 1282 | 1283 | ```python 1284 | conversionValue = tdStart + decimalNumber.setResultsName("factor") + tdEnd 1285 | for tokens,start,end in conversionValue.scanString(html): 1286 | print tokens.factor 1287 | ``` 1288 | 1289 | 现在我们的输出就比较干净了: 1290 | 1291 | ``` 1292 | 40.468564224 1293 | 0.40468564224 1294 | 43560.0 1295 | 0.0040468564224 1296 | 4046.8564224 1297 | 0.0015625 1298 | 4840.0 1299 | 100.0 1300 | ... 1301 | ``` 1302 | 1303 | 我们已经构建了表达式来提取换算比率本身,但是我们还缺乏这个换算比率是从哪个单位到哪个单位的信息。 1304 | 为了解析这个,我们将使用一个非常类似于解析转换比率的表达式: 1305 | 1306 | ```python 1307 | fromUnit = tdStart + units.setResultsName("fromUnit") + tdEnd 1308 | toUnit = tdStart + units.setResultsName("toUnit") + tdEnd 1309 | ``` 1310 | 1311 | 但是我们如何定义`units`本身?观察页面,似乎并没有可以清晰可辨的模式。我们可以尝试 1312 | `OneOrMore(Word(alphas))`,但是这在匹配像"barrels,US petroleum"或"gallons (British)" 1313 | 时失败。而尝试为其添加专门的符号应对这样的错误这种策略,会在我们碰到之前没看到的情况 1314 | 时失败。 1315 | 1316 | 我们所知道的关于`units`的一件事是,它总在碰到`