├── .gitignore ├── ChangeLog ├── LICENSE ├── README.md ├── WikiExtractor.py ├── cirrus-extract.py ├── extractPage.py ├── setup.py ├── tests.py └── tox.ini /.gitignore: -------------------------------------------------------------------------------- 1 | ### https://raw.github.com/github/gitignore/c699a4f4684e9e294c9c550f820ca330f019b6f9/python.gitignore 2 | 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | env/ 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .coverage 43 | .coverage.* 44 | .cache 45 | nosetests.xml 46 | coverage.xml 47 | *,cover 48 | .hypothesis/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | 58 | # Flask instance folder 59 | instance/ 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # IPython Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # dotenv 80 | .env 81 | 82 | # virtualenv 83 | venv/ 84 | ENV/ 85 | 86 | # Spyder project settings 87 | .spyderproject 88 | -------------------------------------------------------------------------------- /ChangeLog: -------------------------------------------------------------------------------- 1 | 2017-01-15 Giuseppe Attardi 2 | 3 | * WikiExtractor.py (process_dump): use text_type to decide whether 4 | to use decode('utf-8').) 5 | 6 | 2016-10-29 Giuseppe Attardi 7 | 8 | * setup.py: use scripts instead of console_scripts. 9 | 10 | * WikiExtractor.py (pages_from): handle self closing . 11 | 12 | 2016-08-31 Giuseppe Attardi 13 | 14 | * WikiExtractor.py (findMatchingBraces): fixed ambiguous cases 15 | like {{{!}} {{!}}} 16 | 17 | 2016-08-30 Giuseppe Attardi 18 | 19 | * WikiExtractor.py (Extractor.wiki2text): drop templates before 20 | dropping tables. 21 | (Extractor.templateParams): fixed RE for separating var from value. 22 | 23 | (main): removed option xml_namespaces, since we extract only 24 | articles, i.e. those with namespace 0. 25 | 26 | (Extractor.clean): removed option --escapedoc, use --html instead. 27 | 28 | * WikiExtractor.py (Extractor.transform): Keep or else 29 | further expansion will be triggered. 30 | 31 | 2016-08-29 Giuseppe Attardi 32 | 33 | * WikiExtractor.py (callParserFunction): reworked order of 34 | evaluation of template parameters, according to 35 | https://www.mediawiki.org/wiki/Help:Templates#Order_of_evaluation 36 | Template parameters are fully evaluated before they are passed to the template. 37 | The branching parser function are an exception, since evaluation 38 | depends on values tested. 39 | 40 | * WikiExtractor.py (callParserFunction): reworked handling of 41 | parser function. 42 | Handle expansion of: 43 | {{numero romano|1999}} 44 | where Template:Numero romano = {{{{{|safesubst:}}}#invoke:Numero romano|main}} 45 | and parameter body has no arguments, by getting them from frame. 46 | Instead expansion of: 47 | {{Str endswith|1999|a.C.}} 48 | where Template:Str endswith = {{#ifeq:{{#invoke:String|sub|s={{{1|}}}| -{{#invoke:String|len|s={{{2}}}}} |ignore_errors=true}}|{{{2|}}}|sì}} 49 | uses parameters in template. 50 | 51 | * WikiExtractor.py (Extractor.extract): Adds an h1 tag for title if --html is 52 | passed. Patch contributed by okb1100 on GitHub. 53 | 54 | * WikiExtractor.py (modules): added partial emulation of modules String and Roman. 55 | ) 56 | 2016-08-26 Giuseppe Attardi 57 | 58 | * WikiExtractor.py (nowiki): handle .... 59 | 60 | 2016-08-22 Giuseppe Attardi 61 | 62 | * WikiExtractor.py (Extractor.extract): fixed UTF-8 encoding with 63 | option -a. 64 | 65 | 2016-08-19 Giuseppe Attardi 66 | 67 | * WikiExtractor.py (Extractor): fixed misspelled keepLists. 68 | 69 | 2016-08-11 Giuseppe Attardi 70 | 71 | * WikiExtractor.py (Extract): made escape_doc a class variable. 72 | 73 | 2016-06-20 Seth Cleveland 74 | 75 | * WikiExtractor.py: Fix string encoding error for python 2 and log 76 | extract exceptions. 77 | 78 | 2016-06-19 Giuseppe Attardi 79 | 80 | * WikiExtractor.py: support for Python 3 by orangain@gmail.com 81 | 82 | 2016-03-23 Seth Cleveland 83 | 84 | * WikiExtractor.py: add filtering options -- min text length, xml 85 | namespace, and disambig pages. Added option to print the revision 86 | id with each document. 87 | 88 | 2016-03-23 Giuseppe Attardi 89 | 90 | * WikiExtractor.py (compact): properly emit section headers when Extractor.keepSections. 91 | 92 | 2016-03-19 Giuseppe Attardi 93 | 94 | * WikiExtractor.py (main): restored option --sections. 95 | 96 | 2016-03-12 Giuseppe Attardi 97 | 98 | * WikiExtractor.py (findBalanced): fix to bad single tuple 99 | parameter to findBalanced. 100 | 101 | 2016-03-06 Giuseppe Attardi 102 | 103 | * WikiExtractor.py (process_dump): handle templates with stdin. 104 | 105 | 2016-02-20 Giuseppe Attardi 106 | 107 | * WikiExtractor.py (reduce_process): tell mapper to wait when 108 | spool gets too long. 109 | 110 | 2016-02-19 Giuseppe Attardi 111 | 112 | * WikiExtractor.py (extract_process): use out.truncate() instread 113 | of out.close() 114 | 115 | 2016-02-15 Giuseppe Attardi 116 | 117 | * WikiExtractor.py (Extractor.clean): turned into method. 118 | (process_dump): control spool length to avoid filing up memory. 119 | 120 | 2016-02-14 Giuseppe Attardi 121 | 122 | * WikiExtractor.py (load_templates): save also in templates. 123 | (pages_from): common iterator for extracting pages from dump, used 124 | both for analyzing pages, templates and single articles. 125 | 126 | 2016-02-13 Giuseppe Attardi 127 | 128 | * WikiExtractor.py (reduce_process): close file here. 129 | 130 | * extractPage.py (process_data): allow range of ids. 131 | 132 | 2016-02-12 Giuseppe Attardi 133 | 134 | * WikiExtractor.py (reduce_process): moved here creation of OutputSplitter. 135 | (compact): Extractor.keepLists allows preserving lists in output. 136 | (main): added new option --lists for preserving lists in output. 137 | (compact): rest lislLevel entering new section. 138 | (ignoredTags): removed 'div' from here, since it is in discardedTags. 139 | (ignoredTags): moved 'sub' and 'sup' to discardedTags. 140 | 141 | 2016-02-11 Giuseppe Attardi 142 | 143 | * WikiExtractor.py (if_empty): added implementation of Lua module 144 | for "Template:If empty". 145 | (compact): preserve lists also in text mode. 146 | (process_dump): log ID and title of pages. 147 | 148 | 2016-02-10 Giuseppe Attardi 149 | 150 | * WikiExtractor.py (process_dump): detect templates by element 151 | 10 rather than by colon in title. 152 | (Extractor.templateParams): restore space in RE [^= ]: 153 | m = re.match(' *([^= ]*?) *=(.*)', param, re.DOTALL) 154 | 155 | 2016-02-04 Giuseppe Attardi 156 | 157 | * cirrus-extract.py: added. 158 | * README.md: added mention of Cirrus extract. 159 | 160 | 2015-11-20 Giuseppe Attardi 161 | 162 | * WikiExtractor.py (makeExternalLink): fixed. 163 | (main): dropped confusing option --section. 164 | 165 | 2015-10-29 Giuseppe Attardi 166 | 167 | * WikiExtractor.py: /usr/bin/env 168 | 169 | 2015-09-29 Giuseppe Attardi 170 | 171 | * WikiExtractor.py: fixed argparse default. 172 | (load_templates): save also modules, for a future release capaple 173 | of applying them. 174 | 175 | 2015-09-14 Giuseppe Attardi 176 | 177 | * WikiExtractor.py (Extractor.templateParams): allow space in key 178 | (process_dump): terminate by joining processes rather than queues. 179 | 180 | * WikiExtractor.py (process_dump): eliminated ordering_queue in 181 | favor of termination signals to jobs_queue. 182 | 183 | 2015-09-13 Giuseppe Attardi 184 | 185 | * WikiExtractor.py (process_dump): queue.put() dropped second arg 186 | True, since it is the default. 187 | (process_dump): queue renamed to jobs_queue. 188 | (main): restored default bytes to 1M. 189 | Dropped confusing option to write to a single file: use -o - for that. 190 | 191 | 2015-08-30 Giuseppe Attardi 192 | 193 | * WikiExtractor.py (main): check presence of title element in 194 | single article. 195 | (load_templates): reconstruct the template namespace from the 196 | title of the first template in the saved templates. 197 | (define_template): match also #redirect as used in French. 198 | 199 | 2015-06-02 Giuseppe Attardi 200 | 201 | * WikiExtractor.py (Extractor.expandTemplate): extend frame before 202 | subst(), since there may be recursion in default parameter value, 203 | e.g. {{OTRS|celebrative|date=April 2015}} in article 21637542 in 204 | enwiki. 205 | 206 | 2015-05-06 Giuseppe Attardi 207 | 208 | * WikiExtractor.py (main): fixed arg.namespaces. 209 | (compact): use fillvalue=' ' in izip_longest. 210 | 211 | 2015-04-26 Giuseppe Attardi 212 | 213 | * WikiExtractor.py (clean): use re.U when matching \W or chinese 214 | characters will be lost. 215 | 216 | 2015-04-25 Giuseppe Attardi 217 | 218 | * WikiExtractor.py (sharp_expr): use unicode() instread od str() 219 | or else chinese article 596814 fails. 220 | (splitParameters): use findMatchingBraces istead of findBalanced, 221 | to properly handle {{{4|{{{{{subst|}}}CURRENTYEAR}}}} 222 | (Extractor.extract): handle currentyear, currentmonth and currentday 223 | 224 | 2015-04-23 Giuseppe Attardi 225 | 226 | * WikiExtractor.py: make UTF-8 the default encoding 227 | 228 | 2015-04-22 Giuseppe Attardi 229 | 230 | * WikiExtractor.py (replaceInternalLinks): function for replacing 231 | internal links, modeled after MediaWiki original. 232 | (replaceExternalLinks): revised taking into account the former. 233 | (replaceInternalLinks): fix to nested iterator. 234 | 235 | 2015-04-20 Giuseppe Attardi 236 | 237 | * WikiExtractor.py (findBalanced): dropped optional arguments. 238 | (make_anchor_tag): dont use splitParameters() since we must 239 | consider also single [, which do not count ib parameters. 240 | (sharp_switch): use split() to split at fist = sign. 241 | (main): grouped command options. 242 | (main): removed option -B. 243 | (Template): class used to represent templates. It provides a method 244 | for parameter subsitution. 245 | Templats are parsed on demand when used and stored in a cache. 246 | 247 | 2015-04-19 Giuseppe Attardi 248 | 249 | * WikiExtractor.py (clean): use dropNext to drop discardElements. 250 | (discardElements): added div. 251 | (compact): avoid duplicated header line with optio --sections. 252 | (compact): skip empty list items. 253 | (main): changed logger format. 254 | (ignoredTags): added abbr. 255 | (clean): handle extension SyntaxHighlight. 256 | (main): use % parameters in logging. 257 | (main): added option --html, that subsumes --links and --sections 258 | for producing HTML output instead of plain text. 259 | (Extractor): moved here variables keepLinks, keepSections, toHTML. 260 | (compact): modified to produce HTML lists. 261 | 262 | 2015-04-18 Giuseppe Attardi 263 | 264 | * WikiExtractor.py (compact): strip lists of initial characters. 265 | 266 | 2015-04-17 Giuseppe Attardi 267 | 268 | * WikiExtractor.py (sharp_switch): fixed handling of default. 269 | (uc, lc, ucfirst, lcfirst): parser functions. 270 | 271 | 2015-04-16 Giuseppe Attardi 272 | 273 | * WikiExtractor.py (MagicWords): dealing with MagicWords. 274 | This required rewriting code as methods of class Extractor, since 275 | some MagicWords are document related and are being handled in 276 | different threads, hence thay canot be globals. 277 | (findMatchingBraces): fix to ambiguities resolution. 278 | (clean): fixed dealing with trail for make_anchor-tag() 279 | (substParameter): expand defaultValue only if used. 280 | (selfClosing_tag_patterns): avoid matching besides tag end.. 281 | 282 | 2015-04-15 Giuseppe Attardi 283 | 284 | * WikiExtractor.py (expandTemplates): increase depth only when 285 | calling expandTemplate() 286 | (define_template): removed \n in onlyincludeAccumulator, drop 287 | always. 288 | (sharp_invoke): restored support for #invoke, by adding parameter 289 | frame to expandTemplate. 290 | (main): allow specifying G in --bytes. 291 | (make_anchor_tag): urlencode link. 292 | (wikiLink): properly match anchor. 293 | (make_anchor_tag): use splitParameters to separate parts of link. 294 | (splitParameters): include pair [], since arguments may contain 295 | wikilinks. 296 | 297 | 2015-04-14 Giuseppe Attardi 298 | 299 | * WikiExtractor.py (clean): dropped removal of preformatted lines, 300 | since it is hard to distinguich them, since templates may 301 | introdice lines with starting blanks. 302 | (discardElements) added 'small'. 303 | (ignoredTags) removed 'small'. 304 | (make_anchor_tag): fixed RE for wikiLink. 305 | (sharp_expr): added infix operators. 306 | (Infix): support for infix operators. 307 | (Extractor.extract): moved here logging of document being processed. 308 | (clean): rewritten handling of wikilinks since using RE is to slow. 309 | (maxTemplateRecursionLevels): increased to 30. 310 | 311 | 2015-04-13 Giuseppe Attardi 312 | 313 | * WikiExtractor.py (findMatchingBraces): rewritten to handle 314 | ambiguities. 315 | (substParameter): only evaluate name and default. 316 | (main): fixed option --article. 317 | 318 | 2015-04-12 Giuseppe Attardi 319 | 320 | * WikiExtractor.py (ExtractorThread): enabled multithread version. 321 | (findMatchingBraces): handle isolated braces. 322 | (expandTemplates): recurse on result from expandTemplate(). 323 | 324 | 2015-04-11 Giuseppe Attardi 325 | 326 | * WikiExtractor.py (sharp_switch): deal properly with #default. 327 | (OutputSplitter): update to new-style classes 328 | 329 | * WikiExtractor.py (selfClosingTags): added nowiki. 330 | 331 | * WikiExtractor.py (bold_italic, bold): allow single quote inside, 332 | e.g. '''[[Chinese New Year|New Year's Eve]]'''. 333 | 334 | * WikiExtractor.py (templateParams): fix pattern to match 335 | parameter name. 336 | 337 | * WikiExtractor.py (substParameter): use splitParameters() 338 | 339 | * WikiExtractor.py (main): added --no-templates option. 340 | 341 | * WikiExtractor.py (substParameter): added parameter param_depth 342 | to control depth of parameter expansion, separately from depth, 343 | used for template expansion. 344 | 345 | 2015-04-10 Giuseppe Attardi 346 | 347 | * WikiExtractor.py (callParserFunction): return '' also in case of 348 | failure. 349 | 350 | 2015-04-09 Giuseppe Attardi 351 | 352 | * WikiExtractor.py (expandTemplates): replaced frame parameter with 353 | depth, used to limit max template recursion. 354 | 355 | 2015-04-07 Giuseppe Attardi 356 | 357 | * WikiExtractor.py (main): added --debug option. 358 | 359 | 2015-01-24 Giuseppe Attardi 360 | 361 | * WikiExtractor.py (splitParameters): rewritten template 362 | processing by performing proper parsing of all balanced 363 | expressions in templates invocation and expansion, using iterator 364 | findBalanced(). 365 | 366 | 2015-01-18 Giuseppe Attardi 367 | 368 | * WikiExtractor.py (expandTemplates): template expansion now working. 369 | 370 | 2015-01-11 Giuseppe Attardi 371 | 372 | * WikiExtractor.py (externalLink): replaced .* with appropriate 373 | [^x]* here and elsewhere. 374 | 375 | 2015-01-10 Giuseppe Attardi 376 | 377 | * WikiExtractor.py (main): added option --article for processing a 378 | single article. 379 | (main): get dump rm file rather than frpm stdin, so that 380 | preprocessing does not need to save data to temp file. 381 | 382 | 2014-02-25 Giuseppe Attardi 383 | 384 | * WikiExtractor.py (ignoreTag): make / optional. 385 | 386 | 2013-12-15 Giuseppe Attardi 387 | 388 | * WikiExtractor.py (clean): added template expansion 389 | 390 | 2013-10-14 Giuseppe Attardi 391 | 392 | * WikiExtractor.py: added wiktionary and wikt to the namespaces 393 | (used e.g. in http://en.wikipedia.org/wiki?curid=12) 394 | 395 | 2013-05-09 Giuseppe Attardi 396 | 397 | * WikiExtractor.py (main): handle properly keepLinks option. 398 | 399 | 2013-04-05 Giuseppe Attardi 400 | 401 | * WikiExtractor.py (compact): keep lines ending with ':'. 402 | 403 | 2013-04-02 Giuseppe Attardi 404 | 405 | * WikiExtractor.py: obtain prefix from dump. 406 | 407 | 2013-01-27 Giuseppe Attardi 408 | 409 | * WikiExtractor.py (WikiDocument): add newline after . 410 | Release version 2.3. 411 | 412 | 2012-12-30 Giuseppe Attardi 413 | 414 | * WikiExtractor.py (process_data): added patch by Humberto 415 | Pereira, who claims a 10x improvement in speed. 416 | (main): added option to set acceptedNamespaces 417 | 418 | 2012-11-01 Giuseppe Attardi 419 | 420 | * WikiExtractor.py (get_url): create URL from Id instead than from title. 421 | 422 | 2012-06-28 Giuseppe Attardi 423 | 424 | * WikiExtractor.py (OutputSplitter.reserve): added method to 425 | invoke before writing. 426 | (WikiDocument): use reserve() before writing whole page. 427 | (main): added version number and option -v. 428 | 429 | 2012-05-17 Giuseppe Attardi 430 | 431 | * WikiExtractor.py (main): added option to preserve sections as 432 | HTML headers and lists as

. 433 | 434 | 2012-05-08 Giuseppe Attardi 435 | 436 | * WikiExtractor.py: Released version 2.0. 437 | 438 | * test/test.xml: added sample to test hard cases for extractor. 439 | 440 | * WikiExtractor.py (dropNested): Completely rewritten to be more 441 | compliant to WikiMedia Markup language. 442 | Use proper parsing fuctions to handle nested structures. 443 | Improved performance by reducing creation of lists and strings. 444 | Use htmlentitydefs instead of hand crafted list. 445 | Added parameter -b to set URL for site. 446 | Extensive use of RegExpr instead of specific string tests. 447 | Deal with preformatted text. 448 | Added parameter accepetedNamespaces to select namespaces to retain 449 | in page titles or wiki links. 450 | TODO: 451 | 1. handle Template expansion. See WikiPrep 452 | (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/) 453 | 2. Use full parser in order to better deal with nested and 454 | unbalanced expressions. 455 | 456 | 2012-02-15 Stefano Dei Rossi 457 | 458 | * WikiExtractor.py (WikiExtractor): replaced with a simple space 459 | instead of u'\u00A0'. 460 | 461 | 2012-02-10 Giuseppe Attardi 462 | 463 | * WikiExtractor.py: added Copyright. 464 | 465 | 2011-11-03 Antonio Fuschetto 466 | 467 | * WikiExtractor.py: updated version to 1.6 (Oct 17). 468 | 469 | 2011-10-17 Giuseppe Attardi 470 | 471 | * WikiExtractor.py: turned prefix into a parameter. 472 | 473 | 2011-07-29 Antonio Fuschetto 474 | 475 | * WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and 476 | apostrophe_italic_pattern. 477 | 478 | 2011-07-28 Antonio Fuschetto 479 | 480 | * WikiExtractor.py (__garbage_namespaces): added "file" namespace to 481 | remove list. 482 | 483 | 2011-07-10 Antonio Fuschetto 484 | 485 | * WikiExtractor.py (get_wiki_document_url): changed the handling of 486 | URL prefix (anchors don't use prefix but a relative URLs). 487 | 488 | 2011-06-26 Antonio Fuschetto 489 | 490 | * WikiExtractor.py (extract_document): changed the handling of 491 | wikilinks, adding an anchor tag for each link with a reference to the 492 | Wikipedia document. 493 | 494 | * WikiExtractor.py (WikiExtractor): changed the handling of 495 | placeholders: from "[Formula 12]" to "formula_12". 496 | 497 | 2011-04-06 Antonio Fuschetto 498 | 499 | * WikiExtractor.py (init): fixed bugs in apostrophe_bold_pattern and 500 | apostrophe_italic_pattern. 501 | 502 | * WikiExtractor.py (compact): drop lines ending with ':' 503 | (these are sentences preceding list items); fixed some bugs. 504 | 505 | * WikiExtractor.py: released version 1.1. 506 | 507 | 2011-03-12 Antonio Fuschetto 508 | 509 | * wiki-extractor.py (main): removed the sentence splitting option. 510 | 511 | * wiki-extractor.py: fixed some bugs; released version 1.0; changed 512 | filename to "Wiki-Extractor.py" according to Tanl module names. 513 | 514 | 2011-03-01 Antonio Fuschetto 515 | 516 | * wiki-extractor.py (main): added cross platform path management. 517 | 518 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | {one line to give the program's name and a brief idea of what it does.} 635 | Copyright (C) {year} {name of author} 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | {project} Copyright (C) {year} {fullname} 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | 676 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WikiExtractor 2 | [WikiExtractor.py](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) is a Python script that extracts and cleans text from a [Wikipedia database dump](http://download.wikimedia.org/). 3 | 4 | The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library. 5 | 6 | For further information, see the [project Home Page](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) or the [Wiki](https://github.com/attardi/wikiextractor/wiki). 7 | 8 | # Wikipedia Cirrus Extractor 9 | 10 | `cirrus-extractor.py` is a version of the script that performs extraction from a Wikipedia Cirrus dump. 11 | Cirrus dumps contain text with already expanded templates. 12 | 13 | Cirrus dumps are available at: 14 | [cirrussearch](http://dumps.wikimedia.org/other/cirrussearch/). 15 | 16 | # Details 17 | 18 | WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions. 19 | 20 | In order to speed up processing: 21 | 22 | - multiprocessing is used for dealing with articles in parallel 23 | - a cache is kept of parsed templates (only useful for repeated extractions). 24 | 25 | ## Installation 26 | 27 | The script may be invoked directly, however it can be installed by doing: 28 | 29 | (sudo) python setup.py install 30 | 31 | ## Usage 32 | The script is invoked with a Wikipedia dump file as an argument. 33 | The output is stored in several files of similar size in a given directory. 34 | Each file will contains several documents in this [document format](http://medialab.di.unipi.it/wiki/Document_Format). 35 | 36 | usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] 37 | [-l] [-s] [--lists] [-ns ns1,ns2] 38 | [--templates TEMPLATES] [--no-templates] [-r] 39 | [--min_text_length MIN_TEXT_LENGTH] 40 | [--filter_disambig_pages] [-it abbr,b,big] 41 | [-de gallery,timeline,noinclude] [--keep_tables] 42 | [--processes PROCESSES] [-q] [--debug] [-a] [-v] 43 | input 44 | 45 | Wikipedia Extractor: 46 | Extracts and cleans text from a Wikipedia database dump and stores output in a 47 | number of files of similar size in a given directory. 48 | Each file will contain several documents in the format: 49 | 50 | 51 | ... 52 | 53 | 54 | If the program is invoked with the --json flag, then each file will 55 | contain several documents formatted as json ojects, one per line, with 56 | the following structure 57 | 58 | {"id": "", "revid": "", "url":"", "title": "", "text": "..."} 59 | 60 | Template expansion requires preprocesssng first the whole dump and 61 | collecting template definitions. 62 | 63 | positional arguments: 64 | input XML wiki dump file 65 | 66 | optional arguments: 67 | -h, --help show this help message and exit 68 | --processes PROCESSES 69 | Number of processes to use (default 1) 70 | 71 | Output: 72 | -o OUTPUT, --output OUTPUT 73 | directory for extracted files (or '-' for dumping to 74 | stdout) 75 | -b n[KMG], --bytes n[KMG] 76 | maximum bytes per output file (default 1M) 77 | -c, --compress compress output files using bzip 78 | --json write output in json format instead of the default one 79 | 80 | Processing: 81 | --html produce HTML output, subsumes --links 82 | -l, --links preserve links 83 | -s, --sections preserve sections 84 | --lists preserve lists 85 | -ns ns1,ns2, --namespaces ns1,ns2 86 | accepted namespaces in links 87 | --templates TEMPLATES 88 | use or create file containing templates 89 | --no-templates Do not expand templates 90 | -r, --revision Include the document revision id (default=False) 91 | --min_text_length MIN_TEXT_LENGTH 92 | Minimum expanded text length required to write 93 | document (default=0) 94 | --filter_disambig_pages 95 | Remove pages from output that contain disabmiguation 96 | markup (default=False) 97 | -it abbr,b,big, --ignored_tags abbr,b,big 98 | comma separated list of tags that will be dropped, 99 | keeping their content 100 | -de gallery,timeline,noinclude, --discard_elements gallery,timeline,noinclude 101 | comma separated list of elements that will be removed 102 | from the article text 103 | --keep_tables Preserve tables in the output article text 104 | (default=False) 105 | 106 | Special: 107 | -q, --quiet suppress reporting progress info 108 | --debug print debug info 109 | -a, --article analyze a file containing a single article (debug 110 | option) 111 | -v, --version print program version 112 | 113 | 114 | Saving templates to a file will speed up performing extraction the next time, 115 | assuming template definitions have not changed. 116 | 117 | Option --no-templates significantly speeds up the extractor, avoiding the cost 118 | of expanding [MediaWiki templates](https://www.mediawiki.org/wiki/Help:Templates). 119 | 120 | For further information, visit [the documentation](http://attardi.github.io/wikiextractor). 121 | -------------------------------------------------------------------------------- /WikiExtractor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | # ============================================================================= 5 | # Version: 2.75 (March 4, 2017) 6 | # Author: Giuseppe Attardi (attardi@di.unipi.it), University of Pisa 7 | # 8 | # Contributors: 9 | # Antonio Fuschetto (fuschett@aol.com) 10 | # Leonardo Souza (lsouza@amtera.com.br) 11 | # Juan Manuel Caicedo (juan@cavorite.com) 12 | # Humberto Pereira (begini@gmail.com) 13 | # Siegfried-A. Gevatter (siegfried@gevatter.com) 14 | # Pedro Assis (pedroh2306@gmail.com) 15 | # Wim Muskee (wimmuskee@gmail.com) 16 | # Radics Geza (radicsge@gmail.com) 17 | # orangain (orangain@gmail.com) 18 | # Seth Cleveland (scleveland@turnitin.com) 19 | # Bren Barn 20 | # 21 | # ============================================================================= 22 | # Copyright (c) 2011-2017. Giuseppe Attardi (attardi@di.unipi.it). 23 | # ============================================================================= 24 | # This file is part of Tanl. 25 | # 26 | # Tanl is free software; you can redistribute it and/or modify it 27 | # under the terms of the GNU General Public License, version 3, 28 | # as published by the Free Software Foundation. 29 | # 30 | # Tanl is distributed in the hope that it will be useful, 31 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 32 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 33 | # GNU General Public License at for more details. 34 | # 35 | # ============================================================================= 36 | 37 | """Wikipedia Extractor: 38 | Extracts and cleans text from a Wikipedia database dump and stores output in a 39 | number of files of similar size in a given directory. 40 | Each file will contain several documents in the format: 41 | 42 | 43 | ... 44 | 45 | 46 | If the program is invoked with the --json flag, then each file will 47 | contain several documents formatted as json ojects, one per line, with 48 | the following structure 49 | 50 | {"id": "", "revid": "", "url":"", "title": "", "text": "..."} 51 | 52 | Template expansion requires preprocesssng first the whole dump and 53 | collecting template definitions. 54 | 55 | """ 56 | 57 | from __future__ import unicode_literals, division 58 | 59 | import sys 60 | import argparse 61 | import bz2 62 | import codecs 63 | import cgi 64 | import fileinput 65 | import logging 66 | import os.path 67 | import re # TODO use regex when it will be standard 68 | import time 69 | import json 70 | from io import StringIO 71 | from multiprocessing import Queue, Process, Value, cpu_count 72 | from timeit import default_timer 73 | 74 | 75 | PY2 = sys.version_info[0] == 2 76 | # Python 2.7 compatibiity 77 | if PY2: 78 | from urllib import quote 79 | from htmlentitydefs import name2codepoint 80 | from itertools import izip as zip, izip_longest as zip_longest 81 | range = xrange # Use Python 3 equivalent 82 | chr = unichr # Use Python 3 equivalent 83 | text_type = unicode 84 | 85 | class SimpleNamespace(object): 86 | def __init__ (self, **kwargs): 87 | self.__dict__.update(kwargs) 88 | def __repr__ (self): 89 | keys = sorted(self.__dict__) 90 | items = ("{}={!r}".format(k, self.__dict__[k]) for k in keys) 91 | return "{}({})".format(type(self).__name__, ", ".join(items)) 92 | def __eq__ (self, other): 93 | return self.__dict__ == other.__dict__ 94 | else: 95 | from urllib.parse import quote 96 | from html.entities import name2codepoint 97 | from itertools import zip_longest 98 | from types import SimpleNamespace 99 | text_type = str 100 | 101 | 102 | # =========================================================================== 103 | 104 | # Program version 105 | version = '2.75' 106 | 107 | ## PARAMS #################################################################### 108 | 109 | options = SimpleNamespace( 110 | 111 | ## 112 | # Defined in 113 | # We include as default Template, when loading external template file. 114 | knownNamespaces = {'Template': 10}, 115 | 116 | ## 117 | # The namespace used for template definitions 118 | # It is the name associated with namespace key=10 in the siteinfo header. 119 | templateNamespace = '', 120 | templatePrefix = '', 121 | 122 | ## 123 | # The namespace used for module definitions 124 | # It is the name associated with namespace key=828 in the siteinfo header. 125 | moduleNamespace = '', 126 | 127 | ## 128 | # Recognize only these namespaces in links 129 | # w: Internal links to the Wikipedia 130 | # wiktionary: Wiki dictionary 131 | # wikt: shortcut for Wiktionary 132 | # 133 | acceptedNamespaces = ['w', 'wiktionary', 'wikt'], 134 | 135 | # This is obtained from 136 | urlbase = '', 137 | 138 | ## 139 | # Filter disambiguation pages 140 | filter_disambig_pages = False, 141 | 142 | ## 143 | # Drop tables from the article 144 | keep_tables = False, 145 | 146 | ## 147 | # Whether to preserve links in output 148 | keepLinks = False, 149 | 150 | ## 151 | # Whether to preserve section titles 152 | keepSections = True, 153 | 154 | ## 155 | # Whether to preserve lists 156 | keepLists = False, 157 | 158 | ## 159 | # Whether to output HTML instead of text 160 | toHTML = False, 161 | 162 | ## 163 | # Whether to write json instead of the xml-like default output format 164 | write_json = False, 165 | 166 | ## 167 | # Whether to expand templates 168 | expand_templates = True, 169 | 170 | ## 171 | ## Whether to escape doc content 172 | escape_doc = False, 173 | 174 | ## 175 | # Print the wikipedia article revision 176 | print_revision = False, 177 | 178 | ## 179 | # Minimum expanded text length required to print document 180 | min_text_length = 0, 181 | 182 | # Shared objects holding templates, redirects and cache 183 | templates = {}, 184 | redirects = {}, 185 | # cache of parser templates 186 | # FIXME: sharing this with a Manager slows down. 187 | templateCache = {}, 188 | 189 | # Elements to ignore/discard 190 | 191 | ignored_tag_patterns = [], 192 | 193 | discardElements = [ 194 | 'gallery', 'timeline', 'noinclude', 'pre', 195 | 'table', 'tr', 'td', 'th', 'caption', 'div', 196 | 'form', 'input', 'select', 'option', 'textarea', 197 | 'ul', 'li', 'ol', 'dl', 'dt', 'dd', 'menu', 'dir', 198 | 'ref', 'references', 'img', 'imagemap', 'source', 'small', 199 | 'sub', 'sup', 'indicator' 200 | ], 201 | ) 202 | 203 | ## 204 | # Keys for Template and Module namespaces 205 | templateKeys = set(['10', '828']) 206 | 207 | ## 208 | # Regex for identifying disambig pages 209 | filter_disambig_page_pattern = re.compile("{{disambig(uation)?(\|[^}]*)?}}") 210 | 211 | ## 212 | # page filtering logic -- remove templates, undesired xml namespaces, and disambiguation pages 213 | def keepPage(ns, page): 214 | if ns != '0': # Aritcle 215 | return False 216 | # remove disambig pages if desired 217 | if options.filter_disambig_pages: 218 | for line in page: 219 | if filter_disambig_page_pattern.match(line): 220 | return False 221 | return True 222 | 223 | 224 | def get_url(uid): 225 | return "%s?curid=%s" % (options.urlbase, uid) 226 | 227 | 228 | # ========================================================================= 229 | # 230 | # MediaWiki Markup Grammar 231 | # https://www.mediawiki.org/wiki/Preprocessor_ABNF 232 | 233 | # xml-char = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD / %x10000-10FFFF 234 | # sptab = SP / HTAB 235 | 236 | # ; everything except ">" (%x3E) 237 | # attr-char = %x9 / %xA / %xD / %x20-3D / %x3F-D7FF / %xE000-FFFD / %x10000-10FFFF 238 | 239 | # literal = *xml-char 240 | # title = wikitext-L3 241 | # part-name = wikitext-L3 242 | # part-value = wikitext-L3 243 | # part = ( part-name "=" part-value ) / ( part-value ) 244 | # parts = [ title *( "|" part ) ] 245 | # tplarg = "{{{" parts "}}}" 246 | # template = "{{" parts "}}" 247 | # link = "[[" wikitext-L3 "]]" 248 | 249 | # comment = "" 250 | # unclosed-comment = "', re.DOTALL) 335 | 336 | 337 | # Match ... 338 | nowiki = re.compile(r'.*?') 339 | 340 | 341 | def ignoreTag(tag): 342 | left = re.compile(r'<%s\b.*?>' % tag, re.IGNORECASE | re.DOTALL) # both and 343 | right = re.compile(r'' % tag, re.IGNORECASE) 344 | options.ignored_tag_patterns.append((left, right)) 345 | 346 | # Match selfClosing HTML tags 347 | selfClosing_tag_patterns = [ 348 | re.compile(r'<\s*%s\b[^>]*/\s*>' % tag, re.DOTALL | re.IGNORECASE) for tag in selfClosingTags 349 | ] 350 | 351 | # Match HTML placeholder tags 352 | placeholder_tag_patterns = [ 353 | (re.compile(r'<\s*%s(\s*| [^>]+?)>.*?<\s*/\s*%s\s*>' % (tag, tag), re.DOTALL | re.IGNORECASE), 354 | repl) for tag, repl in placeholder_tags.items() 355 | ] 356 | 357 | # Match preformatted lines 358 | preformatted = re.compile(r'^ .*?$') 359 | 360 | # Match external links (space separates second optional parameter) 361 | externalLink = re.compile(r'\[\w+[^ ]*? (.*?)]') 362 | externalLinkNoAnchor = re.compile(r'\[\w+[&\]]*\]') 363 | 364 | # Matches bold/italic 365 | bold_italic = re.compile(r"'''''(.*?)'''''") 366 | bold = re.compile(r"'''(.*?)'''") 367 | italic_quote = re.compile(r"''\"([^\"]*?)\"''") 368 | italic = re.compile(r"''(.*?)''") 369 | quote_quote = re.compile(r'""([^"]*?)""') 370 | 371 | # Matches space 372 | spaces = re.compile(r' {2,}') 373 | 374 | # Matches dots 375 | dots = re.compile(r'\.{4,}') 376 | 377 | 378 | # ====================================================================== 379 | 380 | 381 | class Template(list): 382 | """ 383 | A Template is a list of TemplateText or TemplateArgs 384 | """ 385 | 386 | @classmethod 387 | def parse(cls, body): 388 | tpl = Template() 389 | # we must handle nesting, s.a. 390 | # {{{1|{{PAGENAME}}} 391 | # {{{italics|{{{italic|}}} 392 | # {{#if:{{{{{#if:{{{nominee|}}}|nominee|candidate}}|}}}| 393 | # 394 | start = 0 395 | for s, e in findMatchingBraces(body, 3): 396 | tpl.append(TemplateText(body[start:s])) 397 | tpl.append(TemplateArg(body[s + 3:e - 3])) 398 | start = e 399 | tpl.append(TemplateText(body[start:])) # leftover 400 | return tpl 401 | 402 | 403 | def subst(self, params, extractor, depth=0): 404 | # We perform parameter substitutions recursively. 405 | # We also limit the maximum number of iterations to avoid too long or 406 | # even endless loops (in case of malformed input). 407 | 408 | # :see: http://meta.wikimedia.org/wiki/Help:Expansion#Distinction_between_variables.2C_parser_functions.2C_and_templates 409 | # 410 | # Parameter values are assigned to parameters in two (?) passes. 411 | # Therefore a parameter name in a template can depend on the value of 412 | # another parameter of the same template, regardless of the order in 413 | # which they are specified in the template call, for example, using 414 | # Template:ppp containing "{{{{{{p}}}}}}", {{ppp|p=q|q=r}} and even 415 | # {{ppp|q=r|p=q}} gives r, but using Template:tvvv containing 416 | # "{{{{{{{{{p}}}}}}}}}", {{tvvv|p=q|q=r|r=s}} gives s. 417 | 418 | # logging.debug('&*ssubst tpl %d %s', extractor.frame.length, '', depth, self) 419 | 420 | if depth > extractor.maxParameterRecursionLevels: 421 | extractor.recursion_exceeded_3_errs += 1 422 | return '' 423 | 424 | return ''.join([tpl.subst(params, extractor, depth) for tpl in self]) 425 | 426 | def __str__(self): 427 | return ''.join([text_type(x) for x in self]) 428 | 429 | 430 | class TemplateText(text_type): 431 | """Fixed text of template""" 432 | 433 | 434 | def subst(self, params, extractor, depth): 435 | return self 436 | 437 | 438 | class TemplateArg(object): 439 | """ 440 | parameter to a template. 441 | Has a name and a default value, both of which are Templates. 442 | """ 443 | 444 | def __init__(self, parameter): 445 | """ 446 | :param parameter: the parts of a tplarg. 447 | """ 448 | # the parameter name itself might contain templates, e.g.: 449 | # appointe{{#if:{{{appointer14|}}}|r|d}}14| 450 | # 4|{{{{{subst|}}}CURRENTYEAR}} 451 | 452 | # any parts in a tplarg after the first (the parameter default) are 453 | # ignored, and an equals sign in the first part is treated as plain text. 454 | # logging.debug('TemplateArg %s', parameter) 455 | 456 | parts = splitParts(parameter) 457 | self.name = Template.parse(parts[0]) 458 | if len(parts) > 1: 459 | # This parameter has a default value 460 | self.default = Template.parse(parts[1]) 461 | else: 462 | self.default = None 463 | 464 | def __str__(self): 465 | if self.default: 466 | return '{{{%s|%s}}}' % (self.name, self.default) 467 | else: 468 | return '{{{%s}}}' % self.name 469 | 470 | 471 | def subst(self, params, extractor, depth): 472 | """ 473 | Substitute value for this argument from dict :param params: 474 | Use :param extractor: to evaluate expressions for name and default. 475 | Limit substitution to the maximun :param depth:. 476 | """ 477 | # the parameter name itself might contain templates, e.g.: 478 | # appointe{{#if:{{{appointer14|}}}|r|d}}14| 479 | paramName = self.name.subst(params, extractor, depth + 1) 480 | paramName = extractor.transform(paramName) 481 | res = '' 482 | if paramName in params: 483 | res = params[paramName] # use parameter value specified in template invocation 484 | elif self.default: # use the default value 485 | defaultValue = self.default.subst(params, extractor, depth + 1) 486 | res = extractor.transform(defaultValue) 487 | # logging.debug('subst arg %d %s -> %s' % (depth, paramName, res)) 488 | return res 489 | 490 | 491 | class Frame(object): 492 | 493 | def __init__(self, title='', args=[], prev=None): 494 | self.title = title 495 | self.args = args 496 | self.prev = prev 497 | self.depth = prev.depth + 1 if prev else 0 498 | 499 | 500 | def push(self, title, args): 501 | return Frame(title, args, self) 502 | 503 | 504 | def pop(self): 505 | return self.prev 506 | 507 | 508 | def __str__(self): 509 | res = '' 510 | prev = self.prev 511 | while prev: 512 | if res: res += ', ' 513 | res += '(%s, %s)' % (prev.title, prev.args) 514 | prev = prev.prev 515 | return '' 516 | 517 | # ====================================================================== 518 | 519 | substWords = 'subst:|safesubst:' 520 | 521 | class Extractor(object): 522 | """ 523 | An extraction task on a article. 524 | """ 525 | def __init__(self, id, revid, title, lines): 526 | """ 527 | :param id: id of page. 528 | :param title: tutle of page. 529 | :param lines: a list of lines. 530 | """ 531 | self.id = id 532 | self.revid = revid 533 | self.title = title 534 | self.text = ''.join(lines) 535 | self.magicWords = MagicWords() 536 | self.frame = Frame() 537 | self.recursion_exceeded_1_errs = 0 # template recursion within expand() 538 | self.recursion_exceeded_2_errs = 0 # template recursion within expandTemplate() 539 | self.recursion_exceeded_3_errs = 0 # parameter recursion 540 | self.template_title_errs = 0 541 | 542 | def write_output(self, out, text): 543 | """ 544 | :param out: a memory file 545 | :param text: the text of the page 546 | """ 547 | url = get_url(self.id) 548 | if options.write_json: 549 | json_data = { 550 | 'id': self.id, 551 | 'url': url, 552 | 'title': self.title, 553 | 'text': "\n".join(text) 554 | } 555 | if options.print_revision: 556 | json_data['revid'] = self.revid 557 | # We don't use json.dump(data, out) because we want to be 558 | # able to encode the string if the output is sys.stdout 559 | out_str = json.dumps(json_data, ensure_ascii=False) 560 | if out == sys.stdout: # option -a or -o - 561 | out_str = out_str.encode('utf-8') 562 | out.write(out_str) 563 | out.write('\n') 564 | else: 565 | if options.print_revision: 566 | header = '\n' % (self.id, self.revid, url, self.title) 567 | else: 568 | header = '\n' % (self.id, url, self.title) 569 | footer = "\n\n" 570 | if out == sys.stdout: # option -a or -o - 571 | header = header.encode('utf-8') 572 | out.write(header) 573 | for line in text: 574 | if out == sys.stdout: # option -a or -o - 575 | line = line.encode('utf-8') 576 | out.write(line) 577 | out.write('\n') 578 | out.write(footer) 579 | 580 | def extract(self, out): 581 | """ 582 | :param out: a memory file. 583 | """ 584 | logging.info('%s\t%s', self.id, self.title) 585 | 586 | # Separate header from text with a newline. 587 | if options.toHTML: 588 | title_str = '

' + self.title + '

' 589 | else: 590 | title_str = self.title + '\n' 591 | # https://www.mediawiki.org/wiki/Help:Magic_words 592 | colon = self.title.find(':') 593 | if colon != -1: 594 | ns = self.title[:colon] 595 | pagename = self.title[colon+1:] 596 | else: 597 | ns = '' # Main 598 | pagename = self.title 599 | self.magicWords['NAMESPACE'] = ns 600 | self.magicWords['NAMESPACENUMBER'] = options.knownNamespaces.get(ns, '0') 601 | self.magicWords['PAGENAME'] = pagename 602 | self.magicWords['FULLPAGENAME'] = self.title 603 | slash = pagename.rfind('/') 604 | if slash != -1: 605 | self.magicWords['BASEPAGENAME'] = pagename[:slash] 606 | self.magicWords['SUBPAGENAME'] = pagename[slash+1:] 607 | else: 608 | self.magicWords['BASEPAGENAME'] = pagename 609 | self.magicWords['SUBPAGENAME'] = '' 610 | slash = pagename.find('/') 611 | if slash != -1: 612 | self.magicWords['ROOTPAGENAME'] = pagename[:slash] 613 | else: 614 | self.magicWords['ROOTPAGENAME'] = pagename 615 | self.magicWords['CURRENTYEAR'] = time.strftime('%Y') 616 | self.magicWords['CURRENTMONTH'] = time.strftime('%m') 617 | self.magicWords['CURRENTDAY'] = time.strftime('%d') 618 | self.magicWords['CURRENTHOUR'] = time.strftime('%H') 619 | self.magicWords['CURRENTTIME'] = time.strftime('%H:%M:%S') 620 | text = self.text 621 | self.text = '' # save memory 622 | # 623 | # @see https://doc.wikimedia.org/mediawiki-core/master/php/classParser.html 624 | # This does the equivalent of internalParse(): 625 | # 626 | # $dom = $this->preprocessToDom( $text, $flag ); 627 | # $text = $frame->expand( $dom ); 628 | # 629 | text = self.transform(text) 630 | text = self.wiki2text(text) 631 | text = compact(self.clean(text)) 632 | text = [title_str] + text 633 | 634 | if sum(len(line) for line in text) < options.min_text_length: 635 | return 636 | 637 | self.write_output(out, text) 638 | 639 | errs = (self.template_title_errs, 640 | self.recursion_exceeded_1_errs, 641 | self.recursion_exceeded_2_errs, 642 | self.recursion_exceeded_3_errs) 643 | if any(errs): 644 | logging.warn("Template errors in article '%s' (%s): title(%d) recursion(%d, %d, %d)", 645 | self.title, self.id, *errs) 646 | 647 | 648 | def transform(self, wikitext): 649 | """ 650 | Transforms wiki markup. 651 | @see https://www.mediawiki.org/wiki/Help:Formatting 652 | """ 653 | # look for matching ... 654 | res = '' 655 | cur = 0 656 | nowiki_chunks = {} 657 | nowiki_id = 0 658 | for m in nowiki.finditer(wikitext, cur): 659 | nowiki_key = "".format(nowiki_id) 660 | res += wikitext[cur:m.start()] + nowiki_key 661 | nowiki_chunks[nowiki_key] = wikitext[m.start()+len(""):m.end()-len("")] 662 | nowiki_id += 1 663 | cur = m.end() 664 | # leftover 665 | res += wikitext[cur:] 666 | 667 | res = self.transform1(res) 668 | 669 | # replace wiki chunks with actual nowiki text 670 | for key in nowiki_chunks: 671 | res = res.replace(key, nowiki_chunks[key]) 672 | 673 | return res 674 | 675 | 676 | def transform1(self, text): 677 | """Transform text not containing """ 678 | if options.expand_templates: 679 | # expand templates 680 | # See: http://www.mediawiki.org/wiki/Help:Templates 681 | return self.expand(text) 682 | else: 683 | # Drop transclusions (template, parser functions) 684 | return dropNested(text, r'{{', r'}}') 685 | 686 | 687 | def wiki2text(self, text): 688 | # 689 | # final part of internalParse().) 690 | # 691 | # $text = $this->doTableStuff( $text ); 692 | # $text = preg_replace( '/(^|\n)-----*/', '\\1

', $text ); 693 | # $text = $this->doDoubleUnderscore( $text ); 694 | # $text = $this->doHeadings( $text ); 695 | # $text = $this->replaceInternalLinks( $text ); 696 | # $text = $this->doAllQuotes( $text ); 697 | # $text = $this->replaceExternalLinks( $text ); 698 | # $text = str_replace( self::MARKER_PREFIX . 'NOPARSE', '', $text ); 699 | # $text = $this->doMagicLinks( $text ); 700 | # $text = $this->formatHeadings( $text, $origText, $isMain ); 701 | 702 | # Drop tables 703 | # first drop residual templates, or else empty parameter |} might look like end of table. 704 | if not options.keep_tables: 705 | text = dropNested(text, r'{{', r'}}') 706 | text = dropNested(text, r'{\|', r'\|}') 707 | 708 | # Handle bold/italic/quote 709 | if options.toHTML: 710 | text = bold_italic.sub(r'\1', text) 711 | text = bold.sub(r'\1', text) 712 | text = italic.sub(r'\1', text) 713 | else: 714 | text = bold_italic.sub(r'\1', text) 715 | text = bold.sub(r'\1', text) 716 | text = italic_quote.sub(r'"\1"', text) 717 | text = italic.sub(r'"\1"', text) 718 | text = quote_quote.sub(r'"\1"', text) 719 | # residuals of unbalanced quotes 720 | text = text.replace("'''", '').replace("''", '"') 721 | 722 | # replace internal links 723 | text = replaceInternalLinks(text) 724 | 725 | # replace external links 726 | text = replaceExternalLinks(text) 727 | 728 | # drop MagicWords behavioral switches 729 | text = magicWordsRE.sub('', text) 730 | 731 | # ############### Process HTML ############### 732 | 733 | # turn into HTML, except for the content of 734 | res = '' 735 | cur = 0 736 | for m in syntaxhighlight.finditer(text): 737 | res += unescape(text[cur:m.start()]) + m.group(1) 738 | cur = m.end() 739 | text = res + unescape(text[cur:]) 740 | return text 741 | 742 | 743 | def clean(self, text): 744 | """ 745 | Removes irrelevant parts from :param: text. 746 | """ 747 | 748 | # Collect spans 749 | spans = [] 750 | # Drop HTML comments 751 | for m in comment.finditer(text): 752 | spans.append((m.start(), m.end())) 753 | 754 | # Drop self-closing tags 755 | for pattern in selfClosing_tag_patterns: 756 | for m in pattern.finditer(text): 757 | spans.append((m.start(), m.end())) 758 | 759 | # Drop ignored tags 760 | for left, right in options.ignored_tag_patterns: 761 | for m in left.finditer(text): 762 | spans.append((m.start(), m.end())) 763 | for m in right.finditer(text): 764 | spans.append((m.start(), m.end())) 765 | 766 | # Bulk remove all spans 767 | text = dropSpans(spans, text) 768 | 769 | # Drop discarded elements 770 | for tag in options.discardElements: 771 | text = dropNested(text, r'<\s*%s\b[^>/]*>' % tag, r'<\s*/\s*%s>' % tag) 772 | 773 | if not options.toHTML: 774 | # Turn into text what is left ( ) and 775 | text = unescape(text) 776 | 777 | # Expand placeholders 778 | for pattern, placeholder in placeholder_tag_patterns: 779 | index = 1 780 | for match in pattern.finditer(text): 781 | text = text.replace(match.group(), '%s_%d' % (placeholder, index)) 782 | index += 1 783 | 784 | text = text.replace('<<', '«').replace('>>', '»') 785 | 786 | ############################################# 787 | 788 | # Cleanup text 789 | text = text.replace('\t', ' ') 790 | text = spaces.sub(' ', text) 791 | text = dots.sub('...', text) 792 | text = re.sub(' (,:\.\)\]»)', r'\1', text) 793 | text = re.sub('(\[\(«) ', r'\1', text) 794 | text = re.sub(r'\n\W+?\n', '\n', text, flags=re.U) # lines with only punctuations 795 | text = text.replace(',,', ',').replace(',.', '.') 796 | if options.keep_tables: 797 | # the following regular expressions are used to remove the wikiml chartacters around table strucutures 798 | # yet keep the content. The order here is imporant so we remove certain markup like {| and then 799 | # then the future html attributes such as 'style'. Finally we drop the remaining '|-' that delimits cells. 800 | text = re.sub(r'!(?:\s)?style=\"[a-z]+:(?:\d+)%;\"', r'', text) 801 | text = re.sub(r'!(?:\s)?style="[a-z]+:(?:\d+)%;[a-z]+:(?:#)?(?:[0-9a-z]+)?"', r'', text) 802 | text = text.replace('|-', '') 803 | text = text.replace('|', '') 804 | if options.toHTML: 805 | text = cgi.escape(text) 806 | return text 807 | 808 | 809 | # ---------------------------------------------------------------------- 810 | # Expand templates 811 | 812 | maxTemplateRecursionLevels = 30 813 | maxParameterRecursionLevels = 10 814 | 815 | # check for template beginning 816 | reOpen = re.compile('(?= self.maxTemplateRecursionLevels: 844 | self.recursion_exceeded_1_errs += 1 845 | return res 846 | 847 | # logging.debug('%*s %s', self.frame.depth, '', res) 857 | return res 858 | 859 | 860 | def templateParams(self, parameters): 861 | """ 862 | Build a dictionary with positional or name key to expanded parameters. 863 | :param parameters: the parts[1:] of a template, i.e. all except the title. 864 | """ 865 | templateParams = {} 866 | 867 | if not parameters: 868 | return templateParams 869 | # logging.debug('%*s 898 | # Parameters may span several lines, like: 899 | # {{Reflist|colwidth=30em|refs= 900 | # <ref name="Goode">Title</ref> 901 | 902 | # The '=' might occurr within an HTML attribute: 903 | # "<ref name=value" 904 | # but we stop at first. 905 | m = re.match(' *([^=]*?) *?=(.*)', param, re.DOTALL) 906 | if m: 907 | # This is a named parameter. This case also handles parameter 908 | # assignments like "2=xxx", where the number of an unnamed 909 | # parameter ("2") is specified explicitly - this is handled 910 | # transparently. 911 | 912 | parameterName = m.group(1).strip() 913 | parameterValue = m.group(2) 914 | 915 | if ']]' not in parameterValue: # if the value does not contain a link, trim whitespace 916 | parameterValue = parameterValue.strip() 917 | templateParams[parameterName] = parameterValue 918 | else: 919 | # this is an unnamed parameter 920 | unnamedParameterCounter += 1 921 | 922 | if ']]' not in param: # if the value does not contain a link, trim whitespace 923 | param = param.strip() 924 | templateParams[str(unnamedParameterCounter)] = param 925 | # logging.debug('%*stemplateParams> %s', self.frame.length, '', '|'.join(templateParams.values())) 926 | return templateParams 927 | 928 | 929 | def expandTemplate(self, body): 930 | """Expands template invocation. 931 | :param body: the parts of a template. 932 | 933 | :see http://meta.wikimedia.org/wiki/Help:Expansion for an explanation 934 | of the process. 935 | 936 | See in particular: Expansion of names and values 937 | http://meta.wikimedia.org/wiki/Help:Expansion#Expansion_of_names_and_values 938 | 939 | For most parser functions all names and values are expanded, 940 | regardless of what is relevant for the result. The branching functions 941 | (#if, #ifeq, #iferror, #ifexist, #ifexpr, #switch) are exceptions. 942 | 943 | All names in a template call are expanded, and the titles of the 944 | tplargs in the template body, after which it is determined which 945 | values must be expanded, and for which tplargs in the template body 946 | the first part (default) [sic in the original doc page]. 947 | 948 | In the case of a tplarg, any parts beyond the first are never 949 | expanded. The possible name and the value of the first part is 950 | expanded if the title does not match a name in the template call. 951 | 952 | :see code for braceSubstitution at 953 | https://doc.wikimedia.org/mediawiki-core/master/php/html/Parser_8php_source.html#3397: 954 | 955 | """ 956 | 957 | # template = "{{" parts "}}" 958 | 959 | # Templates and tplargs are decomposed in the same way, with pipes as 960 | # separator, even though eventually any parts in a tplarg after the first 961 | # (the parameter default) are ignored, and an equals sign in the first 962 | # part is treated as plain text. 963 | # Pipes inside inner templates and tplargs, or inside double rectangular 964 | # brackets within the template or tplargs are not taken into account in 965 | # this decomposition. 966 | # The first part is called title, the other parts are simply called parts. 967 | 968 | # If a part has one or more equals signs in it, the first equals sign 969 | # determines the division into name = value. Equals signs inside inner 970 | # templates and tplargs, or inside double rectangular brackets within the 971 | # part are not taken into account in this decomposition. Parts without 972 | # equals sign are indexed 1, 2, .., given as attribute in the tag. 973 | 974 | if self.frame.depth >= self.maxTemplateRecursionLevels: 975 | self.recursion_exceeded_2_errs += 1 976 | # logging.debug('%*sEXPAND> %s', self.frame.depth, '', body) 977 | return '' 978 | 979 | logging.debug('%*sEXPAND %s', self.frame.depth, '', body) 980 | parts = splitParts(body) 981 | # title is the portion before the first | 982 | title = parts[0].strip() 983 | title = self.expand(title) 984 | 985 | # SUBST 986 | # Apply the template tag to parameters without 987 | # substituting into them, e.g. 988 | # {{subst:t|a{{{p|q}}}b}} gives the wikitext start-a{{{p|q}}}b-end 989 | # @see https://www.mediawiki.org/wiki/Manual:Substitution#Partial_substitution 990 | subst = False 991 | if re.match(substWords, title, re.IGNORECASE): 992 | title = re.sub(substWords, '', title, 1, re.IGNORECASE) 993 | subst = True 994 | 995 | if title in self.magicWords.values: 996 | ret = self.magicWords[title] 997 | logging.debug('%*s 1: 1015 | funct = title[:colon] 1016 | parts[0] = title[colon + 1:].strip() # side-effect (parts[0] not used later) 1017 | if subst and self.frame.depth > 0 and funct == '#invoke' and len(parts) <= 2: 1018 | # if substitution happened, the actual arguments are probably on the previous frame 1019 | self.frame.pop() 1020 | for k in self.frame.args: 1021 | if re.match(r'^\d+$', k): 1022 | # unnamed parameter 1023 | parts += [self.frame.args[k]] 1024 | else: 1025 | parts += ["{}={}".format(k, self.frame.args[k])] 1026 | 1027 | # arguments after first are not evaluated 1028 | ret = callParserFunction(funct, parts, self) 1029 | logging.debug('%*s 1: 1162 | # rest are new parameters 1163 | parameters.extend(par[1:]) 1164 | else: 1165 | parameters = par 1166 | elif not parameters: 1167 | parameters = [''] # create first param 1168 | # add span to last previous parameter 1169 | parameters[-1] += paramsList[s:e] 1170 | cur = e 1171 | # leftover 1172 | par = paramsList[cur:].split(sep) 1173 | if par: 1174 | if parameters: 1175 | # portion before | belongs to previous parameter 1176 | parameters[-1] += par[0] 1177 | if len(par) > 1: 1178 | # rest are new parameters 1179 | parameters.extend(par[1:]) 1180 | else: 1181 | parameters = par 1182 | 1183 | # logging.debug('splitParts %s %s\nparams: %s', sep, paramsList, text_type(parameters)) 1184 | return parameters 1185 | 1186 | 1187 | def findMatchingBraces(text, ldelim=0): 1188 | """ 1189 | :param ldelim: number of braces to match. 0 means match [[]], {{}} and {{{}}}. 1190 | """ 1191 | # Parsing is done with respect to pairs of double braces {{..}} delimiting 1192 | # a template, and pairs of triple braces {{{..}}} delimiting a tplarg. 1193 | # If double opening braces are followed by triple closing braces or 1194 | # conversely, this is taken as delimiting a template, with one left-over 1195 | # brace outside it, taken as plain text. For any pattern of braces this 1196 | # defines a set of templates and tplargs such that any two are either 1197 | # separate or nested (not overlapping). 1198 | 1199 | # Unmatched double rectangular closing brackets can be in a template or 1200 | # tplarg, but unmatched double rectangular opening brackets cannot. 1201 | # Unmatched double or triple closing braces inside a pair of 1202 | # double rectangular brackets are treated as plain text. 1203 | # Other formulation: in ambiguity between template or tplarg on one hand, 1204 | # and a link on the other hand, the structure with the rightmost opening 1205 | # takes precedence, even if this is the opening of a link without any 1206 | # closing, so not producing an actual link. 1207 | 1208 | # In the case of more than three opening braces the last three are assumed 1209 | # to belong to a tplarg, unless there is no matching triple of closing 1210 | # braces, in which case the last two opening braces are are assumed to 1211 | # belong to a template. 1212 | 1213 | # We must skip individual { like in: 1214 | # {{#ifeq: {{padleft:|1|}} | { | | }} 1215 | # We must resolve ambiguities like this: 1216 | # {{{{ }}}} -> { {{{ }}} } 1217 | # {{{{{ }}}}} -> {{ {{{ }}} }} 1218 | # {{#if:{{{{{#if:{{{nominee|}}}|nominee|candidate}}|}}}|...}} 1219 | # {{{!}} {{!}}} 1220 | 1221 | # Handle: 1222 | # {{{{{|safesubst:}}}#Invoke:String|replace|{{{1|{{{{{|safesubst:}}}PAGENAME}}}}}|%s+%([^%(]-%)$||plain=false}} 1223 | # as well as expressions with stray }: 1224 | # {{{link|{{ucfirst:{{{1}}}}}} interchange}}} 1225 | 1226 | if ldelim: # 2-3 1227 | reOpen = re.compile('[{]{%d,}' % ldelim) # at least ldelim 1228 | reNext = re.compile('[{]{2,}|}{2,}') # at least 2 1229 | else: 1230 | reOpen = re.compile('{{2,}|\[{2,}') 1231 | reNext = re.compile('{{2,}|}{2,}|\[{2,}|]{2,}') # at least 2 1232 | 1233 | cur = 0 1234 | while True: 1235 | m1 = reOpen.search(text, cur) 1236 | if not m1: 1237 | return 1238 | lmatch = m1.end() - m1.start() 1239 | if m1.group()[0] == '{': 1240 | stack = [lmatch] # stack of opening braces lengths 1241 | else: 1242 | stack = [-lmatch] # negative means [ 1243 | end = m1.end() 1244 | while True: 1245 | m2 = reNext.search(text, end) 1246 | if not m2: 1247 | return # unbalanced 1248 | end = m2.end() 1249 | brac = m2.group()[0] 1250 | lmatch = m2.end() - m2.start() 1251 | 1252 | if brac == '{': 1253 | stack.append(lmatch) 1254 | elif brac == '}': 1255 | while stack: 1256 | openCount = stack.pop() # opening span 1257 | if openCount == 0: # illegal unmatched [[ 1258 | continue 1259 | if lmatch >= openCount: 1260 | lmatch -= openCount 1261 | if lmatch <= 1: # either close or stray } 1262 | break 1263 | else: 1264 | # put back unmatched 1265 | stack.append(openCount - lmatch) 1266 | break 1267 | if not stack: 1268 | yield m1.start(), end - lmatch 1269 | cur = end 1270 | break 1271 | elif len(stack) == 1 and 0 < stack[0] < ldelim: 1272 | # ambiguous {{{{{ }}} }} 1273 | yield m1.start() + stack[0], end 1274 | cur = end 1275 | break 1276 | elif brac == '[': # [[ 1277 | stack.append(-lmatch) 1278 | else: # ]] 1279 | while stack and stack[-1] < 0: # matching [[ 1280 | openCount = -stack.pop() 1281 | if lmatch >= openCount: 1282 | lmatch -= openCount 1283 | if lmatch <= 1: # either close or stray ] 1284 | break 1285 | else: 1286 | # put back unmatched (negative) 1287 | stack.append(lmatch - openCount) 1288 | break 1289 | if not stack: 1290 | yield m1.start(), end - lmatch 1291 | cur = end 1292 | break 1293 | # unmatched ]] are discarded 1294 | cur = end 1295 | 1296 | 1297 | def findBalanced(text, openDelim=['[['], closeDelim=[']]']): 1298 | """ 1299 | Assuming that text contains a properly balanced expression using 1300 | :param openDelim: as opening delimiters and 1301 | :param closeDelim: as closing delimiters. 1302 | :return: an iterator producing pairs (start, end) of start and end 1303 | positions in text containing a balanced expression. 1304 | """ 1305 | openPat = '|'.join([re.escape(x) for x in openDelim]) 1306 | # pattern for delimiters expected after each opening delimiter 1307 | afterPat = {o: re.compile(openPat + '|' + c, re.DOTALL) for o, c in zip(openDelim, closeDelim)} 1308 | stack = [] 1309 | start = 0 1310 | cur = 0 1311 | # end = len(text) 1312 | startSet = False 1313 | startPat = re.compile(openPat) 1314 | nextPat = startPat 1315 | while True: 1316 | next = nextPat.search(text, cur) 1317 | if not next: 1318 | return 1319 | if not startSet: 1320 | start = next.start() 1321 | startSet = True 1322 | delim = next.group(0) 1323 | if delim in openDelim: 1324 | stack.append(delim) 1325 | nextPat = afterPat[delim] 1326 | else: 1327 | opening = stack.pop() 1328 | # assert opening == openDelim[closeDelim.index(next.group(0))] 1329 | if stack: 1330 | nextPat = afterPat[stack[-1]] 1331 | else: 1332 | yield start, next.end() 1333 | nextPat = startPat 1334 | start = next.end() 1335 | startSet = False 1336 | cur = next.end() 1337 | 1338 | 1339 | # ---------------------------------------------------------------------- 1340 | # Modules 1341 | 1342 | # Only minimal support 1343 | # FIXME: import Lua modules. 1344 | 1345 | def if_empty(*rest): 1346 | """ 1347 | This implements If_empty from English Wikipedia module: 1348 | 1349 | Module:If empty 1350 | 828 1351 | local p = {} 1352 | 1353 | function p.main(frame) 1354 | local args = require('Module:Arguments').getArgs(frame, {wrappers = 'Template:If empty', removeBlanks = false}) 1355 | 1356 | -- For backwards compatibility reasons, the first 8 parameters can be unset instead of being blank, 1357 | -- even though there's really no legitimate use case for this. At some point, this will be removed. 1358 | local lowestNil = math.huge 1359 | for i = 8,1,-1 do 1360 | if args[i] == nil then 1361 | args[i] = '' 1362 | lowestNil = i 1363 | end 1364 | end 1365 | 1366 | for k,v in ipairs(args) do 1367 | if v ~= '' then 1368 | if lowestNil < k then 1369 | -- If any uses of this template depend on the behavior above, add them to a tracking category. 1370 | -- This is a rather fragile, convoluted, hacky way to do it, but it ensures that this module's output won't be modified 1371 | -- by it. 1372 | frame:extensionTag('ref', '[[Category:Instances of Template:If_empty missing arguments]]', {group = 'TrackingCategory'}) 1373 | frame:extensionTag('references', '', {group = 'TrackingCategory'}) 1374 | end 1375 | return v 1376 | end 1377 | end 1378 | end 1379 | 1380 | return p 1381 | """ 1382 | for arg in rest: 1383 | if arg: 1384 | return arg 1385 | return '' 1386 | 1387 | 1388 | # ---------------------------------------------------------------------- 1389 | # String module emulation 1390 | # https://en.wikipedia.org/wiki/Module:String 1391 | 1392 | def functionParams(args, vars): 1393 | """ 1394 | Build a dictionary of var/value from :param: args. 1395 | Parameters can be either named or unnamed. In the latter case, their 1396 | name is taken fron :param: vars. 1397 | """ 1398 | params = {} 1399 | index = 1 1400 | for var in vars: 1401 | value = args.get(var) 1402 | if value is None: 1403 | value = args.get(str(index)) # positional argument 1404 | if value is None: 1405 | value = '' 1406 | else: 1407 | index += 1 1408 | params[var] = value 1409 | return params 1410 | 1411 | 1412 | def string_sub(args): 1413 | params = functionParams(args, ('s', 'i', 'j')) 1414 | s = params.get('s', '') 1415 | i = int(params.get('i', 1) or 1) # or handles case of '' value 1416 | j = int(params.get('j', -1) or -1) 1417 | if i > 0: i -= 1 # lua is 1-based 1418 | if j < 0: j += 1 1419 | if j == 0: j = len(s) 1420 | return s[i:j] 1421 | 1422 | 1423 | def string_sublength(args): 1424 | params = functionParams(args, ('s', 'i', 'len')) 1425 | s = params.get('s', '') 1426 | i = int(params.get('i', 1) or 1) - 1 # lua is 1-based 1427 | len = int(params.get('len', 1) or 1) 1428 | return s[i:i+len] 1429 | 1430 | 1431 | def string_len(args): 1432 | params = functionParams(args, ('s')) 1433 | s = params.get('s', '') 1434 | return len(s) 1435 | 1436 | 1437 | def string_find(args): 1438 | params = functionParams(args, ('source', 'target', 'start', 'plain')) 1439 | source = params.get('source', '') 1440 | pattern = params.get('target', '') 1441 | start = int('0'+params.get('start', 1)) - 1 # lua is 1-based 1442 | plain = int('0'+params.get('plain', 1)) 1443 | if source == '' or pattern == '': 1444 | return 0 1445 | if plain: 1446 | return source.find(pattern, start) + 1 # lua is 1-based 1447 | else: 1448 | return (re.compile(pattern).search(source, start) or -1) + 1 1449 | 1450 | 1451 | def string_pos(args): 1452 | params = functionParams(args, ('target', 'pos')) 1453 | target = params.get('target', '') 1454 | pos = int(params.get('pos', 1) or 1) 1455 | if pos > 0: 1456 | pos -= 1 # The first character has an index value of 1 1457 | return target[pos] 1458 | 1459 | 1460 | def string_replace(args): 1461 | params = functionParams(args, ('source', 'pattern', 'replace', 'count', 'plain')) 1462 | source = params.get('source', '') 1463 | pattern = params.get('pattern', '') 1464 | replace = params.get('replace', '') 1465 | count = int(params.get('count', 0) or 0) 1466 | plain = int(params.get('plain', 1) or 1) 1467 | if plain: 1468 | if count: 1469 | return source.replace(pattern, replace, count) 1470 | else: 1471 | return source.replace(pattern, replace) 1472 | else: 1473 | return re.compile(pattern).sub(replace, source, count) 1474 | 1475 | 1476 | def string_rep(args): 1477 | params = functionParams(args, ('s')) 1478 | source = params.get('source', '') 1479 | count = int(params.get('count', '1')) 1480 | return source * count 1481 | 1482 | 1483 | # ---------------------------------------------------------------------- 1484 | # Module:Roman 1485 | # http://en.wikipedia.org/w/index.php?title=Module:Roman 1486 | # Modulo:Numero_romano 1487 | # https://it.wikipedia.org/wiki/Modulo:Numero_romano 1488 | 1489 | def roman_main(args): 1490 | """Convert first arg to roman numeral if <= 5000 else :return: second arg.""" 1491 | num = int(float(args.get('1'))) 1492 | 1493 | # Return a message for numbers too big to be expressed in Roman numerals. 1494 | if 0 > num or num >= 5000: 1495 | return args.get('2', 'N/A') 1496 | 1497 | def toRoman(n, romanNumeralMap): 1498 | """convert integer to Roman numeral""" 1499 | result = "" 1500 | for integer, numeral in romanNumeralMap: 1501 | while n >= integer: 1502 | result += numeral 1503 | n -= integer 1504 | return result 1505 | 1506 | # Find the Roman numerals for numbers 4999 or less. 1507 | smallRomans = ( 1508 | (1000, "M"), 1509 | (900, "CM"), (500, "D"), (400, "CD"), (100, "C"), 1510 | (90, "XC"), (50, "L"), (40, "XL"), (10, "X"), 1511 | (9, "IX"), (5, "V"), (4, "IV"), (1, "I") 1512 | ) 1513 | return toRoman(num, smallRomans) 1514 | 1515 | # ---------------------------------------------------------------------- 1516 | 1517 | modules = { 1518 | 'convert': { 1519 | 'convert': lambda params: params['1'] + ' ' + params['2'], # no conversion 1520 | }, 1521 | 1522 | 'If empty': { 1523 | 'main': if_empty 1524 | }, 1525 | 1526 | 'String': { 1527 | 'len': string_len, 1528 | 'sub': string_sub, 1529 | 'sublength': string_sublength, 1530 | 'pos': string_pos, 1531 | 'find': string_find, 1532 | 'replace': string_replace, 1533 | 'rep': string_rep, 1534 | }, 1535 | 1536 | 'Roman': { 1537 | 'main': roman_main 1538 | }, 1539 | 1540 | 'Numero romano': { 1541 | 'main': roman_main 1542 | } 1543 | } 1544 | 1545 | # ---------------------------------------------------------------------- 1546 | # variables 1547 | 1548 | 1549 | class MagicWords(object): 1550 | """ 1551 | One copy in each Extractor. 1552 | 1553 | @see https://doc.wikimedia.org/mediawiki-core/master/php/MagicWord_8php_source.html 1554 | """ 1555 | names = [ 1556 | '!', 1557 | 'currentmonth', 1558 | 'currentmonth1', 1559 | 'currentmonthname', 1560 | 'currentmonthnamegen', 1561 | 'currentmonthabbrev', 1562 | 'currentday', 1563 | 'currentday2', 1564 | 'currentdayname', 1565 | 'currentyear', 1566 | 'currenttime', 1567 | 'currenthour', 1568 | 'localmonth', 1569 | 'localmonth1', 1570 | 'localmonthname', 1571 | 'localmonthnamegen', 1572 | 'localmonthabbrev', 1573 | 'localday', 1574 | 'localday2', 1575 | 'localdayname', 1576 | 'localyear', 1577 | 'localtime', 1578 | 'localhour', 1579 | 'numberofarticles', 1580 | 'numberoffiles', 1581 | 'numberofedits', 1582 | 'articlepath', 1583 | 'pageid', 1584 | 'sitename', 1585 | 'server', 1586 | 'servername', 1587 | 'scriptpath', 1588 | 'stylepath', 1589 | 'pagename', 1590 | 'pagenamee', 1591 | 'fullpagename', 1592 | 'fullpagenamee', 1593 | 'namespace', 1594 | 'namespacee', 1595 | 'namespacenumber', 1596 | 'currentweek', 1597 | 'currentdow', 1598 | 'localweek', 1599 | 'localdow', 1600 | 'revisionid', 1601 | 'revisionday', 1602 | 'revisionday2', 1603 | 'revisionmonth', 1604 | 'revisionmonth1', 1605 | 'revisionyear', 1606 | 'revisiontimestamp', 1607 | 'revisionuser', 1608 | 'revisionsize', 1609 | 'subpagename', 1610 | 'subpagenamee', 1611 | 'talkspace', 1612 | 'talkspacee', 1613 | 'subjectspace', 1614 | 'subjectspacee', 1615 | 'talkpagename', 1616 | 'talkpagenamee', 1617 | 'subjectpagename', 1618 | 'subjectpagenamee', 1619 | 'numberofusers', 1620 | 'numberofactiveusers', 1621 | 'numberofpages', 1622 | 'currentversion', 1623 | 'rootpagename', 1624 | 'rootpagenamee', 1625 | 'basepagename', 1626 | 'basepagenamee', 1627 | 'currenttimestamp', 1628 | 'localtimestamp', 1629 | 'directionmark', 1630 | 'contentlanguage', 1631 | 'numberofadmins', 1632 | 'cascadingsources', 1633 | ] 1634 | 1635 | def __init__(self): 1636 | self.values = {'!': '|'} 1637 | 1638 | def __getitem__(self, name): 1639 | return self.values.get(name) 1640 | 1641 | def __setitem__(self, name, value): 1642 | self.values[name] = value 1643 | 1644 | switches = ( 1645 | '__NOTOC__', 1646 | '__FORCETOC__', 1647 | '__TOC__', 1648 | '__TOC__', 1649 | '__NEWSECTIONLINK__', 1650 | '__NONEWSECTIONLINK__', 1651 | '__NOGALLERY__', 1652 | '__HIDDENCAT__', 1653 | '__NOCONTENTCONVERT__', 1654 | '__NOCC__', 1655 | '__NOTITLECONVERT__', 1656 | '__NOTC__', 1657 | '__START__', 1658 | '__END__', 1659 | '__INDEX__', 1660 | '__NOINDEX__', 1661 | '__STATICREDIRECT__', 1662 | '__DISAMBIG__' 1663 | ) 1664 | 1665 | 1666 | magicWordsRE = re.compile('|'.join(MagicWords.switches)) 1667 | 1668 | 1669 | # ---------------------------------------------------------------------- 1670 | # parser functions utilities 1671 | 1672 | 1673 | def ucfirst(string): 1674 | """:return: a string with just its first character uppercase 1675 | We can't use title() since it coverts all words. 1676 | """ 1677 | if string: 1678 | return string[0].upper() + string[1:] 1679 | else: 1680 | return '' 1681 | 1682 | 1683 | def lcfirst(string): 1684 | """:return: a string with its first character lowercase""" 1685 | if string: 1686 | if len(string) > 1: 1687 | return string[0].lower() + string[1:] 1688 | else: 1689 | return string.lower() 1690 | else: 1691 | return '' 1692 | 1693 | 1694 | def fullyQualifiedTemplateTitle(templateTitle): 1695 | """ 1696 | Determine the namespace of the page being included through the template 1697 | mechanism 1698 | """ 1699 | if templateTitle.startswith(':'): 1700 | # Leading colon by itself implies main namespace, so strip this colon 1701 | return ucfirst(templateTitle[1:]) 1702 | else: 1703 | m = re.match('([^:]*)(:.*)', templateTitle) 1704 | if m: 1705 | # colon found but not in the first position - check if it 1706 | # designates a known namespace 1707 | prefix = normalizeNamespace(m.group(1)) 1708 | if prefix in options.knownNamespaces: 1709 | return prefix + ucfirst(m.group(2)) 1710 | # The title of the page being included is NOT in the main namespace and 1711 | # lacks any other explicit designation of the namespace - therefore, it 1712 | # is resolved to the Template namespace (that's the default for the 1713 | # template inclusion mechanism). 1714 | 1715 | # This is a defense against pages whose title only contains UTF-8 chars 1716 | # that are reduced to an empty string. Right now I can think of one such 1717 | # case - which represents the non-breaking space. 1718 | # In this particular case, this page is a redirect to [[Non-nreaking 1719 | # space]], but having in the system a redirect page with an empty title 1720 | # causes numerous problems, so we'll live happier without it. 1721 | if templateTitle: 1722 | return options.templatePrefix + ucfirst(templateTitle) 1723 | else: 1724 | return '' # caller may log as error 1725 | 1726 | 1727 | def normalizeNamespace(ns): 1728 | return ucfirst(ns) 1729 | 1730 | 1731 | # ---------------------------------------------------------------------- 1732 | # Parser functions 1733 | # see http://www.mediawiki.org/wiki/Help:Extension:ParserFunctions 1734 | # https://github.com/Wikia/app/blob/dev/extensions/ParserFunctions/ParserFunctions_body.php 1735 | 1736 | 1737 | class Infix: 1738 | """Infix operators. 1739 | The calling sequence for the infix is: 1740 | x |op| y 1741 | """ 1742 | 1743 | def __init__(self, function): 1744 | self.function = function 1745 | 1746 | def __ror__(self, other): 1747 | return Infix(lambda x, self=self, other=other: self.function(other, x)) 1748 | 1749 | def __or__(self, other): 1750 | return self.function(other) 1751 | 1752 | def __rlshift__(self, other): 1753 | return Infix(lambda x, self=self, other=other: self.function(other, x)) 1754 | 1755 | def __rshift__(self, other): 1756 | return self.function(other) 1757 | 1758 | def __call__(self, value1, value2): 1759 | return self.function(value1, value2) 1760 | 1761 | 1762 | ROUND = Infix(lambda x, y: round(x, y)) 1763 | 1764 | 1765 | from math import floor, ceil, pi, e, trunc, exp, log as ln, sin, cos, tan, asin, acos, atan 1766 | 1767 | 1768 | def sharp_expr(extr, expr): 1769 | """Tries converting a lua expr into a Python expr.""" 1770 | try: 1771 | expr = extr.expand(expr) 1772 | expr = re.sub('(?])=', '==', expr) # negative lookbehind 1773 | expr = re.sub('mod', '%', expr) # no \b here 1774 | expr = re.sub('\bdiv\b', '/', expr) 1775 | expr = re.sub('\bround\b', '|ROUND|', expr) 1776 | return text_type(eval(expr)) 1777 | except: 1778 | return '%s' % expr 1779 | 1780 | 1781 | def sharp_if(extr, testValue, valueIfTrue, valueIfFalse=None, *args): 1782 | # In theory, we should evaluate the first argument here, 1783 | # but it was evaluated while evaluating part[0] in expandTemplate(). 1784 | if testValue.strip(): 1785 | # The {{#if:}} function is an if-then-else construct. 1786 | # The applied condition is: "The condition string is non-empty". 1787 | valueIfTrue = extr.expand(valueIfTrue.strip()) # eval 1788 | if valueIfTrue: 1789 | return valueIfTrue 1790 | elif valueIfFalse: 1791 | return extr.expand(valueIfFalse.strip()) # eval 1792 | return "" 1793 | 1794 | 1795 | def sharp_ifeq(extr, lvalue, rvalue, valueIfTrue, valueIfFalse=None, *args): 1796 | rvalue = rvalue.strip() 1797 | if rvalue: 1798 | # lvalue is always evaluated 1799 | if lvalue.strip() == rvalue: 1800 | # The {{#ifeq:}} function is an if-then-else construct. The 1801 | # applied condition is "is rvalue equal to lvalue". Note that this 1802 | # does only string comparison while MediaWiki implementation also 1803 | # supports numerical comparissons. 1804 | 1805 | if valueIfTrue: 1806 | return extr.expand(valueIfTrue.strip()) 1807 | else: 1808 | if valueIfFalse: 1809 | return extr.expand(valueIfFalse.strip()) 1810 | return "" 1811 | 1812 | 1813 | def sharp_iferror(extr, test, then='', Else=None, *args): 1814 | if re.match('<(?:strong|span|p|div)\s(?:[^\s>]*\s+)*?class="(?:[^"\s>]*\s+)*?error(?:\s[^">]*)?"', test): 1815 | return extr.expand(then.strip()) 1816 | elif Else is None: 1817 | return test.strip() 1818 | else: 1819 | return extr.expand(Else.strip()) 1820 | 1821 | 1822 | def sharp_switch(extr, primary, *params): 1823 | # FIXME: we don't support numeric expressions in primary 1824 | 1825 | # {{#switch: comparison string 1826 | # | case1 = result1 1827 | # | case2 1828 | # | case4 = result2 1829 | # | 1 | case5 = result3 1830 | # | #default = result4 1831 | # }} 1832 | 1833 | primary = primary.strip() 1834 | found = False # for fall through cases 1835 | default = None 1836 | rvalue = None 1837 | lvalue = '' 1838 | for param in params: 1839 | # handle cases like: 1840 | # #default = [http://www.perseus.tufts.edu/hopper/text?doc=Perseus...] 1841 | pair = param.split('=', 1) 1842 | lvalue = extr.expand(pair[0].strip()) 1843 | rvalue = None 1844 | if len(pair) > 1: 1845 | # got "=" 1846 | rvalue = extr.expand(pair[1].strip()) 1847 | # check for any of multiple values pipe separated 1848 | if found or primary in [v.strip() for v in lvalue.split('|')]: 1849 | # Found a match, return now 1850 | return rvalue 1851 | elif lvalue == '#default': 1852 | default = rvalue 1853 | rvalue = None # avoid defaulting to last case 1854 | elif lvalue == primary: 1855 | # If the value matches, set a flag and continue 1856 | found = True 1857 | # Default case 1858 | # Check if the last item had no = sign, thus specifying the default case 1859 | if rvalue is not None: 1860 | return lvalue 1861 | elif default is not None: 1862 | return default 1863 | return '' 1864 | 1865 | 1866 | # Extension Scribunto: https://www.mediawiki.org/wiki/Extension:Scribunto 1867 | def sharp_invoke(module, function, args): 1868 | functions = modules.get(module) 1869 | if functions: 1870 | funct = functions.get(function) 1871 | if funct: 1872 | return text_type(funct(args)) 1873 | return '' 1874 | 1875 | 1876 | parserFunctions = { 1877 | 1878 | '#expr': sharp_expr, 1879 | 1880 | '#if': sharp_if, 1881 | 1882 | '#ifeq': sharp_ifeq, 1883 | 1884 | '#iferror': sharp_iferror, 1885 | 1886 | '#ifexpr': lambda *args: '', # not supported 1887 | 1888 | '#ifexist': lambda extr, title, ifex, ifnex: extr.expand(ifnex), # assuming title is not present 1889 | 1890 | '#rel2abs': lambda *args: '', # not supported 1891 | 1892 | '#switch': sharp_switch, 1893 | 1894 | '#language': lambda *args: '', # not supported 1895 | 1896 | '#time': lambda *args: '', # not supported 1897 | 1898 | '#timel': lambda *args: '', # not supported 1899 | 1900 | '#titleparts': lambda *args: '', # not supported 1901 | 1902 | # This function is used in some pages to construct links 1903 | # http://meta.wikimedia.org/wiki/Help:URL 1904 | 'urlencode': lambda extr, string, *rest: quote(string.encode('utf-8')), 1905 | 1906 | 'lc': lambda extr, string, *rest: string.lower() if string else '', 1907 | 1908 | 'lcfirst': lambda extr, string, *rest: lcfirst(string), 1909 | 1910 | 'uc': lambda extr, string, *rest: string.upper() if string else '', 1911 | 1912 | 'ucfirst': lambda extr, string, *rest: ucfirst(string), 1913 | 1914 | 'int': lambda extr, string, *rest: text_type(int(string)), 1915 | 1916 | } 1917 | 1918 | 1919 | def callParserFunction(functionName, args, extractor): 1920 | """ 1921 | Parser functions have similar syntax as templates, except that 1922 | the first argument is everything after the first colon. 1923 | :return: the result of the invocation, None in case of failure. 1924 | 1925 | :param: args not yet expanded (see branching functions). 1926 | https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions 1927 | """ 1928 | 1929 | try: 1930 | # https://it.wikipedia.org/wiki/Template:Str_endswith has #Invoke 1931 | functionName = functionName.lower() 1932 | if functionName == '#invoke': 1933 | module, fun = args[0].strip(), args[1].strip() 1934 | logging.debug('%*s#invoke %s %s %s', extractor.frame.depth, '', module, fun, args[2:]) 1935 | # special handling of frame 1936 | if len(args) == 2: 1937 | # find parameters in frame whose title is the one of the original 1938 | # template invocation 1939 | templateTitle = fullyQualifiedTemplateTitle(module) 1940 | if not templateTitle: 1941 | logging.warn("Template with empty title") 1942 | params = None 1943 | frame = extractor.frame 1944 | while frame: 1945 | if frame.title == templateTitle: 1946 | params = frame.args 1947 | break 1948 | frame = frame.prev 1949 | else: 1950 | params = [extractor.transform(p) for p in args[2:]] # evaluates them 1951 | params = extractor.templateParams(params) 1952 | ret = sharp_invoke(module, fun, params) 1953 | logging.debug('%*s<#invoke %s %s %s', extractor.frame.depth, '', module, fun, ret) 1954 | return ret 1955 | if functionName in parserFunctions: 1956 | # branching functions use the extractor to selectively evaluate args 1957 | return parserFunctions[functionName](extractor, *args) 1958 | except: 1959 | return "" # FIXME: fix errors 1960 | return "" 1961 | 1962 | 1963 | # ---------------------------------------------------------------------- 1964 | # Expand using WikiMedia API 1965 | # import json 1966 | 1967 | # def expand(text): 1968 | # """Expand templates invoking MediaWiki API""" 1969 | # text = urlib.urlencodew(text.encode('utf-8')) 1970 | # base = urlbase[:urlbase.rfind('/')] 1971 | # url = base + "/w/api.php?action=expandtemplates&format=json&text=" + text 1972 | # exp = json.loads(urllib.urlopen(url)) 1973 | # return exp['expandtemplates']['*'] 1974 | 1975 | # ---------------------------------------------------------------------- 1976 | # Extract Template definition 1977 | 1978 | reNoinclude = re.compile(r'(?:.*?)', re.DOTALL) 1979 | reIncludeonly = re.compile(r'|', re.DOTALL) 1980 | 1981 | def define_template(title, page): 1982 | """ 1983 | Adds a template defined in the :param page:. 1984 | @see https://en.wikipedia.org/wiki/Help:Template#Noinclude.2C_includeonly.2C_and_onlyinclude 1985 | """ 1986 | # title = normalizeTitle(title) 1987 | 1988 | # sanity check (empty template, e.g. Template:Crude Oil Prices)) 1989 | if not page: return 1990 | 1991 | # check for redirects 1992 | m = re.match('#REDIRECT.*?\[\[([^\]]*)]]', page[0], re.IGNORECASE) 1993 | if m: 1994 | options.redirects[title] = m.group(1) # normalizeTitle(m.group(1)) 1995 | return 1996 | 1997 | text = unescape(''.join(page)) 1998 | 1999 | # We're storing template text for future inclusion, therefore, 2000 | # remove all text and keep all text 2001 | # (but eliminate tags per se). 2002 | # However, if ... parts are present, 2003 | # then only keep them and discard the rest of the template body. 2004 | # This is because using on a text fragment is 2005 | # equivalent to enclosing it in tags **AND** 2006 | # enclosing all the rest of the template body in tags. 2007 | 2008 | # remove comments 2009 | text = comment.sub('', text) 2010 | 2011 | # eliminate fragments 2012 | text = reNoinclude.sub('', text) 2013 | # eliminate unterminated elements 2014 | text = re.sub(r'.*$', '', text, flags=re.DOTALL) 2015 | text = re.sub(r'', '', text) 2016 | 2017 | onlyincludeAccumulator = '' 2018 | for m in re.finditer('(.*?)', text, re.DOTALL): 2019 | onlyincludeAccumulator += m.group(1) 2020 | if onlyincludeAccumulator: 2021 | text = onlyincludeAccumulator 2022 | else: 2023 | text = reIncludeonly.sub('', text) 2024 | 2025 | if text: 2026 | if title in options.templates: 2027 | logging.warn('Redefining: %s', title) 2028 | options.templates[title] = text 2029 | 2030 | 2031 | # ---------------------------------------------------------------------- 2032 | 2033 | def dropNested(text, openDelim, closeDelim): 2034 | """ 2035 | A matching function for nested expressions, e.g. namespaces and tables. 2036 | """ 2037 | openRE = re.compile(openDelim, re.IGNORECASE) 2038 | closeRE = re.compile(closeDelim, re.IGNORECASE) 2039 | # partition text in separate blocks { } { } 2040 | spans = [] # pairs (s, e) for each partition 2041 | nest = 0 # nesting level 2042 | start = openRE.search(text, 0) 2043 | if not start: 2044 | return text 2045 | end = closeRE.search(text, start.end()) 2046 | next = start 2047 | while end: 2048 | next = openRE.search(text, next.end()) 2049 | if not next: # termination 2050 | while nest: # close all pending 2051 | nest -= 1 2052 | end0 = closeRE.search(text, end.end()) 2053 | if end0: 2054 | end = end0 2055 | else: 2056 | break 2057 | spans.append((start.start(), end.end())) 2058 | break 2059 | while end.end() < next.start(): 2060 | # { } { 2061 | if nest: 2062 | nest -= 1 2063 | # try closing more 2064 | last = end.end() 2065 | end = closeRE.search(text, end.end()) 2066 | if not end: # unbalanced 2067 | if spans: 2068 | span = (spans[0][0], last) 2069 | else: 2070 | span = (start.start(), last) 2071 | spans = [span] 2072 | break 2073 | else: 2074 | spans.append((start.start(), end.end())) 2075 | # advance start, find next close 2076 | start = next 2077 | end = closeRE.search(text, next.end()) 2078 | break # { } 2079 | if next != start: 2080 | # { { } 2081 | nest += 1 2082 | # collect text outside partitions 2083 | return dropSpans(spans, text) 2084 | 2085 | 2086 | def dropSpans(spans, text): 2087 | """ 2088 | Drop from text the blocks identified in :param spans:, possibly nested. 2089 | """ 2090 | spans.sort() 2091 | res = '' 2092 | offset = 0 2093 | for s, e in spans: 2094 | if offset <= s: # handle nesting 2095 | if offset < s: 2096 | res += text[offset:s] 2097 | offset = e 2098 | res += text[offset:] 2099 | return res 2100 | 2101 | 2102 | # ---------------------------------------------------------------------- 2103 | # WikiLinks 2104 | 2105 | # May be nested [[File:..|..[[..]]..|..]], [[Category:...]], etc. 2106 | # Also: [[Help:IPA for Catalan|[andora]]] 2107 | 2108 | 2109 | def replaceInternalLinks(text): 2110 | """ 2111 | Replaces internal links of the form: 2112 | [[title |...|label]]trail 2113 | 2114 | with title concatenated with trail, when present, e.g. 's' for plural. 2115 | 2116 | See https://www.mediawiki.org/wiki/Help:Links#Internal_links 2117 | """ 2118 | # call this after removal of external links, so we need not worry about 2119 | # triple closing ]]]. 2120 | cur = 0 2121 | res = '' 2122 | for s, e in findBalanced(text): 2123 | m = tailRE.match(text, e) 2124 | if m: 2125 | trail = m.group(0) 2126 | end = m.end() 2127 | else: 2128 | trail = '' 2129 | end = e 2130 | inner = text[s + 2:e - 2] 2131 | # find first | 2132 | pipe = inner.find('|') 2133 | if pipe < 0: 2134 | title = inner 2135 | label = title 2136 | else: 2137 | title = inner[:pipe].rstrip() 2138 | # find last | 2139 | curp = pipe + 1 2140 | for s1, e1 in findBalanced(inner): 2141 | last = inner.rfind('|', curp, s1) 2142 | if last >= 0: 2143 | pipe = last # advance 2144 | curp = e1 2145 | label = inner[pipe + 1:].strip() 2146 | res += text[cur:s] + makeInternalLink(title, label) + trail 2147 | cur = end 2148 | return res + text[cur:] 2149 | 2150 | 2151 | # the official version is a method in class Parser, similar to this: 2152 | # def replaceInternalLinks2(text): 2153 | # global wgExtraInterlanguageLinkPrefixes 2154 | 2155 | # # the % is needed to support urlencoded titles as well 2156 | # tc = Title::legalChars() + '#%' 2157 | # # Match a link having the form [[namespace:link|alternate]]trail 2158 | # e1 = re.compile("([%s]+)(?:\\|(.+?))?]](.*)" % tc, re.S | re.D) 2159 | # # Match cases where there is no "]]", which might still be images 2160 | # e1_img = re.compile("([%s]+)\\|(.*)" % tc, re.S | re.D) 2161 | 2162 | # holders = LinkHolderArray(self) 2163 | 2164 | # # split the entire text string on occurrences of [[ 2165 | # iterBrackets = re.compile('[[').finditer(text) 2166 | 2167 | # m in iterBrackets.next() 2168 | # # get the first element (all text up to first [[) 2169 | # s = text[:m.start()] 2170 | # cur = m.end() 2171 | 2172 | # line = s 2173 | 2174 | # useLinkPrefixExtension = self.getTargetLanguage().linkPrefixExtension() 2175 | # e2 = None 2176 | # if useLinkPrefixExtension: 2177 | # # Match the end of a line for a word that is not followed by whitespace, 2178 | # # e.g. in the case of "The Arab al[[Razi]]", "al" will be matched 2179 | # global wgContLang 2180 | # charset = wgContLang.linkPrefixCharset() 2181 | # e2 = re.compile("((?>.*[^charset]|))(.+)", re.S | re.D | re.U) 2182 | 2183 | # if self.mTitle is None: 2184 | # raise MWException(__METHOD__ + ": \self.mTitle is null\n") 2185 | 2186 | # nottalk = not self.mTitle.isTalkPage() 2187 | 2188 | # if useLinkPrefixExtension: 2189 | # m = e2.match(s) 2190 | # if m: 2191 | # first_prefix = m.group(2) 2192 | # else: 2193 | # first_prefix = false 2194 | # else: 2195 | # prefix = '' 2196 | 2197 | # useSubpages = self.areSubpagesAllowed() 2198 | 2199 | # for m in iterBrackets: 2200 | # line = text[cur:m.start()] 2201 | # cur = m.end() 2202 | 2203 | # # TODO: Check for excessive memory usage 2204 | 2205 | # if useLinkPrefixExtension: 2206 | # m = e2.match(e2) 2207 | # if m: 2208 | # prefix = m.group(2) 2209 | # s = m.group(1) 2210 | # else: 2211 | # prefix = '' 2212 | # # first link 2213 | # if first_prefix: 2214 | # prefix = first_prefix 2215 | # first_prefix = False 2216 | 2217 | # might_be_img = False 2218 | 2219 | # m = e1.match(line) 2220 | # if m: # page with normal label or alt 2221 | # label = m.group(2) 2222 | # # If we get a ] at the beginning of m.group(3) that means we have a link that is something like: 2223 | # # [[Image:Foo.jpg|[http://example.com desc]]] <- having three ] in a row fucks up, 2224 | # # the real problem is with the e1 regex 2225 | # # See bug 1300. 2226 | # # 2227 | # # Still some problems for cases where the ] is meant to be outside punctuation, 2228 | # # and no image is in sight. See bug 2095. 2229 | # # 2230 | # if label and m.group(3)[0] == ']' and '[' in label: 2231 | # label += ']' # so that replaceExternalLinks(label) works later 2232 | # m.group(3) = m.group(3)[1:] 2233 | # # fix up urlencoded title texts 2234 | # if '%' in m.group(1): 2235 | # # Should anchors '#' also be rejected? 2236 | # m.group(1) = str_replace(array('<', '>'), array('<', '>'), rawurldecode(m.group(1))) 2237 | # trail = m.group(3) 2238 | # else: 2239 | # m = e1_img.match(line): 2240 | # if m: 2241 | # # Invalid, but might be an image with a link in its caption 2242 | # might_be_img = true 2243 | # label = m.group(2) 2244 | # if '%' in m.group(1): 2245 | # m.group(1) = rawurldecode(m.group(1)) 2246 | # trail = "" 2247 | # else: # Invalid form; output directly 2248 | # s += prefix + '[[' + line 2249 | # continue 2250 | 2251 | # origLink = m.group(1) 2252 | 2253 | # # Dont allow internal links to pages containing 2254 | # # PROTO: where PROTO is a valid URL protocol these 2255 | # # should be external links. 2256 | # if (preg_match('/^(?i:' + self.mUrlProtocols + ')/', origLink)) { 2257 | # s += prefix + '[[' + line 2258 | # continue 2259 | # } 2260 | 2261 | # # Make subpage if necessary 2262 | # if useSubpages: 2263 | # link = self.maybeDoSubpageLink(origLink, label) 2264 | # else: 2265 | # link = origLink 2266 | 2267 | # noforce = origLink[0] != ':' 2268 | # if not noforce: 2269 | # # Strip off leading ':' 2270 | # link = link[1:] 2271 | 2272 | # nt = Title::newFromText(self.mStripState.unstripNoWiki(link)) 2273 | # if nt is None: 2274 | # s += prefix + '[[' + line 2275 | # continue 2276 | 2277 | # ns = nt.getNamespace() 2278 | # iw = nt.getInterwiki() 2279 | 2280 | # if might_be_img { # if this is actually an invalid link 2281 | # if (ns == NS_FILE and noforce) { # but might be an image 2282 | # found = False 2283 | # while True: 2284 | # # look at the next 'line' to see if we can close it there 2285 | # next_line = iterBrakets.next() 2286 | # if not next_line: 2287 | # break 2288 | # m = explode(']]', next_line, 3) 2289 | # if m.lastindex == 3: 2290 | # # the first ]] closes the inner link, the second the image 2291 | # found = True 2292 | # label += "[[%s]]%s" % (m.group(0), m.group(1)) 2293 | # trail = m.group(2) 2294 | # break 2295 | # elif m.lastindex == 2: 2296 | # # if there is exactly one ]] that is fine, we will keep looking 2297 | # label += "[[{m[0]}]]{m.group(1)}" 2298 | # else: 2299 | # # if next_line is invalid too, we need look no further 2300 | # label += '[[' + next_line 2301 | # break 2302 | # if not found: 2303 | # # we couldnt find the end of this imageLink, so output it raw 2304 | # # but dont ignore what might be perfectly normal links in the text we ve examined 2305 | # holders.merge(self.replaceInternalLinks2(label)) 2306 | # s += "{prefix}[[%s|%s" % (link, text) 2307 | # # note: no trail, because without an end, there *is* no trail 2308 | # continue 2309 | # } else: # it is not an image, so output it raw 2310 | # s += "{prefix}[[%s|%s" % (link, text) 2311 | # # note: no trail, because without an end, there *is* no trail 2312 | # continue 2313 | # } 2314 | 2315 | # wasblank = (text == '') 2316 | # if wasblank: 2317 | # text = link 2318 | # else: 2319 | # # Bug 4598 madness. Handle the quotes only if they come from the alternate part 2320 | # # [[Lista d''e paise d''o munno]] . Lista d''e paise d''o munno 2321 | # # [[Criticism of Harry Potter|Criticism of ''Harry Potter'']] 2322 | # # . Criticism of Harry Potter 2323 | # text = self.doQuotes(text) 2324 | 2325 | # # Link not escaped by : , create the various objects 2326 | # if noforce and not nt.wasLocalInterwiki(): 2327 | # # Interwikis 2328 | # if iw and mOptions.getInterwikiMagic() and nottalk and ( 2329 | # Language::fetchLanguageName(iw, None, 'mw') or 2330 | # in_array(iw, wgExtraInterlanguageLinkPrefixes)): 2331 | # # Bug 24502: filter duplicates 2332 | # if iw not in mLangLinkLanguages: 2333 | # self.mLangLinkLanguages[iw] = True 2334 | # self.mOutput.addLanguageLink(nt.getFullText()) 2335 | 2336 | # s = rstrip(s + prefix) 2337 | # s += strip(trail, "\n") == '' ? '': prefix + trail 2338 | # continue 2339 | 2340 | # if ns == NS_FILE: 2341 | # if not wfIsBadImage(nt.getDBkey(), self.mTitle): 2342 | # if wasblank: 2343 | # # if no parameters were passed, text 2344 | # # becomes something like "File:Foo.png", 2345 | # # which we dont want to pass on to the 2346 | # # image generator 2347 | # text = '' 2348 | # else: 2349 | # # recursively parse links inside the image caption 2350 | # # actually, this will parse them in any other parameters, too, 2351 | # # but it might be hard to fix that, and it doesnt matter ATM 2352 | # text = self.replaceExternalLinks(text) 2353 | # holders.merge(self.replaceInternalLinks2(text)) 2354 | # # cloak any absolute URLs inside the image markup, so replaceExternalLinks() wont touch them 2355 | # s += prefix + self.armorLinks( 2356 | # self.makeImage(nt, text, holders)) + trail 2357 | # else: 2358 | # s += prefix + trail 2359 | # continue 2360 | 2361 | # if ns == NS_CATEGORY: 2362 | # s = rstrip(s + "\n") # bug 87 2363 | 2364 | # if wasblank: 2365 | # sortkey = self.getDefaultSort() 2366 | # else: 2367 | # sortkey = text 2368 | # sortkey = Sanitizer::decodeCharReferences(sortkey) 2369 | # sortkey = str_replace("\n", '', sortkey) 2370 | # sortkey = self.getConverterLanguage().convertCategoryKey(sortkey) 2371 | # self.mOutput.addCategory(nt.getDBkey(), sortkey) 2372 | 2373 | # s += strip(prefix + trail, "\n") == '' ? '' : prefix + trail 2374 | 2375 | # continue 2376 | # } 2377 | # } 2378 | 2379 | # # Self-link checking. For some languages, variants of the title are checked in 2380 | # # LinkHolderArray::doVariants() to allow batching the existence checks necessary 2381 | # # for linking to a different variant. 2382 | # if ns != NS_SPECIAL and nt.equals(self.mTitle) and !nt.hasFragment(): 2383 | # s += prefix + Linker::makeSelfLinkObj(nt, text, '', trail) 2384 | # continue 2385 | 2386 | # # NS_MEDIA is a pseudo-namespace for linking directly to a file 2387 | # # @todo FIXME: Should do batch file existence checks, see comment below 2388 | # if ns == NS_MEDIA: 2389 | # # Give extensions a chance to select the file revision for us 2390 | # options = [] 2391 | # descQuery = False 2392 | # Hooks::run('BeforeParserFetchFileAndTitle', 2393 | # [this, nt, &options, &descQuery]) 2394 | # # Fetch and register the file (file title may be different via hooks) 2395 | # file, nt = self.fetchFileAndTitle(nt, options) 2396 | # # Cloak with NOPARSE to avoid replacement in replaceExternalLinks 2397 | # s += prefix + self.armorLinks( 2398 | # Linker::makeMediaLinkFile(nt, file, text)) + trail 2399 | # continue 2400 | 2401 | # # Some titles, such as valid special pages or files in foreign repos, should 2402 | # # be shown as bluelinks even though they are not included in the page table 2403 | # # 2404 | # # @todo FIXME: isAlwaysKnown() can be expensive for file links; we should really do 2405 | # # batch file existence checks for NS_FILE and NS_MEDIA 2406 | # if iw == '' and nt.isAlwaysKnown(): 2407 | # self.mOutput.addLink(nt) 2408 | # s += self.makeKnownLinkHolder(nt, text, array(), trail, prefix) 2409 | # else: 2410 | # # Links will be added to the output link list after checking 2411 | # s += holders.makeHolder(nt, text, array(), trail, prefix) 2412 | # } 2413 | # return holders 2414 | 2415 | 2416 | def makeInternalLink(title, label): 2417 | colon = title.find(':') 2418 | if colon > 0 and title[:colon] not in options.acceptedNamespaces: 2419 | return '' 2420 | if colon == 0: 2421 | # drop also :File: 2422 | colon2 = title.find(':', colon + 1) 2423 | if colon2 > 1 and title[colon + 1:colon2] not in options.acceptedNamespaces: 2424 | return '' 2425 | if options.keepLinks: 2426 | return '%s' % (quote(title.encode('utf-8')), label) 2427 | else: 2428 | return label 2429 | 2430 | 2431 | # ---------------------------------------------------------------------- 2432 | # External links 2433 | 2434 | # from: https://doc.wikimedia.org/mediawiki-core/master/php/DefaultSettings_8php_source.html 2435 | 2436 | wgUrlProtocols = [ 2437 | 'bitcoin:', 'ftp://', 'ftps://', 'geo:', 'git://', 'gopher://', 'http://', 2438 | 'https://', 'irc://', 'ircs://', 'magnet:', 'mailto:', 'mms://', 'news:', 2439 | 'nntp://', 'redis://', 'sftp://', 'sip:', 'sips:', 'sms:', 'ssh://', 2440 | 'svn://', 'tel:', 'telnet://', 'urn:', 'worldwind://', 'xmpp:', '//' 2441 | ] 2442 | 2443 | # from: https://doc.wikimedia.org/mediawiki-core/master/php/Parser_8php_source.html 2444 | 2445 | # Constants needed for external link processing 2446 | # Everything except bracket, space, or control characters 2447 | # \p{Zs} is unicode 'separator, space' category. It covers the space 0x20 2448 | # as well as U+3000 is IDEOGRAPHIC SPACE for bug 19052 2449 | EXT_LINK_URL_CLASS = r'[^][<>"\x00-\x20\x7F\s]' 2450 | ANCHOR_CLASS = r'[^][\x00-\x08\x0a-\x1F]' 2451 | ExtLinkBracketedRegex = re.compile( 2452 | '\[(((?i)' + '|'.join(wgUrlProtocols) + ')' + EXT_LINK_URL_CLASS + r'+)' + 2453 | r'\s*((?:' + ANCHOR_CLASS + r'|\[\[' + ANCHOR_CLASS + r'+\]\])' + r'*?)\]', 2454 | re.S | re.U) 2455 | # A simpler alternative: 2456 | # ExtLinkBracketedRegex = re.compile(r'\[(.*?)\](?!])') 2457 | 2458 | EXT_IMAGE_REGEX = re.compile( 2459 | r"""^(http://|https://)([^][<>"\x00-\x20\x7F\s]+) 2460 | /([A-Za-z0-9_.,~%\-+&;#*?!=()@\x80-\xFF]+)\.((?i)gif|png|jpg|jpeg)$""", 2461 | re.X | re.S | re.U) 2462 | 2463 | 2464 | def replaceExternalLinks(text): 2465 | """ 2466 | https://www.mediawiki.org/wiki/Help:Links#External_links 2467 | [URL anchor text] 2468 | """ 2469 | s = '' 2470 | cur = 0 2471 | for m in ExtLinkBracketedRegex.finditer(text): 2472 | s += text[cur:m.start()] 2473 | cur = m.end() 2474 | 2475 | url = m.group(1) 2476 | label = m.group(3) 2477 | 2478 | # # The characters '<' and '>' (which were escaped by 2479 | # # removeHTMLtags()) should not be included in 2480 | # # URLs, per RFC 2396. 2481 | # m2 = re.search('&(lt|gt);', url) 2482 | # if m2: 2483 | # link = url[m2.end():] + ' ' + link 2484 | # url = url[0:m2.end()] 2485 | 2486 | # If the link text is an image URL, replace it with an tag 2487 | # This happened by accident in the original parser, but some people used it extensively 2488 | m = EXT_IMAGE_REGEX.match(label) 2489 | if m: 2490 | label = makeExternalImage(label) 2491 | 2492 | # Use the encoded URL 2493 | # This means that users can paste URLs directly into the text 2494 | # Funny characters like ö aren't valid in URLs anyway 2495 | # This was changed in August 2004 2496 | s += makeExternalLink(url, label) # + trail 2497 | 2498 | return s + text[cur:] 2499 | 2500 | 2501 | def makeExternalLink(url, anchor): 2502 | """Function applied to wikiLinks""" 2503 | if options.keepLinks: 2504 | return '%s' % (quote(url.encode('utf-8')), anchor) 2505 | else: 2506 | return anchor 2507 | 2508 | 2509 | def makeExternalImage(url, alt=''): 2510 | if options.keepLinks: 2511 | return '

' % (url, alt) 2512 | else: 2513 | return alt 2514 | 2515 | 2516 | # ---------------------------------------------------------------------- 2517 | 2518 | # match tail after wikilink 2519 | tailRE = re.compile('\w+') 2520 | 2521 | syntaxhighlight = re.compile('<syntaxhighlight .*?>(.*?)</syntaxhighlight>', re.DOTALL) 2522 | 2523 | # skip level 1, it is page name level 2524 | section = re.compile(r'(==+)\s*(.*?)\s*\1') 2525 | 2526 | listOpen = {'*': '

', ':': '

'} 2527 | listClose = {'*': '

', '#': '', ';': '', ':': ''} 2528 | listItem = {'*': '

', '#': '

%s', ';': '

', 2529 | ':': '

'} 2530 | 2531 | 2532 | def compact(text): 2533 | """Deal with headers, lists, empty sections, residuals of tables. 2534 | :param text: convert to HTML. 2535 | """ 2536 | 2537 | page = [] # list of paragraph 2538 | headers = {} # Headers for unfilled sections 2539 | emptySection = False # empty sections are discarded 2540 | listLevel = [] # nesting of lists 2541 | listCount = [] # count of each list (it should be always in the same length of listLevel) 2542 | for line in text.split('\n'): 2543 | if not line: # collapse empty lines 2544 | # if there is an opening list, close it if we see an empty line 2545 | if len(listLevel): 2546 | page.append(line) 2547 | if options.toHTML: 2548 | for c in reversed(listLevel): 2549 | page.append(listClose[c]) 2550 | listLevel = [] 2551 | listCount = [] 2552 | emptySection = False 2553 | elif page and page[-1]: 2554 | page.append('') 2555 | continue 2556 | # Handle section titles 2557 | m = section.match(line) 2558 | if m: 2559 | title = m.group(2) 2560 | lev = len(m.group(1)) # header level 2561 | if options.toHTML: 2562 | page.append("%s" % (lev, title, lev)) 2563 | if title and title[-1] not in '!?': 2564 | title += '.' # terminate sentence. 2565 | headers[lev] = title 2566 | # drop previous headers 2567 | for i in list(headers.keys()): 2568 | if i > lev: 2569 | del headers[i] 2570 | emptySection = True 2571 | listLevel = [] 2572 | listCount = [] 2573 | continue 2574 | # Handle page title 2575 | elif line.startswith('++'): 2576 | title = line[2:-2] 2577 | if title: 2578 | if title[-1] not in '!?': 2579 | title += '.' 2580 | page.append(title) 2581 | # handle indents 2582 | elif line[0] == ':': 2583 | # page.append(line.lstrip(':*#;')) 2584 | continue 2585 | # handle lists 2586 | elif line[0] in '*#;:': 2587 | i = 0 2588 | # c: current level char 2589 | # n: next level char 2590 | for c, n in zip_longest(listLevel, line, fillvalue=''): 2591 | if not n or n not in '*#;:': # shorter or different 2592 | if c: 2593 | if options.toHTML: 2594 | page.append(listClose[c]) 2595 | listLevel = listLevel[:-1] 2596 | listCount = listCount[:-1] 2597 | continue 2598 | else: 2599 | break 2600 | # n != '' 2601 | if c != n and (not c or (c not in ';:' and n not in ';:')): 2602 | if c: 2603 | # close level 2604 | if options.toHTML: 2605 | page.append(listClose[c]) 2606 | listLevel = listLevel[:-1] 2607 | listCount = listCount[:-1] 2608 | listLevel += n 2609 | listCount.append(0) 2610 | if options.toHTML: 2611 | page.append(listOpen[n]) 2612 | i += 1 2613 | n = line[i - 1] # last list char 2614 | line = line[i:].strip() 2615 | if line: # FIXME: n is '"' 2616 | if options.keepLists: 2617 | if options.keepSections: 2618 | # emit open sections 2619 | items = sorted(headers.items()) 2620 | for _, v in items: 2621 | page.append(v) 2622 | headers.clear() 2623 | # use item count for #-lines 2624 | listCount[i - 1] += 1 2625 | bullet = '%d. ' % listCount[i - 1] if n == '#' else '- ' 2626 | page.append('{0:{1}s}'.format(bullet, len(listLevel)) + line) 2627 | elif options.toHTML: 2628 | page.append(listItem[n] % line) 2629 | elif len(listLevel): 2630 | if options.toHTML: 2631 | for c in reversed(listLevel): 2632 | page.append(listClose[c]) 2633 | listLevel = [] 2634 | listCount = [] 2635 | page.append(line) 2636 | 2637 | # Drop residuals of lists 2638 | elif line[0] in '{|' or line[-1] == '}': 2639 | continue 2640 | # Drop irrelevant lines 2641 | elif (line[0] == '(' and line[-1] == ')') or line.strip('.-') == '': 2642 | continue 2643 | elif len(headers): 2644 | if options.keepSections: 2645 | items = sorted(headers.items()) 2646 | for i, v in items: 2647 | page.append(v) 2648 | headers.clear() 2649 | page.append(line) # first line 2650 | emptySection = False 2651 | elif not emptySection: 2652 | # Drop preformatted 2653 | if line[0] != ' ': # dangerous 2654 | page.append(line) 2655 | return page 2656 | 2657 | 2658 | def handle_unicode(entity): 2659 | numeric_code = int(entity[2:-1]) 2660 | if numeric_code >= 0x10000: return '' 2661 | return chr(numeric_code) 2662 | 2663 | 2664 | # ------------------------------------------------------------------------------ 2665 | # Output 2666 | 2667 | 2668 | class NextFile(object): 2669 | """ 2670 | Synchronous generation of next available file name. 2671 | """ 2672 | 2673 | filesPerDir = 100 2674 | 2675 | def __init__(self, path_name): 2676 | self.path_name = path_name 2677 | self.dir_index = -1 2678 | self.file_index = -1 2679 | 2680 | def __next__(self): 2681 | self.file_index = (self.file_index + 1) % NextFile.filesPerDir 2682 | if self.file_index == 0: 2683 | self.dir_index += 1 2684 | dirname = self._dirname() 2685 | if not os.path.isdir(dirname): 2686 | os.makedirs(dirname) 2687 | return self._filepath() 2688 | 2689 | next = __next__ 2690 | 2691 | def _dirname(self): 2692 | char1 = self.dir_index % 26 2693 | char2 = self.dir_index // 26 % 26 2694 | return os.path.join(self.path_name, '%c%c' % (ord('A') + char2, ord('A') + char1)) 2695 | 2696 | def _filepath(self): 2697 | return '%s/wiki_%02d' % (self._dirname(), self.file_index) 2698 | 2699 | 2700 | class OutputSplitter(object): 2701 | """ 2702 | File-like object, that splits output to multiple files of a given max size. 2703 | """ 2704 | 2705 | def __init__(self, nextFile, max_file_size=0, compress=True): 2706 | """ 2707 | :param nextFile: a NextFile object from which to obtain filenames 2708 | to use. 2709 | :param max_file_size: the maximum size of each file. 2710 | :para compress: whether to write data with bzip compression. 2711 | """ 2712 | self.nextFile = nextFile 2713 | self.compress = compress 2714 | self.max_file_size = max_file_size 2715 | self.file = self.open(next(self.nextFile)) 2716 | 2717 | def reserve(self, size): 2718 | if self.file.tell() + size > self.max_file_size: 2719 | self.close() 2720 | self.file = self.open(next(self.nextFile)) 2721 | 2722 | def write(self, data): 2723 | self.reserve(len(data)) 2724 | self.file.write(data) 2725 | 2726 | def close(self): 2727 | self.file.close() 2728 | 2729 | def open(self, filename): 2730 | if self.compress: 2731 | return bz2.BZ2File(filename + '.bz2', 'w') 2732 | else: 2733 | return open(filename, 'wb') 2734 | 2735 | 2736 | # ---------------------------------------------------------------------- 2737 | # READER 2738 | 2739 | tagRE = re.compile(r'(.*?)<(/?\w+)[^>]*?>(?:([^<]*)(<.*?>)?)?') 2740 | # 1 2 3 4 2741 | keyRE = re.compile(r'key="(\d*)"') 2742 | 2743 | def load_templates(file, output_file=None): 2744 | """ 2745 | Load templates from :param file:. 2746 | :param output_file: file where to save templates and modules. 2747 | """ 2748 | options.templatePrefix = options.templateNamespace + ':' 2749 | options.modulePrefix = options.moduleNamespace + ':' 2750 | 2751 | if output_file: 2752 | output = codecs.open(output_file, 'wb', 'utf-8') 2753 | for page_count, page_data in enumerate(pages_from(file)): 2754 | id, revid, title, ns, page = page_data 2755 | if not output_file and (not options.templateNamespace or 2756 | not options.moduleNamespace): # do not know it yet 2757 | # reconstruct templateNamespace and moduleNamespace from the first title 2758 | if ns in templateKeys: 2759 | colon = title.find(':') 2760 | if colon > 1: 2761 | if ns == '10': 2762 | options.templateNamespace = title[:colon] 2763 | options.templatePrefix = title[:colon + 1] 2764 | elif ns == '828': 2765 | options.moduleNamespace = title[:colon] 2766 | options.modulePrefix = title[:colon + 1] 2767 | if ns in templateKeys: 2768 | text = ''.join(page) 2769 | define_template(title, text) 2770 | # save templates and modules to file 2771 | if output_file: 2772 | output.write('\n') 2773 | output.write(' %s\n' % title) 2774 | output.write(' %s\n' % ns) 2775 | output.write(' %s\n' % id) 2776 | output.write(' ') 2777 | for line in page: 2778 | output.write(line) 2779 | output.write(' \n') 2780 | output.write('\n') 2781 | if page_count and page_count % 100000 == 0: 2782 | logging.info("Preprocessed %d pages", page_count) 2783 | if output_file: 2784 | output.close() 2785 | logging.info("Saved %d templates to '%s'", len(options.templates), output_file) 2786 | 2787 | 2788 | def pages_from(input): 2789 | """ 2790 | Scans input extracting pages. 2791 | :return: (id, revid, title, namespace key, page), page is a list of lines. 2792 | """ 2793 | # we collect individual lines, since str.join() is significantly faster 2794 | # than concatenation 2795 | page = [] 2796 | id = None 2797 | ns = '0' 2798 | last_id = None 2799 | revid = None 2800 | inText = False 2801 | redirect = False 2802 | title = None 2803 | for line in input: 2804 | if not isinstance(line, text_type): line = line.decode('utf-8') 2805 | if '<' not in line: # faster than doing re.search() 2806 | if inText: 2807 | page.append(line) 2808 | continue 2809 | m = tagRE.search(line) 2810 | if not m: 2811 | continue 2812 | tag = m.group(2) 2813 | if tag == 'page': 2814 | page = [] 2815 | redirect = False 2816 | elif tag == 'id' and not id: 2817 | id = m.group(3) 2818 | elif tag == 'id' and id: 2819 | revid = m.group(3) 2820 | elif tag == 'title': 2821 | title = m.group(3) 2822 | elif tag == 'ns': 2823 | ns = m.group(3) 2824 | elif tag == 'redirect': 2825 | redirect = True 2826 | elif tag == 'text': 2827 | if m.lastindex == 3 and line[m.start(3)-2] == '/': # self closing 2828 | # 2829 | continue 2830 | inText = True 2831 | line = line[m.start(3):m.end(3)] 2832 | page.append(line) 2833 | if m.lastindex == 4: # open-close 2834 | inText = False 2835 | elif tag == '/text': 2836 | if m.group(1): 2837 | page.append(m.group(1)) 2838 | inText = False 2839 | elif inText: 2840 | page.append(line) 2841 | elif tag == '/page': 2842 | if id != last_id and not redirect: 2843 | yield (id, revid, title, ns, page) 2844 | last_id = id 2845 | ns = '0' 2846 | id = None 2847 | revid = None 2848 | title = None 2849 | page = [] 2850 | 2851 | 2852 | def process_dump(input_file, template_file, out_file, file_size, file_compress, 2853 | process_count): 2854 | """ 2855 | :param input_file: name of the wikipedia dump file; '-' to read from stdin 2856 | :param template_file: optional file with template definitions. 2857 | :param out_file: directory where to store extracted data, or '-' for stdout 2858 | :param file_size: max size of each extracted file, or None for no max (one file) 2859 | :param file_compress: whether to compress files with bzip. 2860 | :param process_count: number of extraction processes to spawn. 2861 | """ 2862 | 2863 | if input_file == '-': 2864 | input = sys.stdin 2865 | else: 2866 | input = fileinput.FileInput(input_file, openhook=fileinput.hook_compressed) 2867 | 2868 | # collect siteinfo 2869 | for line in input: 2870 | # When an input file is .bz2 or .gz, line can be a bytes even in Python 3. 2871 | if not isinstance(line, text_type): line = line.decode('utf-8') 2872 | m = tagRE.search(line) 2873 | if not m: 2874 | continue 2875 | tag = m.group(2) 2876 | if tag == 'base': 2877 | # discover urlbase from the xml dump file 2878 | # /mediawiki/siteinfo/base 2879 | base = m.group(3) 2880 | options.urlbase = base[:base.rfind("/")] 2881 | elif tag == 'namespace': 2882 | mk = keyRE.search(line) 2883 | if mk: 2884 | nsid = mk.group(1) 2885 | else: 2886 | nsid = '' 2887 | options.knownNamespaces[m.group(3)] = nsid 2888 | if re.search('key="10"', line): 2889 | options.templateNamespace = m.group(3) 2890 | options.templatePrefix = options.templateNamespace + ':' 2891 | elif re.search('key="828"', line): 2892 | options.moduleNamespace = m.group(3) 2893 | options.modulePrefix = options.moduleNamespace + ':' 2894 | elif tag == '/siteinfo': 2895 | break 2896 | 2897 | if options.expand_templates: 2898 | # preprocess 2899 | template_load_start = default_timer() 2900 | if template_file: 2901 | if os.path.exists(template_file): 2902 | logging.info("Loading template definitions from: %s", template_file) 2903 | # can't use with here: 2904 | file = fileinput.FileInput(template_file, 2905 | openhook=fileinput.hook_compressed) 2906 | load_templates(file) 2907 | file.close() 2908 | else: 2909 | if input_file == '-': 2910 | # can't scan then reset stdin; must error w/ suggestion to specify template_file 2911 | raise ValueError("to use templates with stdin dump, must supply explicit template-file") 2912 | logging.info("Preprocessing '%s' to collect template definitions: this may take some time.", input_file) 2913 | load_templates(input, template_file) 2914 | input.close() 2915 | input = fileinput.FileInput(input_file, openhook=fileinput.hook_compressed) 2916 | template_load_elapsed = default_timer() - template_load_start 2917 | logging.info("Loaded %d templates in %.1fs", len(options.templates), template_load_elapsed) 2918 | 2919 | # process pages 2920 | logging.info("Starting page extraction from %s.", input_file) 2921 | extract_start = default_timer() 2922 | 2923 | # Parallel Map/Reduce: 2924 | # - pages to be processed are dispatched to workers 2925 | # - a reduce process collects the results, sort them and print them. 2926 | 2927 | process_count = max(1, process_count) 2928 | maxsize = 10 * process_count 2929 | # output queue 2930 | output_queue = Queue(maxsize=maxsize) 2931 | 2932 | if out_file == '-': 2933 | out_file = None 2934 | 2935 | worker_count = process_count 2936 | 2937 | # load balancing 2938 | max_spool_length = 10000 2939 | spool_length = Value('i', 0, lock=False) 2940 | 2941 | # reduce job that sorts and prints output 2942 | reduce = Process(target=reduce_process, 2943 | args=(options, output_queue, spool_length, 2944 | out_file, file_size, file_compress)) 2945 | reduce.start() 2946 | 2947 | # initialize jobs queue 2948 | jobs_queue = Queue(maxsize=maxsize) 2949 | 2950 | # start worker processes 2951 | logging.info("Using %d extract processes.", worker_count) 2952 | workers = [] 2953 | for i in range(worker_count): 2954 | extractor = Process(target=extract_process, 2955 | args=(options, i, jobs_queue, output_queue)) 2956 | extractor.daemon = True # only live while parent process lives 2957 | extractor.start() 2958 | workers.append(extractor) 2959 | 2960 | # Mapper process 2961 | page_num = 0 2962 | for page_data in pages_from(input): 2963 | id, revid, title, ns, page = page_data 2964 | if keepPage(ns, page): 2965 | # slow down 2966 | delay = 0 2967 | if spool_length.value > max_spool_length: 2968 | # reduce to 10% 2969 | while spool_length.value > max_spool_length/10: 2970 | time.sleep(10) 2971 | delay += 10 2972 | if delay: 2973 | logging.info('Delay %ds', delay) 2974 | job = (id, revid, title, page, page_num) 2975 | jobs_queue.put(job) # goes to any available extract_process 2976 | page_num += 1 2977 | page = None # free memory 2978 | 2979 | input.close() 2980 | 2981 | # signal termination 2982 | for _ in workers: 2983 | jobs_queue.put(None) 2984 | # wait for workers to terminate 2985 | for w in workers: 2986 | w.join() 2987 | 2988 | # signal end of work to reduce process 2989 | output_queue.put(None) 2990 | # wait for it to finish 2991 | reduce.join() 2992 | 2993 | extract_duration = default_timer() - extract_start 2994 | extract_rate = page_num / extract_duration 2995 | logging.info("Finished %d-process extraction of %d articles in %.1fs (%.1f art/s)", 2996 | process_count, page_num, extract_duration, extract_rate) 2997 | 2998 | 2999 | # ---------------------------------------------------------------------- 3000 | # Multiprocess support 3001 | 3002 | 3003 | def extract_process(opts, i, jobs_queue, output_queue): 3004 | """Pull tuples of raw page content, do CPU/regex-heavy fixup, push finished text 3005 | :param i: process id. 3006 | :param jobs_queue: where to get jobs. 3007 | :param output_queue: where to queue extracted text for output. 3008 | """ 3009 | 3010 | global options 3011 | options = opts 3012 | 3013 | createLogger(options.quiet, options.debug) 3014 | 3015 | out = StringIO() # memory buffer 3016 | 3017 | 3018 | while True: 3019 | job = jobs_queue.get() # job is (id, title, page, page_num) 3020 | if job: 3021 | id, revid, title, page, page_num = job 3022 | try: 3023 | e = Extractor(*job[:4]) # (id, revid, title, page) 3024 | page = None # free memory 3025 | e.extract(out) 3026 | text = out.getvalue() 3027 | except: 3028 | text = '' 3029 | logging.exception('Processing page: %s %s', id, title) 3030 | 3031 | output_queue.put((page_num, text)) 3032 | out.truncate(0) 3033 | out.seek(0) 3034 | else: 3035 | logging.debug('Quit extractor') 3036 | break 3037 | out.close() 3038 | 3039 | 3040 | report_period = 10000 # progress report period 3041 | def reduce_process(opts, output_queue, spool_length, 3042 | out_file=None, file_size=0, file_compress=True): 3043 | """Pull finished article text, write series of files (or stdout) 3044 | :param opts: global parameters. 3045 | :param output_queue: text to be output. 3046 | :param spool_length: spool length. 3047 | :param out_file: filename where to print. 3048 | :param file_size: max file size. 3049 | :param file_compress: whether to compress output. 3050 | """ 3051 | 3052 | global options 3053 | options = opts 3054 | 3055 | createLogger(options.quiet, options.debug) 3056 | 3057 | if out_file: 3058 | nextFile = NextFile(out_file) 3059 | output = OutputSplitter(nextFile, file_size, file_compress) 3060 | else: 3061 | output = sys.stdout if PY2 else sys.stdout.buffer 3062 | if file_compress: 3063 | logging.warn("writing to stdout, so no output compression (use an external tool)") 3064 | 3065 | interval_start = default_timer() 3066 | # FIXME: use a heap 3067 | spool = {} # collected pages 3068 | next_page = 0 # sequence numbering of page 3069 | while True: 3070 | if next_page in spool: 3071 | output.write(spool.pop(next_page).encode('utf-8')) 3072 | next_page += 1 3073 | # tell mapper our load: 3074 | spool_length.value = len(spool) 3075 | # progress report 3076 | if next_page % report_period == 0: 3077 | interval_rate = report_period / (default_timer() - interval_start) 3078 | logging.info("Extracted %d articles (%.1f art/s)", 3079 | next_page, interval_rate) 3080 | interval_start = default_timer() 3081 | else: 3082 | # mapper puts None to signal finish 3083 | pair = output_queue.get() 3084 | if not pair: 3085 | break 3086 | page_num, text = pair 3087 | spool[page_num] = text 3088 | # tell mapper our load: 3089 | spool_length.value = len(spool) 3090 | # FIXME: if an extractor dies, process stalls; the other processes 3091 | # continue to produce pairs, filling up memory. 3092 | if len(spool) > 200: 3093 | logging.debug('Collected %d, waiting: %d, %d', len(spool), 3094 | next_page, next_page == page_num) 3095 | if output != sys.stdout: 3096 | output.close() 3097 | 3098 | 3099 | # ---------------------------------------------------------------------- 3100 | 3101 | # Minimum size of output files 3102 | minFileSize = 200 * 1024 3103 | 3104 | def main(): 3105 | 3106 | parser = argparse.ArgumentParser(prog=os.path.basename(sys.argv[0]), 3107 | formatter_class=argparse.RawDescriptionHelpFormatter, 3108 | description=__doc__) 3109 | parser.add_argument("input", 3110 | help="XML wiki dump file") 3111 | groupO = parser.add_argument_group('Output') 3112 | groupO.add_argument("-o", "--output", default="text", 3113 | help="directory for extracted files (or '-' for dumping to stdout)") 3114 | groupO.add_argument("-b", "--bytes", default="1M", 3115 | help="maximum bytes per output file (default %(default)s)", 3116 | metavar="n[KMG]") 3117 | groupO.add_argument("-c", "--compress", action="store_true", 3118 | help="compress output files using bzip") 3119 | groupO.add_argument("--json", action="store_true", 3120 | help="write output in json format instead of the default one") 3121 | 3122 | 3123 | groupP = parser.add_argument_group('Processing') 3124 | groupP.add_argument("--html", action="store_true", 3125 | help="produce HTML output, subsumes --links") 3126 | groupP.add_argument("-l", "--links", action="store_true", 3127 | help="preserve links") 3128 | groupP.add_argument("-s", "--sections", action="store_true", 3129 | help="preserve sections") 3130 | groupP.add_argument("--lists", action="store_true", 3131 | help="preserve lists") 3132 | groupP.add_argument("-ns", "--namespaces", default="", metavar="ns1,ns2", 3133 | help="accepted namespaces in links") 3134 | groupP.add_argument("--templates", 3135 | help="use or create file containing templates") 3136 | groupP.add_argument("--no-templates", action="store_false", 3137 | help="Do not expand templates") 3138 | groupP.add_argument("-r", "--revision", action="store_true", default=options.print_revision, 3139 | help="Include the document revision id (default=%(default)s)") 3140 | groupP.add_argument("--min_text_length", type=int, default=options.min_text_length, 3141 | help="Minimum expanded text length required to write document (default=%(default)s)") 3142 | groupP.add_argument("--filter_disambig_pages", action="store_true", default=options.filter_disambig_pages, 3143 | help="Remove pages from output that contain disabmiguation markup (default=%(default)s)") 3144 | groupP.add_argument("-it", "--ignored_tags", default="", metavar="abbr,b,big", 3145 | help="comma separated list of tags that will be dropped, keeping their content") 3146 | groupP.add_argument("-de", "--discard_elements", default="", metavar="gallery,timeline,noinclude", 3147 | help="comma separated list of elements that will be removed from the article text") 3148 | groupP.add_argument("--keep_tables", action="store_true", default=options.keep_tables, 3149 | help="Preserve tables in the output article text (default=%(default)s)") 3150 | default_process_count = max(1, cpu_count() - 1) 3151 | parser.add_argument("--processes", type=int, default=default_process_count, 3152 | help="Number of processes to use (default %(default)s)") 3153 | 3154 | groupS = parser.add_argument_group('Special') 3155 | groupS.add_argument("-q", "--quiet", action="store_true", 3156 | help="suppress reporting progress info") 3157 | groupS.add_argument("--debug", action="store_true", 3158 | help="print debug info") 3159 | groupS.add_argument("-a", "--article", action="store_true", 3160 | help="analyze a file containing a single article (debug option)") 3161 | groupS.add_argument("-v", "--version", action="version", 3162 | version='%(prog)s ' + version, 3163 | help="print program version") 3164 | 3165 | args = parser.parse_args() 3166 | 3167 | options.keepLinks = args.links 3168 | options.keepSections = args.sections 3169 | options.keepLists = args.lists 3170 | options.toHTML = args.html 3171 | options.write_json = args.json 3172 | options.print_revision = args.revision 3173 | options.min_text_length = args.min_text_length 3174 | if args.html: 3175 | options.keepLinks = True 3176 | 3177 | options.expand_templates = args.no_templates 3178 | options.filter_disambig_pages = args.filter_disambig_pages 3179 | options.keep_tables = args.keep_tables 3180 | 3181 | try: 3182 | power = 'kmg'.find(args.bytes[-1].lower()) + 1 3183 | file_size = int(args.bytes[:-1]) * 1024 ** power 3184 | if file_size < minFileSize: 3185 | raise ValueError() 3186 | except ValueError: 3187 | logging.error('Insufficient or invalid size: %s', args.bytes) 3188 | return 3189 | 3190 | if args.namespaces: 3191 | options.acceptedNamespaces = set(args.namespaces.split(',')) 3192 | 3193 | # ignoredTags and discardElemets have default values already supplied, if passed in the defaults are overwritten 3194 | if args.ignored_tags: 3195 | ignoredTags = set(args.ignored_tags.split(',')) 3196 | else: 3197 | ignoredTags = [ 3198 | 'abbr', 'b', 'big', 'blockquote', 'center', 'cite', 'em', 3199 | 'font', 'h1', 'h2', 'h3', 'h4', 'hiero', 'i', 'kbd', 3200 | 'p', 'plaintext', 's', 'span', 'strike', 'strong', 3201 | 'tt', 'u', 'var' 3202 | ] 3203 | 3204 | # 'a' tag is handled separately 3205 | for tag in ignoredTags: 3206 | ignoreTag(tag) 3207 | 3208 | if args.discard_elements: 3209 | options.discardElements = set(args.discard_elements.split(',')) 3210 | 3211 | FORMAT = '%(levelname)s: %(message)s' 3212 | logging.basicConfig(format=FORMAT) 3213 | 3214 | options.quiet = args.quiet 3215 | options.debug = args.debug 3216 | 3217 | createLogger(options.quiet, options.debug) 3218 | 3219 | input_file = args.input 3220 | 3221 | if not options.keepLinks: 3222 | ignoreTag('a') 3223 | 3224 | # sharing cache of parser templates is too slow: 3225 | # manager = Manager() 3226 | # templateCache = manager.dict() 3227 | 3228 | if args.article: 3229 | if args.templates: 3230 | if os.path.exists(args.templates): 3231 | with open(args.templates) as file: 3232 | load_templates(file) 3233 | 3234 | file = fileinput.FileInput(input_file, openhook=fileinput.hook_compressed) 3235 | for page_data in pages_from(file): 3236 | id, revid, title, ns, page = page_data 3237 | Extractor(id, revid, title, page).extract(sys.stdout) 3238 | file.close() 3239 | return 3240 | 3241 | output_path = args.output 3242 | if output_path != '-' and not os.path.isdir(output_path): 3243 | try: 3244 | os.makedirs(output_path) 3245 | except: 3246 | logging.error('Could not create: %s', output_path) 3247 | return 3248 | 3249 | process_dump(input_file, args.templates, output_path, file_size, 3250 | args.compress, args.processes) 3251 | 3252 | def createLogger(quiet, debug): 3253 | logger = logging.getLogger() 3254 | if not quiet: 3255 | logger.setLevel(logging.INFO) 3256 | if debug: 3257 | logger.setLevel(logging.DEBUG) 3258 | 3259 | if __name__ == '__main__': 3260 | main() 3261 | -------------------------------------------------------------------------------- /cirrus-extract.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # ============================================================================= 5 | # Version: 1.00 (December 15, 2015) 6 | # Author: Giuseppe Attardi (attardi@di.unipi.it), University of Pisa 7 | # 8 | # ============================================================================= 9 | # Copyright (c) 2015. Giuseppe Attardi (attardi@di.unipi.it). 10 | # ============================================================================= 11 | # This file is part of Tanl. 12 | # 13 | # Tanl is free software; you can redistribute it and/or modify it 14 | # under the terms of the GNU General Public License, version 3, 15 | # as published by the Free Software Foundation. 16 | # 17 | # Tanl is distributed in the hope that it will be useful, 18 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 19 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 20 | # GNU General Public License for more details. 21 | # 22 | # You should have received a copy of the GNU General Public License 23 | # along with this program. If not, see . 24 | # ============================================================================= 25 | 26 | """Wikipedia Cirrus Extractor: 27 | Extracts and cleans text from a Wikipedia Cirrus dump and stores output in a 28 | number of files of similar size in a given directory. 29 | Each file will contain several documents in the format: 30 | 31 | 32 | ... 33 | 34 | 35 | """ 36 | 37 | import sys, os.path, time 38 | import re 39 | import json 40 | import argparse 41 | import bz2 42 | import gzip 43 | import logging 44 | 45 | # Program version 46 | version = '1.00' 47 | 48 | urlbase = 'http://it.wikipedia.org/' 49 | 50 | # ---------------------------------------------------------------------- 51 | 52 | class NextFile(object): 53 | """ 54 | Synchronous generation of next available file name. 55 | """ 56 | 57 | filesPerDir = 100 58 | 59 | def __init__(self, path_name): 60 | self.path_name = path_name 61 | self.dir_index = -1 62 | self.file_index = -1 63 | 64 | def next(self): 65 | self.file_index = (self.file_index + 1) % NextFile.filesPerDir 66 | if self.file_index == 0: 67 | self.dir_index += 1 68 | dirname = self._dirname() 69 | if not os.path.isdir(dirname): 70 | os.makedirs(dirname) 71 | return self._filepath() 72 | 73 | def _dirname(self): 74 | char1 = self.dir_index % 26 75 | char2 = self.dir_index / 26 % 26 76 | return os.path.join(self.path_name, '%c%c' % (ord('A') + char2, ord('A') + char1)) 77 | 78 | def _filepath(self): 79 | return '%s/wiki_%02d' % (self._dirname(), self.file_index) 80 | 81 | class OutputSplitter(object): 82 | """ 83 | File-like object, that splits output to multiple files of a given max size. 84 | """ 85 | 86 | def __init__(self, nextFile, max_file_size=0, compress=True): 87 | """ 88 | :param nextfile: a NextFile object from which to obtain filenames 89 | to use. 90 | :param max_file_size: the maximum size of each file. 91 | :para compress: whether to write data with bzip compression. 92 | """ 93 | self.nextFile = nextFile 94 | self.compress = compress 95 | self.max_file_size = max_file_size 96 | self.file = self.open(self.nextFile.next()) 97 | 98 | def reserve(self, size): 99 | if self.file.tell() + size > self.max_file_size: 100 | self.close() 101 | self.file = self.open(self.nextFile.next()) 102 | 103 | def write(self, data): 104 | self.reserve(len(data)) 105 | self.file.write(data) 106 | 107 | def close(self): 108 | self.file.close() 109 | 110 | def open(self, filename): 111 | if self.compress: 112 | return bz2.BZ2File(filename + '.bz2', 'w') 113 | else: 114 | return open(filename, 'w') 115 | 116 | # ---------------------------------------------------------------------- 117 | 118 | class Extractor(object): 119 | 120 | def extract(self, out): 121 | """ 122 | :param out: output file. 123 | """ 124 | logging.debug("%s\t%s", self.id, self.title) 125 | text = ''.join(self.page) 126 | url = get_url(self.id) 127 | header = '\n' % (self.id, url, self.title) 128 | # Separate header from text with a newline. 129 | header += self.title + '\n\n' 130 | header = header.encode('utf-8') 131 | footer = "\n\n" 132 | out.write(header) 133 | text = clean(self, text) 134 | for line in compact(text): 135 | out.write(line.encode('utf-8')) 136 | out.write('\n') 137 | out.write(footer) 138 | 139 | def process_dump(input_file, out_file, file_size, file_compress): 140 | """ 141 | :param input_file: name of the wikipedia dump file; '-' to read from stdin 142 | :param out_file: directory where to store extracted data, or '-' for stdout 143 | :param file_size: max size of each extracted file, or None for no max (one file) 144 | :param file_compress: whether to compress files with bzip. 145 | """ 146 | 147 | if input_file == '-': 148 | input = sys.stdin 149 | else: 150 | input = gzip.open(input_file) 151 | 152 | if out_file == '-': 153 | output = sys.stdout 154 | if file_compress: 155 | logging.warn("writing to stdout, so no output compression (use external tool)") 156 | else: 157 | nextFile = NextFile(out_file) 158 | output = OutputSplitter(nextFile, file_size, file_compress) 159 | 160 | # process dump 161 | # format 162 | # {"index":{"_type":"page","_id":"3825914"}} 163 | # {"namespace":0,"title":TITLE,"timestamp":"2014-06-29T15:51:09Z","text":TEXT,...} 164 | while True: 165 | line = input.readline() 166 | if not line: 167 | break 168 | index = json.loads(line) 169 | content = json.loads(input.readline()) 170 | type = index['index']['_type'] 171 | id = index['index']['_id'] 172 | if type == 'page' and content['namespace'] == 0: 173 | title = content['title'] 174 | text = content['text'] 175 | # drop references: 176 | # ^ The Penguin Dictionary 177 | text = re.sub(r' \^ .*', '', text) 178 | url = urlbase + 'wiki?curid=' + id 179 | header = '\n' % (id, url, title) 180 | page = header + title + '\n\n' + text + '\n\n' 181 | output.write(page.encode('utf-8')) 182 | 183 | # ---------------------------------------------------------------------- 184 | 185 | # Minimum size of output files 186 | minFileSize = 200 * 1024 187 | 188 | def main(): 189 | parser = argparse.ArgumentParser(prog=os.path.basename(sys.argv[0]), 190 | formatter_class=argparse.RawDescriptionHelpFormatter, 191 | description=__doc__) 192 | parser.add_argument("input", 193 | help="Cirrus Json wiki dump file") 194 | groupO = parser.add_argument_group('Output') 195 | groupO.add_argument("-o", "--output", default="text", 196 | help="directory for extracted files (or '-' for dumping to stdin)") 197 | groupO.add_argument("-b", "--bytes", default="1M", 198 | help="maximum bytes per output file (default %(default)s)", 199 | metavar="n[KMG]") 200 | groupO.add_argument("-c", "--compress", action="store_true", 201 | help="compress output files using bzip") 202 | 203 | groupP = parser.add_argument_group('Processing') 204 | groupP.add_argument("-ns", "--namespaces", default="", metavar="ns1,ns2", 205 | help="accepted namespaces") 206 | 207 | groupS = parser.add_argument_group('Special') 208 | groupS.add_argument("-q", "--quiet", action="store_true", 209 | help="suppress reporting progress info") 210 | groupS.add_argument("-v", "--version", action="version", 211 | version='%(prog)s ' + version, 212 | help="print program version") 213 | 214 | args = parser.parse_args() 215 | 216 | try: 217 | power = 'kmg'.find(args.bytes[-1].lower()) + 1 218 | file_size = int(args.bytes[:-1]) * 1024 ** power 219 | if file_size < minFileSize: 220 | raise ValueError() 221 | except ValueError: 222 | logging.error('Insufficient or invalid size: %s', args.bytes) 223 | return 224 | 225 | FORMAT = '%(levelname)s: %(message)s' 226 | logging.basicConfig(format=FORMAT) 227 | 228 | logger = logging.getLogger() 229 | if not args.quiet: 230 | logger.setLevel(logging.INFO) 231 | 232 | input_file = args.input 233 | 234 | output_path = args.output 235 | if output_path != '-' and not os.path.isdir(output_path): 236 | try: 237 | os.makedirs(output_path) 238 | except: 239 | logging.error('Could not create: %s', output_path) 240 | return 241 | 242 | process_dump(input_file, output_path, file_size, args.compress) 243 | 244 | 245 | if __name__ == '__main__': 246 | main() 247 | -------------------------------------------------------------------------------- /extractPage.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # ============================================================================= 5 | # Version: 2.9 (Feb 13, 2016) 6 | # Author: Giuseppe Attardi (attardi@di.unipi.it), University of Pisa 7 | 8 | # ============================================================================= 9 | # Copyright (c) 2009. Giuseppe Attardi (attardi@di.unipi.it). 10 | # ============================================================================= 11 | # This file is part of Tanl. 12 | # 13 | # Tanl is free software; you can redistribute it and/or modify it 14 | # under the terms of the GNU General Public License, version 3, 15 | # as published by the Free Software Foundation. 16 | # 17 | # Tanl is distributed in the hope that it will be useful, 18 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 19 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 20 | # GNU General Public License for more details. 21 | # 22 | # You should have received a copy of the GNU General Public License 23 | # along with this program. If not, see . 24 | # ============================================================================= 25 | 26 | """Wikipedia Page Extractor: 27 | Extracts a single page from a Wikipedia dump file. 28 | """ 29 | 30 | import sys, os.path 31 | import re, random 32 | import argparse 33 | from itertools import izip 34 | import logging, traceback 35 | import urllib 36 | import bz2, gzip 37 | from htmlentitydefs import name2codepoint 38 | import Queue, threading, multiprocessing 39 | 40 | 41 | # Program version 42 | version = '2.9' 43 | 44 | # ---------------------------------------------------------------------- 45 | # READER 46 | 47 | tagRE = re.compile(r'(.*?)<(/?\w+)[^>]*>(?:([^<]*)(<.*?>)?)?') 48 | #tagRE = re.compile(r'(.*?)<(/?\w+)[^>]*>([^<]*)') 49 | # 1 2 3 50 | 51 | def process_data(input_file, ids, templates=False): 52 | """ 53 | :param input_file: name of the wikipedia dump file. 54 | :param ids: article ids (single or range first-last). 55 | :param templates: collect also templates 56 | """ 57 | 58 | if input_file.lower().endswith("bz2"): 59 | opener = bz2.BZ2File 60 | else: 61 | opener = open 62 | 63 | input = opener(input_file) 64 | print '' 65 | 66 | rang = ids.split('-') 67 | first = int(rang[0]) 68 | if len(rang) == 1: 69 | last = first 70 | else: 71 | last = int(rang[1]) 72 | page = [] 73 | curid = 0 74 | for line in input: 75 | line = line.decode('utf-8') 76 | if '<' not in line: # faster than doing re.search() 77 | if page: 78 | page.append(line) 79 | continue 80 | m = tagRE.search(line) 81 | if not m: 82 | continue 83 | tag = m.group(2) 84 | if tag == 'page': 85 | page = [] 86 | page.append(line) 87 | inArticle = False 88 | elif tag == 'id' and not curid: # other are present 89 | curid = int(m.group(3)) 90 | if first <= curid <= last: 91 | page.append(line) 92 | inArticle = True 93 | elif curid > last and not templates: 94 | break 95 | elif not inArticle and not templates: 96 | page = [] 97 | elif tag == 'title': 98 | if templates: 99 | if m.group(3).startswith('Template:'): 100 | page.append(line) 101 | else: 102 | page = [] 103 | else: 104 | page.append(line) 105 | elif tag == '/page': 106 | if page: 107 | page.append(line) 108 | print ''.join(page).encode('utf-8') 109 | if not templates and curid == last: 110 | break 111 | curid = 0 112 | page = [] 113 | elif page: 114 | page.append(line) 115 | 116 | print '' 117 | input.close() 118 | 119 | def main(): 120 | parser = argparse.ArgumentParser(prog=os.path.basename(sys.argv[0]), 121 | formatter_class=argparse.RawDescriptionHelpFormatter, 122 | description=__doc__) 123 | parser.add_argument("input", 124 | help="XML wiki dump file") 125 | parser.add_argument("--id", default="", 126 | help="article number, or range first-last") 127 | parser.add_argument("--template", action="store_true", 128 | help="extract also all templates") 129 | parser.add_argument("-v", "--version", action="version", 130 | version='%(prog)s ' + version, 131 | help="print program version") 132 | 133 | args = parser.parse_args() 134 | 135 | process_data(args.input, args.id, args.template) 136 | 137 | if __name__ == '__main__': 138 | main() 139 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | def readme(): 4 | with open('README.md') as f: 5 | return f.read() 6 | 7 | setup( 8 | name='wikiextractor', 9 | 10 | description='A script that extracts and cleans text from a Wikipedia' 11 | 'database dump', 12 | author='Giuseppe Attardi', 13 | author_email='attardi@di.unipi.it', 14 | version='2.69', 15 | 16 | url='https://github.com/attardi/wikiextractor', 17 | 18 | license="GPL 3.0", 19 | keywords=['text', 'nlp'], 20 | scripts=['WikiExtractor.py'] 21 | ) 22 | -------------------------------------------------------------------------------- /tests.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | # This test must pass both in Python 2 and 3. 3 | from __future__ import unicode_literals 4 | 5 | import sys 6 | import os.path 7 | import unittest 8 | 9 | from WikiExtractor import ( 10 | normalizeTitle, unescape, ucfirst, lcfirst, splitParts, 11 | fullyQualifiedTemplateTitle, NextFile 12 | ) 13 | 14 | 15 | class TestNormalizeTitle(unittest.TestCase): 16 | 17 | def test_known_namespace(self): 18 | self.assertEqual(normalizeTitle("Template: Births"), "Template:Births") 19 | self.assertEqual(normalizeTitle(" template: births_"), "Template:Births") 20 | 21 | def test_not_known_namespace(self): 22 | self.assertEqual(normalizeTitle("Category: Births"), "Category: Births") 23 | self.assertEqual(normalizeTitle("_category: births___"), "Category: Births") 24 | 25 | def test_no_namespace(self): 26 | self.assertEqual(normalizeTitle("python"), "Python") 27 | self.assertEqual(normalizeTitle("python 3"), "Python 3") 28 | self.assertEqual(normalizeTitle("python__3"), "Python 3") 29 | 30 | 31 | class TestStringUtils(unittest.TestCase): 32 | 33 | def test_unescape(self): 34 | self.assertEqual(unescape('"'), '"') 35 | self.assertEqual(unescape('&'), '&') 36 | self.assertEqual(unescape('あ'), '\u3042') 37 | if sys.maxunicode > 0xFFFF: 38 | # Python 3 or UCS-4 build of Python 2 39 | self.assertEqual(unescape('𝕆'), '\U0001D546') 40 | self.assertEqual(unescape('𝓁'), '\U0001d4c1') 41 | else: 42 | # UCS-2 build of Python 2 43 | self.assertEqual(unescape('𝕆'), '𝕆') 44 | self.assertEqual(unescape('𝓁'), '𝓁') 45 | 46 | def test_ucfirst(self): 47 | self.assertEqual(ucfirst('python'), 'Python') 48 | 49 | def test_lcfirst(self): 50 | self.assertEqual(lcfirst('Python'), 'python') 51 | 52 | 53 | class TestSplitParts(unittest.TestCase): 54 | 55 | def test_simple(self): 56 | self.assertEqual(splitParts("p=q|q=r|r=s"), ['p=q', 'q=r', 'r=s']) 57 | 58 | def test_complex(self): 59 | self.assertEqual(splitParts('{{#if: {{{1}}} | {{lc:{{{1}}} | "parameter missing"}}'), 60 | ['{{#if: {{{1}}} ', ' {{lc:{{{1}}} ', ' "parameter missing"}}']) 61 | 62 | self.assertEqual(splitParts('''{{if:| 63 | |{{#if:the president| 64 | |{{#if:| 65 | [[Category:Hatnote templates|A{{PAGENAME}}]] 66 | }} 67 | }} 68 | }}'''), ['''{{if:| 69 | |{{#if:the president| 70 | |{{#if:| 71 | [[Category:Hatnote templates|A{{PAGENAME}}]] 72 | }} 73 | }} 74 | }}''']) 75 | 76 | 77 | class TestFullyQualifiedTemplateTitle(unittest.TestCase): 78 | 79 | def test_main_namespace(self): 80 | self.assertEqual(fullyQualifiedTemplateTitle(':Python'), 'Python') 81 | self.assertEqual(fullyQualifiedTemplateTitle(':python'), 'Python') 82 | 83 | def test_other_namespace(self): 84 | self.assertEqual(fullyQualifiedTemplateTitle('User:Orange'), 'User:Orange') 85 | 86 | 87 | class TestNextFile(unittest.TestCase): 88 | 89 | def test_next(self): 90 | f = NextFile('out') 91 | self.assertEqual(next(f), 'out{}AA/wiki_00'.format(os.path.sep)) 92 | self.assertEqual(next(f), 'out{}AA/wiki_01'.format(os.path.sep)) 93 | for _ in range(97): 94 | next(f) 95 | self.assertEqual(next(f), 'out{}AA/wiki_99'.format(os.path.sep)) 96 | self.assertEqual(next(f), 'out{}AB/wiki_00'.format(os.path.sep)) 97 | 98 | 99 | if __name__ == '__main__': 100 | unittest.main() 101 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py27, py33, py34, py35 3 | [testenv] 4 | commands=python tests.py 5 | --------------------------------------------------------------------------------