├── .gitignore ├── .travis.yml ├── LICENCE ├── README.rst ├── dev_requirements.txt ├── docs ├── .gitignore ├── Makefile ├── make.bat └── source │ ├── _theme │ └── armstrong │ │ ├── LICENSE │ │ ├── README.rst │ │ ├── layout.html │ │ ├── rtd-themes.conf │ │ ├── static │ │ └── rtd.css_t │ │ └── theme.conf │ ├── conf.py │ └── index.rst ├── download_test_data.sh ├── pdftables ├── __init__.py ├── boxes.py ├── config_parameters.py ├── counter.py ├── diagnostics.py ├── display.py ├── line_segments.py ├── numpy_subset.py ├── patched_poppler.py ├── pdf_document.py ├── pdf_document_pdfminer.py ├── pdf_document_poppler.py ├── pdftables.py └── scripts │ ├── __init__.py │ └── render.py ├── render_all.sh ├── requirements.txt ├── setup.py └── test ├── fixtures.py ├── test_Table_class.py ├── test_all_sample_data.py ├── test_box.py ├── test_contains_tables.py ├── test_finds_tables.py ├── test_get_tables.py ├── test_ground.py ├── test_linesegments.py └── test_render_script.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | fixtures/ 3 | png/ 4 | svg/ 5 | .*.swp 6 | /*.egg-info 7 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | virtualenv: 5 | system_site_packages: true 6 | before_install: 7 | - export PIP_USE_MIRRORS=true 8 | - sudo apt-get update 9 | - sudo apt-get install -qq python-poppler 10 | install: 11 | - pip install -e . 12 | - pip install -r requirements.txt 13 | - pip install coveralls 14 | - ./download_test_data.sh 15 | script: nosetests --with-coverage --cover-package=pdftables 16 | after_success: 17 | - coveralls 18 | -------------------------------------------------------------------------------- /LICENCE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013, ScraperWiki Limited 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 5 | 6 | Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 7 | Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 8 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 9 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | .. -*- mode: rst -*- 2 | 3 | pdftables - a library for extracting tables from PDF files 4 | ========================================================== 5 | 6 | .. image:: https://travis-ci.org/scraperwiki/pdftables.png 7 | :target: https://travis-ci.org/scraperwiki/pdftables 8 | .. image:: https://pypip.in/v/pdftables/badge.png 9 | :target: https://pypi.python.org/pypi/pdftables 10 | 11 | **pdftables is no longer maintained**. The development contined commercially at `pdftables.com `_. 12 | 13 | .. 14 | 15 | `This Readme, and more, is available on ReadTheDocs. `_ 16 | 17 | `This post `_ 18 | on the ScraperWiki blog describes the algorithms used in pdftables, and 19 | something of its genesis. This README gives more technical information. 20 | 21 | pdftables uses `pdfminer `_ to get information on the locations of text 22 | elements in a PDF document. pdfminer was chosen as a base because it provides 23 | information on the full range of page elements in PDF files, including 24 | graphical elements such as lines. Although the algorithms currently used do not 25 | use these elements they are planned for future work. As a purely Python library, 26 | pdfminer is very portable. The downside of pdfminer is that it is slow, perhaps 27 | an order of magnitude slower than alternative C based libraries. 28 | 29 | Installation 30 | ============ 31 | 32 | You need poppler and Cairo. On a Ubuntu and friends you can go: 33 | 34 | .. code:: bash 35 | 36 | sudo apt-get -y install python-poppler python-cairo 37 | 38 | Then we can install the ``pip``-able requirements from the ``requirements.txt`` file: 39 | 40 | .. code:: bash 41 | 42 | pip install -r requirements.txt 43 | 44 | Usage 45 | ===== 46 | 47 | First we get a file object to a PDF: 48 | 49 | .. code:: python 50 | 51 | filepath = 'example.pdf' 52 | fileobj = open(filepath,'rb') 53 | 54 | Then we create a PDF element from the file object: 55 | 56 | .. code:: python 57 | 58 | from pdftables.pdf_document import PDFDocument 59 | doc = PDFDocument.from_fileobj(fileobj) 60 | 61 | Then we use the ``get_page()`` method to select a single page from the document: 62 | 63 | .. code:: python 64 | 65 | from pdftables.pdftables import page_to_tables 66 | page = doc.get_page(pagenumber) 67 | tables = page_to_tables(page) 68 | 69 | You can also loop over all pages in the PDF using ``get_pages()``: 70 | 71 | .. code:: python 72 | 73 | from pdftables.pdftables import page_to_tables 74 | for page_number, page in enumerate(doc.get_pages()): 75 | tables = page_to_tables(page) 76 | 77 | Now you have a TableContainer object, you can convert it to ASCII for quick previewing: 78 | 79 | .. code:: python 80 | 81 | from pdftables.display import to_string 82 | for table in tables: 83 | print to_string(table.data) 84 | 85 | ``table.data`` is a table that has been found, in the form of a list of lists of strings 86 | (ie: a list of rows, each containing the same number of cells). 87 | 88 | Command line tool 89 | ================= 90 | 91 | pdftables includes a command line tool for diagnostic rendering of pages and tables, called ``pdftables-render``. 92 | This is installed if you ``pip install`` pdftables, or you manually run ``python setup.py``. 93 | 94 | .. code:: bash 95 | 96 | $ pdftables-render example.pdf 97 | 98 | This creates separate PNG and SVG files for each page of the specified PDF, in ``png/`` and ``svg/``, with three disagnostic displays per page. 99 | 100 | Developing pdftables 101 | ==================== 102 | 103 | Files and folders:: 104 | 105 | . 106 | |-fixtures 107 | | |-sample_data 108 | |-pdftables 109 | |-test 110 | 111 | *fixtures* contains test fixtures, in particular the sample_data directory 112 | contains PDF files which are installed from a different repository by running 113 | the ``download_test_data.sh`` script. 114 | 115 | We're also using data from http://www.tamirhassan.com/competition/dataset-tools.html which is also installed by the download script. 116 | 117 | *pdftables* contains the core code files 118 | 119 | *test* contains tests 120 | 121 | **pdftables.py** - this is the core of the pdftables library 122 | 123 | **counter.py** - implements collections.Counter for the benefit of Python 2.6 124 | 125 | **display.py** - prettily prints a table by implementing the ``to_string`` function 126 | 127 | **numpy_subset.py** - partially implements ``numpy.diff``, ``numpy.arange`` and ``numpy.average`` to avoid a large dependency on numpy. 128 | 129 | **pdf_document.py** - implements PDFDocument to abstract away the underlying PDF class, and ease any conversion to a different underlying PDF library to replace PDFminer 130 | 131 | 132 | 133 | -------------------------------------------------------------------------------- /dev_requirements.txt: -------------------------------------------------------------------------------- 1 | Pygments 2 | -------------------------------------------------------------------------------- /docs/.gitignore: -------------------------------------------------------------------------------- 1 | /build/ 2 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = build 9 | 10 | # Internal variables. 11 | PAPEROPT_a4 = -D latex_paper_size=a4 12 | PAPEROPT_letter = -D latex_paper_size=letter 13 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source 14 | # the i18n builder cannot share the environment and doctrees with the others 15 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source 16 | 17 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 18 | 19 | help: 20 | @echo "Please use \`make ' where is one of" 21 | @echo " html to make standalone HTML files" 22 | @echo " dirhtml to make HTML files named index.html in directories" 23 | @echo " singlehtml to make a single large HTML file" 24 | @echo " pickle to make pickle files" 25 | @echo " json to make JSON files" 26 | @echo " htmlhelp to make HTML files and a HTML help project" 27 | @echo " qthelp to make HTML files and a qthelp project" 28 | @echo " devhelp to make HTML files and a Devhelp project" 29 | @echo " epub to make an epub" 30 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 31 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 32 | @echo " text to make text files" 33 | @echo " man to make manual pages" 34 | @echo " texinfo to make Texinfo files" 35 | @echo " info to make Texinfo files and run them through makeinfo" 36 | @echo " gettext to make PO message catalogs" 37 | @echo " changes to make an overview of all changed/added/deprecated items" 38 | @echo " linkcheck to check all external links for integrity" 39 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 40 | 41 | clean: 42 | -rm -rf $(BUILDDIR)/* 43 | 44 | html: 45 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 46 | @echo 47 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 48 | 49 | dirhtml: 50 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 51 | @echo 52 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 53 | 54 | singlehtml: 55 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 56 | @echo 57 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 58 | 59 | pickle: 60 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 61 | @echo 62 | @echo "Build finished; now you can process the pickle files." 63 | 64 | json: 65 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 66 | @echo 67 | @echo "Build finished; now you can process the JSON files." 68 | 69 | htmlhelp: 70 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 71 | @echo 72 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 73 | ".hhp project file in $(BUILDDIR)/htmlhelp." 74 | 75 | qthelp: 76 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 77 | @echo 78 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 79 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 80 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/pdftables.qhcp" 81 | @echo "To view the help file:" 82 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/pdftables.qhc" 83 | 84 | devhelp: 85 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 86 | @echo 87 | @echo "Build finished." 88 | @echo "To view the help file:" 89 | @echo "# mkdir -p $$HOME/.local/share/devhelp/pdftables" 90 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/pdftables" 91 | @echo "# devhelp" 92 | 93 | epub: 94 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 95 | @echo 96 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 97 | 98 | latex: 99 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 100 | @echo 101 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 102 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 103 | "(use \`make latexpdf' here to do that automatically)." 104 | 105 | latexpdf: 106 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 107 | @echo "Running LaTeX files through pdflatex..." 108 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 109 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 110 | 111 | text: 112 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 113 | @echo 114 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 115 | 116 | man: 117 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 118 | @echo 119 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 120 | 121 | texinfo: 122 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 123 | @echo 124 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 125 | @echo "Run \`make' in that directory to run these through makeinfo" \ 126 | "(use \`make info' here to do that automatically)." 127 | 128 | info: 129 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 130 | @echo "Running Texinfo files through makeinfo..." 131 | make -C $(BUILDDIR)/texinfo info 132 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 133 | 134 | gettext: 135 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 136 | @echo 137 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 138 | 139 | changes: 140 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 141 | @echo 142 | @echo "The overview file is in $(BUILDDIR)/changes." 143 | 144 | linkcheck: 145 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 146 | @echo 147 | @echo "Link check complete; look for any errors in the above output " \ 148 | "or in $(BUILDDIR)/linkcheck/output.txt." 149 | 150 | doctest: 151 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 152 | @echo "Testing of doctests in the sources finished, look at the " \ 153 | "results in $(BUILDDIR)/doctest/output.txt." 154 | -------------------------------------------------------------------------------- /docs/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | REM Command file for Sphinx documentation 4 | 5 | if "%SPHINXBUILD%" == "" ( 6 | set SPHINXBUILD=sphinx-build 7 | ) 8 | set BUILDDIR=build 9 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source 10 | set I18NSPHINXOPTS=%SPHINXOPTS% source 11 | if NOT "%PAPER%" == "" ( 12 | set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% 13 | set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% 14 | ) 15 | 16 | if "%1" == "" goto help 17 | 18 | if "%1" == "help" ( 19 | :help 20 | echo.Please use `make ^` where ^ is one of 21 | echo. html to make standalone HTML files 22 | echo. dirhtml to make HTML files named index.html in directories 23 | echo. singlehtml to make a single large HTML file 24 | echo. pickle to make pickle files 25 | echo. json to make JSON files 26 | echo. htmlhelp to make HTML files and a HTML help project 27 | echo. qthelp to make HTML files and a qthelp project 28 | echo. devhelp to make HTML files and a Devhelp project 29 | echo. epub to make an epub 30 | echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter 31 | echo. text to make text files 32 | echo. man to make manual pages 33 | echo. texinfo to make Texinfo files 34 | echo. gettext to make PO message catalogs 35 | echo. changes to make an overview over all changed/added/deprecated items 36 | echo. linkcheck to check all external links for integrity 37 | echo. doctest to run all doctests embedded in the documentation if enabled 38 | goto end 39 | ) 40 | 41 | if "%1" == "clean" ( 42 | for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i 43 | del /q /s %BUILDDIR%\* 44 | goto end 45 | ) 46 | 47 | if "%1" == "html" ( 48 | %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html 49 | if errorlevel 1 exit /b 1 50 | echo. 51 | echo.Build finished. The HTML pages are in %BUILDDIR%/html. 52 | goto end 53 | ) 54 | 55 | if "%1" == "dirhtml" ( 56 | %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml 57 | if errorlevel 1 exit /b 1 58 | echo. 59 | echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. 60 | goto end 61 | ) 62 | 63 | if "%1" == "singlehtml" ( 64 | %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml 65 | if errorlevel 1 exit /b 1 66 | echo. 67 | echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. 68 | goto end 69 | ) 70 | 71 | if "%1" == "pickle" ( 72 | %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle 73 | if errorlevel 1 exit /b 1 74 | echo. 75 | echo.Build finished; now you can process the pickle files. 76 | goto end 77 | ) 78 | 79 | if "%1" == "json" ( 80 | %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json 81 | if errorlevel 1 exit /b 1 82 | echo. 83 | echo.Build finished; now you can process the JSON files. 84 | goto end 85 | ) 86 | 87 | if "%1" == "htmlhelp" ( 88 | %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp 89 | if errorlevel 1 exit /b 1 90 | echo. 91 | echo.Build finished; now you can run HTML Help Workshop with the ^ 92 | .hhp project file in %BUILDDIR%/htmlhelp. 93 | goto end 94 | ) 95 | 96 | if "%1" == "qthelp" ( 97 | %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp 98 | if errorlevel 1 exit /b 1 99 | echo. 100 | echo.Build finished; now you can run "qcollectiongenerator" with the ^ 101 | .qhcp project file in %BUILDDIR%/qthelp, like this: 102 | echo.^> qcollectiongenerator %BUILDDIR%\qthelp\pdftables.qhcp 103 | echo.To view the help file: 104 | echo.^> assistant -collectionFile %BUILDDIR%\qthelp\pdftables.ghc 105 | goto end 106 | ) 107 | 108 | if "%1" == "devhelp" ( 109 | %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp 110 | if errorlevel 1 exit /b 1 111 | echo. 112 | echo.Build finished. 113 | goto end 114 | ) 115 | 116 | if "%1" == "epub" ( 117 | %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub 118 | if errorlevel 1 exit /b 1 119 | echo. 120 | echo.Build finished. The epub file is in %BUILDDIR%/epub. 121 | goto end 122 | ) 123 | 124 | if "%1" == "latex" ( 125 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex 126 | if errorlevel 1 exit /b 1 127 | echo. 128 | echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. 129 | goto end 130 | ) 131 | 132 | if "%1" == "text" ( 133 | %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text 134 | if errorlevel 1 exit /b 1 135 | echo. 136 | echo.Build finished. The text files are in %BUILDDIR%/text. 137 | goto end 138 | ) 139 | 140 | if "%1" == "man" ( 141 | %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man 142 | if errorlevel 1 exit /b 1 143 | echo. 144 | echo.Build finished. The manual pages are in %BUILDDIR%/man. 145 | goto end 146 | ) 147 | 148 | if "%1" == "texinfo" ( 149 | %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo 150 | if errorlevel 1 exit /b 1 151 | echo. 152 | echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. 153 | goto end 154 | ) 155 | 156 | if "%1" == "gettext" ( 157 | %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale 158 | if errorlevel 1 exit /b 1 159 | echo. 160 | echo.Build finished. The message catalogs are in %BUILDDIR%/locale. 161 | goto end 162 | ) 163 | 164 | if "%1" == "changes" ( 165 | %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes 166 | if errorlevel 1 exit /b 1 167 | echo. 168 | echo.The overview file is in %BUILDDIR%/changes. 169 | goto end 170 | ) 171 | 172 | if "%1" == "linkcheck" ( 173 | %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck 174 | if errorlevel 1 exit /b 1 175 | echo. 176 | echo.Link check complete; look for any errors in the above output ^ 177 | or in %BUILDDIR%/linkcheck/output.txt. 178 | goto end 179 | ) 180 | 181 | if "%1" == "doctest" ( 182 | %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest 183 | if errorlevel 1 exit /b 1 184 | echo. 185 | echo.Testing of doctests in the sources finished, look at the ^ 186 | results in %BUILDDIR%/doctest/output.txt. 187 | goto end 188 | ) 189 | 190 | :end 191 | -------------------------------------------------------------------------------- /docs/source/_theme/armstrong/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2011 Bay Citizen & Texas Tribune 2 | 3 | Original ReadTheDocs.org code 4 | Copyright (c) 2010 Charles Leifer, Eric Holscher, Bobby Grace 5 | 6 | Permission is hereby granted, free of charge, to any person 7 | obtaining a copy of this software and associated documentation 8 | files (the "Software"), to deal in the Software without 9 | restriction, including without limitation the rights to use, 10 | copy, modify, merge, publish, distribute, sublicense, and/or sell 11 | copies of the Software, and to permit persons to whom the 12 | Software is furnished to do so, subject to the following 13 | conditions: 14 | 15 | The above copyright notice and this permission notice shall be 16 | included in all copies or substantial portions of the Software. 17 | 18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 19 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 20 | OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 21 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 22 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 23 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 24 | FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 25 | OTHER DEALINGS IN THE SOFTWARE. 26 | 27 | -------------------------------------------------------------------------------- /docs/source/_theme/armstrong/README.rst: -------------------------------------------------------------------------------- 1 | Armstrong Sphinx Theme 2 | ====================== 3 | Sphinx theme for Armstrong documentation 4 | 5 | 6 | Usage 7 | ----- 8 | Symlink this repository into your documentation at ``docs/_themes/armstrong`` 9 | then add the following two settings to your Sphinx ``conf.py`` file:: 10 | 11 | html_theme = "armstrong" 12 | html_theme_path = ["_themes", ] 13 | 14 | You can also change colors and such by adjusting the ``html_theme_options`` 15 | dictionary. For a list of all settings, see ``theme.conf``. 16 | 17 | 18 | Defaults 19 | -------- 20 | This repository has been customized for Armstrong documentation, but you can 21 | use the original default color scheme on your project by copying the 22 | ``rtd-theme.conf`` over the existing ``theme.conf``. 23 | 24 | 25 | Contributing 26 | ------------ 27 | 28 | * Create something awesome -- make the code better, add some functionality, 29 | whatever (this is the hardest part). 30 | * `Fork it`_ 31 | * Create a topic branch to house your changes 32 | * Get all of your commits in the new topic branch 33 | * Submit a `pull request`_ 34 | 35 | .. _Fork it: http://help.github.com/forking/ 36 | .. _pull request: http://help.github.com/pull-requests/ 37 | 38 | 39 | State of Project 40 | ---------------- 41 | Armstrong is an open-source news platform that is freely available to any 42 | organization. It is the result of a collaboration between the `Texas Tribune`_ 43 | and `Bay Citizen`_, and a grant from the `John S. and James L. Knight 44 | Foundation`_. The first stable release is scheduled for September, 2011. 45 | 46 | To follow development, be sure to join the `Google Group`_. 47 | 48 | ``armstrong_sphinx`` is part of the `Armstrong`_ project. Unless you're 49 | looking for a Sphinx theme, you're probably looking for the main project. 50 | 51 | .. _Armstrong: http://www.armstrongcms.org/ 52 | .. _Bay Citizen: http://www.baycitizen.org/ 53 | .. _John S. and James L. Knight Foundation: http://www.knightfoundation.org/ 54 | .. _Texas Tribune: http://www.texastribune.org/ 55 | .. _Google Group: http://groups.google.com/group/armstrongcms 56 | 57 | 58 | Credit 59 | ------ 60 | This theme is based on the the excellent `Read the Docs`_ theme. The original 61 | can be found in the `readthedocs.org`_ repository on GitHub. 62 | 63 | .. _Read the Docs: http://readthedocs.org/ 64 | .. _readthedocs.org: https://github.com/rtfd/readthedocs.org 65 | 66 | 67 | License 68 | ------- 69 | Like the original RTD code, this code is licensed under a BSD. See the 70 | associated ``LICENSE`` file for more information. 71 | -------------------------------------------------------------------------------- /docs/source/_theme/armstrong/layout.html: -------------------------------------------------------------------------------- 1 | {% extends "basic/layout.html" %} 2 | 3 | {% set script_files = script_files + [pathto("_static/searchtools.js", 1)] %} 4 | 5 | {% block htmltitle %} 6 | {{ super() }} 7 | 8 | 9 | 10 | {% endblock %} 11 | 12 | {% block footer %} 13 | 31 | 32 | 33 | {% if theme_analytics_code %} 34 | 35 | 46 | {% endif %} 47 | 48 | {% endblock %} 49 | -------------------------------------------------------------------------------- /docs/source/_theme/armstrong/rtd-themes.conf: -------------------------------------------------------------------------------- 1 | [theme] 2 | inherit = default 3 | stylesheet = rtd.css 4 | pygment_style = default 5 | show_sphinx = False 6 | 7 | [options] 8 | show_rtd = True 9 | 10 | white = #ffffff 11 | almost_white = #f8f8f8 12 | barely_white = #f2f2f2 13 | dirty_white = #eeeeee 14 | almost_dirty_white = #e6e6e6 15 | dirtier_white = #dddddd 16 | lighter_gray = #cccccc 17 | gray_a = #aaaaaa 18 | gray_9 = #999999 19 | light_gray = #888888 20 | gray_7 = #777777 21 | gray = #666666 22 | dark_gray = #444444 23 | gray_2 = #222222 24 | black = #111111 25 | light_color = #e8ecef 26 | light_medium_color = #DDEAF0 27 | medium_color = #8ca1af 28 | medium_color_link = #86989b 29 | medium_color_link_hover = #a6b8bb 30 | dark_color = #465158 31 | 32 | h1 = #000000 33 | h2 = #465158 34 | h3 = #6c818f 35 | 36 | link_color = #444444 37 | link_color_decoration = #CCCCCC 38 | 39 | medium_color_hover = #697983 40 | green_highlight = #8ecc4c 41 | 42 | 43 | positive_dark = #609060 44 | positive_medium = #70a070 45 | positive_light = #e9ffe9 46 | 47 | negative_dark = #900000 48 | negative_medium = #b04040 49 | negative_light = #ffe9e9 50 | negative_text = #c60f0f 51 | 52 | ruler = #abc 53 | 54 | viewcode_bg = #f4debf 55 | viewcode_border = #ac9 56 | 57 | highlight = #ffe080 58 | 59 | code_background = #eeeeee 60 | 61 | background = #465158 62 | background_link = #ffffff 63 | background_link_half = #ffffff 64 | background_text = #eeeeee 65 | background_text_link = #86989b 66 | -------------------------------------------------------------------------------- /docs/source/_theme/armstrong/static/rtd.css_t: -------------------------------------------------------------------------------- 1 | /* 2 | * rtd.css 3 | * ~~~~~~~~~~~~~~~ 4 | * 5 | * Sphinx stylesheet -- sphinxdoc theme. Originally created by 6 | * Armin Ronacher for Werkzeug. 7 | * 8 | * Customized for ReadTheDocs by Eric Pierce & Eric Holscher 9 | * 10 | * :copyright: Copyright 2007-2010 by the Sphinx team, see AUTHORS. 11 | * :license: BSD, see LICENSE for details. 12 | * 13 | */ 14 | 15 | /* RTD colors 16 | * light blue: {{ theme_light_color }} 17 | * medium blue: {{ theme_medium_color }} 18 | * dark blue: {{ theme_dark_color }} 19 | * dark grey: {{ theme_grey_color }} 20 | * 21 | * medium blue hover: {{ theme_medium_color_hover }}; 22 | * green highlight: {{ theme_green_highlight }} 23 | * light blue (project bar): {{ theme_light_color }} 24 | */ 25 | 26 | @import url("basic.css"); 27 | 28 | /* PAGE LAYOUT -------------------------------------------------------------- */ 29 | 30 | body { 31 | font: 100%/1.5 "ff-meta-web-pro-1","ff-meta-web-pro-2",Arial,"Helvetica Neue",sans-serif; 32 | text-align: center; 33 | color: black; 34 | background-color: {{ theme_background }}; 35 | padding: 0; 36 | margin: 0; 37 | } 38 | 39 | div.document { 40 | text-align: left; 41 | background-color: {{ theme_light_color }}; 42 | } 43 | 44 | div.bodywrapper { 45 | background-color: {{ theme_white }}; 46 | border-left: 1px solid {{ theme_lighter_gray }}; 47 | border-bottom: 1px solid {{ theme_lighter_gray }}; 48 | margin: 0 0 0 16em; 49 | } 50 | 51 | div.body { 52 | margin: 0; 53 | padding: 0.5em 1.3em; 54 | max-width: 55em; 55 | min-width: 20em; 56 | } 57 | 58 | div.related { 59 | font-size: 1em; 60 | background-color: {{ theme_background }}; 61 | } 62 | 63 | div.documentwrapper { 64 | float: left; 65 | width: 100%; 66 | background-color: {{ theme_light_color }}; 67 | } 68 | 69 | 70 | /* HEADINGS --------------------------------------------------------------- */ 71 | 72 | h1 { 73 | margin: 0; 74 | padding: 0.7em 0 0.3em 0; 75 | font-size: 1.5em; 76 | line-height: 1.15; 77 | color: {{ theme_h1 }}; 78 | clear: both; 79 | } 80 | 81 | h2 { 82 | margin: 2em 0 0.2em 0; 83 | font-size: 1.35em; 84 | padding: 0; 85 | color: {{ theme_h2 }}; 86 | } 87 | 88 | h3 { 89 | margin: 1em 0 -0.3em 0; 90 | font-size: 1.2em; 91 | color: {{ theme_h3 }}; 92 | } 93 | 94 | div.body h1 a, div.body h2 a, div.body h3 a, div.body h4 a, div.body h5 a, div.body h6 a { 95 | color: black; 96 | } 97 | 98 | h1 a.anchor, h2 a.anchor, h3 a.anchor, h4 a.anchor, h5 a.anchor, h6 a.anchor { 99 | display: none; 100 | margin: 0 0 0 0.3em; 101 | padding: 0 0.2em 0 0.2em; 102 | color: {{ theme_gray_a }} !important; 103 | } 104 | 105 | h1:hover a.anchor, h2:hover a.anchor, h3:hover a.anchor, h4:hover a.anchor, 106 | h5:hover a.anchor, h6:hover a.anchor { 107 | display: inline; 108 | } 109 | 110 | h1 a.anchor:hover, h2 a.anchor:hover, h3 a.anchor:hover, h4 a.anchor:hover, 111 | h5 a.anchor:hover, h6 a.anchor:hover { 112 | color: {{ theme_gray_7 }}; 113 | background-color: {{ theme_dirty_white }}; 114 | } 115 | 116 | 117 | /* LINKS ------------------------------------------------------------------ */ 118 | 119 | /* Normal links get a pseudo-underline */ 120 | a { 121 | color: {{ theme_link_color }}; 122 | text-decoration: none; 123 | border-bottom: 1px solid {{ theme_link_color_decoration }}; 124 | } 125 | 126 | /* Links in sidebar, TOC, index trees and tables have no underline */ 127 | .sphinxsidebar a, 128 | .toctree-wrapper a, 129 | .indextable a, 130 | #indices-and-tables a { 131 | color: {{ theme_dark_gray }}; 132 | text-decoration: none; 133 | border-bottom: none; 134 | } 135 | 136 | /* Most links get an underline-effect when hovered */ 137 | a:hover, 138 | div.toctree-wrapper a:hover, 139 | .indextable a:hover, 140 | #indices-and-tables a:hover { 141 | color: {{ theme_black }}; 142 | text-decoration: none; 143 | border-bottom: 1px solid {{ theme_black }}; 144 | } 145 | 146 | /* Footer links */ 147 | div.footer a { 148 | color: {{ theme_background_text_link }}; 149 | text-decoration: none; 150 | border: none; 151 | } 152 | div.footer a:hover { 153 | color: {{ theme_medium_color_link_hover }}; 154 | text-decoration: underline; 155 | border: none; 156 | } 157 | 158 | /* Permalink anchor (subtle grey with a red hover) */ 159 | div.body a.headerlink { 160 | color: {{ theme_lighter_gray }}; 161 | font-size: 1em; 162 | margin-left: 6px; 163 | padding: 0 4px 0 4px; 164 | text-decoration: none; 165 | border: none; 166 | } 167 | div.body a.headerlink:hover { 168 | color: {{ theme_negative_text }}; 169 | border: none; 170 | } 171 | 172 | 173 | /* NAVIGATION BAR --------------------------------------------------------- */ 174 | 175 | div.related ul { 176 | height: 2.5em; 177 | } 178 | 179 | div.related ul li { 180 | margin: 0; 181 | padding: 0.65em 0; 182 | float: left; 183 | display: block; 184 | color: {{ theme_background_link_half }}; /* For the >> separators */ 185 | font-size: 0.8em; 186 | } 187 | 188 | div.related ul li.right { 189 | float: right; 190 | margin-right: 5px; 191 | color: transparent; /* Hide the | separators */ 192 | } 193 | 194 | /* "Breadcrumb" links in nav bar */ 195 | div.related ul li a { 196 | order: none; 197 | background-color: inherit; 198 | font-weight: bold; 199 | margin: 6px 0 6px 4px; 200 | line-height: 1.75em; 201 | color: {{ theme_background_link }}; 202 | text-shadow: 0 1px rgba(0, 0, 0, 0.5); 203 | padding: 0.4em 0.8em; 204 | border: none; 205 | border-radius: 3px; 206 | } 207 | /* previous / next / modules / index links look more like buttons */ 208 | div.related ul li.right a { 209 | margin: 0.375em 0; 210 | background-color: {{ theme_medium_color_hover }}; 211 | text-shadow: 0 1px rgba(0, 0, 0, 0.5); 212 | border-radius: 3px; 213 | -webkit-border-radius: 3px; 214 | -moz-border-radius: 3px; 215 | } 216 | /* All navbar links light up as buttons when hovered */ 217 | div.related ul li a:hover { 218 | background-color: {{ theme_medium_color }}; 219 | color: {{ theme_white }}; 220 | text-decoration: none; 221 | border-radius: 3px; 222 | -webkit-border-radius: 3px; 223 | -moz-border-radius: 3px; 224 | } 225 | /* Take extra precautions for tt within links */ 226 | a tt, 227 | div.related ul li a tt { 228 | background: inherit !important; 229 | color: inherit !important; 230 | } 231 | 232 | 233 | /* SIDEBAR ---------------------------------------------------------------- */ 234 | 235 | div.sphinxsidebarwrapper { 236 | padding: 0; 237 | } 238 | 239 | div.sphinxsidebar { 240 | margin: 0; 241 | margin-left: -100%; 242 | float: left; 243 | top: 3em; 244 | left: 0; 245 | padding: 0 1em; 246 | width: 14em; 247 | font-size: 1em; 248 | text-align: left; 249 | background-color: {{ theme_light_color }}; 250 | } 251 | 252 | div.sphinxsidebar img { 253 | max-width: 12em; 254 | } 255 | 256 | div.sphinxsidebar h3, div.sphinxsidebar h4 { 257 | margin: 1.2em 0 0.3em 0; 258 | font-size: 1em; 259 | padding: 0; 260 | color: {{ theme_gray_2 }}; 261 | font-family: "ff-meta-web-pro-1", "ff-meta-web-pro-2", "Arial", "Helvetica Neue", sans-serif; 262 | } 263 | 264 | div.sphinxsidebar h3 a { 265 | color: {{ theme_grey_color }}; 266 | } 267 | 268 | div.sphinxsidebar ul, 269 | div.sphinxsidebar p { 270 | margin-top: 0; 271 | padding-left: 0; 272 | line-height: 130%; 273 | background-color: {{ theme_light_color }}; 274 | } 275 | 276 | /* No bullets for nested lists, but a little extra indentation */ 277 | div.sphinxsidebar ul ul { 278 | list-style-type: none; 279 | margin-left: 1.5em; 280 | padding: 0; 281 | } 282 | 283 | /* A little top/bottom padding to prevent adjacent links' borders 284 | * from overlapping each other */ 285 | div.sphinxsidebar ul li { 286 | padding: 1px 0; 287 | } 288 | 289 | /* A little left-padding to make these align with the ULs */ 290 | div.sphinxsidebar p.topless { 291 | padding-left: 0 0 0 1em; 292 | } 293 | 294 | /* Make these into hidden one-liners */ 295 | div.sphinxsidebar ul li, 296 | div.sphinxsidebar p.topless { 297 | white-space: nowrap; 298 | overflow: hidden; 299 | } 300 | /* ...which become visible when hovered */ 301 | div.sphinxsidebar ul li:hover, 302 | div.sphinxsidebar p.topless:hover { 303 | overflow: visible; 304 | } 305 | 306 | /* Search text box and "Go" button */ 307 | #searchbox { 308 | margin-top: 2em; 309 | margin-bottom: 1em; 310 | background: {{ theme_dirtier_white }}; 311 | padding: 0.5em; 312 | border-radius: 6px; 313 | -moz-border-radius: 6px; 314 | -webkit-border-radius: 6px; 315 | } 316 | #searchbox h3 { 317 | margin-top: 0; 318 | } 319 | 320 | /* Make search box and button abut and have a border */ 321 | input, 322 | div.sphinxsidebar input { 323 | border: 1px solid {{ theme_gray_9 }}; 324 | float: left; 325 | } 326 | 327 | /* Search textbox */ 328 | input[type="text"] { 329 | margin: 0; 330 | padding: 0 3px; 331 | height: 20px; 332 | width: 144px; 333 | border-top-left-radius: 3px; 334 | border-bottom-left-radius: 3px; 335 | -moz-border-radius-topleft: 3px; 336 | -moz-border-radius-bottomleft: 3px; 337 | -webkit-border-top-left-radius: 3px; 338 | -webkit-border-bottom-left-radius: 3px; 339 | } 340 | /* Search button */ 341 | input[type="submit"] { 342 | margin: 0 0 0 -1px; /* -1px prevents a double-border with textbox */ 343 | height: 22px; 344 | color: {{ theme_dark_gray }}; 345 | background-color: {{ theme_light_color }}; 346 | padding: 1px 4px; 347 | font-weight: bold; 348 | border-top-right-radius: 3px; 349 | border-bottom-right-radius: 3px; 350 | -moz-border-radius-topright: 3px; 351 | -moz-border-radius-bottomright: 3px; 352 | -webkit-border-top-right-radius: 3px; 353 | -webkit-border-bottom-right-radius: 3px; 354 | } 355 | input[type="submit"]:hover { 356 | color: {{ theme_white }}; 357 | background-color: {{ theme_green_highlight }}; 358 | } 359 | 360 | div.sphinxsidebar p.searchtip { 361 | clear: both; 362 | padding: 0.5em 0 0 0; 363 | background: {{ theme_dirtier_white }}; 364 | color: {{ theme_gray }}; 365 | font-size: 0.9em; 366 | } 367 | 368 | /* Sidebar links are unusual */ 369 | div.sphinxsidebar li a, 370 | div.sphinxsidebar p a { 371 | background: {{ theme_light_color }}; /* In case links overlap main content */ 372 | border-radius: 3px; 373 | -moz-border-radius: 3px; 374 | -webkit-border-radius: 3px; 375 | border: 1px solid transparent; /* To prevent things jumping around on hover */ 376 | padding: 0 5px 0 5px; 377 | } 378 | div.sphinxsidebar li a:hover, 379 | div.sphinxsidebar p a:hover { 380 | color: {{ theme_black }}; 381 | text-decoration: none; 382 | border: 1px solid {{ theme_light_gray }}; 383 | } 384 | 385 | /* Tweak any link appearing in a heading */ 386 | div.sphinxsidebar h3 a { 387 | } 388 | 389 | 390 | 391 | 392 | /* OTHER STUFF ------------------------------------------------------------ */ 393 | 394 | cite, code, tt { 395 | font-family: 'Consolas', 'Deja Vu Sans Mono', 396 | 'Bitstream Vera Sans Mono', monospace; 397 | font-size: 0.95em; 398 | letter-spacing: 0.01em; 399 | } 400 | 401 | tt { 402 | background-color: {{ theme_code_background }}; 403 | color: {{ theme_dark_gray }}; 404 | } 405 | 406 | tt.descname, tt.descclassname, tt.xref { 407 | border: 0; 408 | } 409 | 410 | hr { 411 | border: 1px solid {{ theme_ruler }}; 412 | margin: 2em; 413 | } 414 | 415 | pre, #_fontwidthtest { 416 | font-family: 'Consolas', 'Deja Vu Sans Mono', 417 | 'Bitstream Vera Sans Mono', monospace; 418 | margin: 1em 2em; 419 | font-size: 0.95em; 420 | letter-spacing: 0.015em; 421 | line-height: 120%; 422 | padding: 0.5em; 423 | border: 1px solid {{ theme_lighter_gray }}; 424 | background-color: {{ theme_code_background }}; 425 | border-radius: 6px; 426 | -moz-border-radius: 6px; 427 | -webkit-border-radius: 6px; 428 | } 429 | 430 | pre a { 431 | color: inherit; 432 | text-decoration: underline; 433 | } 434 | 435 | td.linenos pre { 436 | padding: 0.5em 0; 437 | } 438 | 439 | div.quotebar { 440 | background-color: {{ theme_almost_white }}; 441 | max-width: 250px; 442 | float: right; 443 | padding: 2px 7px; 444 | border: 1px solid {{ theme_lighter_gray }}; 445 | } 446 | 447 | div.topic { 448 | background-color: {{ theme_almost_white }}; 449 | } 450 | 451 | table { 452 | border-collapse: collapse; 453 | margin: 0 -0.5em 0 -0.5em; 454 | } 455 | 456 | table td, table th { 457 | padding: 0.2em 0.5em 0.2em 0.5em; 458 | } 459 | 460 | 461 | /* ADMONITIONS AND WARNINGS ------------------------------------------------- */ 462 | 463 | /* Shared by admonitions, warnings and sidebars */ 464 | div.admonition, 465 | div.warning, 466 | div.sidebar { 467 | font-size: 0.9em; 468 | margin: 2em; 469 | padding: 0; 470 | /* 471 | border-radius: 6px; 472 | -moz-border-radius: 6px; 473 | -webkit-border-radius: 6px; 474 | */ 475 | } 476 | div.admonition p, 477 | div.warning p, 478 | div.sidebar p { 479 | margin: 0.5em 1em 0.5em 1em; 480 | padding: 0; 481 | } 482 | div.admonition pre, 483 | div.warning pre, 484 | div.sidebar pre { 485 | margin: 0.4em 1em 0.4em 1em; 486 | } 487 | div.admonition p.admonition-title, 488 | div.warning p.admonition-title, 489 | div.sidebar p.sidebar-title { 490 | margin: 0; 491 | padding: 0.1em 0 0.1em 0.5em; 492 | color: white; 493 | font-weight: bold; 494 | font-size: 1.1em; 495 | text-shadow: 0 1px rgba(0, 0, 0, 0.5); 496 | } 497 | div.admonition ul, div.admonition ol, 498 | div.warning ul, div.warning ol, 499 | div.sidebar ul, div.sidebar ol { 500 | margin: 0.1em 0.5em 0.5em 3em; 501 | padding: 0; 502 | } 503 | 504 | 505 | /* Admonitions and sidebars only */ 506 | div.admonition, div.sidebar { 507 | border: 1px solid {{ theme_positive_dark }}; 508 | background-color: {{ theme_positive_light }}; 509 | } 510 | div.admonition p.admonition-title, 511 | div.sidebar p.sidebar-title { 512 | background-color: {{ theme_positive_medium }}; 513 | border-bottom: 1px solid {{ theme_positive_dark }}; 514 | } 515 | 516 | 517 | /* Warnings only */ 518 | div.warning { 519 | border: 1px solid {{ theme_negative_dark }}; 520 | background-color: {{ theme_negative_light }}; 521 | } 522 | div.warning p.admonition-title { 523 | background-color: {{ theme_negative_medium }}; 524 | border-bottom: 1px solid {{ theme_negative_dark }}; 525 | } 526 | 527 | 528 | /* Sidebars only */ 529 | div.sidebar { 530 | max-width: 200px; 531 | } 532 | 533 | 534 | 535 | div.versioninfo { 536 | margin: 1em 0 0 0; 537 | border: 1px solid {{ theme_lighter_gray }}; 538 | background-color: {{ theme_light_medium_color }}; 539 | padding: 8px; 540 | line-height: 1.3em; 541 | font-size: 0.9em; 542 | } 543 | 544 | .viewcode-back { 545 | font-family: 'Lucida Grande', 'Lucida Sans Unicode', 'Geneva', 546 | 'Verdana', sans-serif; 547 | } 548 | 549 | div.viewcode-block:target { 550 | background-color: {{ theme_viewcode_bg }}; 551 | border-top: 1px solid {{ theme_viewcode_border }}; 552 | border-bottom: 1px solid {{ theme_viewcode_border }}; 553 | } 554 | 555 | dl { 556 | margin: 1em 0 2.5em 0; 557 | } 558 | 559 | /* Highlight target when you click an internal link */ 560 | dt:target { 561 | background: {{ theme_highlight }}; 562 | } 563 | /* Don't highlight whole divs */ 564 | div.highlight { 565 | background: transparent; 566 | } 567 | /* But do highlight spans (so search results can be highlighted) */ 568 | span.highlight { 569 | background: {{ theme_highlight }}; 570 | } 571 | 572 | div.footer { 573 | background-color: {{ theme_background }}; 574 | color: {{ theme_background_text }}; 575 | padding: 0 2em 2em 2em; 576 | clear: both; 577 | font-size: 0.8em; 578 | text-align: center; 579 | } 580 | 581 | p { 582 | margin: 0.8em 0 0.5em 0; 583 | } 584 | 585 | .section p img { 586 | margin: 1em 2em; 587 | } 588 | 589 | 590 | /* MOBILE LAYOUT -------------------------------------------------------------- */ 591 | 592 | @media screen and (max-width: 600px) { 593 | 594 | h1, h2, h3, h4, h5 { 595 | position: relative; 596 | } 597 | 598 | ul { 599 | padding-left: 1.75em; 600 | } 601 | 602 | div.bodywrapper a.headerlink, #indices-and-tables h1 a { 603 | color: {{ theme_almost_dirty_white }}; 604 | font-size: 80%; 605 | float: right; 606 | line-height: 1.8; 607 | position: absolute; 608 | right: -0.7em; 609 | visibility: inherit; 610 | } 611 | 612 | div.bodywrapper h1 a.headerlink, #indices-and-tables h1 a { 613 | line-height: 1.5; 614 | } 615 | 616 | pre { 617 | font-size: 0.7em; 618 | overflow: auto; 619 | word-wrap: break-word; 620 | white-space: pre-wrap; 621 | } 622 | 623 | div.related ul { 624 | height: 2.5em; 625 | padding: 0; 626 | text-align: left; 627 | } 628 | 629 | div.related ul li { 630 | clear: both; 631 | color: {{ theme_dark_color }}; 632 | padding: 0.2em 0; 633 | } 634 | 635 | div.related ul li:last-child { 636 | border-bottom: 1px dotted {{ theme_medium_color }}; 637 | padding-bottom: 0.4em; 638 | margin-bottom: 1em; 639 | width: 100%; 640 | } 641 | 642 | div.related ul li a { 643 | color: {{ theme_dark_color }}; 644 | padding-right: 0; 645 | } 646 | 647 | div.related ul li a:hover { 648 | background: inherit; 649 | color: inherit; 650 | } 651 | 652 | div.related ul li.right { 653 | clear: none; 654 | padding: 0.65em 0; 655 | margin-bottom: 0.5em; 656 | } 657 | 658 | div.related ul li.right a { 659 | color: {{ theme_white }}; 660 | padding-right: 0.8em; 661 | } 662 | 663 | div.related ul li.right a:hover { 664 | background-color: {{ theme_medium_color }}; 665 | } 666 | 667 | div.body { 668 | clear: both; 669 | min-width: 0; 670 | word-wrap: break-word; 671 | } 672 | 673 | div.bodywrapper { 674 | margin: 0 0 0 0; 675 | } 676 | 677 | div.sphinxsidebar { 678 | float: none; 679 | margin: 0; 680 | width: auto; 681 | } 682 | 683 | div.sphinxsidebar input[type="text"] { 684 | height: 2em; 685 | line-height: 2em; 686 | width: 70%; 687 | } 688 | 689 | div.sphinxsidebar input[type="submit"] { 690 | height: 2em; 691 | margin-left: 0.5em; 692 | width: 20%; 693 | } 694 | 695 | div.sphinxsidebar p.searchtip { 696 | background: inherit; 697 | margin-bottom: 1em; 698 | } 699 | 700 | div.sphinxsidebar ul li, div.sphinxsidebar p.topless { 701 | white-space: normal; 702 | } 703 | 704 | .bodywrapper img { 705 | display: block; 706 | margin-left: auto; 707 | margin-right: auto; 708 | max-width: 100%; 709 | } 710 | 711 | div.documentwrapper { 712 | float: none; 713 | } 714 | 715 | div.admonition, div.warning, pre, blockquote { 716 | margin-left: 0em; 717 | margin-right: 0em; 718 | } 719 | 720 | .body p img { 721 | margin: 0; 722 | } 723 | 724 | #searchbox { 725 | background: transparent; 726 | } 727 | 728 | .related:not(:first-child) li { 729 | display: none; 730 | } 731 | 732 | .related:not(:first-child) li.right { 733 | display: block; 734 | } 735 | 736 | div.footer { 737 | padding: 1em; 738 | } 739 | 740 | .rtd_doc_footer .badge { 741 | float: none; 742 | margin: 1em auto; 743 | position: static; 744 | } 745 | 746 | .rtd_doc_footer .badge.revsys-inline { 747 | margin-right: auto; 748 | margin-bottom: 2em; 749 | } 750 | 751 | table.indextable { 752 | display: block; 753 | width: auto; 754 | } 755 | 756 | .indextable tr { 757 | display: block; 758 | } 759 | 760 | .indextable td { 761 | display: block; 762 | padding: 0; 763 | width: auto !important; 764 | } 765 | 766 | .indextable td dt { 767 | margin: 1em 0; 768 | } 769 | 770 | ul.search { 771 | margin-left: 0.25em; 772 | } 773 | 774 | ul.search li div.context { 775 | font-size: 90%; 776 | line-height: 1.1; 777 | margin-bottom: 1; 778 | margin-left: 0; 779 | } 780 | 781 | } 782 | -------------------------------------------------------------------------------- /docs/source/_theme/armstrong/theme.conf: -------------------------------------------------------------------------------- 1 | [theme] 2 | inherit = default 3 | stylesheet = rtd.css 4 | pygment_style = default 5 | show_sphinx = False 6 | 7 | [options] 8 | show_rtd = True 9 | 10 | white = #ffffff 11 | almost_white = #f8f8f8 12 | barely_white = #f2f2f2 13 | dirty_white = #eeeeee 14 | almost_dirty_white = #e6e6e6 15 | dirtier_white = #DAC6AF 16 | lighter_gray = #cccccc 17 | gray_a = #aaaaaa 18 | gray_9 = #999999 19 | light_gray = #888888 20 | gray_7 = #777777 21 | gray = #666666 22 | dark_gray = #444444 23 | gray_2 = #222222 24 | black = #111111 25 | light_color = #EDE4D8 26 | light_medium_color = #DDEAF0 27 | medium_color = #8ca1af 28 | medium_color_link = #634320 29 | medium_color_link_hover = #261a0c 30 | dark_color = rgba(160, 109, 52, 1.0) 31 | 32 | h1 = #1f3744 33 | h2 = #335C72 34 | h3 = #638fa6 35 | 36 | link_color = #335C72 37 | link_color_decoration = #99AEB9 38 | 39 | medium_color_hover = rgba(255, 255, 255, 0.25) 40 | medium_color = rgba(255, 255, 255, 0.5) 41 | green_highlight = #8ecc4c 42 | 43 | 44 | positive_dark = rgba(51, 77, 0, 1.0) 45 | positive_medium = rgba(102, 153, 0, 1.0) 46 | positive_light = rgba(102, 153, 0, 0.1) 47 | 48 | negative_dark = rgba(51, 13, 0, 1.0) 49 | negative_medium = rgba(204, 51, 0, 1.0) 50 | negative_light = rgba(204, 51, 0, 0.1) 51 | negative_text = #c60f0f 52 | 53 | ruler = #abc 54 | 55 | viewcode_bg = #f4debf 56 | viewcode_border = #ac9 57 | 58 | highlight = #ffe080 59 | 60 | code_background = rgba(0, 0, 0, 0.075) 61 | 62 | background = rgba(135, 57, 34, 1.0) 63 | background_link = rgba(212, 195, 172, 1.0) 64 | background_link_half = rgba(212, 195, 172, 0.5) 65 | background_text = rgba(212, 195, 172, 1.0) 66 | background_text_link = rgba(171, 138, 93, 1.0) 67 | -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # pdftables documentation build configuration file, created by 4 | # sphinx-quickstart on Thu Sep 12 16:20:18 2013. 5 | # 6 | # This file is execfile()d with the current directory set to its containing dir. 7 | # 8 | # Note that not all possible configuration values are present in this 9 | # autogenerated file. 10 | # 11 | # All configuration values have a default; values that are commented out 12 | # serve to show the default. 13 | 14 | import sys, os 15 | 16 | # If extensions (or modules to document with autodoc) are in another directory, 17 | # add these directories to sys.path here. If the directory is relative to the 18 | # documentation root, use os.path.abspath to make it absolute, like shown here. 19 | 20 | # The order here is sensitive: they have to be this way round in order to make 21 | # `import pdftables` find the right file. 22 | sys.path = [ os.path.abspath('../../pdftables'), os.path.abspath('../../') ] + sys.path 23 | 24 | import setup as pdftables_setup 25 | 26 | # -- General configuration ----------------------------------------------------- 27 | 28 | # If your documentation needs a minimal Sphinx version, state it here. 29 | #needs_sphinx = '1.0' 30 | 31 | # Add any Sphinx extension module names here, as strings. They can be extensions 32 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. 33 | extensions = ['sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.intersphinx', 'sphinx.ext.todo', 'sphinx.ext.coverage', 'sphinx.ext.viewcode'] 34 | 35 | # Add any paths that contain templates here, relative to this directory. 36 | templates_path = ['_templates'] 37 | 38 | # The suffix of source filenames. 39 | source_suffix = '.rst' 40 | 41 | # The encoding of source files. 42 | #source_encoding = 'utf-8-sig' 43 | 44 | # The master toctree document. 45 | master_doc = 'index' 46 | 47 | # General information about the project. 48 | project = u'pdftables' 49 | copyright = u'2013, ScraperWiki' 50 | 51 | # The version info for the project you're documenting, acts as replacement for 52 | # |version| and |release|, also used in various other places throughout the 53 | # built documents. 54 | # 55 | # The short X.Y version. 56 | version = pdftables_setup.conf['version'] 57 | # The full version, including alpha/beta/rc tags. 58 | release = pdftables_setup.conf['version'] 59 | 60 | # The language for content autogenerated by Sphinx. Refer to documentation 61 | # for a list of supported languages. 62 | #language = None 63 | 64 | # There are two options for replacing |today|: either, you set today to some 65 | # non-false value, then it is used: 66 | #today = '' 67 | # Else, today_fmt is used as the format for a strftime call. 68 | #today_fmt = '%B %d, %Y' 69 | 70 | # List of patterns, relative to source directory, that match files and 71 | # directories to ignore when looking for source files. 72 | exclude_patterns = [] 73 | 74 | # The reST default role (used for this markup: `text`) to use for all documents. 75 | #default_role = None 76 | 77 | # If true, '()' will be appended to :func: etc. cross-reference text. 78 | #add_function_parentheses = True 79 | 80 | # If true, the current module name will be prepended to all description 81 | # unit titles (such as .. function::). 82 | #add_module_names = True 83 | 84 | # If true, sectionauthor and moduleauthor directives will be shown in the 85 | # output. They are ignored by default. 86 | #show_authors = False 87 | 88 | # The name of the Pygments (syntax highlighting) style to use. 89 | pygments_style = 'sphinx' 90 | 91 | # A list of ignored prefixes for module index sorting. 92 | #modindex_common_prefix = [] 93 | 94 | 95 | # -- Options for HTML output --------------------------------------------------- 96 | 97 | # The theme to use for HTML and HTML Help pages. See the documentation for 98 | # a list of builtin themes. 99 | html_theme = 'armstrong' 100 | 101 | # Theme options are theme-specific and customize the look and feel of a theme 102 | # further. For a list of options available for each theme, see the 103 | # documentation. 104 | #html_theme_options = {} 105 | 106 | # Add any paths that contain custom themes here, relative to this directory. 107 | html_theme_path = ['_theme'] 108 | 109 | # The name for this set of Sphinx documents. If None, it defaults to 110 | # " v documentation". 111 | #html_title = None 112 | 113 | # A shorter title for the navigation bar. Default is the same as html_title. 114 | #html_short_title = None 115 | 116 | # The name of an image file (relative to this directory) to place at the top 117 | # of the sidebar. 118 | #html_logo = None 119 | 120 | # The name of an image file (within the static path) to use as favicon of the 121 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 122 | # pixels large. 123 | #html_favicon = None 124 | 125 | # Add any paths that contain custom static files (such as style sheets) here, 126 | # relative to this directory. They are copied after the builtin static files, 127 | # so a file named "default.css" will overwrite the builtin "default.css". 128 | html_static_path = ['_static'] 129 | 130 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 131 | # using the given strftime format. 132 | #html_last_updated_fmt = '%b %d, %Y' 133 | 134 | # If true, SmartyPants will be used to convert quotes and dashes to 135 | # typographically correct entities. 136 | #html_use_smartypants = True 137 | 138 | # Custom sidebar templates, maps document names to template names. 139 | #html_sidebars = {} 140 | 141 | # Additional templates that should be rendered to pages, maps page names to 142 | # template names. 143 | #html_additional_pages = {} 144 | 145 | # If false, no module index is generated. 146 | #html_domain_indices = True 147 | 148 | # If false, no index is generated. 149 | #html_use_index = True 150 | 151 | # If true, the index is split into individual pages for each letter. 152 | #html_split_index = False 153 | 154 | # If true, links to the reST sources are added to the pages. 155 | #html_show_sourcelink = True 156 | 157 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 158 | #html_show_sphinx = True 159 | 160 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 161 | #html_show_copyright = True 162 | 163 | # If true, an OpenSearch description file will be output, and all pages will 164 | # contain a tag referring to it. The value of this option must be the 165 | # base URL from which the finished HTML is served. 166 | #html_use_opensearch = '' 167 | 168 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 169 | #html_file_suffix = None 170 | 171 | # Output file base name for HTML help builder. 172 | htmlhelp_basename = 'pdftablesdoc' 173 | 174 | 175 | # -- Options for LaTeX output -------------------------------------------------- 176 | 177 | latex_elements = { 178 | # The paper size ('letterpaper' or 'a4paper'). 179 | #'papersize': 'letterpaper', 180 | 181 | # The font size ('10pt', '11pt' or '12pt'). 182 | #'pointsize': '10pt', 183 | 184 | # Additional stuff for the LaTeX preamble. 185 | #'preamble': '', 186 | } 187 | 188 | # Grouping the document tree into LaTeX files. List of tuples 189 | # (source start file, target name, title, author, documentclass [howto/manual]). 190 | latex_documents = [ 191 | ('index', 'pdftables.tex', u'pdftables Documentation', 192 | u'ScraperWiki', 'manual'), 193 | ] 194 | 195 | # The name of an image file (relative to this directory) to place at the top of 196 | # the title page. 197 | #latex_logo = None 198 | 199 | # For "manual" documents, if this is true, then toplevel headings are parts, 200 | # not chapters. 201 | #latex_use_parts = False 202 | 203 | # If true, show page references after internal links. 204 | #latex_show_pagerefs = False 205 | 206 | # If true, show URL addresses after external links. 207 | #latex_show_urls = False 208 | 209 | # Documents to append as an appendix to all manuals. 210 | #latex_appendices = [] 211 | 212 | # If false, no module index is generated. 213 | #latex_domain_indices = True 214 | 215 | 216 | # -- Options for manual page output -------------------------------------------- 217 | 218 | # One entry per manual page. List of tuples 219 | # (source start file, name, description, authors, manual section). 220 | man_pages = [ 221 | ('index', 'pdftables', u'pdftables Documentation', 222 | [u'ScraperWiki'], 1) 223 | ] 224 | 225 | # If true, show URL addresses after external links. 226 | #man_show_urls = False 227 | 228 | 229 | # -- Options for Texinfo output ------------------------------------------------ 230 | 231 | # Grouping the document tree into Texinfo files. List of tuples 232 | # (source start file, target name, title, author, 233 | # dir menu entry, description, category) 234 | texinfo_documents = [ 235 | ('index', 'pdftables', u'pdftables Documentation', 236 | u'ScraperWiki', 'pdftables', 'One line description of project.', 237 | 'Miscellaneous'), 238 | ] 239 | 240 | # Documents to append as an appendix to all manuals. 241 | #texinfo_appendices = [] 242 | 243 | # If false, no module index is generated. 244 | #texinfo_domain_indices = True 245 | 246 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 247 | #texinfo_show_urls = 'footnote' 248 | 249 | 250 | # Example configuration for intersphinx: refer to the Python standard library. 251 | intersphinx_mapping = {'http://docs.python.org/': None} 252 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | .. pdftables documentation master file, created by 2 | sphinx-quickstart on Thu Sep 12 16:20:18 2013. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | pdftables 7 | ========= 8 | 9 | .. include:: ../../README.rst 10 | :start-line: 13 11 | .. :end-line: 67 12 | 13 | 14 | .. Contents: 15 | ========= 16 | 17 | .. .. toctree:: 18 | :numbered: 19 | :maxdepth: 2 20 | 21 | API reference 22 | ============= 23 | 24 | .. automodule:: pdftables 25 | :members: 26 | 27 | 28 | Indices and tables 29 | ================== 30 | 31 | * :ref:`genindex` 32 | * :ref:`modindex` 33 | * :ref:`search` 34 | 35 | -------------------------------------------------------------------------------- /download_test_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | git clone https://bitbucket.org/scraperwikids/pdftables-test-data fixtures/ 4 | 5 | # get EU ground truth dataset 6 | wget http://www.tamirhassan.com/files/eu-dataset-20130324.zip -O fixtures/eu.zip 7 | unzip fixtures/eu.zip -d fixtures 8 | rm fixtures/eu.zip 9 | 10 | -------------------------------------------------------------------------------- /pdftables/__init__.py: -------------------------------------------------------------------------------- 1 | from pdftables import * 2 | from .config_parameters import ConfigParameters 3 | -------------------------------------------------------------------------------- /pdftables/boxes.py: -------------------------------------------------------------------------------- 1 | """ 2 | Describe box-like data (such as glyphs and rects) in a PDF and helper functions 3 | """ 4 | # ScraperWiki Limited 5 | # Ian Hopkinson, 2013-06-19 6 | # -*- coding: utf-8 -*- 7 | 8 | from __future__ import unicode_literals 9 | 10 | from collections import namedtuple 11 | from counter import Counter 12 | 13 | from .line_segments import LineSegment 14 | 15 | 16 | def _rounder(val, tol): 17 | """ 18 | Utility function to round numbers to arbitrary tolerance 19 | """ 20 | return round((1.0 * val) / tol) * tol 21 | 22 | 23 | class Histogram(Counter): 24 | 25 | def rounder(self, tol): 26 | c = Histogram() 27 | for item in self: 28 | c = c + Histogram({_rounder(item, tol): self[item]}) 29 | return c 30 | 31 | 32 | class Rectangle(namedtuple("Rectangle", "x1 y1 x2 y2")): 33 | 34 | def __repr__(self): 35 | return ( 36 | "Rectangle(x1={0:6.02f} y1={1:6.02f} x2={2:6.02f} y2={3:6.02f})" 37 | .format(self.x1, self.y1, self.x2, self.y2)) 38 | 39 | 40 | class Box(object): 41 | 42 | def __init__(self, rect, text=None, barycenter=None, barycenter_y=None): 43 | 44 | if not isinstance(rect, Rectangle): 45 | raise RuntimeError("Box(x) expects isinstance(x, Rectangle)") 46 | 47 | self.rect = rect 48 | self.text = text 49 | self.barycenter = barycenter 50 | self.barycenter_y = barycenter_y 51 | 52 | def __repr__(self): 53 | if self is Box.empty_box: 54 | return "" 55 | return "".format(self.rect, self.text) 56 | 57 | @classmethod 58 | def copy(cls, o): 59 | return cls( 60 | rect=o.rect, 61 | text=o.text, 62 | barycenter=o.barycenter, 63 | barycenter_y=o.barycenter_y, 64 | ) 65 | 66 | def is_connected_to(self, next): 67 | if self.text.strip() == "" or next.text.strip() == "": 68 | # Whitespace can't be connected into a word. 69 | return False 70 | 71 | def equal(left, right): 72 | # Distance in pixels 73 | TOLERANCE = 0.5 74 | # The almond board paradox 75 | if self.text.endswith("("): 76 | TOLERANCE = 10 77 | return abs(left - right) < TOLERANCE 78 | 79 | shared_barycenter = self.barycenter_y == next.barycenter_y 80 | shared_boundary = equal(self.right, next.left) 81 | 82 | return shared_barycenter and shared_boundary 83 | 84 | def extend(self, next): 85 | self.text += next.text 86 | self.rect = self.rect._replace(x2=next.right) 87 | 88 | def clip(self, *rectangles): 89 | """ 90 | Return the rectangle representing the subset of this Box and all of 91 | rectangles. If there is no rectangle left, ``Box.empty_box`` is 92 | returned which always clips to the empty box. 93 | """ 94 | 95 | x1, y1, x2, y2 = self.rect 96 | for rectangle in rectangles: 97 | x1 = max(x1, rectangle.left) 98 | x2 = min(x2, rectangle.right) 99 | y1 = max(y1, rectangle.top) 100 | y2 = min(y2, rectangle.bottom) 101 | 102 | if x1 > x2 or y1 > y2: 103 | # There is no rect left, so return the "empty set" 104 | return Box.empty_box 105 | 106 | return type(self)(Rectangle(x1=x1, y1=y1, x2=x2, y2=y2)) 107 | 108 | @property 109 | def left(self): 110 | return self.rect[0] 111 | 112 | @property 113 | def top(self): 114 | return self.rect[1] 115 | 116 | @property 117 | def right(self): 118 | return self.rect[2] 119 | 120 | @property 121 | def bottom(self): 122 | return self.rect[3] 123 | 124 | @property 125 | def center_x(self): 126 | return (self.left + self.right) / 2. 127 | 128 | @property 129 | def center_y(self): 130 | return (self.bottom + self.top) / 2. 131 | 132 | @property 133 | def width(self): 134 | return self.right - self.left 135 | 136 | @property 137 | def height(self): 138 | return self.bottom - self.top 139 | 140 | """ 141 | The empty box. This is necessary because we get one 142 | when we clip two boxes that do not overlap (and 143 | possibly in other situations). 144 | 145 | By convention it has left at +Inf, right at -Inf, top 146 | at +Inf, bottom at -Inf. 147 | 148 | It is defined this way so that it is invariant under clipping. 149 | """ 150 | Box.empty_box = Box(Rectangle(x1=float("+inf"), y1=float("+inf"), 151 | x2=float("-inf"), y2=float("-inf"))) 152 | 153 | 154 | class BoxList(list): 155 | 156 | def line_segments(self): 157 | """ 158 | Return line (start, end) corresponding to horizontal and vertical 159 | box edges 160 | """ 161 | horizontal = [LineSegment(b.left, b.right, b) 162 | for b in self] 163 | vertical = [LineSegment(b.top, b.bottom, b) 164 | for b in self] 165 | 166 | return horizontal, vertical 167 | 168 | def inside(self, rect): 169 | """ 170 | Return a fresh instance that is the subset that is (strictly) 171 | inside `rect`. 172 | """ 173 | 174 | def is_in_rect(box): 175 | return (rect.left <= box.left <= box.right <= rect.right and 176 | rect.top <= box.top <= box.bottom <= rect.bottom) 177 | 178 | return type(self)(box for box in self if is_in_rect(box)) 179 | 180 | def bounds(self): 181 | """Return the (strictest) bounding box of all elements.""" 182 | return Box(Rectangle( 183 | x1=min(box.left for box in self), 184 | y1=min(box.top for box in self), 185 | x2=max(box.right for box in self), 186 | y2=max(box.bottom for box in self), 187 | )) 188 | 189 | def __repr__(self): 190 | return "BoxList(len={0})".format(len(self)) 191 | 192 | def purge_empty_text(self): 193 | # TODO: BUG: we remove characters without adjusting the width / coords 194 | # which is kind of invalid. 195 | 196 | return BoxList(box for box in self if box.text.strip() 197 | or box.classname != 'LTTextLineHorizontal') 198 | 199 | def filterByType(self, flt=None): 200 | if not flt: 201 | return self 202 | return BoxList(box for box in self if box.classname in flt) 203 | 204 | def histogram(self, dir_fun): 205 | # index 0 = left, 1 = top, 2 = right, 3 = bottom 206 | for item in self: 207 | assert type(item) == Box, item 208 | return Histogram(dir_fun(box) for box in self) 209 | 210 | def count(self): 211 | return Counter(x.classname for x in self) 212 | -------------------------------------------------------------------------------- /pdftables/config_parameters.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import warnings 4 | 5 | 6 | class ConfigParameters(object): 7 | """ 8 | Controls how tables are detected, extracted etc. 9 | Be Careful! If you add a new parameter: 10 | 11 | 1) The default value should be equivalent to the previous behaviour 12 | 2) You're committing to retaining its default value forever(ish)! People 13 | will write code which relies on the default value today, so changing 14 | that will give them unexpected behaviour. 15 | """ 16 | 17 | def __init__( 18 | self, 19 | 20 | table_top_hint=None, 21 | table_bottom_hint=None, 22 | 23 | n_glyph_column_threshold=3, 24 | n_glyph_row_threshold=5 25 | ): 26 | 27 | self.table_top_hint = table_top_hint 28 | self.table_bottom_hint = table_bottom_hint 29 | 30 | self.n_glyph_column_threshold = n_glyph_column_threshold 31 | self.n_glyph_row_threshold = n_glyph_row_threshold 32 | 33 | 34 | -------------------------------------------------------------------------------- /pdftables/counter.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implement collections.Counter for the benefit of Python 2.6 3 | """ 4 | 5 | 6 | from operator import itemgetter 7 | from heapq import nlargest 8 | from itertools import repeat, ifilter 9 | 10 | 11 | class Counter(dict): 12 | 13 | ''' 14 | Dict subclass for counting hashable objects. Sometimes called a bag 15 | or multiset. Elements are stored as dictionary keys and their counts 16 | are stored as dictionary values. 17 | 18 | >>> Counter('zyzygy') 19 | Counter({'y': 3, 'z': 2, 'g': 1}) 20 | ''' 21 | 22 | def __init__(self, iterable=None, **kwds): 23 | '''Create a new, empty Counter object. And if given, count elements 24 | from an input iterable. Or, initialize the count from another mapping 25 | of elements to their counts. 26 | 27 | >>> c = Counter() # a new, empty counter 28 | >>> c = Counter('gallahad') # a new counter from an iterable 29 | >>> c = Counter({'a': 4, 'b': 2}) # a new counter from a mapping 30 | >>> c = Counter(a=4, b=2) # a new counter from keyword args 31 | 32 | ''' 33 | self.update(iterable, **kwds) 34 | 35 | def __missing__(self, key): 36 | return 0 37 | 38 | def most_common(self, n=None): 39 | '''List the n most common elements and their counts from the most 40 | common to the least. If n is None, then list all element counts. 41 | 42 | >>> Counter('abracadabra').most_common(3) 43 | [('a', 5), ('r', 2), ('b', 2)] 44 | 45 | ''' 46 | if n is None: 47 | return sorted(self.iteritems(), key=itemgetter(1), reverse=True) 48 | return nlargest(n, self.iteritems(), key=itemgetter(1)) 49 | 50 | def elements(self): 51 | '''Iterator over elements repeating each as many times as its count. 52 | 53 | >>> c = Counter('ABCABC') 54 | >>> sorted(c.elements()) 55 | ['A', 'A', 'B', 'B', 'C', 'C'] 56 | 57 | If an element's count has been set to zero or is a negative number, 58 | elements() will ignore it. 59 | 60 | ''' 61 | for elem, count in self.iteritems(): 62 | for _ in repeat(None, count): 63 | yield elem 64 | 65 | # Override dict methods where the meaning changes for Counter objects. 66 | 67 | @classmethod 68 | def fromkeys(cls, iterable, v=None): 69 | raise NotImplementedError( 70 | 'Counter.fromkeys() is undefined. Use Counter(iterable) instead.') 71 | 72 | def update(self, iterable=None, **kwds): 73 | '''Like dict.update() but add counts instead of replacing them. 74 | 75 | Source can be an iterable, a dictionary, or another Counter instance. 76 | 77 | >>> c = Counter('which') 78 | >>> c.update('witch') # add elements from another iterable 79 | >>> d = Counter('watch') 80 | >>> c.update(d) # add elements from another counter 81 | >>> c['h'] # four 'h' in which, witch, and watch 82 | 4 83 | 84 | ''' 85 | if iterable is not None: 86 | if hasattr(iterable, 'iteritems'): 87 | if self: 88 | self_get = self.get 89 | for elem, count in iterable.iteritems(): 90 | self[elem] = self_get(elem, 0) + count 91 | else: 92 | # fast path when counter is empty 93 | dict.update(self, iterable) 94 | else: 95 | self_get = self.get 96 | for elem in iterable: 97 | self[elem] = self_get(elem, 0) + 1 98 | if kwds: 99 | self.update(kwds) 100 | 101 | def copy(self): 102 | """ 103 | Like dict.copy() but returns a Counter instance instead of a dict. 104 | """ 105 | return Counter(self) 106 | 107 | def __delitem__(self, elem): 108 | """ 109 | Like dict.__delitem__() but does not raise KeyError for missing values. 110 | """ 111 | if elem in self: 112 | dict.__delitem__(self, elem) 113 | 114 | def __repr__(self): 115 | if not self: 116 | return '%s()' % self.__class__.__name__ 117 | items = ', '.join(map('%r: %r'.__mod__, self.most_common())) 118 | return '%s({%s})' % (self.__class__.__name__, items) 119 | 120 | # Multiset-style mathematical operations discussed in: 121 | # Knuth TAOCP Volume II section 4.6.3 exercise 19 122 | # and at http://en.wikipedia.org/wiki/Multiset 123 | # 124 | # Outputs guaranteed to only include positive counts. 125 | # 126 | # To strip negative and zero counts, add-in an empty counter: 127 | # c += Counter() 128 | 129 | def __add__(self, other): 130 | '''Add counts from two counters. 131 | 132 | >>> Counter('abbb') + Counter('bcc') 133 | Counter({'b': 4, 'c': 2, 'a': 1}) 134 | 135 | 136 | ''' 137 | if not isinstance(other, Counter): 138 | return NotImplemented 139 | result = Counter() 140 | for elem in set(self) | set(other): 141 | newcount = self[elem] + other[elem] 142 | if newcount > 0: 143 | result[elem] = newcount 144 | return result 145 | 146 | def __sub__(self, other): 147 | ''' Subtract count, but keep only results with positive counts. 148 | 149 | >>> Counter('abbbc') - Counter('bccd') 150 | Counter({'b': 2, 'a': 1}) 151 | 152 | ''' 153 | if not isinstance(other, Counter): 154 | return NotImplemented 155 | result = Counter() 156 | for elem in set(self) | set(other): 157 | newcount = self[elem] - other[elem] 158 | if newcount > 0: 159 | result[elem] = newcount 160 | return result 161 | 162 | def __or__(self, other): 163 | '''Union is the maximum of value in either of the input counters. 164 | 165 | >>> Counter('abbb') | Counter('bcc') 166 | Counter({'b': 3, 'c': 2, 'a': 1}) 167 | 168 | ''' 169 | if not isinstance(other, Counter): 170 | return NotImplemented 171 | _max = max 172 | result = Counter() 173 | for elem in set(self) | set(other): 174 | newcount = _max(self[elem], other[elem]) 175 | if newcount > 0: 176 | result[elem] = newcount 177 | return result 178 | 179 | def __and__(self, other): 180 | ''' Intersection is the minimum of corresponding counts. 181 | 182 | >>> Counter('abbb') & Counter('bcc') 183 | Counter({'b': 1}) 184 | 185 | ''' 186 | if not isinstance(other, Counter): 187 | return NotImplemented 188 | _min = min 189 | result = Counter() 190 | if len(self) < len(other): 191 | self, other = other, self 192 | for elem in ifilter(self.__contains__, other): 193 | newcount = _min(self[elem], other[elem]) 194 | if newcount > 0: 195 | result[elem] = newcount 196 | return result 197 | 198 | 199 | if __name__ == '__main__': 200 | import doctest 201 | print doctest.testmod() 202 | -------------------------------------------------------------------------------- /pdftables/diagnostics.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | from collections import namedtuple 5 | import poppler 6 | import cairo 7 | 8 | from os.path import abspath 9 | 10 | Point = namedtuple('Point', ['x', 'y']) 11 | Line = namedtuple('Line', ['start', 'end']) 12 | Polygon = namedtuple('Polygon', 'points') 13 | Rectangle = namedtuple('Rectangle', ['top_left', 'bottom_right']) 14 | AnnotationGroup = namedtuple('AnnotationGroup', ['name', 'color', 'shapes']) 15 | Color = namedtuple('Color', ['red', 'green', 'blue']) 16 | 17 | __all__ = [ 18 | 'render_page', 19 | 'make_annotations', 20 | ] 21 | 22 | 23 | def draw_line(context, line): 24 | context.move_to(line.start.x, line.start.y) 25 | context.line_to(line.end.x, line.end.y) 26 | context.stroke() 27 | 28 | 29 | def draw_polygon(context, polygon): 30 | if len(polygon.points) == 0: 31 | return 32 | 33 | first_point = polygon.points[0] 34 | 35 | context.move_to(first_point.x, first_point.y) 36 | for line in polygon.points[1:]: 37 | context.line_to(line.x, line.y) 38 | 39 | context.stroke() 40 | 41 | 42 | def draw_rectangle(context, rectangle): 43 | width = abs(rectangle.bottom_right.x - rectangle.top_left.x) 44 | height = abs(rectangle.bottom_right.y - rectangle.top_left.y) 45 | 46 | context.rectangle(rectangle.top_left.x, 47 | rectangle.top_left.y, 48 | width, 49 | height) 50 | context.stroke() 51 | 52 | 53 | RENDERERS = {} 54 | RENDERERS[Line] = draw_line 55 | RENDERERS[Rectangle] = draw_rectangle 56 | RENDERERS[Polygon] = draw_polygon 57 | 58 | 59 | class CairoPdfPageRenderer(object): 60 | 61 | def __init__(self, pdf_page, svg_filename, png_filename): 62 | self._svg_filename = abspath(svg_filename) 63 | self._png_filename = abspath(png_filename) if png_filename else None 64 | self._context, self._surface = self._get_context( 65 | svg_filename, *pdf_page.get_size()) 66 | 67 | white = poppler.Color() 68 | white.red = white.green = white.blue = 65535 69 | black = poppler.Color() 70 | black.red = black.green = black.blue = 0 71 | # red = poppler.Color() 72 | # red.red = red.green = red.blue = 0 73 | # red.red = 65535 74 | 75 | width = pdf_page.get_size()[0] 76 | 77 | # We render everything 3 times, moving 78 | # one page-width to the right each time. 79 | self._offset_colors = [ 80 | (0, white, white, True), 81 | (width, black, white, True), 82 | (2 * width, black, black, False) 83 | ] 84 | 85 | for offset, fg_color, bg_color, render_graphics in self._offset_colors: 86 | # Render into context, with a different offset 87 | # each time. 88 | self._context.save() 89 | self._context.translate(offset, 0) 90 | 91 | sel = poppler.Rectangle() 92 | sel.x1, sel.y1 = (0, 0) 93 | sel.x2, sel.y2 = pdf_page.get_size() 94 | 95 | if render_graphics: 96 | pdf_page.render(self._context) 97 | 98 | pdf_page.render_selection( 99 | self._context, sel, sel, poppler.SELECTION_GLYPH, 100 | fg_color, bg_color) 101 | 102 | self._context.restore() 103 | 104 | @staticmethod 105 | def _get_context(filename, width, height): 106 | SCALE = 1 107 | # left, middle, right 108 | N_RENDERINGS = 3 109 | 110 | surface = cairo.SVGSurface( 111 | filename, N_RENDERINGS * width * SCALE, height * SCALE) 112 | # srf = cairo.ImageSurface( 113 | # cairo.FORMAT_RGB24, int(w*SCALE), int(h*SCALE)) 114 | 115 | context = cairo.Context(surface) 116 | context.scale(SCALE, SCALE) 117 | 118 | # Set background color to white 119 | context.set_source_rgb(1, 1, 1) 120 | context.paint() 121 | 122 | return context, surface 123 | 124 | def draw(self, shape, color): 125 | self._context.save() 126 | self._context.set_line_width(1) 127 | self._context.set_source_rgba(color.red, 128 | color.green, 129 | color.blue, 130 | 0.5) 131 | self._context.translate(self._offset_colors[1][0], 0) 132 | RENDERERS[type(shape)](self._context, shape) 133 | self._context.restore() 134 | 135 | def flush(self): 136 | if self._png_filename is not None: 137 | self._surface.write_to_png(self._png_filename) 138 | 139 | # NOTE! The flush is rather expensive, since it writes out the svg 140 | # data. The profile will show a large amount of time spent inside it. 141 | # Removing it won't help the execution time at all, it will just move 142 | # it somewhere that the profiler can't see it 143 | # (at garbage collection time) 144 | self._surface.flush() 145 | self._surface.finish() 146 | 147 | 148 | def render_page(pdf_filename, page_number, annotations, svg_file=None, 149 | png_file=None): 150 | """ 151 | Render a single page of a pdf with graphical annotations added. 152 | """ 153 | 154 | page = extract_pdf_page(pdf_filename, page_number) 155 | 156 | renderer = CairoPdfPageRenderer(page, svg_file, png_file) 157 | for annotation in annotations: 158 | assert isinstance(annotation, AnnotationGroup), ( 159 | "annotations: {0}, annotation: {1}".format( 160 | annotations, annotation)) 161 | for shape in annotation.shapes: 162 | renderer.draw(shape, annotation.color) 163 | 164 | renderer.flush() 165 | 166 | 167 | def extract_pdf_page(filename, page_number): 168 | file_uri = "file://{0}".format(abspath(filename)) 169 | doc = poppler.document_new_from_file(file_uri, "") 170 | 171 | page = doc.get_page(page_number) 172 | 173 | return page 174 | 175 | 176 | def make_annotations(table_container): 177 | """ 178 | Take the output of the table-finding algorithm (TableFinder) and create 179 | AnnotationGroups. These can be drawn on top of the original PDF page to 180 | visualise how the algorithm arrived at its output. 181 | """ 182 | 183 | annotations = [] 184 | 185 | annotations.append( 186 | AnnotationGroup( 187 | name='all_glyphs', 188 | color=Color(0, 1, 0), 189 | shapes=convert_rectangles(table_container.all_glyphs))) 190 | 191 | annotations.append( 192 | AnnotationGroup( 193 | name='all_words', 194 | color=Color(0, 0, 1), 195 | shapes=convert_rectangles(table_container.all_words))) 196 | 197 | annotations.append( 198 | AnnotationGroup( 199 | name='text_barycenters', 200 | color=Color(0, 0, 1), 201 | shapes=convert_barycenters(table_container.all_glyphs))) 202 | 203 | annotations.append( 204 | AnnotationGroup( 205 | name='hat_graph_vertical', 206 | color=Color(0, 1, 0), 207 | shapes=make_hat_graph( 208 | table_container._y_point_values, 209 | table_container._center_lines, 210 | direction="vertical"))) 211 | 212 | for table in table_container: 213 | annotations.append( 214 | AnnotationGroup( 215 | name='row_edges', 216 | color=Color(1, 0, 0), 217 | shapes=convert_horizontal_lines( 218 | table.row_edges, table.bounding_box))) 219 | 220 | annotations.append( 221 | AnnotationGroup( 222 | name='column_edges', 223 | color=Color(1, 0, 0), 224 | shapes=convert_vertical_lines( 225 | table.column_edges, table.bounding_box))) 226 | 227 | annotations.append( 228 | AnnotationGroup( 229 | name='glyph_histogram_horizontal', 230 | color=Color(1, 0, 0), 231 | shapes=make_glyph_histogram( 232 | table._x_glyph_histogram, table.bounding_box, 233 | direction="horizontal"))) 234 | 235 | annotations.append( 236 | AnnotationGroup( 237 | name='glyph_histogram_vertical', 238 | color=Color(1, 0, 0), 239 | shapes=make_glyph_histogram( 240 | table._y_glyph_histogram, table.bounding_box, 241 | direction="vertical"))) 242 | 243 | annotations.append( 244 | AnnotationGroup( 245 | name='horizontal_glyph_above_threshold', 246 | color=Color(0, 0, 0), 247 | shapes=make_thresholds( 248 | table._x_threshold_segs, table.bounding_box, 249 | direction="horizontal"))) 250 | 251 | annotations.append( 252 | AnnotationGroup( 253 | name='vertical_glyph_above_threshold', 254 | color=Color(0, 0, 0), 255 | shapes=make_thresholds( 256 | table._y_threshold_segs, table.bounding_box, 257 | direction="vertical"))) 258 | 259 | # Draw bounding boxes last so that they appear on top 260 | annotations.append( 261 | AnnotationGroup( 262 | name='table_bounding_boxes', 263 | color=Color(0, 0, 1), 264 | shapes=convert_rectangles(table_container.bounding_boxes))) 265 | 266 | return annotations 267 | 268 | 269 | def make_thresholds(segments, box, direction): 270 | lines = [] 271 | 272 | for segment in segments: 273 | 274 | if direction == "horizontal": 275 | lines.append(Line(Point(segment.start, box.bottom + 10), 276 | Point(segment.end, box.bottom + 10))) 277 | else: 278 | lines.append(Line(Point(10, segment.start), 279 | Point(10, segment.end))) 280 | 281 | return lines 282 | 283 | 284 | def make_hat_graph(hats, center_lines, direction): 285 | """ 286 | Draw estimated text barycenter 287 | """ 288 | 289 | max_value = max(v for _, v in hats) 290 | DISPLAY_WIDTH = 25 291 | 292 | points = [] 293 | polygon = Polygon(points) 294 | 295 | def point(x, y): 296 | points.append(Point(x, y)) 297 | 298 | for position, value in hats: 299 | point(((value / max_value - 1) * DISPLAY_WIDTH), position) 300 | 301 | lines = [] 302 | for position in center_lines: 303 | lines.append(Line(Point(-DISPLAY_WIDTH, position), 304 | Point(0, position))) 305 | 306 | return [polygon] + lines 307 | 308 | 309 | def make_glyph_histogram(histogram, box, direction): 310 | 311 | # if direction == "vertical": 312 | # return [] 313 | 314 | bin_edges, bin_values = histogram 315 | 316 | if not bin_edges: 317 | # There are no glyphs, and nothing to render! 318 | return [] 319 | 320 | points = [] 321 | polygon = Polygon(points) 322 | 323 | def point(x, y): 324 | points.append(Point(x, y)) 325 | 326 | # def line(*args): 327 | # lines.append(Line(*args)) 328 | previous_value = 0 if direction == "horizontal" else box.bottom 329 | 330 | x = zip(bin_edges, bin_values) 331 | for edge, value in x: 332 | 333 | if direction == "horizontal": 334 | value *= 0.75 335 | value = box.bottom - value 336 | 337 | point(edge, previous_value) 338 | point(edge, value) 339 | 340 | else: 341 | value *= 0.25 342 | value += 7 # shift pixels to the right 343 | 344 | point(previous_value, edge) 345 | point(value, edge) 346 | 347 | previous_value = value 348 | 349 | # Final point is at 0 350 | if direction == "horizontal": 351 | point(edge, 0) 352 | else: 353 | point(box.bottom, edge) 354 | 355 | # Draw edge density plot (not terribly interesting, should probably be 356 | # deleted) 357 | # lines = [] 358 | # if direction == "horizontal": 359 | # for edge in bin_edges: 360 | # lines.append(Line(Point(edge, box.bottom), 361 | # Point(edge, box.bottom + 5))) 362 | # else: 363 | # for edge in bin_edges: 364 | # lines.append(Line(Point(0, edge), Point(5, edge))) 365 | return [polygon] # + lines 366 | 367 | 368 | def convert_rectangles(boxes): 369 | return [Rectangle(Point(b.left, b.top), Point(b.right, b.bottom)) 370 | for b in boxes] 371 | 372 | 373 | def convert_barycenters(boxes): 374 | return [Line(Point(b.left, b.barycenter.midpoint), 375 | Point(b.right, b.barycenter.midpoint)) 376 | for b in boxes if b.barycenter is not None] 377 | 378 | 379 | def convert_horizontal_lines(y_edges, bbox): 380 | return [Line(Point(bbox.left, y), Point(bbox.right, y)) 381 | for y in y_edges] 382 | 383 | 384 | def convert_vertical_lines(x_edges, bbox): 385 | return [Line(Point(x, bbox.top), Point(x, bbox.bottom)) 386 | for x in x_edges] 387 | 388 | if __name__ == '__main__': 389 | annotations = [ 390 | AnnotationGroup( 391 | name='', 392 | color=Color(1, 0, 0), 393 | shapes=[Rectangle(Point(100, 100), Point(200, 200))]) 394 | ] 395 | render_page(sys.argv[1], 0, annotations) 396 | -------------------------------------------------------------------------------- /pdftables/display.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import unicode_literals 3 | from collections import defaultdict 4 | from StringIO import StringIO 5 | 6 | 7 | def to_string(table): 8 | """ 9 | Returns a list of the maximum width for each column across all rows 10 | >>> type(to_string([['foo', 'goodbye'], ['llama', 'bar']])) 11 | 12 | """ 13 | result = StringIO() 14 | 15 | (columns, rows) = get_dimensions(table) 16 | 17 | result.write(" {} columns, {} rows\n".format(columns, rows)) 18 | col_widths = find_column_widths(table) 19 | table_width = sum(col_widths) + len(col_widths) + 2 20 | hbar = ' {}\n'.format('-' * table_width) 21 | 22 | result.write(" {}\n".format(' '.join( 23 | [unicode(col_index).rjust(width, ' ') for (col_index, width) 24 | in enumerate(col_widths)]))) 25 | 26 | result.write(hbar) 27 | for row_index, row in enumerate(table): 28 | cells = [cell.rjust(width, ' ') for (cell, width) 29 | in zip(row, col_widths)] 30 | result.write("{:>3} | {}|\n".format(row_index, '|'.join(cells))) 31 | result.write(hbar) 32 | result.seek(0) 33 | return unicode(result.read()) 34 | 35 | 36 | def get_dimensions(table): 37 | """ 38 | Returns columns, rows for a table. 39 | >>> get_dimensions([['row1', 'apple', 'llama'], ['row2', 'plum', 'goat']]) 40 | (3, 2) 41 | >>> get_dimensions([['row1', 'apple', 'llama'], ['row2', 'banana']]) 42 | (3, 2) 43 | """ 44 | rows = len(table) 45 | try: 46 | cols = max(len(row) for row in table) 47 | except ValueError: 48 | cols = 0 49 | return (cols, rows) 50 | 51 | 52 | def find_column_widths(table): 53 | """ 54 | Returns a list of the maximum width for each column across all rows 55 | >>> find_column_widths([['foo', 'goodbye'], ['llama', 'bar']]) 56 | [5, 7] 57 | """ 58 | col_widths = defaultdict(lambda: 0) 59 | for row_index, row in enumerate(table): 60 | for column_index, cell in enumerate(row): 61 | col_widths[column_index] = max(col_widths[column_index], len(cell)) 62 | return [col_widths[col] for col in sorted(col_widths)] 63 | 64 | if __name__ == '__main__': 65 | print(to_string([['foo', 'goodbye'], ['llama', 'bar']])) 66 | -------------------------------------------------------------------------------- /pdftables/line_segments.py: -------------------------------------------------------------------------------- 1 | """ 2 | Algorithms for processing line segments 3 | 4 | segments_generator 5 | 6 | Yield segments in order of their start/end. 7 | 8 | [(1, 4), (2, 3)] => [(1, (1, 4)), (2, (2, 3)), (3, (2, 3)), (4, (1, 4))] 9 | 10 | histogram_segments 11 | 12 | Number of line segments present in each given range 13 | 14 | [(1, 4), (2, 3)] => [((1, 2), 1), ((2, 3), 2), ((3, 4), 1)] 15 | 16 | segment_histogram 17 | 18 | Binning for a histogram and the number of counts in each bin 19 | 20 | [(1, 4), (2, 3)] => [(1, 2, 3, 4), (1, 2, 1)] 21 | """ 22 | 23 | from __future__ import division 24 | 25 | from collections import defaultdict, namedtuple 26 | from heapq import heappush, heapreplace, heappop 27 | 28 | 29 | class LineSegment(namedtuple("LineSegment", ["start", "end", "object"])): 30 | 31 | @classmethod 32 | def make(cls, start, end, obj=None): 33 | return cls(start, end, obj) 34 | 35 | def __repr__(self): 36 | return '{0}(start={1:6.04f} end={2:6.04f} object={3})'.format( 37 | type(self).__name__, self.start, self.end, self.object) 38 | 39 | @property 40 | def length(self): 41 | return self.end - self.start 42 | 43 | @property 44 | def midpoint(self): 45 | return (self.start + self.end) / 2 46 | 47 | 48 | def midpoint(segment): 49 | yield segment.midpoint 50 | 51 | 52 | def start_end(segment): 53 | yield segment.start 54 | yield segment.end 55 | 56 | 57 | def start_midpoint_end(segment): 58 | yield segment.start 59 | yield segment.midpoint 60 | yield segment.end 61 | 62 | 63 | def segments_generator(line_segments, to_visit=start_end): 64 | """ 65 | Given the a list of segment ranges [(start, stop)...], yield the list of 66 | coordinates where any line segment starts or ends along with the line 67 | segment which starts/ends at that coordinate. The third element of the 68 | yielded tuple is True if the given segment is finishing at that point. 69 | 70 | In [1]: list(linesegments.segments_generator([(1, 4), (2, 3)])) 71 | Out[1]: [(1, (1, 4), False), 72 | (2, (2, 3), False), 73 | (3, (2, 3), True), 74 | (4, (1, 4), True)] 75 | 76 | The function ``to_visit`` specifies which positions will be visited for 77 | each segment and may be ``start_end`` or ``start_midpoint_end``. 78 | 79 | If ``to_visit`` is a dictionary instance, it is a mapping from 80 | ``type(segment)`` onto a visit function. This allows passing in both 81 | line segments and points which should be visited simultaneously, for 82 | example, to classify points according to which line segments are currently 83 | overlapped. 84 | """ 85 | 86 | # Queue contains a list of outstanding segments to process. It is a list 87 | # of tuples, [(position, (start, end)], where (start, end) represents one 88 | # line-segment. `position` is the coordinate at which the line-segment) is 89 | # to be considered again. 90 | queue = [] 91 | 92 | _to_visit = to_visit 93 | 94 | # (Note, this has the effect of sorting line_segments in start order) 95 | for segment in line_segments: 96 | if isinstance(to_visit, dict): 97 | _to_visit = to_visit[type(segment)] 98 | 99 | # an iterator representing the points to visit for this segment 100 | points_to_visit = _to_visit(segment) 101 | try: 102 | start = points_to_visit.next() 103 | except StopIteration: 104 | continue 105 | 106 | heappush(queue, (start, points_to_visit, segment)) 107 | 108 | # Process outstanding segments until there are none left 109 | while queue: 110 | 111 | # Get the next segment to consider and the position at which we're 112 | # considering it 113 | position, points_to_visit, segment = heappop(queue) 114 | 115 | try: 116 | point_next_position = points_to_visit.next() 117 | if point_next_position < position: 118 | raise RuntimeError("Malformed input: next={0} < pos={1}" 119 | .format(point_next_position, position)) 120 | except StopIteration: 121 | # No more points for this segment 122 | disappearing = True 123 | else: 124 | disappearing = False 125 | heappush(queue, (point_next_position, points_to_visit, segment)) 126 | 127 | yield position, segment, disappearing 128 | 129 | 130 | def histogram_segments(segments): 131 | """ 132 | Given a list of histogram segments returns ((start, end), n_segments) 133 | which represents a histogram projection of the number of 134 | segments onto a line. 135 | 136 | In [1]: list(linesegments.histogram_segments([(1, 4), (2, 3)])) 137 | Out[1]: [((1, 2), 1), ((2, 3), 2), ((3, 4), 1)] 138 | """ 139 | 140 | # sum(.values()) is the number of segments within (start, end) 141 | active_segments = defaultdict(int) 142 | 143 | consider_segments = list(segments_generator(segments)) 144 | 145 | # TODO(pwaller): This function doesn't need to consider the active segments 146 | # It should just maintain a counter. (expect a ~O(10%) speedup) 147 | 148 | # Look ahead to the next start, and that's the end of the interesting range 149 | for this, next in zip(consider_segments, consider_segments[1:]): 150 | 151 | # (start, end) is the range until the next segment 152 | (start, seg, disappearing), (end, _, _) = this, next 153 | 154 | # Did the segment appear or disappear? Key on the segment coordinates 155 | if not disappearing: 156 | active_segments[seg] += 1 157 | 158 | else: 159 | active_segments[seg] -= 1 160 | 161 | if start == end: 162 | # This happens if a segment appears more than once. 163 | # Then we don't care about considering this zero-length range. 164 | continue 165 | 166 | yield (start, end), sum(active_segments.values()) 167 | 168 | 169 | def hat_point_generator(line_segments): 170 | """ 171 | This is a hat for one line segment: 172 | 173 | /\ 174 | ____/ \____ 175 | 176 | | | 177 | | \_ End 178 | \_ Start 179 | 180 | position ---> 181 | 182 | This generator yields at every `position` where the value of the hat 183 | function could change. 184 | """ 185 | 186 | # Invariants: 187 | # * Position should be always increasing 188 | # * First and last yielded points should always be the empty set. 189 | # * All yielded positions should lie within all LineSegments in the 190 | # `active_segments` at the point it is yielded 191 | # * Each yielded set has its own id() 192 | 193 | # Set of segments active yielded for unique values of `position` 194 | active_segments = set() 195 | # Set of segments which have appeared so far at this `position` 196 | new_segments = set() 197 | 198 | segments_by_position = segments_generator( 199 | line_segments, start_midpoint_end) 200 | last_position = None 201 | 202 | for position, segment, disappearing in segments_by_position: 203 | 204 | if segment.start == segment.end: 205 | # Zero-length segments are uninteresting and get skipped 206 | continue 207 | 208 | if last_position is not None and last_position != position: 209 | 210 | # Sanity check. 211 | assert all(s.start <= last_position < s.end 212 | for s in active_segments) 213 | 214 | # Copy the `active_segments` set so that the caller doesn't 215 | # accidentally end up with the same set repeatedly and can't 216 | # modify the state inside this function 217 | yield last_position, set(active_segments) 218 | 219 | # `new_segments` are now `active segments. 220 | active_segments |= new_segments 221 | new_segments.clear() 222 | 223 | if disappearing: 224 | # This is the end of the segment, remove it from the active set 225 | active_segments.remove(segment) 226 | else: 227 | # Record the segment in the seen list. It might be the start or 228 | # midpoint. If it's the start, it won't be `active` until the 229 | # next iteration (unless that iteration removes it). 230 | new_segments.add(segment) 231 | 232 | last_position = position 233 | 234 | # For completeness, yield empty set at final position. 235 | yield last_position, set() 236 | 237 | 238 | def hat(segment, position): 239 | """ 240 | This function returns 0 when ``position` is the start or end of ``segment`` 241 | and 1 when ``position`` is in the middle of the segment. 242 | 243 | /\ 244 | __/ \__ 245 | """ 246 | h = abs((segment.midpoint - position) / segment.length) 247 | return max(0, 1 - h) 248 | 249 | 250 | def normal_hat(position, active_segments): 251 | """ 252 | The ``normal_hat`` is the sum of the hat function for all active segments 253 | at this position 254 | """ 255 | return sum(hat(s, position) for s in active_segments) 256 | 257 | 258 | def max_length(position, active_segments): 259 | """ 260 | Returns the maximum length of any segment overlapping ``position`` 261 | """ 262 | if not active_segments: 263 | return None 264 | return max(s.length for s in active_segments) 265 | 266 | 267 | def normal_hat_with_max_length(position, active_segments): 268 | """ 269 | Obtain both the hat value and the length of the largest line segment 270 | overlapping each "hat position". 271 | """ 272 | 273 | return (normal_hat(position, active_segments), 274 | max_length(position, active_segments)) 275 | 276 | 277 | def hat_generator(line_segments, value_function=normal_hat): 278 | """ 279 | The purpose of this function is to determine where it might be effective 280 | to clamp text to for the purposes of text visitation order. 281 | 282 | The hat generator returns the sum of ``hat`` at each position where any 283 | line segment's ``start``, ``midpoint`` or ``end`` is. 284 | 285 | ``value_function`` can be used to obtain different kinds of information 286 | from the ``hat_point_generator``'s points. 287 | """ 288 | 289 | for position, active_segments in hat_point_generator(line_segments): 290 | yield position, value_function(position, active_segments) 291 | 292 | 293 | def segment_histogram(line_segments): 294 | """ 295 | Binning for a histogram and the number of counts in each bin. 296 | Can be used to make a numpy.histogram. 297 | 298 | In [1]: list(linesegments.histogram_segments([(1, 4), (2, 3)])) 299 | Out[1]: [(1, 2, 3, 4), (1, 2, 1)] 300 | """ 301 | data = list(histogram_segments(line_segments)) 302 | 303 | if not data: 304 | return [(), ()] 305 | 306 | x, counts = zip(*data) 307 | starts, ends = zip(*x) 308 | 309 | return starts + (ends[-1],), counts 310 | 311 | 312 | def above_threshold(histogram, threshold): 313 | """ 314 | Returns a list of line segments from histogram which are above threshold 315 | """ 316 | 317 | bin_edges, bin_values = histogram 318 | edges = zip(bin_edges, bin_edges[1:]) 319 | 320 | above_threshold = [] 321 | 322 | for (first, second), value in zip(edges, bin_values): 323 | if value < threshold: 324 | continue 325 | 326 | if above_threshold and above_threshold[-1].end == first: 327 | # There's a previous one we can extend 328 | above_threshold[-1] = above_threshold[-1]._replace( 329 | end=second) 330 | else: 331 | # Insert a new one 332 | above_threshold.append(LineSegment(first, second, None)) 333 | 334 | return above_threshold 335 | 336 | 337 | def find_peaks(position_values): 338 | """ 339 | Find all points in a peaky graph which are local maxima. 340 | 341 | This function assumes that the very first and last points can't be peaks. 342 | """ 343 | 344 | # Initial value is zero, up is the only direction from here! 345 | increasing = True 346 | 347 | # The loop has two states. Either we're going up, in which case when we see 348 | # a next value less than the current one, we must be at the top. At which 349 | # point it's down hill and all next values will be less than the current 350 | # one. Until the bottom is reached, at which point we're increasing again. 351 | 352 | # Note that the last `position` can never be yielded. 353 | 354 | successive_pairs = zip(position_values, position_values[1:]) 355 | for (position, value), (_, next_value) in successive_pairs: 356 | if increasing: 357 | if next_value < value: 358 | # position is a peak 359 | increasing = False 360 | yield position 361 | 362 | else: 363 | if next_value > value: 364 | # position is a trough 365 | increasing = True 366 | -------------------------------------------------------------------------------- /pdftables/numpy_subset.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # ScraperWiki Limited 3 | # Ian Hopkinson, 2013-08-02 4 | # -*- coding: utf-8 -*- 5 | 6 | """ 7 | Implement numpy.diff, numpy.arange and numpy.average to remove numpy dependency 8 | """ 9 | 10 | import math 11 | 12 | 13 | def diff(input_array): 14 | ''' 15 | First order differences for an input array 16 | 17 | >>> diff([1,2,3,4]) 18 | [1, 1, 1] 19 | ''' 20 | result = [] 21 | for i in range(0, len(input_array) - 1): 22 | result.append(input_array[i + 1] - input_array[i]) 23 | return result 24 | 25 | 26 | def arange(start, stop, stepv): 27 | ''' 28 | Generate a list of float values given start, stop and step values 29 | 30 | >>> arange(0, 2, 0.1) 31 | [ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 32 | 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9] 33 | ''' 34 | count = int(math.ceil(float(stop - start) / float(stepv))) 35 | result = [None, ] * count 36 | result[0] = start 37 | for i in xrange(1, count): 38 | result[i] = result[i - 1] + stepv 39 | 40 | return result 41 | 42 | 43 | def average(input_array): 44 | ''' 45 | Average a 1D array of values 46 | 47 | >>> average([1,2,3,4]) 48 | 2.5 49 | ''' 50 | return float(sum(input_array)) / float(len(input_array)) 51 | -------------------------------------------------------------------------------- /pdftables/patched_poppler.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | 3 | import ctypes 4 | import poppler 5 | 6 | from ctypes import CDLL, POINTER, c_voidp, c_double, c_uint, c_bool 7 | from ctypes import Structure, addressof 8 | 9 | from .boxes import Box, Rectangle 10 | 11 | 12 | class CRectangle(Structure): 13 | _fields_ = [ 14 | ("x1", c_double), 15 | ("y1", c_double), 16 | ("x2", c_double), 17 | ("y2", c_double), 18 | ] 19 | CRectangle.ptr = POINTER(CRectangle) 20 | 21 | glib = CDLL("libpoppler-glib.so.8") 22 | 23 | g_free = glib.g_free 24 | g_free.argtypes = (c_voidp,) 25 | 26 | 27 | _c_text_layout = glib.poppler_page_get_text_layout 28 | _c_text_layout.argtypes = (c_voidp, POINTER(CRectangle.ptr), POINTER(c_uint)) 29 | _c_text_layout.restype = c_bool 30 | 31 | GLYPH = poppler.SELECTION_GLYPH 32 | 33 | 34 | def poppler_page_get_text_layout(page): 35 | """ 36 | Wrapper of an underlying c-api function not yet exposed by the 37 | python-poppler API. 38 | 39 | Returns a list of text rectangles on the pdf `page` 40 | """ 41 | 42 | n = c_uint(0) 43 | rects = CRectangle.ptr() 44 | 45 | # From python-poppler internals it is known that hash(page) returns the 46 | # c-pointer to the underlying glib object. See also the repr(page). 47 | page_ptr = hash(page) 48 | 49 | _c_text_layout(page_ptr, rects, n) 50 | 51 | # Obtain pointer to array of rectangles of the correct length 52 | rectangles = POINTER(CRectangle * n.value).from_address(addressof(rects)) 53 | 54 | get_text = page.get_selected_text 55 | 56 | poppler_rect = poppler.Rectangle() 57 | 58 | result = [] 59 | for crect in rectangles.contents: 60 | # result.append(Rectangle( 61 | # x1=crect.x1, y1=crect.y1, x2=crect.x2, y2=crect.y2)) 62 | 63 | _ = (crect.x1, crect.y1, crect.x2, crect.y2) 64 | poppler_rect.x1, poppler_rect.y1, poppler_rect.x2, poppler_rect.y2 = _ 65 | 66 | text = get_text(GLYPH, poppler_rect).decode("utf8") 67 | 68 | if text.endswith(" \n"): 69 | text = text[:-2] 70 | elif text.endswith(" ") and len(text) > 1: 71 | text = text[:-1] 72 | elif text.endswith("\n"): 73 | text = text[:-1] 74 | 75 | rect = Box( 76 | rect=Rectangle(x1=crect.x1, y1=crect.y1, x2=crect.x2, y2=crect.y2), 77 | text=text, 78 | ) 79 | result.append(rect) 80 | 81 | # TODO(pwaller): check that this free is correct 82 | g_free(rectangles) 83 | 84 | return result 85 | -------------------------------------------------------------------------------- /pdftables/pdf_document.py: -------------------------------------------------------------------------------- 1 | """ 2 | Backend abstraction for PDFDocuments 3 | """ 4 | 5 | import abc 6 | import os 7 | 8 | DEFAULT_BACKEND = "poppler" 9 | BACKEND = os.environ.get("PDFTABLES_BACKEND", DEFAULT_BACKEND).lower() 10 | 11 | # TODO(pwaller): Use abstract base class? 12 | # What does it buy us? Can we enforce that only methods specified in an ABC 13 | # are used by client code? 14 | 15 | 16 | class PDFDocument(object): 17 | __metaclass__ = abc.ABCMeta 18 | 19 | @classmethod 20 | def get_backend(cls, backend=None): 21 | """ 22 | Returns the PDFDocument class to use based on configuration from 23 | enviornment or pdf_document.BACKEND 24 | """ 25 | # If `cls` is not already a subclass of the base PDFDocument, pick one 26 | if not issubclass(cls, PDFDocument): 27 | return cls 28 | 29 | if backend is None: 30 | backend = BACKEND 31 | 32 | # Imports have to go inline to avoid circular imports with the backends 33 | if backend == "pdfminer": 34 | from pdf_document_pdfminer import PDFDocument as PDFDoc 35 | return PDFDoc 36 | 37 | elif backend == "poppler": 38 | from pdf_document_poppler import PDFDocument as PDFDoc 39 | return PDFDoc 40 | 41 | raise NotImplementedError("Unknown backend '{0}'".format(backend)) 42 | 43 | @classmethod 44 | def from_path(cls, path): 45 | Class = cls.get_backend() 46 | return Class(path) 47 | 48 | @classmethod 49 | def from_fileobj(cls, fh): 50 | # TODO(pwaller): For now, put fh into a temporary file and call 51 | # .from_path. Future: when we have a working stream input function for 52 | # poppler, use that. 53 | raise NotImplementedError 54 | Class = cls._get_backend() 55 | # return Class(fh) # This is wrong since constructor now takes a path. 56 | 57 | def __init__(self, *args, **kwargs): 58 | raise RuntimeError( 59 | "Don't use this constructor, use a {0}.from_* method instead!" 60 | .format(self.__class__.__name__)) 61 | 62 | @abc.abstractmethod 63 | def __len__(self): 64 | """ 65 | Return the number of pages in the PDF 66 | """ 67 | 68 | @abc.abstractmethod 69 | def get_page(self, number): 70 | """ 71 | Return a PDFPage for page `number` (0 indexed!) 72 | """ 73 | 74 | @abc.abstractmethod 75 | def get_pages(self): 76 | """ 77 | Return all pages in the document: TODO(pwaller) move implementation here 78 | """ 79 | 80 | 81 | class PDFPage(object): 82 | __metaclass__ = abc.ABCMeta 83 | 84 | @abc.abstractmethod 85 | def get_glyphs(self): 86 | """ 87 | Obtain a list of bounding boxes (Box instances) for all glyphs 88 | on the page. 89 | """ 90 | 91 | @abc.abstractproperty 92 | def size(self): 93 | """ 94 | (width, height) of page 95 | """ 96 | -------------------------------------------------------------------------------- /pdftables/pdf_document_pdfminer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | PDFDocument backend based on pdfminer 4 | """ 5 | 6 | import collections 7 | 8 | import pdfminer.pdfparser 9 | import pdfminer.pdfinterp 10 | import pdfminer.pdfdevice 11 | import pdfminer.layout 12 | import pdfminer.converter 13 | 14 | from .pdf_document import ( 15 | PDFDocument as BasePDFDocument, 16 | PDFPage as BasePDFPage, 17 | ) 18 | 19 | from .boxes import Box, BoxList, Rectangle 20 | 21 | 22 | class PDFDocument(BasePDFDocument): 23 | 24 | """ 25 | pdfminer implementation of PDFDocument 26 | """ 27 | 28 | @staticmethod 29 | def _initialise(file_handle): 30 | 31 | (doc, parser) = (pdfminer.pdfparser.PDFDocument(), 32 | pdfminer.pdfparser.PDFParser(file_handle)) 33 | 34 | parser.set_document(doc) 35 | doc.set_parser(parser) 36 | 37 | doc.initialize('') 38 | if not doc.is_extractable: 39 | raise ValueError( 40 | "pdfminer.pdfparser.PDFDocument is_extractable != True") 41 | la_params = pdfminer.layout.LAParams() 42 | la_params.word_margin = 0.0 43 | 44 | resource_manager = pdfminer.pdfinterp.PDFResourceManager() 45 | aggregator = pdfminer.converter.PDFPageAggregator( 46 | resource_manager, laparams=la_params) 47 | 48 | interpreter = pdfminer.pdfinterp.PDFPageInterpreter( 49 | resource_manager, aggregator) 50 | 51 | return doc, interpreter, aggregator 52 | 53 | def __init__(self, file_path): 54 | self._pages = None 55 | 56 | self._file_handle = open(file_path, "rb") 57 | 58 | result = self._initialise(self._file_handle) 59 | (self._doc, self._interpreter, self._device) = result 60 | 61 | def __len__(self): 62 | return len(self.get_pages()) 63 | 64 | def get_creator(self): 65 | return self._doc.info[0]['Creator'] # TODO: what's doc.info ? 66 | 67 | def get_pages(self): 68 | """ 69 | Returns a list of lazy pages (parsed on demand) 70 | """ 71 | if not self._pages: 72 | self._construct_pages() 73 | 74 | return self._pages 75 | 76 | def _construct_pages(self): 77 | self._pages = [PDFPage(self, page) for page in self._doc.get_pages()] 78 | 79 | def get_page(self, page_number): 80 | """ 81 | 0-based page getter 82 | """ 83 | pages = self.get_pages() 84 | if 0 <= page_number < len(pages): 85 | return pages[page_number] 86 | raise IndexError("Invalid page. Reminder: get_page() is 0-indexed " 87 | "(there are {0} pages)!".format(len(pages))) 88 | 89 | 90 | def children(obj): 91 | """ 92 | get all descendants of nested iterables 93 | """ 94 | if isinstance(obj, collections.Iterable): 95 | for child in obj: 96 | for node in children(child): 97 | yield node 98 | yield obj 99 | 100 | 101 | class PDFPage(BasePDFPage): 102 | 103 | """ 104 | pdfminer implementation of PDFPage 105 | """ 106 | 107 | def __init__(self, parent_pdf_document, page): 108 | assert isinstance(page, pdfminer.pdfparser.PDFPage), page.__class__ 109 | 110 | self.pdf_document = parent_pdf_document 111 | self._page = page 112 | self._cached_lt_page = None 113 | 114 | @property 115 | def size(self): 116 | x1, y1, x2, y2 = self._page.mediabox 117 | return x2 - x1, y2 - y1 118 | 119 | def get_glyphs(self): 120 | """ 121 | Return a BoxList of the glyphs on this page. 122 | """ 123 | 124 | items = children(self._lt_page()) 125 | 126 | def keep(o): 127 | return isinstance(o, pdfminer.layout.LTChar) 128 | 129 | _, page_height = self.size 130 | 131 | def make_box(obj): 132 | # TODO(pwaller): Note: is `self._page.rotate` taken into account? 133 | 134 | # pdfminer gives coordinates such that y=0 is the bottom of the 135 | # page. Our algorithms expect y=0 is the top of the page, so.. 136 | left, bottom, right, top = obj.bbox 137 | return Box( 138 | rect=Rectangle( 139 | x1=left, x2=right, 140 | y1=page_height - top, 141 | y2=page_height - bottom, 142 | ), 143 | text=obj.get_text() 144 | ) 145 | 146 | return BoxList(make_box(obj) for obj in items if keep(obj)) 147 | 148 | def _lt_page(self): 149 | if not self._cached_lt_page: 150 | self._parse_page() 151 | return self._cached_lt_page 152 | 153 | def _parse_page(self): 154 | self.pdf_document._interpreter.process_page(self._page) 155 | self._cached_lt_page = self.pdf_document._device.get_result() 156 | assert isinstance(self._cached_lt_page, pdfminer.layout.LTPage), ( 157 | self._cached_lt_page.__class__) 158 | -------------------------------------------------------------------------------- /pdftables/pdf_document_poppler.py: -------------------------------------------------------------------------------- 1 | from ctypes import CDLL, POINTER, c_voidp, c_double, c_uint, c_bool 2 | from ctypes import Structure, addressof, pointer 3 | from os.path import abspath 4 | 5 | import gobject 6 | try: 7 | import poppler 8 | except ImportError: 9 | print "Poppler unavailable! Please install it." 10 | print " sudo apt-get install python-poppler" 11 | raise 12 | 13 | import patched_poppler 14 | 15 | from .boxes import Box, BoxList 16 | from .pdf_document import ( 17 | PDFDocument as BasePDFDocument, 18 | PDFPage as BasePDFPage, 19 | ) 20 | 21 | 22 | class PDFDocument(BasePDFDocument): 23 | 24 | def __init__(self, file_path, password=""): 25 | uri = "file://{0}".format(abspath(file_path)) 26 | self._poppler = poppler.document_new_from_file(uri, password) 27 | 28 | def __len__(self): 29 | return self._poppler.get_n_pages() 30 | 31 | def get_page(self, n): 32 | return PDFPage(self, n) 33 | 34 | def get_pages(self): 35 | return [self.get_page(i) for i in xrange(len(self))] 36 | 37 | 38 | class PDFPage(BasePDFPage): 39 | 40 | def __init__(self, doc, n): 41 | self._poppler = doc._poppler.get_page(n) 42 | 43 | @property 44 | def size(self): 45 | return self._poppler.get_size() 46 | 47 | def get_glyphs(self): 48 | # TODO(pwaller): Result of this should be memoized onto the PDFPage 49 | # instance. 50 | 51 | gtl = patched_poppler.poppler_page_get_text_layout 52 | rectangles = gtl(self._poppler) 53 | 54 | return BoxList(rectangles) 55 | 56 | # TODO(pwaller): Salvage this. 57 | # 58 | # Poppler seems to lie to us because the assertion below fails. 59 | # It should return the same number of rectangles as there are 60 | # characters in the text, but it does not. 61 | # See: 62 | # 63 | # http://www.mail-archive.com/poppler 64 | # @lists.freedesktop.org/msg06245.html 65 | # https://github.com/scraperwiki/pdftables/issues/89 66 | # https://bugs.freedesktop.org/show_bug.cgi?id=69608 67 | 68 | text = self._poppler.get_text().decode("utf8") 69 | 70 | # assert len(text) == len(rectangles), ( 71 | # "t={0}, r={1}".format(len(text), len(rectangles))) 72 | 73 | # assert False 74 | 75 | return BoxList(Box(rect=rect, text=character) 76 | for rect, character in zip(rectangles, text)) 77 | -------------------------------------------------------------------------------- /pdftables/pdftables.py: -------------------------------------------------------------------------------- 1 | """ 2 | pdftables public interface 3 | """ 4 | 5 | from __future__ import unicode_literals 6 | """ 7 | Some experiments with pdfminer 8 | http://www.unixuser.org/~euske/python/pdfminer/programming.html 9 | Some help here: 10 | http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files 11 | """ 12 | 13 | # TODO(IanHopkinson) Identify multi-column text, for multicolumn text detect 14 | # per column 15 | # TODO(IanHopkinson) Dynamic / smarter thresholding 16 | # TODO(IanHopkinson) Handle argentina_diputados_voting_record.pdf automatically 17 | # TODO(IanHopkinson) Handle multiple tables on one page 18 | 19 | __all__ = ["get_tables", "page_to_tables", "page_contains_tables"] 20 | 21 | import codecs 22 | import collections 23 | import math 24 | import sys 25 | 26 | import numpy_subset 27 | 28 | from bisect import bisect_left 29 | from collections import defaultdict 30 | from counter import Counter 31 | from cStringIO import StringIO 32 | from operator import attrgetter 33 | 34 | from .boxes import Box, BoxList, Rectangle 35 | from .config_parameters import ConfigParameters 36 | from .line_segments import (segment_histogram, above_threshold, hat_generator, 37 | find_peaks, normal_hat_with_max_length, 38 | midpoint, start_end, LineSegment, 39 | segments_generator) 40 | from .pdf_document import PDFDocument, PDFPage 41 | 42 | IS_TABLE_COLUMN_COUNT_THRESHOLD = 3 43 | IS_TABLE_ROW_COUNT_THRESHOLD = 3 44 | 45 | LEFT = 0 46 | TOP = 3 47 | RIGHT = 2 48 | BOTTOM = 1 49 | 50 | 51 | class Table(object): 52 | 53 | """ 54 | Represents a single table on a PDF page. 55 | """ 56 | 57 | def __init__(self): 58 | # TODO(pwaller): populate this from pdf_page.number 59 | self.page_number = None 60 | self.bounding_box = None 61 | self.glyphs = None 62 | self.edges = None 63 | self.row_edges = None 64 | self.column_edges = None 65 | self.data = None 66 | 67 | def __repr__(self): 68 | d = self.data 69 | if d is not None: 70 | # TODO(pwaller): Compute this in a better way. 71 | h = len(d) 72 | w = len(d[0]) 73 | return "".format(w, h) 74 | else: 75 | return "
" 76 | 77 | 78 | class TableContainer(object): 79 | 80 | """ 81 | Represents a collection of tables on a PDF page. 82 | """ 83 | 84 | def __init__(self): 85 | self.tables = [] 86 | 87 | self.original_page = None 88 | self.page_size = None 89 | self.bounding_boxes = None 90 | self.all_glyphs = None 91 | 92 | def add(self, table): 93 | self.tables.append(table) 94 | 95 | def __repr__(self): 96 | return "TableContainer(" + repr(self.__dict__) + ")" 97 | 98 | def __iter__(self): 99 | return iter(self.tables) 100 | 101 | 102 | def get_tables(fh): 103 | """ 104 | Return a list of 'tables' from the given file handle, where a table is a 105 | list of rows, and a row is a list of strings. 106 | """ 107 | pdf = PDFDocument.from_fileobj(fh) 108 | return get_tables_from_document(pdf) 109 | 110 | 111 | def get_tables_from_document(pdf_document): 112 | """ 113 | Return a list of 'tables' from the given PDFDocument, where a table is a 114 | list of rows, and a row is a list of strings. 115 | """ 116 | raise NotImplementedError("This interface hasn't been fixed yet, sorry!") 117 | 118 | result = [] 119 | 120 | config = ConfigParameters() 121 | 122 | # TODO(pwaller): Return one table container with all tables on it? 123 | 124 | for i, pdf_page in enumerate(pdf_document.get_pages()): 125 | if not page_contains_tables(pdf_page): 126 | continue 127 | 128 | tables = page_to_tables(pdf_page, config) 129 | 130 | # crop_table(table) 131 | #result.append(Table(table, i + 1, len(pdf_document), 1, 1)) 132 | 133 | return result 134 | 135 | 136 | def crop_table(table): 137 | """ 138 | Remove empty rows from the top and bottom of the table. 139 | 140 | TODO(pwaller): We may need functionality similar to this, or not? 141 | """ 142 | for row in list(table): # top -> bottom 143 | if not any(cell.strip() for cell in row): 144 | table.remove(row) 145 | else: 146 | break 147 | 148 | for row in list(reversed(table)): # bottom -> top 149 | if not any(cell.strip() for cell in row): 150 | table.remove(row) 151 | else: 152 | break 153 | 154 | 155 | def page_contains_tables(pdf_page): 156 | if not isinstance(pdf_page, PDFPage): 157 | raise TypeError("Page must be PDFPage, not {}".format( 158 | pdf_page.__class__)) 159 | 160 | # TODO(pwaller): 161 | # 162 | # I would prefer if this function was defined in terms of `page_to_tables` 163 | # so that the logic cannot diverge. 164 | # 165 | # What should the test be? 166 | # len(page_to_tables(page)) > 0? 167 | # Number of tables left after filtering ones that have no data > 0? 168 | 169 | box_list = pdf_page.get_glyphs() 170 | 171 | boxtop = attrgetter("top") 172 | yhist = box_list.histogram(boxtop).rounder(1) 173 | test = [k for k, v in yhist.items() if v > IS_TABLE_COLUMN_COUNT_THRESHOLD] 174 | return len(test) > IS_TABLE_ROW_COUNT_THRESHOLD 175 | 176 | 177 | def make_words(glyphs): 178 | """ 179 | A word is a series of glyphs which are visually connected to each other. 180 | """ 181 | 182 | def ordering(box): 183 | assert box.barycenter_y is not None, ( 184 | "Box belongs to no barycenter. Has assign_barycenters been run?") 185 | return (box.barycenter_y, box.center_x) 186 | 187 | words = [] 188 | glyphs = [g for g in glyphs if g.barycenter_y is not None] 189 | 190 | for glyph in sorted(glyphs, key=ordering): 191 | 192 | if len(words) > 0 and words[-1].is_connected_to(glyph): 193 | words[-1].extend(glyph) 194 | 195 | else: 196 | words.append(Box.copy(glyph)) 197 | 198 | return words 199 | 200 | 201 | def page_to_tables(pdf_page, config=None): 202 | """ 203 | The central algorithm to pdftables, find all the tables on ``pdf_page`` and 204 | return them in a ```TableContainer``. 205 | 206 | The algorithm is steered with ``config`` which is of type 207 | ``ConfigParameters`` 208 | """ 209 | 210 | if config is None: 211 | config = ConfigParameters() 212 | 213 | # Avoid local variables; instead use properties of the 214 | # `tables` object, so that they are exposed for debugging and 215 | # visualisation. 216 | 217 | tables = TableContainer() 218 | 219 | tables.original_page = pdf_page 220 | tables.page_size = pdf_page.size 221 | tables.all_glyphs = pdf_page.get_glyphs() 222 | 223 | tables._x_segments, tables._y_segments = tables.all_glyphs.line_segments() 224 | # Find candidate text centerlines and compute some properties of them. 225 | (tables._y_point_values, 226 | tables._center_lines, 227 | tables._barycenter_maxheights) = ( 228 | determine_text_centerlines(tables._y_segments) 229 | ) 230 | 231 | assign_barycenters(tables._y_segments, 232 | tables._center_lines, 233 | tables._barycenter_maxheights) 234 | 235 | # Note: word computation must come after barycenter computation 236 | tables.all_words = make_words(tables.all_glyphs) 237 | 238 | tables.bounding_boxes = find_bounding_boxes(tables.all_glyphs, config) 239 | 240 | for box in tables.bounding_boxes: 241 | table = Table() 242 | table.bounding_box = box 243 | 244 | table.glyphs = tables.all_glyphs.inside(box) 245 | 246 | if len(table.glyphs) == 0: 247 | # If this happens, then find_bounding_boxes returned somewhere with 248 | # no glyphs inside it. Wat. 249 | raise RuntimeError("This is an empty table bounding box. " 250 | "That shouldn't happen.") 251 | 252 | # Fetch line-segments 253 | 254 | # h is lines with fixed y, multiple x values 255 | # v is lines with fixed x, multiple y values 256 | # TODO(pwaller): compute for whole page, then get subset belonging to 257 | # this table. 258 | table._x_segments, table._y_segments = table.glyphs.line_segments() 259 | 260 | # Histogram them 261 | xs = table._x_glyph_histogram = segment_histogram(table._x_segments) 262 | ys = table._y_glyph_histogram = segment_histogram(table._y_segments) 263 | 264 | thres_nc = config.n_glyph_column_threshold 265 | thres_nr = config.n_glyph_row_threshold 266 | # Threshold them 267 | xs = table._x_threshold_segs = above_threshold(xs, thres_nc) 268 | ys = table._y_threshold_segs = above_threshold(ys, thres_nr) 269 | 270 | # Compute edges (the set of edges used to be called a 'comb') 271 | edges = compute_cell_edges(box, xs, ys, config) 272 | table.column_edges, table.row_edges = edges 273 | 274 | if table.column_edges and table.row_edges: 275 | table.data = compute_table_data(table) 276 | else: 277 | table.data = None 278 | 279 | tables.add(table) 280 | 281 | return tables 282 | 283 | 284 | def determine_text_centerlines(v_segments): 285 | """ 286 | Find candidate centerlines to snap glyphs to. 287 | """ 288 | 289 | _ = hat_generator(v_segments, value_function=normal_hat_with_max_length) 290 | y_hat_points = list(_) 291 | 292 | if not y_hat_points: 293 | # No text on the page? 294 | return [], [], [] 295 | 296 | points, values_maxlengths = zip(*y_hat_points) 297 | values, max_lengths = zip(*values_maxlengths) 298 | 299 | point_values = zip(points, values) 300 | 301 | # y-positions of "good" center lines vertically 302 | # ("good" is determined using the /\ ("hat") function) 303 | center_lines = list(find_peaks(point_values)) 304 | 305 | # mapping of y-position (at each hat-point) to maximum glyph 306 | # height over that point 307 | barycenter_maxheights = dict( 308 | (barycenter, maxheight) 309 | for barycenter, maxheight in zip(points, max_lengths) 310 | if maxheight is not None) 311 | 312 | return point_values, center_lines, barycenter_maxheights 313 | 314 | 315 | def find_bounding_boxes(glyphs, config): 316 | """ 317 | Returns a list of bounding boxes, one per table. 318 | """ 319 | 320 | # TODO(pwaller): One day, this function will find more than one table. 321 | 322 | th, bh = config.table_top_hint, config.table_bottom_hint 323 | assert(glyphs is not None) 324 | bbox = find_table_bounding_box(glyphs, th, bh) 325 | 326 | if bbox is Box.empty_box: 327 | return [] 328 | 329 | # Return the one table's bounding box. 330 | return [bbox] 331 | 332 | 333 | def compute_cell_edges(box, h_segments, v_segments, config): 334 | """ 335 | Determines edges of cell content horizontally and vertically. It 336 | works by binning and thresholding the resulting histogram for 337 | each of the two axes (x and y). 338 | """ 339 | 340 | # TODO(pwaller): shove this on a config? 341 | # these need better names before being part of a public API. 342 | # They specify the minimum amount of space between "threshold-segments" 343 | # in the histogram of line segments and the minimum length, otherwise 344 | # they are not considered a gap. 345 | # units= pdf "points" 346 | minimum_segment_size = 0.5 347 | minimum_gap_size = 0.5 348 | 349 | def gap_midpoints(segments): 350 | return [(b.start + a.end) / 2. 351 | for a, b in zip(segments, segments[1:]) 352 | if b.start - a.end > minimum_gap_size 353 | and b.length > minimum_segment_size 354 | ] 355 | 356 | column_edges = [box.left] + gap_midpoints(h_segments) + [box.right] 357 | row_edges = [box.top] + gap_midpoints(v_segments) + [box.bottom] 358 | 359 | return column_edges, row_edges 360 | 361 | 362 | def compute_table_data(table): 363 | """ 364 | Compute the final table data and return a list of lists. 365 | `table` should have been prepared with a list of glyphs, and a 366 | list of row_edges and column_edges (see the calling sequence in 367 | `page_to_tables`). 368 | """ 369 | 370 | ncolumns = len(table.column_edges) 371 | nrows = len(table.row_edges) 372 | 373 | # This contains a list of `boxes` at each table cell 374 | box_table = [[list() for i in range(ncolumns)] for j in range(nrows)] 375 | 376 | for box in table.glyphs: 377 | if box.text is None: 378 | # Glyph has no text, ignore it. 379 | continue 380 | 381 | x, y = box.center_x, box.center_y 382 | 383 | # Compute index of "gap" between two combs, rather than the comb itself 384 | col = bisect_left(table.column_edges, x) 385 | row = bisect_left(table.row_edges, y) 386 | 387 | # If this blows up, please check what "box" is when it blows up. 388 | # Is it a "\n" ? 389 | box_table[row][col].append(box) 390 | 391 | def compute_text(boxes): 392 | 393 | def ordering(box): 394 | return (box.barycenter_y, box.center_x) 395 | sorted_boxes = sorted(boxes, key=ordering) 396 | 397 | result = [] 398 | 399 | for this, next in zip(sorted_boxes, sorted_boxes[1:] + [None]): 400 | result.append(this.text) 401 | if next is None: 402 | continue 403 | centerline_distance = next.center_y - this.center_y 404 | # Maximum separation is when the two barycenters are far enough 405 | # away that the two characters don't overlap anymore 406 | max_separation = this.height / 2 + next.height / 2 407 | 408 | if centerline_distance >= max_separation: 409 | result.append('\n') 410 | 411 | return ''.join(result) 412 | 413 | table_array = [] 414 | for row in box_table: 415 | table_array.append([compute_text(boxes) for boxes in row]) 416 | 417 | return table_array 418 | 419 | 420 | def find_table_bounding_box(box_list, table_top_hint, table_bottom_hint): 421 | """ 422 | Returns one bounding box (minx, maxx, miny, maxy) for tables based 423 | on a boxlist 424 | """ 425 | 426 | # TODO(pwaller): These closures are here to make it clear how these things 427 | # belong together. At some point it may get broken apart 428 | # again, or simplified. 429 | 430 | def threshold_above(hist, threshold_value): 431 | """ 432 | >>> threshold_above(Counter({518: 10, 520: 20, 530: 20, \ 433 | 525: 17}), 15) 434 | [520, 530, 525] 435 | """ 436 | if not isinstance(hist, Counter): 437 | raise ValueError("requires Counter") # TypeError then? 438 | 439 | above = [k for k, v in hist.items() if v > threshold_value] 440 | return above 441 | 442 | def threshold_y(): 443 | """ 444 | Try to reduce the y range with a threshold. 445 | """ 446 | 447 | return box_list.bounds() 448 | 449 | # TODO(pwaller): Reconcile the below code 450 | 451 | boxtop = attrgetter("top") 452 | boxbottom = attrgetter("bottom") 453 | 454 | # Note: (pwaller) this rounding excludes stuff I'm not sure we should 455 | # be excluding. (e.g, in the unica- dataset) 456 | 457 | yhisttop = box_list.histogram(boxtop).rounder(2) 458 | yhistbottom = box_list.histogram(boxbottom).rounder(2) 459 | 460 | try: 461 | # TODO(pwaller): fix this, remove except block 462 | threshold = IS_TABLE_COLUMN_COUNT_THRESHOLD 463 | miny = min(threshold_above(yhisttop, threshold)) 464 | # and the top of the top cell 465 | maxy = max(threshold_above(yhistbottom, threshold)) 466 | except ValueError: 467 | # Value errors raised when min and/or max fed empty lists 468 | miny = None 469 | maxy = None 470 | #raise ValueError("table_threshold caught nothing") 471 | 472 | return Box(Rectangle( 473 | x1=float("-inf"), 474 | y1=miny, 475 | x2=float("+inf"), 476 | y2=maxy, 477 | )) 478 | 479 | def hints_y(): 480 | miny = float("-inf") 481 | maxy = float("+inf") 482 | 483 | glyphs = [glyph for glyph in box_list if glyph.text is not None] 484 | 485 | if table_top_hint: 486 | top_box = [box for box in glyphs if table_top_hint in box.text] 487 | if top_box: 488 | miny = top_box[0].top 489 | 490 | if table_bottom_hint: 491 | bottomBox = [box for box in glyphs 492 | if table_bottom_hint in box.text] 493 | if bottomBox: 494 | maxy = bottomBox[0].bottom 495 | 496 | return Box(Rectangle( 497 | x1=float("-inf"), 498 | y1=miny, 499 | x2=float("+inf"), 500 | y2=maxy, 501 | )) 502 | 503 | bounds = box_list.bounds() 504 | threshold_bounds = threshold_y() 505 | hinted_bounds = hints_y() 506 | 507 | return bounds.clip(threshold_bounds, hinted_bounds) 508 | 509 | 510 | class Baseline(LineSegment): 511 | pass 512 | 513 | 514 | def assign_barycenters(y_segments, barycenters, barycenter_heightmap): 515 | """ 516 | Assign the glyph.barycenter and .barycenter_y to their "preferred" 517 | barycenter. "Preferred" is currently defined as closest. 518 | 519 | Here we use the term "barycenter" because it is the center of the glyph 520 | weighted according to nearby glyphs in Y. It is used to determine which 521 | word glyphs belong to, and which order text should be inserted into a cell. 522 | """ 523 | result = list() 524 | # Compute a list of barycenter line segments, which are long enough to 525 | # overlap all glyphs which are long enough overlap it. 526 | for barycenter in barycenters: 527 | maxheight = barycenter_heightmap[barycenter] 528 | result.append(Baseline.make(barycenter - maxheight / 2, 529 | barycenter + maxheight / 2)) 530 | 531 | to_visit = {LineSegment: midpoint, Baseline: start_end} 532 | 533 | active_barycenters = set() 534 | 535 | segments = segments_generator(result + y_segments, to_visit) 536 | for position, glyph, disappearing in segments: 537 | # Maintain a list of barycenters vs position 538 | if isinstance(glyph, Baseline): 539 | if not disappearing: 540 | active_barycenters.add(glyph) 541 | else: 542 | active_barycenters.remove(glyph) 543 | continue 544 | 545 | if len(active_barycenters) == 0: 546 | # There is no barycenter this glyph might belong to. 547 | # TODO(pwaller): huh? This should surely never happen. 548 | # Investigate why by turning this assert on. 549 | # assert False 550 | continue 551 | 552 | # Pick the barycenter closest to our position (== glyph.center_y) 553 | barycenter = min(active_barycenters, 554 | key=lambda b: abs(b.midpoint - position)) 555 | 556 | glyph.object.barycenter = barycenter 557 | glyph.object.barycenter_y = barycenter.midpoint 558 | -------------------------------------------------------------------------------- /pdftables/scripts/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /pdftables/scripts/render.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """pdftables-render: obtain pdftables debugging information from pdfs 4 | 5 | Usage: 6 | pdftables-render [options] ... 7 | pdftables-render (-h | --help) 8 | pdftables-render --version 9 | 10 | Example page number lists: 11 | 12 | may contain a [:page-number-list]. 13 | 14 | pdftables-render my.pdf:1 15 | pdftables-render my.pdf:2,5-10,15- 16 | 17 | Example JSON config options: 18 | 19 | '{ "n_glyph_column_threshold": 3, "n_glyph_row_threshold": 5 }' 20 | 21 | Options: 22 | -h --help Show this screen. 23 | --version Show version. 24 | -D --debug Additional debug information 25 | -O --output-dir= Path to write debug data to 26 | -a --ascii Show ascii table 27 | -p --pprint pprint.pprint() the table 28 | -i --interactive jump into an interactive debugger (ipython) 29 | -c --config= JSON object of config parameters 30 | 31 | """ 32 | 33 | # Use $ pip install --user --editable pdftables 34 | # to install this util in your path. 35 | 36 | import sys 37 | import os 38 | import json 39 | 40 | import pdftables 41 | 42 | from os.path import basename 43 | from pprint import pprint 44 | 45 | from docopt import docopt 46 | 47 | from pdftables.pdf_document import PDFDocument 48 | from pdftables.diagnostics import render_page, make_annotations 49 | from pdftables.display import to_string 50 | from pdftables.pdftables import page_to_tables 51 | from pdftables.config_parameters import ConfigParameters 52 | 53 | 54 | def main(args=None): 55 | 56 | if args is not None: 57 | argv = args 58 | else: 59 | argv = sys.argv[1:] 60 | 61 | arguments = docopt( 62 | __doc__, 63 | argv=argv, 64 | version='pdftables-render experimental') 65 | 66 | if arguments["--debug"]: 67 | print(arguments) 68 | 69 | if arguments["--config"]: 70 | kwargs = json.loads(arguments["--config"]) 71 | else: 72 | kwargs = {} 73 | config = ConfigParameters(**kwargs) 74 | 75 | for pdf_filename in arguments[""]: 76 | render_pdf(arguments, pdf_filename, config) 77 | 78 | 79 | def ensure_dirs(): 80 | try: 81 | os.mkdir('png') 82 | except OSError: 83 | pass 84 | 85 | try: 86 | os.mkdir('svg') 87 | except OSError: 88 | pass 89 | 90 | 91 | def parse_page_ranges(range_string, npages): 92 | ranges = range_string.split(',') 93 | result = [] 94 | 95 | def string_to_pagenumber(s): 96 | if s == "": 97 | return npages 98 | return int(x) 99 | 100 | for r in ranges: 101 | if '-' not in r: 102 | result.append(int(r)) 103 | else: 104 | # Convert 1-based indices to 0-based and make integer. 105 | points = [string_to_pagenumber(x) for x in r.split('-')] 106 | 107 | if len(points) == 2: 108 | start, end = points 109 | else: 110 | raise RuntimeError( 111 | "Malformed range string: {0}" 112 | .format(range_string)) 113 | 114 | # Plus one because it's (start, end) inclusive 115 | result.extend(xrange(start, end + 1)) 116 | 117 | # Convert from one based to zero based indices 118 | return [x - 1 for x in result] 119 | 120 | 121 | def render_pdf(arguments, pdf_filename, config): 122 | ensure_dirs() 123 | 124 | page_range_string = '' 125 | page_set = [] 126 | if ':' in pdf_filename: 127 | pdf_filename, page_range_string = pdf_filename.split(':') 128 | 129 | doc = PDFDocument.from_path(pdf_filename) 130 | 131 | if page_range_string: 132 | page_set = parse_page_ranges(page_range_string, len(doc)) 133 | 134 | for page_number, page in enumerate(doc.get_pages()): 135 | if page_set and page_number not in page_set: 136 | # Page ranges have been specified by user, and this page not in 137 | continue 138 | 139 | svg_file = 'svg/{0}_{1:02d}.svg'.format( 140 | basename(pdf_filename), page_number) 141 | png_file = 'png/{0}_{1:02d}.png'.format( 142 | basename(pdf_filename), page_number) 143 | 144 | table_container = page_to_tables(page, config) 145 | annotations = make_annotations(table_container) 146 | 147 | render_page( 148 | pdf_filename, page_number, annotations, svg_file, png_file) 149 | 150 | print "Rendered", svg_file, png_file 151 | 152 | if arguments["--interactive"]: 153 | from ipdb import set_trace 154 | set_trace() 155 | 156 | for table in table_container: 157 | 158 | if arguments["--ascii"]: 159 | print to_string(table.data) 160 | 161 | if arguments["--pprint"]: 162 | pprint(table.data) 163 | 164 | 165 | -------------------------------------------------------------------------------- /render_all.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | for pdf in fixtures/sample_data/*.pdf 4 | do 5 | printf -- "---** %s **---\n" "$pdf" 6 | pdftables-render "$pdf" 7 | done 8 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pdfminer>=20110515 2 | requests>=1.1.0 3 | matplotlib>=1.1.1 4 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | long_desc = """ 4 | PDFTables helps with extracting tables from PDF files. 5 | """ 6 | # See https://pypi.python.org/pypi?%3Aaction=list_classifiers for classifiers 7 | 8 | conf = dict(name='pdftables', 9 | version='0.0.3', 10 | description="Parses PDFs and extracts what it believes to be tables.", 11 | long_description=long_desc, 12 | classifiers=[ 13 | "Development Status :: 7 - Inactive", 14 | "Intended Audience :: Developers", 15 | "License :: OSI Approved :: BSD License", 16 | "Operating System :: POSIX :: Linux", 17 | "Programming Language :: Python", 18 | ], 19 | keywords='', 20 | author='ScraperWiki Ltd', 21 | author_email='feedback@scraperwiki.com', 22 | url='http://scraperwiki.com', 23 | license='BSD', 24 | packages=find_packages(exclude=['ez_setup', 'examples', 'tests']), 25 | namespace_packages=[], 26 | include_package_data=False, 27 | zip_safe=False, 28 | install_requires=[ 29 | 'pdfminer>=20110515', 30 | 'docopt>=0.6', 31 | ], 32 | tests_require=[], 33 | entry_points={ 34 | 'console_scripts': [ 35 | 'pdftables-render = pdftables.scripts.render:main', 36 | ] 37 | }) 38 | 39 | if __name__ == '__main__': 40 | setup(**conf) 41 | -------------------------------------------------------------------------------- /test/fixtures.py: -------------------------------------------------------------------------------- 1 | 2 | from os.path import abspath, dirname, join as pjoin 3 | 4 | from pdftables.pdf_document import PDFDocument 5 | 6 | memoized = {} 7 | 8 | 9 | def fixture(filename): 10 | """ 11 | Obtain a PDFDocument for fixtures/sample_data/{filename}, memoizing the 12 | return result. 13 | """ 14 | global memoized 15 | 16 | if filename in memoized: 17 | return memoized.get(filename) 18 | here = abspath(dirname(__file__)) 19 | fn = pjoin(here, "..", "fixtures", "sample_data", filename) 20 | memoized[filename] = PDFDocument.from_path(fn) 21 | return memoized[filename] 22 | -------------------------------------------------------------------------------- /test/test_Table_class.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # ScraperWiki Limited 3 | # Ian Hopkinson, 2013-07-30 4 | # -*- coding: utf-8 -*- 5 | 6 | """ 7 | Tests the Table class which contains metadata 8 | """ 9 | import sys 10 | 11 | from pdftables.pdftables import get_tables_from_document 12 | 13 | from fixtures import fixture 14 | 15 | from nose.tools import * 16 | import nose 17 | 18 | 19 | def test_it_includes_page_numbers(): 20 | """ 21 | page_number is 1-indexed, as defined in the PDF format 22 | table_number is 1-indexed 23 | """ 24 | doc = fixture('AnimalExampleTables.pdf') 25 | try: 26 | result = get_tables_from_document(doc) 27 | except NotImplementedError, e: 28 | raise nose.SkipTest(e) 29 | assert_equals(result[0].total_pages, 4) 30 | assert_equals(result[0].page_number, 2) 31 | assert_equals(result[1].total_pages, 4) 32 | assert_equals(result[1].page_number, 3) 33 | assert_equals(result[2].total_pages, 4) 34 | assert_equals(result[2].page_number, 4) 35 | 36 | 37 | def test_it_includes_table_numbers(): 38 | doc = fixture('AnimalExampleTables.pdf') 39 | try: 40 | result = get_tables_from_document(doc) 41 | except NotImplementedError, e: 42 | raise nose.SkipTest(e) 43 | result = get_tables_from_document(doc) 44 | assert_equals(result[0].table_number_on_page, 1) 45 | assert_equals(result[0].total_tables_on_page, 1) 46 | -------------------------------------------------------------------------------- /test/test_all_sample_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | from __future__ import unicode_literals 5 | from nose.tools import assert_equal 6 | from os.path import join, dirname 7 | import os 8 | 9 | from pdftables.pdftables import page_to_tables 10 | from pdftables.display import to_string 11 | from pdftables.diagnostics import render_page, make_annotations 12 | 13 | from fixtures import fixture 14 | 15 | SAMPLE_DIR = join(dirname(__file__), '..', 'fixtures', 'sample_data') 16 | RENDERED_DIR = join(dirname(__file__), '..', 'fixtures', 'rendered') 17 | EXPECTED_DIR = join(dirname(__file__), '..', 'fixtures', 'expected_output') 18 | ACTUAL_DIR = join(dirname(__file__), '..', 'fixtures', 'actual_output') 19 | 20 | 21 | def _test_sample_data(): 22 | for filename in os.listdir(SAMPLE_DIR): 23 | yield _test_sample_pdf, filename 24 | 25 | 26 | def _test_sample_pdf(short_filename): 27 | doc = fixture(short_filename) 28 | for page_number, page in enumerate(doc.get_pages()): 29 | 30 | tables = page_to_tables(page) 31 | annotations = make_annotations(tables) 32 | 33 | basename = '{0}_{1}'.format(short_filename, page_number) 34 | 35 | render_page( 36 | join(SAMPLE_DIR, short_filename), 37 | page_number, 38 | annotations, 39 | svg_file=join(RENDERED_DIR, 'svgs', basename + '.svg'), 40 | png_file=join(RENDERED_DIR, 'pngs', basename + '.png') 41 | ) 42 | 43 | assert_equal(get_expected_number_of_tables(short_filename), len(tables)) 44 | for table_num, table in enumerate(tables): 45 | table_filename = "{}_{}.txt".format(short_filename, table_num) 46 | expected_filename = join(EXPECTED_DIR, table_filename) 47 | actual_filename = join(ACTUAL_DIR, table_filename) 48 | 49 | with open(actual_filename, 'w') as f: 50 | f.write(to_string(table).encode('utf-8')) 51 | 52 | diff_table_files(expected_filename, actual_filename) 53 | 54 | 55 | def get_expected_number_of_tables(short_filename): 56 | result = len([fn for fn in os.listdir(EXPECTED_DIR) 57 | if fn.startswith(short_filename)]) 58 | if result == 0: 59 | print("NOTE: there is no 'expected' data for {0} in {1}: you probably " 60 | "want to review then copy files from {2}".format( 61 | short_filename, EXPECTED_DIR, ACTUAL_DIR)) 62 | return result 63 | 64 | 65 | def diff_table_files(expected, result): 66 | with open(expected) as f: 67 | with open(result) as g: 68 | for line, (expected_line, actual_line) in enumerate(zip(f, g)): 69 | try: 70 | assert_equal(expected_line, actual_line) 71 | except AssertionError: 72 | print("{} and {} differ @ line {}".format( 73 | expected, result, line + 1)) 74 | raise 75 | -------------------------------------------------------------------------------- /test/test_box.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | #import sys 5 | #sys.path.append('pdftables') 6 | 7 | import pdftables 8 | import pdftables.boxes as boxes 9 | 10 | 11 | from pdftables.boxes import Box, BoxList, Rectangle 12 | 13 | from nose.tools import assert_equals, assert_is_not 14 | 15 | inf = float("inf") 16 | 17 | 18 | def test_basic_boxes(): 19 | box = Box(Rectangle(11, 22, 33, 44), "text") 20 | assert_equals(box.rect, (11, 22, 33, 44)) 21 | assert_equals((11, 22, 33, 44), box.rect) 22 | assert_equals(22, box.top) 23 | assert_equals('text', box.text) 24 | 25 | 26 | def test_clip_identity(): 27 | """ 28 | Test clipping a box with itself results in the same box 29 | """ 30 | box1 = Box(Rectangle(-inf, -inf, inf, inf)) 31 | box2 = Box(Rectangle(-inf, -inf, inf, inf)) 32 | 33 | clipped = box1.clip(box2) 34 | assert_is_not(clipped, Box.empty_box) 35 | assert_equals(clipped.rect, box1.rect) 36 | 37 | 38 | def test_clip_x_infinite(): 39 | """ 40 | Test correctly clipping (-inf, 0, inf, 10) with (-inf, -inf, inf, inf) 41 | """ 42 | box1 = Box(Rectangle(-inf, -inf, inf, inf)) 43 | box2 = Box(Rectangle(-inf, 0, inf, 10)) 44 | 45 | clipped = box1.clip(box2) 46 | assert_is_not(clipped, Box.empty_box) 47 | assert_equals(clipped.rect, (-inf, 0, inf, 10)) 48 | 49 | 50 | def test_boxlist_inside(): 51 | b = BoxList() 52 | b.append(Box(Rectangle(0, 0, 10, 10))) 53 | 54 | infrect = Box(Rectangle(-inf, -inf, inf, inf)) 55 | 56 | assert_equals(1, len(b.inside(infrect))) 57 | assert_equals(0, len(b.inside(Box.empty_box))) 58 | 59 | 60 | def test_boxlist_inside_not_inside(): 61 | b = BoxList() 62 | b.append(Box(Rectangle(0, 0, 10, 10))) 63 | 64 | otherbox = Box(Rectangle(-100, -100, -90, -90)) 65 | assert_equals(0, len(b.inside(otherbox))) 66 | -------------------------------------------------------------------------------- /test/test_contains_tables.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # ScraperWiki Limited 3 | # Ian Hopkinson, 2013-06-17 4 | # -*- coding: utf-8 -*- 5 | 6 | """ 7 | ContainsTables tests 8 | """ 9 | 10 | import sys 11 | 12 | import pdftables 13 | 14 | from fixtures import fixture 15 | 16 | from nose.tools import assert_equals 17 | 18 | 19 | def contains_tables(pdf): 20 | """ 21 | contains_tables takes a pdf document and returns a boolean array of the 22 | length of the document which is true for pages which contains tables 23 | """ 24 | return [pdftables.page_contains_tables(page) for page in pdf.get_pages()] 25 | 26 | 27 | def test_it_finds_no_tables_in_a_pdf_with_no_tables(): 28 | pdf = fixture('m27-dexpeg2-polymer.pdf') 29 | assert_equals( 30 | [False, False, False, False, False, False, False, False], 31 | contains_tables(pdf)) 32 | 33 | 34 | def test_it_finds_tables_on_all_pages_AlmondBoard(): 35 | pdf = fixture('2012.01.PosRpt.pdf') 36 | assert_equals( 37 | [True, True, True, True, True, True, True], 38 | contains_tables(pdf)) 39 | 40 | 41 | def test_it_finds_tables_on_some_pages_CONAB(): 42 | pdf = fixture('13_06_12_10_36_58_boletim_ingles_junho_2013.pdf') 43 | TestList = [False] * 32 44 | TestList[5:8] = [True] * 3 45 | TestList[9:11] = [True] * 2 46 | TestList[12] = True 47 | TestList[14] = True 48 | TestList[16:18] = [True] * 2 49 | TestList[19:24] = [True] * 5 50 | TestList[25:30] = [True] * 5 51 | 52 | assert_equals(contains_tables(pdf), TestList) 53 | -------------------------------------------------------------------------------- /test/test_finds_tables.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # ScraperWiki Limited 3 | # Ian Hopkinson, 2013-06-17 4 | # -*- coding: utf-8 -*- 5 | 6 | """ 7 | Finds tables tests 8 | """ 9 | 10 | import sys 11 | 12 | import pdftables 13 | from pdftables.config_parameters import ConfigParameters 14 | 15 | from fixtures import fixture 16 | 17 | from nose.tools import assert_equals 18 | 19 | 20 | def test_atomise_does_not_disrupt_table_finding(): 21 | pdf_page = fixture( 22 | "13_06_12_10_36_58_boletim_ingles_junho_2013.pdf").get_page(3) 23 | tables1 = pdftables.page_to_tables( 24 | pdf_page, 25 | ConfigParameters( 26 | atomise=True, 27 | extend_y=False)) 28 | tables2 = pdftables.page_to_tables( 29 | pdf_page, 30 | ConfigParameters( 31 | atomise=False, 32 | extend_y=False)) 33 | 34 | assert_equals(tables1.tables[0].data, tables2.tables[0].data) 35 | -------------------------------------------------------------------------------- /test/test_get_tables.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # ScraperWiki Limited 3 | # Ian Hopkinson, 2013-06-17 4 | # -*- coding: utf-8 -*- 5 | 6 | """ 7 | getTablesTests 8 | """ 9 | 10 | import sys 11 | sys.path.append('code') 12 | 13 | from pdftables import page_to_tables # , TableDiagnosticData 14 | from pdftables.config_parameters import ConfigParameters 15 | 16 | from fixtures import fixture 17 | 18 | from nose.tools import * 19 | 20 | 21 | def test_it_doesnt_find_tables_when_there_arent_any(): 22 | pdf_page = fixture( 23 | "13_06_12_10_36_58_boletim_ingles_junho_2013.pdf").get_page(4) 24 | tables = page_to_tables(pdf_page) 25 | 26 | assert_equals([], tables.tables[0].data) 27 | 28 | 29 | def test_it_copes_with_CONAB_p8(): 30 | pdf_page = fixture( 31 | "13_06_12_10_36_58_boletim_ingles_junho_2013.pdf").get_page(7) 32 | page_to_tables(pdf_page, ConfigParameters(atomise=True)) 33 | 34 | 35 | def test_it_can_use_hints_AlmondBoard_p1(): 36 | pdf_page = fixture("2012.01.PosRpt.pdf").get_page(0) 37 | tables = page_to_tables( 38 | pdf_page, 39 | ConfigParameters( 40 | atomise=False, 41 | table_top_hint=u"% Change", 42 | table_bottom_hint=u"Uncommited")) 43 | assert_equals( 44 | [[u'Salable', u'Million Lbs.', u'Kernel Wt.', u'Kernel Wt.', u'% Change'], 45 | [u'1. Carryin August 1, 2011', u'254.0', 46 | u'253,959,411', u'321,255,129', u'-20.95%'], 47 | [u'2. Crop Receipts to Date', u'1,950.0', 48 | u'1,914,471,575', u'1,548,685,417', u'23.62%'], 49 | [u'3. [3% Loss and Exempt]', u'58.5', 50 | u'57,434,147)(', u'46,460,563(', u')'], 51 | [u'4. New Crop Marketable (2-3)', u'1,891.5', 52 | u'1,857,037,428', u'1,502,224,854', u'23.62%'], 53 | [u'5. [Reserve]', u'n/a', u'0', u'0', u''], 54 | [u'6. Total Supply (1+4-5)Shipments by Handlers', 55 | u'2,145.5', u'2,110,996,839', u'1,823,479,983', u'15.77%'], 56 | [u'7. Domestic', u'555.0', 57 | u'265,796,698', u'255,785,794', u'3.91%'], 58 | [u'8. Export', u'1,295.0', 59 | u'755,447,255', u'664,175,807', u'13.74%'], 60 | [u'9. Total Shipments', u'1,850.0', 61 | u'1,021,243,953', u'919,961,601', u'11.01%'], 62 | [u'10. Forecasted Carryout', u'295.5', u'', u'', u''], 63 | [u'11. Computed Inventory (6-9)Commitments (sold, not delivered)**', 64 | u'', u'1,089,752,886', u'903,518,382', u'20.61%'], 65 | [u'12. Domestic', u'', u'214,522,238', u'187,492,263', u'14.42%'], [ 66 | u'13. Export', u'', u'226,349,446', u'155,042,764', u'45.99%'], 67 | [u'14. Total Commited Shipments', u'', 68 | u'440,871,684', u'342,535,027', u'28.71%'], 69 | [u'15. Uncommited Inventory (11-14)', u'', u'648,881,202', u'560,983,355', u'15.67%']], tables.tables[0].data) 70 | 71 | 72 | def test_it_can_use_one_hint_argentina_by_size(): 73 | pdf_page = fixture("argentina_diputados_voting_record.pdf").get_page(0) 74 | tables = page_to_tables( 75 | pdf_page, 76 | ConfigParameters( 77 | atomise=False, 78 | table_top_hint='Apellido')) 79 | #table1,_ = getTable(fh, 2) 80 | table1 = list(tables)[0].data 81 | assert_equals(32, len(table1)) 82 | assert_equals(4, len(table1[0])) 83 | 84 | 85 | def test_it_returns_the_AlmondBoard_p2_table_by_size(): 86 | pdf_page = fixture("2012.01.PosRpt.pdf").get_page(1) 87 | tables = page_to_tables(pdf_page, ConfigParameters(atomise=False)) 88 | table1 = list(tables)[0].data 89 | assert_equals(78, len(table1)) 90 | assert_equals(9, len(table1[0])) 91 | 92 | 93 | def test_the_atomise_option_works_on_coceral_p1_by_size(): 94 | pdf_page = fixture( 95 | "1359397366Final_Coceral grain estimate_2012_December.pdf").get_page(0) 96 | tables = page_to_tables(pdf_page, 97 | ConfigParameters( 98 | atomise=True)) 99 | table = list(tables)[0].data 100 | #table1, _ = getTable(fh, 2) 101 | assert_equals(43, len(table)) 102 | assert_equals(31, len(table[0])) 103 | 104 | 105 | def test_it_does_not_crash_on_m30_p5(): 106 | pdf_page = fixture("m30-JDent36s15-20.pdf").get_page(4) 107 | tables = page_to_tables(pdf_page) 108 | table = list(tables)[0].data 109 | assert len(table) > 0 110 | """Put this in for more aggressive test""" 111 | # assert_equals([u'5\n', u'0.75\n', u'0.84\n', u'0.92\n', u'0.94\n', u'evaluation of a novel liquid whitening gel containing 18%\n'], 112 | # table[4]) 113 | 114 | 115 | def test_it_returns_the_AlmondBoard_p4_table(): 116 | pdf_page = fixture("2012.01.PosRpt.pdf").get_page(3) 117 | tables = page_to_tables( 118 | pdf_page, 119 | ConfigParameters( 120 | atomise=False, 121 | extend_y=False)) 122 | assert_equals( 123 | [[u'Variety Name', u'Total Receipts', u'Total Receipts', u'Total Inedibles', u'Receipts', u'% Rejects'], 124 | [u'Aldrich', u'48,455,454', u'49,181,261', 125 | u'405,555', u'2.53%', u'0.82%'], 126 | [u'Avalon', u'7,920,179', u'8,032,382', 127 | u'91,733', u'0.41%', u'1.14%'], 128 | [u'Butte', u'151,830,761', u'150,799,510', 129 | u'1,054,567', u'7.93%', u'0.70%'], 130 | [u'Butte/Padre', u'215,114,812', u'218,784,885', 131 | u'1,145,000', u'11.24%', u'0.52%'], 132 | [u'Carmel', u'179,525,234', u'178,912,935', 133 | u'1,213,790', u'9.38%', u'0.68%'], 134 | [u'Carrion', u'507,833', u'358,580', 135 | u'2,693', u'0.03%', u'0.75%'], 136 | [u'Fritz', u'105,479,433', u'106,650,571', 137 | u'1,209,192', u'5.51%', u'1.13%'], 138 | [u'Harvey', u'58,755', u'58,755', 139 | u'1,416', u'0.00%', u'2.41%'], 140 | [u'Hashem', u'430,319', u'430,014', 141 | u'1,887', u'0.02%', u'0.44%'], 142 | [u'Le Grand', u'0', u'0', u'0', u'0.00%', u'0.00%'], 143 | [u'Livingston', u'7,985,535', u'7,926,910', 144 | u'186,238', u'0.42%', u'2.35%'], 145 | [u'Marchini', u'363,887', u'391,965', 146 | u'3,675', u'0.02%', u'0.94%'], 147 | [u'Merced', u'65,422', u'66,882', 148 | u'1,167', u'0.00%', u'1.74%'], 149 | [u'Mission', u'19,097,034', u'18,851,071', 150 | u'110,323', u'1.00%', u'0.59%'], 151 | [u'Mixed', u'36,358,011', u'36,926,337', 152 | u'952,264', u'1.90%', u'2.58%'], 153 | [u'Mono', u'757,637', u'689,552', 154 | u'6,785', u'0.04%', u'0.98%'], 155 | [u'Monterey', u'220,713,436', u'212,746,409', 156 | u'2,293,892', u'11.53%', u'1.08%'], 157 | [u'Morley', u'822,529', u'825,738', 158 | u'6,264', u'0.04%', u'0.76%'], 159 | [u'N43', u'156,488', u'85,832', u'340', u'0.01%', u'0.40%'], 160 | [u'Neplus', u'1,279,599', u'1,237,532', 161 | u'17,388', u'0.07%', u'1.41%'], 162 | [u'Nonpareil', u'741,809,844', u'727,286,104', 163 | u'5,121,465', u'38.75%', u'0.70%'], 164 | [u'Padre', u'62,905,358', u'62,417,565', 165 | u'193,168', u'3.29%', u'0.31%'], 166 | [u'Peerless', u'5,113,472', u'5,101,245', 167 | u'20,792', u'0.27%', u'0.41%'], 168 | [u'Price', u'25,312,529', u'25,124,463', 169 | u'143,983', u'1.32%', u'0.57%'], 170 | [u'Ruby', u'4,163,237', u'4,057,470', 171 | u'35,718', u'0.22%', u'0.88%'], 172 | [u'Sauret', u'55,864', u'55,864', u'517', u'0.00%', u'0.93%'], 173 | [u'Savana', u'389,317', u'390,585', 174 | u'2,049', u'0.02%', u'0.52%'], 175 | [u'Sonora', u'31,832,025', u'33,184,703', 176 | u'387,848', u'1.66%', u'1.17%'], 177 | [u'Thompson', u'491,026', u'487,926', 178 | u'8,382', u'0.03%', u'1.72%'], 179 | [u'Tokyo', u'783,494', u'794,699', 180 | u'4,511', u'0.04%', u'0.57%'], 181 | [u'Winters', u'5,780,183', u'5,756,167', 182 | u'46,211', u'0.30%', u'0.80%'], 183 | [u'Wood Colony', u'37,458,735', u'36,331,907', 184 | u'189,967', u'1.96%', u'0.52%'], 185 | [u'Major Varieties Sub Total:', u'1,913,017,442', 186 | u'1,893,945,819', u'14,858,780', u'99.92%', u'0.78%'], 187 | [u'Minor Varieties Total:', u'1,454,133', 188 | u'1,480,800', u'34,997', u'0.08%', u'2.36%'], 189 | [u'Grand Total All Varieties', u'1,914,471,575', u'1,895,426,619', u'14,893,777', u'100.00%', u'0.79%']], tables.tables[0].data 190 | ) 191 | -------------------------------------------------------------------------------- /test/test_ground.py: -------------------------------------------------------------------------------- 1 | from pdftables.pdf_document import PDFDocument 2 | from pdftables.pdftables import page_to_tables 3 | import lxml.etree 4 | from collections import Counter 5 | from nose.tools import assert_equals 6 | 7 | 8 | class ResultTable(object): 9 | def __sub__(self, other): 10 | r = ResultTable() 11 | r.cells = self.cells 12 | r.cells.subtract(other.cells) 13 | r.number_of_rows = self.number_of_rows - other.number_of_rows 14 | r.number_of_cols = self.number_of_cols - other.number_of_cols 15 | return r 16 | 17 | def __repr__(self): 18 | assert self.cells is not None 19 | response = "" 20 | return response.format(col=self.number_of_cols, 21 | row=self.number_of_rows, 22 | plus=sum(self.cells[x] for x in self.cells if self.cells[x] >= 1), 23 | minus=abs(sum(self.cells[x] for x in self.cells if self.cells[x] <= -1))) 24 | 25 | 26 | def pdf_results(filename): 27 | def get_cells(table): 28 | cells = Counter() 29 | for row in table.data: 30 | for cell in row: 31 | cells.update([cell]) 32 | return cells 33 | 34 | #doc = PDFDocument.from_fileobj(open(filename, "rb")) 35 | doc = PDFDocument.from_path(filename) 36 | for page in doc.get_pages(): 37 | table_container = page_to_tables(page) 38 | builder = [] 39 | for table in table_container: 40 | r = ResultTable() 41 | r.cells = get_cells(table) 42 | r.number_of_rows = len(table.data) 43 | r.number_of_cols = max(len(row) for row in table.data) 44 | builder.append(r) 45 | return builder 46 | 47 | 48 | def xml_results(filename): 49 | def max_of_strs(strs): 50 | return max(map(int, strs)) 51 | root = lxml.etree.fromstring(open(filename, "rb").read()) 52 | builder = [] 53 | for table in root.xpath("//table"): 54 | r = ResultTable() 55 | r.cells = Counter(table.xpath("//content/text()")) 56 | cols = table.xpath("//@end-col") 57 | cols.extend(table.xpath("//@start-col")) 58 | rows = table.xpath("//@end-row") 59 | rows.extend(table.xpath("//@start-row")) 60 | r.number_of_cols = max_of_strs(cols) + 1 # starts at zero 61 | r.number_of_rows = max_of_strs(rows) + 1 # starts at zero 62 | builder.append(r) 63 | return builder 64 | 65 | 66 | 67 | def _test_ground(filebase, number): 68 | """tests whether we successfully parse ground truth data: 69 | see fixtures/eu-dataset""" 70 | pdf_tables = pdf_results(filebase % (number, ".pdf")) 71 | xml_tables = xml_results(filebase % (number, "-str.xml")) 72 | assert_equals(len(pdf_tables), len(xml_tables)) 73 | for i in range(0, len(pdf_tables)): 74 | pdf_table = pdf_tables[i] 75 | xml_table = xml_tables[i] 76 | diff = pdf_table - xml_table 77 | clean_diff_list = {x:diff.cells[x] for x in diff.cells if diff.cells[x] != 0} 78 | assert_equals(pdf_table.number_of_cols, xml_table.number_of_cols) 79 | assert_equals(pdf_table.number_of_rows, xml_table.number_of_rows) 80 | assert_equals(clean_diff_list, {}) 81 | 82 | def test_all_eu(): 83 | filebase = "fixtures/eu-dataset/eu-%03d%s" 84 | for i in range(1,35): # 1..34 85 | yield _test_ground, filebase, i 86 | 87 | -------------------------------------------------------------------------------- /test/test_linesegments.py: -------------------------------------------------------------------------------- 1 | import pdftables.line_segments as line_segments 2 | 3 | from nose.tools import assert_equals, raises 4 | 5 | from pdftables.line_segments import LineSegment 6 | 7 | 8 | def segments(segments): 9 | return [line_segments.LineSegment.make(a, b) for a, b in segments] 10 | 11 | 12 | def test_segments_generator(): 13 | seg1, seg2 = segs = segments([(1, 4), (2, 3)]) 14 | values = list(line_segments.segments_generator(segs)) 15 | assert_equals( 16 | [(1, seg1, False), 17 | (2, seg2, False), 18 | (3, seg2, True), 19 | (4, seg1, True)], 20 | values 21 | ) 22 | 23 | 24 | def test_histogram_segments(): 25 | segs = segments([(1, 4), (2, 3)]) 26 | values = list(line_segments.histogram_segments(segs)) 27 | assert_equals([((1, 2), 1), ((2, 3), 2), ((3, 4), 1)], values) 28 | 29 | 30 | def test_segment_histogram(): 31 | segs = segments([(1, 4), (2, 3)]) 32 | values = list(line_segments.segment_histogram(segs)) 33 | assert_equals([(1, 2, 3, 4), (1, 2, 1)], values) 34 | 35 | 36 | @raises(RuntimeError) 37 | def test_malformed_input_segments_generator(): 38 | segs = segments([(1, -1)]) 39 | list(line_segments.segments_generator(segs)) 40 | 41 | 42 | def test_hat_point_generator(): 43 | segs = segments([(1, 4), (2, 3)]) 44 | result = list(line_segments.hat_point_generator(segs)) 45 | 46 | x = 2.5 47 | expected = [(1, set()), 48 | (2, set([LineSegment(start=1, end=4, object=None)])), 49 | (x, set([LineSegment(start=1, end=4, object=None), 50 | LineSegment(start=2, end=3, object=None)])), 51 | (3, set([LineSegment(start=1, end=4, object=None)])), 52 | (4, set())] 53 | 54 | assert_equals(expected, result) 55 | 56 | 57 | def test_hat_generator(): 58 | segs = segments([(0, 4), (1, 3)]) 59 | result = list(line_segments.hat_generator(segs)) 60 | 61 | expected = [(0, 0), (1, 0.75), (2.0, 2.0), (3, 0.75), (4, 0)] 62 | 63 | assert_equals(expected, result) 64 | -------------------------------------------------------------------------------- /test/test_render_script.py: -------------------------------------------------------------------------------- 1 | from pdftables.scripts.render import main 2 | import sys 3 | import os 4 | import shutil 5 | import glob 6 | from nose.tools import with_setup 7 | 8 | # TODO(pwaller): Don't write test data to png/svg. 9 | 10 | PDF_FILE = 'fixtures/sample_data/unica_preco_recebido.pdf' 11 | 12 | def clean_output_directories(): 13 | shutil.rmtree('png', ignore_errors=True) 14 | shutil.rmtree('svg', ignore_errors=True) 15 | 16 | @with_setup(clean_output_directories) 17 | def test_png_output_directory_is_created(): 18 | assert not os.path.isdir('png') 19 | main([PDF_FILE]) 20 | assert os.path.isdir('png') 21 | 22 | @with_setup(clean_output_directories) 23 | def test_svg_output_directory_is_created(): 24 | assert not os.path.isdir('svg') 25 | main([PDF_FILE]) 26 | assert os.path.isdir('svg') 27 | 28 | @with_setup(clean_output_directories) 29 | def test_expected_png_output(): 30 | main([PDF_FILE]) 31 | assert os.path.isfile('png/unica_preco_recebido.pdf_00.png') 32 | 33 | @with_setup(clean_output_directories) 34 | def test_expected_svg_output(): 35 | main([PDF_FILE]) 36 | assert os.path.isfile('svg/unica_preco_recebido.pdf_00.svg') 37 | 38 | # Some of the fixture pdfs 39 | PDF_FILES = [ 40 | ('fixtures/sample_data/unica_preco_recebido.pdf', 1), 41 | ('fixtures/sample_data/m29-JDent36s2-7.pdf', 6), 42 | ('fixtures/sample_data/COPAMONTHLYMay2013.pdf', 1), 43 | ('fixtures/sample_data/COPAWEEKLYJUNE52013.pdf', 1), 44 | ('fixtures/sample_data/tabla_subsidios.pdf', 1), 45 | ('fixtures/sample_data/AnimalExampleTables.pdf', 4), 46 | ('fixtures/sample_data/m30-JDent36s15-20.pdf', 6), 47 | ('fixtures/sample_data/commodity-prices_en.pdf', 1), 48 | ] 49 | 50 | @with_setup(clean_output_directories) 51 | def test_expected_number_of_pages(): 52 | for infile, expected_pages in PDF_FILES: 53 | main([infile]) 54 | 55 | actual_pages = len(glob.glob('svg/%s_*.svg' % os.path.basename(infile))) 56 | assert expected_pages == actual_pages 57 | 58 | actual_pages = len(glob.glob('png/%s_*.png' % os.path.basename(infile))) 59 | assert expected_pages == actual_pages 60 | --------------------------------------------------------------------------------