├── .gitignore
├── .travis.yml
├── LICENCE
├── README.rst
├── dev_requirements.txt
├── docs
├── .gitignore
├── Makefile
├── make.bat
└── source
│ ├── _theme
│ └── armstrong
│ │ ├── LICENSE
│ │ ├── README.rst
│ │ ├── layout.html
│ │ ├── rtd-themes.conf
│ │ ├── static
│ │ └── rtd.css_t
│ │ └── theme.conf
│ ├── conf.py
│ └── index.rst
├── download_test_data.sh
├── pdftables
├── __init__.py
├── boxes.py
├── config_parameters.py
├── counter.py
├── diagnostics.py
├── display.py
├── line_segments.py
├── numpy_subset.py
├── patched_poppler.py
├── pdf_document.py
├── pdf_document_pdfminer.py
├── pdf_document_poppler.py
├── pdftables.py
└── scripts
│ ├── __init__.py
│ └── render.py
├── render_all.sh
├── requirements.txt
├── setup.py
└── test
├── fixtures.py
├── test_Table_class.py
├── test_all_sample_data.py
├── test_box.py
├── test_contains_tables.py
├── test_finds_tables.py
├── test_get_tables.py
├── test_ground.py
├── test_linesegments.py
└── test_render_script.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | fixtures/
3 | png/
4 | svg/
5 | .*.swp
6 | /*.egg-info
7 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 | python:
3 | - "2.7"
4 | virtualenv:
5 | system_site_packages: true
6 | before_install:
7 | - export PIP_USE_MIRRORS=true
8 | - sudo apt-get update
9 | - sudo apt-get install -qq python-poppler
10 | install:
11 | - pip install -e .
12 | - pip install -r requirements.txt
13 | - pip install coveralls
14 | - ./download_test_data.sh
15 | script: nosetests --with-coverage --cover-package=pdftables
16 | after_success:
17 | - coveralls
18 |
--------------------------------------------------------------------------------
/LICENCE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2013, ScraperWiki Limited
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
5 |
6 | Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
7 | Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
8 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
9 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | .. -*- mode: rst -*-
2 |
3 | pdftables - a library for extracting tables from PDF files
4 | ==========================================================
5 |
6 | .. image:: https://travis-ci.org/scraperwiki/pdftables.png
7 | :target: https://travis-ci.org/scraperwiki/pdftables
8 | .. image:: https://pypip.in/v/pdftables/badge.png
9 | :target: https://pypi.python.org/pypi/pdftables
10 |
11 | **pdftables is no longer maintained**. The development contined commercially at `pdftables.com `_.
12 |
13 | ..
14 |
15 | `This Readme, and more, is available on ReadTheDocs. `_
16 |
17 | `This post `_
18 | on the ScraperWiki blog describes the algorithms used in pdftables, and
19 | something of its genesis. This README gives more technical information.
20 |
21 | pdftables uses `pdfminer `_ to get information on the locations of text
22 | elements in a PDF document. pdfminer was chosen as a base because it provides
23 | information on the full range of page elements in PDF files, including
24 | graphical elements such as lines. Although the algorithms currently used do not
25 | use these elements they are planned for future work. As a purely Python library,
26 | pdfminer is very portable. The downside of pdfminer is that it is slow, perhaps
27 | an order of magnitude slower than alternative C based libraries.
28 |
29 | Installation
30 | ============
31 |
32 | You need poppler and Cairo. On a Ubuntu and friends you can go:
33 |
34 | .. code:: bash
35 |
36 | sudo apt-get -y install python-poppler python-cairo
37 |
38 | Then we can install the ``pip``-able requirements from the ``requirements.txt`` file:
39 |
40 | .. code:: bash
41 |
42 | pip install -r requirements.txt
43 |
44 | Usage
45 | =====
46 |
47 | First we get a file object to a PDF:
48 |
49 | .. code:: python
50 |
51 | filepath = 'example.pdf'
52 | fileobj = open(filepath,'rb')
53 |
54 | Then we create a PDF element from the file object:
55 |
56 | .. code:: python
57 |
58 | from pdftables.pdf_document import PDFDocument
59 | doc = PDFDocument.from_fileobj(fileobj)
60 |
61 | Then we use the ``get_page()`` method to select a single page from the document:
62 |
63 | .. code:: python
64 |
65 | from pdftables.pdftables import page_to_tables
66 | page = doc.get_page(pagenumber)
67 | tables = page_to_tables(page)
68 |
69 | You can also loop over all pages in the PDF using ``get_pages()``:
70 |
71 | .. code:: python
72 |
73 | from pdftables.pdftables import page_to_tables
74 | for page_number, page in enumerate(doc.get_pages()):
75 | tables = page_to_tables(page)
76 |
77 | Now you have a TableContainer object, you can convert it to ASCII for quick previewing:
78 |
79 | .. code:: python
80 |
81 | from pdftables.display import to_string
82 | for table in tables:
83 | print to_string(table.data)
84 |
85 | ``table.data`` is a table that has been found, in the form of a list of lists of strings
86 | (ie: a list of rows, each containing the same number of cells).
87 |
88 | Command line tool
89 | =================
90 |
91 | pdftables includes a command line tool for diagnostic rendering of pages and tables, called ``pdftables-render``.
92 | This is installed if you ``pip install`` pdftables, or you manually run ``python setup.py``.
93 |
94 | .. code:: bash
95 |
96 | $ pdftables-render example.pdf
97 |
98 | This creates separate PNG and SVG files for each page of the specified PDF, in ``png/`` and ``svg/``, with three disagnostic displays per page.
99 |
100 | Developing pdftables
101 | ====================
102 |
103 | Files and folders::
104 |
105 | .
106 | |-fixtures
107 | | |-sample_data
108 | |-pdftables
109 | |-test
110 |
111 | *fixtures* contains test fixtures, in particular the sample_data directory
112 | contains PDF files which are installed from a different repository by running
113 | the ``download_test_data.sh`` script.
114 |
115 | We're also using data from http://www.tamirhassan.com/competition/dataset-tools.html which is also installed by the download script.
116 |
117 | *pdftables* contains the core code files
118 |
119 | *test* contains tests
120 |
121 | **pdftables.py** - this is the core of the pdftables library
122 |
123 | **counter.py** - implements collections.Counter for the benefit of Python 2.6
124 |
125 | **display.py** - prettily prints a table by implementing the ``to_string`` function
126 |
127 | **numpy_subset.py** - partially implements ``numpy.diff``, ``numpy.arange`` and ``numpy.average`` to avoid a large dependency on numpy.
128 |
129 | **pdf_document.py** - implements PDFDocument to abstract away the underlying PDF class, and ease any conversion to a different underlying PDF library to replace PDFminer
130 |
131 |
132 |
133 |
--------------------------------------------------------------------------------
/dev_requirements.txt:
--------------------------------------------------------------------------------
1 | Pygments
2 |
--------------------------------------------------------------------------------
/docs/.gitignore:
--------------------------------------------------------------------------------
1 | /build/
2 |
--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
1 | # Makefile for Sphinx documentation
2 | #
3 |
4 | # You can set these variables from the command line.
5 | SPHINXOPTS =
6 | SPHINXBUILD = sphinx-build
7 | PAPER =
8 | BUILDDIR = build
9 |
10 | # Internal variables.
11 | PAPEROPT_a4 = -D latex_paper_size=a4
12 | PAPEROPT_letter = -D latex_paper_size=letter
13 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
14 | # the i18n builder cannot share the environment and doctrees with the others
15 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
16 |
17 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
18 |
19 | help:
20 | @echo "Please use \`make ' where is one of"
21 | @echo " html to make standalone HTML files"
22 | @echo " dirhtml to make HTML files named index.html in directories"
23 | @echo " singlehtml to make a single large HTML file"
24 | @echo " pickle to make pickle files"
25 | @echo " json to make JSON files"
26 | @echo " htmlhelp to make HTML files and a HTML help project"
27 | @echo " qthelp to make HTML files and a qthelp project"
28 | @echo " devhelp to make HTML files and a Devhelp project"
29 | @echo " epub to make an epub"
30 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
31 | @echo " latexpdf to make LaTeX files and run them through pdflatex"
32 | @echo " text to make text files"
33 | @echo " man to make manual pages"
34 | @echo " texinfo to make Texinfo files"
35 | @echo " info to make Texinfo files and run them through makeinfo"
36 | @echo " gettext to make PO message catalogs"
37 | @echo " changes to make an overview of all changed/added/deprecated items"
38 | @echo " linkcheck to check all external links for integrity"
39 | @echo " doctest to run all doctests embedded in the documentation (if enabled)"
40 |
41 | clean:
42 | -rm -rf $(BUILDDIR)/*
43 |
44 | html:
45 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
46 | @echo
47 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
48 |
49 | dirhtml:
50 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
51 | @echo
52 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
53 |
54 | singlehtml:
55 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
56 | @echo
57 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
58 |
59 | pickle:
60 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
61 | @echo
62 | @echo "Build finished; now you can process the pickle files."
63 |
64 | json:
65 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
66 | @echo
67 | @echo "Build finished; now you can process the JSON files."
68 |
69 | htmlhelp:
70 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
71 | @echo
72 | @echo "Build finished; now you can run HTML Help Workshop with the" \
73 | ".hhp project file in $(BUILDDIR)/htmlhelp."
74 |
75 | qthelp:
76 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
77 | @echo
78 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \
79 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
80 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/pdftables.qhcp"
81 | @echo "To view the help file:"
82 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/pdftables.qhc"
83 |
84 | devhelp:
85 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
86 | @echo
87 | @echo "Build finished."
88 | @echo "To view the help file:"
89 | @echo "# mkdir -p $$HOME/.local/share/devhelp/pdftables"
90 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/pdftables"
91 | @echo "# devhelp"
92 |
93 | epub:
94 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
95 | @echo
96 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub."
97 |
98 | latex:
99 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
100 | @echo
101 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
102 | @echo "Run \`make' in that directory to run these through (pdf)latex" \
103 | "(use \`make latexpdf' here to do that automatically)."
104 |
105 | latexpdf:
106 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
107 | @echo "Running LaTeX files through pdflatex..."
108 | $(MAKE) -C $(BUILDDIR)/latex all-pdf
109 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
110 |
111 | text:
112 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
113 | @echo
114 | @echo "Build finished. The text files are in $(BUILDDIR)/text."
115 |
116 | man:
117 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
118 | @echo
119 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man."
120 |
121 | texinfo:
122 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
123 | @echo
124 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
125 | @echo "Run \`make' in that directory to run these through makeinfo" \
126 | "(use \`make info' here to do that automatically)."
127 |
128 | info:
129 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
130 | @echo "Running Texinfo files through makeinfo..."
131 | make -C $(BUILDDIR)/texinfo info
132 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
133 |
134 | gettext:
135 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
136 | @echo
137 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
138 |
139 | changes:
140 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
141 | @echo
142 | @echo "The overview file is in $(BUILDDIR)/changes."
143 |
144 | linkcheck:
145 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
146 | @echo
147 | @echo "Link check complete; look for any errors in the above output " \
148 | "or in $(BUILDDIR)/linkcheck/output.txt."
149 |
150 | doctest:
151 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
152 | @echo "Testing of doctests in the sources finished, look at the " \
153 | "results in $(BUILDDIR)/doctest/output.txt."
154 |
--------------------------------------------------------------------------------
/docs/make.bat:
--------------------------------------------------------------------------------
1 | @ECHO OFF
2 |
3 | REM Command file for Sphinx documentation
4 |
5 | if "%SPHINXBUILD%" == "" (
6 | set SPHINXBUILD=sphinx-build
7 | )
8 | set BUILDDIR=build
9 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source
10 | set I18NSPHINXOPTS=%SPHINXOPTS% source
11 | if NOT "%PAPER%" == "" (
12 | set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
13 | set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
14 | )
15 |
16 | if "%1" == "" goto help
17 |
18 | if "%1" == "help" (
19 | :help
20 | echo.Please use `make ^` where ^ is one of
21 | echo. html to make standalone HTML files
22 | echo. dirhtml to make HTML files named index.html in directories
23 | echo. singlehtml to make a single large HTML file
24 | echo. pickle to make pickle files
25 | echo. json to make JSON files
26 | echo. htmlhelp to make HTML files and a HTML help project
27 | echo. qthelp to make HTML files and a qthelp project
28 | echo. devhelp to make HTML files and a Devhelp project
29 | echo. epub to make an epub
30 | echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
31 | echo. text to make text files
32 | echo. man to make manual pages
33 | echo. texinfo to make Texinfo files
34 | echo. gettext to make PO message catalogs
35 | echo. changes to make an overview over all changed/added/deprecated items
36 | echo. linkcheck to check all external links for integrity
37 | echo. doctest to run all doctests embedded in the documentation if enabled
38 | goto end
39 | )
40 |
41 | if "%1" == "clean" (
42 | for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
43 | del /q /s %BUILDDIR%\*
44 | goto end
45 | )
46 |
47 | if "%1" == "html" (
48 | %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
49 | if errorlevel 1 exit /b 1
50 | echo.
51 | echo.Build finished. The HTML pages are in %BUILDDIR%/html.
52 | goto end
53 | )
54 |
55 | if "%1" == "dirhtml" (
56 | %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
57 | if errorlevel 1 exit /b 1
58 | echo.
59 | echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
60 | goto end
61 | )
62 |
63 | if "%1" == "singlehtml" (
64 | %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
65 | if errorlevel 1 exit /b 1
66 | echo.
67 | echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
68 | goto end
69 | )
70 |
71 | if "%1" == "pickle" (
72 | %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
73 | if errorlevel 1 exit /b 1
74 | echo.
75 | echo.Build finished; now you can process the pickle files.
76 | goto end
77 | )
78 |
79 | if "%1" == "json" (
80 | %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
81 | if errorlevel 1 exit /b 1
82 | echo.
83 | echo.Build finished; now you can process the JSON files.
84 | goto end
85 | )
86 |
87 | if "%1" == "htmlhelp" (
88 | %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
89 | if errorlevel 1 exit /b 1
90 | echo.
91 | echo.Build finished; now you can run HTML Help Workshop with the ^
92 | .hhp project file in %BUILDDIR%/htmlhelp.
93 | goto end
94 | )
95 |
96 | if "%1" == "qthelp" (
97 | %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
98 | if errorlevel 1 exit /b 1
99 | echo.
100 | echo.Build finished; now you can run "qcollectiongenerator" with the ^
101 | .qhcp project file in %BUILDDIR%/qthelp, like this:
102 | echo.^> qcollectiongenerator %BUILDDIR%\qthelp\pdftables.qhcp
103 | echo.To view the help file:
104 | echo.^> assistant -collectionFile %BUILDDIR%\qthelp\pdftables.ghc
105 | goto end
106 | )
107 |
108 | if "%1" == "devhelp" (
109 | %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
110 | if errorlevel 1 exit /b 1
111 | echo.
112 | echo.Build finished.
113 | goto end
114 | )
115 |
116 | if "%1" == "epub" (
117 | %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
118 | if errorlevel 1 exit /b 1
119 | echo.
120 | echo.Build finished. The epub file is in %BUILDDIR%/epub.
121 | goto end
122 | )
123 |
124 | if "%1" == "latex" (
125 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
126 | if errorlevel 1 exit /b 1
127 | echo.
128 | echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
129 | goto end
130 | )
131 |
132 | if "%1" == "text" (
133 | %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
134 | if errorlevel 1 exit /b 1
135 | echo.
136 | echo.Build finished. The text files are in %BUILDDIR%/text.
137 | goto end
138 | )
139 |
140 | if "%1" == "man" (
141 | %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
142 | if errorlevel 1 exit /b 1
143 | echo.
144 | echo.Build finished. The manual pages are in %BUILDDIR%/man.
145 | goto end
146 | )
147 |
148 | if "%1" == "texinfo" (
149 | %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
150 | if errorlevel 1 exit /b 1
151 | echo.
152 | echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
153 | goto end
154 | )
155 |
156 | if "%1" == "gettext" (
157 | %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
158 | if errorlevel 1 exit /b 1
159 | echo.
160 | echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
161 | goto end
162 | )
163 |
164 | if "%1" == "changes" (
165 | %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
166 | if errorlevel 1 exit /b 1
167 | echo.
168 | echo.The overview file is in %BUILDDIR%/changes.
169 | goto end
170 | )
171 |
172 | if "%1" == "linkcheck" (
173 | %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
174 | if errorlevel 1 exit /b 1
175 | echo.
176 | echo.Link check complete; look for any errors in the above output ^
177 | or in %BUILDDIR%/linkcheck/output.txt.
178 | goto end
179 | )
180 |
181 | if "%1" == "doctest" (
182 | %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
183 | if errorlevel 1 exit /b 1
184 | echo.
185 | echo.Testing of doctests in the sources finished, look at the ^
186 | results in %BUILDDIR%/doctest/output.txt.
187 | goto end
188 | )
189 |
190 | :end
191 |
--------------------------------------------------------------------------------
/docs/source/_theme/armstrong/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2011 Bay Citizen & Texas Tribune
2 |
3 | Original ReadTheDocs.org code
4 | Copyright (c) 2010 Charles Leifer, Eric Holscher, Bobby Grace
5 |
6 | Permission is hereby granted, free of charge, to any person
7 | obtaining a copy of this software and associated documentation
8 | files (the "Software"), to deal in the Software without
9 | restriction, including without limitation the rights to use,
10 | copy, modify, merge, publish, distribute, sublicense, and/or sell
11 | copies of the Software, and to permit persons to whom the
12 | Software is furnished to do so, subject to the following
13 | conditions:
14 |
15 | The above copyright notice and this permission notice shall be
16 | included in all copies or substantial portions of the Software.
17 |
18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
19 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
20 | OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
21 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
22 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
23 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
24 | FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
25 | OTHER DEALINGS IN THE SOFTWARE.
26 |
27 |
--------------------------------------------------------------------------------
/docs/source/_theme/armstrong/README.rst:
--------------------------------------------------------------------------------
1 | Armstrong Sphinx Theme
2 | ======================
3 | Sphinx theme for Armstrong documentation
4 |
5 |
6 | Usage
7 | -----
8 | Symlink this repository into your documentation at ``docs/_themes/armstrong``
9 | then add the following two settings to your Sphinx ``conf.py`` file::
10 |
11 | html_theme = "armstrong"
12 | html_theme_path = ["_themes", ]
13 |
14 | You can also change colors and such by adjusting the ``html_theme_options``
15 | dictionary. For a list of all settings, see ``theme.conf``.
16 |
17 |
18 | Defaults
19 | --------
20 | This repository has been customized for Armstrong documentation, but you can
21 | use the original default color scheme on your project by copying the
22 | ``rtd-theme.conf`` over the existing ``theme.conf``.
23 |
24 |
25 | Contributing
26 | ------------
27 |
28 | * Create something awesome -- make the code better, add some functionality,
29 | whatever (this is the hardest part).
30 | * `Fork it`_
31 | * Create a topic branch to house your changes
32 | * Get all of your commits in the new topic branch
33 | * Submit a `pull request`_
34 |
35 | .. _Fork it: http://help.github.com/forking/
36 | .. _pull request: http://help.github.com/pull-requests/
37 |
38 |
39 | State of Project
40 | ----------------
41 | Armstrong is an open-source news platform that is freely available to any
42 | organization. It is the result of a collaboration between the `Texas Tribune`_
43 | and `Bay Citizen`_, and a grant from the `John S. and James L. Knight
44 | Foundation`_. The first stable release is scheduled for September, 2011.
45 |
46 | To follow development, be sure to join the `Google Group`_.
47 |
48 | ``armstrong_sphinx`` is part of the `Armstrong`_ project. Unless you're
49 | looking for a Sphinx theme, you're probably looking for the main project.
50 |
51 | .. _Armstrong: http://www.armstrongcms.org/
52 | .. _Bay Citizen: http://www.baycitizen.org/
53 | .. _John S. and James L. Knight Foundation: http://www.knightfoundation.org/
54 | .. _Texas Tribune: http://www.texastribune.org/
55 | .. _Google Group: http://groups.google.com/group/armstrongcms
56 |
57 |
58 | Credit
59 | ------
60 | This theme is based on the the excellent `Read the Docs`_ theme. The original
61 | can be found in the `readthedocs.org`_ repository on GitHub.
62 |
63 | .. _Read the Docs: http://readthedocs.org/
64 | .. _readthedocs.org: https://github.com/rtfd/readthedocs.org
65 |
66 |
67 | License
68 | -------
69 | Like the original RTD code, this code is licensed under a BSD. See the
70 | associated ``LICENSE`` file for more information.
71 |
--------------------------------------------------------------------------------
/docs/source/_theme/armstrong/layout.html:
--------------------------------------------------------------------------------
1 | {% extends "basic/layout.html" %}
2 |
3 | {% set script_files = script_files + [pathto("_static/searchtools.js", 1)] %}
4 |
5 | {% block htmltitle %}
6 | {{ super() }}
7 |
8 |
9 |
10 | {% endblock %}
11 |
12 | {% block footer %}
13 |
31 |
32 |
33 | {% if theme_analytics_code %}
34 |
35 |
46 | {% endif %}
47 |
48 | {% endblock %}
49 |
--------------------------------------------------------------------------------
/docs/source/_theme/armstrong/rtd-themes.conf:
--------------------------------------------------------------------------------
1 | [theme]
2 | inherit = default
3 | stylesheet = rtd.css
4 | pygment_style = default
5 | show_sphinx = False
6 |
7 | [options]
8 | show_rtd = True
9 |
10 | white = #ffffff
11 | almost_white = #f8f8f8
12 | barely_white = #f2f2f2
13 | dirty_white = #eeeeee
14 | almost_dirty_white = #e6e6e6
15 | dirtier_white = #dddddd
16 | lighter_gray = #cccccc
17 | gray_a = #aaaaaa
18 | gray_9 = #999999
19 | light_gray = #888888
20 | gray_7 = #777777
21 | gray = #666666
22 | dark_gray = #444444
23 | gray_2 = #222222
24 | black = #111111
25 | light_color = #e8ecef
26 | light_medium_color = #DDEAF0
27 | medium_color = #8ca1af
28 | medium_color_link = #86989b
29 | medium_color_link_hover = #a6b8bb
30 | dark_color = #465158
31 |
32 | h1 = #000000
33 | h2 = #465158
34 | h3 = #6c818f
35 |
36 | link_color = #444444
37 | link_color_decoration = #CCCCCC
38 |
39 | medium_color_hover = #697983
40 | green_highlight = #8ecc4c
41 |
42 |
43 | positive_dark = #609060
44 | positive_medium = #70a070
45 | positive_light = #e9ffe9
46 |
47 | negative_dark = #900000
48 | negative_medium = #b04040
49 | negative_light = #ffe9e9
50 | negative_text = #c60f0f
51 |
52 | ruler = #abc
53 |
54 | viewcode_bg = #f4debf
55 | viewcode_border = #ac9
56 |
57 | highlight = #ffe080
58 |
59 | code_background = #eeeeee
60 |
61 | background = #465158
62 | background_link = #ffffff
63 | background_link_half = #ffffff
64 | background_text = #eeeeee
65 | background_text_link = #86989b
66 |
--------------------------------------------------------------------------------
/docs/source/_theme/armstrong/static/rtd.css_t:
--------------------------------------------------------------------------------
1 | /*
2 | * rtd.css
3 | * ~~~~~~~~~~~~~~~
4 | *
5 | * Sphinx stylesheet -- sphinxdoc theme. Originally created by
6 | * Armin Ronacher for Werkzeug.
7 | *
8 | * Customized for ReadTheDocs by Eric Pierce & Eric Holscher
9 | *
10 | * :copyright: Copyright 2007-2010 by the Sphinx team, see AUTHORS.
11 | * :license: BSD, see LICENSE for details.
12 | *
13 | */
14 |
15 | /* RTD colors
16 | * light blue: {{ theme_light_color }}
17 | * medium blue: {{ theme_medium_color }}
18 | * dark blue: {{ theme_dark_color }}
19 | * dark grey: {{ theme_grey_color }}
20 | *
21 | * medium blue hover: {{ theme_medium_color_hover }};
22 | * green highlight: {{ theme_green_highlight }}
23 | * light blue (project bar): {{ theme_light_color }}
24 | */
25 |
26 | @import url("basic.css");
27 |
28 | /* PAGE LAYOUT -------------------------------------------------------------- */
29 |
30 | body {
31 | font: 100%/1.5 "ff-meta-web-pro-1","ff-meta-web-pro-2",Arial,"Helvetica Neue",sans-serif;
32 | text-align: center;
33 | color: black;
34 | background-color: {{ theme_background }};
35 | padding: 0;
36 | margin: 0;
37 | }
38 |
39 | div.document {
40 | text-align: left;
41 | background-color: {{ theme_light_color }};
42 | }
43 |
44 | div.bodywrapper {
45 | background-color: {{ theme_white }};
46 | border-left: 1px solid {{ theme_lighter_gray }};
47 | border-bottom: 1px solid {{ theme_lighter_gray }};
48 | margin: 0 0 0 16em;
49 | }
50 |
51 | div.body {
52 | margin: 0;
53 | padding: 0.5em 1.3em;
54 | max-width: 55em;
55 | min-width: 20em;
56 | }
57 |
58 | div.related {
59 | font-size: 1em;
60 | background-color: {{ theme_background }};
61 | }
62 |
63 | div.documentwrapper {
64 | float: left;
65 | width: 100%;
66 | background-color: {{ theme_light_color }};
67 | }
68 |
69 |
70 | /* HEADINGS --------------------------------------------------------------- */
71 |
72 | h1 {
73 | margin: 0;
74 | padding: 0.7em 0 0.3em 0;
75 | font-size: 1.5em;
76 | line-height: 1.15;
77 | color: {{ theme_h1 }};
78 | clear: both;
79 | }
80 |
81 | h2 {
82 | margin: 2em 0 0.2em 0;
83 | font-size: 1.35em;
84 | padding: 0;
85 | color: {{ theme_h2 }};
86 | }
87 |
88 | h3 {
89 | margin: 1em 0 -0.3em 0;
90 | font-size: 1.2em;
91 | color: {{ theme_h3 }};
92 | }
93 |
94 | div.body h1 a, div.body h2 a, div.body h3 a, div.body h4 a, div.body h5 a, div.body h6 a {
95 | color: black;
96 | }
97 |
98 | h1 a.anchor, h2 a.anchor, h3 a.anchor, h4 a.anchor, h5 a.anchor, h6 a.anchor {
99 | display: none;
100 | margin: 0 0 0 0.3em;
101 | padding: 0 0.2em 0 0.2em;
102 | color: {{ theme_gray_a }} !important;
103 | }
104 |
105 | h1:hover a.anchor, h2:hover a.anchor, h3:hover a.anchor, h4:hover a.anchor,
106 | h5:hover a.anchor, h6:hover a.anchor {
107 | display: inline;
108 | }
109 |
110 | h1 a.anchor:hover, h2 a.anchor:hover, h3 a.anchor:hover, h4 a.anchor:hover,
111 | h5 a.anchor:hover, h6 a.anchor:hover {
112 | color: {{ theme_gray_7 }};
113 | background-color: {{ theme_dirty_white }};
114 | }
115 |
116 |
117 | /* LINKS ------------------------------------------------------------------ */
118 |
119 | /* Normal links get a pseudo-underline */
120 | a {
121 | color: {{ theme_link_color }};
122 | text-decoration: none;
123 | border-bottom: 1px solid {{ theme_link_color_decoration }};
124 | }
125 |
126 | /* Links in sidebar, TOC, index trees and tables have no underline */
127 | .sphinxsidebar a,
128 | .toctree-wrapper a,
129 | .indextable a,
130 | #indices-and-tables a {
131 | color: {{ theme_dark_gray }};
132 | text-decoration: none;
133 | border-bottom: none;
134 | }
135 |
136 | /* Most links get an underline-effect when hovered */
137 | a:hover,
138 | div.toctree-wrapper a:hover,
139 | .indextable a:hover,
140 | #indices-and-tables a:hover {
141 | color: {{ theme_black }};
142 | text-decoration: none;
143 | border-bottom: 1px solid {{ theme_black }};
144 | }
145 |
146 | /* Footer links */
147 | div.footer a {
148 | color: {{ theme_background_text_link }};
149 | text-decoration: none;
150 | border: none;
151 | }
152 | div.footer a:hover {
153 | color: {{ theme_medium_color_link_hover }};
154 | text-decoration: underline;
155 | border: none;
156 | }
157 |
158 | /* Permalink anchor (subtle grey with a red hover) */
159 | div.body a.headerlink {
160 | color: {{ theme_lighter_gray }};
161 | font-size: 1em;
162 | margin-left: 6px;
163 | padding: 0 4px 0 4px;
164 | text-decoration: none;
165 | border: none;
166 | }
167 | div.body a.headerlink:hover {
168 | color: {{ theme_negative_text }};
169 | border: none;
170 | }
171 |
172 |
173 | /* NAVIGATION BAR --------------------------------------------------------- */
174 |
175 | div.related ul {
176 | height: 2.5em;
177 | }
178 |
179 | div.related ul li {
180 | margin: 0;
181 | padding: 0.65em 0;
182 | float: left;
183 | display: block;
184 | color: {{ theme_background_link_half }}; /* For the >> separators */
185 | font-size: 0.8em;
186 | }
187 |
188 | div.related ul li.right {
189 | float: right;
190 | margin-right: 5px;
191 | color: transparent; /* Hide the | separators */
192 | }
193 |
194 | /* "Breadcrumb" links in nav bar */
195 | div.related ul li a {
196 | order: none;
197 | background-color: inherit;
198 | font-weight: bold;
199 | margin: 6px 0 6px 4px;
200 | line-height: 1.75em;
201 | color: {{ theme_background_link }};
202 | text-shadow: 0 1px rgba(0, 0, 0, 0.5);
203 | padding: 0.4em 0.8em;
204 | border: none;
205 | border-radius: 3px;
206 | }
207 | /* previous / next / modules / index links look more like buttons */
208 | div.related ul li.right a {
209 | margin: 0.375em 0;
210 | background-color: {{ theme_medium_color_hover }};
211 | text-shadow: 0 1px rgba(0, 0, 0, 0.5);
212 | border-radius: 3px;
213 | -webkit-border-radius: 3px;
214 | -moz-border-radius: 3px;
215 | }
216 | /* All navbar links light up as buttons when hovered */
217 | div.related ul li a:hover {
218 | background-color: {{ theme_medium_color }};
219 | color: {{ theme_white }};
220 | text-decoration: none;
221 | border-radius: 3px;
222 | -webkit-border-radius: 3px;
223 | -moz-border-radius: 3px;
224 | }
225 | /* Take extra precautions for tt within links */
226 | a tt,
227 | div.related ul li a tt {
228 | background: inherit !important;
229 | color: inherit !important;
230 | }
231 |
232 |
233 | /* SIDEBAR ---------------------------------------------------------------- */
234 |
235 | div.sphinxsidebarwrapper {
236 | padding: 0;
237 | }
238 |
239 | div.sphinxsidebar {
240 | margin: 0;
241 | margin-left: -100%;
242 | float: left;
243 | top: 3em;
244 | left: 0;
245 | padding: 0 1em;
246 | width: 14em;
247 | font-size: 1em;
248 | text-align: left;
249 | background-color: {{ theme_light_color }};
250 | }
251 |
252 | div.sphinxsidebar img {
253 | max-width: 12em;
254 | }
255 |
256 | div.sphinxsidebar h3, div.sphinxsidebar h4 {
257 | margin: 1.2em 0 0.3em 0;
258 | font-size: 1em;
259 | padding: 0;
260 | color: {{ theme_gray_2 }};
261 | font-family: "ff-meta-web-pro-1", "ff-meta-web-pro-2", "Arial", "Helvetica Neue", sans-serif;
262 | }
263 |
264 | div.sphinxsidebar h3 a {
265 | color: {{ theme_grey_color }};
266 | }
267 |
268 | div.sphinxsidebar ul,
269 | div.sphinxsidebar p {
270 | margin-top: 0;
271 | padding-left: 0;
272 | line-height: 130%;
273 | background-color: {{ theme_light_color }};
274 | }
275 |
276 | /* No bullets for nested lists, but a little extra indentation */
277 | div.sphinxsidebar ul ul {
278 | list-style-type: none;
279 | margin-left: 1.5em;
280 | padding: 0;
281 | }
282 |
283 | /* A little top/bottom padding to prevent adjacent links' borders
284 | * from overlapping each other */
285 | div.sphinxsidebar ul li {
286 | padding: 1px 0;
287 | }
288 |
289 | /* A little left-padding to make these align with the ULs */
290 | div.sphinxsidebar p.topless {
291 | padding-left: 0 0 0 1em;
292 | }
293 |
294 | /* Make these into hidden one-liners */
295 | div.sphinxsidebar ul li,
296 | div.sphinxsidebar p.topless {
297 | white-space: nowrap;
298 | overflow: hidden;
299 | }
300 | /* ...which become visible when hovered */
301 | div.sphinxsidebar ul li:hover,
302 | div.sphinxsidebar p.topless:hover {
303 | overflow: visible;
304 | }
305 |
306 | /* Search text box and "Go" button */
307 | #searchbox {
308 | margin-top: 2em;
309 | margin-bottom: 1em;
310 | background: {{ theme_dirtier_white }};
311 | padding: 0.5em;
312 | border-radius: 6px;
313 | -moz-border-radius: 6px;
314 | -webkit-border-radius: 6px;
315 | }
316 | #searchbox h3 {
317 | margin-top: 0;
318 | }
319 |
320 | /* Make search box and button abut and have a border */
321 | input,
322 | div.sphinxsidebar input {
323 | border: 1px solid {{ theme_gray_9 }};
324 | float: left;
325 | }
326 |
327 | /* Search textbox */
328 | input[type="text"] {
329 | margin: 0;
330 | padding: 0 3px;
331 | height: 20px;
332 | width: 144px;
333 | border-top-left-radius: 3px;
334 | border-bottom-left-radius: 3px;
335 | -moz-border-radius-topleft: 3px;
336 | -moz-border-radius-bottomleft: 3px;
337 | -webkit-border-top-left-radius: 3px;
338 | -webkit-border-bottom-left-radius: 3px;
339 | }
340 | /* Search button */
341 | input[type="submit"] {
342 | margin: 0 0 0 -1px; /* -1px prevents a double-border with textbox */
343 | height: 22px;
344 | color: {{ theme_dark_gray }};
345 | background-color: {{ theme_light_color }};
346 | padding: 1px 4px;
347 | font-weight: bold;
348 | border-top-right-radius: 3px;
349 | border-bottom-right-radius: 3px;
350 | -moz-border-radius-topright: 3px;
351 | -moz-border-radius-bottomright: 3px;
352 | -webkit-border-top-right-radius: 3px;
353 | -webkit-border-bottom-right-radius: 3px;
354 | }
355 | input[type="submit"]:hover {
356 | color: {{ theme_white }};
357 | background-color: {{ theme_green_highlight }};
358 | }
359 |
360 | div.sphinxsidebar p.searchtip {
361 | clear: both;
362 | padding: 0.5em 0 0 0;
363 | background: {{ theme_dirtier_white }};
364 | color: {{ theme_gray }};
365 | font-size: 0.9em;
366 | }
367 |
368 | /* Sidebar links are unusual */
369 | div.sphinxsidebar li a,
370 | div.sphinxsidebar p a {
371 | background: {{ theme_light_color }}; /* In case links overlap main content */
372 | border-radius: 3px;
373 | -moz-border-radius: 3px;
374 | -webkit-border-radius: 3px;
375 | border: 1px solid transparent; /* To prevent things jumping around on hover */
376 | padding: 0 5px 0 5px;
377 | }
378 | div.sphinxsidebar li a:hover,
379 | div.sphinxsidebar p a:hover {
380 | color: {{ theme_black }};
381 | text-decoration: none;
382 | border: 1px solid {{ theme_light_gray }};
383 | }
384 |
385 | /* Tweak any link appearing in a heading */
386 | div.sphinxsidebar h3 a {
387 | }
388 |
389 |
390 |
391 |
392 | /* OTHER STUFF ------------------------------------------------------------ */
393 |
394 | cite, code, tt {
395 | font-family: 'Consolas', 'Deja Vu Sans Mono',
396 | 'Bitstream Vera Sans Mono', monospace;
397 | font-size: 0.95em;
398 | letter-spacing: 0.01em;
399 | }
400 |
401 | tt {
402 | background-color: {{ theme_code_background }};
403 | color: {{ theme_dark_gray }};
404 | }
405 |
406 | tt.descname, tt.descclassname, tt.xref {
407 | border: 0;
408 | }
409 |
410 | hr {
411 | border: 1px solid {{ theme_ruler }};
412 | margin: 2em;
413 | }
414 |
415 | pre, #_fontwidthtest {
416 | font-family: 'Consolas', 'Deja Vu Sans Mono',
417 | 'Bitstream Vera Sans Mono', monospace;
418 | margin: 1em 2em;
419 | font-size: 0.95em;
420 | letter-spacing: 0.015em;
421 | line-height: 120%;
422 | padding: 0.5em;
423 | border: 1px solid {{ theme_lighter_gray }};
424 | background-color: {{ theme_code_background }};
425 | border-radius: 6px;
426 | -moz-border-radius: 6px;
427 | -webkit-border-radius: 6px;
428 | }
429 |
430 | pre a {
431 | color: inherit;
432 | text-decoration: underline;
433 | }
434 |
435 | td.linenos pre {
436 | padding: 0.5em 0;
437 | }
438 |
439 | div.quotebar {
440 | background-color: {{ theme_almost_white }};
441 | max-width: 250px;
442 | float: right;
443 | padding: 2px 7px;
444 | border: 1px solid {{ theme_lighter_gray }};
445 | }
446 |
447 | div.topic {
448 | background-color: {{ theme_almost_white }};
449 | }
450 |
451 | table {
452 | border-collapse: collapse;
453 | margin: 0 -0.5em 0 -0.5em;
454 | }
455 |
456 | table td, table th {
457 | padding: 0.2em 0.5em 0.2em 0.5em;
458 | }
459 |
460 |
461 | /* ADMONITIONS AND WARNINGS ------------------------------------------------- */
462 |
463 | /* Shared by admonitions, warnings and sidebars */
464 | div.admonition,
465 | div.warning,
466 | div.sidebar {
467 | font-size: 0.9em;
468 | margin: 2em;
469 | padding: 0;
470 | /*
471 | border-radius: 6px;
472 | -moz-border-radius: 6px;
473 | -webkit-border-radius: 6px;
474 | */
475 | }
476 | div.admonition p,
477 | div.warning p,
478 | div.sidebar p {
479 | margin: 0.5em 1em 0.5em 1em;
480 | padding: 0;
481 | }
482 | div.admonition pre,
483 | div.warning pre,
484 | div.sidebar pre {
485 | margin: 0.4em 1em 0.4em 1em;
486 | }
487 | div.admonition p.admonition-title,
488 | div.warning p.admonition-title,
489 | div.sidebar p.sidebar-title {
490 | margin: 0;
491 | padding: 0.1em 0 0.1em 0.5em;
492 | color: white;
493 | font-weight: bold;
494 | font-size: 1.1em;
495 | text-shadow: 0 1px rgba(0, 0, 0, 0.5);
496 | }
497 | div.admonition ul, div.admonition ol,
498 | div.warning ul, div.warning ol,
499 | div.sidebar ul, div.sidebar ol {
500 | margin: 0.1em 0.5em 0.5em 3em;
501 | padding: 0;
502 | }
503 |
504 |
505 | /* Admonitions and sidebars only */
506 | div.admonition, div.sidebar {
507 | border: 1px solid {{ theme_positive_dark }};
508 | background-color: {{ theme_positive_light }};
509 | }
510 | div.admonition p.admonition-title,
511 | div.sidebar p.sidebar-title {
512 | background-color: {{ theme_positive_medium }};
513 | border-bottom: 1px solid {{ theme_positive_dark }};
514 | }
515 |
516 |
517 | /* Warnings only */
518 | div.warning {
519 | border: 1px solid {{ theme_negative_dark }};
520 | background-color: {{ theme_negative_light }};
521 | }
522 | div.warning p.admonition-title {
523 | background-color: {{ theme_negative_medium }};
524 | border-bottom: 1px solid {{ theme_negative_dark }};
525 | }
526 |
527 |
528 | /* Sidebars only */
529 | div.sidebar {
530 | max-width: 200px;
531 | }
532 |
533 |
534 |
535 | div.versioninfo {
536 | margin: 1em 0 0 0;
537 | border: 1px solid {{ theme_lighter_gray }};
538 | background-color: {{ theme_light_medium_color }};
539 | padding: 8px;
540 | line-height: 1.3em;
541 | font-size: 0.9em;
542 | }
543 |
544 | .viewcode-back {
545 | font-family: 'Lucida Grande', 'Lucida Sans Unicode', 'Geneva',
546 | 'Verdana', sans-serif;
547 | }
548 |
549 | div.viewcode-block:target {
550 | background-color: {{ theme_viewcode_bg }};
551 | border-top: 1px solid {{ theme_viewcode_border }};
552 | border-bottom: 1px solid {{ theme_viewcode_border }};
553 | }
554 |
555 | dl {
556 | margin: 1em 0 2.5em 0;
557 | }
558 |
559 | /* Highlight target when you click an internal link */
560 | dt:target {
561 | background: {{ theme_highlight }};
562 | }
563 | /* Don't highlight whole divs */
564 | div.highlight {
565 | background: transparent;
566 | }
567 | /* But do highlight spans (so search results can be highlighted) */
568 | span.highlight {
569 | background: {{ theme_highlight }};
570 | }
571 |
572 | div.footer {
573 | background-color: {{ theme_background }};
574 | color: {{ theme_background_text }};
575 | padding: 0 2em 2em 2em;
576 | clear: both;
577 | font-size: 0.8em;
578 | text-align: center;
579 | }
580 |
581 | p {
582 | margin: 0.8em 0 0.5em 0;
583 | }
584 |
585 | .section p img {
586 | margin: 1em 2em;
587 | }
588 |
589 |
590 | /* MOBILE LAYOUT -------------------------------------------------------------- */
591 |
592 | @media screen and (max-width: 600px) {
593 |
594 | h1, h2, h3, h4, h5 {
595 | position: relative;
596 | }
597 |
598 | ul {
599 | padding-left: 1.75em;
600 | }
601 |
602 | div.bodywrapper a.headerlink, #indices-and-tables h1 a {
603 | color: {{ theme_almost_dirty_white }};
604 | font-size: 80%;
605 | float: right;
606 | line-height: 1.8;
607 | position: absolute;
608 | right: -0.7em;
609 | visibility: inherit;
610 | }
611 |
612 | div.bodywrapper h1 a.headerlink, #indices-and-tables h1 a {
613 | line-height: 1.5;
614 | }
615 |
616 | pre {
617 | font-size: 0.7em;
618 | overflow: auto;
619 | word-wrap: break-word;
620 | white-space: pre-wrap;
621 | }
622 |
623 | div.related ul {
624 | height: 2.5em;
625 | padding: 0;
626 | text-align: left;
627 | }
628 |
629 | div.related ul li {
630 | clear: both;
631 | color: {{ theme_dark_color }};
632 | padding: 0.2em 0;
633 | }
634 |
635 | div.related ul li:last-child {
636 | border-bottom: 1px dotted {{ theme_medium_color }};
637 | padding-bottom: 0.4em;
638 | margin-bottom: 1em;
639 | width: 100%;
640 | }
641 |
642 | div.related ul li a {
643 | color: {{ theme_dark_color }};
644 | padding-right: 0;
645 | }
646 |
647 | div.related ul li a:hover {
648 | background: inherit;
649 | color: inherit;
650 | }
651 |
652 | div.related ul li.right {
653 | clear: none;
654 | padding: 0.65em 0;
655 | margin-bottom: 0.5em;
656 | }
657 |
658 | div.related ul li.right a {
659 | color: {{ theme_white }};
660 | padding-right: 0.8em;
661 | }
662 |
663 | div.related ul li.right a:hover {
664 | background-color: {{ theme_medium_color }};
665 | }
666 |
667 | div.body {
668 | clear: both;
669 | min-width: 0;
670 | word-wrap: break-word;
671 | }
672 |
673 | div.bodywrapper {
674 | margin: 0 0 0 0;
675 | }
676 |
677 | div.sphinxsidebar {
678 | float: none;
679 | margin: 0;
680 | width: auto;
681 | }
682 |
683 | div.sphinxsidebar input[type="text"] {
684 | height: 2em;
685 | line-height: 2em;
686 | width: 70%;
687 | }
688 |
689 | div.sphinxsidebar input[type="submit"] {
690 | height: 2em;
691 | margin-left: 0.5em;
692 | width: 20%;
693 | }
694 |
695 | div.sphinxsidebar p.searchtip {
696 | background: inherit;
697 | margin-bottom: 1em;
698 | }
699 |
700 | div.sphinxsidebar ul li, div.sphinxsidebar p.topless {
701 | white-space: normal;
702 | }
703 |
704 | .bodywrapper img {
705 | display: block;
706 | margin-left: auto;
707 | margin-right: auto;
708 | max-width: 100%;
709 | }
710 |
711 | div.documentwrapper {
712 | float: none;
713 | }
714 |
715 | div.admonition, div.warning, pre, blockquote {
716 | margin-left: 0em;
717 | margin-right: 0em;
718 | }
719 |
720 | .body p img {
721 | margin: 0;
722 | }
723 |
724 | #searchbox {
725 | background: transparent;
726 | }
727 |
728 | .related:not(:first-child) li {
729 | display: none;
730 | }
731 |
732 | .related:not(:first-child) li.right {
733 | display: block;
734 | }
735 |
736 | div.footer {
737 | padding: 1em;
738 | }
739 |
740 | .rtd_doc_footer .badge {
741 | float: none;
742 | margin: 1em auto;
743 | position: static;
744 | }
745 |
746 | .rtd_doc_footer .badge.revsys-inline {
747 | margin-right: auto;
748 | margin-bottom: 2em;
749 | }
750 |
751 | table.indextable {
752 | display: block;
753 | width: auto;
754 | }
755 |
756 | .indextable tr {
757 | display: block;
758 | }
759 |
760 | .indextable td {
761 | display: block;
762 | padding: 0;
763 | width: auto !important;
764 | }
765 |
766 | .indextable td dt {
767 | margin: 1em 0;
768 | }
769 |
770 | ul.search {
771 | margin-left: 0.25em;
772 | }
773 |
774 | ul.search li div.context {
775 | font-size: 90%;
776 | line-height: 1.1;
777 | margin-bottom: 1;
778 | margin-left: 0;
779 | }
780 |
781 | }
782 |
--------------------------------------------------------------------------------
/docs/source/_theme/armstrong/theme.conf:
--------------------------------------------------------------------------------
1 | [theme]
2 | inherit = default
3 | stylesheet = rtd.css
4 | pygment_style = default
5 | show_sphinx = False
6 |
7 | [options]
8 | show_rtd = True
9 |
10 | white = #ffffff
11 | almost_white = #f8f8f8
12 | barely_white = #f2f2f2
13 | dirty_white = #eeeeee
14 | almost_dirty_white = #e6e6e6
15 | dirtier_white = #DAC6AF
16 | lighter_gray = #cccccc
17 | gray_a = #aaaaaa
18 | gray_9 = #999999
19 | light_gray = #888888
20 | gray_7 = #777777
21 | gray = #666666
22 | dark_gray = #444444
23 | gray_2 = #222222
24 | black = #111111
25 | light_color = #EDE4D8
26 | light_medium_color = #DDEAF0
27 | medium_color = #8ca1af
28 | medium_color_link = #634320
29 | medium_color_link_hover = #261a0c
30 | dark_color = rgba(160, 109, 52, 1.0)
31 |
32 | h1 = #1f3744
33 | h2 = #335C72
34 | h3 = #638fa6
35 |
36 | link_color = #335C72
37 | link_color_decoration = #99AEB9
38 |
39 | medium_color_hover = rgba(255, 255, 255, 0.25)
40 | medium_color = rgba(255, 255, 255, 0.5)
41 | green_highlight = #8ecc4c
42 |
43 |
44 | positive_dark = rgba(51, 77, 0, 1.0)
45 | positive_medium = rgba(102, 153, 0, 1.0)
46 | positive_light = rgba(102, 153, 0, 0.1)
47 |
48 | negative_dark = rgba(51, 13, 0, 1.0)
49 | negative_medium = rgba(204, 51, 0, 1.0)
50 | negative_light = rgba(204, 51, 0, 0.1)
51 | negative_text = #c60f0f
52 |
53 | ruler = #abc
54 |
55 | viewcode_bg = #f4debf
56 | viewcode_border = #ac9
57 |
58 | highlight = #ffe080
59 |
60 | code_background = rgba(0, 0, 0, 0.075)
61 |
62 | background = rgba(135, 57, 34, 1.0)
63 | background_link = rgba(212, 195, 172, 1.0)
64 | background_link_half = rgba(212, 195, 172, 0.5)
65 | background_text = rgba(212, 195, 172, 1.0)
66 | background_text_link = rgba(171, 138, 93, 1.0)
67 |
--------------------------------------------------------------------------------
/docs/source/conf.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #
3 | # pdftables documentation build configuration file, created by
4 | # sphinx-quickstart on Thu Sep 12 16:20:18 2013.
5 | #
6 | # This file is execfile()d with the current directory set to its containing dir.
7 | #
8 | # Note that not all possible configuration values are present in this
9 | # autogenerated file.
10 | #
11 | # All configuration values have a default; values that are commented out
12 | # serve to show the default.
13 |
14 | import sys, os
15 |
16 | # If extensions (or modules to document with autodoc) are in another directory,
17 | # add these directories to sys.path here. If the directory is relative to the
18 | # documentation root, use os.path.abspath to make it absolute, like shown here.
19 |
20 | # The order here is sensitive: they have to be this way round in order to make
21 | # `import pdftables` find the right file.
22 | sys.path = [ os.path.abspath('../../pdftables'), os.path.abspath('../../') ] + sys.path
23 |
24 | import setup as pdftables_setup
25 |
26 | # -- General configuration -----------------------------------------------------
27 |
28 | # If your documentation needs a minimal Sphinx version, state it here.
29 | #needs_sphinx = '1.0'
30 |
31 | # Add any Sphinx extension module names here, as strings. They can be extensions
32 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
33 | extensions = ['sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.intersphinx', 'sphinx.ext.todo', 'sphinx.ext.coverage', 'sphinx.ext.viewcode']
34 |
35 | # Add any paths that contain templates here, relative to this directory.
36 | templates_path = ['_templates']
37 |
38 | # The suffix of source filenames.
39 | source_suffix = '.rst'
40 |
41 | # The encoding of source files.
42 | #source_encoding = 'utf-8-sig'
43 |
44 | # The master toctree document.
45 | master_doc = 'index'
46 |
47 | # General information about the project.
48 | project = u'pdftables'
49 | copyright = u'2013, ScraperWiki'
50 |
51 | # The version info for the project you're documenting, acts as replacement for
52 | # |version| and |release|, also used in various other places throughout the
53 | # built documents.
54 | #
55 | # The short X.Y version.
56 | version = pdftables_setup.conf['version']
57 | # The full version, including alpha/beta/rc tags.
58 | release = pdftables_setup.conf['version']
59 |
60 | # The language for content autogenerated by Sphinx. Refer to documentation
61 | # for a list of supported languages.
62 | #language = None
63 |
64 | # There are two options for replacing |today|: either, you set today to some
65 | # non-false value, then it is used:
66 | #today = ''
67 | # Else, today_fmt is used as the format for a strftime call.
68 | #today_fmt = '%B %d, %Y'
69 |
70 | # List of patterns, relative to source directory, that match files and
71 | # directories to ignore when looking for source files.
72 | exclude_patterns = []
73 |
74 | # The reST default role (used for this markup: `text`) to use for all documents.
75 | #default_role = None
76 |
77 | # If true, '()' will be appended to :func: etc. cross-reference text.
78 | #add_function_parentheses = True
79 |
80 | # If true, the current module name will be prepended to all description
81 | # unit titles (such as .. function::).
82 | #add_module_names = True
83 |
84 | # If true, sectionauthor and moduleauthor directives will be shown in the
85 | # output. They are ignored by default.
86 | #show_authors = False
87 |
88 | # The name of the Pygments (syntax highlighting) style to use.
89 | pygments_style = 'sphinx'
90 |
91 | # A list of ignored prefixes for module index sorting.
92 | #modindex_common_prefix = []
93 |
94 |
95 | # -- Options for HTML output ---------------------------------------------------
96 |
97 | # The theme to use for HTML and HTML Help pages. See the documentation for
98 | # a list of builtin themes.
99 | html_theme = 'armstrong'
100 |
101 | # Theme options are theme-specific and customize the look and feel of a theme
102 | # further. For a list of options available for each theme, see the
103 | # documentation.
104 | #html_theme_options = {}
105 |
106 | # Add any paths that contain custom themes here, relative to this directory.
107 | html_theme_path = ['_theme']
108 |
109 | # The name for this set of Sphinx documents. If None, it defaults to
110 | # " v documentation".
111 | #html_title = None
112 |
113 | # A shorter title for the navigation bar. Default is the same as html_title.
114 | #html_short_title = None
115 |
116 | # The name of an image file (relative to this directory) to place at the top
117 | # of the sidebar.
118 | #html_logo = None
119 |
120 | # The name of an image file (within the static path) to use as favicon of the
121 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
122 | # pixels large.
123 | #html_favicon = None
124 |
125 | # Add any paths that contain custom static files (such as style sheets) here,
126 | # relative to this directory. They are copied after the builtin static files,
127 | # so a file named "default.css" will overwrite the builtin "default.css".
128 | html_static_path = ['_static']
129 |
130 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
131 | # using the given strftime format.
132 | #html_last_updated_fmt = '%b %d, %Y'
133 |
134 | # If true, SmartyPants will be used to convert quotes and dashes to
135 | # typographically correct entities.
136 | #html_use_smartypants = True
137 |
138 | # Custom sidebar templates, maps document names to template names.
139 | #html_sidebars = {}
140 |
141 | # Additional templates that should be rendered to pages, maps page names to
142 | # template names.
143 | #html_additional_pages = {}
144 |
145 | # If false, no module index is generated.
146 | #html_domain_indices = True
147 |
148 | # If false, no index is generated.
149 | #html_use_index = True
150 |
151 | # If true, the index is split into individual pages for each letter.
152 | #html_split_index = False
153 |
154 | # If true, links to the reST sources are added to the pages.
155 | #html_show_sourcelink = True
156 |
157 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
158 | #html_show_sphinx = True
159 |
160 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
161 | #html_show_copyright = True
162 |
163 | # If true, an OpenSearch description file will be output, and all pages will
164 | # contain a tag referring to it. The value of this option must be the
165 | # base URL from which the finished HTML is served.
166 | #html_use_opensearch = ''
167 |
168 | # This is the file name suffix for HTML files (e.g. ".xhtml").
169 | #html_file_suffix = None
170 |
171 | # Output file base name for HTML help builder.
172 | htmlhelp_basename = 'pdftablesdoc'
173 |
174 |
175 | # -- Options for LaTeX output --------------------------------------------------
176 |
177 | latex_elements = {
178 | # The paper size ('letterpaper' or 'a4paper').
179 | #'papersize': 'letterpaper',
180 |
181 | # The font size ('10pt', '11pt' or '12pt').
182 | #'pointsize': '10pt',
183 |
184 | # Additional stuff for the LaTeX preamble.
185 | #'preamble': '',
186 | }
187 |
188 | # Grouping the document tree into LaTeX files. List of tuples
189 | # (source start file, target name, title, author, documentclass [howto/manual]).
190 | latex_documents = [
191 | ('index', 'pdftables.tex', u'pdftables Documentation',
192 | u'ScraperWiki', 'manual'),
193 | ]
194 |
195 | # The name of an image file (relative to this directory) to place at the top of
196 | # the title page.
197 | #latex_logo = None
198 |
199 | # For "manual" documents, if this is true, then toplevel headings are parts,
200 | # not chapters.
201 | #latex_use_parts = False
202 |
203 | # If true, show page references after internal links.
204 | #latex_show_pagerefs = False
205 |
206 | # If true, show URL addresses after external links.
207 | #latex_show_urls = False
208 |
209 | # Documents to append as an appendix to all manuals.
210 | #latex_appendices = []
211 |
212 | # If false, no module index is generated.
213 | #latex_domain_indices = True
214 |
215 |
216 | # -- Options for manual page output --------------------------------------------
217 |
218 | # One entry per manual page. List of tuples
219 | # (source start file, name, description, authors, manual section).
220 | man_pages = [
221 | ('index', 'pdftables', u'pdftables Documentation',
222 | [u'ScraperWiki'], 1)
223 | ]
224 |
225 | # If true, show URL addresses after external links.
226 | #man_show_urls = False
227 |
228 |
229 | # -- Options for Texinfo output ------------------------------------------------
230 |
231 | # Grouping the document tree into Texinfo files. List of tuples
232 | # (source start file, target name, title, author,
233 | # dir menu entry, description, category)
234 | texinfo_documents = [
235 | ('index', 'pdftables', u'pdftables Documentation',
236 | u'ScraperWiki', 'pdftables', 'One line description of project.',
237 | 'Miscellaneous'),
238 | ]
239 |
240 | # Documents to append as an appendix to all manuals.
241 | #texinfo_appendices = []
242 |
243 | # If false, no module index is generated.
244 | #texinfo_domain_indices = True
245 |
246 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
247 | #texinfo_show_urls = 'footnote'
248 |
249 |
250 | # Example configuration for intersphinx: refer to the Python standard library.
251 | intersphinx_mapping = {'http://docs.python.org/': None}
252 |
--------------------------------------------------------------------------------
/docs/source/index.rst:
--------------------------------------------------------------------------------
1 | .. pdftables documentation master file, created by
2 | sphinx-quickstart on Thu Sep 12 16:20:18 2013.
3 | You can adapt this file completely to your liking, but it should at least
4 | contain the root `toctree` directive.
5 |
6 | pdftables
7 | =========
8 |
9 | .. include:: ../../README.rst
10 | :start-line: 13
11 | .. :end-line: 67
12 |
13 |
14 | .. Contents:
15 | =========
16 |
17 | .. .. toctree::
18 | :numbered:
19 | :maxdepth: 2
20 |
21 | API reference
22 | =============
23 |
24 | .. automodule:: pdftables
25 | :members:
26 |
27 |
28 | Indices and tables
29 | ==================
30 |
31 | * :ref:`genindex`
32 | * :ref:`modindex`
33 | * :ref:`search`
34 |
35 |
--------------------------------------------------------------------------------
/download_test_data.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 |
3 | git clone https://bitbucket.org/scraperwikids/pdftables-test-data fixtures/
4 |
5 | # get EU ground truth dataset
6 | wget http://www.tamirhassan.com/files/eu-dataset-20130324.zip -O fixtures/eu.zip
7 | unzip fixtures/eu.zip -d fixtures
8 | rm fixtures/eu.zip
9 |
10 |
--------------------------------------------------------------------------------
/pdftables/__init__.py:
--------------------------------------------------------------------------------
1 | from pdftables import *
2 | from .config_parameters import ConfigParameters
3 |
--------------------------------------------------------------------------------
/pdftables/boxes.py:
--------------------------------------------------------------------------------
1 | """
2 | Describe box-like data (such as glyphs and rects) in a PDF and helper functions
3 | """
4 | # ScraperWiki Limited
5 | # Ian Hopkinson, 2013-06-19
6 | # -*- coding: utf-8 -*-
7 |
8 | from __future__ import unicode_literals
9 |
10 | from collections import namedtuple
11 | from counter import Counter
12 |
13 | from .line_segments import LineSegment
14 |
15 |
16 | def _rounder(val, tol):
17 | """
18 | Utility function to round numbers to arbitrary tolerance
19 | """
20 | return round((1.0 * val) / tol) * tol
21 |
22 |
23 | class Histogram(Counter):
24 |
25 | def rounder(self, tol):
26 | c = Histogram()
27 | for item in self:
28 | c = c + Histogram({_rounder(item, tol): self[item]})
29 | return c
30 |
31 |
32 | class Rectangle(namedtuple("Rectangle", "x1 y1 x2 y2")):
33 |
34 | def __repr__(self):
35 | return (
36 | "Rectangle(x1={0:6.02f} y1={1:6.02f} x2={2:6.02f} y2={3:6.02f})"
37 | .format(self.x1, self.y1, self.x2, self.y2))
38 |
39 |
40 | class Box(object):
41 |
42 | def __init__(self, rect, text=None, barycenter=None, barycenter_y=None):
43 |
44 | if not isinstance(rect, Rectangle):
45 | raise RuntimeError("Box(x) expects isinstance(x, Rectangle)")
46 |
47 | self.rect = rect
48 | self.text = text
49 | self.barycenter = barycenter
50 | self.barycenter_y = barycenter_y
51 |
52 | def __repr__(self):
53 | if self is Box.empty_box:
54 | return ""
55 | return "".format(self.rect, self.text)
56 |
57 | @classmethod
58 | def copy(cls, o):
59 | return cls(
60 | rect=o.rect,
61 | text=o.text,
62 | barycenter=o.barycenter,
63 | barycenter_y=o.barycenter_y,
64 | )
65 |
66 | def is_connected_to(self, next):
67 | if self.text.strip() == "" or next.text.strip() == "":
68 | # Whitespace can't be connected into a word.
69 | return False
70 |
71 | def equal(left, right):
72 | # Distance in pixels
73 | TOLERANCE = 0.5
74 | # The almond board paradox
75 | if self.text.endswith("("):
76 | TOLERANCE = 10
77 | return abs(left - right) < TOLERANCE
78 |
79 | shared_barycenter = self.barycenter_y == next.barycenter_y
80 | shared_boundary = equal(self.right, next.left)
81 |
82 | return shared_barycenter and shared_boundary
83 |
84 | def extend(self, next):
85 | self.text += next.text
86 | self.rect = self.rect._replace(x2=next.right)
87 |
88 | def clip(self, *rectangles):
89 | """
90 | Return the rectangle representing the subset of this Box and all of
91 | rectangles. If there is no rectangle left, ``Box.empty_box`` is
92 | returned which always clips to the empty box.
93 | """
94 |
95 | x1, y1, x2, y2 = self.rect
96 | for rectangle in rectangles:
97 | x1 = max(x1, rectangle.left)
98 | x2 = min(x2, rectangle.right)
99 | y1 = max(y1, rectangle.top)
100 | y2 = min(y2, rectangle.bottom)
101 |
102 | if x1 > x2 or y1 > y2:
103 | # There is no rect left, so return the "empty set"
104 | return Box.empty_box
105 |
106 | return type(self)(Rectangle(x1=x1, y1=y1, x2=x2, y2=y2))
107 |
108 | @property
109 | def left(self):
110 | return self.rect[0]
111 |
112 | @property
113 | def top(self):
114 | return self.rect[1]
115 |
116 | @property
117 | def right(self):
118 | return self.rect[2]
119 |
120 | @property
121 | def bottom(self):
122 | return self.rect[3]
123 |
124 | @property
125 | def center_x(self):
126 | return (self.left + self.right) / 2.
127 |
128 | @property
129 | def center_y(self):
130 | return (self.bottom + self.top) / 2.
131 |
132 | @property
133 | def width(self):
134 | return self.right - self.left
135 |
136 | @property
137 | def height(self):
138 | return self.bottom - self.top
139 |
140 | """
141 | The empty box. This is necessary because we get one
142 | when we clip two boxes that do not overlap (and
143 | possibly in other situations).
144 |
145 | By convention it has left at +Inf, right at -Inf, top
146 | at +Inf, bottom at -Inf.
147 |
148 | It is defined this way so that it is invariant under clipping.
149 | """
150 | Box.empty_box = Box(Rectangle(x1=float("+inf"), y1=float("+inf"),
151 | x2=float("-inf"), y2=float("-inf")))
152 |
153 |
154 | class BoxList(list):
155 |
156 | def line_segments(self):
157 | """
158 | Return line (start, end) corresponding to horizontal and vertical
159 | box edges
160 | """
161 | horizontal = [LineSegment(b.left, b.right, b)
162 | for b in self]
163 | vertical = [LineSegment(b.top, b.bottom, b)
164 | for b in self]
165 |
166 | return horizontal, vertical
167 |
168 | def inside(self, rect):
169 | """
170 | Return a fresh instance that is the subset that is (strictly)
171 | inside `rect`.
172 | """
173 |
174 | def is_in_rect(box):
175 | return (rect.left <= box.left <= box.right <= rect.right and
176 | rect.top <= box.top <= box.bottom <= rect.bottom)
177 |
178 | return type(self)(box for box in self if is_in_rect(box))
179 |
180 | def bounds(self):
181 | """Return the (strictest) bounding box of all elements."""
182 | return Box(Rectangle(
183 | x1=min(box.left for box in self),
184 | y1=min(box.top for box in self),
185 | x2=max(box.right for box in self),
186 | y2=max(box.bottom for box in self),
187 | ))
188 |
189 | def __repr__(self):
190 | return "BoxList(len={0})".format(len(self))
191 |
192 | def purge_empty_text(self):
193 | # TODO: BUG: we remove characters without adjusting the width / coords
194 | # which is kind of invalid.
195 |
196 | return BoxList(box for box in self if box.text.strip()
197 | or box.classname != 'LTTextLineHorizontal')
198 |
199 | def filterByType(self, flt=None):
200 | if not flt:
201 | return self
202 | return BoxList(box for box in self if box.classname in flt)
203 |
204 | def histogram(self, dir_fun):
205 | # index 0 = left, 1 = top, 2 = right, 3 = bottom
206 | for item in self:
207 | assert type(item) == Box, item
208 | return Histogram(dir_fun(box) for box in self)
209 |
210 | def count(self):
211 | return Counter(x.classname for x in self)
212 |
--------------------------------------------------------------------------------
/pdftables/config_parameters.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | import warnings
4 |
5 |
6 | class ConfigParameters(object):
7 | """
8 | Controls how tables are detected, extracted etc.
9 | Be Careful! If you add a new parameter:
10 |
11 | 1) The default value should be equivalent to the previous behaviour
12 | 2) You're committing to retaining its default value forever(ish)! People
13 | will write code which relies on the default value today, so changing
14 | that will give them unexpected behaviour.
15 | """
16 |
17 | def __init__(
18 | self,
19 |
20 | table_top_hint=None,
21 | table_bottom_hint=None,
22 |
23 | n_glyph_column_threshold=3,
24 | n_glyph_row_threshold=5
25 | ):
26 |
27 | self.table_top_hint = table_top_hint
28 | self.table_bottom_hint = table_bottom_hint
29 |
30 | self.n_glyph_column_threshold = n_glyph_column_threshold
31 | self.n_glyph_row_threshold = n_glyph_row_threshold
32 |
33 |
34 |
--------------------------------------------------------------------------------
/pdftables/counter.py:
--------------------------------------------------------------------------------
1 | """
2 | Implement collections.Counter for the benefit of Python 2.6
3 | """
4 |
5 |
6 | from operator import itemgetter
7 | from heapq import nlargest
8 | from itertools import repeat, ifilter
9 |
10 |
11 | class Counter(dict):
12 |
13 | '''
14 | Dict subclass for counting hashable objects. Sometimes called a bag
15 | or multiset. Elements are stored as dictionary keys and their counts
16 | are stored as dictionary values.
17 |
18 | >>> Counter('zyzygy')
19 | Counter({'y': 3, 'z': 2, 'g': 1})
20 | '''
21 |
22 | def __init__(self, iterable=None, **kwds):
23 | '''Create a new, empty Counter object. And if given, count elements
24 | from an input iterable. Or, initialize the count from another mapping
25 | of elements to their counts.
26 |
27 | >>> c = Counter() # a new, empty counter
28 | >>> c = Counter('gallahad') # a new counter from an iterable
29 | >>> c = Counter({'a': 4, 'b': 2}) # a new counter from a mapping
30 | >>> c = Counter(a=4, b=2) # a new counter from keyword args
31 |
32 | '''
33 | self.update(iterable, **kwds)
34 |
35 | def __missing__(self, key):
36 | return 0
37 |
38 | def most_common(self, n=None):
39 | '''List the n most common elements and their counts from the most
40 | common to the least. If n is None, then list all element counts.
41 |
42 | >>> Counter('abracadabra').most_common(3)
43 | [('a', 5), ('r', 2), ('b', 2)]
44 |
45 | '''
46 | if n is None:
47 | return sorted(self.iteritems(), key=itemgetter(1), reverse=True)
48 | return nlargest(n, self.iteritems(), key=itemgetter(1))
49 |
50 | def elements(self):
51 | '''Iterator over elements repeating each as many times as its count.
52 |
53 | >>> c = Counter('ABCABC')
54 | >>> sorted(c.elements())
55 | ['A', 'A', 'B', 'B', 'C', 'C']
56 |
57 | If an element's count has been set to zero or is a negative number,
58 | elements() will ignore it.
59 |
60 | '''
61 | for elem, count in self.iteritems():
62 | for _ in repeat(None, count):
63 | yield elem
64 |
65 | # Override dict methods where the meaning changes for Counter objects.
66 |
67 | @classmethod
68 | def fromkeys(cls, iterable, v=None):
69 | raise NotImplementedError(
70 | 'Counter.fromkeys() is undefined. Use Counter(iterable) instead.')
71 |
72 | def update(self, iterable=None, **kwds):
73 | '''Like dict.update() but add counts instead of replacing them.
74 |
75 | Source can be an iterable, a dictionary, or another Counter instance.
76 |
77 | >>> c = Counter('which')
78 | >>> c.update('witch') # add elements from another iterable
79 | >>> d = Counter('watch')
80 | >>> c.update(d) # add elements from another counter
81 | >>> c['h'] # four 'h' in which, witch, and watch
82 | 4
83 |
84 | '''
85 | if iterable is not None:
86 | if hasattr(iterable, 'iteritems'):
87 | if self:
88 | self_get = self.get
89 | for elem, count in iterable.iteritems():
90 | self[elem] = self_get(elem, 0) + count
91 | else:
92 | # fast path when counter is empty
93 | dict.update(self, iterable)
94 | else:
95 | self_get = self.get
96 | for elem in iterable:
97 | self[elem] = self_get(elem, 0) + 1
98 | if kwds:
99 | self.update(kwds)
100 |
101 | def copy(self):
102 | """
103 | Like dict.copy() but returns a Counter instance instead of a dict.
104 | """
105 | return Counter(self)
106 |
107 | def __delitem__(self, elem):
108 | """
109 | Like dict.__delitem__() but does not raise KeyError for missing values.
110 | """
111 | if elem in self:
112 | dict.__delitem__(self, elem)
113 |
114 | def __repr__(self):
115 | if not self:
116 | return '%s()' % self.__class__.__name__
117 | items = ', '.join(map('%r: %r'.__mod__, self.most_common()))
118 | return '%s({%s})' % (self.__class__.__name__, items)
119 |
120 | # Multiset-style mathematical operations discussed in:
121 | # Knuth TAOCP Volume II section 4.6.3 exercise 19
122 | # and at http://en.wikipedia.org/wiki/Multiset
123 | #
124 | # Outputs guaranteed to only include positive counts.
125 | #
126 | # To strip negative and zero counts, add-in an empty counter:
127 | # c += Counter()
128 |
129 | def __add__(self, other):
130 | '''Add counts from two counters.
131 |
132 | >>> Counter('abbb') + Counter('bcc')
133 | Counter({'b': 4, 'c': 2, 'a': 1})
134 |
135 |
136 | '''
137 | if not isinstance(other, Counter):
138 | return NotImplemented
139 | result = Counter()
140 | for elem in set(self) | set(other):
141 | newcount = self[elem] + other[elem]
142 | if newcount > 0:
143 | result[elem] = newcount
144 | return result
145 |
146 | def __sub__(self, other):
147 | ''' Subtract count, but keep only results with positive counts.
148 |
149 | >>> Counter('abbbc') - Counter('bccd')
150 | Counter({'b': 2, 'a': 1})
151 |
152 | '''
153 | if not isinstance(other, Counter):
154 | return NotImplemented
155 | result = Counter()
156 | for elem in set(self) | set(other):
157 | newcount = self[elem] - other[elem]
158 | if newcount > 0:
159 | result[elem] = newcount
160 | return result
161 |
162 | def __or__(self, other):
163 | '''Union is the maximum of value in either of the input counters.
164 |
165 | >>> Counter('abbb') | Counter('bcc')
166 | Counter({'b': 3, 'c': 2, 'a': 1})
167 |
168 | '''
169 | if not isinstance(other, Counter):
170 | return NotImplemented
171 | _max = max
172 | result = Counter()
173 | for elem in set(self) | set(other):
174 | newcount = _max(self[elem], other[elem])
175 | if newcount > 0:
176 | result[elem] = newcount
177 | return result
178 |
179 | def __and__(self, other):
180 | ''' Intersection is the minimum of corresponding counts.
181 |
182 | >>> Counter('abbb') & Counter('bcc')
183 | Counter({'b': 1})
184 |
185 | '''
186 | if not isinstance(other, Counter):
187 | return NotImplemented
188 | _min = min
189 | result = Counter()
190 | if len(self) < len(other):
191 | self, other = other, self
192 | for elem in ifilter(self.__contains__, other):
193 | newcount = _min(self[elem], other[elem])
194 | if newcount > 0:
195 | result[elem] = newcount
196 | return result
197 |
198 |
199 | if __name__ == '__main__':
200 | import doctest
201 | print doctest.testmod()
202 |
--------------------------------------------------------------------------------
/pdftables/diagnostics.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | import sys
4 | from collections import namedtuple
5 | import poppler
6 | import cairo
7 |
8 | from os.path import abspath
9 |
10 | Point = namedtuple('Point', ['x', 'y'])
11 | Line = namedtuple('Line', ['start', 'end'])
12 | Polygon = namedtuple('Polygon', 'points')
13 | Rectangle = namedtuple('Rectangle', ['top_left', 'bottom_right'])
14 | AnnotationGroup = namedtuple('AnnotationGroup', ['name', 'color', 'shapes'])
15 | Color = namedtuple('Color', ['red', 'green', 'blue'])
16 |
17 | __all__ = [
18 | 'render_page',
19 | 'make_annotations',
20 | ]
21 |
22 |
23 | def draw_line(context, line):
24 | context.move_to(line.start.x, line.start.y)
25 | context.line_to(line.end.x, line.end.y)
26 | context.stroke()
27 |
28 |
29 | def draw_polygon(context, polygon):
30 | if len(polygon.points) == 0:
31 | return
32 |
33 | first_point = polygon.points[0]
34 |
35 | context.move_to(first_point.x, first_point.y)
36 | for line in polygon.points[1:]:
37 | context.line_to(line.x, line.y)
38 |
39 | context.stroke()
40 |
41 |
42 | def draw_rectangle(context, rectangle):
43 | width = abs(rectangle.bottom_right.x - rectangle.top_left.x)
44 | height = abs(rectangle.bottom_right.y - rectangle.top_left.y)
45 |
46 | context.rectangle(rectangle.top_left.x,
47 | rectangle.top_left.y,
48 | width,
49 | height)
50 | context.stroke()
51 |
52 |
53 | RENDERERS = {}
54 | RENDERERS[Line] = draw_line
55 | RENDERERS[Rectangle] = draw_rectangle
56 | RENDERERS[Polygon] = draw_polygon
57 |
58 |
59 | class CairoPdfPageRenderer(object):
60 |
61 | def __init__(self, pdf_page, svg_filename, png_filename):
62 | self._svg_filename = abspath(svg_filename)
63 | self._png_filename = abspath(png_filename) if png_filename else None
64 | self._context, self._surface = self._get_context(
65 | svg_filename, *pdf_page.get_size())
66 |
67 | white = poppler.Color()
68 | white.red = white.green = white.blue = 65535
69 | black = poppler.Color()
70 | black.red = black.green = black.blue = 0
71 | # red = poppler.Color()
72 | # red.red = red.green = red.blue = 0
73 | # red.red = 65535
74 |
75 | width = pdf_page.get_size()[0]
76 |
77 | # We render everything 3 times, moving
78 | # one page-width to the right each time.
79 | self._offset_colors = [
80 | (0, white, white, True),
81 | (width, black, white, True),
82 | (2 * width, black, black, False)
83 | ]
84 |
85 | for offset, fg_color, bg_color, render_graphics in self._offset_colors:
86 | # Render into context, with a different offset
87 | # each time.
88 | self._context.save()
89 | self._context.translate(offset, 0)
90 |
91 | sel = poppler.Rectangle()
92 | sel.x1, sel.y1 = (0, 0)
93 | sel.x2, sel.y2 = pdf_page.get_size()
94 |
95 | if render_graphics:
96 | pdf_page.render(self._context)
97 |
98 | pdf_page.render_selection(
99 | self._context, sel, sel, poppler.SELECTION_GLYPH,
100 | fg_color, bg_color)
101 |
102 | self._context.restore()
103 |
104 | @staticmethod
105 | def _get_context(filename, width, height):
106 | SCALE = 1
107 | # left, middle, right
108 | N_RENDERINGS = 3
109 |
110 | surface = cairo.SVGSurface(
111 | filename, N_RENDERINGS * width * SCALE, height * SCALE)
112 | # srf = cairo.ImageSurface(
113 | # cairo.FORMAT_RGB24, int(w*SCALE), int(h*SCALE))
114 |
115 | context = cairo.Context(surface)
116 | context.scale(SCALE, SCALE)
117 |
118 | # Set background color to white
119 | context.set_source_rgb(1, 1, 1)
120 | context.paint()
121 |
122 | return context, surface
123 |
124 | def draw(self, shape, color):
125 | self._context.save()
126 | self._context.set_line_width(1)
127 | self._context.set_source_rgba(color.red,
128 | color.green,
129 | color.blue,
130 | 0.5)
131 | self._context.translate(self._offset_colors[1][0], 0)
132 | RENDERERS[type(shape)](self._context, shape)
133 | self._context.restore()
134 |
135 | def flush(self):
136 | if self._png_filename is not None:
137 | self._surface.write_to_png(self._png_filename)
138 |
139 | # NOTE! The flush is rather expensive, since it writes out the svg
140 | # data. The profile will show a large amount of time spent inside it.
141 | # Removing it won't help the execution time at all, it will just move
142 | # it somewhere that the profiler can't see it
143 | # (at garbage collection time)
144 | self._surface.flush()
145 | self._surface.finish()
146 |
147 |
148 | def render_page(pdf_filename, page_number, annotations, svg_file=None,
149 | png_file=None):
150 | """
151 | Render a single page of a pdf with graphical annotations added.
152 | """
153 |
154 | page = extract_pdf_page(pdf_filename, page_number)
155 |
156 | renderer = CairoPdfPageRenderer(page, svg_file, png_file)
157 | for annotation in annotations:
158 | assert isinstance(annotation, AnnotationGroup), (
159 | "annotations: {0}, annotation: {1}".format(
160 | annotations, annotation))
161 | for shape in annotation.shapes:
162 | renderer.draw(shape, annotation.color)
163 |
164 | renderer.flush()
165 |
166 |
167 | def extract_pdf_page(filename, page_number):
168 | file_uri = "file://{0}".format(abspath(filename))
169 | doc = poppler.document_new_from_file(file_uri, "")
170 |
171 | page = doc.get_page(page_number)
172 |
173 | return page
174 |
175 |
176 | def make_annotations(table_container):
177 | """
178 | Take the output of the table-finding algorithm (TableFinder) and create
179 | AnnotationGroups. These can be drawn on top of the original PDF page to
180 | visualise how the algorithm arrived at its output.
181 | """
182 |
183 | annotations = []
184 |
185 | annotations.append(
186 | AnnotationGroup(
187 | name='all_glyphs',
188 | color=Color(0, 1, 0),
189 | shapes=convert_rectangles(table_container.all_glyphs)))
190 |
191 | annotations.append(
192 | AnnotationGroup(
193 | name='all_words',
194 | color=Color(0, 0, 1),
195 | shapes=convert_rectangles(table_container.all_words)))
196 |
197 | annotations.append(
198 | AnnotationGroup(
199 | name='text_barycenters',
200 | color=Color(0, 0, 1),
201 | shapes=convert_barycenters(table_container.all_glyphs)))
202 |
203 | annotations.append(
204 | AnnotationGroup(
205 | name='hat_graph_vertical',
206 | color=Color(0, 1, 0),
207 | shapes=make_hat_graph(
208 | table_container._y_point_values,
209 | table_container._center_lines,
210 | direction="vertical")))
211 |
212 | for table in table_container:
213 | annotations.append(
214 | AnnotationGroup(
215 | name='row_edges',
216 | color=Color(1, 0, 0),
217 | shapes=convert_horizontal_lines(
218 | table.row_edges, table.bounding_box)))
219 |
220 | annotations.append(
221 | AnnotationGroup(
222 | name='column_edges',
223 | color=Color(1, 0, 0),
224 | shapes=convert_vertical_lines(
225 | table.column_edges, table.bounding_box)))
226 |
227 | annotations.append(
228 | AnnotationGroup(
229 | name='glyph_histogram_horizontal',
230 | color=Color(1, 0, 0),
231 | shapes=make_glyph_histogram(
232 | table._x_glyph_histogram, table.bounding_box,
233 | direction="horizontal")))
234 |
235 | annotations.append(
236 | AnnotationGroup(
237 | name='glyph_histogram_vertical',
238 | color=Color(1, 0, 0),
239 | shapes=make_glyph_histogram(
240 | table._y_glyph_histogram, table.bounding_box,
241 | direction="vertical")))
242 |
243 | annotations.append(
244 | AnnotationGroup(
245 | name='horizontal_glyph_above_threshold',
246 | color=Color(0, 0, 0),
247 | shapes=make_thresholds(
248 | table._x_threshold_segs, table.bounding_box,
249 | direction="horizontal")))
250 |
251 | annotations.append(
252 | AnnotationGroup(
253 | name='vertical_glyph_above_threshold',
254 | color=Color(0, 0, 0),
255 | shapes=make_thresholds(
256 | table._y_threshold_segs, table.bounding_box,
257 | direction="vertical")))
258 |
259 | # Draw bounding boxes last so that they appear on top
260 | annotations.append(
261 | AnnotationGroup(
262 | name='table_bounding_boxes',
263 | color=Color(0, 0, 1),
264 | shapes=convert_rectangles(table_container.bounding_boxes)))
265 |
266 | return annotations
267 |
268 |
269 | def make_thresholds(segments, box, direction):
270 | lines = []
271 |
272 | for segment in segments:
273 |
274 | if direction == "horizontal":
275 | lines.append(Line(Point(segment.start, box.bottom + 10),
276 | Point(segment.end, box.bottom + 10)))
277 | else:
278 | lines.append(Line(Point(10, segment.start),
279 | Point(10, segment.end)))
280 |
281 | return lines
282 |
283 |
284 | def make_hat_graph(hats, center_lines, direction):
285 | """
286 | Draw estimated text barycenter
287 | """
288 |
289 | max_value = max(v for _, v in hats)
290 | DISPLAY_WIDTH = 25
291 |
292 | points = []
293 | polygon = Polygon(points)
294 |
295 | def point(x, y):
296 | points.append(Point(x, y))
297 |
298 | for position, value in hats:
299 | point(((value / max_value - 1) * DISPLAY_WIDTH), position)
300 |
301 | lines = []
302 | for position in center_lines:
303 | lines.append(Line(Point(-DISPLAY_WIDTH, position),
304 | Point(0, position)))
305 |
306 | return [polygon] + lines
307 |
308 |
309 | def make_glyph_histogram(histogram, box, direction):
310 |
311 | # if direction == "vertical":
312 | # return []
313 |
314 | bin_edges, bin_values = histogram
315 |
316 | if not bin_edges:
317 | # There are no glyphs, and nothing to render!
318 | return []
319 |
320 | points = []
321 | polygon = Polygon(points)
322 |
323 | def point(x, y):
324 | points.append(Point(x, y))
325 |
326 | # def line(*args):
327 | # lines.append(Line(*args))
328 | previous_value = 0 if direction == "horizontal" else box.bottom
329 |
330 | x = zip(bin_edges, bin_values)
331 | for edge, value in x:
332 |
333 | if direction == "horizontal":
334 | value *= 0.75
335 | value = box.bottom - value
336 |
337 | point(edge, previous_value)
338 | point(edge, value)
339 |
340 | else:
341 | value *= 0.25
342 | value += 7 # shift pixels to the right
343 |
344 | point(previous_value, edge)
345 | point(value, edge)
346 |
347 | previous_value = value
348 |
349 | # Final point is at 0
350 | if direction == "horizontal":
351 | point(edge, 0)
352 | else:
353 | point(box.bottom, edge)
354 |
355 | # Draw edge density plot (not terribly interesting, should probably be
356 | # deleted)
357 | # lines = []
358 | # if direction == "horizontal":
359 | # for edge in bin_edges:
360 | # lines.append(Line(Point(edge, box.bottom),
361 | # Point(edge, box.bottom + 5)))
362 | # else:
363 | # for edge in bin_edges:
364 | # lines.append(Line(Point(0, edge), Point(5, edge)))
365 | return [polygon] # + lines
366 |
367 |
368 | def convert_rectangles(boxes):
369 | return [Rectangle(Point(b.left, b.top), Point(b.right, b.bottom))
370 | for b in boxes]
371 |
372 |
373 | def convert_barycenters(boxes):
374 | return [Line(Point(b.left, b.barycenter.midpoint),
375 | Point(b.right, b.barycenter.midpoint))
376 | for b in boxes if b.barycenter is not None]
377 |
378 |
379 | def convert_horizontal_lines(y_edges, bbox):
380 | return [Line(Point(bbox.left, y), Point(bbox.right, y))
381 | for y in y_edges]
382 |
383 |
384 | def convert_vertical_lines(x_edges, bbox):
385 | return [Line(Point(x, bbox.top), Point(x, bbox.bottom))
386 | for x in x_edges]
387 |
388 | if __name__ == '__main__':
389 | annotations = [
390 | AnnotationGroup(
391 | name='',
392 | color=Color(1, 0, 0),
393 | shapes=[Rectangle(Point(100, 100), Point(200, 200))])
394 | ]
395 | render_page(sys.argv[1], 0, annotations)
396 |
--------------------------------------------------------------------------------
/pdftables/display.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | from __future__ import unicode_literals
3 | from collections import defaultdict
4 | from StringIO import StringIO
5 |
6 |
7 | def to_string(table):
8 | """
9 | Returns a list of the maximum width for each column across all rows
10 | >>> type(to_string([['foo', 'goodbye'], ['llama', 'bar']]))
11 |
12 | """
13 | result = StringIO()
14 |
15 | (columns, rows) = get_dimensions(table)
16 |
17 | result.write(" {} columns, {} rows\n".format(columns, rows))
18 | col_widths = find_column_widths(table)
19 | table_width = sum(col_widths) + len(col_widths) + 2
20 | hbar = ' {}\n'.format('-' * table_width)
21 |
22 | result.write(" {}\n".format(' '.join(
23 | [unicode(col_index).rjust(width, ' ') for (col_index, width)
24 | in enumerate(col_widths)])))
25 |
26 | result.write(hbar)
27 | for row_index, row in enumerate(table):
28 | cells = [cell.rjust(width, ' ') for (cell, width)
29 | in zip(row, col_widths)]
30 | result.write("{:>3} | {}|\n".format(row_index, '|'.join(cells)))
31 | result.write(hbar)
32 | result.seek(0)
33 | return unicode(result.read())
34 |
35 |
36 | def get_dimensions(table):
37 | """
38 | Returns columns, rows for a table.
39 | >>> get_dimensions([['row1', 'apple', 'llama'], ['row2', 'plum', 'goat']])
40 | (3, 2)
41 | >>> get_dimensions([['row1', 'apple', 'llama'], ['row2', 'banana']])
42 | (3, 2)
43 | """
44 | rows = len(table)
45 | try:
46 | cols = max(len(row) for row in table)
47 | except ValueError:
48 | cols = 0
49 | return (cols, rows)
50 |
51 |
52 | def find_column_widths(table):
53 | """
54 | Returns a list of the maximum width for each column across all rows
55 | >>> find_column_widths([['foo', 'goodbye'], ['llama', 'bar']])
56 | [5, 7]
57 | """
58 | col_widths = defaultdict(lambda: 0)
59 | for row_index, row in enumerate(table):
60 | for column_index, cell in enumerate(row):
61 | col_widths[column_index] = max(col_widths[column_index], len(cell))
62 | return [col_widths[col] for col in sorted(col_widths)]
63 |
64 | if __name__ == '__main__':
65 | print(to_string([['foo', 'goodbye'], ['llama', 'bar']]))
66 |
--------------------------------------------------------------------------------
/pdftables/line_segments.py:
--------------------------------------------------------------------------------
1 | """
2 | Algorithms for processing line segments
3 |
4 | segments_generator
5 |
6 | Yield segments in order of their start/end.
7 |
8 | [(1, 4), (2, 3)] => [(1, (1, 4)), (2, (2, 3)), (3, (2, 3)), (4, (1, 4))]
9 |
10 | histogram_segments
11 |
12 | Number of line segments present in each given range
13 |
14 | [(1, 4), (2, 3)] => [((1, 2), 1), ((2, 3), 2), ((3, 4), 1)]
15 |
16 | segment_histogram
17 |
18 | Binning for a histogram and the number of counts in each bin
19 |
20 | [(1, 4), (2, 3)] => [(1, 2, 3, 4), (1, 2, 1)]
21 | """
22 |
23 | from __future__ import division
24 |
25 | from collections import defaultdict, namedtuple
26 | from heapq import heappush, heapreplace, heappop
27 |
28 |
29 | class LineSegment(namedtuple("LineSegment", ["start", "end", "object"])):
30 |
31 | @classmethod
32 | def make(cls, start, end, obj=None):
33 | return cls(start, end, obj)
34 |
35 | def __repr__(self):
36 | return '{0}(start={1:6.04f} end={2:6.04f} object={3})'.format(
37 | type(self).__name__, self.start, self.end, self.object)
38 |
39 | @property
40 | def length(self):
41 | return self.end - self.start
42 |
43 | @property
44 | def midpoint(self):
45 | return (self.start + self.end) / 2
46 |
47 |
48 | def midpoint(segment):
49 | yield segment.midpoint
50 |
51 |
52 | def start_end(segment):
53 | yield segment.start
54 | yield segment.end
55 |
56 |
57 | def start_midpoint_end(segment):
58 | yield segment.start
59 | yield segment.midpoint
60 | yield segment.end
61 |
62 |
63 | def segments_generator(line_segments, to_visit=start_end):
64 | """
65 | Given the a list of segment ranges [(start, stop)...], yield the list of
66 | coordinates where any line segment starts or ends along with the line
67 | segment which starts/ends at that coordinate. The third element of the
68 | yielded tuple is True if the given segment is finishing at that point.
69 |
70 | In [1]: list(linesegments.segments_generator([(1, 4), (2, 3)]))
71 | Out[1]: [(1, (1, 4), False),
72 | (2, (2, 3), False),
73 | (3, (2, 3), True),
74 | (4, (1, 4), True)]
75 |
76 | The function ``to_visit`` specifies which positions will be visited for
77 | each segment and may be ``start_end`` or ``start_midpoint_end``.
78 |
79 | If ``to_visit`` is a dictionary instance, it is a mapping from
80 | ``type(segment)`` onto a visit function. This allows passing in both
81 | line segments and points which should be visited simultaneously, for
82 | example, to classify points according to which line segments are currently
83 | overlapped.
84 | """
85 |
86 | # Queue contains a list of outstanding segments to process. It is a list
87 | # of tuples, [(position, (start, end)], where (start, end) represents one
88 | # line-segment. `position` is the coordinate at which the line-segment) is
89 | # to be considered again.
90 | queue = []
91 |
92 | _to_visit = to_visit
93 |
94 | # (Note, this has the effect of sorting line_segments in start order)
95 | for segment in line_segments:
96 | if isinstance(to_visit, dict):
97 | _to_visit = to_visit[type(segment)]
98 |
99 | # an iterator representing the points to visit for this segment
100 | points_to_visit = _to_visit(segment)
101 | try:
102 | start = points_to_visit.next()
103 | except StopIteration:
104 | continue
105 |
106 | heappush(queue, (start, points_to_visit, segment))
107 |
108 | # Process outstanding segments until there are none left
109 | while queue:
110 |
111 | # Get the next segment to consider and the position at which we're
112 | # considering it
113 | position, points_to_visit, segment = heappop(queue)
114 |
115 | try:
116 | point_next_position = points_to_visit.next()
117 | if point_next_position < position:
118 | raise RuntimeError("Malformed input: next={0} < pos={1}"
119 | .format(point_next_position, position))
120 | except StopIteration:
121 | # No more points for this segment
122 | disappearing = True
123 | else:
124 | disappearing = False
125 | heappush(queue, (point_next_position, points_to_visit, segment))
126 |
127 | yield position, segment, disappearing
128 |
129 |
130 | def histogram_segments(segments):
131 | """
132 | Given a list of histogram segments returns ((start, end), n_segments)
133 | which represents a histogram projection of the number of
134 | segments onto a line.
135 |
136 | In [1]: list(linesegments.histogram_segments([(1, 4), (2, 3)]))
137 | Out[1]: [((1, 2), 1), ((2, 3), 2), ((3, 4), 1)]
138 | """
139 |
140 | # sum(.values()) is the number of segments within (start, end)
141 | active_segments = defaultdict(int)
142 |
143 | consider_segments = list(segments_generator(segments))
144 |
145 | # TODO(pwaller): This function doesn't need to consider the active segments
146 | # It should just maintain a counter. (expect a ~O(10%) speedup)
147 |
148 | # Look ahead to the next start, and that's the end of the interesting range
149 | for this, next in zip(consider_segments, consider_segments[1:]):
150 |
151 | # (start, end) is the range until the next segment
152 | (start, seg, disappearing), (end, _, _) = this, next
153 |
154 | # Did the segment appear or disappear? Key on the segment coordinates
155 | if not disappearing:
156 | active_segments[seg] += 1
157 |
158 | else:
159 | active_segments[seg] -= 1
160 |
161 | if start == end:
162 | # This happens if a segment appears more than once.
163 | # Then we don't care about considering this zero-length range.
164 | continue
165 |
166 | yield (start, end), sum(active_segments.values())
167 |
168 |
169 | def hat_point_generator(line_segments):
170 | """
171 | This is a hat for one line segment:
172 |
173 | /\
174 | ____/ \____
175 |
176 | | |
177 | | \_ End
178 | \_ Start
179 |
180 | position --->
181 |
182 | This generator yields at every `position` where the value of the hat
183 | function could change.
184 | """
185 |
186 | # Invariants:
187 | # * Position should be always increasing
188 | # * First and last yielded points should always be the empty set.
189 | # * All yielded positions should lie within all LineSegments in the
190 | # `active_segments` at the point it is yielded
191 | # * Each yielded set has its own id()
192 |
193 | # Set of segments active yielded for unique values of `position`
194 | active_segments = set()
195 | # Set of segments which have appeared so far at this `position`
196 | new_segments = set()
197 |
198 | segments_by_position = segments_generator(
199 | line_segments, start_midpoint_end)
200 | last_position = None
201 |
202 | for position, segment, disappearing in segments_by_position:
203 |
204 | if segment.start == segment.end:
205 | # Zero-length segments are uninteresting and get skipped
206 | continue
207 |
208 | if last_position is not None and last_position != position:
209 |
210 | # Sanity check.
211 | assert all(s.start <= last_position < s.end
212 | for s in active_segments)
213 |
214 | # Copy the `active_segments` set so that the caller doesn't
215 | # accidentally end up with the same set repeatedly and can't
216 | # modify the state inside this function
217 | yield last_position, set(active_segments)
218 |
219 | # `new_segments` are now `active segments.
220 | active_segments |= new_segments
221 | new_segments.clear()
222 |
223 | if disappearing:
224 | # This is the end of the segment, remove it from the active set
225 | active_segments.remove(segment)
226 | else:
227 | # Record the segment in the seen list. It might be the start or
228 | # midpoint. If it's the start, it won't be `active` until the
229 | # next iteration (unless that iteration removes it).
230 | new_segments.add(segment)
231 |
232 | last_position = position
233 |
234 | # For completeness, yield empty set at final position.
235 | yield last_position, set()
236 |
237 |
238 | def hat(segment, position):
239 | """
240 | This function returns 0 when ``position` is the start or end of ``segment``
241 | and 1 when ``position`` is in the middle of the segment.
242 |
243 | /\
244 | __/ \__
245 | """
246 | h = abs((segment.midpoint - position) / segment.length)
247 | return max(0, 1 - h)
248 |
249 |
250 | def normal_hat(position, active_segments):
251 | """
252 | The ``normal_hat`` is the sum of the hat function for all active segments
253 | at this position
254 | """
255 | return sum(hat(s, position) for s in active_segments)
256 |
257 |
258 | def max_length(position, active_segments):
259 | """
260 | Returns the maximum length of any segment overlapping ``position``
261 | """
262 | if not active_segments:
263 | return None
264 | return max(s.length for s in active_segments)
265 |
266 |
267 | def normal_hat_with_max_length(position, active_segments):
268 | """
269 | Obtain both the hat value and the length of the largest line segment
270 | overlapping each "hat position".
271 | """
272 |
273 | return (normal_hat(position, active_segments),
274 | max_length(position, active_segments))
275 |
276 |
277 | def hat_generator(line_segments, value_function=normal_hat):
278 | """
279 | The purpose of this function is to determine where it might be effective
280 | to clamp text to for the purposes of text visitation order.
281 |
282 | The hat generator returns the sum of ``hat`` at each position where any
283 | line segment's ``start``, ``midpoint`` or ``end`` is.
284 |
285 | ``value_function`` can be used to obtain different kinds of information
286 | from the ``hat_point_generator``'s points.
287 | """
288 |
289 | for position, active_segments in hat_point_generator(line_segments):
290 | yield position, value_function(position, active_segments)
291 |
292 |
293 | def segment_histogram(line_segments):
294 | """
295 | Binning for a histogram and the number of counts in each bin.
296 | Can be used to make a numpy.histogram.
297 |
298 | In [1]: list(linesegments.histogram_segments([(1, 4), (2, 3)]))
299 | Out[1]: [(1, 2, 3, 4), (1, 2, 1)]
300 | """
301 | data = list(histogram_segments(line_segments))
302 |
303 | if not data:
304 | return [(), ()]
305 |
306 | x, counts = zip(*data)
307 | starts, ends = zip(*x)
308 |
309 | return starts + (ends[-1],), counts
310 |
311 |
312 | def above_threshold(histogram, threshold):
313 | """
314 | Returns a list of line segments from histogram which are above threshold
315 | """
316 |
317 | bin_edges, bin_values = histogram
318 | edges = zip(bin_edges, bin_edges[1:])
319 |
320 | above_threshold = []
321 |
322 | for (first, second), value in zip(edges, bin_values):
323 | if value < threshold:
324 | continue
325 |
326 | if above_threshold and above_threshold[-1].end == first:
327 | # There's a previous one we can extend
328 | above_threshold[-1] = above_threshold[-1]._replace(
329 | end=second)
330 | else:
331 | # Insert a new one
332 | above_threshold.append(LineSegment(first, second, None))
333 |
334 | return above_threshold
335 |
336 |
337 | def find_peaks(position_values):
338 | """
339 | Find all points in a peaky graph which are local maxima.
340 |
341 | This function assumes that the very first and last points can't be peaks.
342 | """
343 |
344 | # Initial value is zero, up is the only direction from here!
345 | increasing = True
346 |
347 | # The loop has two states. Either we're going up, in which case when we see
348 | # a next value less than the current one, we must be at the top. At which
349 | # point it's down hill and all next values will be less than the current
350 | # one. Until the bottom is reached, at which point we're increasing again.
351 |
352 | # Note that the last `position` can never be yielded.
353 |
354 | successive_pairs = zip(position_values, position_values[1:])
355 | for (position, value), (_, next_value) in successive_pairs:
356 | if increasing:
357 | if next_value < value:
358 | # position is a peak
359 | increasing = False
360 | yield position
361 |
362 | else:
363 | if next_value > value:
364 | # position is a trough
365 | increasing = True
366 |
--------------------------------------------------------------------------------
/pdftables/numpy_subset.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # ScraperWiki Limited
3 | # Ian Hopkinson, 2013-08-02
4 | # -*- coding: utf-8 -*-
5 |
6 | """
7 | Implement numpy.diff, numpy.arange and numpy.average to remove numpy dependency
8 | """
9 |
10 | import math
11 |
12 |
13 | def diff(input_array):
14 | '''
15 | First order differences for an input array
16 |
17 | >>> diff([1,2,3,4])
18 | [1, 1, 1]
19 | '''
20 | result = []
21 | for i in range(0, len(input_array) - 1):
22 | result.append(input_array[i + 1] - input_array[i])
23 | return result
24 |
25 |
26 | def arange(start, stop, stepv):
27 | '''
28 | Generate a list of float values given start, stop and step values
29 |
30 | >>> arange(0, 2, 0.1)
31 | [ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ,
32 | 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9]
33 | '''
34 | count = int(math.ceil(float(stop - start) / float(stepv)))
35 | result = [None, ] * count
36 | result[0] = start
37 | for i in xrange(1, count):
38 | result[i] = result[i - 1] + stepv
39 |
40 | return result
41 |
42 |
43 | def average(input_array):
44 | '''
45 | Average a 1D array of values
46 |
47 | >>> average([1,2,3,4])
48 | 2.5
49 | '''
50 | return float(sum(input_array)) / float(len(input_array))
51 |
--------------------------------------------------------------------------------
/pdftables/patched_poppler.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 |
3 | import ctypes
4 | import poppler
5 |
6 | from ctypes import CDLL, POINTER, c_voidp, c_double, c_uint, c_bool
7 | from ctypes import Structure, addressof
8 |
9 | from .boxes import Box, Rectangle
10 |
11 |
12 | class CRectangle(Structure):
13 | _fields_ = [
14 | ("x1", c_double),
15 | ("y1", c_double),
16 | ("x2", c_double),
17 | ("y2", c_double),
18 | ]
19 | CRectangle.ptr = POINTER(CRectangle)
20 |
21 | glib = CDLL("libpoppler-glib.so.8")
22 |
23 | g_free = glib.g_free
24 | g_free.argtypes = (c_voidp,)
25 |
26 |
27 | _c_text_layout = glib.poppler_page_get_text_layout
28 | _c_text_layout.argtypes = (c_voidp, POINTER(CRectangle.ptr), POINTER(c_uint))
29 | _c_text_layout.restype = c_bool
30 |
31 | GLYPH = poppler.SELECTION_GLYPH
32 |
33 |
34 | def poppler_page_get_text_layout(page):
35 | """
36 | Wrapper of an underlying c-api function not yet exposed by the
37 | python-poppler API.
38 |
39 | Returns a list of text rectangles on the pdf `page`
40 | """
41 |
42 | n = c_uint(0)
43 | rects = CRectangle.ptr()
44 |
45 | # From python-poppler internals it is known that hash(page) returns the
46 | # c-pointer to the underlying glib object. See also the repr(page).
47 | page_ptr = hash(page)
48 |
49 | _c_text_layout(page_ptr, rects, n)
50 |
51 | # Obtain pointer to array of rectangles of the correct length
52 | rectangles = POINTER(CRectangle * n.value).from_address(addressof(rects))
53 |
54 | get_text = page.get_selected_text
55 |
56 | poppler_rect = poppler.Rectangle()
57 |
58 | result = []
59 | for crect in rectangles.contents:
60 | # result.append(Rectangle(
61 | # x1=crect.x1, y1=crect.y1, x2=crect.x2, y2=crect.y2))
62 |
63 | _ = (crect.x1, crect.y1, crect.x2, crect.y2)
64 | poppler_rect.x1, poppler_rect.y1, poppler_rect.x2, poppler_rect.y2 = _
65 |
66 | text = get_text(GLYPH, poppler_rect).decode("utf8")
67 |
68 | if text.endswith(" \n"):
69 | text = text[:-2]
70 | elif text.endswith(" ") and len(text) > 1:
71 | text = text[:-1]
72 | elif text.endswith("\n"):
73 | text = text[:-1]
74 |
75 | rect = Box(
76 | rect=Rectangle(x1=crect.x1, y1=crect.y1, x2=crect.x2, y2=crect.y2),
77 | text=text,
78 | )
79 | result.append(rect)
80 |
81 | # TODO(pwaller): check that this free is correct
82 | g_free(rectangles)
83 |
84 | return result
85 |
--------------------------------------------------------------------------------
/pdftables/pdf_document.py:
--------------------------------------------------------------------------------
1 | """
2 | Backend abstraction for PDFDocuments
3 | """
4 |
5 | import abc
6 | import os
7 |
8 | DEFAULT_BACKEND = "poppler"
9 | BACKEND = os.environ.get("PDFTABLES_BACKEND", DEFAULT_BACKEND).lower()
10 |
11 | # TODO(pwaller): Use abstract base class?
12 | # What does it buy us? Can we enforce that only methods specified in an ABC
13 | # are used by client code?
14 |
15 |
16 | class PDFDocument(object):
17 | __metaclass__ = abc.ABCMeta
18 |
19 | @classmethod
20 | def get_backend(cls, backend=None):
21 | """
22 | Returns the PDFDocument class to use based on configuration from
23 | enviornment or pdf_document.BACKEND
24 | """
25 | # If `cls` is not already a subclass of the base PDFDocument, pick one
26 | if not issubclass(cls, PDFDocument):
27 | return cls
28 |
29 | if backend is None:
30 | backend = BACKEND
31 |
32 | # Imports have to go inline to avoid circular imports with the backends
33 | if backend == "pdfminer":
34 | from pdf_document_pdfminer import PDFDocument as PDFDoc
35 | return PDFDoc
36 |
37 | elif backend == "poppler":
38 | from pdf_document_poppler import PDFDocument as PDFDoc
39 | return PDFDoc
40 |
41 | raise NotImplementedError("Unknown backend '{0}'".format(backend))
42 |
43 | @classmethod
44 | def from_path(cls, path):
45 | Class = cls.get_backend()
46 | return Class(path)
47 |
48 | @classmethod
49 | def from_fileobj(cls, fh):
50 | # TODO(pwaller): For now, put fh into a temporary file and call
51 | # .from_path. Future: when we have a working stream input function for
52 | # poppler, use that.
53 | raise NotImplementedError
54 | Class = cls._get_backend()
55 | # return Class(fh) # This is wrong since constructor now takes a path.
56 |
57 | def __init__(self, *args, **kwargs):
58 | raise RuntimeError(
59 | "Don't use this constructor, use a {0}.from_* method instead!"
60 | .format(self.__class__.__name__))
61 |
62 | @abc.abstractmethod
63 | def __len__(self):
64 | """
65 | Return the number of pages in the PDF
66 | """
67 |
68 | @abc.abstractmethod
69 | def get_page(self, number):
70 | """
71 | Return a PDFPage for page `number` (0 indexed!)
72 | """
73 |
74 | @abc.abstractmethod
75 | def get_pages(self):
76 | """
77 | Return all pages in the document: TODO(pwaller) move implementation here
78 | """
79 |
80 |
81 | class PDFPage(object):
82 | __metaclass__ = abc.ABCMeta
83 |
84 | @abc.abstractmethod
85 | def get_glyphs(self):
86 | """
87 | Obtain a list of bounding boxes (Box instances) for all glyphs
88 | on the page.
89 | """
90 |
91 | @abc.abstractproperty
92 | def size(self):
93 | """
94 | (width, height) of page
95 | """
96 |
--------------------------------------------------------------------------------
/pdftables/pdf_document_pdfminer.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | """
3 | PDFDocument backend based on pdfminer
4 | """
5 |
6 | import collections
7 |
8 | import pdfminer.pdfparser
9 | import pdfminer.pdfinterp
10 | import pdfminer.pdfdevice
11 | import pdfminer.layout
12 | import pdfminer.converter
13 |
14 | from .pdf_document import (
15 | PDFDocument as BasePDFDocument,
16 | PDFPage as BasePDFPage,
17 | )
18 |
19 | from .boxes import Box, BoxList, Rectangle
20 |
21 |
22 | class PDFDocument(BasePDFDocument):
23 |
24 | """
25 | pdfminer implementation of PDFDocument
26 | """
27 |
28 | @staticmethod
29 | def _initialise(file_handle):
30 |
31 | (doc, parser) = (pdfminer.pdfparser.PDFDocument(),
32 | pdfminer.pdfparser.PDFParser(file_handle))
33 |
34 | parser.set_document(doc)
35 | doc.set_parser(parser)
36 |
37 | doc.initialize('')
38 | if not doc.is_extractable:
39 | raise ValueError(
40 | "pdfminer.pdfparser.PDFDocument is_extractable != True")
41 | la_params = pdfminer.layout.LAParams()
42 | la_params.word_margin = 0.0
43 |
44 | resource_manager = pdfminer.pdfinterp.PDFResourceManager()
45 | aggregator = pdfminer.converter.PDFPageAggregator(
46 | resource_manager, laparams=la_params)
47 |
48 | interpreter = pdfminer.pdfinterp.PDFPageInterpreter(
49 | resource_manager, aggregator)
50 |
51 | return doc, interpreter, aggregator
52 |
53 | def __init__(self, file_path):
54 | self._pages = None
55 |
56 | self._file_handle = open(file_path, "rb")
57 |
58 | result = self._initialise(self._file_handle)
59 | (self._doc, self._interpreter, self._device) = result
60 |
61 | def __len__(self):
62 | return len(self.get_pages())
63 |
64 | def get_creator(self):
65 | return self._doc.info[0]['Creator'] # TODO: what's doc.info ?
66 |
67 | def get_pages(self):
68 | """
69 | Returns a list of lazy pages (parsed on demand)
70 | """
71 | if not self._pages:
72 | self._construct_pages()
73 |
74 | return self._pages
75 |
76 | def _construct_pages(self):
77 | self._pages = [PDFPage(self, page) for page in self._doc.get_pages()]
78 |
79 | def get_page(self, page_number):
80 | """
81 | 0-based page getter
82 | """
83 | pages = self.get_pages()
84 | if 0 <= page_number < len(pages):
85 | return pages[page_number]
86 | raise IndexError("Invalid page. Reminder: get_page() is 0-indexed "
87 | "(there are {0} pages)!".format(len(pages)))
88 |
89 |
90 | def children(obj):
91 | """
92 | get all descendants of nested iterables
93 | """
94 | if isinstance(obj, collections.Iterable):
95 | for child in obj:
96 | for node in children(child):
97 | yield node
98 | yield obj
99 |
100 |
101 | class PDFPage(BasePDFPage):
102 |
103 | """
104 | pdfminer implementation of PDFPage
105 | """
106 |
107 | def __init__(self, parent_pdf_document, page):
108 | assert isinstance(page, pdfminer.pdfparser.PDFPage), page.__class__
109 |
110 | self.pdf_document = parent_pdf_document
111 | self._page = page
112 | self._cached_lt_page = None
113 |
114 | @property
115 | def size(self):
116 | x1, y1, x2, y2 = self._page.mediabox
117 | return x2 - x1, y2 - y1
118 |
119 | def get_glyphs(self):
120 | """
121 | Return a BoxList of the glyphs on this page.
122 | """
123 |
124 | items = children(self._lt_page())
125 |
126 | def keep(o):
127 | return isinstance(o, pdfminer.layout.LTChar)
128 |
129 | _, page_height = self.size
130 |
131 | def make_box(obj):
132 | # TODO(pwaller): Note: is `self._page.rotate` taken into account?
133 |
134 | # pdfminer gives coordinates such that y=0 is the bottom of the
135 | # page. Our algorithms expect y=0 is the top of the page, so..
136 | left, bottom, right, top = obj.bbox
137 | return Box(
138 | rect=Rectangle(
139 | x1=left, x2=right,
140 | y1=page_height - top,
141 | y2=page_height - bottom,
142 | ),
143 | text=obj.get_text()
144 | )
145 |
146 | return BoxList(make_box(obj) for obj in items if keep(obj))
147 |
148 | def _lt_page(self):
149 | if not self._cached_lt_page:
150 | self._parse_page()
151 | return self._cached_lt_page
152 |
153 | def _parse_page(self):
154 | self.pdf_document._interpreter.process_page(self._page)
155 | self._cached_lt_page = self.pdf_document._device.get_result()
156 | assert isinstance(self._cached_lt_page, pdfminer.layout.LTPage), (
157 | self._cached_lt_page.__class__)
158 |
--------------------------------------------------------------------------------
/pdftables/pdf_document_poppler.py:
--------------------------------------------------------------------------------
1 | from ctypes import CDLL, POINTER, c_voidp, c_double, c_uint, c_bool
2 | from ctypes import Structure, addressof, pointer
3 | from os.path import abspath
4 |
5 | import gobject
6 | try:
7 | import poppler
8 | except ImportError:
9 | print "Poppler unavailable! Please install it."
10 | print " sudo apt-get install python-poppler"
11 | raise
12 |
13 | import patched_poppler
14 |
15 | from .boxes import Box, BoxList
16 | from .pdf_document import (
17 | PDFDocument as BasePDFDocument,
18 | PDFPage as BasePDFPage,
19 | )
20 |
21 |
22 | class PDFDocument(BasePDFDocument):
23 |
24 | def __init__(self, file_path, password=""):
25 | uri = "file://{0}".format(abspath(file_path))
26 | self._poppler = poppler.document_new_from_file(uri, password)
27 |
28 | def __len__(self):
29 | return self._poppler.get_n_pages()
30 |
31 | def get_page(self, n):
32 | return PDFPage(self, n)
33 |
34 | def get_pages(self):
35 | return [self.get_page(i) for i in xrange(len(self))]
36 |
37 |
38 | class PDFPage(BasePDFPage):
39 |
40 | def __init__(self, doc, n):
41 | self._poppler = doc._poppler.get_page(n)
42 |
43 | @property
44 | def size(self):
45 | return self._poppler.get_size()
46 |
47 | def get_glyphs(self):
48 | # TODO(pwaller): Result of this should be memoized onto the PDFPage
49 | # instance.
50 |
51 | gtl = patched_poppler.poppler_page_get_text_layout
52 | rectangles = gtl(self._poppler)
53 |
54 | return BoxList(rectangles)
55 |
56 | # TODO(pwaller): Salvage this.
57 | #
58 | # Poppler seems to lie to us because the assertion below fails.
59 | # It should return the same number of rectangles as there are
60 | # characters in the text, but it does not.
61 | # See:
62 | #
63 | # http://www.mail-archive.com/poppler
64 | # @lists.freedesktop.org/msg06245.html
65 | # https://github.com/scraperwiki/pdftables/issues/89
66 | # https://bugs.freedesktop.org/show_bug.cgi?id=69608
67 |
68 | text = self._poppler.get_text().decode("utf8")
69 |
70 | # assert len(text) == len(rectangles), (
71 | # "t={0}, r={1}".format(len(text), len(rectangles)))
72 |
73 | # assert False
74 |
75 | return BoxList(Box(rect=rect, text=character)
76 | for rect, character in zip(rectangles, text))
77 |
--------------------------------------------------------------------------------
/pdftables/pdftables.py:
--------------------------------------------------------------------------------
1 | """
2 | pdftables public interface
3 | """
4 |
5 | from __future__ import unicode_literals
6 | """
7 | Some experiments with pdfminer
8 | http://www.unixuser.org/~euske/python/pdfminer/programming.html
9 | Some help here:
10 | http://denis.papathanasiou.org/2010/08/04/extracting-text-images-from-pdf-files
11 | """
12 |
13 | # TODO(IanHopkinson) Identify multi-column text, for multicolumn text detect
14 | # per column
15 | # TODO(IanHopkinson) Dynamic / smarter thresholding
16 | # TODO(IanHopkinson) Handle argentina_diputados_voting_record.pdf automatically
17 | # TODO(IanHopkinson) Handle multiple tables on one page
18 |
19 | __all__ = ["get_tables", "page_to_tables", "page_contains_tables"]
20 |
21 | import codecs
22 | import collections
23 | import math
24 | import sys
25 |
26 | import numpy_subset
27 |
28 | from bisect import bisect_left
29 | from collections import defaultdict
30 | from counter import Counter
31 | from cStringIO import StringIO
32 | from operator import attrgetter
33 |
34 | from .boxes import Box, BoxList, Rectangle
35 | from .config_parameters import ConfigParameters
36 | from .line_segments import (segment_histogram, above_threshold, hat_generator,
37 | find_peaks, normal_hat_with_max_length,
38 | midpoint, start_end, LineSegment,
39 | segments_generator)
40 | from .pdf_document import PDFDocument, PDFPage
41 |
42 | IS_TABLE_COLUMN_COUNT_THRESHOLD = 3
43 | IS_TABLE_ROW_COUNT_THRESHOLD = 3
44 |
45 | LEFT = 0
46 | TOP = 3
47 | RIGHT = 2
48 | BOTTOM = 1
49 |
50 |
51 | class Table(object):
52 |
53 | """
54 | Represents a single table on a PDF page.
55 | """
56 |
57 | def __init__(self):
58 | # TODO(pwaller): populate this from pdf_page.number
59 | self.page_number = None
60 | self.bounding_box = None
61 | self.glyphs = None
62 | self.edges = None
63 | self.row_edges = None
64 | self.column_edges = None
65 | self.data = None
66 |
67 | def __repr__(self):
68 | d = self.data
69 | if d is not None:
70 | # TODO(pwaller): Compute this in a better way.
71 | h = len(d)
72 | w = len(d[0])
73 | return "".format(w, h)
74 | else:
75 | return ""
76 |
77 |
78 | class TableContainer(object):
79 |
80 | """
81 | Represents a collection of tables on a PDF page.
82 | """
83 |
84 | def __init__(self):
85 | self.tables = []
86 |
87 | self.original_page = None
88 | self.page_size = None
89 | self.bounding_boxes = None
90 | self.all_glyphs = None
91 |
92 | def add(self, table):
93 | self.tables.append(table)
94 |
95 | def __repr__(self):
96 | return "TableContainer(" + repr(self.__dict__) + ")"
97 |
98 | def __iter__(self):
99 | return iter(self.tables)
100 |
101 |
102 | def get_tables(fh):
103 | """
104 | Return a list of 'tables' from the given file handle, where a table is a
105 | list of rows, and a row is a list of strings.
106 | """
107 | pdf = PDFDocument.from_fileobj(fh)
108 | return get_tables_from_document(pdf)
109 |
110 |
111 | def get_tables_from_document(pdf_document):
112 | """
113 | Return a list of 'tables' from the given PDFDocument, where a table is a
114 | list of rows, and a row is a list of strings.
115 | """
116 | raise NotImplementedError("This interface hasn't been fixed yet, sorry!")
117 |
118 | result = []
119 |
120 | config = ConfigParameters()
121 |
122 | # TODO(pwaller): Return one table container with all tables on it?
123 |
124 | for i, pdf_page in enumerate(pdf_document.get_pages()):
125 | if not page_contains_tables(pdf_page):
126 | continue
127 |
128 | tables = page_to_tables(pdf_page, config)
129 |
130 | # crop_table(table)
131 | #result.append(Table(table, i + 1, len(pdf_document), 1, 1))
132 |
133 | return result
134 |
135 |
136 | def crop_table(table):
137 | """
138 | Remove empty rows from the top and bottom of the table.
139 |
140 | TODO(pwaller): We may need functionality similar to this, or not?
141 | """
142 | for row in list(table): # top -> bottom
143 | if not any(cell.strip() for cell in row):
144 | table.remove(row)
145 | else:
146 | break
147 |
148 | for row in list(reversed(table)): # bottom -> top
149 | if not any(cell.strip() for cell in row):
150 | table.remove(row)
151 | else:
152 | break
153 |
154 |
155 | def page_contains_tables(pdf_page):
156 | if not isinstance(pdf_page, PDFPage):
157 | raise TypeError("Page must be PDFPage, not {}".format(
158 | pdf_page.__class__))
159 |
160 | # TODO(pwaller):
161 | #
162 | # I would prefer if this function was defined in terms of `page_to_tables`
163 | # so that the logic cannot diverge.
164 | #
165 | # What should the test be?
166 | # len(page_to_tables(page)) > 0?
167 | # Number of tables left after filtering ones that have no data > 0?
168 |
169 | box_list = pdf_page.get_glyphs()
170 |
171 | boxtop = attrgetter("top")
172 | yhist = box_list.histogram(boxtop).rounder(1)
173 | test = [k for k, v in yhist.items() if v > IS_TABLE_COLUMN_COUNT_THRESHOLD]
174 | return len(test) > IS_TABLE_ROW_COUNT_THRESHOLD
175 |
176 |
177 | def make_words(glyphs):
178 | """
179 | A word is a series of glyphs which are visually connected to each other.
180 | """
181 |
182 | def ordering(box):
183 | assert box.barycenter_y is not None, (
184 | "Box belongs to no barycenter. Has assign_barycenters been run?")
185 | return (box.barycenter_y, box.center_x)
186 |
187 | words = []
188 | glyphs = [g for g in glyphs if g.barycenter_y is not None]
189 |
190 | for glyph in sorted(glyphs, key=ordering):
191 |
192 | if len(words) > 0 and words[-1].is_connected_to(glyph):
193 | words[-1].extend(glyph)
194 |
195 | else:
196 | words.append(Box.copy(glyph))
197 |
198 | return words
199 |
200 |
201 | def page_to_tables(pdf_page, config=None):
202 | """
203 | The central algorithm to pdftables, find all the tables on ``pdf_page`` and
204 | return them in a ```TableContainer``.
205 |
206 | The algorithm is steered with ``config`` which is of type
207 | ``ConfigParameters``
208 | """
209 |
210 | if config is None:
211 | config = ConfigParameters()
212 |
213 | # Avoid local variables; instead use properties of the
214 | # `tables` object, so that they are exposed for debugging and
215 | # visualisation.
216 |
217 | tables = TableContainer()
218 |
219 | tables.original_page = pdf_page
220 | tables.page_size = pdf_page.size
221 | tables.all_glyphs = pdf_page.get_glyphs()
222 |
223 | tables._x_segments, tables._y_segments = tables.all_glyphs.line_segments()
224 | # Find candidate text centerlines and compute some properties of them.
225 | (tables._y_point_values,
226 | tables._center_lines,
227 | tables._barycenter_maxheights) = (
228 | determine_text_centerlines(tables._y_segments)
229 | )
230 |
231 | assign_barycenters(tables._y_segments,
232 | tables._center_lines,
233 | tables._barycenter_maxheights)
234 |
235 | # Note: word computation must come after barycenter computation
236 | tables.all_words = make_words(tables.all_glyphs)
237 |
238 | tables.bounding_boxes = find_bounding_boxes(tables.all_glyphs, config)
239 |
240 | for box in tables.bounding_boxes:
241 | table = Table()
242 | table.bounding_box = box
243 |
244 | table.glyphs = tables.all_glyphs.inside(box)
245 |
246 | if len(table.glyphs) == 0:
247 | # If this happens, then find_bounding_boxes returned somewhere with
248 | # no glyphs inside it. Wat.
249 | raise RuntimeError("This is an empty table bounding box. "
250 | "That shouldn't happen.")
251 |
252 | # Fetch line-segments
253 |
254 | # h is lines with fixed y, multiple x values
255 | # v is lines with fixed x, multiple y values
256 | # TODO(pwaller): compute for whole page, then get subset belonging to
257 | # this table.
258 | table._x_segments, table._y_segments = table.glyphs.line_segments()
259 |
260 | # Histogram them
261 | xs = table._x_glyph_histogram = segment_histogram(table._x_segments)
262 | ys = table._y_glyph_histogram = segment_histogram(table._y_segments)
263 |
264 | thres_nc = config.n_glyph_column_threshold
265 | thres_nr = config.n_glyph_row_threshold
266 | # Threshold them
267 | xs = table._x_threshold_segs = above_threshold(xs, thres_nc)
268 | ys = table._y_threshold_segs = above_threshold(ys, thres_nr)
269 |
270 | # Compute edges (the set of edges used to be called a 'comb')
271 | edges = compute_cell_edges(box, xs, ys, config)
272 | table.column_edges, table.row_edges = edges
273 |
274 | if table.column_edges and table.row_edges:
275 | table.data = compute_table_data(table)
276 | else:
277 | table.data = None
278 |
279 | tables.add(table)
280 |
281 | return tables
282 |
283 |
284 | def determine_text_centerlines(v_segments):
285 | """
286 | Find candidate centerlines to snap glyphs to.
287 | """
288 |
289 | _ = hat_generator(v_segments, value_function=normal_hat_with_max_length)
290 | y_hat_points = list(_)
291 |
292 | if not y_hat_points:
293 | # No text on the page?
294 | return [], [], []
295 |
296 | points, values_maxlengths = zip(*y_hat_points)
297 | values, max_lengths = zip(*values_maxlengths)
298 |
299 | point_values = zip(points, values)
300 |
301 | # y-positions of "good" center lines vertically
302 | # ("good" is determined using the /\ ("hat") function)
303 | center_lines = list(find_peaks(point_values))
304 |
305 | # mapping of y-position (at each hat-point) to maximum glyph
306 | # height over that point
307 | barycenter_maxheights = dict(
308 | (barycenter, maxheight)
309 | for barycenter, maxheight in zip(points, max_lengths)
310 | if maxheight is not None)
311 |
312 | return point_values, center_lines, barycenter_maxheights
313 |
314 |
315 | def find_bounding_boxes(glyphs, config):
316 | """
317 | Returns a list of bounding boxes, one per table.
318 | """
319 |
320 | # TODO(pwaller): One day, this function will find more than one table.
321 |
322 | th, bh = config.table_top_hint, config.table_bottom_hint
323 | assert(glyphs is not None)
324 | bbox = find_table_bounding_box(glyphs, th, bh)
325 |
326 | if bbox is Box.empty_box:
327 | return []
328 |
329 | # Return the one table's bounding box.
330 | return [bbox]
331 |
332 |
333 | def compute_cell_edges(box, h_segments, v_segments, config):
334 | """
335 | Determines edges of cell content horizontally and vertically. It
336 | works by binning and thresholding the resulting histogram for
337 | each of the two axes (x and y).
338 | """
339 |
340 | # TODO(pwaller): shove this on a config?
341 | # these need better names before being part of a public API.
342 | # They specify the minimum amount of space between "threshold-segments"
343 | # in the histogram of line segments and the minimum length, otherwise
344 | # they are not considered a gap.
345 | # units= pdf "points"
346 | minimum_segment_size = 0.5
347 | minimum_gap_size = 0.5
348 |
349 | def gap_midpoints(segments):
350 | return [(b.start + a.end) / 2.
351 | for a, b in zip(segments, segments[1:])
352 | if b.start - a.end > minimum_gap_size
353 | and b.length > minimum_segment_size
354 | ]
355 |
356 | column_edges = [box.left] + gap_midpoints(h_segments) + [box.right]
357 | row_edges = [box.top] + gap_midpoints(v_segments) + [box.bottom]
358 |
359 | return column_edges, row_edges
360 |
361 |
362 | def compute_table_data(table):
363 | """
364 | Compute the final table data and return a list of lists.
365 | `table` should have been prepared with a list of glyphs, and a
366 | list of row_edges and column_edges (see the calling sequence in
367 | `page_to_tables`).
368 | """
369 |
370 | ncolumns = len(table.column_edges)
371 | nrows = len(table.row_edges)
372 |
373 | # This contains a list of `boxes` at each table cell
374 | box_table = [[list() for i in range(ncolumns)] for j in range(nrows)]
375 |
376 | for box in table.glyphs:
377 | if box.text is None:
378 | # Glyph has no text, ignore it.
379 | continue
380 |
381 | x, y = box.center_x, box.center_y
382 |
383 | # Compute index of "gap" between two combs, rather than the comb itself
384 | col = bisect_left(table.column_edges, x)
385 | row = bisect_left(table.row_edges, y)
386 |
387 | # If this blows up, please check what "box" is when it blows up.
388 | # Is it a "\n" ?
389 | box_table[row][col].append(box)
390 |
391 | def compute_text(boxes):
392 |
393 | def ordering(box):
394 | return (box.barycenter_y, box.center_x)
395 | sorted_boxes = sorted(boxes, key=ordering)
396 |
397 | result = []
398 |
399 | for this, next in zip(sorted_boxes, sorted_boxes[1:] + [None]):
400 | result.append(this.text)
401 | if next is None:
402 | continue
403 | centerline_distance = next.center_y - this.center_y
404 | # Maximum separation is when the two barycenters are far enough
405 | # away that the two characters don't overlap anymore
406 | max_separation = this.height / 2 + next.height / 2
407 |
408 | if centerline_distance >= max_separation:
409 | result.append('\n')
410 |
411 | return ''.join(result)
412 |
413 | table_array = []
414 | for row in box_table:
415 | table_array.append([compute_text(boxes) for boxes in row])
416 |
417 | return table_array
418 |
419 |
420 | def find_table_bounding_box(box_list, table_top_hint, table_bottom_hint):
421 | """
422 | Returns one bounding box (minx, maxx, miny, maxy) for tables based
423 | on a boxlist
424 | """
425 |
426 | # TODO(pwaller): These closures are here to make it clear how these things
427 | # belong together. At some point it may get broken apart
428 | # again, or simplified.
429 |
430 | def threshold_above(hist, threshold_value):
431 | """
432 | >>> threshold_above(Counter({518: 10, 520: 20, 530: 20, \
433 | 525: 17}), 15)
434 | [520, 530, 525]
435 | """
436 | if not isinstance(hist, Counter):
437 | raise ValueError("requires Counter") # TypeError then?
438 |
439 | above = [k for k, v in hist.items() if v > threshold_value]
440 | return above
441 |
442 | def threshold_y():
443 | """
444 | Try to reduce the y range with a threshold.
445 | """
446 |
447 | return box_list.bounds()
448 |
449 | # TODO(pwaller): Reconcile the below code
450 |
451 | boxtop = attrgetter("top")
452 | boxbottom = attrgetter("bottom")
453 |
454 | # Note: (pwaller) this rounding excludes stuff I'm not sure we should
455 | # be excluding. (e.g, in the unica- dataset)
456 |
457 | yhisttop = box_list.histogram(boxtop).rounder(2)
458 | yhistbottom = box_list.histogram(boxbottom).rounder(2)
459 |
460 | try:
461 | # TODO(pwaller): fix this, remove except block
462 | threshold = IS_TABLE_COLUMN_COUNT_THRESHOLD
463 | miny = min(threshold_above(yhisttop, threshold))
464 | # and the top of the top cell
465 | maxy = max(threshold_above(yhistbottom, threshold))
466 | except ValueError:
467 | # Value errors raised when min and/or max fed empty lists
468 | miny = None
469 | maxy = None
470 | #raise ValueError("table_threshold caught nothing")
471 |
472 | return Box(Rectangle(
473 | x1=float("-inf"),
474 | y1=miny,
475 | x2=float("+inf"),
476 | y2=maxy,
477 | ))
478 |
479 | def hints_y():
480 | miny = float("-inf")
481 | maxy = float("+inf")
482 |
483 | glyphs = [glyph for glyph in box_list if glyph.text is not None]
484 |
485 | if table_top_hint:
486 | top_box = [box for box in glyphs if table_top_hint in box.text]
487 | if top_box:
488 | miny = top_box[0].top
489 |
490 | if table_bottom_hint:
491 | bottomBox = [box for box in glyphs
492 | if table_bottom_hint in box.text]
493 | if bottomBox:
494 | maxy = bottomBox[0].bottom
495 |
496 | return Box(Rectangle(
497 | x1=float("-inf"),
498 | y1=miny,
499 | x2=float("+inf"),
500 | y2=maxy,
501 | ))
502 |
503 | bounds = box_list.bounds()
504 | threshold_bounds = threshold_y()
505 | hinted_bounds = hints_y()
506 |
507 | return bounds.clip(threshold_bounds, hinted_bounds)
508 |
509 |
510 | class Baseline(LineSegment):
511 | pass
512 |
513 |
514 | def assign_barycenters(y_segments, barycenters, barycenter_heightmap):
515 | """
516 | Assign the glyph.barycenter and .barycenter_y to their "preferred"
517 | barycenter. "Preferred" is currently defined as closest.
518 |
519 | Here we use the term "barycenter" because it is the center of the glyph
520 | weighted according to nearby glyphs in Y. It is used to determine which
521 | word glyphs belong to, and which order text should be inserted into a cell.
522 | """
523 | result = list()
524 | # Compute a list of barycenter line segments, which are long enough to
525 | # overlap all glyphs which are long enough overlap it.
526 | for barycenter in barycenters:
527 | maxheight = barycenter_heightmap[barycenter]
528 | result.append(Baseline.make(barycenter - maxheight / 2,
529 | barycenter + maxheight / 2))
530 |
531 | to_visit = {LineSegment: midpoint, Baseline: start_end}
532 |
533 | active_barycenters = set()
534 |
535 | segments = segments_generator(result + y_segments, to_visit)
536 | for position, glyph, disappearing in segments:
537 | # Maintain a list of barycenters vs position
538 | if isinstance(glyph, Baseline):
539 | if not disappearing:
540 | active_barycenters.add(glyph)
541 | else:
542 | active_barycenters.remove(glyph)
543 | continue
544 |
545 | if len(active_barycenters) == 0:
546 | # There is no barycenter this glyph might belong to.
547 | # TODO(pwaller): huh? This should surely never happen.
548 | # Investigate why by turning this assert on.
549 | # assert False
550 | continue
551 |
552 | # Pick the barycenter closest to our position (== glyph.center_y)
553 | barycenter = min(active_barycenters,
554 | key=lambda b: abs(b.midpoint - position))
555 |
556 | glyph.object.barycenter = barycenter
557 | glyph.object.barycenter_y = barycenter.midpoint
558 |
--------------------------------------------------------------------------------
/pdftables/scripts/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/pdftables/scripts/render.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | """pdftables-render: obtain pdftables debugging information from pdfs
4 |
5 | Usage:
6 | pdftables-render [options] ...
7 | pdftables-render (-h | --help)
8 | pdftables-render --version
9 |
10 | Example page number lists:
11 |
12 | may contain a [:page-number-list].
13 |
14 | pdftables-render my.pdf:1
15 | pdftables-render my.pdf:2,5-10,15-
16 |
17 | Example JSON config options:
18 |
19 | '{ "n_glyph_column_threshold": 3, "n_glyph_row_threshold": 5 }'
20 |
21 | Options:
22 | -h --help Show this screen.
23 | --version Show version.
24 | -D --debug Additional debug information
25 | -O --output-dir= Path to write debug data to
26 | -a --ascii Show ascii table
27 | -p --pprint pprint.pprint() the table
28 | -i --interactive jump into an interactive debugger (ipython)
29 | -c --config= JSON object of config parameters
30 |
31 | """
32 |
33 | # Use $ pip install --user --editable pdftables
34 | # to install this util in your path.
35 |
36 | import sys
37 | import os
38 | import json
39 |
40 | import pdftables
41 |
42 | from os.path import basename
43 | from pprint import pprint
44 |
45 | from docopt import docopt
46 |
47 | from pdftables.pdf_document import PDFDocument
48 | from pdftables.diagnostics import render_page, make_annotations
49 | from pdftables.display import to_string
50 | from pdftables.pdftables import page_to_tables
51 | from pdftables.config_parameters import ConfigParameters
52 |
53 |
54 | def main(args=None):
55 |
56 | if args is not None:
57 | argv = args
58 | else:
59 | argv = sys.argv[1:]
60 |
61 | arguments = docopt(
62 | __doc__,
63 | argv=argv,
64 | version='pdftables-render experimental')
65 |
66 | if arguments["--debug"]:
67 | print(arguments)
68 |
69 | if arguments["--config"]:
70 | kwargs = json.loads(arguments["--config"])
71 | else:
72 | kwargs = {}
73 | config = ConfigParameters(**kwargs)
74 |
75 | for pdf_filename in arguments[""]:
76 | render_pdf(arguments, pdf_filename, config)
77 |
78 |
79 | def ensure_dirs():
80 | try:
81 | os.mkdir('png')
82 | except OSError:
83 | pass
84 |
85 | try:
86 | os.mkdir('svg')
87 | except OSError:
88 | pass
89 |
90 |
91 | def parse_page_ranges(range_string, npages):
92 | ranges = range_string.split(',')
93 | result = []
94 |
95 | def string_to_pagenumber(s):
96 | if s == "":
97 | return npages
98 | return int(x)
99 |
100 | for r in ranges:
101 | if '-' not in r:
102 | result.append(int(r))
103 | else:
104 | # Convert 1-based indices to 0-based and make integer.
105 | points = [string_to_pagenumber(x) for x in r.split('-')]
106 |
107 | if len(points) == 2:
108 | start, end = points
109 | else:
110 | raise RuntimeError(
111 | "Malformed range string: {0}"
112 | .format(range_string))
113 |
114 | # Plus one because it's (start, end) inclusive
115 | result.extend(xrange(start, end + 1))
116 |
117 | # Convert from one based to zero based indices
118 | return [x - 1 for x in result]
119 |
120 |
121 | def render_pdf(arguments, pdf_filename, config):
122 | ensure_dirs()
123 |
124 | page_range_string = ''
125 | page_set = []
126 | if ':' in pdf_filename:
127 | pdf_filename, page_range_string = pdf_filename.split(':')
128 |
129 | doc = PDFDocument.from_path(pdf_filename)
130 |
131 | if page_range_string:
132 | page_set = parse_page_ranges(page_range_string, len(doc))
133 |
134 | for page_number, page in enumerate(doc.get_pages()):
135 | if page_set and page_number not in page_set:
136 | # Page ranges have been specified by user, and this page not in
137 | continue
138 |
139 | svg_file = 'svg/{0}_{1:02d}.svg'.format(
140 | basename(pdf_filename), page_number)
141 | png_file = 'png/{0}_{1:02d}.png'.format(
142 | basename(pdf_filename), page_number)
143 |
144 | table_container = page_to_tables(page, config)
145 | annotations = make_annotations(table_container)
146 |
147 | render_page(
148 | pdf_filename, page_number, annotations, svg_file, png_file)
149 |
150 | print "Rendered", svg_file, png_file
151 |
152 | if arguments["--interactive"]:
153 | from ipdb import set_trace
154 | set_trace()
155 |
156 | for table in table_container:
157 |
158 | if arguments["--ascii"]:
159 | print to_string(table.data)
160 |
161 | if arguments["--pprint"]:
162 | pprint(table.data)
163 |
164 |
165 |
--------------------------------------------------------------------------------
/render_all.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 |
3 | for pdf in fixtures/sample_data/*.pdf
4 | do
5 | printf -- "---** %s **---\n" "$pdf"
6 | pdftables-render "$pdf"
7 | done
8 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pdfminer>=20110515
2 | requests>=1.1.0
3 | matplotlib>=1.1.1
4 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | long_desc = """
4 | PDFTables helps with extracting tables from PDF files.
5 | """
6 | # See https://pypi.python.org/pypi?%3Aaction=list_classifiers for classifiers
7 |
8 | conf = dict(name='pdftables',
9 | version='0.0.3',
10 | description="Parses PDFs and extracts what it believes to be tables.",
11 | long_description=long_desc,
12 | classifiers=[
13 | "Development Status :: 7 - Inactive",
14 | "Intended Audience :: Developers",
15 | "License :: OSI Approved :: BSD License",
16 | "Operating System :: POSIX :: Linux",
17 | "Programming Language :: Python",
18 | ],
19 | keywords='',
20 | author='ScraperWiki Ltd',
21 | author_email='feedback@scraperwiki.com',
22 | url='http://scraperwiki.com',
23 | license='BSD',
24 | packages=find_packages(exclude=['ez_setup', 'examples', 'tests']),
25 | namespace_packages=[],
26 | include_package_data=False,
27 | zip_safe=False,
28 | install_requires=[
29 | 'pdfminer>=20110515',
30 | 'docopt>=0.6',
31 | ],
32 | tests_require=[],
33 | entry_points={
34 | 'console_scripts': [
35 | 'pdftables-render = pdftables.scripts.render:main',
36 | ]
37 | })
38 |
39 | if __name__ == '__main__':
40 | setup(**conf)
41 |
--------------------------------------------------------------------------------
/test/fixtures.py:
--------------------------------------------------------------------------------
1 |
2 | from os.path import abspath, dirname, join as pjoin
3 |
4 | from pdftables.pdf_document import PDFDocument
5 |
6 | memoized = {}
7 |
8 |
9 | def fixture(filename):
10 | """
11 | Obtain a PDFDocument for fixtures/sample_data/{filename}, memoizing the
12 | return result.
13 | """
14 | global memoized
15 |
16 | if filename in memoized:
17 | return memoized.get(filename)
18 | here = abspath(dirname(__file__))
19 | fn = pjoin(here, "..", "fixtures", "sample_data", filename)
20 | memoized[filename] = PDFDocument.from_path(fn)
21 | return memoized[filename]
22 |
--------------------------------------------------------------------------------
/test/test_Table_class.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # ScraperWiki Limited
3 | # Ian Hopkinson, 2013-07-30
4 | # -*- coding: utf-8 -*-
5 |
6 | """
7 | Tests the Table class which contains metadata
8 | """
9 | import sys
10 |
11 | from pdftables.pdftables import get_tables_from_document
12 |
13 | from fixtures import fixture
14 |
15 | from nose.tools import *
16 | import nose
17 |
18 |
19 | def test_it_includes_page_numbers():
20 | """
21 | page_number is 1-indexed, as defined in the PDF format
22 | table_number is 1-indexed
23 | """
24 | doc = fixture('AnimalExampleTables.pdf')
25 | try:
26 | result = get_tables_from_document(doc)
27 | except NotImplementedError, e:
28 | raise nose.SkipTest(e)
29 | assert_equals(result[0].total_pages, 4)
30 | assert_equals(result[0].page_number, 2)
31 | assert_equals(result[1].total_pages, 4)
32 | assert_equals(result[1].page_number, 3)
33 | assert_equals(result[2].total_pages, 4)
34 | assert_equals(result[2].page_number, 4)
35 |
36 |
37 | def test_it_includes_table_numbers():
38 | doc = fixture('AnimalExampleTables.pdf')
39 | try:
40 | result = get_tables_from_document(doc)
41 | except NotImplementedError, e:
42 | raise nose.SkipTest(e)
43 | result = get_tables_from_document(doc)
44 | assert_equals(result[0].table_number_on_page, 1)
45 | assert_equals(result[0].total_tables_on_page, 1)
46 |
--------------------------------------------------------------------------------
/test/test_all_sample_data.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 |
4 | from __future__ import unicode_literals
5 | from nose.tools import assert_equal
6 | from os.path import join, dirname
7 | import os
8 |
9 | from pdftables.pdftables import page_to_tables
10 | from pdftables.display import to_string
11 | from pdftables.diagnostics import render_page, make_annotations
12 |
13 | from fixtures import fixture
14 |
15 | SAMPLE_DIR = join(dirname(__file__), '..', 'fixtures', 'sample_data')
16 | RENDERED_DIR = join(dirname(__file__), '..', 'fixtures', 'rendered')
17 | EXPECTED_DIR = join(dirname(__file__), '..', 'fixtures', 'expected_output')
18 | ACTUAL_DIR = join(dirname(__file__), '..', 'fixtures', 'actual_output')
19 |
20 |
21 | def _test_sample_data():
22 | for filename in os.listdir(SAMPLE_DIR):
23 | yield _test_sample_pdf, filename
24 |
25 |
26 | def _test_sample_pdf(short_filename):
27 | doc = fixture(short_filename)
28 | for page_number, page in enumerate(doc.get_pages()):
29 |
30 | tables = page_to_tables(page)
31 | annotations = make_annotations(tables)
32 |
33 | basename = '{0}_{1}'.format(short_filename, page_number)
34 |
35 | render_page(
36 | join(SAMPLE_DIR, short_filename),
37 | page_number,
38 | annotations,
39 | svg_file=join(RENDERED_DIR, 'svgs', basename + '.svg'),
40 | png_file=join(RENDERED_DIR, 'pngs', basename + '.png')
41 | )
42 |
43 | assert_equal(get_expected_number_of_tables(short_filename), len(tables))
44 | for table_num, table in enumerate(tables):
45 | table_filename = "{}_{}.txt".format(short_filename, table_num)
46 | expected_filename = join(EXPECTED_DIR, table_filename)
47 | actual_filename = join(ACTUAL_DIR, table_filename)
48 |
49 | with open(actual_filename, 'w') as f:
50 | f.write(to_string(table).encode('utf-8'))
51 |
52 | diff_table_files(expected_filename, actual_filename)
53 |
54 |
55 | def get_expected_number_of_tables(short_filename):
56 | result = len([fn for fn in os.listdir(EXPECTED_DIR)
57 | if fn.startswith(short_filename)])
58 | if result == 0:
59 | print("NOTE: there is no 'expected' data for {0} in {1}: you probably "
60 | "want to review then copy files from {2}".format(
61 | short_filename, EXPECTED_DIR, ACTUAL_DIR))
62 | return result
63 |
64 |
65 | def diff_table_files(expected, result):
66 | with open(expected) as f:
67 | with open(result) as g:
68 | for line, (expected_line, actual_line) in enumerate(zip(f, g)):
69 | try:
70 | assert_equal(expected_line, actual_line)
71 | except AssertionError:
72 | print("{} and {} differ @ line {}".format(
73 | expected, result, line + 1))
74 | raise
75 |
--------------------------------------------------------------------------------
/test/test_box.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 |
4 | #import sys
5 | #sys.path.append('pdftables')
6 |
7 | import pdftables
8 | import pdftables.boxes as boxes
9 |
10 |
11 | from pdftables.boxes import Box, BoxList, Rectangle
12 |
13 | from nose.tools import assert_equals, assert_is_not
14 |
15 | inf = float("inf")
16 |
17 |
18 | def test_basic_boxes():
19 | box = Box(Rectangle(11, 22, 33, 44), "text")
20 | assert_equals(box.rect, (11, 22, 33, 44))
21 | assert_equals((11, 22, 33, 44), box.rect)
22 | assert_equals(22, box.top)
23 | assert_equals('text', box.text)
24 |
25 |
26 | def test_clip_identity():
27 | """
28 | Test clipping a box with itself results in the same box
29 | """
30 | box1 = Box(Rectangle(-inf, -inf, inf, inf))
31 | box2 = Box(Rectangle(-inf, -inf, inf, inf))
32 |
33 | clipped = box1.clip(box2)
34 | assert_is_not(clipped, Box.empty_box)
35 | assert_equals(clipped.rect, box1.rect)
36 |
37 |
38 | def test_clip_x_infinite():
39 | """
40 | Test correctly clipping (-inf, 0, inf, 10) with (-inf, -inf, inf, inf)
41 | """
42 | box1 = Box(Rectangle(-inf, -inf, inf, inf))
43 | box2 = Box(Rectangle(-inf, 0, inf, 10))
44 |
45 | clipped = box1.clip(box2)
46 | assert_is_not(clipped, Box.empty_box)
47 | assert_equals(clipped.rect, (-inf, 0, inf, 10))
48 |
49 |
50 | def test_boxlist_inside():
51 | b = BoxList()
52 | b.append(Box(Rectangle(0, 0, 10, 10)))
53 |
54 | infrect = Box(Rectangle(-inf, -inf, inf, inf))
55 |
56 | assert_equals(1, len(b.inside(infrect)))
57 | assert_equals(0, len(b.inside(Box.empty_box)))
58 |
59 |
60 | def test_boxlist_inside_not_inside():
61 | b = BoxList()
62 | b.append(Box(Rectangle(0, 0, 10, 10)))
63 |
64 | otherbox = Box(Rectangle(-100, -100, -90, -90))
65 | assert_equals(0, len(b.inside(otherbox)))
66 |
--------------------------------------------------------------------------------
/test/test_contains_tables.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # ScraperWiki Limited
3 | # Ian Hopkinson, 2013-06-17
4 | # -*- coding: utf-8 -*-
5 |
6 | """
7 | ContainsTables tests
8 | """
9 |
10 | import sys
11 |
12 | import pdftables
13 |
14 | from fixtures import fixture
15 |
16 | from nose.tools import assert_equals
17 |
18 |
19 | def contains_tables(pdf):
20 | """
21 | contains_tables takes a pdf document and returns a boolean array of the
22 | length of the document which is true for pages which contains tables
23 | """
24 | return [pdftables.page_contains_tables(page) for page in pdf.get_pages()]
25 |
26 |
27 | def test_it_finds_no_tables_in_a_pdf_with_no_tables():
28 | pdf = fixture('m27-dexpeg2-polymer.pdf')
29 | assert_equals(
30 | [False, False, False, False, False, False, False, False],
31 | contains_tables(pdf))
32 |
33 |
34 | def test_it_finds_tables_on_all_pages_AlmondBoard():
35 | pdf = fixture('2012.01.PosRpt.pdf')
36 | assert_equals(
37 | [True, True, True, True, True, True, True],
38 | contains_tables(pdf))
39 |
40 |
41 | def test_it_finds_tables_on_some_pages_CONAB():
42 | pdf = fixture('13_06_12_10_36_58_boletim_ingles_junho_2013.pdf')
43 | TestList = [False] * 32
44 | TestList[5:8] = [True] * 3
45 | TestList[9:11] = [True] * 2
46 | TestList[12] = True
47 | TestList[14] = True
48 | TestList[16:18] = [True] * 2
49 | TestList[19:24] = [True] * 5
50 | TestList[25:30] = [True] * 5
51 |
52 | assert_equals(contains_tables(pdf), TestList)
53 |
--------------------------------------------------------------------------------
/test/test_finds_tables.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # ScraperWiki Limited
3 | # Ian Hopkinson, 2013-06-17
4 | # -*- coding: utf-8 -*-
5 |
6 | """
7 | Finds tables tests
8 | """
9 |
10 | import sys
11 |
12 | import pdftables
13 | from pdftables.config_parameters import ConfigParameters
14 |
15 | from fixtures import fixture
16 |
17 | from nose.tools import assert_equals
18 |
19 |
20 | def test_atomise_does_not_disrupt_table_finding():
21 | pdf_page = fixture(
22 | "13_06_12_10_36_58_boletim_ingles_junho_2013.pdf").get_page(3)
23 | tables1 = pdftables.page_to_tables(
24 | pdf_page,
25 | ConfigParameters(
26 | atomise=True,
27 | extend_y=False))
28 | tables2 = pdftables.page_to_tables(
29 | pdf_page,
30 | ConfigParameters(
31 | atomise=False,
32 | extend_y=False))
33 |
34 | assert_equals(tables1.tables[0].data, tables2.tables[0].data)
35 |
--------------------------------------------------------------------------------
/test/test_get_tables.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # ScraperWiki Limited
3 | # Ian Hopkinson, 2013-06-17
4 | # -*- coding: utf-8 -*-
5 |
6 | """
7 | getTablesTests
8 | """
9 |
10 | import sys
11 | sys.path.append('code')
12 |
13 | from pdftables import page_to_tables # , TableDiagnosticData
14 | from pdftables.config_parameters import ConfigParameters
15 |
16 | from fixtures import fixture
17 |
18 | from nose.tools import *
19 |
20 |
21 | def test_it_doesnt_find_tables_when_there_arent_any():
22 | pdf_page = fixture(
23 | "13_06_12_10_36_58_boletim_ingles_junho_2013.pdf").get_page(4)
24 | tables = page_to_tables(pdf_page)
25 |
26 | assert_equals([], tables.tables[0].data)
27 |
28 |
29 | def test_it_copes_with_CONAB_p8():
30 | pdf_page = fixture(
31 | "13_06_12_10_36_58_boletim_ingles_junho_2013.pdf").get_page(7)
32 | page_to_tables(pdf_page, ConfigParameters(atomise=True))
33 |
34 |
35 | def test_it_can_use_hints_AlmondBoard_p1():
36 | pdf_page = fixture("2012.01.PosRpt.pdf").get_page(0)
37 | tables = page_to_tables(
38 | pdf_page,
39 | ConfigParameters(
40 | atomise=False,
41 | table_top_hint=u"% Change",
42 | table_bottom_hint=u"Uncommited"))
43 | assert_equals(
44 | [[u'Salable', u'Million Lbs.', u'Kernel Wt.', u'Kernel Wt.', u'% Change'],
45 | [u'1. Carryin August 1, 2011', u'254.0',
46 | u'253,959,411', u'321,255,129', u'-20.95%'],
47 | [u'2. Crop Receipts to Date', u'1,950.0',
48 | u'1,914,471,575', u'1,548,685,417', u'23.62%'],
49 | [u'3. [3% Loss and Exempt]', u'58.5',
50 | u'57,434,147)(', u'46,460,563(', u')'],
51 | [u'4. New Crop Marketable (2-3)', u'1,891.5',
52 | u'1,857,037,428', u'1,502,224,854', u'23.62%'],
53 | [u'5. [Reserve]', u'n/a', u'0', u'0', u''],
54 | [u'6. Total Supply (1+4-5)Shipments by Handlers',
55 | u'2,145.5', u'2,110,996,839', u'1,823,479,983', u'15.77%'],
56 | [u'7. Domestic', u'555.0',
57 | u'265,796,698', u'255,785,794', u'3.91%'],
58 | [u'8. Export', u'1,295.0',
59 | u'755,447,255', u'664,175,807', u'13.74%'],
60 | [u'9. Total Shipments', u'1,850.0',
61 | u'1,021,243,953', u'919,961,601', u'11.01%'],
62 | [u'10. Forecasted Carryout', u'295.5', u'', u'', u''],
63 | [u'11. Computed Inventory (6-9)Commitments (sold, not delivered)**',
64 | u'', u'1,089,752,886', u'903,518,382', u'20.61%'],
65 | [u'12. Domestic', u'', u'214,522,238', u'187,492,263', u'14.42%'], [
66 | u'13. Export', u'', u'226,349,446', u'155,042,764', u'45.99%'],
67 | [u'14. Total Commited Shipments', u'',
68 | u'440,871,684', u'342,535,027', u'28.71%'],
69 | [u'15. Uncommited Inventory (11-14)', u'', u'648,881,202', u'560,983,355', u'15.67%']], tables.tables[0].data)
70 |
71 |
72 | def test_it_can_use_one_hint_argentina_by_size():
73 | pdf_page = fixture("argentina_diputados_voting_record.pdf").get_page(0)
74 | tables = page_to_tables(
75 | pdf_page,
76 | ConfigParameters(
77 | atomise=False,
78 | table_top_hint='Apellido'))
79 | #table1,_ = getTable(fh, 2)
80 | table1 = list(tables)[0].data
81 | assert_equals(32, len(table1))
82 | assert_equals(4, len(table1[0]))
83 |
84 |
85 | def test_it_returns_the_AlmondBoard_p2_table_by_size():
86 | pdf_page = fixture("2012.01.PosRpt.pdf").get_page(1)
87 | tables = page_to_tables(pdf_page, ConfigParameters(atomise=False))
88 | table1 = list(tables)[0].data
89 | assert_equals(78, len(table1))
90 | assert_equals(9, len(table1[0]))
91 |
92 |
93 | def test_the_atomise_option_works_on_coceral_p1_by_size():
94 | pdf_page = fixture(
95 | "1359397366Final_Coceral grain estimate_2012_December.pdf").get_page(0)
96 | tables = page_to_tables(pdf_page,
97 | ConfigParameters(
98 | atomise=True))
99 | table = list(tables)[0].data
100 | #table1, _ = getTable(fh, 2)
101 | assert_equals(43, len(table))
102 | assert_equals(31, len(table[0]))
103 |
104 |
105 | def test_it_does_not_crash_on_m30_p5():
106 | pdf_page = fixture("m30-JDent36s15-20.pdf").get_page(4)
107 | tables = page_to_tables(pdf_page)
108 | table = list(tables)[0].data
109 | assert len(table) > 0
110 | """Put this in for more aggressive test"""
111 | # assert_equals([u'5\n', u'0.75\n', u'0.84\n', u'0.92\n', u'0.94\n', u'evaluation of a novel liquid whitening gel containing 18%\n'],
112 | # table[4])
113 |
114 |
115 | def test_it_returns_the_AlmondBoard_p4_table():
116 | pdf_page = fixture("2012.01.PosRpt.pdf").get_page(3)
117 | tables = page_to_tables(
118 | pdf_page,
119 | ConfigParameters(
120 | atomise=False,
121 | extend_y=False))
122 | assert_equals(
123 | [[u'Variety Name', u'Total Receipts', u'Total Receipts', u'Total Inedibles', u'Receipts', u'% Rejects'],
124 | [u'Aldrich', u'48,455,454', u'49,181,261',
125 | u'405,555', u'2.53%', u'0.82%'],
126 | [u'Avalon', u'7,920,179', u'8,032,382',
127 | u'91,733', u'0.41%', u'1.14%'],
128 | [u'Butte', u'151,830,761', u'150,799,510',
129 | u'1,054,567', u'7.93%', u'0.70%'],
130 | [u'Butte/Padre', u'215,114,812', u'218,784,885',
131 | u'1,145,000', u'11.24%', u'0.52%'],
132 | [u'Carmel', u'179,525,234', u'178,912,935',
133 | u'1,213,790', u'9.38%', u'0.68%'],
134 | [u'Carrion', u'507,833', u'358,580',
135 | u'2,693', u'0.03%', u'0.75%'],
136 | [u'Fritz', u'105,479,433', u'106,650,571',
137 | u'1,209,192', u'5.51%', u'1.13%'],
138 | [u'Harvey', u'58,755', u'58,755',
139 | u'1,416', u'0.00%', u'2.41%'],
140 | [u'Hashem', u'430,319', u'430,014',
141 | u'1,887', u'0.02%', u'0.44%'],
142 | [u'Le Grand', u'0', u'0', u'0', u'0.00%', u'0.00%'],
143 | [u'Livingston', u'7,985,535', u'7,926,910',
144 | u'186,238', u'0.42%', u'2.35%'],
145 | [u'Marchini', u'363,887', u'391,965',
146 | u'3,675', u'0.02%', u'0.94%'],
147 | [u'Merced', u'65,422', u'66,882',
148 | u'1,167', u'0.00%', u'1.74%'],
149 | [u'Mission', u'19,097,034', u'18,851,071',
150 | u'110,323', u'1.00%', u'0.59%'],
151 | [u'Mixed', u'36,358,011', u'36,926,337',
152 | u'952,264', u'1.90%', u'2.58%'],
153 | [u'Mono', u'757,637', u'689,552',
154 | u'6,785', u'0.04%', u'0.98%'],
155 | [u'Monterey', u'220,713,436', u'212,746,409',
156 | u'2,293,892', u'11.53%', u'1.08%'],
157 | [u'Morley', u'822,529', u'825,738',
158 | u'6,264', u'0.04%', u'0.76%'],
159 | [u'N43', u'156,488', u'85,832', u'340', u'0.01%', u'0.40%'],
160 | [u'Neplus', u'1,279,599', u'1,237,532',
161 | u'17,388', u'0.07%', u'1.41%'],
162 | [u'Nonpareil', u'741,809,844', u'727,286,104',
163 | u'5,121,465', u'38.75%', u'0.70%'],
164 | [u'Padre', u'62,905,358', u'62,417,565',
165 | u'193,168', u'3.29%', u'0.31%'],
166 | [u'Peerless', u'5,113,472', u'5,101,245',
167 | u'20,792', u'0.27%', u'0.41%'],
168 | [u'Price', u'25,312,529', u'25,124,463',
169 | u'143,983', u'1.32%', u'0.57%'],
170 | [u'Ruby', u'4,163,237', u'4,057,470',
171 | u'35,718', u'0.22%', u'0.88%'],
172 | [u'Sauret', u'55,864', u'55,864', u'517', u'0.00%', u'0.93%'],
173 | [u'Savana', u'389,317', u'390,585',
174 | u'2,049', u'0.02%', u'0.52%'],
175 | [u'Sonora', u'31,832,025', u'33,184,703',
176 | u'387,848', u'1.66%', u'1.17%'],
177 | [u'Thompson', u'491,026', u'487,926',
178 | u'8,382', u'0.03%', u'1.72%'],
179 | [u'Tokyo', u'783,494', u'794,699',
180 | u'4,511', u'0.04%', u'0.57%'],
181 | [u'Winters', u'5,780,183', u'5,756,167',
182 | u'46,211', u'0.30%', u'0.80%'],
183 | [u'Wood Colony', u'37,458,735', u'36,331,907',
184 | u'189,967', u'1.96%', u'0.52%'],
185 | [u'Major Varieties Sub Total:', u'1,913,017,442',
186 | u'1,893,945,819', u'14,858,780', u'99.92%', u'0.78%'],
187 | [u'Minor Varieties Total:', u'1,454,133',
188 | u'1,480,800', u'34,997', u'0.08%', u'2.36%'],
189 | [u'Grand Total All Varieties', u'1,914,471,575', u'1,895,426,619', u'14,893,777', u'100.00%', u'0.79%']], tables.tables[0].data
190 | )
191 |
--------------------------------------------------------------------------------
/test/test_ground.py:
--------------------------------------------------------------------------------
1 | from pdftables.pdf_document import PDFDocument
2 | from pdftables.pdftables import page_to_tables
3 | import lxml.etree
4 | from collections import Counter
5 | from nose.tools import assert_equals
6 |
7 |
8 | class ResultTable(object):
9 | def __sub__(self, other):
10 | r = ResultTable()
11 | r.cells = self.cells
12 | r.cells.subtract(other.cells)
13 | r.number_of_rows = self.number_of_rows - other.number_of_rows
14 | r.number_of_cols = self.number_of_cols - other.number_of_cols
15 | return r
16 |
17 | def __repr__(self):
18 | assert self.cells is not None
19 | response = ""
20 | return response.format(col=self.number_of_cols,
21 | row=self.number_of_rows,
22 | plus=sum(self.cells[x] for x in self.cells if self.cells[x] >= 1),
23 | minus=abs(sum(self.cells[x] for x in self.cells if self.cells[x] <= -1)))
24 |
25 |
26 | def pdf_results(filename):
27 | def get_cells(table):
28 | cells = Counter()
29 | for row in table.data:
30 | for cell in row:
31 | cells.update([cell])
32 | return cells
33 |
34 | #doc = PDFDocument.from_fileobj(open(filename, "rb"))
35 | doc = PDFDocument.from_path(filename)
36 | for page in doc.get_pages():
37 | table_container = page_to_tables(page)
38 | builder = []
39 | for table in table_container:
40 | r = ResultTable()
41 | r.cells = get_cells(table)
42 | r.number_of_rows = len(table.data)
43 | r.number_of_cols = max(len(row) for row in table.data)
44 | builder.append(r)
45 | return builder
46 |
47 |
48 | def xml_results(filename):
49 | def max_of_strs(strs):
50 | return max(map(int, strs))
51 | root = lxml.etree.fromstring(open(filename, "rb").read())
52 | builder = []
53 | for table in root.xpath("//table"):
54 | r = ResultTable()
55 | r.cells = Counter(table.xpath("//content/text()"))
56 | cols = table.xpath("//@end-col")
57 | cols.extend(table.xpath("//@start-col"))
58 | rows = table.xpath("//@end-row")
59 | rows.extend(table.xpath("//@start-row"))
60 | r.number_of_cols = max_of_strs(cols) + 1 # starts at zero
61 | r.number_of_rows = max_of_strs(rows) + 1 # starts at zero
62 | builder.append(r)
63 | return builder
64 |
65 |
66 |
67 | def _test_ground(filebase, number):
68 | """tests whether we successfully parse ground truth data:
69 | see fixtures/eu-dataset"""
70 | pdf_tables = pdf_results(filebase % (number, ".pdf"))
71 | xml_tables = xml_results(filebase % (number, "-str.xml"))
72 | assert_equals(len(pdf_tables), len(xml_tables))
73 | for i in range(0, len(pdf_tables)):
74 | pdf_table = pdf_tables[i]
75 | xml_table = xml_tables[i]
76 | diff = pdf_table - xml_table
77 | clean_diff_list = {x:diff.cells[x] for x in diff.cells if diff.cells[x] != 0}
78 | assert_equals(pdf_table.number_of_cols, xml_table.number_of_cols)
79 | assert_equals(pdf_table.number_of_rows, xml_table.number_of_rows)
80 | assert_equals(clean_diff_list, {})
81 |
82 | def test_all_eu():
83 | filebase = "fixtures/eu-dataset/eu-%03d%s"
84 | for i in range(1,35): # 1..34
85 | yield _test_ground, filebase, i
86 |
87 |
--------------------------------------------------------------------------------
/test/test_linesegments.py:
--------------------------------------------------------------------------------
1 | import pdftables.line_segments as line_segments
2 |
3 | from nose.tools import assert_equals, raises
4 |
5 | from pdftables.line_segments import LineSegment
6 |
7 |
8 | def segments(segments):
9 | return [line_segments.LineSegment.make(a, b) for a, b in segments]
10 |
11 |
12 | def test_segments_generator():
13 | seg1, seg2 = segs = segments([(1, 4), (2, 3)])
14 | values = list(line_segments.segments_generator(segs))
15 | assert_equals(
16 | [(1, seg1, False),
17 | (2, seg2, False),
18 | (3, seg2, True),
19 | (4, seg1, True)],
20 | values
21 | )
22 |
23 |
24 | def test_histogram_segments():
25 | segs = segments([(1, 4), (2, 3)])
26 | values = list(line_segments.histogram_segments(segs))
27 | assert_equals([((1, 2), 1), ((2, 3), 2), ((3, 4), 1)], values)
28 |
29 |
30 | def test_segment_histogram():
31 | segs = segments([(1, 4), (2, 3)])
32 | values = list(line_segments.segment_histogram(segs))
33 | assert_equals([(1, 2, 3, 4), (1, 2, 1)], values)
34 |
35 |
36 | @raises(RuntimeError)
37 | def test_malformed_input_segments_generator():
38 | segs = segments([(1, -1)])
39 | list(line_segments.segments_generator(segs))
40 |
41 |
42 | def test_hat_point_generator():
43 | segs = segments([(1, 4), (2, 3)])
44 | result = list(line_segments.hat_point_generator(segs))
45 |
46 | x = 2.5
47 | expected = [(1, set()),
48 | (2, set([LineSegment(start=1, end=4, object=None)])),
49 | (x, set([LineSegment(start=1, end=4, object=None),
50 | LineSegment(start=2, end=3, object=None)])),
51 | (3, set([LineSegment(start=1, end=4, object=None)])),
52 | (4, set())]
53 |
54 | assert_equals(expected, result)
55 |
56 |
57 | def test_hat_generator():
58 | segs = segments([(0, 4), (1, 3)])
59 | result = list(line_segments.hat_generator(segs))
60 |
61 | expected = [(0, 0), (1, 0.75), (2.0, 2.0), (3, 0.75), (4, 0)]
62 |
63 | assert_equals(expected, result)
64 |
--------------------------------------------------------------------------------
/test/test_render_script.py:
--------------------------------------------------------------------------------
1 | from pdftables.scripts.render import main
2 | import sys
3 | import os
4 | import shutil
5 | import glob
6 | from nose.tools import with_setup
7 |
8 | # TODO(pwaller): Don't write test data to png/svg.
9 |
10 | PDF_FILE = 'fixtures/sample_data/unica_preco_recebido.pdf'
11 |
12 | def clean_output_directories():
13 | shutil.rmtree('png', ignore_errors=True)
14 | shutil.rmtree('svg', ignore_errors=True)
15 |
16 | @with_setup(clean_output_directories)
17 | def test_png_output_directory_is_created():
18 | assert not os.path.isdir('png')
19 | main([PDF_FILE])
20 | assert os.path.isdir('png')
21 |
22 | @with_setup(clean_output_directories)
23 | def test_svg_output_directory_is_created():
24 | assert not os.path.isdir('svg')
25 | main([PDF_FILE])
26 | assert os.path.isdir('svg')
27 |
28 | @with_setup(clean_output_directories)
29 | def test_expected_png_output():
30 | main([PDF_FILE])
31 | assert os.path.isfile('png/unica_preco_recebido.pdf_00.png')
32 |
33 | @with_setup(clean_output_directories)
34 | def test_expected_svg_output():
35 | main([PDF_FILE])
36 | assert os.path.isfile('svg/unica_preco_recebido.pdf_00.svg')
37 |
38 | # Some of the fixture pdfs
39 | PDF_FILES = [
40 | ('fixtures/sample_data/unica_preco_recebido.pdf', 1),
41 | ('fixtures/sample_data/m29-JDent36s2-7.pdf', 6),
42 | ('fixtures/sample_data/COPAMONTHLYMay2013.pdf', 1),
43 | ('fixtures/sample_data/COPAWEEKLYJUNE52013.pdf', 1),
44 | ('fixtures/sample_data/tabla_subsidios.pdf', 1),
45 | ('fixtures/sample_data/AnimalExampleTables.pdf', 4),
46 | ('fixtures/sample_data/m30-JDent36s15-20.pdf', 6),
47 | ('fixtures/sample_data/commodity-prices_en.pdf', 1),
48 | ]
49 |
50 | @with_setup(clean_output_directories)
51 | def test_expected_number_of_pages():
52 | for infile, expected_pages in PDF_FILES:
53 | main([infile])
54 |
55 | actual_pages = len(glob.glob('svg/%s_*.svg' % os.path.basename(infile)))
56 | assert expected_pages == actual_pages
57 |
58 | actual_pages = len(glob.glob('png/%s_*.png' % os.path.basename(infile)))
59 | assert expected_pages == actual_pages
60 |
--------------------------------------------------------------------------------