├── .gitignore
├── ASHIMA.md
├── LICENSE
├── LICENSE.txt
├── README.md
├── TODO.md
├── example
    ├── example.pdf
    └── test_to_pandas.py
├── setup.cfg
├── setup.py
└── src
    └── pdftableextract
        ├── __init__.py
        ├── core.py
        ├── extracttab.py
        ├── pnm.py
        └── scripts.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.pyc
 2 | 
 3 | .installed.cfg
 4 | bin
 5 | develop-eggs
 6 | 
 7 | *.egg-info
 8 | 
 9 | tmp
10 | build
11 | dist
12 | 


--------------------------------------------------------------------------------
/ASHIMA.md:
--------------------------------------------------------------------------------
 1 | *PDF Table Extraction Utility.* Analyses a page in a PDF looking
 2 | for well delineated table cells, and extracts the text in each cell.
 3 | Outputs include JSON, XML, and CSV lists of cell locations, shapes,
 4 | and contents, and CSV and HTML versions of the tables. This utility
 5 | is intended to be the first step in automatically processing data
 6 | in tables from a PDF file, and was originally designed to read the
 7 | tables in ST Micro’s datasheets. The script requires numpy and poppler
 8 | (pdftoppm and pdftotext)
 9 | 
10 | ###License
11 | [MIT Expat](http://ashimagroup.net/os/license/mit-expat)
12 | 
13 | ###Tags
14 | [Utilities](http://ashimagroup.net/os/tag/utilities)
15 | 
16 | 
17 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (C) 2012 Ashima Research
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | LICENSE


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | *PDF Table Extraction Utility.* Analyses a page in a PDF looking
 2 | for well delineated table cells, and extracts the text in each cell.
 3 | Outputs include JSON, XML, and CSV lists of cell locations, shapes,
 4 | and contents, and CSV and HTML versions of the tables. This utility
 5 | is intended to be the first step in automatically processing data
 6 | in tables from a PDF file, and was originally designed to read the
 7 | tables in ST Micro’s datasheets. The script requires numpy and poppler
 8 | (pdftoppm and pdftotext)
 9 | 
10 | ###License
11 | [MIT Expat](http://ashimagroup.net/os/license/mit-expat)
12 | 
13 | ###Tags
14 | [Utilities](http://ashimagroup.net/os/tag/utilities)
15 | 
16 | 
17 | 


--------------------------------------------------------------------------------
/TODO.md:
--------------------------------------------------------------------------------
 1 | TODO
 2 | ====
 3 | This list is in no particular order, and things will get done
 4 | when/if I need them or I have spare time :) 
 5 | 
 6 | --ijm.
 7 | 
 8 | 
 9 | Line finding
10 | ============
11 | 
12 | The line finding algorithm is robust but cannot tell large blocks
13 | of black from lines of black. So I need to add something that finds
14 | and removes solid blocks.
15 | 
16 | Many tables use whitespace to delineate columns or rows but simply
17 | running the same scan for white as is done for black returns a lot
18 | of junk. For example, if the font used is a fixed space font, the
19 | scanner returns grid with cells that are one character wide, and
20 | one character high across the whole page.
21 | 
22 | Horizontal white rows are probably easiest to add, and in fact the
23 | histogram data is already computed. However this puts a row boundary
24 | between EVERY row of text, and makes automatic cropping much harder
25 | and multi-cell consolidation impractical.
26 | 
27 | A significant number of tuning options will be needed in order
28 | to control the width of whitespace recognised, weather or not to
29 | remove already detected black delimiters etc.
30 | 
31 | 
32 | Cell finding
33 | ============
34 | 
35 | The cell finding algorithm is very simplistic. It starts at the top
36 | left and find all connected cells to its right, then descends,
37 | stopping if any cell divider is seen, and remembers which cells
38 | have been visited. This makes it 'width greedy'. It attempts to
39 | start a search for every cell so it will find all sizes of rectangular
40 | cells, but it will fail to find, and so split up, 2 of the 4 possible
41 | L shapes, only 1 of the 4 C or U shapes, or any O shape (where a
42 | cell surrounds another cell).
43 | 
44 | An option is needed to select width greedy, height greedy, square greedy.
45 | 
46 | A flood file algorithm would make a single cell for text around
47 | a table, rather than the current splitting it into rectangles, but
48 | this would also require a graph view of cell relationships.
49 | 
50 | Popplar wrapper
51 | ===============
52 | A short peice of code that wrapps the poppler library to give the
53 | same functionality as ppmtotext but over a socket or file descriptor,
54 | and able to process sequential requests. At the moment pdf-extract
55 | executes ppmtotext once for every cell it finds! This would be much
56 | faster if a wrapper didn't need to be spun up repeatedly.
57 | 
58 | A wrapper is needed to comply with the MIT Expat vs GPL incompatibility.
59 | 
60 | Blank row or column removal
61 | ===========================
62 | It shouldn't be to hard to notice when a complete row or column is
63 | empty, and remove it from the result. However a number of tuning
64 | options would be needed, including not removing empty row/column,
65 | not remove empty row/column in the middle of the table, ignore white
66 | space, ignore punctuation. etc.
67 | 
68 | Delimiter thickness hints
69 | =========================
70 | I should be able to record the relative thicknesses of the delimiters
71 | around a cell, so that later on it would be possible to extract
72 | table and heading boundaries for tables that use them in a detectable
73 | way.
74 | 
75 | Better Hierarchical information
76 | ==============================
77 | I want to keep the cell location data structure flat (because
78 | ultimately the page is always flat), but I could include more
79 | information about cell relationships, and facilitate rebuilding a
80 | representative document object model down stream. I'd particularly
81 | like to be able to automatically separate two tables on the same
82 | page, and to auto-join a multi-page table.
83 | 
84 | Miscellaneous
85 | =============
86 | 
87 | * An option to change the program called to extract text in each
88 | cell: currently it calls pdftotext, but it could easily be ocrad
89 | or any other pdf tool that can take a cropping rectangle.
90 | 
91 | 
92 | 


--------------------------------------------------------------------------------
/example/example.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ashima/pdf-table-extract/7a04fc5a99b74aebecda208bf9680bb2cad2cc72/example/example.pdf


--------------------------------------------------------------------------------
/example/test_to_pandas.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import pdftableextract as pdf
 3 | 
 4 | pages = ["1"]
 5 | cells = [pdf.process_page("example.pdf",p) for p in pages]
 6 | 
 7 | #flatten the cells structure
 8 | cells = [item for sublist in cells for item in sublist ]
 9 | 
10 | #without any options, process_page picks up a blank table at the top of the page.
11 | #so choose table '1'
12 | li = pdf.table_to_list(cells, pages)[1]
13 | 
14 | #li is a list of lists, the first line is the header, last is the footer (for this table only!)
15 | #column '0' contains store names
16 | #row '1' contains column headings
17 | #data is row '2' through '-1'
18 | 
19 | data =pd.DataFrame(li[2:-1], columns=li[1], index=[l[0] for l in li[2:-1]])
20 | print data
21 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md
3 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | import os
 3 | 
 4 | here = os.path.abspath(os.path.dirname(__file__))
 5 | README = open(os.path.join(here, 'README.md')).read()
 6 | #NEWS = open(os.path.join(here, 'NEWS.txt')).read()
 7 | 
 8 | 
 9 | version = '0.1'
10 | 
11 | install_requires = [ "numpy" ]
12 | 
13 | 
14 | setup(name='pdf-table-extract',
15 |     version=version,
16 |     description="Extract Tables from PDF files",
17 |     long_description=README + '\n\n',# + NEWS,
18 |     classifiers=[
19 |       # Get strings from http://pypi.python.org/pypi?%3Aaction=list_classifiers
20 |     ],
21 |     keywords='PDF, tables',
22 |     author='Ian McEwan',
23 |     author_email='ijm@ashimaresearch.com',
24 |     url='ashimaresearch.com',
25 |     license='MIT-Expat',
26 |     packages=find_packages('src'),
27 |     package_dir = {'': 'src'},include_package_data=True,
28 |     zip_safe=False,
29 |     install_requires=install_requires,
30 |     entry_points={
31 |         'console_scripts':
32 |             ['pdf-table-extract=pdftableextract.scripts:main']
33 |     }
34 | )
35 | 


--------------------------------------------------------------------------------
/src/pdftableextract/__init__.py:
--------------------------------------------------------------------------------
1 | # Example package with a console entry point
2 | from core import process_page, output, table_to_list


--------------------------------------------------------------------------------
/src/pdftableextract/core.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import os
  3 | from numpy import array, fromstring, ones, zeros, uint8, diff, where, sum, delete
  4 | import subprocess
  5 | from pipes import quote
  6 | from .pnm import readPNM, dumpImage
  7 | import re
  8 | from pipes import quote
  9 | from xml.dom.minidom import getDOMImplementation
 10 | import json
 11 | import csv
 12 | 
 13 | #-----------------------------------------------------------------------
 14 | def check_for_required_executable(name,command):
 15 |     """Checks for an executable called 'name' by running 'command' and supressing
 16 |     output. If the return code is non-zero or an OS error occurs, an Exception is raised""" 
 17 |     try:
 18 |         with open(os.devnull, "w") as fnull:
 19 |             result=subprocess.check_call(command,stdout=fnull, stderr=fnull)
 20 |     except OSError as e:
 21 |         message = """Error running {0}.
 22 | Command failed: {1}
 23 | {2}""".format(name, " ".join(command), e)
 24 |         raise OSError(message)
 25 |     except subprocess.CalledProcessError as e:
 26 |         raise
 27 |     except Exception as e:
 28 |         raise
 29 | 
 30 | #-----------------------------------------------------------------------
 31 | def popen(name,command, *args, **kwargs):
 32 |     try:
 33 |         result=subprocess.Popen(command,*args, **kwargs)
 34 |         return result
 35 |     except OSError, e:
 36 |         message="""Error running {0}. Is it installed correctly?
 37 | Error: {1}""".format(name, e)
 38 |         raise OSError(message)
 39 |     except Exception, e:
 40 |         raise 
 41 | 
 42 | def colinterp(a,x) :
 43 |     """Interpolates colors"""
 44 |     l = len(a)-1
 45 |     i = min(l, max(0, int (x * l)))
 46 |     (u,v) = a[i:i+2,:]
 47 |     return u - (u-v) * ((x * l) % 1.0)
 48 | 
 49 | colarr = array([ [255,0,0],[255,255,0],[0,255,0],[0,255,255],[0,0,255] ])
 50 | 
 51 | def col(x, colmult=1.0) :
 52 |     """colors"""
 53 |     return colinterp(colarr,(colmult * x)% 1.0) / 2
 54 | 
 55 | 
 56 | def process_page(infile, pgs, 
 57 |     outfilename=None,
 58 |     greyscale_threshold=25,
 59 |     page=None,
 60 |     crop=None,
 61 |     line_length=0.17,
 62 |     bitmap_resolution=300,
 63 |     name=None,
 64 |     pad=2,
 65 |     white=None,
 66 |     black=None,
 67 |     bitmap=False, 
 68 |     checkcrop=False, 
 69 |     checklines=False, 
 70 |     checkdivs=False,
 71 |     checkcells=False,
 72 |     whitespace="normalize",
 73 |     boxes=False) :
 74 |     
 75 |   outfile = open(outfilename,'w') if outfilename else sys.stdout
 76 |   page=page or []
 77 |   (pg,frow,lrow) = (map(int,(pgs.split(":")))+[None,None])[0:3]
 78 |   #check that pdftoppdm exists by running a simple command
 79 |   check_for_required_executable("pdftoppm",["pdftoppm","-h"])
 80 |   #end check
 81 | 
 82 |   p = popen("pdftoppm", ("pdftoppm -gray -r %d -f %d -l %d %s " %
 83 |       (bitmap_resolution,pg,pg,quote(infile))),
 84 |       stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True )
 85 | 
 86 | #-----------------------------------------------------------------------
 87 | # image load secion.
 88 | 
 89 |   (maxval, width, height, data) = readPNM(p.stdout)
 90 | 
 91 |   pad = int(pad)
 92 |   height+=pad*2
 93 |   width+=pad*2
 94 |   
 95 | # reimbed image with a white padd.
 96 |   bmp = ones( (height,width) , dtype=bool )
 97 |   bmp[pad:height-pad,pad:width-pad] = ( data[:,:] > int(255.0*greyscale_threshold/100.0) )
 98 | 
 99 | # Set up Debuging image.
100 |   img = zeros( (height,width,3) , dtype=uint8 )
101 |   img[:,:,0] = bmp*255
102 |   img[:,:,1] = bmp*255
103 |   img[:,:,2] = bmp*255
104 | 
105 | #-----------------------------------------------------------------------
106 | # Find bounding box.
107 |   t=0
108 |   while t < height and sum(bmp[t,:]==0) == 0 :
109 |     t=t+1
110 |   if t > 0 :
111 |     t=t-1
112 |   
113 |   b=height-1
114 |   while b > t and sum(bmp[b,:]==0) == 0 :
115 |     b=b-1
116 |   if b < height-1:
117 |     b = b+1
118 |   
119 |   l=0
120 |   while l < width and sum(bmp[:,l]==0) == 0 :
121 |     l=l+1
122 |   if l > 0 :
123 |     l=l-1
124 |   
125 |   r=width-1
126 |   while r > l and sum(bmp[:,r]==0) == 0 :
127 |     r=r-1
128 |   if r < width-1 :
129 |     r=r+1
130 |   
131 | # Mark bounding box.
132 |   bmp[t,:] = 0
133 |   bmp[b,:] = 0
134 |   bmp[:,l] = 0
135 |   bmp[:,r] = 0
136 | 
137 |   def boxOfString(x,p) :
138 |     s = x.split(":")
139 |     if len(s) < 4 :
140 |       raise ValueError("boxes have format left:top:right:bottom[:page]")
141 |     return ([bitmap_resolution * float(x) + pad for x in s[0:4] ]
142 |                 + [ p if len(s)<5 else int(s[4]) ] ) 
143 | 
144 | 
145 | # translate crop to paint white.
146 |   whites = []
147 |   if crop :
148 |     (l,t,r,b,p) = boxOfString(crop,pg) 
149 |     whites.extend( [ (0,0,l,height,p), (0,0,width,t,p),
150 |                      (r,0,width,height,p), (0,b,width,height,p) ] )
151 | 
152 | # paint white ...
153 |   if white :
154 |     whites.extend( [ boxOfString(b, pg) for b in white ] )
155 | 
156 |   for (l,t,r,b,p) in whites :
157 |     if p == pg :
158 |       bmp[ t:b+1,l:r+1 ] = 1
159 |       img[ t:b+1,l:r+1 ] = [255,255,255]
160 |   
161 | # paint black ...
162 |   if black :
163 |     for b in black :
164 |       (l,t,r,b) = [bitmap_resolution * float(x) + pad for x in b.split(":") ]
165 |       bmp[ t:b+1,l:r+1 ] = 0
166 |       img[ t:b+1,l:r+1 ] = [0,0,0]
167 | 
168 |   if checkcrop :
169 |     dumpImage(outfile,bmp,img, bitmap, pad)
170 |     return True
171 |     
172 | #-----------------------------------------------------------------------
173 | # Line finding section.
174 | #
175 | # Find all vertical or horizontal lines that are more than rlthresh 
176 | # long, these are considered lines on the table grid.
177 | 
178 |   lthresh = int(line_length * bitmap_resolution)
179 |   vs = zeros(width, dtype=int)
180 |   for i in range(width) :
181 |     dd = diff( where(bmp[:,i])[0] ) 
182 |     if len(dd)>0:
183 |       v = max ( dd )
184 |       if v > lthresh :
185 |         vs[i] = 1
186 |     else:
187 | # it was a solid black line.
188 |       if bmp[0,i] == 0 :
189 |         vs[i] = 1
190 |   vd= ( where(diff(vs[:]))[0] +1 )
191 | 
192 |   hs = zeros(height, dtype=int)
193 |   for j in range(height) :
194 |     dd = diff( where(bmp[j,:]==1)[0] )
195 |     if len(dd) > 0 :
196 |       h = max ( dd )
197 |       if h > lthresh :
198 |         hs[j] = 1
199 |     else:
200 | # it was a solid black line.
201 |       if bmp[j,0] == 0 :
202 |         hs[j] = 1
203 |   hd=(  where(diff(hs[:]==1))[0] +1 )
204 | 
205 | #-----------------------------------------------------------------------
206 | # Look for dividors that are too large.
207 |   maxdiv=10
208 |   i=0
209 | 
210 |   while i < len(vd) :
211 |     if vd[i+1]-vd[i] > maxdiv :
212 |       vd = delete(vd,i)
213 |       vd = delete(vd,i)
214 |     else:
215 |       i=i+2
216 |   
217 |   j = 0 
218 |   while j < len(hd):
219 |     if hd[j+1]-hd[j] > maxdiv :
220 |       hd = delete(hd,j)
221 |       hd = delete(hd,j)
222 |     else:
223 |       j=j+2
224 |   
225 |   if checklines :
226 |     for i in vd :
227 |       img[:,i] = [255,0,0] # red
228 |   
229 |     for j in hd :
230 |       img[j,:] = [0,0,255] # blue
231 |     dumpImage(outfile,bmp,img)
232 |     return True
233 | #-----------------------------------------------------------------------
234 | # divider checking.
235 | #
236 | # at this point vd holds the x coordinate of vertical  and 
237 | # hd holds the y coordinate of horizontal divider tansitions for each 
238 | # vertical and horizontal lines in the table grid.
239 | 
240 |   def isDiv(a, l,r,t,b) :
241 |           # if any col or row (in axis) is all zeros ...
242 |     return sum( sum(bmp[t:b, l:r], axis=a)==0 ) >0 
243 | 
244 |   if checkdivs :
245 |     img = img / 2
246 |     for j in range(0,len(hd),2):
247 |       for i in range(0,len(vd),2):
248 |         if i>0 :
249 |           (l,r,t,b) = (vd[i-1], vd[i],   hd[j],   hd[j+1]) 
250 |           img[ t:b, l:r, 1 ] = 192
251 |           if isDiv(1, l,r,t,b) :
252 |             img[ t:b, l:r, 0 ] = 0
253 |             img[ t:b, l:r, 2 ] = 255
254 |           
255 |         if j>0 :
256 |           (l,r,t,b) = (vd[i],   vd[i+1], hd[j-1], hd[j] )
257 |           img[ t:b, l:r, 1 ] = 128
258 |           if isDiv(0, l,r,t,b) :
259 |             img[ t:b, l:r, 0 ] = 255
260 |             img[ t:b, l:r, 2 ] = 0
261 |     dumpImage(outfile,bmp,img)
262 |     return True
263 | #-----------------------------------------------------------------------
264 | # Cell finding section.
265 | # This algorithum is width hungry, and always generates rectangular
266 | # boxes.
267 | 
268 |   cells =[] 
269 |   touched = zeros( (len(hd), len(vd)),dtype=bool )
270 |   j = 0
271 |   while j*2+2 < len (hd) :
272 |     i = 0
273 |     while i*2+2 < len(vd) :
274 |       u = 1
275 |       v = 1
276 |       if not touched[j,i] :
277 |         while 2+(i+u)*2 < len(vd) and \
278 |             not isDiv( 0, vd[ 2*(i+u) ], vd[ 2*(i+u)+1],
279 |                hd[ 2*(j+v)-1 ], hd[ 2*(j+v) ] ):
280 |           u=u+1
281 |         bot = False
282 |         while 2+(j+v)*2 < len(hd) and not bot :
283 |           bot = False
284 |           for k in range(1,u+1) :
285 |             bot |= isDiv( 1, vd[ 2*(i+k)-1 ], vd[ 2*(i+k)],
286 |                hd[ 2*(j+v) ], hd[ 2*(j+v)+1 ] )
287 |           if not bot :
288 |             v=v+1
289 |         cells.append( (i,j,u,v) )
290 |         touched[ j:j+v, i:i+u] = True
291 |       i = i+1
292 |     j=j+1
293 |   
294 |   
295 |   if checkcells :
296 |     nc = len(cells)+0.
297 |     img = img / 2
298 |     for k in range(len(cells)):
299 |       (i,j,u,v) = cells[k]
300 |       (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] )
301 |       img[ t:b, l:r ] += col( k/nc )
302 |     dumpImage(outfile,bmp,img)
303 |     return True
304 |   
305 | #-----------------------------------------------------------------------
306 | # fork out to extract text for each cell.
307 | 
308 |   whitespace = re.compile( r'\s+')
309 |    
310 |   def getCell( (i,j,u,v) ):
311 |     (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] )
312 |     p = popen("pdftotext", 
313 |               "pdftotext -r %d -x %d -y %d -W %d -H %d -layout -nopgbrk -f %d -l %d %s -" % (bitmap_resolution, l-pad, t-pad, r-l, b-t, pg, pg, quote(infile)),
314 |               stdout=subprocess.PIPE, 
315 |               shell=True )
316 |     
317 |     ret = p.communicate()[0]
318 |     if whitespace != 'raw' :
319 |       ret = whitespace.sub( "" if whitespace == "none" else " ", ret )
320 |       if len(ret) > 0 :
321 |         ret = ret[ (1 if ret[0]==' ' else 0) : 
322 |                    len(ret) - (1 if ret[-1]==' ' else 0) ]
323 |     return (i,j,u,v,pg,ret)
324 |       
325 |   if boxes :
326 |     cells = [ x + (pg,"",) for x in cells if 
327 |               ( frow == None or (x[1] >= frow and x[1] <= lrow)) ]
328 |   else :
329 |     #check that pdftotext exists by running a simple command
330 |     check_for_required_executable("pdftotext",["pdftotext","-h"])
331 |     #end check
332 |     cells = [ getCell(x)   for x in cells if 
333 |               ( frow == None or (x[1] >= frow and x[1] <= lrow)) ]
334 |   return cells
335 | 
336 | #-----------------------------------------------------------------------
337 | #output section.
338 | 
339 | def output(cells, pgs, 
340 |                 cells_csv_filename=None, 
341 |                 cells_json_filename=None, 
342 |                 cells_xml_filename=None, 
343 |                 table_csv_filename=None,
344 |                 table_html_filename=None,
345 |                 table_list_filename=None,
346 |                 infile=None, name=None, output_type=None
347 |                 ):
348 |                 
349 |     output_types = [
350 |              dict(filename=cells_csv_filename, function=o_cells_csv),  
351 |              dict(filename=cells_json_filename, function=o_cells_json), 
352 |              dict(filename=cells_xml_filename, function=o_cells_xml), 
353 |              dict(filename=table_csv_filename, function=o_table_csv),
354 |              dict(filename=table_html_filename, function=o_table_html),
355 |              dict(filename=table_list_filename, function=o_table_list)
356 |              ]
357 |              
358 |     for entry in output_types:
359 |         if entry["filename"]:
360 |             if entry["filename"] != sys.stdout:
361 |                 outfile = open(entry["filename"],'w')
362 |             else:
363 |                 outfile = sys.stdout
364 |             
365 |             entry["function"](cells, pgs, 
366 |                                 outfile=outfile, 
367 |                                 name=name, 
368 |                                 infile=infile, 
369 |                                 output_type=output_type)
370 | 
371 |             if entry["filename"] != sys.stdout:
372 |                 outfile.close()
373 |         
374 | def o_cells_csv(cells,pgs, outfile=None, name=None, infile=None, output_type=None) :
375 |   outfile = outfile or sys.stdout
376 |   csv.writer( outfile , dialect='excel' ).writerows(cells)
377 | 
378 | def o_cells_json(cells,pgs, outfile=None, infile=None, name=None, output_type=None) :
379 |   """Output JSON formatted cell data"""
380 |   outfile = outfile or sys.stdout
381 |   #defaults
382 |   infile=infile or ""
383 |   name=name or ""
384 |   
385 |   json.dump({ 
386 |     "src": infile,
387 |     "name": name,
388 |     "colnames": ( "x","y","width","height","page","contents" ),
389 |     "cells":cells
390 |     }, outfile)
391 | 
392 | def o_cells_xml(cells,pgs, outfile=None,infile=None, name=None, output_type=None) : 
393 |   """Output XML formatted cell data"""
394 |   outfile = outfile or sys.stdout
395 |   #defaults
396 |   infile=infile or ""
397 |   name=name or ""
398 | 
399 |   doc = getDOMImplementation().createDocument(None,"table", None)
400 |   root = doc.documentElement;
401 |   if infile :
402 |     root.setAttribute("src",infile)
403 |   if name :
404 |     root.setAttribute("name",name)
405 |   for cl in cells :
406 |     x = doc.createElement("cell")
407 |     map(lambda(a): x.setAttribute(*a), zip("xywhp",map(str,cl)))
408 |     if cl[5] != "" :
409 |       x.appendChild( doc.createTextNode(cl[5]) )
410 |     root.appendChild(x)
411 |   outfile.write( doc.toprettyxml() )
412 |   
413 | def table_to_list(cells,pgs) : 
414 |   """Output list of lists"""
415 |   l=[0,0,0]
416 |   for (i,j,u,v,pg,value) in cells :
417 |       r=[i,j,pg]
418 |       l = [max(x) for x in zip(l,r)]
419 |   
420 |   tab = [ [ [ "" for x in range(l[0]+1)
421 |             ] for x in range(l[1]+1)
422 |           ] for x in range(l[2]+1)
423 |         ]
424 |   for (i,j,u,v,pg,value) in cells :
425 |     tab[pg][j][i] = value
426 | 
427 |   return tab
428 | 
429 | def o_table_csv(cells,pgs, outfile=None, name=None, infile=None, output_type=None) :
430 |   """Output CSV formatted table"""
431 |   outfile = outfile or sys.stdout
432 |   tab=table_to_list(cells, pgs)
433 |   for t in tab:
434 |     csv.writer( outfile , dialect='excel' ).writerows(t)
435 |   
436 | 
437 | def o_table_list(cells,pgs, outfile=None, name=None, infile=None, output_type=None) :
438 |   """Output list of lists"""
439 |   outfile = outfile or sys.stdout
440 |   tab = table_to_list(cells, pgs)
441 |   print(tab)
442 |     
443 | def o_table_html(cells,pgs, outfile=None, output_type=None, name=None, infile=None) : 
444 |   """Output HTML formatted table"""
445 | 
446 |   oj = 0 
447 |   opg = 0
448 |   doc = getDOMImplementation().createDocument(None,"table", None)
449 |   root = doc.documentElement;
450 |   if (output_type == "table_chtml" ):
451 |     root.setAttribute("border","1")
452 |     root.setAttribute("cellspaceing","0")
453 |     root.setAttribute("style","border-spacing:0")
454 |   nc = len(cells)
455 |   tr = None
456 |   for k in range(nc):
457 |     (i,j,u,v,pg,value) = cells[k]
458 |     if j > oj or pg > opg:
459 |       if pg > opg:
460 |         s = "Name: " + name + ", " if name else ""
461 |         root.appendChild( doc.createComment( s + 
462 |           ("Source: %s page %d." % (infile, pg) )));
463 |       if tr :
464 |         root.appendChild(tr)
465 |       tr = doc.createElement("tr")
466 |       oj = j
467 |       opg = pg
468 |     td = doc.createElement("td")
469 |     if value != "" :
470 |       td.appendChild( doc.createTextNode(value) )
471 |     if u>1 :
472 |       td.setAttribute("colspan",str(u))
473 |     if v>1 :
474 |       td.setAttribute("rowspan",str(v))
475 |     if output_type == "table_chtml" :
476 |       td.setAttribute("style", "background-color: #%02x%02x%02x" %
477 |             tuple(128+col(k/(nc+0.))))
478 |     tr.appendChild(td)
479 |   root.appendChild(tr)
480 |   outfile.write( doc.toprettyxml() )
481 |   
482 | 


--------------------------------------------------------------------------------
/src/pdftableextract/extracttab.py:
--------------------------------------------------------------------------------
  1 | # Description : PDF Table Extraction Utility
  2 | #      Author : Ian McEwan, Ashima Research.
  3 | #  Maintainer : ijm
  4 | #     Lastmod : 20130402 (ijm)
  5 | #     License : Copyright (C) 2011 Ashima Research. All rights reserved.
  6 | #               Distributed under the MIT Expat License. See LICENSE file.
  7 | #               https://github.com/ashima/pdf-table-extract
  8 | 
  9 | import sys, argparse, subprocess, re, csv, json
 10 | from numpy import *
 11 | from pipes import quote
 12 | from xml.dom.minidom import getDOMImplementation
 13 | 
 14 | # Proccessing function.
 15 | 
 16 | def process_page(pgs) :
 17 |   (pg,frow,lrow) = (map(int,(pgs.split(":")))+[None,None])[0:3]
 18 | 
 19 |   p = subprocess.Popen( ("pdftoppm -gray -r %d -f %d -l %d %s " %
 20 |       (args.r,pg,pg,quote(args.infile))),
 21 |       stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True )
 22 | 
 23 | #-----------------------------------------------------------------------
 24 | # image load secion.
 25 | 
 26 |   (maxval, width, height, data) = readPNM(p.stdout)
 27 | 
 28 |   pad = int(args.pad)
 29 |   height+=pad*2
 30 |   width+=pad*2
 31 |   
 32 | # reimbed image with a white padd.
 33 |   bmp = ones( (height,width) , dtype=bool )
 34 |   bmp[pad:height-pad,pad:width-pad] = ( data[:,:] > int(255.0*args.g/100.0) )
 35 | 
 36 | # Set up Debuging image.
 37 |   img = zeros( (height,width,3) , dtype=uint8 )
 38 |   img[:,:,0] = bmp*255
 39 |   img[:,:,1] = bmp*255
 40 |   img[:,:,2] = bmp*255
 41 | 
 42 | #-----------------------------------------------------------------------
 43 | # Find bounding box.
 44 | 
 45 |   t=0
 46 |   while t < height and sum(bmp[t,:]==0) == 0 :
 47 |     t=t+1
 48 |   if t > 0 :
 49 |     t=t-1
 50 |   
 51 |   b=height-1
 52 |   while b > t and sum(bmp[b,:]==0) == 0 :
 53 |     b=b-1
 54 |   if b < height-1:
 55 |     b = b+1
 56 |   
 57 |   l=0
 58 |   while l < width and sum(bmp[:,l]==0) == 0 :
 59 |     l=l+1
 60 |   if l > 0 :
 61 |     l=l-1
 62 |   
 63 |   r=width-1
 64 |   while r > l and sum(bmp[:,r]==0) == 0 :
 65 |     r=r-1
 66 |   if r < width-1 :
 67 |     r=r+1
 68 |   
 69 | # Mark bounding box.
 70 |   bmp[t,:] = 0
 71 |   bmp[b,:] = 0
 72 |   bmp[:,l] = 0
 73 |   bmp[:,r] = 0
 74 | 
 75 |   def boxOfString(x,p) :
 76 |     s = x.split(":")
 77 |     if len(s) < 4 :
 78 |       raise Exception("boxes have format left:top:right:bottom[:page]")
 79 |     return ([args.r * float(x) + args.pad for x in s[0:4] ]
 80 |                 + [ p if len(s)<5 else int(s[4]) ] ) 
 81 | 
 82 | 
 83 | # translate crop to paint white.
 84 |   whites = []
 85 |   if args.crop :
 86 |     (l,t,r,b,p) = boxOfString(args.crop,pg) 
 87 |     whites.extend( [ (0,0,l,height,p), (0,0,width,t,p),
 88 |                      (r,0,width,height,p), (0,b,width,height,p) ] )
 89 | 
 90 | # paint white ...
 91 |   if args.white :
 92 |     whites.extend( [ boxOfString(b, pg) for b in args.white ] )
 93 | 
 94 |   for (l,t,r,b,p) in whites :
 95 |     if p == pg :
 96 |       bmp[ t:b+1,l:r+1 ] = 1
 97 |       img[ t:b+1,l:r+1 ] = [255,255,255]
 98 |   
 99 | # paint black ...
100 |   if args.black :
101 |     for b in args.black :
102 |       (l,t,r,b) = [args.r * float(x) + args.pad for x in b.split(":") ]
103 |       bmp[ t:b+1,l:r+1 ] = 0
104 |       img[ t:b+1,l:r+1 ] = [0,0,0]
105 | 
106 |   if args.checkcrop :
107 |     dumpImage(args,bmp,img)
108 |     sys.exit(0)
109 |     
110 |   
111 | #-----------------------------------------------------------------------
112 | # Line finding section.
113 | #
114 | # Find all verticle or horizontal lines that are more than rlthresh 
115 | # long, these are considered lines on the table grid.
116 | 
117 |   lthresh = int(args.l * args.r)
118 |   vs = zeros(width, dtype=int)
119 |   for i in range(width) :
120 |     dd = diff( where(bmp[:,i])[0] ) 
121 |     if len(dd)>0:
122 |       v = max ( dd )
123 |       if v > lthresh :
124 |         vs[i] = 1
125 |     else:
126 | # it was a solid black line.
127 |       if bmp[0,i] == 0 :
128 |         vs[i] = 1
129 |   vd= ( where(diff(vs[:]))[0] +1 )
130 | 
131 |   hs = zeros(height, dtype=int)
132 |   for j in range(height) :
133 |     dd = diff( where(bmp[j,:]==1)[0] )
134 |     if len(dd) > 0 :
135 |       h = max ( dd )
136 |       if h > lthresh :
137 |         hs[j] = 1
138 |     else:
139 | # it was a solid black line.
140 |       if bmp[j,0] == 0 :
141 |         hs[j] = 1
142 |   hd=(  where(diff(hs[:]==1))[0] +1 )
143 | 
144 | #-----------------------------------------------------------------------
145 | # Look for dividors that are too large.
146 | 
147 |   maxdiv=10
148 |   i=0
149 | 
150 |   while i < len(vd) :
151 |     if vd[i+1]-vd[i] > maxdiv :
152 |       vd = delete(vd,i)
153 |       vd = delete(vd,i)
154 |     else:
155 |       i=i+2
156 |   
157 |   j = 0 
158 |   while j < len(hd):
159 |     if hd[j+1]-hd[j] > maxdiv :
160 |       hd = delete(hd,j)
161 |       hd = delete(hd,j)
162 |     else:
163 |       j=j+2
164 |   
165 |   if args.checklines :
166 |     for i in vd :
167 |       img[:,i] = [255,0,0] # red
168 |   
169 |     for j in hd :
170 |       img[j,:] = [0,0,255] # blue
171 |     dumpImage(args,bmp,img)
172 |     sys.exit(0)
173 |   
174 | #-----------------------------------------------------------------------
175 | # divider checking.
176 | #
177 | # at this point vd holds the x coordinate of vertical  and 
178 | # hd holds the y coordinate of horizontal divider tansitions for each 
179 | # vertical and horizontal lines in the table grid.
180 | 
181 |   def isDiv(a, l,r,t,b) :
182 |           # if any col or row (in axis) is all zeros ...
183 |     return sum( sum(bmp[t:b, l:r], axis=a)==0 ) >0 
184 | 
185 |   if args.checkdivs :
186 |     img = img / 2
187 |     for j in range(0,len(hd),2):
188 |       for i in range(0,len(vd),2):
189 |         if i>0 :
190 |           (l,r,t,b) = (vd[i-1], vd[i],   hd[j],   hd[j+1]) 
191 |           img[ t:b, l:r, 1 ] = 192
192 |           if isDiv(1, l,r,t,b) :
193 |             img[ t:b, l:r, 0 ] = 0
194 |             img[ t:b, l:r, 2 ] = 255
195 |           
196 |         if j>0 :
197 |           (l,r,t,b) = (vd[i],   vd[i+1], hd[j-1], hd[j] )
198 |           img[ t:b, l:r, 1 ] = 128
199 |           if isDiv(0, l,r,t,b) :
200 |             img[ t:b, l:r, 0 ] = 255
201 |             img[ t:b, l:r, 2 ] = 0
202 |   
203 |     dumpImage(args,bmp,img)
204 |     sys.exit(0)
205 |   
206 | #-----------------------------------------------------------------------
207 | # Cell finding section.
208 | # This algorithum is width hungry, and always generates rectangular
209 | # boxes.
210 | 
211 |   cells =[] 
212 |   touched = zeros( (len(hd), len(vd)),dtype=bool )
213 |   j = 0
214 |   while j*2+2 < len (hd) :
215 |     i = 0
216 |     while i*2+2 < len(vd) :
217 |       u = 1
218 |       v = 1
219 |       if not touched[j,i] :
220 |         while 2+(i+u)*2 < len(vd) and \
221 |             not isDiv( 0, vd[ 2*(i+u) ], vd[ 2*(i+u)+1],
222 |                hd[ 2*(j+v)-1 ], hd[ 2*(j+v) ] ):
223 |           u=u+1
224 |         bot = False
225 |         while 2+(j+v)*2 < len(hd) and not bot :
226 |           bot = False
227 |           for k in range(1,u+1) :
228 |             bot |= isDiv( 1, vd[ 2*(i+k)-1 ], vd[ 2*(i+k)],
229 |                hd[ 2*(j+v) ], hd[ 2*(j+v)+1 ] )
230 |           if not bot :
231 |             v=v+1
232 |         cells.append( (i,j,u,v) )
233 |         touched[ j:j+v, i:i+u] = True
234 |       i = i+1
235 |     j=j+1
236 |   
237 |   
238 |   if args.checkcells :
239 |     nc = len(cells)+0.
240 |     img = img / 2
241 |     for k in range(len(cells)):
242 |       (i,j,u,v) = cells[k]
243 |       (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] )
244 |       img[ t:b, l:r ] += col( k/nc )
245 |     dumpImage(args,bmp,img)
246 |     sys.exit(0)
247 |   
248 |   
249 | #-----------------------------------------------------------------------
250 | # fork out to extract text for each cell.
251 | 
252 |   whitespace = re.compile( r'\s+')
253 |    
254 |   def getCell( (i,j,u,v) ):
255 |     (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] )
256 |     p = subprocess.Popen(
257 |     ("pdftotext -r %d -x %d -y %d -W %d -H %d -layout -nopgbrk -f %d -l %d %s -"
258 |          % (args.r, l-pad, t-pad, r-l, b-t, pg, pg, quote(args.infile) ) ),
259 |         stdout=subprocess.PIPE, shell=True )
260 |     
261 |     ret = p.communicate()[0]
262 |     if args.w != 'raw' :
263 |       ret = whitespace.sub( "" if args.w == "none" else " ", ret )
264 |       if len(ret) > 0 :
265 |         ret = ret[ (1 if ret[0]==' ' else 0) : 
266 |                    len(ret) - (1 if ret[-1]==' ' else 0) ]
267 |     return (i,j,u,v,pg,ret)
268 | 
269 |   #if args.boxes :
270 |   #  cells = [ x + (pg,"",) for x in cells ]
271 |   #else :
272 |   #  cells = map(getCell, cells)
273 |   
274 |   if args.boxes :
275 |     cells = [ x + (pg,"",) for x in cells if 
276 |               ( frow == None or (x[1] >= frow and x[1] <= lrow)) ]
277 |   else :
278 |     cells = [ getCell(x)   for x in cells if 
279 |               ( frow == None or (x[1] >= frow and x[1] <= lrow)) ]
280 |   return cells
281 | 
282 | 
283 | #-----------------------------------------------------------------------
284 | # main
285 | 
286 | def main_script():
287 |     args = procargs()
288 | 
289 |     cells = []
290 |     for pgs in args.page :
291 |       cells.extend(process_page(pgs))
292 | 
293 |     { "cells_csv" : o_cells_csv,   "cells_json" : o_cells_json,
294 |       "cells_xml" : o_cells_xml,   "table_csv"  : o_table_csv,
295 |       "table_html": o_table_html,  "table_chtml": o_table_html,
296 |       } [ args.t ](cells,args.page)
297 | 
298 | 


--------------------------------------------------------------------------------
/src/pdftableextract/pnm.py:
--------------------------------------------------------------------------------
 1 | from numpy import array, fromstring, uint8, reshape, ones
 2 | #-----------------------------------------------------------------------
 3 | # PNM stuff.
 4 | 
 5 | def noncomment(fd):
 6 |   """Read lines from the filehandle until a non-comment line is found. 
 7 |   Comments start with #"""
 8 |   while True:
 9 |     x = fd.readline() 
10 |     if x.startswith('#') :
11 |       continue
12 |     else:
13 |       return x
14 | 
15 | def readPNM(fd):
16 |   """Reads the PNM file from the filehandle"""
17 |   t = noncomment(fd)
18 |   s = noncomment(fd)
19 |   m = noncomment(fd) if not (t.startswith('P1') or t.startswith('P4')) else '1'
20 |   data = fd.read()
21 |   ls = len(s.split())
22 |   if ls != 2 :
23 |     name = "<pipe>" if fd.name=="<fdopen>" else "Filename = {0}".format(fd.name)
24 |     raise IOError("Expected 2 elements from parsing PNM file, got {0}: {1}".format(ls, name))
25 |   xs, ys = s.split()
26 |   width = int(xs)
27 |   height = int(ys)
28 |   m = int(m)
29 | 
30 |   if m != 255 :
31 |     print "Just want 8 bit pgms for now!"
32 |   
33 |   d = fromstring(data,dtype=uint8)
34 |   d = reshape(d, (height,width) )
35 |   return (m,width,height, d)
36 | 
37 | def writePNM(fd,img):
38 |   """Writes a PNM file to a filehandle given the img data as a numpy array"""
39 |   s = img.shape
40 |   m = 255
41 |   if img.dtype == bool :
42 |     img = img + uint8(0) 
43 |     t = "P5"
44 |     m = 1
45 |   elif len(s) == 2 :
46 |     t = "P5"
47 |   else:
48 |     t = "P6"
49 |     
50 |   fd.write( "%s\n%d %d\n%d\n" % (t, s[1],s[0],m) )
51 |   fd.write( uint8(img).tostring() )
52 | 
53 | 
54 | def dumpImage(outfile,bmp,img,bitmap=False, pad=2) :
55 |     """Dumps the numpy array in image into the filename and closes the outfile"""
56 |     oi = bmp if bitmap else img
57 |     (height,width) = bmp.shape
58 |     writePNM(outfile, oi[pad:height-pad, pad:width-pad])
59 |     outfile.close()
60 | 


--------------------------------------------------------------------------------
/src/pdftableextract/scripts.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import sys
  3 | import logging
  4 | import subprocess
  5 | from .core import process_page, output
  6 | import core
  7 | 
  8 | #-----------------------------------------------------------------------
  9 | 
 10 | def procargs() :
 11 |   p = argparse.ArgumentParser( description="Finds tables in a PDF page.")
 12 |   p.add_argument("-i", dest='infile',  help="input file" )
 13 |   p.add_argument("-o", dest='outfile', help="output file", default=None,
 14 |      type=str)
 15 |   p.add_argument("--greyscale_threshold","-g", help="grayscale threshold (%%)", type=int, default=25 )
 16 |   p.add_argument("-p", type=str, dest='page', required=True, action="append",
 17 |      help="a page in the PDF to process, as page[:firstrow:lastrow]." )
 18 |   p.add_argument("-c", type=str, dest='crop',
 19 |      help="crop to left:top:right:bottom. Paints white outside this "
 20 |           "rectangle."  )
 21 |   p.add_argument("--line_length", "-l", type=float, default=0.17 ,
 22 |      help="line length threshold (length)" )
 23 |   p.add_argument("--bitmap_resolution", "-r", type=int, default=300,
 24 |      help="resolution of internal bitmap (dots per length unit)" )
 25 |   p.add_argument("-name", help="name to add to XML tag, or HTML comments")
 26 |   p.add_argument("-pad", help="imitial image pading (pixels)", type=int,
 27 |      default=2 )
 28 |   p.add_argument("-white",action="append", 
 29 |     help="paint white to the bitmap as left:top:right:bottom in length units."
 30 |          "Done before painting black" )
 31 |   p.add_argument("-black",action="append", 
 32 |     help="paint black to the bitmap as left:top:right:bottom in length units."
 33 |          "Done after poainting white" )
 34 |   p.add_argument("-bitmap", action="store_true",
 35 |      help = "Dump working bitmap not debuging image." )
 36 |   p.add_argument("-checkcrop",  action="store_true",
 37 |      help = "Stop after finding croping rectangle, and output debuging "
 38 |             "image (use -bitmap).")
 39 |   p.add_argument("-checklines", action="store_true",
 40 |      help = "Stop after finding lines, and output debuging image." )
 41 |   p.add_argument("-checkdivs",  action="store_true",
 42 |      help = "Stop after finding dividors, and output debuging image." )
 43 |   p.add_argument("-checkcells", action="store_true",
 44 |      help = "Stop after finding cells, and output debuging image." )
 45 |   p.add_argument("-colmult", type=float, default=1.0,
 46 |      help = "color cycling multiplyer for checkcells and chtml" )
 47 |   p.add_argument("-boxes", action="store_true",
 48 |      help = "Just output cell corners, don't send cells to pdftotext." )
 49 |   p.add_argument("-t", choices=['cells_csv','cells_json','cells_xml',
 50 |      'table_csv','table_html','table_chtml','table_list'],
 51 |      default="cells_xml",
 52 |      help = "output type (table_chtml is colorized like '-checkcells') "
 53 |             "(default cells_xml)" )
 54 |   p.add_argument("--whitespace","-w", choices=['none','normalize','raw'], default="normalize",
 55 |      help = "What to do with whitespace in cells. none = remove it all, "
 56 |             "normalize (default) = any whitespace (including CRLF) replaced "
 57 |             "with a single space, raw = do nothing." )
 58 |   p.add_argument("--traceback","--backtrace","-tb","-bt",action="store_true")
 59 |   return p.parse_args()
 60 | 
 61 | def main():
 62 |   try:
 63 |     args = procargs()
 64 |     imain(args)
 65 |   except IOError as e:
 66 |     if args.traceback:
 67 |         raise
 68 |     sys.exit("I/O Error running pdf-table-extract: {0}".format(e))
 69 |   except OSError as e:
 70 |     print("An OS Error occurred running pdf-table-extract: Is `pdftoppm` installed and available?")
 71 |     if args.traceback:
 72 |         raise
 73 |     sys.exit("OS Error: {0}".format(e))
 74 |   except subprocess.CalledProcessError as e:
 75 |     if args.traceback:
 76 |         raise
 77 |     sys.exit("Error while checking a subprocess call: {0}".format(e))
 78 |   except Exception as e:
 79 |     if args.traceback:
 80 |         raise
 81 |     sys.exit(e)
 82 | 
 83 | def imain(args):
 84 |     cells = []
 85 |     if args.checkcrop or args.checklines or args.checkdivs or args.checkcells:
 86 |         for pgs in args.page :
 87 |             success = process_page(args.infile, pgs,
 88 |                 bitmap=args.bitmap, 
 89 |                 checkcrop=args.checkcrop, 
 90 |                 checklines=args.checklines, 
 91 |                 checkdivs=args.checkdivs,
 92 |                 checkcells=args.checkcells,
 93 |                 whitespace=args.whitespace,
 94 |                 boxes=args.boxes,
 95 |                 greyscale_threshold=args.greyscale_threshold,
 96 |                 page=args.page,
 97 |                 crop=args.crop,
 98 |                 line_length=args.line_length,
 99 |                 bitmap_resolution=args.bitmap_resolution,
100 |                 name=args.name,
101 |                 pad=args.pad,
102 |                 white=args.white,
103 |                 black=args.black, outfilename=args.outfile)
104 | 
105 |     else:
106 |         for pgs in args.page :
107 |             cells.extend(process_page(args.infile, pgs,
108 |                 bitmap=args.bitmap, 
109 |                 checkcrop=args.checkcrop, 
110 |                 checklines=args.checklines, 
111 |                 checkdivs=args.checkdivs,
112 |                 checkcells=args.checkcells,
113 |                 whitespace=args.whitespace,
114 |                 boxes=args.boxes,
115 |                 greyscale_threshold=args.greyscale_threshold,
116 |                 page=args.page,
117 |                 crop=args.crop,
118 |                 line_length=args.line_length,
119 |                 bitmap_resolution=args.bitmap_resolution,
120 |                 name=args.name,
121 |                 pad=args.pad,
122 |                 white=args.white,
123 |                 black=args.black))
124 | 
125 |             filenames = dict()
126 |             if args.outfile is None:
127 |                 args.outfile = sys.stdout
128 |             filenames["{0}_filename".format(args.t)] = args.outfile
129 |             output(cells, args.page, name=args.name, infile=args.infile, output_type=args.t, **filenames)
130 | 
131 | 
132 | 
133 | 


--------------------------------------------------------------------------------