├── .gitignore ├── ASHIMA.md ├── LICENSE ├── LICENSE.txt ├── README.md ├── TODO.md ├── example ├── example.pdf └── test_to_pandas.py ├── setup.cfg ├── setup.py └── src └── pdftableextract ├── __init__.py ├── core.py ├── extracttab.py ├── pnm.py └── scripts.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | 3 | .installed.cfg 4 | bin 5 | develop-eggs 6 | 7 | *.egg-info 8 | 9 | tmp 10 | build 11 | dist 12 | -------------------------------------------------------------------------------- /ASHIMA.md: -------------------------------------------------------------------------------- 1 | *PDF Table Extraction Utility.* Analyses a page in a PDF looking 2 | for well delineated table cells, and extracts the text in each cell. 3 | Outputs include JSON, XML, and CSV lists of cell locations, shapes, 4 | and contents, and CSV and HTML versions of the tables. This utility 5 | is intended to be the first step in automatically processing data 6 | in tables from a PDF file, and was originally designed to read the 7 | tables in ST Micro’s datasheets. The script requires numpy and poppler 8 | (pdftoppm and pdftotext) 9 | 10 | ###License 11 | [MIT Expat](http://ashimagroup.net/os/license/mit-expat) 12 | 13 | ###Tags 14 | [Utilities](http://ashimagroup.net/os/tag/utilities) 15 | 16 | 17 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (C) 2012 Ashima Research 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | *PDF Table Extraction Utility.* Analyses a page in a PDF looking 2 | for well delineated table cells, and extracts the text in each cell. 3 | Outputs include JSON, XML, and CSV lists of cell locations, shapes, 4 | and contents, and CSV and HTML versions of the tables. This utility 5 | is intended to be the first step in automatically processing data 6 | in tables from a PDF file, and was originally designed to read the 7 | tables in ST Micro’s datasheets. The script requires numpy and poppler 8 | (pdftoppm and pdftotext) 9 | 10 | ###License 11 | [MIT Expat](http://ashimagroup.net/os/license/mit-expat) 12 | 13 | ###Tags 14 | [Utilities](http://ashimagroup.net/os/tag/utilities) 15 | 16 | 17 | -------------------------------------------------------------------------------- /TODO.md: -------------------------------------------------------------------------------- 1 | TODO 2 | ==== 3 | This list is in no particular order, and things will get done 4 | when/if I need them or I have spare time :) 5 | 6 | --ijm. 7 | 8 | 9 | Line finding 10 | ============ 11 | 12 | The line finding algorithm is robust but cannot tell large blocks 13 | of black from lines of black. So I need to add something that finds 14 | and removes solid blocks. 15 | 16 | Many tables use whitespace to delineate columns or rows but simply 17 | running the same scan for white as is done for black returns a lot 18 | of junk. For example, if the font used is a fixed space font, the 19 | scanner returns grid with cells that are one character wide, and 20 | one character high across the whole page. 21 | 22 | Horizontal white rows are probably easiest to add, and in fact the 23 | histogram data is already computed. However this puts a row boundary 24 | between EVERY row of text, and makes automatic cropping much harder 25 | and multi-cell consolidation impractical. 26 | 27 | A significant number of tuning options will be needed in order 28 | to control the width of whitespace recognised, weather or not to 29 | remove already detected black delimiters etc. 30 | 31 | 32 | Cell finding 33 | ============ 34 | 35 | The cell finding algorithm is very simplistic. It starts at the top 36 | left and find all connected cells to its right, then descends, 37 | stopping if any cell divider is seen, and remembers which cells 38 | have been visited. This makes it 'width greedy'. It attempts to 39 | start a search for every cell so it will find all sizes of rectangular 40 | cells, but it will fail to find, and so split up, 2 of the 4 possible 41 | L shapes, only 1 of the 4 C or U shapes, or any O shape (where a 42 | cell surrounds another cell). 43 | 44 | An option is needed to select width greedy, height greedy, square greedy. 45 | 46 | A flood file algorithm would make a single cell for text around 47 | a table, rather than the current splitting it into rectangles, but 48 | this would also require a graph view of cell relationships. 49 | 50 | Popplar wrapper 51 | =============== 52 | A short peice of code that wrapps the poppler library to give the 53 | same functionality as ppmtotext but over a socket or file descriptor, 54 | and able to process sequential requests. At the moment pdf-extract 55 | executes ppmtotext once for every cell it finds! This would be much 56 | faster if a wrapper didn't need to be spun up repeatedly. 57 | 58 | A wrapper is needed to comply with the MIT Expat vs GPL incompatibility. 59 | 60 | Blank row or column removal 61 | =========================== 62 | It shouldn't be to hard to notice when a complete row or column is 63 | empty, and remove it from the result. However a number of tuning 64 | options would be needed, including not removing empty row/column, 65 | not remove empty row/column in the middle of the table, ignore white 66 | space, ignore punctuation. etc. 67 | 68 | Delimiter thickness hints 69 | ========================= 70 | I should be able to record the relative thicknesses of the delimiters 71 | around a cell, so that later on it would be possible to extract 72 | table and heading boundaries for tables that use them in a detectable 73 | way. 74 | 75 | Better Hierarchical information 76 | ============================== 77 | I want to keep the cell location data structure flat (because 78 | ultimately the page is always flat), but I could include more 79 | information about cell relationships, and facilitate rebuilding a 80 | representative document object model down stream. I'd particularly 81 | like to be able to automatically separate two tables on the same 82 | page, and to auto-join a multi-page table. 83 | 84 | Miscellaneous 85 | ============= 86 | 87 | * An option to change the program called to extract text in each 88 | cell: currently it calls pdftotext, but it could easily be ocrad 89 | or any other pdf tool that can take a cropping rectangle. 90 | 91 | 92 | -------------------------------------------------------------------------------- /example/example.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashima/pdf-table-extract/7a04fc5a99b74aebecda208bf9680bb2cad2cc72/example/example.pdf -------------------------------------------------------------------------------- /example/test_to_pandas.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import pdftableextract as pdf 3 | 4 | pages = ["1"] 5 | cells = [pdf.process_page("example.pdf",p) for p in pages] 6 | 7 | #flatten the cells structure 8 | cells = [item for sublist in cells for item in sublist ] 9 | 10 | #without any options, process_page picks up a blank table at the top of the page. 11 | #so choose table '1' 12 | li = pdf.table_to_list(cells, pages)[1] 13 | 14 | #li is a list of lists, the first line is the header, last is the footer (for this table only!) 15 | #column '0' contains store names 16 | #row '1' contains column headings 17 | #data is row '2' through '-1' 18 | 19 | data =pd.DataFrame(li[2:-1], columns=li[1], index=[l[0] for l in li[2:-1]]) 20 | print data 21 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | import os 3 | 4 | here = os.path.abspath(os.path.dirname(__file__)) 5 | README = open(os.path.join(here, 'README.md')).read() 6 | #NEWS = open(os.path.join(here, 'NEWS.txt')).read() 7 | 8 | 9 | version = '0.1' 10 | 11 | install_requires = [ "numpy" ] 12 | 13 | 14 | setup(name='pdf-table-extract', 15 | version=version, 16 | description="Extract Tables from PDF files", 17 | long_description=README + '\n\n',# + NEWS, 18 | classifiers=[ 19 | # Get strings from http://pypi.python.org/pypi?%3Aaction=list_classifiers 20 | ], 21 | keywords='PDF, tables', 22 | author='Ian McEwan', 23 | author_email='ijm@ashimaresearch.com', 24 | url='ashimaresearch.com', 25 | license='MIT-Expat', 26 | packages=find_packages('src'), 27 | package_dir = {'': 'src'},include_package_data=True, 28 | zip_safe=False, 29 | install_requires=install_requires, 30 | entry_points={ 31 | 'console_scripts': 32 | ['pdf-table-extract=pdftableextract.scripts:main'] 33 | } 34 | ) 35 | -------------------------------------------------------------------------------- /src/pdftableextract/__init__.py: -------------------------------------------------------------------------------- 1 | # Example package with a console entry point 2 | from core import process_page, output, table_to_list -------------------------------------------------------------------------------- /src/pdftableextract/core.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | from numpy import array, fromstring, ones, zeros, uint8, diff, where, sum, delete 4 | import subprocess 5 | from pipes import quote 6 | from .pnm import readPNM, dumpImage 7 | import re 8 | from pipes import quote 9 | from xml.dom.minidom import getDOMImplementation 10 | import json 11 | import csv 12 | 13 | #----------------------------------------------------------------------- 14 | def check_for_required_executable(name,command): 15 | """Checks for an executable called 'name' by running 'command' and supressing 16 | output. If the return code is non-zero or an OS error occurs, an Exception is raised""" 17 | try: 18 | with open(os.devnull, "w") as fnull: 19 | result=subprocess.check_call(command,stdout=fnull, stderr=fnull) 20 | except OSError as e: 21 | message = """Error running {0}. 22 | Command failed: {1} 23 | {2}""".format(name, " ".join(command), e) 24 | raise OSError(message) 25 | except subprocess.CalledProcessError as e: 26 | raise 27 | except Exception as e: 28 | raise 29 | 30 | #----------------------------------------------------------------------- 31 | def popen(name,command, *args, **kwargs): 32 | try: 33 | result=subprocess.Popen(command,*args, **kwargs) 34 | return result 35 | except OSError, e: 36 | message="""Error running {0}. Is it installed correctly? 37 | Error: {1}""".format(name, e) 38 | raise OSError(message) 39 | except Exception, e: 40 | raise 41 | 42 | def colinterp(a,x) : 43 | """Interpolates colors""" 44 | l = len(a)-1 45 | i = min(l, max(0, int (x * l))) 46 | (u,v) = a[i:i+2,:] 47 | return u - (u-v) * ((x * l) % 1.0) 48 | 49 | colarr = array([ [255,0,0],[255,255,0],[0,255,0],[0,255,255],[0,0,255] ]) 50 | 51 | def col(x, colmult=1.0) : 52 | """colors""" 53 | return colinterp(colarr,(colmult * x)% 1.0) / 2 54 | 55 | 56 | def process_page(infile, pgs, 57 | outfilename=None, 58 | greyscale_threshold=25, 59 | page=None, 60 | crop=None, 61 | line_length=0.17, 62 | bitmap_resolution=300, 63 | name=None, 64 | pad=2, 65 | white=None, 66 | black=None, 67 | bitmap=False, 68 | checkcrop=False, 69 | checklines=False, 70 | checkdivs=False, 71 | checkcells=False, 72 | whitespace="normalize", 73 | boxes=False) : 74 | 75 | outfile = open(outfilename,'w') if outfilename else sys.stdout 76 | page=page or [] 77 | (pg,frow,lrow) = (map(int,(pgs.split(":")))+[None,None])[0:3] 78 | #check that pdftoppdm exists by running a simple command 79 | check_for_required_executable("pdftoppm",["pdftoppm","-h"]) 80 | #end check 81 | 82 | p = popen("pdftoppm", ("pdftoppm -gray -r %d -f %d -l %d %s " % 83 | (bitmap_resolution,pg,pg,quote(infile))), 84 | stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True ) 85 | 86 | #----------------------------------------------------------------------- 87 | # image load secion. 88 | 89 | (maxval, width, height, data) = readPNM(p.stdout) 90 | 91 | pad = int(pad) 92 | height+=pad*2 93 | width+=pad*2 94 | 95 | # reimbed image with a white padd. 96 | bmp = ones( (height,width) , dtype=bool ) 97 | bmp[pad:height-pad,pad:width-pad] = ( data[:,:] > int(255.0*greyscale_threshold/100.0) ) 98 | 99 | # Set up Debuging image. 100 | img = zeros( (height,width,3) , dtype=uint8 ) 101 | img[:,:,0] = bmp*255 102 | img[:,:,1] = bmp*255 103 | img[:,:,2] = bmp*255 104 | 105 | #----------------------------------------------------------------------- 106 | # Find bounding box. 107 | t=0 108 | while t < height and sum(bmp[t,:]==0) == 0 : 109 | t=t+1 110 | if t > 0 : 111 | t=t-1 112 | 113 | b=height-1 114 | while b > t and sum(bmp[b,:]==0) == 0 : 115 | b=b-1 116 | if b < height-1: 117 | b = b+1 118 | 119 | l=0 120 | while l < width and sum(bmp[:,l]==0) == 0 : 121 | l=l+1 122 | if l > 0 : 123 | l=l-1 124 | 125 | r=width-1 126 | while r > l and sum(bmp[:,r]==0) == 0 : 127 | r=r-1 128 | if r < width-1 : 129 | r=r+1 130 | 131 | # Mark bounding box. 132 | bmp[t,:] = 0 133 | bmp[b,:] = 0 134 | bmp[:,l] = 0 135 | bmp[:,r] = 0 136 | 137 | def boxOfString(x,p) : 138 | s = x.split(":") 139 | if len(s) < 4 : 140 | raise ValueError("boxes have format left:top:right:bottom[:page]") 141 | return ([bitmap_resolution * float(x) + pad for x in s[0:4] ] 142 | + [ p if len(s)<5 else int(s[4]) ] ) 143 | 144 | 145 | # translate crop to paint white. 146 | whites = [] 147 | if crop : 148 | (l,t,r,b,p) = boxOfString(crop,pg) 149 | whites.extend( [ (0,0,l,height,p), (0,0,width,t,p), 150 | (r,0,width,height,p), (0,b,width,height,p) ] ) 151 | 152 | # paint white ... 153 | if white : 154 | whites.extend( [ boxOfString(b, pg) for b in white ] ) 155 | 156 | for (l,t,r,b,p) in whites : 157 | if p == pg : 158 | bmp[ t:b+1,l:r+1 ] = 1 159 | img[ t:b+1,l:r+1 ] = [255,255,255] 160 | 161 | # paint black ... 162 | if black : 163 | for b in black : 164 | (l,t,r,b) = [bitmap_resolution * float(x) + pad for x in b.split(":") ] 165 | bmp[ t:b+1,l:r+1 ] = 0 166 | img[ t:b+1,l:r+1 ] = [0,0,0] 167 | 168 | if checkcrop : 169 | dumpImage(outfile,bmp,img, bitmap, pad) 170 | return True 171 | 172 | #----------------------------------------------------------------------- 173 | # Line finding section. 174 | # 175 | # Find all vertical or horizontal lines that are more than rlthresh 176 | # long, these are considered lines on the table grid. 177 | 178 | lthresh = int(line_length * bitmap_resolution) 179 | vs = zeros(width, dtype=int) 180 | for i in range(width) : 181 | dd = diff( where(bmp[:,i])[0] ) 182 | if len(dd)>0: 183 | v = max ( dd ) 184 | if v > lthresh : 185 | vs[i] = 1 186 | else: 187 | # it was a solid black line. 188 | if bmp[0,i] == 0 : 189 | vs[i] = 1 190 | vd= ( where(diff(vs[:]))[0] +1 ) 191 | 192 | hs = zeros(height, dtype=int) 193 | for j in range(height) : 194 | dd = diff( where(bmp[j,:]==1)[0] ) 195 | if len(dd) > 0 : 196 | h = max ( dd ) 197 | if h > lthresh : 198 | hs[j] = 1 199 | else: 200 | # it was a solid black line. 201 | if bmp[j,0] == 0 : 202 | hs[j] = 1 203 | hd=( where(diff(hs[:]==1))[0] +1 ) 204 | 205 | #----------------------------------------------------------------------- 206 | # Look for dividors that are too large. 207 | maxdiv=10 208 | i=0 209 | 210 | while i < len(vd) : 211 | if vd[i+1]-vd[i] > maxdiv : 212 | vd = delete(vd,i) 213 | vd = delete(vd,i) 214 | else: 215 | i=i+2 216 | 217 | j = 0 218 | while j < len(hd): 219 | if hd[j+1]-hd[j] > maxdiv : 220 | hd = delete(hd,j) 221 | hd = delete(hd,j) 222 | else: 223 | j=j+2 224 | 225 | if checklines : 226 | for i in vd : 227 | img[:,i] = [255,0,0] # red 228 | 229 | for j in hd : 230 | img[j,:] = [0,0,255] # blue 231 | dumpImage(outfile,bmp,img) 232 | return True 233 | #----------------------------------------------------------------------- 234 | # divider checking. 235 | # 236 | # at this point vd holds the x coordinate of vertical and 237 | # hd holds the y coordinate of horizontal divider tansitions for each 238 | # vertical and horizontal lines in the table grid. 239 | 240 | def isDiv(a, l,r,t,b) : 241 | # if any col or row (in axis) is all zeros ... 242 | return sum( sum(bmp[t:b, l:r], axis=a)==0 ) >0 243 | 244 | if checkdivs : 245 | img = img / 2 246 | for j in range(0,len(hd),2): 247 | for i in range(0,len(vd),2): 248 | if i>0 : 249 | (l,r,t,b) = (vd[i-1], vd[i], hd[j], hd[j+1]) 250 | img[ t:b, l:r, 1 ] = 192 251 | if isDiv(1, l,r,t,b) : 252 | img[ t:b, l:r, 0 ] = 0 253 | img[ t:b, l:r, 2 ] = 255 254 | 255 | if j>0 : 256 | (l,r,t,b) = (vd[i], vd[i+1], hd[j-1], hd[j] ) 257 | img[ t:b, l:r, 1 ] = 128 258 | if isDiv(0, l,r,t,b) : 259 | img[ t:b, l:r, 0 ] = 255 260 | img[ t:b, l:r, 2 ] = 0 261 | dumpImage(outfile,bmp,img) 262 | return True 263 | #----------------------------------------------------------------------- 264 | # Cell finding section. 265 | # This algorithum is width hungry, and always generates rectangular 266 | # boxes. 267 | 268 | cells =[] 269 | touched = zeros( (len(hd), len(vd)),dtype=bool ) 270 | j = 0 271 | while j*2+2 < len (hd) : 272 | i = 0 273 | while i*2+2 < len(vd) : 274 | u = 1 275 | v = 1 276 | if not touched[j,i] : 277 | while 2+(i+u)*2 < len(vd) and \ 278 | not isDiv( 0, vd[ 2*(i+u) ], vd[ 2*(i+u)+1], 279 | hd[ 2*(j+v)-1 ], hd[ 2*(j+v) ] ): 280 | u=u+1 281 | bot = False 282 | while 2+(j+v)*2 < len(hd) and not bot : 283 | bot = False 284 | for k in range(1,u+1) : 285 | bot |= isDiv( 1, vd[ 2*(i+k)-1 ], vd[ 2*(i+k)], 286 | hd[ 2*(j+v) ], hd[ 2*(j+v)+1 ] ) 287 | if not bot : 288 | v=v+1 289 | cells.append( (i,j,u,v) ) 290 | touched[ j:j+v, i:i+u] = True 291 | i = i+1 292 | j=j+1 293 | 294 | 295 | if checkcells : 296 | nc = len(cells)+0. 297 | img = img / 2 298 | for k in range(len(cells)): 299 | (i,j,u,v) = cells[k] 300 | (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] ) 301 | img[ t:b, l:r ] += col( k/nc ) 302 | dumpImage(outfile,bmp,img) 303 | return True 304 | 305 | #----------------------------------------------------------------------- 306 | # fork out to extract text for each cell. 307 | 308 | whitespace = re.compile( r'\s+') 309 | 310 | def getCell( (i,j,u,v) ): 311 | (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] ) 312 | p = popen("pdftotext", 313 | "pdftotext -r %d -x %d -y %d -W %d -H %d -layout -nopgbrk -f %d -l %d %s -" % (bitmap_resolution, l-pad, t-pad, r-l, b-t, pg, pg, quote(infile)), 314 | stdout=subprocess.PIPE, 315 | shell=True ) 316 | 317 | ret = p.communicate()[0] 318 | if whitespace != 'raw' : 319 | ret = whitespace.sub( "" if whitespace == "none" else " ", ret ) 320 | if len(ret) > 0 : 321 | ret = ret[ (1 if ret[0]==' ' else 0) : 322 | len(ret) - (1 if ret[-1]==' ' else 0) ] 323 | return (i,j,u,v,pg,ret) 324 | 325 | if boxes : 326 | cells = [ x + (pg,"",) for x in cells if 327 | ( frow == None or (x[1] >= frow and x[1] <= lrow)) ] 328 | else : 329 | #check that pdftotext exists by running a simple command 330 | check_for_required_executable("pdftotext",["pdftotext","-h"]) 331 | #end check 332 | cells = [ getCell(x) for x in cells if 333 | ( frow == None or (x[1] >= frow and x[1] <= lrow)) ] 334 | return cells 335 | 336 | #----------------------------------------------------------------------- 337 | #output section. 338 | 339 | def output(cells, pgs, 340 | cells_csv_filename=None, 341 | cells_json_filename=None, 342 | cells_xml_filename=None, 343 | table_csv_filename=None, 344 | table_html_filename=None, 345 | table_list_filename=None, 346 | infile=None, name=None, output_type=None 347 | ): 348 | 349 | output_types = [ 350 | dict(filename=cells_csv_filename, function=o_cells_csv), 351 | dict(filename=cells_json_filename, function=o_cells_json), 352 | dict(filename=cells_xml_filename, function=o_cells_xml), 353 | dict(filename=table_csv_filename, function=o_table_csv), 354 | dict(filename=table_html_filename, function=o_table_html), 355 | dict(filename=table_list_filename, function=o_table_list) 356 | ] 357 | 358 | for entry in output_types: 359 | if entry["filename"]: 360 | if entry["filename"] != sys.stdout: 361 | outfile = open(entry["filename"],'w') 362 | else: 363 | outfile = sys.stdout 364 | 365 | entry["function"](cells, pgs, 366 | outfile=outfile, 367 | name=name, 368 | infile=infile, 369 | output_type=output_type) 370 | 371 | if entry["filename"] != sys.stdout: 372 | outfile.close() 373 | 374 | def o_cells_csv(cells,pgs, outfile=None, name=None, infile=None, output_type=None) : 375 | outfile = outfile or sys.stdout 376 | csv.writer( outfile , dialect='excel' ).writerows(cells) 377 | 378 | def o_cells_json(cells,pgs, outfile=None, infile=None, name=None, output_type=None) : 379 | """Output JSON formatted cell data""" 380 | outfile = outfile or sys.stdout 381 | #defaults 382 | infile=infile or "" 383 | name=name or "" 384 | 385 | json.dump({ 386 | "src": infile, 387 | "name": name, 388 | "colnames": ( "x","y","width","height","page","contents" ), 389 | "cells":cells 390 | }, outfile) 391 | 392 | def o_cells_xml(cells,pgs, outfile=None,infile=None, name=None, output_type=None) : 393 | """Output XML formatted cell data""" 394 | outfile = outfile or sys.stdout 395 | #defaults 396 | infile=infile or "" 397 | name=name or "" 398 | 399 | doc = getDOMImplementation().createDocument(None,"table", None) 400 | root = doc.documentElement; 401 | if infile : 402 | root.setAttribute("src",infile) 403 | if name : 404 | root.setAttribute("name",name) 405 | for cl in cells : 406 | x = doc.createElement("cell") 407 | map(lambda(a): x.setAttribute(*a), zip("xywhp",map(str,cl))) 408 | if cl[5] != "" : 409 | x.appendChild( doc.createTextNode(cl[5]) ) 410 | root.appendChild(x) 411 | outfile.write( doc.toprettyxml() ) 412 | 413 | def table_to_list(cells,pgs) : 414 | """Output list of lists""" 415 | l=[0,0,0] 416 | for (i,j,u,v,pg,value) in cells : 417 | r=[i,j,pg] 418 | l = [max(x) for x in zip(l,r)] 419 | 420 | tab = [ [ [ "" for x in range(l[0]+1) 421 | ] for x in range(l[1]+1) 422 | ] for x in range(l[2]+1) 423 | ] 424 | for (i,j,u,v,pg,value) in cells : 425 | tab[pg][j][i] = value 426 | 427 | return tab 428 | 429 | def o_table_csv(cells,pgs, outfile=None, name=None, infile=None, output_type=None) : 430 | """Output CSV formatted table""" 431 | outfile = outfile or sys.stdout 432 | tab=table_to_list(cells, pgs) 433 | for t in tab: 434 | csv.writer( outfile , dialect='excel' ).writerows(t) 435 | 436 | 437 | def o_table_list(cells,pgs, outfile=None, name=None, infile=None, output_type=None) : 438 | """Output list of lists""" 439 | outfile = outfile or sys.stdout 440 | tab = table_to_list(cells, pgs) 441 | print(tab) 442 | 443 | def o_table_html(cells,pgs, outfile=None, output_type=None, name=None, infile=None) : 444 | """Output HTML formatted table""" 445 | 446 | oj = 0 447 | opg = 0 448 | doc = getDOMImplementation().createDocument(None,"table", None) 449 | root = doc.documentElement; 450 | if (output_type == "table_chtml" ): 451 | root.setAttribute("border","1") 452 | root.setAttribute("cellspaceing","0") 453 | root.setAttribute("style","border-spacing:0") 454 | nc = len(cells) 455 | tr = None 456 | for k in range(nc): 457 | (i,j,u,v,pg,value) = cells[k] 458 | if j > oj or pg > opg: 459 | if pg > opg: 460 | s = "Name: " + name + ", " if name else "" 461 | root.appendChild( doc.createComment( s + 462 | ("Source: %s page %d." % (infile, pg) ))); 463 | if tr : 464 | root.appendChild(tr) 465 | tr = doc.createElement("tr") 466 | oj = j 467 | opg = pg 468 | td = doc.createElement("td") 469 | if value != "" : 470 | td.appendChild( doc.createTextNode(value) ) 471 | if u>1 : 472 | td.setAttribute("colspan",str(u)) 473 | if v>1 : 474 | td.setAttribute("rowspan",str(v)) 475 | if output_type == "table_chtml" : 476 | td.setAttribute("style", "background-color: #%02x%02x%02x" % 477 | tuple(128+col(k/(nc+0.)))) 478 | tr.appendChild(td) 479 | root.appendChild(tr) 480 | outfile.write( doc.toprettyxml() ) 481 | 482 | -------------------------------------------------------------------------------- /src/pdftableextract/extracttab.py: -------------------------------------------------------------------------------- 1 | # Description : PDF Table Extraction Utility 2 | # Author : Ian McEwan, Ashima Research. 3 | # Maintainer : ijm 4 | # Lastmod : 20130402 (ijm) 5 | # License : Copyright (C) 2011 Ashima Research. All rights reserved. 6 | # Distributed under the MIT Expat License. See LICENSE file. 7 | # https://github.com/ashima/pdf-table-extract 8 | 9 | import sys, argparse, subprocess, re, csv, json 10 | from numpy import * 11 | from pipes import quote 12 | from xml.dom.minidom import getDOMImplementation 13 | 14 | # Proccessing function. 15 | 16 | def process_page(pgs) : 17 | (pg,frow,lrow) = (map(int,(pgs.split(":")))+[None,None])[0:3] 18 | 19 | p = subprocess.Popen( ("pdftoppm -gray -r %d -f %d -l %d %s " % 20 | (args.r,pg,pg,quote(args.infile))), 21 | stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True ) 22 | 23 | #----------------------------------------------------------------------- 24 | # image load secion. 25 | 26 | (maxval, width, height, data) = readPNM(p.stdout) 27 | 28 | pad = int(args.pad) 29 | height+=pad*2 30 | width+=pad*2 31 | 32 | # reimbed image with a white padd. 33 | bmp = ones( (height,width) , dtype=bool ) 34 | bmp[pad:height-pad,pad:width-pad] = ( data[:,:] > int(255.0*args.g/100.0) ) 35 | 36 | # Set up Debuging image. 37 | img = zeros( (height,width,3) , dtype=uint8 ) 38 | img[:,:,0] = bmp*255 39 | img[:,:,1] = bmp*255 40 | img[:,:,2] = bmp*255 41 | 42 | #----------------------------------------------------------------------- 43 | # Find bounding box. 44 | 45 | t=0 46 | while t < height and sum(bmp[t,:]==0) == 0 : 47 | t=t+1 48 | if t > 0 : 49 | t=t-1 50 | 51 | b=height-1 52 | while b > t and sum(bmp[b,:]==0) == 0 : 53 | b=b-1 54 | if b < height-1: 55 | b = b+1 56 | 57 | l=0 58 | while l < width and sum(bmp[:,l]==0) == 0 : 59 | l=l+1 60 | if l > 0 : 61 | l=l-1 62 | 63 | r=width-1 64 | while r > l and sum(bmp[:,r]==0) == 0 : 65 | r=r-1 66 | if r < width-1 : 67 | r=r+1 68 | 69 | # Mark bounding box. 70 | bmp[t,:] = 0 71 | bmp[b,:] = 0 72 | bmp[:,l] = 0 73 | bmp[:,r] = 0 74 | 75 | def boxOfString(x,p) : 76 | s = x.split(":") 77 | if len(s) < 4 : 78 | raise Exception("boxes have format left:top:right:bottom[:page]") 79 | return ([args.r * float(x) + args.pad for x in s[0:4] ] 80 | + [ p if len(s)<5 else int(s[4]) ] ) 81 | 82 | 83 | # translate crop to paint white. 84 | whites = [] 85 | if args.crop : 86 | (l,t,r,b,p) = boxOfString(args.crop,pg) 87 | whites.extend( [ (0,0,l,height,p), (0,0,width,t,p), 88 | (r,0,width,height,p), (0,b,width,height,p) ] ) 89 | 90 | # paint white ... 91 | if args.white : 92 | whites.extend( [ boxOfString(b, pg) for b in args.white ] ) 93 | 94 | for (l,t,r,b,p) in whites : 95 | if p == pg : 96 | bmp[ t:b+1,l:r+1 ] = 1 97 | img[ t:b+1,l:r+1 ] = [255,255,255] 98 | 99 | # paint black ... 100 | if args.black : 101 | for b in args.black : 102 | (l,t,r,b) = [args.r * float(x) + args.pad for x in b.split(":") ] 103 | bmp[ t:b+1,l:r+1 ] = 0 104 | img[ t:b+1,l:r+1 ] = [0,0,0] 105 | 106 | if args.checkcrop : 107 | dumpImage(args,bmp,img) 108 | sys.exit(0) 109 | 110 | 111 | #----------------------------------------------------------------------- 112 | # Line finding section. 113 | # 114 | # Find all verticle or horizontal lines that are more than rlthresh 115 | # long, these are considered lines on the table grid. 116 | 117 | lthresh = int(args.l * args.r) 118 | vs = zeros(width, dtype=int) 119 | for i in range(width) : 120 | dd = diff( where(bmp[:,i])[0] ) 121 | if len(dd)>0: 122 | v = max ( dd ) 123 | if v > lthresh : 124 | vs[i] = 1 125 | else: 126 | # it was a solid black line. 127 | if bmp[0,i] == 0 : 128 | vs[i] = 1 129 | vd= ( where(diff(vs[:]))[0] +1 ) 130 | 131 | hs = zeros(height, dtype=int) 132 | for j in range(height) : 133 | dd = diff( where(bmp[j,:]==1)[0] ) 134 | if len(dd) > 0 : 135 | h = max ( dd ) 136 | if h > lthresh : 137 | hs[j] = 1 138 | else: 139 | # it was a solid black line. 140 | if bmp[j,0] == 0 : 141 | hs[j] = 1 142 | hd=( where(diff(hs[:]==1))[0] +1 ) 143 | 144 | #----------------------------------------------------------------------- 145 | # Look for dividors that are too large. 146 | 147 | maxdiv=10 148 | i=0 149 | 150 | while i < len(vd) : 151 | if vd[i+1]-vd[i] > maxdiv : 152 | vd = delete(vd,i) 153 | vd = delete(vd,i) 154 | else: 155 | i=i+2 156 | 157 | j = 0 158 | while j < len(hd): 159 | if hd[j+1]-hd[j] > maxdiv : 160 | hd = delete(hd,j) 161 | hd = delete(hd,j) 162 | else: 163 | j=j+2 164 | 165 | if args.checklines : 166 | for i in vd : 167 | img[:,i] = [255,0,0] # red 168 | 169 | for j in hd : 170 | img[j,:] = [0,0,255] # blue 171 | dumpImage(args,bmp,img) 172 | sys.exit(0) 173 | 174 | #----------------------------------------------------------------------- 175 | # divider checking. 176 | # 177 | # at this point vd holds the x coordinate of vertical and 178 | # hd holds the y coordinate of horizontal divider tansitions for each 179 | # vertical and horizontal lines in the table grid. 180 | 181 | def isDiv(a, l,r,t,b) : 182 | # if any col or row (in axis) is all zeros ... 183 | return sum( sum(bmp[t:b, l:r], axis=a)==0 ) >0 184 | 185 | if args.checkdivs : 186 | img = img / 2 187 | for j in range(0,len(hd),2): 188 | for i in range(0,len(vd),2): 189 | if i>0 : 190 | (l,r,t,b) = (vd[i-1], vd[i], hd[j], hd[j+1]) 191 | img[ t:b, l:r, 1 ] = 192 192 | if isDiv(1, l,r,t,b) : 193 | img[ t:b, l:r, 0 ] = 0 194 | img[ t:b, l:r, 2 ] = 255 195 | 196 | if j>0 : 197 | (l,r,t,b) = (vd[i], vd[i+1], hd[j-1], hd[j] ) 198 | img[ t:b, l:r, 1 ] = 128 199 | if isDiv(0, l,r,t,b) : 200 | img[ t:b, l:r, 0 ] = 255 201 | img[ t:b, l:r, 2 ] = 0 202 | 203 | dumpImage(args,bmp,img) 204 | sys.exit(0) 205 | 206 | #----------------------------------------------------------------------- 207 | # Cell finding section. 208 | # This algorithum is width hungry, and always generates rectangular 209 | # boxes. 210 | 211 | cells =[] 212 | touched = zeros( (len(hd), len(vd)),dtype=bool ) 213 | j = 0 214 | while j*2+2 < len (hd) : 215 | i = 0 216 | while i*2+2 < len(vd) : 217 | u = 1 218 | v = 1 219 | if not touched[j,i] : 220 | while 2+(i+u)*2 < len(vd) and \ 221 | not isDiv( 0, vd[ 2*(i+u) ], vd[ 2*(i+u)+1], 222 | hd[ 2*(j+v)-1 ], hd[ 2*(j+v) ] ): 223 | u=u+1 224 | bot = False 225 | while 2+(j+v)*2 < len(hd) and not bot : 226 | bot = False 227 | for k in range(1,u+1) : 228 | bot |= isDiv( 1, vd[ 2*(i+k)-1 ], vd[ 2*(i+k)], 229 | hd[ 2*(j+v) ], hd[ 2*(j+v)+1 ] ) 230 | if not bot : 231 | v=v+1 232 | cells.append( (i,j,u,v) ) 233 | touched[ j:j+v, i:i+u] = True 234 | i = i+1 235 | j=j+1 236 | 237 | 238 | if args.checkcells : 239 | nc = len(cells)+0. 240 | img = img / 2 241 | for k in range(len(cells)): 242 | (i,j,u,v) = cells[k] 243 | (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] ) 244 | img[ t:b, l:r ] += col( k/nc ) 245 | dumpImage(args,bmp,img) 246 | sys.exit(0) 247 | 248 | 249 | #----------------------------------------------------------------------- 250 | # fork out to extract text for each cell. 251 | 252 | whitespace = re.compile( r'\s+') 253 | 254 | def getCell( (i,j,u,v) ): 255 | (l,r,t,b) = ( vd[2*i+1] , vd[ 2*(i+u) ], hd[2*j+1], hd[2*(j+v)] ) 256 | p = subprocess.Popen( 257 | ("pdftotext -r %d -x %d -y %d -W %d -H %d -layout -nopgbrk -f %d -l %d %s -" 258 | % (args.r, l-pad, t-pad, r-l, b-t, pg, pg, quote(args.infile) ) ), 259 | stdout=subprocess.PIPE, shell=True ) 260 | 261 | ret = p.communicate()[0] 262 | if args.w != 'raw' : 263 | ret = whitespace.sub( "" if args.w == "none" else " ", ret ) 264 | if len(ret) > 0 : 265 | ret = ret[ (1 if ret[0]==' ' else 0) : 266 | len(ret) - (1 if ret[-1]==' ' else 0) ] 267 | return (i,j,u,v,pg,ret) 268 | 269 | #if args.boxes : 270 | # cells = [ x + (pg,"",) for x in cells ] 271 | #else : 272 | # cells = map(getCell, cells) 273 | 274 | if args.boxes : 275 | cells = [ x + (pg,"",) for x in cells if 276 | ( frow == None or (x[1] >= frow and x[1] <= lrow)) ] 277 | else : 278 | cells = [ getCell(x) for x in cells if 279 | ( frow == None or (x[1] >= frow and x[1] <= lrow)) ] 280 | return cells 281 | 282 | 283 | #----------------------------------------------------------------------- 284 | # main 285 | 286 | def main_script(): 287 | args = procargs() 288 | 289 | cells = [] 290 | for pgs in args.page : 291 | cells.extend(process_page(pgs)) 292 | 293 | { "cells_csv" : o_cells_csv, "cells_json" : o_cells_json, 294 | "cells_xml" : o_cells_xml, "table_csv" : o_table_csv, 295 | "table_html": o_table_html, "table_chtml": o_table_html, 296 | } [ args.t ](cells,args.page) 297 | 298 | -------------------------------------------------------------------------------- /src/pdftableextract/pnm.py: -------------------------------------------------------------------------------- 1 | from numpy import array, fromstring, uint8, reshape, ones 2 | #----------------------------------------------------------------------- 3 | # PNM stuff. 4 | 5 | def noncomment(fd): 6 | """Read lines from the filehandle until a non-comment line is found. 7 | Comments start with #""" 8 | while True: 9 | x = fd.readline() 10 | if x.startswith('#') : 11 | continue 12 | else: 13 | return x 14 | 15 | def readPNM(fd): 16 | """Reads the PNM file from the filehandle""" 17 | t = noncomment(fd) 18 | s = noncomment(fd) 19 | m = noncomment(fd) if not (t.startswith('P1') or t.startswith('P4')) else '1' 20 | data = fd.read() 21 | ls = len(s.split()) 22 | if ls != 2 : 23 | name = "" if fd.name=="" else "Filename = {0}".format(fd.name) 24 | raise IOError("Expected 2 elements from parsing PNM file, got {0}: {1}".format(ls, name)) 25 | xs, ys = s.split() 26 | width = int(xs) 27 | height = int(ys) 28 | m = int(m) 29 | 30 | if m != 255 : 31 | print "Just want 8 bit pgms for now!" 32 | 33 | d = fromstring(data,dtype=uint8) 34 | d = reshape(d, (height,width) ) 35 | return (m,width,height, d) 36 | 37 | def writePNM(fd,img): 38 | """Writes a PNM file to a filehandle given the img data as a numpy array""" 39 | s = img.shape 40 | m = 255 41 | if img.dtype == bool : 42 | img = img + uint8(0) 43 | t = "P5" 44 | m = 1 45 | elif len(s) == 2 : 46 | t = "P5" 47 | else: 48 | t = "P6" 49 | 50 | fd.write( "%s\n%d %d\n%d\n" % (t, s[1],s[0],m) ) 51 | fd.write( uint8(img).tostring() ) 52 | 53 | 54 | def dumpImage(outfile,bmp,img,bitmap=False, pad=2) : 55 | """Dumps the numpy array in image into the filename and closes the outfile""" 56 | oi = bmp if bitmap else img 57 | (height,width) = bmp.shape 58 | writePNM(outfile, oi[pad:height-pad, pad:width-pad]) 59 | outfile.close() 60 | -------------------------------------------------------------------------------- /src/pdftableextract/scripts.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import logging 4 | import subprocess 5 | from .core import process_page, output 6 | import core 7 | 8 | #----------------------------------------------------------------------- 9 | 10 | def procargs() : 11 | p = argparse.ArgumentParser( description="Finds tables in a PDF page.") 12 | p.add_argument("-i", dest='infile', help="input file" ) 13 | p.add_argument("-o", dest='outfile', help="output file", default=None, 14 | type=str) 15 | p.add_argument("--greyscale_threshold","-g", help="grayscale threshold (%%)", type=int, default=25 ) 16 | p.add_argument("-p", type=str, dest='page', required=True, action="append", 17 | help="a page in the PDF to process, as page[:firstrow:lastrow]." ) 18 | p.add_argument("-c", type=str, dest='crop', 19 | help="crop to left:top:right:bottom. Paints white outside this " 20 | "rectangle." ) 21 | p.add_argument("--line_length", "-l", type=float, default=0.17 , 22 | help="line length threshold (length)" ) 23 | p.add_argument("--bitmap_resolution", "-r", type=int, default=300, 24 | help="resolution of internal bitmap (dots per length unit)" ) 25 | p.add_argument("-name", help="name to add to XML tag, or HTML comments") 26 | p.add_argument("-pad", help="imitial image pading (pixels)", type=int, 27 | default=2 ) 28 | p.add_argument("-white",action="append", 29 | help="paint white to the bitmap as left:top:right:bottom in length units." 30 | "Done before painting black" ) 31 | p.add_argument("-black",action="append", 32 | help="paint black to the bitmap as left:top:right:bottom in length units." 33 | "Done after poainting white" ) 34 | p.add_argument("-bitmap", action="store_true", 35 | help = "Dump working bitmap not debuging image." ) 36 | p.add_argument("-checkcrop", action="store_true", 37 | help = "Stop after finding croping rectangle, and output debuging " 38 | "image (use -bitmap).") 39 | p.add_argument("-checklines", action="store_true", 40 | help = "Stop after finding lines, and output debuging image." ) 41 | p.add_argument("-checkdivs", action="store_true", 42 | help = "Stop after finding dividors, and output debuging image." ) 43 | p.add_argument("-checkcells", action="store_true", 44 | help = "Stop after finding cells, and output debuging image." ) 45 | p.add_argument("-colmult", type=float, default=1.0, 46 | help = "color cycling multiplyer for checkcells and chtml" ) 47 | p.add_argument("-boxes", action="store_true", 48 | help = "Just output cell corners, don't send cells to pdftotext." ) 49 | p.add_argument("-t", choices=['cells_csv','cells_json','cells_xml', 50 | 'table_csv','table_html','table_chtml','table_list'], 51 | default="cells_xml", 52 | help = "output type (table_chtml is colorized like '-checkcells') " 53 | "(default cells_xml)" ) 54 | p.add_argument("--whitespace","-w", choices=['none','normalize','raw'], default="normalize", 55 | help = "What to do with whitespace in cells. none = remove it all, " 56 | "normalize (default) = any whitespace (including CRLF) replaced " 57 | "with a single space, raw = do nothing." ) 58 | p.add_argument("--traceback","--backtrace","-tb","-bt",action="store_true") 59 | return p.parse_args() 60 | 61 | def main(): 62 | try: 63 | args = procargs() 64 | imain(args) 65 | except IOError as e: 66 | if args.traceback: 67 | raise 68 | sys.exit("I/O Error running pdf-table-extract: {0}".format(e)) 69 | except OSError as e: 70 | print("An OS Error occurred running pdf-table-extract: Is `pdftoppm` installed and available?") 71 | if args.traceback: 72 | raise 73 | sys.exit("OS Error: {0}".format(e)) 74 | except subprocess.CalledProcessError as e: 75 | if args.traceback: 76 | raise 77 | sys.exit("Error while checking a subprocess call: {0}".format(e)) 78 | except Exception as e: 79 | if args.traceback: 80 | raise 81 | sys.exit(e) 82 | 83 | def imain(args): 84 | cells = [] 85 | if args.checkcrop or args.checklines or args.checkdivs or args.checkcells: 86 | for pgs in args.page : 87 | success = process_page(args.infile, pgs, 88 | bitmap=args.bitmap, 89 | checkcrop=args.checkcrop, 90 | checklines=args.checklines, 91 | checkdivs=args.checkdivs, 92 | checkcells=args.checkcells, 93 | whitespace=args.whitespace, 94 | boxes=args.boxes, 95 | greyscale_threshold=args.greyscale_threshold, 96 | page=args.page, 97 | crop=args.crop, 98 | line_length=args.line_length, 99 | bitmap_resolution=args.bitmap_resolution, 100 | name=args.name, 101 | pad=args.pad, 102 | white=args.white, 103 | black=args.black, outfilename=args.outfile) 104 | 105 | else: 106 | for pgs in args.page : 107 | cells.extend(process_page(args.infile, pgs, 108 | bitmap=args.bitmap, 109 | checkcrop=args.checkcrop, 110 | checklines=args.checklines, 111 | checkdivs=args.checkdivs, 112 | checkcells=args.checkcells, 113 | whitespace=args.whitespace, 114 | boxes=args.boxes, 115 | greyscale_threshold=args.greyscale_threshold, 116 | page=args.page, 117 | crop=args.crop, 118 | line_length=args.line_length, 119 | bitmap_resolution=args.bitmap_resolution, 120 | name=args.name, 121 | pad=args.pad, 122 | white=args.white, 123 | black=args.black)) 124 | 125 | filenames = dict() 126 | if args.outfile is None: 127 | args.outfile = sys.stdout 128 | filenames["{0}_filename".format(args.t)] = args.outfile 129 | output(cells, args.page, name=args.name, infile=args.infile, output_type=args.t, **filenames) 130 | 131 | 132 | 133 | --------------------------------------------------------------------------------