├── Dockerfile ├── README.md ├── app ├── __pycache__ │ └── pdfapi.cpython-38.pyc ├── data │ ├── sample_doc_1.pdf │ ├── sample_doc_2.pdf │ └── sample_doc_3.pdf ├── format_converter.py ├── output.pdf └── pdfapi.py ├── docker-compose.yml ├── requirements.txt └── screenshots ├── doc_3.JPG ├── get_doc_list.JPG ├── parse_doc_2.JPG └── parse_doc_3.JPG /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.8 2 | 3 | RUN pip install -r requirements.txt 4 | 5 | COPY ./app ./ 6 | 7 | CMD ["uvicorn", "app.pdfapi:app", "--host", "0.0.0.0", "--port", "8000", "--reload"] -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # fastapi_pdfextractor ![GitHub top language](https://img.shields.io/github/languages/top/soham-1/fastapi_pdfextractor?color=%2300ff00) 2 | A simple api using [fastapi](https://pypi.org/project/fastapi/) for extracting the text content of pdf using [pdfminer](https://pypi.org/project/pdfminer/). 3 | Different pdf parsers were tried like pypdf2, pdfminer.. but pdfminer gave better results. For added ocr support first install [tesseract](https://github.com/UB-Mannheim/tesseract/wiki) and [ghost script](https://www.ghostscript.com/download/gsdnld.html) as these are required dependencies for the code to work.
4 | Try out and compare the output of pdfminer and tika through API endpoints. Access the results through API response or app/results directory. 5 |
6 | Note: if tesseract is installed in some other location than default, then change the location accordingly in pdfapi.py file. 7 | 8 | ## Clone project 9 | ``` 10 | git clone https://github.com/soham-1/fastapi_pdfextractor.git 11 | ``` 12 | 13 | ## Run locally 14 | ### Install dependencies 15 | ``` 16 | pip install -r requirements.txt 17 | ``` 18 | 19 | ### Run Server 20 | ``` 21 | cd app 22 | uvicorn pdfapi:app --host 0.0.0.0 --port 8000 --reload 23 | ``` 24 | 25 | ## Run on Docker 26 | ``` 27 | docker-compose up -d --build 28 | ``` 29 | 30 | ### Stop the container using 31 | ``` 32 | docker-compose stop fast_api 33 | ``` 34 | 35 | ### Restart it using 36 | ``` 37 | docker-compose up -d 38 | ``` 39 | 40 | ## Documentation 41 | This api has following endpoints 42 | * #### ```/get_doc_list``` - for getting a list of all the available pdf's 43 | * #### ```/parse/{doc_name}``` - for getting the meta data and text content of pdf. available pdf's are sample_doc_1, sample_doc_2. sample_doc_3 44 | * #### ```/pdfminer_text/{doc}``` - returns text output of a pdf using pdfminer library 45 | * #### ```/pdfminer_text/{doc}/{page_no}``` - returns text output of a pdf of specified page_no 46 | * #### ```/tika_text/{doc}``` - returns text output of a pdf using py-tika library 47 | * #### ```/pdfminer_xml/{doc}``` - returns xml output 48 | * #### ```/pdfminer_xml/{doc}/{page_no}``` - returns xml output of a pdf of specified page_no 49 | * #### ```/pdfminer_html/{doc}``` - returns html output 50 | * #### ```/pdfminer_html/{doc}/{page_no}``` 51 | * #### ```/pdfminer_html_char/{doc}``` - returns character level html output 52 | * #### ```/pdfminer_html_char/{doc}/{page_no}``` 53 | 54 | ### text pdf 55 | ![get_doc_list](/screenshots/get_doc_list.JPG) 56 | 57 | ### output 58 | parse doc 59 | 60 | ### pdf with scanned image 61 | parse doc 62 | 63 | ### output 64 | ![get_doc_list](/screenshots/parse_doc_3.JPG) 65 | -------------------------------------------------------------------------------- /app/__pycache__/pdfapi.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/__pycache__/pdfapi.cpython-38.pyc -------------------------------------------------------------------------------- /app/data/sample_doc_1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/data/sample_doc_1.pdf -------------------------------------------------------------------------------- /app/data/sample_doc_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/data/sample_doc_2.pdf -------------------------------------------------------------------------------- /app/data/sample_doc_3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/data/sample_doc_3.pdf -------------------------------------------------------------------------------- /app/format_converter.py: -------------------------------------------------------------------------------- 1 | import pdfminer 2 | from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 3 | from pdfminer.converter import TextConverter, PDFPageAggregator, XMLConverter, HTMLConverter 4 | from pdfminer.layout import LAParams 5 | from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed 6 | from pdfminer.pdfparser import PDFParser 7 | from io import StringIO, BytesIO 8 | 9 | from tika import parser 10 | 11 | from bs4 import BeautifulSoup 12 | 13 | 14 | def pdfminer_to_text(input_file:str, page_no:int=None): 15 | file_loc = "app/data/" + input_file + ".pdf" 16 | fp = open(file_loc, 'rb') 17 | outfp = BytesIO() 18 | rsrcmgr = PDFResourceManager() 19 | laparams = LAParams(line_overlap=.5, char_margin=60, line_margin=.5, word_margin=1, boxes_flow=.6, 20 | detect_vertical=True, all_texts=False) 21 | # laparams = LAParams() # default configurations 22 | device = TextConverter(rsrcmgr, outfp, 'utf-8', laparams=laparams) 23 | interpreter = PDFPageInterpreter(rsrcmgr, device) 24 | text = "" 25 | for page_index, page in enumerate(PDFPage.get_pages(fp)): 26 | if page_no is None or page_index == page_no: 27 | interpreter.process_page(page) 28 | text += outfp.getvalue().decode() 29 | if page_no and page_index == page_no: break 30 | output_file = f'app/results/pdfminer_{input_file}.txt' if page_no is None else f'app/results/pdfminer_{input_file}_page{page_no}.txt' 31 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f: 32 | f.write(text) 33 | return text 34 | 35 | 36 | def tika_to_text(input_file): 37 | parsed_pdf = parser.from_file("app/data/" + input_file + ".pdf") 38 | text = parsed_pdf["content"] 39 | output_file = f'app/results/tika_{input_file}.txt' 40 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f: 41 | f.write(text) 42 | return text 43 | 44 | 45 | def pdfminer_to_xml(input_file:str, page_no:int=None): 46 | file_loc = "app/data/" + input_file + ".pdf" 47 | fp = open(file_loc, 'rb') 48 | outfp = BytesIO() 49 | rsrcmgr = PDFResourceManager() 50 | laparams = LAParams() 51 | device = XMLConverter(rsrcmgr, outfp, 'utf-8', laparams=laparams) 52 | interpreter = PDFPageInterpreter(rsrcmgr, device) 53 | text = "" 54 | for page_index, page in enumerate(PDFPage.get_pages(fp)): 55 | if page_no is None or page_index == page_no: 56 | interpreter.process_page(page) 57 | text += outfp.getvalue().decode() 58 | outfp.seek(0) 59 | outfp.truncate(0) 60 | if page_no and page_index == page_no: break 61 | output_file = f'app/results/pdfminer_{input_file}.xml' if page_no is None else f'app/results/pdfminer_{input_file}_page{page_no}.xml' 62 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f: 63 | f.write(text) 64 | return text 65 | 66 | 67 | def pdfminer_to_html(input_file:str, page_no:int=None): 68 | file_loc = "app/data/" + input_file + ".pdf" 69 | fp = open(file_loc, 'rb') 70 | outfp = BytesIO() 71 | rsrcmgr = PDFResourceManager() 72 | laparams = LAParams(line_overlap=.5, char_margin=60, line_margin=.5, word_margin=1, boxes_flow=.6, 73 | detect_vertical=True, all_texts=False) 74 | # laparams = LAParams() 75 | device = HTMLConverter(rsrcmgr, outfp, 'utf-8', laparams=laparams) 76 | interpreter = PDFPageInterpreter(rsrcmgr, device) 77 | text = "" 78 | for page_index, page in enumerate(PDFPage.get_pages(fp)): 79 | if page_no is None or page_index == page_no: 80 | interpreter.process_page(page) 81 | retstr = outfp.getvalue().decode() 82 | text += retstr 83 | output_file = f'app/results/pdfminer_{input_file}_page{page_index}.html' 84 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f: 85 | f.write(retstr) 86 | if page_no and page_index == page_no: break 87 | outfp.seek(0) 88 | outfp.truncate(0) 89 | return text 90 | 91 | 92 | def pdfminer_to_html_char_level(input_file:str, page_no:int=None): 93 | xml = pdfminer_to_xml(input_file, page_no) 94 | body_start = """ 95 | 96 | 97 | 98 | 99 | 100 | Document 101 | 107 | 108 | 109 | \n""" 110 | 111 | body_end = """\n 112 | """ 113 | 114 | source = BeautifulSoup(xml, 'xml') 115 | pages = source.findAll("page") 116 | text_val = "" 117 | for page_index, page in enumerate(pages): 118 | output_file = f'app/results/pdfminer_char_{input_file}_page{page_index}.html' 119 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f: 120 | f.write(body_start) 121 | page_width = page["bbox"].split(',')[2] 122 | page_height = page["bbox"].split(',')[3] 123 | textbox = page.findAll("text") 124 | for text in textbox: 125 | if "bbox" in text.attrs: 126 | left, bottom, right, top = text.attrs['bbox'].split(',') 127 | style = f"display:inline; position: absolute; left:{left}px; top:{round(float(page_height)-float(top),3)}px; font-size: {round(float(top)-float(bottom),3)}px; font-family: {text.attrs['font'].split('+')[1]};" 128 | div = f"\n
{text.text}
" 129 | text_val += div 130 | f.write(div) 131 | f.write(body_end) 132 | return body_start + text_val + body_end -------------------------------------------------------------------------------- /app/output.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/output.pdf -------------------------------------------------------------------------------- /app/pdfapi.py: -------------------------------------------------------------------------------- 1 | import os 2 | from datetime import datetime 3 | 4 | 5 | import uvicorn 6 | from fastapi import FastAPI 7 | from fastapi.responses import JSONResponse 8 | from fastapi.encoders import jsonable_encoder 9 | 10 | from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 11 | from pdfminer.converter import TextConverter 12 | from pdfminer.layout import LAParams 13 | from pdfminer.pdfpage import PDFPage 14 | from pdfminer.pdfparser import PDFParser 15 | from pdfminer.pdfdocument import PDFDocument 16 | from io import StringIO 17 | 18 | import ocrmypdf 19 | import pdfplumber 20 | from PIL import Image 21 | import pytesseract 22 | pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' 23 | 24 | from .format_converter import pdfminer_to_text, tika_to_text, pdfminer_to_xml, pdfminer_to_html, pdfminer_to_html_char_level 25 | 26 | app = FastAPI() 27 | 28 | @app.get("/") 29 | def index(): 30 | return {"requests": "1. call '/get_doc_list' to get list of all documents available \n 2. call '/parse/' to get text content of document"} 31 | 32 | @app.get("/get_doc_list") 33 | def get_doc_list(): 34 | ls = os.listdir("app/data") 35 | return {"doc_list": ls} 36 | 37 | @app.get("/parse/{sample_doc}") 38 | def get_text(sample_doc : str): 39 | path = "data/"+sample_doc+".pdf" 40 | response = jsonable_encoder(pdf_to_text(path)) 41 | return JSONResponse(content = response) 42 | 43 | @app.get("/pdfminer_text/{doc}") 44 | async def pdfminer_text(doc:str): 45 | response = jsonable_encoder(pdfminer_to_text(doc)) 46 | return JSONResponse(content = response) 47 | 48 | @app.get("/pdfminer_text/{doc}/{page_no}") 49 | async def pdfminer_text(doc:str,page_no:int): 50 | response = jsonable_encoder(pdfminer_to_text(doc, page_no)) 51 | return JSONResponse(content = response) 52 | 53 | @app.get("/tika_text/{doc}") 54 | async def tika_text(doc:str): 55 | response = jsonable_encoder(tika_to_text(doc)) 56 | return JSONResponse(content = response) 57 | 58 | @app.get("/pdfminer_xml/{doc}") 59 | async def pdfminer_xml(doc:str): 60 | response = jsonable_encoder(pdfminer_to_xml(doc)) 61 | return JSONResponse(content = response) 62 | 63 | @app.get("/pdfminer_xml/{doc}/{page_no}") 64 | async def pdfminer_xml(doc:str, page_no:int): 65 | response = jsonable_encoder(pdfminer_to_xml(doc, page_no)) 66 | return JSONResponse(content = response) 67 | 68 | @app.get("/pdfminer_html/{doc}") 69 | async def pdfminer_xml(doc:str): 70 | response = jsonable_encoder(pdfminer_to_html(doc)) 71 | return JSONResponse(content = response) 72 | 73 | @app.get("/pdfminer_html/{doc}/{page_no}") 74 | async def pdfminer_xml(doc:str, page_no:int): 75 | response = jsonable_encoder(pdfminer_to_html(doc, page_no)) 76 | return JSONResponse(content = response) 77 | 78 | @app.get("/pdfminer_html_char/{doc}") 79 | async def pdfminer_html_char_level(doc:str): 80 | response = jsonable_encoder(pdfminer_to_html_char_level(doc)) 81 | return JSONResponse(content = response) 82 | 83 | @app.get("/pdfminer_html_char/{doc}/{page_no}") 84 | async def pdfminer_html_char_level(doc:str, page_no:int): 85 | response = jsonable_encoder(pdfminer_to_html_char_level(doc, page_no)) 86 | return JSONResponse(content = response) 87 | 88 | def pdf_to_text(path): 89 | """ 90 | returns json form of text extracted from pdf specified in the path 91 | response contains number of pages and text in each page 92 | """ 93 | 94 | response = {} 95 | rsrcmgr = PDFResourceManager() 96 | retstr = StringIO() 97 | laparams = LAParams() 98 | device = TextConverter(rsrcmgr, retstr, laparams=laparams) 99 | interpreter = PDFPageInterpreter(rsrcmgr, device) 100 | 101 | fp = open(path, 'rb') 102 | parser = PDFParser(fp) 103 | doc = PDFDocument(parser) 104 | info = doc.info[0].keys() 105 | dtformat = "%Y%m%d%H%M%S" 106 | 107 | response['Author'] = doc.info[0]["Author"].decode("utf-8") if "Author" in info else None 108 | response['Creator'] = doc.info[0]["Creator"].decode("utf-8") if "Creator" in info else None 109 | 110 | if "CreationDate" in info: 111 | clean_creation = doc.info[0]["CreationDate"].decode("utf-8").replace("D:","").split('+')[0] 112 | response['CreationDate'] = datetime.strptime(clean_creation, dtformat) 113 | else: None 114 | 115 | if "ModDate" in info: 116 | clean_modified = doc.info[0]["CreationDate"].decode("utf-8").replace("D:","").split('+')[0] 117 | response['LastModified'] = datetime.strptime(clean_modified, dtformat) 118 | else: None 119 | 120 | num_pages = 0 121 | 122 | for pageNumber, page in enumerate(PDFPage.get_pages(fp)): 123 | interpreter.process_page(page) 124 | response['page_'+str(pageNumber)+"_text"] = retstr.getvalue().replace("-/n", "") 125 | if (len(response['page_'+str(pageNumber)+"_text"])) < 5: 126 | response['page_'+str(pageNumber)+"_text"] = image_to_text(path, pageNumber) 127 | retstr.truncate(0) 128 | retstr.seek(0) 129 | num_pages += 1 130 | 131 | response['num_pages'] = num_pages 132 | fp.close() 133 | device.close() 134 | retstr.close() 135 | return response 136 | 137 | def image_to_text(path, pageNumber): 138 | os.system(f'ocrmypdf {path} output.pdf') 139 | text = "" 140 | with pdfplumber.open('output.pdf') as pdf: 141 | page = pdf.pages[pageNumber] 142 | text = page.extract_text(x_tolerance=2) 143 | print(text) 144 | return text -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: "3" 2 | 3 | services: 4 | fast_api: 5 | build: . 6 | container_name: fastapi_container 7 | volumes: 8 | - ./app/:/app 9 | ports: 10 | - "8000:8000" 11 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | uvicorn==0.13.4 2 | pdfminer==20191125 3 | fastapi>=0.65.2 4 | pytesseract==0.3.7 5 | ocrmypdf==12.0.3 6 | tika==1.24 7 | bs4==0.0.1 8 | -------------------------------------------------------------------------------- /screenshots/doc_3.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/screenshots/doc_3.JPG -------------------------------------------------------------------------------- /screenshots/get_doc_list.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/screenshots/get_doc_list.JPG -------------------------------------------------------------------------------- /screenshots/parse_doc_2.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/screenshots/parse_doc_2.JPG -------------------------------------------------------------------------------- /screenshots/parse_doc_3.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/screenshots/parse_doc_3.JPG --------------------------------------------------------------------------------