├── Dockerfile
├── README.md
├── app
├── __pycache__
│ └── pdfapi.cpython-38.pyc
├── data
│ ├── sample_doc_1.pdf
│ ├── sample_doc_2.pdf
│ └── sample_doc_3.pdf
├── format_converter.py
├── output.pdf
└── pdfapi.py
├── docker-compose.yml
├── requirements.txt
└── screenshots
├── doc_3.JPG
├── get_doc_list.JPG
├── parse_doc_2.JPG
└── parse_doc_3.JPG
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.8
2 |
3 | RUN pip install -r requirements.txt
4 |
5 | COPY ./app ./
6 |
7 | CMD ["uvicorn", "app.pdfapi:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # fastapi_pdfextractor 
2 | A simple api using [fastapi](https://pypi.org/project/fastapi/) for extracting the text content of pdf using [pdfminer](https://pypi.org/project/pdfminer/).
3 | Different pdf parsers were tried like pypdf2, pdfminer.. but pdfminer gave better results. For added ocr support first install [tesseract](https://github.com/UB-Mannheim/tesseract/wiki) and [ghost script](https://www.ghostscript.com/download/gsdnld.html) as these are required dependencies for the code to work.
4 | Try out and compare the output of pdfminer and tika through API endpoints. Access the results through API response or app/results directory.
5 |
6 | Note: if tesseract is installed in some other location than default, then change the location accordingly in pdfapi.py file.
7 |
8 | ## Clone project
9 | ```
10 | git clone https://github.com/soham-1/fastapi_pdfextractor.git
11 | ```
12 |
13 | ## Run locally
14 | ### Install dependencies
15 | ```
16 | pip install -r requirements.txt
17 | ```
18 |
19 | ### Run Server
20 | ```
21 | cd app
22 | uvicorn pdfapi:app --host 0.0.0.0 --port 8000 --reload
23 | ```
24 |
25 | ## Run on Docker
26 | ```
27 | docker-compose up -d --build
28 | ```
29 |
30 | ### Stop the container using
31 | ```
32 | docker-compose stop fast_api
33 | ```
34 |
35 | ### Restart it using
36 | ```
37 | docker-compose up -d
38 | ```
39 |
40 | ## Documentation
41 | This api has following endpoints
42 | * #### ```/get_doc_list``` - for getting a list of all the available pdf's
43 | * #### ```/parse/{doc_name}``` - for getting the meta data and text content of pdf. available pdf's are sample_doc_1, sample_doc_2. sample_doc_3
44 | * #### ```/pdfminer_text/{doc}``` - returns text output of a pdf using pdfminer library
45 | * #### ```/pdfminer_text/{doc}/{page_no}``` - returns text output of a pdf of specified page_no
46 | * #### ```/tika_text/{doc}``` - returns text output of a pdf using py-tika library
47 | * #### ```/pdfminer_xml/{doc}``` - returns xml output
48 | * #### ```/pdfminer_xml/{doc}/{page_no}``` - returns xml output of a pdf of specified page_no
49 | * #### ```/pdfminer_html/{doc}``` - returns html output
50 | * #### ```/pdfminer_html/{doc}/{page_no}```
51 | * #### ```/pdfminer_html_char/{doc}``` - returns character level html output
52 | * #### ```/pdfminer_html_char/{doc}/{page_no}```
53 |
54 | ### text pdf
55 | 
56 |
57 | ### output
58 |
59 |
60 | ### pdf with scanned image
61 |
62 |
63 | ### output
64 | 
65 |
--------------------------------------------------------------------------------
/app/__pycache__/pdfapi.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/__pycache__/pdfapi.cpython-38.pyc
--------------------------------------------------------------------------------
/app/data/sample_doc_1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/data/sample_doc_1.pdf
--------------------------------------------------------------------------------
/app/data/sample_doc_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/data/sample_doc_2.pdf
--------------------------------------------------------------------------------
/app/data/sample_doc_3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/soham-1/fastapi_pdfextractor/14588dca8ca18a0e7bbdf6be5b6843399700eda1/app/data/sample_doc_3.pdf
--------------------------------------------------------------------------------
/app/format_converter.py:
--------------------------------------------------------------------------------
1 | import pdfminer
2 | from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
3 | from pdfminer.converter import TextConverter, PDFPageAggregator, XMLConverter, HTMLConverter
4 | from pdfminer.layout import LAParams
5 | from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
6 | from pdfminer.pdfparser import PDFParser
7 | from io import StringIO, BytesIO
8 |
9 | from tika import parser
10 |
11 | from bs4 import BeautifulSoup
12 |
13 |
14 | def pdfminer_to_text(input_file:str, page_no:int=None):
15 | file_loc = "app/data/" + input_file + ".pdf"
16 | fp = open(file_loc, 'rb')
17 | outfp = BytesIO()
18 | rsrcmgr = PDFResourceManager()
19 | laparams = LAParams(line_overlap=.5, char_margin=60, line_margin=.5, word_margin=1, boxes_flow=.6,
20 | detect_vertical=True, all_texts=False)
21 | # laparams = LAParams() # default configurations
22 | device = TextConverter(rsrcmgr, outfp, 'utf-8', laparams=laparams)
23 | interpreter = PDFPageInterpreter(rsrcmgr, device)
24 | text = ""
25 | for page_index, page in enumerate(PDFPage.get_pages(fp)):
26 | if page_no is None or page_index == page_no:
27 | interpreter.process_page(page)
28 | text += outfp.getvalue().decode()
29 | if page_no and page_index == page_no: break
30 | output_file = f'app/results/pdfminer_{input_file}.txt' if page_no is None else f'app/results/pdfminer_{input_file}_page{page_no}.txt'
31 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f:
32 | f.write(text)
33 | return text
34 |
35 |
36 | def tika_to_text(input_file):
37 | parsed_pdf = parser.from_file("app/data/" + input_file + ".pdf")
38 | text = parsed_pdf["content"]
39 | output_file = f'app/results/tika_{input_file}.txt'
40 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f:
41 | f.write(text)
42 | return text
43 |
44 |
45 | def pdfminer_to_xml(input_file:str, page_no:int=None):
46 | file_loc = "app/data/" + input_file + ".pdf"
47 | fp = open(file_loc, 'rb')
48 | outfp = BytesIO()
49 | rsrcmgr = PDFResourceManager()
50 | laparams = LAParams()
51 | device = XMLConverter(rsrcmgr, outfp, 'utf-8', laparams=laparams)
52 | interpreter = PDFPageInterpreter(rsrcmgr, device)
53 | text = ""
54 | for page_index, page in enumerate(PDFPage.get_pages(fp)):
55 | if page_no is None or page_index == page_no:
56 | interpreter.process_page(page)
57 | text += outfp.getvalue().decode()
58 | outfp.seek(0)
59 | outfp.truncate(0)
60 | if page_no and page_index == page_no: break
61 | output_file = f'app/results/pdfminer_{input_file}.xml' if page_no is None else f'app/results/pdfminer_{input_file}_page{page_no}.xml'
62 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f:
63 | f.write(text)
64 | return text
65 |
66 |
67 | def pdfminer_to_html(input_file:str, page_no:int=None):
68 | file_loc = "app/data/" + input_file + ".pdf"
69 | fp = open(file_loc, 'rb')
70 | outfp = BytesIO()
71 | rsrcmgr = PDFResourceManager()
72 | laparams = LAParams(line_overlap=.5, char_margin=60, line_margin=.5, word_margin=1, boxes_flow=.6,
73 | detect_vertical=True, all_texts=False)
74 | # laparams = LAParams()
75 | device = HTMLConverter(rsrcmgr, outfp, 'utf-8', laparams=laparams)
76 | interpreter = PDFPageInterpreter(rsrcmgr, device)
77 | text = ""
78 | for page_index, page in enumerate(PDFPage.get_pages(fp)):
79 | if page_no is None or page_index == page_no:
80 | interpreter.process_page(page)
81 | retstr = outfp.getvalue().decode()
82 | text += retstr
83 | output_file = f'app/results/pdfminer_{input_file}_page{page_index}.html'
84 | with open(output_file, 'w', encoding='utf-8', errors='ignore') as f:
85 | f.write(retstr)
86 | if page_no and page_index == page_no: break
87 | outfp.seek(0)
88 | outfp.truncate(0)
89 | return text
90 |
91 |
92 | def pdfminer_to_html_char_level(input_file:str, page_no:int=None):
93 | xml = pdfminer_to_xml(input_file, page_no)
94 | body_start = """
95 |
96 |