├── LICENSE
├── README.rst
└── tests
├── .DS_Store
├── .ipynb_checkpoints
├── alltests-checkpoint.ipynb
└── test-checkpoint.png
├── alltests.ipynb
├── convert_to_boxes.py
├── convert_to_csv.py
├── convert_to_prediction.py
├── convert_to_searchable_pdf.py
├── convert_to_string.py
├── convert_to_tables.py
├── convert_to_txt.py
├── output_csv_screenshot.png
├── output_pdf_screenshot.png
├── table.png
├── table2.png
├── test.png
└── test2.png
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Nanonets
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | Python OCR
2 | ================
3 |
4 | .. image:: https://img.shields.io/pypi/v/ocr-nanonets-wrapper.svg?color=green
5 | :target: https://pypi.org/project/ocr-nanonets-wrapper/
6 |
7 | This python package is an OCR library which reads all text & tables from image & PDF files using an OCR engine & provides intelligent post-processing options to save OCR results in formats you want.
8 |
9 | |
10 |
11 | .. image:: https://uploads-ssl.webflow.com/60545d366101292fb9c8e98e/60545d36610129d15ec8e9c6_logo-white.svg
12 | :target: https://nanonets.com/?&utm_source=wrapper
13 |
14 | |
15 | Installation
16 | -----
17 |
18 | The package requires `Python 3 `_ to run.
19 |
20 | You can use `pip `_ to install:
21 |
22 | .. code-block:: bash
23 |
24 | pip install ocr-nanonets-wrapper
25 |
26 | Authentication
27 | -----
28 |
29 | This software is perpetually free :)
30 |
31 | You can get your free API key (with unlimited requests) by creating a free account on `https://app.nanonets.com/#/keys `_.
32 |
33 | .. code-block:: python
34 |
35 | from nanonets import NANONETSOCR
36 | model = NANONETSOCR()
37 | model.set_token('REPLACE_API_KEY')
38 |
39 |
40 | Usage
41 | -----
42 |
43 | You can refer the code shared below or `directly use code from here `_.
44 |
45 | .. code-block:: python
46 |
47 | # Initialise
48 | from nanonets import NANONETSOCR
49 | model = NANONETSOCR()
50 |
51 | # Authenticate
52 | # This software is perpetually free :)
53 | # You can get your free API key (with unlimited requests) by creating a free account on https://app.nanonets.com/#/keys?utm_source=wrapper.
54 | model.set_token('REPLACE_API_KEY')
55 |
56 | # PDF / Image to Raw OCR Engine Output
57 | import json
58 | pred_json = model.convert_to_prediction('INPUT_FILE')
59 | print(json.dumps(pred_json, indent=2))
60 |
61 | # PDF / Image to String
62 | string = model.convert_to_string('INPUT_FILE')
63 | print(string)
64 |
65 | # PDF / Image to TXT File
66 | model.convert_to_txt('INPUT_FILE', output_file_name = 'OUTPUTNAME.txt')
67 |
68 | # PDF / Image to Boxes
69 | # each element contains predicted word and bounding box information
70 | # bounding box information denotes the spatial position of each word in the file
71 | boxes = model.convert_to_boxes('test.png')
72 | for box in boxes:
73 | print(box)
74 |
75 | # PDF / Image to CSV
76 | # This method extracts tables from your file and prints them in a .csv file.
77 | # NOTE : This particular function is a trial offering 1000 pages of use.
78 | # To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables.
79 | model.convert_to_csv('INPUT_FILE', output_file_name = 'OUTPUTNAME.csv')
80 |
81 | # PDF / Image to Tables
82 | # This method extracts tables from your file and returns a json object.
83 | # NOTE : This particular function is a trial offering 1000 pages of use.
84 | # To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables.
85 | import json
86 | tables_json = model.convert_to_tables('INPUT_FILE')
87 | print(json.dumps(tables_json, indent=2))
88 |
89 | # PDF / Image to Searchable PDF
90 | model.convert_to_searchable_pdf('INPUT_FILE', output_file_name = 'OUTPUTNAME.pdf')
91 |
92 | Testing
93 | -------
94 |
95 | To make getting started easier for you, there is a bunch of sample code along with sample input files.
96 |
97 | - Clone or download the repo and open the /tests folder.
98 | - `all_tests.ipynb `_ is a python notebook containing testing for all methods in the package.
99 | - convert_to_{METHOD}.py files are python files corresponding to each method in the package individually.
100 |
101 | **Note**
102 |
103 | convert_to_string() and convert_to_txt() methods have two optional parameters -
104 |
105 | 1. **formatting =**
106 |
107 | - ```lines and spaces``` : default, all formatting enabled
108 |
109 | - ```none``` : space separated text with formatting removed
110 |
111 | - ```lines``` : space separated text with lines separated with newline character
112 |
113 | - ```pages``` : list of page wise space separated text
114 |
115 | 2. **line_threshold =**
116 |
117 | - ```low``` : default
118 | - ```high``` : You can add ``line_threshold='high'`` as a parameter while calling the method which in few cases can improve reading flowcharts and diagrams.
119 |
120 |
121 | Advanced Functions
122 | ------------
123 | If extracting flat fields, tables and line items from PDFs and images is your use case, I will strongly advice you to create your own model by signing up on `app.nanonets.com `_ and using our advanced API. This will improve functionalities, accuracy and response times significantly. Once you have created your account and model, you can use `API documentation present here `_ to extract flat fields, tables and line items from any PDF or image.
124 |
125 | Nanonets
126 | ------------
127 | We help businesses automate Manual Data Entry Using AI and reduce turn around times & manual effort required. More than 1000 enterprises use Nanonets for Intelligent Document Processing. We have generated incredible ROIs for our clients.
128 |
129 | We provide OCR and IDP solutions customised for various use cases - invoice automation, Receipt OCR, purchase order automation, accounts payable automation, ID Card OCR and many more.
130 |
131 | - Visit `nanonets.com `_ for enterprise OCR and IDP solutions.
132 | - Sign up on `app.nanonets.com/#/signup `_ to start a free trial.
133 |
134 |
135 | License
136 | -------
137 |
138 | **MIT**
139 |
140 | **This software is perpetually free :)**
141 |
--------------------------------------------------------------------------------
/tests/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/.DS_Store
--------------------------------------------------------------------------------
/tests/.ipynb_checkpoints/test-checkpoint.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/.ipynb_checkpoints/test-checkpoint.png
--------------------------------------------------------------------------------
/tests/convert_to_boxes.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 |
4 | model.set_token('REPLACE_API_KEY')
5 |
6 | boxes = model.convert_to_boxes('test.png')
7 | for box in boxes:
8 | print(box)
--------------------------------------------------------------------------------
/tests/convert_to_csv.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 |
4 | model.set_token('REPLACE_API_KEY')
5 |
6 | model.convert_to_csv('table.png', output_file_name='output.csv')
--------------------------------------------------------------------------------
/tests/convert_to_prediction.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 |
4 | model.set_token('REPLACE_API_KEY')
5 |
6 | pred_json = model.convert_to_prediction('test.png')
7 |
8 | # print the json
9 | import json
10 | print(json.dumps(pred_json, indent=2))
--------------------------------------------------------------------------------
/tests/convert_to_searchable_pdf.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 |
4 | model.set_token('REPLACE_API_KEY')
5 |
6 | model.convert_to_searchable_pdf('test2.png', output_file_name='output.pdf')
--------------------------------------------------------------------------------
/tests/convert_to_string.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 |
4 | model.set_token('REPLACE_API_KEY')
5 |
6 | string1 = model.convert_to_string('test.png')
7 | print(string1)
8 |
9 | print('\n\n\n')
10 |
11 | string2 = model.convert_to_string('test.png', formatting='lines')
12 | print(string2)
13 |
14 | print('\n\n\n')
15 |
16 | string3 = model.convert_to_string('test.png', formatting='none')
17 | print(string3)
--------------------------------------------------------------------------------
/tests/convert_to_tables.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 |
4 | model.set_token('REPLACE_API_KEY')
5 |
6 | tables_json = model.convert_to_tables('table2.png')
7 |
8 | # print the json
9 | import json
10 | print(json.dumps(tables_json, indent=2))
--------------------------------------------------------------------------------
/tests/convert_to_txt.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 |
4 | model.set_token('REPLACE_API_KEY')
5 |
6 | # formatting enabled
7 | model.convert_to_txt('test.png', output_file_name='output.txt')
8 |
9 | # formatting = lines
10 | # model.convert_to_txt('test.png', formatting='lines', output_file_name='output.txt')
11 |
12 | # formatting = none
13 | # model.convert_to_txt('test.png', formatting='none', output_file_name='output.txt')
--------------------------------------------------------------------------------
/tests/output_csv_screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/output_csv_screenshot.png
--------------------------------------------------------------------------------
/tests/output_pdf_screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/output_pdf_screenshot.png
--------------------------------------------------------------------------------
/tests/table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/table.png
--------------------------------------------------------------------------------
/tests/table2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/table2.png
--------------------------------------------------------------------------------
/tests/test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/test.png
--------------------------------------------------------------------------------
/tests/test2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/test2.png
--------------------------------------------------------------------------------