├── LICENSE
├── README.rst
└── tests
    ├── .DS_Store
    ├── .ipynb_checkpoints
        ├── alltests-checkpoint.ipynb
        └── test-checkpoint.png
    ├── alltests.ipynb
    ├── convert_to_boxes.py
    ├── convert_to_csv.py
    ├── convert_to_prediction.py
    ├── convert_to_searchable_pdf.py
    ├── convert_to_string.py
    ├── convert_to_tables.py
    ├── convert_to_txt.py
    ├── output_csv_screenshot.png
    ├── output_pdf_screenshot.png
    ├── table.png
    ├── table2.png
    ├── test.png
    └── test2.png


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Nanonets
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | Python OCR
  2 | ================
  3 | 
  4 | .. image:: https://img.shields.io/pypi/v/ocr-nanonets-wrapper.svg?color=green
  5 |    :target: https://pypi.org/project/ocr-nanonets-wrapper/
  6 | 
  7 | This python package is an OCR library which reads all text & tables from image & PDF files using an OCR engine & provides intelligent post-processing options to save OCR results in formats you want.
  8 | 
  9 | |
 10 | 
 11 | .. image:: https://uploads-ssl.webflow.com/60545d366101292fb9c8e98e/60545d36610129d15ec8e9c6_logo-white.svg
 12 |    :target: https://nanonets.com/?&utm_source=wrapper
 13 |    
 14 | |
 15 | Installation
 16 | -----
 17 | 
 18 | The package requires `Python 3 <https://www.python.org/downloads/>`_ to run.
 19 | 
 20 | You can use `pip <https://pip.pypa.io/en/stable/installation/>`_ to install:
 21 | 
 22 | .. code-block:: bash
 23 | 
 24 |     pip install ocr-nanonets-wrapper
 25 | 
 26 | Authentication
 27 | -----
 28 | 
 29 | This software is perpetually free :)
 30 | 
 31 | You can get your free API key (with unlimited requests) by creating a free account on `https://app.nanonets.com/#/keys <https://app.nanonets.com/#/keys?utm_source=wrapper>`_.
 32 | 
 33 | .. code-block:: python
 34 | 
 35 |     from nanonets import NANONETSOCR
 36 |     model = NANONETSOCR()
 37 |     model.set_token('REPLACE_API_KEY')
 38 | 
 39 | 
 40 | Usage
 41 | -----
 42 | 
 43 | You can refer the code shared below or `directly use code from here <https://github.com/NanoNets/ocr-python-nanonets/blob/main/tests/alltests.ipynb>`_.
 44 | 
 45 | .. code-block:: python
 46 | 
 47 |     # Initialise
 48 |     from nanonets import NANONETSOCR
 49 |     model = NANONETSOCR()
 50 |     
 51 |     # Authenticate
 52 |     # This software is perpetually free :)
 53 |     # You can get your free API key (with unlimited requests) by creating a free account on https://app.nanonets.com/#/keys?utm_source=wrapper.
 54 |     model.set_token('REPLACE_API_KEY')
 55 |     
 56 |     # PDF / Image to Raw OCR Engine Output
 57 |     import json
 58 |     pred_json = model.convert_to_prediction('INPUT_FILE')
 59 |     print(json.dumps(pred_json, indent=2))
 60 |     
 61 |     # PDF / Image to String
 62 |     string = model.convert_to_string('INPUT_FILE')
 63 |     print(string)
 64 |     
 65 |     # PDF / Image to TXT File
 66 |     model.convert_to_txt('INPUT_FILE', output_file_name = 'OUTPUTNAME.txt')
 67 | 
 68 |     # PDF / Image to Boxes 
 69 |     # each element contains predicted word and bounding box information
 70 |     # bounding box information denotes the spatial position of each word in the file
 71 |     boxes = model.convert_to_boxes('test.png')
 72 |     for box in boxes:
 73 |         print(box)
 74 | 
 75 |     # PDF / Image to CSV
 76 |     # This method extracts tables from your file and prints them in a .csv file.
 77 |     # NOTE : This particular function is a trial offering 1000 pages of use. 
 78 |     # To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables.
 79 |     model.convert_to_csv('INPUT_FILE', output_file_name = 'OUTPUTNAME.csv')
 80 | 
 81 |     # PDF / Image to Tables
 82 |     # This method extracts tables from your file and returns a json object.
 83 |     # NOTE : This particular function is a trial offering 1000 pages of use. 
 84 |     # To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables.
 85 |     import json
 86 |     tables_json = model.convert_to_tables('INPUT_FILE')
 87 |     print(json.dumps(tables_json, indent=2))
 88 | 
 89 |     # PDF / Image to Searchable PDF
 90 |     model.convert_to_searchable_pdf('INPUT_FILE', output_file_name = 'OUTPUTNAME.pdf')  
 91 | 
 92 | Testing
 93 | -------
 94 | 
 95 | To make getting started easier for you, there is a bunch of sample code along with sample input files.
 96 | 
 97 | - Clone or download the repo and open the /tests folder.
 98 | - `all_tests.ipynb <https://github.com/NanoNets/ocr-python-nanonets/blob/main/tests/alltests.ipynb>`_ is a python notebook containing testing for all methods in the package.
 99 | - convert_to_{METHOD}.py files are python files corresponding to each method in the package individually.
100 | 
101 | **Note**
102 | 
103 | convert_to_string() and convert_to_txt() methods have two optional parameters - 
104 | 
105 | 1. **formatting =**
106 | 
107 | - ```lines and spaces``` : default, all formatting enabled
108 | 
109 | - ```none``` : space separated text with formatting removed
110 | 
111 | - ```lines``` : space separated text with lines separated with newline character 
112 | 
113 | - ```pages``` : list of page wise space separated text
114 | 
115 | 2. **line_threshold =**
116 | 
117 | - ```low``` : default
118 | - ```high``` : You can add ``line_threshold='high'`` as a parameter while calling the method which in few cases can improve reading flowcharts and diagrams.
119 | 
120 | 
121 | Advanced Functions
122 | ------------
123 | If extracting flat fields, tables and line items from PDFs and images is your use case, I will strongly advice you to create your own model by signing up on `app.nanonets.com <https://app.nanonets.com/#/signup?utm_source=wrapper>`_ and using our advanced API. This will improve functionalities, accuracy and response times significantly. Once you have created your account and model, you can use `API documentation present here <https://app.nanonets.com/documentation#operation/OCRModelLabelFileByModelIdPost>`_ to extract flat fields, tables and line items from any PDF or image.
124 | 
125 | Nanonets
126 | ------------
127 | We help businesses automate Manual Data Entry Using AI and reduce turn around times & manual effort required. More than 1000 enterprises use Nanonets for Intelligent Document Processing. We have generated incredible ROIs for our clients.
128 | 
129 | We provide OCR and IDP solutions customised for various use cases - invoice automation, Receipt OCR, purchase order automation, accounts payable automation, ID Card OCR and many more.
130 | 
131 | - Visit `nanonets.com <https://nanonets.com/?&utm_source=wrapper>`_ for enterprise OCR and IDP solutions.
132 | - Sign up on `app.nanonets.com/#/signup <https://app.nanonets.com/#/signup?&utm_source=wrapper>`_ to start a free trial.
133 | 
134 | 
135 | License
136 | -------
137 | 
138 | **MIT**
139 | 
140 | **This software is perpetually free :)**
141 | 


--------------------------------------------------------------------------------
/tests/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/.DS_Store


--------------------------------------------------------------------------------
/tests/.ipynb_checkpoints/test-checkpoint.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/.ipynb_checkpoints/test-checkpoint.png


--------------------------------------------------------------------------------
/tests/convert_to_boxes.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 | 
4 | model.set_token('REPLACE_API_KEY')
5 | 
6 | boxes = model.convert_to_boxes('test.png')
7 | for box in boxes:
8 |     print(box)


--------------------------------------------------------------------------------
/tests/convert_to_csv.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 | 
4 | model.set_token('REPLACE_API_KEY')
5 | 
6 | model.convert_to_csv('table.png', output_file_name='output.csv')


--------------------------------------------------------------------------------
/tests/convert_to_prediction.py:
--------------------------------------------------------------------------------
 1 | from nanonets import NANONETSOCR
 2 | model = NANONETSOCR()
 3 | 
 4 | model.set_token('REPLACE_API_KEY')
 5 | 
 6 | pred_json = model.convert_to_prediction('test.png')
 7 | 
 8 | # print the json
 9 | import json
10 | print(json.dumps(pred_json, indent=2))


--------------------------------------------------------------------------------
/tests/convert_to_searchable_pdf.py:
--------------------------------------------------------------------------------
1 | from nanonets import NANONETSOCR
2 | model = NANONETSOCR()
3 | 
4 | model.set_token('REPLACE_API_KEY')
5 | 
6 | model.convert_to_searchable_pdf('test2.png', output_file_name='output.pdf')


--------------------------------------------------------------------------------
/tests/convert_to_string.py:
--------------------------------------------------------------------------------
 1 | from nanonets import NANONETSOCR
 2 | model = NANONETSOCR()
 3 | 
 4 | model.set_token('REPLACE_API_KEY')
 5 | 
 6 | string1 = model.convert_to_string('test.png')
 7 | print(string1)
 8 | 
 9 | print('\n\n\n')
10 | 
11 | string2 = model.convert_to_string('test.png', formatting='lines')
12 | print(string2)
13 | 
14 | print('\n\n\n')
15 | 
16 | string3 = model.convert_to_string('test.png', formatting='none')
17 | print(string3)


--------------------------------------------------------------------------------
/tests/convert_to_tables.py:
--------------------------------------------------------------------------------
 1 | from nanonets import NANONETSOCR
 2 | model = NANONETSOCR()
 3 | 
 4 | model.set_token('REPLACE_API_KEY')
 5 | 
 6 | tables_json = model.convert_to_tables('table2.png')
 7 | 
 8 | # print the json
 9 | import json
10 | print(json.dumps(tables_json, indent=2))


--------------------------------------------------------------------------------
/tests/convert_to_txt.py:
--------------------------------------------------------------------------------
 1 | from nanonets import NANONETSOCR
 2 | model = NANONETSOCR()
 3 | 
 4 | model.set_token('REPLACE_API_KEY')
 5 | 
 6 | # formatting enabled
 7 | model.convert_to_txt('test.png', output_file_name='output.txt')
 8 | 
 9 | # formatting = lines
10 | # model.convert_to_txt('test.png', formatting='lines', output_file_name='output.txt')
11 | 
12 | # formatting = none
13 | # model.convert_to_txt('test.png', formatting='none', output_file_name='output.txt')


--------------------------------------------------------------------------------
/tests/output_csv_screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/output_csv_screenshot.png


--------------------------------------------------------------------------------
/tests/output_pdf_screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/output_pdf_screenshot.png


--------------------------------------------------------------------------------
/tests/table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/table.png


--------------------------------------------------------------------------------
/tests/table2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/table2.png


--------------------------------------------------------------------------------
/tests/test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/test.png


--------------------------------------------------------------------------------
/tests/test2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NanoNets/ocr-python/39bee49a3b426d36ccbe484ccffac59073d14e87/tests/test2.png


--------------------------------------------------------------------------------