├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | ## creative commons 2 | 3 | # CC0 1.0 Universal 4 | 5 | ``` 6 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE LEGAL 7 | SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN ATTORNEY-CLIENT 8 | RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN "AS-IS" BASIS. 9 | CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE USE OF THIS DOCUMENT OR THE 10 | INFORMATION OR WORKS PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES 11 | RESULTING FROM THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | ``` 14 | 15 | ### Statement of Purpose 16 | 17 | The laws of most jurisdictions throughout the world automatically confer 18 | exclusive Copyright and Related Rights (defined below) upon the creator and 19 | subsequent owner(s) (each and all, an "owner") of an original work of 20 | authorship and/or a database (each, a "Work"). 21 | 22 | Certain owners wish to permanently relinquish those rights to a Work for the 23 | purpose of contributing to a commons of creative, cultural and scientific works 24 | ("Commons") that the public can reliably and without fear of later claims of 25 | infringement build upon, modify, incorporate in other works, reuse and 26 | redistribute as freely as possible in any form whatsoever and for any purposes, 27 | including without limitation commercial purposes. These owners may contribute 28 | to the Commons to promote the ideal of a free culture and the further 29 | production of creative, cultural and scientific works, or to gain reputation or 30 | greater distribution for their Work in part through the use and efforts of 31 | others. 32 | 33 | For these and/or other purposes and motivations, and without any expectation of 34 | additional consideration or compensation, the person associating CC0 with a 35 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright 36 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work and 37 | publicly distribute the Work under its terms, with knowledge of his or her 38 | Copyright and Related Rights in the Work and the meaning and intended legal 39 | effect of CC0 on those rights. 40 | 41 | 1. __Copyright and Related Rights.__ A Work made available under CC0 may be 42 | protected by copyright and related or neighboring rights ("Copyright and 43 | Related Rights"). Copyright and Related Rights include, but are not limited to, 44 | the following: 45 | 46 | i. the right to reproduce, adapt, distribute, perform, display, 47 | communicate, and translate a Work; 48 | 49 | ii. moral rights retained by the original author(s) and/or performer(s); 50 | 51 | iii. publicity and privacy rights pertaining to a person's image or 52 | likeness depicted in a Work; 53 | 54 | iv. rights protecting against unfair competition in regards to a Work, 55 | subject to the limitations in paragraph 4(a), below; 56 | 57 | v. rights protecting the extraction, dissemination, use and reuse of data 58 | in a Work; 59 | 60 | vi. database rights (such as those arising under Directive 96/9/EC of the 61 | European Parliament and of the Council of 11 March 1996 on the legal 62 | protection of databases, and under any national implementation thereof, 63 | including any amended or successor version of such directive); and 64 | 65 | vii. other similar, equivalent or corresponding rights throughout the world 66 | based on applicable law or treaty, and any national implementations 67 | thereof. 68 | 69 | 2. __Waiver.__ To the greatest extent permitted by, but not in contravention 70 | of, applicable law, Affirmer hereby overtly, fully, permanently, irrevocably 71 | and unconditionally waives, abandons, and surrenders all of Affirmer's 72 | Copyright and Related Rights and associated claims and causes of action, 73 | whether now known or unknown (including existing as well as future claims and 74 | causes of action), in the Work (i) in all territories worldwide, (ii) for the 75 | maximum duration provided by applicable law or treaty (including future time 76 | extensions), (iii) in any current or future medium and for any number of 77 | copies, and (iv) for any purpose whatsoever, including without limitation 78 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes 79 | the Waiver for the benefit of each member of the public at large and to the 80 | detriment of Affirmer's heirs and successors, fully intending that such Waiver 81 | shall not be subject to revocation, rescission, cancellation, termination, or 82 | any other legal or equitable action to disrupt the quiet enjoyment of the Work 83 | by the public as contemplated by Affirmer's express Statement of Purpose. 84 | 85 | 3. __Public License Fallback.__ Should any part of the Waiver for any reason be 86 | judged legally invalid or ineffective under applicable law, then the Waiver 87 | shall be preserved to the maximum extent permitted taking into account 88 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver 89 | is so judged Affirmer hereby grants to each affected person a royalty-free, non 90 | transferable, non sublicensable, non exclusive, irrevocable and unconditional 91 | license to exercise Affirmer's Copyright and Related Rights in the Work (i) in 92 | all territories worldwide, (ii) for the maximum duration provided by applicable 93 | law or treaty (including future time extensions), (iii) in any current or 94 | future medium and for any number of copies, and (iv) for any purpose 95 | whatsoever, including without limitation commercial, advertising or promotional 96 | purposes (the "License"). The License shall be deemed effective as of the date 97 | CC0 was applied by Affirmer to the Work. Should any part of the License for any 98 | reason be judged legally invalid or ineffective under applicable law, such 99 | partial invalidity or ineffectiveness shall not invalidate the remainder of the 100 | License, and in such case Affirmer hereby affirms that he or she will not (i) 101 | exercise any of his or her remaining Copyright and Related Rights in the Work 102 | or (ii) assert any associated claims and causes of action with respect to the 103 | Work, in either case contrary to Affirmer's express Statement of Purpose. 104 | 105 | 4. __Limitations and Disclaimers.__ 106 | 107 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 108 | surrendered, licensed or otherwise affected by this document. 109 | 110 | b. Affirmer offers the Work as-is and makes no representations or 111 | warranties of any kind concerning the Work, express, implied, statutory or 112 | otherwise, including without limitation warranties of title, 113 | merchantability, fitness for a particular purpose, non infringement, or the 114 | absence of latent or other defects, accuracy, or the present or absence of 115 | errors, whether or not discoverable, all to the greatest extent permissible 116 | under applicable law. 117 | 118 | c. Affirmer disclaims responsibility for clearing rights of other persons 119 | that may apply to the Work or any use thereof, including without limitation 120 | any person's Copyright and Related Rights in the Work. Further, Affirmer 121 | disclaims responsibility for obtaining any necessary consents, permissions 122 | or other rights required for any use of the Work. 123 | 124 | d. Affirmer understands and acknowledges that Creative Commons is not a 125 | party to this document and has no duty or obligation with respect to this 126 | CC0 or use of the Work. 127 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Awesome OCR 2 | =========== 3 | 4 | [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 5 | 6 | This list contains links to great software tools and libraries and literature 7 | related to [Optical Character Recognition 8 | (OCR)](http://en.wikipedia.org/wiki/Optical_Character_Recognition). 9 | 10 | Contributions are welcome, as is feedback. 11 | 12 | 13 | * [Software](#software) 14 | * [OCR engines](#ocr-engines) 15 | * [Older and possibly abandoned OCR engines](#older-and-possibly-abandoned-ocr-engines) 16 | * [OCR file formats](#ocr-file-formats) 17 | * [hOCR](#hocr) 18 | * [ALTO XML](#alto-xml) 19 | * [TEI](#tei) 20 | * [PAGE XML](#page-xml) 21 | * [OCR CLI](#ocr-cli) 22 | * [OCR GUI](#ocr-gui) 23 | * [OCR Preprocessing](#ocr-preprocessing) 24 | * [OCR as a Service](#ocr-as-a-service) 25 | * [OCR evaluation](#ocr-evaluation) 26 | * [OCR libraries by programming language](#ocr-libraries-by-programming-language) 27 | * [Crystal](#crystal) 28 | * [Elixir](#elixir) 29 | * [Go](#go) 30 | * [Java](#java) 31 | * [.Net](#net) 32 | * [Object Pascal](#object-pascal) 33 | * [PHP](#php) 34 | * [Python](#python) 35 | * [Javascript](#javascript) 36 | * [Ruby](#ruby) 37 | * [Swift](#swift) 38 | * [Rust](#rust) 39 | * [R](#r) 40 | * [OCR training tools](#ocr-training-tools) 41 | * [Datasets](#datasets) 42 | * [Ground Truth](#ground-truth) 43 | * [Literature](#literature) 44 | * [OCR-related publication and link lists](#ocr-related-publication-and-link-lists) 45 | * [Blog Posts and Tutorials](#blog-posts-and-tutorials) 46 | * [OCR Showcases](#ocr-showcases) 47 | * [Academic articles](#academic-articles) 48 | * [2011 and before](#2011-and-before) 49 | * [2012](#2012) 50 | * [2013](#2013) 51 | * [2014](#2014) 52 | * [2015](#2015) 53 | * [2016](#2016) 54 | * [2017](#2017) 55 | * [2018](#2018) 56 | 57 | 58 | 59 | ## Software 60 | 61 | ### OCR engines 62 | 63 | * [tesseract](https://github.com/tesseract-ocr/tesseract) - The definitive Open Source OCR engine `Apache 2.0` 64 | * [EasyOCR](https://github.com/JaidedAI/EasyOCR) - OCR engine built on PyTorch by JaidedAI, `Apache 2.0` 65 | * [ocropus](https://github.com/tmbdev/ocropy) - OCR engine based on LSTM, `Apache 2.0` 66 | * [ocropus 0.4](https://github.com/jkrall/ocropus) - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++ 67 | * [kraken](https://github.com/mittagessen/kraken) - Ocropus fork with sane defaults 68 | * [gocr](https://www-e.ovgu.de/jschulen/ocr/) - OCR engine under the GNU Public License led by Joerg Schulenburg. 69 | * [Ocrad](http://www.gnu.org/software/ocrad/) - The GNU OCR. `GPL` 70 | * [ocular](https://github.com/tberg12/ocular) - Machine-learning OCR for historic documents 71 | * [SwiftOCR](https://github.com/garnele007/SwiftOCR) - fast and simple OCR library written in Swift 72 | * [attention-ocr](https://github.com/emedvedev/attention-ocr) - OCR engine using visual attention mechanisms 73 | * [RWTH-OCR](https://www-i6.informatik.rwth-aachen.de/rwth-ocr/) - The RWTH Aachen University Optical Character Recognition System 74 | * [simple-ocr-opencv](https://github.com/goncalopp/simple-ocr-opencv) and its [fork](https://github.com/RedFantom/simple-ocr-opencv) - A simple pythonic OCR engine using opencv and numpy 75 | * [Calamari](https://github.com/Calamari-OCR/calamari) - OCR Engine based on OCRopy and Kraken 76 | * [doctr](https://github.com/mindee/doctr) - A seamless & high-performing OCR library powered by Deep Learning 77 | 78 | ### Older and possibly abandoned OCR engines 79 | 80 | * [Clara OCR](http://freecode.com/projects/claraocr/) - Open source OCR in C `GPL` 81 | * [Cuneiform](https://en.wikipedia.org/wiki/CuneiForm_(software)) - CuneiForm OCR was developed by Cognitive Technologies 82 | * [Eye](https://sourceforge.net/projects/eyeocr/) - an experimental Java OCR (image-to-text) application 83 | * [kognition](https://sourceforge.net/projects/kognition/) - An omnifont OCR software for KDE 84 | * [OCRchie](https://people.eecs.berkeley.edu/~fateman/kathey/ocrchie.html) - Modular Optical Character Recognition Software 85 | * [ocre](http://lem.eui.upm.es/ocre.html) - o.c.r. easy 86 | * [xplab](http://www.pattern-lab.de/) - A GTK 2 tool for pattern matching 87 | * [hebOCR](https://github.com/yaacov/hebocr) - Hebrew character recognition library (previously named hocr, see [Wikipedia article](https://de.wikipedia.org/wiki/HebOCR)) `GPL` 88 | 89 | ### OCR file formats 90 | 91 | * [abby2hocr.xslt XSLT script](https://gist.github.com/tfmorris/5977784) 92 | * [ocr-conversion-scripts](https://github.com/cneud/ocr-conversion-scripts) 93 | 94 | #### hOCR 95 | 96 | * [hocr-tools](https://github.com/tmbdev/hocr-tools) - Tools for doing various useful things with hOCR files, `Apache 2.0` 97 | * [hocr-spec](https://github.com/kba/hocr-spec) - hOCR 1.2 specification 98 | * [ocr-transform](https://github.com/UB-Mannheim/ocr-transform) - CLI tool to convert between hOCR and ALTO, `MIT` 99 | * [hocr-parser](https://github.com/athento/hocr-parser) - hOCR Specification Python Parser 100 | * [hOCRTools](https://github.com/ONB-RD/hOCRTools) - hOCR to ALTO conversion XSLT 101 | 102 | #### ALTO XML 103 | 104 | * [ALTO XML Schema](https://github.com/altoxml/schema) - XML Schema and development of the ALTO XML format 105 | * [ALTO XML Documentation](https://github.com/altoxml/documentation) - Documentation and use cases for ALTO 106 | * [alto-tools](https://github.com/cneud/alto-tools) - Various tools to work with ALTO files, Python 107 | * [AbbyyToAlto](https://github.com/ironymark/AbbyyToAlto) - PHP script converting from Abbyy 6 to ALTO XML 108 | 109 | #### TEI 110 | 111 | * [TEI-OCR](https://github.com/OpenPhilology/tei-ocr) - TEI customization for OCR generated layout and content information 112 | * [TEI SIG on Libraries](http://www.tei-c.org/SIG/Libraries/teiinlibraries/main-driver.html) - Best Practices for TEI in Libraries 113 | * [GDZ](http://gdz.sub.uni-goettingen.de/uploads/media/GDZ_document_format_2005_12_08.pdf) - METS/TEI-based GDZ document format 114 | 115 | #### PAGE XML 116 | 117 | * [PAGE-XML Schema](https://github.com/PRImA-Research-Lab/PAGE-XML/tree/master/pagecontent) - XML schema of the PAGE XML format along with documentation and examples 118 | * [omni:us Pages Format (OPF)](https://omni-us.github.io/pageformat/pagecontent_omnius.html) - XML schema very similar to PAGE XML that has some additional features. 119 | * [py-pagexml](https://github.com/omni-us/pagexml) - Python library for handling PAGE XML and OPF files. 120 | 121 | ### OCR CLI 122 | 123 | * [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched 124 | * [Pdf2PdfOCR](https://github.com/LeoFCardoso/pdf2pdfocr) - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported. 125 | * [Ocrocis](https://github.com/kaumanns/ocrocis) - Project manager interface for Ocropy, see also [external project homepage](http://cistern.cis.lmu.de/ocrocis/) 126 | * [tesseract-recognize](https://github.com/mauvilsa/tesseract-recognize) - Tesseract-based tool that outputs result in Page XML format ([docker image](https://hub.docker.com/r/mauvilsa/tesseract-recognize)). 127 | 128 | 129 | ### OCR GUI 130 | 131 | * [moz-hocr-editor](https://github.com/garrison/moz-hocr-edit) - Firefox Addon for editing hOCR files **Discontinued** 132 | * [qt-box-editor](https://github.com/zdenop/qt-box-editor) - QT4 editor of tesseract-ocr box files. 133 | * [ocr-gt-tools](https://github.com/UB-Mannheim/ocr-gt-tools) - Client-Server application for editing OCR ground truth. 134 | * [Paperwork](https://github.com/openpaperwork/paperwork) - Using scanners and OCR to grep paper documents the easy way. 135 | * [Paperless](https://github.com/danielquinn/paperless) - Scan, index, and archive all of your paper documents. 136 | * [gImageReader](https://github.com/manisandro/gImageReader) - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr. 137 | * [VietOCR](http://vietocr.sourceforge.net/) - A Java/.NET GUI frontend for Tesseract OCR engine, including [jTessBoxEditor](http://vietocr.sourceforge.net/training.html) a graphical Tesseract [box data](https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files) editor 138 | * [PoCoTo](https://github.com/cisocrgroup/PoCoTo) - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents. 139 | * [OCRFeeder](https://wiki.gnome.org/Apps/OCRFeeder) - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more. 140 | * [PRImA PAGE Viewer](https://github.com/PRImA-Research-Lab/prima-page-viewer) - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR. 141 | * [LAREX](https://github.com/chreul/larex) - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books. 142 | * [archiscribe](https://github.com/jbaiter/archiscribe) - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in [@jbaiter/archiscribe-corpus](https://github.com/jbaiter/archiscribe-corpus). 143 | * [nw-page-editor](https://github.com/mauvilsa/nw-page-editor) - Simple app for visual editing of Page XML files. Provides desktop and [server docker-based](https://hub.docker.com/r/mauvilsa/nw-page-editor-web) versions. 144 | 145 | ### OCR Preprocessing 146 | 147 | * [NoiseRemove.java in MathOCR](https://github.com/chungkwong/MathOCR/blob/master/src/main/java/com/github/chungkwong/mathocr/preprocess/NoiseRemove.java) - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis 148 | * [binarize.c in ZBar](https://github.com/ZBar/ZBar/blob/master/zbar/qrcode/binarize.c) - C implementations of two binarization algorithms, based on Sauvola 149 | * [typeface-corpus](https://github.com/jbest/typeface-corpus) - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities. 150 | * [binarizewolfjolion](https://github.com/zp-j/binarizewolfjolion) - Comparison of binarization algorithms. [Blog post](http://zp-j.github.io/2013/10/04/document-binarization/) 151 | * [`crop_morphology.py` in oldnyc](https://github.com/danvk/oldnyc) - Cropping a page to just the text block 152 | * [Whiteboard Picture Cleaner](https://gist.github.com/lelandbatey/8677901) - Shell one-liner/script to clean up and beautify photos of whiteboards 153 | * Fred's ImageMagick script [textcleaner](http://www.fmwconcepts.com/imagemagick/textcleaner/index.php) - Processes a scanned document of text to clean the text background 154 | * [localcontrast](https://sourceforge.net/projects/localcontrast/) - Fast O(1) local contrast optimization 155 | 156 | 157 | ### OCR as a Service 158 | 159 | * [Open OCR](https://github.com/tleyden/open-ocr) - Run Tesseract in Docker containers 160 | * [tesseract-web-service](https://github.com/guitarmind/tesseract-web-service) - An implementation of RESTful web service for tesseract-OCR using tornado. 161 | * [docker-ocropy](https://github.com/kba/docker-ocropy) - A Docker container for running the [ocropy OCR system](htps://github.com/tmbdev/ocropy). 162 | * [ABBYY Cloud OCR SDK Code samples](https://github.com/abbyysdk/ocrsdk.com) - Code samples for using the proprietary commercial ABBYY OCR API. 163 | * [nidaba](https://github.com/OpenPhilology/nidaba) - An expandable and scalable OCR pipeline 164 | * [gamera](https://github.com/hsnr-gamera/gamera) - A meta-framework for building document processing applications, e.g. OCR 165 | * [ocr-tools](https://github.com/subugoe/ocr-tools) - Project to provide CLI and web service interfaces to common OCR engines 166 | * [ocrad-docker](https://github.com/kba/ocrad-docker) - Run the [ocrad](http://www.gnu.org/software/ocrad/) OCR engine in a docker container 167 | * [kraken-docker](https://github.com/kba/kraken-docker) - Run the [kraken](https://github.com/mittagessen/kraken) OCR engine in a docker container 168 | * [Konfuzio](https://www.konfuzio.com) - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open) 169 | * [ocr.space](https://ocr.space/) - Free Online OCR and OCR API by [@a9t9](https://github.com/A9T9) based on Tesseract (code is not open) 170 | * [OCR4all](https://github.com/OCR4all/OCR4all) - Provides OCR services through web applications. Included Projects: [LAREX](https://github.com/chreul/LAREX), [OCRopus](https://github.com/tmbdev/ocropy), [calamari](https://github.com/ChWick/calamari) and [nashi](https://github.com/andbue/nashi). 171 | 172 | ### OCR evaluation 173 | 174 | * [ISRI OCR Evaluation Tools](https://code.google.com/archive/p/isri-ocr-evaluation-tools/) with a [User Guide from 1996 :!:](https://github.com/eddieantonio/isri-ocr-evaluation-tools/blob/HEAD/user-guide.pdf) 175 | * [isri-ocr-evaluation-tools](https://github.com/eddieantonio/isri-ocr-evaluation-tools) - further development by [@eddieantonio](https://github.com/eddieantonio) (2015, 2016) 176 | * [ancientgreekocr-evaluation-tools](https://github.com/ryanfb/ancientgreekocr-ocr-evaluation-tools) - further development by [@nickjwhite](https://github.com/nickjwhite) (2013, 2014) 177 | * [ocrevalUAtion](https://github.com/impactcentre/ocrevalUAtion) - Cross-format evaluation, CLI and GUI 178 | * [ngram-ocr-eval](https://github.com/impactcentre/hackathon2014/tree/master/ngram-ocr-eval) - Brute and simple OCR evaluation using ngrams 179 | * [quack](https://github.com/tokee/quack) - Quality-Assurance-tool for scans with corresponding ALTO-files 180 | 181 | ### OCR libraries by programming language 182 | 183 | #### Crystal 184 | 185 | * [tesseract-ocr](https://github.com/dannnylo/tesseract-ocr-crystal) - A Crystal wrapper for tesseract-ocr. 186 | 187 | #### Elixir 188 | 189 | * [tesseract_ocr](https://github.com/dannnylo/tesseract-ocr-elixir) - Elixir library wrapping the tesseract executable. 190 | 191 | #### Go 192 | 193 | * [gosseract](https://github.com/otiai10/gosseract) - Golang OCR library, wrapping Tesseract-ocr. 194 | 195 | #### Java 196 | 197 | * [Tess4J](https://github.com/nguyenq/tess4j) - Java Native Access bindings to Tesseract. 198 | * [tess-two](https://github.com/rmtheis/tess-two) - Tools for compiling Tesseract on Android and Java API. 199 | 200 | #### .Net 201 | 202 | * [tesseract for .net](https://github.com/charlesw/tesseract) - A .Net wrapper for tesseract-ocr. 203 | 204 | #### Object Pascal 205 | 206 | * [TTesseractOCR4](https://github.com/r1me/TTesseractOCR4) - Object Pascal binding for tesseract-ocr 4.x. 207 | 208 | #### PHP 209 | 210 | * [Tesseract OCR for PHP](https://github.com/thiagoalessio/tesseract-ocr-for-php) - Tesseract PHP bindings. 211 | 212 | #### Python 213 | 214 | * [pytesseract](https://github.com/madmaze/pytesseract) - A Python wrapper for Google Tesseract. 215 | * [pyocr](https://github.com/jflesch/pyocr) - A Python wrapper for Tesseract and Cuneiform. 216 | * [ocrodjvu](https://github.com/jwilk/ocrodjvu) - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract 217 | * [tesserocr](https://github.com/sirfz/tesserocr) - A Python wrapper for the tesseract-ocr API 218 | 219 | #### Javascript 220 | 221 | * [ocracy](https://github.com/naptha/ocracy) - pure javascript lstm rnn implementation based on ocropus 222 | * [gocr.js](https://github.com/antimatter15/gocr.js) - Javascript port (emscripten) of gocr 223 | * [ocrad.js](https://github.com/antimatter15/ocrad.js) - Javascript port (emscripten) of ocrad 224 | * [tesseract.js](https://github.com/naptha/tesseract.js) - Javascript port (emscripten) of Tesseract 225 | * [node-tesseract-ocr](https://github.com/zapolnoch/node-tesseract-ocr) - A simple wrapper for the Tesseract OCR package. 226 | * [node-tesseract-native](https://github.com/mdelete/node-tesseract-native) - C++ module for node providing OCR with tesseract and leptonica. 227 | 228 | #### Ruby 229 | 230 | * [rtesseract](https://github.com/dannnylo/rtesseract) - Ruby library wrapping the tesseract and imagemagick executables. 231 | * [ruby-tesseract](https://github.com/meh/ruby-tesseract-ocr) - Native Tesseract bindings for Ruby MRI and JRuby 232 | * [ocr_space](https://github.com/suyesh/ocr_space) - API wrapper for free ocr service ocr.space. Includes CLI 233 | 234 | #### Rust 235 | 236 | * [tesseract.rs](https://github.com/antimatter15/tesseract-rs) - Rust bindings for tesseract OCR. 237 | * [leptess](https://crates.io/crates/leptess) - Productive and safe Rust bindings/wrappers for tesseract and leptonica. 238 | 239 | #### R 240 | 241 | * [tesseract](https://github.com/ropensci/tesseract) - R bindings for tesseract OCR. 242 | 243 | #### Swift 244 | 245 | * [Tesseract OCR iOS](https://github.com/gali8/Tesseract-OCR-iOS) - Swift and Objective-C wrapper for Tesseract OCR. 246 | * [SwiftOCR](https://github.com/garnele007/SwiftOCR) - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes. 247 | 248 | ### OCR training tools 249 | 250 | * [glyph-miner](https://github.com/benedikt-budig/glyph-miner) - A system for extracting glyphs from early typeset prints 251 | * [ocrodeg](https://github.com/NVlabs/ocrodeg) - Document image degradation for OCR data augmentation 252 | 253 | ## Datasets 254 | 255 | ### Ground Truth 256 | 257 | * [archiscribe-corpus](https://github.com/jbaiter/archiscribe-corpus) - >4,200 lines transcribed from 19th Century German prints via [archiscribe](https://archiscribe.jbaiter.de/) `CC-BY 4.0` 258 | * [CIS OCR Test Set](https://github.com/cisocrgroup/Resources/tree/master/ocrtestset) - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo) 259 | - [Rescribe](https://github.com/rescribe/carolineminuscule-groundtruth) - Transcriptions of Caroline Minuscule Manuscripts `PDM 1.0` 260 | * [CLTK](https://github.com/cltk) - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0` 261 | * [DIVA-HisDB](https://diuf.unifr.ch/main/hisdoc/diva-hisdb) - 150 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of three medieval manuscripts `CC-BY-NC 3.0` 262 | * [EarlyPrintedBooks](https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks) - ~8,800 lines from several early printed books `CC-BY-NC-SA 4.0` 263 | * [EEBO-TCP](https://github.com/Anterotesis/historical-texts/tree/master/eebo-tcp) - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0` 264 | * [ECCO-TCP](https://github.com/Anterotesis/historical-texts/tree/master/ecco-tcp) - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0` 265 | * [eMOP-TCP](https://github.com/Early-Modern-OCR/TCP-ECCO-texts) - 2,188 ECCO-TCP documents, cleaned up by [eMOP](http://emop.tamu.edu/) `PDM 1.0` 266 | * [Evans-TCP](https://github.com/Anterotesis/historical-texts/tree/master/evans-tcp) - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/) 267 | * [FDHN](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms) 268 | * [FROC-MSS](https://github.com/Jean-Baptiste-Camps/FROC-MSS) - 4 Old French Medieval Manuscripts `CC-BY 4.0` 269 | * [GERMANA](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only` 270 | * [GT4HistOCR](https://doi.org/10.5281/zenodo.1344132) - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0` 271 | * [imagessan](https://github.com/Shreeshrii/imagessan/) - Sanskrit images & ground truth (Devanagari script) 272 | * [IMPACT-BHL](http://www.bhle.eu/en/results-of-the-collaboration-of-bhl-europe-and-impact) - 2,418 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0` 273 | * [IMPACT-BL](https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=BL&search-filter-language=&search-filter-script=&search-filter-year=) - 294 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0` 274 | * [IMPACT-BNE](https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=BNE&search-filter-language=&search-filter-script=&search-filter-year=) - 215 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Spain, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required, [XML@GitHub](https://github.com/impactcentre/groundtruth-spa) `CC-BY-NC-SA 4.0` 275 | * [IMPACT-BNF](https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=BNE&search-filter-language=&search-filter-script=&search-filter-year=) - 151 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0` 276 | * [IMPACT-KB](http://lab.kb.nl/dataset/ground-truth-impact-project#access) - 142 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of the Netherlands `CC-BY 4.0` 277 | * [IMPACT-NKC](https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=NKC&search-filter-language=&search-filter-script=&search-filter-year=) - 187 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0` 278 | * [IMPACT-NLB](https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=NLB&search-filter-language=&search-filter-script=&search-filter-year=) - 19 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0` 279 | * [IMPACT-NUK](https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/impact-dataset-browser/?query=&search-filter-institution=NUK&search-filter-language=&search-filter-script=&search-filter-year=) - 209 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0` 280 | * [IMPACT-PSNC](http://dl.psnc.pl/activities/projekty/impact/results/) - 478 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0` 281 | * [LascivaRoma/lexical](https://github.com/lascivaroma/lexical) - Transcription of 19th century lexical resources for Latin learning 282 | * [MJSynth](http://www.robots.ox.ac.uk/~vgg/data/text/) - 9m synthetic images covering 90k English words 283 | * [OCR19thSAC](https://files.ifi.uzh.ch/cl/OCR19thSAC/) - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0` 284 | * [OCR-D](http://ocr-d.de/daten) - 180 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of German historical prints from [OCR-D](http://ocr-d.de/) `CC-BY-SA 4.0` 285 | * [OCR_GS_Data](https://github.com/OpenITI/OCR_GS_Data) - Double-checked Arabic Gold Standard from [OpenITI](https://github.com/OpenITI) 286 | * [old-books](https://github.com/PedroBarcha/old-books-dataset) - 322 old books from [Project Gutenberg](https://www.gutenberg.org/) `GPL 3.0` 287 | * [PRImA-ENP](http://www.primaresearch.org/datasets/ENP) - 528 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0` 288 | * [RODRIGO](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only` 289 | * [Toebler-OCR](https://github.com/PonteIneptique/toebler-ocr) - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch 290 | 291 | ## Literature 292 | 293 | ### OCR-related publication and link lists 294 | 295 | * [IMPACT: Tools for text digitisation](http://www.digitisation.eu/tools-resources/tools-for-text-digitisation/) - List of tools software projects related, some related to OCR 296 | * [OCR-D](https://www.zotero.org/groups/ocr-d) - List of OCR-related academic articles in the context of the [OCR-D](http://www.ocr-d.de/) project. :de: 297 | * [Mendeley Group "OCR - Optical Character Recognition"](https://www.mendeley.com/groups/752871/ocr-optical-character-recognition/) - Collection of 34 papers on OCR 298 | * [eadh.org projects](http://eadh.org/projects) - List of Digital Humanities-related projects in Europe, some related to OCR 299 | * [Wikipedia: Comparison of optical character recognition software](https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software) 300 | * [OCR [and Deep Learning]](http://handong1587.github.io/deep_learning/2015/10/09/ocr.html) by [@handong1587](https://github.com/handong1587/) 301 | * [Ocropus Wiki: Publications](https://github.com/tmbdev/ocropy/wiki/Publications) 302 | 303 | ### Blog Posts and Tutorials 304 | 305 | * [Tesseract Blends Old and New OCR Technology](https://github.com/tesseract-ocr/docs/tree/master/das_tutorial2016) (2016) [@theraysmith](https://github.com/theraysmith) 306 | * Tutorial@DAS2016, Updated "What You Always Wanted to Know" slides 307 | * [What You Always Wanted To Know About Tesseract](https://drive.google.com/folderview?id=0B7l10Bj_LprhQnpSRkpGMGV2eE0&usp#list) (2014) [@theraysmith](https://github.com/theraysmith) 308 | * Tutorial@DAS2014, includes demos 309 | * [Extracting text from an image using Ocropus](http://www.danvk.org/2015/01/09/extracting-text-from-an-image-using-ocropus.html) (2015) 310 | * [Training an Ocropus OCR model](http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html) (2015) [@danvk](https://github.com/danvk) 311 | * [Ocropus Wiki: Compute errors and confusions](https://github.com/tmbdev/ocropy/wiki/Compute-errors-and-confusions) (2016) [@zuphilip](https://github.com/zuphilip) 312 | * [Ocropus Wiki: Working with Ground Truth](https://github.com/tmbdev/ocropy/wiki/Compute-errors-and-confusion://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth) (2016) [@zuphilip](https://github.com/zuphilip) 313 | * [OCRopus](https://comsys.informatik.uni-kiel.de/lang/de/res/ocropus/) (2016) [@jze](https://github.com/jze) 314 | * mostly on column separation in ocropus 315 | * [10 Tips for making your OCR project succeed](http://blog.kbresearch.nl/2013/12/12/10-tips-for-making-your-ocr-project-succeed/) (2013) [@cneud](https://github.com/cneud) 316 | * general things to consider for OCR projects 317 | * [Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology](https://www.leadtools.com/sdk/image-processing/document) - 318 | * feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR 319 | * [Extracting Text from PDFs; Doing OCR; all within R](https://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/) [@shawngraham](https://github.com/shawngraham) 320 | * How to work with OCR from PDFs in the [R programming environment](https://www.r-project.org/) 321 | * [Tutorial: Command-line OCR on a Mac](http://benschmidt.org/dighist13/?page_id=129) [@bmschmidt](https://github.com/bmschmidt) 322 | * Tutorial on how to run tesseract in Mac OSX 323 | * [Practical Expercience with OCRopus Model Training](https://comsys.informatik.uni-kiel.de/lang/de/res/practical-expercience-with-ocropus-model-training/) (2016) [@jze](https://github.com/jze) 324 | * [Homemade Manuscript OCR (1): OCRopy](http://graal.hypotheses.org/786) (2017) [@Jean-Baptiste-Camps](https://github.com/Jean-Baptiste-Camps) 325 | * Tutorial on applying OCR to medieval manuscripts with OCRopy 326 | * [Optimizing Binarization for OCRopus](https://comsys.informatik.uni-kiel.de/lang/de/res/optimizing-binarization-for-ocropus/) (2017) [@jze](https://github.com/jze) 327 | * [Prototype demo for OCR postfix in Danish Newspapers](https://sbdevel.wordpress.com/2016/11/15/prototype-demo-for-ocr-postfix-in-danish-newspapers/) (2016) [@thomasegense](https://github.com/thomasegense) 328 | * [How Can I OCR My Dictionary?](https://digilex.hypotheses.org/153) (2016) [@JessedeDoes](https://github.com/JessedeDoes) 329 | * ["Needlessly complex" blog](https://mzucker.github.io/) (2016) [@mzucker](https://github.com/mzucker). Several image processing how-tos (Python based), particularly: 330 | * [Page dewarping](https://mzucker.github.io/2016/08/15/page-dewarping.html) ([code](https://github.com/mzucker/page_dewarp)) 331 | * [Compressing and enhancing hand-written notes](https://mzucker.github.io/2016/09/20/noteshrink.html) ([code](https://github.com/mzucker/noteshrink)) 332 | * [Unprojecting text with ellipses](https://mzucker.github.io/2016/10/11/unprojecting-text-with-ellipses.html) ([code](https://github.com/mzucker/unproject_text)) 333 | * [(Open-Source-)OCR-Workflows](https://edoc.bbaw.de/frontdoor/index/index/docId/2786) (2017) [@wrznr](https://github.com/wrznr) :de: overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the [@OCR-D](https://github.com/OCR-D) project. 334 | * [A gentle introduction to OCR](https://towardsdatascience.com/a-gentle-introduction-to-ocr-ee1469a201aa) (2018) [@shgidi](https://github.com/shgidi) 335 | * [Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR](https://blog.ub.unibas.ch/2019/06/04/worauf-kann-ich-mich-verlassen-arbeiten-mit-digitalisierten-quellen-teil-1-ocr/) (2019) [@eliaskreyenbuehl](https://github.com/eliaskreyenbuehl) :de: A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts. 336 | 337 | ### OCR Showcases 338 | 339 | * [abbyy-finereader-ocr-senate](https://github.com/dannguyen/abbyy-finereader-ocr-senate) - Using OCR to parse scanned Senate Financial Disclosure forms. 340 | * [cvOCR](https://github.com/Halfish/cvOCR) - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract 341 | * [MathOCR](https://github.com/chungkwong/MathOCR) - A printed scientific document recognition system, **pre-alpha** 342 | 343 | ### Academic articles 344 | 345 | #### 2011 and before 346 | * [High performance document layout analysis](http://www.dfki.de/web/research/publications/renameFileForDownload?filename=HighPerfDocLayoutAna.pdf&file_id=uploads_552) (2003) Breuel 347 | * [Adaptive degraded document image binarization](http://doai.io/10.1016/j.patcog.2005.09.010) (2006) Gatos, Pratikakis, Perantonis 348 | * [[Internship Report]](http://www.madm.eu/_media/theses/ocropusgupta.pdf) (2007) Gupta 349 | * [OCRopus Addons (Internship Report)](http://madm.dfki.de/_media/theses/ocropusdantrey.pdf) (2007) Dantrey 350 | 351 | #### 2012 352 | * [Local Logistic Classifiers for Large Scale Learning](http://www.academia.edu/2959462/Local_Logistic_Classifiers_for_Large_Scale_Learning) (2012) Yousefi, Breuel 353 | 354 | #### 2013 355 | * [High Performance OCR for Printed English and Fraktur using LSTM Networks](http://staffhome.ecm.uwa.edu.au/~00082689/papers/Breuel-LSTM-OCR-ICDAR13.pdf) (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait 356 | * [Can we build language-independent OCR using LSTM networks?](https://www.researchgate.net/publication/260341307_Can_we_build_language-independent_OCR_using_LSTM_networks) (2013) Ul-Hasan, Breuel 357 | * [Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks](http://staffhome.ecm.uwa.edu.au/~00082689/papers/Adnan-Urdu-OCR-ICDAR13.pdf) (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel 358 | 359 | #### 2014 360 | * [OCR of historical printings of Latin texts: Problems, Prospects, Progress.](http://www.springmann.net/papers/2014-04-07-DATeCH2014-springmann.pdf) (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink 361 | * [Correcting Noisy OCR: Context beats Confusion](http://dx.doi.org/10.1145/2595188.2595200) (2014) Evershed, Fitch 362 | 363 | #### 2015 364 | * [TypeWright: An Experiment in Participatory Curation](http://www.digitalhumanities.org/dhq/vol/9/4/000220/000220.html) (2015) Bilansky 365 | * On crowd-sourcing OCR postcorrection 366 | * [Benchmarking of LSTM Networks](http://arxiv.org/abs/1508.02774) (2015) Breuel 367 | * [Recognition of Historical Greek Polytonic Scripts Using LSTM](http://users.iit.demokritos.gr/~bgat/OldDocPro/05_paper_305.pdf) (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki 368 | * [A Segmentation-Free Approach for Printed Devanagari Script Recognition](https://www.researchgate.net/publication/280777081_A_Segmentation-Free_Approach_for_Printed_Devanagari_Script_Recognition) (2015) Karayil, Ul-Hasan, Breuel 369 | * [A Sequence Learning Approach for Multiple Script Identification](https://www.researchgate.net/publication/280777013_A_Sequence_Learning_Approach_for_Multiple_Script_Identification) (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel 370 | 371 | #### 2016 372 | * [Important New Developments in Arabographic Optical Character Recognition (OCR)](https://arxiv.org/abs/1703.09550) (2016) Romanov, Miller, Savant, Kiessling 373 | * on [kraken](#ocr-engines) 374 | * using [OpenArabic/OCR_GS_Data](https://github.com/OpenArabic/OCR_GS_Data) for ground truth data 375 | * [OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus](https://arxiv.org/abs/1608.02153) (2016) Springmann, Lüdeling 376 | * [Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents](http://arxiv.org/abs/1606.05157) (2016) Springmann, Fink, Schulz 377 | * [Generic Text Recognition using Long Short-Term Memory Networks](https://kluedo.ub.uni-kl.de/frontdoor/index/index/docId/4353) (2016) Ul-Hasan -- Ph.D Thesis 378 | * [OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters](https://www.researchgate.net/publication/294575734_OCRoRACT_A_Sequence_Learning_OCR_System_Trained_on_Isolated_Characters) (2016) Dengel, Ul-Hasan, Bukhari 379 | * [Recursive Recurrent Nets with Attention Modeling for OCR in the Wild](https://arxiv.org/abs/1603.03101) (2016) Lee, Osindero 380 | 381 | #### 2017 382 | 383 | * [Telugu OCR Framework using Deep Learning](https://arxiv.org/abs/1509.05962) (2015/2017) [Achanta](https://github.com/rakeshvar), Hastie 384 | * see also [TeluguOCR](https://github.com/TeluguOCR), [banti_telugu_ocr](https://github.com/TeluguOCR/banti_telugu_ocr), [chamanti_ocr](https://github.com/rakeshvar/chamanti_ocr), [#49](https://github.com/kba/awesome-ocr/issues/49) 385 | 386 | 387 | #### 2018 388 | 389 | * [A Two-Stage Method for Text Line Detection in Historical Documents](https://arxiv.org/abs/1802.03345) (2018) [Grüning](https://github.com/TobiasGruening), Leifert, Strauß, Labahn. Code available at https://github.com/TobiasGruening/ARU-Net 390 | --------------------------------------------------------------------------------