├── README.md ├── images ├── aurebesh.jpg ├── digits-task.jpg ├── greek-thai.png ├── hitchhikers-rotated.png └── invoice-sample.jpg └── tesseract-tutorial.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # OCR in Python with OpenCV, Tesseract and Pytesseract 2 | 3 | Companion tutorial blog post can be found [here](https://nanonets.com/blog/ocr-with-tesseract/) 4 | 5 | ### Preprocessing of images using OpenCV 6 | 7 | Basic functions for different preprocessing methods 8 | - grayscaling 9 | - thresholding 10 | - dilating 11 | - eroding 12 | - opening 13 | - canny edge detection 14 | - noise removal 15 | - deskwing 16 | - template matching. 17 | 18 | Different methods can come in handy with different kinds of images. 19 | 20 | ### Bounding box information using Pytesseract 21 | 22 | While running and image through the tesseract OCR engine, pytesseract allows you to get bounding box imformation 23 | - on a character level 24 | - on a word level 25 | - based on a regex template 26 | 27 | We will see how to obtain all of them. 28 | 29 | ### Page Segmentation Modes 30 | 31 | There are several ways a page of text can be analysed. The tesseract api provides several page segmentation modes if you want to run OCR on only a small region or in different orientations, etc. 32 | 33 | Here's a list of the supported page segmentation modes by tesseract. Check it out here 34 | 35 | 0 Orientation and script detection (OSD) only. 36 | 1 Automatic page segmentation with OSD. 37 | 2 Automatic page segmentation, but no OSD, or OCR. 38 | 3 Fully automatic page segmentation, but no OSD. (Default) 39 | 4 Assume a single column of text of variable sizes. 40 | 5 Assume a single uniform block of vertically aligned text. 41 | 6 Assume a single uniform block of text. 42 | 7 Treat the image as a single text line. 43 | 8 Treat the image as a single word. 44 | 9 Treat the image as a single word in a circle. 45 | 10 Treat the image as a single character. 46 | 11 Sparse text. Find as much text as possible in no particular order. 47 | 12 Sparse text with OSD. 48 | 13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. 49 | 50 | To change your page segmentation mode, change the ```--psm``` argument in your custom config string to any of the above mentioned mode codes. 51 | 52 | ### Playing around with the config 53 | 54 | By making minor changes in the config file you can 55 | - specify language 56 | - detect only digits 57 | - whitelist characters 58 | - blacklist characters 59 | - work with multiple language 60 | 61 | -------------------------------------------------------------------------------- /images/aurebesh.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NanoNets/ocr-with-tesseract/4e005bbd8408068e72c499fa2109815b32f5c156/images/aurebesh.jpg -------------------------------------------------------------------------------- /images/digits-task.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NanoNets/ocr-with-tesseract/4e005bbd8408068e72c499fa2109815b32f5c156/images/digits-task.jpg -------------------------------------------------------------------------------- /images/greek-thai.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NanoNets/ocr-with-tesseract/4e005bbd8408068e72c499fa2109815b32f5c156/images/greek-thai.png -------------------------------------------------------------------------------- /images/hitchhikers-rotated.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NanoNets/ocr-with-tesseract/4e005bbd8408068e72c499fa2109815b32f5c156/images/hitchhikers-rotated.png -------------------------------------------------------------------------------- /images/invoice-sample.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NanoNets/ocr-with-tesseract/4e005bbd8408068e72c499fa2109815b32f5c156/images/invoice-sample.jpg --------------------------------------------------------------------------------