├── README.md ├── image_example.py ├── output.txt ├── pdf_example.py ├── sample1.jpg └── sample2.pdf /README.md: -------------------------------------------------------------------------------- 1 | # tesseract-python 2 | Examples to implement OCR(Optical Character Recognition) using tesseract using Python 3 | 4 | ## Installation: 5 | - Install tesserct-ocr using this command: 6 | - On Ubuntu 7 | ``` 8 | sudo apt-get install tesseract-ocr 9 | ``` 10 | - On Mac 11 | ``` 12 | brew install tesseract 13 | ``` 14 | - On Windows, download installer from [here](https://github.com/UB-Mannheim/tesseract/wiki) 15 | 16 | 17 | - Install python binding for tesseract, pytesseract, using this pip command: 18 | ``` 19 | pip install pytesseract 20 | ``` 21 | 22 | - Install image processing library in python, pillow using this pip command: 23 | ``` 24 | pip install pillow 25 | ``` 26 | 27 | **For working with pdf files:** 28 | - Install imagemagick using this command: 29 | - On Ubuntu 30 | ``` 31 | sudo apt-get install imagemagick 32 | ``` 33 | - For other platforms, download installer from [here](https://imagemagick.org/script/download.php) 34 | 35 | 36 | - Install python binding for imagemagick, wand, using this pip command: 37 | ``` 38 | pip install wand 39 | ``` 40 | -------------------------------------------------------------------------------- /image_example.py: -------------------------------------------------------------------------------- 1 | from PIL import Image 2 | import pytesseract 3 | 4 | im = Image.open("sample1.jpg") 5 | 6 | text = pytesseract.image_to_string(im, lang = 'eng') 7 | 8 | print(text) 9 | -------------------------------------------------------------------------------- /output.txt: -------------------------------------------------------------------------------- 1 | The quick brown fox 2 | jumped over the 5 3 | lazy dogs! 4 | 5 | -------------------------------------------------------------------------------- /pdf_example.py: -------------------------------------------------------------------------------- 1 | import io 2 | from PIL import Image 3 | import pytesseract 4 | from wand.image import Image as wi 5 | 6 | pdf = wi(filename = "sample2.pdf", resolution = 300) 7 | pdfImage = pdf.convert('jpeg') 8 | 9 | imageBlobs = [] 10 | 11 | for img in pdfImage.sequence: 12 | imgPage = wi(image = img) 13 | imageBlobs.append(imgPage.make_blob('jpeg')) 14 | 15 | recognized_text = [] 16 | 17 | for imgBlob in imageBlobs: 18 | im = Image.open(io.BytesIO(imgBlob)) 19 | text = pytesseract.image_to_string(im, lang = 'eng') 20 | recognized_text.append(text) 21 | 22 | print(recognized_text) 23 | -------------------------------------------------------------------------------- /sample1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhilkumarsingh/tesseract-python/9112570be06fe61a06f889b75793bcf6f2334a78/sample1.jpg -------------------------------------------------------------------------------- /sample2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nikhilkumarsingh/tesseract-python/9112570be06fe61a06f889b75793bcf6f2334a78/sample2.pdf --------------------------------------------------------------------------------