├── README.md
├── image_example.py
├── output.txt
├── pdf_example.py
├── sample1.jpg
└── sample2.pdf


/README.md:
--------------------------------------------------------------------------------
 1 | # tesseract-python
 2 | Examples to implement OCR(Optical Character Recognition) using tesseract using Python
 3 | 
 4 | ## Installation:
 5 | - Install tesserct-ocr using this command:
 6 |     - On Ubuntu
 7 |       ```
 8 |       sudo apt-get install tesseract-ocr
 9 |       ```
10 |     - On Mac
11 |       ```
12 |       brew install tesseract
13 |       ```
14 |     - On Windows, download installer from [here](https://github.com/UB-Mannheim/tesseract/wiki)
15 | 
16 | 
17 | - Install python binding for tesseract, pytesseract, using this pip command:
18 |   ```
19 |   pip install pytesseract
20 |   ```
21 | 
22 | - Install image processing library in python, pillow using this pip command:
23 |   ```
24 |   pip install pillow
25 |   ```
26 |   
27 | **For working with pdf files:**
28 | - Install imagemagick using this command:
29 |     - On Ubuntu
30 |       ```
31 |       sudo apt-get install imagemagick
32 |       ```
33 |     - For other platforms, download installer from [here](https://imagemagick.org/script/download.php)
34 | 
35 | 
36 | - Install python binding for imagemagick, wand, using this pip command:
37 |   ```
38 |   pip install wand
39 |   ```
40 | 


--------------------------------------------------------------------------------
/image_example.py:
--------------------------------------------------------------------------------
1 | from PIL import Image
2 | import pytesseract
3 | 
4 | im = Image.open("sample1.jpg")
5 | 
6 | text = pytesseract.image_to_string(im, lang = 'eng')
7 | 
8 | print(text)
9 | 


--------------------------------------------------------------------------------
/output.txt:
--------------------------------------------------------------------------------
1 | The quick brown fox
2 | jumped over the 5
3 | lazy dogs!
4 | 
5 | 


--------------------------------------------------------------------------------
/pdf_example.py:
--------------------------------------------------------------------------------
 1 | import io
 2 | from PIL import Image
 3 | import pytesseract
 4 | from wand.image import Image as wi
 5 | 
 6 | pdf = wi(filename = "sample2.pdf", resolution = 300)
 7 | pdfImage = pdf.convert('jpeg')
 8 | 
 9 | imageBlobs = []
10 | 
11 | for img in pdfImage.sequence:
12 | 	imgPage = wi(image = img)
13 | 	imageBlobs.append(imgPage.make_blob('jpeg'))
14 | 
15 | recognized_text = []
16 | 
17 | for imgBlob in imageBlobs:
18 | 	im = Image.open(io.BytesIO(imgBlob))
19 | 	text = pytesseract.image_to_string(im, lang = 'eng')
20 | 	recognized_text.append(text)
21 | 
22 | print(recognized_text)
23 | 


--------------------------------------------------------------------------------
/sample1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilkumarsingh/tesseract-python/9112570be06fe61a06f889b75793bcf6f2334a78/sample1.jpg


--------------------------------------------------------------------------------
/sample2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilkumarsingh/tesseract-python/9112570be06fe61a06f889b75793bcf6f2334a78/sample2.pdf


--------------------------------------------------------------------------------