├── tesseract-config ├── .gitignore ├── Makefile └── README.md /tesseract-config: -------------------------------------------------------------------------------- 1 | tessedit_char_blacklist fifl 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pdf 2 | temp/* 3 | ocr/* 4 | ocr-pages/* 5 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | OCR_OUTPUTS := $(patsubst %.pdf, ocr/%.txt, $(wildcard *.pdf)) 2 | PNG_DPI := 600 3 | LANGUAGE := eng 4 | 5 | all : $(OCR_OUTPUTS) 6 | 7 | ocr/%.txt : %.pdf 8 | @mkdir -p temp 9 | @mkdir -p ocr 10 | @mkdir -p ocr-pages 11 | @echo "$^: bursting into a PDF for each page ..." 12 | @pdftk $^ burst output temp/$*.page-%08d.pdf 13 | @echo "$^: converting pages into images ..." 14 | @for pdf in temp/$*.page-*.pdf ; do \ 15 | convert -density $(PNG_DPI) -depth 8 $$pdf $$pdf.png ; \ 16 | done 17 | @echo "$^: running OCR on each page ..." 18 | @for png in temp/$*.page-*.png ; do \ 19 | tesseract $$png $$png tesseract-config -l $(LANGUAGE) > /dev/null 2>&1; \ 20 | done 21 | @cp temp/$*.page-*.png.txt ocr-pages/ 22 | @cat temp/$*.page-*.png.txt > ocr/$*.txt 23 | @echo "$^: Finished running OCR." 24 | 25 | .PHONY : clean 26 | clean : 27 | rm -rf temp/* 28 | 29 | .PHONY : clobber 30 | clobber : clean 31 | rm -rf ocr/* 32 | rm -rf ocr-pages/* 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Makefile for running OCR on PDFs 2 | 3 | A common problem is to have a collection of PDFs for which one wishes to get plain text files with OCR text. This [Makefile](https://www.gnu.org/software/make/) offers a simple recipe for doing just that. It bursts each PDF into separate files for each page (with [pdftk](https://www.pdflabs.com/tools/pdftk-server/)), creates a PNG for each page (with [imagemagick](http://www.imagemagick.org/)), then runs [tesseract](https://github.com/tesseract-ocr/tesseract) OCR on each page, producing a single text file corresponding to the input PDF file, as well as OCR files for each page in the PDF. 4 | 5 | This process can be slow for any individual PDF. However, it is very memory efficient and GNU Make provides parallelization for free, and it requires no point and click nonsense, so when running this on a batch of PDF files it can be quite efficient. And it is portable, so for very big batches I have thrown this up on an Digital Ocean droplet or Amazon EC2 instance and just let it run. 6 | 7 | ## Install dependencies 8 | 9 | On a Mac with [Homebrew](http://brew.sh/): 10 | 11 | ``` 12 | brew install imagemagick 13 | brew install tesseract 14 | ``` 15 | 16 | Getting PDFtk on a recent version of Mac OS X is a bit more tricky. Hopefully [PDFtk server](https://www.pdflabs.com/tools/pdftk-server/) will be updated on the main website and in Homebrew soon. In the meantime, install the [version in this Stack Overflow thread](http://stackoverflow.com/questions/32505951/pdftk-server-on-os-x-10-11) if you are using El Capitan. 17 | 18 | On Ubuntu 14.04: 19 | 20 | ``` 21 | sudo apt-get install pdftk 22 | sudo apt-get install imagemagick 23 | sudo apt-get install tesseract-ocr 24 | ``` 25 | 26 | ## Use 27 | 28 | Put all the PDFs that you want OCRed in the same directory as this Makefile. Run `make`. If you want run the recipe in parallel, run `make -j 4` where the number is how many processes you want to run at once. To get rid of the intermediate products, run `make clean`; the get rid of the OCR text and start over, run `make clobber`. Your OCR text files will be in the directory `ocr/`, and the OCR text files for each page will be in `ocr-pages/`. 29 | 30 | You can control various options for tesseract by editing the `tesseract-config` file. And you can control the DPI at which the OCR is run and the language by editing a variable at the top of the `Makefile.` 31 | 32 | ## License 33 | 34 | Copyright 2016 [Lincoln Mullen](http://lincolnmullen.com). [Licensed MIT](https://opensource.org/licenses/MIT). 35 | 36 | --------------------------------------------------------------------------------