├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── example-image.js
├── example-text.js
├── install-mac.sh
├── lib
    ├── convert.js
    ├── electronic.js
    ├── ocr.js
    ├── raw.js
    ├── searchable.js
    └── split.js
├── main.js
├── makefile
├── package-lock.json
├── package.json
├── share
    ├── configs
    │   └── alphanumeric
    ├── dia.traineddata
    └── eng.traineddata
└── test
    ├── 01_command-test.js
    ├── 02_split-test.js
    ├── 03_searchable-test.js
    ├── 04_electronic-test.js
    ├── 05_convert-test.js
    ├── 06_ocr-test.js
    ├── 07_raw-test.js
    ├── npm-debug.log
    └── test_data
        ├── multipage_raw.pdf
        ├── multipage_searchable.pdf
        ├── single_page_raw.pdf
        ├── single_page_raw.tif
        ├── single_page_searchable.pdf
        └── single_page_searchable.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | node_modules
2 | npm-debug.log
3 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: node_js
 2 | 
 3 | before_script:
 4 |   - sudo apt-get install -qq ghostscript
 5 |   - sudo apt-get install -qq pdftk
 6 |   - sudo apt-get install -qq tesseract-ocr
 7 |   - sudo apt-get install -qq poppler-utils
 8 |   - sudo cp ./share/eng.traineddata /usr/share/tesseract-ocr/tessdata/
 9 |   - sudo cp ./share/dia.traineddata /usr/share/tesseract-ocr/tessdata/
10 |   - sudo cp ./share/configs/alphanumeric /usr/share/tesseract-ocr/tessdata/configs/alphanumeric
11 | 
12 | node_js:
13 |   - "0.8"
14 | 
15 | notifications:
16 |   email: false
17 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Noah Isaacson
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Node PDF
  2 | Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text
  3 | 
  4 | [![Build Status](https://travis-ci.org/nisaacson/pdf-extract.png)](https://travis-ci.org/nisaacson/pdf-extract)
  5 | 
  6 | ## Installation
  7 | 
  8 | To begin install the module.
  9 | 
 10 | `npm install pdf-extract`
 11 | 
 12 | After the library is installed you will need the following binaries accessible on your path to process pdfs.
 13 | 
 14 | - pdftk
 15 |     - pdftk splits multi-page pdf into single pages.
 16 | - pdftotext
 17 |     - pdftotext is used to extract text out of searchable pdf documents
 18 | - ghostscript
 19 |     - ghostscript is an ocr preprocessor which convert pdfs to tif files for input into tesseract
 20 | - tesseract
 21 |     - tesseract performs the actual ocr on your scanned images
 22 | 
 23 | 
 24 | ### OSX
 25 | To begin on OSX, first make sure you have the homebrew package manager installed.
 26 | 
 27 | **pdftk** is not available in Homebrew. However a gui install is available here.
 28 | [http://www.pdflabs.com/docs/install-pdftk/](http://www.pdflabs.com/docs/install-pdftk/)
 29 | 
 30 | **pdftotext** is included as part of the **poppler** utilities library. **poppler** can be installed via homebrew
 31 | 
 32 | ``` bash
 33 | brew install poppler
 34 | ```
 35 | 
 36 | **ghostscript** can be install via homebrew
 37 | ``` bash
 38 | brew install gs
 39 | ```
 40 | 
 41 | **tesseract** can be installed via homebrew as well
 42 | 
 43 | ``` bash
 44 | brew install tesseract
 45 | ```
 46 | 
 47 | After tesseract is installed you need to install the alphanumeric config and an updated trained data file
 48 | ``` bash
 49 | cd <root of this module>
 50 | cp "./share/eng.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/eng.traineddata"
 51 | cp "./share/dia.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/dia.traineddata"
 52 | cp "./share/configs/alphanumeric" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/configs/alphanumeric"
 53 | ```
 54 | 
 55 | ### Ubuntu
 56 | **pdftk** can be installed directly via apt-get
 57 | ```bash
 58 | apt-get install pdftk
 59 | ```
 60 | 
 61 | **pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute
 62 | ``` bash
 63 | apt-get install poppler-utils
 64 | ```
 65 | 
 66 | **ghostscript** can be install via apt-get
 67 | ``` bash
 68 | apt-get install ghostscript
 69 | ```
 70 | 
 71 | **tesseract** can be installed via apt-get. Note that unlike the osx install the package is called **tesseract-ocr** on Ubuntu, not **tesseract**
 72 | ``` bash
 73 | apt-get install tesseract-ocr
 74 | ```
 75 | 
 76 | For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this pdf-extract module into the *tess-data* folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system
 77 | ``` bash
 78 | cd <root of this module>
 79 | cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata"
 80 | cp "./share/configs/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric"
 81 | ```
 82 | 
 83 | 
 84 | ### SmartOS
 85 | **pdftk** can be installed directly via apt-get
 86 | ``` bash
 87 | apt-get install pdftk
 88 | ```
 89 | 
 90 | **pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute
 91 | ``` bash
 92 | apt-get install poppler-utils
 93 | ```
 94 | 
 95 | **ghostscript** can be install via pkgin. Note you may need to update the pkgin repo to include the additional sources provided by Joyent. Check [http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html](http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html) for details
 96 | ``` bash
 97 | pkgin install ghostscript
 98 | ```
 99 | 
100 | **tesseract** can be must be manually downloaded and compiled. You must also install leptonica before installing tesseract. At the time of this writing leptonica is available from [http://www.leptonica.com/download.html](http://www.leptonica.com/download.html), with the latest version tarball available from [http://www.leptonica.com/source/leptonica-1.69.tar.gz](http://www.leptonica.com/source/leptonica-1.69.tar.gz)
101 | ``` bash
102 | pkgin install autoconf
103 | wget http://www.leptonica.com/source/leptonica-1.69.tar.gz
104 | tar -xvzf leptonica-1.69.tar.gz
105 | cd leptonica-1.69
106 | ./configure
107 | make
108 | [sudo] make install
109 | ```
110 | After installing leptonic move on to tesseract. Tesseract is available from [https://code.google.com/p/tesseract-ocr/downloads/list](https://code.google.com/p/tesseract-ocr/downloads/list) with the latest version available from [https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=](https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=)
111 | ``` bash
112 | wget https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=
113 | tar -xvzf tesseract-ocr-3.02.02.tar.gz
114 | cd tesseract-ocr
115 | ./configure
116 | make
117 | [sudo] make install
118 | ```
119 | 
120 | ### Windows
121 | Important! You will have to add some variables to the PATH of your machine. You do this by right clicking your computer in file explorer, select Properties, select Advanced System Settings, Environment Variables. You can then add **the folder that contains the executables** to the path variable.
122 | 
123 | **pdftk** can be installed using the PDFtk Server installer found here: https://www.pdflabs.com/tools/pdftk-server/
124 | It should autmatically add itself to the PATH, if not, the default install location is *"C:\Program Files (x86)\PDFtk Server\bin\"*
125 | 
126 | **pdftotext** can be installed using the recompiled poppler utils for windows, which have been collected and bundled here: http://manifestwebdesign.com/2013/01/09/xpdf-and-poppler-utils-on-windows/
127 | Unpack these in a folder, (example: *"C:\poppler-utils"*) and add this to the PATH.
128 | 
129 | **ghostscript** for Windows can be found at: http://www.ghostscript.com/download/gsdnld.html
130 | Make sure you download the General Public License and the correct version (32/64bit).
131 | Install it and go to the installation folder (default: *"C:\Program Files\gs\gs9.19"*) and go into the **bin** folder.
132 | Rename the *gswin64c* to *gs*, and add the bin folder to your PATH.
133 | 
134 | **tesseract** can be build, but you can also download an older version which seems to work fine. Downloads at: https://sourceforge.net/projects/tesseract-ocr-alt/files/
135 | Version tested is *tesseract-ocr-setup-3.02.02.exe*, the default install location is *"C:\Program Files (x86)\Tesseract-OCR"* and is also added to the PATH.
136 | Note, this is only when you've checked that it will install for everyone on the machine.
137 | 
138 | Everything should work after all this! If not, try restarting to make sure the PATH variables are correctly used.
139 | **This setup was tested on a Windows 10 Pro N 64bit machine.**
140 | 
141 | 
142 | ## Usage
143 | 
144 | ### OCR Extract from scanned image
145 | Extract from a pdf file which contains a scanned image and no searchable text
146 | ``` javascript
147 | const path = require("path")
148 | const pdf_extract = require('pdf-extract')
149 | 
150 | console.log("Usage: node thisfile.js the/path/tothe.pdf")
151 | const absolute_path_to_pdf = path.resolve(process.argv[2])
152 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf)
153 | 
154 | const options = {
155 |   type: 'ocr', // perform ocr to get the text within the scanned image
156 |   ocr_flags: ['--psm 1'], // automatically detect page orientation
157 | }
158 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…"))
159 | processor.on('complete', data => callback(null, data))
160 | processor.on('error', callback)
161 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) }
162 | ```
163 | 
164 | 
165 | 
166 | ### Text extract from searchable pdf
167 | Extract from a pdf file which contains actual searchable text
168 | ``` javascript
169 | const path = require("path")
170 | const pdf_extract = require('./main.js')
171 | 
172 | console.log("Usage: node thisfile.js the/path/tothe.pdf")
173 | const absolute_path_to_pdf = path.resolve(process.argv[2])
174 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf)
175 | 
176 | const options = {
177 |   type: 'text', // extract searchable text from PDF
178 |   ocr_flags: ['--psm 1'], // automatically detect page orientation
179 |   enc: 'UTF-8',  // optional, encoding to use for the text output
180 |   mode: 'layout' // optional, mode to use when reading the pdf
181 | }
182 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…"))
183 | processor.on('complete', data => callback(null, data))
184 | processor.on('error', callback)
185 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) }
186 | ```
187 | #### Options
188 | At a minimum you must specific the type of pdf extract you wish to perform
189 | 
190 | **clean**
191 | When the system performs extracts text from a multi-page pdf, it first splits the pdf into single pages. This are written to disk before the ocr occurs. For some applications these single page files can be useful. If you need to work with the single page pdf files after the ocr is complete, set the **clean** option to **false** as show below. Note that the single page pdf files are written to the system appropriate temp directory, so if you must copy the files to a more permanent location yourself after the ocr process completes
192 | ``` javascript
193 | var options = {
194 |   type: 'ocr' // (required), perform ocr to get the text within the scanned image
195 |   enc: 'UTF-8' // optional, only applies to 'text' type
196 |   mode: 'layout' // optional, only applies to 'text' type. Available modes are 'layout', 'simple', 'table' or 'lineprinter'. Default is 'layout'
197 |   clean: false // keep the single page pdfs created during the ocr process
198 |   ocr_flags: [
199 |     '-psm 1',       // automatically detect page orientation
200 |     '-l dia',       // use a custom language file
201 |     'alphanumeric'  // only output ascii characters
202 |   ]
203 | }
204 | ```
205 | 
206 | 
207 | ### Events
208 | When processing, the module will emit various events as they occurr
209 | 
210 | **page**
211 | Emitted when a page has completed processing. The data passed with this event looks like
212 | ``` javascript
213 | var data = {
214 |   hash: <sha1 hash of the input pdf file here>
215 |   text: <extracted text here>,
216 |   index: 2,
217 |   num_pages: 4,
218 |   pdf_path: "~/Downloads/input_pdf_file.pdf",
219 |   single_page_pdf_path: "/tmp/temp_pdf_file2.pdf"
220 | }
221 | ```
222 | 
223 | **error**
224 | Emitted when an error occurs during processing. After this event is emitted processing will stop.
225 | The data passed with this event looks like
226 | ```
227 | var data = {
228 |   error: 'no file exists at the path you specified',
229 |   pdf_path: "~/Downloads/input_pdf_file.pdf",
230 | }
231 | ```
232 | 
233 | **complete**
234 | Emitted when all pages have completed processing and the pdf extraction is complete
235 | ```
236 | var data = {
237 |   hash: <sha1 hash of the input pdf file here>
238 |   text_pages: <Array of Strings, one per page>,
239 |   pdf_path: "~/Downloads/input_pdf_file.pdf",
240 |   single_page_pdf_file_paths: [
241 |     "/tmp/temp_pdf_file1.pdf",
242 |     "/tmp/temp_pdf_file2.pdf",
243 |     "/tmp/temp_pdf_file3.pdf",
244 |     "/tmp/temp_pdf_file4.pdf",
245 |   ]
246 | }
247 | ```
248 | 
249 | **log**
250 | To avoid spamming process.stdout, log events are emitted instead.
251 | 
252 | ## Tests
253 | To test that your system satisfies the needed dependencies and that module is functioning correctly execute the command in the pdf-extract module folder
254 | ```
255 | cd <project_root>/node_modules/pdf-extract
256 | npm test
257 | ```
258 | 


--------------------------------------------------------------------------------
/example-image.js:
--------------------------------------------------------------------------------
 1 | const path = require("path")
 2 | const pdf_extract = require('./main.js')
 3 | 
 4 | console.log("Usage: node thisfile.js the/path/tothe.pdf")
 5 | const absolute_path_to_pdf = path.resolve(process.argv[2])
 6 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf)
 7 | 
 8 | const options = {
 9 |   type: 'ocr', // perform ocr to get the text within the scanned image
10 |   ocr_flags: ['--psm 1'] // automatically detect page orientation
11 | }
12 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…"))
13 | processor.on('complete', data => callback(null, data))
14 | processor.on('error', callback)
15 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) }
16 | 


--------------------------------------------------------------------------------
/example-text.js:
--------------------------------------------------------------------------------
 1 | const path = require("path")
 2 | const pdf_extract = require('./main.js')
 3 | 
 4 | console.log("Usage: node thisfile.js the/path/tothe.pdf")
 5 | const absolute_path_to_pdf = path.resolve(process.argv[2])
 6 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf)
 7 | 
 8 | const options = {
 9 |   type: 'text', // extract searchable text from PDF
10 |   ocr_flags: ['--psm 1'] // automatically detect page orientation
11 | }
12 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…"))
13 | processor.on('complete', data => callback(null, data))
14 | processor.on('error', callback)
15 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) }
16 | 


--------------------------------------------------------------------------------
/install-mac.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # can skip pdftk install step per https://github.com/nisaacson/pdf-extract/pull/26
 3 | brew install poppler
 4 | brew install gs
 5 | brew install tesseract
 6 | cd /usr/local/Cellar/tesseract/*/share/tessdata/
 7 | cp "$OLDPWD/node_modules/pdf-extract/share/eng.traineddata" eng.traineddata
 8 | cp "$OLDPWD/node_modules/pdf-extract/share/dia.traineddata" dia.traineddata
 9 | cp "$OLDPWD/node_modules/pdf-extract/share/configs/alphanumeric" configs/alphanumeric
10 | cd -
11 | 


--------------------------------------------------------------------------------
/lib/convert.js:
--------------------------------------------------------------------------------
 1 | /**
 2 |  * Converts a pdf file at a given path to a tiff file with
 3 |  * the GraphicsMagick command "convert"
 4 |  */
 5 | var temp = require('temp');
 6 | var path = require('path');
 7 | var exec = require('child_process').exec
 8 | var spawn = require('child_process').spawn;
 9 | var fs = require('fs');
10 | var pdf_convert_quality = 400; // default to density 400 for the convert command
11 | 
12 | 
13 | /**
14 |  * @param input_path the path to a pdf file on disk. Since GhostScript requires random file access, we need a path
15 |  *   to an actual file rather than accepting a stream
16 |  * @param {String} quality is an optional flag that controls the quality of the pdf to tiff conversion.
17 |  * @return {String} output_path the path to the converted tif file
18 |  * @return callback(<maybe error>, output_path)
19 |  */
20 | exports = module.exports = function convert(input_path, quality, callback) {
21 |   // options is an optional parameter
22 |   if (!callback || typeof callback != "function") {
23 |     callback = quality;   // callback must be the second parameter
24 |     quality = undefined;  // no option passed
25 |   }
26 | 
27 |   fs.exists(input_path, function (exists) {
28 |     if (!exists) { return callback('error, no file exists at the path you specified: ' + input_path); }
29 |     // get a temp output path
30 | 
31 |     var output_path = temp.path({prefix: 'tif_output', suffix:'.tif'});
32 |     // var output_path = path.join(__dirname,'test/test_data/single_page_raw.tif');
33 |     var params = [
34 | 
35 |       // '-depth 8',
36 |       // '-background white',
37 |       // '-flatten +matte',
38 |       // '-density '+pdf_convert_quality,
39 |       input_path,
40 |       output_path
41 |     ];
42 |     if (quality) {
43 |       if (typeof(quality) !== 'string' && typeof(quality) !== 'number') {
44 |         return callback('error, pdf quality option must be a string, you passed a ' + typeof(quality));
45 |       }
46 |       pdf_convert_quality = quality;
47 |     }
48 |     var cmd = 'gs -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw -o "' + output_path + '" "'+input_path+'"';
49 |     // var cmd = 'convert -depth 8 -background white -flatten +matte -density '+pdf_convert_quality+' "'+ input_path +'"  "' + output_path+'"';
50 |     var child = exec(cmd, function (err, stderr, stdout) {
51 |       if (err) {
52 |         return callback(err);
53 |       }
54 |       return callback(null, output_path);
55 |     });
56 |   });
57 | }
58 | 


--------------------------------------------------------------------------------
/lib/electronic.js:
--------------------------------------------------------------------------------
  1 | /**
  2 |  * Module which extracts the text out of an electronic pdf file
  3 |  * This module can handle multi-page pdf files
  4 |  */
  5 | var fs = require('fs');
  6 | var async = require('async');
  7 | 
  8 | var rimraf = require('rimraf');
  9 | var util = require('util');
 10 | var events = require('events');
 11 | 
 12 | var split = require('./split.js');
 13 | var searchable = require('./searchable.js');
 14 | var pathhash = require('pathhash');
 15 | 
 16 | 
 17 | function Electronic(){
 18 |   if(false === (this instanceof Electronic)) {
 19 |     return new Electronic();
 20 |   }
 21 | }
 22 | util.inherits(Electronic, events.EventEmitter);
 23 | module.exports = Electronic;
 24 | 
 25 | 
 26 | /**
 27 |  * @param pdf_path path to the pdf file on disk
 28 |  *
 29 |  * @return {Array} text_pages an array of the extracted text where
 30 |  *   each entry is the text for the page at the given index
 31 |  * @return callback(<maybe error>, text_pages)
 32 |  */
 33 | Electronic.prototype.process = function(pdf_path, options) {
 34 |   var self = this;
 35 |   var text_pages = [];
 36 |   var split_output;
 37 |   var single_page_pdf_file_paths = [];
 38 |   fs.exists(pdf_path, function (exists) {
 39 |     var err;
 40 |     if (!exists) {
 41 |       err = 'no file exists at the path you specified: ' + pdf_path
 42 |       self.emit('error', { error: err, pdf_path: pdf_path});
 43 |       return
 44 |     }
 45 |     pathhash(pdf_path, function (err, hash) {
 46 |       if (err) {
 47 |         err = 'error hashing file at the path you specified: ' + pdf_path + '. ' + err;
 48 |         self.emit('error', { error: err, pdf_path: pdf_path});
 49 |         return
 50 |       }
 51 |       // split the pdf into single page pdf files
 52 |       split(pdf_path, options.pdf_password, function (err, output) {
 53 |         if (err) {
 54 |           self.emit('error', { error: err, pdf_path: pdf_path});
 55 |           return
 56 |         }
 57 | 
 58 | 
 59 |         if (!output) {
 60 |           err = 'failed to split pdf file into distinct pages';
 61 |           self.emit('error', { error: err, pdf_path: pdf_path});
 62 |           return
 63 |         }
 64 |         split_output = output;
 65 |         if (!split_output.hasOwnProperty('files') || split_output.files.length == 0) {
 66 |           err = 'no pages where found in your pdf document';
 67 |           self.emit('error', { error: err, pdf_path: pdf_path});
 68 |           return
 69 |         }
 70 |         self.emit('log', 'finished splitting pages for file at path ' + pdf_path);
 71 |         var files = split_output.files;
 72 |         var index = 0;
 73 |         async.forEachSeries(
 74 |           files,
 75 |           // extract the text for each page
 76 |           function (file, cb) {
 77 |             index++;
 78 |             searchable(file.file_path, options, function (err, extract) {
 79 | 	          if(err){
 80 | 	            self.emit('error', { error: err, pdf_path: pdf_path});
 81 | 	            return;
 82 | 	          }
 83 |               text_pages.push(extract);
 84 |               var file_path = file.file_path
 85 |               single_page_pdf_file_paths.push(file.file_path);
 86 |               self.emit('page', { hash: hash, text: extract, index: index, pdf_path: pdf_path});
 87 |               cb();
 88 |             });
 89 |           },
 90 |           function (err) {
 91 |             if (!err) {
 92 |               self.emit('complete', { hash: hash, text_pages: text_pages, pdf_path: pdf_path, single_page_pdf_file_paths: single_page_pdf_file_paths});
 93 |               return;
 94 |             }
 95 |             self.emit('error', { error: err, pdf_path: pdf_path});
 96 |             if (!split_output || ! split_output.folder) { return }
 97 |             fs.exists(split_output.folder, function (exists) {
 98 |               if (!exists) { return }
 99 |               var remove_cb = function() {}
100 |               rimraf(split_output.folder, remove_cb);
101 |             });
102 |           }
103 |         );
104 |       });
105 |     });
106 |   });
107 | }
108 | 


--------------------------------------------------------------------------------
/lib/ocr.js:
--------------------------------------------------------------------------------
 1 | /**
 2 |  * Module which extracts text from electronic searchable pdf files.
 3 |  * Requires the "pdftotext" binary be installed on the system and accessible in the
 4 |  * current path
 5 |  */
 6 | var temp = require('temp');
 7 | var path = require('path');
 8 | var exec = require('child_process').exec;
 9 | var fs = require('fs');
10 | 
11 | /**
12 |  * @param tif_path path to the single page file on disk containing a scanned image of text
13 |  * @param {Array} options is an optional list of flags to pass to the tesseract command
14 |  * @return {String} extract the extracted ocr text output
15 |  * @return callback(<maybe error>, stdout)
16 |  */
17 | module.exports = function(input_path, options, callback) {
18 |   // options is an optional parameter
19 |   if (!callback || typeof callback != "function") {
20 |     // callback must be the second parameter
21 |     callback = options;
22 |     options = [];
23 |   }
24 |   fs.exists(input_path, function (exists) {
25 |     if (!exists) { return callback('error, no file exists at the path you specified: ' + input_path); }
26 |     // get a temp output path
27 |     var output_path = temp.path({prefix: 'ocr_output'});
28 |     // output_path = path.join(__dirname,'test/test_data/single_page_raw');
29 |     var procoptions = { maxBuffer: 4096 * 4096 };
30 |     var cmd = 'tesseract "'+input_path+'" "'+output_path+'" '+options.join(' ');
31 |     var child = exec(cmd, procoptions, function (err, stdout, stderr) {
32 |       if (err) { return callback(err); }
33 |       // tesseract automatically appends ".txt" to the output file name
34 |       var text_output_path = output_path+'.txt';
35 |       // inspect(text_output_path, 'text output path');
36 |       fs.readFile(text_output_path, 'utf8', function(err, output) {
37 |         // inspect(output, 'ocr output');
38 |         if (err) { return callback(err); }
39 |         // cleanup after ourselves
40 |         fs.unlink(text_output_path, function (err) {
41 |           if (err) { return callback(err); }
42 |           callback(null, output);
43 |         });
44 |       });
45 |     });
46 |   });
47 | }
48 | 


--------------------------------------------------------------------------------
/lib/raw.js:
--------------------------------------------------------------------------------
  1 | /**
  2 |  * Module which extracts the text out of an electronic pdf file
  3 |  * This module can handle multi-page pdf files
  4 | 
  5 |  */
  6 | var util = require('util');
  7 | var events = require('events');
  8 | var fs = require('fs');
  9 | var async = require('async');
 10 | var split = require('./split.js');
 11 | var convert = require('./convert.js');
 12 | var pathHash = require('pathhash');
 13 | var ocr = require('./ocr.js');
 14 | var rimraf = require('rimraf');
 15 | 
 16 | 
 17 | function Raw(){
 18 |   if(false === (this instanceof Raw)) {
 19 |     return new Raw();
 20 |   }
 21 | }
 22 | util.inherits(Raw, events.EventEmitter);
 23 | module.exports = Raw;
 24 | 
 25 | 
 26 | /**
 27 |  * @param {String} pdf_path path to the pdf file on disk
 28 |  * @param {Boolean} params.clean true to remove the temporary single-page pdf
 29 |  *   files from disk. Sometimes however you might want to be able to use those
 30 |  *   single page pdfs after the ocr completes. In this case pass clean = false
 31 |  *
 32 |  * @return {Array} text_pages an array of the extracted text where
 33 |  *   each entry is the text for the page at the given index
 34 |  * @return callback(<maybe error>, text_pages)
 35 |  */
 36 | Raw.prototype.process = function(pdf_path, options) {
 37 |   var self = this;
 38 |   var text_pages = [];
 39 |   var split_output;
 40 |   if (!options) {
 41 |     options = {};
 42 |   }
 43 |   // default to removing the single page pdfs after ocr completes
 44 |   if (!options.hasOwnProperty('clean')) {
 45 |     options.clean = true;
 46 |   }
 47 |   fs.exists(pdf_path, function (exists) {
 48 |     if (!exists) {
 49 |       var err = 'no file exists at the path you specified: ' + pdf_path
 50 |       self.emit('error', { error: err, pdf_path: pdf_path});
 51 |       return
 52 |     }
 53 |     pathHash(pdf_path, function (err, hash) {
 54 |       if (err) {
 55 |         err = 'error hashing file at the path you specified: ' + pdf_path + '. ' + err;
 56 |         self.emit('error', { error: err, pdf_path: pdf_path});
 57 |         return;
 58 |       }
 59 |       split(pdf_path, function (err, output) {
 60 |         if (err) {
 61 |           self.emit('error', { error: err, pdf_path: pdf_path});
 62 |           return
 63 |         }
 64 |         if (!output) {
 65 |           err = 'no files returned from split';
 66 |           self.emit('error', { error: err, pdf_path: pdf_path});
 67 |           return;
 68 |         }
 69 |         self.emit('log', 'finished splitting pages for file at path ' + pdf_path);
 70 |         split_output = output;
 71 |         var pdf_files = output.files;
 72 |         if (!pdf_files || pdf_files.length == 0) {
 73 |           err = 'error, no pages where found in your pdf document';
 74 |           self.emit('error', { error: err, pdf_path: pdf_path});
 75 |           return;
 76 |         }
 77 |         var index = 0;
 78 |         var num_pages = pdf_files.length
 79 |         var single_page_pdf_file_paths = [];
 80 |         async.forEachSeries(
 81 |           pdf_files,
 82 |           // extract the text for each page via ocr
 83 |           function (pdf_file, cb) {
 84 |             var quality = 300;
 85 |             if (options.hasOwnProperty('quality') && options.quality) {
 86 |               quality = options.quality;
 87 |             }
 88 |             convert(pdf_file.file_path, quality, function (err, tif_path) {
 89 |               var zeroBasedNumPages = num_pages-1;
 90 |               self.emit('log', 'converted page to intermediate tiff file, page '+ index+ ' (0-based indexing) of '+ zeroBasedNumPages);
 91 |               if (err) { return cb(err); }
 92 |               var ocr_flags = [
 93 |                 '-psm 6'
 94 |               ];
 95 |               if (options.ocr_flags) {
 96 |                 ocr_flags = options.ocr_flags;
 97 |               }
 98 |               ocr(tif_path, ocr_flags, function (err, extract) {
 99 |                 fs.unlink(tif_path, function (tif_cleanup_err, reply) {
100 |                   if (tif_cleanup_err) {
101 |                     err += ', error removing temporary tif file: "'+tif_cleanup_err+'"';
102 |                   }
103 |                   if (err) { return cb(err); }
104 |                   var page_number = index+1
105 |                   self.emit('log', 'raw ocr: page ' + index + ' (0-based indexing) of ' +zeroBasedNumPages + ' complete');
106 |                   single_page_pdf_file_paths.push(pdf_file.file_path);
107 |                   self.emit('page', { hash: hash, text: extract, index: index, num_pages: num_pages, pdf_path: pdf_path, single_page_pdf_path: pdf_file.file_path});
108 |                   text_pages.push(extract);
109 |                   index++;
110 |                   cb();
111 |                 });
112 |               });
113 |             });
114 |           }, function (err) {
115 |             if (err) {
116 |               self.emit('error', err);
117 |               return;
118 |             }
119 |             self.emit('complete', { hash: hash, text_pages: text_pages, pdf_path: pdf_path, single_page_pdf_file_paths: single_page_pdf_file_paths});
120 |           });
121 |       });
122 |     });
123 |   });
124 | }
125 | 


--------------------------------------------------------------------------------
/lib/searchable.js:
--------------------------------------------------------------------------------
  1 | /**
  2 |  * Module which extracts text from electronic searchable pdf files.
  3 |  * Requires the "pdftotext" binary be installed on the system and accessible in the
  4 |  * current path
  5 |  */
  6 | var path = require('path');
  7 | var temp = require('temp');
  8 | var exec = require('child_process').exec;
  9 | var spawn = require('child_process').spawn;
 10 | var fs = require('fs');
 11 | var walk = require('walk');
 12 | var async = require('async');
 13 | var rimraf = require('rimraf');
 14 | 
 15 | /**
 16 |  * @param pdf_path path to the single page searchable pdf file on disk
 17 |  * This function buffers all the output from stdout and sends it back as a string.
 18 |  * Since we only handle single pages of pdf text here the amount of text is small
 19 |  * and therefore we don't need to use a stream
 20 |  *
 21 |  * @return {ReadStream} the entire output from stdout
 22 |  * @return callback(<maybe error>, stdout)
 23 |  */
 24 | module.exports = function(pdf_path, options, callback) {
 25 |   if(options===undefined)options={};
 26 |   if(options.mode===undefined)options.mode='layout';
 27 |   confirm_file_exists(pdf_path, function (err) {
 28 |     if (err) { return callback(err); }
 29 |     var child = spawn('pdftotext', (spawnOptions(options)).concat([pdf_path, '-']));
 30 |     var stdout = child.stdout;
 31 |     var stderr = child.stderr;
 32 |     var output = '';
 33 |     stdout.setEncoding('utf8');
 34 |     stderr.setEncoding('utf8');
 35 |     stderr.on('data', function(data) {
 36 |       return callback(data, null);
 37 |     });
 38 |     // buffer the stdout output
 39 |     stdout.on('data', function(data) {
 40 |       output += data;
 41 |     });
 42 |     stdout.on('close', function(data) {
 43 |       return callback(null, output);
 44 |     });
 45 |   });
 46 | }
 47 | 
 48 | function spawnOptions(options) {
 49 |   var result = [];
 50 |   result.push('-' + options.mode);
 51 |   if (options.enc) {
 52 |     result.push('-enc');
 53 |     result.push(options.enc);
 54 |   }
 55 |   return result;
 56 | }
 57 | 
 58 | /**
 59 |  * Non-recursive find of all the files in a given directory that end with *.pdf
 60 |  * @return {Array} files is an array of the absolute paths to the single
 61 |  * page pdf files. Each entry in this array is an object with fields
 62 |  * <file_name> and <file_path> set
 63 |  * @return callback(<maybe error>, files)
 64 |  */
 65 | function get_pdfs_in_directory(directory_path, callback) {
 66 |   var file_paths = [];
 67 |   var files = null;
 68 |   var walker = walk.walk(directory_path, { followLinks: false});
 69 |   walker.on('file', function(root, stat, next) {
 70 |     if (stat.name.match(/\.pdf$/i)) {
 71 |       var file_path = path.join(directory_path, stat.name);
 72 |       file_paths.push({file_path: file_path, file_name: stat.name});
 73 |       next();
 74 |     }
 75 |   });
 76 |   walker.on('end', function() {
 77 |     return callback(null, file_paths);
 78 |   });
 79 | }
 80 | 
 81 | /**
 82 |  * Cleanup any single page pdfs on error
 83 |  */
 84 | function cleanup_directory(directory_path, callback) {
 85 |   // only remove the folder at directory_path if it exists
 86 |   fs.exists(directory_path, function (exists) {
 87 |     if (!exists) {
 88 |       return callback();
 89 |     }
 90 |     rimraf(directory_path, callback);
 91 |   });
 92 | }
 93 | 
 94 | /**
 95 |  * @param {String} file_path absolute path to file on disk
 96 |  * @return {Function} callback() if file does exist
 97 |  * callback(<error message>) if file does not exists
 98 |  */
 99 | function confirm_file_exists(file_path, callback) {
100 |   fs.exists(file_path, function (exists) {
101 |     if (!exists) {
102 |       return callback('no file at path: ' + file_path);
103 |     }
104 |     return callback();
105 |   });
106 | };


--------------------------------------------------------------------------------
/lib/split.js:
--------------------------------------------------------------------------------
  1 | /**
  2 |  * Module which splits multi-pag pdfs into single pages
  3 |  * Requires the pdftk binary be installed on the system and accessible in the
  4 |  * current path
  5 |  */
  6 | var path = require('path');
  7 | var temp = require('temp');
  8 | var exec = require('child_process').exec;
  9 | var fs = require('fs');
 10 | var walk = require('walk');
 11 | var async = require('async');
 12 | var rimraf = require('rimraf');
 13 | 
 14 | /**
 15 |  * @param pdf_path path to the pdf file on disk
 16 |  *
 17 |  * @see get_pdfs_in_directory
 18 |  * @return {Object} an object with the fields "folder" and "files" set
 19 |  *   files is an array of the absolute paths to the single page pdf files.
 20 |  *
 21 |  *   Each entry in this array is an object with fields
 22 |  *    <file_name> and <file_path> set
 23 |  *
 24 |  * @return callback(<maybe error>, output_paths)
 25 |  */
 26 | module.exports = function(pdf_path, pdf_password, callback) {
 27 |   if(!callback || typeof pdf_password === 'function'){
 28 |     callback = pdf_password;
 29 |   }
 30 |   confirm_file_exists(pdf_path, function (err) {
 31 |     if (err) { return callback(err); }
 32 | 
 33 |     var output_dir = temp.path({},'pdf_pages');
 34 |     fs.mkdir(output_dir, function(err) {
 35 |       if (err) { return callback(err, null); }
 36 |       // name the files with the upload id and a digit string
 37 |       // example: "507c3e55c786e2aa6f000005-page00001.pdf"
 38 |       var output_name = 'page%05d.pdf';
 39 |       var output_path = path.join(output_dir, output_name);
 40 |       var cmd;
 41 |       if (typeof pdf_password === 'string') {
 42 |         cmd = 'pdftk "' + pdf_path + '" input_pw "' + pdf_password + '" burst output "' + output_path;
 43 |       }
 44 |       else {
 45 |         cmd = 'pdfseparate ' + pdf_path + ' ' + output_path;
 46 |       }
 47 |       var child = exec(cmd, function (err, stdout, stderr) {
 48 |         if (err) {
 49 |           var output_err = {
 50 |             message: 'an error occurred while splitting pdf into single pages with the pdftk burst command',
 51 |             error: err
 52 |           }
 53 |           callback(output_err, null);
 54 |           return;
 55 |         }
 56 |         remove_doc_data(function (err, reply) {
 57 |           if (err) { return callback(err); }
 58 |           return get_pdfs_in_directory(output_dir, callback);
 59 |         });
 60 |       });
 61 |     });
 62 |   });
 63 | }
 64 | 
 65 | 
 66 | 
 67 | /**
 68 |  * Non-recursive find of all the files in a given directory that end with *.pdf
 69 |  * @return {Object} output an object with the fields "folder" and "files" set
 70 |  *   files is an array of the absolute paths to the single page pdf files.
 71 |  *
 72 |  *   Each entry in this array is an object with fields
 73 |  *    <file_name> and <file_path> set
 74 |  *
 75 |  * @return callback(<maybe error>, output)
 76 |  */
 77 | function get_pdfs_in_directory(directory_path, callback) {
 78 |   var file_paths = [];
 79 |   var files = null;
 80 |   var walker = walk.walk(directory_path, { followLinks: false});
 81 |   walker.on('file', function(root, stat, next) {
 82 |     if (stat.name.match(/\.pdf$/i)) {
 83 |       var file_path = path.join(directory_path, stat.name);
 84 |       file_paths.push({file_path: file_path, file_name: stat.name});
 85 |     }
 86 |     next();
 87 |   });
 88 | 
 89 | 
 90 |   walker.on('end', function() {
 91 |     file_paths.sort(function (a,b) {
 92 |       if (a.file_name < b.file_name) {
 93 |         return -1;
 94 |       }
 95 |       if (a.file_name == b.file_name) {
 96 |         return 0;
 97 |       }
 98 |       return 1;
 99 |     });
100 |     var output = {
101 |       folder: directory_path,
102 |       files: file_paths
103 |     }
104 |     return callback(null, output);
105 |   });
106 | }
107 | 
108 | 
109 | /**
110 |  * @param {String} file_path absolute path to file on disk
111 |  * @return {Function} callback() if file does exist
112 |  * callback(<error message>) if file does not exists
113 |  */
114 | function confirm_file_exists(file_path, callback) {
115 |   fs.exists(file_path, function (exists) {
116 |     if (!exists) {
117 |       return callback('no file at path: ' + file_path);
118 |     }
119 |     return callback();
120 |   });
121 | };
122 | 
123 | /**
124 |  * pdftk creates a file called doc_data.txt during the burst split process.
125 |  * This file is not needed so remove it now
126 |  */
127 | function remove_doc_data(callback) {
128 |   var folder = path.join(__dirname, '..');
129 |   var doc_data_path = path.join(folder, 'doc_data.txt');
130 |   fs.exists(doc_data_path, function (exists) {
131 |     if (!exists) {
132 |       return callback();
133 |     }
134 |     fs.unlink(doc_data_path, callback);
135 |   });
136 | }
137 | 


--------------------------------------------------------------------------------
/main.js:
--------------------------------------------------------------------------------
 1 | /**
 2 |  * @title Node PDF main.js
 3 |  * Node PDF allows you to convert pdf files into raw text. The library supports
 4 |  * text extraction from electronic searchable pdfs.
 5 |  *
 6 |  * In addition, the library supports OCR text extract from pdfs which just
 7 |  * contain scanned images via the tesseract-ocr engine
 8 |  *
 9 |  * Multi-page pdfs are supported for both searchable and image pdfs.
10 |  * The library returns an array of strings where the string at a given
11 |  * index in the output array cooresponds the page in the input pdf document
12 |  *
13 |  * @author Noah Isaacson
14 |  * @date 2012-10-26
15 |  */
16 | var path = require('path');
17 | var temp = require('temp');
18 | var exec = require('child_process').exec;
19 | var fs = require('fs');
20 | var walk = require('walk');
21 | var async = require('async');
22 | var rimraf = require('rimraf');
23 | 
24 | var Raw = require('./lib/raw');
25 | var Electronic = require('./lib/electronic');
26 | 
27 | /**
28 |  * To process a pdf, pass in the absolute path to the pdf file on disk
29 | 
30 |  * @param {Object} params should have the following fields set
31 |  * @param {String} params.pdf_path the absolute path to the pdf file on disk
32 |  * @param {Boolean} params.clean true if you want the temporary single page pdfs
33 |  * @param {Boolean} options.type must be either "ocr" or "text"
34 |  *
35 |  * @return {Array} text_pages is an array of strings, where each string is the
36 |  * extracted text for the matching page index in the pdf document
37 |  * @return {Processor} a processor object which will emit events as they occur
38 |  */
39 | module.exports = function(pdf_path, options, cb) {
40 |   var err;
41 |   var processor = new Raw();
42 |   if (!'pdf_path') {
43 |     err = 'you must supply a pdf path as the first parameter'
44 |     return cb(err);
45 |   }
46 |   if (!options) {
47 |     err =  'no options supplied. You must supply an options object with the "type" field set'
48 |     return cb(err);
49 |   }
50 |   if (!options.hasOwnProperty('type') || ! options.type) {
51 |     err  ='error, you must specify the type of extraction you wish to perform in the options object. Allowed values are "ocr" or "text"';
52 |     return cb(err);
53 |   }
54 |   if (options.type === 'ocr') {
55 |     processor = new Raw();
56 |   }
57 |   else if (options.type === 'text') {
58 |     processor = new Electronic();
59 |   }
60 |   else {
61 |     err  ='error, you must specify the type of extraction you wish to perform in the options object. Allowed values are "ocr" or "text"';
62 |     return cb(err);;
63 |   }
64 |   fs.exists(pdf_path, function (exists) {
65 |     if (!exists) {
66 |       err = 'no file exists at the path you specified';
67 |       return cb(err);
68 |     }
69 |     processor.process(pdf_path, options);
70 |     cb();
71 |   });
72 |   return processor;
73 | }
74 | 


--------------------------------------------------------------------------------
/makefile:
--------------------------------------------------------------------------------
 1 | test:
 2 | 	mocha --reporter spec
 3 | electronic:
 4 | 	mocha $(shell find test -name "*electronic-test.js") --test --reporter spec
 5 | searchable:
 6 | 	mocha $(shell find test -name "*searchable-test.js") --test --reporter spec
 7 | 
 8 | ocr:
 9 | 	mocha $(shell find test -name "*ocr-test.js") --test --reporter spec
10 | 
11 | .PHONY: test
12 | 


--------------------------------------------------------------------------------
/package-lock.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "name": "pdf-extract",
  3 |   "version": "1.0.12",
  4 |   "lockfileVersion": 1,
  5 |   "requires": true,
  6 |   "dependencies": {
  7 |     "async": {
  8 |       "version": "0.1.22",
  9 |       "resolved": "https://registry.npmjs.org/async/-/async-0.1.22.tgz",
 10 |       "integrity": "sha1-D8GqoIig4+8Ovi2IMbqw3PiEUGE="
 11 |     },
 12 |     "commander": {
 13 |       "version": "0.6.1",
 14 |       "resolved": "https://registry.npmjs.org/commander/-/commander-0.6.1.tgz",
 15 |       "integrity": "sha1-+mihT2qUXVTbvlDYzbMyDp47GgY=",
 16 |       "dev": true
 17 |     },
 18 |     "debug": {
 19 |       "version": "4.1.1",
 20 |       "resolved": "https://registry.npmjs.org/debug/-/debug-4.1.1.tgz",
 21 |       "integrity": "sha512-pYAIzeRo8J6KPEaJ0VWOh5Pzkbw/RetuzehGM7QRRX5he4fPHx2rdKMB256ehJCkX+XRQm16eZLqLNS8RSZXZw==",
 22 |       "dev": true,
 23 |       "requires": {
 24 |         "ms": "^2.1.1"
 25 |       },
 26 |       "dependencies": {
 27 |         "ms": {
 28 |           "version": "2.1.2",
 29 |           "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.2.tgz",
 30 |           "integrity": "sha512-sGkPx+VjMtmA6MX27oA4FBFELFCZZ4S4XqeGOXCv68tT+jb3vk/RyaKWP0PTKyWtmLSM0b+adUTEvbs1PEaH2w==",
 31 |           "dev": true
 32 |         }
 33 |       }
 34 |     },
 35 |     "diff": {
 36 |       "version": "1.0.2",
 37 |       "resolved": "https://registry.npmjs.org/diff/-/diff-1.0.2.tgz",
 38 |       "integrity": "sha1-Suc/Gu6Nb89ITxoc53zmUdm38Mk=",
 39 |       "dev": true
 40 |     },
 41 |     "eyespect": {
 42 |       "version": "0.1.10",
 43 |       "resolved": "https://registry.npmjs.org/eyespect/-/eyespect-0.1.10.tgz",
 44 |       "integrity": "sha1-ma6EajAvzK95Dj7LRPR7XzsXaqA="
 45 |     },
 46 |     "forEachAsync": {
 47 |       "version": "2.2.1",
 48 |       "resolved": "https://registry.npmjs.org/forEachAsync/-/forEachAsync-2.2.1.tgz",
 49 |       "integrity": "sha1-43I/AJA5EOHrSx2zrVG1xkoxn+w=",
 50 |       "requires": {
 51 |         "sequence": "2.x"
 52 |       }
 53 |     },
 54 |     "graceful-fs": {
 55 |       "version": "1.1.14",
 56 |       "resolved": "https://registry.npmjs.org/graceful-fs/-/graceful-fs-1.1.14.tgz",
 57 |       "integrity": "sha1-BweNtfY3f2Mh/Oqu30l94STclGU=",
 58 |       "optional": true
 59 |     },
 60 |     "growl": {
 61 |       "version": "1.7.0",
 62 |       "resolved": "https://registry.npmjs.org/growl/-/growl-1.7.0.tgz",
 63 |       "integrity": "sha1-3i1mE20ALhErpw8/EMMc98NQsto=",
 64 |       "dev": true
 65 |     },
 66 |     "jade": {
 67 |       "version": "0.26.3",
 68 |       "resolved": "https://registry.npmjs.org/jade/-/jade-0.26.3.tgz",
 69 |       "integrity": "sha1-jxDXl32NefL2/4YqgbBRPMslaGw=",
 70 |       "dev": true,
 71 |       "requires": {
 72 |         "commander": "0.6.1",
 73 |         "mkdirp": "0.3.0"
 74 |       },
 75 |       "dependencies": {
 76 |         "mkdirp": {
 77 |           "version": "0.3.0",
 78 |           "resolved": "https://registry.npmjs.org/mkdirp/-/mkdirp-0.3.0.tgz",
 79 |           "integrity": "sha1-G79asbqCevI1dRQ0kEJkVfSB/h4=",
 80 |           "dev": true
 81 |         }
 82 |       }
 83 |     },
 84 |     "mkdirp": {
 85 |       "version": "0.3.3",
 86 |       "resolved": "https://registry.npmjs.org/mkdirp/-/mkdirp-0.3.3.tgz",
 87 |       "integrity": "sha1-WV4lHBNww6aLqyE20ONIuBBa3xM=",
 88 |       "dev": true
 89 |     },
 90 |     "mocha": {
 91 |       "version": "1.8.2",
 92 |       "resolved": "https://registry.npmjs.org/mocha/-/mocha-1.8.2.tgz",
 93 |       "integrity": "sha1-+2sdB9mPLrpBhUaETDvgcx9145A=",
 94 |       "dev": true,
 95 |       "requires": {
 96 |         "commander": "0.6.1",
 97 |         "debug": "*",
 98 |         "diff": "1.0.2",
 99 |         "growl": "1.7.x",
100 |         "jade": "0.26.3",
101 |         "mkdirp": "0.3.3",
102 |         "ms": "0.3.0"
103 |       }
104 |     },
105 |     "ms": {
106 |       "version": "0.3.0",
107 |       "resolved": "https://registry.npmjs.org/ms/-/ms-0.3.0.tgz",
108 |       "integrity": "sha1-A+3DSNYT5mpWSGz9rFO8vomcvWE=",
109 |       "dev": true
110 |     },
111 |     "os-tmpdir": {
112 |       "version": "1.0.2",
113 |       "resolved": "https://registry.npmjs.org/os-tmpdir/-/os-tmpdir-1.0.2.tgz",
114 |       "integrity": "sha1-u+Z0BseaqFxc/sdm/lc0VV36EnQ="
115 |     },
116 |     "pathhash": {
117 |       "version": "1.0.0",
118 |       "resolved": "https://registry.npmjs.org/pathhash/-/pathhash-1.0.0.tgz",
119 |       "integrity": "sha1-904ZwRtXJNLOOQa/G8lnIJ7Ytyc="
120 |     },
121 |     "rimraf": {
122 |       "version": "2.0.3",
123 |       "resolved": "https://registry.npmjs.org/rimraf/-/rimraf-2.0.3.tgz",
124 |       "integrity": "sha1-9QopZecUTpr9mYmC8V33BnMPVqk=",
125 |       "requires": {
126 |         "graceful-fs": "~1.1"
127 |       }
128 |     },
129 |     "sequence": {
130 |       "version": "2.2.1",
131 |       "resolved": "https://registry.npmjs.org/sequence/-/sequence-2.2.1.tgz",
132 |       "integrity": "sha1-f1YXiV1ENRwKBH52RGdpBJChawM="
133 |     },
134 |     "should": {
135 |       "version": "1.2.2",
136 |       "resolved": "https://registry.npmjs.org/should/-/should-1.2.2.tgz",
137 |       "integrity": "sha1-DwP3dQZtnqJjJpDJF7EoJPzB1YI=",
138 |       "dev": true
139 |     },
140 |     "temp": {
141 |       "version": "0.8.3",
142 |       "resolved": "https://registry.npmjs.org/temp/-/temp-0.8.3.tgz",
143 |       "integrity": "sha1-4Ma8TSa5AxJEEOT+2BEDAU38H1k=",
144 |       "requires": {
145 |         "os-tmpdir": "^1.0.0",
146 |         "rimraf": "~2.2.6"
147 |       },
148 |       "dependencies": {
149 |         "rimraf": {
150 |           "version": "2.2.8",
151 |           "resolved": "https://registry.npmjs.org/rimraf/-/rimraf-2.2.8.tgz",
152 |           "integrity": "sha1-5Dm+Kq7jJzIZUnMPmaiSnk/FBYI="
153 |         }
154 |       }
155 |     },
156 |     "walk": {
157 |       "version": "2.2.1",
158 |       "resolved": "https://registry.npmjs.org/walk/-/walk-2.2.1.tgz",
159 |       "integrity": "sha1-WtofjknkfUt0Rdi+ei4eYxq0MBY=",
160 |       "requires": {
161 |         "forEachAsync": "~2.2"
162 |       }
163 |     }
164 |   }
165 | }
166 | 


--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "pdf-extract",
 3 |   "engines": "node",
 4 |   "version": "1.0.12",
 5 |   "private": false,
 6 |   "scripts": {
 7 |     "test": "node_modules/.bin/mocha --reporter spec"
 8 |   },
 9 |   "repository": {
10 |     "type": "git",
11 |     "url": "https://github.com/nisaacson/pdf-extract.git"
12 |   },
13 |   "main": "main.js",
14 |   "folders": "lib",
15 |   "dependencies": {
16 |     "eyespect": "~0.1.8",
17 |     "async": "~0.1.22",
18 |     "temp": "~0.8.3",
19 |     "walk": "~2.2.1",
20 |     "rimraf": "~2.0.2",
21 |     "pathhash": "~1.0.0"
22 |   },
23 |   "devDependencies": {
24 |     "should": "~1.2.1",
25 |     "mocha": "~1.8.1"
26 |   }
27 | }
28 | 


--------------------------------------------------------------------------------
/share/configs/alphanumeric:
--------------------------------------------------------------------------------
1 | tessedit_char_whitelist !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
2 | 


--------------------------------------------------------------------------------
/share/dia.traineddata:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/share/dia.traineddata


--------------------------------------------------------------------------------
/share/eng.traineddata:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/share/eng.traineddata


--------------------------------------------------------------------------------
/test/01_command-test.js:
--------------------------------------------------------------------------------
 1 | var inspect = require('eyespect').inspector({maxLength:20000});
 2 | var should = require('should');
 3 | var async = require('async');
 4 | var exec = require('child_process').exec;
 5 | 
 6 | describe('01 Command Test', function() {
 7 |   it('should have ghostscript (gs) binary on path', function(done) {
 8 |     var cmd = 'which gs';
 9 |     var child = exec(cmd, function (err, stdout, stderr) {
10 |       should.not.exist(err, 'ghostscript not available. You will not be able to perform ocr and extract text from pdfs with scanned image. To get convert install GhostScript on your system');
11 |       stderr.length.should.equal(0);
12 |       should.exist(stdout);
13 |       stdout.length.should.be.above(8);
14 |       done();
15 |     });
16 |   });
17 |   it('should have pdftotext binary on path', function(done) {
18 |     var cmd = 'which pdftotext';
19 |     var child = exec(cmd, function (err, stdout, stderr) {
20 |       should.not.exist(err, 'pdftotext not available. You will not be able to extract text from electronic searchable pdf files without the pdftotext library installed on your system');
21 |       stderr.length.should.equal(0);
22 |       should.exist(stdout);
23 |       stdout.length.should.be.above(8);
24 |       done();
25 |     });
26 |   });
27 | 
28 |   it('should have tesseract binary on path', function(done) {
29 |     var cmd = 'which tesseract';
30 |     var child = exec(cmd, function (err, stdout, stderr) {
31 |       should.not.exist(err, 'tesseract not available. You will not be able to perform ocr and extract from pdfs with scanned images.');
32 |       stderr.length.should.equal(0);
33 |       should.exist(stdout);
34 |       stdout.length.should.be.above(8);
35 |       done();
36 |     });
37 |   });
38 | 
39 | });


--------------------------------------------------------------------------------
/test/02_split-test.js:
--------------------------------------------------------------------------------
 1 | var inspect = require('eyespect').inspector({maxLength:20000});
 2 | var path = require('path');
 3 | var should = require('should');
 4 | var assert = require('assert')
 5 | var fs = require('fs');
 6 | var async = require('async');
 7 | var split = require('../lib/split.js');
 8 | 
 9 | describe('02 Split Test', function() {
10 |   it('should split multi-page pdf in single page pdf files', function(done) {
11 |     this.timeout(10*1000);
12 |     this.slow(2*1000);
13 |     var file_name = 'multipage_searchable.pdf';
14 |     var relative_path = path.join('test_data',file_name);
15 |     var pdf_path = path.join(__dirname, relative_path);
16 |     split(pdf_path, function (err, output) {
17 |       should.not.exist(err);
18 |       should.exist(output);
19 |       output.should.have.property('folder');
20 |       output.should.have.property('files');
21 |       var files = output.files;
22 |       files.length.should.equal(8, 'wrong number of pages after splitting searchable pdf with name: ' + file_name);
23 |       // make sure each file entry in files exists
24 |       async.forEach(
25 |         files,
26 |         function (file, cb) {
27 |           file.should.have.property('file_name');
28 |           file.should.have.property('file_path');
29 |           fs.exists(file.file_path, function (exists) {
30 |             assert.ok(exists, 'file does not exist like it should at path: ' + file.file_path);
31 |             cb();
32 |           });
33 |         },
34 |         function (err) {
35 |           should.not.exist(err);
36 |           done();
37 |         }
38 |       );
39 |     });
40 |   });
41 | 
42 |   it('should split single page pdf into a new single page pdf files', function(done) {
43 |     this.timeout(10*1000);
44 |     this.slow(2*1000);
45 |     var file_name = 'single_page_searchable.pdf';
46 |     var relative_path = path.join('test_data',file_name);
47 |     var pdf_path = path.join(__dirname, relative_path);
48 |     split(pdf_path, function (err, output) {
49 |       should.not.exist(err);
50 |       should.exist(output);
51 |       output.should.have.property('folder');
52 |       output.should.have.property('files');
53 |       var files = output.files;
54 |       files.length.should.equal(1, 'wrong number of pages after splitting searchable pdf with name: ' + file_name);
55 |       // make sure each file entry in files exists
56 |       async.forEach(
57 |         files,
58 |         function (file, cb) {
59 |           file.should.have.property('file_name');
60 |           file.should.have.property('file_path');
61 |           fs.exists(file.file_path, function (exists) {
62 |             assert.ok(exists, 'file does not exist like it should at path: ' + file.file_path);
63 |             cb();
64 |           });
65 |         },
66 |         function (err) {
67 |           should.not.exist(err);
68 |           done();
69 |         }
70 |       );
71 |     });
72 |   });
73 | });


--------------------------------------------------------------------------------
/test/03_searchable-test.js:
--------------------------------------------------------------------------------
 1 | var inspect = require('eyespect').inspector({maxLength:20000});
 2 | var path = require('path');
 3 | var should = require('should');
 4 | var fs = require('fs');
 5 | var assert = require('assert');
 6 | var async = require('async');
 7 | var pathhash = require('pathhash');
 8 | var pdf = require('../main');
 9 | 
10 | describe('03 Searchable Test', function() {
11 |   var file_name = 'single_page_searchable.pdf';
12 |   var relative_path = path.join('test_data',file_name);
13 |   var pdf_path = path.join(__dirname, relative_path);
14 |   var hash;
15 |   before(function(done) {
16 |     pathhash(pdf_path, function (err, reply) {
17 |       should.not.exist(err, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. ' + err);
18 |       should.exist(reply, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. No hash returned from hashDataAtPath');
19 |       hash = reply;
20 |       done();
21 |     });
22 |   });
23 |   it('should return an error when not passing a type for a searchable pdf', function(done) {
24 |     this.timeout(10*1000);
25 |     this.slow(5*1000);
26 |     var options = {
27 |     }
28 |     pdf(pdf_path, options, function (err, extract) {
29 |       should.exist(err,'error should be returned');
30 |       should.not.exist(extract);
31 |       done();
32 |     });
33 |   });
34 | 
35 |   it('should extract text from electronic searchable pdf', function(done) {
36 |     this.timeout(10*1000);
37 |     this.slow(5*1000);
38 |     var options = {
39 |       type: 'text'
40 |     }
41 |     var processor = pdf(pdf_path, options, function (err) {
42 |       should.not.exist(err);
43 |     });
44 | 
45 |     processor.on('error', function (err) {
46 |       should.not.exist(err);
47 |       assert.ok(false, 'error during processing');
48 | 
49 |     });
50 |     processor.on('complete', function (data) {
51 |       data.should.have.property('text_pages');
52 |       data.should.have.property('hash');
53 |       if (data.hash !== hash) {
54 |         return;
55 |       }
56 |       data.text_pages.length.should.eql(1);
57 |       data.text_pages[0].length.should.be.above(20);
58 |       done();
59 |     });
60 |   });
61 | });
62 | 


--------------------------------------------------------------------------------
/test/04_electronic-test.js:
--------------------------------------------------------------------------------
 1 | /**
 2 |  * Tests extraction for a multi-page searchable pdf file
 3 |  */
 4 | var inspect = require('eyespect').inspector({maxLength:20000});
 5 | var path = require('path');
 6 | var should = require('should');
 7 | var assert = require('assert');
 8 | var fs = require('fs');
 9 | var async = require('async');
10 | 
11 | var pdf = require('../main.js');
12 | 
13 | var get_desired_text = function(text_file_name, callback) {
14 |   var relative_path = path.join('test_data',text_file_name);
15 |   var text_file_path = path.join(__dirname, relative_path);
16 |   fs.readFile(text_file_path, 'utf8', function (err, reply) {
17 |     should.not.exist(err);
18 |     should.exist(reply);
19 |     return callback(err, reply);
20 |   });
21 | }
22 | describe('04 Multipage searchable test', function() {
23 |   it('should extract array of text pages from multipage  searchable pdf', function(done) {
24 |     this.timeout(10*1000);
25 |     this.slow(2*1000);
26 |     var file_name = 'multipage_searchable.pdf';
27 |     var relative_path = path.join('test_data',file_name);
28 |     var pdf_path = path.join(__dirname, relative_path);
29 |     var options = {
30 |       type: 'text',
31 |     };
32 |     var processor = pdf(pdf_path, options, function (err) {
33 |       should.not.exist(err);
34 |     });
35 | 
36 |     processor.on('complete', function(data) {
37 |       data.should.have.property('text_pages');
38 |       data.should.have.property('single_page_pdf_file_paths');
39 |       data.single_page_pdf_file_paths.length.should.equal(8,'wrong number of single_page_pdf_file_paths returned');
40 |       data.should.have.property('pdf_path');
41 |       data.text_pages.length.should.equal(8, 'wrong number of pages after extracting from mulitpage searchable pdf with name: ' + file_name);
42 |       assert.ok(page_event_fired, 'never received a "page" event like we should have');
43 |       for (var index in data.text_pages) {
44 |         var page = data.text_pages[index];
45 |         page.length.should.be.above(0, 'no text on page at index: ' + index);
46 |       }
47 | 
48 |       done();
49 |     });
50 |     var page_event_fired = false;
51 |     processor.on('error', function(data) {
52 |       false.should.be.true('error occurred during processing');
53 |     });
54 |     processor.on('page', function(data) {
55 |       page_event_fired = true;
56 |       data.should.have.property('index');
57 |       data.should.have.property('pdf_path');
58 |       data.should.have.property('text');
59 |       data.pdf_path.should.eql(pdf_path);
60 |       data.text.length.should.above(0);
61 |     });
62 |   });
63 | });
64 | 


--------------------------------------------------------------------------------
/test/05_convert-test.js:
--------------------------------------------------------------------------------
 1 | var inspect = require('eyespect').inspector({maxLength:20000});
 2 | var path = require('path');
 3 | var should = require('should');
 4 | var assert = require('assert');
 5 | var fs = require('fs');
 6 | var async = require('async');
 7 | 
 8 | var convert = require('../lib/convert.js');
 9 | 
10 | var get_desired_text = function(text_file_name, callback) {
11 |   var relative_path = path.join('test_data',text_file_name);
12 |   var text_file_path = path.join(__dirname, relative_path);
13 |   fs.readFile(text_file_path, 'utf8', function (err, reply) {
14 |     should.not.exist(err);
15 |     should.exist(reply);
16 |     return callback(err, reply);
17 |   });
18 | }
19 | describe('05 Convert Test', function() {
20 |   it('should convert raw single page pdf to tif file', function(done) {
21 |     return done();
22 |     this.timeout(10*1000);
23 |     var file_name = 'single_page_raw.pdf';
24 |     var relative_path = path.join('test_data',file_name);
25 |     var pdf_path = path.join(__dirname, relative_path);
26 |     fs.exists(pdf_path, function (exists) {
27 |       assert.ok(exists, 'file does not exist like it should at path: ' + pdf_path);
28 |       convert(pdf_path, function (err, tif_path) {
29 |         should.not.exist(err);
30 |         should.exist(tif_path);
31 |         fs.exists(tif_path, function (exists) {
32 |           assert.ok(exists, 'til file does not exist like it should at path: ' + tif_path);
33 |           done();
34 |         });
35 |       });
36 |     });
37 |   });
38 | });
39 | 


--------------------------------------------------------------------------------
/test/06_ocr-test.js:
--------------------------------------------------------------------------------
 1 | var inspect = require('eyespect').inspector({maxLength:20000});
 2 | var path = require('path');
 3 | var should = require('should');
 4 | var assert = require('assert');
 5 | var fs = require('fs');
 6 | var ocr = require('../lib/ocr.js');
 7 | 
 8 | describe('06 OCR Test', function() {
 9 |   it('should extract text from tif file via tesseract ocr', function(done) {
10 |     this.timeout(100*1000);
11 |     this.slow(20*1000);
12 |     var file_name = 'single_page_raw.tif';
13 |     var relative_path = path.join('test_data',file_name);
14 |     var tif_path = path.join(__dirname, relative_path);
15 |     fs.exists(tif_path, function (exists) {
16 |       assert.ok(exists, 'tif file does not exist like it should at path: ' + tif_path);
17 |       ocr(tif_path, function (err, extract) {
18 |         should.not.exist(err);
19 |         should.exist(extract);
20 |         extract.length.should.be.above(20, 'wrong ocr output');
21 |         done();
22 |       });
23 |     });
24 |   });
25 | 
26 |   it('should ocr tif file using custom language file', function(done) {
27 |     this.timeout(100*1000);
28 |     this.slow(20*1000);
29 |     var file_name = 'single_page_raw.tif';
30 |     var relative_path = path.join('test_data',file_name);
31 |     var tif_path = path.join(__dirname, relative_path);
32 |     fs.exists(tif_path, function (exists) {
33 |       assert.ok(exists, 'tif file does not exist like it should at path: ' + tif_path);
34 |       var options = [
35 |         '-psm 1',
36 |         '-l dia',
37 |         'alphanumeric'
38 |       ]
39 |       ocr(tif_path, options, function (err, extract) {
40 |         should.not.exist(err);
41 |         should.exist(extract);
42 |         extract.length.should.be.above(20, 'wrong ocr output');
43 |         done();
44 |       });
45 |     });
46 |   });
47 | 
48 | });
49 | 


--------------------------------------------------------------------------------
/test/07_raw-test.js:
--------------------------------------------------------------------------------
  1 | /**
  2 |  * Tests ocr extraction for a multi-page raw scan pdf file
  3 |  */
  4 | var assert = require('assert');
  5 | var inspect = require('eyespect').inspector({maxLength:20000});
  6 | var path = require('path');
  7 | var should = require('should');
  8 | var fs = require('fs');
  9 | var async = require('async');
 10 | 
 11 | var pdf = require('../main.js');
 12 | var pathHash = require('pathhash');
 13 | 
 14 | var get_desired_text = function(text_file_name, callback) {
 15 |   var relative_path = path.join('test_data',text_file_name);
 16 |   var text_file_path = path.join(__dirname, relative_path);
 17 |   fs.readFile(text_file_path, 'utf8', function (err, reply) {
 18 |     should.not.exist(err);
 19 |     should.exist(reply);
 20 |     return callback(err, reply);
 21 |   });
 22 | }
 23 | describe('07 Multipage raw test', function() {
 24 |   var file_name = 'multipage_raw.pdf';
 25 |   var relative_path = path.join('test_data',file_name);
 26 |   var pdf_path = path.join(__dirname, relative_path);
 27 |   var options = {
 28 |     type: 'ocr',
 29 |     clean: false // keep the temporary single page pdf files
 30 |   };
 31 | 
 32 |   var hash;
 33 |   before(function(done) {
 34 |     pathHash(pdf_path, function (err, reply) {
 35 |       should.not.exist(err, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. ' + err);
 36 |       should.exist(reply, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. No hash returned from hashDataAtPath');
 37 |       hash = reply;
 38 |       done();
 39 |     });
 40 |   });
 41 | 
 42 |   it('should extract array of text pages from multipage raw scan pdf', function(done) {
 43 |     console.log('\nPlease be patient, this test make take a minute or more to complete');
 44 |     this.timeout(240*1000);
 45 |     this.slow(120*1000);
 46 |     var processor = pdf(pdf_path, options, function (err) {
 47 |       should.not.exist(err);
 48 |     });
 49 |     processor.on('complete', function(data) {
 50 |       data.should.have.property('text_pages');
 51 |       data.should.have.property('pdf_path');
 52 |       data.should.have.property('single_page_pdf_file_paths');
 53 |       data.text_pages.length.should.eql(2, 'wrong number of pages after extracting from mulitpage searchable pdf with name: ' + file_name);
 54 | 
 55 |       assert.ok(page_event_fired, 'no "page" event fired');
 56 |       async.forEach(
 57 |         data.single_page_pdf_file_paths,
 58 |         function (file_path, cb) {
 59 |           fs.exists(file_path, function (exists) {
 60 |             assert.ok(exists,'no single page pdf file exists at the path: ' + file_path);
 61 |             cb();
 62 |           });
 63 |         },
 64 |         function (err) {
 65 |           should.not.exist(err, 'error in raw processing: ' + err);
 66 |           done();
 67 |         }
 68 |       );
 69 |     });
 70 |     processor.on('log', function(data) {
 71 |       inspect(data, 'log data');
 72 |     });
 73 |     var page_event_fired = false;
 74 |     processor.on('page', function(data) {
 75 |       page_event_fired = true;
 76 |       data.should.have.property('index');
 77 |       data.should.have.property('pdf_path');
 78 |       data.should.have.property('text');
 79 |       data.pdf_path.should.eql(pdf_path);
 80 |       data.text.length.should.above(0);
 81 |     });
 82 |   });
 83 | 
 84 |   it('should ocr raw scan using custom language in ocr_flags', function (done) {
 85 |     this.timeout(240*1000);
 86 |     this.slow(120*1000);
 87 |     var ocr_flags = [
 88 |       '-psm 1',
 89 |       '-l dia',
 90 |       'alphanumeric'
 91 |     ];
 92 | 
 93 |     inspect('Please be patient, this test make take a minute or more to complete');
 94 | 
 95 |     options.ocr_flags = ocr_flags;
 96 |     var processor = pdf(pdf_path, options, function (err) {
 97 |       should.not.exist(err);
 98 |     });
 99 |     processor.on('error', function (err){
100 |       should.not.exist(err);
101 |       assert.ok(false, 'error during raw processing');
102 |     });
103 |     processor.on('log', function(data) {
104 |       inspect(data, 'log event');
105 |     });
106 | 
107 |     processor.on('complete', function (data) {
108 |       data.should.have.property('text_pages');
109 |       data.should.have.property('hash');
110 |       data.should.have.property('pdf_path');
111 |       data.should.have.property('single_page_pdf_file_paths');
112 |       if (hash !== data.hash) {
113 |         return;
114 |       }
115 |       data.text_pages.length.should.eql(2, 'wrong number of pages after extracting from mulitpage searchable pdf with name: ' + file_name);
116 |       async.forEach(
117 |         data.single_page_pdf_file_paths,
118 |         function (file_path, cb) {
119 |           fs.exists(file_path, function (exists) {
120 |             assert.ok(exists, 'single page pdf file not found at path: ' + file_path);
121 |             cb();
122 |           });
123 |         },
124 |         function (err) {
125 |           should.not.exist(err, 'error in raw processing: ' + err);
126 |           done();
127 |         }
128 |       );
129 |     });
130 |   });
131 | });
132 | 


--------------------------------------------------------------------------------
/test/npm-debug.log:
--------------------------------------------------------------------------------
 1 | 0 info it worked if it ends with ok
 2 | 1 verbose cli [ 'node', '/usr/local/bin/npm', 'test' ]
 3 | 2 info using npm@1.1.62
 4 | 3 info using node@v0.8.11
 5 | 4 verbose read json /Users/noah/Sites/node/DocParse/lib/pdf/test/package.json
 6 | 5 error Error: ENOENT, open '/Users/noah/Sites/node/DocParse/lib/pdf/test/package.json'
 7 | 6 error If you need help, you may report this log at:
 8 | 6 error     <http://github.com/isaacs/npm/issues>
 9 | 6 error or email it to:
10 | 6 error     <npm-@googlegroups.com>
11 | 7 error System Darwin 11.4.0
12 | 8 error command "node" "/usr/local/bin/npm" "test"
13 | 9 error cwd /Users/noah/Sites/node/DocParse/lib/pdf/test
14 | 10 error node -v v0.8.11
15 | 11 error npm -v 1.1.62
16 | 12 error path /Users/noah/Sites/node/DocParse/lib/pdf/test/package.json
17 | 13 error code ENOENT
18 | 14 error errno 34
19 | 15 verbose exit [ 34, true ]
20 | 


--------------------------------------------------------------------------------
/test/test_data/multipage_raw.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/multipage_raw.pdf


--------------------------------------------------------------------------------
/test/test_data/multipage_searchable.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/multipage_searchable.pdf


--------------------------------------------------------------------------------
/test/test_data/single_page_raw.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/single_page_raw.pdf


--------------------------------------------------------------------------------
/test/test_data/single_page_raw.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/single_page_raw.tif


--------------------------------------------------------------------------------
/test/test_data/single_page_searchable.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/single_page_searchable.pdf


--------------------------------------------------------------------------------
/test/test_data/single_page_searchable.txt:
--------------------------------------------------------------------------------
 1 | Sampling
 2 | by David A. Freedman
 3 | Department of Statistics
 4 | University of California
 5 | Berkeley, CA 94720
 6 | 
 7 |      The basic idea in sampling is extrapolation from the part to the
 8 | whole--from "the sample" to "the population." (The population is some-
 9 | times rather mysteriously called "the universe.") There is an immediate
10 | corollary: the sample must be chosen to fairly represent the population.
11 | Methods for choosing samples are called "designs." Good designs in-
12 | volve the use of probability methods, minimizing subjective judgment in
13 | the choice of units to survey. Samples drawn using probability methods
14 | are called "probability samples."
15 |      Bias is a serious problem in applied work; probability samples min-
16 | imize bias. As it turns out, however, methods used to extrapolate from a
17 | probability sample to the population should take into account the method
18 | used to draw the sample; otherwise, bias may come in through the back
19 | door. The ideas will be illustrated for sampling people or business records,
20 | but apply more broadly. There are sample surveys of buildings, farms, law
21 | cases, schools, trees, trade union locals, and many other populations.
22 | 
23 | SAMPLE DESIGN
24 |      Probability samples should be distinguished from "samples of con-
25 | venience" (also called "grab samples"). A typical sample of convenience
26 | comprises the investigator's students in an introductory course. A "mall
27 | sample" consists of the people willing to be interviewed on certain days
28 | at certain shopping centers. This too is a convenience sample. The reason
29 | for the nomenclature is apparent, and so is the downside: the sample may
30 | not represent any definable population larger than itself.
31 |      To draw a probability sample, we begin by identifying the population
32 | of interest. The next step is to create the "sampling frame," a list of
33 | units to be sampled. One easy design is "simple random sampling." For
34 | instance, to draw a simple random sample of 100 units, choose one unit
35 | at random from the frame; put this unit into the sample; choose another
36 | unit at random from the remaining ones in the frame; and so forth. Keep
37 | going until 100 units have been chosen. At each step along the way, all
38 | units in the pool have the same chance of being chosen.
39 | 


--------------------------------------------------------------------------------