├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── example-image.js ├── example-text.js ├── install-mac.sh ├── lib ├── convert.js ├── electronic.js ├── ocr.js ├── raw.js ├── searchable.js └── split.js ├── main.js ├── makefile ├── package-lock.json ├── package.json ├── share ├── configs │ └── alphanumeric ├── dia.traineddata └── eng.traineddata └── test ├── 01_command-test.js ├── 02_split-test.js ├── 03_searchable-test.js ├── 04_electronic-test.js ├── 05_convert-test.js ├── 06_ocr-test.js ├── 07_raw-test.js ├── npm-debug.log └── test_data ├── multipage_raw.pdf ├── multipage_searchable.pdf ├── single_page_raw.pdf ├── single_page_raw.tif ├── single_page_searchable.pdf └── single_page_searchable.txt /.gitignore: -------------------------------------------------------------------------------- 1 | node_modules 2 | npm-debug.log 3 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: node_js 2 | 3 | before_script: 4 | - sudo apt-get install -qq ghostscript 5 | - sudo apt-get install -qq pdftk 6 | - sudo apt-get install -qq tesseract-ocr 7 | - sudo apt-get install -qq poppler-utils 8 | - sudo cp ./share/eng.traineddata /usr/share/tesseract-ocr/tessdata/ 9 | - sudo cp ./share/dia.traineddata /usr/share/tesseract-ocr/tessdata/ 10 | - sudo cp ./share/configs/alphanumeric /usr/share/tesseract-ocr/tessdata/configs/alphanumeric 11 | 12 | node_js: 13 | - "0.8" 14 | 15 | notifications: 16 | email: false 17 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Noah Isaacson 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Node PDF 2 | Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text 3 | 4 | [![Build Status](https://travis-ci.org/nisaacson/pdf-extract.png)](https://travis-ci.org/nisaacson/pdf-extract) 5 | 6 | ## Installation 7 | 8 | To begin install the module. 9 | 10 | `npm install pdf-extract` 11 | 12 | After the library is installed you will need the following binaries accessible on your path to process pdfs. 13 | 14 | - pdftk 15 | - pdftk splits multi-page pdf into single pages. 16 | - pdftotext 17 | - pdftotext is used to extract text out of searchable pdf documents 18 | - ghostscript 19 | - ghostscript is an ocr preprocessor which convert pdfs to tif files for input into tesseract 20 | - tesseract 21 | - tesseract performs the actual ocr on your scanned images 22 | 23 | 24 | ### OSX 25 | To begin on OSX, first make sure you have the homebrew package manager installed. 26 | 27 | **pdftk** is not available in Homebrew. However a gui install is available here. 28 | [http://www.pdflabs.com/docs/install-pdftk/](http://www.pdflabs.com/docs/install-pdftk/) 29 | 30 | **pdftotext** is included as part of the **poppler** utilities library. **poppler** can be installed via homebrew 31 | 32 | ``` bash 33 | brew install poppler 34 | ``` 35 | 36 | **ghostscript** can be install via homebrew 37 | ``` bash 38 | brew install gs 39 | ``` 40 | 41 | **tesseract** can be installed via homebrew as well 42 | 43 | ``` bash 44 | brew install tesseract 45 | ``` 46 | 47 | After tesseract is installed you need to install the alphanumeric config and an updated trained data file 48 | ``` bash 49 | cd 50 | cp "./share/eng.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/eng.traineddata" 51 | cp "./share/dia.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/dia.traineddata" 52 | cp "./share/configs/alphanumeric" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/configs/alphanumeric" 53 | ``` 54 | 55 | ### Ubuntu 56 | **pdftk** can be installed directly via apt-get 57 | ```bash 58 | apt-get install pdftk 59 | ``` 60 | 61 | **pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute 62 | ``` bash 63 | apt-get install poppler-utils 64 | ``` 65 | 66 | **ghostscript** can be install via apt-get 67 | ``` bash 68 | apt-get install ghostscript 69 | ``` 70 | 71 | **tesseract** can be installed via apt-get. Note that unlike the osx install the package is called **tesseract-ocr** on Ubuntu, not **tesseract** 72 | ``` bash 73 | apt-get install tesseract-ocr 74 | ``` 75 | 76 | For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this pdf-extract module into the *tess-data* folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system 77 | ``` bash 78 | cd 79 | cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata" 80 | cp "./share/configs/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric" 81 | ``` 82 | 83 | 84 | ### SmartOS 85 | **pdftk** can be installed directly via apt-get 86 | ``` bash 87 | apt-get install pdftk 88 | ``` 89 | 90 | **pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute 91 | ``` bash 92 | apt-get install poppler-utils 93 | ``` 94 | 95 | **ghostscript** can be install via pkgin. Note you may need to update the pkgin repo to include the additional sources provided by Joyent. Check [http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html](http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html) for details 96 | ``` bash 97 | pkgin install ghostscript 98 | ``` 99 | 100 | **tesseract** can be must be manually downloaded and compiled. You must also install leptonica before installing tesseract. At the time of this writing leptonica is available from [http://www.leptonica.com/download.html](http://www.leptonica.com/download.html), with the latest version tarball available from [http://www.leptonica.com/source/leptonica-1.69.tar.gz](http://www.leptonica.com/source/leptonica-1.69.tar.gz) 101 | ``` bash 102 | pkgin install autoconf 103 | wget http://www.leptonica.com/source/leptonica-1.69.tar.gz 104 | tar -xvzf leptonica-1.69.tar.gz 105 | cd leptonica-1.69 106 | ./configure 107 | make 108 | [sudo] make install 109 | ``` 110 | After installing leptonic move on to tesseract. Tesseract is available from [https://code.google.com/p/tesseract-ocr/downloads/list](https://code.google.com/p/tesseract-ocr/downloads/list) with the latest version available from [https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=](https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=) 111 | ``` bash 112 | wget https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q= 113 | tar -xvzf tesseract-ocr-3.02.02.tar.gz 114 | cd tesseract-ocr 115 | ./configure 116 | make 117 | [sudo] make install 118 | ``` 119 | 120 | ### Windows 121 | Important! You will have to add some variables to the PATH of your machine. You do this by right clicking your computer in file explorer, select Properties, select Advanced System Settings, Environment Variables. You can then add **the folder that contains the executables** to the path variable. 122 | 123 | **pdftk** can be installed using the PDFtk Server installer found here: https://www.pdflabs.com/tools/pdftk-server/ 124 | It should autmatically add itself to the PATH, if not, the default install location is *"C:\Program Files (x86)\PDFtk Server\bin\"* 125 | 126 | **pdftotext** can be installed using the recompiled poppler utils for windows, which have been collected and bundled here: http://manifestwebdesign.com/2013/01/09/xpdf-and-poppler-utils-on-windows/ 127 | Unpack these in a folder, (example: *"C:\poppler-utils"*) and add this to the PATH. 128 | 129 | **ghostscript** for Windows can be found at: http://www.ghostscript.com/download/gsdnld.html 130 | Make sure you download the General Public License and the correct version (32/64bit). 131 | Install it and go to the installation folder (default: *"C:\Program Files\gs\gs9.19"*) and go into the **bin** folder. 132 | Rename the *gswin64c* to *gs*, and add the bin folder to your PATH. 133 | 134 | **tesseract** can be build, but you can also download an older version which seems to work fine. Downloads at: https://sourceforge.net/projects/tesseract-ocr-alt/files/ 135 | Version tested is *tesseract-ocr-setup-3.02.02.exe*, the default install location is *"C:\Program Files (x86)\Tesseract-OCR"* and is also added to the PATH. 136 | Note, this is only when you've checked that it will install for everyone on the machine. 137 | 138 | Everything should work after all this! If not, try restarting to make sure the PATH variables are correctly used. 139 | **This setup was tested on a Windows 10 Pro N 64bit machine.** 140 | 141 | 142 | ## Usage 143 | 144 | ### OCR Extract from scanned image 145 | Extract from a pdf file which contains a scanned image and no searchable text 146 | ``` javascript 147 | const path = require("path") 148 | const pdf_extract = require('pdf-extract') 149 | 150 | console.log("Usage: node thisfile.js the/path/tothe.pdf") 151 | const absolute_path_to_pdf = path.resolve(process.argv[2]) 152 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf) 153 | 154 | const options = { 155 | type: 'ocr', // perform ocr to get the text within the scanned image 156 | ocr_flags: ['--psm 1'], // automatically detect page orientation 157 | } 158 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…")) 159 | processor.on('complete', data => callback(null, data)) 160 | processor.on('error', callback) 161 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) } 162 | ``` 163 | 164 | 165 | 166 | ### Text extract from searchable pdf 167 | Extract from a pdf file which contains actual searchable text 168 | ``` javascript 169 | const path = require("path") 170 | const pdf_extract = require('./main.js') 171 | 172 | console.log("Usage: node thisfile.js the/path/tothe.pdf") 173 | const absolute_path_to_pdf = path.resolve(process.argv[2]) 174 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf) 175 | 176 | const options = { 177 | type: 'text', // extract searchable text from PDF 178 | ocr_flags: ['--psm 1'], // automatically detect page orientation 179 | enc: 'UTF-8', // optional, encoding to use for the text output 180 | mode: 'layout' // optional, mode to use when reading the pdf 181 | } 182 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…")) 183 | processor.on('complete', data => callback(null, data)) 184 | processor.on('error', callback) 185 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) } 186 | ``` 187 | #### Options 188 | At a minimum you must specific the type of pdf extract you wish to perform 189 | 190 | **clean** 191 | When the system performs extracts text from a multi-page pdf, it first splits the pdf into single pages. This are written to disk before the ocr occurs. For some applications these single page files can be useful. If you need to work with the single page pdf files after the ocr is complete, set the **clean** option to **false** as show below. Note that the single page pdf files are written to the system appropriate temp directory, so if you must copy the files to a more permanent location yourself after the ocr process completes 192 | ``` javascript 193 | var options = { 194 | type: 'ocr' // (required), perform ocr to get the text within the scanned image 195 | enc: 'UTF-8' // optional, only applies to 'text' type 196 | mode: 'layout' // optional, only applies to 'text' type. Available modes are 'layout', 'simple', 'table' or 'lineprinter'. Default is 'layout' 197 | clean: false // keep the single page pdfs created during the ocr process 198 | ocr_flags: [ 199 | '-psm 1', // automatically detect page orientation 200 | '-l dia', // use a custom language file 201 | 'alphanumeric' // only output ascii characters 202 | ] 203 | } 204 | ``` 205 | 206 | 207 | ### Events 208 | When processing, the module will emit various events as they occurr 209 | 210 | **page** 211 | Emitted when a page has completed processing. The data passed with this event looks like 212 | ``` javascript 213 | var data = { 214 | hash: 215 | text: , 216 | index: 2, 217 | num_pages: 4, 218 | pdf_path: "~/Downloads/input_pdf_file.pdf", 219 | single_page_pdf_path: "/tmp/temp_pdf_file2.pdf" 220 | } 221 | ``` 222 | 223 | **error** 224 | Emitted when an error occurs during processing. After this event is emitted processing will stop. 225 | The data passed with this event looks like 226 | ``` 227 | var data = { 228 | error: 'no file exists at the path you specified', 229 | pdf_path: "~/Downloads/input_pdf_file.pdf", 230 | } 231 | ``` 232 | 233 | **complete** 234 | Emitted when all pages have completed processing and the pdf extraction is complete 235 | ``` 236 | var data = { 237 | hash: 238 | text_pages: , 239 | pdf_path: "~/Downloads/input_pdf_file.pdf", 240 | single_page_pdf_file_paths: [ 241 | "/tmp/temp_pdf_file1.pdf", 242 | "/tmp/temp_pdf_file2.pdf", 243 | "/tmp/temp_pdf_file3.pdf", 244 | "/tmp/temp_pdf_file4.pdf", 245 | ] 246 | } 247 | ``` 248 | 249 | **log** 250 | To avoid spamming process.stdout, log events are emitted instead. 251 | 252 | ## Tests 253 | To test that your system satisfies the needed dependencies and that module is functioning correctly execute the command in the pdf-extract module folder 254 | ``` 255 | cd /node_modules/pdf-extract 256 | npm test 257 | ``` 258 | -------------------------------------------------------------------------------- /example-image.js: -------------------------------------------------------------------------------- 1 | const path = require("path") 2 | const pdf_extract = require('./main.js') 3 | 4 | console.log("Usage: node thisfile.js the/path/tothe.pdf") 5 | const absolute_path_to_pdf = path.resolve(process.argv[2]) 6 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf) 7 | 8 | const options = { 9 | type: 'ocr', // perform ocr to get the text within the scanned image 10 | ocr_flags: ['--psm 1'] // automatically detect page orientation 11 | } 12 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…")) 13 | processor.on('complete', data => callback(null, data)) 14 | processor.on('error', callback) 15 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) } 16 | -------------------------------------------------------------------------------- /example-text.js: -------------------------------------------------------------------------------- 1 | const path = require("path") 2 | const pdf_extract = require('./main.js') 3 | 4 | console.log("Usage: node thisfile.js the/path/tothe.pdf") 5 | const absolute_path_to_pdf = path.resolve(process.argv[2]) 6 | if (absolute_path_to_pdf.includes(" ")) throw new Error("will fail for paths w spaces like "+absolute_path_to_pdf) 7 | 8 | const options = { 9 | type: 'text', // extract searchable text from PDF 10 | ocr_flags: ['--psm 1'] // automatically detect page orientation 11 | } 12 | const processor = pdf_extract(absolute_path_to_pdf, options, ()=>console.log("Starting…")) 13 | processor.on('complete', data => callback(null, data)) 14 | processor.on('error', callback) 15 | function callback (error, data) { error ? console.error(error) : console.log(data.text_pages[0]) } 16 | -------------------------------------------------------------------------------- /install-mac.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # can skip pdftk install step per https://github.com/nisaacson/pdf-extract/pull/26 3 | brew install poppler 4 | brew install gs 5 | brew install tesseract 6 | cd /usr/local/Cellar/tesseract/*/share/tessdata/ 7 | cp "$OLDPWD/node_modules/pdf-extract/share/eng.traineddata" eng.traineddata 8 | cp "$OLDPWD/node_modules/pdf-extract/share/dia.traineddata" dia.traineddata 9 | cp "$OLDPWD/node_modules/pdf-extract/share/configs/alphanumeric" configs/alphanumeric 10 | cd - 11 | -------------------------------------------------------------------------------- /lib/convert.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Converts a pdf file at a given path to a tiff file with 3 | * the GraphicsMagick command "convert" 4 | */ 5 | var temp = require('temp'); 6 | var path = require('path'); 7 | var exec = require('child_process').exec 8 | var spawn = require('child_process').spawn; 9 | var fs = require('fs'); 10 | var pdf_convert_quality = 400; // default to density 400 for the convert command 11 | 12 | 13 | /** 14 | * @param input_path the path to a pdf file on disk. Since GhostScript requires random file access, we need a path 15 | * to an actual file rather than accepting a stream 16 | * @param {String} quality is an optional flag that controls the quality of the pdf to tiff conversion. 17 | * @return {String} output_path the path to the converted tif file 18 | * @return callback(, output_path) 19 | */ 20 | exports = module.exports = function convert(input_path, quality, callback) { 21 | // options is an optional parameter 22 | if (!callback || typeof callback != "function") { 23 | callback = quality; // callback must be the second parameter 24 | quality = undefined; // no option passed 25 | } 26 | 27 | fs.exists(input_path, function (exists) { 28 | if (!exists) { return callback('error, no file exists at the path you specified: ' + input_path); } 29 | // get a temp output path 30 | 31 | var output_path = temp.path({prefix: 'tif_output', suffix:'.tif'}); 32 | // var output_path = path.join(__dirname,'test/test_data/single_page_raw.tif'); 33 | var params = [ 34 | 35 | // '-depth 8', 36 | // '-background white', 37 | // '-flatten +matte', 38 | // '-density '+pdf_convert_quality, 39 | input_path, 40 | output_path 41 | ]; 42 | if (quality) { 43 | if (typeof(quality) !== 'string' && typeof(quality) !== 'number') { 44 | return callback('error, pdf quality option must be a string, you passed a ' + typeof(quality)); 45 | } 46 | pdf_convert_quality = quality; 47 | } 48 | var cmd = 'gs -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw -o "' + output_path + '" "'+input_path+'"'; 49 | // var cmd = 'convert -depth 8 -background white -flatten +matte -density '+pdf_convert_quality+' "'+ input_path +'" "' + output_path+'"'; 50 | var child = exec(cmd, function (err, stderr, stdout) { 51 | if (err) { 52 | return callback(err); 53 | } 54 | return callback(null, output_path); 55 | }); 56 | }); 57 | } 58 | -------------------------------------------------------------------------------- /lib/electronic.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Module which extracts the text out of an electronic pdf file 3 | * This module can handle multi-page pdf files 4 | */ 5 | var fs = require('fs'); 6 | var async = require('async'); 7 | 8 | var rimraf = require('rimraf'); 9 | var util = require('util'); 10 | var events = require('events'); 11 | 12 | var split = require('./split.js'); 13 | var searchable = require('./searchable.js'); 14 | var pathhash = require('pathhash'); 15 | 16 | 17 | function Electronic(){ 18 | if(false === (this instanceof Electronic)) { 19 | return new Electronic(); 20 | } 21 | } 22 | util.inherits(Electronic, events.EventEmitter); 23 | module.exports = Electronic; 24 | 25 | 26 | /** 27 | * @param pdf_path path to the pdf file on disk 28 | * 29 | * @return {Array} text_pages an array of the extracted text where 30 | * each entry is the text for the page at the given index 31 | * @return callback(, text_pages) 32 | */ 33 | Electronic.prototype.process = function(pdf_path, options) { 34 | var self = this; 35 | var text_pages = []; 36 | var split_output; 37 | var single_page_pdf_file_paths = []; 38 | fs.exists(pdf_path, function (exists) { 39 | var err; 40 | if (!exists) { 41 | err = 'no file exists at the path you specified: ' + pdf_path 42 | self.emit('error', { error: err, pdf_path: pdf_path}); 43 | return 44 | } 45 | pathhash(pdf_path, function (err, hash) { 46 | if (err) { 47 | err = 'error hashing file at the path you specified: ' + pdf_path + '. ' + err; 48 | self.emit('error', { error: err, pdf_path: pdf_path}); 49 | return 50 | } 51 | // split the pdf into single page pdf files 52 | split(pdf_path, options.pdf_password, function (err, output) { 53 | if (err) { 54 | self.emit('error', { error: err, pdf_path: pdf_path}); 55 | return 56 | } 57 | 58 | 59 | if (!output) { 60 | err = 'failed to split pdf file into distinct pages'; 61 | self.emit('error', { error: err, pdf_path: pdf_path}); 62 | return 63 | } 64 | split_output = output; 65 | if (!split_output.hasOwnProperty('files') || split_output.files.length == 0) { 66 | err = 'no pages where found in your pdf document'; 67 | self.emit('error', { error: err, pdf_path: pdf_path}); 68 | return 69 | } 70 | self.emit('log', 'finished splitting pages for file at path ' + pdf_path); 71 | var files = split_output.files; 72 | var index = 0; 73 | async.forEachSeries( 74 | files, 75 | // extract the text for each page 76 | function (file, cb) { 77 | index++; 78 | searchable(file.file_path, options, function (err, extract) { 79 | if(err){ 80 | self.emit('error', { error: err, pdf_path: pdf_path}); 81 | return; 82 | } 83 | text_pages.push(extract); 84 | var file_path = file.file_path 85 | single_page_pdf_file_paths.push(file.file_path); 86 | self.emit('page', { hash: hash, text: extract, index: index, pdf_path: pdf_path}); 87 | cb(); 88 | }); 89 | }, 90 | function (err) { 91 | if (!err) { 92 | self.emit('complete', { hash: hash, text_pages: text_pages, pdf_path: pdf_path, single_page_pdf_file_paths: single_page_pdf_file_paths}); 93 | return; 94 | } 95 | self.emit('error', { error: err, pdf_path: pdf_path}); 96 | if (!split_output || ! split_output.folder) { return } 97 | fs.exists(split_output.folder, function (exists) { 98 | if (!exists) { return } 99 | var remove_cb = function() {} 100 | rimraf(split_output.folder, remove_cb); 101 | }); 102 | } 103 | ); 104 | }); 105 | }); 106 | }); 107 | } 108 | -------------------------------------------------------------------------------- /lib/ocr.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Module which extracts text from electronic searchable pdf files. 3 | * Requires the "pdftotext" binary be installed on the system and accessible in the 4 | * current path 5 | */ 6 | var temp = require('temp'); 7 | var path = require('path'); 8 | var exec = require('child_process').exec; 9 | var fs = require('fs'); 10 | 11 | /** 12 | * @param tif_path path to the single page file on disk containing a scanned image of text 13 | * @param {Array} options is an optional list of flags to pass to the tesseract command 14 | * @return {String} extract the extracted ocr text output 15 | * @return callback(, stdout) 16 | */ 17 | module.exports = function(input_path, options, callback) { 18 | // options is an optional parameter 19 | if (!callback || typeof callback != "function") { 20 | // callback must be the second parameter 21 | callback = options; 22 | options = []; 23 | } 24 | fs.exists(input_path, function (exists) { 25 | if (!exists) { return callback('error, no file exists at the path you specified: ' + input_path); } 26 | // get a temp output path 27 | var output_path = temp.path({prefix: 'ocr_output'}); 28 | // output_path = path.join(__dirname,'test/test_data/single_page_raw'); 29 | var procoptions = { maxBuffer: 4096 * 4096 }; 30 | var cmd = 'tesseract "'+input_path+'" "'+output_path+'" '+options.join(' '); 31 | var child = exec(cmd, procoptions, function (err, stdout, stderr) { 32 | if (err) { return callback(err); } 33 | // tesseract automatically appends ".txt" to the output file name 34 | var text_output_path = output_path+'.txt'; 35 | // inspect(text_output_path, 'text output path'); 36 | fs.readFile(text_output_path, 'utf8', function(err, output) { 37 | // inspect(output, 'ocr output'); 38 | if (err) { return callback(err); } 39 | // cleanup after ourselves 40 | fs.unlink(text_output_path, function (err) { 41 | if (err) { return callback(err); } 42 | callback(null, output); 43 | }); 44 | }); 45 | }); 46 | }); 47 | } 48 | -------------------------------------------------------------------------------- /lib/raw.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Module which extracts the text out of an electronic pdf file 3 | * This module can handle multi-page pdf files 4 | 5 | */ 6 | var util = require('util'); 7 | var events = require('events'); 8 | var fs = require('fs'); 9 | var async = require('async'); 10 | var split = require('./split.js'); 11 | var convert = require('./convert.js'); 12 | var pathHash = require('pathhash'); 13 | var ocr = require('./ocr.js'); 14 | var rimraf = require('rimraf'); 15 | 16 | 17 | function Raw(){ 18 | if(false === (this instanceof Raw)) { 19 | return new Raw(); 20 | } 21 | } 22 | util.inherits(Raw, events.EventEmitter); 23 | module.exports = Raw; 24 | 25 | 26 | /** 27 | * @param {String} pdf_path path to the pdf file on disk 28 | * @param {Boolean} params.clean true to remove the temporary single-page pdf 29 | * files from disk. Sometimes however you might want to be able to use those 30 | * single page pdfs after the ocr completes. In this case pass clean = false 31 | * 32 | * @return {Array} text_pages an array of the extracted text where 33 | * each entry is the text for the page at the given index 34 | * @return callback(, text_pages) 35 | */ 36 | Raw.prototype.process = function(pdf_path, options) { 37 | var self = this; 38 | var text_pages = []; 39 | var split_output; 40 | if (!options) { 41 | options = {}; 42 | } 43 | // default to removing the single page pdfs after ocr completes 44 | if (!options.hasOwnProperty('clean')) { 45 | options.clean = true; 46 | } 47 | fs.exists(pdf_path, function (exists) { 48 | if (!exists) { 49 | var err = 'no file exists at the path you specified: ' + pdf_path 50 | self.emit('error', { error: err, pdf_path: pdf_path}); 51 | return 52 | } 53 | pathHash(pdf_path, function (err, hash) { 54 | if (err) { 55 | err = 'error hashing file at the path you specified: ' + pdf_path + '. ' + err; 56 | self.emit('error', { error: err, pdf_path: pdf_path}); 57 | return; 58 | } 59 | split(pdf_path, function (err, output) { 60 | if (err) { 61 | self.emit('error', { error: err, pdf_path: pdf_path}); 62 | return 63 | } 64 | if (!output) { 65 | err = 'no files returned from split'; 66 | self.emit('error', { error: err, pdf_path: pdf_path}); 67 | return; 68 | } 69 | self.emit('log', 'finished splitting pages for file at path ' + pdf_path); 70 | split_output = output; 71 | var pdf_files = output.files; 72 | if (!pdf_files || pdf_files.length == 0) { 73 | err = 'error, no pages where found in your pdf document'; 74 | self.emit('error', { error: err, pdf_path: pdf_path}); 75 | return; 76 | } 77 | var index = 0; 78 | var num_pages = pdf_files.length 79 | var single_page_pdf_file_paths = []; 80 | async.forEachSeries( 81 | pdf_files, 82 | // extract the text for each page via ocr 83 | function (pdf_file, cb) { 84 | var quality = 300; 85 | if (options.hasOwnProperty('quality') && options.quality) { 86 | quality = options.quality; 87 | } 88 | convert(pdf_file.file_path, quality, function (err, tif_path) { 89 | var zeroBasedNumPages = num_pages-1; 90 | self.emit('log', 'converted page to intermediate tiff file, page '+ index+ ' (0-based indexing) of '+ zeroBasedNumPages); 91 | if (err) { return cb(err); } 92 | var ocr_flags = [ 93 | '-psm 6' 94 | ]; 95 | if (options.ocr_flags) { 96 | ocr_flags = options.ocr_flags; 97 | } 98 | ocr(tif_path, ocr_flags, function (err, extract) { 99 | fs.unlink(tif_path, function (tif_cleanup_err, reply) { 100 | if (tif_cleanup_err) { 101 | err += ', error removing temporary tif file: "'+tif_cleanup_err+'"'; 102 | } 103 | if (err) { return cb(err); } 104 | var page_number = index+1 105 | self.emit('log', 'raw ocr: page ' + index + ' (0-based indexing) of ' +zeroBasedNumPages + ' complete'); 106 | single_page_pdf_file_paths.push(pdf_file.file_path); 107 | self.emit('page', { hash: hash, text: extract, index: index, num_pages: num_pages, pdf_path: pdf_path, single_page_pdf_path: pdf_file.file_path}); 108 | text_pages.push(extract); 109 | index++; 110 | cb(); 111 | }); 112 | }); 113 | }); 114 | }, function (err) { 115 | if (err) { 116 | self.emit('error', err); 117 | return; 118 | } 119 | self.emit('complete', { hash: hash, text_pages: text_pages, pdf_path: pdf_path, single_page_pdf_file_paths: single_page_pdf_file_paths}); 120 | }); 121 | }); 122 | }); 123 | }); 124 | } 125 | -------------------------------------------------------------------------------- /lib/searchable.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Module which extracts text from electronic searchable pdf files. 3 | * Requires the "pdftotext" binary be installed on the system and accessible in the 4 | * current path 5 | */ 6 | var path = require('path'); 7 | var temp = require('temp'); 8 | var exec = require('child_process').exec; 9 | var spawn = require('child_process').spawn; 10 | var fs = require('fs'); 11 | var walk = require('walk'); 12 | var async = require('async'); 13 | var rimraf = require('rimraf'); 14 | 15 | /** 16 | * @param pdf_path path to the single page searchable pdf file on disk 17 | * This function buffers all the output from stdout and sends it back as a string. 18 | * Since we only handle single pages of pdf text here the amount of text is small 19 | * and therefore we don't need to use a stream 20 | * 21 | * @return {ReadStream} the entire output from stdout 22 | * @return callback(, stdout) 23 | */ 24 | module.exports = function(pdf_path, options, callback) { 25 | if(options===undefined)options={}; 26 | if(options.mode===undefined)options.mode='layout'; 27 | confirm_file_exists(pdf_path, function (err) { 28 | if (err) { return callback(err); } 29 | var child = spawn('pdftotext', (spawnOptions(options)).concat([pdf_path, '-'])); 30 | var stdout = child.stdout; 31 | var stderr = child.stderr; 32 | var output = ''; 33 | stdout.setEncoding('utf8'); 34 | stderr.setEncoding('utf8'); 35 | stderr.on('data', function(data) { 36 | return callback(data, null); 37 | }); 38 | // buffer the stdout output 39 | stdout.on('data', function(data) { 40 | output += data; 41 | }); 42 | stdout.on('close', function(data) { 43 | return callback(null, output); 44 | }); 45 | }); 46 | } 47 | 48 | function spawnOptions(options) { 49 | var result = []; 50 | result.push('-' + options.mode); 51 | if (options.enc) { 52 | result.push('-enc'); 53 | result.push(options.enc); 54 | } 55 | return result; 56 | } 57 | 58 | /** 59 | * Non-recursive find of all the files in a given directory that end with *.pdf 60 | * @return {Array} files is an array of the absolute paths to the single 61 | * page pdf files. Each entry in this array is an object with fields 62 | * and set 63 | * @return callback(, files) 64 | */ 65 | function get_pdfs_in_directory(directory_path, callback) { 66 | var file_paths = []; 67 | var files = null; 68 | var walker = walk.walk(directory_path, { followLinks: false}); 69 | walker.on('file', function(root, stat, next) { 70 | if (stat.name.match(/\.pdf$/i)) { 71 | var file_path = path.join(directory_path, stat.name); 72 | file_paths.push({file_path: file_path, file_name: stat.name}); 73 | next(); 74 | } 75 | }); 76 | walker.on('end', function() { 77 | return callback(null, file_paths); 78 | }); 79 | } 80 | 81 | /** 82 | * Cleanup any single page pdfs on error 83 | */ 84 | function cleanup_directory(directory_path, callback) { 85 | // only remove the folder at directory_path if it exists 86 | fs.exists(directory_path, function (exists) { 87 | if (!exists) { 88 | return callback(); 89 | } 90 | rimraf(directory_path, callback); 91 | }); 92 | } 93 | 94 | /** 95 | * @param {String} file_path absolute path to file on disk 96 | * @return {Function} callback() if file does exist 97 | * callback() if file does not exists 98 | */ 99 | function confirm_file_exists(file_path, callback) { 100 | fs.exists(file_path, function (exists) { 101 | if (!exists) { 102 | return callback('no file at path: ' + file_path); 103 | } 104 | return callback(); 105 | }); 106 | }; -------------------------------------------------------------------------------- /lib/split.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Module which splits multi-pag pdfs into single pages 3 | * Requires the pdftk binary be installed on the system and accessible in the 4 | * current path 5 | */ 6 | var path = require('path'); 7 | var temp = require('temp'); 8 | var exec = require('child_process').exec; 9 | var fs = require('fs'); 10 | var walk = require('walk'); 11 | var async = require('async'); 12 | var rimraf = require('rimraf'); 13 | 14 | /** 15 | * @param pdf_path path to the pdf file on disk 16 | * 17 | * @see get_pdfs_in_directory 18 | * @return {Object} an object with the fields "folder" and "files" set 19 | * files is an array of the absolute paths to the single page pdf files. 20 | * 21 | * Each entry in this array is an object with fields 22 | * and set 23 | * 24 | * @return callback(, output_paths) 25 | */ 26 | module.exports = function(pdf_path, pdf_password, callback) { 27 | if(!callback || typeof pdf_password === 'function'){ 28 | callback = pdf_password; 29 | } 30 | confirm_file_exists(pdf_path, function (err) { 31 | if (err) { return callback(err); } 32 | 33 | var output_dir = temp.path({},'pdf_pages'); 34 | fs.mkdir(output_dir, function(err) { 35 | if (err) { return callback(err, null); } 36 | // name the files with the upload id and a digit string 37 | // example: "507c3e55c786e2aa6f000005-page00001.pdf" 38 | var output_name = 'page%05d.pdf'; 39 | var output_path = path.join(output_dir, output_name); 40 | var cmd; 41 | if (typeof pdf_password === 'string') { 42 | cmd = 'pdftk "' + pdf_path + '" input_pw "' + pdf_password + '" burst output "' + output_path; 43 | } 44 | else { 45 | cmd = 'pdfseparate ' + pdf_path + ' ' + output_path; 46 | } 47 | var child = exec(cmd, function (err, stdout, stderr) { 48 | if (err) { 49 | var output_err = { 50 | message: 'an error occurred while splitting pdf into single pages with the pdftk burst command', 51 | error: err 52 | } 53 | callback(output_err, null); 54 | return; 55 | } 56 | remove_doc_data(function (err, reply) { 57 | if (err) { return callback(err); } 58 | return get_pdfs_in_directory(output_dir, callback); 59 | }); 60 | }); 61 | }); 62 | }); 63 | } 64 | 65 | 66 | 67 | /** 68 | * Non-recursive find of all the files in a given directory that end with *.pdf 69 | * @return {Object} output an object with the fields "folder" and "files" set 70 | * files is an array of the absolute paths to the single page pdf files. 71 | * 72 | * Each entry in this array is an object with fields 73 | * and set 74 | * 75 | * @return callback(, output) 76 | */ 77 | function get_pdfs_in_directory(directory_path, callback) { 78 | var file_paths = []; 79 | var files = null; 80 | var walker = walk.walk(directory_path, { followLinks: false}); 81 | walker.on('file', function(root, stat, next) { 82 | if (stat.name.match(/\.pdf$/i)) { 83 | var file_path = path.join(directory_path, stat.name); 84 | file_paths.push({file_path: file_path, file_name: stat.name}); 85 | } 86 | next(); 87 | }); 88 | 89 | 90 | walker.on('end', function() { 91 | file_paths.sort(function (a,b) { 92 | if (a.file_name < b.file_name) { 93 | return -1; 94 | } 95 | if (a.file_name == b.file_name) { 96 | return 0; 97 | } 98 | return 1; 99 | }); 100 | var output = { 101 | folder: directory_path, 102 | files: file_paths 103 | } 104 | return callback(null, output); 105 | }); 106 | } 107 | 108 | 109 | /** 110 | * @param {String} file_path absolute path to file on disk 111 | * @return {Function} callback() if file does exist 112 | * callback() if file does not exists 113 | */ 114 | function confirm_file_exists(file_path, callback) { 115 | fs.exists(file_path, function (exists) { 116 | if (!exists) { 117 | return callback('no file at path: ' + file_path); 118 | } 119 | return callback(); 120 | }); 121 | }; 122 | 123 | /** 124 | * pdftk creates a file called doc_data.txt during the burst split process. 125 | * This file is not needed so remove it now 126 | */ 127 | function remove_doc_data(callback) { 128 | var folder = path.join(__dirname, '..'); 129 | var doc_data_path = path.join(folder, 'doc_data.txt'); 130 | fs.exists(doc_data_path, function (exists) { 131 | if (!exists) { 132 | return callback(); 133 | } 134 | fs.unlink(doc_data_path, callback); 135 | }); 136 | } 137 | -------------------------------------------------------------------------------- /main.js: -------------------------------------------------------------------------------- 1 | /** 2 | * @title Node PDF main.js 3 | * Node PDF allows you to convert pdf files into raw text. The library supports 4 | * text extraction from electronic searchable pdfs. 5 | * 6 | * In addition, the library supports OCR text extract from pdfs which just 7 | * contain scanned images via the tesseract-ocr engine 8 | * 9 | * Multi-page pdfs are supported for both searchable and image pdfs. 10 | * The library returns an array of strings where the string at a given 11 | * index in the output array cooresponds the page in the input pdf document 12 | * 13 | * @author Noah Isaacson 14 | * @date 2012-10-26 15 | */ 16 | var path = require('path'); 17 | var temp = require('temp'); 18 | var exec = require('child_process').exec; 19 | var fs = require('fs'); 20 | var walk = require('walk'); 21 | var async = require('async'); 22 | var rimraf = require('rimraf'); 23 | 24 | var Raw = require('./lib/raw'); 25 | var Electronic = require('./lib/electronic'); 26 | 27 | /** 28 | * To process a pdf, pass in the absolute path to the pdf file on disk 29 | 30 | * @param {Object} params should have the following fields set 31 | * @param {String} params.pdf_path the absolute path to the pdf file on disk 32 | * @param {Boolean} params.clean true if you want the temporary single page pdfs 33 | * @param {Boolean} options.type must be either "ocr" or "text" 34 | * 35 | * @return {Array} text_pages is an array of strings, where each string is the 36 | * extracted text for the matching page index in the pdf document 37 | * @return {Processor} a processor object which will emit events as they occur 38 | */ 39 | module.exports = function(pdf_path, options, cb) { 40 | var err; 41 | var processor = new Raw(); 42 | if (!'pdf_path') { 43 | err = 'you must supply a pdf path as the first parameter' 44 | return cb(err); 45 | } 46 | if (!options) { 47 | err = 'no options supplied. You must supply an options object with the "type" field set' 48 | return cb(err); 49 | } 50 | if (!options.hasOwnProperty('type') || ! options.type) { 51 | err ='error, you must specify the type of extraction you wish to perform in the options object. Allowed values are "ocr" or "text"'; 52 | return cb(err); 53 | } 54 | if (options.type === 'ocr') { 55 | processor = new Raw(); 56 | } 57 | else if (options.type === 'text') { 58 | processor = new Electronic(); 59 | } 60 | else { 61 | err ='error, you must specify the type of extraction you wish to perform in the options object. Allowed values are "ocr" or "text"'; 62 | return cb(err);; 63 | } 64 | fs.exists(pdf_path, function (exists) { 65 | if (!exists) { 66 | err = 'no file exists at the path you specified'; 67 | return cb(err); 68 | } 69 | processor.process(pdf_path, options); 70 | cb(); 71 | }); 72 | return processor; 73 | } 74 | -------------------------------------------------------------------------------- /makefile: -------------------------------------------------------------------------------- 1 | test: 2 | mocha --reporter spec 3 | electronic: 4 | mocha $(shell find test -name "*electronic-test.js") --test --reporter spec 5 | searchable: 6 | mocha $(shell find test -name "*searchable-test.js") --test --reporter spec 7 | 8 | ocr: 9 | mocha $(shell find test -name "*ocr-test.js") --test --reporter spec 10 | 11 | .PHONY: test 12 | -------------------------------------------------------------------------------- /package-lock.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "pdf-extract", 3 | "version": "1.0.12", 4 | "lockfileVersion": 1, 5 | "requires": true, 6 | "dependencies": { 7 | "async": { 8 | "version": "0.1.22", 9 | "resolved": "https://registry.npmjs.org/async/-/async-0.1.22.tgz", 10 | "integrity": "sha1-D8GqoIig4+8Ovi2IMbqw3PiEUGE=" 11 | }, 12 | "commander": { 13 | "version": "0.6.1", 14 | "resolved": "https://registry.npmjs.org/commander/-/commander-0.6.1.tgz", 15 | "integrity": "sha1-+mihT2qUXVTbvlDYzbMyDp47GgY=", 16 | "dev": true 17 | }, 18 | "debug": { 19 | "version": "4.1.1", 20 | "resolved": "https://registry.npmjs.org/debug/-/debug-4.1.1.tgz", 21 | "integrity": "sha512-pYAIzeRo8J6KPEaJ0VWOh5Pzkbw/RetuzehGM7QRRX5he4fPHx2rdKMB256ehJCkX+XRQm16eZLqLNS8RSZXZw==", 22 | "dev": true, 23 | "requires": { 24 | "ms": "^2.1.1" 25 | }, 26 | "dependencies": { 27 | "ms": { 28 | "version": "2.1.2", 29 | "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.2.tgz", 30 | "integrity": "sha512-sGkPx+VjMtmA6MX27oA4FBFELFCZZ4S4XqeGOXCv68tT+jb3vk/RyaKWP0PTKyWtmLSM0b+adUTEvbs1PEaH2w==", 31 | "dev": true 32 | } 33 | } 34 | }, 35 | "diff": { 36 | "version": "1.0.2", 37 | "resolved": "https://registry.npmjs.org/diff/-/diff-1.0.2.tgz", 38 | "integrity": "sha1-Suc/Gu6Nb89ITxoc53zmUdm38Mk=", 39 | "dev": true 40 | }, 41 | "eyespect": { 42 | "version": "0.1.10", 43 | "resolved": "https://registry.npmjs.org/eyespect/-/eyespect-0.1.10.tgz", 44 | "integrity": "sha1-ma6EajAvzK95Dj7LRPR7XzsXaqA=" 45 | }, 46 | "forEachAsync": { 47 | "version": "2.2.1", 48 | "resolved": "https://registry.npmjs.org/forEachAsync/-/forEachAsync-2.2.1.tgz", 49 | "integrity": "sha1-43I/AJA5EOHrSx2zrVG1xkoxn+w=", 50 | "requires": { 51 | "sequence": "2.x" 52 | } 53 | }, 54 | "graceful-fs": { 55 | "version": "1.1.14", 56 | "resolved": "https://registry.npmjs.org/graceful-fs/-/graceful-fs-1.1.14.tgz", 57 | "integrity": "sha1-BweNtfY3f2Mh/Oqu30l94STclGU=", 58 | "optional": true 59 | }, 60 | "growl": { 61 | "version": "1.7.0", 62 | "resolved": "https://registry.npmjs.org/growl/-/growl-1.7.0.tgz", 63 | "integrity": "sha1-3i1mE20ALhErpw8/EMMc98NQsto=", 64 | "dev": true 65 | }, 66 | "jade": { 67 | "version": "0.26.3", 68 | "resolved": "https://registry.npmjs.org/jade/-/jade-0.26.3.tgz", 69 | "integrity": "sha1-jxDXl32NefL2/4YqgbBRPMslaGw=", 70 | "dev": true, 71 | "requires": { 72 | "commander": "0.6.1", 73 | "mkdirp": "0.3.0" 74 | }, 75 | "dependencies": { 76 | "mkdirp": { 77 | "version": "0.3.0", 78 | "resolved": "https://registry.npmjs.org/mkdirp/-/mkdirp-0.3.0.tgz", 79 | "integrity": "sha1-G79asbqCevI1dRQ0kEJkVfSB/h4=", 80 | "dev": true 81 | } 82 | } 83 | }, 84 | "mkdirp": { 85 | "version": "0.3.3", 86 | "resolved": "https://registry.npmjs.org/mkdirp/-/mkdirp-0.3.3.tgz", 87 | "integrity": "sha1-WV4lHBNww6aLqyE20ONIuBBa3xM=", 88 | "dev": true 89 | }, 90 | "mocha": { 91 | "version": "1.8.2", 92 | "resolved": "https://registry.npmjs.org/mocha/-/mocha-1.8.2.tgz", 93 | "integrity": "sha1-+2sdB9mPLrpBhUaETDvgcx9145A=", 94 | "dev": true, 95 | "requires": { 96 | "commander": "0.6.1", 97 | "debug": "*", 98 | "diff": "1.0.2", 99 | "growl": "1.7.x", 100 | "jade": "0.26.3", 101 | "mkdirp": "0.3.3", 102 | "ms": "0.3.0" 103 | } 104 | }, 105 | "ms": { 106 | "version": "0.3.0", 107 | "resolved": "https://registry.npmjs.org/ms/-/ms-0.3.0.tgz", 108 | "integrity": "sha1-A+3DSNYT5mpWSGz9rFO8vomcvWE=", 109 | "dev": true 110 | }, 111 | "os-tmpdir": { 112 | "version": "1.0.2", 113 | "resolved": "https://registry.npmjs.org/os-tmpdir/-/os-tmpdir-1.0.2.tgz", 114 | "integrity": "sha1-u+Z0BseaqFxc/sdm/lc0VV36EnQ=" 115 | }, 116 | "pathhash": { 117 | "version": "1.0.0", 118 | "resolved": "https://registry.npmjs.org/pathhash/-/pathhash-1.0.0.tgz", 119 | "integrity": "sha1-904ZwRtXJNLOOQa/G8lnIJ7Ytyc=" 120 | }, 121 | "rimraf": { 122 | "version": "2.0.3", 123 | "resolved": "https://registry.npmjs.org/rimraf/-/rimraf-2.0.3.tgz", 124 | "integrity": "sha1-9QopZecUTpr9mYmC8V33BnMPVqk=", 125 | "requires": { 126 | "graceful-fs": "~1.1" 127 | } 128 | }, 129 | "sequence": { 130 | "version": "2.2.1", 131 | "resolved": "https://registry.npmjs.org/sequence/-/sequence-2.2.1.tgz", 132 | "integrity": "sha1-f1YXiV1ENRwKBH52RGdpBJChawM=" 133 | }, 134 | "should": { 135 | "version": "1.2.2", 136 | "resolved": "https://registry.npmjs.org/should/-/should-1.2.2.tgz", 137 | "integrity": "sha1-DwP3dQZtnqJjJpDJF7EoJPzB1YI=", 138 | "dev": true 139 | }, 140 | "temp": { 141 | "version": "0.8.3", 142 | "resolved": "https://registry.npmjs.org/temp/-/temp-0.8.3.tgz", 143 | "integrity": "sha1-4Ma8TSa5AxJEEOT+2BEDAU38H1k=", 144 | "requires": { 145 | "os-tmpdir": "^1.0.0", 146 | "rimraf": "~2.2.6" 147 | }, 148 | "dependencies": { 149 | "rimraf": { 150 | "version": "2.2.8", 151 | "resolved": "https://registry.npmjs.org/rimraf/-/rimraf-2.2.8.tgz", 152 | "integrity": "sha1-5Dm+Kq7jJzIZUnMPmaiSnk/FBYI=" 153 | } 154 | } 155 | }, 156 | "walk": { 157 | "version": "2.2.1", 158 | "resolved": "https://registry.npmjs.org/walk/-/walk-2.2.1.tgz", 159 | "integrity": "sha1-WtofjknkfUt0Rdi+ei4eYxq0MBY=", 160 | "requires": { 161 | "forEachAsync": "~2.2" 162 | } 163 | } 164 | } 165 | } 166 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "pdf-extract", 3 | "engines": "node", 4 | "version": "1.0.12", 5 | "private": false, 6 | "scripts": { 7 | "test": "node_modules/.bin/mocha --reporter spec" 8 | }, 9 | "repository": { 10 | "type": "git", 11 | "url": "https://github.com/nisaacson/pdf-extract.git" 12 | }, 13 | "main": "main.js", 14 | "folders": "lib", 15 | "dependencies": { 16 | "eyespect": "~0.1.8", 17 | "async": "~0.1.22", 18 | "temp": "~0.8.3", 19 | "walk": "~2.2.1", 20 | "rimraf": "~2.0.2", 21 | "pathhash": "~1.0.0" 22 | }, 23 | "devDependencies": { 24 | "should": "~1.2.1", 25 | "mocha": "~1.8.1" 26 | } 27 | } 28 | -------------------------------------------------------------------------------- /share/configs/alphanumeric: -------------------------------------------------------------------------------- 1 | tessedit_char_whitelist !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ 2 | -------------------------------------------------------------------------------- /share/dia.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/share/dia.traineddata -------------------------------------------------------------------------------- /share/eng.traineddata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/share/eng.traineddata -------------------------------------------------------------------------------- /test/01_command-test.js: -------------------------------------------------------------------------------- 1 | var inspect = require('eyespect').inspector({maxLength:20000}); 2 | var should = require('should'); 3 | var async = require('async'); 4 | var exec = require('child_process').exec; 5 | 6 | describe('01 Command Test', function() { 7 | it('should have ghostscript (gs) binary on path', function(done) { 8 | var cmd = 'which gs'; 9 | var child = exec(cmd, function (err, stdout, stderr) { 10 | should.not.exist(err, 'ghostscript not available. You will not be able to perform ocr and extract text from pdfs with scanned image. To get convert install GhostScript on your system'); 11 | stderr.length.should.equal(0); 12 | should.exist(stdout); 13 | stdout.length.should.be.above(8); 14 | done(); 15 | }); 16 | }); 17 | it('should have pdftotext binary on path', function(done) { 18 | var cmd = 'which pdftotext'; 19 | var child = exec(cmd, function (err, stdout, stderr) { 20 | should.not.exist(err, 'pdftotext not available. You will not be able to extract text from electronic searchable pdf files without the pdftotext library installed on your system'); 21 | stderr.length.should.equal(0); 22 | should.exist(stdout); 23 | stdout.length.should.be.above(8); 24 | done(); 25 | }); 26 | }); 27 | 28 | it('should have tesseract binary on path', function(done) { 29 | var cmd = 'which tesseract'; 30 | var child = exec(cmd, function (err, stdout, stderr) { 31 | should.not.exist(err, 'tesseract not available. You will not be able to perform ocr and extract from pdfs with scanned images.'); 32 | stderr.length.should.equal(0); 33 | should.exist(stdout); 34 | stdout.length.should.be.above(8); 35 | done(); 36 | }); 37 | }); 38 | 39 | }); -------------------------------------------------------------------------------- /test/02_split-test.js: -------------------------------------------------------------------------------- 1 | var inspect = require('eyespect').inspector({maxLength:20000}); 2 | var path = require('path'); 3 | var should = require('should'); 4 | var assert = require('assert') 5 | var fs = require('fs'); 6 | var async = require('async'); 7 | var split = require('../lib/split.js'); 8 | 9 | describe('02 Split Test', function() { 10 | it('should split multi-page pdf in single page pdf files', function(done) { 11 | this.timeout(10*1000); 12 | this.slow(2*1000); 13 | var file_name = 'multipage_searchable.pdf'; 14 | var relative_path = path.join('test_data',file_name); 15 | var pdf_path = path.join(__dirname, relative_path); 16 | split(pdf_path, function (err, output) { 17 | should.not.exist(err); 18 | should.exist(output); 19 | output.should.have.property('folder'); 20 | output.should.have.property('files'); 21 | var files = output.files; 22 | files.length.should.equal(8, 'wrong number of pages after splitting searchable pdf with name: ' + file_name); 23 | // make sure each file entry in files exists 24 | async.forEach( 25 | files, 26 | function (file, cb) { 27 | file.should.have.property('file_name'); 28 | file.should.have.property('file_path'); 29 | fs.exists(file.file_path, function (exists) { 30 | assert.ok(exists, 'file does not exist like it should at path: ' + file.file_path); 31 | cb(); 32 | }); 33 | }, 34 | function (err) { 35 | should.not.exist(err); 36 | done(); 37 | } 38 | ); 39 | }); 40 | }); 41 | 42 | it('should split single page pdf into a new single page pdf files', function(done) { 43 | this.timeout(10*1000); 44 | this.slow(2*1000); 45 | var file_name = 'single_page_searchable.pdf'; 46 | var relative_path = path.join('test_data',file_name); 47 | var pdf_path = path.join(__dirname, relative_path); 48 | split(pdf_path, function (err, output) { 49 | should.not.exist(err); 50 | should.exist(output); 51 | output.should.have.property('folder'); 52 | output.should.have.property('files'); 53 | var files = output.files; 54 | files.length.should.equal(1, 'wrong number of pages after splitting searchable pdf with name: ' + file_name); 55 | // make sure each file entry in files exists 56 | async.forEach( 57 | files, 58 | function (file, cb) { 59 | file.should.have.property('file_name'); 60 | file.should.have.property('file_path'); 61 | fs.exists(file.file_path, function (exists) { 62 | assert.ok(exists, 'file does not exist like it should at path: ' + file.file_path); 63 | cb(); 64 | }); 65 | }, 66 | function (err) { 67 | should.not.exist(err); 68 | done(); 69 | } 70 | ); 71 | }); 72 | }); 73 | }); -------------------------------------------------------------------------------- /test/03_searchable-test.js: -------------------------------------------------------------------------------- 1 | var inspect = require('eyespect').inspector({maxLength:20000}); 2 | var path = require('path'); 3 | var should = require('should'); 4 | var fs = require('fs'); 5 | var assert = require('assert'); 6 | var async = require('async'); 7 | var pathhash = require('pathhash'); 8 | var pdf = require('../main'); 9 | 10 | describe('03 Searchable Test', function() { 11 | var file_name = 'single_page_searchable.pdf'; 12 | var relative_path = path.join('test_data',file_name); 13 | var pdf_path = path.join(__dirname, relative_path); 14 | var hash; 15 | before(function(done) { 16 | pathhash(pdf_path, function (err, reply) { 17 | should.not.exist(err, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. ' + err); 18 | should.exist(reply, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. No hash returned from hashDataAtPath'); 19 | hash = reply; 20 | done(); 21 | }); 22 | }); 23 | it('should return an error when not passing a type for a searchable pdf', function(done) { 24 | this.timeout(10*1000); 25 | this.slow(5*1000); 26 | var options = { 27 | } 28 | pdf(pdf_path, options, function (err, extract) { 29 | should.exist(err,'error should be returned'); 30 | should.not.exist(extract); 31 | done(); 32 | }); 33 | }); 34 | 35 | it('should extract text from electronic searchable pdf', function(done) { 36 | this.timeout(10*1000); 37 | this.slow(5*1000); 38 | var options = { 39 | type: 'text' 40 | } 41 | var processor = pdf(pdf_path, options, function (err) { 42 | should.not.exist(err); 43 | }); 44 | 45 | processor.on('error', function (err) { 46 | should.not.exist(err); 47 | assert.ok(false, 'error during processing'); 48 | 49 | }); 50 | processor.on('complete', function (data) { 51 | data.should.have.property('text_pages'); 52 | data.should.have.property('hash'); 53 | if (data.hash !== hash) { 54 | return; 55 | } 56 | data.text_pages.length.should.eql(1); 57 | data.text_pages[0].length.should.be.above(20); 58 | done(); 59 | }); 60 | }); 61 | }); 62 | -------------------------------------------------------------------------------- /test/04_electronic-test.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Tests extraction for a multi-page searchable pdf file 3 | */ 4 | var inspect = require('eyespect').inspector({maxLength:20000}); 5 | var path = require('path'); 6 | var should = require('should'); 7 | var assert = require('assert'); 8 | var fs = require('fs'); 9 | var async = require('async'); 10 | 11 | var pdf = require('../main.js'); 12 | 13 | var get_desired_text = function(text_file_name, callback) { 14 | var relative_path = path.join('test_data',text_file_name); 15 | var text_file_path = path.join(__dirname, relative_path); 16 | fs.readFile(text_file_path, 'utf8', function (err, reply) { 17 | should.not.exist(err); 18 | should.exist(reply); 19 | return callback(err, reply); 20 | }); 21 | } 22 | describe('04 Multipage searchable test', function() { 23 | it('should extract array of text pages from multipage searchable pdf', function(done) { 24 | this.timeout(10*1000); 25 | this.slow(2*1000); 26 | var file_name = 'multipage_searchable.pdf'; 27 | var relative_path = path.join('test_data',file_name); 28 | var pdf_path = path.join(__dirname, relative_path); 29 | var options = { 30 | type: 'text', 31 | }; 32 | var processor = pdf(pdf_path, options, function (err) { 33 | should.not.exist(err); 34 | }); 35 | 36 | processor.on('complete', function(data) { 37 | data.should.have.property('text_pages'); 38 | data.should.have.property('single_page_pdf_file_paths'); 39 | data.single_page_pdf_file_paths.length.should.equal(8,'wrong number of single_page_pdf_file_paths returned'); 40 | data.should.have.property('pdf_path'); 41 | data.text_pages.length.should.equal(8, 'wrong number of pages after extracting from mulitpage searchable pdf with name: ' + file_name); 42 | assert.ok(page_event_fired, 'never received a "page" event like we should have'); 43 | for (var index in data.text_pages) { 44 | var page = data.text_pages[index]; 45 | page.length.should.be.above(0, 'no text on page at index: ' + index); 46 | } 47 | 48 | done(); 49 | }); 50 | var page_event_fired = false; 51 | processor.on('error', function(data) { 52 | false.should.be.true('error occurred during processing'); 53 | }); 54 | processor.on('page', function(data) { 55 | page_event_fired = true; 56 | data.should.have.property('index'); 57 | data.should.have.property('pdf_path'); 58 | data.should.have.property('text'); 59 | data.pdf_path.should.eql(pdf_path); 60 | data.text.length.should.above(0); 61 | }); 62 | }); 63 | }); 64 | -------------------------------------------------------------------------------- /test/05_convert-test.js: -------------------------------------------------------------------------------- 1 | var inspect = require('eyespect').inspector({maxLength:20000}); 2 | var path = require('path'); 3 | var should = require('should'); 4 | var assert = require('assert'); 5 | var fs = require('fs'); 6 | var async = require('async'); 7 | 8 | var convert = require('../lib/convert.js'); 9 | 10 | var get_desired_text = function(text_file_name, callback) { 11 | var relative_path = path.join('test_data',text_file_name); 12 | var text_file_path = path.join(__dirname, relative_path); 13 | fs.readFile(text_file_path, 'utf8', function (err, reply) { 14 | should.not.exist(err); 15 | should.exist(reply); 16 | return callback(err, reply); 17 | }); 18 | } 19 | describe('05 Convert Test', function() { 20 | it('should convert raw single page pdf to tif file', function(done) { 21 | return done(); 22 | this.timeout(10*1000); 23 | var file_name = 'single_page_raw.pdf'; 24 | var relative_path = path.join('test_data',file_name); 25 | var pdf_path = path.join(__dirname, relative_path); 26 | fs.exists(pdf_path, function (exists) { 27 | assert.ok(exists, 'file does not exist like it should at path: ' + pdf_path); 28 | convert(pdf_path, function (err, tif_path) { 29 | should.not.exist(err); 30 | should.exist(tif_path); 31 | fs.exists(tif_path, function (exists) { 32 | assert.ok(exists, 'til file does not exist like it should at path: ' + tif_path); 33 | done(); 34 | }); 35 | }); 36 | }); 37 | }); 38 | }); 39 | -------------------------------------------------------------------------------- /test/06_ocr-test.js: -------------------------------------------------------------------------------- 1 | var inspect = require('eyespect').inspector({maxLength:20000}); 2 | var path = require('path'); 3 | var should = require('should'); 4 | var assert = require('assert'); 5 | var fs = require('fs'); 6 | var ocr = require('../lib/ocr.js'); 7 | 8 | describe('06 OCR Test', function() { 9 | it('should extract text from tif file via tesseract ocr', function(done) { 10 | this.timeout(100*1000); 11 | this.slow(20*1000); 12 | var file_name = 'single_page_raw.tif'; 13 | var relative_path = path.join('test_data',file_name); 14 | var tif_path = path.join(__dirname, relative_path); 15 | fs.exists(tif_path, function (exists) { 16 | assert.ok(exists, 'tif file does not exist like it should at path: ' + tif_path); 17 | ocr(tif_path, function (err, extract) { 18 | should.not.exist(err); 19 | should.exist(extract); 20 | extract.length.should.be.above(20, 'wrong ocr output'); 21 | done(); 22 | }); 23 | }); 24 | }); 25 | 26 | it('should ocr tif file using custom language file', function(done) { 27 | this.timeout(100*1000); 28 | this.slow(20*1000); 29 | var file_name = 'single_page_raw.tif'; 30 | var relative_path = path.join('test_data',file_name); 31 | var tif_path = path.join(__dirname, relative_path); 32 | fs.exists(tif_path, function (exists) { 33 | assert.ok(exists, 'tif file does not exist like it should at path: ' + tif_path); 34 | var options = [ 35 | '-psm 1', 36 | '-l dia', 37 | 'alphanumeric' 38 | ] 39 | ocr(tif_path, options, function (err, extract) { 40 | should.not.exist(err); 41 | should.exist(extract); 42 | extract.length.should.be.above(20, 'wrong ocr output'); 43 | done(); 44 | }); 45 | }); 46 | }); 47 | 48 | }); 49 | -------------------------------------------------------------------------------- /test/07_raw-test.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Tests ocr extraction for a multi-page raw scan pdf file 3 | */ 4 | var assert = require('assert'); 5 | var inspect = require('eyespect').inspector({maxLength:20000}); 6 | var path = require('path'); 7 | var should = require('should'); 8 | var fs = require('fs'); 9 | var async = require('async'); 10 | 11 | var pdf = require('../main.js'); 12 | var pathHash = require('pathhash'); 13 | 14 | var get_desired_text = function(text_file_name, callback) { 15 | var relative_path = path.join('test_data',text_file_name); 16 | var text_file_path = path.join(__dirname, relative_path); 17 | fs.readFile(text_file_path, 'utf8', function (err, reply) { 18 | should.not.exist(err); 19 | should.exist(reply); 20 | return callback(err, reply); 21 | }); 22 | } 23 | describe('07 Multipage raw test', function() { 24 | var file_name = 'multipage_raw.pdf'; 25 | var relative_path = path.join('test_data',file_name); 26 | var pdf_path = path.join(__dirname, relative_path); 27 | var options = { 28 | type: 'ocr', 29 | clean: false // keep the temporary single page pdf files 30 | }; 31 | 32 | var hash; 33 | before(function(done) { 34 | pathHash(pdf_path, function (err, reply) { 35 | should.not.exist(err, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. ' + err); 36 | should.exist(reply, 'error getting sha1 hash of pdf file at path: ' + pdf_path + '. No hash returned from hashDataAtPath'); 37 | hash = reply; 38 | done(); 39 | }); 40 | }); 41 | 42 | it('should extract array of text pages from multipage raw scan pdf', function(done) { 43 | console.log('\nPlease be patient, this test make take a minute or more to complete'); 44 | this.timeout(240*1000); 45 | this.slow(120*1000); 46 | var processor = pdf(pdf_path, options, function (err) { 47 | should.not.exist(err); 48 | }); 49 | processor.on('complete', function(data) { 50 | data.should.have.property('text_pages'); 51 | data.should.have.property('pdf_path'); 52 | data.should.have.property('single_page_pdf_file_paths'); 53 | data.text_pages.length.should.eql(2, 'wrong number of pages after extracting from mulitpage searchable pdf with name: ' + file_name); 54 | 55 | assert.ok(page_event_fired, 'no "page" event fired'); 56 | async.forEach( 57 | data.single_page_pdf_file_paths, 58 | function (file_path, cb) { 59 | fs.exists(file_path, function (exists) { 60 | assert.ok(exists,'no single page pdf file exists at the path: ' + file_path); 61 | cb(); 62 | }); 63 | }, 64 | function (err) { 65 | should.not.exist(err, 'error in raw processing: ' + err); 66 | done(); 67 | } 68 | ); 69 | }); 70 | processor.on('log', function(data) { 71 | inspect(data, 'log data'); 72 | }); 73 | var page_event_fired = false; 74 | processor.on('page', function(data) { 75 | page_event_fired = true; 76 | data.should.have.property('index'); 77 | data.should.have.property('pdf_path'); 78 | data.should.have.property('text'); 79 | data.pdf_path.should.eql(pdf_path); 80 | data.text.length.should.above(0); 81 | }); 82 | }); 83 | 84 | it('should ocr raw scan using custom language in ocr_flags', function (done) { 85 | this.timeout(240*1000); 86 | this.slow(120*1000); 87 | var ocr_flags = [ 88 | '-psm 1', 89 | '-l dia', 90 | 'alphanumeric' 91 | ]; 92 | 93 | inspect('Please be patient, this test make take a minute or more to complete'); 94 | 95 | options.ocr_flags = ocr_flags; 96 | var processor = pdf(pdf_path, options, function (err) { 97 | should.not.exist(err); 98 | }); 99 | processor.on('error', function (err){ 100 | should.not.exist(err); 101 | assert.ok(false, 'error during raw processing'); 102 | }); 103 | processor.on('log', function(data) { 104 | inspect(data, 'log event'); 105 | }); 106 | 107 | processor.on('complete', function (data) { 108 | data.should.have.property('text_pages'); 109 | data.should.have.property('hash'); 110 | data.should.have.property('pdf_path'); 111 | data.should.have.property('single_page_pdf_file_paths'); 112 | if (hash !== data.hash) { 113 | return; 114 | } 115 | data.text_pages.length.should.eql(2, 'wrong number of pages after extracting from mulitpage searchable pdf with name: ' + file_name); 116 | async.forEach( 117 | data.single_page_pdf_file_paths, 118 | function (file_path, cb) { 119 | fs.exists(file_path, function (exists) { 120 | assert.ok(exists, 'single page pdf file not found at path: ' + file_path); 121 | cb(); 122 | }); 123 | }, 124 | function (err) { 125 | should.not.exist(err, 'error in raw processing: ' + err); 126 | done(); 127 | } 128 | ); 129 | }); 130 | }); 131 | }); 132 | -------------------------------------------------------------------------------- /test/npm-debug.log: -------------------------------------------------------------------------------- 1 | 0 info it worked if it ends with ok 2 | 1 verbose cli [ 'node', '/usr/local/bin/npm', 'test' ] 3 | 2 info using npm@1.1.62 4 | 3 info using node@v0.8.11 5 | 4 verbose read json /Users/noah/Sites/node/DocParse/lib/pdf/test/package.json 6 | 5 error Error: ENOENT, open '/Users/noah/Sites/node/DocParse/lib/pdf/test/package.json' 7 | 6 error If you need help, you may report this log at: 8 | 6 error 9 | 6 error or email it to: 10 | 6 error 11 | 7 error System Darwin 11.4.0 12 | 8 error command "node" "/usr/local/bin/npm" "test" 13 | 9 error cwd /Users/noah/Sites/node/DocParse/lib/pdf/test 14 | 10 error node -v v0.8.11 15 | 11 error npm -v 1.1.62 16 | 12 error path /Users/noah/Sites/node/DocParse/lib/pdf/test/package.json 17 | 13 error code ENOENT 18 | 14 error errno 34 19 | 15 verbose exit [ 34, true ] 20 | -------------------------------------------------------------------------------- /test/test_data/multipage_raw.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/multipage_raw.pdf -------------------------------------------------------------------------------- /test/test_data/multipage_searchable.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/multipage_searchable.pdf -------------------------------------------------------------------------------- /test/test_data/single_page_raw.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/single_page_raw.pdf -------------------------------------------------------------------------------- /test/test_data/single_page_raw.tif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/single_page_raw.tif -------------------------------------------------------------------------------- /test/test_data/single_page_searchable.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nisaacson/pdf-extract/d334b3a497de830ca825f1fdd94e9588e890563f/test/test_data/single_page_searchable.pdf -------------------------------------------------------------------------------- /test/test_data/single_page_searchable.txt: -------------------------------------------------------------------------------- 1 | Sampling 2 | by David A. Freedman 3 | Department of Statistics 4 | University of California 5 | Berkeley, CA 94720 6 | 7 | The basic idea in sampling is extrapolation from the part to the 8 | whole--from "the sample" to "the population." (The population is some- 9 | times rather mysteriously called "the universe.") There is an immediate 10 | corollary: the sample must be chosen to fairly represent the population. 11 | Methods for choosing samples are called "designs." Good designs in- 12 | volve the use of probability methods, minimizing subjective judgment in 13 | the choice of units to survey. Samples drawn using probability methods 14 | are called "probability samples." 15 | Bias is a serious problem in applied work; probability samples min- 16 | imize bias. As it turns out, however, methods used to extrapolate from a 17 | probability sample to the population should take into account the method 18 | used to draw the sample; otherwise, bias may come in through the back 19 | door. The ideas will be illustrated for sampling people or business records, 20 | but apply more broadly. There are sample surveys of buildings, farms, law 21 | cases, schools, trees, trade union locals, and many other populations. 22 | 23 | SAMPLE DESIGN 24 | Probability samples should be distinguished from "samples of con- 25 | venience" (also called "grab samples"). A typical sample of convenience 26 | comprises the investigator's students in an introductory course. A "mall 27 | sample" consists of the people willing to be interviewed on certain days 28 | at certain shopping centers. This too is a convenience sample. The reason 29 | for the nomenclature is apparent, and so is the downside: the sample may 30 | not represent any definable population larger than itself. 31 | To draw a probability sample, we begin by identifying the population 32 | of interest. The next step is to create the "sampling frame," a list of 33 | units to be sampled. One easy design is "simple random sampling." For 34 | instance, to draw a simple random sample of 100 units, choose one unit 35 | at random from the frame; put this unit into the sample; choose another 36 | unit at random from the remaining ones in the frame; and so forth. Keep 37 | going until 100 units have been chosen. At each step along the way, all 38 | units in the pool have the same chance of being chosen. 39 | --------------------------------------------------------------------------------