├── .coveralls.yaml ├── .gitignore ├── .jshintrc ├── .travis.yml ├── BUGS.md ├── CONTRIBUTING.md ├── Gruntfile.js ├── LICENSE ├── README.md ├── bin └── quickscrape.js ├── docs ├── screenshot_log_multiple_url.png └── screenshot_log_single_url.png ├── lib ├── eventparse.js ├── loglevels.js └── outformat.js ├── package.json ├── test ├── eventParseSpec.js └── test.js └── test_vagrants └── one ├── .vagrant └── machines │ └── default │ └── virtualbox │ ├── action_provision │ ├── action_set_name │ ├── id │ ├── index_uuid │ └── synced_folders └── Vagrantfile /.coveralls.yaml: -------------------------------------------------------------------------------- 1 | service_name: travis-ci 2 | repo_token: k1iKWD40ZFhVwzyqd31XHSmVBdimLmh54 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | node_modules 2 | npm-debug.log 3 | coverage 4 | # test output files 5 | dryrun 6 | output 7 | .DS_Store 8 | -------------------------------------------------------------------------------- /.jshintrc: -------------------------------------------------------------------------------- 1 | { 2 | "curly": true, 3 | "eqeqeq": true, 4 | "immed": true, 5 | "latedef": "nofunc", 6 | "newcap": true, 7 | "noarg": true, 8 | "sub": true, 9 | "undef": true, 10 | "unused": true, 11 | "boss": true, 12 | "eqnull": true, 13 | "node": true 14 | } 15 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: node_js 2 | node_js: 3 | - '0.10' 4 | before_script: 5 | - npm install -g grunt-cli 6 | after_script: 7 | - npm run coveralls 8 | -------------------------------------------------------------------------------- /BUGS.md: -------------------------------------------------------------------------------- 1 | # bugs 2 | 3 | most bugs should be reported the the issues. However some are created by other pacakges such as Spooky and require 4 | per-installation workarounds 5 | 6 | ### quickscrape/tiny-jsonrpc bug 7 | 8 | The details will differ according to where `node` is installed. Here's PMR's: 9 | ``` 10 | Error: Cannot find module '/usr/local/n/versions/node/6.2.1/lib/node_modules/quickscrape/node_modules/spooky/lib/../node_modules/tiny-jsonrpc/lib/tiny-jsonrpc' so moving on to next url in list 11 | Unsafe JavaScript attempt to access frame with URL about:blank from frame with URL file:///usr/local/n/versions/node/6.2.1/lib/node_modules/quickscrape/node_modules/casperjs/bin/bootstrap.js. Domains, protocols and ports must match. 12 | /usr/local/n/versions/node/6.2.1/lib/node_modules/quickscrape/node_modules/eventemitter2/lib/eventemitter2.js:290 13 | throw arguments[1]; // Unhandled 'error' event 14 | ^ 15 | 16 | Error: Child terminated with non-zero exit code 1 17 | at Spooky. (/usr/local/n/versions/node/6.2.1/lib/node_modules/quickscrape/node_modules/spooky/lib/spooky.js:210:17) 18 | at emitTwo (events.js:106:13) 19 | at ChildProcess.emit (events.js:191:7) 20 | at Process.ChildProcess._handle.onexit (internal/child_process.js:204:12) 21 | ``` 22 | find where your quickscrape is: 23 | ``` 24 | which quickscrape 25 | gives: 26 | /usr/local/n/versions/node/6.2.1/bin/quickscrape 27 | create the top level dir 28 | /usr/local/n/versions/node/6.2.1/ 29 | other might have 30 | /home/$USER/.nvm/versions/node/v6.3.1 31 | 32 | ``` 33 | then copy files from the `lib` directory (after adjusting) 34 | ``` 35 | cd /usr/local/n/versions/node/6.2.1/lib/node_modules/quickscrape/ 36 | cp -r node_modules/tiny-jsonrpc node_modules/spooky/node_modules 37 | ``` 38 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to quickscrape 2 | 3 | Thank you for taking the time to contribute! :+1: 4 | 5 | This is a set of guidelines for contributing to quickscrape. You don't need to follow them as strict rules, use your best judgement and feel free to propose changes to this document as well via a pull request. 6 | 7 | #### Table of Contents 8 | 9 | [Basics](#basics) 10 | 11 | [How can I contribute?](#how-can-i-contribute) 12 | 13 | [Local testing](#local-testing) 14 | 15 | ## Basics 16 | 17 | quickscrape is based on Node.js. If you want an introduction on how to work on a project like this, you can find a comprehensive tutorial [here](http://www.nodebeginner.org/). 18 | 19 | ## How can I contribute? 20 | 21 | ### Report bugs 22 | 23 | If you encounter a bug, please let us know. You can raise a new issue [here](https://github.com/ContentMine/quickscrape/issues). Please include as many information in your report as possible, to help maintainers reproduce the problem. 24 | 25 | * A clear and descriptive title 26 | * Describe the exact steps which reproduce the problem, e.g. the query you entered. 27 | * Describe the behaviour following those steps, and where the problem occurred. 28 | * Explain where it was different from what you expected to happen. 29 | * Attach additional information to the report, such as error messages, or corrupted files. 30 | * Add a `bug` label to the issue. 31 | 32 | Before submitting a bug, please check the [list of existing bugs](https://github.com/ContentMine/quickscrape/issues?q=is%3Aopen+is%3Aissue+label%3Abug) whether there is a similar issue open. You can then help by adding your information to an existing report. 33 | 34 | ### Fixing bugs or implementing new features 35 | 36 | If you're not sure where to start, have a look at issues that have a `help wanted` label - here is a [list](https://github.com/ContentMine/quickscrape/issues?utf8=%E2%9C%93&q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22+). 37 | 38 | ### Suggesting features or changes 39 | 40 | There is always room for improvement and we'd like to hear your perspective on it. 41 | 42 | Before creating a pull request, please raise an issue to discuss the proposed changes first. We can then make sure to make best use of your efforts. 43 | 44 | ## Local testing 45 | 46 | In order to set up your development environment for quickscrape, you need to install [Node.js](https://nodejs.org/en/). 47 | 48 | 1. Create a fork on [github](https://help.github.com/articles/fork-a-repo/). 49 | 50 | 1. Create a [new branch](https://www.atlassian.com/git/tutorials/using-branches/git-checkout) with a descriptive name. 51 | 52 | 1. Work on your changes, and make regular commits to save them. 53 | 54 | 1. Test your changes by running `npm install` within the repository and running gepapers with `npm bin/quickscrape.js`. 55 | 56 | 1. When your changes work as intended, push them to your repository and [create a pull request](https://www.atlassian.com/git/tutorials/making-a-pull-request). 57 | 58 | 1. We will then review the pull request and merge it as soon as possible. If problems arise, they will be discussed within the pull request. 59 | -------------------------------------------------------------------------------- /Gruntfile.js: -------------------------------------------------------------------------------- 1 | 'use strict'; 2 | 3 | module.exports = function(grunt) { 4 | 5 | // Project configuration. 6 | grunt.initConfig({ 7 | nodeunit: { 8 | files: ['test/**/*_test.js'], 9 | }, 10 | watch: { 11 | lib: { 12 | tasks: ['nodeunit'] 13 | }, 14 | test: { 15 | tasks: ['nodeunit'] 16 | } 17 | } 18 | }); 19 | 20 | // These plugins provide necessary tasks. 21 | grunt.loadNpmTasks('grunt-contrib-nodeunit'); 22 | grunt.loadNpmTasks('grunt-contrib-watch'); 23 | 24 | // Default task. 25 | grunt.registerTask('default', ['nodeunit']); 26 | 27 | }; 28 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2014 Richard Smith-Unna 2 | 3 | Permission is hereby granted, free of charge, to any person 4 | obtaining a copy of this software and associated documentation 5 | files (the "Software"), to deal in the Software without 6 | restriction, including without limitation the rights to use, 7 | copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the 9 | Software is furnished to do so, subject to the following 10 | conditions: 11 | 12 | The above copyright notice and this permission notice shall be 13 | included in all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 16 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 17 | OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 18 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 19 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 20 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 21 | FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # quickscrape [![NPM version](https://badge.fury.io/js/quickscrape.svg)][npm] [![license MIT](http://b.repl.ca/v1/license-MIT-brightgreen.png)][license] [![Downloads](http://img.shields.io/npm/dm/quickscrape.svg)][downloads] [![Build Status](https://secure.travis-ci.org/ContentMine/quickscrape.png?branch=master)][travis] 2 | 3 | [npm]: http://badge.fury.io/js/quickscrape 4 | [travis]: http://travis-ci.org/ContentMine/quickscrape 5 | [coveralls]: https://coveralls.io/r/ContentMine/quickscrape 6 | [gemnasium]: https://gemnasium.com/ContentMine/quickscrape 7 | [license]: https://github.com/ContentMine/quickscrape/blob/master/LICENSE-MIT 8 | [downloads]: https://nodei.co/npm/quickscrape 9 | 10 | `quickscrape` is a simple command-line tool for powerful, modern website scraping. 11 | 12 | ### Table of Contents 13 | 14 | - [Description](#description) 15 | - [Installation](#installation) 16 | - [Documentation](#documentation) 17 | - [Examples](#examples) 18 | - [1. Extract data from a single URL with a predefined scraper](#1-extract-data-from-a-single-url-with-a-predefined-scraper) 19 | - [Contributing](#contributing) 20 | - [Release History](#release-history) 21 | - [License](#license) 22 | 23 | ### Description 24 | 25 | `quickscrape` is not like other scraping tools. It is designed to enable large-scale content mining. Here's what makes it different: 26 | 27 | Websites can be rendered in a GUI-less browser ([PhantomJS](http://phantomjs.org) via [CasperJS](http://casperjs.org)). This has some important benefits: 28 | 29 | - Many modern websites are only barely specified in their HTML, but are rendered with Javascript after the page is loaded. Headless browsing ensures the version of the HTML you scrape is the same one human visitors would see on their screen. 30 | - User interactions can be simulated. This is useful whenever content is only loaded after interaction, for example when article content is gradually loaded by AJAX during scrolling. 31 | - The full DOM specification is supported (because the backend is WebKit). This means pages with complex Javascripts that use rare parts of the dom (for example, Facebook) can be rendered, which they cannot in most existing tools. 32 | 33 | Scrapers are defined in separate JSON files that follow a defined structure ([scraperJSON](https://github.com/ContentMine/scraperJSON)). This too has important benefits: 34 | 35 | - No programming required! Non-programmers can make scrapers using a text editor and a web browser with an element inspector (e.g. Chrome). 36 | - Large collections of scrapers can be maintained to retrieve similar sets of information from different pages. For example: newspapers or academic journals. 37 | - Any other software supporting the same format could use the same scraper definitions. 38 | 39 | `quickscrape` is being developed to allow the community early access to the technology that will drive [ContentMine](http://contentmine.org), such as [ScraperJSON](https://github.com/ContentMine/journal-scrapers) and our Node.js scraping library [thresher](https://github.com/ContentMine/thresher). 40 | 41 | The software is under rapid development, so please be aware there may be bugs. If you find one, please report it on the [issue tracker](https://github.com/ContentMine/quickscrape/issues). 42 | 43 | ### Installation 44 | 45 | #### Prerequisites 46 | 47 | You'll need [Node.js](http://nodejs.org) (`node`), a platform which enables standalone JavaScript apps. You'll also need the Node package manager (`npm`), which usually comes with Node.js. Installing Node.js via the operating system's package manager leads to issues. If you already have Node.js installed, and it requires `sudo` to install node packages, that's the wrong way. The easiest way to do it right on Unix systems (e.g. Linux, OSX) is to use NVM, the Node version manager. 48 | 49 | First, install NVM: 50 | 51 | ```bash 52 | curl https://raw.githubusercontent.com/creationix/nvm/v0.24.1/install.sh | bash 53 | ``` 54 | 55 | or, if you don't have `curl`: 56 | 57 | ```bash 58 | wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.24.1/install.sh | bash 59 | ``` 60 | 61 | **NB: on OSX, you will need to have the developer tools installed (e.g. by installing XCode).** 62 | 63 | Then, install the latest Node.js, which will automatically install the latest `npm` as well, and set that version as the default: 64 | 65 | ```bash 66 | source ~/.nvm/nvm.sh 67 | nvm install 0.10 68 | nvm alias default 0.10 69 | nvm use default 70 | ``` 71 | 72 | Now you should have `node` and `npm` available. Check by running: 73 | 74 | ``` 75 | node -v 76 | npm -v 77 | ``` 78 | 79 | If both of those printed version numbers, you're ready to move on to installing `quickscrape`. 80 | 81 | #### Quickscrape 82 | 83 | `quickscrape` is very easy to install. Simply: 84 | 85 | ```bash 86 | npm install --global quickscrape 87 | ``` 88 | 89 | ### Documentation 90 | 91 | Run `quickscrape --help` from the command line to get help: 92 | 93 | ``` 94 | Usage: quickscrape [options] 95 | 96 | Options: 97 | 98 | -h, --help output usage information 99 | -V, --version output the version number 100 | -u, --url URL to scrape 101 | -r, --urllist path to file with list of URLs to scrape (one per line) 102 | -s, --scraper path to scraper definition (in JSON format) 103 | -d, --scraperdir path to directory containing scraper definitions (in JSON format) 104 | -o, --output where to output results (directory will be created if it doesn't exist 105 | -r, --ratelimit maximum number of scrapes per minute (default 3) 106 | -h --headless render all pages in a headless browser 107 | -l, --loglevel amount of information to log (silent, verbose, info*, data, warn, error, or debug) 108 | -f, --outformat JSON format to transform results into (currently only bibjson) 109 | ``` 110 | 111 | You must provide scraper definitions in ScraperJSON format as used in the [ContentMine journal-scrapers](https://github.com/ContentMine/journal-scrapers). 112 | 113 | ### Examples 114 | 115 | #### 1. Extract data from a single URL with a predefined scraper 116 | 117 | First, you'll want to grab some pre-cooked definitions: 118 | 119 | ```bash 120 | git clone https://github.com/ContentMine/journal-scrapers.git 121 | ``` 122 | 123 | Now just run `quickscrape`: 124 | 125 | ```bash 126 | quickscrape \ 127 | --url https://peerj.com/articles/384 \ 128 | --scraper journal-scrapers/scrapers/peerj.json \ 129 | --output peerj-384 130 | --outformat bibjson 131 | ``` 132 | 133 | You'll see log messages informing you how the scraping proceeds: 134 | 135 | ![Single URL log output](docs/screenshot_log_single_url.png) 136 | 137 | Then in the `peerj-384` directory there are several files: 138 | 139 | ``` 140 | $ tree peerj-384 141 | peerj-384/ 142 | └── https_peerj.com_articles_384 143 | ├── bib.json 144 | ├── fig-1-full.png 145 | ├── fulltext.html 146 | ├── fulltext.pdf 147 | ├── fulltext.xml 148 | └── results.json 149 | ``` 150 | 151 | - `fulltext.html` is the fulltext HTML (duh!) 152 | - `results.json` is a JSON file containing all the captured data 153 | - `bib.json` is a JSON file containing the results in bibJSON format 154 | - `fig-1-full.png` is the downloaded image from the only figure in the paper 155 | 156 | `results.json` looks like this (truncated): 157 | 158 | ```json 159 | { 160 | "publisher": { 161 | "value": [ 162 | "PeerJ Inc." 163 | ] 164 | }, 165 | "journal_name": { 166 | "value": [ 167 | "PeerJ" 168 | ] 169 | }, 170 | "journal_issn": { 171 | "value": [ 172 | "2167-8359" 173 | ] 174 | }, 175 | "title": { 176 | "value": [ 177 | "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss" 178 | ] 179 | }, 180 | "keywords": { 181 | "value": [ 182 | "Pendred; MLPA; DFNB4; \n SLC26A4\n ; FOXI1 and KCNJ10; Genotyping; Genetics; SNHL" 183 | ] 184 | }, 185 | "author_name": { 186 | "value": [ 187 | "Lynn M. Pique", 188 | "Marie-Luise Brennan", 189 | "Colin J. Davidson", 190 | "Frederick Schaefer", 191 | "John Greinwald Jr", 192 | "Iris Schrijver" 193 | ] 194 | } 195 | } 196 | ``` 197 | 198 | `bib.json` looks like this (truncated): 199 | 200 | ```json 201 | { 202 | "title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss", 203 | "link": [ 204 | { 205 | "type": "fulltext_html", 206 | "url": "https://peerj.com/articles/384" 207 | }, 208 | { 209 | "type": "fulltext_pdf", 210 | "url": "https://peerj.com/articles/384.pdf" 211 | }, 212 | { 213 | "type": "fulltext_xml", 214 | "url": "/articles/384.xml" 215 | } 216 | ], 217 | "author": [ 218 | { 219 | "name": "Lynn M. Pique", 220 | "institution": "Department of Pathology, Stanford University Medical Center, Stanford, CA, USA" 221 | }, 222 | { 223 | "name": "Marie-Luise Brennan", 224 | "institution": "Department of Pediatrics, Stanford University Medical Center, Stanford, CA, USA" 225 | } 226 | ] 227 | } 228 | ``` 229 | 230 | ### Contributing 231 | 232 | We are not yet accepting contributions, if you'd like to help please drop me an email (richard@contentmine.org) and I'll let you know when we're ready for that. 233 | 234 | ### Release History 235 | 236 | - ***0.1.0*** - initial version with simple one-element scraping 237 | - ***0.1.1*** - multiple-member elements; clean exiting; massive speedup 238 | - ***0.1.2*** - ability to grab text or HTML content of a selected node via special attributes `text` and `html` 239 | - ***0.1.3*** - refactor into modules, full logging suite, much more robust downloading 240 | - ***0.1.4*** - multiple URL processing, bug fixes, reduce dependency list 241 | - ***0.1.5*** - fix bug in bubbling logs up from PhantomJS 242 | - ***0.1.6*** - add dependency checking option 243 | - ***0.1.7*** - fix bug where jsdom rendered external resources (#10) 244 | - ***0.2.0*** - core moved out to separate library: [thresher](https://github.com/ContentMine/thresher). PhantomJS and CasperJS binaries now managed through npm to simplify installation. 245 | - ***0.2.1*** - fix messy metadata 246 | - ***0.2.3*** - automatic scraper selection 247 | - ***0.2.4-5*** - bump thresher dependency for bug fixes 248 | - ***0.2.6-7*** - new Thresher API 249 | - ***0.2.8*** - fix Thresher API use 250 | - ***0.3.0*** - use Thresher 0.1.0 and scraperJSON 0.1.0 251 | - ***0.3.1*** - update the reported version number left out of last release 252 | - ***0.3.2*** - fix dependencies 253 | - ***0.3.3-6*** - bug fixes 254 | - ***0.3.7*** - fix bug in bibJSON dates. Bump to thresher 0.1.3. 255 | - ***0.4.0*** - fix various bugs (with urllists, tokenized urls), print help when run with no args, update all dependencies. 256 | - ***0.4.1*** - fix version number reporting. 257 | 258 | ### License 259 | 260 | Copyright (c) 2014 Shuttleworth Foundation 261 | Licensed under the MIT license. 262 | -------------------------------------------------------------------------------- /bin/quickscrape.js: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env node 2 | 3 | var program = require('commander') 4 | , fs = require('fs') 5 | , winston = require('winston') 6 | , which = require('which').sync 7 | , path = require('path') 8 | , thresher = require('thresher') 9 | , Thresher = thresher.Thresher 10 | , ScraperBox = thresher.ScraperBox 11 | , Scraper = thresher.Scraper 12 | , ep = require('../lib/eventparse.js') 13 | , loglevels = require('../lib/loglevels.js') 14 | , outformat = require('../lib/outformat.js') 15 | , sanitize = require('sanitize-filename'); 16 | 17 | 18 | var pjson = require('../package.json'); 19 | QSVERSION = pjson.version; 20 | 21 | program 22 | .version(pjson.version) 23 | .option('-u, --url ', 24 | 'URL to scrape') 25 | .option('-r, --urllist ', 26 | 'path to file with list of URLs to scrape (one per line)') 27 | .option('-s, --scraper ', 28 | 'path to scraper definition (in JSON format)') 29 | .option('-d, --scraperdir ', 30 | 'path to directory containing scraper definitions (in JSON format)') 31 | .option('-o, --output ', 32 | 'where to output results ' + 33 | '(directory will be created if it doesn\'t exist', 34 | 'output') 35 | .option('-n, --numberdirs', 36 | 'use a number instead of the URL to name output subdirectories') 37 | .option('-i, --ratelimit ', 38 | 'maximum number of scrapes per minute (default 3)', 3) 39 | .option('-h, --headless', 40 | 'render all pages in a headless browser') 41 | .option('-l, --loglevel ', 42 | 'amount of information to log ' + 43 | '(silent, verbose, info*, data, warn, error, or debug)', 44 | 'info') 45 | .option('-f, --outformat ', 46 | 'JSON format to transform results into (currently only bibjson)') 47 | .option('-f, --logfile ', 48 | 'save log to specified file in output directory as well as printing to terminal') 49 | .parse(process.argv); 50 | 51 | if (!process.argv.slice(2).length) { 52 | program.help(); 53 | } 54 | 55 | // set up logging 56 | var allowedlevels = Object.keys(loglevels.levels); 57 | if (allowedlevels.indexOf(program.loglevel) == -1) { 58 | winston.error('Loglevel must be one of: ', 59 | 'quiet, verbose, data, info, warn, error, debug'); 60 | process.exit(1); 61 | } 62 | 63 | log = new (winston.Logger)({ 64 | transports: [new winston.transports.Console({ 65 | level: program.loglevel, 66 | levels: loglevels.levels, 67 | colorize: true 68 | })], 69 | level: program.loglevel, 70 | levels: loglevels.levels, 71 | colorize: true 72 | }); 73 | winston.addColors(loglevels.colors); 74 | 75 | // have to do this before changing directory 76 | if (program.scraper) program.scraper = path.resolve(program.scraper) 77 | if (program.scraperdir) program.scraperdir = path.resolve(program.scraperdir) 78 | 79 | // create output directory 80 | if (!fs.existsSync(program.output)) { 81 | log.debug('creating output directory: ' + program.output); 82 | fs.mkdirSync(program.output); 83 | } 84 | process.chdir(program.output); 85 | tld = process.cwd(); 86 | 87 | if (program.hasOwnProperty('logfile')) { 88 | log.add(winston.transports.File, { 89 | filename: program.logfile, 90 | level: 'debug' 91 | }); 92 | log.info('Saving logs to ./' + program.output + '/' + program.logfile); 93 | } 94 | 95 | // verify arguments 96 | if (program.scraper && program.scraperdir) { 97 | log.error('Please use either --scraper or --scraperdir, not both'); 98 | process.exit(1); 99 | } 100 | 101 | if (!(program.url || program.urllist)) { 102 | log.error('You must provide a URL or list of URLs to scrape'); 103 | process.exit(1); 104 | } 105 | 106 | if (!(program.scraper || program.scraperdir)) { 107 | log.error('You must provide a scraper definition'); 108 | process.exit(1); 109 | } 110 | 111 | if (program.outformat) { 112 | if (!program.outformat.toLowerCase() == 'bibjson') { 113 | log.error('Outformat ' + program.outformat + ' is not valid.'); 114 | } 115 | } 116 | 117 | // log options 118 | log.info('quickscrape ' + QSVERSION + ' launched with...'); 119 | if (program.url) { 120 | log.info('- URL: ' + program.url); 121 | } else { 122 | log.info('- URLs from file: ' + program.urls); 123 | } 124 | if (program.scraper) { 125 | log.info('- Scraper:', program.scraper); 126 | } 127 | if (program.scraperdir) { 128 | log.info('- Scraperdir:', program.scraperdir); 129 | } 130 | log.info('- Rate limit:', program.ratelimit, 'per minute'); 131 | log.info('- Log level:', program.loglevel); 132 | 133 | // check scrapers 134 | if (program.scraperdir) { 135 | var scrapers = new ScraperBox(program.scraperdir); 136 | if (scrapers.scrapers.length == 0) { 137 | log.error('the scraper directory provided did not contain any ' + 138 | 'valid scrapers'); 139 | exit(1); 140 | } 141 | } 142 | if (program.scraper) { 143 | var definition = fs.readFileSync(program.scraper); 144 | var scraper = new Scraper(JSON.parse(definition)); 145 | if (!scraper.valid) { 146 | scraper.on('definitionError', function(problems) { 147 | log.error('the scraper provided was not valid for the following reason(s):'); 148 | problems.forEach(function(p) { 149 | log.error('\t- ' + p); 150 | }); 151 | exit(1); 152 | }); 153 | scraper.validate(definition); 154 | } 155 | } 156 | 157 | // load list of URLs from a file 158 | var loadUrls = function(path) { 159 | var list = fs.readFileSync(path, { 160 | encoding: 'utf8' 161 | }); 162 | return list.split('\n').map(function(cv) { 163 | return cv.trim(); 164 | }).filter(function(x) { 165 | return x.length > 0; 166 | }); 167 | } 168 | 169 | urllist = program.url ? [program.url] : loadUrls(program.urllist); 170 | log.info('urls to scrape:', urllist.length); 171 | 172 | // this is the callback we pass to the scraper, so the program 173 | // can exit when all asynchronous file and download tasks have finished 174 | var finish = function() { 175 | log.info('all tasks completed'); 176 | process.exit(0); 177 | } 178 | 179 | // set up crude rate-limiting 180 | mintime = 60000 / program.ratelimit; 181 | lasttime = new Date().getTime(); 182 | 183 | done = false; 184 | next = 0; 185 | 186 | var checkForNext = function() { 187 | var now = new Date().getTime(); 188 | var diff = now - lasttime; 189 | var timeleft = Math.max(mintime - diff, 0); 190 | if (timeleft == 0 && done) { 191 | next ++; 192 | if (next < urllist.length) { 193 | lasttime = new Date().getTime(); 194 | processUrl(urllist[next]); 195 | if (next == urllist.length - 1) { 196 | finish(); 197 | } 198 | } else { 199 | finish(); 200 | } 201 | } else if (done) { 202 | if (next == urllist.length - 1) { 203 | finish(); 204 | } 205 | } 206 | } 207 | 208 | // process a URL 209 | var i = 0 210 | var processUrl = function(url) { 211 | i += 1 212 | done = false; 213 | log.info('processing URL:', url); 214 | 215 | // load the scraper definition(s) 216 | var scrapers = new ScraperBox(program.scraperdir); 217 | if (program.scraper) { 218 | scrapers.addScraper(program.scraper); 219 | } 220 | if (scrapers.scrapers.length == 0) { 221 | log.warn('no scrapers ') 222 | return; 223 | } 224 | 225 | // url-specific output dir 226 | var dir = program.numberdirs ? ('' + i) : url.replace(/\/+/g, '_').replace(/:/g, ''); 227 | dir = sanitize(path.join(tld, dir)); 228 | if (!fs.existsSync(dir)) { 229 | log.debug('creating output directory: ' + dir); 230 | fs.mkdirSync(dir); 231 | } 232 | process.chdir(dir); 233 | 234 | // run scraper 235 | var capturesFailed = 0; 236 | var t = new Thresher(scrapers); 237 | 238 | t.on('scraper.*', function(var1, var2) { 239 | log.log(ep.getlevel(this.event), 240 | ep.compose(this.event, var1, var2)); 241 | }); 242 | 243 | t.on('scraper.elementCaptureFailed', function() { 244 | capturesFailed += 1; 245 | }) 246 | 247 | t.on('scraper.renderer.*', function(var1, var2) { 248 | log.info(this.event, var1, var2) 249 | }); 250 | 251 | t.once('result', function(result, structured) { 252 | var nresults = Object.keys(result).length 253 | log.info('URL processed: captured ' + (nresults - capturesFailed) + '/' + 254 | nresults + ' elements (' + capturesFailed + ' captures failed)'); 255 | outfile = 'results.json' 256 | log.debug('writing results to file:', outfile) 257 | fs.writeFileSync(outfile, JSON.stringify(structured, undefined, 2)); 258 | // write out any extra formats 259 | if (program.outformat) { 260 | outformat.format(program.outformat, structured); 261 | } 262 | log.debug('changing back to top-level directory'); 263 | process.chdir(tld); 264 | 265 | // if we don't remove all the listeners, processing more URLs 266 | // will post messages to all the listeners from previous URLs 267 | t.removeAllListeners(); 268 | t = null; 269 | 270 | done = true; 271 | }); 272 | 273 | t.scrape(url, program.headless); 274 | } 275 | 276 | setInterval(checkForNext, 100); 277 | processUrl(urllist[0]); 278 | -------------------------------------------------------------------------------- /docs/screenshot_log_multiple_url.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/quickscrape/27dce16c1319c7e6884784f31dafcd7405adfd37/docs/screenshot_log_multiple_url.png -------------------------------------------------------------------------------- /docs/screenshot_log_single_url.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/quickscrape/27dce16c1319c7e6884784f31dafcd7405adfd37/docs/screenshot_log_single_url.png -------------------------------------------------------------------------------- /lib/eventparse.js: -------------------------------------------------------------------------------- 1 | var chalk = require('chalk'); 2 | 3 | function bracket(x) { 4 | return chalk.yellow('[') + 5 | x + 6 | chalk.yellow(']'); 7 | } 8 | 9 | var scraper = bracket(chalk.cyan('scraper')) 10 | , scraperBox = bracket(chalk.magenta('scraperBox')); 11 | 12 | var mapping = { 13 | 'error': [], 14 | 'info': [], 15 | 'result': [], 16 | 'end': [], 17 | 'scraper.downloadStarted': 18 | [scraper, 'download started'], 19 | 'scraper.downloadProgress': 20 | [scraper, 'download process'], 21 | 'scraper.downloadError': 22 | [scraper, 'downloading failed'], 23 | 'scraper.fileSaveError': 24 | [scraper, 'file save failed'], 25 | 'scraper.downloadSaved': 26 | [scraper, 'download started'], 27 | 'scraper.urlRendered': 28 | [scraper, 'URL rendered'], 29 | 'scraper.elementCaptured': 30 | [scraper, 'element captured'], 31 | 'scraper.elementCaptureFailed': 32 | [scraper, 'element capture failed'], 33 | 'scraper.elementResults': 34 | [scraper, 'element results'], 35 | 'scraper.selectorFailed': 36 | [scraper, 'selector had no results'], 37 | 'scraper.attributeFailed': 38 | [scraper, 'attribute was not valid'], 39 | 'scrapersLoaded': 40 | [scraperBox, 'scrapers loaded'], 41 | 'gettingScraper': 42 | [scraperBox, 'getting scraper'], 43 | 'scraperNotFound': 44 | [scraperBox, 'scraper not found'], 45 | 'scraperFound': 46 | [scraperBox, 'scraper found'], 47 | 'scrapeStart': 48 | [scraperBox, 'scraping started'] 49 | } 50 | 51 | module.exports.getlevel = function(event) { 52 | if (/\\.error/.test(event)) { 53 | return 'error'; 54 | } else if (/Error/.test(event)) { 55 | return 'warning'; 56 | } else if (/elementCapture/.test(event)) { 57 | return 'data'; 58 | } else if (/elementResults/.test(event)) { 59 | return 'debug'; 60 | } else if (/Failed/.test(event)) { 61 | return 'debug'; 62 | } 63 | return 'info'; 64 | } 65 | 66 | module.exports.compose = function(event, var1, var2) { 67 | msg = mapping[event] || [event]; 68 | if (var1) 69 | msg = msg.concat([var1]) 70 | if (var2) 71 | msg = msg.concat([var2]) 72 | strmsg = msg.join('. ') + '.'; 73 | return strmsg; 74 | } 75 | -------------------------------------------------------------------------------- /lib/loglevels.js: -------------------------------------------------------------------------------- 1 | var log = module.exports; 2 | 3 | log.levels = { 4 | silly: 0, 5 | input: 1, 6 | verbose: 2, 7 | prompt: 3, 8 | debug: 4, 9 | data: 5, 10 | info: 6, 11 | help: 7, 12 | warn: 8, 13 | error: 9 14 | }; 15 | 16 | log.colors = { 17 | silly: 'magenta', 18 | input: 'grey', 19 | verbose: 'cyan', 20 | prompt: 'grey', 21 | debug: 'blue', 22 | info: 'green', 23 | data: 'grey', 24 | help: 'cyan', 25 | warn: 'yellow', 26 | error: 'red' 27 | }; 28 | -------------------------------------------------------------------------------- /lib/outformat.js: -------------------------------------------------------------------------------- 1 | var moment = require('moment') 2 | , fs = require('fs'); 3 | var outformat = module.exports; 4 | 5 | outformat.format = function(fmt, res) { 6 | if (fmt.toLowerCase() === 'bibjson') { 7 | var outfile = 'bib.json'; 8 | var bib = outformat.bibJSON(res); 9 | var pretty = JSON.stringify(bib, undefined, 2); 10 | return fs.writeFileSync(outfile, pretty); 11 | } else { 12 | return false; 13 | } 14 | } 15 | 16 | outformat.bibJSON = function(t) { 17 | var x = {}; 18 | 19 | // single value metadata 20 | ['title'].forEach(function(key) { 21 | if (t[key] && t[key].value && t[key].value.length > 0) { 22 | x[key] = t[key].value[0]; 23 | } 24 | }); 25 | 26 | // links 27 | x.link = []; 28 | ['fulltext_html', 29 | 'fulltext_pdf', 30 | 'fulltext_xml', 31 | 'supplementary_file'].forEach(function(type) { 32 | if (t[type] && t[type].value && t[type].value.length > 0) { 33 | t[type].value.forEach(function(url) { 34 | x.link.push({ type: type, url: url }); 35 | }); 36 | } 37 | }); 38 | 39 | // people 40 | ['author', 'editor', 'reviewer'].forEach(function(key) { 41 | var people = []; 42 | ['name', 'givenName', 43 | 'familyName', 'institution'].forEach(function(type) { 44 | var endkey = key + '_' + type; 45 | if (t[endkey] && t[endkey].value) { 46 | var i = 0; 47 | t[endkey].value.forEach(function(y) { 48 | if (people.length < i + 1) { 49 | people.push({}); 50 | } 51 | people[i][type] = y; 52 | i += 1; 53 | }); 54 | } 55 | }); 56 | if (people.length > 0) { 57 | x[key] = people; 58 | } 59 | }); 60 | 61 | // publisher 62 | if (t.publisher && t.publisher.value && t.publisher.value.length > 0) { 63 | x.publisher = { name: t.publisher.value[0] }; 64 | } 65 | 66 | // journal 67 | x.journal = {}; 68 | ['volume', 'issue', 'firstpage', 69 | 'lastpage', 'pages'].forEach(function(key) { 70 | if (t[key] && t[key].value && t[key].value.length > 0) { 71 | x.journal[key] = t[key].value[0]; 72 | } 73 | }); 74 | if (t.journal_name && 75 | t.journal_name.value && 76 | t.journal_name.value.length > 0) { 77 | x.journal.name = t.journal_name.value[0]; 78 | } 79 | if (t.journal_issn && 80 | t.journal_issn.value && 81 | t.journal_issn.value.length > 0) { 82 | x.journal.issn = t.journal_issn.value[0]; 83 | } 84 | 85 | // sections 86 | x.sections = {}; 87 | // single-entry fields 88 | ['abstract', 'description', 'introduction', 'methods', 'results', 89 | 'discussion', 'conclusion', 'case_report', 'acknowledgement', 90 | 'author_contrib', 'competing_interest'].forEach(function(key) { 91 | var record = {}; 92 | var htmlkey = key + '_html'; 93 | var textkey = key + '_text'; 94 | [key, textkey].forEach(function(endkey) { 95 | if (t[endkey] && t[endkey].value && t[endkey].value.length > 0) { 96 | record.text = t[endkey].value[0] 97 | } 98 | }); 99 | if (t[htmlkey] && t[htmlkey].value && t[htmlkey].value.length > 0) { 100 | record.html = t[htmlkey].value[0] 101 | } 102 | if (Object.keys(record).length > 0) { 103 | x.sections[key] = record; 104 | } 105 | }); 106 | // multiple-entry fields 107 | ['references_html', 'tables_html', 'figures_html'].forEach(function(key) { 108 | if (t[key] && t[key].value && t[key].value.length > 0) { 109 | var outkey = key.replace(/_html$/, '') 110 | x.sections[outkey] = t[key].value.map(function(y) { 111 | return { 112 | html: y 113 | }; 114 | }); 115 | } 116 | }); 117 | 118 | // date 119 | x.date = {}; 120 | ['date_published', 'date_submitted', 121 | 'date_accepted'].forEach(function(key) { 122 | if (t[key] && t[key].value && t[key].value.length > 0) { 123 | var date = t[key].value[0]; 124 | if (date.constructor === Array) { 125 | date = date[0]; 126 | } 127 | key = key.replace(/^date_/, ''); 128 | x.date[key] = moment(new Date(date.trim())).format(); 129 | } 130 | }); 131 | 132 | // identifier 133 | x.identifier = []; 134 | ['doi', 'pmid'].forEach(function(key) { 135 | if (t[key] && t[key].value && t[key].value.length > 0) { 136 | x.identifier.push({ 137 | type: key, 138 | id: t[key].value[0] 139 | }); 140 | } 141 | }); 142 | 143 | // license 144 | if (t.license && t.license.value && t.license.value.length > 0) { 145 | x.license = t.license.value.map(function(y) { 146 | return { raw: y }; 147 | }); 148 | } 149 | 150 | // copyright 151 | if (t.copyright && t.copyright.value) { 152 | x.copyright = t.copyright.value; 153 | } 154 | 155 | x['log'] = [ 156 | { 157 | date: moment().format(), 158 | 'event': 'scraped by quickscrape v' + QSVERSION 159 | } 160 | ] 161 | 162 | return x; 163 | } 164 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "quickscrape", 3 | "description": "A command line scraping tool for the modern web", 4 | "version": "0.4.7", 5 | "homepage": "https://github.com/ContentMine/quickscrape", 6 | "author": { 7 | "name": "Richard Smith-Unna", 8 | "email": "rds45@cam.ac.uk", 9 | "url": "http://contentmine.org" 10 | }, 11 | "repository": { 12 | "type": "git", 13 | "url": "https://github.com/ContentMine/quickscrape.git" 14 | }, 15 | "bugs": { 16 | "url": "https://github.com/ContentMine/quickscrape/issues" 17 | }, 18 | "license": "MIT", 19 | "main": "lib/quickscrape", 20 | "engines": { 21 | "node": ">= 0.8.14" 22 | }, 23 | "scripts": { 24 | "test": "mocha", 25 | "coverage": "istanbul cover ./node_modules/mocha/bin/_mocha --report lcovonly -- -R spec", 26 | "coveralls": "istanbul cover ./node_modules/mocha/bin/_mocha --report lcovonly -- -R spec && cat ./coverage/lcov.info | ./node_modules/coveralls/bin/coveralls.js && rm -rf ./coverage" 27 | }, 28 | "dependencies": { 29 | "chalk": "~1.0.0", 30 | "commander": "~2.7.1", 31 | "moment": "~2.10.2", 32 | "thresher": "^0.1.11", 33 | "which": "~1.0.5", 34 | "winston": "~1.0.0", 35 | "sanitize-filename": "1.6.0" 36 | }, 37 | "bin": { 38 | "quickscrape": "bin/quickscrape.js" 39 | }, 40 | "devDependencies": { 41 | "grunt": "~0.4.5", 42 | "coveralls": "~2.11.2", 43 | "mocha-lcov-reporter": "0.0.2", 44 | "should": "~4.0.0", 45 | "istanbul": "~0.3.13", 46 | "mocha": "~2.2.4" 47 | }, 48 | "keywords": [ 49 | "scraping", 50 | "datamining", 51 | "contentmining" 52 | ] 53 | } 54 | -------------------------------------------------------------------------------- /test/eventParseSpec.js: -------------------------------------------------------------------------------- 1 | var ep = require('../lib/eventparse.js') 2 | , should = require('should'); 3 | 4 | describe("eventparse", function() { 5 | 6 | describe("compose()", function() { 7 | 8 | it("should compose a message with all relevant info", function() { 9 | var event = 'scraper.downloadStarted' 10 | , var1 = 'file.txt' 11 | , var2 = 'http://place.com'; 12 | msg = ep.compose(event, var1, var2); 13 | msg.should.match(/scraper/); 14 | msg.should.match(/download started/); 15 | msg.should.match(RegExp(var1)); 16 | msg.should.match(RegExp(var2)); 17 | }); 18 | 19 | }); 20 | 21 | }); 22 | -------------------------------------------------------------------------------- /test/test.js: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/quickscrape/27dce16c1319c7e6884784f31dafcd7405adfd37/test/test.js -------------------------------------------------------------------------------- /test_vagrants/one/.vagrant/machines/default/virtualbox/action_provision: -------------------------------------------------------------------------------- 1 | 1.5:432ab8e5-9b70-4a8b-9853-fec6c4239c69 -------------------------------------------------------------------------------- /test_vagrants/one/.vagrant/machines/default/virtualbox/action_set_name: -------------------------------------------------------------------------------- 1 | 1412628413 -------------------------------------------------------------------------------- /test_vagrants/one/.vagrant/machines/default/virtualbox/id: -------------------------------------------------------------------------------- 1 | 432ab8e5-9b70-4a8b-9853-fec6c4239c69 -------------------------------------------------------------------------------- /test_vagrants/one/.vagrant/machines/default/virtualbox/index_uuid: -------------------------------------------------------------------------------- 1 | 3ca7a5daf4154dc2b90b75bfd9a5a620 -------------------------------------------------------------------------------- /test_vagrants/one/.vagrant/machines/default/virtualbox/synced_folders: -------------------------------------------------------------------------------- 1 | {"virtualbox":{"/vagrant":{"guestpath":"/vagrant","hostpath":"/Users/rds45/code/quickscrape/test_vagrants/one","disabled":false}}} -------------------------------------------------------------------------------- /test_vagrants/one/Vagrantfile: -------------------------------------------------------------------------------- 1 | # -*- mode: ruby -*- 2 | # vi: set ft=ruby : 3 | 4 | # Vagrantfile API/syntax version. Don't touch unless you know what you're doing! 5 | VAGRANTFILE_API_VERSION = "2" 6 | 7 | Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| 8 | # All Vagrant configuration is done here. The most common configuration 9 | # options are documented and commented below. For a complete reference, 10 | # please see the online documentation at vagrantup.com. 11 | 12 | # Every Vagrant virtual environment requires a box to build off of. 13 | config.vm.box = "hashicorp/precise32" 14 | 15 | # Disable automatic box update checking. If you disable this, then 16 | # boxes will only be checked for updates when the user runs 17 | # `vagrant box outdated`. This is not recommended. 18 | # config.vm.box_check_update = false 19 | 20 | # Create a forwarded port mapping which allows access to a specific port 21 | # within the machine from a port on the host machine. In the example below, 22 | # accessing "localhost:8080" will access port 80 on the guest machine. 23 | # config.vm.network "forwarded_port", guest: 80, host: 8080 24 | 25 | # Create a private network, which allows host-only access to the machine 26 | # using a specific IP. 27 | # config.vm.network "private_network", ip: "192.168.33.10" 28 | 29 | # Create a public network, which generally matched to bridged network. 30 | # Bridged networks make the machine appear as another physical device on 31 | # your network. 32 | # config.vm.network "public_network" 33 | 34 | # If true, then any SSH connections made will enable agent forwarding. 35 | # Default value: false 36 | # config.ssh.forward_agent = true 37 | 38 | # Share an additional folder to the guest VM. The first argument is 39 | # the path on the host to the actual folder. The second argument is 40 | # the path on the guest to mount the folder. And the optional third 41 | # argument is a set of non-required options. 42 | # config.vm.synced_folder "../data", "/vagrant_data" 43 | 44 | # Provider-specific configuration so you can fine-tune various 45 | # backing providers for Vagrant. These expose provider-specific options. 46 | # Example for VirtualBox: 47 | # 48 | # config.vm.provider "virtualbox" do |vb| 49 | # # Don't boot with headless mode 50 | # vb.gui = true 51 | # 52 | # # Use VBoxManage to customize the VM. For example to change memory: 53 | # vb.customize ["modifyvm", :id, "--memory", "1024"] 54 | # end 55 | # 56 | # View the documentation for the provider you're using for more 57 | # information on available options. 58 | 59 | # Enable provisioning with CFEngine. CFEngine Community packages are 60 | # automatically installed. For example, configure the host as a 61 | # policy server and optionally a policy file to run: 62 | # 63 | # config.vm.provision "cfengine" do |cf| 64 | # cf.am_policy_hub = true 65 | # # cf.run_file = "motd.cf" 66 | # end 67 | # 68 | # You can also configure and bootstrap a client to an existing 69 | # policy server: 70 | # 71 | # config.vm.provision "cfengine" do |cf| 72 | # cf.policy_server_address = "10.0.2.15" 73 | # end 74 | 75 | # Enable provisioning with Puppet stand alone. Puppet manifests 76 | # are contained in a directory path relative to this Vagrantfile. 77 | # You will need to create the manifests directory and a manifest in 78 | # the file default.pp in the manifests_path directory. 79 | # 80 | # config.vm.provision "puppet" do |puppet| 81 | # puppet.manifests_path = "manifests" 82 | # puppet.manifest_file = "site.pp" 83 | # end 84 | 85 | # Enable provisioning with chef solo, specifying a cookbooks path, roles 86 | # path, and data_bags path (all relative to this Vagrantfile), and adding 87 | # some recipes and/or roles. 88 | # 89 | # config.vm.provision "chef_solo" do |chef| 90 | # chef.cookbooks_path = "../my-recipes/cookbooks" 91 | # chef.roles_path = "../my-recipes/roles" 92 | # chef.data_bags_path = "../my-recipes/data_bags" 93 | # chef.add_recipe "mysql" 94 | # chef.add_role "web" 95 | # 96 | # # You may also specify custom JSON attributes: 97 | # chef.json = { mysql_password: "foo" } 98 | # end 99 | 100 | # Enable provisioning with chef server, specifying the chef server URL, 101 | # and the path to the validation key (relative to this Vagrantfile). 102 | # 103 | # The Opscode Platform uses HTTPS. Substitute your organization for 104 | # ORGNAME in the URL and validation key. 105 | # 106 | # If you have your own Chef Server, use the appropriate URL, which may be 107 | # HTTP instead of HTTPS depending on your configuration. Also change the 108 | # validation key to validation.pem. 109 | # 110 | # config.vm.provision "chef_client" do |chef| 111 | # chef.chef_server_url = "https://api.opscode.com/organizations/ORGNAME" 112 | # chef.validation_key_path = "ORGNAME-validator.pem" 113 | # end 114 | # 115 | # If you're using the Opscode platform, your validator client is 116 | # ORGNAME-validator, replacing ORGNAME with your organization name. 117 | # 118 | # If you have your own Chef Server, the default validation client name is 119 | # chef-validator, unless you changed the configuration. 120 | # 121 | # chef.validation_client_name = "ORGNAME-validator" 122 | end 123 | --------------------------------------------------------------------------------