├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | CC0 1.0 Universal 2 | 3 | Statement of Purpose 4 | 5 | The laws of most jurisdictions throughout the world automatically confer 6 | exclusive Copyright and Related Rights (defined below) upon the creator and 7 | subsequent owner(s) (each and all, an "owner") of an original work of 8 | authorship and/or a database (each, a "Work"). 9 | 10 | Certain owners wish to permanently relinquish those rights to a Work for the 11 | purpose of contributing to a commons of creative, cultural and scientific 12 | works ("Commons") that the public can reliably and without fear of later 13 | claims of infringement build upon, modify, incorporate in other works, reuse 14 | and redistribute as freely as possible in any form whatsoever and for any 15 | purposes, including without limitation commercial purposes. These owners may 16 | contribute to the Commons to promote the ideal of a free culture and the 17 | further production of creative, cultural and scientific works, or to gain 18 | reputation or greater distribution for their Work in part through the use and 19 | efforts of others. 20 | 21 | For these and/or other purposes and motivations, and without any expectation 22 | of additional consideration or compensation, the person associating CC0 with a 23 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright 24 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work 25 | and publicly distribute the Work under its terms, with knowledge of his or her 26 | Copyright and Related Rights in the Work and the meaning and intended legal 27 | effect of CC0 on those rights. 28 | 29 | 1. Copyright and Related Rights. A Work made available under CC0 may be 30 | protected by copyright and related or neighboring rights ("Copyright and 31 | Related Rights"). Copyright and Related Rights include, but are not limited 32 | to, the following: 33 | 34 | i. the right to reproduce, adapt, distribute, perform, display, communicate, 35 | and translate a Work; 36 | 37 | ii. moral rights retained by the original author(s) and/or performer(s); 38 | 39 | iii. publicity and privacy rights pertaining to a person's image or likeness 40 | depicted in a Work; 41 | 42 | iv. rights protecting against unfair competition in regards to a Work, 43 | subject to the limitations in paragraph 4(a), below; 44 | 45 | v. rights protecting the extraction, dissemination, use and reuse of data in 46 | a Work; 47 | 48 | vi. database rights (such as those arising under Directive 96/9/EC of the 49 | European Parliament and of the Council of 11 March 1996 on the legal 50 | protection of databases, and under any national implementation thereof, 51 | including any amended or successor version of such directive); and 52 | 53 | vii. other similar, equivalent or corresponding rights throughout the world 54 | based on applicable law or treaty, and any national implementations thereof. 55 | 56 | 2. Waiver. To the greatest extent permitted by, but not in contravention of, 57 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and 58 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright 59 | and Related Rights and associated claims and causes of action, whether now 60 | known or unknown (including existing as well as future claims and causes of 61 | action), in the Work (i) in all territories worldwide, (ii) for the maximum 62 | duration provided by applicable law or treaty (including future time 63 | extensions), (iii) in any current or future medium and for any number of 64 | copies, and (iv) for any purpose whatsoever, including without limitation 65 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes 66 | the Waiver for the benefit of each member of the public at large and to the 67 | detriment of Affirmer's heirs and successors, fully intending that such Waiver 68 | shall not be subject to revocation, rescission, cancellation, termination, or 69 | any other legal or equitable action to disrupt the quiet enjoyment of the Work 70 | by the public as contemplated by Affirmer's express Statement of Purpose. 71 | 72 | 3. Public License Fallback. Should any part of the Waiver for any reason be 73 | judged legally invalid or ineffective under applicable law, then the Waiver 74 | shall be preserved to the maximum extent permitted taking into account 75 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver 76 | is so judged Affirmer hereby grants to each affected person a royalty-free, 77 | non transferable, non sublicensable, non exclusive, irrevocable and 78 | unconditional license to exercise Affirmer's Copyright and Related Rights in 79 | the Work (i) in all territories worldwide, (ii) for the maximum duration 80 | provided by applicable law or treaty (including future time extensions), (iii) 81 | in any current or future medium and for any number of copies, and (iv) for any 82 | purpose whatsoever, including without limitation commercial, advertising or 83 | promotional purposes (the "License"). The License shall be deemed effective as 84 | of the date CC0 was applied by Affirmer to the Work. Should any part of the 85 | License for any reason be judged legally invalid or ineffective under 86 | applicable law, such partial invalidity or ineffectiveness shall not 87 | invalidate the remainder of the License, and in such case Affirmer hereby 88 | affirms that he or she will not (i) exercise any of his or her remaining 89 | Copyright and Related Rights in the Work or (ii) assert any associated claims 90 | and causes of action with respect to the Work, in either case contrary to 91 | Affirmer's express Statement of Purpose. 92 | 93 | 4. Limitations and Disclaimers. 94 | 95 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 96 | surrendered, licensed or otherwise affected by this document. 97 | 98 | b. Affirmer offers the Work as-is and makes no representations or warranties 99 | of any kind concerning the Work, express, implied, statutory or otherwise, 100 | including without limitation warranties of title, merchantability, fitness 101 | for a particular purpose, non infringement, or the absence of latent or 102 | other defects, accuracy, or the present or absence of errors, whether or not 103 | discoverable, all to the greatest extent permissible under applicable law. 104 | 105 | c. Affirmer disclaims responsibility for clearing rights of other persons 106 | that may apply to the Work or any use thereof, including without limitation 107 | any person's Copyright and Related Rights in the Work. Further, Affirmer 108 | disclaims responsibility for obtaining any necessary consents, permissions 109 | or other rights required for any use of the Work. 110 | 111 | d. Affirmer understands and acknowledges that Creative Commons is not a 112 | party to this document and has no duty or obligation with respect to this 113 | CC0 or use of the Work. 114 | 115 | For more information, please see 116 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | scraperJSON 2 | =========== 3 | 4 | The scraperJSON standard for defining web scrapers as JSON objects. 5 | 6 | ### Purpose 7 | 8 | scraperJSON is a JSON schema for defining web scrapers in a standardised way. Defining web scrapers in such a way enables mass-scale scraping and mining of similar data from many different sources, for example: 9 | 10 | - academic articles (see [journal-scrapers](https://github.com/ContentMine/journal-scrapers)) 11 | - news sites 12 | - weather sources 13 | - financial/sports/educational sites 14 | 15 | ### Status 16 | 17 | The specification is still in early drafting and is currently evolving very fast as our understanding of the potential needs of the system in real use develops. 18 | 19 | Because of this, the standard is simple described in text here, with a reference implementation in a Node.js library, [thresher](https://github.com/ContentMine/thresher), and a command-line app [quickscrape](https://github.com/ContentMine/quickscrape). 20 | 21 | The schema will be formally defined once we reach a stable set of features. 22 | 23 | ### Specification 24 | 25 | The current schema is described below. 26 | 27 | There can be two keys in the root object: 28 | 29 | - ***url*** - a string-form regular expression specifying which URL(s) this scraper targets 30 | - ***elements*** - a dictionary of elements to scrape 31 | 32 | Elements are defined as key-value pairs, where the key is a description of the element, and the value is a dictionary of specifiers defining the element and its processing. Allowed keys in the specifier dictionary are: 33 | 34 | - ***selector*** - an XPath selector targeting the element to be selected. 35 | - ***attribute*** - a string specifying the attribute to extract from the selected element. **Optional** (omitting this key is equivalent to giving it a value of `text`). In addition to html attributes there are two special attributes allowed: 36 | - `text` - extracts any plaintext inside the selected element 37 | - `html` - extracts the inner HTML of the selected element 38 | - ***download*** - if present and has a truthey value (`true` or an Object) the element is treated as a URL to a resource and is downloaded. **Optional** (omitting this key is equivalent to giving it a value of `false`). If the value is an object, the following keys are allowed: 39 | - `rename` - a string specifying the filename to which the downloaded file will be renamed. 40 | - ***regex*** - an Object specifying a regular expression whose groups should be captured as the results. The results will be an array of the captured groups. If the global flag (`g`) is specified, the result will be an array of arrays of captured groups. There are two keys allowed: 41 | - `source` - a string specifying the regular expression to be executed. **Required** 42 | - `flags` - an array specifying the regex flags to be used (`g`, `m`, `i`, etc.). **Optional** (omitting this key will cause the regex to be executed with no flags). 43 | 44 | Example: 45 | ```json 46 | { 47 | "url": "plos.*\\.org", 48 | "elements": { 49 | "fulltext_pdf": { 50 | "selector": "//meta[@name='citation_pdf_url']", 51 | "attribute": "content", 52 | "regex": { 53 | "flags": ["g", "m"], 54 | "source": "(\\w+)" 55 | }, 56 | "download": { 57 | "rename": "fulltext.pdf" 58 | } 59 | }, 60 | "title": { 61 | "selector": "//meta[@name='citation_title']" 62 | } 63 | } 64 | } 65 | ``` 66 | 67 | ### Changelog 68 | 69 | ***0.0.1*** - add download renaming 70 | ***0.0.2*** - add regex 71 | --------------------------------------------------------------------------------