├── extension ├── devtools │ ├── views │ │ ├── SitemapSelectorGraph.html │ │ ├── SitemapExport.html │ │ ├── SitemapBrowseData.html │ │ ├── SitemapList.html │ │ ├── SelectorEditTableColumn.html │ │ ├── SitemapListItem.html │ │ ├── SelectorListItem.html │ │ ├── DataPreview.html │ │ ├── SelectorList.html │ │ ├── SitemapExportDataCSV.html │ │ ├── SitemapImport.html │ │ ├── SitemapCreate.html │ │ ├── Viewport.html │ │ ├── SitemapScrapeConfig.html │ │ └── SitemapEditMetadata.html │ ├── devtools_init_page.html │ ├── devtools_init_page.js │ ├── devtools_scraper_panel.css │ └── devtools_scraper_panel.html ├── assets │ ├── images │ │ ├── icon128.png │ │ ├── icon16.png │ │ ├── icon19.png │ │ ├── icon38.png │ │ ├── icon48.png │ │ └── LICENSE │ ├── bootstrap-3.0.0 │ │ └── fonts │ │ │ ├── glyphicons-halflings-regular.eot │ │ │ ├── glyphicons-halflings-regular.ttf │ │ │ └── glyphicons-halflings-regular.woff │ ├── jquery.bootstrapvalidator │ │ └── bootstrapValidator.css │ ├── base64.js │ ├── LICENSE-sugar-js │ ├── LICENSE-jquery-js │ ├── jquery.whencallsequentially.js │ ├── LICENSE-d3-js │ └── LICENSE-icanhaz-js ├── scripts │ ├── App.js │ ├── Selector │ │ ├── SelectorElement.js │ │ ├── SelectorValue.js │ │ ├── SelectorElementStyle.js │ │ ├── SelectorElementAttribute.js │ │ ├── SelectorHTML.js │ │ ├── SelectorText.js │ │ ├── SelectorGroup.js │ │ ├── SelectorElementScroll.js │ │ ├── SelectorLink.js │ │ ├── SelectorImage.js │ │ └── SelectorPopupLink.js │ ├── DateUtils │ │ ├── DateRoller.js │ │ ├── DatePatternSupport.js │ │ └── SimpleDateFormatter.js │ ├── Queue.js │ ├── Config.js │ ├── ElementQuery.js │ ├── StoreDevtools.js │ ├── ChromePopupBrowser.js │ ├── Job.js │ ├── BackgroundScript.js │ ├── UniqueElementList.js │ ├── ContentScript.js │ ├── SelectorGraph.js │ └── Store.js ├── popup.html ├── content_script │ └── content_script.js ├── options_page │ ├── options_page.js │ └── options.html ├── background_page │ └── background_script.js └── manifest.json ├── docs ├── images │ ├── store-logo-sources.txt │ ├── sitemap-tree.png │ ├── chrome-store-logo.png │ ├── chrome-store-logo.xcf │ ├── selectors │ │ ├── table │ │ │ ├── table.png │ │ │ └── selectors.png │ │ ├── element-click │ │ │ ├── click-more.png │ │ │ └── click-once.png │ │ ├── link │ │ │ ├── pagination-link-selectors.png │ │ │ ├── pagination-selector-graph.png │ │ │ └── multiple-level-link-selectors.png │ │ └── text │ │ │ ├── text-selector-multiple-per-page.png │ │ │ ├── text-selector-multiple-elements-with-text-selectors.png │ │ │ └── text-selector-multiple-single-text-selectors-in-one-page.png │ ├── chrome-store-logo-920x680.png │ ├── chrome-store-logo-920x680.xcf │ ├── scraping-a-site │ │ ├── news-site.png │ │ ├── news-site-sitemap.png │ │ └── news-site-selector-graph.png │ └── open-web-scraper │ │ └── open-web-scraper.png ├── Installation.md ├── Open Web Scraper.md ├── Selectors │ ├── HTML selector.md │ ├── Element style selector.md │ ├── Element attribute selector.md │ ├── Link popup selector.md │ ├── Image selector.md │ ├── Grouped selector.md │ ├── Element scroll down selector.md │ ├── Table selector.md │ ├── Element selector.md │ ├── Link selector.md │ ├── Text selector.md │ └── Element click selector.md ├── Storage backends.md ├── CSS selector.md ├── Development.md ├── Selectors.md └── Scraping a site.md ├── .gitmodules ├── .gitignore ├── playgrounds ├── sitemap-tree │ ├── style.css │ ├── sitemap.json │ └── index.html └── extension │ └── webpage.css ├── tests ├── FakeStore.js ├── spec │ ├── QueueSpec.js │ ├── Selector │ │ ├── SelectorValueSpec.js │ │ ├── SelectorElementSpec.js │ │ ├── SelectorGroupSpec.js │ │ ├── SelectorLinkSpec.js │ │ ├── SelectorElementAttributeSpec.js │ │ ├── SelectorElementScrollSpec.js │ │ ├── SelectorElementStyleSpec.js │ │ ├── SelectorHTMLSpec.js │ │ ├── SelectorImageSpec.js │ │ └── SelectorPopupLinkSpec.js │ ├── BackgroundScriptSpec.js │ ├── SelectorSpec.js │ ├── ChromePopupBrowserSpec.js │ ├── ElementQuerySpec.js │ ├── UniqueElementListSpec.js │ ├── JobSpec.js │ ├── ContentScriptSpec.js │ ├── ContentSelectorSpec.js │ └── jquery.whencallsequentiallySpec.js ├── Matchers.js └── ChromeAPI.js ├── jasmine-standalone └── lib │ └── jasmine-1.3.1 │ └── MIT.LICENSE └── README.md /extension/devtools/views/SitemapSelectorGraph.html: -------------------------------------------------------------------------------- 1 |
-------------------------------------------------------------------------------- /docs/images/store-logo-sources.txt: -------------------------------------------------------------------------------- 1 | http://jsfiddle.net/t8Sgq/ 2 | http://jsfiddle.net/qpVkY/ 3 | -------------------------------------------------------------------------------- /docs/images/sitemap-tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/sitemap-tree.png -------------------------------------------------------------------------------- /docs/images/chrome-store-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/chrome-store-logo.png -------------------------------------------------------------------------------- /docs/images/chrome-store-logo.xcf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/chrome-store-logo.xcf -------------------------------------------------------------------------------- /extension/assets/images/icon128.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/images/icon128.png -------------------------------------------------------------------------------- /extension/assets/images/icon16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/images/icon16.png -------------------------------------------------------------------------------- /extension/assets/images/icon19.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/images/icon19.png -------------------------------------------------------------------------------- /extension/assets/images/icon38.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/images/icon38.png -------------------------------------------------------------------------------- /extension/assets/images/icon48.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/images/icon48.png -------------------------------------------------------------------------------- /docs/images/selectors/table/table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/table/table.png -------------------------------------------------------------------------------- /docs/images/chrome-store-logo-920x680.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/chrome-store-logo-920x680.png -------------------------------------------------------------------------------- /docs/images/chrome-store-logo-920x680.xcf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/chrome-store-logo-920x680.xcf -------------------------------------------------------------------------------- /docs/images/scraping-a-site/news-site.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/scraping-a-site/news-site.png -------------------------------------------------------------------------------- /docs/images/selectors/table/selectors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/table/selectors.png -------------------------------------------------------------------------------- /extension/devtools/devtools_init_page.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | -------------------------------------------------------------------------------- /extension/devtools/devtools_init_page.js: -------------------------------------------------------------------------------- 1 | chrome.devtools.panels.create("Web Scraper", "../assets/images/icon48.png", "devtools/devtools_scraper_panel.html"); -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "extension/assets/css-selector"] 2 | path = extension/assets/css-selector 3 | url = https://github.com/martinsbalodis/css-selector.git 4 | -------------------------------------------------------------------------------- /docs/images/open-web-scraper/open-web-scraper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/open-web-scraper/open-web-scraper.png -------------------------------------------------------------------------------- /docs/images/scraping-a-site/news-site-sitemap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/scraping-a-site/news-site-sitemap.png -------------------------------------------------------------------------------- /docs/images/selectors/element-click/click-more.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/element-click/click-more.png -------------------------------------------------------------------------------- /docs/images/selectors/element-click/click-once.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/element-click/click-once.png -------------------------------------------------------------------------------- /docs/images/scraping-a-site/news-site-selector-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/scraping-a-site/news-site-selector-graph.png -------------------------------------------------------------------------------- /docs/images/selectors/link/pagination-link-selectors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/link/pagination-link-selectors.png -------------------------------------------------------------------------------- /docs/images/selectors/link/pagination-selector-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/link/pagination-selector-graph.png -------------------------------------------------------------------------------- /docs/images/selectors/link/multiple-level-link-selectors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/link/multiple-level-link-selectors.png -------------------------------------------------------------------------------- /docs/images/selectors/text/text-selector-multiple-per-page.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/text/text-selector-multiple-per-page.png -------------------------------------------------------------------------------- /extension/assets/bootstrap-3.0.0/fonts/glyphicons-halflings-regular.eot: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/bootstrap-3.0.0/fonts/glyphicons-halflings-regular.eot -------------------------------------------------------------------------------- /extension/assets/bootstrap-3.0.0/fonts/glyphicons-halflings-regular.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/bootstrap-3.0.0/fonts/glyphicons-halflings-regular.ttf -------------------------------------------------------------------------------- /extension/assets/bootstrap-3.0.0/fonts/glyphicons-halflings-regular.woff: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/extension/assets/bootstrap-3.0.0/fonts/glyphicons-halflings-regular.woff -------------------------------------------------------------------------------- /docs/images/selectors/text/text-selector-multiple-elements-with-text-selectors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/text/text-selector-multiple-elements-with-text-selectors.png -------------------------------------------------------------------------------- /docs/images/selectors/text/text-selector-multiple-single-text-selectors-in-one-page.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jwillmer/web-scraper-chrome-extension/HEAD/docs/images/selectors/text/text-selector-multiple-single-text-selectors-in-one-page.png -------------------------------------------------------------------------------- /extension/scripts/App.js: -------------------------------------------------------------------------------- 1 | $(function () { 2 | 3 | // init bootstrap alerts 4 | $(".alert").alert(); 5 | 6 | var store = new StoreDevtools(); 7 | new SitemapController({ 8 | store: store, 9 | templateDir: 'views/' 10 | }); 11 | }); -------------------------------------------------------------------------------- /extension/assets/images/LICENSE: -------------------------------------------------------------------------------- 1 | icons source: 2 | https://www.iconfinder.com/iconsets/free-grey-cloud-icons#readme 3 | https://www.iconfinder.com/icons/129397/spider_web_icon#size=96 4 | 5 | license: 6 | http://creativecommons.org/licenses/by/3.0/legalcode 7 | -------------------------------------------------------------------------------- /extension/devtools/views/SitemapExport.html: -------------------------------------------------------------------------------- 1 |
2 |
3 |
4 | 5 |
6 |
7 |
-------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | projectFilesBackup 3 | extension.zip 4 | 5 | /.vs/web-scraper-chrome-extension/v15/.suo 6 | /.vs/web-scraper-chrome-extension/v15 7 | /.vs/VSWorkspaceState.json 8 | /.vs/slnx.sqlite 9 | /.vs/ProjectSettings.json 10 | /.vs/config/applicationhost.config 11 | -------------------------------------------------------------------------------- /playgrounds/sitemap-tree/style.css: -------------------------------------------------------------------------------- 1 | .node circle { 2 | cursor: pointer; 3 | fill: #fff; 4 | stroke: steelblue; 5 | stroke-width: 1.5px; 6 | } 7 | 8 | .node text { 9 | font-size: 11px; 10 | } 11 | 12 | path.link { 13 | fill: none; 14 | stroke: #ccc; 15 | stroke-width: 1.5px; 16 | } -------------------------------------------------------------------------------- /extension/devtools/views/SitemapBrowseData.html: -------------------------------------------------------------------------------- 1 |
2 | 3 | 4 | 5 | {{#columns}} 6 | 7 | {{/columns}} 8 | 9 | 10 | 11 | 12 |
{{.}}
13 |
-------------------------------------------------------------------------------- /extension/devtools/views/SitemapList.html: -------------------------------------------------------------------------------- 1 |
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
IDStart URLactions
13 |
-------------------------------------------------------------------------------- /extension/devtools/views/SelectorEditTableColumn.html: -------------------------------------------------------------------------------- 1 | 2 | {{header}} 3 | 4 | 5 | -------------------------------------------------------------------------------- /playgrounds/extension/webpage.css: -------------------------------------------------------------------------------- 1 | #webpage { 2 | height:400px; 3 | border-radius: 5px; 4 | border:3px #ccc solid; 5 | margin:10px; 6 | overflow-y:scroll; 7 | } 8 | 9 | #webpage { 10 | font-size: 14px; 11 | } 12 | 13 | #webpage .navbar-nav > li > a { 14 | padding-top: 15px; 15 | padding-bottom: 15px; 16 | } 17 | 18 | .productImage { 19 | background-image: url("../../docs/images/chrome-store-logo.png"); 20 | height: 375px; 21 | } -------------------------------------------------------------------------------- /tests/FakeStore.js: -------------------------------------------------------------------------------- 1 | var FakeStore = function () { 2 | this.data = []; 3 | }; 4 | 5 | FakeStore.prototype = { 6 | 7 | writeDocs: function (data, callback) { 8 | data.forEach(function (data) { 9 | this.data.push(data); 10 | }.bind(this)); 11 | callback(); 12 | }, 13 | 14 | initSitemapDataDb: function (sitemapId, callback) { 15 | callback(this); 16 | }, 17 | 18 | saveSitemap: function (sitemap, callback) { 19 | callback(this); 20 | } 21 | }; -------------------------------------------------------------------------------- /playgrounds/sitemap-tree/sitemap.json: -------------------------------------------------------------------------------- 1 | { 2 | "selectors":[ 3 | { 4 | "id": "a", 5 | "type": "SelectorElement", 6 | "parentSelectors": ["_root", "d"] 7 | }, 8 | { 9 | "id": "b", 10 | "type": "SelectorElement", 11 | "parentSelectors": ["a"] 12 | }, 13 | { 14 | "id": "c", 15 | "type": "SelectorElement", 16 | "parentSelectors": ["a"] 17 | }, 18 | { 19 | "id": "d", 20 | "type": "SelectorElement", 21 | "parentSelectors": ["a"] 22 | } 23 | ] 24 | } -------------------------------------------------------------------------------- /extension/devtools/views/SitemapListItem.html: -------------------------------------------------------------------------------- 1 | 2 | {{_id}} 3 | 4 | {{#startUrls}} 5 | {{.}}, 6 | {{/startUrls}} 7 | 8 | 9 | 10 | 11 | 12 | 13 | -------------------------------------------------------------------------------- /docs/Installation.md: -------------------------------------------------------------------------------- 1 | # Installation 2 | 3 | You can install the extension from [Chrome store] [1]. After installing it you 4 | should restart chrome to make sure the extension is fully loaded. If you don't 5 | want to restart Chrome then use the extension only in tabs that are 6 | created after installing it. 7 | 8 | ## Requirements 9 | 10 | The extension requires Chrome 31+ . There are no OS limitations. 11 | 12 | [1]: https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn "Install web scraper from Chrome store" -------------------------------------------------------------------------------- /docs/Open Web Scraper.md: -------------------------------------------------------------------------------- 1 | # Open Web Scraper 2 | 3 | Web Scraper is integrated into chrome Developer tools. Figure 1 shows how you 4 | can open it. You can also use these shortcuts to open Developer tools. After 5 | opening Developer tools open *Web Scraper* tab. 6 | 7 | Shourtcuts: 8 | 9 | * windows, linux: `Ctrl+Shift+I`, `f12`, open `Tools / Developer tools` 10 | * mac `Cmd+Opt+I`, open `Tools / Developer tools` 11 | 12 | ![Fig. 1: Open Web Scraper][open-web-scraper] 13 | 14 | [open-web-scraper]: images/open-web-scraper/open-web-scraper.png?raw=true -------------------------------------------------------------------------------- /extension/popup.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 10 | 11 | 12 |

13 | Open Developer tools where you will find Web Scraper tab: 14 |

25 |

26 |

27 | Documentation is available on webscraper.io 28 |

29 | 30 | -------------------------------------------------------------------------------- /extension/assets/jquery.bootstrapvalidator/bootstrapValidator.css: -------------------------------------------------------------------------------- 1 | /** 2 | * BootstrapValidator (http://bootstrapvalidator.com) 3 | * The best jQuery plugin to validate form fields. Designed to use with Bootstrap 3 4 | * 5 | * @author http://twitter.com/nghuuphuoc 6 | * @copyright (c) 2013 - 2014 Nguyen Huu Phuoc 7 | * @license MIT 8 | */ 9 | 10 | .bv-form .help-block { 11 | margin-bottom: 0; 12 | } 13 | .bv-form .tooltip-inner { 14 | text-align: left; 15 | } 16 | .nav-tabs li.bv-tab-success > a { 17 | color: #3c763d; 18 | } 19 | .nav-tabs li.bv-tab-error > a { 20 | color: #a94442; 21 | } 22 | -------------------------------------------------------------------------------- /extension/devtools/views/SelectorListItem.html: -------------------------------------------------------------------------------- 1 | 2 | {{id}} 3 | {{selector}} 4 | {{type}} 5 | {{multiple}} 6 | {{parentSelectors}} 7 | 8 | 9 | 10 | 11 | 12 | 13 | -------------------------------------------------------------------------------- /extension/devtools/views/DataPreview.html: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /extension/devtools/views/SelectorList.html: -------------------------------------------------------------------------------- 1 |
2 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
IDSelectortypeMultipleParent selectorsActions
20 | 21 |
-------------------------------------------------------------------------------- /docs/Selectors/HTML selector.md: -------------------------------------------------------------------------------- 1 | # HTML selector 2 | HMTL selector can extract HTML and text within the selected element. Only the 3 | inner HTML of the element will be extracted. 4 | 5 | ## Configuration options 6 | * selector - [CSS selector] [css-selector] for the element whose inner HTML 7 | will be extracted. 8 | * multiple - multiple records are being extracted. 9 | * remove HTML 10 | * trim text 11 | * replace text - regular expression in the replace field possible 12 | * text prefix/suffix 13 | * delay - delay the extraction 14 | 15 | ## Use cases 16 | See [Text selector] [text-selector] use cases. 17 | 18 | [text-selector]: Text%20selector.md 19 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /extension/devtools/views/SitemapExportDataCSV.html: -------------------------------------------------------------------------------- 1 |
Console
2 | 3 |

Export {{_id}} data as CSV

4 |
5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 |
17 | 18 | 19 |
20 |
21 | 22 |

23 | Waiting for process to finish. > Download now! 24 |

25 | -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorElement.js: -------------------------------------------------------------------------------- 1 | var SelectorElement = { 2 | 3 | canReturnMultipleRecords: function () { 4 | return true; 5 | }, 6 | 7 | canHaveChildSelectors: function () { 8 | return true; 9 | }, 10 | 11 | canHaveLocalChildSelectors: function () { 12 | return true; 13 | }, 14 | 15 | canCreateNewJobs: function () { 16 | return false; 17 | }, 18 | willReturnElements: function () { 19 | return true; 20 | }, 21 | 22 | _getData: function (parentElement) { 23 | 24 | var dfd = $.Deferred(); 25 | 26 | var elements = this.getDataElements(parentElement); 27 | dfd.resolve(jQuery.makeArray(elements)); 28 | 29 | return dfd.promise(); 30 | }, 31 | 32 | getDataColumns: function () { 33 | return []; 34 | }, 35 | 36 | getFeatures: function () { 37 | return ['multiple', 'delay'] 38 | } 39 | }; 40 | -------------------------------------------------------------------------------- /docs/Selectors/Element style selector.md: -------------------------------------------------------------------------------- 1 | # Element style selector 2 | Element style selector can extract an style value of an HTML element. 3 | For example you could use this selector to extract the with attribute from 4 | this div: `
`. 5 | 6 | ## Configuration options 7 | * selector - [CSS selector] [css-selector] for the element. 8 | * multiple - multiple records are being extracted. 9 | * style name - the attribute that is going to be extracted. For example 10 | `width`, `background-image`. 11 | * remove HTML 12 | * trim text 13 | * replace text - regular expression in the replace field possible 14 | * text prefix/suffix 15 | * delay - delay the extraction 16 | 17 | ## Use cases 18 | See [Text selector] [text-selector] use cases. 19 | 20 | [text-selector]: Text%20selector.md 21 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /playgrounds/sitemap-tree/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /extension/devtools/views/SitemapImport.html: -------------------------------------------------------------------------------- 1 |
2 |
3 | 4 | 5 |
6 | 7 |
8 |
9 |
10 | 11 | 12 |
13 | 14 |
15 |
16 |
17 |
18 | 19 |
20 |
21 |
-------------------------------------------------------------------------------- /docs/Selectors/Element attribute selector.md: -------------------------------------------------------------------------------- 1 | # Element attribute selector 2 | Element attribute selector can extract an attributes value of an HTML element. 3 | For example you could use this selector to extract title attribute from 4 | this link: `link`. 5 | 6 | ## Configuration options 7 | * selector - [CSS selector] [css-selector] for the element. 8 | * multiple - multiple records are being extracted. 9 | * attribute name - the attribute that is going to be extracted. For example 10 | `title`, `data-id`. 11 | * remove HTML 12 | * trim text 13 | * replace text - regular expression in the replace field possible 14 | * text prefix/suffix 15 | * delay - delay the extraction 16 | 17 | ## Use cases 18 | See [Text selector] [text-selector] use cases. 19 | 20 | [text-selector]: Text%20selector.md 21 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /tests/spec/QueueSpec.js: -------------------------------------------------------------------------------- 1 | describe("Queue", function () { 2 | 3 | var q; 4 | var job; 5 | 6 | beforeEach(function () { 7 | q = new Queue(); 8 | job = new Job("http://test.lv/", {}); 9 | }); 10 | 11 | it("should be able to add items to queue", function () { 12 | q.add(job); 13 | expect(q.getQueueSize()).toBe(1); 14 | expect(q.jobs[0].url).toBe("http://test.lv/"); 15 | }); 16 | 17 | it("should be able to mark urls as scraped", function () { 18 | 19 | q.add(job); 20 | var j = q.getNextJob(); 21 | expect(q.getQueueSize()).toBe(0); 22 | 23 | // try to add this job again 24 | q.add(job); 25 | expect(q.getQueueSize()).toBe(0); 26 | }); 27 | 28 | it("should be able to reject documents", function () { 29 | 30 | job = new Job("http://test.lv/test.doc"); 31 | 32 | var accepted = q.add(job); 33 | expect(accepted).toBe(false); 34 | }); 35 | 36 | }); -------------------------------------------------------------------------------- /docs/Selectors/Link popup selector.md: -------------------------------------------------------------------------------- 1 | # Link popup selector 2 | 3 | *Link popup selector* works similarly as [Link selector] [link-selector]. It can 4 | be used for url extraction and site navigation. The only difference is that 5 | *Link popup selector* should be used when clicking on a link the site opens a new 6 | window (popup) instead of loading the URL in the same tab or opening it in a 7 | new tab. This selector will catch the popup creation event and extract the URL. 8 | If the site creates a visual popup but not a real window then you should try 9 | [Element click selector] [element-click-selector] 10 | 11 | Note! when selecting these link elements you can move the mouse over the 12 | element and press "S" to select it to prevent it from opening a popup. 13 | 14 | ## Use cases 15 | See [Link selector] [link-selector] use cases. 16 | 17 | [link-selector]: Link%20selector.md 18 | [element-click-selector]: Element%20click%20selector.md -------------------------------------------------------------------------------- /extension/devtools/views/SitemapCreate.html: -------------------------------------------------------------------------------- 1 |
2 |
3 | 4 | 5 |
6 | 7 |
8 |
9 |
10 | 11 |
12 |
13 | 14 |
15 |
16 |
17 |
18 |
19 | 20 |
21 |
22 |
-------------------------------------------------------------------------------- /tests/spec/Selector/SelectorValueSpec.js: -------------------------------------------------------------------------------- 1 | describe("Value Selector", function () { 2 | 3 | var $el; 4 | 5 | beforeEach(function () { 6 | }); 7 | 8 | it("should place value in input element", function () { 9 | 10 | var selector = new Selector({ 11 | id: 'a', 12 | type: 'SelectorValue', 13 | multiple: false, 14 | selector: "#selector-value-input", 15 | insertValue: "test" 16 | }); 17 | 18 | var dataDeferred = selector.getData($("#selector-value")); 19 | 20 | waitsFor(function () { 21 | return dataDeferred.state() === 'resolved'; 22 | }, "wait for data input", 5000); 23 | 24 | runs(function () { 25 | dataDeferred.done(function (data) { 26 | var input = $("#selector-value-input").val(); 27 | expect(data[0].a).toEqual(input); 28 | }); 29 | }); 30 | }); 31 | }); 32 | -------------------------------------------------------------------------------- /extension/scripts/DateUtils/DateRoller.js: -------------------------------------------------------------------------------- 1 | /* 2 | * Iterator from first day to second 3 | * 4 | * @author © Denis Bakhtenkov denis.bakhtenkov@gmail.com 5 | * @version 2016 6 | */ 7 | 8 | var DateRoller = { 9 | 10 | /** 11 | * 12 | * @param {Date} from 13 | * @param {Date} to 14 | * @returns {Array} all days between From and To 15 | */ 16 | days: function (from, to) { 17 | 18 | /** 19 | * 20 | * @param {Date} first 21 | * @param {Date} second 22 | * @returns {Number} 23 | */ 24 | function compareDays(first, second) { 25 | var day = 24 * 60 * 60 * 1000; 26 | return Math.floor(first / day) - Math.floor(second / day); 27 | } 28 | 29 | var res = []; 30 | var curDate = new Date(from); 31 | var step = from <= to ? 1 : -1; 32 | 33 | do { 34 | res.push(new Date(curDate)); 35 | curDate.setDate(curDate.getDate() + step); 36 | } while (compareDays(curDate, to) * step <= 0); 37 | 38 | return res; 39 | } 40 | 41 | }; -------------------------------------------------------------------------------- /tests/spec/BackgroundScriptSpec.js: -------------------------------------------------------------------------------- 1 | describe("BackgroundScript", function () { 2 | 3 | var backgroundScript = getBackgroundScript("BackgroundScript"); 4 | var contentScript = getContentScript("BackgroundScript"); 5 | var $el; 6 | 7 | beforeEach(function () { 8 | 9 | this.addMatchers(selectorMatchers); 10 | 11 | $el = jQuery("#tests").html(""); 12 | if($el.length === 0) { 13 | $el = $("").appendTo("body"); 14 | } 15 | }); 16 | 17 | it("should be able to call BackgroundScript functions from background script", function () { 18 | 19 | var deferredResponse = backgroundScript.dummy(); 20 | 21 | expect(deferredResponse).deferredToEqual('dummy'); 22 | }); 23 | 24 | it("should be able to call BackgroundScript from Devtools", function() { 25 | 26 | var backgroundScript = getBackgroundScript("DevTools"); 27 | var deferredResponse = backgroundScript.dummy(); 28 | expect(deferredResponse).deferredToEqual('dummy'); 29 | }); 30 | }); -------------------------------------------------------------------------------- /docs/Selectors/Image selector.md: -------------------------------------------------------------------------------- 1 | # Image selector 2 | Image selector can extract `src` attribute (URL) of an image. 3 | Optionally you can also store the images. The images will be stored in your 4 | downloads directory: 5 | 6 | `Downloads///` 7 | 8 | Note! When selecting CSS selector for image selector all the images within the 9 | site are moved to the top. If this feature somehow breaks sites layout please 10 | report it as a bug. 11 | 12 | ## Configuration options 13 | * selector - [CSS selector] [css-selector] for the image element. 14 | * multiple - multiple records are being extracted. Usually should not be 15 | checked for Image selector. 16 | * download image - downloads and store images on local drive. When CouchDB 17 | storage back end is used the image is also stored locally. 18 | * delay - delay the extraction 19 | 20 | ## Use cases 21 | See [Text selector] [text-selector] use cases. 22 | 23 | [text-selector]: Text%20selector.md 24 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /extension/assets/base64.js: -------------------------------------------------------------------------------- 1 | /** 2 | * @url http://jsperf.com/blob-base64-conversion 3 | * @type {{blobToBase64: blobToBase64, base64ToBlob: base64ToBlob}} 4 | */ 5 | var Base64 = { 6 | 7 | blobToBase64: function(blob) { 8 | 9 | var deferredResponse = $.Deferred(); 10 | var reader = new FileReader(); 11 | reader.onload = function() { 12 | var dataUrl = reader.result; 13 | var base64 = dataUrl.split(',')[1]; 14 | deferredResponse.resolve(base64); 15 | }; 16 | reader.readAsDataURL(blob); 17 | 18 | return deferredResponse.promise(); 19 | }, 20 | 21 | base64ToBlob: function(base64, mimeType) { 22 | 23 | var deferredResponse = $.Deferred(); 24 | var binary = atob(base64); 25 | var len = binary.length; 26 | var buffer = new ArrayBuffer(len); 27 | var view = new Uint8Array(buffer); 28 | for (var i = 0; i < len; i++) { 29 | view[i] = binary.charCodeAt(i); 30 | } 31 | var blob = new Blob([view], {type: mimeType}); 32 | deferredResponse.resolve(blob); 33 | 34 | return deferredResponse.promise(); 35 | } 36 | }; 37 | -------------------------------------------------------------------------------- /docs/Storage backends.md: -------------------------------------------------------------------------------- 1 | # Storage backends 2 | 3 | Web scraper can be configured to use either local storage or CouchDB. By 4 | default all data is stored in the local storage. 5 | 6 | ## Local storage 7 | 8 | Local storage backend uses browsers built in database to store data. This data 9 | is not replicated from one chrome instance to another. 10 | 11 | ## CouchDB 12 | 13 | [CouchDB] [couchdb] is a RESTful NoSQL JavaScript database. You can configure 14 | the extension to store sitemaps and scraped data in this database. The data 15 | then could be accessible from all your chrome instances. To do that 16 | you need to configure it in the options page. You can open it by right clicking 17 | extensions icon and selecting options. There you can switch between storage 18 | backends. For CouchDB you need to add configure the database where sitemaps 19 | will be storend and the couchdb db server where scraped data will be stored. 20 | For example you can configure it like this: 21 | 22 | * sitemap db - http://localhost:5984/scraper-sitemaps 23 | * data db - http://localhost:5984/ 24 | 25 | [couchdb]: http://couchdb.apache.org/ 26 | -------------------------------------------------------------------------------- /extension/assets/LICENSE-sugar-js: -------------------------------------------------------------------------------- 1 | Copyright © 2011 Andrew Plummer 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sub-license, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | The above copyright notice, and every other copyright notice found in this software, and all the attributions in every file, and this permission notice shall be included in all copies or substantial portions of the Software. 5 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /jasmine-standalone/lib/jasmine-1.3.1/MIT.LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2008-2011 Pivotal Labs 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining 4 | a copy of this software and associated documentation files (the 5 | "Software"), to deal in the Software without restriction, including 6 | without limitation the rights to use, copy, modify, merge, publish, 7 | distribute, sublicense, and/or sell copies of the Software, and to 8 | permit persons to whom the Software is furnished to do so, subject to 9 | the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be 12 | included in all copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 15 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 17 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 18 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 19 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 20 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /extension/scripts/Queue.js: -------------------------------------------------------------------------------- 1 | var Queue = function () { 2 | this.jobs = []; 3 | this.scrapedUrls = {}; 4 | }; 5 | 6 | Queue.prototype = { 7 | 8 | /** 9 | * Returns false if page is already scraped 10 | * @param job 11 | * @returns {boolean} 12 | */ 13 | add: function (job) { 14 | 15 | if (this.canBeAdded(job)) { 16 | this.jobs.push(job); 17 | this._setUrlScraped(job.url); 18 | return true; 19 | } 20 | return false; 21 | }, 22 | 23 | canBeAdded: function (job) { 24 | if (this.isScraped(job.url)) { 25 | return false; 26 | } 27 | 28 | // reject documents 29 | if (job.url.match(/\.(doc|docx|pdf|ppt|pptx|odt)$/i) !== null) { 30 | return false; 31 | } 32 | return true; 33 | }, 34 | 35 | getQueueSize: function () { 36 | return this.jobs.length; 37 | }, 38 | 39 | isScraped: function (url) { 40 | return (this.scrapedUrls[url] !== undefined); 41 | }, 42 | 43 | _setUrlScraped: function (url) { 44 | this.scrapedUrls[url] = true; 45 | }, 46 | 47 | getNextJob: function () { 48 | 49 | // @TODO test this 50 | if (this.getQueueSize() > 0) { 51 | return this.jobs.pop(); 52 | } 53 | else { 54 | return false; 55 | } 56 | } 57 | }; -------------------------------------------------------------------------------- /extension/assets/LICENSE-jquery-js: -------------------------------------------------------------------------------- 1 | Copyright 2013 jQuery Foundation and other contributors 2 | http://jquery.com/ 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining 5 | a copy of this software and associated documentation files (the 6 | "Software"), to deal in the Software without restriction, including 7 | without limitation the rights to use, copy, modify, merge, publish, 8 | distribute, sublicense, and/or sell copies of the Software, and to 9 | permit persons to whom the Software is furnished to do so, subject to 10 | the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be 13 | included in all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 16 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 17 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 18 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 19 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 20 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 21 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorValue.js: -------------------------------------------------------------------------------- 1 | var SelectorValue = { 2 | 3 | canReturnMultipleRecords: function () { 4 | return false; 5 | }, 6 | 7 | canHaveChildSelectors: function () { 8 | return true; 9 | }, 10 | 11 | canHaveLocalChildSelectors: function () { 12 | return true; 13 | }, 14 | 15 | canCreateNewJobs: function () { 16 | return false; 17 | }, 18 | 19 | willReturnElements: function () { 20 | return false; 21 | }, 22 | 23 | _getData: function (parentElement) { 24 | 25 | var dfd = $.Deferred(); 26 | 27 | var elements = this.getDataElements(parentElement); 28 | 29 | var result = []; 30 | $(elements).each(function (k, element) { 31 | $(element).val(this.insertValue); 32 | }.bind(this)); 33 | 34 | 35 | var data = {}; 36 | data[this.id] = this.insertValue; 37 | result.push(data); 38 | 39 | 40 | dfd.resolve(result); 41 | return dfd.promise(); 42 | }, 43 | 44 | getDataColumns: function () { 45 | return []; 46 | }, 47 | 48 | getFeatures: function () { 49 | return ['insertValue'] 50 | } 51 | }; 52 | -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorElementStyle.js: -------------------------------------------------------------------------------- 1 | var SelectorElementStyle = { 2 | canReturnMultipleRecords: function () { 3 | return true; 4 | }, 5 | 6 | canHaveChildSelectors: function () { 7 | return false; 8 | }, 9 | 10 | canHaveLocalChildSelectors: function () { 11 | return false; 12 | }, 13 | 14 | canCreateNewJobs: function () { 15 | return false; 16 | }, 17 | willReturnElements: function () { 18 | return false; 19 | }, 20 | _getData: function (parentElement) { 21 | 22 | var dfd = $.Deferred(); 23 | var elements = this.getDataElements(parentElement); 24 | 25 | var result = []; 26 | $(elements).each(function (k, element) { 27 | var data = {}; 28 | data[this.id] = $(element).css(this.extractStyle); 29 | result.push(data); 30 | }.bind(this)); 31 | 32 | if (this.multiple === false && elements.length === 0) { 33 | var data = {}; 34 | data[this.id + '-src'] = null; 35 | result.push(data); 36 | } 37 | dfd.resolve(result); 38 | 39 | return dfd.promise(); 40 | }, 41 | 42 | getDataColumns: function () { 43 | return [this.id]; 44 | }, 45 | 46 | getFeatures: function () { 47 | return ['multiple', 'extractStyle', 'delay', 'textmanipulation'] 48 | } 49 | }; -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorElementAttribute.js: -------------------------------------------------------------------------------- 1 | var SelectorElementAttribute = { 2 | canReturnMultipleRecords: function () { 3 | return true; 4 | }, 5 | 6 | canHaveChildSelectors: function () { 7 | return false; 8 | }, 9 | 10 | canHaveLocalChildSelectors: function () { 11 | return false; 12 | }, 13 | 14 | canCreateNewJobs: function () { 15 | return false; 16 | }, 17 | willReturnElements: function () { 18 | return false; 19 | }, 20 | _getData: function (parentElement) { 21 | 22 | var dfd = $.Deferred(); 23 | 24 | var elements = this.getDataElements(parentElement); 25 | 26 | var result = []; 27 | $(elements).each(function (k, element) { 28 | var data = {}; 29 | 30 | data[this.id] = $(element).attr(this.extractAttribute); 31 | result.push(data); 32 | }.bind(this)); 33 | 34 | if (this.multiple === false && elements.length === 0) { 35 | var data = {}; 36 | data[this.id + '-src'] = null; 37 | result.push(data); 38 | } 39 | dfd.resolve(result); 40 | 41 | return dfd.promise(); 42 | }, 43 | 44 | getDataColumns: function () { 45 | return [this.id]; 46 | }, 47 | 48 | getFeatures: function () { 49 | return ['multiple', 'extractAttribute', 'delay', 'textmanipulation'] 50 | } 51 | }; -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorHTML.js: -------------------------------------------------------------------------------- 1 | var SelectorHTML = { 2 | 3 | canReturnMultipleRecords: function () { 4 | return true; 5 | }, 6 | 7 | canHaveChildSelectors: function () { 8 | return false; 9 | }, 10 | 11 | canHaveLocalChildSelectors: function () { 12 | return false; 13 | }, 14 | 15 | canCreateNewJobs: function () { 16 | return false; 17 | }, 18 | willReturnElements: function () { 19 | return false; 20 | }, 21 | _getData: function (parentElement) { 22 | 23 | var dfd = $.Deferred(); 24 | 25 | var elements = this.getDataElements(parentElement); 26 | 27 | var result = []; 28 | $(elements).each(function (k, element) { 29 | var data = {}; 30 | var html = $(element).html(); 31 | 32 | // do something 33 | 34 | data[this.id] = html; 35 | 36 | result.push(data); 37 | }.bind(this)); 38 | 39 | if (this.multiple === false && elements.length === 0) { 40 | var data = {}; 41 | data[this.id] = null; 42 | result.push(data); 43 | } 44 | 45 | dfd.resolve(result); 46 | return dfd.promise(); 47 | }, 48 | 49 | getDataColumns: function () { 50 | return [this.id]; 51 | }, 52 | 53 | getFeatures: function () { 54 | return ['multiple', 'textmanipulation', 'delay'] 55 | } 56 | }; 57 | -------------------------------------------------------------------------------- /docs/Selectors/Grouped selector.md: -------------------------------------------------------------------------------- 1 | # Grouped selector 2 | 3 | Grouped selector can group text data from multiple elements into one record. 4 | The extracted data will be stored as JSON. 5 | 6 | ## Configuration options 7 | * selector - [CSS selector] [css-selector] for the elements whose text will be 8 | extracted and stored in JSON format. 9 | * attribute name - optionally this selector can extract an attribute of the 10 | selected element. If specified the extractor will also add this attribute to 11 | the resulting JSON. 12 | * remove HTML 13 | * trim text 14 | * replace text - regular expression in the replace field possible 15 | * text prefix/suffix 16 | * delay - delay the extraction 17 | 18 | ## Use cases 19 | 20 | #### Extract article references 21 | 22 | For example you are extracting a news article that might have multiple 23 | reference links. If you are selecting these links with link selector with 24 | multiple checked you would get duplicate articles in the result set where each 25 | record would contain one reference link. Using grouped selector you could 26 | serialize all these reference links into one record. To do that select all 27 | reference links and set attribute name to `href` to also extract links to these 28 | sites. 29 | 30 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /docs/Selectors/Element scroll down selector.md: -------------------------------------------------------------------------------- 1 | # Element scroll down selector 2 | 3 | This is another Element selector that works similarly to Element selector but 4 | additionally it scrolls down the page multiple times to find those elements 5 | which are added when page is scrolled down to the bottom. Use the delay 6 | attribute to configure waiting interval between scrolling and element search. 7 | Scrolling is stopped after no new elements are found. If the page can scroll 8 | infinitely then this selector will be stuck in an infinite loop. 9 | 10 | ## Configuration options 11 | 12 | * selector - [CSS selector] [css-selector] for the element. 13 | * multiple - multiple records are being extracted (almost always should be 14 | checked). Multiple option for child selectors usually should not be checked. 15 | * delay - delay before element selection and delay between scrolling. This 16 | should usually be specified because the data won't be loaded immediately from 17 | the server after scrolling down. More than 2000 ms might be a good choice if 18 | you you don't want to loose data because the server didn't respond fast enough. 19 | * pagination limit - the number of clicks you want the selector to perform. 20 | 21 | ## Use cases 22 | See [Element selector] [element-selector] use cases. 23 | 24 | [element-selector]: Element%20selector.md 25 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /extension/scripts/Config.js: -------------------------------------------------------------------------------- 1 | var Config = function () { 2 | 3 | }; 4 | 5 | Config.prototype = { 6 | 7 | sitemapDb: '', 8 | dataDb: '', 9 | 10 | defaults: { 11 | storageType: "local", 12 | // this is where sitemap documents are stored 13 | sitemapDb: "scraper-sitemaps", 14 | // this is where scraped data is stored. 15 | // empty for local storage 16 | dataDb: "" 17 | }, 18 | 19 | /** 20 | * Loads configuration from chrome extension sync storage 21 | */ 22 | loadConfiguration: function (callback) { 23 | 24 | chrome.storage.sync.get(['sitemapDb', 'dataDb', 'storageType'], function (items) { 25 | 26 | this.storageType = items.storageType || this.defaults.storageType; 27 | if (this.storageType === 'local') { 28 | this.sitemapDb = this.defaults.sitemapDb; 29 | this.dataDb = this.defaults.dataDb; 30 | } 31 | else { 32 | this.sitemapDb = items.sitemapDb || this.defaults.sitemapDb; 33 | this.dataDb = items.dataDb || this.defaults.dataDb; 34 | } 35 | 36 | callback(); 37 | }.bind(this)); 38 | }, 39 | 40 | /** 41 | * Saves configuration to chrome extension sync storage 42 | * @param {type} items 43 | * @param {type} callback 44 | * @returns {undefined} 45 | */ 46 | updateConfiguration: function (items, callback) { 47 | chrome.storage.sync.set(items, callback); 48 | } 49 | }; -------------------------------------------------------------------------------- /docs/Selectors/Table selector.md: -------------------------------------------------------------------------------- 1 | # Table selector 2 | 3 | Table selector can extract data from tables. *Table selector* has 3 4 | configurable CSS selectors. The selector is for table selection. After you have 5 | selected the selector the *Table selector* will try to guess selectors 6 | for header row and data rows. You can click Element preview on those selectors 7 | to see whether the *Table selector* found table header and data rows correctly. 8 | The header row selector is used to identify table columns when data is 9 | extracted from multiple pages. Also you can rename table columns. Figure 1 10 | shows what you should select when extracting data from a table. 11 | 12 | ![Fig. 1: Selectors for table selector] [table-selector-selectors] 13 | 14 | ## Configuration options 15 | * selector - [CSS selector] [css-selector] for the table element. 16 | * header row selector - [CSS selector] [css-selector] for table header row. 17 | * data rows selector - [CSS selector] [css-selector] for table data rows. 18 | * multiple - multiple records are being extracted. Usually should be 19 | checked for Table selector because you are extracting multiple rows. 20 | * delay - delay the extraction 21 | 22 | ## Use cases 23 | See [Text selector] [text-selector] use cases. 24 | 25 | [table-selector-selectors]: ../images/selectors/table/selectors.png?raw=true 26 | [text-selector]: Text%20selector.md 27 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorText.js: -------------------------------------------------------------------------------- 1 | var SelectorText = { 2 | 3 | canReturnMultipleRecords: function () { 4 | return true; 5 | }, 6 | 7 | canHaveChildSelectors: function () { 8 | return false; 9 | }, 10 | 11 | canHaveLocalChildSelectors: function () { 12 | return false; 13 | }, 14 | 15 | canCreateNewJobs: function () { 16 | return false; 17 | }, 18 | willReturnElements: function () { 19 | return false; 20 | }, 21 | _getData: function (parentElement) { 22 | 23 | var dfd = $.Deferred(); 24 | 25 | var elements = this.getDataElements(parentElement); 26 | 27 | var result = []; 28 | $(elements).each(function (k, element) { 29 | var data = {}; 30 | 31 | // remove script, style tag contents from text results 32 | var $element_clone = $(element).clone(); 33 | $element_clone.find("script, style").remove(); 34 | //
replace br tags with newlines 35 | $element_clone.find("br").after("\n"); 36 | data[this.id] = $element_clone.text(); 37 | 38 | result.push(data); 39 | }.bind(this)); 40 | 41 | if (this.multiple === false && elements.length === 0) { 42 | var data = {}; 43 | data[this.id] = null; 44 | result.push(data); 45 | } 46 | 47 | dfd.resolve(result); 48 | return dfd.promise(); 49 | }, 50 | 51 | getDataColumns: function () { 52 | return [this.id]; 53 | }, 54 | 55 | getFeatures: function () { 56 | return ['multiple', 'delay', 'textmanipulation'] 57 | } 58 | }; 59 | -------------------------------------------------------------------------------- /extension/assets/jquery.whencallsequentially.js: -------------------------------------------------------------------------------- 1 | /** 2 | * @author Martins Balodis 3 | * 4 | * An alternative version of $.when which can be used to execute asynchronous 5 | * calls sequentially one after another. 6 | * 7 | * @returns $.Deferred().promise() 8 | */ 9 | $.whenCallSequentially = function (functionCalls) { 10 | 11 | var deferredResonse = $.Deferred(); 12 | var resultData = new Array(); 13 | 14 | // nothing to do 15 | if (functionCalls.length === 0) { 16 | return deferredResonse.resolve(resultData).promise(); 17 | } 18 | 19 | var currentDeferred = functionCalls.shift()(); 20 | // execute synchronous calls synchronously 21 | while (currentDeferred.state() === 'resolved') { 22 | currentDeferred.done(function (data) { 23 | resultData.push(data); 24 | }); 25 | if (functionCalls.length === 0) { 26 | return deferredResonse.resolve(resultData).promise(); 27 | } 28 | currentDeferred = functionCalls.shift()(); 29 | } 30 | 31 | // handle async calls 32 | var interval = setInterval(function () { 33 | // handle mixed sync calls 34 | while (currentDeferred.state() === 'resolved') { 35 | currentDeferred.done(function (data) { 36 | resultData.push(data); 37 | }); 38 | if (functionCalls.length === 0) { 39 | clearInterval(interval); 40 | deferredResonse.resolve(resultData); 41 | break; 42 | } 43 | currentDeferred = functionCalls.shift()(); 44 | } 45 | }, 10); 46 | 47 | return deferredResonse.promise(); 48 | }; 49 | -------------------------------------------------------------------------------- /docs/CSS selector.md: -------------------------------------------------------------------------------- 1 | # CSS selector 2 | 3 | Web Scraper uses css selectors to find HTML elements in web pages and to extract 4 | data from them. When selecting an element the Web Scraper will try to make its 5 | best guess what the CSS selector might be for the selected elements. But you 6 | can also write it yourself and test it with by clicking "Element preview". You 7 | can use CSS selectors that are available in CSS versions 1-3 and also pseudo 8 | selectors that are additionally available in jQuery. Here are some 9 | documentation links that might help you: 10 | 11 | * [CSS Selectors] [css-selectors-wikipedia] 12 | * [jQuery CSS selectors] [css-selectors-jquery] 13 | * [w3schools CSS selector reference] [w3schools-css-selector-reference] 14 | 15 | ## Additional Web Scraper selectors 16 | It is possible to add new pseudo CSS selectors to Web Scraper. Right now there 17 | is only one CSS selector added. 18 | 19 | #### Parent selector 20 | 21 | CSS Selector `_parent_` allows a child selector of an 22 | *Element selector* to select the element that was returned by the *Element selector*. For 23 | example this CSS selector could be used in a case where you need to extract an 24 | attribute from the element that the *Element selector* returned. 25 | 26 | [css-selectors-wikipedia]: http://en.wikipedia.org/wiki/Cascading_Style_Sheets#Selector 27 | [css-selectors-jquery]: http://api.jquery.com/category/selectors/ 28 | [w3schools-css-selector-reference]: http://www.w3schools.com/cssref/css_selectors.asp -------------------------------------------------------------------------------- /extension/assets/LICENSE-d3-js: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013, Michael Bostock 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * The name Michael Bostock may not be used to endorse or promote products 15 | derived from this software without specific prior written permission. 16 | 17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 18 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 19 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 20 | DISCLAIMED. IN NO EVENT SHALL MICHAEL BOSTOCK BE LIABLE FOR ANY DIRECT, 21 | INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 22 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 23 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY 24 | OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 25 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, 26 | EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorGroup.js: -------------------------------------------------------------------------------- 1 | var SelectorGroup = { 2 | 3 | canReturnMultipleRecords: function () { 4 | return false; 5 | }, 6 | 7 | canHaveChildSelectors: function () { 8 | return false; 9 | }, 10 | 11 | canHaveLocalChildSelectors: function () { 12 | return false; 13 | }, 14 | 15 | canCreateNewJobs: function () { 16 | return false; 17 | }, 18 | willReturnElements: function () { 19 | return false; 20 | }, 21 | _getData: function (parentElement) { 22 | 23 | var dfd = $.Deferred(); 24 | 25 | // cannot reuse this.getDataElements because it depends on *multiple* property 26 | var elements = $(this.selector, parentElement); 27 | 28 | var records = []; 29 | $(elements).each(function (k, element) { 30 | var data = {}; 31 | 32 | data[this.id] = $(element).text(); 33 | 34 | if (this.extractAttribute) { 35 | data[this.id + '-' + this.extractAttribute] = $(element).attr(this.extractAttribute); 36 | } 37 | 38 | if (this.extractStyle) { 39 | data[this.id + '-' + this.extractStyle] = $(element).css(this.extractStyle); 40 | } 41 | 42 | records.push(data); 43 | }.bind(this)); 44 | 45 | var result = {}; 46 | result[this.id] = records; 47 | 48 | dfd.resolve([result]); 49 | return dfd.promise(); 50 | }, 51 | 52 | getDataColumns: function () { 53 | return [this.id]; 54 | }, 55 | 56 | getFeatures: function () { 57 | return ['delay', 'extractAttribute', 'textmanipulation', 'extractStyle'] 58 | } 59 | }; -------------------------------------------------------------------------------- /extension/assets/LICENSE-icanhaz-js: -------------------------------------------------------------------------------- 1 | ICanHaz.js is Copyright (c) 2010 Henrik Joreteg and is MIT licensed. 2 | 3 | In my best attempt to comply with instructions I'm including the following license notice from Mustache and Mustache.js: 4 | --------------------------------------------------------------------- 5 | Copyright (c) 2009 Chris Wanstrath (Ruby) 6 | Copyright (c) 2010 Jan Lehnardt (JavaScript) 7 | 8 | Permission is hereby granted, free of charge, to any person obtaining 9 | a copy of this software and associated documentation files (the 10 | "Software"), to deal in the Software without restriction, including 11 | without limitation the rights to use, copy, modify, merge, publish, 12 | distribute, sublicense, and/or sell copies of the Software, and to 13 | permit persons to whom the Software is furnished to do so, subject to 14 | the following conditions: 15 | 16 | The above copyright notice and this permission notice shall be 17 | included in all copies or substantial portions of the Software. 18 | 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 20 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 21 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 22 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 23 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 24 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 25 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 26 | --------------------------------------------------------------------- -------------------------------------------------------------------------------- /tests/spec/SelectorSpec.js: -------------------------------------------------------------------------------- 1 | describe("Selector", function () { 2 | var $el; 3 | 4 | beforeEach(function () { 5 | 6 | $el = jQuery("#tests").html(""); 7 | if($el.length === 0) { 8 | $el = $("").appendTo("body"); 9 | } 10 | }); 11 | 12 | it("should be able to select elements", function () { 13 | 14 | $el.html("
"); 15 | 16 | var selector = new Selector({ 17 | selector: "a", 18 | type: 'SelectorLink' 19 | }); 20 | var elements = selector.getDataElements($el); 21 | 22 | expect(elements).toEqual($el.find("a").get()); 23 | }); 24 | 25 | it("should be able to select parent", function () { 26 | 27 | $el.html(""); 28 | 29 | var selector = new Selector({ 30 | selector: "_parent_", 31 | type: 'SelectorLink' 32 | }); 33 | var elements = selector.getDataElements($el); 34 | 35 | expect(elements).toEqual($el.get()); 36 | }); 37 | 38 | it("should be able to select elements with delay", function() { 39 | 40 | var selector = new Selector({ 41 | id: 'a', 42 | selector: "a", 43 | type: 'SelectorText', 44 | delay:100 45 | }); 46 | var dataDeferred = selector.getData($el); 47 | 48 | // add data after data extraction called 49 | $el.html("a"); 50 | 51 | waitsFor(function() { 52 | return dataDeferred.state() === 'resolved'; 53 | }, "wait for data extraction", 5000); 54 | 55 | runs(function () { 56 | dataDeferred.done(function(data) { 57 | expect(data).toEqual([ 58 | { 59 | 'a': "a" 60 | } 61 | ]); 62 | }); 63 | }); 64 | }); 65 | }); -------------------------------------------------------------------------------- /extension/devtools/devtools_scraper_panel.css: -------------------------------------------------------------------------------- 1 | /*body > form, body > div {*/ 2 | /*display:none;*/ 3 | /*}*/ 4 | 5 | a, tbody tr { 6 | cursor: pointer; 7 | } 8 | 9 | 10 | .selector-list-tpl, .sitemap-list-tpl { 11 | display:none 12 | } 13 | 14 | /** 15 | * Compact elements 16 | */ 17 | .navbar-nav>li>a { 18 | padding-top: 3px; 19 | padding-bottom: 3px; 20 | } 21 | 22 | .navbar-text { 23 | margin-top:4px; 24 | margin-bottom:4px; 25 | padding-right:3px; 26 | } 27 | 28 | .navbar { 29 | min-height:26px; 30 | margin-bottom: 6px; 31 | } 32 | .table-condensed tbody>tr>td { 33 | padding:1px 5px; 34 | } 35 | 36 | body { 37 | font-size: 12px; 38 | } 39 | 40 | form .form-control { 41 | font-size: 12px; 42 | padding: 3px 12px; 43 | height: 25px; 44 | } 45 | 46 | textarea.form-control { 47 | height: auto; 48 | } 49 | 50 | form .btn { 51 | font-size: 12px; 52 | padding: 3px 12px; 53 | } 54 | 55 | form .form-group { 56 | margin-bottom:5px; 57 | } 58 | 59 | form select[multiple], select[size] { 60 | height: auto; 61 | } 62 | 63 | #selector-graph .node circle { 64 | cursor: pointer; 65 | fill: #fff; 66 | stroke: steelblue; 67 | stroke-width: 1px; 68 | } 69 | 70 | #selector-graph .node text { 71 | font-size: 11px; 72 | } 73 | 74 | #selector-graph path.link { 75 | fill: none; 76 | stroke: #ccc; 77 | stroke-width: 1px; 78 | } 79 | 80 | .data-preview-modal .modal-dialog { 81 | width:auto; 82 | } 83 | 84 | .data-preview-modal .modal-body { 85 | overflow-y:scroll; 86 | } 87 | 88 | .data-preview-modal tbody tr { 89 | cursor: initial; 90 | } -------------------------------------------------------------------------------- /extension/scripts/ElementQuery.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Element selector. Uses jQuery as base and adds some more features 3 | * @param parentElement 4 | * @param selector 5 | */ 6 | ElementQuery = function(CSSSelector, parentElement) { 7 | 8 | CSSSelector = CSSSelector || ""; 9 | 10 | var selectedElements = []; 11 | 12 | var addElement = function(element) { 13 | if(selectedElements.indexOf(element) === -1) { 14 | selectedElements.push(element); 15 | } 16 | }; 17 | 18 | var selectorParts = ElementQuery.getSelectorParts(CSSSelector); 19 | selectorParts.forEach(function(selector) { 20 | 21 | // handle special case when parent is selected 22 | if(selector === "_parent_") { 23 | $(parentElement).each(function(i, element){ 24 | addElement(element); 25 | }); 26 | } 27 | else { 28 | var elements = $(selector, parentElement); 29 | elements.each(function(i, element) { 30 | addElement(element); 31 | }); 32 | } 33 | }); 34 | 35 | return selectedElements; 36 | }; 37 | 38 | ElementQuery.getSelectorParts = function(CSSSelector) { 39 | 40 | var selectors = CSSSelector.split(/(,|".*?"|'.*?'|\(.*?\))/); 41 | 42 | var resultSelectors = []; 43 | var currentSelector = ""; 44 | selectors.forEach(function(selector) { 45 | if(selector === ',') { 46 | if(currentSelector.trim().length) { 47 | resultSelectors.push(currentSelector.trim()); 48 | } 49 | currentSelector = ""; 50 | } 51 | else { 52 | currentSelector += selector; 53 | } 54 | }); 55 | if(currentSelector.trim().length) { 56 | resultSelectors.push(currentSelector.trim()); 57 | } 58 | 59 | return resultSelectors; 60 | }; 61 | -------------------------------------------------------------------------------- /extension/content_script/content_script.js: -------------------------------------------------------------------------------- 1 | chrome.runtime.onMessage.addListener( 2 | function (request, sender, sendResponse) { 3 | 4 | console.log("chrome.runtime.onMessage", request); 5 | 6 | if (request.extractData) { 7 | console.log("received data extraction request", request); 8 | var extractor = new DataExtractor(request); 9 | var deferredData = extractor.getData(); 10 | deferredData.done(function(data){ 11 | console.log("dataextractor data", data); 12 | var selectors = extractor.sitemap.selectors; 13 | sendResponse(data, selectors); 14 | }); 15 | return true; 16 | } 17 | else if(request.previewSelectorData) { 18 | console.log("received data-preview extraction request", request); 19 | var extractor = new DataExtractor(request); 20 | var deferredData = extractor.getSingleSelectorData(request.parentSelectorIds, request.selectorId); 21 | deferredData.done(function(data){ 22 | console.log("dataextractor data", data); 23 | var selectors = extractor.sitemap.selectors; 24 | sendResponse(data, selectors); 25 | }); 26 | return true; 27 | } 28 | // Universal ContentScript communication handler 29 | else if(request.contentScriptCall) { 30 | 31 | var contentScript = getContentScript("ContentScript"); 32 | 33 | console.log("received ContentScript request", request); 34 | 35 | var deferredResponse = contentScript[request.fn](request.request); 36 | deferredResponse.done(function (response) { 37 | sendResponse(response, null); 38 | }); 39 | 40 | return true; 41 | } 42 | } 43 | ); -------------------------------------------------------------------------------- /extension/devtools/views/Viewport.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 31 | 32 |
33 |
-------------------------------------------------------------------------------- /tests/spec/Selector/SelectorElementSpec.js: -------------------------------------------------------------------------------- 1 | describe("Element Selector", function () { 2 | 3 | beforeEach(function () { 4 | 5 | }); 6 | 7 | it("should return one element", function () { 8 | 9 | var selector = new Selector({ 10 | id: 'a', 11 | type: 'SelectorElement', 12 | multiple: false, 13 | selector: "div" 14 | }); 15 | 16 | var dataDeferred = selector.getData($("#selector-element-nodata")); 17 | 18 | waitsFor(function() { 19 | return dataDeferred.state() === 'resolved'; 20 | }, "wait for data extraction", 5000); 21 | 22 | runs(function () { 23 | dataDeferred.done(function(data) { 24 | expect(data).toEqual([$("#selector-element-nodata div")[0]]); 25 | }); 26 | }); 27 | }); 28 | 29 | it("should return multiple elements", function () { 30 | 31 | var selector = new Selector({ 32 | id: 'a', 33 | type: 'SelectorElement', 34 | multiple: true, 35 | selector: "div" 36 | }); 37 | 38 | var dataDeferred = selector.getData($("#selector-element-nodata")); 39 | 40 | waitsFor(function() { 41 | return dataDeferred.state() === 'resolved'; 42 | }, "wait for data extraction", 5000); 43 | 44 | runs(function () { 45 | dataDeferred.done(function(data) { 46 | expect(data).toEqual([$("#selector-element-nodata div")[0], $("#selector-element-nodata div")[1]]); 47 | }); 48 | }); 49 | }); 50 | 51 | it("should return no data columns", function () { 52 | var selector = new Selector({ 53 | id: 'a', 54 | type: 'SelectorElement', 55 | multiple: true, 56 | selector: "div" 57 | }); 58 | 59 | var columns = selector.getDataColumns(); 60 | expect(columns).toEqual([]); 61 | }); 62 | }); -------------------------------------------------------------------------------- /extension/devtools/views/SitemapScrapeConfig.html: -------------------------------------------------------------------------------- 1 |
2 |
3 |
4 | 5 |
6 | 7 |
8 |
9 |
10 | 11 |
12 | 13 |
14 |
15 |
16 | 17 |
18 | 19 |
20 |
21 | 25 | 26 |
27 |
28 | 29 |
30 |
31 |
32 |
33 | -------------------------------------------------------------------------------- /extension/scripts/DateUtils/DatePatternSupport.js: -------------------------------------------------------------------------------- 1 | /* 2 | * Support for "[date<01.01.2016>]" pattern 3 | * 4 | * @author © Denis Bakhtenkov denis.bakhtenkov@gmail.com 5 | * @version 2016 6 | */ 7 | 8 | /* global DateRoller */ 9 | 10 | var DatePatternSupport = { 11 | /** 12 | * 13 | * @param {String} startUrl 14 | * @returns {Array} 15 | */ 16 | expandUrl: function (startUrl) { 17 | 18 | function nowSupport(d) { 19 | switch (d) { 20 | case "now": 21 | return df.format(new Date()); 22 | case "yesterday": 23 | var date = new Date(); 24 | date.setDate(date.getDate() - 1); 25 | return df.format(new Date(date)); 26 | case "tomorrow": 27 | var date = new Date(); 28 | date.setDate(date.getDate() + 1); 29 | return df.format(new Date(date)); 30 | default: 31 | return d; 32 | } 33 | } 34 | 35 | var startUrls = startUrl; 36 | // single start url 37 | if (startUrl.push === undefined) { 38 | startUrls = [startUrls]; 39 | } 40 | 41 | var df; 42 | var urls = []; 43 | startUrls.forEach(function (startUrl) { 44 | var re = /^(.*?)\[date<(.*)><(.*)><(.*)>\](.*)$/; 45 | var matches = startUrl.match(re); 46 | if (matches) { 47 | df = new SimpleDateFormatter(matches[2]); 48 | var startDate = df.parse(nowSupport(matches[3])); 49 | var endDate = df.parse(nowSupport(matches[4])); 50 | 51 | var roller = DateRoller.days(startDate, endDate); 52 | roller.forEach(function (date) { 53 | urls.push(matches[1] + df.format(date) + matches[5]); 54 | }); 55 | 56 | } else { 57 | urls.push(startUrl); 58 | } 59 | }); 60 | 61 | return urls; 62 | } 63 | 64 | }; -------------------------------------------------------------------------------- /tests/spec/ChromePopupBrowserSpec.js: -------------------------------------------------------------------------------- 1 | describe("Chrome popup browser", function () { 2 | 3 | beforeEach(function () { 4 | window.chromeAPI.reset(); 5 | }); 6 | 7 | it("should init a popup window", function () { 8 | 9 | var browser = new ChromePopupBrowser({ 10 | pageLoadDelay: 500 11 | }); 12 | browser._initPopupWindow(function () { 13 | }); 14 | expect(browser.tab).toEqual({id: 0}); 15 | 16 | }); 17 | 18 | it("should load a page", function () { 19 | 20 | var browser = new ChromePopupBrowser({ 21 | pageLoadDelay: 500 22 | }); 23 | browser._initPopupWindow(function () { 24 | }); 25 | var tabLoadSuccess = false; 26 | browser.loadUrl("http://example,com/", function () { 27 | tabLoadSuccess = true; 28 | }); 29 | waitsFor(function () { 30 | return tabLoadSuccess; 31 | }, 1000); 32 | 33 | runs(function () { 34 | expect(tabLoadSuccess).toEqual(true); 35 | }); 36 | }); 37 | 38 | it("should sendMessage to popup contentscript when data extraction is needed", function () { 39 | 40 | var sitemap = new Sitemap({ 41 | selectors: [ 42 | { 43 | id: 'a', 44 | selector: '#browserTest', 45 | type: 'SelectorText', 46 | multiple: false, 47 | parentSelectors: ['_root'] 48 | } 49 | ] 50 | }); 51 | 52 | var browser = new ChromePopupBrowser({ 53 | pageLoadDelay: 500 54 | }); 55 | browser._initPopupWindow(function () { 56 | }); 57 | var fetched = false; 58 | var dataFetched = {}; 59 | browser.fetchData("http://example,com/", sitemap, '_root', function (data) { 60 | fetched = true; 61 | dataFetched = data; 62 | }); 63 | 64 | waitsFor(function (data) { 65 | return fetched; 66 | }, 1000); 67 | 68 | runs(function () { 69 | expect(fetched).toEqual(true); 70 | expect(dataFetched).toEqual([ 71 | { 72 | 'a': 'a' 73 | } 74 | ]); 75 | }); 76 | 77 | }); 78 | }); -------------------------------------------------------------------------------- /extension/options_page/options_page.js: -------------------------------------------------------------------------------- 1 | $(function () { 2 | 3 | // popups for Storage setting input fields 4 | $("#sitemapDb") 5 | .popover({ 6 | title: 'Database for sitemap storage', 7 | html: true, 8 | content: "CouchDB database url
http://example.com/scraper-sitemaps/", 9 | placement: 'bottom' 10 | }) 11 | .blur(function () { 12 | $(this).popover('hide'); 13 | }); 14 | 15 | $("#dataDb") 16 | .popover({ 17 | title: 'Database for scraped data', 18 | html: true, 19 | content: "CouchDB database url. For each sitemap a new DB will be created.
http://example.com/", 20 | placement: 'bottom' 21 | }) 22 | .blur(function () { 23 | $(this).popover('hide'); 24 | }); 25 | 26 | // switch between configuration types 27 | $("select[name=storageType]").change(function () { 28 | var type = $(this).val(); 29 | 30 | if (type === 'couchdb') { 31 | $(".form-group.couchdb").show(); 32 | } 33 | else { 34 | $(".form-group.couchdb").hide(); 35 | } 36 | }); 37 | 38 | // Extension configuration 39 | var config = new Config(); 40 | 41 | // load previously synced data 42 | config.loadConfiguration(function () { 43 | 44 | $("#storageType").val(config.storageType); 45 | $("#sitemapDb").val(config.sitemapDb); 46 | $("#dataDb").val(config.dataDb); 47 | 48 | $("select[name=storageType]").change(); 49 | }); 50 | 51 | // Sync storage settings 52 | $("form#storage_configuration").submit(function () { 53 | 54 | var sitemapDb = $("#sitemapDb").val(); 55 | var dataDb = $("#dataDb").val(); 56 | var storageType = $("#storageType").val(); 57 | 58 | var newConfig; 59 | 60 | if (storageType === 'local') { 61 | newConfig = { 62 | storageType: storageType, 63 | sitemapDb: ' ', 64 | dataDb: ' ' 65 | } 66 | } 67 | else { 68 | newConfig = { 69 | storageType: storageType, 70 | sitemapDb: sitemapDb, 71 | dataDb: dataDb 72 | } 73 | } 74 | 75 | config.updateConfiguration(newConfig); 76 | return false; 77 | }); 78 | }); -------------------------------------------------------------------------------- /extension/scripts/StoreDevtools.js: -------------------------------------------------------------------------------- 1 | /** 2 | * From devtools panel there is no possibility to execute XHR requests. So all requests to a remote CouchDb must be 3 | * handled through Background page. StoreDevtools is a simply a proxy store 4 | * @constructor 5 | */ 6 | var StoreDevtools = function () { 7 | 8 | }; 9 | 10 | StoreDevtools.prototype = { 11 | createSitemap: function (sitemap, callback) { 12 | 13 | var request = { 14 | createSitemap: true, 15 | sitemap: JSON.parse(JSON.stringify(sitemap)) 16 | }; 17 | 18 | chrome.runtime.sendMessage(request, function (callbackFn, originalSitemap, newSitemap) { 19 | originalSitemap._rev = newSitemap._rev; 20 | callbackFn(originalSitemap); 21 | }.bind(this, callback, sitemap)); 22 | }, 23 | saveSitemap: function (sitemap, callback) { 24 | this.createSitemap(sitemap, callback); 25 | }, 26 | deleteSitemap: function (sitemap, callback) { 27 | 28 | var request = { 29 | deleteSitemap: true, 30 | sitemap: JSON.parse(JSON.stringify(sitemap)) 31 | }; 32 | chrome.runtime.sendMessage(request, function (response) { 33 | callback(); 34 | }); 35 | }, 36 | getAllSitemaps: function (callback) { 37 | 38 | var request = { 39 | getAllSitemaps: true 40 | }; 41 | 42 | chrome.runtime.sendMessage(request, function (response) { 43 | 44 | var sitemaps = []; 45 | 46 | for (var i in response) { 47 | sitemaps.push(new Sitemap(response[i])); 48 | } 49 | callback(sitemaps); 50 | }); 51 | }, 52 | getSitemapData: function (sitemap, callback) { 53 | var request = { 54 | getSitemapData: true, 55 | sitemap: JSON.parse(JSON.stringify(sitemap)) 56 | }; 57 | 58 | chrome.runtime.sendMessage(request, function (response) { 59 | callback(response); 60 | }); 61 | }, 62 | sitemapExists: function (sitemapId, callback) { 63 | 64 | var request = { 65 | sitemapExists: true, 66 | sitemapId: sitemapId 67 | }; 68 | 69 | chrome.runtime.sendMessage(request, function (response) { 70 | callback(response); 71 | }); 72 | } 73 | }; -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorElementScroll.js: -------------------------------------------------------------------------------- 1 | var SelectorElementScroll = { 2 | 3 | canReturnMultipleRecords: function () { 4 | return true; 5 | }, 6 | 7 | canHaveChildSelectors: function () { 8 | return true; 9 | }, 10 | 11 | canHaveLocalChildSelectors: function () { 12 | return true; 13 | }, 14 | 15 | canCreateNewJobs: function () { 16 | return false; 17 | }, 18 | willReturnElements: function () { 19 | return true; 20 | }, 21 | scrollToBottom: function() { 22 | window.scrollTo(0,document.body.scrollHeight); 23 | }, 24 | _getData: function (parentElement) { 25 | 26 | var paginationLimit = parseInt(this.paginationLimit); 27 | var paginationCount = 1; 28 | var delay = parseInt(this.delay) || 0; 29 | var deferredResponse = $.Deferred(); 30 | var foundElements = []; 31 | 32 | // initially scroll down and wait 33 | this.scrollToBottom(); 34 | var nextElementSelection = (new Date()).getTime()+delay; 35 | 36 | // infinitely scroll down and find all items 37 | var interval = setInterval(function() { 38 | 39 | var now = (new Date()).getTime(); 40 | // sleep. wait when to extract next elements 41 | if(now < nextElementSelection) { 42 | return; 43 | } 44 | 45 | var elements = this.getDataElements(parentElement); 46 | // no new elements found or pagination limit 47 | if(elements.length === foundElements.length || paginationCount >= paginationLimit) { 48 | clearInterval(interval); 49 | deferredResponse.resolve(jQuery.makeArray(elements)); 50 | } 51 | else { 52 | paginationCount++; 53 | // continue scrolling and add delay 54 | foundElements = elements; 55 | this.scrollToBottom(); 56 | nextElementSelection = now+delay; 57 | } 58 | 59 | }.bind(this), 50); 60 | 61 | return deferredResponse.promise(); 62 | }, 63 | 64 | getDataColumns: function () { 65 | return []; 66 | }, 67 | 68 | getFeatures: function () { 69 | return ['multiple', 'delay', 'paginationLimit']; 70 | } 71 | }; 72 | -------------------------------------------------------------------------------- /tests/spec/Selector/SelectorGroupSpec.js: -------------------------------------------------------------------------------- 1 | describe("Group Selector", function () { 2 | 3 | beforeEach(function () { 4 | 5 | }); 6 | 7 | it("should extract text data", function () { 8 | 9 | var selector = new Selector({ 10 | id: 'a', 11 | type: 'SelectorGroup', 12 | multiple: false, 13 | selector: "div", 14 | textmanipulation: {} 15 | }); 16 | 17 | var dataDeferred = selector.getData($("#selector-group-text")); 18 | 19 | waitsFor(function() { 20 | return dataDeferred.state() === 'resolved'; 21 | }, "wait for data extraction", 5000); 22 | 23 | // extract as JSON.stringify since we allow to use regex to modify the content in the GUI 24 | runs(function () { 25 | dataDeferred.done(function(data) { 26 | expect(data).toEqual([ 27 | { 28 | a: '[{"a":"a"},{"a":"b"}]' 29 | } 30 | ]); 31 | }); 32 | }); 33 | }); 34 | 35 | it("should extract link urls", function () { 36 | 37 | var selector = new Selector({ 38 | id: 'a', 39 | type: 'SelectorGroup', 40 | multiple: false, 41 | selector: "a", 42 | extractAttribute: 'href', 43 | textmanipulation: {} 44 | }); 45 | 46 | var dataDeferred = selector.getData($("#selector-group-url")); 47 | 48 | waitsFor(function() { 49 | return dataDeferred.state() === 'resolved'; 50 | }, "wait for data extraction", 5000); 51 | 52 | // extract as JSON.stringify since we allow to use regex to modify the content in the GUI 53 | runs(function () { 54 | dataDeferred.done(function(data) { 55 | expect(data).toEqual([ 56 | { 57 | a: '[{"a":"a","a-href":"http://aa/"},{"a":"b","a-href":"http://bb/"}]' 58 | } 59 | ]); 60 | }); 61 | }); 62 | }); 63 | 64 | it("should return only one data column", function () { 65 | var selector = new Selector({ 66 | id: 'id', 67 | type: 'SelectorGroup', 68 | multiple: true, 69 | selector: "div" 70 | }); 71 | 72 | var columns = selector.getDataColumns(); 73 | expect(columns).toEqual(['id']); 74 | }); 75 | }); -------------------------------------------------------------------------------- /extension/options_page/options.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Web Scraper 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 |

Web Scraper

15 |

Options page

16 |
17 | 18 | 19 |
20 |
21 | Storage settings 22 |
23 | 24 |
25 | 29 |
30 |
31 | 32 |
33 | 34 | 35 |
36 | 37 |
38 |
39 | 40 |
41 | 42 | 43 |
44 | 45 |
46 |
47 |
48 |
49 | 50 |
51 |
52 |
53 |
54 | 55 |
56 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /docs/Selectors/Element selector.md: -------------------------------------------------------------------------------- 1 | # Element selector 2 | 3 | Element selector is for element selection that contain multiple data elements. 4 | For example element selector might be used to select a list of items in an 5 | e-commerce site. The selector will return each selected element as a parent 6 | element to its child selectors. Element selectors child selectors will be 7 | extracting data only within the element that the element selector gave them. 8 | 9 | Note! If the page dynamically loads new items after scrolling down or clicking 10 | on a button then you should try these selectors: 11 | 12 | * [Element scroll down selector] [element-scroll-selector] 13 | * [Element click selector] [element-click-selector] 14 | 15 | ## Configuration options 16 | * selector - [CSS selector] [css-selector] for the wrapper elements that will 17 | be used as parent elements for child selectors. 18 | * multiple - multiple records are being extracted (almost always should be 19 | checked). Multiple option for child selectors usually should not be checked. 20 | * delay - delay the extraction 21 | 22 | ## Use cases 23 | 24 | #### Select multiple e-commerce items from a page 25 | 26 | For example an e-commerce site has a page with a list of items. With element 27 | selector you can select the elements that wrap these items and then add 28 | multiple child selectors to it to extract data within the items wrapper 29 | element. Figure 1 shows how an element selector could be used in this 30 | situation. 31 | 32 | ![Fig. 1: Multiple items selected with element selector] [multiple-elements-with-text-selectors] 33 | 34 | #### Extract data from tables 35 | 36 | Similarly to e-commerce item selection you can also select table rows and add 37 | child selectors for data extraction from table cells. 38 | Though [Table selector] [table-selector] might be much better solution. 39 | 40 | [css-selector]: ../CSS%20selector.md 41 | [element-scroll-selector]: Element%20scroll%20down%20selector.md 42 | [element-click-selector]: Element%20click%20selector.md 43 | [table-selector]: Table%20selector.md 44 | [multiple-elements-with-text-selectors]: ../images/selectors/text/text-selector-multiple-elements-with-text-selectors.png?raw=true -------------------------------------------------------------------------------- /tests/spec/ElementQuerySpec.js: -------------------------------------------------------------------------------- 1 | describe("ElementQuery", function () { 2 | 3 | var $el; 4 | 5 | beforeEach(function () { 6 | 7 | $el = jQuery("#tests").html(""); 8 | if ($el.length === 0) { 9 | $el = $("").appendTo("body"); 10 | } 11 | }); 12 | 13 | it("should be able to select elements", function () { 14 | 15 | $el.append(''); 16 | 17 | var selectedElements = ElementQuery("a, span", $el); 18 | var expectedElements = $("a, span", $el); 19 | 20 | expect(selectedElements.sort()).toEqual(expectedElements.get().sort()); 21 | }); 22 | 23 | it("should be able to select parent", function () { 24 | 25 | $el.append(''); 26 | 27 | var selectedElements = ElementQuery("a, span, _parent_", $el); 28 | var expectedElements = $("a, span", $el); 29 | expectedElements = expectedElements.add($el); 30 | 31 | expect(selectedElements.sort()).toEqual(expectedElements.get().sort()); 32 | }); 33 | 34 | it("should should not return duplicates", function () { 35 | 36 | $el.append(''); 37 | 38 | var selectedElements = ElementQuery("*, a, span, _parent_", $el); 39 | var expectedElements = $("a, span", $el); 40 | expectedElements = expectedElements.add($el); 41 | 42 | expect(selectedElements.length).toEqual(3); 43 | expect(selectedElements.sort()).toEqual(expectedElements.get().sort()); 44 | }); 45 | 46 | it("should be able to select parent when parent there are multiple parents", function(){ 47 | 48 | $el.append(''); 49 | 50 | var selectedElements = ElementQuery("_parent_", $("span", $el)); 51 | var expectedElements = $("span", $el); 52 | 53 | expect(selectedElements.length).toEqual(2); 54 | expect(selectedElements.sort()).toEqual(expectedElements.get().sort()); 55 | }); 56 | 57 | it("should be able to select element with a comma ,", function(){ 58 | 59 | $el.append(','); 60 | 61 | var selectedElements = ElementQuery(":contains(',')", $el); 62 | var expectedElements = $("span", $el); 63 | 64 | expect(selectedElements.length).toEqual(1); 65 | expect(selectedElements.sort()).toEqual(expectedElements.get().sort()); 66 | }); 67 | 68 | it("should preserve spaces", function(){ 69 | 70 | var parts = ElementQuery.getSelectorParts('div.well li:nth-of-type(2) a'); 71 | expect(parts).toEqual(['div.well li:nth-of-type(2) a']); 72 | }); 73 | }); -------------------------------------------------------------------------------- /tests/spec/UniqueElementListSpec.js: -------------------------------------------------------------------------------- 1 | describe("UniqueElementList", function () { 2 | var $el; 3 | 4 | beforeEach(function () { 5 | 6 | $el = jQuery("#tests").html(""); 7 | if($el.length === 0) { 8 | $el = $("").appendTo("body"); 9 | } 10 | }); 11 | 12 | it("it should add only unique elements", function () { 13 | 14 | $el.html("12"); 15 | 16 | var list = new UniqueElementList('uniqueText'); 17 | expect(list.length).toEqual(0); 18 | 19 | var $a = $el.find("a"); 20 | list.push($a[0]); 21 | expect(list.length).toEqual(1); 22 | list.push($a[0]); 23 | expect(list.length).toEqual(1); 24 | list.push($a[1]); 25 | expect(list.length).toEqual(2); 26 | list.push($a[1]); 27 | expect(list.length).toEqual(2); 28 | }); 29 | 30 | it("it should add only unique elements when using uniqueHTMLText type", function () { 31 | 32 | $el.html("aa"); 33 | 34 | var list = new UniqueElementList('uniqueHTMLText'); 35 | expect(list.length).toEqual(0); 36 | 37 | var $a = $el.find("a"); 38 | list.push($a[0]); 39 | expect(list.length).toEqual(1); 40 | list.push($a[0]); 41 | expect(list.length).toEqual(1); 42 | list.push($a[1]); 43 | expect(list.length).toEqual(2); 44 | list.push($a[1]); 45 | expect(list.length).toEqual(2); 46 | }); 47 | 48 | it("it should add only unique elements when using uniqueHTML type", function () { 49 | 50 | $el.html("aaabcc"); 51 | 52 | var list = new UniqueElementList('uniqueHTML'); 53 | expect(list.length).toEqual(0); 54 | 55 | var $a = $el.find("a"); 56 | list.push($a[0]); 57 | expect(list.length).toEqual(1); 58 | list.push($a[0]); 59 | expect(list.length).toEqual(1); 60 | list.push($a[1]); 61 | expect(list.length).toEqual(2); 62 | list.push($a[1]); 63 | expect(list.length).toEqual(2); 64 | list.push($a[2]); 65 | expect(list.length).toEqual(2); 66 | }); 67 | 68 | it("it should add only unique elements when using uniqueCSSSelector type", function () { 69 | 70 | $el.html(""); 71 | 72 | var list = new UniqueElementList('uniqueCSSSelector'); 73 | expect(list.length).toEqual(0); 74 | 75 | var $a = $el.find("a"); 76 | list.push($a[0]); 77 | expect(list.length).toEqual(1); 78 | list.push($a[0]); 79 | expect(list.length).toEqual(1); 80 | list.push($a[1]); 81 | expect(list.length).toEqual(2); 82 | list.push($a[1]); 83 | expect(list.length).toEqual(2); 84 | }); 85 | }); -------------------------------------------------------------------------------- /docs/Development.md: -------------------------------------------------------------------------------- 1 | # Development Instructions 2 | 3 | ## Selector Development 4 | 5 | This section demonstrates all steps that are needed in order to create or extend a selector for the web scraper. In this example we are creating a "Select All" selector. 6 | 7 | ### Create Selector Logic 8 | You can skip the file creation steps if you intend to extend other selectors with functionallity. 9 | 10 | - Duplicate the file `SelectorElementStyle.js` in `scripts/Selector/` 11 | - Rename the duplicated file to `SelectorAll.js` 12 | - Modify the `getData` method to return all content 13 | - Specify which features you like to have enabled in the `getFeatures` function 14 | - Implement the logic for the enabled features (Feature `textmanipulation` will work out of the box) 15 | 16 | ### Create Selector Controls 17 | 18 | - Add a section into the `SelectorEdit.html` file in `devtools/views/` 19 | - Add section class `form-group feature feature-AllSelector` 20 | - You can use `{{#selectorName}}` and `{{/selectorName}}` to prevent content from displaying (used for checkobx controls) 21 | - Use `{{selector.selectorAll}}` to define a variable 22 | 23 | 24 | ### Set references to your selector 25 | 26 | #### Controler 27 | 28 | - Open the `Controler.js` in `scripts/` 29 | - Add a variable in the function `getCurrentlyEditedSelector` to select your HTML section value 30 | - Add the variable to the `newSelector` object (every selector in `scripts/Selector/` that references this feature can access the value) 31 | - Add validation rules to your variable in the function `initSelectorValidation` 32 | 33 | 34 | #### File reference 35 | 36 | - Add a reference in `extension/manifest.json` in the section `content_scripts` and `scripts` 37 | - Add a reference to `extension\devtools\devtools_scraper_panel.html` 38 | - Add a eference to `playgrounds\extension\index.html` 39 | - Add a reference to `tests\SpecRunner.html` 40 | 41 | 42 | ### Testing 43 | 44 | For testing you need to run a web server. Personally I use [Web Server for Chrome](https://chrome.google.com/webstore/detail/web-server-for-chrome/ofhbbkphhbklhfoeikjpcbhemlocgigb) and reference the working directory of the project. 45 | 46 | - Duplicate a test file in `tests/Selector` and rename it 47 | - Write your tests for your selector 48 | - Run the tests by opening `tests/SpecRunner.html` 49 | - Try you implementation by opening `playgrounds/extension/index.html` 50 | - Extend the playground if it does not cover your scenario 51 | 52 | ### Documentation 53 | 54 | - Create a `md` file in `docs/selectors` 55 | - Describe the usage, options, etc 56 | 57 | -------------------------------------------------------------------------------- /extension/scripts/Selector/SelectorLink.js: -------------------------------------------------------------------------------- 1 | var SelectorLink = { 2 | canReturnMultipleRecords: function () { 3 | return true; 4 | }, 5 | 6 | canHaveChildSelectors: function () { 7 | return true; 8 | }, 9 | 10 | canHaveLocalChildSelectors: function () { 11 | return false; 12 | }, 13 | 14 | canCreateNewJobs: function () { 15 | return true; 16 | }, 17 | willReturnElements: function () { 18 | return false; 19 | }, 20 | _getData: function (parentElement) { 21 | var elements = this.getDataElements(parentElement); 22 | 23 | var dfd = $.Deferred(); 24 | 25 | // return empty record if not multiple type and no elements found 26 | if (this.multiple === false && elements.length === 0) { 27 | var data = {}; 28 | data[this.id] = null; 29 | dfd.resolve([data]); 30 | return dfd; 31 | } 32 | 33 | // extract links one by one 34 | var deferredDataExtractionCalls = []; 35 | $(elements).each(function (k, element) { 36 | 37 | deferredDataExtractionCalls.push(function(element) { 38 | 39 | var href = element.href; 40 | if (this.stringReplacement && this.stringReplacement.replaceString) { 41 | var replace; 42 | var replacement = this.stringReplacement.replacementString || ""; 43 | try { 44 | var regex = new RegExp(this.stringReplacement.replaceString, 'gm'); 45 | replace = regex.test(href) ? regex : this.stringReplacement.replaceString; 46 | } catch (e) { replace = this.stringReplacement.replaceString; } 47 | 48 | href = href.replace(replace, replacement); 49 | } 50 | 51 | var deferredData = $.Deferred(); 52 | var data = {}; 53 | 54 | data[this.id] = $(element).text(); 55 | data._followSelectorId = this.id; 56 | data[this.id + '-href'] = href; 57 | data._follow = href; 58 | deferredData.resolve(data); 59 | 60 | return deferredData; 61 | }.bind(this, element)); 62 | }.bind(this)); 63 | 64 | $.whenCallSequentially(deferredDataExtractionCalls).done(function(responses) { 65 | var result = []; 66 | responses.forEach(function(dataResult) { 67 | result.push(dataResult); 68 | }); 69 | dfd.resolve(result); 70 | }); 71 | 72 | return dfd.promise(); 73 | }, 74 | 75 | getDataColumns: function () { 76 | return [this.id, this.id + '-href']; 77 | }, 78 | 79 | getFeatures: function () { 80 | return ['multiple', 'delay', 'stringReplacement'] 81 | }, 82 | 83 | getItemCSSSelector: function() { 84 | return "a"; 85 | } 86 | }; -------------------------------------------------------------------------------- /tests/Matchers.js: -------------------------------------------------------------------------------- 1 | var getSelectorIds = function (selectors) { 2 | 3 | var ids = []; 4 | selectors.forEach(function (selector) { 5 | ids.push(selector.id); 6 | }); 7 | return ids; 8 | }; 9 | 10 | var selectorListSorter = function (a, b) { 11 | if (a.id === b.id) { 12 | return 0; 13 | } 14 | else if (a.id > b.id) { 15 | return 1; 16 | } 17 | else { 18 | return -1; 19 | } 20 | }; 21 | 22 | var selectorMatchers = { 23 | matchSelectors: function (expectedIds) { 24 | 25 | expectedIds = expectedIds.sort(); 26 | var actualIds = getSelectorIds(this.actual).sort(); 27 | 28 | expect(actualIds).toEqual(expectedIds); 29 | return true; 30 | }, 31 | matchSelectorList: function (expectedSelectors) { 32 | 33 | var actualSelectors = this.actual 34 | if (expectedSelectors.length !== actualSelectors.length) { 35 | return false; 36 | } 37 | expectedSelectors.sort(selectorListSorter); 38 | actualSelectors.sort(selectorListSorter); 39 | 40 | for (var i in expectedSelectors) { 41 | if (expectedSelectors[i].id !== actualSelectors[i].id) { 42 | return false; 43 | } 44 | } 45 | return true; 46 | }, 47 | // @REFACTOR use match selector list 48 | matchSelectorTrees: function (expectedSelectorTrees) { 49 | var actualSelectorTrees = this.actual; 50 | 51 | if (actualSelectorTrees.length !== expectedSelectorTrees.length) { 52 | return false; 53 | } 54 | 55 | for (var i in expectedSelectorTrees) { 56 | expect(actualSelectorTrees[i]).matchSelectors(expectedSelectorTrees[i]); 57 | } 58 | return true; 59 | }, 60 | deferredToEqual: function(expectedData) { 61 | 62 | var deferredData = this.actual; 63 | var data; 64 | 65 | waitsFor(function() { 66 | var state = deferredData.state(); 67 | if(state === "resolved") return true; 68 | if(state === "rejected") { 69 | expect(state).toEqual("resolved"); 70 | return true; 71 | } 72 | 73 | return false; 74 | }, "wait for data extraction", 5000); 75 | 76 | runs(function () { 77 | deferredData.done(function(d) { 78 | data = d; 79 | }); 80 | expect(data).toEqual(expectedData); 81 | }); 82 | return true; 83 | }, 84 | deferredToFail: function() { 85 | 86 | var deferredData = this.actual; 87 | 88 | waitsFor(function() { 89 | var state = deferredData.state(); 90 | if(state === "rejected") return true; 91 | if(state === "resolved") { 92 | expect(state).toEqual("rejected"); 93 | return true; 94 | } 95 | 96 | return false; 97 | }, "wait for data extraction", 5000); 98 | 99 | return true; 100 | } 101 | }; -------------------------------------------------------------------------------- /extension/devtools/devtools_scraper_panel.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /extension/scripts/ChromePopupBrowser.js: -------------------------------------------------------------------------------- 1 | var ChromePopupBrowser = function (options) { 2 | 3 | this.pageLoadDelay = options.pageLoadDelay; 4 | 5 | // @TODO somehow handle the closed window 6 | }; 7 | 8 | ChromePopupBrowser.prototype = { 9 | 10 | _initPopupWindow: function (callback, scope) { 11 | 12 | var browser = this; 13 | if (this.window !== undefined) { 14 | console.log(JSON.stringify(this.window)); 15 | // check if tab exists 16 | chrome.tabs.get(this.tab.id, function (tab) { 17 | if (!tab) { 18 | throw "Scraping window closed"; 19 | } 20 | }); 21 | 22 | 23 | callback.call(scope); 24 | return; 25 | } 26 | 27 | chrome.windows.create({'type': 'popup', width: 1042, height: 768, focused: true, url: 'chrome://newtab'}, function (window) { 28 | browser.window = window; 29 | browser.tab = window.tabs[0]; 30 | 31 | 32 | callback.call(scope); 33 | }); 34 | }, 35 | 36 | loadUrl: function (url, callback) { 37 | 38 | var tab = this.tab; 39 | 40 | var tabLoadListener = function (tabId, changeInfo, tab) { 41 | if(tabId === this.tab.id) { 42 | if (changeInfo.status === 'complete') { 43 | 44 | // @TODO check url ? maybe it would be bad because some sites might use redirects 45 | 46 | // remove event listener 47 | chrome.tabs.onUpdated.removeListener(tabLoadListener); 48 | 49 | // callback tab is loaded after page load delay 50 | setTimeout(callback, this.pageLoadDelay); 51 | } 52 | } 53 | }.bind(this); 54 | chrome.tabs.onUpdated.addListener(tabLoadListener); 55 | 56 | chrome.tabs.update(tab.id, {url: url}); 57 | }, 58 | 59 | close: function () { 60 | chrome.windows.remove(this.window.id); 61 | }, 62 | 63 | fetchData: function (url, sitemap, parentSelectorId, callback, scope) { 64 | 65 | var browser = this; 66 | 67 | this._initPopupWindow(function () { 68 | var tab = browser.tab; 69 | 70 | browser.loadUrl(url, function () { 71 | 72 | var message = { 73 | extractData: true, 74 | sitemap: JSON.parse(JSON.stringify(sitemap)), 75 | parentSelectorId: parentSelectorId 76 | }; 77 | 78 | chrome.tabs.sendMessage(tab.id, message, function (data, selectors) { 79 | console.log("extracted data from web page", data); 80 | 81 | if (selectors && scope) { 82 | // table selector can dynamically add columns (addMissingColumns Feature) 83 | scope.scraper.sitemap.selectors = selectors; 84 | } 85 | 86 | callback.call(scope, data); 87 | }); 88 | }.bind(this)); 89 | }, this); 90 | } 91 | }; -------------------------------------------------------------------------------- /extension/scripts/Job.js: -------------------------------------------------------------------------------- 1 | var Job = function (url, parentSelector, scraper, parentJob, baseData) { 2 | 3 | if (parentJob !== undefined) { 4 | this.url = this.combineUrls(parentJob.url, url); 5 | } 6 | else { 7 | this.url = url; 8 | } 9 | this.parentSelector = parentSelector; 10 | this.scraper = scraper; 11 | this.dataItems = []; 12 | this.baseData = baseData || {}; 13 | }; 14 | 15 | Job.prototype = { 16 | 17 | combineUrls: function (parentUrl, childUrl) { 18 | 19 | var urlMatcher = new RegExp("(https?://)?([a-z0-9\\-\\.]+\\.[a-z0-9\\-]+(:\\d+)?|\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}(:\\d+)?)?(\\/[^\\?]*\\/|\\/)?([^\\?]*)?(\\?.*)?", "i"); 20 | 21 | var parentMatches = parentUrl.match(urlMatcher); 22 | var childMatches = childUrl.match(urlMatcher); 23 | 24 | // special case for urls like this: ?a=1 or like-this/ 25 | if (childMatches[1] === undefined && childMatches[2] === undefined && childMatches[5] === undefined && childMatches[6] === undefined) { 26 | 27 | var url = parentMatches[1] + parentMatches[2] + parentMatches[5] + parentMatches[6] + childMatches[7]; 28 | return url; 29 | } 30 | 31 | if (childMatches[1] === undefined) { 32 | childMatches[1] = parentMatches[1]; 33 | } 34 | if (childMatches[2] === undefined) { 35 | childMatches[2] = parentMatches[2]; 36 | } 37 | if (childMatches[5] === undefined) { 38 | if(parentMatches[5] === undefined) { 39 | childMatches[5] = '/'; 40 | } 41 | else { 42 | childMatches[5] = parentMatches[5]; 43 | } 44 | } 45 | 46 | if (childMatches[6] === undefined) { 47 | childMatches[6] = ""; 48 | } 49 | if (childMatches[7] === undefined) { 50 | childMatches[7] = ""; 51 | } 52 | 53 | return childMatches[1] + childMatches[2] + childMatches[5] + childMatches[6] + childMatches[7]; 54 | }, 55 | 56 | execute: function (browser, callback, scope) { 57 | 58 | var sitemap = this.scraper.sitemap; 59 | var job = this; 60 | browser.fetchData(this.url, sitemap, this.parentSelector, function (results) { 61 | // merge data with data from initialization 62 | for (var i in results) { 63 | var result = results[i]; 64 | for (var key in this.baseData) { 65 | if(!(key in result)) { 66 | result[key] = this.baseData[key]; 67 | } 68 | } 69 | this.dataItems.push(result); 70 | } 71 | 72 | if (sitemap) { 73 | // table selector can dynamically add columns (addMissingColumns Feature) 74 | sitemap.selectors = this.scraper.sitemap.selectors; 75 | } 76 | 77 | console.log(job); 78 | callback(job); 79 | }.bind(this), this); 80 | }, 81 | getResults: function () { 82 | return this.dataItems; 83 | } 84 | }; 85 | -------------------------------------------------------------------------------- /extension/devtools/views/SitemapEditMetadata.html: -------------------------------------------------------------------------------- 1 |
2 |
3 |
4 | 5 | 6 |
7 | 8 |
9 |
10 |
11 | 12 |
13 |
14 | 15 |
16 |
17 |
18 | 19 |
20 | Supported URL patterns:
21 | 1. Numeric with optional step and zero padding – [START-END:STEP] – [001-010:10]
22 | 2. Date interval – [date<PATTERN><START><END>] – [date<dd.MM.yyyy><01.01.2017><now>]
23 |
    24 | date placeholder may be yesterday / now / tomorrow
    25 | other template components (in Java style) 26 |
      27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 69 | 70 |
      yyyyfull year (4 digits)
      yylast 2 digits of year
      MMM  Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
      MMmonth number (01-12)
      ddday of month
      71 |
    72 |
73 |
74 | 75 |
76 |
77 | 78 |
79 |
80 |
81 |
-------------------------------------------------------------------------------- /tests/spec/Selector/SelectorLinkSpec.js: -------------------------------------------------------------------------------- 1 | describe("Link Selector", function () { 2 | 3 | var $el; 4 | 5 | beforeEach(function () { 6 | $el = jQuery("#tests").html(""); 7 | if($el.length === 0) { 8 | $el = $("").appendTo("body"); 9 | } 10 | }); 11 | 12 | it("should extract single link", function () { 13 | 14 | var selector = new Selector({ 15 | id: 'a', 16 | type: 'SelectorLink', 17 | multiple: false, 18 | selector: "a" 19 | }); 20 | 21 | var dataDeferred = selector.getData($("#selector-follow")); 22 | 23 | waitsFor(function() { 24 | return dataDeferred.state() === 'resolved'; 25 | }, "wait for data extraction", 5000); 26 | 27 | runs(function () { 28 | dataDeferred.done(function(data) { 29 | expect(data).toEqual([ 30 | { 31 | a: "a", 32 | 'a-href': "http://example.com/a", 33 | _follow: "http://example.com/a", 34 | _followSelectorId: "a" 35 | } 36 | ]); 37 | }); 38 | }); 39 | }); 40 | 41 | it("should extract multiple links", function () { 42 | 43 | var selector = new Selector({ 44 | id: 'a', 45 | type: 'SelectorLink', 46 | multiple: true, 47 | selector: "a" 48 | }); 49 | 50 | var dataDeferred = selector.getData($("#selector-follow")); 51 | 52 | waitsFor(function() { 53 | return dataDeferred.state() === 'resolved'; 54 | }, "wait for data extraction", 5000); 55 | 56 | runs(function () { 57 | dataDeferred.done(function(data) { 58 | expect(data).toEqual([ 59 | { 60 | a: "a", 61 | 'a-href': "http://example.com/a", 62 | _follow: "http://example.com/a", 63 | _followSelectorId: "a" 64 | }, 65 | { 66 | a: "b", 67 | 'a-href': "http://example.com/b", 68 | _follow: "http://example.com/b", 69 | _followSelectorId: "a" 70 | } 71 | ]); 72 | }); 73 | }); 74 | }); 75 | 76 | it("should return data and url columns", function () { 77 | var selector = new Selector({ 78 | id: 'id', 79 | type: 'SelectorLink', 80 | multiple: true, 81 | selector: "div" 82 | }); 83 | 84 | var columns = selector.getDataColumns(); 85 | expect(columns).toEqual(['id', 'id-href']); 86 | }); 87 | 88 | it("should return empty array when no links are found", function () { 89 | var selector = new Selector({ 90 | id: 'a', 91 | type: 'SelectorLink', 92 | multiple: true, 93 | selector: "a" 94 | }); 95 | 96 | var dataDeferred = selector.getData($("#not-exist")); 97 | 98 | waitsFor(function() { 99 | return dataDeferred.state() === 'resolved'; 100 | }, "wait for data extraction", 5000); 101 | 102 | runs(function () { 103 | dataDeferred.done(function(data) { 104 | expect(data).toEqual([]); 105 | }); 106 | }); 107 | }); 108 | }); 109 | -------------------------------------------------------------------------------- /extension/scripts/BackgroundScript.js: -------------------------------------------------------------------------------- 1 | /** 2 | * ContentScript that can be called from anywhere within the extension 3 | */ 4 | var BackgroundScript = { 5 | 6 | dummy: function() { 7 | 8 | return $.Deferred().resolve("dummy").promise(); 9 | }, 10 | 11 | /** 12 | * Returns the id of the tab that is visible to user 13 | * @returns $.Deferred() integer 14 | */ 15 | getActiveTabId: function() { 16 | 17 | var deferredResponse = $.Deferred(); 18 | 19 | chrome.tabs.query({ 20 | active: true, 21 | currentWindow: true 22 | }, function (tabs) { 23 | 24 | if (tabs.length < 1) { 25 | // @TODO must be running within popup. maybe find another active window? 26 | deferredResponse.reject("couldn't find the active tab"); 27 | } 28 | else { 29 | var tabId = tabs[0].id; 30 | deferredResponse.resolve(tabId); 31 | } 32 | }); 33 | return deferredResponse.promise(); 34 | }, 35 | 36 | /** 37 | * Execute a function within the active tab within content script 38 | * @param request.fn function to call 39 | * @param request.request request that will be passed to the function 40 | */ 41 | executeContentScript: function(request) { 42 | 43 | var reqToContentScript = { 44 | contentScriptCall: true, 45 | fn: request.fn, 46 | request: request.request 47 | }; 48 | var deferredResponse = $.Deferred(); 49 | var deferredActiveTabId = this.getActiveTabId(); 50 | deferredActiveTabId.done(function(tabId) { 51 | chrome.tabs.sendMessage(tabId, reqToContentScript, function(response) { 52 | deferredResponse.resolve(response); 53 | }); 54 | }); 55 | 56 | return deferredResponse; 57 | } 58 | }; 59 | 60 | /** 61 | * @param location configure from where the content script is being accessed (ContentScript, BackgroundPage, DevTools) 62 | * @returns BackgroundScript 63 | */ 64 | var getBackgroundScript = function(location) { 65 | 66 | // Handle calls from different places 67 | if(location === "BackgroundScript") { 68 | return BackgroundScript; 69 | } 70 | else if(location === "DevTools" || location === "ContentScript") { 71 | 72 | // if called within background script proxy calls to content script 73 | var backgroundScript = {}; 74 | 75 | Object.keys(BackgroundScript).forEach(function(attr) { 76 | if(typeof BackgroundScript[attr] === 'function') { 77 | backgroundScript[attr] = function(request) { 78 | 79 | var reqToBackgroundScript = { 80 | backgroundScriptCall: true, 81 | fn: attr, 82 | request: request 83 | }; 84 | 85 | var deferredResponse = $.Deferred(); 86 | 87 | chrome.runtime.sendMessage(reqToBackgroundScript, function(response) { 88 | deferredResponse.resolve(response); 89 | }); 90 | 91 | return deferredResponse; 92 | }; 93 | } 94 | else { 95 | backgroundScript[attr] = BackgroundScript[attr]; 96 | } 97 | }); 98 | 99 | return backgroundScript; 100 | } 101 | else { 102 | throw "Invalid BackgroundScript initialization - " + location; 103 | } 104 | }; -------------------------------------------------------------------------------- /tests/spec/JobSpec.js: -------------------------------------------------------------------------------- 1 | describe("Job", function () { 2 | 3 | beforeEach(function () { 4 | window.chromeAPI.reset(); 5 | }); 6 | 7 | it("should be able to create correct url from parent job", function () { 8 | 9 | var parent = new Job("http://example.com/"); 10 | var child = new Job("/test/", null, null, parent); 11 | expect(child.url).toBe("http://example.com/test/"); 12 | 13 | var parent = new Job("http://example.com"); 14 | var child = new Job("test/", null, null, parent); 15 | expect(child.url).toBe("http://example.com/test/"); 16 | 17 | var parent = new Job("http://example.com/asdasdad"); 18 | var child = new Job("tvnet.lv", null, null, parent); 19 | expect(child.url).toBe("http://tvnet.lv/"); 20 | 21 | var parent = new Job("http://example.com/asdasdad"); 22 | var child = new Job("?test", null, null, parent); 23 | expect(child.url).toBe("http://example.com/asdasdad?test"); 24 | 25 | var parent = new Job("http://example.com/1/"); 26 | var child = new Job("2/", null, null, parent); 27 | expect(child.url).toBe("http://example.com/1/2/"); 28 | 29 | var parent = new Job("http://127.0.0.1/1/"); 30 | var child = new Job("2/", null, null, parent); 31 | expect(child.url).toBe("http://127.0.0.1/1/2/"); 32 | 33 | var parent = new Job("http://xn--80aaxitdbjk.xn--p1ai/"); 34 | var child = new Job("2/", null, null, parent); 35 | 36 | expect(child.url).toBe("http://xn--80aaxitdbjk.xn--p1ai/2/"); 37 | }); 38 | 39 | it("should be able to create correct url from parent job with slashes after question mark", function () { 40 | 41 | var parent = new Job("http://www.sportstoto.com.my/results_past.asp?date=5/1/1992"); 42 | var child = new Job("popup_past_results.asp?drawNo=418/92", null, null, parent); 43 | expect(child.url).toBe("http://www.sportstoto.com.my/popup_past_results.asp?drawNo=418/92"); 44 | }); 45 | 46 | it("should be able to create correct url with a port number", function () { 47 | 48 | var parent = new Job("http://nukrobi2.nuk.uni-lj.si:8080/wayback/20101021090940/http://volitve.gov.si/lv2010/kandidati/seznam_obcin.html"); 49 | var child = new Job("http://nukrobi2.nuk.uni-lj.si:8080/wayback/20101021091250/http://volitve.gov.si/lv2010/kandidati/zupani_os_celje.html", null, null, parent); 50 | expect(child.url).toBe("http://nukrobi2.nuk.uni-lj.si:8080/wayback/20101021091250/http://volitve.gov.si/lv2010/kandidati/zupani_os_celje.html"); 51 | 52 | var parent = new Job("http://nukrobi2.nuk.uni-lj.si:8080"); 53 | var child = new Job("zupani_os_celje.html", null, null, parent); 54 | expect(child.url).toBe("http://nukrobi2.nuk.uni-lj.si:8080/zupani_os_celje.html"); 55 | }); 56 | 57 | it("should not override data with base data if it already exists", function() { 58 | 59 | var browser = { 60 | fetchData:function(url, sitemap, parentSelector, callback) { 61 | callback([{a:1,b:2}]); 62 | } 63 | }; 64 | 65 | var job = new Job(undefined, undefined, {sitemap:undefined}, undefined, {a:'do not override', c:3}); 66 | job.execute(browser, function(){}); 67 | var results = job.getResults(); 68 | expect(results).toEqual([{a:1,b:2,c:3}]); 69 | }); 70 | }); -------------------------------------------------------------------------------- /extension/scripts/DateUtils/SimpleDateFormatter.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Formatter for Date, parse and format with pattern 3 | * 4 | * @author © Denis Bakhtenkov denis.bakhtenkov@gmail.com 5 | * @version 2016 6 | * @param {String} pattern 7 | * default is dd.MM.yyyy 8 | * @returns {SimpleDateFormatter} 9 | */ 10 | var SimpleDateFormatter = function (pattern) { 11 | this.pattern = pattern || "dd.MM.yyyy"; 12 | this.months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]; 13 | }; 14 | 15 | /** 16 | * Return pattern 17 | * @returns {String} 18 | */ 19 | SimpleDateFormatter.prototype.getPattern = function () { 20 | return this.pattern; 21 | }; 22 | 23 | /** 24 | * 'dd.MM.yyyy hh:mm:ss' 25 | * @param {Date} date 26 | * @returns {String} 27 | */ 28 | SimpleDateFormatter.prototype.format = function (date) { 29 | 30 | /** 31 | * Adding left 'zero' if value's length less than digits 32 | * @param {Number} value 33 | * @param {Number} digits 34 | * @returns {String} 35 | */ 36 | function lzero(value, digits) { 37 | digits = digits || 2; 38 | var result = value.toString(); 39 | while (result.length < digits) { 40 | result = "0" + result; 41 | } 42 | return result; 43 | } 44 | 45 | var variants = { 46 | yyyy: date.getFullYear(), 47 | yy: lzero(date.getFullYear() % 100), 48 | MMM: this.months[date.getMonth()], 49 | MM: lzero(date.getMonth() + 1), 50 | dd: lzero(date.getDate()), 51 | hh: lzero(date.getHours()), 52 | mm: lzero(date.getMinutes()), 53 | sss: lzero(date.getMilliseconds(), 3), 54 | ss: lzero(date.getSeconds()) 55 | }; 56 | 57 | var format = this.pattern; 58 | 59 | for (var i in variants) { 60 | format = format.replace(i, variants[i]); 61 | } 62 | 63 | return format; 64 | }; 65 | 66 | /** 67 | * 16.06.2016 68 | * dd.MM.yyyy 69 | * 70 | * @param {String} string 71 | * @returns {Date} 72 | */ 73 | SimpleDateFormatter.prototype.parse = function (string) { 74 | 75 | var date = new Date(0); 76 | var pat = this.pattern; 77 | var input = string; 78 | var variants = { 79 | yyyy: "date.setFullYear(parseInt(value));", 80 | yy: "date.setYear(parseInt(value) + 2000);", 81 | MMM: "date.setMonth(parseInt(value));", 82 | MM: "date.setMonth(parseInt(value) - 1);", 83 | dd: "date.setDate(parseInt(value));", 84 | hh: "date.setHours(parseInt(value));", 85 | mm: "date.setMinutes(parseInt(value));", 86 | sss: "date.setMilliseconds(parseInt(value));", 87 | ss: "date.setSeconds(parseInt(value));" 88 | }; 89 | 90 | for (var i in variants) { 91 | var pos = pat.search(i); 92 | if (pos !== -1) { 93 | var value = input.substr(pos, i.length); 94 | input = input.substring(0, pos) + input.substring(pos + i.length); 95 | pat = pat.substring(0, pos) + pat.substring(pos + i.length); 96 | if (i === "MMM") { 97 | for (var j in this.months) { 98 | if (value === this.months[j]) { 99 | value = j; 100 | eval(variants[i]); 101 | break; 102 | } 103 | } 104 | } else { 105 | eval(variants[i]); 106 | } 107 | } 108 | } 109 | 110 | return date; 111 | }; -------------------------------------------------------------------------------- /tests/spec/Selector/SelectorElementAttributeSpec.js: -------------------------------------------------------------------------------- 1 | describe("Element Attribute Selector", function () { 2 | 3 | var $el; 4 | 5 | beforeEach(function () { 6 | 7 | this.addMatchers(selectorMatchers); 8 | 9 | $el = jQuery("#tests").html(""); 10 | if($el.length === 0) { 11 | $el = $("").appendTo("body"); 12 | } 13 | }); 14 | 15 | it("should extract image src tag", function () { 16 | 17 | var selector = new Selector({ 18 | id: 'img', 19 | type: 'SelectorElementAttribute', 20 | multiple: false, 21 | extractAttribute: "src", 22 | selector: "img" 23 | }); 24 | 25 | var dataDeferred = selector.getData($("#selector-image-one-image")); 26 | 27 | waitsFor(function() { 28 | return dataDeferred.state() === 'resolved'; 29 | }, "wait for data extraction", 5000); 30 | 31 | runs(function () { 32 | dataDeferred.done(function(data) { 33 | expect(data).toEqual([ 34 | { 35 | 'img': "http://aa/" 36 | } 37 | ]); 38 | }); 39 | }); 40 | }); 41 | 42 | it("should extract multiple src tags", function () { 43 | 44 | var selector = new Selector({ 45 | id: 'img', 46 | type: 'SelectorElementAttribute', 47 | multiple: true, 48 | extractAttribute: "src", 49 | selector: "img" 50 | }); 51 | 52 | var dataDeferred = selector.getData($("#selector-image-multiple-images")); 53 | 54 | waitsFor(function() { 55 | return dataDeferred.state() === 'resolved'; 56 | }, "wait for data extraction", 5000); 57 | 58 | runs(function () { 59 | dataDeferred.done(function(data) { 60 | expect(data).toEqual([ 61 | { 62 | 'img': "http://aa/" 63 | }, 64 | { 65 | 'img': "http://bb/" 66 | } 67 | ]); 68 | }); 69 | }); 70 | }); 71 | 72 | it("should return only one data column", function () { 73 | var selector = new Selector({ 74 | id: 'id', 75 | type: 'SelectorElementAttribute', 76 | multiple: true, 77 | selector: "img" 78 | }); 79 | 80 | var columns = selector.getDataColumns(); 81 | expect(columns).toEqual(['id']); 82 | }); 83 | 84 | it("should return empty array when no images are found", function () { 85 | var selector = new Selector({ 86 | id: 'img', 87 | type: 'SelectorElementAttribute', 88 | multiple: true, 89 | selector: "img.not-exist", 90 | extractAttribute: "src" 91 | }); 92 | 93 | var dataDeferred = selector.getData($("#not-exist")); 94 | 95 | waitsFor(function() { 96 | return dataDeferred.state() === 'resolved'; 97 | }, "wait for data extraction", 5000); 98 | 99 | runs(function () { 100 | dataDeferred.done(function(data) { 101 | expect(data).toEqual([]); 102 | }); 103 | }); 104 | }); 105 | 106 | it("should be able to select data- attributes", function () { 107 | 108 | var html = '
'; 109 | $el.append(html); 110 | 111 | var selector = new Selector({ 112 | id: 'type', 113 | type: 'SelectorElementAttribute', 114 | multiple: true, 115 | selector: "li", 116 | extractAttribute: "data-type" 117 | }); 118 | 119 | var dataDeferred = selector.getData($el); 120 | 121 | expect(dataDeferred).deferredToEqual([{ 122 | 'type': 'dog' 123 | }]); 124 | }); 125 | }); 126 | -------------------------------------------------------------------------------- /tests/spec/Selector/SelectorElementScrollSpec.js: -------------------------------------------------------------------------------- 1 | describe("Scroll Element Selector", function () { 2 | 3 | var $el; 4 | 5 | beforeEach(function () { 6 | $el = jQuery("#tests").html(""); 7 | if($el.length === 0) { 8 | $el = $("").appendTo("body"); 9 | } 10 | }); 11 | 12 | it("should return one element", function () { 13 | 14 | $el.append("
a
b
"); 15 | var selector = new Selector({ 16 | id: 'a', 17 | type: 'SelectorElementScroll', 18 | multiple: false, 19 | selector: "div" 20 | }); 21 | 22 | var dataDeferred = selector.getData($el[0]); 23 | 24 | waitsFor(function() { 25 | return dataDeferred.state() === 'resolved'; 26 | }, "wait for data extraction", 5000); 27 | 28 | runs(function () { 29 | dataDeferred.done(function(data) { 30 | expect(data).toEqual([$el.find("div")[0]]); 31 | }); 32 | }); 33 | }); 34 | 35 | it("should return multiple elements", function () { 36 | 37 | $el.append("
a
b
"); 38 | var selector = new Selector({ 39 | id: 'a', 40 | type: 'SelectorElementScroll', 41 | multiple: true, 42 | selector: "div" 43 | }); 44 | 45 | var dataDeferred = selector.getData($el[0]); 46 | 47 | waitsFor(function() { 48 | return dataDeferred.state() === 'resolved'; 49 | }, "wait for data extraction", 5000); 50 | 51 | runs(function () { 52 | dataDeferred.done(function(data) { 53 | expect(data).toEqual($el.find("div").get()); 54 | }); 55 | }); 56 | }); 57 | 58 | it("should get elements when scrolling is not needed", function() { 59 | 60 | $el.append($("a")); 61 | var selector = new Selector({ 62 | id: 'a', 63 | type: 'SelectorElementScroll', 64 | multiple: true, 65 | selector: "a", 66 | delay: 100 67 | }); 68 | 69 | var dataDeferred = selector.getData($el[0]); 70 | 71 | waitsFor(function() { 72 | return dataDeferred.state() === 'resolved'; 73 | }, "wait for data extraction", 5000); 74 | 75 | runs(function () { 76 | dataDeferred.done(function(data) { 77 | expect(data).toEqual($el.find("a").get()); 78 | }); 79 | }); 80 | }); 81 | 82 | it("should get elements which are added a delay", function() { 83 | 84 | $el.append($("a")); 85 | // add extra element after a little delay 86 | setTimeout(function() { 87 | $el.append($("a")); 88 | }, 100); 89 | 90 | var selector = new Selector({ 91 | id: 'a', 92 | type: 'SelectorElementScroll', 93 | multiple: true, 94 | selector: "a", 95 | delay: 200 96 | }); 97 | 98 | var dataDeferred = selector.getData($el[0]); 99 | 100 | waitsFor(function() { 101 | return dataDeferred.state() === 'resolved'; 102 | }, "wait for data extraction", 5000); 103 | 104 | runs(function () { 105 | dataDeferred.done(function(data) { 106 | expect($el.find("a").length).toEqual(2); 107 | expect(data).toEqual($el.find("a").get()); 108 | }); 109 | }); 110 | }); 111 | 112 | it("should return no data columns", function () { 113 | var selector = new Selector({ 114 | id: 'a', 115 | type: 'SelectorElementScroll', 116 | multiple: true, 117 | selector: "div" 118 | }); 119 | 120 | var columns = selector.getDataColumns(); 121 | expect(columns).toEqual([]); 122 | }); 123 | }); 124 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web Scraper 2 | Web Scraper is a chrome browser extension built for data extraction from web 3 | pages. Using this extension you can create a plan (sitemap) how a web site 4 | should be traversed and what should be extracted. Using these sitemaps the 5 | Web Scraper will navigate the site accordingly and extract all data. Scraped 6 | data later can be exported as CSV. 7 | 8 | #### Latest Version 9 | To run the latest version you need to [download the project][latest-releases] to your system and [follow the description on Google][get-started-chrome]) (select the `extension` folder). 10 | 11 | ## Changelog 12 | 13 | ### v0.3 14 | * Enabled pasting of multible start URLs (by [@jwillmer](https://github.com/jwillmer)) 15 | * Added scraping of dynamic table columns (by [@jwillmer](https://github.com/jwillmer)) 16 | * Added style extraction type (by [@jwillmer](https://github.com/jwillmer)) 17 | * Added text manipulation (trim, replace, prefix, suffix, remove HTML) (by [@jwillmer](https://github.com/jwillmer)) 18 | * Added image improvements to find images in div background (by [@jwillmer](https://github.com/jwillmer)) 19 | * Added support for vertical tables (by [@jwillmer](https://github.com/jwillmer)) 20 | * Added random delay function between requests (by [@Euphorbium](https://github.com/Euphorbium)) 21 | * Start URL can now also be a local URL (by [@3flex](https://github.com/3flex)) 22 | * Added CSV export options (by [@mohamnag](https://github.com/mohamnag)) 23 | * Added Regex group for select (by [@RuneHL](https://github.com/RuneHL)) 24 | * JSON export/import of settings (by [@haisi](https://github.com/haisi)) 25 | * Added date and number pattern in URL (by [@codoff](https://github.com/codoff)) 26 | * Added pagination selector limit (by [@codoff](https://github.com/codoff)) 27 | * Improved CSV export (by [@haisi](https://github.com/haisi)) 28 | * Added click limit option (by [@panna-ahmed](https://github.com/panna-ahmed)) 29 | 30 | ### v0.2 31 | * Added Element click selector 32 | * Added Element scroll down selector 33 | * Added Link popup selector 34 | * Improved table selector to work with any html markup 35 | * Added Image download 36 | * Added keyboard shortcuts when selecting elements 37 | * Added configurable delay before using selector 38 | * Added configurable delay between page visiting 39 | * Added multiple start url configuration 40 | * Added form field validation 41 | * Fixed a lot of bugs 42 | 43 | ### v0.1.3 44 | * Added Table selector 45 | * Added HTML selector 46 | * Added HTML attribute selector 47 | * Added data preview 48 | * Added ranged start urls 49 | * Fixed bug which made selector tree not to show on some operating systems 50 | 51 | #### Bugs 52 | When submitting a bug please attach an exported sitemap if possible. 53 | 54 | #### Development 55 | Read the [Development Instructions](/docs/Development.md) before you start. 56 | 57 | ## License 58 | LGPLv3 59 | 60 | [chrome-store]: https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn 61 | [webscraper.io]: http://webscraper.io/ 62 | [google-groups]: https://groups.google.com/forum/#!forum/web-scraper 63 | [github-issues]: https://github.com/martinsbalodis/web-scraper-chrome-extension/issues 64 | [get-started-chrome]: https://developer.chrome.com/extensions/getstarted#unpacked 65 | [issue-14]: https://github.com/jwillmer/web-scraper-chrome-extension/issues/14 66 | [latest-releases]: https://github.com/jwillmer/web-scraper-chrome-extension/releases 67 | -------------------------------------------------------------------------------- /tests/spec/Selector/SelectorElementStyleSpec.js: -------------------------------------------------------------------------------- 1 | describe("Element Style Selector", function () { 2 | 3 | var $el; 4 | 5 | beforeEach(function () { 6 | 7 | this.addMatchers(selectorMatchers); 8 | 9 | $el = jQuery("#tests").html(""); 10 | if ($el.length === 0) { 11 | $el = $("").appendTo("body"); 12 | } 13 | }); 14 | 15 | it("should extract width style", function () { 16 | 17 | var selector = new Selector({ 18 | id: 'pixel', 19 | type: 'SelectorElementStyle', 20 | multiple: false, 21 | extractStyle: "width", 22 | selector: "div.productShotThumbnail" 23 | }); 24 | 25 | var dataDeferred = selector.getData($("#style-extraction-test")); 26 | 27 | waitsFor(function() { 28 | return dataDeferred.state() === 'resolved'; 29 | }, "wait for data extraction", 5000); 30 | 31 | runs(function () { 32 | dataDeferred.done(function(data) { 33 | expect(data).toEqual([ 34 | { 35 | 'pixel': "20px" 36 | } 37 | ]); 38 | }); 39 | }); 40 | }); 41 | 42 | it("should extract multiple widths", function () { 43 | 44 | var selector = new Selector({ 45 | id: 'pixel', 46 | type: 'SelectorElementStyle', 47 | multiple: true, 48 | extractStyle: "width", 49 | selector: "div.productShotThumbnail" 50 | }); 51 | 52 | var dataDeferred = selector.getData($("#style-extraction-test")); 53 | 54 | waitsFor(function() { 55 | return dataDeferred.state() === 'resolved'; 56 | }, "wait for data extraction", 5000); 57 | 58 | runs(function () { 59 | dataDeferred.done(function(data) { 60 | expect(data).toEqual([ 61 | { 62 | 'pixel': "20px" 63 | }, 64 | { 65 | 'pixel': "20px" 66 | }, 67 | { 68 | 'pixel': "20px" 69 | } 70 | ]); 71 | }); 72 | }); 73 | }); 74 | 75 | it("should return only one data column", function () { 76 | var selector = new Selector({ 77 | id: 'pixel', 78 | type: 'SelectorElementStyle', 79 | multiple: true, 80 | selector: "div.productShotThumbnail" 81 | }); 82 | 83 | var columns = selector.getDataColumns(); 84 | expect(columns).toEqual(['pixel']); 85 | }); 86 | 87 | it("should return empty array when no width is found", function () { 88 | var selector = new Selector({ 89 | id: 'pixel', 90 | type: 'SelectorElementStyle', 91 | multiple: true, 92 | selector: "img.not-exist", 93 | extractStyle: "width" 94 | }); 95 | 96 | var dataDeferred = selector.getData($("#not-exist")); 97 | 98 | waitsFor(function() { 99 | return dataDeferred.state() === 'resolved'; 100 | }, "wait for data extraction", 5000); 101 | 102 | runs(function () { 103 | dataDeferred.done(function(data) { 104 | expect(data).toEqual([]); 105 | }); 106 | }); 107 | }); 108 | 109 | it("should be able to extract color green", function () { 110 | 111 | var html = '
'; 112 | $el.append(html); 113 | 114 | var selector = new Selector({ 115 | id: 'color', 116 | type: 'SelectorElementStyle', 117 | multiple: true, 118 | selector: "li", 119 | extractStyle: "color" 120 | }); 121 | 122 | var dataDeferred = selector.getData($el); 123 | 124 | expect(dataDeferred).deferredToEqual([{ 125 | 'color': 'rgb(0, 128, 0)' 126 | }]); 127 | }); 128 | }); 129 | -------------------------------------------------------------------------------- /extension/scripts/UniqueElementList.js: -------------------------------------------------------------------------------- 1 | /** 2 | * Only Elements unique will be added to this array 3 | * @constructor 4 | */ 5 | UniqueElementList = function(clickElementUniquenessType) { 6 | this.clickElementUniquenessType = clickElementUniquenessType; 7 | this.addedElements = {}; 8 | }; 9 | 10 | UniqueElementList.prototype = new Array; 11 | 12 | UniqueElementList.prototype.push = function(element) { 13 | 14 | var getStyles = function(_elem, _style) { 15 | var computedStyle; 16 | if ( typeof _elem.currentStyle != 'undefined' ) { 17 | computedStyle = _elem.currentStyle; 18 | } else { 19 | computedStyle = document.defaultView.getComputedStyle(_elem, null); 20 | } 21 | return _style ? computedStyle[_style] : computedStyle; 22 | }; 23 | 24 | var copyComputedStyle = function(src, dest) { 25 | var styles = getStyles(src); 26 | for ( var i in styles ) { 27 | // Do not use `hasOwnProperty`, nothing will get copied 28 | if ( typeof i == "string" && i != "cssText" && !/\d/.test(i) ) { 29 | // The try is for setter only properties 30 | try { 31 | dest.style[i] = styles[i]; 32 | // `fontSize` comes before `font` If `font` is empty, `fontSize` gets 33 | // overwritten. So make sure to reset this property. (hackyhackhack) 34 | // Other properties may need similar treatment 35 | if ( i == "font" ) { 36 | dest.style.fontSize = styles.fontSize; 37 | } 38 | } catch (e) {} 39 | } 40 | } 41 | }; 42 | 43 | if(this.isAdded(element)) { 44 | return false; 45 | } 46 | else { 47 | var elementUniqueId = this.getElementUniqueId(element); 48 | this.addedElements[elementUniqueId] = true; 49 | var clone = $(element).clone(true)[0]; 50 | 51 | // clone computed styles (to extract images from background) 52 | var items = element.getElementsByTagName("*"); 53 | var itemsCloned = clone.getElementsByTagName("*"); 54 | $(items).each(function(i, item) { 55 | copyComputedStyle(item, itemsCloned[i]); 56 | }); 57 | 58 | Array.prototype.push.call(this, clone); 59 | return true; 60 | } 61 | }; 62 | 63 | UniqueElementList.prototype.getElementUniqueId = function(element) { 64 | 65 | if(this.clickElementUniquenessType === 'uniqueText') { 66 | var elementText = $(element).text().trim(); 67 | return elementText; 68 | } 69 | else if(this.clickElementUniquenessType === 'uniqueHTMLText') { 70 | 71 | var elementHTML = $("
").append($(element).eq(0).clone()).html(); 72 | return elementHTML; 73 | } 74 | else if(this.clickElementUniquenessType === 'uniqueHTML') { 75 | 76 | // get element without text 77 | var $element = $(element).eq(0).clone(); 78 | 79 | var removeText = function($element) { 80 | $element.contents() 81 | .filter(function() { 82 | if(this.nodeType !== 3) { 83 | removeText($(this)); 84 | } 85 | return this.nodeType == 3; //Node.TEXT_NODE 86 | }).remove(); 87 | }; 88 | removeText($element); 89 | 90 | var elementHTML = $("
").append($element).html(); 91 | return elementHTML; 92 | } 93 | else if(this.clickElementUniquenessType === 'uniqueCSSSelector') { 94 | var cs = new CssSelector({ 95 | enableSmartTableSelector: false, 96 | parent: $("body")[0], 97 | enableResultStripping:false 98 | }); 99 | var CSSSelector = cs.getCssSelector([element]); 100 | return CSSSelector; 101 | } 102 | else { 103 | throw "Invalid clickElementUniquenessType "+this.clickElementUniquenessType; 104 | } 105 | }; 106 | 107 | UniqueElementList.prototype.isAdded = function(element) { 108 | 109 | var elementUniqueId = this.getElementUniqueId(element); 110 | var isAdded = elementUniqueId in this.addedElements; 111 | return isAdded; 112 | }; 113 | -------------------------------------------------------------------------------- /extension/background_page/background_script.js: -------------------------------------------------------------------------------- 1 | var config = new Config(); 2 | var store; 3 | config.loadConfiguration(function () { 4 | console.log("initial configuration", config); 5 | store = new Store(config); 6 | }); 7 | 8 | chrome.storage.onChanged.addListener(function () { 9 | config.loadConfiguration(function () { 10 | console.log("configuration changed", config); 11 | store = new Store(config); 12 | }); 13 | }); 14 | 15 | var sendToActiveTab = function(request, callback) { 16 | chrome.tabs.query({ 17 | active: true, 18 | currentWindow: true 19 | }, function (tabs) { 20 | if (tabs.length < 1) { 21 | this.console.log("couldn't find active tab"); 22 | } 23 | else { 24 | var tab = tabs[0]; 25 | chrome.tabs.sendMessage(tab.id, request, callback); 26 | } 27 | }); 28 | }; 29 | 30 | chrome.runtime.onMessage.addListener( 31 | function (request, sender, sendResponse) { 32 | 33 | console.log("chrome.runtime.onMessage", request); 34 | 35 | if (request.createSitemap) { 36 | store.createSitemap(request.sitemap, sendResponse); 37 | return true; 38 | } 39 | else if (request.saveSitemap) { 40 | store.saveSitemap(request.sitemap, sendResponse); 41 | return true; 42 | } 43 | else if (request.deleteSitemap) { 44 | store.deleteSitemap(request.sitemap, sendResponse); 45 | return true; 46 | } 47 | else if (request.getAllSitemaps) { 48 | store.getAllSitemaps(sendResponse); 49 | return true; 50 | } 51 | else if (request.sitemapExists) { 52 | store.sitemapExists(request.sitemapId, sendResponse); 53 | return true; 54 | } 55 | else if (request.getSitemapData) { 56 | store.getSitemapData(new Sitemap(request.sitemap), sendResponse); 57 | return true; 58 | } 59 | else if (request.scrapeSitemap) { 60 | var sitemap = new Sitemap(request.sitemap); 61 | var queue = new Queue(); 62 | var browser = new ChromePopupBrowser({ 63 | pageLoadDelay: request.pageLoadDelay 64 | }); 65 | 66 | var scraper = new Scraper({ 67 | queue: queue, 68 | sitemap: sitemap, 69 | browser: browser, 70 | store: store, 71 | requestInterval: request.requestInterval, 72 | requestIntervalRandomness: request.requestIntervalRandomness 73 | }); 74 | 75 | try { 76 | scraper.run(function () { 77 | browser.close(); 78 | var notification = chrome.notifications.create("scraping-finished", { 79 | type: 'basic', 80 | iconUrl: 'assets/images/icon128.png', 81 | title: 'Scraping finished!', 82 | message: 'Finished scraping ' + sitemap._id 83 | }, function(id) { 84 | // notification showed 85 | }); 86 | // table selector can dynamically add columns (addMissingColumns Feature) 87 | var selectors = sitemap.selectors; 88 | sendResponse(selectors); 89 | }); 90 | } 91 | catch (e) { 92 | console.log("Scraper execution cancelled", e); 93 | } 94 | 95 | return true; 96 | } 97 | else if(request.previewSelectorData) { 98 | chrome.tabs.query({ 99 | active: true, 100 | currentWindow: true 101 | }, function (tabs) { 102 | if (tabs.length < 1) { 103 | this.console.log("couldn't find active tab"); 104 | } 105 | else { 106 | var tab = tabs[0]; 107 | chrome.tabs.sendMessage(tab.id, request, sendResponse); 108 | } 109 | }); 110 | return true; 111 | } 112 | else if(request.backgroundScriptCall) { 113 | 114 | var backgroundScript = getBackgroundScript("BackgroundScript"); 115 | var deferredResponse = backgroundScript[request.fn](request.request) 116 | deferredResponse.done(function(response){ 117 | sendResponse(response); 118 | }); 119 | 120 | return true; 121 | } 122 | } 123 | ); 124 | -------------------------------------------------------------------------------- /tests/spec/Selector/SelectorHTMLSpec.js: -------------------------------------------------------------------------------- 1 | describe("HTML Selector", function () { 2 | 3 | beforeEach(function () { 4 | 5 | }); 6 | 7 | it("should extract single html element", function () { 8 | 9 | var selector = new Selector({ 10 | id: 'a', 11 | type: 'SelectorHTML', 12 | multiple: false, 13 | selector: "div" 14 | }); 15 | 16 | var dataDeferred = selector.getData($("#selector-html-single-html")); 17 | 18 | waitsFor(function() { 19 | return dataDeferred.state() === 'resolved'; 20 | }, "wait for data extraction", 5000); 21 | 22 | runs(function () { 23 | dataDeferred.done(function(data) { 24 | expect(data).toEqual([ 25 | { 26 | a: "aaabbbccc" 27 | } 28 | ]); 29 | }); 30 | }); 31 | }); 32 | 33 | it("should extract multiple html elements", function () { 34 | 35 | var selector = new Selector({ 36 | id: 'a', 37 | type: 'SelectorHTML', 38 | multiple: true, 39 | selector: "div" 40 | }); 41 | 42 | var dataDeferred = selector.getData($("#selector-html-multiple-html")); 43 | 44 | waitsFor(function() { 45 | return dataDeferred.state() === 'resolved'; 46 | }, "wait for data extraction", 5000); 47 | 48 | runs(function () { 49 | dataDeferred.done(function(data) { 50 | expect(data).toEqual([ 51 | { 52 | a: "aaabbbccc" 53 | }, 54 | { 55 | a: "dddeeefff" 56 | } 57 | ]); 58 | }); 59 | }); 60 | }); 61 | 62 | it("should extract null when there are no elements", function () { 63 | 64 | var selector = new Selector({ 65 | id: 'a', 66 | type: 'SelectorHTML', 67 | multiple: false, 68 | selector: "div" 69 | }); 70 | 71 | var dataDeferred = selector.getData($("#selector-html-single-not-exist")); 72 | 73 | waitsFor(function() { 74 | return dataDeferred.state() === 'resolved'; 75 | }, "wait for data extraction", 5000); 76 | 77 | runs(function () { 78 | dataDeferred.done(function(data) { 79 | expect(data).toEqual([ 80 | { 81 | a: null 82 | } 83 | ]); 84 | }); 85 | }); 86 | }); 87 | 88 | it("should extract empty string when there is no regex match", function () { 89 | 90 | var selector = new Selector({ 91 | id: 'a', 92 | type: 'SelectorHTML', 93 | multiple: false, 94 | selector: "div", 95 | textmanipulation: { regex: "wontmatch" } 96 | }); 97 | 98 | var dataDeferred = selector.getData($("#selector-html-single-html")); 99 | 100 | waitsFor(function() { 101 | return dataDeferred.state() === 'resolved'; 102 | }, "wait for data extraction", 5000); 103 | 104 | runs(function () { 105 | dataDeferred.done(function(data) { 106 | expect(data).toEqual([ 107 | { 108 | a: '' 109 | } 110 | ]); 111 | }); 112 | }); 113 | }); 114 | 115 | it("should extract html+text using regex", function () { 116 | 117 | var selector = new Selector({ 118 | id: 'a', 119 | type: 'SelectorHTML', 120 | multiple: false, 121 | selector: "div", 122 | textmanipulation: { regex: "\\w+" } 123 | }); 124 | 125 | var dataDeferred = selector.getData($("#selector-html-single-html")); 126 | 127 | waitsFor(function() { 128 | return dataDeferred.state() === 'resolved'; 129 | }, "wait for data extraction", 5000); 130 | 131 | runs(function () { 132 | dataDeferred.done(function(data) { 133 | expect(data).toEqual([ 134 | { 135 | a: 'bbb' 136 | } 137 | ]); 138 | }); 139 | }); 140 | }); 141 | 142 | it("should return only one data column", function () { 143 | var selector = new Selector({ 144 | id: 'id', 145 | type: 'SelectorHTML', 146 | multiple: true, 147 | selector: "div" 148 | }); 149 | 150 | var columns = selector.getDataColumns(); 151 | expect(columns).toEqual(['id']); 152 | }); 153 | }); -------------------------------------------------------------------------------- /tests/spec/Selector/SelectorImageSpec.js: -------------------------------------------------------------------------------- 1 | describe("Image Selector", function () { 2 | 3 | var $el; 4 | 5 | beforeEach(function () { 6 | 7 | this.addMatchers(selectorMatchers); 8 | 9 | $el = jQuery("#tests").html(""); 10 | if($el.length === 0) { 11 | $el = $("").appendTo("body"); 12 | } 13 | }); 14 | 15 | it("should extract single image", function () { 16 | 17 | var selector = new Selector({ 18 | id: 'img', 19 | type: 'SelectorImage', 20 | multiple: false, 21 | selector: "img" 22 | }); 23 | 24 | var dataDeferred = selector.getData($("#selector-image-one-image")); 25 | 26 | waitsFor(function() { 27 | return dataDeferred.state() === 'resolved'; 28 | }, "wait for data extraction", 5000); 29 | 30 | runs(function () { 31 | dataDeferred.done(function(data) { 32 | expect(data).toEqual([ 33 | { 34 | 'img-src': "http://aa/" 35 | } 36 | ]); 37 | }); 38 | }); 39 | }); 40 | 41 | it("should extract multiple images", function () { 42 | 43 | var selector = new Selector({ 44 | id: 'img', 45 | type: 'SelectorImage', 46 | multiple: true, 47 | selector: "img" 48 | }); 49 | 50 | var dataDeferred = selector.getData($("#selector-image-multiple-images")); 51 | 52 | waitsFor(function() { 53 | return dataDeferred.state() === 'resolved'; 54 | }, "wait for data extraction", 5000); 55 | 56 | runs(function () { 57 | dataDeferred.done(function(data) { 58 | expect(data).toEqual([ 59 | { 60 | 'img-src': "http://aa/" 61 | }, 62 | { 63 | 'img-src': "http://bb/" 64 | } 65 | ]); 66 | }); 67 | }); 68 | }); 69 | 70 | it("should return only src column", function () { 71 | var selector = new Selector({ 72 | id: 'id', 73 | type: 'SelectorImage', 74 | multiple: true, 75 | selector: "img" 76 | }); 77 | 78 | var columns = selector.getDataColumns(); 79 | expect(columns).toEqual(['id-src']); 80 | }); 81 | 82 | it("should return empty array when no images are found", function () { 83 | var selector = new Selector({ 84 | id: 'img', 85 | type: 'SelectorImage', 86 | multiple: true, 87 | selector: "img.not-exist" 88 | }); 89 | 90 | var dataDeferred = selector.getData($("#not-exist")); 91 | 92 | waitsFor(function() { 93 | return dataDeferred.state() === 'resolved'; 94 | }, "wait for data extraction", 5000); 95 | 96 | runs(function () { 97 | dataDeferred.done(function(data) { 98 | expect(data).toEqual([]); 99 | }); 100 | }); 101 | }); 102 | 103 | it("should be able to download image as base64", function() { 104 | 105 | var deferredImage = SelectorImage.downloadImageBase64("../docs/images/chrome-store-logo.png"); 106 | 107 | waitsFor(function() { 108 | return deferredImage.state() === 'resolved'; 109 | }, "wait for data extraction", 5000); 110 | 111 | runs(function () { 112 | deferredImage.done(function(imageResponse) { 113 | expect(imageResponse.imageBase64.length > 100).toEqual(true); 114 | }); 115 | }); 116 | }); 117 | 118 | it("should be able to get data with image data attached", function() { 119 | 120 | $el.append(''); 121 | 122 | var selector = new Selector({ 123 | id: 'img', 124 | type: 'SelectorImage', 125 | multiple: true, 126 | selector: "img", 127 | downloadImage: true 128 | }); 129 | var deferredData = selector.getData($el[0]); 130 | 131 | waitsFor(function() { 132 | return deferredData.state() === 'resolved'; 133 | }, "wait for data extraction", 5000); 134 | 135 | runs(function () { 136 | deferredData.done(function(data) { 137 | expect(!!data[0]['_imageBase64-img']).toEqual(true); 138 | expect(!!data[0]['_imageMimeType-img']).toEqual(true); 139 | }); 140 | }); 141 | }); 142 | }); 143 | -------------------------------------------------------------------------------- /docs/Selectors/Link selector.md: -------------------------------------------------------------------------------- 1 | # Link selector 2 | 3 | Link selector is used for link selection and website navigation. If you use 4 | *Link selector* without any child selectors then it will extract the link and 5 | the href attribute of the link. If you add child selectors to *Link selector* 6 | then these child selectors will be used in the page that this link was leading 7 | to. If you are selecting multiple links then check *multiple* property. 8 | 9 | Note! Link selector works only with `` tags with `href` attribute. If the 10 | link selector is not working for you then you can try these workarounds: 11 | 12 | 1. Check that the link in the url bar changes after clicking an item (changes 13 | only after hash tag doesn't count). If the link doesn't change then the site 14 | is probably using ajax for data loading. Instead of using link selector you 15 | should use [Element click selector] [element-click]. 16 | 2. If the site opens a popup then you should use 17 | [Link popup selector] [link-popup] 18 | 3. The site might be using JavaScript `window.location` to change the URL. Web 19 | Scraper cannot handle this kind of navigation right now. 20 | 21 | ## Configuration options 22 | 23 | * selector - [CSS selector] [css-selector] for the link element from which the 24 | link for navigation will be extracted. 25 | * multiple - multiple records are being extracted. Usually should be checked. 26 | * delay - delay the extraction 27 | 28 | ## Use cases 29 | 30 | **Navigate through multiple levels of navigation** 31 | 32 | For example an e-commerce site has multi level navigation - 33 | `categories -> subcategories`. To scrape data from all categories and 34 | subcategories you can create two *Link selectors*. One selector would select 35 | category links and the other selector would select subcategory links that are 36 | available in the category pages. The subcategory *Link selector* should be made 37 | as a child of the category *Link selector*. The selectors for data extraction 38 | from subcategory pages should be made as a child selectors to the subcategory 39 | selector. 40 | 41 | ![Fig. 1: Multiple link selectors for category navigation][multiple-level-link-selectors] 42 | 43 | **Handle pagination** 44 | 45 | For example an e-commerce site has multiple categories. Each category has a 46 | list of items and pagination links. Also some pages are not directly available 47 | from the category but are available from pagination pages (you can see 48 | pagination links 1-5, but not 6-8). You can start by building a sitemap that 49 | visits each category and extract items from category page. This sitemap will 50 | extract items only from the first pagination page. To extract items from all of 51 | the pagination links including the ones that are not visible at the beginning 52 | you need to create another *Link selector* that selects the pagination links. 53 | Figure 2 shows how the link selector should be created in the sitemap. When 54 | the scraper opens a category link it will extract items that are available in 55 | the page. After that it will find the pagination links and also visit those. If 56 | the pagination link selector is made a child to itself it will recursively 57 | discover all pagination pages. Figure 3 shows a selector graph where you can 58 | see how pagination links discover more pagination links and more data. 59 | 60 | ![Fig. 2: Sitemap with Link selector for pagination][pagination-link-selectors] 61 | ![Fig. 3: Selector graph with pagination][pagination-selector-graph] 62 | 63 | [multiple-level-link-selectors]: ../images/selectors/link/multiple-level-link-selectors.png?raw=true 64 | [pagination-link-selectors]: ../images/selectors/link/pagination-link-selectors.png?raw=true 65 | [pagination-selector-graph]: ../images/selectors/link/pagination-selector-graph.png?raw=true 66 | [element-click]: Element%20click%20selector.md 67 | [link-popup]: Link%20popup%20selector.md 68 | [css-selector]: ../CSS%20selector.md -------------------------------------------------------------------------------- /docs/Selectors/Text selector.md: -------------------------------------------------------------------------------- 1 | # Text selector 2 | 3 | Text selector is used for text selection. The text selector will extract text 4 | from the selected element and from all its child elements. HTML will be 5 | stripped and only text will be returned. Selector will ignore text within 6 | `