├── .codeclimate.yml ├── .gitignore ├── .travis.yml ├── LICENSE.txt ├── README.md ├── docs ├── advanced.md ├── api.md ├── index.md └── tutorial.md ├── examples └── github.py ├── livescrape.py ├── mkdocs.yml ├── requirements.txt ├── setup.cfg ├── setup.py ├── test-requirements.txt ├── test.py └── tox.ini /.codeclimate.yml: -------------------------------------------------------------------------------- 1 | languages: 2 | Ruby: true 3 | JavaScript: true 4 | PHP: true 5 | Python: true 6 | exclude_paths: 7 | - "test.py" 8 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | build 3 | dist 4 | *.py[cod] 5 | *.egg-info 6 | .tox/ 7 | .coverage 8 | htmlcov 9 | site/ -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.6" 4 | - "2.7" 5 | - "3.2" 6 | - "3.3" 7 | - "3.4" 8 | - "3.5" 9 | # - "nightly" Fails unpredicably 10 | # - "pypy" Can't install lxml 11 | # - "pypy3" Can't install lxml 12 | install: 13 | - pip install . 14 | - pip install -r requirements.txt 15 | - pip install -r test-requirements.txt 16 | script: nosetests -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright (c) 2016 Koert van der Veer 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 5 | 6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 7 | 8 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Build Status](https://travis-ci.org/ondergetekende/livescrape.png?branch=master)](https://travis-ci.org/ondergetekende/livescrape) 2 | [![PyPI version](https://badge.fury.io/py/livescrape.svg)](https://pypi.python.org/pypi/livescrape) 3 | [![Documentation](https://readthedocs.org/projects/livescrape/badge)](https://livescrape.readthedocs.org/en/latest/) 4 | 5 | Introduction 6 | ============ 7 | 8 | `livescrape` is a tool for building pythonic web scrapers. Contrary to other scrapers, it focusses on exposing the scraped site in a semantic way in your application. It allows you to define page objects, specifying infomation to be extracted, and how to navigate to other page objects. 9 | 10 | While other scraping libraries would mostly be used in batch jobs, `livescrape` is intended to be used in the main application. 11 | 12 | Example 13 | ======= 14 | 15 | For more complete example, I recommend you check out the [Tutorial](docs/tutorial.md), but here's a quick primer using github. 16 | 17 | from livescrape import ScrapedPage, Css, CssMulti 18 | 19 | class GithubProjectPage(ScrapedPage): 20 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 21 | scrape_args = ("username", "projectname") 22 | 23 | description = Css(".repository-meta-content", 24 | cleanup=lambda desc: desc.strip()) 25 | contents = Css('.js-directory-link', multiple=True) 26 | table_contents = CssMulti( 27 | 'tr.js-navigation-item', 28 | name=Css("td.content a"), 29 | message=Css("td.message a"), 30 | age=Css("td.age time", attribute="datetime"), 31 | ) 32 | 33 | project_page = GithubProjectPage("ondergetekende", "livescrape") 34 | print(project_page.description) 35 | # Prints out the description for this project 36 | 37 | print(project_page.contents) 38 | # Prints the filenames in the root of the repository 39 | 40 | print(project_page.table_contents) 41 | # Prints information for all files in the root of the repository 42 | -------------------------------------------------------------------------------- /docs/advanced.md: -------------------------------------------------------------------------------- 1 | # Advanced document retrieval 2 | 3 | ## Introduction 4 | 5 | Livescrape comes with a simple `requests`-based document retrieval, defined in `scrape_fetch`. The default implementation retrieves a page using a shared session (`livescrape.SHARED_SESSION`) to retrieve documents via a simple get request. 6 | 7 | Sometimes, this is not enough. You may need authentication, post or may want to add some caching to reduce load on the remote server. 8 | 9 | # Recipes 10 | 11 | In your `ScrapedPage` derived class, you can define your own `scrape_fetch` function, which does whatever you need it to do. It should return the unicode for the page. 12 | 13 | ## Post requests 14 | 15 | Implementations of scrape_fetch are not limited to just get requests, you can also use posts. In that case, you probably need more arguments than just those those for the url template. You can just define some additional scrape arguments (along with any default values). They will end up in `scrape_args` after the `ScrapedPage` is constructed. 16 | 17 | import livescrape, os 18 | 19 | class MyCustomScrapedPage(livescrape.ScrapedPage): 20 | scrape_url = "http://example.net/search" 21 | scrape_args = ['query'] 22 | 23 | def scrape_fetch(self, url): 24 | query = self.scrape_args['query'] 25 | response = self.scape_session.post(url, data={'q': query}) 26 | return response.text 27 | 28 | ## Authentication 29 | 30 | Many sites require some sort of authentication. The easiest way to do authentication is just to try to request the target page, detect if you need to log in, optionally log in, and try again. Using a session persistance is necessary to keep the number of logins down. 31 | 32 | import livescrape, requests 33 | 34 | class MyCustomScrapedPage(livescrape.ScrapedPage): 35 | def scrape_fetch(self, url): 36 | response = self.scrape_session.get(text, allow_redirect=False) 37 | if (response.status_code == 302 or # redirect to login page 38 | response.status_code == 403 or # HTTP forbidden 39 | "Anonymous user" in response.text): # Somthing in the page 40 | self.scrape_session.post("http://example.net/login", 41 | data={"username": "JohnDoe", 42 | "password": "hunter2"}) 43 | response = self.scrape_session.get(text, allow_redirect=False) 44 | return response.text 45 | 46 | 47 | You may also need to add stuff like csfr token retrieval, but that's left as an excersise for the reader. 48 | 49 | ## Caching 50 | 51 | When you're using a site as an API, you typically repeat the same request over and over again. Adding some sort of caching may reduce the impact on the remote site. In this example I'll use the django caching framework, but this could easilly be adapted for other caching methods. 52 | 53 | import livescrape, requests 54 | from django.core.cache import cache 55 | 56 | class MyCustomScrapedPage(livescrape.ScrapedPage): 57 | def scrape_fetch(self, url): 58 | cache_key = "some_prefix:" + url 59 | page = cache.get(cache_key) 60 | if not page: 61 | page = super(MyCustomScrapedPage, self).scrape_fetch(url) 62 | cache.set(cache_key, page, 3600) 63 | return page 64 | 65 | ## Local documents 66 | 67 | When you have a local copy of a site (say, you downloaded an archive), you needn't use the requests library, you just need to turn the url into a filename, and read the file from disk. Remember that you need to perform unicode decoding as well. 68 | 69 | import livescrape, os 70 | 71 | class MyCustomScrapedPage(livescrape.ScrapedPage): 72 | def scrape_fetch(self, url): 73 | file = url.split('/')[-1] 74 | with open(os.path.join("my_archive", file), "rb") as f: 75 | page = r.read() 76 | return page.decode('utf8') 77 | -------------------------------------------------------------------------------- /docs/api.md: -------------------------------------------------------------------------------- 1 | # API documentation 2 | 3 | ## ScrapedPage 4 | 5 | Under normal circumstances, you'd derive a class of ScrapedPage. ScapedPage converts any `ScrapedAttribute`s to properties which perform the actual scraping. 6 | 7 | ### scrape_url 8 | 9 | The url for the scraped page. Can contain named percent-style formatting placeholders. e.g. `http://localhost:8000/%(directory)s/%(filename)s?q=%(querystring)s`. You will need to pre-encode any parameters you pass in - there is no automatic encoding of parameters 10 | 11 | ### scrape_args 12 | 13 | Names of the positional arguments. After construction, this is replaced by a dictionary with all of the provided values. 14 | 15 | ### scrape_arg_defaults 16 | 17 | Default values for any arguments provided. 18 | 19 | ### scrape_headers 20 | 21 | Defines additional headers to be sent. You may, for example, want to modify the user agent by adding `scrape_headers = {'User-Agent': 'My fake browser'}` to your `ScrapedPage` definition. 22 | 23 | ### scrape_fetch(self, url) 24 | 25 | Fetches the raw HTML for the `ScrapedPage`. In many cases, you'll want to customize your HTTP access, for example to add caching, throttling or authentication. To do that, override the `scrape_fetch(url)` method, and add your logic there. You'll need to return a unicode string with the raw HTML. Any character encodings should already have been applied. 26 | 27 | ### scrape_create_document(self, raw_html) 28 | 29 | Creates a lxml document from the raw html. Sometimes, your document isn't actually HTML, it may have been encoded in some form. In that case, you can override this. 30 | 31 | ### _dict 32 | 33 | A property which returns all of the defined scrape properties in dictionary form. 34 | 35 | ### scrape_keys 36 | 37 | A list of all the scrapable attributes. Automatically generated. 38 | 39 | ### scrape_session 40 | 41 | By default this property returns the session shared by all the `ScrapedPage` classes, but individual classes can override according to their needs. The original shared session can be found in `livescrape.SHARED_SESSION`. 42 | 43 | ## ScrapedAttribute(extract=None, cleanup=None, attribute=None, multiple=False) 44 | 45 | This is the base class for all scraped attributes. 46 | 47 | When `extract` is provided, it is used to extract data from a element which has been selected for. The signature is `extract(element)`. 48 | 49 | when `attribute` is provided, the data is extracted from the named attribute, instead of the element's text. Cannot be used together with `extract`. 50 | 51 | When `cleanup` is provided, it is called on the extracted data just before it is returned. The signature is `cleanup(extracted_data)` 52 | 53 | When `multiple` is provided, not just the first matching element is converted, but all of them. As a result, the attribute returns an iterable, even when only one element is selected. `cleanup` and `extract` are still applied per-element. 54 | 55 | ### extract(self, element, scraped_document) 56 | 57 | Pulls data from the provided element. This can be overridden to create attributes which aren't simply based on the element text or attribute. For one-off jobs, you may want to use the `extract=` argument in the constructor. 58 | 59 | ### cleanup(self, extracted_data, element) 60 | 61 | Converts the extracted data into usable data. This can be overridden to create type-specific extractors. For one-off jobs, you may want to use the `cleanup=` argument in the constructor. 62 | 63 | ### @decorator 64 | `ScrapedAttribute` can be used as a decorator. The decorated function functions as both cleanup and extract. It's signature is `attribute_name(value, element)`. 65 | 66 | ## Specialized selectors 67 | ### Css(selector, ...) 68 | 69 | Pulls data from the document using a `selector`, and returns it as a string. 70 | Supports all additional constructor arguments defined by `ScrapedAttribute`. If you pass an empty string, the entire document will be used. (useful in combination with `CssGroup`) 71 | 72 | ## CssFloat(selector, ...) 73 | 74 | Pulls data from the document using a css selector, and converts it to a floating point number. Supports all additional constructor arguments defined by `ScrapedAttribute`. 75 | 76 | ## CssInt(selector, ...) 77 | 78 | Pulls data from the document using a css selector, and converts it to an integer. Supports all additional constructor arguments defined by `ScrapedAttribute`. 79 | 80 | ## CssDate(selector, ...) 81 | 82 | Pulls data from the document using a css selector, and converts it to an datetime object, using the second parameter (`date_format`). Supports all additional constructor arguments defined by `ScrapedAttribute`. 83 | 84 | ## CssBoolean(selector) 85 | 86 | Pulls data from the document using a css selector, and returns true if it was found. Supports none of the additional constructor arguments defined by `ScrapedAttribute`. 87 | 88 | ## CssRaw(selector, ...) 89 | 90 | Pulls data from the document using a css selector, and returns the content's raw html. Not that this HTML has been fixed up by lxml, and may differ from the html in the original document. Supports all additional constructor arguments defined by `ScrapedAttribute`, except `extract`. 91 | 92 | ## CssGroup(selector) 93 | 94 | Groups together several attributes, which all operate on the same element. Especially useful when used with`multiple=True`. Adding an element is done by assigning an `ScrapedAttribute` to an attribute of CssGroup, like this: 95 | 96 | ```python 97 | from livescrape import ScrapedPage, CssGroup, Css 98 | 99 | class SomePage(ScrapedPage): 100 | user = CssGroup(".user") 101 | user.name = Css("a.username") # Effectively finds ".user a.username" 102 | user.rank = Css("a.userrank") 103 | ``` 104 | 105 | ## CssMulti(selector, attr1=..., attr2=... ) 106 | 107 | Finds a list of elements in the document, and for each element, applies additional `ScrapedAttribute`s to build a dictionary. The additional attributes are provided as keyword arguments to the constructor. Supports none of the additional constructor arguments defined by `ScrapedAttribute`. 108 | 109 | Deprecated in favor of `CssGroup` 110 | 111 | ## CssLink(selector, page_type, referer=True, ...) 112 | 113 | Finds links in the document using `selector`, and returns a new `ScrapedPage` for the link target. `page_type` is the type of `ScrapedPage` to be instantiated. You may pass the target type by name, to break circular dependencies. Supports all additional constructor arguments defined by `ScrapedAttribute`, although by default, it defines `attribute='href'` for obvious reasons. 114 | 115 | If `referer` is True, the Referer header is set up automatically. You can also set it to a custom url, or to False (for no referer header). 116 | 117 | # SHARED_SESSION 118 | 119 | All of the `ScapedPage` descendents share a [requests](http://docs.python-requests.org/) session. In the classes this is exposed in an overridable `scrape_session` property. It may be tempting to change things in the shared session, such as user agent, however, as with any global variable, this is a bad idea. Libraries using livescrape may depend on the default values, and may break when you change them. If you need a custom session, it is best to override the `scrape_session` property to provide your own one. 120 | 121 | Forward compatibility 122 | ===================== 123 | 124 | This project uses [semantic versioning](http://semver.org/). Names starting with `scrape_` and underscore (`_`) are reserved. You should not define attribibutes by those names. 125 | -------------------------------------------------------------------------------- /docs/index.md: -------------------------------------------------------------------------------- 1 | Introduction 2 | ============ 3 | 4 | `livescrape` is a tool for building pythonic web scrapers. Contrary to other scrapers, it focusses on exposing the scraped site in a semantic way in your application. It allows you to define page objects, specifying infomation to be extracted, and how to navigate to other page objects. 5 | 6 | While other scraping libraries would mostly be used in batch jobs, `livescrape` is intended to be used in the main application. `livescrape` turns the human interface into an API. 7 | 8 | Example 9 | ======= 10 | 11 | For more complete example, I recommend you check out the [Tutorial](tutorial.md), but here's a quick primer using github. 12 | 13 | class GithubProjectpage(ScrapedPage): 14 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 15 | scrape_args = ("username", "projectname") 16 | 17 | description = Css(".repository-meta-content", 18 | cleanup=lambda desc: desc.strip()) 19 | contents = Css('.js-directory-link', multiple=True) 20 | table_contents = CssGroup('tr.js-navigation-item') 21 | table_contents.name = Css("td.content a") 22 | table_contents.message = Css("td.message a") 23 | table_contents.age = Css("td.age time", attribute="datetime"), 24 | 25 | project_page = GithubProjectPage("ondergetekende", "livescrape") 26 | 27 | print(project_page.description) # Prints the description for this project 28 | 29 | print(projects.contents) # Prints the filenames in the repository root 30 | -------------------------------------------------------------------------------- /docs/tutorial.md: -------------------------------------------------------------------------------- 1 | Defining a scraper 2 | ================== 3 | 4 | Defining a scraper is similar to defining a model in django. In my example I'll use github. Github has an API which would be far more suitable for any use, but it's a well-known site, which helps understanding the examples. 5 | 6 | Let's start by defining GitHub's repository page 7 | 8 | from livescrape import ScrapedPage, Css 9 | 10 | class GithubProjectPage(ScrapedPage): 11 | scrape_url = "https://github.com/python/cpython/" 12 | 13 | description = Css(".repository-meta-content") 14 | 15 | page = GithubProjectPage() 16 | print(page.description) 17 | # will output the description 18 | print(page._dict) 19 | # will output {"description": ""} 20 | 21 | 22 | That's nice and all, but we don't just want to address the project page for the cpython mirror, but just any project page. We can do this by adding string formatting parameters to `scrape_url`. 23 | 24 | from livescrape import ScrapedPage, Css 25 | 26 | class GithubProjectPage(ScrapedPage): 27 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 28 | 29 | description = Css(".repository-meta-content") 30 | 31 | page = GithubProjectPage(username="python", projectname="cpython") 32 | print(page.description) 33 | # will output the description 34 | 35 | You can avoid using keyword arguments by defining `scrape_args` like this: 36 | 37 | from livescrape import ScrapedPage, Css 38 | 39 | class GithubProjectPage(ScrapedPage): 40 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 41 | scrape_args = ("username", "projectname") 42 | 43 | description = Css(".repository-meta-content") 44 | 45 | page = GithubProjectPage("python", "cpython") 46 | print(page.description) 47 | # will output the description 48 | 49 | 50 | Cleaning up the data 51 | ==================== 52 | 53 | Now when you run the previous example, you may notice that the description is padded with a lot of whitespace. We really don't want that, so we can pass in a cleanup function with the `cleanup=` keyword argument. Its signature is `cleanup(extracted_data)`. In this example I'll use a lambda. 54 | 55 | from livescrape import ScrapedPage, Css 56 | 57 | class GithubProjectPage(ScrapedPage): 58 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 59 | scrape_args = ("username", "projectname") 60 | 61 | description = Css(".repository-meta-content", 62 | cleanup=lambda value: value.strip()) 63 | 64 | page = GithubProjectPage("python", "cpython") 65 | print(page.description) 66 | # will output the description 67 | 68 | By default, data is extracted by taking the text contents of the element. Sometimes, however, the data you need is in an attribute. In that case, you can provide the `attribute=` keyword argument: 69 | 70 | from livescrape import ScrapedPage, Css 71 | 72 | class GithubProjectPage(ScrapedPage): 73 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 74 | scrape_args = ("username", "projectname") 75 | 76 | git_repo = Css("input.input-monospace", attribute="value") 77 | 78 | page = GithubProjectPage("python", "cpython") 79 | print(page.description) 80 | 81 | If the data you're after is even more complicated (e.g. a combination of elements), you may want to perform the extraction yourself, by providing an extractor function with the `extract=` argument. Its signature is `extract(element)`. The extracted data will be passed into the cleanup chain unmodified, which means you're not limited to strings. 82 | 83 | from livescrape import ScrapedPage, Css 84 | 85 | class GithubProjectPage(ScrapedPage): 86 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 87 | scrape_args = ("username", "projectname") 88 | 89 | git_repo = @Css("input.input-monospace", 90 | extract=lambda elem: {"repo": elem.get("value")}) 91 | 92 | page = GithubProjectPage("python", "cpython") 93 | print(page.description) 94 | 95 | While lambda's are nice for simple conversions, sometimes you'll need to do something more complicated. A lambda would be too cramped fo that. In that case, it may be useful to declare the cleanup function using the decorator syntax. The signature of the decorated function is `attributename(extracted_data, element)`. 96 | 97 | from livescrape import ScrapedPage, Css 98 | 99 | class GithubProjectPage(ScrapedPage): 100 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 101 | scrape_args = ("username", "projectname") 102 | 103 | @Css(".repository-meta-content") 104 | def description(self, value, element): 105 | return value.strip() 106 | 107 | page = GithubProjectPage("python", "cpython") 108 | print(page.description) 109 | # will output the description 110 | 111 | For some common datatypes, there are special `Css` selectors: `CssInt`, `CssFloat`, `CssDate` , `CssRaw` (for raw html) and `CssBoolean` (testing whether some selector is present). 112 | 113 | List data 114 | ========= 115 | 116 | Normaly when scraping, only the first matching element is used, but sometimes you'll want to go over lists of things. To do so, specify the `multiple` argument. In this example contents will produce the names of all the root directories in the project. 117 | 118 | from livescrape import ScrapedPage, Css 119 | 120 | class GithubProjectPage(ScrapedPage): 121 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 122 | scrape_args = ("username", "projectname") 123 | 124 | contents = Css('.js-directory-link', multiple=True) 125 | 126 | Note that cleanup code runs per list item, not on the list as a whole. 127 | 128 | Tabular data 129 | ============ 130 | 131 | If you need more than one datum per list item, you will need to use `CSSGroup`. You can provide the additional selectors by assigning them to attributes of the group. It will produce an object for each of the table contents. 132 | 133 | from livescrape import ScrapedPage, Css, CssGroup 134 | 135 | class GithubProjectPage(ScrapedPage): 136 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 137 | scrape_args = ("username", "projectname") 138 | 139 | table_contents = CssGroup('tr.js-navigation-item', multiple=True) 140 | 141 | table_contents.name = Css("td.content a") 142 | table_contents.message = Css("td.message a") 143 | table_contents.age = Css("td.age time", attribute="datetime") 144 | 145 | Note that cleanup code runs per list item, not on the list as a whole. 146 | 147 | Links 148 | ===== 149 | 150 | Websites typically have links, which you'll want to follow. The `CssLink` selector helps you by allowing you to specify what `ScrapedPage` should handle the target of that link. In the following example, we're reusing one of the `GithubProjectPage` definitions above. 151 | 152 | from scrape import ScrapedPage, CssLink 153 | 154 | class GithubOveriew(ScrapedPage): 155 | scrape_url = "https://github.com/%(username)s" 156 | scrape_args = ("username") 157 | 158 | repos = CssLink(".repo-list-name a", GithubProjectPage, multiple=True) 159 | 160 | You could now type `GithubOverview("python").repos[0].description` to retrieve the description of the first repository on the overview page. 161 | -------------------------------------------------------------------------------- /examples/github.py: -------------------------------------------------------------------------------- 1 | from livescrape import ScrapedPage, Css, CssGroup, CssLink 2 | 3 | 4 | class GithubProjectpage(ScrapedPage): 5 | scrape_url = "https://github.com/%(username)s/%(projectname)s/" 6 | scrape_args = ("username", "projectname") 7 | 8 | description = Css(".repository-meta-content", 9 | cleanup=lambda desc: desc.strip()) 10 | contents = CssLink('.js-directory-link', "GithubOveriew", multiple=True) 11 | table_contents = CssGroup('tr.js-navigation-item', multiple=True) 12 | table_contents.name = Css("td.content a") 13 | table_contents.message = Css("td.message a") 14 | table_contents.age = Css("td.age time", attribute="datetime") 15 | 16 | 17 | class GithubOveriew(ScrapedPage): 18 | scrape_url = "https://github.com/%(username)s" 19 | scrape_args = ("username") 20 | 21 | repos = CssLink(".repo-list-name a", GithubProjectpage, multiple=True) 22 | 23 | 24 | if __name__ == '__main__': 25 | cpython = GithubOveriew(username="python") 26 | print(cpython.repos[0].description) 27 | print(cpython.repos[0].description) 28 | -------------------------------------------------------------------------------- /livescrape.py: -------------------------------------------------------------------------------- 1 | from abc import abstractmethod 2 | import cgi 3 | import datetime 4 | try: 5 | import urlparse # python2 6 | except ImportError: # pragma: no cover 7 | import urllib.parse as urlparse 8 | import warnings 9 | 10 | import lxml.etree 11 | import lxml.html 12 | import requests 13 | import six 14 | 15 | 16 | SHARED_SESSION = requests.Session() 17 | SHARED_SESSION.headers['User-Agent'] = "Mozilla/5.0 (Livescrape)" 18 | 19 | 20 | class ScrapedAttribute(object): 21 | """Base class for scraped attributes. 22 | 23 | When used in a ScrapedPage, it will be converted into an attribute. 24 | """ 25 | 26 | def __init__(self, extract=None, cleanup=None, attribute=None, 27 | multiple=False): 28 | if extract and attribute: 29 | raise ValueError("extract and attribute are mututally exclusive") 30 | 31 | self._cleanup = cleanup 32 | self._extract = extract 33 | self.attribute = attribute 34 | self.multiple = multiple 35 | 36 | # Placeholder for cleanup method when using the decorator syntax 37 | self._cleanup_method = None 38 | 39 | @abstractmethod 40 | def get(self, doc, scraped_page): # pragma: no cover 41 | raise NotImplementedError() 42 | 43 | def extract(self, element, scraped_page): 44 | if self._extract: 45 | value = self._extract(element) 46 | elif self.attribute is None: 47 | value = element.text_content() 48 | else: 49 | value = element.get(self.attribute) 50 | if value is None: 51 | return 52 | 53 | # In python2, lxml returns str if only ascii characters are used. 54 | # This leads to inconsistent return types, so in that case, we convert 55 | # to unicode (which should be a semantic no-op) 56 | if six.PY2 and isinstance(value, str): # pragma: no cover 57 | value = six.text_type(value) 58 | 59 | return self.perform_cleanups(value, element, scraped_page) 60 | 61 | def perform_cleanups(self, value, element, scraped_page=None): 62 | if self._cleanup: 63 | value = self._cleanup(value) 64 | 65 | if self._cleanup_method: 66 | value = self._cleanup_method(scraped_page, value, element) 67 | 68 | return self.cleanup(value, element, scraped_page) 69 | 70 | def cleanup(self, value, elements, scraped_page=None): 71 | return value 72 | 73 | def __call__(self, func): 74 | self._cleanup_method = func 75 | return self 76 | 77 | _SCRAPER_CLASSES = {} 78 | 79 | 80 | class _ScrapedMeta(type): 81 | """A metaclass for Scraped. 82 | 83 | Converts any ScrapedAttribute attributes to usable properties. 84 | """ 85 | def __new__(cls, name, bases, namespace): 86 | keys = [] 87 | for key, value in namespace.items(): 88 | if isinstance(value, ScrapedAttribute): 89 | def mk_attribute(selector): 90 | def method(scraped): 91 | return scraped._get_value(selector) 92 | return property(method) 93 | 94 | namespace[key] = mk_attribute(value) 95 | keys.append(key) 96 | 97 | result = super(_ScrapedMeta, cls).__new__(cls, name, bases, namespace) 98 | result.scrape_keys = keys 99 | _SCRAPER_CLASSES[name] = result 100 | return result 101 | 102 | 103 | @six.add_metaclass(_ScrapedMeta) 104 | class ScrapedPage(object): 105 | _scrape_doc = None 106 | scrape_url = None 107 | scrape_args = [] 108 | scrape_arg_defaults = {} 109 | scrape_headers = {} 110 | 111 | def __init__(self, *pargs, **kwargs): 112 | scrape_url = kwargs.pop("scrape_url", None) 113 | scrape_referer = kwargs.pop("scrape_referer", None) 114 | 115 | arguments = dict(self.scrape_arg_defaults) 116 | arguments.update(kwargs) 117 | arguments.update(zip(self.scrape_args, pargs)) 118 | self.scrape_args = arguments 119 | 120 | if scrape_url: 121 | self.scrape_url = scrape_url 122 | elif not self.scrape_url: 123 | # We can't scrape if we don't actually have a url configured 124 | raise ValueError("%s.scrape_url needs to be defined" % 125 | type(self).__name__) 126 | else: 127 | self.scrape_url = self.scrape_url % arguments 128 | 129 | self.scrape_headers = dict(self.scrape_headers) 130 | if scrape_referer: 131 | self.scrape_headers['Referer'] = scrape_referer 132 | 133 | @property 134 | def scrape_session(self): 135 | return SHARED_SESSION 136 | 137 | def scrape_fetch(self, url): 138 | return self.scrape_session.get(url, 139 | headers=self.scrape_headers).text 140 | 141 | def scrape_create_document(self, page): 142 | return lxml.html.fromstring(page) 143 | 144 | def _get_value(self, property_scraper): 145 | if self._scrape_doc is None: 146 | page = self.scrape_fetch(self.scrape_url) 147 | self._scrape_doc = self.scrape_create_document(page) 148 | 149 | return property_scraper.get(self._scrape_doc, scraped_page=self) 150 | 151 | @property 152 | def _dict(self): 153 | return dict((key, getattr(self, key)) for key in self.scrape_keys) 154 | 155 | def __repr__(self): 156 | return "%s(scrape_url=%r)" % (type(self).__name__, self.scrape_url) 157 | 158 | 159 | class Css(ScrapedAttribute): 160 | def __init__(self, selector, **kwargs): 161 | self.selector = selector 162 | assert selector or not self.multiple, "Empty selectors are only "\ 163 | "with singular matches" 164 | 165 | super(Css, self).__init__(**kwargs) 166 | 167 | def get(self, doc, scraped_page): 168 | assert doc is not None 169 | if not self.selector: 170 | return doc 171 | 172 | elements = doc.cssselect(self.selector) 173 | 174 | if self.multiple: 175 | values = [self.extract(element, scraped_page) 176 | for element in elements] 177 | return [v for v in values if v is not None] 178 | elif len(elements): 179 | return self.extract(elements[0], scraped_page) 180 | 181 | 182 | class CssFloat(Css): 183 | def cleanup(self, value, elements, scraped_page=None): 184 | try: 185 | return float(value) 186 | except ValueError: 187 | return None 188 | 189 | 190 | class CssInt(Css): 191 | def cleanup(self, value, elements, scraped_page=None): 192 | try: 193 | return int(value) 194 | except ValueError: 195 | return None 196 | 197 | 198 | class CssDate(Css): 199 | def __init__(self, selector, date_format, tzinfo=None, **kwargs): 200 | self.date_format = date_format 201 | self.tzinfo = tzinfo 202 | super(CssDate, self).__init__(selector, **kwargs) 203 | 204 | def cleanup(self, value, elements, scraped_page=None): 205 | try: 206 | result = datetime.datetime.strptime(value, self.date_format) 207 | if self.tzinfo: 208 | result = result.replace(tzinfo=self.tzinfo) 209 | return result 210 | except ValueError: 211 | return None 212 | 213 | 214 | class CssBoolean(Css): 215 | def cleanup(self, value, elements, scraped_page=None): 216 | return True 217 | 218 | 219 | class CssRaw(Css): 220 | def __init__(self, selector, include_tag=False, **kwargs): 221 | self.include_tag = include_tag 222 | super(CssRaw, self).__init__(selector, **kwargs) 223 | 224 | def extract(self, element, scraped_page): 225 | if self.include_tag: 226 | value = lxml.html.tostring(element, encoding="unicode") 227 | else: 228 | value = six.text_type("") 229 | if element.text: 230 | value = cgi.escape(element.text) 231 | for child in element: 232 | value += lxml.html.tostring(child, encoding="unicode") 233 | 234 | return self.perform_cleanups(value, element, scraped_page) 235 | 236 | 237 | class CssMulti(Css): 238 | def __init__(self, selector, cleanup=None, **subselectors): 239 | super(CssMulti, self).__init__(selector, cleanup=cleanup, 240 | multiple=True) 241 | self.subselectors = subselectors 242 | warnings.warn( 243 | "The 'CssMulti' class was deprecated in favor of CssGroup", 244 | DeprecationWarning) 245 | 246 | def extract(self, element, scraped_page=None): 247 | value = {} 248 | 249 | for key, selector in self.subselectors.items(): 250 | value[key] = selector.get(element, 251 | scraped_page=scraped_page) 252 | 253 | return self.perform_cleanups(value, element, scraped_page) 254 | 255 | 256 | class CssGroup(Css): 257 | class _CompoundAttribute(object): 258 | def __init__(self, parent, element, scraped_page): 259 | self._subselectors = parent._subselectors 260 | self._element = element 261 | self._scaped_page = scraped_page 262 | 263 | def __getattr__(self, attribute): 264 | try: 265 | selector = self._subselectors[attribute] 266 | except KeyError: 267 | return getattr(super(CssGroup._CompoundAttribute, self), 268 | attribute) 269 | 270 | return selector.get(self._element, self._scaped_page) 271 | 272 | def __getitem__(self, attribute): 273 | # May raise keyerror, which is suitable for __getitem__ 274 | selector = self._subselectors[attribute] 275 | return selector.get(self._element, self._scaped_page) 276 | 277 | def __dir__(self): 278 | attrs = dir(super(CssGroup._CompoundAttribute, self)) 279 | attrs += self._subselectors.keys() 280 | return attrs 281 | 282 | def _dict(self): 283 | return dict( 284 | (key, selector.get(self._element, self._scaped_page)) 285 | for (key, selector) in self._subselectors.items()) 286 | 287 | def __init__(self, *pargs, **kwargs): 288 | super(CssGroup, self).__init__(*pargs, **kwargs) 289 | self._subselectors = {} 290 | 291 | def extract(self, element, scraped_page=None): 292 | value = CssGroup._CompoundAttribute(self, element, scraped_page) 293 | return self.perform_cleanups(value, element, scraped_page) 294 | 295 | def __setattr__(self, key, value): 296 | if isinstance(value, ScrapedAttribute): 297 | self._subselectors[key] = value 298 | else: 299 | super(CssGroup, self).__setattr__(key, value) 300 | 301 | 302 | class CssLink(Css): 303 | def __init__(self, selector, page_factory, referer=True, **kwargs): 304 | kwargs.setdefault('attribute', 'href') 305 | super(CssLink, self).__init__(selector, **kwargs) 306 | self.page_factory = page_factory 307 | self.referer = referer 308 | 309 | def cleanup(self, value, elements, scraped_page=None): 310 | url = urlparse.urljoin(scraped_page.scrape_url, value) 311 | factory = (_SCRAPER_CLASSES[self.page_factory] 312 | if isinstance(self.page_factory, six.string_types) 313 | else self.page_factory) 314 | 315 | if self.referer is True: # automatic referer 316 | referer = scraped_page.scrape_url 317 | elif not self.referer: 318 | referer = None 319 | else: 320 | referer = self.referer 321 | 322 | return factory(scrape_url=url, scrape_referer=referer) 323 | -------------------------------------------------------------------------------- /mkdocs.yml: -------------------------------------------------------------------------------- 1 | site_name: Livescrape documentation 2 | repo_url: https://github.com/ondergetekende/livesscrape 3 | pages: 4 | - 'Introduction': 'index.md' 5 | - 'Tutorial': 'tutorial.md' 6 | - 'API documentation': 'api.md' 7 | - 'Advanced document retrieval': 'advanced.md' 8 | 9 | 10 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | cssselect 2 | lxml 3 | requests 4 | six -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name='livescrape', 5 | version='0.9.8', 6 | url='https://github.com/ondergetekende/python-livescrape', 7 | description='A toolkit to build pythonic web scraper libraries', 8 | author='Koert van der Veer', 9 | author_email='koert@ondergetekende.nl', 10 | py_modules=["livescrape"], 11 | install_requires=["lxml", "requests", "cssselect", "six"], 12 | classifiers=[ 13 | 'Intended Audience :: Developers', 14 | 'Operating System :: OS Independent', 15 | 'Programming Language :: Python', 16 | 'Programming Language :: Python :: 3', 17 | 'Programming Language :: Python :: 3.2', 18 | 'Programming Language :: Python :: 3.3', 19 | 'Programming Language :: Python :: 3.4', 20 | 'Programming Language :: Python :: 3.5', 21 | 'Programming Language :: Python :: 2.7', 22 | ], 23 | ) 24 | -------------------------------------------------------------------------------- /test-requirements.txt: -------------------------------------------------------------------------------- 1 | pep8 2 | flake8 3 | hacking 4 | coverage 5 | unittest2 6 | responses -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import re 3 | 4 | import responses 5 | import unittest2 as unittest 6 | 7 | import livescrape 8 | 9 | 10 | class BasePage(livescrape.ScrapedPage): 11 | scrape_url = "http://fake-host/test.html" 12 | 13 | 14 | class Test(unittest.TestCase): 15 | def setUp(self): 16 | responses.reset() 17 | responses.add( 18 | responses.GET, BasePage.scrape_url, 19 | """ 20 |

Heading

21 |

15

22 | 3.14 23 | 42 24 | 25 | 2016-04-23 26 | link 27 | 28 | test123 29 | testmore 30 | 31 | 32 |
keyvalue
key2value2
tr") 151 | foo_withtag = livescrape.CssRaw("table>tr", include_tag=True) 152 | 153 | x = Page() 154 | html = re.sub(r'\s+', " ", x.foo).strip() 155 | self.assertEqual(html, "test123 key testmore value") 156 | 157 | html = re.sub(r'\s+', " ", x.foo_withtag).strip() 158 | self.assertEqual( 159 | html, 160 | "test123 key testmore value") 161 | 162 | def test_complex(self): 163 | class Page(BasePage): 164 | foo = livescrape.CssMulti( 165 | "table tr", 166 | key=livescrape.Css("th"), 167 | value=livescrape.Css("td")) 168 | 169 | x = Page() 170 | 171 | self.assertEqual(x.foo, [{"key": "key", "value": "value"}, 172 | {"key": "key2", "value": "value2"}]) 173 | 174 | def test_group(self): 175 | class Page(BasePage): 176 | foo = livescrape.CssGroup("table tr", multiple=True) 177 | foo.key = livescrape.Css("th") 178 | foo.value = livescrape.Css("td") 179 | 180 | x = Page() 181 | 182 | self.assertEqual(x.foo[0]["key"], "key") 183 | self.assertEqual(x.foo[0]["value"], "value") 184 | self.assertEqual(x.foo[1]["key"], "key2") 185 | self.assertEqual(x.foo[1]["value"], "value2") 186 | 187 | self.assertEqual(x.foo[0].key, "key") 188 | self.assertEqual(x.foo[0].value, "value") 189 | self.assertEqual(x.foo[1].key, "key2") 190 | self.assertEqual(x.foo[1].value, "value2") 191 | self.assertEqual(x.foo[0]._dict(), 192 | {"key": "key", "value": "value"}) 193 | 194 | # List members, but filter private ones 195 | self.assertEqual([att for att in dir(x.foo[1]) 196 | if att[0] != "_"], 197 | ["key", "value"]) 198 | 199 | with self.assertRaises(AttributeError): 200 | x.foo[0].nonexistent 201 | 202 | def test_cleanup(self): 203 | cleanup_args = [None] 204 | 205 | def cleanup(x): 206 | self.assertIsNone(cleanup_args[0]) 207 | cleanup_args[0] = x 208 | return "TESTed" 209 | 210 | class Page(BasePage): 211 | foo = livescrape.Css("h1.foo", 212 | cleanup=cleanup) 213 | 214 | x = Page() 215 | 216 | self.assertEqual(x.foo, "TESTed") 217 | self.assertEqual(cleanup_args[0], "Heading") 218 | 219 | def test_extract(self): 220 | extract_args = [None] 221 | 222 | def extract(x): 223 | self.assertIsNone(extract_args[0]) 224 | extract_args[0] = x 225 | return "TESTed" 226 | 227 | class Page(BasePage): 228 | foo = livescrape.Css("h1.foo", 229 | extract=extract) 230 | 231 | x = Page() 232 | 233 | self.assertEqual(x.foo, "TESTed") 234 | self.assertEqual(extract_args[0].text, "Heading") 235 | 236 | def test_cleanup_extract(self): 237 | cleanup_args = [None] 238 | extract_args = [None] 239 | 240 | def cleanup(x): 241 | self.assertIsNone(cleanup_args[0]) 242 | cleanup_args[0] = x 243 | return "TESTed" 244 | 245 | def extract(x): 246 | self.assertIsNone(extract_args[0]) 247 | extract_args[0] = x 248 | return "Xtracted" 249 | 250 | class Page(BasePage): 251 | foo = livescrape.Css("h1.foo", 252 | cleanup=cleanup, 253 | extract=extract) 254 | 255 | x = Page() 256 | value = x.foo 257 | 258 | self.assertEqual(extract_args[0].text, "Heading") 259 | self.assertEqual(cleanup_args[0], "Xtracted") 260 | self.assertEqual(value, "TESTed") 261 | 262 | def test_decorator(self): 263 | cleanup_args = [None] 264 | extract_args = [None] 265 | method_args = [None] 266 | 267 | def cleanup(x): 268 | self.assertIsNone(cleanup_args[0]) 269 | cleanup_args[0] = x 270 | return "TESTed" 271 | 272 | def extract(x): 273 | self.assertIsNone(extract_args[0]) 274 | extract_args[0] = x 275 | return "Xtracted" 276 | 277 | class Page(BasePage): 278 | @livescrape.Css("h1.foo", 279 | cleanup=cleanup, 280 | extract=extract) 281 | def foo(self, value, element): 282 | method_args[0] = (value, element) 283 | return "METhod" 284 | 285 | x = Page() 286 | value = x.foo 287 | 288 | self.assertEqual(extract_args[0].text, "Heading") 289 | self.assertEqual(cleanup_args[0], "Xtracted") 290 | self.assertEqual(method_args[0][0], "TESTed") 291 | self.assertEqual(method_args[0][1].text, "Heading") 292 | self.assertEqual(value, "METhod") 293 | 294 | def test_headers(self): 295 | class Page(BasePage): 296 | scrape_headers = {"foo": "bar"} 297 | 298 | Page().scrape_fetch(BasePage.scrape_url) 299 | self.assertEqual(len(responses.calls), 1) 300 | self.assertEqual(responses.calls[0].request.headers['Foo'], 301 | 'bar') 302 | 303 | def test_referer(self): 304 | class Page(BasePage): 305 | foo = livescrape.CssLink("a", "Page") 306 | 307 | x = Page() 308 | 309 | self.assertIsInstance(x.foo, Page) 310 | 311 | responses.add( 312 | responses.GET, x.foo.scrape_url, 313 | "") 314 | 315 | self.assertIsNone(x.foo.foo) 316 | 317 | self.assertEqual(len(responses.calls), 2) 318 | self.assertEqual(responses.calls[1].request.headers['Referer'], 319 | 'http://fake-host/test.html') 320 | 321 | def test_custom_referer(self): 322 | class Page(BasePage): 323 | foo = livescrape.CssLink("a", "Page", referer="http://no") 324 | 325 | x = Page() 326 | 327 | self.assertIsInstance(x.foo, Page) 328 | 329 | responses.add( 330 | responses.GET, x.foo.scrape_url, 331 | "") 332 | 333 | self.assertIsNone(x.foo.foo) 334 | 335 | self.assertEqual(len(responses.calls), 2) 336 | self.assertEqual(responses.calls[1].request.headers['Referer'], 337 | 'http://no') 338 | 339 | def test_no_referer(self): 340 | class Page(BasePage): 341 | foo = livescrape.CssLink("a", "Page", referer=False) 342 | 343 | x = Page() 344 | 345 | self.assertIsInstance(x.foo, Page) 346 | 347 | responses.add( 348 | responses.GET, x.foo.scrape_url, 349 | "") 350 | 351 | self.assertIsNone(x.foo.foo) 352 | 353 | self.assertEqual(len(responses.calls), 2) 354 | self.assertNotIn("Referer", responses.calls[1].request.headers) 355 | 356 | if __name__ == '__main__': 357 | unittest.main() 358 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py27,py34,py35,pep8,coverage 3 | skipsdist = True 4 | 5 | [testenv] 6 | deps = -r{toxinidir}/test-requirements.txt 7 | -r{toxinidir}/requirements.txt 8 | ; setenv = 9 | ; PYTHONPATH = {toxinidir}:{toxinidir} 10 | commands = python test.py 11 | 12 | [testenv:pep8] 13 | deps = -r{toxinidir}/test-requirements.txt 14 | commands = flake8 {posargs} 15 | 16 | [testenv:coverage] 17 | deps = -r{toxinidir}/test-requirements.txt 18 | -r{toxinidir}/requirements.txt 19 | commands = 20 | coverage run --branch --omit={envdir}/*,examples/*.py,test.py test.py 21 | coverage html 22 | coverage report --skip-covered --fail-under 95 --show-missing 23 | 24 | [flake8] 25 | ignore = H101,H301,H302,H238 26 | show-source = True 27 | --------------------------------------------------------------------------------