├── .codeclimate.yml
├── .gitignore
├── .travis.yml
├── LICENSE.txt
├── README.md
├── docs
    ├── advanced.md
    ├── api.md
    ├── index.md
    └── tutorial.md
├── examples
    └── github.py
├── livescrape.py
├── mkdocs.yml
├── requirements.txt
├── setup.cfg
├── setup.py
├── test-requirements.txt
├── test.py
└── tox.ini


/.codeclimate.yml:
--------------------------------------------------------------------------------
1 | languages:
2 |   Ruby: true
3 |   JavaScript: true
4 |   PHP: true
5 |   Python: true
6 | exclude_paths:
7 | - "test.py"
8 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | build
3 | dist
4 | *.py[cod]
5 | *.egg-info
6 | .tox/
7 | .coverage
8 | htmlcov
9 | site/


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - "2.6"
 4 |   - "2.7"
 5 |   - "3.2"
 6 |   - "3.3"
 7 |   - "3.4"
 8 |   - "3.5"
 9 | #  - "nightly"  Fails unpredicably
10 | #  - "pypy"     Can't install lxml
11 | #  - "pypy3"    Can't install lxml
12 | install:
13 |   - pip install .
14 |   - pip install -r requirements.txt
15 |   - pip install -r test-requirements.txt
16 | script: nosetests


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 | Copyright (c) 2016 Koert van der Veer
3 | 
4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
5 | 
6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
7 | 
8 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | [![Build Status](https://travis-ci.org/ondergetekende/livescrape.png?branch=master)](https://travis-ci.org/ondergetekende/livescrape)
 2 | [![PyPI version](https://badge.fury.io/py/livescrape.svg)](https://pypi.python.org/pypi/livescrape)
 3 | [![Documentation](https://readthedocs.org/projects/livescrape/badge)](https://livescrape.readthedocs.org/en/latest/)
 4 | 
 5 | Introduction
 6 | ============
 7 | 
 8 | `livescrape` is a tool for building pythonic web scrapers. Contrary to other scrapers, it focusses on exposing the scraped site in a semantic way in your application. It allows you to define page objects, specifying infomation to be extracted, and how to navigate to other page objects.
 9 | 
10 | While other scraping libraries would mostly be used in batch jobs, `livescrape` is intended to be used in the main application.
11 | 
12 | Example
13 | =======
14 | 
15 | For more complete example, I recommend you check out the [Tutorial](docs/tutorial.md), but here's a quick primer using github.
16 |  
17 |     from livescrape import ScrapedPage, Css, CssMulti
18 |     
19 |     class GithubProjectPage(ScrapedPage):
20 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
21 |         scrape_args = ("username", "projectname")
22 |     
23 |         description = Css(".repository-meta-content",
24 |                           cleanup=lambda desc: desc.strip())
25 |         contents = Css('.js-directory-link', multiple=True)
26 |         table_contents = CssMulti(
27 |             'tr.js-navigation-item',
28 |             name=Css("td.content a"),
29 |             message=Css("td.message a"),
30 |             age=Css("td.age time", attribute="datetime"),
31 |         )
32 |     
33 |     project_page = GithubProjectPage("ondergetekende", "livescrape")
34 |     print(project_page.description)
35 |     # Prints out the description for this project
36 |     
37 |     print(project_page.contents)
38 |     # Prints the filenames in the root of the repository
39 |     
40 |     print(project_page.table_contents)
41 |     # Prints information for all files in the root of the repository
42 | 


--------------------------------------------------------------------------------
/docs/advanced.md:
--------------------------------------------------------------------------------
 1 | # Advanced document retrieval
 2 | 
 3 | ## Introduction
 4 | 
 5 | Livescrape comes with a simple `requests`-based document retrieval, defined in `scrape_fetch`. The default implementation retrieves a page using a shared session (`livescrape.SHARED_SESSION`) to retrieve documents via a simple get request.
 6 | 
 7 | Sometimes, this is not enough. You may need authentication, post or may want to add some caching to reduce load on the remote server.
 8 | 
 9 | # Recipes
10 | 
11 | In your `ScrapedPage` derived class, you can define your own `scrape_fetch` function, which does whatever you need it to do. It should return the unicode for the page.
12 | 
13 | ## Post requests
14 | 
15 | Implementations of scrape_fetch are not limited to just get requests, you can also use posts. In that case, you probably need more arguments than just those those for the url template. You can just define some additional scrape arguments (along with any default values). They will end up in `scrape_args` after the `ScrapedPage` is constructed.
16 | 
17 |     import livescrape, os
18 |     
19 |     class MyCustomScrapedPage(livescrape.ScrapedPage):
20 |         scrape_url = "http://example.net/search"
21 |         scrape_args = ['query']
22 |     
23 |         def scrape_fetch(self, url):
24 |             query = self.scrape_args['query']
25 |             response = self.scape_session.post(url, data={'q': query})
26 |             return response.text
27 | 
28 | ## Authentication
29 | 
30 | Many sites require some sort of authentication. The easiest way to do authentication is just to try to request the target page, detect if you need to log in, optionally log in, and try again. Using a session persistance is necessary to keep the number of logins down.
31 | 
32 |     import livescrape, requests
33 |     
34 |     class MyCustomScrapedPage(livescrape.ScrapedPage):    
35 |         def scrape_fetch(self, url):
36 |             response = self.scrape_session.get(text, allow_redirect=False)
37 |             if (response.status_code == 302 or  # redirect to login page
38 |                     response.status_code == 403 or  # HTTP forbidden
39 |                     "Anonymous user" in response.text):  # Somthing in the page
40 |                 self.scrape_session.post("http://example.net/login",
41 |                                          data={"username": "JohnDoe",
42 |                                                "password": "hunter2"})
43 |                 response = self.scrape_session.get(text, allow_redirect=False)
44 |             return response.text
45 | 
46 | 
47 | You may also need to add stuff like csfr token retrieval, but that's left as an excersise for the reader.
48 | 
49 | ## Caching
50 | 
51 | When you're using a site as an API, you typically repeat the same request over and over again. Adding some sort of caching may reduce the impact on the remote site. In this example I'll use the django caching framework, but this could easilly be adapted for other caching methods.
52 | 
53 |     import livescrape, requests
54 |     from django.core.cache import cache
55 |     
56 |     class MyCustomScrapedPage(livescrape.ScrapedPage):    
57 |         def scrape_fetch(self, url):
58 |             cache_key = "some_prefix:" + url
59 |             page = cache.get(cache_key)
60 |             if not page:
61 |                 page = super(MyCustomScrapedPage, self).scrape_fetch(url)
62 |                 cache.set(cache_key, page, 3600)
63 |             return page
64 | 
65 | ## Local documents
66 | 
67 | When you have a local copy of a site (say, you downloaded an archive), you needn't use the requests library, you just need to turn the url into a filename, and read the file from disk. Remember that you need to perform unicode decoding as well.
68 | 
69 |     import livescrape, os
70 |     
71 |     class MyCustomScrapedPage(livescrape.ScrapedPage):    
72 |         def scrape_fetch(self, url):
73 |             file = url.split('/')[-1]
74 |             with open(os.path.join("my_archive", file), "rb") as f:
75 |                 page = r.read()
76 |             return page.decode('utf8')
77 | 


--------------------------------------------------------------------------------
/docs/api.md:
--------------------------------------------------------------------------------
  1 | # API documentation
  2 | 
  3 | ## ScrapedPage
  4 | 
  5 | Under normal circumstances, you'd derive a class of ScrapedPage. ScapedPage converts any `ScrapedAttribute`s to properties which perform the actual scraping.
  6 | 
  7 | ### scrape_url
  8 | 
  9 | The url for the scraped page. Can contain named percent-style formatting placeholders. e.g. `http://localhost:8000/%(directory)s/%(filename)s?q=%(querystring)s`. You will need to pre-encode any parameters you pass in - there is no automatic encoding of parameters
 10 | 
 11 | ### scrape_args
 12 | 
 13 | Names of the positional arguments. After construction, this is replaced by a dictionary with all of the provided values.
 14 | 
 15 | ### scrape_arg_defaults
 16 | 
 17 | Default values for any arguments provided.
 18 | 
 19 | ### scrape_headers
 20 | 
 21 | Defines additional headers to be sent. You may, for example, want to modify the user agent by adding `scrape_headers = {'User-Agent': 'My fake browser'}` to your `ScrapedPage` definition.
 22 | 
 23 | ### scrape_fetch(self, url)
 24 | 
 25 | Fetches the raw HTML for the `ScrapedPage`. In many cases, you'll want to customize your HTTP access, for example to add caching, throttling or authentication. To do that, override the `scrape_fetch(url)` method, and add your logic there. You'll need to return a unicode string with the raw HTML. Any character encodings should already have been applied.
 26 | 
 27 | ### scrape_create_document(self, raw_html)
 28 | 
 29 | Creates a lxml document from the raw html. Sometimes, your document isn't actually HTML, it may have been encoded in some form. In that case, you can override this.
 30 | 
 31 | ### _dict
 32 | 
 33 | A property which returns all of the defined scrape properties in dictionary form.
 34 | 
 35 | ### scrape_keys
 36 | 
 37 | A list of all the scrapable attributes. Automatically generated.
 38 | 
 39 | ### scrape_session
 40 | 
 41 | By default this property returns the session shared by all the `ScrapedPage` classes, but individual classes can override according to their needs. The original shared session can be found in `livescrape.SHARED_SESSION`.
 42 | 
 43 | ## ScrapedAttribute(extract=None, cleanup=None, attribute=None, multiple=False)
 44 | 
 45 | This is the base class for all scraped attributes. 
 46 | 
 47 | When `extract` is provided, it is used to extract data from a element which has been selected for. The signature is `extract(element)`.
 48 | 
 49 | when `attribute` is provided, the data is extracted from the named attribute, instead of the element's text. Cannot be used together with `extract`.
 50 | 
 51 | When `cleanup` is provided, it is called on the extracted data just before it is returned. The signature is `cleanup(extracted_data)`
 52 | 
 53 | When `multiple` is provided, not just the first matching element is converted, but all of them. As a result, the attribute returns an iterable, even when only one element is selected. `cleanup` and `extract` are still applied per-element.
 54 | 
 55 | ### extract(self, element, scraped_document)
 56 | 
 57 | Pulls data from the provided element. This can be overridden to create attributes which aren't simply based on the element text or attribute. For one-off jobs, you may want to use the `extract=` argument in the constructor.
 58 | 
 59 | ### cleanup(self, extracted_data, element)
 60 | 
 61 | Converts the extracted data into usable data. This can be overridden to create type-specific extractors. For one-off jobs, you may want to use the `cleanup=` argument in the constructor.
 62 | 
 63 | ### @decorator
 64 | `ScrapedAttribute` can be used as a decorator. The decorated function functions as both cleanup and extract. It's signature is `attribute_name(value, element)`.
 65 | 
 66 | ## Specialized selectors
 67 | ### Css(selector, ...)
 68 | 
 69 | Pulls data from the document using a `selector`, and returns it as a string.
 70 | Supports all additional constructor arguments defined by `ScrapedAttribute`. If you pass an empty string, the entire document will be used. (useful in combination with `CssGroup`)
 71 | 
 72 | ## CssFloat(selector, ...)
 73 | 
 74 | Pulls data from the document using a css selector, and converts it to a floating point number. Supports all additional constructor arguments defined by `ScrapedAttribute`.
 75 | 
 76 | ## CssInt(selector, ...)
 77 | 
 78 | Pulls data from the document using a css selector, and converts it to an integer. Supports all additional constructor arguments defined by `ScrapedAttribute`. 
 79 | 
 80 | ## CssDate(selector, ...)
 81 | 
 82 | Pulls data from the document using a css selector, and converts it to an datetime object, using the second parameter (`date_format`). Supports all additional constructor arguments defined by `ScrapedAttribute`.
 83 | 
 84 | ## CssBoolean(selector)
 85 | 
 86 | Pulls data from the document using a css selector, and returns true if it was found. Supports none of the additional constructor arguments defined by `ScrapedAttribute`.
 87 | 
 88 | ## CssRaw(selector, ...)
 89 | 
 90 | Pulls data from the document using a css selector, and returns the content's raw html. Not that this HTML has been fixed up by lxml, and may differ from the html in the original document. Supports all additional constructor arguments defined by `ScrapedAttribute`, except `extract`.
 91 | 
 92 | ## CssGroup(selector)
 93 | 
 94 | Groups together several attributes, which all operate on the same element. Especially useful when used with`multiple=True`. Adding an element is done by assigning an `ScrapedAttribute` to an attribute of CssGroup, like this:
 95 | 
 96 | ```python
 97 | from livescrape import ScrapedPage, CssGroup, Css
 98 | 
 99 | class SomePage(ScrapedPage):
100 |     user = CssGroup(".user")
101 |     user.name = Css("a.username")  # Effectively finds ".user a.username"
102 |     user.rank = Css("a.userrank")
103 | ```
104 | 
105 | ## CssMulti(selector, attr1=..., attr2=... )
106 | 
107 | Finds a list of elements in the document, and for each element, applies additional `ScrapedAttribute`s to build a dictionary. The additional attributes are provided as keyword arguments to the constructor. Supports none of the additional constructor arguments defined by `ScrapedAttribute`.
108 | 
109 | Deprecated in favor of `CssGroup`
110 | 
111 | ## CssLink(selector, page_type, referer=True, ...)
112 | 
113 | Finds links in the document using `selector`, and returns a new `ScrapedPage` for the link target. `page_type` is the type of `ScrapedPage` to be instantiated. You may pass the target type by name, to break circular dependencies. Supports all additional constructor arguments defined by `ScrapedAttribute`, although by default, it defines `attribute='href'` for obvious reasons.
114 | 
115 | If `referer` is True, the Referer header is set up automatically. You can also set it to a custom url, or to False (for no referer header).
116 | 
117 | # SHARED_SESSION
118 | 
119 | All of the `ScapedPage` descendents share a [requests](http://docs.python-requests.org/) session. In the classes this is exposed in an overridable `scrape_session` property. It may be tempting to change things in the shared session, such as user agent, however, as with any global variable, this is a bad idea. Libraries using livescrape may depend on the default values, and may break when you change them. If you need a custom session, it is best to override the `scrape_session` property to provide your own one.
120 | 
121 | Forward compatibility
122 | =====================
123 | 
124 | This project uses [semantic versioning](http://semver.org/). Names starting with `scrape_`  and underscore (`_`) are reserved. You should not define attribibutes by those names.
125 | 


--------------------------------------------------------------------------------
/docs/index.md:
--------------------------------------------------------------------------------
 1 | Introduction
 2 | ============
 3 | 
 4 | `livescrape` is a tool for building pythonic web scrapers. Contrary to other scrapers, it focusses on exposing the scraped site in a semantic way in your application. It allows you to define page objects, specifying infomation to be extracted, and how to navigate to other page objects.
 5 | 
 6 | While other scraping libraries would mostly be used in batch jobs, `livescrape` is intended to be used in the main application. `livescrape` turns the human interface into an API.
 7 | 
 8 | Example
 9 | =======
10 | 
11 | For more complete example, I recommend you check out the [Tutorial](tutorial.md), but here's a quick primer using github.
12 | 
13 |     class GithubProjectpage(ScrapedPage):
14 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
15 |         scrape_args = ("username", "projectname")
16 |     
17 |         description = Css(".repository-meta-content",
18 |                           cleanup=lambda desc: desc.strip())
19 |         contents = Css('.js-directory-link', multiple=True)
20 |         table_contents = CssGroup('tr.js-navigation-item')
21 |         table_contents.name = Css("td.content a")
22 |         table_contents.message = Css("td.message a")
23 |         table_contents.age = Css("td.age time", attribute="datetime"),
24 |     
25 |     project_page = GithubProjectPage("ondergetekende", "livescrape")
26 |     
27 |     print(project_page.description) # Prints the description for this project
28 |         
29 |     print(projects.contents)  # Prints the filenames in the repository root
30 | 


--------------------------------------------------------------------------------
/docs/tutorial.md:
--------------------------------------------------------------------------------
  1 | Defining a scraper
  2 | ==================
  3 | 
  4 | Defining a scraper is similar to defining a model in django. In my example I'll use github. Github has an API which would be far more suitable for any use, but it's a well-known site, which helps understanding the examples.
  5 | 
  6 | Let's start by defining GitHub's repository page
  7 | 
  8 |     from livescrape import ScrapedPage, Css
  9 |     
 10 |     class GithubProjectPage(ScrapedPage):
 11 |         scrape_url = "https://github.com/python/cpython/"
 12 |     
 13 |         description = Css(".repository-meta-content")
 14 |     
 15 |     page = GithubProjectPage()
 16 |     print(page.description)
 17 |     # will output the description
 18 |     print(page._dict)
 19 |     # will output {"description": "<whatever the description is>"}
 20 |     
 21 | 
 22 | That's nice and all, but we don't just want to address the project page for the cpython mirror, but just any project page. We can do this by adding string formatting parameters to `scrape_url`.
 23 | 
 24 |     from livescrape import ScrapedPage, Css
 25 |     
 26 |     class GithubProjectPage(ScrapedPage):
 27 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
 28 |     
 29 |         description = Css(".repository-meta-content")
 30 |     
 31 |     page = GithubProjectPage(username="python", projectname="cpython")
 32 |     print(page.description)
 33 |     # will output the description
 34 | 
 35 | You can avoid using keyword arguments by defining `scrape_args` like this:
 36 | 
 37 |     from livescrape import ScrapedPage, Css
 38 |     
 39 |     class GithubProjectPage(ScrapedPage):
 40 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
 41 |         scrape_args = ("username", "projectname")
 42 |     
 43 |         description = Css(".repository-meta-content")
 44 |     
 45 |     page = GithubProjectPage("python", "cpython")
 46 |     print(page.description)
 47 |     # will output the description
 48 | 
 49 | 
 50 | Cleaning up the data
 51 | ====================
 52 | 
 53 | Now when you run the previous example, you may notice that the description is padded with a lot of whitespace. We really don't want that, so we can pass in a cleanup function with the `cleanup=` keyword argument. Its signature is `cleanup(extracted_data)`. In this example I'll use a lambda.
 54 | 
 55 |     from livescrape import ScrapedPage, Css
 56 |     
 57 |     class GithubProjectPage(ScrapedPage):
 58 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
 59 |         scrape_args = ("username", "projectname")
 60 |     
 61 |         description = Css(".repository-meta-content", 
 62 |                           cleanup=lambda value: value.strip())
 63 |     
 64 |     page = GithubProjectPage("python", "cpython")
 65 |     print(page.description)
 66 |     # will output the description
 67 | 
 68 | By default, data is extracted by taking the text contents of the element. Sometimes, however, the data you need is in an attribute. In that case, you can provide the `attribute=` keyword argument: 
 69 | 
 70 |     from livescrape import ScrapedPage, Css
 71 |     
 72 |     class GithubProjectPage(ScrapedPage):
 73 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
 74 |         scrape_args = ("username", "projectname")
 75 |     
 76 |         git_repo = Css("input.input-monospace", attribute="value")
 77 |     
 78 |     page = GithubProjectPage("python", "cpython")
 79 |     print(page.description)
 80 | 
 81 | If the data you're after is even more complicated (e.g. a combination of elements), you may want to perform the extraction yourself, by providing an extractor function with the `extract=` argument. Its signature is `extract(element)`. The extracted data will be passed into the cleanup chain unmodified, which means you're not limited to strings.
 82 | 
 83 |     from livescrape import ScrapedPage, Css
 84 |     
 85 |     class GithubProjectPage(ScrapedPage):
 86 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
 87 |         scrape_args = ("username", "projectname")
 88 |     
 89 |         git_repo = @Css("input.input-monospace",
 90 |                         extract=lambda elem: {"repo": elem.get("value")})
 91 |     
 92 |     page = GithubProjectPage("python", "cpython")
 93 |     print(page.description)
 94 | 
 95 | While lambda's are nice for simple conversions, sometimes you'll need to do something more complicated. A lambda would be too cramped fo that. In that case, it may be useful to declare the cleanup function using the decorator syntax. The signature of the decorated function is `attributename(extracted_data, element)`.
 96 | 
 97 |     from livescrape import ScrapedPage, Css
 98 |     
 99 |     class GithubProjectPage(ScrapedPage):
100 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
101 |         scrape_args = ("username", "projectname")
102 |     
103 |         @Css(".repository-meta-content")
104 |         def description(self, value, element):
105 |             return value.strip()
106 |     
107 |     page = GithubProjectPage("python", "cpython")
108 |     print(page.description)
109 |     # will output the description
110 | 
111 | For some common datatypes, there are special `Css` selectors: `CssInt`, `CssFloat`, `CssDate` , `CssRaw` (for raw html) and `CssBoolean` (testing whether some selector is present).
112 | 
113 | List data
114 | =========
115 | 
116 | Normaly when scraping, only the first matching element is used, but sometimes you'll want to go over lists of things. To do so, specify the `multiple` argument. In this example contents will produce the names of all the root directories in the project.
117 | 
118 |     from livescrape import ScrapedPage, Css
119 |     
120 |     class GithubProjectPage(ScrapedPage):
121 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
122 |         scrape_args = ("username", "projectname")
123 |         
124 |         contents = Css('.js-directory-link', multiple=True)
125 | 
126 | Note that cleanup code runs per list item, not on the list as a whole.
127 | 
128 | Tabular data
129 | ============
130 | 
131 | If you need more than one datum per list item, you will need to use `CSSGroup`. You can provide the additional selectors by assigning them to attributes of the group. It will produce an object for each of the table contents.
132 | 
133 |     from livescrape import ScrapedPage, Css, CssGroup
134 |     
135 |     class GithubProjectPage(ScrapedPage):
136 |         scrape_url = "https://github.com/%(username)s/%(projectname)s/"
137 |         scrape_args = ("username", "projectname")
138 |     
139 |         table_contents = CssGroup('tr.js-navigation-item', multiple=True)
140 |     
141 |         table_contents.name = Css("td.content a")
142 |         table_contents.message = Css("td.message a")
143 |         table_contents.age = Css("td.age time", attribute="datetime")
144 | 
145 | Note that cleanup code runs per list item, not on the list as a whole.
146 | 
147 | Links
148 | =====
149 | 
150 | Websites typically have links, which you'll want to follow. The `CssLink` selector helps you by allowing you to specify what `ScrapedPage` should handle the target of that link. In the following example, we're reusing one of the `GithubProjectPage` definitions above.
151 | 
152 |     from scrape import ScrapedPage, CssLink
153 |     
154 |     class GithubOveriew(ScrapedPage):
155 |         scrape_url = "https://github.com/%(username)s"
156 |         scrape_args = ("username")
157 |         
158 |         repos = CssLink(".repo-list-name a", GithubProjectPage, multiple=True)
159 | 
160 | You could now type `GithubOverview("python").repos[0].description` to retrieve the description of the first repository on the overview page.
161 | 


--------------------------------------------------------------------------------
/examples/github.py:
--------------------------------------------------------------------------------
 1 | from livescrape import ScrapedPage, Css, CssGroup, CssLink
 2 | 
 3 | 
 4 | class GithubProjectpage(ScrapedPage):
 5 |     scrape_url = "https://github.com/%(username)s/%(projectname)s/"
 6 |     scrape_args = ("username", "projectname")
 7 | 
 8 |     description = Css(".repository-meta-content",
 9 |                       cleanup=lambda desc: desc.strip())
10 |     contents = CssLink('.js-directory-link', "GithubOveriew", multiple=True)
11 |     table_contents = CssGroup('tr.js-navigation-item', multiple=True)
12 |     table_contents.name = Css("td.content a")
13 |     table_contents.message = Css("td.message a")
14 |     table_contents.age = Css("td.age time", attribute="datetime")
15 | 
16 | 
17 | class GithubOveriew(ScrapedPage):
18 |     scrape_url = "https://github.com/%(username)s"
19 |     scrape_args = ("username")
20 | 
21 |     repos = CssLink(".repo-list-name a", GithubProjectpage, multiple=True)
22 | 
23 | 
24 | if __name__ == '__main__':
25 |     cpython = GithubOveriew(username="python")
26 |     print(cpython.repos[0].description)
27 |     print(cpython.repos[0].description)
28 | 


--------------------------------------------------------------------------------
/livescrape.py:
--------------------------------------------------------------------------------
  1 | from abc import abstractmethod
  2 | import cgi
  3 | import datetime
  4 | try:
  5 |     import urlparse  # python2
  6 | except ImportError:  # pragma: no cover
  7 |     import urllib.parse as urlparse
  8 | import warnings
  9 | 
 10 | import lxml.etree
 11 | import lxml.html
 12 | import requests
 13 | import six
 14 | 
 15 | 
 16 | SHARED_SESSION = requests.Session()
 17 | SHARED_SESSION.headers['User-Agent'] = "Mozilla/5.0 (Livescrape)"
 18 | 
 19 | 
 20 | class ScrapedAttribute(object):
 21 |     """Base class for scraped attributes.
 22 | 
 23 |     When used in a ScrapedPage, it will be converted into an attribute.
 24 |     """
 25 | 
 26 |     def __init__(self, extract=None, cleanup=None, attribute=None,
 27 |                  multiple=False):
 28 |         if extract and attribute:
 29 |             raise ValueError("extract and attribute are mututally exclusive")
 30 | 
 31 |         self._cleanup = cleanup
 32 |         self._extract = extract
 33 |         self.attribute = attribute
 34 |         self.multiple = multiple
 35 | 
 36 |         # Placeholder for cleanup method when using the decorator syntax
 37 |         self._cleanup_method = None
 38 | 
 39 |     @abstractmethod
 40 |     def get(self, doc, scraped_page):  # pragma: no cover
 41 |         raise NotImplementedError()
 42 | 
 43 |     def extract(self, element, scraped_page):
 44 |         if self._extract:
 45 |             value = self._extract(element)
 46 |         elif self.attribute is None:
 47 |             value = element.text_content()
 48 |         else:
 49 |             value = element.get(self.attribute)
 50 |             if value is None:
 51 |                 return
 52 | 
 53 |         # In python2, lxml returns str if only ascii characters are used.
 54 |         # This leads to inconsistent return types, so in that case, we convert
 55 |         # to unicode (which should be a semantic no-op)
 56 |         if six.PY2 and isinstance(value, str):  # pragma: no cover
 57 |             value = six.text_type(value)
 58 | 
 59 |         return self.perform_cleanups(value, element, scraped_page)
 60 | 
 61 |     def perform_cleanups(self, value, element, scraped_page=None):
 62 |         if self._cleanup:
 63 |             value = self._cleanup(value)
 64 | 
 65 |         if self._cleanup_method:
 66 |             value = self._cleanup_method(scraped_page, value, element)
 67 | 
 68 |         return self.cleanup(value, element, scraped_page)
 69 | 
 70 |     def cleanup(self, value, elements, scraped_page=None):
 71 |         return value
 72 | 
 73 |     def __call__(self, func):
 74 |         self._cleanup_method = func
 75 |         return self
 76 | 
 77 | _SCRAPER_CLASSES = {}
 78 | 
 79 | 
 80 | class _ScrapedMeta(type):
 81 |     """A metaclass for Scraped.
 82 | 
 83 |     Converts any ScrapedAttribute attributes to usable properties.
 84 |     """
 85 |     def __new__(cls, name, bases, namespace):
 86 |         keys = []
 87 |         for key, value in namespace.items():
 88 |             if isinstance(value, ScrapedAttribute):
 89 |                 def mk_attribute(selector):
 90 |                     def method(scraped):
 91 |                         return scraped._get_value(selector)
 92 |                     return property(method)
 93 | 
 94 |                 namespace[key] = mk_attribute(value)
 95 |                 keys.append(key)
 96 | 
 97 |         result = super(_ScrapedMeta, cls).__new__(cls, name, bases, namespace)
 98 |         result.scrape_keys = keys
 99 |         _SCRAPER_CLASSES[name] = result
100 |         return result
101 | 
102 | 
103 | @six.add_metaclass(_ScrapedMeta)
104 | class ScrapedPage(object):
105 |     _scrape_doc = None
106 |     scrape_url = None
107 |     scrape_args = []
108 |     scrape_arg_defaults = {}
109 |     scrape_headers = {}
110 | 
111 |     def __init__(self, *pargs, **kwargs):
112 |         scrape_url = kwargs.pop("scrape_url", None)
113 |         scrape_referer = kwargs.pop("scrape_referer", None)
114 | 
115 |         arguments = dict(self.scrape_arg_defaults)
116 |         arguments.update(kwargs)
117 |         arguments.update(zip(self.scrape_args, pargs))
118 |         self.scrape_args = arguments
119 | 
120 |         if scrape_url:
121 |             self.scrape_url = scrape_url
122 |         elif not self.scrape_url:
123 |             # We can't scrape if we don't actually have a url configured
124 |             raise ValueError("%s.scrape_url needs to be defined" %
125 |                              type(self).__name__)
126 |         else:
127 |             self.scrape_url = self.scrape_url % arguments
128 | 
129 |         self.scrape_headers = dict(self.scrape_headers)
130 |         if scrape_referer:
131 |             self.scrape_headers['Referer'] = scrape_referer
132 | 
133 |     @property
134 |     def scrape_session(self):
135 |         return SHARED_SESSION
136 | 
137 |     def scrape_fetch(self, url):
138 |         return self.scrape_session.get(url,
139 |                                        headers=self.scrape_headers).text
140 | 
141 |     def scrape_create_document(self, page):
142 |         return lxml.html.fromstring(page)
143 | 
144 |     def _get_value(self, property_scraper):
145 |         if self._scrape_doc is None:
146 |             page = self.scrape_fetch(self.scrape_url)
147 |             self._scrape_doc = self.scrape_create_document(page)
148 | 
149 |         return property_scraper.get(self._scrape_doc, scraped_page=self)
150 | 
151 |     @property
152 |     def _dict(self):
153 |         return dict((key, getattr(self, key)) for key in self.scrape_keys)
154 | 
155 |     def __repr__(self):
156 |         return "%s(scrape_url=%r)" % (type(self).__name__, self.scrape_url)
157 | 
158 | 
159 | class Css(ScrapedAttribute):
160 |     def __init__(self, selector, **kwargs):
161 |         self.selector = selector
162 |         assert selector or not self.multiple, "Empty selectors are only "\
163 |             "with singular matches"
164 | 
165 |         super(Css, self).__init__(**kwargs)
166 | 
167 |     def get(self, doc, scraped_page):
168 |         assert doc is not None
169 |         if not self.selector:
170 |             return doc
171 | 
172 |         elements = doc.cssselect(self.selector)
173 | 
174 |         if self.multiple:
175 |             values = [self.extract(element, scraped_page)
176 |                       for element in elements]
177 |             return [v for v in values if v is not None]
178 |         elif len(elements):
179 |             return self.extract(elements[0], scraped_page)
180 | 
181 | 
182 | class CssFloat(Css):
183 |     def cleanup(self, value, elements, scraped_page=None):
184 |         try:
185 |             return float(value)
186 |         except ValueError:
187 |             return None
188 | 
189 | 
190 | class CssInt(Css):
191 |     def cleanup(self, value, elements, scraped_page=None):
192 |         try:
193 |             return int(value)
194 |         except ValueError:
195 |             return None
196 | 
197 | 
198 | class CssDate(Css):
199 |     def __init__(self, selector, date_format, tzinfo=None, **kwargs):
200 |         self.date_format = date_format
201 |         self.tzinfo = tzinfo
202 |         super(CssDate, self).__init__(selector, **kwargs)
203 | 
204 |     def cleanup(self, value, elements, scraped_page=None):
205 |         try:
206 |             result = datetime.datetime.strptime(value, self.date_format)
207 |             if self.tzinfo:
208 |                 result = result.replace(tzinfo=self.tzinfo)
209 |             return result
210 |         except ValueError:
211 |             return None
212 | 
213 | 
214 | class CssBoolean(Css):
215 |     def cleanup(self, value, elements, scraped_page=None):
216 |         return True
217 | 
218 | 
219 | class CssRaw(Css):
220 |     def __init__(self, selector, include_tag=False, **kwargs):
221 |         self.include_tag = include_tag
222 |         super(CssRaw, self).__init__(selector, **kwargs)
223 | 
224 |     def extract(self, element, scraped_page):
225 |         if self.include_tag:
226 |             value = lxml.html.tostring(element, encoding="unicode")
227 |         else:
228 |             value = six.text_type("")
229 |             if element.text:
230 |                 value = cgi.escape(element.text)
231 |             for child in element:
232 |                 value += lxml.html.tostring(child, encoding="unicode")
233 | 
234 |         return self.perform_cleanups(value, element, scraped_page)
235 | 
236 | 
237 | class CssMulti(Css):
238 |     def __init__(self, selector, cleanup=None, **subselectors):
239 |         super(CssMulti, self).__init__(selector, cleanup=cleanup,
240 |                                        multiple=True)
241 |         self.subselectors = subselectors
242 |         warnings.warn(
243 |             "The 'CssMulti' class was deprecated in favor of CssGroup",
244 |             DeprecationWarning)
245 | 
246 |     def extract(self, element, scraped_page=None):
247 |         value = {}
248 | 
249 |         for key, selector in self.subselectors.items():
250 |             value[key] = selector.get(element,
251 |                                       scraped_page=scraped_page)
252 | 
253 |         return self.perform_cleanups(value, element, scraped_page)
254 | 
255 | 
256 | class CssGroup(Css):
257 |     class _CompoundAttribute(object):
258 |         def __init__(self, parent, element, scraped_page):
259 |             self._subselectors = parent._subselectors
260 |             self._element = element
261 |             self._scaped_page = scraped_page
262 | 
263 |         def __getattr__(self, attribute):
264 |             try:
265 |                 selector = self._subselectors[attribute]
266 |             except KeyError:
267 |                 return getattr(super(CssGroup._CompoundAttribute, self),
268 |                                attribute)
269 | 
270 |             return selector.get(self._element, self._scaped_page)
271 | 
272 |         def __getitem__(self, attribute):
273 |             # May raise keyerror, which is suitable for __getitem__
274 |             selector = self._subselectors[attribute]
275 |             return selector.get(self._element, self._scaped_page)
276 | 
277 |         def __dir__(self):
278 |             attrs = dir(super(CssGroup._CompoundAttribute, self))
279 |             attrs += self._subselectors.keys()
280 |             return attrs
281 | 
282 |         def _dict(self):
283 |             return dict(
284 |                 (key, selector.get(self._element, self._scaped_page))
285 |                 for (key, selector) in self._subselectors.items())
286 | 
287 |     def __init__(self, *pargs, **kwargs):
288 |         super(CssGroup, self).__init__(*pargs, **kwargs)
289 |         self._subselectors = {}
290 | 
291 |     def extract(self, element, scraped_page=None):
292 |         value = CssGroup._CompoundAttribute(self, element, scraped_page)
293 |         return self.perform_cleanups(value, element, scraped_page)
294 | 
295 |     def __setattr__(self, key, value):
296 |         if isinstance(value, ScrapedAttribute):
297 |             self._subselectors[key] = value
298 |         else:
299 |             super(CssGroup, self).__setattr__(key, value)
300 | 
301 | 
302 | class CssLink(Css):
303 |     def __init__(self, selector, page_factory, referer=True, **kwargs):
304 |         kwargs.setdefault('attribute', 'href')
305 |         super(CssLink, self).__init__(selector, **kwargs)
306 |         self.page_factory = page_factory
307 |         self.referer = referer
308 | 
309 |     def cleanup(self, value, elements, scraped_page=None):
310 |         url = urlparse.urljoin(scraped_page.scrape_url, value)
311 |         factory = (_SCRAPER_CLASSES[self.page_factory]
312 |                    if isinstance(self.page_factory, six.string_types)
313 |                    else self.page_factory)
314 | 
315 |         if self.referer is True:  # automatic referer
316 |             referer = scraped_page.scrape_url
317 |         elif not self.referer:
318 |             referer = None
319 |         else:
320 |             referer = self.referer
321 | 
322 |         return factory(scrape_url=url, scrape_referer=referer)
323 | 


--------------------------------------------------------------------------------
/mkdocs.yml:
--------------------------------------------------------------------------------
 1 | site_name: Livescrape documentation
 2 | repo_url: https://github.com/ondergetekende/livesscrape
 3 | pages:
 4 |     - 'Introduction': 'index.md'
 5 |     - 'Tutorial': 'tutorial.md'
 6 |     - 'API documentation': 'api.md'
 7 |     - 'Advanced document retrieval': 'advanced.md'
 8 | 
 9 | 
10 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | cssselect
2 | lxml
3 | requests
4 | six


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | setup(
 4 |     name='livescrape',
 5 |     version='0.9.8',
 6 |     url='https://github.com/ondergetekende/python-livescrape',
 7 |     description='A toolkit to build pythonic web scraper libraries',
 8 |     author='Koert van der Veer',
 9 |     author_email='koert@ondergetekende.nl',
10 |     py_modules=["livescrape"],
11 |     install_requires=["lxml", "requests", "cssselect", "six"],
12 |     classifiers=[
13 |         'Intended Audience :: Developers',
14 |         'Operating System :: OS Independent',
15 |         'Programming Language :: Python',
16 |         'Programming Language :: Python :: 3',
17 |         'Programming Language :: Python :: 3.2',
18 |         'Programming Language :: Python :: 3.3',
19 |         'Programming Language :: Python :: 3.4',
20 |         'Programming Language :: Python :: 3.5',
21 |         'Programming Language :: Python :: 2.7',
22 |     ],
23 | )
24 | 


--------------------------------------------------------------------------------
/test-requirements.txt:
--------------------------------------------------------------------------------
1 | pep8
2 | flake8
3 | hacking
4 | coverage
5 | unittest2
6 | responses


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import re
  3 | 
  4 | import responses
  5 | import unittest2 as unittest
  6 | 
  7 | import livescrape
  8 | 
  9 | 
 10 | class BasePage(livescrape.ScrapedPage):
 11 |     scrape_url = "http://fake-host/test.html"
 12 | 
 13 | 
 14 | class Test(unittest.TestCase):
 15 |     def setUp(self):
 16 |         responses.reset()
 17 |         responses.add(
 18 |             responses.GET, BasePage.scrape_url,
 19 |             """<html><body>
 20 |             <h1 class="foo" data-foo=1>Heading</h1>
 21 |             <h1 id="the-id">15</h1>
 22 |             <span class=float>3.14</span>
 23 |             <span class=int>42</span>
 24 |             <span class=bool></span>
 25 |             <span class=date>2016-04-23</span>
 26 |             <a href="/very-fake">link</a>
 27 |             <table>
 28 |               <tr>test123
 29 |                 <th>key</th> testmore
 30 |                 <td>value</th></tr>
 31 |               <tr><th>key2<td>value2</tr>
 32 |             </table
 33 |             """)
 34 |         responses.start()
 35 |         self.addCleanup(responses.stop)
 36 | 
 37 |     def test_simplecss(self):
 38 |         class Page(BasePage):
 39 |             foo = livescrape.Css("h1.foo")
 40 | 
 41 |         x = Page()
 42 | 
 43 |         self.assertEqual(x.foo, 'Heading')
 44 | 
 45 |     def test_dict(self):
 46 |         class Page(BasePage):
 47 |             foo = livescrape.Css("h1.foo")
 48 | 
 49 |         x = Page()
 50 | 
 51 |         self.assertEqual(x._dict, {"foo": 'Heading'})
 52 | 
 53 |     def test_ambigous(self):
 54 |         class Page(BasePage):
 55 |             foo = livescrape.Css("h1")
 56 | 
 57 |         x = Page()
 58 | 
 59 |         self.assertEqual(x.foo, 'Heading')
 60 | 
 61 |     def test_multiple(self):
 62 |         class Page(BasePage):
 63 |             foo = livescrape.Css("h1",
 64 |                                  multiple=True)
 65 | 
 66 |         x = Page()
 67 | 
 68 |         self.assertEqual(x.foo, ['Heading', '15'])
 69 | 
 70 |     def test_attribute(self):
 71 |         class Page(BasePage):
 72 |             foo = livescrape.Css("h1.foo",
 73 |                                  attribute="data-foo")
 74 |             not_there = livescrape.Css("h1.foo",
 75 |                                        attribute="not-there")
 76 | 
 77 |         x = Page()
 78 | 
 79 |         self.assertEqual(x.foo, '1')
 80 |         self.assertIsNone(x.not_there)
 81 | 
 82 |     def test_link(self):
 83 |         class Page(BasePage):
 84 |             foo = livescrape.CssLink("a", "Page")
 85 | 
 86 |         x = Page()
 87 | 
 88 |         self.assertIsInstance(x.foo, Page)
 89 |         self.assertEqual(x.foo.scrape_url,
 90 |                          "http://fake-host/very-fake")
 91 | 
 92 |     def test_float(self):
 93 |         class Page(BasePage):
 94 |             foo = livescrape.CssFloat(".float")
 95 |             foo_fail = livescrape.CssFloat(".date")
 96 | 
 97 |         x = Page()
 98 | 
 99 |         self.assertAlmostEqual(x.foo, 3.14)
100 |         self.assertIsNone(x.foo_fail)
101 | 
102 |     def test_int(self):
103 |         class Page(BasePage):
104 |             foo = livescrape.CssInt(".int")
105 |             foo_fail = livescrape.CssInt(".date")
106 | 
107 |         x = Page()
108 | 
109 |         self.assertEqual(x.foo, 42)
110 |         self.assertIsNone(x.foo_fail)
111 | 
112 |     def test_date(self):
113 |         class UTC(datetime.tzinfo):
114 |             """UTC"""
115 | 
116 |             def utcoffset(self, dt):
117 |                 return datetime.timedelta(0)
118 | 
119 |             def tzname(self, dt):
120 |                 return "UTC"
121 | 
122 |             def dst(self, dt):
123 |                 return datetime.timedelta(0)
124 | 
125 |         class Page(BasePage):
126 |             foo = livescrape.CssDate(".date", '%Y-%m-%d')
127 |             foo_tz = livescrape.CssDate(".date", '%Y-%m-%d', tzinfo=UTC())
128 |             foo_fail = livescrape.CssDate(".float", '%Y-%m-%d')
129 | 
130 |         x = Page()
131 | 
132 |         self.assertEqual(x.foo.year, 2016)
133 |         self.assertEqual(x.foo.tzinfo, None)
134 |         self.assertEqual(x.foo_tz.year, 2016)
135 |         self.assertEqual(x.foo_tz.tzinfo.tzname(None), "UTC")
136 |         self.assertIsNone(x.foo_fail)
137 | 
138 |     def test_bool(self):
139 |         class Page(BasePage):
140 |             foo = livescrape.CssBoolean(".bool")
141 |             bar = livescrape.CssBoolean(".bool-not-there")
142 | 
143 |         x = Page()
144 | 
145 |         self.assertTrue(x.foo)
146 |         self.assertFalse(x.bar)
147 | 
148 |     def test_raw(self):
149 |         class Page(BasePage):
150 |             foo = livescrape.CssRaw("table>tr")
151 |             foo_withtag = livescrape.CssRaw("table>tr", include_tag=True)
152 | 
153 |         x = Page()
154 |         html = re.sub(r'\s+', " ", x.foo).strip()
155 |         self.assertEqual(html, "test123 <th>key</th> testmore <td>value</td>")
156 | 
157 |         html = re.sub(r'\s+', " ", x.foo_withtag).strip()
158 |         self.assertEqual(
159 |             html,
160 |             "<tr>test123 <th>key</th> testmore <td>value</td></tr>")
161 | 
162 |     def test_complex(self):
163 |         class Page(BasePage):
164 |             foo = livescrape.CssMulti(
165 |                 "table tr",
166 |                 key=livescrape.Css("th"),
167 |                 value=livescrape.Css("td"))
168 | 
169 |         x = Page()
170 | 
171 |         self.assertEqual(x.foo, [{"key": "key", "value": "value"},
172 |                                  {"key": "key2", "value": "value2"}])
173 | 
174 |     def test_group(self):
175 |         class Page(BasePage):
176 |             foo = livescrape.CssGroup("table tr", multiple=True)
177 |             foo.key = livescrape.Css("th")
178 |             foo.value = livescrape.Css("td")
179 | 
180 |         x = Page()
181 | 
182 |         self.assertEqual(x.foo[0]["key"], "key")
183 |         self.assertEqual(x.foo[0]["value"], "value")
184 |         self.assertEqual(x.foo[1]["key"], "key2")
185 |         self.assertEqual(x.foo[1]["value"], "value2")
186 | 
187 |         self.assertEqual(x.foo[0].key, "key")
188 |         self.assertEqual(x.foo[0].value, "value")
189 |         self.assertEqual(x.foo[1].key, "key2")
190 |         self.assertEqual(x.foo[1].value, "value2")
191 |         self.assertEqual(x.foo[0]._dict(),
192 |                          {"key": "key", "value": "value"})
193 | 
194 |         # List members, but filter private ones
195 |         self.assertEqual([att for att in dir(x.foo[1])
196 |                           if att[0] != "_"],
197 |                          ["key", "value"])
198 | 
199 |         with self.assertRaises(AttributeError):
200 |             x.foo[0].nonexistent
201 | 
202 |     def test_cleanup(self):
203 |         cleanup_args = [None]
204 | 
205 |         def cleanup(x):
206 |             self.assertIsNone(cleanup_args[0])
207 |             cleanup_args[0] = x
208 |             return "TESTed"
209 | 
210 |         class Page(BasePage):
211 |             foo = livescrape.Css("h1.foo",
212 |                                  cleanup=cleanup)
213 | 
214 |         x = Page()
215 | 
216 |         self.assertEqual(x.foo, "TESTed")
217 |         self.assertEqual(cleanup_args[0], "Heading")
218 | 
219 |     def test_extract(self):
220 |         extract_args = [None]
221 | 
222 |         def extract(x):
223 |             self.assertIsNone(extract_args[0])
224 |             extract_args[0] = x
225 |             return "TESTed"
226 | 
227 |         class Page(BasePage):
228 |             foo = livescrape.Css("h1.foo",
229 |                                  extract=extract)
230 | 
231 |         x = Page()
232 | 
233 |         self.assertEqual(x.foo, "TESTed")
234 |         self.assertEqual(extract_args[0].text, "Heading")
235 | 
236 |     def test_cleanup_extract(self):
237 |         cleanup_args = [None]
238 |         extract_args = [None]
239 | 
240 |         def cleanup(x):
241 |             self.assertIsNone(cleanup_args[0])
242 |             cleanup_args[0] = x
243 |             return "TESTed"
244 | 
245 |         def extract(x):
246 |             self.assertIsNone(extract_args[0])
247 |             extract_args[0] = x
248 |             return "Xtracted"
249 | 
250 |         class Page(BasePage):
251 |             foo = livescrape.Css("h1.foo",
252 |                                  cleanup=cleanup,
253 |                                  extract=extract)
254 | 
255 |         x = Page()
256 |         value = x.foo
257 | 
258 |         self.assertEqual(extract_args[0].text, "Heading")
259 |         self.assertEqual(cleanup_args[0], "Xtracted")
260 |         self.assertEqual(value, "TESTed")
261 | 
262 |     def test_decorator(self):
263 |         cleanup_args = [None]
264 |         extract_args = [None]
265 |         method_args = [None]
266 | 
267 |         def cleanup(x):
268 |             self.assertIsNone(cleanup_args[0])
269 |             cleanup_args[0] = x
270 |             return "TESTed"
271 | 
272 |         def extract(x):
273 |             self.assertIsNone(extract_args[0])
274 |             extract_args[0] = x
275 |             return "Xtracted"
276 | 
277 |         class Page(BasePage):
278 |             @livescrape.Css("h1.foo",
279 |                             cleanup=cleanup,
280 |                             extract=extract)
281 |             def foo(self, value, element):
282 |                 method_args[0] = (value, element)
283 |                 return "METhod"
284 | 
285 |         x = Page()
286 |         value = x.foo
287 | 
288 |         self.assertEqual(extract_args[0].text, "Heading")
289 |         self.assertEqual(cleanup_args[0], "Xtracted")
290 |         self.assertEqual(method_args[0][0], "TESTed")
291 |         self.assertEqual(method_args[0][1].text, "Heading")
292 |         self.assertEqual(value, "METhod")
293 | 
294 |     def test_headers(self):
295 |         class Page(BasePage):
296 |             scrape_headers = {"foo": "bar"}
297 | 
298 |         Page().scrape_fetch(BasePage.scrape_url)
299 |         self.assertEqual(len(responses.calls), 1)
300 |         self.assertEqual(responses.calls[0].request.headers['Foo'],
301 |                          'bar')
302 | 
303 |     def test_referer(self):
304 |         class Page(BasePage):
305 |             foo = livescrape.CssLink("a", "Page")
306 | 
307 |         x = Page()
308 | 
309 |         self.assertIsInstance(x.foo, Page)
310 | 
311 |         responses.add(
312 |             responses.GET, x.foo.scrape_url,
313 |             "<html>")
314 | 
315 |         self.assertIsNone(x.foo.foo)
316 | 
317 |         self.assertEqual(len(responses.calls), 2)
318 |         self.assertEqual(responses.calls[1].request.headers['Referer'],
319 |                          'http://fake-host/test.html')
320 | 
321 |     def test_custom_referer(self):
322 |         class Page(BasePage):
323 |             foo = livescrape.CssLink("a", "Page", referer="http://no")
324 | 
325 |         x = Page()
326 | 
327 |         self.assertIsInstance(x.foo, Page)
328 | 
329 |         responses.add(
330 |             responses.GET, x.foo.scrape_url,
331 |             "<html>")
332 | 
333 |         self.assertIsNone(x.foo.foo)
334 | 
335 |         self.assertEqual(len(responses.calls), 2)
336 |         self.assertEqual(responses.calls[1].request.headers['Referer'],
337 |                          'http://no')
338 | 
339 |     def test_no_referer(self):
340 |         class Page(BasePage):
341 |             foo = livescrape.CssLink("a", "Page", referer=False)
342 | 
343 |         x = Page()
344 | 
345 |         self.assertIsInstance(x.foo, Page)
346 | 
347 |         responses.add(
348 |             responses.GET, x.foo.scrape_url,
349 |             "<html>")
350 | 
351 |         self.assertIsNone(x.foo.foo)
352 | 
353 |         self.assertEqual(len(responses.calls), 2)
354 |         self.assertNotIn("Referer", responses.calls[1].request.headers)
355 | 
356 | if __name__ == '__main__':
357 |     unittest.main()
358 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
 1 | [tox]
 2 | envlist = py27,py34,py35,pep8,coverage
 3 | skipsdist = True
 4 | 
 5 | [testenv]
 6 | deps = -r{toxinidir}/test-requirements.txt 
 7 |        -r{toxinidir}/requirements.txt
 8 | ; setenv =
 9 | ;     PYTHONPATH = {toxinidir}:{toxinidir}
10 | commands = python test.py
11 | 
12 | [testenv:pep8]
13 | deps = -r{toxinidir}/test-requirements.txt 
14 | commands = flake8 {posargs}
15 | 
16 | [testenv:coverage]
17 | deps = -r{toxinidir}/test-requirements.txt 
18 |        -r{toxinidir}/requirements.txt
19 | commands = 
20 | 	coverage run --branch --omit={envdir}/*,examples/*.py,test.py test.py
21 |     coverage html
22 |     coverage report --skip-covered --fail-under 95 --show-missing
23 | 
24 | [flake8]
25 | ignore = H101,H301,H302,H238
26 | show-source = True
27 | 


--------------------------------------------------------------------------------