├── .gitignore
├── DESIGN.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── docs
    ├── Makefile
    ├── api.rst
    ├── cache.rst
    ├── conf.py
    ├── config.rst
    ├── index.rst
    ├── install.rst
    ├── quickstart.rst
    ├── tasks.rst
    └── utils.rst
├── requirements.txt
├── scrapekit
    ├── __init__.py
    ├── config.py
    ├── core.py
    ├── exc.py
    ├── http.py
    ├── logs.py
    ├── reporting
    │   ├── __init__.py
    │   ├── db.py
    │   └── render.py
    ├── tasks.py
    ├── templates
    │   ├── index.html
    │   ├── layout.html
    │   ├── macros.html
    │   ├── task_run_item.html
    │   └── task_run_list.html
    └── util.py
├── setup.py
└── test.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.DS_Store
 2 | 
 3 | # Byte-compiled / optimized / DLL files
 4 | __pycache__/
 5 | *.py[cod]
 6 | 
 7 | # C extensions
 8 | *.so
 9 | 
10 | # Distribution / packaging
11 | .Python
12 | env/
13 | bin/
14 | build/
15 | develop-eggs/
16 | dist/
17 | eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # Installer logs
28 | pip-log.txt
29 | pip-delete-this-directory.txt
30 | 
31 | # Unit test / coverage reports
32 | htmlcov/
33 | .tox/
34 | .coverage
35 | .cache
36 | nosetests.xml
37 | coverage.xml
38 | 
39 | # Translations
40 | *.mo
41 | 
42 | # Mr Developer
43 | .mr.developer.cfg
44 | .project
45 | .pydevproject
46 | 
47 | # Rope
48 | .ropeproject
49 | 
50 | # Django stuff:
51 | *.log
52 | *.pot
53 | 
54 | # Sphinx documentation
55 | docs/_build/
56 | 
57 | # Scrapekit specific
58 | data/
59 | reports/
60 | 


--------------------------------------------------------------------------------
/DESIGN.md:
--------------------------------------------------------------------------------
 1 | 
 2 | What should a typical session look like?
 3 | 
 4 | ## Fetching stuff from the web
 5 | 
 6 | ```python
 7 | from scrapekit import config, http
 8 | 
 9 | config.cache_forever = True
10 | config.cache_dir = '/tmp'
11 | 
12 | res = http.get('http://databin.pudo.org/t/b2d9cf')
13 | ```
14 | 
15 | Good enough. This should retain a cache of the data locally, and do
16 | retrieval.
17 | 
18 | Other concerns:
19 | 
20 | * Rate limiting
21 | * User agent hiding, and very explicit UAs.
22 | 
23 | ## Parallel processing
24 | 
25 | Next up: threading. Basically a really light-weight, in-process
26 | version of celery? Perhaps with an option to go for the real thing
27 | when needed?
28 | 
29 | ```python
30 | from scrapekit import processing
31 | 
32 | @processing.task
33 | def scrape_page(url):
34 |   pass
35 | 
36 | scrape_page.queue(url)
37 | processing.init(num_threads=20)
38 | ```
39 | 
40 | Alternatively, this could support a system of pipelines like this:
41 | 
42 | ```python
43 | from scrapekit import processing
44 | 
45 | @processing.task
46 | def scrape_index():
47 |   for i in xrange(1000):
48 |     yield 'http://example.com/%d' % i
49 | 
50 | @processing.task
51 | def scrape_page(url):
52 |   pass
53 | 
54 | pipeline = scrape_index.pipeline()
55 | pipeline = pipeline.chain(scrape_page)
56 | pipeline.run(num_threads=20)
57 | ```
58 | 
59 | ## Logging
60 | 
61 | What would really good logging for scrapers look like?
62 | 
63 | * Includes context, such as the URL currently being processed
64 | * All the trivial stuff (HTTP issues, HTML/XML parsing) is handled
65 | * Logs go to a CSV file or database? Something that allows
66 |   systematic analysis.
67 | * Does this actually generate nice-to-look at HTML?
68 | * Set up sensible defaults for Requests logging
69 | 
70 | ## Audits
71 | 
72 | Audits are parts of a pipeline that validate the generated data against a
73 | pre-defined schema. This could be used to make sure the data meets 
74 | certain expectations.
75 | 
76 | ## Other functionality
77 | 
78 | What else is repeated all over scrapers?
79 | 
80 | * Text cleaning (remove multiple spaces, normalize).
81 | * Currency conversion and deflation
82 | * Geocoding of addresses
83 | 
84 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014 Friedrich Lindenberg
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include LICENSE requirements.txt README.md
2 | recursive-include scrapekit/templates *
3 | global-exclude *.pyc
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # scrapekit
 2 | 
 3 | Did you know the entire web was made of data? You probably did.
 4 | Scrapekit helps you get that data with simple Python scripts. Based on
 5 | [requests](http://docs.python-requests.org/), the library will handles
 6 | caching, threading and logging.
 7 | 
 8 | See the [full documentation](http://scrapekit.readthedocs.org/).
 9 | 
10 | ## Example
11 | 
12 | ```python
13 | from scrapekit import Scraper
14 | 
15 | scraper = Scraper('example')
16 | 
17 | @scraper.task
18 | def get_index():
19 |   url = 'http://databin.pudo.org/t/b2d9cf'
20 |   doc = scraper.get(url).html()
21 |   for row in doc.findall('.//tr'):
22 |     yield row
23 | 
24 | @scraper.task
25 | def get_row(row):
26 |   columns = row.findall('./td')
27 |   print(columns)
28 | 
29 | pipeline = get_index | get_row
30 | if __name__ == '__main__':
31 |   pipeline.run()
32 | 
33 | ```
34 | 
35 | ## Works well with
36 | 
37 | Scrapekit doesn't aim to provide all functionality necessary for
38 | scraping. Specifically, it doesn't address HTML parsing, data storage
39 | and data validation. For these needs, check the following libraries:
40 | 
41 | * [lxml](http://lxml.de/) for HTML/XML parsing; much faster and more
42 |   flexible than [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/).
43 | * [dataset](http://dataset.rtfd.org) is a sister library of scrapekit
44 |   that simplifies storing semi-structured data in SQL databases.
45 | 
46 | ## Existing tools
47 | 
48 | * [Scrapy](http://scrapy.org/) is a much more mature and comprehensive
49 |   framework for developing scrapers. On the other hand, it requires you to
50 |   develop scrapers within its class system. This can be too heavyweight
51 |   for a simple script to grab data off a web site.
52 | * [scrapelib](http://scrapelib.readthedocs.org/) is a thin wrapper
53 |   around requests that does throttling, retries and caching.
54 | * [MechanicalSoup](https://github.com/hickford/MechanicalSoup) binds
55 |   BeautifulSoup and requests into an imperative, stateful API.
56 | 
57 | ## Credits and license
58 | 
59 | Scrapekit is licensed under the terms of the MIT license, which is also
60 | included in [LICENSE](LICENSE). It was developed through projects of
61 | [ICFJ](http://icfj.org), [ANCIR](http://investigativecenters.org) and
62 | [ICIJ](http://icij.org).
63 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
  1 | # Makefile for Sphinx documentation
  2 | #
  3 | 
  4 | # You can set these variables from the command line.
  5 | SPHINXOPTS    =
  6 | SPHINXBUILD   = sphinx-build
  7 | PAPER         =
  8 | BUILDDIR      = _build
  9 | 
 10 | # User-friendly check for sphinx-build
 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
 13 | endif
 14 | 
 15 | # Internal variables.
 16 | PAPEROPT_a4     = -D latex_paper_size=a4
 17 | PAPEROPT_letter = -D latex_paper_size=letter
 18 | ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 19 | # the i18n builder cannot share the environment and doctrees with the others
 20 | I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 21 | 
 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
 23 | 
 24 | help:
 25 | 	@echo "Please use \`make <target>' where <target> is one of"
 26 | 	@echo "  html       to make standalone HTML files"
 27 | 	@echo "  dirhtml    to make HTML files named index.html in directories"
 28 | 	@echo "  singlehtml to make a single large HTML file"
 29 | 	@echo "  pickle     to make pickle files"
 30 | 	@echo "  json       to make JSON files"
 31 | 	@echo "  htmlhelp   to make HTML files and a HTML help project"
 32 | 	@echo "  qthelp     to make HTML files and a qthelp project"
 33 | 	@echo "  devhelp    to make HTML files and a Devhelp project"
 34 | 	@echo "  epub       to make an epub"
 35 | 	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
 36 | 	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
 37 | 	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
 38 | 	@echo "  text       to make text files"
 39 | 	@echo "  man        to make manual pages"
 40 | 	@echo "  texinfo    to make Texinfo files"
 41 | 	@echo "  info       to make Texinfo files and run them through makeinfo"
 42 | 	@echo "  gettext    to make PO message catalogs"
 43 | 	@echo "  changes    to make an overview of all changed/added/deprecated items"
 44 | 	@echo "  xml        to make Docutils-native XML files"
 45 | 	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
 46 | 	@echo "  linkcheck  to check all external links for integrity"
 47 | 	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
 48 | 
 49 | clean:
 50 | 	rm -rf $(BUILDDIR)/*
 51 | 
 52 | html:
 53 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
 54 | 	@echo
 55 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 56 | 
 57 | dirhtml:
 58 | 	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
 59 | 	@echo
 60 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
 61 | 
 62 | singlehtml:
 63 | 	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
 64 | 	@echo
 65 | 	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
 66 | 
 67 | pickle:
 68 | 	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
 69 | 	@echo
 70 | 	@echo "Build finished; now you can process the pickle files."
 71 | 
 72 | json:
 73 | 	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
 74 | 	@echo
 75 | 	@echo "Build finished; now you can process the JSON files."
 76 | 
 77 | htmlhelp:
 78 | 	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
 79 | 	@echo
 80 | 	@echo "Build finished; now you can run HTML Help Workshop with the" \
 81 | 	      ".hhp project file in $(BUILDDIR)/htmlhelp."
 82 | 
 83 | qthelp:
 84 | 	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
 85 | 	@echo
 86 | 	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
 87 | 	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
 88 | 	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/scrapekit.qhcp"
 89 | 	@echo "To view the help file:"
 90 | 	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/scrapekit.qhc"
 91 | 
 92 | devhelp:
 93 | 	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
 94 | 	@echo
 95 | 	@echo "Build finished."
 96 | 	@echo "To view the help file:"
 97 | 	@echo "# mkdir -p $$HOME/.local/share/devhelp/scrapekit"
 98 | 	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/scrapekit"
 99 | 	@echo "# devhelp"
100 | 
101 | epub:
102 | 	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
103 | 	@echo
104 | 	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
105 | 
106 | latex:
107 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
108 | 	@echo
109 | 	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
110 | 	@echo "Run \`make' in that directory to run these through (pdf)latex" \
111 | 	      "(use \`make latexpdf' here to do that automatically)."
112 | 
113 | latexpdf:
114 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
115 | 	@echo "Running LaTeX files through pdflatex..."
116 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf
117 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
118 | 
119 | latexpdfja:
120 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
121 | 	@echo "Running LaTeX files through platex and dvipdfmx..."
122 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
123 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
124 | 
125 | text:
126 | 	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
127 | 	@echo
128 | 	@echo "Build finished. The text files are in $(BUILDDIR)/text."
129 | 
130 | man:
131 | 	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
132 | 	@echo
133 | 	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
134 | 
135 | texinfo:
136 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
137 | 	@echo
138 | 	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
139 | 	@echo "Run \`make' in that directory to run these through makeinfo" \
140 | 	      "(use \`make info' here to do that automatically)."
141 | 
142 | info:
143 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
144 | 	@echo "Running Texinfo files through makeinfo..."
145 | 	make -C $(BUILDDIR)/texinfo info
146 | 	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
147 | 
148 | gettext:
149 | 	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
150 | 	@echo
151 | 	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
152 | 
153 | changes:
154 | 	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
155 | 	@echo
156 | 	@echo "The overview file is in $(BUILDDIR)/changes."
157 | 
158 | linkcheck:
159 | 	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
160 | 	@echo
161 | 	@echo "Link check complete; look for any errors in the above output " \
162 | 	      "or in $(BUILDDIR)/linkcheck/output.txt."
163 | 
164 | doctest:
165 | 	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
166 | 	@echo "Testing of doctests in the sources finished, look at the " \
167 | 	      "results in $(BUILDDIR)/doctest/output.txt."
168 | 
169 | xml:
170 | 	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
171 | 	@echo
172 | 	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
173 | 
174 | pseudoxml:
175 | 	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
176 | 	@echo
177 | 	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
178 | 


--------------------------------------------------------------------------------
/docs/api.rst:
--------------------------------------------------------------------------------
 1 | API documentation
 2 | =================
 3 | 
 4 | The following documentation aims to present the internal API of the
 5 | library. While it is possible to use all of these classes directly, 
 6 | following the usage patterns detailed in the rest of the documentation
 7 | is advised.
 8 | 
 9 | Basic Scraper
10 | -------------
11 | 
12 | .. automodule:: scrapekit.core
13 |    :members:
14 | 
15 | 
16 | Tasks and threaded execution
17 | ----------------------------
18 | 
19 | .. automodule:: scrapekit.tasks
20 |    :members:
21 | 
22 | 
23 | HTTP caching and parsing
24 | ------------------------
25 | 
26 | .. automodule:: scrapekit.http
27 |    :members:
28 | 
29 | 
30 | Exceptions and Errors
31 | ---------------------
32 | 
33 | .. automodule:: scrapekit.exc
34 |    :members:
35 | 
36 | 


--------------------------------------------------------------------------------
/docs/cache.rst:
--------------------------------------------------------------------------------
 1 | Caching
 2 | =======
 3 | 
 4 | Caching of response data is implemented via `CacheControl
 5 | <https://github.com/ionrock/cachecontrol>`_, a library that extends
 6 | `requests <http://python-requests.org>`_. To enable a flexible usage of
 7 | the caching mechanism, the use of cached data is steered through a cache
 8 | policy, which can be specified either for the whole scraper or for a
 9 | specific request.
10 | 
11 | The following policies are supported:
12 | 
13 | * ``http`` will perform response caching and validation according to HTTP
14 |   semantic, i.e. in the way that a browser would do it. This requires the
15 |   server to set accurate cache control headers - which many applications
16 |   are too stupid to do.
17 | * ``none`` will disable caching entirely and always revert to the server for
18 |   up-to-date information.
19 | * ``force`` will always use the cached data and not check with the server
20 |   for updated pages. This is useful in debug mode, but dangerous when used
21 |   in production.
22 | 
23 | Per-request cache settings
24 | --------------------------
25 | 
26 | While caching will usually be configured on a scraper-wide basis, it can
27 | also be set for individual (``GET``) requests by passing a ``cache``
28 | argument set to one of the policy names:
29 | 
30 | .. code-block:: python
31 | 
32 |   import scrapekit
33 |   scraper = scrapekit.Scraper('example')
34 | 
35 |   # No caching:
36 |   scraper.get('http://google.com', cache='none')
37 | 
38 |   # Cache according to HTTP semantics:
39 |   scraper.get('http://google.com', cache='http')
40 | 
41 |   # Force re-use of data, even if it is stale:
42 |   scraper.get('http://google.com', cache='force')
43 | 


--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # scrapekit documentation build configuration file, created by
  4 | # sphinx-quickstart on Wed Aug  6 15:12:35 2014.
  5 | #
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | import sys
 16 | import os
 17 | import sphinx_rtd_theme
 18 | 
 19 | # If extensions (or modules to document with autodoc) are in another directory,
 20 | # add these directories to sys.path here. If the directory is relative to the
 21 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 22 | #sys.path.insert(0, os.path.abspath('.'))
 23 | 
 24 | # -- General configuration ------------------------------------------------
 25 | 
 26 | # If your documentation needs a minimal Sphinx version, state it here.
 27 | #needs_sphinx = '1.0'
 28 | 
 29 | # Add any Sphinx extension module names here, as strings. They can be
 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 31 | # ones.
 32 | extensions = [
 33 |     'sphinx.ext.autodoc',
 34 | ]
 35 | 
 36 | # Add any paths that contain templates here, relative to this directory.
 37 | templates_path = ['_templates']
 38 | 
 39 | # The suffix of source filenames.
 40 | source_suffix = '.rst'
 41 | 
 42 | # The encoding of source files.
 43 | #source_encoding = 'utf-8-sig'
 44 | 
 45 | # The master toctree document.
 46 | master_doc = 'index'
 47 | 
 48 | # General information about the project.
 49 | project = u'scrapekit'
 50 | copyright = u'2014, Friedrich Lindenberg'
 51 | 
 52 | # The version info for the project you're documenting, acts as replacement for
 53 | # |version| and |release|, also used in various other places throughout the
 54 | # built documents.
 55 | #
 56 | # The short X.Y version.
 57 | version = '0.1'
 58 | # The full version, including alpha/beta/rc tags.
 59 | release = '0.1'
 60 | 
 61 | # The language for content autogenerated by Sphinx. Refer to documentation
 62 | # for a list of supported languages.
 63 | #language = None
 64 | 
 65 | # There are two options for replacing |today|: either, you set today to some
 66 | # non-false value, then it is used:
 67 | #today = ''
 68 | # Else, today_fmt is used as the format for a strftime call.
 69 | #today_fmt = '%B %d, %Y'
 70 | 
 71 | # List of patterns, relative to source directory, that match files and
 72 | # directories to ignore when looking for source files.
 73 | exclude_patterns = ['_build']
 74 | 
 75 | # The reST default role (used for this markup: `text`) to use for all
 76 | # documents.
 77 | #default_role = None
 78 | 
 79 | # If true, '()' will be appended to :func: etc. cross-reference text.
 80 | #add_function_parentheses = True
 81 | 
 82 | # If true, the current module name will be prepended to all description
 83 | # unit titles (such as .. function::).
 84 | #add_module_names = True
 85 | 
 86 | # If true, sectionauthor and moduleauthor directives will be shown in the
 87 | # output. They are ignored by default.
 88 | #show_authors = False
 89 | 
 90 | # The name of the Pygments (syntax highlighting) style to use.
 91 | pygments_style = 'sphinx'
 92 | 
 93 | # A list of ignored prefixes for module index sorting.
 94 | #modindex_common_prefix = []
 95 | 
 96 | # If true, keep warnings as "system message" paragraphs in the built documents.
 97 | #keep_warnings = False
 98 | 
 99 | 
100 | # -- Options for HTML output ----------------------------------------------
101 | 
102 | # The theme to use for HTML and HTML Help pages.  See the documentation for
103 | # a list of builtin themes.
104 | html_theme = "sphinx_rtd_theme"
105 | 
106 | # Theme options are theme-specific and customize the look and feel of a theme
107 | # further.  For a list of options available for each theme, see the
108 | # documentation.
109 | #html_theme_options = {}
110 | 
111 | # Add any paths that contain custom themes here, relative to this directory.
112 | html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
113 | 
114 | # 'default'th = [sphinx_rtd_theme.get_html_theme_path()] The name for this set of Sphinx documents.  If None, it defaults to
115 | # "<project> v<release> documentation".
116 | #html_title = None
117 | 
118 | # A shorter title for the navigation bar.  Default is the same as html_title.
119 | #html_short_title = None
120 | 
121 | # The name of an image file (relative to this directory) to place at the top
122 | # of the sidebar.
123 | #html_logo = None
124 | 
125 | # The name of an image file (within the static path) to use as favicon of the
126 | # docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
127 | # pixels large.
128 | #html_favicon = None
129 | 
130 | # Add any paths that contain custom static files (such as style sheets) here,
131 | # relative to this directory. They are copied after the builtin static files,
132 | # so a file named "default.css" will overwrite the builtin "default.css".
133 | html_static_path = ['_static']
134 | 
135 | # Add any extra paths that contain custom files (such as robots.txt or
136 | # .htaccess) here, relative to this directory. These files are copied
137 | # directly to the root of the documentation.
138 | #html_extra_path = []
139 | 
140 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
141 | # using the given strftime format.
142 | #html_last_updated_fmt = '%b %d, %Y'
143 | 
144 | # If true, SmartyPants will be used to convert quotes and dashes to
145 | # typographically correct entities.
146 | #html_use_smartypants = True
147 | 
148 | # Custom sidebar templates, maps document names to template names.
149 | #html_sidebars = {}
150 | 
151 | # Additional templates that should be rendered to pages, maps page names to
152 | # template names.
153 | #html_additional_pages = {}
154 | 
155 | # If false, no module index is generated.
156 | #html_domain_indices = True
157 | 
158 | # If false, no index is generated.
159 | #html_use_index = True
160 | 
161 | # If true, the index is split into individual pages for each letter.
162 | #html_split_index = False
163 | 
164 | # If true, links to the reST sources are added to the pages.
165 | #html_show_sourcelink = True
166 | 
167 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
168 | #html_show_sphinx = True
169 | 
170 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
171 | #html_show_copyright = True
172 | 
173 | # If true, an OpenSearch description file will be output, and all pages will
174 | # contain a <link> tag referring to it.  The value of this option must be the
175 | # base URL from which the finished HTML is served.
176 | #html_use_opensearch = ''
177 | 
178 | # This is the file name suffix for HTML files (e.g. ".xhtml").
179 | #html_file_suffix = None
180 | 
181 | # Output file base name for HTML help builder.
182 | htmlhelp_basename = 'scrapekitdoc'
183 | 
184 | 
185 | # -- Options for LaTeX output ---------------------------------------------
186 | 
187 | latex_elements = {
188 | # The paper size ('letterpaper' or 'a4paper').
189 | #'papersize': 'letterpaper',
190 | 
191 | # The font size ('10pt', '11pt' or '12pt').
192 | #'pointsize': '10pt',
193 | 
194 | # Additional stuff for the LaTeX preamble.
195 | #'preamble': '',
196 | }
197 | 
198 | # Grouping the document tree into LaTeX files. List of tuples
199 | # (source start file, target name, title,
200 | #  author, documentclass [howto, manual, or own class]).
201 | latex_documents = [
202 |   ('index', 'scrapekit.tex', u'scrapekit Documentation',
203 |    u'Friedrich Lindenberg', 'manual'),
204 | ]
205 | 
206 | # The name of an image file (relative to this directory) to place at the top of
207 | # the title page.
208 | #latex_logo = None
209 | 
210 | # For "manual" documents, if this is true, then toplevel headings are parts,
211 | # not chapters.
212 | #latex_use_parts = False
213 | 
214 | # If true, show page references after internal links.
215 | #latex_show_pagerefs = False
216 | 
217 | # If true, show URL addresses after external links.
218 | #latex_show_urls = False
219 | 
220 | # Documents to append as an appendix to all manuals.
221 | #latex_appendices = []
222 | 
223 | # If false, no module index is generated.
224 | #latex_domain_indices = True
225 | 
226 | 
227 | # -- Options for manual page output ---------------------------------------
228 | 
229 | # One entry per manual page. List of tuples
230 | # (source start file, name, description, authors, manual section).
231 | man_pages = [
232 |     ('index', 'scrapekit', u'scrapekit Documentation',
233 |      [u'Friedrich Lindenberg'], 1)
234 | ]
235 | 
236 | # If true, show URL addresses after external links.
237 | #man_show_urls = False
238 | 
239 | 
240 | # -- Options for Texinfo output -------------------------------------------
241 | 
242 | # Grouping the document tree into Texinfo files. List of tuples
243 | # (source start file, target name, title, author,
244 | #  dir menu entry, description, category)
245 | texinfo_documents = [
246 |   ('index', 'scrapekit', u'scrapekit Documentation',
247 |    u'Friedrich Lindenberg', 'scrapekit', 'One line description of project.',
248 |    'Miscellaneous'),
249 | ]
250 | 
251 | # Documents to append as an appendix to all manuals.
252 | #texinfo_appendices = []
253 | 
254 | # If false, no module index is generated.
255 | #texinfo_domain_indices = True
256 | 
257 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
258 | #texinfo_show_urls = 'footnote'
259 | 
260 | # If true, do not generate a @detailmenu in the "Top" node's menu.
261 | #texinfo_no_detailmenu = False
262 | 


--------------------------------------------------------------------------------
/docs/config.rst:
--------------------------------------------------------------------------------
 1 | Configuration
 2 | =============
 3 | 
 4 | Scrapekit supports a broad range of configuration options, which can be
 5 | set either via a configuration file, environment variables or
 6 | programmatically at run-time.
 7 | 
 8 | 
 9 | Configuration methods
10 | ---------------------
11 | 
12 | As a first source of settings, scrapekit will attempt to read a per-user
13 | configuration file, ``~/.scrapekit.ini``. Inside the ini file, two
14 | sections will be read: ``[scrapekit]`` is expected to hold general
15 | settings, while a section, named after the current scraper, can be used to
16 | adapt these settings: 
17 | 
18 | .. code-block:: ini
19 | 
20 |   [scrapekit]
21 |   reports_path = /var/www/scrapers/reports
22 | 
23 |   [craigslist-sf-boats]
24 |   threads = 5
25 | 
26 | After evaluating these settings, environment variables will be read (see
27 | below for their names). Finally, all of these settings will be overridden
28 | by any configuration provided to the constructor of
29 | :py:class:`Scraper <scrapekit.core.Scraper>`.
30 | 
31 | 
32 | Available settings
33 | ------------------
34 | 
35 | ============ ====================== ====================================
36 | Name         Environment variable   Description
37 | ============ ====================== ====================================
38 | threads      SCRAPEKIT_THREADS      Number of threads to be started.
39 | cache_policy SCRAPEKIT_CACHE_POLICY Policy for caching requests. Valid 
40 |                                     values are ``disable`` (no caching),
41 |                                     ``http`` (cache according to HTTP
42 |                                     header semantics) and ``force``, to
43 |                                     force local storage and re-use of
44 |                                     any requests.
45 | data_path    SCRAPEKIT_DATA_PATH    A storage directory for cached data 
46 |                                     from HTTP requests. This is set to 
47 |                                     be a temporary directory by default,
48 |                                     which means caching will not work.
49 | reports_path SCRAPEKIT_REPORTS_PATH A directory to hold log files and -
50 |                                     if generated - the reports for this
51 |                                     scraper.
52 | ============ ====================== ====================================
53 | 
54 | 
55 | Custom settings
56 | ---------------
57 | 
58 | The scraper configuration is not limited to loading the settings
59 | indicated above. Hence, custom configuration settings (e.g. for site
60 | credentials) can be added to the ini file and then retrieved from the
61 | ``config`` attribute of a :py:class:`Scraper <scrapekit.core.Scraper>`
62 | instance.
63 | 


--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
  1 | .. scrapekit documentation master file, created by
  2 |    sphinx-quickstart on Wed Aug  6 15:12:35 2014.
  3 |    You can adapt this file completely to your liking, but it should at least
  4 |    contain the root `toctree` directive.
  5 | 
  6 | scrapekit: get the data you need, fast.
  7 | =======================================
  8 | 
  9 | .. toctree::
 10 |    :hidden:
 11 | 
 12 | Many web sites expose a great amount of data, and scraping it can help you build
 13 | useful tools, services and analysis on top of that data. This can often be done
 14 | with a simple Python script, using few external libraries.
 15 | 
 16 | As your script grows, however, you will want to add more advanced features, such
 17 | as **caching** of the downloaded pages, **multi-threading** to fetch many pieces
 18 | of content at once, and **logging** to get a clear sense of which data failed to
 19 | parse.
 20 | 
 21 | Scrapekit provides a set of useful tools for these that help with these tasks,
 22 | while also offering you simple ways to structure your scraper. This helps you to
 23 | produce **fast, reliable and structured scraper scripts**.
 24 | 
 25 | 
 26 | Example
 27 | -------
 28 | 
 29 | Below is a simple scraper for postings on Craigslist. This will use
 30 | multiple threads and request caching by default.
 31 | 
 32 | .. code-block:: python
 33 | 
 34 |   import scrapekit
 35 |   from urlparse import urljoin
 36 | 
 37 |   scraper = scrapekit.Scraper('craigslist-sf-boats')
 38 | 
 39 |   @scraper.task
 40 |   def scrape_listing(url):
 41 |       doc = scraper.get(url).html()
 42 |       print(doc.find('.//h2[@class="postingtitle"]').text_content())
 43 | 
 44 | 
 45 |   @scraper.task
 46 |   def scrape_index(url):
 47 |       doc = scraper.get(url).html()
 48 | 
 49 |       for listing in doc.findall('.//a[@class="hdrlnk"]'):
 50 |           listing_url = urljoin(url, listing.get('href'))
 51 |           scrape_listing.queue(listing_url)
 52 | 
 53 |   scrape_index.run('https://sfbay.craigslist.org/boo/')
 54 | 
 55 | By default, this save cache data to a the working directory, in a folder called
 56 | ``data``.
 57 | 
 58 | 
 59 | Reporting
 60 | ---------
 61 | 
 62 | Upon completion, the scraper will also generate an HTML report that presents
 63 | information about each task run within the scraper.
 64 | 
 65 | .. image:: http://cl.ly/image/1J2o2T43422e/Screen%20Shot%202014-08-26%20at%2015.58.03.png
 66 | 
 67 | This behaviour can be disabled by passing ``report=False`` to the constructor of
 68 | the scraper.
 69 | 
 70 | 
 71 | Contents
 72 | --------
 73 | 
 74 | .. toctree::
 75 |    :maxdepth: 2
 76 | 
 77 |    install
 78 |    quickstart
 79 |    tasks
 80 |    cache
 81 |    utils
 82 |    config
 83 |    api
 84 | 
 85 | 
 86 | Contributors
 87 | ------------
 88 | 
 89 | ``scrapekit`` is written and maintained by `Friedrich Lindenberg
 90 | <https://github.com/pudo>`_. It was developed as an outcome of scraping projects
 91 | for the `African Network of Centers for Investigative Reporting (ANCIR)
 92 | <http://investigativecenters.org>`_, supported by a `Knight International
 93 | Journalism Fellowship <http://icfj.org/knight>`_ from the `International
 94 | Center for Journalists (ICFJ) <http://icfj.org>`_.
 95 | 
 96 | Indices and tables
 97 | ==================
 98 | 
 99 | * :ref:`genindex`
100 | * :ref:`modindex`
101 | * :ref:`search`
102 | 
103 | 


--------------------------------------------------------------------------------
/docs/install.rst:
--------------------------------------------------------------------------------
 1 | Installation Guide
 2 | ==================
 3 | 
 4 | The easiest way is to install ``scrapekit`` via the `Python Package Index <https://pypi.python.org/pypi/dataset/>`_ using ``pip`` or ``easy_install``:
 5 | 
 6 | .. code-block:: bash
 7 | 
 8 |    $ pip install scrapekit
 9 | 
10 | To install it manually simply download the repository from Github:
11 | 
12 | .. code-block:: bash
13 | 
14 |    $ git clone git://github.com/pudo/scrapekit.git
15 |    $ cd scrapekit/
16 |    $ python setup.py install
17 | 


--------------------------------------------------------------------------------
/docs/quickstart.rst:
--------------------------------------------------------------------------------
  1 | Quickstart
  2 | ==========
  3 | 
  4 | Welcome to the scrapekit quickstart tutorial. In the following section,
  5 | I'll show you how to write a simple scraper using the functions in
  6 | scrapekit.
  7 | 
  8 | Like many people, I've had a life-long, hidden desire to become a sail
  9 | boat captain. To help me live the dream, we'll start by scraping
 10 | `Craigslist boat sales in San Francisco <https://sfbay.craigslist.org/boo/>`_.
 11 | 
 12 | 
 13 | Getting started
 14 | ---------------
 15 | 
 16 | First, let's make a simple Python module, e.g. in a file called
 17 | ``scrape_boats.py``.
 18 | 
 19 | .. code-block:: python
 20 | 
 21 |   import scrapekit
 22 | 
 23 |   scraper = scrapekit.Scraper('craigslist-sf-boats')
 24 | 
 25 | The first thing we've done is to instantiate a scraper and to give it
 26 | a name. The name will later be used to configure the scraper and to
 27 | read it's log ouput. Next, let's scrape our first page:
 28 | 
 29 | .. code-block:: python
 30 | 
 31 |   from urlparse import urljoin
 32 | 
 33 |   @scraper.task
 34 |   def scrape_index(url):
 35 |       doc = scraper.get(url).html()
 36 | 
 37 |       next_link = doc.find('.//a[@class="button next"]')
 38 |       if next_link is not None:
 39 |           # make an absolute url.
 40 |           next_url = urljoin(url, next_link.get('href'))
 41 |           scrape_index.queue(next_url)
 42 | 
 43 |   scrape_index.run('https://sfbay.craigslist.org/boo/')
 44 | 
 45 | This code will cycle through all the pages of listings, as long as
 46 | a *Next* link is present.
 47 | 
 48 | The key aspect of this snippet is the notion of a :py:class:`task
 49 | <scrapekit.tasks.Task>`. Each scrapekit scraper is broken up into
 50 | many small tasks, ideally one for fetching each web page.
 51 | 
 52 | Tasks are executed in parallel to speed up the scraper. To do that,
 53 | task functions aren't called directly, but by placing them on a
 54 | queue (see :py:func:`scrape_index.queue <scrapekit.tasks.Task.queue>`
 55 | above). Like normal functions, they can still receive arguments -
 56 | in this case, the URL to be scraped.
 57 | 
 58 | At the end of the snippet, we're calling :py:func:`scrape_index.run
 59 | <scrapekit.tasks.Task.run>`. Unlike a simple queueing operation, this
 60 | will tell the scraper to queue a task and then wait for all tasks to
 61 | be executed.
 62 | 
 63 | 
 64 | Scraping details
 65 | ----------------
 66 | 
 67 | Now that we have a basic task to scrape the index of listings, we
 68 | might want to download each listing's page and get some data from it.
 69 | To do this, we can extend our previous script:
 70 | 
 71 | .. code-block:: python
 72 | 
 73 |   import scrapekit
 74 |   from urlparse import urljoin
 75 | 
 76 |   scraper = scrapekit.Scraper('craigslist-sf-boats')
 77 | 
 78 |   @scraper.task
 79 |   def scrape_listing(url):
 80 |       doc = scraper.get(url).html()
 81 |       print(doc.find('.//h2[@class="postingtitle"]').text_content())
 82 | 
 83 | 
 84 |   @scraper.task
 85 |   def scrape_index(url):
 86 |       doc = scraper.get(url).html()
 87 | 
 88 |       for listing in doc.findall('.//a[@class="hdrlnk"]'):
 89 |           listing_url = urljoin(url, listing.get('href'))
 90 |           scrape_listing.queue(listing_url)
 91 | 
 92 |       next_link = doc.find('.//a[@class="button next"]')
 93 |       if next_link is not None:
 94 |           # make an absolute url.
 95 |           next_url = urljoin(url, next_link.get('href'))
 96 |           scrape_index.queue(next_url)
 97 | 
 98 |   scrape_index.run('https://sfbay.craigslist.org/boo/')
 99 | 
100 | This basic scraper could be extended to extract more information from
101 | each listing page, and to save that information to a set of files or
102 | to a database.
103 | 
104 | 
105 | Configuring the scraper
106 | -----------------------
107 | 
108 | As you may have noticed, Craigslist is sometimes a bit slow. You might
109 | want to configure your scraper to use caching, or a different number
110 | of simultaneous threads to retrieve data. The simplest way to set up
111 | caching is to set some environment variables:
112 | 
113 | .. code-block:: bash
114 | 
115 |   $ export SCRAPEKIT_CACHE_POLICY="http"
116 |   $ export SCRAPEKIT_DATA_PATH="data"
117 |   $ export SCRAPEKIT_THREADS=10
118 | 
119 | This will instruct scrapekit to cache requests according to the rules
120 | of HTTP (using headers like ``Cache-Control`` to determine what to cache
121 | and for how long), and to save downloaded data in a directory called
122 | ``data`` in the current working path. We've also instructed the tool to
123 | use 10 threads when scraping data.
124 | 
125 | If you wanto to make these decisions at run-time, you could also pass
126 | them into the constructor of your :py:class:`Scraper
127 | <scrapekit.core.Scraper>`:
128 | 
129 | .. code-block:: python
130 | 
131 |   import scrapekit
132 | 
133 |   config = {
134 |     'threads': 10,
135 |     'cache_policy': 'http',
136 |     'data_path': 'data'
137 |   }
138 |   scraper = scrapekit.Scraper('demo', config=config)
139 | 
140 | For details on all available settings and their meaning, check out the
141 | configuration documentation.
142 | 


--------------------------------------------------------------------------------
/docs/tasks.rst:
--------------------------------------------------------------------------------
  1 | Using tasks
  2 | ===========
  3 | 
  4 | Tasks are used by scrapekit to break up a complex script into small
  5 | units of work which can be executed asynchronously. When needed,
  6 | they can also be composed in a variety of ways to generate complex
  7 | data processing pipelines.
  8 | 
  9 | 
 10 | Explicit queueing
 11 | -----------------
 12 | 
 13 | The most simple way of using tasks is by explicitly queueing them.
 14 | Here's an example of a task queueing another task a few times:
 15 | 
 16 | .. code-block:: python
 17 | 
 18 |   import scrapekit
 19 | 
 20 |   scraper = scrapekit.Scraper('test')
 21 | 
 22 |   @scraper.task
 23 |   def each_item(item):
 24 |       print(item)
 25 | 
 26 |   @scraper.task
 27 |   def generate_work():
 28 |       for i in xrange(100):
 29 |           each_item.queue(i)
 30 | 
 31 |   if __name__ == '__main__':
 32 |       generate_work.queue().wait()
 33 | 
 34 | 
 35 | As you can see, ``generate_work`` will call ``each_item`` for each
 36 | item in the range. Since the items are processed asynchronously,
 37 | the printed output will not be in order, but slightly mixed up.
 38 | 
 39 | You can also see that on the last line, we're queueing the
 40 | ``generate_work`` task itself, and then instructing scrapekit to
 41 | wait for the completion of all tasks. Since the double call is a
 42 | bit awkward, there's a helper function to make both calls at once:
 43 | 
 44 | .. code-block:: python
 45 | 
 46 |   if __name__ == '__main__':
 47 |       generate_work.run()
 48 | 
 49 | 
 50 | Task chaining and piping
 51 | ------------------------
 52 | 
 53 | As an alternative to these explicit instructions to queue, you can
 54 | also use a more pythonic model to declare processing pipelines. A
 55 | processing pipeline connects tasks by feeding the output of one task
 56 | to another task.
 57 | 
 58 | To connect tasks, there are two methods: chaining and piping. Chaining
 59 | will just take the return value of one task, and queue another task
 60 | to process it. Piping, on the other hand, will expect the return value
 61 | of the first task to be an iterable, or for the task itself to be a
 62 | generator. It will then initiate the next task for each item in the
 63 | sequence.
 64 | 
 65 | Let's assume we have these functions defined:
 66 | 
 67 | .. code-block:: python
 68 | 
 69 |   import scrapekit
 70 | 
 71 |   scraper = scrapekit.Scraper('test')
 72 | 
 73 |   @scraper.task
 74 |   def consume_item(item):
 75 |       print(item)
 76 | 
 77 |   @scraper.task
 78 |   def process_item(item):
 79 |       return item ** 3
 80 | 
 81 |   @scraper.task
 82 |   def generate_items():
 83 |       for i in xrange(100):
 84 |           yield i
 85 | 
 86 | 
 87 | The simplest link we could do would be this simple chaining:
 88 | 
 89 | .. code-block:: python
 90 | 
 91 |   pipline = process_item > consume_item
 92 |   pipeline.run(5)
 93 | 
 94 | This linked ``process_item`` to ``consume_item``. Similarly, we could
 95 | use a very simple pipe:
 96 | 
 97 | .. code-block:: python
 98 | 
 99 |   pipline = generate_items | consume_item
100 |   pipeline.run()
101 | 
102 | Finally, we can link all of the functions together:
103 | 
104 | .. code-block:: python
105 | 
106 |   pipline = generate_items | process_item > consume_item
107 |   pipeline.run()
108 | 
109 | 


--------------------------------------------------------------------------------
/docs/utils.rst:
--------------------------------------------------------------------------------
1 | Utility functions
2 | =================
3 | 
4 | These helper functions are intended to support everyday tasks of
5 | data scraping, such as string sanitization, and validation.
6 | 
7 | .. automodule:: scrapekit.util
8 |    :members:
9 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | sphinx_rtd_theme
2 | 


--------------------------------------------------------------------------------
/scrapekit/__init__.py:
--------------------------------------------------------------------------------
1 | from scrapekit.core import Scraper
2 | 


--------------------------------------------------------------------------------
/scrapekit/config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import multiprocessing
 3 | try:
 4 |     from ConfigParser import SafeConfigParser
 5 | except ImportError:
 6 |     from configparser import ConfigParser as SafeConfigParser
 7 | 
 8 | 
 9 | class Config(object):
10 |     """ An object to load and represent the configuration of the current
11 |     scraper. This loads scraper configuration from the environment and a
12 |     per-user configuration file (``~/.scraperkit.ini``). """
13 | 
14 |     def __init__(self, scraper, config):
15 |         self.scraper = scraper
16 |         self.config = self._get_defaults()
17 |         self.config = self._get_file(self.config)
18 |         self.config = self._get_env(self.config)
19 |         if config is not None:
20 |             self.config.update(config)
21 | 
22 |     def _get_defaults(self):
23 |         name = self.scraper.name
24 |         return {
25 |             'cache_policy': 'http',
26 |             'threads': multiprocessing.cpu_count() * 2,
27 |             'data_path': os.path.join(os.getcwd(), 'data', name),
28 |             'reports_path': None
29 |         }
30 | 
31 |     def _get_env(self, config):
32 |         """ Read environment variables based on the settings defined in
33 |         the defaults. These are expected to be upper-case versions of
34 |         the actual setting names, prefixed by ``SCRAPEKIT_``. """
35 |         for option, value in config.items():
36 |             env_name = 'SCRAPEKIT_%s' % option.upper()
37 |             value = os.environ.get(env_name, value)
38 |             config[option] = value
39 |         return config
40 | 
41 |     def _get_file(self, config):
42 |         """ Read a per-user .ini file, which is expected to have either
43 |         a ``[scraperkit]`` or a ``[$SCRAPER_NAME]`` section. """
44 |         config_file = SafeConfigParser()
45 |         config_file.read([os.path.expanduser('~/.scrapekit.ini')])
46 |         if config_file.has_section('scrapekit'):
47 |             config.update(dict(config_file.items('scrapekit')))
48 |         if config_file.has_section(self.scraper.name):
49 |             config.update(dict(config_file.items(self.scraper.name)))
50 |         return config
51 | 
52 |     def items(self):
53 |         return self.config.items()
54 | 
55 |     def __getattr__(self, name):
56 |         if name != 'config' and name in self.config:
57 |             return self.config.get(name)
58 |         try:
59 |             return object.__getattribute__(self, name)
60 |         except AttributeError:
61 |             return None
62 | 


--------------------------------------------------------------------------------
/scrapekit/core.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import atexit
 3 | from uuid import uuid4
 4 | from datetime import datetime
 5 | from threading import local
 6 | 
 7 | from scrapekit.config import Config
 8 | from scrapekit.tasks import TaskManager, Task
 9 | from scrapekit.http import make_session
10 | from scrapekit.logs import make_logger
11 | from scrapekit import reporting
12 | 
13 | 
14 | class Scraper(object):
15 |     """ Scraper application object which handles resource management
16 |     for a variety of related functions. """
17 | 
18 |     def __init__(self, name, config=None, report=False):
19 |         self.name = name
20 |         self.id = uuid4()
21 |         self.start_time = datetime.utcnow()
22 |         self.config = Config(self, config)
23 |         try:
24 |             os.makedirs(self.config.data_path)
25 |         except:
26 |             pass
27 |         self._task_manager = None
28 |         self.task_ctx = local()
29 |         self.log = make_logger(self)
30 | 
31 |         self.session = self.Session()
32 |         if report:
33 |             atexit.register(self.report)
34 | 
35 |     @property
36 |     def task_manager(self):
37 |         if self._task_manager is None:
38 |             self._task_manager = \
39 |                 TaskManager(threads=self.config.threads)
40 |         return self._task_manager
41 | 
42 |     def task(self, fn):
43 |         """ Decorate a function as a task in the scraper framework.
44 |         This will enable the function to be queued and executed in
45 |         a separate thread, allowing for the execution of the scraper
46 |         to be asynchronous.
47 |         """
48 |         return Task(self, fn)
49 | 
50 |     def Session(self):
51 |         """ Create a pre-configured ``requests`` session instance
52 |         that can be used to run HTTP requests. This instance will
53 |         potentially be cached, or a stub, depending on the
54 |         configuration of the scraper. """
55 |         return make_session(self)
56 | 
57 |     def head(self, url, **kwargs):
58 |         """ HTTP HEAD via ``requests``.
59 | 
60 |         See: http://docs.python-requests.org/en/latest/api/#requests.head
61 |         """
62 |         return self.Session().get(url, **kwargs)
63 | 
64 |     def get(self, url, **kwargs):
65 |         """ HTTP GET via ``requests``.
66 | 
67 |         See: http://docs.python-requests.org/en/latest/api/#requests.get
68 |         """
69 |         return self.Session().get(url, **kwargs)
70 | 
71 |     def post(self, url, **kwargs):
72 |         """ HTTP POST via ``requests``.
73 | 
74 |         See: http://docs.python-requests.org/en/latest/api/#requests.post
75 |         """
76 |         return self.Session().post(url, **kwargs)
77 | 
78 |     def put(self, url, **kwargs):
79 |         """ HTTP PUT via ``requests``.
80 | 
81 |         See: http://docs.python-requests.org/en/latest/api/#requests.put
82 |         """
83 |         return self.Session().put(url, **kwargs)
84 | 
85 |     def report(self):
86 |         """ Generate a static HTML report for the last runs of the
87 |         scraper from its log file. """
88 |         index_file = reporting.generate(self)
89 |         print("Report available at: file://%s" % index_file)
90 | 
91 |     def __repr__(self):
92 |         return '<Scraper(%s)>' % self.name
93 | 


--------------------------------------------------------------------------------
/scrapekit/exc.py:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | class ScraperException(Exception):
 4 |     """ Generic scraper exception, the base for all other exceptions.
 5 |     """
 6 | 
 7 | 
 8 | class WrappedMixIn():
 9 |     """ Mix-in for wrapped exceptions. """
10 | 
11 |     def __init__(self, wrapped):
12 |         self.wrapped = wrapped
13 |         self.message = wrapped.message
14 | 
15 |     def __repr__(self):
16 |         name = self.__class__.__name__
17 |         return '<%s(%s)>' % (name, self.wrapped)
18 | 
19 | 
20 | class DependencyException(ScraperException, WrappedMixIn):
21 |     """ Triggered when an operation would require the installation
22 |     of further dependencies. """
23 | 
24 | 
25 | class ParseException(ScraperException, WrappedMixIn):
26 |     """ Triggered when parsing an HTTP response into the desired
27 |     format (e.g. an HTML DOM, or JSON) is not possible. """
28 | 


--------------------------------------------------------------------------------
/scrapekit/http.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | 
  3 | import requests
  4 | from cachecontrol.adapter import CacheControlAdapter
  5 | from cachecontrol.caches import FileCache
  6 | from cachecontrol.controller import CacheController
  7 | 
  8 | from scrapekit.exc import DependencyException, ParseException
  9 | 
 10 | 
 11 | CARRIER_HEADER = 'X-Scrapekit-Cache-Policy'
 12 | 
 13 | 
 14 | class ScraperResponse(requests.Response):
 15 |     """ A modified scraper response that can parse the content into
 16 |     HTML, XML, JSON or a BeautifulSoup instance. """
 17 | 
 18 |     def html(self):
 19 |         """ Create an ``lxml``-based HTML DOM from the response. The tree
 20 |         will not have a root, so all queries need to be relative
 21 |         (i.e. start with a dot).
 22 |         """
 23 |         try:
 24 |             from lxml import html
 25 |             return html.fromstring(self.content)
 26 |         except ImportError as ie:
 27 |             raise DependencyException(ie)
 28 | 
 29 |     def xml(self):
 30 |         """ Create an ``lxml``-based XML DOM from the response. The tree
 31 |         will not have a root, so all queries need to be relative
 32 |         (i.e. start with a dot).
 33 |         """
 34 |         try:
 35 |             from lxml import etree
 36 |             return etree.fromstring(self.content)
 37 |         except ImportError as ie:
 38 |             raise DependencyException(ie)
 39 | 
 40 |     def json(self, **kwargs):
 41 |         """ Create JSON object out of the response. """
 42 |         try:
 43 |             return super(ScraperResponse, self).json(**kwargs)
 44 |         except ValueError as ve:
 45 |             raise ParseException(ve)
 46 | 
 47 | 
 48 | class ScraperSession(requests.Session):
 49 |     """ Sub-class requests session to be able to introduce additional
 50 |     state to sessions and responses. """
 51 | 
 52 |     def request(self, method, url, cache=None, **kwargs):
 53 |         # decide the cache policy and place it in a fake HTTP header
 54 |         cache_policy = cache or self.cache_policy
 55 |         if 'headers' not in kwargs:
 56 |             kwargs['headers'] = {}
 57 |         kwargs['headers'][CARRIER_HEADER] = cache_policy
 58 | 
 59 |         # TODO: put UA fakery here.
 60 | 
 61 |         orig = super(ScraperSession, self).request(method, url, **kwargs)
 62 | 
 63 |         # log request details to the JSON log
 64 |         self.scraper.log.debug("%s %s", method, url, extra={
 65 |             'reqMethod': method,
 66 |             'reqUrl': url,
 67 |             'reqArgs': kwargs
 68 |             })
 69 | 
 70 |         # Cast the response into our own subclass which has HTML/XML
 71 |         # parsing support.
 72 |         response = ScraperResponse()
 73 |         response.__setstate__(orig.__getstate__())
 74 |         return response
 75 | 
 76 | 
 77 | class PolicyCacheController(CacheController):
 78 |     """ Switch the caching mode based on the caching policy provided by
 79 |     request, which in turn can be given at request time or through the
 80 |     scraper configuration. """
 81 | 
 82 |     def cached_request(self, request):
 83 |         cache_policy = request.headers.pop(CARRIER_HEADER, 'none')
 84 |         if cache_policy == 'force' or cache_policy is True:
 85 |             # Force using the cache, even if HTTP semantics forbid it.
 86 |             cache_url = self.cache_url(request.url)
 87 |             resp = self.serializer.loads(request, self.cache.get(cache_url))
 88 |             return resp or False
 89 |         elif cache_policy == 'http':
 90 |             return super(PolicyCacheController, self).cached_request(request)
 91 |         else:
 92 |             return False
 93 | 
 94 | 
 95 | def make_session(scraper):
 96 |     """ Instantiate a session with the desired configuration parameters,
 97 |     including the cache policy. """
 98 |     cache_path = os.path.join(scraper.config.data_path, 'cache')
 99 |     cache_policy = scraper.config.cache_policy
100 |     cache_policy = cache_policy.lower().strip()
101 |     session = ScraperSession()
102 |     session.scraper = scraper
103 |     session.cache_policy = cache_policy
104 | 
105 |     adapter = CacheControlAdapter(
106 |         FileCache(cache_path),
107 |         cache_etags=True,
108 |         controller_class=PolicyCacheController
109 |     )
110 |     session.mount('http://', adapter)
111 |     session.mount('https://', adapter)
112 |     return session
113 | 


--------------------------------------------------------------------------------
/scrapekit/logs.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import logging
 3 | 
 4 | try:
 5 |     import jsonlogger
 6 | except ImportError:
 7 |     # python-json-logger version 0.1.0 has changed the import structure
 8 |     from pythonjsonlogger import jsonlogger
 9 | 
10 | 
11 | class TaskAdapter(logging.LoggerAdapter):
12 |     """ Enhance any log messages with extra information about the
13 |     current context of the scraper. """
14 | 
15 |     def __init__(self, logger, scraper):
16 |         super(TaskAdapter, self).__init__(logger, {})
17 |         self.scraper = scraper
18 | 
19 |     def process(self, msg, kwargs):
20 |         extra = kwargs.get('extra', {})
21 |         extra['scraperName'] = self.scraper.name
22 |         extra['scraperId'] = self.scraper.id
23 |         if hasattr(self.scraper.task_ctx, 'name'):
24 |             extra['taskName'] = self.scraper.task_ctx.name
25 |         if hasattr(self.scraper.task_ctx, 'id'):
26 |             extra['taskId'] = self.scraper.task_ctx.id
27 |         extra['scraperStartTime'] = self.scraper.start_time
28 |         kwargs['extra'] = extra
29 |         return (msg, kwargs)
30 | 
31 | 
32 | def make_json_format():
33 |     supported_keys = ['asctime', 'created', 'filename', 'funcName',
34 |                       'levelname', 'levelno', 'lineno', 'module',
35 |                       'msecs', 'message', 'name', 'pathname',
36 |                       'process', 'processName', 'relativeCreated',
37 |                       'thread', 'threadName']
38 |     log_format = lambda x: ['%({0:s})'.format(i) for i in x]
39 |     return ' '.join(log_format(supported_keys))
40 | 
41 | 
42 | def log_path(scraper):
43 |     """ Determine the file name for the JSON log. """
44 |     return os.path.join(scraper.config.data_path,
45 |                         '%s.jsonlog' % scraper.name)
46 | 
47 | 
48 | def make_logger(scraper):
49 |     """ Create two log handlers, one to output info-level ouput to the
50 |     console, the other to store all logging in a JSON file which will
51 |     later be used to generate reports. """
52 | 
53 |     logger = logging.getLogger('')
54 |     logger.setLevel(logging.DEBUG)
55 | 
56 |     requests_log = logging.getLogger("requests")
57 |     requests_log.setLevel(logging.WARNING)
58 | 
59 |     json_handler = logging.FileHandler(log_path(scraper))
60 |     json_handler.setLevel(logging.DEBUG)
61 |     json_formatter = jsonlogger.JsonFormatter(make_json_format())
62 |     json_handler.setFormatter(json_formatter)
63 |     logger.addHandler(json_handler)
64 | 
65 |     console_handler = logging.StreamHandler()
66 |     console_handler.setLevel(logging.INFO)
67 |     fmt = '%(name)s [%(levelname)-8s]: %(message)s'
68 |     formatter = logging.Formatter(fmt)
69 |     console_handler.setFormatter(formatter)
70 |     logger.addHandler(console_handler)
71 | 
72 |     logger = logging.getLogger(scraper.name)
73 |     logger = TaskAdapter(logger, scraper)
74 |     return logger
75 | 


--------------------------------------------------------------------------------
/scrapekit/reporting/__init__.py:
--------------------------------------------------------------------------------
  1 | from os import path
  2 | 
  3 | from scrapekit.reporting import db
  4 | from scrapekit.reporting import render
  5 | 
  6 | 
  7 | RUNS_QUERY = """
  8 |     SELECT scraperId, scraperStartTime, levelname, COUNT(rowid) as messages,
  9 |         COUNT(DISTINCT taskId) AS tasks
 10 |     FROM log
 11 |     GROUP BY scraperId, scraperStartTime, levelname
 12 |     ORDER BY scraperStartTime DESC
 13 | """
 14 | 
 15 | TASKS_QUERY = """
 16 |     SELECT scraperId, scraperStartTime, levelname, taskName,
 17 |         COUNT(rowid) as messages, COUNT(DISTINCT taskId) AS tasks
 18 |     FROM log
 19 |     GROUP BY scraperId, scraperStartTime, levelname, taskName
 20 |     ORDER BY scraperId DESC, taskName DESC
 21 | """
 22 | 
 23 | TASK_RUNS_QUERY = """
 24 |     SELECT taskId, scraperId, taskName, asctime, levelname,
 25 |         COUNT(rowid) as messages,
 26 |         COUNT(DISTINCT taskId) AS tasks
 27 |     FROM log
 28 |     WHERE scraperId = :scraperId AND taskName = :taskName
 29 |     GROUP BY taskId, scraperId, taskName, asctime, levelname
 30 |     ORDER BY messages DESC
 31 | """
 32 | 
 33 | TASK_RUNS_QUERY_NULL = """
 34 |     SELECT taskId, scraperId, taskName, asctime, levelname,
 35 |         COUNT(rowid) as messages,
 36 |         COUNT(DISTINCT taskId) AS tasks
 37 |     FROM log
 38 |     WHERE scraperId = :scraperId AND taskName IS NULL
 39 |     GROUP BY taskId, scraperId, taskName, asctime, levelname
 40 |     ORDER BY messages DESC
 41 | """
 42 | 
 43 | TASK_RUNS_LIST = """
 44 | SELECT DISTINCT scraperId, taskName FROM log ORDER BY asctime DESC
 45 | """
 46 | 
 47 | 
 48 | def aggregate_loglevels(sql, keys, **kwargs):
 49 |     data, key = {}, None
 50 |     for row in db.query(sql, **kwargs):
 51 |         row_key = map(lambda k: row[k], keys)
 52 |         if key != row_key:
 53 |             if key is not None:
 54 |                 yield data
 55 |             data = row
 56 |             key = row_key
 57 |         else:
 58 |             data['messages'] += row['messages']
 59 |             data['tasks'] += row['tasks']
 60 |         data[row['levelname']] = row['messages']
 61 |     if key is not None:
 62 |         yield data
 63 | 
 64 | 
 65 | def sort_aggregates(rows):
 66 |     rows = list(rows)
 67 | 
 68 |     def key(row):
 69 |         return row.get('ERROR', 0) * (len(rows) * 2) + \
 70 |             row.get('WARN', 0) * len(rows) + \
 71 |             row.get('INFO', 0) * 2 + \
 72 |             row.get('DEBUG', 0)
 73 |     return sorted(rows, key=key)
 74 | 
 75 | 
 76 | def all_task_runs(scraper, keys=('scraperId', 'taskId')):
 77 |     by_task = {}
 78 |     for row in db.log_parse(scraper):
 79 |         asctime = row.get('asctime')
 80 |         row['ts'] = '-' if asctime is None else asctime.rsplit(' ')[-1]
 81 |         row_key = tuple(map(lambda k: row.get(k), keys))
 82 |         if row_key not in by_task:
 83 |             by_task[row_key] = [row]
 84 |         else:
 85 |             by_task[row_key].append(row)
 86 |     return by_task
 87 | 
 88 | 
 89 | def generate(scraper):
 90 |     db.load(scraper)
 91 |     runs = list(aggregate_loglevels(RUNS_QUERY, ('scraperId',)))
 92 |     tasks = list(aggregate_loglevels(TASKS_QUERY, ('scraperId', 'taskName')))
 93 |     index_file = render.paginate(scraper, runs, 'index%s.html', 'index.html',
 94 |                                  tasks=tasks)
 95 | 
 96 |     for task_run in db.query(TASK_RUNS_LIST):
 97 |         task = task_run.get('taskName') or render.PADDING
 98 |         file_name = '%s/%s/index%%s.html' % (task, task_run.get('scraperId'))
 99 |         if path.exists(path.join(path.dirname(index_file), file_name % '')):
100 |             continue
101 |         if task_run.get('taskName') is None:
102 |             runs = aggregate_loglevels(TASK_RUNS_QUERY_NULL, ('taskId',),
103 |                                        scraperId=task_run.get('scraperId'))
104 |         else:
105 |             runs = aggregate_loglevels(TASK_RUNS_QUERY, ('taskId',),
106 |                                        scraperId=task_run.get('scraperId'),
107 |                                        taskName=task_run.get('taskName'))
108 |         runs = sort_aggregates(runs)
109 |         render.paginate(scraper, runs, file_name, 'task_run_list.html',
110 |                         taskName=task)
111 | 
112 |     for (scraperId, taskId), rows in all_task_runs(scraper).items():
113 |         taskName = rows[0].get('taskName')
114 |         file_name = (taskName or render.PADDING,
115 |                      scraperId or render.PADDING,
116 |                      taskId or render.PADDING)
117 |         file_name = '%s/%s/%s%%s.html' % file_name
118 |         if path.exists(path.join(path.dirname(index_file), file_name % '')):
119 |             continue
120 |         render.paginate(scraper, rows, file_name, 'task_run_item.html',
121 |                         scraperId=scraperId, taskId=taskId, taskName=taskName)
122 | 
123 |     return index_file
124 | 


--------------------------------------------------------------------------------
/scrapekit/reporting/db.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import sqlite3
 3 | 
 4 | from scrapekit.logs import log_path
 5 | 
 6 | 
 7 | conn = sqlite3.connect(':memory:')
 8 | 
 9 | 
10 | def dict_factory(cursor, row):
11 |     d = {}
12 |     for idx, col in enumerate(cursor.description):
13 |         d[col[0]] = row[idx]
14 |     return d
15 | 
16 | 
17 | def log_parse(scraper):
18 |     path = log_path(scraper)
19 |     with open(path, 'r') as fh:
20 |         for line in fh:
21 |             data = json.loads(line)
22 |             if data.get('scraperName') != scraper.name:
23 |                 continue
24 |             yield data
25 | 
26 | 
27 | def load(scraper):
28 |     conn.row_factory = dict_factory
29 |     conn.execute("""CREATE TABLE IF NOT EXISTS log (scraperId text,
30 |         taskName text, scraperStartTime datetime, asctime text,
31 |         levelname text, taskId text)""")
32 |     conn.commit()
33 |     for data in log_parse(scraper):
34 |         conn.execute("""INSERT INTO log (scraperId, taskName,
35 |             scraperStartTime, asctime, levelname, taskId) VALUES
36 |             (?, ?, ?, ?, ?, ?)""",
37 |             (data.get('scraperId'), data.get('taskName'),
38 |              data.get('scraperStartTime'), data.get('asctime'),
39 |              data.get('levelname'), data.get('taskId')))
40 |     conn.commit()
41 | 
42 | 
43 | def query(sql, **kwargs):
44 |     rp = conn.execute(sql, kwargs)
45 |     for row in rp.fetchall():
46 |         yield row
47 | 


--------------------------------------------------------------------------------
/scrapekit/reporting/render.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import math
 3 | import platform
 4 | import pkg_resources
 5 | from datetime import datetime
 6 | from collections import namedtuple
 7 | 
 8 | from jinja2 import Environment, PackageLoader
 9 | 
10 | 
11 | PAGE_SIZE = 15
12 | RANGE = 3
13 | PADDING = 'unkown'
14 | IGNORE_FIELDS = ['levelno', 'ts', 'processName', 'filename',
15 |                  'levelname', 'message', 'taskId', 'scraperId']
16 | url = namedtuple('url', ['idx', 'rel', 'abs'])
17 | 
18 | 
19 | def datetimeformat(value):
20 |     outfmt = '%b %d, %Y, %H:%M'
21 |     if value is None:
22 |         return 'no date'
23 |     if not isinstance(value, datetime):
24 |         value = datetime.strptime(value, '%Y-%m-%dT%H:%M')
25 |     return value.strftime(outfmt)
26 | 
27 | 
28 | def render(scraper, dest_file, template, **kwargs):
29 |     reports_path = scraper.config.reports_path
30 |     if reports_path is None:
31 |         reports_path = os.path.join(scraper.config.data_path, 'reports')
32 |     dest_file = os.path.join(reports_path, dest_file)
33 |     dest_path = os.path.dirname(dest_file)
34 |     try:
35 |         os.makedirs(dest_path)
36 |     except:
37 |         pass
38 | 
39 |     loader = PackageLoader('scrapekit', 'templates')
40 |     env = Environment(loader=loader)
41 |     env.filters['dateformat'] = datetimeformat
42 |     template = env.get_template(template)
43 |     kwargs['version'] = pkg_resources.require("scrapekit")[0].version
44 |     kwargs['python'] = platform.python_version()
45 |     kwargs['hostname'] = platform.uname()[1]
46 |     kwargs['padding'] = PADDING
47 |     kwargs['ignore_fields'] = IGNORE_FIELDS
48 |     kwargs['config'] = scraper.config.items()
49 |     kwargs['scraper'] = scraper
50 |     with open(dest_file, 'w') as fh:
51 |         text = template.render(**kwargs)
52 |         fh.write(text.encode('utf-8'))
53 |     return dest_file
54 | 
55 | 
56 | def paginate(scraper, elements, basename, template, **kwargs):
57 |     basedir = os.path.dirname(basename)
58 |     basefile = os.path.basename(basename)
59 |     pages = int(math.ceil(float(len(elements)) / PAGE_SIZE))
60 | 
61 |     urls = []
62 |     for i in range(1, pages + 1):
63 |         fn = basefile % '' if i == 1 else basefile % ('_' + str(i))
64 |         urls.append(url(i, fn, os.path.join(basedir, fn)))
65 | 
66 |     link = None
67 |     for page in reversed(urls):
68 |         offset = (page.idx - 1) * PAGE_SIZE
69 |         es = elements[offset:offset + PAGE_SIZE]
70 | 
71 |         low = page.idx - RANGE
72 |         high = page.idx + RANGE
73 | 
74 |         if low < 1:
75 |             low = 1
76 |             high = min((2 * RANGE) + 1, pages)
77 | 
78 |         if high > pages:
79 |             high = pages
80 |             low = max(1, pages - (2 * RANGE) + 1)
81 | 
82 |         pager = {
83 |             'total': len(elements),
84 |             'page': page,
85 |             'elements': es,
86 |             'pages': urls[low - 1:high],
87 |             'show': pages > 1,
88 |             'prev': None if page.idx == 1 else urls[page.idx - 2],
89 |             'next': None if page.idx == len(urls) else urls[page.idx]
90 |         }
91 |         link = render(scraper, page.abs, template, pager=pager,
92 |                       **kwargs)
93 |     return link
94 | 


--------------------------------------------------------------------------------
/scrapekit/tasks.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This module holds a simple system for the multi-threaded execution of
  3 | scraper code. This can be used, for example, to split a scraper into
  4 | several stages and to have multiple elements processed at the same
  5 | time.
  6 | 
  7 | The goal of this module is to handle simple multi-threaded scrapers,
  8 | while making it easy to upgrade to a queue-based setup using celery
  9 | later.
 10 | """
 11 | 
 12 | from uuid import uuid4
 13 | from time import sleep
 14 | try:
 15 |     from queue import Queue
 16 | except ImportError:
 17 |     from Queue import Queue
 18 | from threading import Thread
 19 | 
 20 | 
 21 | class TaskManager(object):
 22 |     """ The `TaskManager` is a singleton that manages the threads
 23 |     used to parallelize processing and the queue that manages the
 24 |     current set of prepared tasks. """
 25 | 
 26 |     def __init__(self, threads=10):
 27 |         """
 28 |         :param threads: The number of threads to be spawned. Values
 29 |             ranging from 5 to 40 have shown useful, based on the amount
 30 |             of I/O involved in each task.
 31 |         :param daemon: Mark the worker threads as daemons in the
 32 |             operating system, so that they will not be included in the
 33 |             number of application threads for this script.
 34 |         """
 35 |         self.num_threads = int(threads)
 36 |         self.queue = None
 37 | 
 38 |     def _spawn(self):
 39 |         """ Initialize the queue and the threads. """
 40 |         self.queue = Queue(maxsize=self.num_threads * 10)
 41 |         for i in range(self.num_threads):
 42 |             t = Thread(target=self._consume)
 43 |             t.daemon = True
 44 |             t.start()
 45 | 
 46 |     def _consume(self):
 47 |         """ Main loop for each thread, handles picking a task off the
 48 |         queue, processing it and notifying the queue that it is done.
 49 |         """
 50 |         while True:
 51 |             try:
 52 |                 task, args, kwargs = self.queue.get(True)
 53 |                 task(*args, **kwargs)
 54 |             finally:
 55 |                 self.queue.task_done()
 56 | 
 57 |     def put(self, task, args, kwargs):
 58 |         """ Add a new item to the queue. An item is a task and the
 59 |         arguments needed to call it.
 60 | 
 61 |         Do not call this directly, use Task.queue/Task.run instead.
 62 |         """
 63 |         if self.num_threads == 0:
 64 |             return task(*args, **kwargs)
 65 |         if self.queue is None:
 66 |             self._spawn()
 67 |         self.queue.put((task, args, kwargs))
 68 | 
 69 |     def wait(self):
 70 |         """ Wait for each item in the queue to be processed. If this
 71 |         is not called, the main thread will end immediately and none
 72 |         of the tasks assigned to the threads would be executed. """
 73 |         if self.queue is None:
 74 |             return
 75 | 
 76 |         self.queue.join()
 77 | 
 78 | 
 79 | class ChainListener(object):
 80 | 
 81 |     def __init__(self, task):
 82 |         self.task = task
 83 | 
 84 |     def notify(self, value):
 85 |         self.task.queue(value)
 86 | 
 87 | 
 88 | class PipeListener(ChainListener):
 89 | 
 90 |     def notify(self, value):
 91 |         # TODO: if value is a generator, it will be exhausted.
 92 |         # Thus no branching or return value is available.
 93 |         # -> consider using itertools.tee.
 94 |         for value_item in value:
 95 |             self.task.queue(value_item)
 96 | 
 97 | 
 98 | class Task(object):
 99 |     """ A task is a decorator on a function which helps managing the
100 |     execution of that function in a multi-threaded, queued context.
101 | 
102 |     After a task has been applied to a function, it can either be used
103 |     in the normal way (by calling it directly), through a simple queue
104 |     (using the `queue` method), or in pipeline mode (using `chain`,
105 |     `pipe` and `run`).
106 |     """
107 | 
108 |     def __init__(self, scraper, fn, task_id=None):
109 |         self.scraper = scraper
110 |         self.fn = fn
111 |         self.task_id = task_id
112 |         self._listeners = []
113 |         self._source = None
114 | 
115 |     def __call__(self, *args, **kwargs):
116 |         """ Execute the wrapped function. This will either call it in
117 |         normal mode (returning the return value), or notify any
118 |         pipeline listeners that have been associated with this task.
119 |         """
120 |         self.scraper.task_ctx.name = getattr(self.fn, 'func_name', self.fn.__name__)
121 |         self.scraper.task_ctx.id = self.task_id or uuid4()
122 | 
123 |         try:
124 |             self.scraper.log.debug('Begin task', extra={
125 |                 'taskArgs': args,
126 |                 'taskKwargs': kwargs
127 |                 })
128 |             value = self.fn(*args, **kwargs)
129 |             for listener in self._listeners:
130 |                 listener.notify(value)
131 |             return value
132 |         except Exception as e:
133 |             self.scraper.log.exception(e)
134 |         finally:
135 |             self.scraper.task_ctx.name = None
136 |             self.scraper.task_ctx.id = None
137 | 
138 |     def queue(self, *args, **kwargs):
139 |         """ Schedule a task for execution. The task call (and its
140 |         arguments) will be placed on the queue and processed
141 |         asynchronously. """
142 |         self.scraper.task_manager.put(self, args, kwargs)
143 |         return self
144 | 
145 |     def wait(self):
146 |         """ Wait for task execution in the current queue to be
147 |         complete (ie. the queue to be empty). If only `queue` is called
148 |         without `wait`, no processing will occur. """
149 |         self.scraper.task_manager.wait()
150 |         return self
151 | 
152 |     def run(self, *args, **kwargs):
153 |         """ Queue a first item to execute, then wait for the queue to
154 |         be empty before returning. This should be the default way of
155 |         starting any scraper.
156 |         """
157 |         if self._source is not None:
158 |             return self._source.run(*args, **kwargs)
159 |         else:
160 |             self.queue(*args, **kwargs)
161 |             return self.wait()
162 | 
163 |     def chain(self, other_task):
164 |         """ Add a chain listener to the execution of this task. Whenever
165 |         an item has been processed by the task, the registered listener
166 |         task will be queued to be executed with the output of this task.
167 | 
168 |         Can also be written as::
169 | 
170 |             pipeline = task1 > task2
171 |         """
172 |         other_task._source = self
173 |         self._listeners.append(ChainListener(other_task))
174 |         return other_task
175 | 
176 |     def pipe(self, other_task):
177 |         """ Add a pipe listener to the execution of this task. The
178 |         output of this task is required to be an iterable. Each item in
179 |         the iterable will be queued as the sole argument to an execution
180 |         of the listener task.
181 | 
182 |         Can also be written as::
183 | 
184 |             pipeline = task1 | task2
185 |         """
186 |         other_task._source = self
187 |         self._listeners.append(PipeListener(other_task))
188 |         return other_task
189 | 
190 |     def __gt__(self, other_task):
191 |         return self.chain(other_task)
192 | 
193 |     def __or__(self, other_task):
194 |         return self.pipe(other_task)
195 | 


--------------------------------------------------------------------------------
/scrapekit/templates/index.html:
--------------------------------------------------------------------------------
 1 | {% extends "layout.html" %}
 2 | {% from 'macros.html' import pagination %}
 3 | 
 4 | {% block title %}
 5 |   {{ scraper.name }}
 6 | {% endblock %}
 7 | 
 8 | {% block content %}
 9 |   <table class="table table-bordered">
10 |     <tr>
11 |       <th>Scraper run</th>
12 |       <th>Task</th>
13 |       <th>Tasks executed</th>
14 |       <th>Messages</th>
15 |       <th>Warnings</th>
16 |       <th>Errors</th>
17 |     </tr>
18 |     {% for run in pager.elements %}
19 |       <tr class="delim">
20 |         <td colspan="2">
21 |           {{run.scraperStartTime | dateformat}}
22 |         </td>
23 |         <td class="num">{{run.tasks or '-'}}</td>
24 |         <td class="num">{{run.messages or '-'}}</td>
25 |         <td class="num {{'WARN' if run.get('WARN') > 0 else ''}}">
26 |           {{run.get('WARN') or '-'}}
27 |         </td>
28 |         <td class="num {{'ERROR' if run.get('ERROR') > 0 else ''}}">
29 |           {{run.get('ERROR') or '-'}}
30 |         </td>
31 |       </tr>
32 |       {% for task in tasks %}
33 |         {% if task.scraperId == run.scraperId %}
34 |           <tr>
35 |             <td></td>
36 |             <td><a href="{{task.taskName or padding}}/{{task.scraperId}}/index.html">
37 |               {{task.taskName or '(No task)'}}
38 |             </a></td>
39 |             <td class="num">{{task.tasks or '-'}}</td>
40 |             <td class="num">{{task.messages or '-'}}</td>
41 |             <td class="num {{'WARN' if task.get('WARN') > 0 else ''}}">
42 |               {{task.WARN or '-'}}
43 |             </td>
44 |             <td class="num {{'ERROR' if task.get('ERROR') > 0 else ''}}">
45 |               {{task.ERROR or '-'}}
46 |             </td>
47 |           </tr>
48 |         {% endif %}
49 |       {% endfor %}
50 |     {% endfor %}
51 |   </table>
52 | {% endblock %}
53 | 


--------------------------------------------------------------------------------
/scrapekit/templates/layout.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html lang="en">
  3 |   <head>
  4 |     <meta charset="utf-8">
  5 |     <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
  6 |     <meta name="viewport" content="width=device-width, initial-scale=1.0">
  7 | 
  8 |     <title>{% block title %}(Untitled){% endblock %} - scrapekit</title>
  9 |     <meta name="description" content="{% block description %}{% endblock %}">
 10 | 
 11 |     <link rel="shortcut icon" href="http://assets.pudo.org/img/favicon.ico">
 12 |     <link href='http://fonts.googleapis.com/css?family=Ubuntu:400,500,700' rel='stylesheet' type='text/css'>
 13 |     <link rel="stylesheet" type="text/css" href="http://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/css/bootstrap.min.css">
 14 | 
 15 |     <style>
 16 |       body {
 17 |         background-color: #000;
 18 |         color: #eee;
 19 |         font-family: Ubuntu, sans-serif;
 20 |         font-size: 1.5em;
 21 |       }
 22 | 
 23 |       #pagewrapper, .table .table {
 24 |         background-color: #222;
 25 |       }
 26 | 
 27 |       code, pre {
 28 |         border: 0;
 29 |         background-color: transparent;
 30 |         color: #eee;
 31 |       }
 32 | 
 33 |       table, footer {
 34 |         f-ont-family: monospace;
 35 |       }
 36 | 
 37 |       .table .table tr td, .table .table tr th {
 38 |         border-top: 0;
 39 |       }
 40 | 
 41 |       tr.details {
 42 |         display: none;
 43 |       }
 44 | 
 45 |       h1 {
 46 |         text-transform: uppercase;
 47 |       }
 48 | 
 49 |       td.num {
 50 |         text-align: right;
 51 |       }
 52 | 
 53 |       tr.delim td {
 54 |         /* background-color: #f7f7f7; */
 55 |         background-color: #444;
 56 |       }
 57 | 
 58 |       a {
 59 |         color: #eee;
 60 |         text-decoration: underline;
 61 |       }
 62 | 
 63 |       .WARN {
 64 |         /*
 65 |         background-color: #fcf8e3 !important;
 66 |         color: #8a6d3b;
 67 |         */
 68 |         background-color: #8a6d3b !important;
 69 |         font-weight: bolder;
 70 |       }
 71 | 
 72 |       .ERROR {
 73 |         /*
 74 |         background-color: #f2dede !important;
 75 |         color: #a94442;
 76 |         */
 77 |         background-color: #a94442 !important;
 78 |         font-weight: bolder;
 79 |       }
 80 | 
 81 |       .container-fluid {
 82 |         margin: 0em 5% 0em 5%;
 83 |       }
 84 | 
 85 |       ul.pagination li a, ul.pagination li span {
 86 |           background-color: transparent !important;
 87 |       }
 88 | 
 89 | 
 90 |       footer {
 91 |         margin: 2em 0 2em 0;
 92 |         color: #666;
 93 |         text-align: right;
 94 |       }
 95 | 
 96 |       footer a {
 97 |         color: #666;
 98 |         text-decoration: underline;
 99 |       }
100 |     </style>
101 |   </head>
102 |   <body>
103 |     <div id="pagewrapper">
104 |       <div class="container-fluid">
105 |         <h1 class="page-header">
106 |           {{self.title()}}
107 |         </h1>
108 |         <div class="row">
109 |           <div class="col-md-4">
110 |             <p>
111 |             This is an automatic report generated by the <code>{{scraper.name}}</code>
112 |             scraper. It contains information about each run of the scraper, and the 
113 |             error messages and warnings emitted by the scraper.
114 |             </p>
115 | 
116 |             <h3>Configuration</h3>
117 |             <p>The following configuration settings applied during the most recent run
118 |             of this scraper.</p>
119 |             <table class="table table-condensed">
120 |               {% for key, value in config %}
121 |                 <tr>
122 |                   <th>{{key}}</th>
123 |                   <td>{{value}}</td>
124 |                 </tr>
125 |               {% endfor %}
126 |             </table>
127 |           </div>
128 |           <div class="col-md-8">
129 |             {% block content %}{% endblock %}
130 |             {{pagination(pager)}}
131 |           </div>
132 |         </div>
133 |         
134 |       </div>
135 |     </div>
136 | 
137 |     <footer>
138 |       <div class="container-fluid">
139 |         <p>
140 |           <a href="http://scrapekit.rtfd.org">scrapekit</a> {{version}}
141 |         </p>
142 |         <p>Python {{python}} / {{hostname}}</p>
143 |       </div>
144 |     </footer>
145 | 
146 |     {% block js %}{% endblock %}
147 |   </body>
148 | </html>
149 | 
150 | 
151 | 


--------------------------------------------------------------------------------
/scrapekit/templates/macros.html:
--------------------------------------------------------------------------------
 1 | {% macro pagination(pager) %}
 2 |   {% if pager.show %}
 3 |     <ul class="pagination">
 4 |       {% if pager.prev %}
 5 |         <li><a href="{{pager.prev.rel}}">&laquo;</a></li>
 6 |       {% else %}
 7 |         <li class="disabled"><span>&laquo;</span></li>
 8 |       {% endif %}
 9 |       {% for page in pager.pages %}
10 |         {% if page.idx == pager.page.idx %}
11 |           <li class="active"><span>{{page.idx}}</span></li>
12 |         {% else %}
13 |           <li><a href="{{page.rel}}">{{page.idx}}</a></li>
14 |         {% endif %}
15 |       {% endfor %}
16 |       {% if pager.next %}
17 |         <li><a href="{{pager.next.rel}}">&raquo;</a></li>
18 |       {% else %}
19 |         <li class="disabled"><span>&raquo;</span></li>
20 |       {% endif %}
21 |     </ul>
22 |   {% endif %}
23 | {% endmacro %}
24 | 


--------------------------------------------------------------------------------
/scrapekit/templates/task_run_item.html:
--------------------------------------------------------------------------------
 1 | {% extends "layout.html" %}
 2 | {% from 'macros.html' import pagination %}
 3 | 
 4 | {% block title %}
 5 |   {{ scraper.name }}: {{taskName or padding}}
 6 | {% endblock %}
 7 | 
 8 | {% block content %}
 9 |   <table class="table table-bordered table-condensed">
10 |     <tr>
11 |       <th>Time</th>
12 |       <th>Level</th>
13 |       <th>Message</th>
14 |     </tr>
15 |     {% for row in pager.elements %}
16 |       <tr>
17 |         <td><a href="#" class="expand" data-id="details-{{loop.index}}">{{row.ts}}</a></td>
18 |         <td>{{row.levelname}}</td>
19 |         <td>{{row.message}}</td>
20 |       </tr>
21 |       <tr class="details" id="details-{{loop.index}}">
22 |         <td colspan="3">
23 |           <table class="table table-condensed">
24 |             {% for k, v in row.items() %}
25 |               {% if k not in ignore_fields %}
26 |                 <tr><th>{{k}}</th><td>{{v}}</td></tr>
27 |               {% endif %}
28 |             {% endfor %}
29 |           </table>
30 |         </td>
31 |       </tr>
32 |     {% endfor %}
33 |   </table>
34 | {% endblock %}
35 | 
36 | {% block js %}
37 |   <script src="https://code.jquery.com/jquery-1.11.1.min.js"></script>
38 |   <script>
39 |   $(function() {
40 |     $('.expand').click(function(e) {
41 |       var id = $(e.target).data('id');
42 |       $('#' + id).slideToggle(500);
43 |       e.preventDefault();
44 |     });
45 |   });
46 |   </script>
47 | {% endblock %}
48 | 


--------------------------------------------------------------------------------
/scrapekit/templates/task_run_list.html:
--------------------------------------------------------------------------------
 1 | {% extends "layout.html" %}
 2 | {% from 'macros.html' import pagination %}
 3 | 
 4 | {% block title %}
 5 |   {{ scraper.name }}: {{taskName or padding}}
 6 | {% endblock %}
 7 | 
 8 | {% block content %}
 9 |   <table class="table table-bordered">
10 |     <tr>
11 |       <th>Time</th>
12 |       <th>Messages</th>
13 |       <th>Warnings</th>
14 |       <th>Errors</th>
15 |     </tr>
16 |     {% for run in pager.elements %}
17 |       <tr>
18 |         <td><a href="{{run.taskId or padding}}.html">{{run.asctime}}</a></td>
19 |         <td>{{run.messages}}</td>
20 |         <td class="{{'WARN' if run.get('WARN') > 0 else ''}}">
21 |           {{run.WARN or '-'}}
22 |         </td>
23 |         <td class="{{'ERROR' if run.get('ERROR') > 0 else ''}}">
24 |           {{run.ERROR or '-'}}
25 |         </td>
26 |       </tr>
27 |     {% endfor %}
28 |   </table>
29 | {% endblock %}
30 | 
31 | 
32 | 


--------------------------------------------------------------------------------
/scrapekit/util.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | 
 4 | def collapse_whitespace(text):
 5 |     """ Collapse all consecutive whitespace, newlines and tabs
 6 |     in a string into single whitespaces, and strip the outer
 7 |     whitespace. This will also accept an ``lxml`` element and
 8 |     extract all text. """
 9 |     if text is None:
10 |         return None
11 |     if hasattr(text, 'xpath'):
12 |         text = text.xpath('string()')
13 |     text = re.sub('\s+', ' ', text)
14 |     return text.strip()
15 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | 
 4 | setup(
 5 |     name='scrapekit',
 6 |     version='0.2.1',
 7 |     description="Light-weight tools for web scraping",
 8 |     long_description="",
 9 |     classifiers=[
10 |         "Development Status :: 3 - Alpha",
11 |         "Intended Audience :: Developers",
12 |         "License :: OSI Approved :: MIT License",
13 |         "Operating System :: OS Independent",
14 |         'Programming Language :: Python :: 2.6',
15 |         'Programming Language :: Python :: 2.7',
16 |         'Programming Language :: Python :: 3.3',
17 |         'Programming Language :: Python :: 3.4'
18 |     ],
19 |     keywords='web scraping crawling http cache threading',
20 |     author='Friedrich Lindenberg',
21 |     author_email='friedrich@pudo.org',
22 |     url='http://github.com/pudo/scrapekit',
23 |     license='MIT',
24 |     packages=find_packages(exclude=['ez_setup', 'examples', 'test']),
25 |     namespace_packages=[],
26 |     package_data={'scrapekit': ['templates/*.html']},
27 |     include_package_data=True,
28 |     zip_safe=False,
29 |     install_requires=[
30 |         "requests>=2.3.0",
31 |         "CacheControl>=0.10.2",
32 |         "lockfile>=0.9.1",
33 |         "Jinja2>=2.7.3",
34 |         "python-json-logger>=0.0.5"
35 |     ],
36 |     tests_require=[],
37 |     entry_points={
38 |         'console_scripts': []
39 |     }
40 | )
41 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
 1 | from scrapekit import Scraper
 2 | 
 3 | scraper = Scraper('test')
 4 | 
 5 | @scraper.task
 6 | def fun(a, b):
 7 |     print(a, b, a + b)
 8 |     return a + b
 9 | 
10 | #print fun.queue(2, 3).wait()
11 | #print fun(2, 3)
12 | 
13 | 
14 | @scraper.task
15 | def funSource():
16 |     for i in xrange(100):
17 |         yield i
18 | 
19 | @scraper.task
20 | def funModifier(i):
21 |     return i + 0.1
22 | 
23 | @scraper.task
24 | def funSink(i):
25 |     print(i ** 3)
26 | 
27 | 
28 | #pipeline = funSource | funModifier > funSink
29 | #pipeline.run()
30 | 
31 | #funSource.chain(funSink).run()
32 | #funSource.queue().wait()
33 | 
34 | 
35 | @scraper.task
36 | def scrape_index():
37 |     url = 'https://sfbay.craigslist.org/boo/'
38 |     res = scraper.get(url)
39 |     print(res)
40 | 
41 | 
42 | scrape_index.run()
43 | 
44 | 


--------------------------------------------------------------------------------