├── .gitignore ├── DESIGN.md ├── LICENSE ├── MANIFEST.in ├── README.md ├── docs ├── Makefile ├── api.rst ├── cache.rst ├── conf.py ├── config.rst ├── index.rst ├── install.rst ├── quickstart.rst ├── tasks.rst └── utils.rst ├── requirements.txt ├── scrapekit ├── __init__.py ├── config.py ├── core.py ├── exc.py ├── http.py ├── logs.py ├── reporting │ ├── __init__.py │ ├── db.py │ └── render.py ├── tasks.py ├── templates │ ├── index.html │ ├── layout.html │ ├── macros.html │ ├── task_run_item.html │ └── task_run_list.html └── util.py ├── setup.py └── test.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.DS_Store 2 | 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | 7 | # C extensions 8 | *.so 9 | 10 | # Distribution / packaging 11 | .Python 12 | env/ 13 | bin/ 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # Installer logs 28 | pip-log.txt 29 | pip-delete-this-directory.txt 30 | 31 | # Unit test / coverage reports 32 | htmlcov/ 33 | .tox/ 34 | .coverage 35 | .cache 36 | nosetests.xml 37 | coverage.xml 38 | 39 | # Translations 40 | *.mo 41 | 42 | # Mr Developer 43 | .mr.developer.cfg 44 | .project 45 | .pydevproject 46 | 47 | # Rope 48 | .ropeproject 49 | 50 | # Django stuff: 51 | *.log 52 | *.pot 53 | 54 | # Sphinx documentation 55 | docs/_build/ 56 | 57 | # Scrapekit specific 58 | data/ 59 | reports/ 60 | -------------------------------------------------------------------------------- /DESIGN.md: -------------------------------------------------------------------------------- 1 | 2 | What should a typical session look like? 3 | 4 | ## Fetching stuff from the web 5 | 6 | ```python 7 | from scrapekit import config, http 8 | 9 | config.cache_forever = True 10 | config.cache_dir = '/tmp' 11 | 12 | res = http.get('http://databin.pudo.org/t/b2d9cf') 13 | ``` 14 | 15 | Good enough. This should retain a cache of the data locally, and do 16 | retrieval. 17 | 18 | Other concerns: 19 | 20 | * Rate limiting 21 | * User agent hiding, and very explicit UAs. 22 | 23 | ## Parallel processing 24 | 25 | Next up: threading. Basically a really light-weight, in-process 26 | version of celery? Perhaps with an option to go for the real thing 27 | when needed? 28 | 29 | ```python 30 | from scrapekit import processing 31 | 32 | @processing.task 33 | def scrape_page(url): 34 | pass 35 | 36 | scrape_page.queue(url) 37 | processing.init(num_threads=20) 38 | ``` 39 | 40 | Alternatively, this could support a system of pipelines like this: 41 | 42 | ```python 43 | from scrapekit import processing 44 | 45 | @processing.task 46 | def scrape_index(): 47 | for i in xrange(1000): 48 | yield 'http://example.com/%d' % i 49 | 50 | @processing.task 51 | def scrape_page(url): 52 | pass 53 | 54 | pipeline = scrape_index.pipeline() 55 | pipeline = pipeline.chain(scrape_page) 56 | pipeline.run(num_threads=20) 57 | ``` 58 | 59 | ## Logging 60 | 61 | What would really good logging for scrapers look like? 62 | 63 | * Includes context, such as the URL currently being processed 64 | * All the trivial stuff (HTTP issues, HTML/XML parsing) is handled 65 | * Logs go to a CSV file or database? Something that allows 66 | systematic analysis. 67 | * Does this actually generate nice-to-look at HTML? 68 | * Set up sensible defaults for Requests logging 69 | 70 | ## Audits 71 | 72 | Audits are parts of a pipeline that validate the generated data against a 73 | pre-defined schema. This could be used to make sure the data meets 74 | certain expectations. 75 | 76 | ## Other functionality 77 | 78 | What else is repeated all over scrapers? 79 | 80 | * Text cleaning (remove multiple spaces, normalize). 81 | * Currency conversion and deflation 82 | * Geocoding of addresses 83 | 84 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Friedrich Lindenberg 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include LICENSE requirements.txt README.md 2 | recursive-include scrapekit/templates * 3 | global-exclude *.pyc 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # scrapekit 2 | 3 | Did you know the entire web was made of data? You probably did. 4 | Scrapekit helps you get that data with simple Python scripts. Based on 5 | [requests](http://docs.python-requests.org/), the library will handles 6 | caching, threading and logging. 7 | 8 | See the [full documentation](http://scrapekit.readthedocs.org/). 9 | 10 | ## Example 11 | 12 | ```python 13 | from scrapekit import Scraper 14 | 15 | scraper = Scraper('example') 16 | 17 | @scraper.task 18 | def get_index(): 19 | url = 'http://databin.pudo.org/t/b2d9cf' 20 | doc = scraper.get(url).html() 21 | for row in doc.findall('.//tr'): 22 | yield row 23 | 24 | @scraper.task 25 | def get_row(row): 26 | columns = row.findall('./td') 27 | print(columns) 28 | 29 | pipeline = get_index | get_row 30 | if __name__ == '__main__': 31 | pipeline.run() 32 | 33 | ``` 34 | 35 | ## Works well with 36 | 37 | Scrapekit doesn't aim to provide all functionality necessary for 38 | scraping. Specifically, it doesn't address HTML parsing, data storage 39 | and data validation. For these needs, check the following libraries: 40 | 41 | * [lxml](http://lxml.de/) for HTML/XML parsing; much faster and more 42 | flexible than [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/). 43 | * [dataset](http://dataset.rtfd.org) is a sister library of scrapekit 44 | that simplifies storing semi-structured data in SQL databases. 45 | 46 | ## Existing tools 47 | 48 | * [Scrapy](http://scrapy.org/) is a much more mature and comprehensive 49 | framework for developing scrapers. On the other hand, it requires you to 50 | develop scrapers within its class system. This can be too heavyweight 51 | for a simple script to grab data off a web site. 52 | * [scrapelib](http://scrapelib.readthedocs.org/) is a thin wrapper 53 | around requests that does throttling, retries and caching. 54 | * [MechanicalSoup](https://github.com/hickford/MechanicalSoup) binds 55 | BeautifulSoup and requests into an imperative, stateful API. 56 | 57 | ## Credits and license 58 | 59 | Scrapekit is licensed under the terms of the MIT license, which is also 60 | included in [LICENSE](LICENSE). It was developed through projects of 61 | [ICFJ](http://icfj.org), [ANCIR](http://investigativecenters.org) and 62 | [ICIJ](http://icij.org). 63 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = _build 9 | 10 | # User-friendly check for sphinx-build 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) 13 | endif 14 | 15 | # Internal variables. 16 | PAPEROPT_a4 = -D latex_paper_size=a4 17 | PAPEROPT_letter = -D latex_paper_size=letter 18 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 19 | # the i18n builder cannot share the environment and doctrees with the others 20 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 21 | 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 23 | 24 | help: 25 | @echo "Please use \`make ' where is one of" 26 | @echo " html to make standalone HTML files" 27 | @echo " dirhtml to make HTML files named index.html in directories" 28 | @echo " singlehtml to make a single large HTML file" 29 | @echo " pickle to make pickle files" 30 | @echo " json to make JSON files" 31 | @echo " htmlhelp to make HTML files and a HTML help project" 32 | @echo " qthelp to make HTML files and a qthelp project" 33 | @echo " devhelp to make HTML files and a Devhelp project" 34 | @echo " epub to make an epub" 35 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 36 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 37 | @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" 38 | @echo " text to make text files" 39 | @echo " man to make manual pages" 40 | @echo " texinfo to make Texinfo files" 41 | @echo " info to make Texinfo files and run them through makeinfo" 42 | @echo " gettext to make PO message catalogs" 43 | @echo " changes to make an overview of all changed/added/deprecated items" 44 | @echo " xml to make Docutils-native XML files" 45 | @echo " pseudoxml to make pseudoxml-XML files for display purposes" 46 | @echo " linkcheck to check all external links for integrity" 47 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 48 | 49 | clean: 50 | rm -rf $(BUILDDIR)/* 51 | 52 | html: 53 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 54 | @echo 55 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 56 | 57 | dirhtml: 58 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 59 | @echo 60 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 61 | 62 | singlehtml: 63 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 64 | @echo 65 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 66 | 67 | pickle: 68 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 69 | @echo 70 | @echo "Build finished; now you can process the pickle files." 71 | 72 | json: 73 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 74 | @echo 75 | @echo "Build finished; now you can process the JSON files." 76 | 77 | htmlhelp: 78 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 79 | @echo 80 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 81 | ".hhp project file in $(BUILDDIR)/htmlhelp." 82 | 83 | qthelp: 84 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 85 | @echo 86 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 87 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 88 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/scrapekit.qhcp" 89 | @echo "To view the help file:" 90 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/scrapekit.qhc" 91 | 92 | devhelp: 93 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 94 | @echo 95 | @echo "Build finished." 96 | @echo "To view the help file:" 97 | @echo "# mkdir -p $$HOME/.local/share/devhelp/scrapekit" 98 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/scrapekit" 99 | @echo "# devhelp" 100 | 101 | epub: 102 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 103 | @echo 104 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 105 | 106 | latex: 107 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 108 | @echo 109 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 110 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 111 | "(use \`make latexpdf' here to do that automatically)." 112 | 113 | latexpdf: 114 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 115 | @echo "Running LaTeX files through pdflatex..." 116 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 117 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 118 | 119 | latexpdfja: 120 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 121 | @echo "Running LaTeX files through platex and dvipdfmx..." 122 | $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja 123 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 124 | 125 | text: 126 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 127 | @echo 128 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 129 | 130 | man: 131 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 132 | @echo 133 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 134 | 135 | texinfo: 136 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 137 | @echo 138 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 139 | @echo "Run \`make' in that directory to run these through makeinfo" \ 140 | "(use \`make info' here to do that automatically)." 141 | 142 | info: 143 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 144 | @echo "Running Texinfo files through makeinfo..." 145 | make -C $(BUILDDIR)/texinfo info 146 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 147 | 148 | gettext: 149 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 150 | @echo 151 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 152 | 153 | changes: 154 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 155 | @echo 156 | @echo "The overview file is in $(BUILDDIR)/changes." 157 | 158 | linkcheck: 159 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 160 | @echo 161 | @echo "Link check complete; look for any errors in the above output " \ 162 | "or in $(BUILDDIR)/linkcheck/output.txt." 163 | 164 | doctest: 165 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 166 | @echo "Testing of doctests in the sources finished, look at the " \ 167 | "results in $(BUILDDIR)/doctest/output.txt." 168 | 169 | xml: 170 | $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml 171 | @echo 172 | @echo "Build finished. The XML files are in $(BUILDDIR)/xml." 173 | 174 | pseudoxml: 175 | $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml 176 | @echo 177 | @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." 178 | -------------------------------------------------------------------------------- /docs/api.rst: -------------------------------------------------------------------------------- 1 | API documentation 2 | ================= 3 | 4 | The following documentation aims to present the internal API of the 5 | library. While it is possible to use all of these classes directly, 6 | following the usage patterns detailed in the rest of the documentation 7 | is advised. 8 | 9 | Basic Scraper 10 | ------------- 11 | 12 | .. automodule:: scrapekit.core 13 | :members: 14 | 15 | 16 | Tasks and threaded execution 17 | ---------------------------- 18 | 19 | .. automodule:: scrapekit.tasks 20 | :members: 21 | 22 | 23 | HTTP caching and parsing 24 | ------------------------ 25 | 26 | .. automodule:: scrapekit.http 27 | :members: 28 | 29 | 30 | Exceptions and Errors 31 | --------------------- 32 | 33 | .. automodule:: scrapekit.exc 34 | :members: 35 | 36 | -------------------------------------------------------------------------------- /docs/cache.rst: -------------------------------------------------------------------------------- 1 | Caching 2 | ======= 3 | 4 | Caching of response data is implemented via `CacheControl 5 | `_, a library that extends 6 | `requests `_. To enable a flexible usage of 7 | the caching mechanism, the use of cached data is steered through a cache 8 | policy, which can be specified either for the whole scraper or for a 9 | specific request. 10 | 11 | The following policies are supported: 12 | 13 | * ``http`` will perform response caching and validation according to HTTP 14 | semantic, i.e. in the way that a browser would do it. This requires the 15 | server to set accurate cache control headers - which many applications 16 | are too stupid to do. 17 | * ``none`` will disable caching entirely and always revert to the server for 18 | up-to-date information. 19 | * ``force`` will always use the cached data and not check with the server 20 | for updated pages. This is useful in debug mode, but dangerous when used 21 | in production. 22 | 23 | Per-request cache settings 24 | -------------------------- 25 | 26 | While caching will usually be configured on a scraper-wide basis, it can 27 | also be set for individual (``GET``) requests by passing a ``cache`` 28 | argument set to one of the policy names: 29 | 30 | .. code-block:: python 31 | 32 | import scrapekit 33 | scraper = scrapekit.Scraper('example') 34 | 35 | # No caching: 36 | scraper.get('http://google.com', cache='none') 37 | 38 | # Cache according to HTTP semantics: 39 | scraper.get('http://google.com', cache='http') 40 | 41 | # Force re-use of data, even if it is stale: 42 | scraper.get('http://google.com', cache='force') 43 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # scrapekit documentation build configuration file, created by 4 | # sphinx-quickstart on Wed Aug 6 15:12:35 2014. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | import sys 16 | import os 17 | import sphinx_rtd_theme 18 | 19 | # If extensions (or modules to document with autodoc) are in another directory, 20 | # add these directories to sys.path here. If the directory is relative to the 21 | # documentation root, use os.path.abspath to make it absolute, like shown here. 22 | #sys.path.insert(0, os.path.abspath('.')) 23 | 24 | # -- General configuration ------------------------------------------------ 25 | 26 | # If your documentation needs a minimal Sphinx version, state it here. 27 | #needs_sphinx = '1.0' 28 | 29 | # Add any Sphinx extension module names here, as strings. They can be 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 31 | # ones. 32 | extensions = [ 33 | 'sphinx.ext.autodoc', 34 | ] 35 | 36 | # Add any paths that contain templates here, relative to this directory. 37 | templates_path = ['_templates'] 38 | 39 | # The suffix of source filenames. 40 | source_suffix = '.rst' 41 | 42 | # The encoding of source files. 43 | #source_encoding = 'utf-8-sig' 44 | 45 | # The master toctree document. 46 | master_doc = 'index' 47 | 48 | # General information about the project. 49 | project = u'scrapekit' 50 | copyright = u'2014, Friedrich Lindenberg' 51 | 52 | # The version info for the project you're documenting, acts as replacement for 53 | # |version| and |release|, also used in various other places throughout the 54 | # built documents. 55 | # 56 | # The short X.Y version. 57 | version = '0.1' 58 | # The full version, including alpha/beta/rc tags. 59 | release = '0.1' 60 | 61 | # The language for content autogenerated by Sphinx. Refer to documentation 62 | # for a list of supported languages. 63 | #language = None 64 | 65 | # There are two options for replacing |today|: either, you set today to some 66 | # non-false value, then it is used: 67 | #today = '' 68 | # Else, today_fmt is used as the format for a strftime call. 69 | #today_fmt = '%B %d, %Y' 70 | 71 | # List of patterns, relative to source directory, that match files and 72 | # directories to ignore when looking for source files. 73 | exclude_patterns = ['_build'] 74 | 75 | # The reST default role (used for this markup: `text`) to use for all 76 | # documents. 77 | #default_role = None 78 | 79 | # If true, '()' will be appended to :func: etc. cross-reference text. 80 | #add_function_parentheses = True 81 | 82 | # If true, the current module name will be prepended to all description 83 | # unit titles (such as .. function::). 84 | #add_module_names = True 85 | 86 | # If true, sectionauthor and moduleauthor directives will be shown in the 87 | # output. They are ignored by default. 88 | #show_authors = False 89 | 90 | # The name of the Pygments (syntax highlighting) style to use. 91 | pygments_style = 'sphinx' 92 | 93 | # A list of ignored prefixes for module index sorting. 94 | #modindex_common_prefix = [] 95 | 96 | # If true, keep warnings as "system message" paragraphs in the built documents. 97 | #keep_warnings = False 98 | 99 | 100 | # -- Options for HTML output ---------------------------------------------- 101 | 102 | # The theme to use for HTML and HTML Help pages. See the documentation for 103 | # a list of builtin themes. 104 | html_theme = "sphinx_rtd_theme" 105 | 106 | # Theme options are theme-specific and customize the look and feel of a theme 107 | # further. For a list of options available for each theme, see the 108 | # documentation. 109 | #html_theme_options = {} 110 | 111 | # Add any paths that contain custom themes here, relative to this directory. 112 | html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] 113 | 114 | # 'default'th = [sphinx_rtd_theme.get_html_theme_path()] The name for this set of Sphinx documents. If None, it defaults to 115 | # " v documentation". 116 | #html_title = None 117 | 118 | # A shorter title for the navigation bar. Default is the same as html_title. 119 | #html_short_title = None 120 | 121 | # The name of an image file (relative to this directory) to place at the top 122 | # of the sidebar. 123 | #html_logo = None 124 | 125 | # The name of an image file (within the static path) to use as favicon of the 126 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 127 | # pixels large. 128 | #html_favicon = None 129 | 130 | # Add any paths that contain custom static files (such as style sheets) here, 131 | # relative to this directory. They are copied after the builtin static files, 132 | # so a file named "default.css" will overwrite the builtin "default.css". 133 | html_static_path = ['_static'] 134 | 135 | # Add any extra paths that contain custom files (such as robots.txt or 136 | # .htaccess) here, relative to this directory. These files are copied 137 | # directly to the root of the documentation. 138 | #html_extra_path = [] 139 | 140 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 141 | # using the given strftime format. 142 | #html_last_updated_fmt = '%b %d, %Y' 143 | 144 | # If true, SmartyPants will be used to convert quotes and dashes to 145 | # typographically correct entities. 146 | #html_use_smartypants = True 147 | 148 | # Custom sidebar templates, maps document names to template names. 149 | #html_sidebars = {} 150 | 151 | # Additional templates that should be rendered to pages, maps page names to 152 | # template names. 153 | #html_additional_pages = {} 154 | 155 | # If false, no module index is generated. 156 | #html_domain_indices = True 157 | 158 | # If false, no index is generated. 159 | #html_use_index = True 160 | 161 | # If true, the index is split into individual pages for each letter. 162 | #html_split_index = False 163 | 164 | # If true, links to the reST sources are added to the pages. 165 | #html_show_sourcelink = True 166 | 167 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 168 | #html_show_sphinx = True 169 | 170 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 171 | #html_show_copyright = True 172 | 173 | # If true, an OpenSearch description file will be output, and all pages will 174 | # contain a tag referring to it. The value of this option must be the 175 | # base URL from which the finished HTML is served. 176 | #html_use_opensearch = '' 177 | 178 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 179 | #html_file_suffix = None 180 | 181 | # Output file base name for HTML help builder. 182 | htmlhelp_basename = 'scrapekitdoc' 183 | 184 | 185 | # -- Options for LaTeX output --------------------------------------------- 186 | 187 | latex_elements = { 188 | # The paper size ('letterpaper' or 'a4paper'). 189 | #'papersize': 'letterpaper', 190 | 191 | # The font size ('10pt', '11pt' or '12pt'). 192 | #'pointsize': '10pt', 193 | 194 | # Additional stuff for the LaTeX preamble. 195 | #'preamble': '', 196 | } 197 | 198 | # Grouping the document tree into LaTeX files. List of tuples 199 | # (source start file, target name, title, 200 | # author, documentclass [howto, manual, or own class]). 201 | latex_documents = [ 202 | ('index', 'scrapekit.tex', u'scrapekit Documentation', 203 | u'Friedrich Lindenberg', 'manual'), 204 | ] 205 | 206 | # The name of an image file (relative to this directory) to place at the top of 207 | # the title page. 208 | #latex_logo = None 209 | 210 | # For "manual" documents, if this is true, then toplevel headings are parts, 211 | # not chapters. 212 | #latex_use_parts = False 213 | 214 | # If true, show page references after internal links. 215 | #latex_show_pagerefs = False 216 | 217 | # If true, show URL addresses after external links. 218 | #latex_show_urls = False 219 | 220 | # Documents to append as an appendix to all manuals. 221 | #latex_appendices = [] 222 | 223 | # If false, no module index is generated. 224 | #latex_domain_indices = True 225 | 226 | 227 | # -- Options for manual page output --------------------------------------- 228 | 229 | # One entry per manual page. List of tuples 230 | # (source start file, name, description, authors, manual section). 231 | man_pages = [ 232 | ('index', 'scrapekit', u'scrapekit Documentation', 233 | [u'Friedrich Lindenberg'], 1) 234 | ] 235 | 236 | # If true, show URL addresses after external links. 237 | #man_show_urls = False 238 | 239 | 240 | # -- Options for Texinfo output ------------------------------------------- 241 | 242 | # Grouping the document tree into Texinfo files. List of tuples 243 | # (source start file, target name, title, author, 244 | # dir menu entry, description, category) 245 | texinfo_documents = [ 246 | ('index', 'scrapekit', u'scrapekit Documentation', 247 | u'Friedrich Lindenberg', 'scrapekit', 'One line description of project.', 248 | 'Miscellaneous'), 249 | ] 250 | 251 | # Documents to append as an appendix to all manuals. 252 | #texinfo_appendices = [] 253 | 254 | # If false, no module index is generated. 255 | #texinfo_domain_indices = True 256 | 257 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 258 | #texinfo_show_urls = 'footnote' 259 | 260 | # If true, do not generate a @detailmenu in the "Top" node's menu. 261 | #texinfo_no_detailmenu = False 262 | -------------------------------------------------------------------------------- /docs/config.rst: -------------------------------------------------------------------------------- 1 | Configuration 2 | ============= 3 | 4 | Scrapekit supports a broad range of configuration options, which can be 5 | set either via a configuration file, environment variables or 6 | programmatically at run-time. 7 | 8 | 9 | Configuration methods 10 | --------------------- 11 | 12 | As a first source of settings, scrapekit will attempt to read a per-user 13 | configuration file, ``~/.scrapekit.ini``. Inside the ini file, two 14 | sections will be read: ``[scrapekit]`` is expected to hold general 15 | settings, while a section, named after the current scraper, can be used to 16 | adapt these settings: 17 | 18 | .. code-block:: ini 19 | 20 | [scrapekit] 21 | reports_path = /var/www/scrapers/reports 22 | 23 | [craigslist-sf-boats] 24 | threads = 5 25 | 26 | After evaluating these settings, environment variables will be read (see 27 | below for their names). Finally, all of these settings will be overridden 28 | by any configuration provided to the constructor of 29 | :py:class:`Scraper `. 30 | 31 | 32 | Available settings 33 | ------------------ 34 | 35 | ============ ====================== ==================================== 36 | Name Environment variable Description 37 | ============ ====================== ==================================== 38 | threads SCRAPEKIT_THREADS Number of threads to be started. 39 | cache_policy SCRAPEKIT_CACHE_POLICY Policy for caching requests. Valid 40 | values are ``disable`` (no caching), 41 | ``http`` (cache according to HTTP 42 | header semantics) and ``force``, to 43 | force local storage and re-use of 44 | any requests. 45 | data_path SCRAPEKIT_DATA_PATH A storage directory for cached data 46 | from HTTP requests. This is set to 47 | be a temporary directory by default, 48 | which means caching will not work. 49 | reports_path SCRAPEKIT_REPORTS_PATH A directory to hold log files and - 50 | if generated - the reports for this 51 | scraper. 52 | ============ ====================== ==================================== 53 | 54 | 55 | Custom settings 56 | --------------- 57 | 58 | The scraper configuration is not limited to loading the settings 59 | indicated above. Hence, custom configuration settings (e.g. for site 60 | credentials) can be added to the ini file and then retrieved from the 61 | ``config`` attribute of a :py:class:`Scraper ` 62 | instance. 63 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | .. scrapekit documentation master file, created by 2 | sphinx-quickstart on Wed Aug 6 15:12:35 2014. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | scrapekit: get the data you need, fast. 7 | ======================================= 8 | 9 | .. toctree:: 10 | :hidden: 11 | 12 | Many web sites expose a great amount of data, and scraping it can help you build 13 | useful tools, services and analysis on top of that data. This can often be done 14 | with a simple Python script, using few external libraries. 15 | 16 | As your script grows, however, you will want to add more advanced features, such 17 | as **caching** of the downloaded pages, **multi-threading** to fetch many pieces 18 | of content at once, and **logging** to get a clear sense of which data failed to 19 | parse. 20 | 21 | Scrapekit provides a set of useful tools for these that help with these tasks, 22 | while also offering you simple ways to structure your scraper. This helps you to 23 | produce **fast, reliable and structured scraper scripts**. 24 | 25 | 26 | Example 27 | ------- 28 | 29 | Below is a simple scraper for postings on Craigslist. This will use 30 | multiple threads and request caching by default. 31 | 32 | .. code-block:: python 33 | 34 | import scrapekit 35 | from urlparse import urljoin 36 | 37 | scraper = scrapekit.Scraper('craigslist-sf-boats') 38 | 39 | @scraper.task 40 | def scrape_listing(url): 41 | doc = scraper.get(url).html() 42 | print(doc.find('.//h2[@class="postingtitle"]').text_content()) 43 | 44 | 45 | @scraper.task 46 | def scrape_index(url): 47 | doc = scraper.get(url).html() 48 | 49 | for listing in doc.findall('.//a[@class="hdrlnk"]'): 50 | listing_url = urljoin(url, listing.get('href')) 51 | scrape_listing.queue(listing_url) 52 | 53 | scrape_index.run('https://sfbay.craigslist.org/boo/') 54 | 55 | By default, this save cache data to a the working directory, in a folder called 56 | ``data``. 57 | 58 | 59 | Reporting 60 | --------- 61 | 62 | Upon completion, the scraper will also generate an HTML report that presents 63 | information about each task run within the scraper. 64 | 65 | .. image:: http://cl.ly/image/1J2o2T43422e/Screen%20Shot%202014-08-26%20at%2015.58.03.png 66 | 67 | This behaviour can be disabled by passing ``report=False`` to the constructor of 68 | the scraper. 69 | 70 | 71 | Contents 72 | -------- 73 | 74 | .. toctree:: 75 | :maxdepth: 2 76 | 77 | install 78 | quickstart 79 | tasks 80 | cache 81 | utils 82 | config 83 | api 84 | 85 | 86 | Contributors 87 | ------------ 88 | 89 | ``scrapekit`` is written and maintained by `Friedrich Lindenberg 90 | `_. It was developed as an outcome of scraping projects 91 | for the `African Network of Centers for Investigative Reporting (ANCIR) 92 | `_, supported by a `Knight International 93 | Journalism Fellowship `_ from the `International 94 | Center for Journalists (ICFJ) `_. 95 | 96 | Indices and tables 97 | ================== 98 | 99 | * :ref:`genindex` 100 | * :ref:`modindex` 101 | * :ref:`search` 102 | 103 | -------------------------------------------------------------------------------- /docs/install.rst: -------------------------------------------------------------------------------- 1 | Installation Guide 2 | ================== 3 | 4 | The easiest way is to install ``scrapekit`` via the `Python Package Index `_ using ``pip`` or ``easy_install``: 5 | 6 | .. code-block:: bash 7 | 8 | $ pip install scrapekit 9 | 10 | To install it manually simply download the repository from Github: 11 | 12 | .. code-block:: bash 13 | 14 | $ git clone git://github.com/pudo/scrapekit.git 15 | $ cd scrapekit/ 16 | $ python setup.py install 17 | -------------------------------------------------------------------------------- /docs/quickstart.rst: -------------------------------------------------------------------------------- 1 | Quickstart 2 | ========== 3 | 4 | Welcome to the scrapekit quickstart tutorial. In the following section, 5 | I'll show you how to write a simple scraper using the functions in 6 | scrapekit. 7 | 8 | Like many people, I've had a life-long, hidden desire to become a sail 9 | boat captain. To help me live the dream, we'll start by scraping 10 | `Craigslist boat sales in San Francisco `_. 11 | 12 | 13 | Getting started 14 | --------------- 15 | 16 | First, let's make a simple Python module, e.g. in a file called 17 | ``scrape_boats.py``. 18 | 19 | .. code-block:: python 20 | 21 | import scrapekit 22 | 23 | scraper = scrapekit.Scraper('craigslist-sf-boats') 24 | 25 | The first thing we've done is to instantiate a scraper and to give it 26 | a name. The name will later be used to configure the scraper and to 27 | read it's log ouput. Next, let's scrape our first page: 28 | 29 | .. code-block:: python 30 | 31 | from urlparse import urljoin 32 | 33 | @scraper.task 34 | def scrape_index(url): 35 | doc = scraper.get(url).html() 36 | 37 | next_link = doc.find('.//a[@class="button next"]') 38 | if next_link is not None: 39 | # make an absolute url. 40 | next_url = urljoin(url, next_link.get('href')) 41 | scrape_index.queue(next_url) 42 | 43 | scrape_index.run('https://sfbay.craigslist.org/boo/') 44 | 45 | This code will cycle through all the pages of listings, as long as 46 | a *Next* link is present. 47 | 48 | The key aspect of this snippet is the notion of a :py:class:`task 49 | `. Each scrapekit scraper is broken up into 50 | many small tasks, ideally one for fetching each web page. 51 | 52 | Tasks are executed in parallel to speed up the scraper. To do that, 53 | task functions aren't called directly, but by placing them on a 54 | queue (see :py:func:`scrape_index.queue ` 55 | above). Like normal functions, they can still receive arguments - 56 | in this case, the URL to be scraped. 57 | 58 | At the end of the snippet, we're calling :py:func:`scrape_index.run 59 | `. Unlike a simple queueing operation, this 60 | will tell the scraper to queue a task and then wait for all tasks to 61 | be executed. 62 | 63 | 64 | Scraping details 65 | ---------------- 66 | 67 | Now that we have a basic task to scrape the index of listings, we 68 | might want to download each listing's page and get some data from it. 69 | To do this, we can extend our previous script: 70 | 71 | .. code-block:: python 72 | 73 | import scrapekit 74 | from urlparse import urljoin 75 | 76 | scraper = scrapekit.Scraper('craigslist-sf-boats') 77 | 78 | @scraper.task 79 | def scrape_listing(url): 80 | doc = scraper.get(url).html() 81 | print(doc.find('.//h2[@class="postingtitle"]').text_content()) 82 | 83 | 84 | @scraper.task 85 | def scrape_index(url): 86 | doc = scraper.get(url).html() 87 | 88 | for listing in doc.findall('.//a[@class="hdrlnk"]'): 89 | listing_url = urljoin(url, listing.get('href')) 90 | scrape_listing.queue(listing_url) 91 | 92 | next_link = doc.find('.//a[@class="button next"]') 93 | if next_link is not None: 94 | # make an absolute url. 95 | next_url = urljoin(url, next_link.get('href')) 96 | scrape_index.queue(next_url) 97 | 98 | scrape_index.run('https://sfbay.craigslist.org/boo/') 99 | 100 | This basic scraper could be extended to extract more information from 101 | each listing page, and to save that information to a set of files or 102 | to a database. 103 | 104 | 105 | Configuring the scraper 106 | ----------------------- 107 | 108 | As you may have noticed, Craigslist is sometimes a bit slow. You might 109 | want to configure your scraper to use caching, or a different number 110 | of simultaneous threads to retrieve data. The simplest way to set up 111 | caching is to set some environment variables: 112 | 113 | .. code-block:: bash 114 | 115 | $ export SCRAPEKIT_CACHE_POLICY="http" 116 | $ export SCRAPEKIT_DATA_PATH="data" 117 | $ export SCRAPEKIT_THREADS=10 118 | 119 | This will instruct scrapekit to cache requests according to the rules 120 | of HTTP (using headers like ``Cache-Control`` to determine what to cache 121 | and for how long), and to save downloaded data in a directory called 122 | ``data`` in the current working path. We've also instructed the tool to 123 | use 10 threads when scraping data. 124 | 125 | If you wanto to make these decisions at run-time, you could also pass 126 | them into the constructor of your :py:class:`Scraper 127 | `: 128 | 129 | .. code-block:: python 130 | 131 | import scrapekit 132 | 133 | config = { 134 | 'threads': 10, 135 | 'cache_policy': 'http', 136 | 'data_path': 'data' 137 | } 138 | scraper = scrapekit.Scraper('demo', config=config) 139 | 140 | For details on all available settings and their meaning, check out the 141 | configuration documentation. 142 | -------------------------------------------------------------------------------- /docs/tasks.rst: -------------------------------------------------------------------------------- 1 | Using tasks 2 | =========== 3 | 4 | Tasks are used by scrapekit to break up a complex script into small 5 | units of work which can be executed asynchronously. When needed, 6 | they can also be composed in a variety of ways to generate complex 7 | data processing pipelines. 8 | 9 | 10 | Explicit queueing 11 | ----------------- 12 | 13 | The most simple way of using tasks is by explicitly queueing them. 14 | Here's an example of a task queueing another task a few times: 15 | 16 | .. code-block:: python 17 | 18 | import scrapekit 19 | 20 | scraper = scrapekit.Scraper('test') 21 | 22 | @scraper.task 23 | def each_item(item): 24 | print(item) 25 | 26 | @scraper.task 27 | def generate_work(): 28 | for i in xrange(100): 29 | each_item.queue(i) 30 | 31 | if __name__ == '__main__': 32 | generate_work.queue().wait() 33 | 34 | 35 | As you can see, ``generate_work`` will call ``each_item`` for each 36 | item in the range. Since the items are processed asynchronously, 37 | the printed output will not be in order, but slightly mixed up. 38 | 39 | You can also see that on the last line, we're queueing the 40 | ``generate_work`` task itself, and then instructing scrapekit to 41 | wait for the completion of all tasks. Since the double call is a 42 | bit awkward, there's a helper function to make both calls at once: 43 | 44 | .. code-block:: python 45 | 46 | if __name__ == '__main__': 47 | generate_work.run() 48 | 49 | 50 | Task chaining and piping 51 | ------------------------ 52 | 53 | As an alternative to these explicit instructions to queue, you can 54 | also use a more pythonic model to declare processing pipelines. A 55 | processing pipeline connects tasks by feeding the output of one task 56 | to another task. 57 | 58 | To connect tasks, there are two methods: chaining and piping. Chaining 59 | will just take the return value of one task, and queue another task 60 | to process it. Piping, on the other hand, will expect the return value 61 | of the first task to be an iterable, or for the task itself to be a 62 | generator. It will then initiate the next task for each item in the 63 | sequence. 64 | 65 | Let's assume we have these functions defined: 66 | 67 | .. code-block:: python 68 | 69 | import scrapekit 70 | 71 | scraper = scrapekit.Scraper('test') 72 | 73 | @scraper.task 74 | def consume_item(item): 75 | print(item) 76 | 77 | @scraper.task 78 | def process_item(item): 79 | return item ** 3 80 | 81 | @scraper.task 82 | def generate_items(): 83 | for i in xrange(100): 84 | yield i 85 | 86 | 87 | The simplest link we could do would be this simple chaining: 88 | 89 | .. code-block:: python 90 | 91 | pipline = process_item > consume_item 92 | pipeline.run(5) 93 | 94 | This linked ``process_item`` to ``consume_item``. Similarly, we could 95 | use a very simple pipe: 96 | 97 | .. code-block:: python 98 | 99 | pipline = generate_items | consume_item 100 | pipeline.run() 101 | 102 | Finally, we can link all of the functions together: 103 | 104 | .. code-block:: python 105 | 106 | pipline = generate_items | process_item > consume_item 107 | pipeline.run() 108 | 109 | -------------------------------------------------------------------------------- /docs/utils.rst: -------------------------------------------------------------------------------- 1 | Utility functions 2 | ================= 3 | 4 | These helper functions are intended to support everyday tasks of 5 | data scraping, such as string sanitization, and validation. 6 | 7 | .. automodule:: scrapekit.util 8 | :members: 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | sphinx_rtd_theme 2 | -------------------------------------------------------------------------------- /scrapekit/__init__.py: -------------------------------------------------------------------------------- 1 | from scrapekit.core import Scraper 2 | -------------------------------------------------------------------------------- /scrapekit/config.py: -------------------------------------------------------------------------------- 1 | import os 2 | import multiprocessing 3 | try: 4 | from ConfigParser import SafeConfigParser 5 | except ImportError: 6 | from configparser import ConfigParser as SafeConfigParser 7 | 8 | 9 | class Config(object): 10 | """ An object to load and represent the configuration of the current 11 | scraper. This loads scraper configuration from the environment and a 12 | per-user configuration file (``~/.scraperkit.ini``). """ 13 | 14 | def __init__(self, scraper, config): 15 | self.scraper = scraper 16 | self.config = self._get_defaults() 17 | self.config = self._get_file(self.config) 18 | self.config = self._get_env(self.config) 19 | if config is not None: 20 | self.config.update(config) 21 | 22 | def _get_defaults(self): 23 | name = self.scraper.name 24 | return { 25 | 'cache_policy': 'http', 26 | 'threads': multiprocessing.cpu_count() * 2, 27 | 'data_path': os.path.join(os.getcwd(), 'data', name), 28 | 'reports_path': None 29 | } 30 | 31 | def _get_env(self, config): 32 | """ Read environment variables based on the settings defined in 33 | the defaults. These are expected to be upper-case versions of 34 | the actual setting names, prefixed by ``SCRAPEKIT_``. """ 35 | for option, value in config.items(): 36 | env_name = 'SCRAPEKIT_%s' % option.upper() 37 | value = os.environ.get(env_name, value) 38 | config[option] = value 39 | return config 40 | 41 | def _get_file(self, config): 42 | """ Read a per-user .ini file, which is expected to have either 43 | a ``[scraperkit]`` or a ``[$SCRAPER_NAME]`` section. """ 44 | config_file = SafeConfigParser() 45 | config_file.read([os.path.expanduser('~/.scrapekit.ini')]) 46 | if config_file.has_section('scrapekit'): 47 | config.update(dict(config_file.items('scrapekit'))) 48 | if config_file.has_section(self.scraper.name): 49 | config.update(dict(config_file.items(self.scraper.name))) 50 | return config 51 | 52 | def items(self): 53 | return self.config.items() 54 | 55 | def __getattr__(self, name): 56 | if name != 'config' and name in self.config: 57 | return self.config.get(name) 58 | try: 59 | return object.__getattribute__(self, name) 60 | except AttributeError: 61 | return None 62 | -------------------------------------------------------------------------------- /scrapekit/core.py: -------------------------------------------------------------------------------- 1 | import os 2 | import atexit 3 | from uuid import uuid4 4 | from datetime import datetime 5 | from threading import local 6 | 7 | from scrapekit.config import Config 8 | from scrapekit.tasks import TaskManager, Task 9 | from scrapekit.http import make_session 10 | from scrapekit.logs import make_logger 11 | from scrapekit import reporting 12 | 13 | 14 | class Scraper(object): 15 | """ Scraper application object which handles resource management 16 | for a variety of related functions. """ 17 | 18 | def __init__(self, name, config=None, report=False): 19 | self.name = name 20 | self.id = uuid4() 21 | self.start_time = datetime.utcnow() 22 | self.config = Config(self, config) 23 | try: 24 | os.makedirs(self.config.data_path) 25 | except: 26 | pass 27 | self._task_manager = None 28 | self.task_ctx = local() 29 | self.log = make_logger(self) 30 | 31 | self.session = self.Session() 32 | if report: 33 | atexit.register(self.report) 34 | 35 | @property 36 | def task_manager(self): 37 | if self._task_manager is None: 38 | self._task_manager = \ 39 | TaskManager(threads=self.config.threads) 40 | return self._task_manager 41 | 42 | def task(self, fn): 43 | """ Decorate a function as a task in the scraper framework. 44 | This will enable the function to be queued and executed in 45 | a separate thread, allowing for the execution of the scraper 46 | to be asynchronous. 47 | """ 48 | return Task(self, fn) 49 | 50 | def Session(self): 51 | """ Create a pre-configured ``requests`` session instance 52 | that can be used to run HTTP requests. This instance will 53 | potentially be cached, or a stub, depending on the 54 | configuration of the scraper. """ 55 | return make_session(self) 56 | 57 | def head(self, url, **kwargs): 58 | """ HTTP HEAD via ``requests``. 59 | 60 | See: http://docs.python-requests.org/en/latest/api/#requests.head 61 | """ 62 | return self.Session().get(url, **kwargs) 63 | 64 | def get(self, url, **kwargs): 65 | """ HTTP GET via ``requests``. 66 | 67 | See: http://docs.python-requests.org/en/latest/api/#requests.get 68 | """ 69 | return self.Session().get(url, **kwargs) 70 | 71 | def post(self, url, **kwargs): 72 | """ HTTP POST via ``requests``. 73 | 74 | See: http://docs.python-requests.org/en/latest/api/#requests.post 75 | """ 76 | return self.Session().post(url, **kwargs) 77 | 78 | def put(self, url, **kwargs): 79 | """ HTTP PUT via ``requests``. 80 | 81 | See: http://docs.python-requests.org/en/latest/api/#requests.put 82 | """ 83 | return self.Session().put(url, **kwargs) 84 | 85 | def report(self): 86 | """ Generate a static HTML report for the last runs of the 87 | scraper from its log file. """ 88 | index_file = reporting.generate(self) 89 | print("Report available at: file://%s" % index_file) 90 | 91 | def __repr__(self): 92 | return '' % self.name 93 | -------------------------------------------------------------------------------- /scrapekit/exc.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | class ScraperException(Exception): 4 | """ Generic scraper exception, the base for all other exceptions. 5 | """ 6 | 7 | 8 | class WrappedMixIn(): 9 | """ Mix-in for wrapped exceptions. """ 10 | 11 | def __init__(self, wrapped): 12 | self.wrapped = wrapped 13 | self.message = wrapped.message 14 | 15 | def __repr__(self): 16 | name = self.__class__.__name__ 17 | return '<%s(%s)>' % (name, self.wrapped) 18 | 19 | 20 | class DependencyException(ScraperException, WrappedMixIn): 21 | """ Triggered when an operation would require the installation 22 | of further dependencies. """ 23 | 24 | 25 | class ParseException(ScraperException, WrappedMixIn): 26 | """ Triggered when parsing an HTTP response into the desired 27 | format (e.g. an HTML DOM, or JSON) is not possible. """ 28 | -------------------------------------------------------------------------------- /scrapekit/http.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import requests 4 | from cachecontrol.adapter import CacheControlAdapter 5 | from cachecontrol.caches import FileCache 6 | from cachecontrol.controller import CacheController 7 | 8 | from scrapekit.exc import DependencyException, ParseException 9 | 10 | 11 | CARRIER_HEADER = 'X-Scrapekit-Cache-Policy' 12 | 13 | 14 | class ScraperResponse(requests.Response): 15 | """ A modified scraper response that can parse the content into 16 | HTML, XML, JSON or a BeautifulSoup instance. """ 17 | 18 | def html(self): 19 | """ Create an ``lxml``-based HTML DOM from the response. The tree 20 | will not have a root, so all queries need to be relative 21 | (i.e. start with a dot). 22 | """ 23 | try: 24 | from lxml import html 25 | return html.fromstring(self.content) 26 | except ImportError as ie: 27 | raise DependencyException(ie) 28 | 29 | def xml(self): 30 | """ Create an ``lxml``-based XML DOM from the response. The tree 31 | will not have a root, so all queries need to be relative 32 | (i.e. start with a dot). 33 | """ 34 | try: 35 | from lxml import etree 36 | return etree.fromstring(self.content) 37 | except ImportError as ie: 38 | raise DependencyException(ie) 39 | 40 | def json(self, **kwargs): 41 | """ Create JSON object out of the response. """ 42 | try: 43 | return super(ScraperResponse, self).json(**kwargs) 44 | except ValueError as ve: 45 | raise ParseException(ve) 46 | 47 | 48 | class ScraperSession(requests.Session): 49 | """ Sub-class requests session to be able to introduce additional 50 | state to sessions and responses. """ 51 | 52 | def request(self, method, url, cache=None, **kwargs): 53 | # decide the cache policy and place it in a fake HTTP header 54 | cache_policy = cache or self.cache_policy 55 | if 'headers' not in kwargs: 56 | kwargs['headers'] = {} 57 | kwargs['headers'][CARRIER_HEADER] = cache_policy 58 | 59 | # TODO: put UA fakery here. 60 | 61 | orig = super(ScraperSession, self).request(method, url, **kwargs) 62 | 63 | # log request details to the JSON log 64 | self.scraper.log.debug("%s %s", method, url, extra={ 65 | 'reqMethod': method, 66 | 'reqUrl': url, 67 | 'reqArgs': kwargs 68 | }) 69 | 70 | # Cast the response into our own subclass which has HTML/XML 71 | # parsing support. 72 | response = ScraperResponse() 73 | response.__setstate__(orig.__getstate__()) 74 | return response 75 | 76 | 77 | class PolicyCacheController(CacheController): 78 | """ Switch the caching mode based on the caching policy provided by 79 | request, which in turn can be given at request time or through the 80 | scraper configuration. """ 81 | 82 | def cached_request(self, request): 83 | cache_policy = request.headers.pop(CARRIER_HEADER, 'none') 84 | if cache_policy == 'force' or cache_policy is True: 85 | # Force using the cache, even if HTTP semantics forbid it. 86 | cache_url = self.cache_url(request.url) 87 | resp = self.serializer.loads(request, self.cache.get(cache_url)) 88 | return resp or False 89 | elif cache_policy == 'http': 90 | return super(PolicyCacheController, self).cached_request(request) 91 | else: 92 | return False 93 | 94 | 95 | def make_session(scraper): 96 | """ Instantiate a session with the desired configuration parameters, 97 | including the cache policy. """ 98 | cache_path = os.path.join(scraper.config.data_path, 'cache') 99 | cache_policy = scraper.config.cache_policy 100 | cache_policy = cache_policy.lower().strip() 101 | session = ScraperSession() 102 | session.scraper = scraper 103 | session.cache_policy = cache_policy 104 | 105 | adapter = CacheControlAdapter( 106 | FileCache(cache_path), 107 | cache_etags=True, 108 | controller_class=PolicyCacheController 109 | ) 110 | session.mount('http://', adapter) 111 | session.mount('https://', adapter) 112 | return session 113 | -------------------------------------------------------------------------------- /scrapekit/logs.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | 4 | try: 5 | import jsonlogger 6 | except ImportError: 7 | # python-json-logger version 0.1.0 has changed the import structure 8 | from pythonjsonlogger import jsonlogger 9 | 10 | 11 | class TaskAdapter(logging.LoggerAdapter): 12 | """ Enhance any log messages with extra information about the 13 | current context of the scraper. """ 14 | 15 | def __init__(self, logger, scraper): 16 | super(TaskAdapter, self).__init__(logger, {}) 17 | self.scraper = scraper 18 | 19 | def process(self, msg, kwargs): 20 | extra = kwargs.get('extra', {}) 21 | extra['scraperName'] = self.scraper.name 22 | extra['scraperId'] = self.scraper.id 23 | if hasattr(self.scraper.task_ctx, 'name'): 24 | extra['taskName'] = self.scraper.task_ctx.name 25 | if hasattr(self.scraper.task_ctx, 'id'): 26 | extra['taskId'] = self.scraper.task_ctx.id 27 | extra['scraperStartTime'] = self.scraper.start_time 28 | kwargs['extra'] = extra 29 | return (msg, kwargs) 30 | 31 | 32 | def make_json_format(): 33 | supported_keys = ['asctime', 'created', 'filename', 'funcName', 34 | 'levelname', 'levelno', 'lineno', 'module', 35 | 'msecs', 'message', 'name', 'pathname', 36 | 'process', 'processName', 'relativeCreated', 37 | 'thread', 'threadName'] 38 | log_format = lambda x: ['%({0:s})'.format(i) for i in x] 39 | return ' '.join(log_format(supported_keys)) 40 | 41 | 42 | def log_path(scraper): 43 | """ Determine the file name for the JSON log. """ 44 | return os.path.join(scraper.config.data_path, 45 | '%s.jsonlog' % scraper.name) 46 | 47 | 48 | def make_logger(scraper): 49 | """ Create two log handlers, one to output info-level ouput to the 50 | console, the other to store all logging in a JSON file which will 51 | later be used to generate reports. """ 52 | 53 | logger = logging.getLogger('') 54 | logger.setLevel(logging.DEBUG) 55 | 56 | requests_log = logging.getLogger("requests") 57 | requests_log.setLevel(logging.WARNING) 58 | 59 | json_handler = logging.FileHandler(log_path(scraper)) 60 | json_handler.setLevel(logging.DEBUG) 61 | json_formatter = jsonlogger.JsonFormatter(make_json_format()) 62 | json_handler.setFormatter(json_formatter) 63 | logger.addHandler(json_handler) 64 | 65 | console_handler = logging.StreamHandler() 66 | console_handler.setLevel(logging.INFO) 67 | fmt = '%(name)s [%(levelname)-8s]: %(message)s' 68 | formatter = logging.Formatter(fmt) 69 | console_handler.setFormatter(formatter) 70 | logger.addHandler(console_handler) 71 | 72 | logger = logging.getLogger(scraper.name) 73 | logger = TaskAdapter(logger, scraper) 74 | return logger 75 | -------------------------------------------------------------------------------- /scrapekit/reporting/__init__.py: -------------------------------------------------------------------------------- 1 | from os import path 2 | 3 | from scrapekit.reporting import db 4 | from scrapekit.reporting import render 5 | 6 | 7 | RUNS_QUERY = """ 8 | SELECT scraperId, scraperStartTime, levelname, COUNT(rowid) as messages, 9 | COUNT(DISTINCT taskId) AS tasks 10 | FROM log 11 | GROUP BY scraperId, scraperStartTime, levelname 12 | ORDER BY scraperStartTime DESC 13 | """ 14 | 15 | TASKS_QUERY = """ 16 | SELECT scraperId, scraperStartTime, levelname, taskName, 17 | COUNT(rowid) as messages, COUNT(DISTINCT taskId) AS tasks 18 | FROM log 19 | GROUP BY scraperId, scraperStartTime, levelname, taskName 20 | ORDER BY scraperId DESC, taskName DESC 21 | """ 22 | 23 | TASK_RUNS_QUERY = """ 24 | SELECT taskId, scraperId, taskName, asctime, levelname, 25 | COUNT(rowid) as messages, 26 | COUNT(DISTINCT taskId) AS tasks 27 | FROM log 28 | WHERE scraperId = :scraperId AND taskName = :taskName 29 | GROUP BY taskId, scraperId, taskName, asctime, levelname 30 | ORDER BY messages DESC 31 | """ 32 | 33 | TASK_RUNS_QUERY_NULL = """ 34 | SELECT taskId, scraperId, taskName, asctime, levelname, 35 | COUNT(rowid) as messages, 36 | COUNT(DISTINCT taskId) AS tasks 37 | FROM log 38 | WHERE scraperId = :scraperId AND taskName IS NULL 39 | GROUP BY taskId, scraperId, taskName, asctime, levelname 40 | ORDER BY messages DESC 41 | """ 42 | 43 | TASK_RUNS_LIST = """ 44 | SELECT DISTINCT scraperId, taskName FROM log ORDER BY asctime DESC 45 | """ 46 | 47 | 48 | def aggregate_loglevels(sql, keys, **kwargs): 49 | data, key = {}, None 50 | for row in db.query(sql, **kwargs): 51 | row_key = map(lambda k: row[k], keys) 52 | if key != row_key: 53 | if key is not None: 54 | yield data 55 | data = row 56 | key = row_key 57 | else: 58 | data['messages'] += row['messages'] 59 | data['tasks'] += row['tasks'] 60 | data[row['levelname']] = row['messages'] 61 | if key is not None: 62 | yield data 63 | 64 | 65 | def sort_aggregates(rows): 66 | rows = list(rows) 67 | 68 | def key(row): 69 | return row.get('ERROR', 0) * (len(rows) * 2) + \ 70 | row.get('WARN', 0) * len(rows) + \ 71 | row.get('INFO', 0) * 2 + \ 72 | row.get('DEBUG', 0) 73 | return sorted(rows, key=key) 74 | 75 | 76 | def all_task_runs(scraper, keys=('scraperId', 'taskId')): 77 | by_task = {} 78 | for row in db.log_parse(scraper): 79 | asctime = row.get('asctime') 80 | row['ts'] = '-' if asctime is None else asctime.rsplit(' ')[-1] 81 | row_key = tuple(map(lambda k: row.get(k), keys)) 82 | if row_key not in by_task: 83 | by_task[row_key] = [row] 84 | else: 85 | by_task[row_key].append(row) 86 | return by_task 87 | 88 | 89 | def generate(scraper): 90 | db.load(scraper) 91 | runs = list(aggregate_loglevels(RUNS_QUERY, ('scraperId',))) 92 | tasks = list(aggregate_loglevels(TASKS_QUERY, ('scraperId', 'taskName'))) 93 | index_file = render.paginate(scraper, runs, 'index%s.html', 'index.html', 94 | tasks=tasks) 95 | 96 | for task_run in db.query(TASK_RUNS_LIST): 97 | task = task_run.get('taskName') or render.PADDING 98 | file_name = '%s/%s/index%%s.html' % (task, task_run.get('scraperId')) 99 | if path.exists(path.join(path.dirname(index_file), file_name % '')): 100 | continue 101 | if task_run.get('taskName') is None: 102 | runs = aggregate_loglevels(TASK_RUNS_QUERY_NULL, ('taskId',), 103 | scraperId=task_run.get('scraperId')) 104 | else: 105 | runs = aggregate_loglevels(TASK_RUNS_QUERY, ('taskId',), 106 | scraperId=task_run.get('scraperId'), 107 | taskName=task_run.get('taskName')) 108 | runs = sort_aggregates(runs) 109 | render.paginate(scraper, runs, file_name, 'task_run_list.html', 110 | taskName=task) 111 | 112 | for (scraperId, taskId), rows in all_task_runs(scraper).items(): 113 | taskName = rows[0].get('taskName') 114 | file_name = (taskName or render.PADDING, 115 | scraperId or render.PADDING, 116 | taskId or render.PADDING) 117 | file_name = '%s/%s/%s%%s.html' % file_name 118 | if path.exists(path.join(path.dirname(index_file), file_name % '')): 119 | continue 120 | render.paginate(scraper, rows, file_name, 'task_run_item.html', 121 | scraperId=scraperId, taskId=taskId, taskName=taskName) 122 | 123 | return index_file 124 | -------------------------------------------------------------------------------- /scrapekit/reporting/db.py: -------------------------------------------------------------------------------- 1 | import json 2 | import sqlite3 3 | 4 | from scrapekit.logs import log_path 5 | 6 | 7 | conn = sqlite3.connect(':memory:') 8 | 9 | 10 | def dict_factory(cursor, row): 11 | d = {} 12 | for idx, col in enumerate(cursor.description): 13 | d[col[0]] = row[idx] 14 | return d 15 | 16 | 17 | def log_parse(scraper): 18 | path = log_path(scraper) 19 | with open(path, 'r') as fh: 20 | for line in fh: 21 | data = json.loads(line) 22 | if data.get('scraperName') != scraper.name: 23 | continue 24 | yield data 25 | 26 | 27 | def load(scraper): 28 | conn.row_factory = dict_factory 29 | conn.execute("""CREATE TABLE IF NOT EXISTS log (scraperId text, 30 | taskName text, scraperStartTime datetime, asctime text, 31 | levelname text, taskId text)""") 32 | conn.commit() 33 | for data in log_parse(scraper): 34 | conn.execute("""INSERT INTO log (scraperId, taskName, 35 | scraperStartTime, asctime, levelname, taskId) VALUES 36 | (?, ?, ?, ?, ?, ?)""", 37 | (data.get('scraperId'), data.get('taskName'), 38 | data.get('scraperStartTime'), data.get('asctime'), 39 | data.get('levelname'), data.get('taskId'))) 40 | conn.commit() 41 | 42 | 43 | def query(sql, **kwargs): 44 | rp = conn.execute(sql, kwargs) 45 | for row in rp.fetchall(): 46 | yield row 47 | -------------------------------------------------------------------------------- /scrapekit/reporting/render.py: -------------------------------------------------------------------------------- 1 | import os 2 | import math 3 | import platform 4 | import pkg_resources 5 | from datetime import datetime 6 | from collections import namedtuple 7 | 8 | from jinja2 import Environment, PackageLoader 9 | 10 | 11 | PAGE_SIZE = 15 12 | RANGE = 3 13 | PADDING = 'unkown' 14 | IGNORE_FIELDS = ['levelno', 'ts', 'processName', 'filename', 15 | 'levelname', 'message', 'taskId', 'scraperId'] 16 | url = namedtuple('url', ['idx', 'rel', 'abs']) 17 | 18 | 19 | def datetimeformat(value): 20 | outfmt = '%b %d, %Y, %H:%M' 21 | if value is None: 22 | return 'no date' 23 | if not isinstance(value, datetime): 24 | value = datetime.strptime(value, '%Y-%m-%dT%H:%M') 25 | return value.strftime(outfmt) 26 | 27 | 28 | def render(scraper, dest_file, template, **kwargs): 29 | reports_path = scraper.config.reports_path 30 | if reports_path is None: 31 | reports_path = os.path.join(scraper.config.data_path, 'reports') 32 | dest_file = os.path.join(reports_path, dest_file) 33 | dest_path = os.path.dirname(dest_file) 34 | try: 35 | os.makedirs(dest_path) 36 | except: 37 | pass 38 | 39 | loader = PackageLoader('scrapekit', 'templates') 40 | env = Environment(loader=loader) 41 | env.filters['dateformat'] = datetimeformat 42 | template = env.get_template(template) 43 | kwargs['version'] = pkg_resources.require("scrapekit")[0].version 44 | kwargs['python'] = platform.python_version() 45 | kwargs['hostname'] = platform.uname()[1] 46 | kwargs['padding'] = PADDING 47 | kwargs['ignore_fields'] = IGNORE_FIELDS 48 | kwargs['config'] = scraper.config.items() 49 | kwargs['scraper'] = scraper 50 | with open(dest_file, 'w') as fh: 51 | text = template.render(**kwargs) 52 | fh.write(text.encode('utf-8')) 53 | return dest_file 54 | 55 | 56 | def paginate(scraper, elements, basename, template, **kwargs): 57 | basedir = os.path.dirname(basename) 58 | basefile = os.path.basename(basename) 59 | pages = int(math.ceil(float(len(elements)) / PAGE_SIZE)) 60 | 61 | urls = [] 62 | for i in range(1, pages + 1): 63 | fn = basefile % '' if i == 1 else basefile % ('_' + str(i)) 64 | urls.append(url(i, fn, os.path.join(basedir, fn))) 65 | 66 | link = None 67 | for page in reversed(urls): 68 | offset = (page.idx - 1) * PAGE_SIZE 69 | es = elements[offset:offset + PAGE_SIZE] 70 | 71 | low = page.idx - RANGE 72 | high = page.idx + RANGE 73 | 74 | if low < 1: 75 | low = 1 76 | high = min((2 * RANGE) + 1, pages) 77 | 78 | if high > pages: 79 | high = pages 80 | low = max(1, pages - (2 * RANGE) + 1) 81 | 82 | pager = { 83 | 'total': len(elements), 84 | 'page': page, 85 | 'elements': es, 86 | 'pages': urls[low - 1:high], 87 | 'show': pages > 1, 88 | 'prev': None if page.idx == 1 else urls[page.idx - 2], 89 | 'next': None if page.idx == len(urls) else urls[page.idx] 90 | } 91 | link = render(scraper, page.abs, template, pager=pager, 92 | **kwargs) 93 | return link 94 | -------------------------------------------------------------------------------- /scrapekit/tasks.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module holds a simple system for the multi-threaded execution of 3 | scraper code. This can be used, for example, to split a scraper into 4 | several stages and to have multiple elements processed at the same 5 | time. 6 | 7 | The goal of this module is to handle simple multi-threaded scrapers, 8 | while making it easy to upgrade to a queue-based setup using celery 9 | later. 10 | """ 11 | 12 | from uuid import uuid4 13 | from time import sleep 14 | try: 15 | from queue import Queue 16 | except ImportError: 17 | from Queue import Queue 18 | from threading import Thread 19 | 20 | 21 | class TaskManager(object): 22 | """ The `TaskManager` is a singleton that manages the threads 23 | used to parallelize processing and the queue that manages the 24 | current set of prepared tasks. """ 25 | 26 | def __init__(self, threads=10): 27 | """ 28 | :param threads: The number of threads to be spawned. Values 29 | ranging from 5 to 40 have shown useful, based on the amount 30 | of I/O involved in each task. 31 | :param daemon: Mark the worker threads as daemons in the 32 | operating system, so that they will not be included in the 33 | number of application threads for this script. 34 | """ 35 | self.num_threads = int(threads) 36 | self.queue = None 37 | 38 | def _spawn(self): 39 | """ Initialize the queue and the threads. """ 40 | self.queue = Queue(maxsize=self.num_threads * 10) 41 | for i in range(self.num_threads): 42 | t = Thread(target=self._consume) 43 | t.daemon = True 44 | t.start() 45 | 46 | def _consume(self): 47 | """ Main loop for each thread, handles picking a task off the 48 | queue, processing it and notifying the queue that it is done. 49 | """ 50 | while True: 51 | try: 52 | task, args, kwargs = self.queue.get(True) 53 | task(*args, **kwargs) 54 | finally: 55 | self.queue.task_done() 56 | 57 | def put(self, task, args, kwargs): 58 | """ Add a new item to the queue. An item is a task and the 59 | arguments needed to call it. 60 | 61 | Do not call this directly, use Task.queue/Task.run instead. 62 | """ 63 | if self.num_threads == 0: 64 | return task(*args, **kwargs) 65 | if self.queue is None: 66 | self._spawn() 67 | self.queue.put((task, args, kwargs)) 68 | 69 | def wait(self): 70 | """ Wait for each item in the queue to be processed. If this 71 | is not called, the main thread will end immediately and none 72 | of the tasks assigned to the threads would be executed. """ 73 | if self.queue is None: 74 | return 75 | 76 | self.queue.join() 77 | 78 | 79 | class ChainListener(object): 80 | 81 | def __init__(self, task): 82 | self.task = task 83 | 84 | def notify(self, value): 85 | self.task.queue(value) 86 | 87 | 88 | class PipeListener(ChainListener): 89 | 90 | def notify(self, value): 91 | # TODO: if value is a generator, it will be exhausted. 92 | # Thus no branching or return value is available. 93 | # -> consider using itertools.tee. 94 | for value_item in value: 95 | self.task.queue(value_item) 96 | 97 | 98 | class Task(object): 99 | """ A task is a decorator on a function which helps managing the 100 | execution of that function in a multi-threaded, queued context. 101 | 102 | After a task has been applied to a function, it can either be used 103 | in the normal way (by calling it directly), through a simple queue 104 | (using the `queue` method), or in pipeline mode (using `chain`, 105 | `pipe` and `run`). 106 | """ 107 | 108 | def __init__(self, scraper, fn, task_id=None): 109 | self.scraper = scraper 110 | self.fn = fn 111 | self.task_id = task_id 112 | self._listeners = [] 113 | self._source = None 114 | 115 | def __call__(self, *args, **kwargs): 116 | """ Execute the wrapped function. This will either call it in 117 | normal mode (returning the return value), or notify any 118 | pipeline listeners that have been associated with this task. 119 | """ 120 | self.scraper.task_ctx.name = getattr(self.fn, 'func_name', self.fn.__name__) 121 | self.scraper.task_ctx.id = self.task_id or uuid4() 122 | 123 | try: 124 | self.scraper.log.debug('Begin task', extra={ 125 | 'taskArgs': args, 126 | 'taskKwargs': kwargs 127 | }) 128 | value = self.fn(*args, **kwargs) 129 | for listener in self._listeners: 130 | listener.notify(value) 131 | return value 132 | except Exception as e: 133 | self.scraper.log.exception(e) 134 | finally: 135 | self.scraper.task_ctx.name = None 136 | self.scraper.task_ctx.id = None 137 | 138 | def queue(self, *args, **kwargs): 139 | """ Schedule a task for execution. The task call (and its 140 | arguments) will be placed on the queue and processed 141 | asynchronously. """ 142 | self.scraper.task_manager.put(self, args, kwargs) 143 | return self 144 | 145 | def wait(self): 146 | """ Wait for task execution in the current queue to be 147 | complete (ie. the queue to be empty). If only `queue` is called 148 | without `wait`, no processing will occur. """ 149 | self.scraper.task_manager.wait() 150 | return self 151 | 152 | def run(self, *args, **kwargs): 153 | """ Queue a first item to execute, then wait for the queue to 154 | be empty before returning. This should be the default way of 155 | starting any scraper. 156 | """ 157 | if self._source is not None: 158 | return self._source.run(*args, **kwargs) 159 | else: 160 | self.queue(*args, **kwargs) 161 | return self.wait() 162 | 163 | def chain(self, other_task): 164 | """ Add a chain listener to the execution of this task. Whenever 165 | an item has been processed by the task, the registered listener 166 | task will be queued to be executed with the output of this task. 167 | 168 | Can also be written as:: 169 | 170 | pipeline = task1 > task2 171 | """ 172 | other_task._source = self 173 | self._listeners.append(ChainListener(other_task)) 174 | return other_task 175 | 176 | def pipe(self, other_task): 177 | """ Add a pipe listener to the execution of this task. The 178 | output of this task is required to be an iterable. Each item in 179 | the iterable will be queued as the sole argument to an execution 180 | of the listener task. 181 | 182 | Can also be written as:: 183 | 184 | pipeline = task1 | task2 185 | """ 186 | other_task._source = self 187 | self._listeners.append(PipeListener(other_task)) 188 | return other_task 189 | 190 | def __gt__(self, other_task): 191 | return self.chain(other_task) 192 | 193 | def __or__(self, other_task): 194 | return self.pipe(other_task) 195 | -------------------------------------------------------------------------------- /scrapekit/templates/index.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% from 'macros.html' import pagination %} 3 | 4 | {% block title %} 5 | {{ scraper.name }} 6 | {% endblock %} 7 | 8 | {% block content %} 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | {% for run in pager.elements %} 19 | 20 | 23 | 24 | 25 | 28 | 31 | 32 | {% for task in tasks %} 33 | {% if task.scraperId == run.scraperId %} 34 | 35 | 36 | 39 | 40 | 41 | 44 | 47 | 48 | {% endif %} 49 | {% endfor %} 50 | {% endfor %} 51 |
Scraper runTaskTasks executedMessagesWarningsErrors
21 | {{run.scraperStartTime | dateformat}} 22 | {{run.tasks or '-'}}{{run.messages or '-'}} 26 | {{run.get('WARN') or '-'}} 27 | 29 | {{run.get('ERROR') or '-'}} 30 |
37 | {{task.taskName or '(No task)'}} 38 | {{task.tasks or '-'}}{{task.messages or '-'}} 42 | {{task.WARN or '-'}} 43 | 45 | {{task.ERROR or '-'}} 46 |
52 | {% endblock %} 53 | -------------------------------------------------------------------------------- /scrapekit/templates/layout.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | {% block title %}(Untitled){% endblock %} - scrapekit 9 | 10 | 11 | 12 | 13 | 14 | 15 | 101 | 102 | 103 |
104 |
105 |

106 | {{self.title()}} 107 |

108 |
109 |
110 |

111 | This is an automatic report generated by the {{scraper.name}} 112 | scraper. It contains information about each run of the scraper, and the 113 | error messages and warnings emitted by the scraper. 114 |

115 | 116 |

Configuration

117 |

The following configuration settings applied during the most recent run 118 | of this scraper.

119 | 120 | {% for key, value in config %} 121 | 122 | 123 | 124 | 125 | {% endfor %} 126 |
{{key}}{{value}}
127 |
128 |
129 | {% block content %}{% endblock %} 130 | {{pagination(pager)}} 131 |
132 |
133 | 134 |
135 |
136 | 137 |
138 |
139 |

140 | scrapekit {{version}} 141 |

142 |

Python {{python}} / {{hostname}}

143 |
144 |
145 | 146 | {% block js %}{% endblock %} 147 | 148 | 149 | 150 | 151 | -------------------------------------------------------------------------------- /scrapekit/templates/macros.html: -------------------------------------------------------------------------------- 1 | {% macro pagination(pager) %} 2 | {% if pager.show %} 3 |
    4 | {% if pager.prev %} 5 |
  • «
  • 6 | {% else %} 7 |
  • «
  • 8 | {% endif %} 9 | {% for page in pager.pages %} 10 | {% if page.idx == pager.page.idx %} 11 |
  • {{page.idx}}
  • 12 | {% else %} 13 |
  • {{page.idx}}
  • 14 | {% endif %} 15 | {% endfor %} 16 | {% if pager.next %} 17 |
  • »
  • 18 | {% else %} 19 |
  • »
  • 20 | {% endif %} 21 |
22 | {% endif %} 23 | {% endmacro %} 24 | -------------------------------------------------------------------------------- /scrapekit/templates/task_run_item.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% from 'macros.html' import pagination %} 3 | 4 | {% block title %} 5 | {{ scraper.name }}: {{taskName or padding}} 6 | {% endblock %} 7 | 8 | {% block content %} 9 | 10 | 11 | 12 | 13 | 14 | 15 | {% for row in pager.elements %} 16 | 17 | 18 | 19 | 20 | 21 | 22 | 31 | 32 | {% endfor %} 33 |
TimeLevelMessage
{{row.ts}}{{row.levelname}}{{row.message}}
23 | 24 | {% for k, v in row.items() %} 25 | {% if k not in ignore_fields %} 26 | 27 | {% endif %} 28 | {% endfor %} 29 |
{{k}}{{v}}
30 |
34 | {% endblock %} 35 | 36 | {% block js %} 37 | 38 | 47 | {% endblock %} 48 | -------------------------------------------------------------------------------- /scrapekit/templates/task_run_list.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% from 'macros.html' import pagination %} 3 | 4 | {% block title %} 5 | {{ scraper.name }}: {{taskName or padding}} 6 | {% endblock %} 7 | 8 | {% block content %} 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | {% for run in pager.elements %} 17 | 18 | 19 | 20 | 23 | 26 | 27 | {% endfor %} 28 |
TimeMessagesWarningsErrors
{{run.asctime}}{{run.messages}} 21 | {{run.WARN or '-'}} 22 | 24 | {{run.ERROR or '-'}} 25 |
29 | {% endblock %} 30 | 31 | 32 | -------------------------------------------------------------------------------- /scrapekit/util.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | 4 | def collapse_whitespace(text): 5 | """ Collapse all consecutive whitespace, newlines and tabs 6 | in a string into single whitespaces, and strip the outer 7 | whitespace. This will also accept an ``lxml`` element and 8 | extract all text. """ 9 | if text is None: 10 | return None 11 | if hasattr(text, 'xpath'): 12 | text = text.xpath('string()') 13 | text = re.sub('\s+', ' ', text) 14 | return text.strip() 15 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | 4 | setup( 5 | name='scrapekit', 6 | version='0.2.1', 7 | description="Light-weight tools for web scraping", 8 | long_description="", 9 | classifiers=[ 10 | "Development Status :: 3 - Alpha", 11 | "Intended Audience :: Developers", 12 | "License :: OSI Approved :: MIT License", 13 | "Operating System :: OS Independent", 14 | 'Programming Language :: Python :: 2.6', 15 | 'Programming Language :: Python :: 2.7', 16 | 'Programming Language :: Python :: 3.3', 17 | 'Programming Language :: Python :: 3.4' 18 | ], 19 | keywords='web scraping crawling http cache threading', 20 | author='Friedrich Lindenberg', 21 | author_email='friedrich@pudo.org', 22 | url='http://github.com/pudo/scrapekit', 23 | license='MIT', 24 | packages=find_packages(exclude=['ez_setup', 'examples', 'test']), 25 | namespace_packages=[], 26 | package_data={'scrapekit': ['templates/*.html']}, 27 | include_package_data=True, 28 | zip_safe=False, 29 | install_requires=[ 30 | "requests>=2.3.0", 31 | "CacheControl>=0.10.2", 32 | "lockfile>=0.9.1", 33 | "Jinja2>=2.7.3", 34 | "python-json-logger>=0.0.5" 35 | ], 36 | tests_require=[], 37 | entry_points={ 38 | 'console_scripts': [] 39 | } 40 | ) 41 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | from scrapekit import Scraper 2 | 3 | scraper = Scraper('test') 4 | 5 | @scraper.task 6 | def fun(a, b): 7 | print(a, b, a + b) 8 | return a + b 9 | 10 | #print fun.queue(2, 3).wait() 11 | #print fun(2, 3) 12 | 13 | 14 | @scraper.task 15 | def funSource(): 16 | for i in xrange(100): 17 | yield i 18 | 19 | @scraper.task 20 | def funModifier(i): 21 | return i + 0.1 22 | 23 | @scraper.task 24 | def funSink(i): 25 | print(i ** 3) 26 | 27 | 28 | #pipeline = funSource | funModifier > funSink 29 | #pipeline.run() 30 | 31 | #funSource.chain(funSink).run() 32 | #funSource.queue().wait() 33 | 34 | 35 | @scraper.task 36 | def scrape_index(): 37 | url = 'https://sfbay.craigslist.org/boo/' 38 | res = scraper.get(url) 39 | print(res) 40 | 41 | 42 | scrape_index.run() 43 | 44 | --------------------------------------------------------------------------------