├── .gitignore ├── LICENSE ├── MANIFEST.in ├── README.md ├── debian ├── changelog ├── compat ├── control ├── copyright ├── examples ├── lintian-overrides ├── rules └── source │ ├── format │ └── options ├── docs ├── .gitignore ├── Makefile ├── apidoc.rst ├── conf.py ├── index.rst ├── installation.rst └── usage.rst ├── dryscrape ├── __init__.py ├── driver │ ├── __init__.py │ └── webkit.py ├── mixins.py ├── session.py └── xvfb.py ├── examples └── google.py ├── requirements.txt └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.png 2 | *.pyc 3 | *~ 4 | MANIFEST 5 | /dist 6 | /build 7 | 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) Niklas Baumstark 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | 21 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include LICENSE README.md 2 | recursive-include examples *.py 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | **NOTE: This package is not actively maintained. It uses QtWebkit, which is end-of-life and probably doesn't get security fixes backported. Consider using a similar package like [Spynner](https://github.com/makinacorpus/spynner) instead.** 3 | 4 | 5 | # Overview 6 | 7 | **Author:** Niklas Baumstark 8 | 9 | 10 | 11 | dryscrape is a lightweight web scraping library for Python. It uses a 12 | headless Webkit instance to evaluate Javascript on the visited pages. This 13 | enables painless scraping of plain web pages as well as Javascript-heavy 14 | “Web 2.0” applications like 15 | Facebook. 16 | 17 | It is built on the shoulders of 18 | [capybara-webkit](https://github.com/thoughtbot/capybara-webkit)'s 19 | [webkit-server](https://github.com/niklasb/webkit-server). A big thanks goes 20 | to thoughtbot, inc. for building this excellent piece of software! 21 | 22 | # Changelog 23 | 24 | * 1.0: Added Python 3 support, small performance fixes, header names are now 25 | properly normalized. Also added the function `dryscrape.start_xvfb()` to 26 | easily start Xvfb. 27 | * 0.9.1: Changed semantics of the `headers` function in 28 | a backwards-incompatible way: It now returns a list of (key, value) 29 | pairs instead of a dictionary. 30 | 31 | # Supported Platforms 32 | 33 | The library has been confirmed to work on the following platforms: 34 | 35 | * Mac OS X 10.9 Mavericks and 10.10 Yosemite 36 | * Ubuntu Linux 37 | * Arch Linux 38 | 39 | Other unixoid systems should work just fine. 40 | 41 | Windows is not officially supported, although dryscrape should work 42 | with [cygwin](https://www.cygwin.com/). 43 | 44 | ### A word about Qt 5.6 45 | 46 | The 5.6 version of Qt removes the Qt WebKit module in favor of the new module Qt WebEngine. So far webkit-server has not been ported to WebEngine (and likely won't be in the near future), so Qt <= 5.5 is a requirement. 47 | 48 | # Installation, Usage, API Docs 49 | 50 | Documentation can be found at 51 | [dryscrape's ReadTheDocs page](http://dryscrape.readthedocs.io/). 52 | 53 | Quick installation instruction for Ubuntu: 54 | 55 | # apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb 56 | # pip install dryscrape 57 | 58 | # Contact, Bugs, Contributions 59 | 60 | If you have any problems with this software, don't hesitate to open an 61 | issue on [Github](https://github.com/niklasb/dryscrape) or open a pull 62 | request or write a mail to **niklas baumstark at Gmail**. 63 | -------------------------------------------------------------------------------- /debian/changelog: -------------------------------------------------------------------------------- 1 | dryscrape (1.0-1) unstable; urgency=low 2 | 3 | * Initial import 4 | 5 | -- Niklas Baumstark Wed, 23 Sep 2015 13:00:10 +0000 6 | -------------------------------------------------------------------------------- /debian/compat: -------------------------------------------------------------------------------- 1 | 7 2 | -------------------------------------------------------------------------------- /debian/control: -------------------------------------------------------------------------------- 1 | Source: dryscrape 2 | Maintainer: Niklas Baumstark 3 | Section: python 4 | Priority: optional 5 | Build-Depends: python-all (>= 2.6.6-3), debhelper (>= 7) 6 | Standards-Version: 3.9.1 7 | 8 | Package: python-dryscrape 9 | Architecture: all 10 | XB-Python-Version: ${python:Versions} 11 | Depends: ${misc:Depends}, ${python:Depends}, python-webkit-server, python-lxml, python-xvfbwrapper 12 | Provides: ${python:Provides} 13 | Description: a lightweight Javascript-aware, headless web scraping librar 14 | dryscrape is a lightweight web scraping library for Python. 15 | It uses a headless Webkit instance to evaluate Javascript on the visited pages. 16 | This enables painless scraping of plain web pages as well 17 | as Javascript-heavy "Web 2.0" applications like Facebook. 18 | -------------------------------------------------------------------------------- /debian/copyright: -------------------------------------------------------------------------------- 1 | Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ 2 | Upstream-Name: dryscrape 3 | Source: https://github.com/niklasb/dryscrape 4 | 5 | Files: * 6 | Copyright: Copyright (c) 2012 Niklas Baumstark 7 | License: MIT 8 | For details see http://opensource.org/licenses/MIT. 9 | -------------------------------------------------------------------------------- /debian/examples: -------------------------------------------------------------------------------- 1 | examples/* 2 | -------------------------------------------------------------------------------- /debian/lintian-overrides: -------------------------------------------------------------------------------- 1 | python-dryscrape binary: description-synopsis-starts-with-article 2 | python-dryscrape binary: new-package-should-close-itp-bug 3 | 4 | -------------------------------------------------------------------------------- /debian/rules: -------------------------------------------------------------------------------- 1 | #!/usr/bin/make -f 2 | 3 | %: 4 | dh $@ --with python2 --buildsystem=python_distutils 5 | 6 | 7 | -------------------------------------------------------------------------------- /debian/source/format: -------------------------------------------------------------------------------- 1 | 3.0 (quilt) 2 | -------------------------------------------------------------------------------- /debian/source/options: -------------------------------------------------------------------------------- 1 | extend-diff-ignore="\.egg-info" -------------------------------------------------------------------------------- /docs/.gitignore: -------------------------------------------------------------------------------- 1 | /_* 2 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = _build 9 | 10 | # Internal variables. 11 | PAPEROPT_a4 = -D latex_paper_size=a4 12 | PAPEROPT_letter = -D latex_paper_size=letter 13 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 14 | 15 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest 16 | 17 | help: 18 | @echo "Please use \`make ' where is one of" 19 | @echo " html to make standalone HTML files" 20 | @echo " dirhtml to make HTML files named index.html in directories" 21 | @echo " singlehtml to make a single large HTML file" 22 | @echo " pickle to make pickle files" 23 | @echo " json to make JSON files" 24 | @echo " htmlhelp to make HTML files and a HTML help project" 25 | @echo " qthelp to make HTML files and a qthelp project" 26 | @echo " devhelp to make HTML files and a Devhelp project" 27 | @echo " epub to make an epub" 28 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 29 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 30 | @echo " text to make text files" 31 | @echo " man to make manual pages" 32 | @echo " changes to make an overview of all changed/added/deprecated items" 33 | @echo " linkcheck to check all external links for integrity" 34 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 35 | 36 | clean: 37 | -rm -rf $(BUILDDIR)/* 38 | 39 | html: 40 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 41 | @echo 42 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 43 | 44 | dirhtml: 45 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 46 | @echo 47 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 48 | 49 | singlehtml: 50 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 51 | @echo 52 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 53 | 54 | pickle: 55 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 56 | @echo 57 | @echo "Build finished; now you can process the pickle files." 58 | 59 | json: 60 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 61 | @echo 62 | @echo "Build finished; now you can process the JSON files." 63 | 64 | htmlhelp: 65 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 66 | @echo 67 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 68 | ".hhp project file in $(BUILDDIR)/htmlhelp." 69 | 70 | qthelp: 71 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 72 | @echo 73 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 74 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 75 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/dryscrape.qhcp" 76 | @echo "To view the help file:" 77 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/dryscrape.qhc" 78 | 79 | devhelp: 80 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 81 | @echo 82 | @echo "Build finished." 83 | @echo "To view the help file:" 84 | @echo "# mkdir -p $$HOME/.local/share/devhelp/dryscrape" 85 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/dryscrape" 86 | @echo "# devhelp" 87 | 88 | epub: 89 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 90 | @echo 91 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 92 | 93 | latex: 94 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 95 | @echo 96 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 97 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 98 | "(use \`make latexpdf' here to do that automatically)." 99 | 100 | latexpdf: 101 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 102 | @echo "Running LaTeX files through pdflatex..." 103 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 104 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 105 | 106 | text: 107 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 108 | @echo 109 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 110 | 111 | man: 112 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 113 | @echo 114 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 115 | 116 | changes: 117 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 118 | @echo 119 | @echo "The overview file is in $(BUILDDIR)/changes." 120 | 121 | linkcheck: 122 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 123 | @echo 124 | @echo "Link check complete; look for any errors in the above output " \ 125 | "or in $(BUILDDIR)/linkcheck/output.txt." 126 | 127 | doctest: 128 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 129 | @echo "Testing of doctests in the sources finished, look at the " \ 130 | "results in $(BUILDDIR)/doctest/output.txt." 131 | -------------------------------------------------------------------------------- /docs/apidoc.rst: -------------------------------------------------------------------------------- 1 | API Documentation 2 | ================= 3 | 4 | This documentation also contains the API docs for the ``webkit_server`` 5 | module, for convenience (and because I am too lazy to set up dedicated docs 6 | for it). 7 | 8 | Overview 9 | ---------- 10 | 11 | .. inheritance-diagram:: dryscrape.session 12 | dryscrape.mixins 13 | dryscrape.driver.webkit 14 | webkit_server 15 | 16 | Module :mod:`dryscrape.session` 17 | ------------------------------- 18 | 19 | .. automodule:: dryscrape.session 20 | :members: 21 | 22 | Module :mod:`dryscrape.mixins` 23 | ------------------------------- 24 | 25 | .. automodule:: dryscrape.mixins 26 | :members: 27 | 28 | Module :mod:`dryscrape.driver.webkit` 29 | ------------------------------------- 30 | 31 | .. automodule:: dryscrape.driver.webkit 32 | :members: 33 | 34 | Module :mod:`webkit_server` 35 | ------------------------------- 36 | 37 | .. automodule:: webkit_server 38 | :members: 39 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # dryscrape documentation build configuration file, created by 4 | # sphinx-quickstart on Thu Jan 12 15:55:25 2012. 5 | # 6 | # This file is execfile()d with the current directory set to its containing dir. 7 | # 8 | # Note that not all possible configuration values are present in this 9 | # autogenerated file. 10 | # 11 | # All configuration values have a default; values that are commented out 12 | # serve to show the default. 13 | 14 | import sys, os 15 | 16 | class Mock(object): 17 | def __init__(self, *args, **kwargs): 18 | pass 19 | 20 | def __call__(self, *args, **kwargs): 21 | return Mock() 22 | 23 | @classmethod 24 | def __getattr__(self, name): 25 | if name in ('__file__', '__path__'): 26 | return '/dev/null' 27 | elif name[0].upper() == name[0]: 28 | return type(name, (), {}) 29 | else: 30 | return Mock() 31 | 32 | # mock some modules... 33 | MOCK_MODULES = [] 34 | for mod_name in MOCK_MODULES: 35 | sys.modules[mod_name] = Mock() 36 | 37 | # If extensions (or modules to document with autodoc) are in another directory, 38 | # add these directories to sys.path here. If the directory is relative to the 39 | # documentation root, use os.path.abspath to make it absolute, like shown here. 40 | sys.path.insert(0, os.path.abspath('..')) 41 | 42 | # -- General configuration ----------------------------------------------------- 43 | 44 | # If your documentation needs a minimal Sphinx version, state it here. 45 | #needs_sphinx = '1.0' 46 | 47 | # Add any Sphinx extension module names here, as strings. They can be extensions 48 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. 49 | extensions = ['sphinx.ext.autodoc', 50 | 'sphinx.ext.viewcode', 51 | 'sphinx.ext.graphviz', 52 | 'sphinx.ext.inheritance_diagram'] 53 | 54 | # autodoc config 55 | autodoc_default_flags = ['show-inheritance'] 56 | 57 | # Add any paths that contain templates here, relative to this directory. 58 | templates_path = ['_templates'] 59 | 60 | # The suffix of source filenames. 61 | source_suffix = '.rst' 62 | 63 | # The encoding of source files. 64 | #source_encoding = 'utf-8-sig' 65 | 66 | # The master toctree document. 67 | master_doc = 'index' 68 | 69 | # General information about the project. 70 | project = u'dryscrape' 71 | copyright = u'2012, Niklas Baumstark' 72 | 73 | # The version info for the project you're documenting, acts as replacement for 74 | # |version| and |release|, also used in various other places throughout the 75 | # built documents. 76 | # 77 | # The short X.Y version. 78 | version = '1.0' 79 | # The full version, including alpha/beta/rc tags. 80 | release = '1.0.1' 81 | 82 | # The language for content autogenerated by Sphinx. Refer to documentation 83 | # for a list of supported languages. 84 | #language = None 85 | 86 | # There are two options for replacing |today|: either, you set today to some 87 | # non-false value, then it is used: 88 | #today = '' 89 | # Else, today_fmt is used as the format for a strftime call. 90 | #today_fmt = '%B %d, %Y' 91 | 92 | # List of patterns, relative to source directory, that match files and 93 | # directories to ignore when looking for source files. 94 | exclude_patterns = ['_build'] 95 | 96 | # The reST default role (used for this markup: `text`) to use for all documents. 97 | #default_role = None 98 | 99 | # If true, '()' will be appended to :func: etc. cross-reference text. 100 | #add_function_parentheses = True 101 | 102 | # If true, the current module name will be prepended to all description 103 | # unit titles (such as .. function::). 104 | #add_module_names = True 105 | 106 | # If true, sectionauthor and moduleauthor directives will be shown in the 107 | # output. They are ignored by default. 108 | #show_authors = False 109 | 110 | # The name of the Pygments (syntax highlighting) style to use. 111 | pygments_style = 'sphinx' 112 | 113 | # A list of ignored prefixes for module index sorting. 114 | #modindex_common_prefix = [] 115 | 116 | 117 | # -- Options for HTML output --------------------------------------------------- 118 | 119 | # The theme to use for HTML and HTML Help pages. See the documentation for 120 | # a list of builtin themes. 121 | html_theme = 'default' 122 | 123 | # Theme options are theme-specific and customize the look and feel of a theme 124 | # further. For a list of options available for each theme, see the 125 | # documentation. 126 | #html_theme_options = {} 127 | 128 | # Add any paths that contain custom themes here, relative to this directory. 129 | #html_theme_path = [] 130 | 131 | # The name for this set of Sphinx documents. If None, it defaults to 132 | # " v documentation". 133 | #html_title = None 134 | 135 | # A shorter title for the navigation bar. Default is the same as html_title. 136 | #html_short_title = None 137 | 138 | # The name of an image file (relative to this directory) to place at the top 139 | # of the sidebar. 140 | #html_logo = None 141 | 142 | # The name of an image file (within the static path) to use as favicon of the 143 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 144 | # pixels large. 145 | #html_favicon = None 146 | 147 | # Add any paths that contain custom static files (such as style sheets) here, 148 | # relative to this directory. They are copied after the builtin static files, 149 | # so a file named "default.css" will overwrite the builtin "default.css". 150 | html_static_path = ['_static'] 151 | 152 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 153 | # using the given strftime format. 154 | #html_last_updated_fmt = '%b %d, %Y' 155 | 156 | # If true, SmartyPants will be used to convert quotes and dashes to 157 | # typographically correct entities. 158 | #html_use_smartypants = True 159 | 160 | # Custom sidebar templates, maps document names to template names. 161 | #html_sidebars = {} 162 | 163 | # Additional templates that should be rendered to pages, maps page names to 164 | # template names. 165 | #html_additional_pages = {} 166 | 167 | # If false, no module index is generated. 168 | #html_domain_indices = True 169 | 170 | # If false, no index is generated. 171 | #html_use_index = True 172 | 173 | # If true, the index is split into individual pages for each letter. 174 | #html_split_index = False 175 | 176 | # If true, links to the reST sources are added to the pages. 177 | #html_show_sourcelink = True 178 | 179 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 180 | #html_show_sphinx = True 181 | 182 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 183 | #html_show_copyright = True 184 | 185 | # If true, an OpenSearch description file will be output, and all pages will 186 | # contain a tag referring to it. The value of this option must be the 187 | # base URL from which the finished HTML is served. 188 | #html_use_opensearch = '' 189 | 190 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 191 | #html_file_suffix = None 192 | 193 | # Output file base name for HTML help builder. 194 | htmlhelp_basename = 'dryscrapedoc' 195 | 196 | 197 | # -- Options for LaTeX output -------------------------------------------------- 198 | 199 | # The paper size ('letter' or 'a4'). 200 | #latex_paper_size = 'letter' 201 | 202 | # The font size ('10pt', '11pt' or '12pt'). 203 | #latex_font_size = '10pt' 204 | 205 | # Grouping the document tree into LaTeX files. List of tuples 206 | # (source start file, target name, title, author, documentclass [howto/manual]). 207 | latex_documents = [ 208 | ('index', 'dryscrape.tex', u'dryscrape Documentation', 209 | u'Niklas Baumstark', 'manual'), 210 | ] 211 | 212 | # The name of an image file (relative to this directory) to place at the top of 213 | # the title page. 214 | #latex_logo = None 215 | 216 | # For "manual" documents, if this is true, then toplevel headings are parts, 217 | # not chapters. 218 | #latex_use_parts = False 219 | 220 | # If true, show page references after internal links. 221 | #latex_show_pagerefs = False 222 | 223 | # If true, show URL addresses after external links. 224 | #latex_show_urls = False 225 | 226 | # Additional stuff for the LaTeX preamble. 227 | #latex_preamble = '' 228 | 229 | # Documents to append as an appendix to all manuals. 230 | #latex_appendices = [] 231 | 232 | # If false, no module index is generated. 233 | #latex_domain_indices = True 234 | 235 | 236 | # -- Options for manual page output -------------------------------------------- 237 | 238 | # One entry per manual page. List of tuples 239 | # (source start file, name, description, authors, manual section). 240 | man_pages = [ 241 | ('index', 'dryscrape', u'dryscrape Documentation', 242 | [u'Niklas Baumstark'], 1) 243 | ] 244 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | Welcome to dryscrape's documentation! 2 | ==================================== 3 | 4 | dryscrape_ is a lightweight web scraping library for Python. It uses a 5 | headless Webkit instance to evaluate Javascript on the visited pages. This 6 | enables painless scraping of plain web pages as well as Javascript-heavy 7 | “Web 2.0” applications like 8 | Facebook. 9 | 10 | It is built on the shoulders of capybara-webkit_'s webkit-server_. 11 | A big thanks goes to thoughtbot, inc. for building this excellent 12 | piece of software! 13 | 14 | .. _dryscrape: https://github.com/niklasb/dryscrape 15 | .. _capybara-webkit: https://github.com/thoughtbot/capybara-webkit 16 | .. _webkit-server: https://github.com/niklasb/webkit-server 17 | 18 | Contents 19 | ---------- 20 | 21 | .. toctree:: 22 | :maxdepth: 2 23 | 24 | installation 25 | usage 26 | apidoc 27 | 28 | Indices and tables 29 | ================== 30 | 31 | * :ref:`genindex` 32 | * :ref:`modindex` 33 | * :ref:`search` 34 | 35 | -------------------------------------------------------------------------------- /docs/installation.rst: -------------------------------------------------------------------------------- 1 | .. highlight:: none 2 | 3 | Installation 4 | ============ 5 | 6 | Prerequisites 7 | ------------- 8 | 9 | Before installing dryscrape_, you need to install some software it depends on: 10 | 11 | * Qt_, QtWebKit_ 12 | * lxml_ 13 | * pip_ 14 | * xvfb_ (necessary only if no other X server is available) 15 | 16 | On Ubuntu you can do that with one command (the ``#`` indicates that you need 17 | root privileges for this): 18 | 19 | :: 20 | 21 | # apt-get install qt5-default libqt5webkit5-dev build-essential \ 22 | python-lxml python-pip xvfb 23 | 24 | Please note that Qt4 is also supported. 25 | 26 | On Mac OS X, you can use Homebrew_ to install Qt and 27 | easy_install_ to install pip_: 28 | 29 | :: 30 | 31 | # brew install qt 32 | # easy_install pip 33 | 34 | On other operating systems, you can use pip_ to install lxml (though you might 35 | have to install libxml and the Python headers first). 36 | 37 | Recommended: Installing dryscrape from PyPI 38 | ------------------------------- 39 | 40 | This is as simple as a quick 41 | 42 | :: 43 | 44 | # pip install dryscrape 45 | 46 | Note that dryscrape supports Python 2.7 and 3 as of version 1.0. 47 | 48 | Installing dryscrape from Git 49 | ------------------------------- 50 | 51 | First, get a copy of dryscrape_ using Git: 52 | 53 | :: 54 | 55 | $ git clone https://github.com/niklasb/dryscrape.git dryscrape 56 | $ cd dryscrape 57 | 58 | To install dryscrape, you first need to install webkit-server_. You can use 59 | pip_ to do this for you (while still in the dryscrape directory). 60 | 61 | :: 62 | 63 | # pip install -r requirements.txt 64 | 65 | If you want, you can of course also install the dependencies manually. 66 | 67 | Afterwards, you can use the ``setup.py`` script included to install dryscrape: 68 | 69 | :: 70 | 71 | # python setup.py install 72 | 73 | .. _Qt: http://www.qt.io 74 | .. _QtWebKit: http://doc.qt.io/qt-5/qtwebkit-index.html 75 | .. _lxml: http://lxml.de/ 76 | .. _webkit-server: https://github.com/niklasb/webkit-server/ 77 | .. _pip: http://pypi.python.org/pypi/pip 78 | .. _dryscrape: https://github.com/niklasb/dryscrape/ 79 | .. _Homebrew: http://brew.sh/ 80 | .. _easy_install: https://pypi.python.org/pypi/setuptools 81 | -------------------------------------------------------------------------------- /docs/usage.rst: -------------------------------------------------------------------------------- 1 | Usage 2 | ====== 3 | 4 | First demonstration 5 | ------------------------ 6 | 7 | A code sample tells more than thousand words: 8 | 9 | .. literalinclude:: /../examples/google.py 10 | 11 | In this sample, we use dryscrape to do a simple web search on Google. 12 | Note that we set up a Webkit driver instance here and pass it to a dryscrape 13 | :py:class:`~dryscrape.session.Session` in the constructor. The session instance 14 | then passes every method call it cannot resolve -- such as 15 | :py:meth:`~webkit_server.CommandsMixin.visit`, in this case -- to the 16 | underlying driver. 17 | -------------------------------------------------------------------------------- /dryscrape/__init__.py: -------------------------------------------------------------------------------- 1 | from .session import * 2 | from .xvfb import * 3 | import dryscrape.driver 4 | -------------------------------------------------------------------------------- /dryscrape/driver/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/niklasb/dryscrape/4d3dabdec02f321a37325ff8dbb43d049d451931/dryscrape/driver/__init__.py -------------------------------------------------------------------------------- /dryscrape/driver/webkit.py: -------------------------------------------------------------------------------- 1 | """ 2 | Headless Webkit driver for dryscrape. Wraps the ``webkit_server`` module. 3 | """ 4 | 5 | import dryscrape.mixins 6 | import webkit_server 7 | 8 | class Node(webkit_server.Node, 9 | dryscrape.mixins.SelectionMixin, 10 | dryscrape.mixins.AttributeMixin): 11 | """ Node implementation wrapping a ``webkit_server`` node. """ 12 | 13 | 14 | class NodeFactory(webkit_server.NodeFactory): 15 | """ overrides the NodeFactory provided by ``webkit_server``. """ 16 | def create(self, node_id): 17 | return Node(self.client, node_id) 18 | 19 | 20 | class Driver(webkit_server.Client, 21 | dryscrape.mixins.WaitMixin, 22 | dryscrape.mixins.HtmlParsingMixin): 23 | """ Driver implementation wrapping a ``webkit_server`` driver. 24 | 25 | Keyword arguments are passed through to the underlying ``webkit_server.Client`` 26 | constructor. By default, `node_factory_class` is set to use the dryscrape 27 | node implementation. """ 28 | def __init__(self, **kw): 29 | kw.setdefault('node_factory_class', NodeFactory) 30 | super(Driver, self).__init__(**kw) 31 | -------------------------------------------------------------------------------- /dryscrape/mixins.py: -------------------------------------------------------------------------------- 1 | """ 2 | Mixins for use in dryscrape drivers. 3 | """ 4 | 5 | import time 6 | import lxml.html 7 | 8 | class SelectionMixin(object): 9 | """ Mixin that adds different methods of node selection to an object that 10 | provides an ``xpath`` method returning a collection of matches. """ 11 | 12 | def css(self, css): 13 | """ Returns all nodes matching the given CSSv3 expression. """ 14 | return self.css(css) 15 | 16 | def at_css(self, css): 17 | """ Returns the first node matching the given CSSv3 18 | expression or ``None``. """ 19 | return self._first_or_none(self.css(css)) 20 | 21 | def at_xpath(self, xpath): 22 | """ Returns the first node matching the given XPath 2.0 expression or ``None``. 23 | """ 24 | return self._first_or_none(self.xpath(xpath)) 25 | 26 | def parent(self): 27 | """ Returns the parent node. """ 28 | return self.at_xpath('..') 29 | 30 | def children(self): 31 | """ Returns the child nodes. """ 32 | return self.xpath('*') 33 | 34 | def form(self): 35 | """ Returns the form wherein this node is contained or ``None``. """ 36 | return self.at_xpath("ancestor::form") 37 | 38 | def _first_or_none(self, list): 39 | return list[0] if list else None 40 | 41 | 42 | class AttributeMixin(object): 43 | """ Mixin that adds ``[]`` access syntax sugar to an object that supports a 44 | ``set_attr`` and ``get_attr`` method. """ 45 | 46 | def __getitem__(self, attr): 47 | """ Syntax sugar for accessing this node's attributes """ 48 | return self.get_attr(attr) 49 | 50 | def __setitem__(self, attr, value): 51 | """ Syntax sugar for setting this node's attributes """ 52 | self.set_attr(attr, value) 53 | 54 | 55 | class HtmlParsingMixin(object): 56 | """ Mixin that adds a ``document`` method to an object that supports a ``body`` 57 | method returning valid HTML. """ 58 | 59 | def document(self): 60 | """ Parses the HTML returned by ``body`` and returns it as an lxml.html 61 | document. If the driver supports live DOM manipulation (like webkit_server 62 | does), changes performed on the returned document will not take effect. """ 63 | return lxml.html.document_fromstring(self.body()) 64 | 65 | 66 | # default timeout values 67 | DEFAULT_WAIT_INTERVAL = 0.5 68 | DEFAULT_WAIT_TIMEOUT = 10 69 | DEFAULT_AT_TIMEOUT = 1 70 | 71 | class WaitTimeoutError(Exception): 72 | """ Raised when a wait times out """ 73 | 74 | class WaitMixin(SelectionMixin): 75 | """ Mixin that allows waiting for conditions or elements. """ 76 | 77 | def wait_for(self, 78 | condition, 79 | interval = DEFAULT_WAIT_INTERVAL, 80 | timeout = DEFAULT_WAIT_TIMEOUT): 81 | """ Wait until a condition holds by checking it in regular intervals. 82 | Raises ``WaitTimeoutError`` on timeout. """ 83 | 84 | start = time.time() 85 | 86 | # at least execute the check once! 87 | while True: 88 | res = condition() 89 | if res: 90 | return res 91 | 92 | # timeout? 93 | if time.time() - start > timeout: 94 | break 95 | 96 | # wait a bit 97 | time.sleep(interval) 98 | 99 | # timeout occured! 100 | raise WaitTimeoutError("wait_for timed out") 101 | 102 | def wait_for_safe(self, *args, **kw): 103 | """ Wait until a condition holds and return 104 | ``None`` on timeout. """ 105 | try: 106 | return self.wait_for(*args, **kw) 107 | except WaitTimeoutError: 108 | return None 109 | 110 | def wait_while(self, condition, *args, **kw): 111 | """ Wait while a condition holds. """ 112 | return self.wait_for(lambda: not condition(), *args, **kw) 113 | 114 | def at_css(self, css, timeout = DEFAULT_AT_TIMEOUT, **kw): 115 | """ Returns the first node matching the given CSSv3 expression or ``None`` 116 | if a timeout occurs. """ 117 | return self.wait_for_safe(lambda: super(WaitMixin, self).at_css(css), 118 | timeout = timeout, 119 | **kw) 120 | 121 | def at_xpath(self, xpath, timeout = DEFAULT_AT_TIMEOUT, **kw): 122 | """ Returns the first node matching the given XPath 2.0 expression or ``None`` 123 | if a timeout occurs. """ 124 | return self.wait_for_safe(lambda: super(WaitMixin, self).at_xpath(xpath), 125 | timeout = timeout, 126 | **kw) 127 | -------------------------------------------------------------------------------- /dryscrape/session.py: -------------------------------------------------------------------------------- 1 | from dryscrape.driver.webkit import Driver as DefaultDriver 2 | 3 | from itertools import chain 4 | try: 5 | import urlparse 6 | except ImportError: 7 | import urllib 8 | urlparse = urllib.parse 9 | 10 | class Session(object): 11 | """ A web scraping session based on a driver instance. Implements the proxy 12 | pattern to pass unresolved method calls to the underlying driver. 13 | 14 | If no `driver` is specified, the instance will create an instance of 15 | ``dryscrape.session.DefaultDriver`` to get a driver instance (defaults to 16 | ``dryscrape.driver.webkit.Driver``). 17 | 18 | If `base_url` is present, relative URLs are completed with this URL base. 19 | If not, the `get_base_url` method is called on itself to get the base URL. """ 20 | 21 | def __init__(self, 22 | driver = None, 23 | base_url = None): 24 | self.driver = driver or DefaultDriver() 25 | self.base_url = base_url 26 | 27 | # implement proxy pattern 28 | def __getattr__(self, attr): 29 | """ Pass unresolved method calls to underlying driver. """ 30 | return getattr(self.driver, attr) 31 | 32 | def __dir__(self): 33 | """Allow for `dir` to detect proxied methods from `Driver`.""" 34 | dir_chain = chain(dir(type(self)), dir(self.driver)) 35 | return list(set(dir_chain)) 36 | 37 | def visit(self, url): 38 | """ Passes through the URL to the driver after completing it using the 39 | instance's URL base. """ 40 | return self.driver.visit(self.complete_url(url)) 41 | 42 | def complete_url(self, url): 43 | """ Completes a given URL with this instance's URL base. """ 44 | if self.base_url: 45 | return urlparse.urljoin(self.base_url, url) 46 | else: 47 | return url 48 | 49 | def interact(self, **local): 50 | """ Drops the user into an interactive Python session with the ``sess`` variable 51 | set to the current session instance. If keyword arguments are supplied, these 52 | names will also be available within the session. """ 53 | import code 54 | code.interact(local=dict(sess=self, **local)) 55 | -------------------------------------------------------------------------------- /dryscrape/xvfb.py: -------------------------------------------------------------------------------- 1 | import atexit 2 | import os 3 | 4 | _xvfb = None 5 | 6 | 7 | def start_xvfb(): 8 | from xvfbwrapper import Xvfb 9 | global _xvfb 10 | _xvfb = Xvfb() 11 | _xvfb.start() 12 | atexit.register(_xvfb.stop) 13 | 14 | 15 | def stop_xvfb(): 16 | global _xvfb 17 | _xvfb.stop() 18 | -------------------------------------------------------------------------------- /examples/google.py: -------------------------------------------------------------------------------- 1 | import dryscrape 2 | import sys 3 | 4 | if 'linux' in sys.platform: 5 | # start xvfb in case no X is running. Make sure xvfb 6 | # is installed, otherwise this won't work! 7 | dryscrape.start_xvfb() 8 | 9 | search_term = 'dryscrape' 10 | 11 | # set up a web scraping session 12 | sess = dryscrape.Session(base_url = 'http://google.com') 13 | 14 | # we don't need images 15 | sess.set_attribute('auto_load_images', False) 16 | 17 | # visit homepage and search for a term 18 | sess.visit('/') 19 | q = sess.at_xpath('//*[@name="q"]') 20 | q.set(search_term) 21 | q.form().submit() 22 | 23 | # extract all links 24 | for link in sess.xpath('//a[@href]'): 25 | print(link['href']) 26 | 27 | # save a screenshot of the web page 28 | sess.render('google.png') 29 | print("Screenshot written to 'google.png'") 30 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | lxml 2 | git+git://github.com/niklasb/webkit-server.git 3 | xvfbwrapper 4 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup, Command 2 | 3 | setup(name='dryscrape', 4 | version='1.0.1', 5 | description='a lightweight Javascript-aware, headless web scraping library for Python', 6 | author='Niklas Baumstark', 7 | author_email='niklas.baumstark@gmail.com', 8 | license='MIT', 9 | url='https://github.com/niklasb/dryscrape', 10 | packages=['dryscrape', 'dryscrape.driver'], 11 | install_requires=['webkit_server>=1.0', 'lxml', 'xvfbwrapper'], 12 | ) 13 | --------------------------------------------------------------------------------