├── docs ├── .nojekyll ├── _navbar.md ├── _coverpage.md ├── index.html ├── README.md ├── zh-cn │ └── translate.md └── requests-html-logo.svg └── .gitignore /docs/.nojekyll: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | -------------------------------------------------------------------------------- /docs/_navbar.md: -------------------------------------------------------------------------------- 1 | * [EN](http://html.python-requests.org/index.html) 2 | * [中文](/#) 3 | * 联系 4 | * 635870838@qq.com -------------------------------------------------------------------------------- /docs/_coverpage.md: -------------------------------------------------------------------------------- 1 | ![logo](requests-html-logo.svg) 2 | 3 | # Requests-HTML 4 | 5 | > HTML Parsing for Humans (writing Python 3)! 6 | 7 | * 全面支持解析JavaScript! 8 | * CSS 选择器 (jQuery风格, 感谢PyQuery). 9 | * XPath 选择器, for the faint at heart. 10 | * 自定义user-agent (就像一个真正的web浏览器). 11 | * 自动追踪重定向. 12 | * 连接池与cookie持久化. 13 | * 令人欣喜的请求体验,魔法般的解析页面. 14 | 15 | [GitHub](https://github.com/kennethreitz/requests-html/) 16 | [开始使用](?id=安装) 17 | 18 | -------------------------------------------------------------------------------- /docs/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | requests-html中文文档 6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 22 | 23 | 24 | 25 | 26 | 27 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | ## 安装 2 | 3 | ```python 4 | $ pip install requests-html 5 | ``` 6 | 只支持python3.6及以上 7 | 8 | ## 使用方法 9 | 构造一个访问[python.org](https://python.org/)的GET请求,使用[Requests](https://docs.python-requests.org/): 10 | ```python 11 | >>> from requests_html import HTMLSession 12 | >>> session = HTMLSession() 13 | 14 | >>> r = session.get('https://python.org/') 15 | ``` 16 | 获取本页面所有的链接并返回一个列表, 保留了url在页面中原本的形式(已经自动去掉了html标签): 17 | ```python 18 | >>> r.html.links 19 | ['//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'] 20 | ``` 21 | 获取本页面所有的链接并返回一个列表, 自动将url转换为[绝对路径](https://www.navegabem.com/absolute-or-relative-links.html)形式(已经自动去掉了html标签): 22 | ```python 23 | >>> r.html.absolute_links 24 | ['https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'] 25 | ``` 26 | 通过css选择器选取一个**Element对象**([了解更多](https://www.w3schools.com/cssref/css_selectors.asp)): 27 | ```python 28 | >>> about = r.html.find('#about', first=True) 29 | ``` 30 | 获取一个**Element对象**内的文本内容 31 | ```python 32 | >>> print(about.text) 33 | About 34 | Applications 35 | Quotes 36 | Getting Started 37 | Help 38 | Python Brochure 39 | ``` 40 | 获取一个**Element对象**的所有attributes([了解更多](https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes)): 41 | ``` 42 | >>> about.attrs 43 | {'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'} 44 | ``` 45 | 渲染出一个**Element对象**的HTML内容: 46 | ```python 47 | >>> about.html 48 | '
  • \nAbout\n\n
  • ' 49 | ``` 50 | 获取**Element对象**内的特定子**Element对象**,返回列表: 51 | ```youtrack 52 | >>> about.find('a') 53 | [, , , , , ] 54 | ``` 55 | 查找一个**Element对象**内的绝对路径链接 56 | ```youtrack 57 | >>> about.absolute_links 58 | {'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'} 59 | ``` 60 | 在获取的页面中查找文本 61 | ```youtrack 62 | >>> r.html.search('Python is a {} language')[0] 63 | programming 64 | ``` 65 | 更加复杂的css选择器示例(从Chrome dev tools复制) 66 | ```youtrack 67 | >>> r = session.get('https://github.com/') 68 | >>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p' 69 | 70 | >>> print(r.html.find(sel, first=True).text) 71 | GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers. 72 | ``` 73 | 同时也支持XPath([了解更多](https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx)) 74 | ```youtrack 75 | >>> r.html.xpath('a') 76 | [] 77 | ``` 78 | 你也可以获取到只包含某些文本的**Element对象** 79 | ```youtrack 80 | >>> r = session.get('http://python-requests.org/') 81 | >>> r.html.find('a', containing='kenneth') 82 | [, , , ] 83 | ``` 84 | 85 | 86 | ## 支持JavaScript 87 | 让我们获取一些通过JavaScript渲染的文本吧: 88 | ```youtrack 89 | >>> r = session.get('http://python-requests.org/') 90 | 91 | >>> r.html.render() 92 | 93 | >>> r.html.search('Python 2 will retire in only {months} months!')['months'] 94 | '' 95 | ``` 96 | 注意,当你第一次调用`render()`方法时,代码将会自动下载Chromium,并保存在你的家目录下(如:~/.pyppeteer/)。它只会下载这一次。 97 | 98 | ## 分页 99 | 支持智能分页(持续改进中) 100 | ```youtrack 101 | >>> r = session.get('https://reddit.com') 102 | >>> for html in r.html: 103 | ... print(html) 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | … 112 | ``` 113 | 你也可以很方便地请求下一个URL: 114 | ```youtrack 115 | >>> r = session.get('https://reddit.com') 116 | >>> r.html.next() 117 | 'https://www.reddit.com/?count=25&after=t3_81pm82' 118 | ``` 119 | 不使用Requests库 120 | 你不需要Requests库也可以使用requests-html 121 | ```youtrack 122 | >>> from requests_html import HTML 123 | >>> doc = """""" 124 | 125 | >>> html = HTML(html=doc) 126 | >>> html.links 127 | {'https://httpbin.org'} 128 | ``` 129 | 你不需要Requests库也可以渲染JavaScript页面 130 | ```youtrack 131 | # ^^ 接上边代码继续 ^^ 132 | >>> script = """ 133 | () => { 134 | return { 135 | width: document.documentElement.clientWidth, 136 | height: document.documentElement.clientHeight, 137 | deviceScaleFactor: window.devicePixelRatio, 138 | } 139 | } 140 | """ 141 | >>> val = html.render(script=script, reload=False) 142 | 143 | >>> print(val) 144 | {'width': 800, 'height': 600, 'deviceScaleFactor': 1} 145 | 146 | >>> print(html.html) 147 | 148 | ``` 149 | 150 | ## API 文档 151 | 152 | 这些类是`requests-html`主要的接口: 153 | ### ****HTML类**** 154 | class **requests_html.HTML**(\*,* session: Union[_ForwardRef('HTTPSession'),_ForwardRef('AsyncHTMLSession')] = None, url: str ='https://example.org/',html: Union[str, bytes], default_encoding: str = 'utf-8'*) → None   [[源码](http://html.python-requests.org/_modules/requests_html.html#HTML)] 155 | 156 | 用来解析HTML文档。 157 | ****参数说明****: 158 | - **url** - HTML对应的URL,`absolute_links`函数会调用该参数 159 | - **html** - 解析成字符串或字节(可选参数) 160 | - **default_encoding** - 指定字符编码 161 | 162 | #### ****absolute_links**** 163 | 164 |     页面上所有可被获取到的超链接,都会被转成绝对路径形式。 165 | 166 | #### ****base_url**** 167 | 168 |     页面的基准URL,支持``标签([了解更多](http://www.w3school.com.cn/tags/tag_base.asp))。 169 | 170 | #### ****encoding**** 171 | 172 |     用于编码从HTML和html响应头中提取的内容 173 | 174 | #### ****find**** 175 | 176 | ****find****(selector: str = '\*', \*, containing: Union[str, typing.List[str]] = None, clean: bool = False, first: bool = False,_encoding: str = None) → Union[typing.List[_ForwardRef('Element')], _ForwardRef('Element')] 177 | 178 | 接收一个css选择器参数,返回一个**Element对象**或**Element对象**组成的列表。 179 | 180 | ****参数说明****: 181 | 182 | - **selector** - css选择器 183 | - **clean** - 对找到的`