├── LICENSE ├── README.md ├── image.png ├── img ├── 20171214020322682.jpg ├── 20171215043409613.jpg ├── 20171219102251229.jpg ├── 20171220114008366.jpg ├── 20171226032741976.jpg ├── 20171227102206573.jpg ├── image1.png ├── sreenshot1.png └── sreenshot2.png ├── notebook ├── 1-1-urllib.ipynb ├── 2-1-beautifulsoup-basic.ipynb ├── 2-2-beautifulsoup-css.ipynb ├── 2-3-beautifulsoup-regex.ipynb ├── 2-4-practice-baidu-baike.ipynb ├── 3-1-requests.ipynb ├── 3-2-download.ipynb ├── 3-3-practice-download-images.ipynb ├── 4-1-distributed-scraping.ipynb ├── 4-2-asyncio.ipynb ├── 5-1-selenium.ipynb └── 5-2-scrapy.ipynb ├── scraping.jpg └── source_code ├── 1-1-urllib.py ├── 2-1-beautifulsoup-basic.py ├── 2-2-beautifulsoup-css.py ├── 2-3-beautifulsoup-regex.py ├── 2-4-practice-baidu-baike.py ├── 2-5-beautifulsoup-table.py ├── 3-1-requests.py ├── 3-2-download.py ├── 3-3-practice-download-images.py ├── 4-1-distributed-scraping.py ├── 4-2-asyncio.py ├── 5-1-selenium.py └── 5-2-scrapy.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 | 4 | 5 |

6 | 7 | 8 |
9 | 10 | # Web scraping tutorials (Python) 11 | 12 | In these tutorials, we will learn to build some simple but useful scrapers from scratch. Get to know how we can read web page and select sections you need or even download files. 13 | If you understand Chinese, you are lucky! I made Chinese video + text tutorials for all of these contents. You can find it in [莫烦Python](https://mofanpy.com/). 14 | 15 | 16 | **Learning from code, I made two options for you.** 17 | 18 | 1. learn it from [source code](/source_code/) 19 | 2. learn it from [jupyter notebook](/notebook/) 20 | 21 | ## The contents 22 | 23 | * Basic concept and package 24 | * [Urllib](/notebook/1-1-urllib.ipynb) 25 | * BeautifulSoup 26 | * [Basic](/notebook/2-1-beautifulsoup-basic.ipynb) 27 | * [CSS](/notebook/2-2-beautifulsoup-css.ipynb) 28 | * [RegEx](/notebook/2-3-beautifulsoup-regex.ipynb) 29 | * [Practice random scraping](/notebook/2-4-practice-baidu-baike.ipynb) 30 | * Requests and Download 31 | * [Requests](/notebook/3-1-requests.ipynb) 32 | * [Download](/notebook/3-2-download.ipynb) 33 | * [Practice download image](/notebook/3-3-practice-download-images.ipynb) 34 | * Speed up scraping 35 | * [Distributed scraping (multiprocessing)](/notebook/4-1-distributed-scraping.ipynb) 36 | * [Asyncio](/notebook/4-2-asyncio.ipynb) 37 | * Advanced 38 | * [Selenium](/notebook/5-1-selenium.ipynb) 39 | * [Scrapy](/notebook/5-2-scrapy.ipynb) 40 | 41 | 42 | # Donation 43 | 44 | *If this does help you, please consider donating to support me for better tutorials. Any contribution is greatly appreciated!* 45 | 46 |
47 | 48 | Paypal 52 |
53 | 54 |
55 | 56 | Patreon 59 |
-------------------------------------------------------------------------------- /image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/image.png -------------------------------------------------------------------------------- /img/20171214020322682.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/20171214020322682.jpg -------------------------------------------------------------------------------- /img/20171215043409613.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/20171215043409613.jpg -------------------------------------------------------------------------------- /img/20171219102251229.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/20171219102251229.jpg -------------------------------------------------------------------------------- /img/20171220114008366.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/20171220114008366.jpg -------------------------------------------------------------------------------- /img/20171226032741976.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/20171226032741976.jpg -------------------------------------------------------------------------------- /img/20171227102206573.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/20171227102206573.jpg -------------------------------------------------------------------------------- /img/image1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/image1.png -------------------------------------------------------------------------------- /img/sreenshot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/sreenshot1.png -------------------------------------------------------------------------------- /img/sreenshot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MorvanZhou/easy-scraping-tutorial/501e9862249d3bb00ae77026ff1b4eeb4b4a48fb/img/sreenshot2.png -------------------------------------------------------------------------------- /notebook/1-1-urllib.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Simplest scraping\n", 8 | "**We gonna show how to open a very simple [webpage](https://mofanpy.com/static/scraping/basic-structure.html), and read all the content in it.**" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "\n\n\n\t\n\tScraping tutorial 1 | 莫烦Python\n\t\n\n\n\t

爬虫测试1

\n\t

\n\t\t这是一个在 莫烦Python\n\t\t爬虫教程 中的简单测试.\n\t

\n\n\n\n" 23 | ] 24 | } 25 | ], 26 | "source": [ 27 | "from urllib.request import urlopen\n", 28 | "\n", 29 | "# if has Chinese, apply decode()\n", 30 | "html = urlopen(\"https://mofanpy.com/static/scraping/basic-structure.html\").read().decode('utf-8')\n", 31 | "print(html)\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "**Then we select some text according to the tag arround text using regular expression**" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 4, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "name": "stdout", 48 | "output_type": "stream", 49 | "text": [ 50 | "\nPage title is: Scraping tutorial 1 | 莫烦Python\n" 51 | ] 52 | } 53 | ], 54 | "source": [ 55 | "import re\n", 56 | "res = re.findall(r\"(.+?)\", html)\n", 57 | "print(\"\\nPage title is: \", res[0])" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "**Another example to select paragraph content out of html.**" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 10, 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "name": "stdout", 74 | "output_type": "stream", 75 | "text": [ 76 | "\nPage paragraph is: \n\t\t这是一个在 莫烦Python\n\t\t爬虫教程 中的简单测试.\n\t\n" 77 | ] 78 | } 79 | ], 80 | "source": [ 81 | "res = re.findall(r\"

(.*?)

\", html, flags=re.DOTALL) # re.DOTALL if multi line\n", 82 | "print(\"\\nPage paragraph is: \", res[0])" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "**And select links using regex**" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 9, 95 | "metadata": {}, 96 | "outputs": [ 97 | { 98 | "name": "stdout", 99 | "output_type": "stream", 100 | "text": [ 101 | "\nAll links: ['https://mofanpy.com/static/img/description/tab_icon.png', 'https://mofanpy.com/', 'https://mofanpy.com/tutorials/scraping']\n" 102 | ] 103 | } 104 | ], 105 | "source": [ 106 | "res = re.findall(r'href=\"(.*?)\"', html)\n", 107 | "print(\"\\nAll links: \", res)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [] 116 | } 117 | ], 118 | "metadata": { 119 | "kernelspec": { 120 | "display_name": "Python 2", 121 | "language": "python", 122 | "name": "python2" 123 | }, 124 | "language_info": { 125 | "codemirror_mode": { 126 | "name": "ipython", 127 | "version": 2 128 | }, 129 | "file_extension": ".py", 130 | "mimetype": "text/x-python", 131 | "name": "python", 132 | "nbconvert_exporter": "python", 133 | "pygments_lexer": "ipython2", 134 | "version": "2.7.6" 135 | } 136 | }, 137 | "nbformat": 4, 138 | "nbformat_minor": 0 139 | } 140 | -------------------------------------------------------------------------------- /notebook/2-1-beautifulsoup-basic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Basic usage of beautifulsoup\n", 8 | "**First we still need to open a [page](https://mofanpy.com/static/scraping/basic-structure.html), then we can apply beautifulsoup on this page's html.**" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "\n\n\n\t\n\tScraping tutorial 1 | 莫烦Python\n\t\n\n\n\t

爬虫测试1

\n\t

\n\t\t这是一个在 莫烦Python\n\t\t爬虫教程 中的简单测试.\n\t

\n\n\n\n" 23 | ] 24 | } 25 | ], 26 | "source": [ 27 | "from bs4 import BeautifulSoup\n", 28 | "from urllib.request import urlopen\n", 29 | "\n", 30 | "# if has Chinese, apply decode()\n", 31 | "html = urlopen(\"https://mofanpy.com/static/scraping/basic-structure.html\").read().decode('utf-8')\n", 32 | "print(html)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "**Parsing this html using a method called lxml, create a soup object. You can simply \"h1\" or \"p\" to call the heading 1 and paragraph tag from soup.**" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "

爬虫测试1

\n\n

\n\t\t这是一个在 莫烦Python\n爬虫教程 中的简单测试.\n\t

\n" 52 | ] 53 | } 54 | ], 55 | "source": [ 56 | "soup = BeautifulSoup(html, features='lxml')\n", 57 | "print(soup.h1)\n", 58 | "print('\\n', soup.p)" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "**Or using some helpful functions to find tags. Access the attribute of found tags using a key just like what you do in a python dictionary.**" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "name": "stdout", 75 | "output_type": "stream", 76 | "text": [ 77 | "\n ['https://mofanpy.com/', 'https://mofanpy.com/tutorials/scraping']\n" 78 | ] 79 | } 80 | ], 81 | "source": [ 82 | "all_href = soup.find_all('a')\n", 83 | "all_href = [l['href'] for l in all_href]\n", 84 | "print('\\n', all_href)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [] 93 | } 94 | ], 95 | "metadata": { 96 | "kernelspec": { 97 | "display_name": "Python 2", 98 | "language": "python", 99 | "name": "python2" 100 | }, 101 | "language_info": { 102 | "codemirror_mode": { 103 | "name": "ipython", 104 | "version": 2 105 | }, 106 | "file_extension": ".py", 107 | "mimetype": "text/x-python", 108 | "name": "python", 109 | "nbconvert_exporter": "python", 110 | "pygments_lexer": "ipython2", 111 | "version": "2.7.6" 112 | } 113 | }, 114 | "nbformat": 4, 115 | "nbformat_minor": 0 116 | } 117 | -------------------------------------------------------------------------------- /notebook/2-2-beautifulsoup-css.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Beautifulsoup: find by CSS class \n", 8 | "**First we still need to open a [page](https://mofanpy.com/static/scraping/list.html), then we can apply beautifulsoup on this page's html.**" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "\n\n\n\t\n\t爬虫练习 列表 class | 莫烦 Python\n\t\n\n\n\n\n

列表 爬虫练习

\n\n

这是一个在 莫烦 Python爬虫教程\n\t里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.

\n\n\n\n\n\n" 23 | ] 24 | } 25 | ], 26 | "source": [ 27 | "from bs4 import BeautifulSoup\n", 28 | "from urllib.request import urlopen\n", 29 | "\n", 30 | "# if has Chinese, apply decode()\n", 31 | "html = urlopen(\"https://mofanpy.com/static/scraping/list.html\").read().decode('utf-8')\n", 32 | "print(html)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "**Parsing this html using a method called lxml, create a soup object. Find all \"li\" tag which has a class=month.**" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": { 46 | "collapsed": true 47 | }, 48 | "outputs": [ 49 | { 50 | "name": "stdout", 51 | "output_type": "stream", 52 | "text": [ 53 | "一月\n二月\n三月\n四月\n五月\n" 54 | ] 55 | } 56 | ], 57 | "source": [ 58 | "soup = BeautifulSoup(html, features='lxml')\n", 59 | "\n", 60 | "# use class to narrow search\n", 61 | "month = soup.find_all('li', {\"class\": \"month\"})\n", 62 | "for m in month:\n", 63 | " print(m.get_text())" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "**Or using some helpful functions to find tags. Access the attribute of found tags using a key just like what you do in a python dictionary.**" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "name": "stdout", 80 | "output_type": "stream", 81 | "text": [ 82 | "一月一号\n一月二号\n一月三号\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "\n", 88 | "jan = soup.find('ul', {\"class\": 'jan'})\n", 89 | "d_jan = jan.find_all('li') # use jan as a parent\n", 90 | "for d in d_jan:\n", 91 | " print(d.get_text())\n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [] 100 | } 101 | ], 102 | "metadata": { 103 | "kernelspec": { 104 | "display_name": "Python 2", 105 | "language": "python", 106 | "name": "python2" 107 | }, 108 | "language_info": { 109 | "codemirror_mode": { 110 | "name": "ipython", 111 | "version": 2 112 | }, 113 | "file_extension": ".py", 114 | "mimetype": "text/x-python", 115 | "name": "python", 116 | "nbconvert_exporter": "python", 117 | "pygments_lexer": "ipython2", 118 | "version": "2.7.6" 119 | } 120 | }, 121 | "nbformat": 4, 122 | "nbformat_minor": 0 123 | } 124 | -------------------------------------------------------------------------------- /notebook/2-3-beautifulsoup-regex.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Beautifulsoup: find by regual expression\n", 8 | "**First we import re for regex. Then, open a [page](https://mofanpy.com/static/scraping/table.html), then we can apply beautifulsoup on this page's html.**" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": { 15 | "collapsed": true 16 | }, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "\n\n\n\t\n\t爬虫练习 表格 table | 莫烦 Python\n\n\t\n\n\n\n

表格 爬虫练习

\n\n

这是一个在 莫烦 Python爬虫教程\n\t里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.

\n\n
\n\n\t\n\t\t\n\t\n\n\t\n\t\t\n\t\n\n\t\n\t\t\n\t\n\n\t\n\t\t\n\t\n\n
\n\t\t\t分类\n\t\t\n\t\t\t名字\n\t\t\n\t\t\t时长\n\t\t\n\t\t\t预览\n\t\t
\n\t\t\t机器学习\n\t\t\n\t\t\t\n\t\t\t\tTensorflow 神经网络\n\t\t\n\t\t\t2:00\n\t\t\n\t\t\t\n\t\t
\n\t\t\t机器学习\n\t\t\n\t\t\t\n\t\t\t\t强化学习\n\t\t\n\t\t\t5:00\n\t\t\n\t\t\t\n\t\t
\n\t\t\t数据处理\n\t\t\n\t\t\t\n\t\t\t\t爬虫\n\t\t\n\t\t\t3:00\n\t\t\n\t\t\t\n\t\t
\n\n\n\n" 23 | ] 24 | } 25 | ], 26 | "source": [ 27 | "from bs4 import BeautifulSoup\n", 28 | "from urllib.request import urlopen\n", 29 | "import re\n", 30 | "\n", 31 | "# if has Chinese, apply decode()\n", 32 | "html = urlopen(\"https://mofanpy.com/static/scraping/table.html\").read().decode('utf-8')\n", 33 | "print(html)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "**Parsing this html using a method called lxml, create a soup object. Find all \"img\" tag which has a src in a given pattern.**" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "outputs": [ 50 | { 51 | "name": "stdout", 52 | "output_type": "stream", 53 | "text": [ 54 | "https://mofanpy.com/static/img/course_cover/tf.jpg\nhttps://mofanpy.com/static/img/course_cover/rl.jpg\nhttps://mofanpy.com/static/img/course_cover/scraping.jpg\n" 55 | ] 56 | } 57 | ], 58 | "source": [ 59 | "soup = BeautifulSoup(html, features='lxml')\n", 60 | "\n", 61 | "img_links = soup.find_all(\"img\", {\"src\": re.compile('.*?\\.jpg')})\n", 62 | "for link in img_links:\n", 63 | " print(link['src'])" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "**Or using some helpful functions to find tags. Access the attribute of found tags using a key just like what you do in a python dictionary.**" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "name": "stdout", 80 | "output_type": "stream", 81 | "text": [ 82 | "https://mofanpy.com/\nhttps://mofanpy.com/tutorials/scraping\nhttps://mofanpy.com/tutorials/machine-learning/tensorflow/\nhttps://mofanpy.com/tutorials/machine-learning/reinforcement-learning/\nhttps://mofanpy.com/tutorials/data-manipulation/scraping/\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})\n", 88 | "for link in course_links:\n", 89 | " print(link['href'])" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [] 98 | } 99 | ], 100 | "metadata": { 101 | "kernelspec": { 102 | "display_name": "Python 2", 103 | "language": "python", 104 | "name": "python2" 105 | }, 106 | "language_info": { 107 | "codemirror_mode": { 108 | "name": "ipython", 109 | "version": 2 110 | }, 111 | "file_extension": ".py", 112 | "mimetype": "text/x-python", 113 | "name": "python", 114 | "nbconvert_exporter": "python", 115 | "pygments_lexer": "ipython2", 116 | "version": "2.7.6" 117 | } 118 | }, 119 | "nbformat": 4, 120 | "nbformat_minor": 0 121 | } 122 | -------------------------------------------------------------------------------- /notebook/2-4-practice-baidu-baike.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice: scrape Baidu Baike\n", 8 | "**Here we build a scraper to crawl Baidu Baike from this [page](https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711) onwards. We store a historical webpage that we have already visited to keep tracking it.**" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 2, 14 | "metadata": { 15 | "collapsed": false 16 | }, 17 | "outputs": [], 18 | "source": [ 19 | "from bs4 import BeautifulSoup\n", 20 | "from urllib.request import urlopen\n", 21 | "import re\n", 22 | "import random\n", 23 | "\n", 24 | "\n", 25 | "base_url = \"https://baike.baidu.com\"\n", 26 | "his = [\"/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711\"]" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "**Select the last sub url in \"his\", print the title and url.**" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 4, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [ 43 | { 44 | "name": "stdout", 45 | "output_type": "stream", 46 | "text": [ 47 | "网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711\n" 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "url = base_url + his[-1]\n", 53 | "\n", 54 | "html = urlopen(url).read().decode('utf-8')\n", 55 | "soup = BeautifulSoup(html, features='lxml')\n", 56 | "print(soup.find('h1').get_text(), ' url: ', his[-1])" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "**Find all sub_urls for baidu baike (item page), randomly select a sub_urls and store it in \"his\". If no valid sub link is found, than pop last url in \"his\".**" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E4%B8%8B%E8%BD%BD%E8%80%85']\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "# find valid urls\n", 81 | "sub_urls = soup.find_all(\"a\", {\"target\": \"_blank\", \"href\": re.compile(\"/item/(%.{2})+$\")})\n", 82 | "\n", 83 | "if len(sub_urls) != 0:\n", 84 | " his.append(random.sample(sub_urls, 1)[0]['href'])\n", 85 | "else:\n", 86 | " # no valid sub link found\n", 87 | " his.pop()\n", 88 | "print(his)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "**Put everthing together. Random running for 20 iterations. See what we end up with.**" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 6, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "0 网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711\n" 108 | ] 109 | }, 110 | { 111 | "name": "stdout", 112 | "output_type": "stream", 113 | "text": [ 114 | "1 路由器 url: /item/%E8%B7%AF%E7%94%B1%E5%99%A8\n" 115 | ] 116 | }, 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "2 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 122 | ] 123 | }, 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "3 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 129 | ] 130 | }, 131 | { 132 | "name": "stdout", 133 | "output_type": "stream", 134 | "text": [ 135 | "4 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 136 | ] 137 | }, 138 | { 139 | "name": "stdout", 140 | "output_type": "stream", 141 | "text": [ 142 | "5 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 143 | ] 144 | }, 145 | { 146 | "name": "stdout", 147 | "output_type": "stream", 148 | "text": [ 149 | "6 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 150 | ] 151 | }, 152 | { 153 | "name": "stdout", 154 | "output_type": "stream", 155 | "text": [ 156 | "7 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 157 | ] 158 | }, 159 | { 160 | "name": "stdout", 161 | "output_type": "stream", 162 | "text": [ 163 | "8 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 164 | ] 165 | }, 166 | { 167 | "name": "stdout", 168 | "output_type": "stream", 169 | "text": [ 170 | "9 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 171 | ] 172 | }, 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "10 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 178 | ] 179 | }, 180 | { 181 | "name": "stdout", 182 | "output_type": "stream", 183 | "text": [ 184 | "11 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 185 | ] 186 | }, 187 | { 188 | "name": "stdout", 189 | "output_type": "stream", 190 | "text": [ 191 | "12 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 192 | ] 193 | }, 194 | { 195 | "name": "stdout", 196 | "output_type": "stream", 197 | "text": [ 198 | "13 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 199 | ] 200 | }, 201 | { 202 | "name": "stdout", 203 | "output_type": "stream", 204 | "text": [ 205 | "14 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 206 | ] 207 | }, 208 | { 209 | "name": "stdout", 210 | "output_type": "stream", 211 | "text": [ 212 | "15 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 213 | ] 214 | }, 215 | { 216 | "name": "stdout", 217 | "output_type": "stream", 218 | "text": [ 219 | "16 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 220 | ] 221 | }, 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | "17 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 227 | ] 228 | }, 229 | { 230 | "name": "stdout", 231 | "output_type": "stream", 232 | "text": [ 233 | "18 服务等级 url: /item/%E6%9C%8D%E5%8A%A1%E7%AD%89%E7%BA%A7\n" 234 | ] 235 | }, 236 | { 237 | "name": "stdout", 238 | "output_type": "stream", 239 | "text": [ 240 | "19 呼损率 url: /item/%E5%91%BC%E6%8D%9F%E7%8E%87\n" 241 | ] 242 | } 243 | ], 244 | "source": [ 245 | "his = [\"/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711\"]\n", 246 | "\n", 247 | "for i in range(20):\n", 248 | " url = base_url + his[-1]\n", 249 | "\n", 250 | " html = urlopen(url).read().decode('utf-8')\n", 251 | " soup = BeautifulSoup(html, features='lxml')\n", 252 | " print(i, soup.find('h1').get_text(), ' url: ', his[-1])\n", 253 | "\n", 254 | " # find valid urls\n", 255 | " sub_urls = soup.find_all(\"a\", {\"target\": \"_blank\", \"href\": re.compile(\"/item/(%.{2})+$\")})\n", 256 | "\n", 257 | " if len(sub_urls) != 0:\n", 258 | " his.append(random.sample(sub_urls, 1)[0]['href'])\n", 259 | " else:\n", 260 | " # no valid sub link found\n", 261 | " his.pop()" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": "Python 2", 275 | "language": "python", 276 | "name": "python2" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 2 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython2", 288 | "version": "2.7.6" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 0 293 | } 294 | -------------------------------------------------------------------------------- /notebook/3-1-requests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# requests: an alternative to urllib\n", 8 | "**requests has more functions to replace urlopen. Use request.get() to replace urlopen() and pass some parameters to the webpage. The webbrowser is to open the new url and give you an visualization of this result.**" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 2, 14 | "metadata": { 15 | "collapsed": false 16 | }, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "http://www.baidu.com/s?wd=%E8%8E%AB%E7%83%A6Python\n" 23 | ] 24 | }, 25 | { 26 | "data": { 27 | "text/plain": [ 28 | "True" 29 | ] 30 | }, 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "output_type": "execute_result" 34 | } 35 | ], 36 | "source": [ 37 | "import requests\n", 38 | "import webbrowser\n", 39 | "param = {\"wd\": \"莫烦Python\"}\n", 40 | "r = requests.get('http://www.baidu.com/s', params=param)\n", 41 | "print(r.url)\n", 42 | "webbrowser.open(r.url)" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## post\n", 50 | "\n", 51 | "**We test the post function in this [page](http://pythonscraping.com/pages/files/form.html). To pass some data to the server to analyse and send some response to you accordingly.**" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "metadata": { 58 | "collapsed": true 59 | }, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | "Hello there, 莫烦 周!\n" 66 | ] 67 | } 68 | ], 69 | "source": [ 70 | "data = {'firstname': '莫烦', 'lastname': '周'}\n", 71 | "r = requests.post('http://pythonscraping.com/files/processing.php', data=data)\n", 72 | "print(r.text)" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "## upload image\n", 80 | "\n", 81 | "**We still use post function to update image in this [page](http://pythonscraping.com/files/form2.html).**" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 5, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "The file image.png has been uploaded.\n" 94 | ] 95 | } 96 | ], 97 | "source": [ 98 | "file = {'uploadFile': open('./image.png', 'rb')}\n", 99 | "r = requests.post('http://pythonscraping.com/files/processing2.php', files=file)\n", 100 | "print(r.text)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "## login\n", 108 | "\n", 109 | "**Use post method to login to a [website](http://pythonscraping.com/pages/cookies/login.html).**" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 4, 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "{'username': 'Morvan', 'loggedin': '1'}\n" 122 | ] 123 | }, 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "Hey Morvan! Looks like you're still logged into the site!\n" 129 | ] 130 | } 131 | ], 132 | "source": [ 133 | "payload = {'username': 'Morvan', 'password': 'password'}\n", 134 | "r = requests.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)\n", 135 | "print(r.cookies.get_dict())\n", 136 | "r = requests.get('http://pythonscraping.com/pages/cookies/profile.php', cookies=r.cookies)\n", 137 | "print(r.text)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "## another general way to login\n", 145 | "\n", 146 | "**Use session instead requests. Keep you in a session and keep track the cookies.**" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 3, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "name": "stdout", 156 | "output_type": "stream", 157 | "text": [ 158 | "{'username': 'Morvan', 'loggedin': '1'}\n" 159 | ] 160 | }, 161 | { 162 | "name": "stdout", 163 | "output_type": "stream", 164 | "text": [ 165 | "Hey Morvan! Looks like you're still logged into the site!\n" 166 | ] 167 | } 168 | ], 169 | "source": [ 170 | "session = requests.Session()\n", 171 | "payload = {'username': 'Morvan', 'password': 'password'}\n", 172 | "r = session.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)\n", 173 | "print(r.cookies.get_dict())\n", 174 | "r = session.get(\"http://pythonscraping.com/pages/cookies/profile.php\")\n", 175 | "print(r.text)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [] 184 | } 185 | ], 186 | "metadata": { 187 | "kernelspec": { 188 | "display_name": "Python 2", 189 | "language": "python", 190 | "name": "python2" 191 | }, 192 | "language_info": { 193 | "codemirror_mode": { 194 | "name": "ipython", 195 | "version": 2 196 | }, 197 | "file_extension": ".py", 198 | "mimetype": "text/x-python", 199 | "name": "python", 200 | "nbconvert_exporter": "python", 201 | "pygments_lexer": "ipython2", 202 | "version": "2.7.6" 203 | } 204 | }, 205 | "nbformat": 4, 206 | "nbformat_minor": 0 207 | } 208 | -------------------------------------------------------------------------------- /notebook/3-2-download.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Download something\n", 8 | "\n", 9 | "**Introduce multiple ways to download something using python.**" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import os\n", 21 | "os.makedirs('./img/', exist_ok=True)\n", 22 | "\n", 23 | "IMAGE_URL = \"https://mofanpy.com/static/img/description/learning_step_flowchart.png\"" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## urlretrieve\n", 31 | "\n", 32 | "**Download the image url using urlretrieve.**" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": { 39 | "collapsed": true 40 | }, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/plain": [ 45 | "('./img/image1.png', )" 46 | ] 47 | }, 48 | "execution_count": 2, 49 | "metadata": {}, 50 | "output_type": "execute_result" 51 | } 52 | ], 53 | "source": [ 54 | "from urllib.request import urlretrieve\n", 55 | "urlretrieve(IMAGE_URL, './img/image1.png') # whole document" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "![image1](/img/image1.png)\n", 63 | "## request download\n", 64 | "\n", 65 | "**Using requests.get to download at once.**" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "import requests\n", 75 | "r = requests.get(IMAGE_URL)\n", 76 | "with open('./img/image2.png', 'wb') as f:\n", 77 | " f.write(r.content) # whole document" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## download chunck by chunck\n", 85 | "\n", 86 | "**Set stream = True in get() function. This is more efficient.**" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 6, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "r = requests.get(IMAGE_URL, stream=True) # stream loading\n", 96 | "\n", 97 | "with open('./img/image3.png', 'wb') as f:\n", 98 | " for chunk in r.iter_content(chunk_size=32):\n", 99 | " f.write(chunk)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [] 108 | } 109 | ], 110 | "metadata": { 111 | "kernelspec": { 112 | "display_name": "Python 2", 113 | "language": "python", 114 | "name": "python2" 115 | }, 116 | "language_info": { 117 | "codemirror_mode": { 118 | "name": "ipython", 119 | "version": 2 120 | }, 121 | "file_extension": ".py", 122 | "mimetype": "text/x-python", 123 | "name": "python", 124 | "nbconvert_exporter": "python", 125 | "pygments_lexer": "ipython2", 126 | "version": "2.7.6" 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 0 131 | } 132 | -------------------------------------------------------------------------------- /notebook/3-3-practice-download-images.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Practice: download images from web\n", 8 | "\n", 9 | "**Download amazing pictures from [national geographic](http://www.nationalgeographic.com.cn/animals/)**" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "from bs4 import BeautifulSoup\n", 21 | "import requests\n", 22 | "\n", 23 | "URL = \"http://www.nationalgeographic.com.cn/animals/\"" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## find list of image holder" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": { 37 | "collapsed": true 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "html = requests.get(URL).text\n", 42 | "soup = BeautifulSoup(html, 'lxml')\n", 43 | "img_ul = soup.find_all('ul', {\"class\": \"img_list\"})" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "**Create a folder for these pictures**" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "import os\n", 60 | "\n", 61 | "os.makedirs('./img/', exist_ok=True)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "## download\n", 69 | "\n", 70 | "**Find all picture urls and download them.**" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 4, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "name": "stdout", 80 | "output_type": "stream", 81 | "text": [ 82 | "Saved 20171227102206573.jpg\n" 83 | ] 84 | }, 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "Saved 20171226032741976.jpg\n" 90 | ] 91 | }, 92 | { 93 | "name": "stdout", 94 | "output_type": "stream", 95 | "text": [ 96 | "Saved 20171220114008366.jpg\n" 97 | ] 98 | }, 99 | { 100 | "name": "stdout", 101 | "output_type": "stream", 102 | "text": [ 103 | "Saved 20171219102251229.jpg\n" 104 | ] 105 | }, 106 | { 107 | "name": "stdout", 108 | "output_type": "stream", 109 | "text": [ 110 | "Saved 20171215043409613.jpg\n" 111 | ] 112 | }, 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "Saved 20171214020322682.jpg\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "for ul in img_ul:\n", 123 | " imgs = ul.find_all('img')\n", 124 | " for img in imgs:\n", 125 | " url = img['src']\n", 126 | " r = requests.get(url, stream=True)\n", 127 | " image_name = url.split('/')[-1]\n", 128 | " with open('./img/%s' % image_name, 'wb') as f:\n", 129 | " for chunk in r.iter_content(chunk_size=128):\n", 130 | " f.write(chunk)\n", 131 | " print('Saved %s' % image_name)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "![image](/img/20171214020322682.jpg)" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [] 147 | } 148 | ], 149 | "metadata": { 150 | "kernelspec": { 151 | "display_name": "Python 2", 152 | "language": "python", 153 | "name": "python2" 154 | }, 155 | "language_info": { 156 | "codemirror_mode": { 157 | "name": "ipython", 158 | "version": 2 159 | }, 160 | "file_extension": ".py", 161 | "mimetype": "text/x-python", 162 | "name": "python", 163 | "nbconvert_exporter": "python", 164 | "pygments_lexer": "ipython2", 165 | "version": "2.7.6" 166 | } 167 | }, 168 | "nbformat": 4, 169 | "nbformat_minor": 0 170 | } 171 | -------------------------------------------------------------------------------- /notebook/4-1-distributed-scraping.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Distributed scraping: multiprocessing\n", 8 | "\n", 9 | "**Speed up scraping by distributed crawling and parsing. I'm going to scrape [my website](https://mofanpy.com/) but in a local server \"http://127.0.0.1:4000/\" to eliminate different downloading speed. This test is more accurate in time measuring. You can use \"https://mofanpy.com/\" instead, because you cannot access \"http://127.0.0.1:4000/\".**\n", 10 | "\n", 11 | "**We gonna scrape all web pages in my website and reture the title and url for each page.**" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": { 18 | "collapsed": false 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "import multiprocessing as mp\n", 23 | "import time\n", 24 | "from urllib.request import urlopen, urljoin\n", 25 | "from bs4 import BeautifulSoup\n", 26 | "import re\n", 27 | "\n", 28 | "base_url = \"http://127.0.0.1:4000/\"\n", 29 | "# base_url = 'https://mofanpy.com/'\n", 30 | "\n", 31 | "# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN\n", 32 | "if base_url != \"http://127.0.0.1:4000/\":\n", 33 | " restricted_crawl = True\n", 34 | "else:\n", 35 | " restricted_crawl = False" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "**Create a crawl function to open a url in parallel.**" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 1, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "def crawl(url):\n", 52 | " response = urlopen(url)\n", 53 | " time.sleep(0.1) # slightly delay for downloading\n", 54 | " return response.read().decode()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "**Create a parse function to find all results we need in parallel**" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 3, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "def parse(html):\n", 71 | " soup = BeautifulSoup(html, 'lxml')\n", 72 | " urls = soup.find_all('a', {\"href\": re.compile('^/.+?/$')})\n", 73 | " title = soup.find('h1').get_text().strip()\n", 74 | " page_urls = set([urljoin(base_url, url['href']) for url in urls])\n", 75 | " url = soup.find('meta', {'property': \"og:url\"})['content']\n", 76 | " return title, page_urls, url" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "## Normal way\n", 84 | "\n", 85 | "**Do not use multiprocessing, test the speed. Firstly, set what urls we have already seen and what we haven't in a python set.**" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 6, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "\nDistributed Crawling...\n\nDistributed Parsing...\n\nAnalysing...\n1 教程 http://localhost:4000\n\nDistributed Crawling...\n" 98 | ] 99 | }, 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "\nDistributed Parsing...\n" 105 | ] 106 | }, 107 | { 108 | "name": "stdout", 109 | "output_type": "stream", 110 | "text": [ 111 | "\nAnalysing...\n2 其他教学系列 http://localhost:4000tutorials/others/\n3 Linux 简易教学 http://localhost:4000tutorials/others/linux-basic/\n4 近期更新 http://localhost:4000recent-posts/\n5 正则表达式 http://localhost:4000tutorials/python-basic/basic/13-10-regular-expression/\n6 从头开始做一个汽车状态分类器1: 分析数据 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch1/\n7 Threading 多线程教程系列 http://localhost:4000tutorials/python-basic/threading/\n8 强化学习 Reinforcement Learning 教程系列 http://localhost:4000tutorials/machine-learning/reinforcement-learning/\n9 从头开始做一个汽车状态分类器2: 搭建模型 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch2/\n10 我的一点点背景资料 (Mofan's Background) http://localhost:4000about/\n11 Numpy & Pandas 教程系列 http://localhost:4000tutorials/data-manipulation/np-pd/\n12 推荐学习顺序 http://localhost:4000learning-steps/\n13 为了更优秀 http://localhost:4000support/\n14 机器学习实践 http://localhost:4000tutorials/machine-learning/ML-practice/\n15 Git 版本管理 教程系列 http://localhost:4000tutorials/others/git/\n16 multiprocessing 多进程教程系列 http://localhost:4000tutorials/python-basic/multiprocessing/\n17 网页爬虫教程系列 http://localhost:4000tutorials/data-manipulation/scraping/\n18 进化算法 Evolutionary Strategies 教程系列 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/\n19 Keras 教程系列 http://localhost:4000tutorials/machine-learning/keras/\n20 Python基础 教程系列 http://localhost:4000tutorials/python-basic/\n21 Pytorch 教程系列 http://localhost:4000tutorials/machine-learning/torch/\n22 Tkinter GUI 教程系列 http://localhost:4000tutorials/python-basic/tkinter/\n23 有趣的机器学习系列 http://localhost:4000tutorials/machine-learning/ML-intro/\n24 数据处理教程系列 http://localhost:4000tutorials/data-manipulation/\n25 从头开始做一个机器手臂5 完善测试 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch5/\n26 基础教程系列 http://localhost:4000tutorials/python-basic/basic/\n27 机器学习系列 http://localhost:4000tutorials/machine-learning/\n28 Sklearn 通用机器学习 教程系列 http://localhost:4000tutorials/machine-learning/sklearn/\n29 Tensorflow 教程系列 http://localhost:4000tutorials/machine-learning/tensorflow/\n30 从头开始做一个机器手臂1 搭建结构 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch1/\n31 说吧~ http://localhost:4000discuss/\n32 Matplotlib 画图教程系列 http://localhost:4000tutorials/data-manipulation/plt/\n33 Theano 教程系列 http://localhost:4000tutorials/machine-learning/theano/\n\nDistributed Crawling...\n" 112 | ] 113 | }, 114 | { 115 | "name": "stdout", 116 | "output_type": "stream", 117 | "text": [ 118 | "\nDistributed Parsing...\n" 119 | ] 120 | }, 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "\nAnalysing...\n34 Actor Critic (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-1-actor-critic/\n35 Numpy 基础运算1 http://localhost:4000tutorials/data-manipulation/np-pd/2-3-np-math1/\n36 读写文件 3 http://localhost:4000tutorials/python-basic/basic/08-3-read-file3/\n37 小例子 http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-1-general-rl/\n38 Why Keras? http://localhost:4000tutorials/machine-learning/keras/1-1-why/\n39 什么是 强化学习 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-1-A-RL/\n40 Actor Critic http://localhost:4000tutorials/machine-learning/ML-intro/4-08-AC/\n41 临时修改 (stash) http://localhost:4000tutorials/others/git/4-4-stash/\n42 什么是 DQN http://localhost:4000tutorials/machine-learning/torch/4-05-A-DQN/\n43 messagebox 弹窗 http://localhost:4000tutorials/python-basic/tkinter/2-10-messagebox/\n44 Canvas 画布 http://localhost:4000tutorials/python-basic/tkinter/2-07-canvas/\n45 交叉验证 2 Cross-validation http://localhost:4000tutorials/machine-learning/sklearn/3-3-cross-validation2/\n46 Image 图片 http://localhost:4000tutorials/data-manipulation/plt/3-4-image/\n47 线程锁 Lock http://localhost:4000tutorials/python-basic/threading/6-lock/\n48 什么是 Asynchronous Advantage Actor-Critic (A3C) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-3-A1-A3C/\n49 神经进化 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-01-neuro-evolution/\n50 Prioritized Experience Replay (DQN) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-6-prioritized-replay/\n51 RNN 循环神经网络 (回归) http://localhost:4000tutorials/machine-learning/torch/4-03-RNN-regression/\n52 Classification 分类学习 http://localhost:4000tutorials/machine-learning/tensorflow/5-01-classifier/\n53 什么是 Policy Gradients http://localhost:4000tutorials/machine-learning/reinforcement-learning/5-1-A-PG/\n54 Sarsa http://localhost:4000tutorials/machine-learning/ML-intro/4-04-sarsa/\n55 LSTM RNN 循环神经网络 (LSTM) http://localhost:4000tutorials/machine-learning/ML-intro/2-4-LSTM/\n56 神经网络在干嘛 http://localhost:4000tutorials/machine-learning/tensorflow/1-3-what-does-NN-do/\n57 为什么用 Matplotlib http://localhost:4000tutorials/data-manipulation/plt/1-1-why/\n58 while 循环 http://localhost:4000tutorials/python-basic/basic/03-1-while/\n59 Legend 图例 http://localhost:4000tutorials/data-manipulation/plt/2-5-lagend/\n60 基本用法 http://localhost:4000tutorials/data-manipulation/plt/2-1-basic-usage/\n61 什么是 tkinter 窗口 http://localhost:4000tutorials/python-basic/basic/13-07-tkinter/\n62 模块安装 http://localhost:4000tutorials/python-basic/basic/07-1-module-install/\n63 课程要求 http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-2-requirment/\n64 GAN (Generative Adversarial Nets 生成对抗网络) http://localhost:4000tutorials/machine-learning/torch/4-06-GAN/\n65 Pandas 基本介绍 http://localhost:4000tutorials/data-manipulation/np-pd/3-1-pd-intro/\n66 if elif else 判断 http://localhost:4000tutorials/python-basic/basic/04-3-if-elif-else/\n67 自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/ML-intro/2-5-autoencoder/\n68 什么是自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/keras/2-6-A-autoencoder/\n69 自编码 Autoencoder (非监督学习) http://localhost:4000tutorials/machine-learning/tensorflow/5-11-autoencoder/\n70 Dueling DQN (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-7-dueling-DQN/\n71 for 循环 http://localhost:4000tutorials/python-basic/basic/03-2-for/\n72 Linux 基本指令 mkdir, rmdir 和 rm http://localhost:4000tutorials/others/linux-basic/2-03-basic-command/\n73 Numpy 基础运算2 http://localhost:4000tutorials/data-manipulation/np-pd/2-4-np-math2/\n74 Github 在线代码管理 http://localhost:4000tutorials/others/git/5-1-github/\n75 CNN 卷积神经网络 3 http://localhost:4000tutorials/machine-learning/tensorflow/5-05-CNN3/\n76 Asynchronous Advantage Actor-Critic (A3C) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-3-A3C/\n77 Regressor 回归 http://localhost:4000tutorials/machine-learning/keras/2-1-regressor/\n78 Q-learning 思维决策 http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-3-tabular-q2/\n79 Subplot 分格显示 http://localhost:4000tutorials/data-manipulation/plt/4-2-subplot2/\n80 Pandas plot 出图 http://localhost:4000tutorials/data-manipulation/np-pd/3-8-pd-plot/\n81 批标准化 (Batch Normalization) http://localhost:4000tutorials/machine-learning/ML-intro/3-08-batch-normalization/\n82 Policy Gradients http://localhost:4000tutorials/machine-learning/ML-intro/4-07-PG/\n83 什么是多线程 Threading http://localhost:4000tutorials/python-basic/threading/1-why/\n84 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/ML-intro/2-8-gradient-descent/\n85 为什么 Torch 是动态的 http://localhost:4000tutorials/machine-learning/torch/5-01-dynamic/\n86 保存提取 http://localhost:4000tutorials/machine-learning/torch/3-04-save-reload/\n87 添加线程 Thread http://localhost:4000tutorials/python-basic/threading/2-add-thread/\n88 DQN 神经网络 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-2-DQN2/\n89 什么是自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/torch/4-04-A-autoencoder/\n90 Autoencoder 自编码 http://localhost:4000tutorials/machine-learning/keras/2-6-autoencoder/\n91 元组 列表 http://localhost:4000tutorials/python-basic/basic/11-1-array-list/\n92 Sarsa-lambda http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-3-tabular-sarsa-lambda/\n93 设置坐标轴2 http://localhost:4000tutorials/data-manipulation/plt/2-4-axis2/\n94 函数参数 http://localhost:4000tutorials/python-basic/basic/05-2-def-parameters/\n95 神经网络在做什么 http://localhost:4000tutorials/machine-learning/theano/1-3-NN-job/\n96 GIL 不一定有效率 http://localhost:4000tutorials/python-basic/threading/5-GIL/\n97 Pandas 设置值 http://localhost:4000tutorials/data-manipulation/np-pd/3-3-pd-assign/\n98 Pandas 选择数据 http://localhost:4000tutorials/data-manipulation/np-pd/3-2-pd-indexing/\n99 CNN 卷积神经网络 http://localhost:4000tutorials/machine-learning/torch/4-01-CNN/\n100 Placeholder 传入值 http://localhost:4000tutorials/machine-learning/tensorflow/2-5-placeholde/\n101 Tensorboard 可视化好帮手 1 http://localhost:4000tutorials/machine-learning/tensorflow/4-1-tensorboard1/\n102 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/theano/1-1-A-ANN-and-NN/\n103 回到从前 (checkout 针对单个文件) http://localhost:4000tutorials/others/git/3-2-checkout/\n104 为什么用强化学习 Why? http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-1-why/\n105 Linux 基本指令 touch, cp 和 mv http://localhost:4000tutorials/others/linux-basic/2-02-basic-command/\n106 OpenAI gym 环境库 http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-4-gym/\n107 NEAT 强化学习 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-03-neat-reinforcement-learning/\n108 什么是卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/tensorflow/5-03-A-CNN/\n109 Tensorflow 2017 更新 http://localhost:4000tutorials/machine-learning/tensorflow/5-14-tf2017/\n110 continue & break http://localhost:4000tutorials/python-basic/basic/13-01-continue-break/\n111 multiprocessing 什么是多进程 http://localhost:4000tutorials/python-basic/basic/13-06-multiprocessing/\n112 Saver 保存读取 http://localhost:4000tutorials/machine-learning/tensorflow/5-06-save/\n113 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/1-1-A-ANN-and-NN/\n114 什么是 Sarsa http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-1-A-sarsa/\n115 神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/ML-intro/2-1-NN/\n116 通用学习模式 http://localhost:4000tutorials/machine-learning/sklearn/2-2-general-pattern/\n117 第一个版本库 Repository http://localhost:4000tutorials/others/git/2-1-repository/\n118 Dropout 缓解过拟合 http://localhost:4000tutorials/machine-learning/torch/5-03-dropout/\n119 Sarsa 算法更新 http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-1-tabular-sarsa1/\n120 set 找不同 http://localhost:4000tutorials/python-basic/basic/13-09-set/\n121 多维列表 http://localhost:4000tutorials/python-basic/basic/11-3-multi-dim-list/\n122 检验神经网络 (Evaluation) http://localhost:4000tutorials/machine-learning/sklearn/3-2-A-Evaluate-NN/\n123 3D 数据 http://localhost:4000tutorials/data-manipulation/plt/3-5-3d/\n124 Regression 回归例子 http://localhost:4000tutorials/machine-learning/theano/3-2-regression/\n125 激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/ML-intro/3-04-activation-function/\n126 Scatter 散点图 http://localhost:4000tutorials/data-manipulation/plt/3-1-scatter/\n127 什么是神经网络进化 (Neuro-Evolution) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-00-neuro-evolution/\n128 什么是循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/tensorflow/5-07-A-RNN/\n129 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/tensorflow/1-1-B-NN/\n130 DQN 算法更新 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-1-DQN1/\n131 怎么样从 MacOS 或 Linux 通过 SSH 远程 Linux http://localhost:4000tutorials/others/linux-basic/4-01-ssh-from-linux-or-mac/\n132 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/keras/1-1-B-NN/\n133 区分类型 (分类) http://localhost:4000tutorials/machine-learning/torch/3-02-classification/\n134 Optimizer 优化器 http://localhost:4000tutorials/machine-learning/torch/3-06-optimizer/\n135 AlphaGo Zero 为什么更厉害? http://localhost:4000tutorials/machine-learning/ML-intro/4-11-AlphaGo-zero/\n136 Deep Deterministic Policy Gradient (DDPG) http://localhost:4000tutorials/machine-learning/ML-intro/4-09-DDPG/\n137 if 判断 http://localhost:4000tutorials/python-basic/basic/04-1-if/\n138 class 类 init 功能 http://localhost:4000tutorials/python-basic/basic/09-2-class-init/\n139 Numpy copy & deep copy http://localhost:4000tutorials/data-manipulation/np-pd/2-8-np-copy/\n140 Tensorboard 可视化好帮手 2 http://localhost:4000tutorials/machine-learning/tensorflow/4-2-tensorboard2/\n141 Frame 框架 http://localhost:4000tutorials/python-basic/tkinter/2-09-frame/\n142 Microbial Genetic Algorithm http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-04-microbial-genetic-algorithm/\n143 NEAT 监督学习 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-02-neat-supervised-learning/\n144 安装 http://localhost:4000tutorials/python-basic/basic/01-1-install/\n145 怎么样从 Windows 通过 SSH 远程 Linux http://localhost:4000tutorials/others/linux-basic/4-02-ssh-from-windows/\n146 什么是 Actor Critic http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-1-A-AC/\n147 遗传算法 (Genetic Algorithm) http://localhost:4000tutorials/machine-learning/ML-intro/5-01-genetic-algorithm/\n148 CNN 卷积神经网络 http://localhost:4000tutorials/machine-learning/keras/2-3-CNN/\n149 选择好特征 (Good Features) http://localhost:4000tutorials/machine-learning/ML-intro/3-03-choose-feature/\n150 什么是 Q Leaning http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/\n151 Evolution Strategy 强化学习 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-04-evolution-strategy-reinforcement-learning/\n152 兼容 backend http://localhost:4000tutorials/machine-learning/keras/1-3-backend/\n153 什么是过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/theano/3-5-A-overfitting/\n154 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/theano/1-1-C-gradient-descent/\n155 Numpy 和 Pandas 安装 http://localhost:4000tutorials/data-manipulation/np-pd/1-2-install/\n156 例子 旅行商人问题 (Travel Sales Problem) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-03-genetic-algorithm-travel-sales-problem/\n157 Distributed Proximal Policy Optimization (DPPO) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-4-DPPO/\n158 Pytorch 安装 http://localhost:4000tutorials/machine-learning/torch/1-2-install/\n159 基础数学运算 http://localhost:4000tutorials/python-basic/basic/02-2-basic-math/\n160 激励函数 (Activation) http://localhost:4000tutorials/machine-learning/torch/2-03-activation/\n161 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/tensorflow/1-1-D-feature-representation/\n162 例子3 结果可视化 http://localhost:4000tutorials/machine-learning/tensorflow/3-3-visualize-result/\n163 Activation function 激励函数 http://localhost:4000tutorials/machine-learning/theano/2-4-activation/\n164 可视化结果 回归例子 http://localhost:4000tutorials/machine-learning/theano/3-3-visualize/\n165 什么是循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/torch/4-02-A-RNN/\n166 什么是 Deep Deterministic Policy Gradient (DDPG) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-2-A-DDPG/\n167 例子 登录窗口1 http://localhost:4000tutorials/python-basic/tkinter/3-01-example1/\n168 怎么样用 TeamViewer 和 VNC 从远程控制电脑 http://localhost:4000tutorials/others/linux-basic/4-04-teamviewer-vnc/\n169 什么是循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/keras/2-4-A-RNN/\n170 Pandas 导入导出 http://localhost:4000tutorials/data-manipulation/np-pd/3-5-pd-to/\n171 选择学习方法 http://localhost:4000tutorials/machine-learning/sklearn/2-1-select-method/\n172 class 类 http://localhost:4000tutorials/python-basic/basic/09-1-class/\n173 什么是激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/theano/2-4-A-activation-function/\n174 特征标准化 (Feature Normalization) http://localhost:4000tutorials/machine-learning/ML-intro/3-02-normalization/\n175 Policy Gradients 思维决策 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/5-2-policy-gradient-softmax2/\n176 RNN LSTM (回归例子可视化) http://localhost:4000tutorials/machine-learning/tensorflow/5-10-RNN4/\n177 关系拟合 (回归) http://localhost:4000tutorials/machine-learning/torch/3-01-regression/\n178 正规化 Normalization http://localhost:4000tutorials/machine-learning/sklearn/3-1-normalization/\n179 GPU 加速运算 http://localhost:4000tutorials/machine-learning/torch/5-02-GPU/\n180 DQN 强化学习 http://localhost:4000tutorials/machine-learning/torch/4-05-DQN/\n181 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/ML-intro/2-7-feature-representation/\n182 加速神经网络训练 (Speed Up Training) http://localhost:4000tutorials/machine-learning/ML-intro/3-06-speed-up-learning/\n183 交叉验证 1 Cross-validation http://localhost:4000tutorials/machine-learning/sklearn/3-2-cross-validation1/\n184 读写文件 1 http://localhost:4000tutorials/python-basic/basic/08-1-read-file/\n185 回到从前 (reset) http://localhost:4000tutorials/others/git/3-1-reset/\n186 什么是 Tkinter http://localhost:4000tutorials/python-basic/tkinter/1-1-why/\n187 Theano 安装 http://localhost:4000tutorials/machine-learning/theano/1-2-install/\n188 Save 保存 提取 http://localhost:4000tutorials/machine-learning/theano/3-6-save/\n189 sklearn 常用属性与功能 http://localhost:4000tutorials/machine-learning/sklearn/2-4-model-attributes/\n190 变量 (Variable) http://localhost:4000tutorials/machine-learning/torch/2-02-variable/\n191 threading 什么是多线程 http://localhost:4000tutorials/python-basic/basic/13-05-threading/\n192 RNN LSTM 循环神经网络 (分类例子) http://localhost:4000tutorials/machine-learning/tensorflow/5-08-RNN2/\n193 记录修改 (log & diff) http://localhost:4000tutorials/others/git/2-2-modified/\n194 dictionary 字典 http://localhost:4000tutorials/python-basic/basic/11-4-dictionary/\n195 Session 会话控制 http://localhost:4000tutorials/machine-learning/tensorflow/2-3-session/\n196 神经网络进化 (Neuro-Evolution) http://localhost:4000tutorials/machine-learning/ML-intro/5-03-neuro-evolution/\n197 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/tensorflow/1-1-C-gradient-descent/\n198 例子3 添加层 def add_layer() http://localhost:4000tutorials/machine-learning/tensorflow/3-1-add-layer/\n199 Scale 尺度 http://localhost:4000tutorials/python-basic/tkinter/2-05-scale/\n200 例子 配对句子 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-02-genetic-algorithm-match-phrase/\n201 rebase 分支冲突 http://localhost:4000tutorials/others/git/4-3-rebase/\n202 Batch Normalization 批标准化 http://localhost:4000tutorials/machine-learning/torch/5-04-batch-normalization/\n203 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/theano/1-1-D-feature-representation/\n204 自己的云计算, 把 Linux 当成你的云计算平台 http://localhost:4000tutorials/others/linux-basic/5-01-remote-learning/\n205 例子 登录窗口2 http://localhost:4000tutorials/python-basic/tkinter/3-02-example2/\n206 机器学习 (Machine Learning) http://localhost:4000tutorials/machine-learning/sklearn/1-1-A-ML/\n207 自己的模块 http://localhost:4000tutorials/python-basic/basic/12-2-personal-module/\n208 Matplotlib 安装 http://localhost:4000tutorials/data-manipulation/plt/1-2-install/\n209 进化策略 (Evolution Strategy) http://localhost:4000tutorials/machine-learning/ML-intro/5-02-evolution-strategy/\n210 list 列表 http://localhost:4000tutorials/python-basic/basic/11-2-list/\n211 Double DQN (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-5-double_DQN/\n212 Numpy 的创建 array http://localhost:4000tutorials/data-manipulation/np-pd/2-2-np-array/\n213 遗传算法 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-01-genetic-algorithm/\n214 Torch 或 Numpy http://localhost:4000tutorials/machine-learning/torch/2-01-torch-numpy/\n215 快速搭建法 http://localhost:4000tutorials/machine-learning/torch/3-03-fast-nn/\n216 例子 登录窗口3 http://localhost:4000tutorials/python-basic/tkinter/3-03-example3/\n217 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/ML-intro/2-0-ANN-and-NN/\n218 zip lambda map http://localhost:4000tutorials/python-basic/basic/13-03-zip-lambda-map/\n219 Batch Normalization 批标准化 http://localhost:4000tutorials/machine-learning/tensorflow/5-13-BN/\n220 pack grid place 放置位置 http://localhost:4000tutorials/python-basic/tkinter/2-11-pack-grid-place/\n221 scope 命名方法 http://localhost:4000tutorials/machine-learning/tensorflow/5-12-scope/\n222 Subplot 多合一显示 http://localhost:4000tutorials/data-manipulation/plt/4-1-subpot1/\n223 L1 / L2 正规化 (Regularization) http://localhost:4000tutorials/machine-learning/ML-intro/3-09-l1l2regularization/\n224 强化学习方法汇总 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/ML-intro/4-02-RL-methods/\n225 次坐标轴 http://localhost:4000tutorials/data-manipulation/plt/4-4-sencondary-axis/\n226 什么是 LSTM 循环神经网络 http://localhost:4000tutorials/machine-learning/keras/2-4-B-LSTM/\n227 Asynchronous Advantage Actor-Critic (A3C) http://localhost:4000tutorials/machine-learning/ML-intro/4-10-A3C/\n228 强化学习 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/ML-intro/4-01-RL/\n229 Numpy array 分割 http://localhost:4000tutorials/data-manipulation/np-pd/2-7-np-split/\n230 Numpy array 合并 http://localhost:4000tutorials/data-manipulation/np-pd/2-6-np-concat/\n231 if else 判断 http://localhost:4000tutorials/python-basic/basic/04-2-if-else/\n232 进程锁 Lock http://localhost:4000tutorials/python-basic/multiprocessing/7-lock/\n233 import 模块 http://localhost:4000tutorials/python-basic/basic/12-1-import/\n234 RNN Regressor 循环神经网络 http://localhost:4000tutorials/machine-learning/keras/2-5-RNN-LSTM-Regressor/\n235 sklearn 强大数据库 http://localhost:4000tutorials/machine-learning/sklearn/2-3-database/\n236 Regularization 正规化 http://localhost:4000tutorials/machine-learning/theano/3-5-regularization/\n237 效率对比 threading & multiprocessing http://localhost:4000tutorials/python-basic/multiprocessing/4-comparison/\n238 Why Numpy & Pandas? http://localhost:4000tutorials/data-manipulation/np-pd/1-1-why/\n239 共享内存 shared memory http://localhost:4000tutorials/python-basic/multiprocessing/6-shared-memory/\n240 什么是激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/torch/2-03-A-activation-function/\n241 Menubar 菜单 http://localhost:4000tutorials/python-basic/tkinter/2-08-menubar/\n242 Annotation 标注 http://localhost:4000tutorials/data-manipulation/plt/2-6-annotation/\n243 RNN 循环神经网络 (分类) http://localhost:4000tutorials/machine-learning/torch/4-02-RNN-classification/\n244 什么是进化策略 (Evolution Strategy) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-00-evolution-strategy/\n245 例子3 建造神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/3-2-create-NN/\n246 (1+1)-ES http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-02-evolution-strategy-1+1-ES/\n247 保存模型 http://localhost:4000tutorials/machine-learning/sklearn/3-5-save/\n248 进化算法 简介 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/1-01-intro/\n249 Sarsa 思维决策 http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-2-tabular-sarsa2/\n250 Dropout 解决 overfitting http://localhost:4000tutorials/machine-learning/tensorflow/5-02-dropout/\n251 什么是激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/tensorflow/2-6-A-activation-function/\n252 进程池 Pool http://localhost:4000tutorials/python-basic/multiprocessing/5-pool/\n253 Asyncio http://localhost:4000tutorials/data-manipulation/scraping/asyncio/\n254 添加进程 Process http://localhost:4000tutorials/python-basic/multiprocessing/2-add/\n255 存储进程输出 Queue http://localhost:4000tutorials/python-basic/multiprocessing/3-queue/\n256 过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/ML-intro/3-05-overfitting/\n257 图中图 http://localhost:4000tutorials/data-manipulation/plt/4-3-plot-in-plot/\n258 Save & reload 保存提取 http://localhost:4000tutorials/machine-learning/keras/3-1-save/\n259 Checkbutton 勾选项 http://localhost:4000tutorials/python-basic/tkinter/2-06-checkbutton/\n260 分支 (branch) http://localhost:4000tutorials/others/git/4-1-branch/\n261 Q-learning 算法更新 http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-2-tabular-q1/\n262 变量 variable http://localhost:4000tutorials/python-basic/basic/02-3-variable/\n263 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/torch/1-1-C-gradient-descent/\n264 生成对抗网络 (GAN) http://localhost:4000tutorials/machine-learning/ML-intro/2-6-GAN/\n265 什么是 Sarsa(lambda) http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-3-A-sarsa-lambda/\n266 Pandas 合并 merge http://localhost:4000tutorials/data-manipulation/np-pd/3-7-pd-merge/\n267 遗传算法 (Genetic Algorithm) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-00-genetic-algorithm/\n268 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/keras/1-1-D-feature-representation/\n269 什么是批标准化 (Batch Normalization) http://localhost:4000tutorials/machine-learning/torch/5-04-A-batch-normalization/\n270 自己的云计算, 多电脑共享你云端文件 http://localhost:4000tutorials/others/linux-basic/5-02-share-folder/\n271 从头开始做一个机器手臂3 写动态环境 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch3/\n272 定义 Layer 类 http://localhost:4000tutorials/machine-learning/theano/3-1-layer/\n273 什么是 LSTM 循环神经网络 http://localhost:4000tutorials/machine-learning/torch/4-02-B-LSTM/\n274 pickle 保存数据 http://localhost:4000tutorials/python-basic/basic/13-08-pickle/\n275 Classifier 分类 http://localhost:4000tutorials/machine-learning/keras/2-2-classifier/\n276 CNN 卷积神经网络 1 http://localhost:4000tutorials/machine-learning/tensorflow/5-03-CNN1/\n277 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/torch/1-1-D-feature-representation/\n278 什么是生成对抗网络 (GAN) http://localhost:4000tutorials/machine-learning/torch/4-06-A-GAN/\n279 强化学习方法汇总 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-1-B-RL-methods/\n280 Keras 安装 http://localhost:4000tutorials/machine-learning/keras/1-2-install/\n281 Entry & Text 输入, 文本框 http://localhost:4000tutorials/python-basic/tkinter/2-02-entry-text/\n282 RNN 循环神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/5-07-RNN1/\n283 Policy Gradients 算法更新 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/5-1-policy-gradient-softmax1/\n284 Function 用法 http://localhost:4000tutorials/machine-learning/theano/2-2-function/\n285 优化器 optimizer http://localhost:4000tutorials/machine-learning/tensorflow/3-4-optimizer/\n286 Pandas 合并 concat http://localhost:4000tutorials/data-manipulation/np-pd/3-6-pd-concat/\n287 Tensorflow 安装 http://localhost:4000tutorials/machine-learning/tensorflow/1-2-install/\n288 从头开始做一个机器手臂4 加入强化学习算法 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch4/\n289 Deep Deterministic Policy Gradient (DDPG) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-2-DDPG/\n290 什么是卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/keras/2-3-A-CNN/\n291 什么是批标准化 (Batch Normalization) http://localhost:4000tutorials/machine-learning/tensorflow/5-13-A-batch-normalization/\n292 什么是过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/tensorflow/5-02-A-overfitting/\n293 函数默认参数 http://localhost:4000tutorials/python-basic/basic/05-3-def-default-parameters/\n294 Linux 文件权限 http://localhost:4000tutorials/others/linux-basic/3-01-file-permissions/\n295 Animation 动画 http://localhost:4000tutorials/data-manipulation/plt/5-1-animation/\n296 全局 & 局部 变量 http://localhost:4000tutorials/python-basic/basic/06-1-global-local/\n297 从头开始做一个机器手臂2 写静态环境 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch2/\n298 Natural Evolution Strategy http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-03-evolution-strategy-natural-evolution-strategy/\n299 Pandas 处理丢失数据 http://localhost:4000tutorials/data-manipulation/np-pd/3-4-pd-nan/\n300 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/theano/1-1-B-NN/\n301 Numpy 索引 http://localhost:4000tutorials/data-manipulation/np-pd/2-5-np-indexing/\n302 Radiobutton 选择按钮 http://localhost:4000tutorials/python-basic/tkinter/2-04-radiobutton/\n303 激励函数 Activation Function http://localhost:4000tutorials/machine-learning/tensorflow/2-6-activation/\n304 Why Theano? http://localhost:4000tutorials/machine-learning/theano/1-1-why/\n305 Classification 分类学习 http://localhost:4000tutorials/machine-learning/theano/3-4-classification/\n306 了解网页结构 http://localhost:4000tutorials/data-manipulation/scraping/1-01-understand-website/\n307 Variable 变量 http://localhost:4000tutorials/machine-learning/tensorflow/2-4-variable/\n308 加速神经网络训练 (Speed Up Training) http://localhost:4000tutorials/machine-learning/tensorflow/3-4-A-speed-up-learning/\n309 Linux 基本指令 nano 和 cat http://localhost:4000tutorials/others/linux-basic/2-04-basic-command/\n310 处理结构 http://localhost:4000tutorials/machine-learning/tensorflow/2-1-structure/\n311 循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/ML-intro/2-3-RNN/\n312 Bar 柱状图 http://localhost:4000tutorials/data-manipulation/plt/3-2-Bar/\n313 Shared 变量 http://localhost:4000tutorials/machine-learning/theano/2-3-shared-variable/\n314 总结和更多 http://localhost:4000tutorials/machine-learning/theano/4-1-summary/\n315 print 功能 http://localhost:4000tutorials/python-basic/basic/02-1-print/\n316 merge 分支冲突 http://localhost:4000tutorials/others/git/4-2-merge-conflict/\n317 join 功能 http://localhost:4000tutorials/python-basic/threading/3-join/\n318 Why Git? http://localhost:4000tutorials/others/git/1-1-why/\n319 RNN Classifier 循环神经网络 http://localhost:4000tutorials/machine-learning/keras/2-4-RNN-classifier/\n320 def 函数 http://localhost:4000tutorials/python-basic/basic/05-1-def/\n321 安装 Ubuntu 17.10 http://localhost:4000tutorials/others/linux-basic/1-2-install/\n322 Linux 基本指令 ls 和 cd http://localhost:4000tutorials/others/linux-basic/2-01-basic-command/\n323 Sklearn 安装 http://localhost:4000tutorials/machine-learning/sklearn/1-2-install/\n324 AutoEncoder (自编码/非监督学习) http://localhost:4000tutorials/machine-learning/torch/4-04-autoencoder/\n325 DQN http://localhost:4000tutorials/machine-learning/ML-intro/4-06-DQN/\n326 为什么用 Numpy 还是慢, 你用对了吗? http://localhost:4000tutorials/data-manipulation/np-pd/4-1-speed-up-numpy/\n327 什么是过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/torch/5-03-A-overfitting/\n328 机器学习 (Machine Learning) http://localhost:4000tutorials/machine-learning/ML-intro/1-1-machine-learning/\n329 Sarsa(lambda) http://localhost:4000tutorials/machine-learning/ML-intro/4-05-sarsa-lambda/\n330 卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/ML-intro/2-2-CNN/\n331 交叉验证 3 Cross-validation http://localhost:4000tutorials/machine-learning/sklearn/3-4-cross-validation3/\n332 例子2 http://localhost:4000tutorials/machine-learning/tensorflow/2-2-example2/\n333 什么是 DQN http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-1-A-DQN/\n334 为什么选 Tensorflow? http://localhost:4000tutorials/machine-learning/tensorflow/1-1-why/\n335 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/keras/1-1-A-ANN-and-NN/\n336 设置坐标轴1 http://localhost:4000tutorials/data-manipulation/plt/2-3-axis1/\n337 Why Sklearn? http://localhost:4000tutorials/machine-learning/sklearn/1-1-why/\n338 Q Leaning http://localhost:4000tutorials/machine-learning/ML-intro/4-03-q-learning/\n339 进化策略 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-01-evolution-strategy/\n340 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/keras/1-1-C-gradient-descent/\n341 什么是 Multiprocessing http://localhost:4000tutorials/python-basic/multiprocessing/1-why/\n342 Listbox 列表部件 http://localhost:4000tutorials/python-basic/tkinter/2-03-listbox/\n343 Numpy 属性 http://localhost:4000tutorials/data-manipulation/np-pd/2-1-np-attributes/\n344 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/torch/1-1-A-ANN-and-NN/\n345 RNN LSTM (回归例子) http://localhost:4000tutorials/machine-learning/tensorflow/5-09-RNN3/\n346 CNN 卷积神经网络 2 http://localhost:4000tutorials/machine-learning/tensorflow/5-04-CNN2/\n347 figure 图像 http://localhost:4000tutorials/data-manipulation/plt/2-2-figure/\n348 用 Tensorflow 可视化梯度下降 http://localhost:4000tutorials/machine-learning/tensorflow/5-15-tf-gradient-descent/\n349 储存进程结果 Queue http://localhost:4000tutorials/python-basic/threading/4-queue/\n350 Git 安装 http://localhost:4000tutorials/others/git/1-2-install/\n351 tick 能见度 http://localhost:4000tutorials/data-manipulation/plt/2-7-tick-visibility/\n352 给你的 Ubuntu 安装软件 http://localhost:4000tutorials/others/linux-basic/1-4-install-software/\n353 基本用法 http://localhost:4000tutorials/machine-learning/theano/2-1-basic-usage/\n354 加速神经网络训练 (Speed Up Training) http://localhost:4000tutorials/machine-learning/torch/3-06-A-speed-up-learning/\n355 Contours 等高线图 http://localhost:4000tutorials/data-manipulation/plt/3-3-contours/\n356 处理不均衡数据 (Imbalanced data) http://localhost:4000tutorials/machine-learning/ML-intro/3-07-imbalanced-data/\n357 批训练 http://localhost:4000tutorials/machine-learning/torch/3-05-train-on-batch/\n358 DQN 思维决策 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-3-DQN3/\n359 快速了解 Ubuntu 17.10 基本界面 http://localhost:4000tutorials/others/linux-basic/1-3-system-overview/\n360 Why Pytorch? http://localhost:4000tutorials/machine-learning/torch/1-1-why/\n361 什么是卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/torch/4-01-A-CNN/\n362 Label & Button 标签和按钮 http://localhost:4000tutorials/python-basic/tkinter/2-01-label-button/\n363 读写文件 2 http://localhost:4000tutorials/python-basic/basic/08-2-read-file2/\n364 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/torch/1-1-B-NN/\n365 什么是自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/tensorflow/5-11-A-autoencoder/\n366 Why Linux? http://localhost:4000tutorials/others/linux-basic/1-1-why/\n367 怎么样从手机 (Android安卓/IOS苹果) 通过 SSH 远程 Linux http://localhost:4000tutorials/others/linux-basic/4-03-ssh-from-phone/\n368 copy & deepcopy 浅复制 & 深复制 http://localhost:4000tutorials/python-basic/basic/13-04-copy/\n369 检验神经网络 (Evaluation) http://localhost:4000tutorials/machine-learning/ML-intro/3-01-Evaluate-NN/\n370 input 输入 http://localhost:4000tutorials/python-basic/basic/10-1-input/\n371 什么是 LSTM 循环神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/5-07-B-LSTM/\n372 try 错误处理 http://localhost:4000tutorials/python-basic/basic/13-02-try/\nTotal time: 52.3 s\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "unseen = set([base_url,])\n", 131 | "seen = set()\n", 132 | "\n", 133 | "count, t1 = 1, time.time()\n", 134 | "\n", 135 | "while len(unseen) != 0: # still get some url to visit\n", 136 | " if restricted_crawl and len(seen) > 20:\n", 137 | " break\n", 138 | " \n", 139 | " print('\\nDistributed Crawling...')\n", 140 | " htmls = [crawl(url) for url in unseen]\n", 141 | "\n", 142 | " print('\\nDistributed Parsing...')\n", 143 | " results = [parse(html) for html in htmls]\n", 144 | "\n", 145 | " print('\\nAnalysing...')\n", 146 | " seen.update(unseen) # seen the crawled\n", 147 | " unseen.clear() # nothing unseen\n", 148 | "\n", 149 | " for title, page_urls, url in results:\n", 150 | " print(count, title, url)\n", 151 | " count += 1\n", 152 | " unseen.update(page_urls - seen) # get new url to crawl\n", 153 | "print('Total time: %.1f s' % (time.time()-t1, )) # 53 s" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "## multiprocessing\n", 161 | "**Create a process pool and scrape parallelly.**" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 7, 167 | "metadata": { 168 | "collapsed": true 169 | }, 170 | "outputs": [ 171 | { 172 | "name": "stdout", 173 | "output_type": "stream", 174 | "text": [ 175 | "\nDistributed Crawling...\n\nDistributed Parsing...\n\nAnalysing...\n1 教程 http://localhost:4000\n\nDistributed Crawling...\n" 176 | ] 177 | }, 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "\nDistributed Parsing...\n" 183 | ] 184 | }, 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "\nAnalysing...\n2 其他教学系列 http://localhost:4000tutorials/others/\n3 Linux 简易教学 http://localhost:4000tutorials/others/linux-basic/\n4 近期更新 http://localhost:4000recent-posts/\n5 正则表达式 http://localhost:4000tutorials/python-basic/basic/13-10-regular-expression/\n6 从头开始做一个汽车状态分类器1: 分析数据 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch1/\n7 Threading 多线程教程系列 http://localhost:4000tutorials/python-basic/threading/\n8 强化学习 Reinforcement Learning 教程系列 http://localhost:4000tutorials/machine-learning/reinforcement-learning/\n9 从头开始做一个汽车状态分类器2: 搭建模型 http://localhost:4000tutorials/machine-learning/ML-practice/build-car-classifier-from-scratch2/\n10 我的一点点背景资料 (Mofan's Background) http://localhost:4000about/\n11 Numpy & Pandas 教程系列 http://localhost:4000tutorials/data-manipulation/np-pd/\n12 推荐学习顺序 http://localhost:4000learning-steps/\n13 为了更优秀 http://localhost:4000support/\n14 机器学习实践 http://localhost:4000tutorials/machine-learning/ML-practice/\n15 Git 版本管理 教程系列 http://localhost:4000tutorials/others/git/\n16 multiprocessing 多进程教程系列 http://localhost:4000tutorials/python-basic/multiprocessing/\n17 网页爬虫教程系列 http://localhost:4000tutorials/data-manipulation/scraping/\n18 进化算法 Evolutionary Strategies 教程系列 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/\n19 Keras 教程系列 http://localhost:4000tutorials/machine-learning/keras/\n20 Python基础 教程系列 http://localhost:4000tutorials/python-basic/\n21 Pytorch 教程系列 http://localhost:4000tutorials/machine-learning/torch/\n22 Tkinter GUI 教程系列 http://localhost:4000tutorials/python-basic/tkinter/\n23 有趣的机器学习系列 http://localhost:4000tutorials/machine-learning/ML-intro/\n24 数据处理教程系列 http://localhost:4000tutorials/data-manipulation/\n25 从头开始做一个机器手臂5 完善测试 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch5/\n26 基础教程系列 http://localhost:4000tutorials/python-basic/basic/\n27 机器学习系列 http://localhost:4000tutorials/machine-learning/\n28 Sklearn 通用机器学习 教程系列 http://localhost:4000tutorials/machine-learning/sklearn/\n29 Tensorflow 教程系列 http://localhost:4000tutorials/machine-learning/tensorflow/\n30 从头开始做一个机器手臂1 搭建结构 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch1/\n31 说吧~ http://localhost:4000discuss/\n32 Matplotlib 画图教程系列 http://localhost:4000tutorials/data-manipulation/plt/\n33 Theano 教程系列 http://localhost:4000tutorials/machine-learning/theano/\n\nDistributed Crawling...\n" 190 | ] 191 | }, 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "\nDistributed Parsing...\n" 197 | ] 198 | }, 199 | { 200 | "name": "stdout", 201 | "output_type": "stream", 202 | "text": [ 203 | "\nAnalysing...\n34 Actor Critic (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-1-actor-critic/\n35 Numpy 基础运算1 http://localhost:4000tutorials/data-manipulation/np-pd/2-3-np-math1/\n36 读写文件 3 http://localhost:4000tutorials/python-basic/basic/08-3-read-file3/\n37 小例子 http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-1-general-rl/\n38 Why Keras? http://localhost:4000tutorials/machine-learning/keras/1-1-why/\n39 什么是 强化学习 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-1-A-RL/\n40 Actor Critic http://localhost:4000tutorials/machine-learning/ML-intro/4-08-AC/\n41 临时修改 (stash) http://localhost:4000tutorials/others/git/4-4-stash/\n42 什么是 DQN http://localhost:4000tutorials/machine-learning/torch/4-05-A-DQN/\n43 messagebox 弹窗 http://localhost:4000tutorials/python-basic/tkinter/2-10-messagebox/\n44 Canvas 画布 http://localhost:4000tutorials/python-basic/tkinter/2-07-canvas/\n45 交叉验证 2 Cross-validation http://localhost:4000tutorials/machine-learning/sklearn/3-3-cross-validation2/\n46 Image 图片 http://localhost:4000tutorials/data-manipulation/plt/3-4-image/\n47 线程锁 Lock http://localhost:4000tutorials/python-basic/threading/6-lock/\n48 什么是 Asynchronous Advantage Actor-Critic (A3C) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-3-A1-A3C/\n49 神经进化 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-01-neuro-evolution/\n50 Prioritized Experience Replay (DQN) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-6-prioritized-replay/\n51 RNN 循环神经网络 (回归) http://localhost:4000tutorials/machine-learning/torch/4-03-RNN-regression/\n52 Classification 分类学习 http://localhost:4000tutorials/machine-learning/tensorflow/5-01-classifier/\n53 什么是 Policy Gradients http://localhost:4000tutorials/machine-learning/reinforcement-learning/5-1-A-PG/\n54 Sarsa http://localhost:4000tutorials/machine-learning/ML-intro/4-04-sarsa/\n55 LSTM RNN 循环神经网络 (LSTM) http://localhost:4000tutorials/machine-learning/ML-intro/2-4-LSTM/\n56 神经网络在干嘛 http://localhost:4000tutorials/machine-learning/tensorflow/1-3-what-does-NN-do/\n57 为什么用 Matplotlib http://localhost:4000tutorials/data-manipulation/plt/1-1-why/\n58 while 循环 http://localhost:4000tutorials/python-basic/basic/03-1-while/\n59 Legend 图例 http://localhost:4000tutorials/data-manipulation/plt/2-5-lagend/\n60 基本用法 http://localhost:4000tutorials/data-manipulation/plt/2-1-basic-usage/\n61 什么是 tkinter 窗口 http://localhost:4000tutorials/python-basic/basic/13-07-tkinter/\n62 模块安装 http://localhost:4000tutorials/python-basic/basic/07-1-module-install/\n63 课程要求 http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-2-requirment/\n64 GAN (Generative Adversarial Nets 生成对抗网络) http://localhost:4000tutorials/machine-learning/torch/4-06-GAN/\n65 Pandas 基本介绍 http://localhost:4000tutorials/data-manipulation/np-pd/3-1-pd-intro/\n66 if elif else 判断 http://localhost:4000tutorials/python-basic/basic/04-3-if-elif-else/\n67 自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/ML-intro/2-5-autoencoder/\n68 什么是自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/keras/2-6-A-autoencoder/\n69 自编码 Autoencoder (非监督学习) http://localhost:4000tutorials/machine-learning/tensorflow/5-11-autoencoder/\n70 Dueling DQN (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-7-dueling-DQN/\n71 for 循环 http://localhost:4000tutorials/python-basic/basic/03-2-for/\n72 Linux 基本指令 mkdir, rmdir 和 rm http://localhost:4000tutorials/others/linux-basic/2-03-basic-command/\n73 Numpy 基础运算2 http://localhost:4000tutorials/data-manipulation/np-pd/2-4-np-math2/\n74 Github 在线代码管理 http://localhost:4000tutorials/others/git/5-1-github/\n75 CNN 卷积神经网络 3 http://localhost:4000tutorials/machine-learning/tensorflow/5-05-CNN3/\n76 Asynchronous Advantage Actor-Critic (A3C) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-3-A3C/\n77 Regressor 回归 http://localhost:4000tutorials/machine-learning/keras/2-1-regressor/\n78 Q-learning 思维决策 http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-3-tabular-q2/\n79 Subplot 分格显示 http://localhost:4000tutorials/data-manipulation/plt/4-2-subplot2/\n80 Pandas plot 出图 http://localhost:4000tutorials/data-manipulation/np-pd/3-8-pd-plot/\n81 批标准化 (Batch Normalization) http://localhost:4000tutorials/machine-learning/ML-intro/3-08-batch-normalization/\n82 Policy Gradients http://localhost:4000tutorials/machine-learning/ML-intro/4-07-PG/\n83 什么是多线程 Threading http://localhost:4000tutorials/python-basic/threading/1-why/\n84 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/ML-intro/2-8-gradient-descent/\n85 为什么 Torch 是动态的 http://localhost:4000tutorials/machine-learning/torch/5-01-dynamic/\n86 保存提取 http://localhost:4000tutorials/machine-learning/torch/3-04-save-reload/\n87 添加线程 Thread http://localhost:4000tutorials/python-basic/threading/2-add-thread/\n88 DQN 神经网络 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-2-DQN2/\n89 什么是自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/torch/4-04-A-autoencoder/\n90 Autoencoder 自编码 http://localhost:4000tutorials/machine-learning/keras/2-6-autoencoder/\n91 元组 列表 http://localhost:4000tutorials/python-basic/basic/11-1-array-list/\n92 Sarsa-lambda http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-3-tabular-sarsa-lambda/\n93 设置坐标轴2 http://localhost:4000tutorials/data-manipulation/plt/2-4-axis2/\n94 函数参数 http://localhost:4000tutorials/python-basic/basic/05-2-def-parameters/\n95 神经网络在做什么 http://localhost:4000tutorials/machine-learning/theano/1-3-NN-job/\n96 GIL 不一定有效率 http://localhost:4000tutorials/python-basic/threading/5-GIL/\n97 Pandas 设置值 http://localhost:4000tutorials/data-manipulation/np-pd/3-3-pd-assign/\n98 Pandas 选择数据 http://localhost:4000tutorials/data-manipulation/np-pd/3-2-pd-indexing/\n99 CNN 卷积神经网络 http://localhost:4000tutorials/machine-learning/torch/4-01-CNN/\n100 Placeholder 传入值 http://localhost:4000tutorials/machine-learning/tensorflow/2-5-placeholde/\n101 Tensorboard 可视化好帮手 1 http://localhost:4000tutorials/machine-learning/tensorflow/4-1-tensorboard1/\n102 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/theano/1-1-A-ANN-and-NN/\n103 回到从前 (checkout 针对单个文件) http://localhost:4000tutorials/others/git/3-2-checkout/\n104 为什么用强化学习 Why? http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-1-why/\n105 Linux 基本指令 touch, cp 和 mv http://localhost:4000tutorials/others/linux-basic/2-02-basic-command/\n106 OpenAI gym 环境库 http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-4-gym/\n107 NEAT 强化学习 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-03-neat-reinforcement-learning/\n108 什么是卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/tensorflow/5-03-A-CNN/\n109 Tensorflow 2017 更新 http://localhost:4000tutorials/machine-learning/tensorflow/5-14-tf2017/\n110 continue & break http://localhost:4000tutorials/python-basic/basic/13-01-continue-break/\n111 multiprocessing 什么是多进程 http://localhost:4000tutorials/python-basic/basic/13-06-multiprocessing/\n112 Saver 保存读取 http://localhost:4000tutorials/machine-learning/tensorflow/5-06-save/\n113 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/1-1-A-ANN-and-NN/\n114 什么是 Sarsa http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-1-A-sarsa/\n115 神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/ML-intro/2-1-NN/\n116 通用学习模式 http://localhost:4000tutorials/machine-learning/sklearn/2-2-general-pattern/\n117 第一个版本库 Repository http://localhost:4000tutorials/others/git/2-1-repository/\n118 Dropout 缓解过拟合 http://localhost:4000tutorials/machine-learning/torch/5-03-dropout/\n119 Sarsa 算法更新 http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-1-tabular-sarsa1/\n120 set 找不同 http://localhost:4000tutorials/python-basic/basic/13-09-set/\n121 多维列表 http://localhost:4000tutorials/python-basic/basic/11-3-multi-dim-list/\n122 检验神经网络 (Evaluation) http://localhost:4000tutorials/machine-learning/sklearn/3-2-A-Evaluate-NN/\n123 3D 数据 http://localhost:4000tutorials/data-manipulation/plt/3-5-3d/\n124 Regression 回归例子 http://localhost:4000tutorials/machine-learning/theano/3-2-regression/\n125 激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/ML-intro/3-04-activation-function/\n126 Scatter 散点图 http://localhost:4000tutorials/data-manipulation/plt/3-1-scatter/\n127 什么是神经网络进化 (Neuro-Evolution) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-00-neuro-evolution/\n128 什么是循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/tensorflow/5-07-A-RNN/\n129 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/tensorflow/1-1-B-NN/\n130 DQN 算法更新 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-1-DQN1/\n131 怎么样从 MacOS 或 Linux 通过 SSH 远程 Linux http://localhost:4000tutorials/others/linux-basic/4-01-ssh-from-linux-or-mac/\n132 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/keras/1-1-B-NN/\n133 区分类型 (分类) http://localhost:4000tutorials/machine-learning/torch/3-02-classification/\n134 Optimizer 优化器 http://localhost:4000tutorials/machine-learning/torch/3-06-optimizer/\n135 AlphaGo Zero 为什么更厉害? http://localhost:4000tutorials/machine-learning/ML-intro/4-11-AlphaGo-zero/\n136 Deep Deterministic Policy Gradient (DDPG) http://localhost:4000tutorials/machine-learning/ML-intro/4-09-DDPG/\n137 if 判断 http://localhost:4000tutorials/python-basic/basic/04-1-if/\n138 class 类 init 功能 http://localhost:4000tutorials/python-basic/basic/09-2-class-init/\n139 Numpy copy & deep copy http://localhost:4000tutorials/data-manipulation/np-pd/2-8-np-copy/\n140 Tensorboard 可视化好帮手 2 http://localhost:4000tutorials/machine-learning/tensorflow/4-2-tensorboard2/\n141 Frame 框架 http://localhost:4000tutorials/python-basic/tkinter/2-09-frame/\n142 Microbial Genetic Algorithm http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-04-microbial-genetic-algorithm/\n143 NEAT 监督学习 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-02-neat-supervised-learning/\n144 安装 http://localhost:4000tutorials/python-basic/basic/01-1-install/\n145 怎么样从 Windows 通过 SSH 远程 Linux http://localhost:4000tutorials/others/linux-basic/4-02-ssh-from-windows/\n146 什么是 Actor Critic http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-1-A-AC/\n147 遗传算法 (Genetic Algorithm) http://localhost:4000tutorials/machine-learning/ML-intro/5-01-genetic-algorithm/\n148 CNN 卷积神经网络 http://localhost:4000tutorials/machine-learning/keras/2-3-CNN/\n149 选择好特征 (Good Features) http://localhost:4000tutorials/machine-learning/ML-intro/3-03-choose-feature/\n150 什么是 Q Leaning http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/\n151 Evolution Strategy 强化学习 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/4-04-evolution-strategy-reinforcement-learning/\n152 兼容 backend http://localhost:4000tutorials/machine-learning/keras/1-3-backend/\n153 什么是过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/theano/3-5-A-overfitting/\n154 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/theano/1-1-C-gradient-descent/\n155 Numpy 和 Pandas 安装 http://localhost:4000tutorials/data-manipulation/np-pd/1-2-install/\n156 例子 旅行商人问题 (Travel Sales Problem) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-03-genetic-algorithm-travel-sales-problem/\n157 Distributed Proximal Policy Optimization (DPPO) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-4-DPPO/\n158 Pytorch 安装 http://localhost:4000tutorials/machine-learning/torch/1-2-install/\n159 基础数学运算 http://localhost:4000tutorials/python-basic/basic/02-2-basic-math/\n160 激励函数 (Activation) http://localhost:4000tutorials/machine-learning/torch/2-03-activation/\n161 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/tensorflow/1-1-D-feature-representation/\n162 例子3 结果可视化 http://localhost:4000tutorials/machine-learning/tensorflow/3-3-visualize-result/\n163 Activation function 激励函数 http://localhost:4000tutorials/machine-learning/theano/2-4-activation/\n164 可视化结果 回归例子 http://localhost:4000tutorials/machine-learning/theano/3-3-visualize/\n165 什么是循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/torch/4-02-A-RNN/\n166 什么是 Deep Deterministic Policy Gradient (DDPG) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-2-A-DDPG/\n167 例子 登录窗口1 http://localhost:4000tutorials/python-basic/tkinter/3-01-example1/\n168 怎么样用 TeamViewer 和 VNC 从远程控制电脑 http://localhost:4000tutorials/others/linux-basic/4-04-teamviewer-vnc/\n169 什么是循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/keras/2-4-A-RNN/\n170 Pandas 导入导出 http://localhost:4000tutorials/data-manipulation/np-pd/3-5-pd-to/\n171 选择学习方法 http://localhost:4000tutorials/machine-learning/sklearn/2-1-select-method/\n172 class 类 http://localhost:4000tutorials/python-basic/basic/09-1-class/\n173 什么是激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/theano/2-4-A-activation-function/\n174 特征标准化 (Feature Normalization) http://localhost:4000tutorials/machine-learning/ML-intro/3-02-normalization/\n175 Policy Gradients 思维决策 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/5-2-policy-gradient-softmax2/\n176 RNN LSTM (回归例子可视化) http://localhost:4000tutorials/machine-learning/tensorflow/5-10-RNN4/\n177 关系拟合 (回归) http://localhost:4000tutorials/machine-learning/torch/3-01-regression/\n178 正规化 Normalization http://localhost:4000tutorials/machine-learning/sklearn/3-1-normalization/\n179 GPU 加速运算 http://localhost:4000tutorials/machine-learning/torch/5-02-GPU/\n180 DQN 强化学习 http://localhost:4000tutorials/machine-learning/torch/4-05-DQN/\n181 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/ML-intro/2-7-feature-representation/\n182 加速神经网络训练 (Speed Up Training) http://localhost:4000tutorials/machine-learning/ML-intro/3-06-speed-up-learning/\n183 交叉验证 1 Cross-validation http://localhost:4000tutorials/machine-learning/sklearn/3-2-cross-validation1/\n184 读写文件 1 http://localhost:4000tutorials/python-basic/basic/08-1-read-file/\n185 回到从前 (reset) http://localhost:4000tutorials/others/git/3-1-reset/\n186 什么是 Tkinter http://localhost:4000tutorials/python-basic/tkinter/1-1-why/\n187 Theano 安装 http://localhost:4000tutorials/machine-learning/theano/1-2-install/\n188 Save 保存 提取 http://localhost:4000tutorials/machine-learning/theano/3-6-save/\n189 sklearn 常用属性与功能 http://localhost:4000tutorials/machine-learning/sklearn/2-4-model-attributes/\n190 变量 (Variable) http://localhost:4000tutorials/machine-learning/torch/2-02-variable/\n191 threading 什么是多线程 http://localhost:4000tutorials/python-basic/basic/13-05-threading/\n192 RNN LSTM 循环神经网络 (分类例子) http://localhost:4000tutorials/machine-learning/tensorflow/5-08-RNN2/\n193 记录修改 (log & diff) http://localhost:4000tutorials/others/git/2-2-modified/\n194 dictionary 字典 http://localhost:4000tutorials/python-basic/basic/11-4-dictionary/\n195 Session 会话控制 http://localhost:4000tutorials/machine-learning/tensorflow/2-3-session/\n196 神经网络进化 (Neuro-Evolution) http://localhost:4000tutorials/machine-learning/ML-intro/5-03-neuro-evolution/\n197 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/tensorflow/1-1-C-gradient-descent/\n198 例子3 添加层 def add_layer() http://localhost:4000tutorials/machine-learning/tensorflow/3-1-add-layer/\n199 Scale 尺度 http://localhost:4000tutorials/python-basic/tkinter/2-05-scale/\n200 例子 配对句子 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-02-genetic-algorithm-match-phrase/\n201 rebase 分支冲突 http://localhost:4000tutorials/others/git/4-3-rebase/\n202 Batch Normalization 批标准化 http://localhost:4000tutorials/machine-learning/torch/5-04-batch-normalization/\n203 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/theano/1-1-D-feature-representation/\n204 自己的云计算, 把 Linux 当成你的云计算平台 http://localhost:4000tutorials/others/linux-basic/5-01-remote-learning/\n205 例子 登录窗口2 http://localhost:4000tutorials/python-basic/tkinter/3-02-example2/\n206 机器学习 (Machine Learning) http://localhost:4000tutorials/machine-learning/sklearn/1-1-A-ML/\n207 自己的模块 http://localhost:4000tutorials/python-basic/basic/12-2-personal-module/\n208 Matplotlib 安装 http://localhost:4000tutorials/data-manipulation/plt/1-2-install/\n209 进化策略 (Evolution Strategy) http://localhost:4000tutorials/machine-learning/ML-intro/5-02-evolution-strategy/\n210 list 列表 http://localhost:4000tutorials/python-basic/basic/11-2-list/\n211 Double DQN (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-5-double_DQN/\n212 Numpy 的创建 array http://localhost:4000tutorials/data-manipulation/np-pd/2-2-np-array/\n213 遗传算法 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-01-genetic-algorithm/\n214 Torch 或 Numpy http://localhost:4000tutorials/machine-learning/torch/2-01-torch-numpy/\n215 快速搭建法 http://localhost:4000tutorials/machine-learning/torch/3-03-fast-nn/\n216 例子 登录窗口3 http://localhost:4000tutorials/python-basic/tkinter/3-03-example3/\n217 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/ML-intro/2-0-ANN-and-NN/\n218 zip lambda map http://localhost:4000tutorials/python-basic/basic/13-03-zip-lambda-map/\n219 Batch Normalization 批标准化 http://localhost:4000tutorials/machine-learning/tensorflow/5-13-BN/\n220 pack grid place 放置位置 http://localhost:4000tutorials/python-basic/tkinter/2-11-pack-grid-place/\n221 scope 命名方法 http://localhost:4000tutorials/machine-learning/tensorflow/5-12-scope/\n222 Subplot 多合一显示 http://localhost:4000tutorials/data-manipulation/plt/4-1-subpot1/\n223 L1 / L2 正规化 (Regularization) http://localhost:4000tutorials/machine-learning/ML-intro/3-09-l1l2regularization/\n224 强化学习方法汇总 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/ML-intro/4-02-RL-methods/\n225 次坐标轴 http://localhost:4000tutorials/data-manipulation/plt/4-4-sencondary-axis/\n226 什么是 LSTM 循环神经网络 http://localhost:4000tutorials/machine-learning/keras/2-4-B-LSTM/\n227 Asynchronous Advantage Actor-Critic (A3C) http://localhost:4000tutorials/machine-learning/ML-intro/4-10-A3C/\n228 强化学习 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/ML-intro/4-01-RL/\n229 Numpy array 分割 http://localhost:4000tutorials/data-manipulation/np-pd/2-7-np-split/\n230 Numpy array 合并 http://localhost:4000tutorials/data-manipulation/np-pd/2-6-np-concat/\n231 if else 判断 http://localhost:4000tutorials/python-basic/basic/04-2-if-else/\n232 进程锁 Lock http://localhost:4000tutorials/python-basic/multiprocessing/7-lock/\n233 import 模块 http://localhost:4000tutorials/python-basic/basic/12-1-import/\n234 RNN Regressor 循环神经网络 http://localhost:4000tutorials/machine-learning/keras/2-5-RNN-LSTM-Regressor/\n235 sklearn 强大数据库 http://localhost:4000tutorials/machine-learning/sklearn/2-3-database/\n236 Regularization 正规化 http://localhost:4000tutorials/machine-learning/theano/3-5-regularization/\n237 效率对比 threading & multiprocessing http://localhost:4000tutorials/python-basic/multiprocessing/4-comparison/\n238 Why Numpy & Pandas? http://localhost:4000tutorials/data-manipulation/np-pd/1-1-why/\n239 共享内存 shared memory http://localhost:4000tutorials/python-basic/multiprocessing/6-shared-memory/\n240 什么是激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/torch/2-03-A-activation-function/\n241 Menubar 菜单 http://localhost:4000tutorials/python-basic/tkinter/2-08-menubar/\n242 Annotation 标注 http://localhost:4000tutorials/data-manipulation/plt/2-6-annotation/\n243 RNN 循环神经网络 (分类) http://localhost:4000tutorials/machine-learning/torch/4-02-RNN-classification/\n244 什么是进化策略 (Evolution Strategy) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-00-evolution-strategy/\n245 例子3 建造神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/3-2-create-NN/\n246 (1+1)-ES http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-02-evolution-strategy-1+1-ES/\n247 保存模型 http://localhost:4000tutorials/machine-learning/sklearn/3-5-save/\n248 进化算法 简介 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/1-01-intro/\n249 Sarsa 思维决策 http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-2-tabular-sarsa2/\n250 Dropout 解决 overfitting http://localhost:4000tutorials/machine-learning/tensorflow/5-02-dropout/\n251 什么是激励函数 (Activation Function) http://localhost:4000tutorials/machine-learning/tensorflow/2-6-A-activation-function/\n252 进程池 Pool http://localhost:4000tutorials/python-basic/multiprocessing/5-pool/\n253 Asyncio http://localhost:4000tutorials/data-manipulation/scraping/asyncio/\n254 添加进程 Process http://localhost:4000tutorials/python-basic/multiprocessing/2-add/\n255 存储进程输出 Queue http://localhost:4000tutorials/python-basic/multiprocessing/3-queue/\n256 过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/ML-intro/3-05-overfitting/\n257 图中图 http://localhost:4000tutorials/data-manipulation/plt/4-3-plot-in-plot/\n258 Save & reload 保存提取 http://localhost:4000tutorials/machine-learning/keras/3-1-save/\n259 Checkbutton 勾选项 http://localhost:4000tutorials/python-basic/tkinter/2-06-checkbutton/\n260 分支 (branch) http://localhost:4000tutorials/others/git/4-1-branch/\n261 Q-learning 算法更新 http://localhost:4000tutorials/machine-learning/reinforcement-learning/2-2-tabular-q1/\n262 变量 variable http://localhost:4000tutorials/python-basic/basic/02-3-variable/\n263 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/torch/1-1-C-gradient-descent/\n264 生成对抗网络 (GAN) http://localhost:4000tutorials/machine-learning/ML-intro/2-6-GAN/\n265 什么是 Sarsa(lambda) http://localhost:4000tutorials/machine-learning/reinforcement-learning/3-3-A-sarsa-lambda/\n266 Pandas 合并 merge http://localhost:4000tutorials/data-manipulation/np-pd/3-7-pd-merge/\n267 遗传算法 (Genetic Algorithm) http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/2-00-genetic-algorithm/\n268 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/keras/1-1-D-feature-representation/\n269 什么是批标准化 (Batch Normalization) http://localhost:4000tutorials/machine-learning/torch/5-04-A-batch-normalization/\n270 自己的云计算, 多电脑共享你云端文件 http://localhost:4000tutorials/others/linux-basic/5-02-share-folder/\n271 从头开始做一个机器手臂3 写动态环境 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch3/\n272 定义 Layer 类 http://localhost:4000tutorials/machine-learning/theano/3-1-layer/\n273 什么是 LSTM 循环神经网络 http://localhost:4000tutorials/machine-learning/torch/4-02-B-LSTM/\n274 pickle 保存数据 http://localhost:4000tutorials/python-basic/basic/13-08-pickle/\n275 Classifier 分类 http://localhost:4000tutorials/machine-learning/keras/2-2-classifier/\n276 CNN 卷积神经网络 1 http://localhost:4000tutorials/machine-learning/tensorflow/5-03-CNN1/\n277 科普: 神经网络的黑盒不黑 http://localhost:4000tutorials/machine-learning/torch/1-1-D-feature-representation/\n278 什么是生成对抗网络 (GAN) http://localhost:4000tutorials/machine-learning/torch/4-06-A-GAN/\n279 强化学习方法汇总 (Reinforcement Learning) http://localhost:4000tutorials/machine-learning/reinforcement-learning/1-1-B-RL-methods/\n280 Keras 安装 http://localhost:4000tutorials/machine-learning/keras/1-2-install/\n281 Entry & Text 输入, 文本框 http://localhost:4000tutorials/python-basic/tkinter/2-02-entry-text/\n282 RNN 循环神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/5-07-RNN1/\n283 Policy Gradients 算法更新 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/5-1-policy-gradient-softmax1/\n284 Function 用法 http://localhost:4000tutorials/machine-learning/theano/2-2-function/\n285 优化器 optimizer http://localhost:4000tutorials/machine-learning/tensorflow/3-4-optimizer/\n286 Pandas 合并 concat http://localhost:4000tutorials/data-manipulation/np-pd/3-6-pd-concat/\n287 Tensorflow 安装 http://localhost:4000tutorials/machine-learning/tensorflow/1-2-install/\n288 从头开始做一个机器手臂4 加入强化学习算法 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch4/\n289 Deep Deterministic Policy Gradient (DDPG) (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/6-2-DDPG/\n290 什么是卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/keras/2-3-A-CNN/\n291 什么是批标准化 (Batch Normalization) http://localhost:4000tutorials/machine-learning/tensorflow/5-13-A-batch-normalization/\n292 什么是过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/tensorflow/5-02-A-overfitting/\n293 函数默认参数 http://localhost:4000tutorials/python-basic/basic/05-3-def-default-parameters/\n294 Linux 文件权限 http://localhost:4000tutorials/others/linux-basic/3-01-file-permissions/\n295 Animation 动画 http://localhost:4000tutorials/data-manipulation/plt/5-1-animation/\n296 全局 & 局部 变量 http://localhost:4000tutorials/python-basic/basic/06-1-global-local/\n297 从头开始做一个机器手臂2 写静态环境 http://localhost:4000tutorials/machine-learning/ML-practice/RL-build-arm-from-scratch2/\n298 Natural Evolution Strategy http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-03-evolution-strategy-natural-evolution-strategy/\n299 Pandas 处理丢失数据 http://localhost:4000tutorials/data-manipulation/np-pd/3-4-pd-nan/\n300 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/theano/1-1-B-NN/\n301 Numpy 索引 http://localhost:4000tutorials/data-manipulation/np-pd/2-5-np-indexing/\n302 Radiobutton 选择按钮 http://localhost:4000tutorials/python-basic/tkinter/2-04-radiobutton/\n303 激励函数 Activation Function http://localhost:4000tutorials/machine-learning/tensorflow/2-6-activation/\n304 Why Theano? http://localhost:4000tutorials/machine-learning/theano/1-1-why/\n305 Classification 分类学习 http://localhost:4000tutorials/machine-learning/theano/3-4-classification/\n306 了解网页结构 http://localhost:4000tutorials/data-manipulation/scraping/1-01-understand-website/\n307 Variable 变量 http://localhost:4000tutorials/machine-learning/tensorflow/2-4-variable/\n308 加速神经网络训练 (Speed Up Training) http://localhost:4000tutorials/machine-learning/tensorflow/3-4-A-speed-up-learning/\n309 Linux 基本指令 nano 和 cat http://localhost:4000tutorials/others/linux-basic/2-04-basic-command/\n310 处理结构 http://localhost:4000tutorials/machine-learning/tensorflow/2-1-structure/\n311 循环神经网络 RNN (Recurrent Neural Network) http://localhost:4000tutorials/machine-learning/ML-intro/2-3-RNN/\n312 Bar 柱状图 http://localhost:4000tutorials/data-manipulation/plt/3-2-Bar/\n313 Shared 变量 http://localhost:4000tutorials/machine-learning/theano/2-3-shared-variable/\n314 总结和更多 http://localhost:4000tutorials/machine-learning/theano/4-1-summary/\n315 print 功能 http://localhost:4000tutorials/python-basic/basic/02-1-print/\n316 merge 分支冲突 http://localhost:4000tutorials/others/git/4-2-merge-conflict/\n317 join 功能 http://localhost:4000tutorials/python-basic/threading/3-join/\n318 Why Git? http://localhost:4000tutorials/others/git/1-1-why/\n319 RNN Classifier 循环神经网络 http://localhost:4000tutorials/machine-learning/keras/2-4-RNN-classifier/\n320 def 函数 http://localhost:4000tutorials/python-basic/basic/05-1-def/\n321 安装 Ubuntu 17.10 http://localhost:4000tutorials/others/linux-basic/1-2-install/\n322 Linux 基本指令 ls 和 cd http://localhost:4000tutorials/others/linux-basic/2-01-basic-command/\n323 Sklearn 安装 http://localhost:4000tutorials/machine-learning/sklearn/1-2-install/\n324 AutoEncoder (自编码/非监督学习) http://localhost:4000tutorials/machine-learning/torch/4-04-autoencoder/\n325 DQN http://localhost:4000tutorials/machine-learning/ML-intro/4-06-DQN/\n326 为什么用 Numpy 还是慢, 你用对了吗? http://localhost:4000tutorials/data-manipulation/np-pd/4-1-speed-up-numpy/\n327 什么是过拟合 (Overfitting) http://localhost:4000tutorials/machine-learning/torch/5-03-A-overfitting/\n328 机器学习 (Machine Learning) http://localhost:4000tutorials/machine-learning/ML-intro/1-1-machine-learning/\n329 Sarsa(lambda) http://localhost:4000tutorials/machine-learning/ML-intro/4-05-sarsa-lambda/\n330 卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/ML-intro/2-2-CNN/\n331 交叉验证 3 Cross-validation http://localhost:4000tutorials/machine-learning/sklearn/3-4-cross-validation3/\n332 例子2 http://localhost:4000tutorials/machine-learning/tensorflow/2-2-example2/\n333 什么是 DQN http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-1-A-DQN/\n334 为什么选 Tensorflow? http://localhost:4000tutorials/machine-learning/tensorflow/1-1-why/\n335 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/keras/1-1-A-ANN-and-NN/\n336 设置坐标轴1 http://localhost:4000tutorials/data-manipulation/plt/2-3-axis1/\n337 Why Sklearn? http://localhost:4000tutorials/machine-learning/sklearn/1-1-why/\n338 Q Leaning http://localhost:4000tutorials/machine-learning/ML-intro/4-03-q-learning/\n339 进化策略 http://localhost:4000tutorials/machine-learning/evolutionary-algorithm/3-01-evolution-strategy/\n340 神经网络 梯度下降 http://localhost:4000tutorials/machine-learning/keras/1-1-C-gradient-descent/\n341 什么是 Multiprocessing http://localhost:4000tutorials/python-basic/multiprocessing/1-why/\n342 Listbox 列表部件 http://localhost:4000tutorials/python-basic/tkinter/2-03-listbox/\n343 Numpy 属性 http://localhost:4000tutorials/data-manipulation/np-pd/2-1-np-attributes/\n344 科普: 人工神经网络 VS 生物神经网络 http://localhost:4000tutorials/machine-learning/torch/1-1-A-ANN-and-NN/\n345 RNN LSTM (回归例子) http://localhost:4000tutorials/machine-learning/tensorflow/5-09-RNN3/\n346 CNN 卷积神经网络 2 http://localhost:4000tutorials/machine-learning/tensorflow/5-04-CNN2/\n347 figure 图像 http://localhost:4000tutorials/data-manipulation/plt/2-2-figure/\n348 用 Tensorflow 可视化梯度下降 http://localhost:4000tutorials/machine-learning/tensorflow/5-15-tf-gradient-descent/\n349 储存进程结果 Queue http://localhost:4000tutorials/python-basic/threading/4-queue/\n350 Git 安装 http://localhost:4000tutorials/others/git/1-2-install/\n351 tick 能见度 http://localhost:4000tutorials/data-manipulation/plt/2-7-tick-visibility/\n352 给你的 Ubuntu 安装软件 http://localhost:4000tutorials/others/linux-basic/1-4-install-software/\n353 基本用法 http://localhost:4000tutorials/machine-learning/theano/2-1-basic-usage/\n354 加速神经网络训练 (Speed Up Training) http://localhost:4000tutorials/machine-learning/torch/3-06-A-speed-up-learning/\n355 Contours 等高线图 http://localhost:4000tutorials/data-manipulation/plt/3-3-contours/\n356 处理不均衡数据 (Imbalanced data) http://localhost:4000tutorials/machine-learning/ML-intro/3-07-imbalanced-data/\n357 批训练 http://localhost:4000tutorials/machine-learning/torch/3-05-train-on-batch/\n358 DQN 思维决策 (Tensorflow) http://localhost:4000tutorials/machine-learning/reinforcement-learning/4-3-DQN3/\n359 快速了解 Ubuntu 17.10 基本界面 http://localhost:4000tutorials/others/linux-basic/1-3-system-overview/\n360 Why Pytorch? http://localhost:4000tutorials/machine-learning/torch/1-1-why/\n361 什么是卷积神经网络 CNN (Convolutional Neural Network) http://localhost:4000tutorials/machine-learning/torch/4-01-A-CNN/\n362 Label & Button 标签和按钮 http://localhost:4000tutorials/python-basic/tkinter/2-01-label-button/\n363 读写文件 2 http://localhost:4000tutorials/python-basic/basic/08-2-read-file2/\n364 什么是神经网络 (Neural Network) http://localhost:4000tutorials/machine-learning/torch/1-1-B-NN/\n365 什么是自编码 (Autoencoder) http://localhost:4000tutorials/machine-learning/tensorflow/5-11-A-autoencoder/\n366 Why Linux? http://localhost:4000tutorials/others/linux-basic/1-1-why/\n367 怎么样从手机 (Android安卓/IOS苹果) 通过 SSH 远程 Linux http://localhost:4000tutorials/others/linux-basic/4-03-ssh-from-phone/\n368 copy & deepcopy 浅复制 & 深复制 http://localhost:4000tutorials/python-basic/basic/13-04-copy/\n369 检验神经网络 (Evaluation) http://localhost:4000tutorials/machine-learning/ML-intro/3-01-Evaluate-NN/\n370 input 输入 http://localhost:4000tutorials/python-basic/basic/10-1-input/\n371 什么是 LSTM 循环神经网络 http://localhost:4000tutorials/machine-learning/tensorflow/5-07-B-LSTM/\n372 try 错误处理 http://localhost:4000tutorials/python-basic/basic/13-02-try/\nTotal time: 16.3 s\n" 204 | ] 205 | } 206 | ], 207 | "source": [ 208 | "unseen = set([base_url,])\n", 209 | "seen = set()\n", 210 | "\n", 211 | "pool = mp.Pool(4) \n", 212 | "count, t1 = 1, time.time()\n", 213 | "while len(unseen) != 0: # still get some url to visit\n", 214 | " if restricted_crawl and len(seen) > 20:\n", 215 | " break\n", 216 | " print('\\nDistributed Crawling...')\n", 217 | " crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]\n", 218 | " htmls = [j.get() for j in crawl_jobs] # request connection\n", 219 | "\n", 220 | " print('\\nDistributed Parsing...')\n", 221 | " parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]\n", 222 | " results = [j.get() for j in parse_jobs] # parse html\n", 223 | "\n", 224 | " print('\\nAnalysing...')\n", 225 | " seen.update(unseen) # seen the crawled\n", 226 | " unseen.clear() # nothing unseen\n", 227 | "\n", 228 | " for title, page_urls, url in results:\n", 229 | " print(count, title, url)\n", 230 | " count += 1\n", 231 | " unseen.update(page_urls - seen) # get new url to crawl\n", 232 | "print('Total time: %.1f s' % (time.time()-t1, )) # 16 s !!!" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [] 241 | } 242 | ], 243 | "metadata": { 244 | "kernelspec": { 245 | "display_name": "Python 2", 246 | "language": "python", 247 | "name": "python2" 248 | }, 249 | "language_info": { 250 | "codemirror_mode": { 251 | "name": "ipython", 252 | "version": 2 253 | }, 254 | "file_extension": ".py", 255 | "mimetype": "text/x-python", 256 | "name": "python", 257 | "nbconvert_exporter": "python", 258 | "pygments_lexer": "ipython2", 259 | "version": "2.7.6" 260 | } 261 | }, 262 | "nbformat": 4, 263 | "nbformat_minor": 0 264 | } 265 | -------------------------------------------------------------------------------- /notebook/4-2-asyncio.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Asyncio tutorial\n", 8 | "\n", 9 | "## A normal way in python\n", 10 | "**Firstly, let's see a running time in a normal way.**" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 3, 16 | "metadata": {}, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "Start job 1\n" 23 | ] 24 | }, 25 | { 26 | "name": "stdout", 27 | "output_type": "stream", 28 | "text": [ 29 | "Job 1 takes 1 s\nStart job 2\n" 30 | ] 31 | }, 32 | { 33 | "name": "stdout", 34 | "output_type": "stream", 35 | "text": [ 36 | "Job 2 takes 2 s\nNO async total time : 3.0082271099090576\n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "import time\n", 42 | "\n", 43 | "\n", 44 | "def job(t):\n", 45 | " print('Start job ', t)\n", 46 | " time.sleep(t) # wait for \"t\" seconds\n", 47 | " print('Job ', t, ' takes ', t, ' s')\n", 48 | " \n", 49 | "\n", 50 | "def main():\n", 51 | " [job(t) for t in range(1, 3)]\n", 52 | " \n", 53 | " \n", 54 | "t1 = time.time()\n", 55 | "main()\n", 56 | "print(\"NO async total time : \", time.time() - t1)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "## Translate above to async\n", 64 | "**Now, let's see the running time using asyncio**" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 4, 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "name": "stdout", 74 | "output_type": "stream", 75 | "text": [ 76 | "Start job 1\nStart job 2\n" 77 | ] 78 | }, 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "Job 1 takes 1 s\n" 84 | ] 85 | }, 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "Job 2 takes 2 s\nAsync total time : 2.0054221153259277\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "import asyncio\n", 96 | "\n", 97 | "\n", 98 | "async def job(t):\n", 99 | " print('Start job ', t)\n", 100 | " await asyncio.sleep(t) # wait for \"t\" seconds, it will look for another job while await\n", 101 | " print('Job ', t, ' takes ', t, ' s')\n", 102 | " \n", 103 | "\n", 104 | "async def main(loop):\n", 105 | " tasks = [loop.create_task(job(t)) for t in range(1, 3)] # just create, not run job\n", 106 | " await asyncio.wait(tasks) # run jobs and wait for all tasks done\n", 107 | "\n", 108 | "t1 = time.time()\n", 109 | "loop = asyncio.get_event_loop()\n", 110 | "loop.run_until_complete(main(loop))\n", 111 | "# loop.close() # Ipython notebook gives error if close loop\n", 112 | "print(\"Async total time : \", time.time() - t1)\n" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## A normal way in crawling webpage\n", 120 | "**We can use this machanism in requesting for a website. Await for download a page and switch to do another job.**" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 10, 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stdout", 130 | "output_type": "stream", 131 | "text": [ 132 | "https://mofanpy.com/\nhttps://mofanpy.com/\nNormal total time: 0.3869960308074951\n" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "import requests\n", 138 | "\n", 139 | "URL = 'https://mofanpy.com/'\n", 140 | "\n", 141 | "\n", 142 | "def normal(): \n", 143 | " for i in range(2):\n", 144 | " r = requests.get(URL)\n", 145 | " url = r.url\n", 146 | " print(url)\n", 147 | " \n", 148 | "t1 = time.time()\n", 149 | "normal()\n", 150 | "print(\"Normal total time:\", time.time()-t1)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "## Translate above to async using aiohttp\n", 158 | "**We have to install another useful package called [aiohttp](https://aiohttp.readthedocs.io/en/stable/index.html). You can simply run \"pip3 install aiohttp\" in your terminal.**" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 9, 164 | "metadata": {}, 165 | "outputs": [ 166 | { 167 | "name": "stdout", 168 | "output_type": "stream", 169 | "text": [ 170 | "['https://mofanpy.com/', 'https://mofanpy.com/']\nAsync total time: 0.11447715759277344\n" 171 | ] 172 | } 173 | ], 174 | "source": [ 175 | "import aiohttp\n", 176 | "\n", 177 | "\n", 178 | "async def job(session):\n", 179 | " response = await session.get(URL)\n", 180 | " return str(response.url)\n", 181 | "\n", 182 | "\n", 183 | "async def main(loop):\n", 184 | " async with aiohttp.ClientSession() as session:\n", 185 | " tasks = [loop.create_task(job(session)) for _ in range(2)]\n", 186 | " finished, unfinished = await asyncio.wait(tasks)\n", 187 | " all_results = [r.result() for r in finished] # get return from job\n", 188 | " print(all_results)\n", 189 | " \n", 190 | "t1 = time.time()\n", 191 | "loop = asyncio.get_event_loop()\n", 192 | "loop.run_until_complete(main(loop))\n", 193 | "# loop.close() # Ipython notebook gives error if close loop\n", 194 | "print(\"Async total time:\", time.time() - t1)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "## Compare async with multiprocessing\n", 202 | "**The following code scrape my website with async**" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 2, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "name": "stdout", 212 | "output_type": "stream", 213 | "text": [ 214 | "\nAsync Crawling...\n\nDistributed Parsing...\n\nAnalysing...\n\nAsync Crawling...\n" 215 | ] 216 | }, 217 | { 218 | "name": "stdout", 219 | "output_type": "stream", 220 | "text": [ 221 | "\nDistributed Parsing...\n" 222 | ] 223 | }, 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "\nAnalysing...\n\nAsync Crawling...\n" 229 | ] 230 | }, 231 | { 232 | "name": "stdout", 233 | "output_type": "stream", 234 | "text": [ 235 | "\nDistributed Parsing...\n" 236 | ] 237 | }, 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "\nAnalysing...\nAsync total time: 7.21798300743103\n" 243 | ] 244 | } 245 | ], 246 | "source": [ 247 | "import aiohttp\n", 248 | "import asyncio\n", 249 | "import time\n", 250 | "from bs4 import BeautifulSoup\n", 251 | "from urllib.request import urljoin\n", 252 | "import re\n", 253 | "import multiprocessing as mp\n", 254 | "\n", 255 | "# base_url = \"https://mofanpy.com/\"\n", 256 | "base_url = \"http://127.0.0.1:4000/\"\n", 257 | "\n", 258 | "# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN\n", 259 | "if base_url != \"http://127.0.0.1:4000/\":\n", 260 | " restricted_crawl = True\n", 261 | "else:\n", 262 | " restricted_crawl = False\n", 263 | " \n", 264 | " \n", 265 | "seen = set()\n", 266 | "unseen = set([base_url])\n", 267 | "\n", 268 | "\n", 269 | "def parse(html):\n", 270 | " soup = BeautifulSoup(html, 'lxml')\n", 271 | " urls = soup.find_all('a', {\"href\": re.compile('^/.+?/$')})\n", 272 | " title = soup.find('h1').get_text().strip()\n", 273 | " page_urls = set([urljoin(base_url, url['href']) for url in urls])\n", 274 | " url = soup.find('meta', {'property': \"og:url\"})['content']\n", 275 | " return title, page_urls, url\n", 276 | "\n", 277 | "\n", 278 | "async def crawl(url, session):\n", 279 | " r = await session.get(url)\n", 280 | " html = await r.text()\n", 281 | " await asyncio.sleep(0.1) # slightly delay for downloading\n", 282 | " return html\n", 283 | "\n", 284 | "\n", 285 | "async def main(loop):\n", 286 | " pool = mp.Pool(8) # slightly affected\n", 287 | " async with aiohttp.ClientSession() as session:\n", 288 | " count = 1\n", 289 | " while len(unseen) != 0:\n", 290 | " print('\\nAsync Crawling...')\n", 291 | " tasks = [loop.create_task(crawl(url, session)) for url in unseen]\n", 292 | " finished, unfinished = await asyncio.wait(tasks)\n", 293 | " htmls = [f.result() for f in finished]\n", 294 | " \n", 295 | " print('\\nDistributed Parsing...')\n", 296 | " parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]\n", 297 | " results = [j.get() for j in parse_jobs]\n", 298 | " \n", 299 | " print('\\nAnalysing...')\n", 300 | " seen.update(unseen)\n", 301 | " unseen.clear()\n", 302 | " for title, page_urls, url in results:\n", 303 | " # print(count, title, url)\n", 304 | " unseen.update(page_urls - seen)\n", 305 | " count += 1\n", 306 | "\n", 307 | "if __name__ == \"__main__\":\n", 308 | " t1 = time.time()\n", 309 | " loop = asyncio.get_event_loop()\n", 310 | " loop.run_until_complete(main(loop))\n", 311 | " # loop.close()\n", 312 | " print(\"Async total time: \", time.time() - t1)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "**Here we try multiprocessing and test the speed**" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 3, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "name": "stdout", 329 | "output_type": "stream", 330 | "text": [ 331 | "\nDistributed Crawling...\n\nDistributed Parsing...\n\nAnalysing...\n\nDistributed Crawling...\n" 332 | ] 333 | }, 334 | { 335 | "name": "stdout", 336 | "output_type": "stream", 337 | "text": [ 338 | "\nDistributed Parsing...\n" 339 | ] 340 | }, 341 | { 342 | "name": "stdout", 343 | "output_type": "stream", 344 | "text": [ 345 | "\nAnalysing...\n\nDistributed Crawling...\n" 346 | ] 347 | }, 348 | { 349 | "name": "stdout", 350 | "output_type": "stream", 351 | "text": [ 352 | "\nDistributed Parsing...\n" 353 | ] 354 | }, 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "\nAnalysing...\nTotal time: 11.5 s\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "from urllib.request import urlopen, urljoin\n", 365 | "from bs4 import BeautifulSoup\n", 366 | "import multiprocessing as mp\n", 367 | "import re\n", 368 | "import time\n", 369 | "\n", 370 | "\n", 371 | "def crawl(url):\n", 372 | " response = urlopen(url)\n", 373 | " time.sleep(0.1) # slightly delay for downloading\n", 374 | " return response.read().decode()\n", 375 | "\n", 376 | "\n", 377 | "def parse(html):\n", 378 | " soup = BeautifulSoup(html, 'lxml')\n", 379 | " urls = soup.find_all('a', {\"href\": re.compile('^/.+?/$')})\n", 380 | " title = soup.find('h1').get_text().strip()\n", 381 | " page_urls = set([urljoin(base_url, url['href']) for url in urls])\n", 382 | " url = soup.find('meta', {'property': \"og:url\"})['content']\n", 383 | " return title, page_urls, url\n", 384 | "\n", 385 | "\n", 386 | "if __name__ == '__main__':\n", 387 | " # base_url = 'https://mofanpy.com/'\n", 388 | " base_url = \"http://127.0.0.1:4000/\"\n", 389 | " \n", 390 | " # DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN\n", 391 | " if base_url != \"http://127.0.0.1:4000/\":\n", 392 | " restricted_crawl = True\n", 393 | " else:\n", 394 | " restricted_crawl = False\n", 395 | " \n", 396 | " unseen = set([base_url,])\n", 397 | " seen = set()\n", 398 | "\n", 399 | " pool = mp.Pool(8) # number strongly affected\n", 400 | " count, t1 = 1, time.time()\n", 401 | " while len(unseen) != 0: # still get some url to visit\n", 402 | " if restricted_crawl and len(seen) > 20:\n", 403 | " break\n", 404 | " print('\\nDistributed Crawling...')\n", 405 | " crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]\n", 406 | " htmls = [j.get() for j in crawl_jobs] # request connection\n", 407 | " htmls = [h for h in htmls if h is not None] # remove None\n", 408 | "\n", 409 | " print('\\nDistributed Parsing...')\n", 410 | " parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]\n", 411 | " results = [j.get() for j in parse_jobs] # parse html\n", 412 | "\n", 413 | " print('\\nAnalysing...')\n", 414 | " seen.update(unseen)\n", 415 | " unseen.clear()\n", 416 | "\n", 417 | " for title, page_urls, url in results:\n", 418 | " # print(count, title, url)\n", 419 | " count += 1\n", 420 | " unseen.update(page_urls - seen)\n", 421 | "\n", 422 | " print('Total time: %.1f s' % (time.time()-t1, ))\n", 423 | "\n" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 44, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [] 439 | } 440 | ], 441 | "metadata": { 442 | "kernelspec": { 443 | "display_name": "Python 3", 444 | "language": "python", 445 | "name": "python3" 446 | }, 447 | "language_info": { 448 | "codemirror_mode": { 449 | "name": "ipython", 450 | "version": 3 451 | }, 452 | "file_extension": ".py", 453 | "mimetype": "text/x-python", 454 | "name": "python", 455 | "nbconvert_exporter": "python", 456 | "pygments_lexer": "ipython3", 457 | "version": "3.5.1" 458 | } 459 | }, 460 | "nbformat": 4, 461 | "nbformat_minor": 1 462 | } 463 | -------------------------------------------------------------------------------- /notebook/5-1-selenium.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Selenium tutorial\n", 8 | "\n", 9 | "**[Selenium](http://selenium-python.readthedocs.io) is a project to test website then be used to scraping because some website need to run javascript. If using what we learned before, we cannot get what we need when the actural content run by javascript.**\n", 10 | "\n", 11 | "**Install [Selenium](http://selenium-python.readthedocs.io/installation.html) and it will open a actural browser to do all the jobs.**" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 43, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import os\n", 21 | "\n", 22 | "os.makedirs('./img/', exist_ok=True)" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 17, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "name": "stdout", 32 | "output_type": "stream", 33 | "text": [ 34 | "\n \n \n \n (.+?)", html) 10 | print("\nPage title is: ", res[0]) 11 | # Page title is: Scraping tutorial 1 | 莫烦Python 12 | 13 | 14 | res = re.findall(r"

(.*?)

", html, flags=re.DOTALL) # re.DOTALL if multi line 15 | print("\nPage paragraph is: ", res[0]) 16 | # Page paragraph is: 17 | # 这是一个在 莫烦Python 18 | # 爬虫教程 中的简单测试. 19 | 20 | 21 | res = re.findall(r'href="(.*?)"', html) 22 | print("\nAll links: ", res) 23 | # All links: ['https://mofanpy.com/static/img/description/tab_icon.png', 'https://mofanpy.com/', 'https://mofanpy.com/tutorials/scraping'] -------------------------------------------------------------------------------- /source_code/2-1-beautifulsoup-basic.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | from urllib.request import urlopen 3 | 4 | # if has Chinese, apply decode() 5 | html = urlopen("https://mofanpy.com/static/scraping/basic-structure.html").read().decode('utf-8') 6 | 7 | soup = BeautifulSoup(html, features='lxml') 8 | print(soup.h1) 9 | print('\n', soup.p) 10 | 11 | all_href = soup.find_all('a') 12 | all_href = [l['href'] for l in all_href] 13 | print('\n', all_href) 14 | 15 | 16 | -------------------------------------------------------------------------------- /source_code/2-2-beautifulsoup-css.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | from urllib.request import urlopen 3 | 4 | # if has Chinese, apply decode() 5 | html = urlopen("https://mofanpy.com/static/scraping/list.html").read().decode('utf-8') 6 | 7 | soup = BeautifulSoup(html, features='lxml') 8 | 9 | # use class to narrow search 10 | month = soup.find_all('li', {"class": "month"}) 11 | for m in month: 12 | print(m.get_text()) 13 | 14 | 15 | jan = soup.find('ul', {"class": 'jan'}) 16 | d_jan = jan.find_all('li') # use jan as a parent 17 | for d in d_jan: 18 | print(d.get_text()) 19 | 20 | -------------------------------------------------------------------------------- /source_code/2-3-beautifulsoup-regex.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | from urllib.request import urlopen 3 | import re 4 | 5 | # if has Chinese, apply decode() 6 | html = urlopen("https://mofanpy.com/static/scraping/table.html").read().decode('utf-8') 7 | 8 | soup = BeautifulSoup(html, features='lxml') 9 | 10 | img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')}) 11 | for link in img_links: 12 | print(link['src']) 13 | 14 | print('\n') 15 | 16 | course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')}) 17 | for link in course_links: 18 | print(link['href']) -------------------------------------------------------------------------------- /source_code/2-4-practice-baidu-baike.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | from urllib.request import urlopen 3 | import re 4 | import random 5 | 6 | 7 | base_url = "https://baike.baidu.com" 8 | his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"] 9 | 10 | for i in range(20): 11 | # dealing with Chinese symbols 12 | url = base_url + his[-1] 13 | 14 | html = urlopen(url).read().decode('utf-8') 15 | soup = BeautifulSoup(html, features='lxml') 16 | print(i, soup.find('h1').get_text(), ' url: ', his[-1]) 17 | 18 | # find valid urls 19 | sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")}) 20 | 21 | if len(sub_urls) != 0: 22 | his.append(random.sample(sub_urls, 1)[0]['href']) 23 | else: 24 | # no valid sub link found 25 | his.pop() 26 | -------------------------------------------------------------------------------- /source_code/2-5-beautifulsoup-table.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | from urllib.request import urlopen 3 | 4 | # if has Chinese, apply decode() 5 | html = urlopen("https://mofanpy.com/static/scraping/table.html").read().decode('utf-8') 6 | 7 | soup = BeautifulSoup(html, features='lxml') 8 | 9 | # print with title 10 | for item in soup.find("table", {"id": "course-list"}).children: 11 | print(item) 12 | 13 | print("-------------------------") 14 | # print without title 15 | for item in soup.find("table", {"id": "course-list"}).tr.next_siblings: 16 | print(item) 17 | 18 | print("-------------------------") 19 | # navigate using next_sibling/previous_sibling 20 | print(soup.find("img", {"src": "https://mofanpy.com/static/img/course_cover/scraping.jpg"} 21 | ).parent.previous_sibling.get_text()) 22 | 23 | -------------------------------------------------------------------------------- /source_code/3-1-requests.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | 4 | def get(): 5 | print('\nget') 6 | param = {"wd": "莫烦Python"} 7 | r = requests.get('http://www.baidu.com/s', params=param) 8 | print(r.url) 9 | print(r.text) 10 | 11 | 12 | def post_name(): 13 | print('\npost name') 14 | # http://pythonscraping.com/pages/files/form.html 15 | data = {'firstname': '莫烦', 'lastname': '周'} 16 | r = requests.post('http://pythonscraping.com/files/processing.php', data=data) 17 | print(r.text) 18 | 19 | 20 | def post_image(): 21 | print('\npost image') 22 | # http://pythonscraping.com/files/form2.html 23 | file = {'uploadFile': open('./image.png', 'rb')} 24 | r = requests.post('http://pythonscraping.com/files/processing2.php', files=file) 25 | print(r.text) 26 | 27 | 28 | def post_login(): 29 | print('\npost login') 30 | # http://pythonscraping.com/pages/cookies/login.html 31 | payload = {'username': 'Morvan', 'password': 'password'} 32 | r = requests.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload) 33 | print(r.cookies.get_dict()) 34 | # http://pythonscraping.com/pages/cookies/profile.php 35 | r = requests.get('http://pythonscraping.com/pages/cookies/profile.php', cookies=r.cookies) 36 | print(r.text) 37 | 38 | 39 | def session_login(): 40 | print('\nsession login') 41 | # http://pythonscraping.com/pages/cookies/login.html 42 | session = requests.Session() 43 | payload = {'username': 'Morvan', 'password': 'password'} 44 | r = session.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload) 45 | print(r.cookies.get_dict()) 46 | r = session.get("http://pythonscraping.com/pages/cookies/profile.php") 47 | print(r.text) 48 | 49 | 50 | get() 51 | post_name() 52 | post_image() 53 | post_login() 54 | session_login() -------------------------------------------------------------------------------- /source_code/3-2-download.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.makedirs('./img/', exist_ok=True) 3 | 4 | IMAGE_URL = "https://mofanpy.com/static/img/description/learning_step_flowchart.png" 5 | 6 | 7 | def urllib_download(): 8 | from urllib.request import urlretrieve 9 | urlretrieve(IMAGE_URL, './img/image1.png') # whole document 10 | 11 | 12 | def request_download(): 13 | import requests 14 | r = requests.get(IMAGE_URL) 15 | with open('./img/image2.png', 'wb') as f: 16 | f.write(r.content) # whole document 17 | 18 | 19 | def chunk_download(): 20 | import requests 21 | r = requests.get(IMAGE_URL, stream=True) # stream loading 22 | 23 | with open('./img/image3.png', 'wb') as f: 24 | for chunk in r.iter_content(chunk_size=32): 25 | f.write(chunk) 26 | 27 | 28 | urllib_download() 29 | print('download image1') 30 | request_download() 31 | print('download image2') 32 | chunk_download() 33 | print('download image3') 34 | 35 | -------------------------------------------------------------------------------- /source_code/3-3-practice-download-images.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | import requests 3 | import os 4 | os.makedirs('./img/', exist_ok=True) 5 | 6 | URL = "http://www.nationalgeographic.com.cn/animals/" 7 | 8 | html = requests.get(URL).text 9 | soup = BeautifulSoup(html, 'lxml') 10 | img_ul = soup.find_all('ul', {"class": "img_list"}) 11 | 12 | for ul in img_ul: 13 | imgs = ul.find_all('img') 14 | for img in imgs: 15 | url = img['src'] 16 | r = requests.get(url, stream=True) 17 | image_name = url.split('/')[-1] 18 | with open('./img/%s' % image_name, 'wb') as f: 19 | for chunk in r.iter_content(chunk_size=128): 20 | f.write(chunk) 21 | print('Saved %s' % image_name) -------------------------------------------------------------------------------- /source_code/4-1-distributed-scraping.py: -------------------------------------------------------------------------------- 1 | from urllib.request import urlopen, urljoin 2 | from bs4 import BeautifulSoup 3 | import multiprocessing as mp 4 | import re 5 | import time 6 | 7 | 8 | def crawl(url): 9 | response = urlopen(url) 10 | time.sleep(0.1) # slightly delay for downloading 11 | return response.read().decode() 12 | 13 | 14 | def parse(html): 15 | soup = BeautifulSoup(html, 'lxml') 16 | urls = soup.find_all('a', {"href": re.compile('^/.+?/$')}) 17 | title = soup.find('h1').get_text().strip() 18 | page_urls = set([urljoin(base_url, url['href']) for url in urls]) # remove duplication 19 | url = soup.find('meta', {'property': "og:url"})['content'] 20 | return title, page_urls, url 21 | 22 | 23 | if __name__ == '__main__': 24 | base_url = 'https://mofanpy.com/' 25 | # base_url = "http://127.0.0.1:4000/" 26 | 27 | # DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN 28 | if base_url != "http://127.0.0.1:4000/": 29 | restricted_crawl = True 30 | else: 31 | restricted_crawl = False 32 | 33 | unseen = set([base_url,]) 34 | seen = set() 35 | 36 | pool = mp.Pool(4) # number strongly affected 37 | count, t1 = 1, time.time() 38 | 39 | while len(unseen) != 0: # still get some url to visit 40 | if restricted_crawl and len(seen) > 20: 41 | break 42 | print('\nDistributed Crawling...') 43 | crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen] 44 | htmls = [j.get() for j in crawl_jobs] # request connection 45 | htmls = [h for h in htmls if h is not None] # remove None 46 | 47 | print('\nDistributed Parsing...') 48 | parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls] 49 | results = [j.get() for j in parse_jobs] # parse html 50 | 51 | print('\nAnalysing...') 52 | seen.update(unseen) 53 | unseen.clear() 54 | 55 | for title, page_urls, url in results: 56 | print(count, title, url) 57 | count += 1 58 | unseen.update(page_urls - seen) 59 | 60 | print('Total time: %.1f s' % (time.time()-t1, )) 61 | 62 | -------------------------------------------------------------------------------- /source_code/4-2-asyncio.py: -------------------------------------------------------------------------------- 1 | import aiohttp 2 | import asyncio 3 | import time 4 | from bs4 import BeautifulSoup 5 | from urllib.request import urljoin 6 | import re 7 | import multiprocessing as mp 8 | 9 | base_url = "https://mofanpy.com/" 10 | # base_url = "http://127.0.0.1:4000/" 11 | 12 | # DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN 13 | if base_url != "http://127.0.0.1:4000/": 14 | restricted_crawl = True 15 | else: 16 | restricted_crawl = False 17 | 18 | seen = set() 19 | unseen = set([base_url]) 20 | 21 | 22 | def parse(html): 23 | soup = BeautifulSoup(html, 'lxml') 24 | urls = soup.find_all('a', {"href": re.compile('^/.+?/$')}) 25 | title = soup.find('h1').get_text().strip() 26 | page_urls = set([urljoin(base_url, url['href']) for url in urls]) 27 | url = soup.find('meta', {'property': "og:url"})['content'] 28 | return title, page_urls, url 29 | 30 | 31 | async def crawl(url, session): 32 | r = await session.get(url) 33 | html = await r.text() 34 | await asyncio.sleep(0.1) # slightly delay for downloading 35 | return html 36 | 37 | 38 | async def main(loop): 39 | pool = mp.Pool(2) # slightly affected 40 | async with aiohttp.ClientSession() as session: 41 | count = 1 42 | while len(unseen) != 0: 43 | if restricted_crawl and len(seen) > 20: 44 | break 45 | tasks = [loop.create_task(crawl(url, session)) for url in unseen] 46 | finished, unfinished = await asyncio.wait(tasks) 47 | htmls = [f.result() for f in finished] 48 | 49 | parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls] 50 | results = [j.get() for j in parse_jobs] 51 | 52 | seen.update(unseen) 53 | unseen.clear() 54 | for title, page_urls, url in results: 55 | print(count, title, url) 56 | unseen.update(page_urls - seen) 57 | count += 1 58 | 59 | if __name__ == "__main__": 60 | t1 = time.time() 61 | loop = asyncio.get_event_loop() 62 | loop.run_until_complete(main(loop)) 63 | loop.close() 64 | print("Async total time: ", time.time() - t1) -------------------------------------------------------------------------------- /source_code/5-1-selenium.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from selenium.webdriver.chrome.options import Options 3 | 4 | # firefox plugin 5 | # https://askubuntu.com/questions/870530/how-to-install-geckodriver-in-ubuntu 6 | 7 | # hide browser window 8 | chrome_options = Options() 9 | chrome_options.add_argument("--headless") # define headless 10 | 11 | # add the option when creating driver 12 | driver = webdriver.Chrome(chrome_options=chrome_options) 13 | driver.get("https://mofanpy.com/") 14 | driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click() 15 | driver.find_element_by_link_text("About").click() 16 | driver.find_element_by_link_text(u"赞助").click() 17 | driver.find_element_by_link_text(u"教程 ▾").click() 18 | driver.find_element_by_link_text(u"数据处理 ▾").click() 19 | driver.find_element_by_link_text(u"网页爬虫").click() 20 | 21 | print(driver.page_source[:200]) 22 | driver.get_screenshot_as_file("./img/sreenshot2.png") 23 | driver.close() 24 | print('finish') -------------------------------------------------------------------------------- /source_code/5-2-scrapy.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | 3 | 4 | class MofanSpider(scrapy.Spider): 5 | name = "mofan" 6 | start_urls = [ 7 | 'https://mofanpy.com/', 8 | ] 9 | # unseen = set() 10 | # seen = set() # we don't need these two as scrapy will deal with them automatically 11 | 12 | def parse(self, response): 13 | yield { # return some results 14 | 'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""), 15 | 'url': response.url, 16 | } 17 | 18 | urls = response.css('a::attr(href)').re(r'^/.+?/$') # find all sub urls 19 | for url in urls: 20 | yield response.follow(url, callback=self.parse) # it will filter duplication automatically 21 | 22 | 23 | # lastly, run this in terminal 24 | # scrapy runspider 5-2-scrapy.py -o res.json --------------------------------------------------------------------------------