├── .gitignore
├── img
    ├── pic2.jpg
    ├── Scrapy.jpg
    ├── 公众号—AI派.jpg
    ├── 1564807382096.png
    ├── 1564807688948.png
    ├── 1564808967916.png
    ├── 1564809335298.png
    ├── 1564809595601.png
    ├── 1564810175374.png
    ├── 1564810438026.png
    ├── 1564812010365.png
    ├── 1564812252811.png
    ├── 1564812421392.png
    ├── 1564812728360.png
    ├── 1564812936144.png
    ├── 1564813159414.png
    ├── 1564813334042.png
    ├── 1565418696252.png
    ├── 1566045583808.png
    ├── 1566046413728.png
    ├── 1566049929368.png
    ├── 1566655220447.png
    ├── 1566656347663.png
    ├── 1566656597002.png
    ├── 1566656895958.png
    ├── 1566657375088.png
    ├── 1567265431802.png
    ├── 1567856337964.png
    ├── 1567857170021.png
    ├── 1567857830629.png
    ├── 1567859645775.png
    ├── 1567860876635.png
    ├── 1567861127369.png
    ├── 1568037229251.png
    ├── 1568037265483.png
    ├── 1568039617382.png
    ├── 1568437445627.png
    ├── 1568438083620.png
    ├── 1568438111443.png
    ├── 1568438387703.png
    ├── 1568438437191.png
    ├── 1568438455277.png
    ├── 1568442757901.png
    ├── 1568442863312.png
    ├── 1568442901842.png
    ├── 1568443442203.png
    ├── 1568447114208.png
    ├── 1568447140842.png
    ├── 1568447241296.png
    ├── 1568447268341.png
    ├── 1569076766706.png
    ├── 1569077409152.png
    ├── 1569077656059.png
    ├── 1569077783474.png
    ├── 1569084529057.png
    ├── 1569653221455.png
    ├── 1569658444818.png
    ├── 1569658481643.png
    ├── 1569681231891.png
    ├── 1569681364026.png
    ├── 1569682015625.png
    ├── 1569682063882.png
    ├── 1569683001422.png
    ├── 1569683170817.png
    ├── 1569684196456.png
    ├── 1569684397046.png
    ├── 1569684428783.png
    ├── 1569684489904.png
    ├── 1569684645930.png
    ├── 1570202323349.png
    ├── 1570202356214.png
    ├── 1570202509673.png
    ├── 1570886392682.png
    ├── 1570886711412.png
    ├── 1570886893828.png
    ├── 1570886939630.png
    ├── 1570886966890.png
    ├── 1570887591900.png
    ├── 1570893855785.png
    ├── 1571484984537.png
    ├── 1571485766709.png
    ├── 1571487145703.png
    ├── 1571489386804.png
    ├── 1571489620723.png
    ├── 1571489715078.png
    ├── 1571489797290.png
    ├── 1571489930881.png
    ├── 1571490007402.png
    ├── image-20191026224441467.png
    ├── image-20191026224553401.png
    ├── image-20191026224709438.png
    ├── image-20191026225204779.png
    ├── image-20191026225551121.png
    ├── image-20191026225959992.png
    ├── image-20191026230235017.png
    ├── image-20191026231439674.png
    ├── image-20191026231525926.png
    ├── image-20191026231748209.png
    ├── image-20191026231937842.png
    ├── image-20191026232056812.png
    ├── image-20191026232214264.png
    ├── image-20191026232312105.png
    ├── image-20191026232459509.png
    ├── image-20191026232535573.png
    ├── image-20191026232551493.png
    ├── image-20191026232936265.png
    └── image-20191026234438718.png
├── README.md
└── md
    ├── 05-第五周-多进程爬虫.md
    ├── 02-第二周-爬虫原理和网页构造.md
    ├── 09-第九周-表单与模拟登录.md
    ├── 07-第七周-MongoDB简介.md
    ├── 01-第一周-Python基础.md
    ├── 11-第十一周-Selenium简介.md
    ├── 04-第四周-正则表达式.md
    ├── 12-第十二周-通过OCR识别验证码.md
    ├── 08-第八周-Scrapy爬虫开发.md
    ├── 06-第六周-Lxml库与Xpath语法.md
    ├── 10-第十周-ORM框架.md
    ├── 03-第三周-我们的第一个爬虫.md
    └── 13-第十三周-如何利用爬取的数据.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | *pyc
3 | 


--------------------------------------------------------------------------------
/img/pic2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/pic2.jpg


--------------------------------------------------------------------------------
/img/Scrapy.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/Scrapy.jpg


--------------------------------------------------------------------------------
/img/公众号—AI派.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/公众号—AI派.jpg


--------------------------------------------------------------------------------
/img/1564807382096.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564807382096.png


--------------------------------------------------------------------------------
/img/1564807688948.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564807688948.png


--------------------------------------------------------------------------------
/img/1564808967916.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564808967916.png


--------------------------------------------------------------------------------
/img/1564809335298.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564809335298.png


--------------------------------------------------------------------------------
/img/1564809595601.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564809595601.png


--------------------------------------------------------------------------------
/img/1564810175374.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564810175374.png


--------------------------------------------------------------------------------
/img/1564810438026.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564810438026.png


--------------------------------------------------------------------------------
/img/1564812010365.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564812010365.png


--------------------------------------------------------------------------------
/img/1564812252811.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564812252811.png


--------------------------------------------------------------------------------
/img/1564812421392.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564812421392.png


--------------------------------------------------------------------------------
/img/1564812728360.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564812728360.png


--------------------------------------------------------------------------------
/img/1564812936144.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564812936144.png


--------------------------------------------------------------------------------
/img/1564813159414.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564813159414.png


--------------------------------------------------------------------------------
/img/1564813334042.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1564813334042.png


--------------------------------------------------------------------------------
/img/1565418696252.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1565418696252.png


--------------------------------------------------------------------------------
/img/1566045583808.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566045583808.png


--------------------------------------------------------------------------------
/img/1566046413728.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566046413728.png


--------------------------------------------------------------------------------
/img/1566049929368.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566049929368.png


--------------------------------------------------------------------------------
/img/1566655220447.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566655220447.png


--------------------------------------------------------------------------------
/img/1566656347663.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566656347663.png


--------------------------------------------------------------------------------
/img/1566656597002.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566656597002.png


--------------------------------------------------------------------------------
/img/1566656895958.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566656895958.png


--------------------------------------------------------------------------------
/img/1566657375088.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1566657375088.png


--------------------------------------------------------------------------------
/img/1567265431802.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1567265431802.png


--------------------------------------------------------------------------------
/img/1567856337964.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1567856337964.png


--------------------------------------------------------------------------------
/img/1567857170021.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1567857170021.png


--------------------------------------------------------------------------------
/img/1567857830629.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1567857830629.png


--------------------------------------------------------------------------------
/img/1567859645775.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1567859645775.png


--------------------------------------------------------------------------------
/img/1567860876635.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1567860876635.png


--------------------------------------------------------------------------------
/img/1567861127369.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1567861127369.png


--------------------------------------------------------------------------------
/img/1568037229251.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568037229251.png


--------------------------------------------------------------------------------
/img/1568037265483.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568037265483.png


--------------------------------------------------------------------------------
/img/1568039617382.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568039617382.png


--------------------------------------------------------------------------------
/img/1568437445627.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568437445627.png


--------------------------------------------------------------------------------
/img/1568438083620.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568438083620.png


--------------------------------------------------------------------------------
/img/1568438111443.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568438111443.png


--------------------------------------------------------------------------------
/img/1568438387703.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568438387703.png


--------------------------------------------------------------------------------
/img/1568438437191.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568438437191.png


--------------------------------------------------------------------------------
/img/1568438455277.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568438455277.png


--------------------------------------------------------------------------------
/img/1568442757901.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568442757901.png


--------------------------------------------------------------------------------
/img/1568442863312.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568442863312.png


--------------------------------------------------------------------------------
/img/1568442901842.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568442901842.png


--------------------------------------------------------------------------------
/img/1568443442203.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568443442203.png


--------------------------------------------------------------------------------
/img/1568447114208.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568447114208.png


--------------------------------------------------------------------------------
/img/1568447140842.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568447140842.png


--------------------------------------------------------------------------------
/img/1568447241296.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568447241296.png


--------------------------------------------------------------------------------
/img/1568447268341.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1568447268341.png


--------------------------------------------------------------------------------
/img/1569076766706.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569076766706.png


--------------------------------------------------------------------------------
/img/1569077409152.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569077409152.png


--------------------------------------------------------------------------------
/img/1569077656059.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569077656059.png


--------------------------------------------------------------------------------
/img/1569077783474.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569077783474.png


--------------------------------------------------------------------------------
/img/1569084529057.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569084529057.png


--------------------------------------------------------------------------------
/img/1569653221455.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569653221455.png


--------------------------------------------------------------------------------
/img/1569658444818.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569658444818.png


--------------------------------------------------------------------------------
/img/1569658481643.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569658481643.png


--------------------------------------------------------------------------------
/img/1569681231891.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569681231891.png


--------------------------------------------------------------------------------
/img/1569681364026.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569681364026.png


--------------------------------------------------------------------------------
/img/1569682015625.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569682015625.png


--------------------------------------------------------------------------------
/img/1569682063882.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569682063882.png


--------------------------------------------------------------------------------
/img/1569683001422.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569683001422.png


--------------------------------------------------------------------------------
/img/1569683170817.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569683170817.png


--------------------------------------------------------------------------------
/img/1569684196456.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569684196456.png


--------------------------------------------------------------------------------
/img/1569684397046.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569684397046.png


--------------------------------------------------------------------------------
/img/1569684428783.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569684428783.png


--------------------------------------------------------------------------------
/img/1569684489904.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569684489904.png


--------------------------------------------------------------------------------
/img/1569684645930.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1569684645930.png


--------------------------------------------------------------------------------
/img/1570202323349.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570202323349.png


--------------------------------------------------------------------------------
/img/1570202356214.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570202356214.png


--------------------------------------------------------------------------------
/img/1570202509673.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570202509673.png


--------------------------------------------------------------------------------
/img/1570886392682.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570886392682.png


--------------------------------------------------------------------------------
/img/1570886711412.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570886711412.png


--------------------------------------------------------------------------------
/img/1570886893828.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570886893828.png


--------------------------------------------------------------------------------
/img/1570886939630.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570886939630.png


--------------------------------------------------------------------------------
/img/1570886966890.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570886966890.png


--------------------------------------------------------------------------------
/img/1570887591900.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570887591900.png


--------------------------------------------------------------------------------
/img/1570893855785.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1570893855785.png


--------------------------------------------------------------------------------
/img/1571484984537.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571484984537.png


--------------------------------------------------------------------------------
/img/1571485766709.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571485766709.png


--------------------------------------------------------------------------------
/img/1571487145703.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571487145703.png


--------------------------------------------------------------------------------
/img/1571489386804.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571489386804.png


--------------------------------------------------------------------------------
/img/1571489620723.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571489620723.png


--------------------------------------------------------------------------------
/img/1571489715078.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571489715078.png


--------------------------------------------------------------------------------
/img/1571489797290.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571489797290.png


--------------------------------------------------------------------------------
/img/1571489930881.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571489930881.png


--------------------------------------------------------------------------------
/img/1571490007402.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/1571490007402.png


--------------------------------------------------------------------------------
/img/image-20191026224441467.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026224441467.png


--------------------------------------------------------------------------------
/img/image-20191026224553401.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026224553401.png


--------------------------------------------------------------------------------
/img/image-20191026224709438.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026224709438.png


--------------------------------------------------------------------------------
/img/image-20191026225204779.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026225204779.png


--------------------------------------------------------------------------------
/img/image-20191026225551121.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026225551121.png


--------------------------------------------------------------------------------
/img/image-20191026225959992.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026225959992.png


--------------------------------------------------------------------------------
/img/image-20191026230235017.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026230235017.png


--------------------------------------------------------------------------------
/img/image-20191026231439674.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026231439674.png


--------------------------------------------------------------------------------
/img/image-20191026231525926.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026231525926.png


--------------------------------------------------------------------------------
/img/image-20191026231748209.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026231748209.png


--------------------------------------------------------------------------------
/img/image-20191026231937842.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026231937842.png


--------------------------------------------------------------------------------
/img/image-20191026232056812.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026232056812.png


--------------------------------------------------------------------------------
/img/image-20191026232214264.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026232214264.png


--------------------------------------------------------------------------------
/img/image-20191026232312105.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026232312105.png


--------------------------------------------------------------------------------
/img/image-20191026232459509.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026232459509.png


--------------------------------------------------------------------------------
/img/image-20191026232535573.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026232535573.png


--------------------------------------------------------------------------------
/img/image-20191026232551493.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026232551493.png


--------------------------------------------------------------------------------
/img/image-20191026232936265.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026232936265.png


--------------------------------------------------------------------------------
/img/image-20191026234438718.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ai-union/PythonSpider/HEAD/img/image-20191026234438718.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 这是一份关于Python爬虫的系列教程。
 2 | 
 3 | # 目录
 4 | 
 5 | - [x] [第一周 Python基础](md/01-第一周-Python基础md)
 6 | - [x] [第二周 爬虫原理和网页构造](md/02-第二周-爬虫原理和网页构造.md)
 7 | - [x] [第三周 我们的第一个爬虫](md/03-第三周-我们的第一个爬虫.md)
 8 | - [x] [第四周 正则表达式](md/04-第四-正则表达式)
 9 | - [x] [第五周 多进程爬虫](md/05-第五周-多进程爬虫.md)
10 | - [x] [第六周 Lxml库与Xpath语法](md/06-第六周-Lxml库与Xpath语法)
11 | - [x] [第七周 MongoDB简介](md/07-第七周-MongoDB简介.md)
12 | - [x] [第八周 Scrapy爬虫开发](md/08-第八周-Scrapy爬虫开发)
13 | - [x] [第九周 表单与模拟登录](md/09-第九周-表单与模拟登录.md)
14 | - [x] [第十周 ORM框架](md/10-第十周-ORM框架.md)
15 | - [x] [第十一周 Selenium简介](md/11-第十一周-Selenium简介.md)
16 | - [x] [第十二周 通过OCR识别验证码](md/12-第十二周-通过OCR识别验证码.md)
17 | - [x] [第十三周 如何利用爬取的数据](md/13-第十三周-如何利用爬取的数据.md)
18 | 
19 | # 更多
20 | 
21 | 欢迎 Star 和 Fork ，如果想学习更多关于Python、机器学习相关的知识，欢迎关注公众号：**AI派**。
22 | 
23 | ![](img/公众号—AI派.jpg)
24 | 


--------------------------------------------------------------------------------
/md/05-第五周-多进程爬虫.md:
--------------------------------------------------------------------------------
 1 | # 专栏第五周多进程爬虫
 2 | 
 3 | 前面我们已经掌握了一些爬虫基本的技术，也实现了一些爬虫程序。但是随着数据量变大时，我们之前的爬虫的效率或者说执行速度就会出现问题，之前我们都是一条数据爬取完成后才继续下一条数据的爬取，这种模式我们通常称它为单线程或者串行爬虫。那么该如何改善呢？通过本章的学习你将掌握以下内容：
 4 | 
 5 | - 多线程：了解多线程的基本概念
 6 | - 多进程：了解多进程的概念
 7 | - 性能对比：通过一个爬虫案例对比它们之间的性能
 8 | - 多进程的使用
 9 | 
10 | ## 多线程与多进程
11 | 
12 | 1. 多线程和多进程概述
13 | 
14 |    当计算机运行程序时，就会创建包含代码和状态的进程。这些进程会通过计算机的一个或多个CPU执行。不过，同一时刻每个CPU只会执行一个进程，然后不同进程间快速切换，给我们一种错觉，感觉好像多个程序在同时进行。
15 | 
16 |    一个进程中，程序的执行也是在不同的线程间进行切换，每个线程执行程序的不同部分。
17 | 
18 |    例如：有一个大型工厂，该工厂负责生产电脑，工厂有很多的车间用来生产不同的电脑部件。每个车间又有很多工人互相合作共享资源来生产某个电脑部件。这里的工厂相当于一个爬虫工程，每个车间相当于一个进程，每个工人就相当于线程。
19 | 
20 | 2. 多进程的使用方法
21 | 
22 |    我们这里为大家介绍的是Python中的multiprocessing库，使用方法如下：
23 | 
24 |    ```python
25 |    from multiprocessing import Pool
26 |    pool = Pool(processes = 4) # 这个4代表着进程数
27 |    pool.map(func, iterable) # func 为方法名， iterable为可迭代的参数
28 |    ```
29 | 
30 | 3.性能对比
31 | 
32 | ​	我们这里以糗事百科的用户名称为例，分别对单进程，2进程，4进程的性能进行对比，代码如下：
33 | 
34 | ```python
35 | import requests
36 | import re
37 | import time
38 | from multiprocessing import Pool
39 | 
40 | 
41 | headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
42 |      AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
43 | 
44 | def spyder(url):
45 |     '''
46 |     解析页面
47 |     '''
48 |     res = requests.get(url, headers = headers)
49 |     ids = re.findall('<h2>(.*?)</h2>', res.text, re.S) # 获取用户名
50 |     time.sleep(1)
51 | 
52 | if __name__ == "__main__":
53 |     urls = ["https://www.qiushibaike.com/text/page/{}/".format(str(i)) for i in range(1, 10)]
54 |     start_1 = time.time()
55 |     for url in urls:
56 |         spyder(url)        
57 |     end_1 = time.time()
58 |     print("单进程：",end_1 - start_1)
59 | 
60 |     start_2 = time.time()
61 |     pool = Pool(processes = 2)
62 |     pool.map(spyder, urls)
63 |     end_2 = time.time()
64 |     print("2进程：",end_2 - start_2)
65 | 
66 |     start_3 = time.time()
67 |     pool = Pool(processes = 4)
68 |     pool.map(spyder, urls)
69 |     end_3 = time.time()
70 |     print("2进程：",end_3 - start_3)
71 | ```
72 | 
73 | 运行结果：
74 | 
75 | ![1567265431802](https://github.com/ai-union/PythonSpyder/blob/master/img/1567265431802.png?raw=true)
76 | 
77 | 这里只是获取了用户名的信息，大家可以将其他信息也爬取出来看看，动手试试看看能缩少多少时间。好了这周的内容就这么多，虽然内容不是很多，但却很重要，大家要多多练习。


--------------------------------------------------------------------------------
/md/02-第二周-爬虫原理和网页构造.md:
--------------------------------------------------------------------------------
  1 | # 爬虫原理和页面构造
  2 | 
  3 | ## 爬虫原理
  4 | 
  5 | 1. 网络连接
  6 | 
  7 |    网络连接就像在火车站买票一样：旅客选择好目的地，投入硬（纸）币或者刷卡，售票机就会给我们一直带有列车信息的车票。
  8 | 
  9 |    计算机(乘客)带着**请求头**和**消息体**（目的地，车次等信息）向服务器(售票机)发起一次请求（购买车票），相应的服务器（售票机）会返回给计算机相应的HTML文件作为Response(相应的车票)。
 10 | 
 11 |    > 这里是一个GET请求。我们常见的还有POST请求。
 12 | 
 13 |    
 14 | 
 15 | ![u=2670977504,417194345&fm=15&gp=0 (2)](https://github.com/ai-union/PythonSpyder/blob/master/img/pic2.jpg?raw=true)
 16 | 
 17 | 2. 爬虫原理
 18 | 
 19 |    在了解了网络连接的基本原理后，爬虫原理就好理解了。网络连接需要计算机一次Request请求和服务器端的Response回应。爬虫也需要做两件事：
 20 | 
 21 |    - 模拟计算机对服务器发起Request请求。
 22 |    - 接收服务器的Response内容并分析其内容，提取出来。
 23 | 
 24 |    但是我们要获取的信息通常不是只在一个页面上，这时就需要设计一个爬虫的执行流程。我们常用的有两种：
 25 | 
 26 |    - 多页面爬虫流程
 27 | 
 28 |      通常这样的网站有很多页面，且每个页面的构造都类似。因此，可以使用如下流程：
 29 | 
 30 |      - 手动翻页并观察各网页的URL构成特点，构造出所有的页面URL保存到列表中。
 31 |      - 根据URL列表依次循环取出URL
 32 |      - 循环调用爬虫函数，存储数据。
 33 |      - 循环结束，爬虫运行结束。
 34 | 
 35 |      ```mermaid
 36 |      graph TD
 37 |      A(开始) --> B[构建URL列表]
 38 |      B --> C{是否循环}
 39 |      C -->|是|D[爬取数据]
 40 |      D-->E[存储数据]
 41 |      C-->|否|F[停止爬取数据]
 42 |      F-->G(结束)
 43 |      ```
 44 | 
 45 |    - 跨页面爬虫流程
 46 | 
 47 |      - 定义爬取函数爬取列表（目录）的所有专题的URL。
 48 |      - 将专题URL存入列表中(种子URL)。
 49 |      - 定义爬取详细页数据函数。
 50 |      - 进入专题详细页面爬取详细页数据。
 51 |      - 存储数据，循环完毕，爬虫结束
 52 | 
 53 |      ```mermaid
 54 |      graph TD
 55 |      A(开始)-->B[爬取列表页URL]
 56 |      B-->C[URL存入列表]
 57 |      C-->D{是否循环}
 58 |      D-->|是|E[爬取详细页数据]
 59 |      E-->F[存储数据]
 60 |      D-->|否|G[停止爬取]
 61 |      G-->H(结束)
 62 |      ```
 63 | 
 64 | ##  页面构造
 65 | 
 66 | 1. 推荐使用谷歌浏览器
 67 | 
 68 | 2. 页面构造
 69 | 
 70 |    1. 我们常说的网页大部分都是用HTML语言来写的。HTML是按层级规定所属关系。
 71 | 
 72 |       ```html
 73 |       <html>
 74 |           <head>
 75 |               这部分页面都是定义网页样式和js
 76 |               
 77 |           </head>
 78 |           <body>
 79 |               <div>
 80 |                   <table>
 81 |                       <tr>
 82 |                       	<td>
 83 |                           </td>
 84 |                       </tr>
 85 |                        <tr>
 86 |                       	<td>
 87 |                           </td>
 88 |                       </tr>
 89 |                   </table>
 90 |               </div>
 91 |           </body>
 92 |       </html>
 93 |       ```
 94 | 
 95 |       > 上面的段代码，就是HTML代码了。
 96 |       >
 97 |       > 我们通常成<div>这样的代码为**标签**，即这是一个**DIV标签**。
 98 |       >
 99 |       > - 我们称DIV标签是table标签的**父节点**
100 |       > - 称tr标签是table标签的**子节点**
101 |       > - 称tr标签为div标签的**孙节点**
102 |       > - 称两个tr的关系为**兄弟节点**
103 | 
104 |       我们这里由于篇幅原因暂不对前端知识做更深入的说明了。我们会在实际项目中和大家慢慢分享相关的知识点。欢迎大家持续关注本专栏。
105 | 
106 |    2. 查询网信息
107 | 
108 |       我们在打开要爬取的目标网站，然后通过F12快捷键即可看到如下页面
109 | 
110 |       ![1565418696252](https://github.com/ai-union/PythonSpyder/blob/master/img/1565418696252.png?raw=true)
111 | 
112 |       在这里我们可以很方便的查看目标网站的页面构成。
113 |       
114 |       好了，本期的内容就是这些了，大家有什么好玩的网站可以加群，我们一起研究一下。
115 | 
116 | 


--------------------------------------------------------------------------------
/md/09-第九周-表单与模拟登录.md:
--------------------------------------------------------------------------------
  1 | # 专栏第九周 表单与模拟登录
  2 | 
  3 | 我们在生活中经常能碰到基于注册和登录的网站，这类网站很多内容对于尚未登录的用户并不开放，因此在爬虫程序编写中考虑账号登录的问题就显得十分重要了。那么我们今天就来了解一下常见的登录所需要的技术问题。
  4 | 
  5 | 本篇文章你将了解到如下内容:
  6 | 
  7 | > - 表单与POST请求
  8 | > - Cookie以及Cookie在Python中的使用
  9 | 
 10 | ## 表单与POST
 11 | 
 12 | 在之前的爬虫中，程序基本只使用了HTTP 的GET操作，即仅通过程序去“读”网页中的数据，但在实际的网站浏览中还会大量的涉及到HTTP POST操作。
 13 | 
 14 | 我们这里所说的表达就是只在HTML页面中的form元素。
 15 | 
 16 | ![1569653221455](https://github.com/ai-union/PythonSpyder/blob/master/img/1569653221455.png?raw=true)
 17 | 
 18 | 1. 发送表单数据
 19 | 
 20 |    在Python中使用requests库中的post()方法可以完成简单的POST操作，下面的代码就是一个最基本的用法：
 21 | 
 22 |    ```python
 23 |    import requests
 24 |    form_data = {'username':'root', 'password':'pwd123'}
 25 |    req = requests.post('http://xxxxx.com', data = form_data)
 26 |    ```
 27 | 
 28 |    > 这里的网站，用户信息均是假数据。
 29 | 
 30 |    我们来观察一下实际中的网站登录信息
 31 |    
 32 |    - 首先我们要获取到网站的登录请求地址
 33 |    
 34 |      ![1569681231891](https://github.com/ai-union/PythonSpyder/blob/master/img/1569681231891.png?raw=true)
 35 |    
 36 |      我们可以在提交表单时，看到这个地址。
 37 |    
 38 |    - 接下来看一下这个表单都提交了哪些内容
 39 |    
 40 |      ![1569681364026](https://github.com/ai-union/PythonSpyder/blob/master/img/1569681364026.png?raw=true)
 41 |    
 42 |      但是，大多数网站的登录表单都是进行了加密或者其他的形式进行了包装，以至于我们简单的模拟表单内容是无法正常登录的。这时就需要通过提交Cookie信息来进行模拟登录。记下来为大家说一下在Python中是如何利用Cookie来进行登录的。
 43 |    
 44 |    ## Cookie的概述及运用
 45 |    
 46 |    1. Cookie,是指网站为了辨别用户身份，进行session跟踪而存储在用户本地的数据。电商网站通常通过跟踪用户的Cookie信息，给用户提供推荐产品。同样这里也保存了用户的信息，因此我们便可以通过提交Cookie来模拟网站登录了。
 47 |    
 48 |    2. 那么如何获取Cookie信息呢？
 49 |    
 50 |       同样可以通过浏览器的开发者工具(F12)开获取到，如下图：
 51 |    
 52 |       ![1569682063882](https://github.com/ai-union/PythonSpyder/blob/master/img/1569682063882.png?raw=true)
 53 |    
 54 |    3. 在Python中的使用
 55 |    
 56 |       - 未加cookie的返回内容。
 57 |    
 58 |         ```python
 59 |         import requests
 60 |         header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
 61 |              AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}
 62 |         req = requests.get("https://accounts.douban.com/passport/setting", headers=header)
 63 |         req.encoding = "UTF-8"
 64 |         print(req.text)
 65 |         ```
 66 |    
 67 |         ![1569683001422](https://github.com/ai-union/PythonSpyder/blob/master/img/1569683001422.png?raw=true)
 68 |    
 69 |         我们访问的是个人设置页面，但是从页面内容来看，应该是登录界面的内容。
 70 |    
 71 |       - 添加cookie的返回内容
 72 |    
 73 |         ```python
 74 |         import requests
 75 |         
 76 |         header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
 77 |              AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
 78 |              'Cookie': '这里填写你们自己的cookie内容'}
 79 |         req = requests.get("https://accounts.douban.com/passport/setting", headers=header)
 80 |         req.encoding = "UTF-8"
 81 |         print(req.text)
 82 |         ```
 83 |    
 84 |         ![1569683170817](https://github.com/ai-union/PythonSpyder/blob/master/img/1569683170817.png?raw=true)
 85 |    
 86 |         我们通过页面返回内容可以看到，正是我们的设置界面。但是我们并没用在代码里进行页面登录的操作。这也证明了通过cookie是可以模拟登录的。
 87 |    
 88 |       ## 实战案例
 89 |    
 90 |       我们拿拉勾网来做一个练习，将我们之前的内容简单的串联一下。
 91 |    
 92 |       - 目标网站：https://www.lagou.com/
 93 |       - 实现内容：爬取【爬虫】相关工作，保存到MongoDB
 94 |    
 95 |       1. 分析页面
 96 |    
 97 |          ![1569684196456](https://github.com/ai-union/PythonSpyder/blob/master/img/1569684196456.png?raw=true)
 98 |    
 99 |          我们通过页面源代码可以看到这里没有职位信息，因此可以得出该网站使用AJAX技术来显示数据。
100 |    
101 |          > 注意这里不是通过F12来看，而是右键---> 查看源代码
102 |    
103 |       2. 获取url规则以及请求头，cookie相关信息
104 |    
105 |          ![1569684397046](https://github.com/ai-union/PythonSpyder/blob/master/img/1569684397046.png?raw=true)
106 |    
107 |          ![1569684428783](https://github.com/ai-union/PythonSpyder/blob/master/img/1569684428783.png?raw=true)
108 |    
109 |          ![1569684489904](https://github.com/ai-union/PythonSpyder/blob/master/img/1569684489904.png?raw=true)
110 |    
111 |          经过翻页我们得出，该页这个页面是通过POST请求，其中**pn**就是页面数。并且根据页面返回的内容我得知总页数为15
112 |    
113 |          ![1569684645930](https://github.com/ai-union/PythonSpyder/blob/master/img/1569684645930.png?raw=true)
114 |    
115 |       3. 代码实现
116 |    
117 |          这个网站的反爬比较有趣。大家自己可以动手试试。如果搞不定的话，欢迎进群交流。
118 |    
119 |          
120 |    
121 |       
122 | 
123 | 


--------------------------------------------------------------------------------
/md/07-第七周-MongoDB简介.md:
--------------------------------------------------------------------------------
  1 | # 专栏第七周 MongoDB简介
  2 | 
  3 | MongoDB是一个基于分布式文件存储的数据库，由C++语言编写，旨在为web应用提供可扩展的高性能数据存储解决方案。由于可扩展性，因此非常适合爬虫数据的存储。
  4 | 
  5 | > 在本篇文章中，你讲学会一下内容：
  6 | >
  7 | > - MongoDB的安装
  8 | > - MongoDB在Python中的基本用法
  9 | > - MongoDB在爬虫中的使用
 10 | 
 11 | ## MongoDB的下载和安装
 12 | 
 13 | 1. 下载地址：https://www.mongodb.com/download-center/community
 14 | 
 15 |    进入官网后我们选择Server，然后根据自己的电脑系统选择合适的版本
 16 | 
 17 |    ![1568437445627](https://github.com/ai-union/PythonSpyder/blob/master/img/1568437445627.png?raw=true)
 18 | 
 19 |    > **注意：在 MongoDB 2.2 版本后已经不再支持 Windows XP 系统。最新版本也已经没有了 32 位系统的安装文件。**
 20 | 
 21 | 2. 下载好文件，我们双击运行。
 22 | 
 23 |    ![1568438083620](https://github.com/ai-union/PythonSpyder/blob/master/img/1568438083620.png?raw=true)
 24 | 
 25 |    ![1568438111443](https://github.com/ai-union/PythonSpyder/blob/master/img/1568438111443.png?raw=true)
 26 | 
 27 |    > 这边的路径我建议不要保存到有中文或者符号的文件夹
 28 | 
 29 |    然后点下一步，其他的我都选择了默认的。
 30 | 
 31 |    ![1568438387703](https://github.com/ai-union/PythonSpyder/blob/master/img/1568438387703.png?raw=true)
 32 | 
 33 |    ![1568438437191](https://github.com/ai-union/PythonSpyder/blob/master/img/1568438437191.png?raw=true)
 34 | 
 35 |    ![1568438455277](https://github.com/ai-union/PythonSpyder/blob/master/img/1568438455277.png?raw=true)
 36 | 
 37 |    > 注意：在最后一步的时候，它会问你是否关闭XXX应用，我这里选择否。
 38 | >
 39 |    > ※**最后一步，一直卡着不动**
 40 | >
 41 |    > 是因为到最后一步时，左下角的勾勾没有去掉，mongodb compass是图形化管理界面，下载它需要很久很久，还有可能一直下不来，所以把勾去掉就能马上安装好。
 42 | 
 43 | 3. 配置与启动
 44 | 
 45 |    在启动前，我们需要在一个目录下面创建一个用来保存数据的文件夹。如：D:\mongoDB\data
 46 | 
 47 |    然后切换到MongoDB的安装目录下的bin目录
 48 | 
 49 |    ```shell
 50 |    cd /d D:\Program Files\MongoDB\Server\4.2\bin
 51 |    ```
 52 | 
 53 |    然后执行以下命令来启动MongoDB
 54 | 
 55 |    ```shell
 56 |    mongod --dbpath D:\mongoDB\data
 57 |    ```
 58 | 
 59 |    D:\mongoDB\data 为你刚刚创建好的文件夹路径，记得要有权限，或者会报错。
 60 | 
 61 |    ![1568442757901](https://github.com/ai-union/PythonSpyder/blob/master/img/1568442757901.png?raw=true)
 62 | 
 63 | 4. 然后我们新开一个窗口
 64 | 
 65 |    同样是进入mongoDB的bin目录
 66 | 
 67 |    ```shell
 68 |    cd /d D:\Program Files\MongoDB\Server\4.2\bin
 69 |    ```
 70 | 
 71 |    ![1568442863312](https://github.com/ai-union/PythonSpyder/blob/master/img/1568442863312.png?raw=true)
 72 | 
 73 |    然后输入：mongo
 74 | 
 75 |    ![1568442901842](https://github.com/ai-union/PythonSpyder/blob/master/img/1568442901842.png?raw=true)
 76 | 
 77 |    看到 > 就表示启动成功了。
 78 | 
 79 | ## 在Python中使用MongoDB
 80 | 
 81 | 1. 安装pymongo库
 82 | 
 83 |    ```python
 84 |    pip install pymongo
 85 |    ```
 86 | 
 87 |    如果安装失败，可以尝试以下命令
 88 | 
 89 |    ```python
 90 |    pip install pymongo -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
 91 |    ```
 92 | 
 93 |    ![1568443442203](https://github.com/ai-union/PythonSpyder/blob/master/img/1568443442203.png?raw=true)
 94 | 
 95 |    如上图表示安装成功。
 96 | 
 97 | 2. 导入pymongo库
 98 | 
 99 |    ```python
100 |    import pymongo # 执行此命令前要保证mongoDB已经启动成功
101 |    ```
102 | 
103 | 3. 建立连接
104 | 
105 |    ```python
106 |    client = pymongo.MongoClient('localhost', 27017)
107 |    ```
108 | 
109 | 4. 新建库
110 | 
111 |    ```python
112 |    db = clinet['db_name']
113 |    ```
114 | 
115 | 5. 新建表
116 | 
117 |    ```python
118 |    table = db['table_name']
119 |    ```
120 | 
121 | 6. 写入数据
122 | 
123 |    ```python
124 |    table.insert({'key1':value1, 'key2':value2})
125 |    ```
126 | 
127 | 7. 删除数据
128 | 
129 |    ```python
130 |    table.remove({'key':value})
131 |    ```
132 | 
133 | 8. 修改数据
134 | 
135 |    ```python
136 |    table.update({'key':value},{"$set":{'key1':value1, 'key2':value2}})
137 |    ```
138 | 
139 | 9. 查询数据
140 | 
141 |    ```python
142 |    table.find({'key':value})
143 |    ```
144 | 
145 | ##  MongoDB在爬虫中的使用
146 | 
147 | 我们这里爬取的是：https://where.heweather.com/location.html这个网站的热门城市代码
148 | 
149 | ![1568447114208](https://github.com/ai-union/PythonSpyder/blob/master/img/1568447114208.png?raw=true)
150 | 
151 | ![1568447140842](https://github.com/ai-union/PythonSpyder/blob/master/img/1568447140842.png?raw=true)
152 | 
153 | 我们讲爬取的信息保存进MongoDB
154 | 
155 | 具体代码如下：
156 | 
157 | ```python
158 | import requests
159 | import pymongo # 导入pymongo
160 | 
161 | 
162 | header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
163 |      AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36'}
164 | 
165 | req = requests.get('https://search.heweather.net/top?group=cn&lang=zh&key=b44262251121469585bc2d212d33a3b3', headers=header)
166 | client = pymongo.MongoClient('localhost', 27017)
167 | db = client['cityCode'] # 创建cityCode这个库
168 | table = db['code_list'] # code_list 这个表
169 | table.insert_one(req.json())
170 | ```
171 | 
172 | 运行效果：
173 | 
174 | ![1568447241296](https://github.com/ai-union/PythonSpyder/blob/master/img/1568447241296.png?raw=true)
175 | 
176 | 这个管理工具是：
177 | 
178 | ![1568447268341](https://github.com/ai-union/PythonSpyder/blob/master/img/1568447268341.png?raw=true)
179 | 
180 | 下载地址：https://robomongo.org/download
181 | 
182 | 好了，本周的内容就是这些，MongoDB的用法还有很多，这里只是简单的介绍以下，如果要深入学习的话，建议买一本书来看看，或者加群获取电子资料。
183 | 
184 | 
185 | 
186 | 
187 | 
188 | 
189 | 
190 | 
191 | 
192 | 
193 | 
194 | 


--------------------------------------------------------------------------------
/md/01-第一周-Python基础.md:
--------------------------------------------------------------------------------
  1 | # Python基础
  2 | 
  3 | # Python环境安装
  4 | 
  5 | 1. 官网地址：https://www.python.org/
  6 | 
  7 | 2. Anaconda集成环境：https://www.anaconda.com/
  8 | 
  9 |    > 本专栏推荐使用这个环境
 10 | 
 11 |    **注意点**
 12 | 
 13 |    - 下载时，记得选择和自己系统对应的版本
 14 | 
 15 |      ![1564807382096](https://github.com/ai-union/PythonSpyder/blob/master/img/1564807382096.png?raw=true)
 16 | 
 17 |    - 记得下载**3.7**版本。*本片文章发布时，最新版本是3.7*
 18 | 
 19 |    - 安装路径不要有中文或者特殊符号。
 20 | 
 21 |    - 安装过程中，你会看到有**两个复选框**，记得把他们**都选上**。
 22 | 
 23 | 3. 安装成功后，打开你的命令行工具，然后输入**Python**看看能得到什么？
 24 | 
 25 |    ![1564807688948](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564807688948.png?raw=true)
 26 | 
 27 |    
 28 | 
 29 |    > 我这里没有使用windows系统自带的CMD，而是下载了一个叫做"cmder"的软件
 30 | 
 31 |    如果你看到类似上图的界面，那么恭喜你，你已经学会了Python的50%。剩下的就是学习如何写代码了。
 32 | 
 33 | ## Python基础知识
 34 | 
 35 | 1. Python是用缩进（四个空格）来控制代码层级关系的。这点很重要。所以写在了最前面。
 36 | 
 37 | 2. 基本数据类型
 38 | 
 39 |    - Number（数值）
 40 | 
 41 |      - int （有符号整数）。比如：10， -10， 0等
 42 |      - float（浮点型）。就是我们常说的小数。比如：0.5，3.14等
 43 |      - complex（复数）。由实数和虚数两部分构成，形式为a + bj。由于爬虫用到的比较少，这里只需要知道一下就好。
 44 | 
 45 |    - String（字符串）
 46 | 
 47 |      字符串是用来表示文本的数据类型。比如：name = "路飞"
 48 | 
 49 |      - 提取字符
 50 | 
 51 |        ```python
 52 |        name = "路飞"
 53 |        name[0] # 这个将返回‘路’
 54 |        ```
 55 | 
 56 |        ![1564808967916](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564808967916.png?raw=true)
 57 | 
 58 |      - 字符串切片
 59 | 
 60 |        ```python
 61 |        word = "人生苦短，我用Python"
 62 |        # 需要将苦短提取出来
 63 |        word[2:4]
 64 |        ```
 65 | 
 66 |        ![1564809335298](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564809335298.png?raw=true)
 67 | 
 68 |        > 注意切片是包含左边，不包含右边的数值。
 69 | 
 70 |        也可以从右边数
 71 | 
 72 |        ```python
 73 |        # 提取Python
 74 |        word[-6:]
 75 |        ```
 76 | 
 77 |        ![1564809595601](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564809595601.png?raw=true)
 78 | 
 79 |      - 拼接， 直接用加号就可以将多个字符串拼接在一起
 80 | 
 81 |        ```python
 82 |        name = "路飞"
 83 |        word = "人生苦短，我用Python"
 84 |        name + word
 85 |        ```
 86 | 
 87 |        ![1564810175374](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564810175374.png?raw=true)
 88 | 
 89 |      - 格式化字符串。有的时候我们的字符串中某个内容需要变动，这个时候就可以私用format()函数了。
 90 | 
 91 |        ```python
 92 |        name = "路飞"
 93 |        msg = "你好啊，{}".format(name)
 94 |        ```
 95 | 
 96 |        ![1564810438026](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564810438026.png?raw=true)
 97 | 
 98 |        > 更多的用法，我们在实际代码中再为大家介绍。
 99 | 
100 |    - List（列表）
101 | 
102 |      python中用 [ ] 来表示列表
103 | 
104 |      ```python
105 |      name_list = ['路飞', '索隆', '娜美']
106 |      ```
107 | 
108 |    - Dictionary（字典）
109 | 
110 |      字典是值包含由键和值组成的数据集合
111 | 
112 |      ```python
113 |      info = {'name':'路飞', 'age':18}
114 |      ```
115 | 
116 |      > 要注意的是键不可以重复，而值是可以的。
117 | 
118 |    - Tuple（元组）
119 | 
120 |      它和列表很相似，用（）来表示。但是它一旦船家女之后就不能修改了。
121 | 
122 |      ```python
123 |      t = ('a', 'b')
124 |      ```
125 | 
126 | 3. 函数
127 | 
128 |    - 函数的定义方式
129 | 
130 |      ```python
131 |      def getInfo():
132 |          print("这是一个函数")
133 |      ```
134 | 
135 |      > 我们可以通过def 加我们的函数名，来创建函数。
136 | 
137 |    - 函数的调用
138 | 
139 |      ```python
140 |      def getInfo():
141 |          print("这是一个函数")
142 |      getInfo()
143 |      ```
144 | 
145 |      ![1564812010365](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564812010365.png?raw=true)
146 | 
147 |    - 全局变量
148 | 
149 |      ![1564812252811](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564812252811.png?raw=true)
150 | 
151 |      函数内是可以调用全局变量的，但是如果要修改那么就要加上global关键字。
152 | 
153 |      ![1564812421392](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564812421392.png?raw=true)
154 | 
155 |    - 参数
156 | 
157 |      函数是可以接收参数的
158 | 
159 |      ![1564812728360](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564812728360.png?raw=true)
160 | 
161 |      这里的name就是参数，通常我们叫它“形参”
162 | 
163 | 4. 条件判断与循环
164 | 
165 |    - 判断
166 | 
167 |      ![1564812936144](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564812936144.png?raw=true)
168 | 
169 |      if 后面接的是表达式，接是否为真，为真的时候就会执行该语句的代码。如果为假的时候就会执行elif的判断或者else后面的判断。
170 | 
171 |    - 循环
172 | 
173 |      - for 循环
174 | 
175 |        我们前面知道Python中有列表，那么我如果想变量列表中的所有元素该怎么操作呢?
176 | 
177 |        ![1564813159414](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564813159414.png?raw=true)
178 | 
179 |        我们看到，列表里面的名字依次被打印了出来
180 | 
181 |      - while循环
182 | 
183 |        我们想让程序从从1开始数，数到10停止。
184 | 
185 |        ![1564813334042](C:\Users\baohu\AppData\Roaming\Typora\typora-user-images\1564813334042.png?raw=true)
186 | 
187 |        while后面接的也是表达式，当条件满足时就会执行。
188 | 
189 |        > 注意一定是一个可以结束的表达式，否则会进入死循环。这里的 i += 1，就是为了让i 每次都加1，这样当i 大于等于10的时候就不会再执行了。
190 | 
191 |        - 我们用break可以提前结束循环
192 |        - 用continue可以跳过某次循环，具体用法我们再后面的爬虫项目中会和大家慢慢介绍。
193 | 
194 | 5. 类
195 | 
196 |    这个概念比较抽象，简单的说就是用代码来描述一类事物。比如狗，猫，车这些。
197 | 
198 |    在python中我们可以用class关键字来定义一个类
199 | 
200 |    ```python
201 |    class Car():
202 |        '''
203 |        这是一个车的类
204 |        '''
205 |    ```
206 | 
207 | ## 最后
208 | 
209 | 我们这里只是简单的介绍了一下在爬虫中要用到的Python技术。由于篇幅的原因未能给大家讲的很细，希望大家见谅。读者可以关注本公众号获取更多关于的Python基础的文章。
210 | 
211 | 然我讲知识点放到实际爬虫项目中。欢迎大家跟着Tango继续后面的爬虫相关的内容。我们下周见。


--------------------------------------------------------------------------------
/md/11-第十一周-Selenium简介.md:
--------------------------------------------------------------------------------
  1 | # 专栏第十一周 Selenium简介
  2 | 
  3 | Selenium是一个用于网站应用程序自动化的工具。它可以直接运行再浏览器中，就像真正的用户在操作一样。它支持大部分浏览器。
  4 | 
  5 | 它发展至今，不仅在自动化测试领域占据重要位置，并且在爬虫上也应用广泛。接下来我们一起看看它是如何应用在爬虫上的。
  6 | 
  7 | ## 安装Selenium
  8 | 
  9 | >  我们这里以谷歌浏览器为例。
 10 | 
 11 | 在搭建Selenium开发环境时，我们需要安装Selenium库并且配置谷歌浏览器的WebDriver。
 12 | 
 13 | 通过如下命令安装:
 14 | 
 15 | ```python
 16 | pip install selenium
 17 | ```
 18 | 
 19 | 安装成功后我们在CMD环境下验证是否安装成功。
 20 | 
 21 | ```python
 22 | import selenium
 23 | selenium.__version__
 24 | ```
 25 | 
 26 | ![1570886392682](https://github.com/ai-union/PythonSpyder/blob/master/img/1570886392682.png?raw=true)
 27 | 
 28 | 接下来就是安装谷歌浏览器的WebDriver了，需要注意的是不同版本的谷歌浏览器的WebDriver也不一样，因此在下载前要确认好自己的浏览器版本。
 29 | 
 30 | 在浏览器中输入：[chrome://settings/help](chrome://settings/help) 
 31 | 
 32 | ![1570886711412](https://github.com/ai-union/PythonSpyder/blob/master/img/1570886711412.png?raw=true)
 33 | 
 34 | 就可以看到当前浏览器的版本。然后到 http://chromedriver.storage.googleapis.com/index.html 这个网站下载与其对应的程序。
 35 | 
 36 | ![1570886893828](https://github.com/ai-union/PythonSpyder/blob/master/img/1570886893828.png?raw=true)
 37 | 
 38 | 最后的小版本不一致影响不大。
 39 | 
 40 | ![1570886966890](https://github.com/ai-union/PythonSpyder/blob/master/img/1570886966890.png?raw=true)
 41 | 
 42 | 下载与自己操作系统一致的zip文件解压。放到一个自己不会删除的地方。
 43 | 
 44 | 到现在为止，我们的环境就算是配置完成了，接下来我们写段代码验证一下：
 45 | 
 46 | ```python
 47 | # 导入Selenium的webdriver类
 48 | from selenium import webdriver
 49 | # 设置url变量保存要访问的网站
 50 | url = "https://www.baidu.com"
 51 | # 将webdriver类实例化，将浏览器设置为谷歌
 52 | # 参数executable_path是我们chromedriver.exe的路径
 53 | path = r"F:\chromedriver.exe"
 54 | browser = webdriver.Chrome(executable_path=path)
 55 | # 打开浏览器并访问目标网站
 56 | browser.get(url)
 57 | ```
 58 | 
 59 | ![1570887591900](https://github.com/ai-union/PythonSpyder/blob/master/img/1570887591900.png?raw=true)
 60 | 
 61 | 运行脚本，如果看到一个自动打开的网站就说明我们的环境搭建成功了。
 62 | 
 63 | ## 网页元素定位
 64 | 
 65 | Selenium定位网页元素主要通过元素属性值或者元素在HTML里的路径位置，定位方式一共有如下8种：
 66 | 
 67 | ```python
 68 | # 通过属性id和name来定位
 69 | find_element_by_id()
 70 | find_element_by_name()
 71 | 
 72 | # 通过HTML标签类型和属性class定位
 73 | find_element_by_class_name()
 74 | find_element_by_tag_name()
 75 | 
 76 | # 通过标签值实现定位，partial_link用于模糊匹配
 77 | find_element_by_link_text()
 78 | find_element_by_partial_link_text()
 79 | 
 80 | # 通过元素的路径定位
 81 | find_element_by_xpath()
 82 | find_element_by_css_selector()
 83 | ```
 84 | 
 85 | 我们除了获取网页的内容之外，有时还需要页面上进行点击或者输入等操作。最常用的有：
 86 | 
 87 | - click：鼠标点击
 88 | - send_keys：输入文本
 89 | 
 90 | 除了这两个之外还有很多，大家可以自行在网上搜索一下，没必要都记住，在实际开发中遇到了上网搜索一下就好。
 91 | 
 92 | 在爬虫中我们通常通过chrome_options参数为请求添加请求头，以防被反爬：
 93 | 
 94 | ```python
 95 | # 导入Selenium的webdriver类
 96 | from selenium import webdriver
 97 | # 导入Options类
 98 | from selenium.webdriver.chrome.options import Options
 99 | # Options实例化
100 | chrome_options = Options()
101 | # 设置浏览器参数
102 | chrome_options.add_argument('--headless')
103 | chrome_options.add_argument('lang=zh_CN.UTF-8') # 根据目标网站要修改
104 | chrome_options.add_argument('User-Agent=xxxxxx') # xxxxxx是你的请求头
105 | # 启动浏览器，并添加请求头
106 | driver = webdriver.chrome(chrome_options = chrome_options)
107 | ```
108 | 
109 | ## 实例
110 | 
111 | ```python
112 | from selenium import webdriver
113 | import codecs
114 | 
115 | pageMax = 2 #爬取的页数
116 | saveFileName = 'proxy.txt'
117 | 
118 | class Item(object):
119 | 	ip = None #代理IP地址
120 | 	port = None #代理IP端口
121 | 	anonymous = None #是否匿名
122 | 	protocol = None #支持的协议http or https
123 | 	local = None #物理位置
124 | 	speed = None #测试速度
125 | 	uptime = None #最后测试时间
126 | 
127 | class GetProxy(object):
128 | 	def __init__(self):
129 | 		self.startUrl = 'https://www.kuaidaili.com/free'
130 | 		self.urls = self.getUrls()
131 | 		self.getProxyList(self.urls)
132 | 		self.fileName = saveFileName
133 | 		
134 | 
135 | 	def getUrls(self):
136 | 		urls = []
137 | 		for word in ['inha', 'intr']:
138 | 			for page in range(1, pageMax + 1): 
139 | 				urlTemp = []
140 | 				urlTemp.append(self.startUrl)
141 | 				urlTemp.append(word)
142 | 				urlTemp.append(str(page))
143 | 				urlTemp.append('')
144 | 				url = '/'.join(urlTemp)
145 | 				urls.append(url)
146 | 		return urls
147 | 
148 | 
149 | 	def getProxyList(self, urls):
150 | 		proxyList = []
151 | 		item = Item()
152 | 		for url in urls:
153 | 			print(url)
154 | 			path = r"F:\chromedriver.exe"
155 | 			browser = webdriver.Chrome(executable_path=path)
156 | 			browser.get(url)
157 | 			browser.implicitly_wait(5)
158 | 			elements = browser.find_elements_by_xpath('//tbody/tr')
159 | 			for element in elements:
160 | 				item.ip = element.find_element_by_xpath('./td[1]').text
161 | 				print(item.ip)
162 | 				item.port = element.find_element_by_xpath('./td[2]').text
163 | 				item.anonymous = element.find_element_by_xpath('./td[3]').text
164 | 				item.protocol = element.find_element_by_xpath('./td[4]').text
165 | 				item.local = element.find_element_by_xpath('./td[5]').text
166 | 				item.speed = element.find_element_by_xpath('./td[6]').text
167 | 				item.uptime = element.find_element_by_xpath('./td[7]').text
168 | 				proxyList.append(item)
169 | 			browser.quit()
170 | 		self.saveFile( proxyList)
171 | 
172 | 	def saveFile(self, proxyList):
173 | 		with codecs.open('proxy.txt', 'w', 'utf-8') as fp:
174 | 			for item in proxyList:
175 | 				fp.write('%s \t' %item.ip)
176 | 				fp.write('%s \t' %item.port)
177 | 				fp.write('%s \t' %item.anonymous)
178 | 				fp.write('%s \t' %item.protocol)
179 | 				fp.write('%s \t' %item.local)
180 | 				fp.write('%s \t' %item.speed)
181 | 				fp.write('%s \r\n' %item.uptime)
182 | 				
183 | 
184 | if __name__ == '__main__':
185 | 	GP = GetProxy()
186 | 
187 | ```
188 | 
189 | 运行结果：
190 | 
191 | ![1570893855785](https://github.com/ai-union/PythonSpyder/blob/master/img/1570893855785.png?raw=true)
192 | 
193 | 介于大家已经学了这么长时间的爬虫了，我这里就不做讲解了，大家看看。如果有疑问请加群。


--------------------------------------------------------------------------------
/md/04-第四周-正则表达式.md:
--------------------------------------------------------------------------------
  1 | # 专栏第四周 正则表达式
  2 | 
  3 | 正则表达式是一个特殊的符号系列，它能够帮助开发人员检查一个字符串是否与某种模式匹配。在爬虫中正则表达式也是非常重要的。
  4 | 
  5 | > 本篇文章涉及以下内容：
  6 | >
  7 | > - 正则表达：学会正则表达式的常用符号。
  8 | > - Python的re模块的使用
  9 | > - 介绍re在爬虫中的运用
 10 | 
 11 | ## 正则表达基础
 12 | 
 13 | 1. 行定位符
 14 | 
 15 |    行定位符就是用来描述子串的边界。**^** 表示行的开始，**$** 表示行的结尾。
 16 | 
 17 |    例如：
 18 | 
 19 |    > ^py
 20 | 
 21 |    这个表达式表示匹配子串py的开始位置是行头, 如："python"就是可以匹配到的，而"Java and py"则不匹配。
 22 | 
 23 |    但是如果使用：
 24 | 
 25 |    > py$
 26 | 
 27 |    则是可以匹配的。
 28 | 
 29 |    如果要匹配子串可以出现在字符串的任意部分，那么可以直接写成：
 30 | 
 31 |    > py
 32 | 
 33 | 2. 元字符
 34 | 
 35 |    我们常用的主要有以下几种，大家先了解一下，后面的爬虫中会慢慢讲解。
 36 | 
 37 |    | 代码 | 说明                                                         |
 38 |    | ---- | ------------------------------------------------------------ |
 39 |    | .    | 匹配除换行符以外的任意字符                                   |
 40 |    | \w   | 匹配字母，数字，下划线或汉字                                 |
 41 |    | \W   | 匹配**除**字母，数字，下划线或汉字以外的内容                 |
 42 |    | \s   | 匹配单个的空白符(包括\<TAB>键和换行符)                       |
 43 |    | \S   | 匹配**除**单个的空白符(包括\<TAB>键和换行符)以外的所有字符   |
 44 |    | \d   | 匹配数字                                                     |
 45 |    | \b   | 匹配单词的开始或结束，单词的分界符通常是空格，标点符号或者换行 |
 46 |    | ^    | 匹配字符串的开始                                             |
 47 |    | $    | 匹配字符串的结尾                                             |
 48 | 
 49 |    比如下面的这个表达式:
 50 | 
 51 |    > \bgo\w*\b
 52 | 
 53 |    它用于匹配以字母go开头的单词。
 54 | 
 55 |    类似:goto, google, gogo123这样的都可以被匹配到。但是像:bo，to这样的单词是匹配不到的。因为不是go开头的。
 56 | 
 57 | 3. 限定符
 58 | 
 59 |    常用的限定符：
 60 | 
 61 |    | 限定符 | 说明                          | 举例                                                       |
 62 |    | ------ | ----------------------------- | ---------------------------------------------------------- |
 63 |    | ？     | 匹配前面的字符零次或一次      | colou?r,该表达式可以匹配color和colour                      |
 64 |    | +      | 匹配前面的字符一次或多次      | go+gle,该表达式可以匹配的范围从gogle到goo...gle            |
 65 |    | *      | 匹配前面的字符零次或者多次    | go*gle,该表达式匹配的范围从ggle到goo...gle                 |
 66 |    | {n}    | 匹配前面的字符n次             | go{2}gle,该表达式只匹配google                              |
 67 |    | {n,}   | 匹配前面的字符最少n次         | go{2,}gle,该表达式匹配范围从google到goo...gle              |
 68 |    | {n,m}  | 匹配前面的字符最少n次,最多m次 | employe{0, 2}，该表达式匹配employ,employe和employee3种情况 |
 69 | 
 70 | 4. 选择字符
 71 | 
 72 |    我们通常使用 **|** 来实现选择，比如我们的身份证号有15位和18位的，那怎么匹配呢？
 73 | 
 74 |    > (^\d{15}$)|(^\d{18}$)|(^\d{17})(\d|X|x)$
 75 | 
 76 |    这个的意思式要么是15个数字，要么是18个数字或者17个数字加一个x(大写，小写)
 77 | 
 78 |    这里的括号表示分组。
 79 | 
 80 | ## Python Re模块
 81 | 
 82 | 我们可以通过下面的代码导入re库
 83 | 
 84 | ```python
 85 | import re
 86 | ```
 87 | 
 88 | 1. 使用match()方法进行匹配
 89 | 
 90 |    match()方法用于从字符串的开始处进行匹配，如果在起始位置匹配成功，则返回Match对象，否则返回None，语法如下
 91 | 
 92 |    > re.match(pattern, string, [flags])
 93 | 
 94 |    - pattern:表示模式字符串，由要匹配的正则表达式转换而来
 95 |    - string:表示要匹配的字符串
 96 |    - flags:可选参数，表示标志位，用于控制匹配方式，如是否区分字母大小写。
 97 | 
 98 |    比如匹配字符串是否以"py"开头，不区分大小写代码如下
 99 | 
100 |    ```python
101 |    import re
102 |    pattern = r'py\w+'
103 |    string = 'python is PYTHON'
104 |    match = re.match(pattern, string, re.I) # re.I表示不区分大小写
105 |    print(match)
106 |    string = "这是python"
107 |    match = re.match(pattern, string, re.I)
108 |    print(match)
109 |    ```
110 | 
111 |    ![1566655220447](https://github.com/ai-union/PythonSpyder/blob/master/img/1566655220447.png?raw=true)
112 | 
113 |    因为match是用字符串的开头开始匹配的，因为"这是python"这个字符串不是py开始的。所以返回的是None。span这个元组表示匹配的位置，第一个为开始位置，第二个为结束位置。
114 | 
115 |    ```python
116 |    import re
117 |    pattern = r'py\w+'
118 |    string = 'python is PYTHON'
119 |    match = re.match(pattern, string, re.I) # re.I表示不区分大小写
120 |    print(match)
121 |    print("匹配的起始位置:", match.start())
122 |    print("匹配的结束位置:", match.end())
123 |    print("匹配的元组:", match.span())
124 |    print("匹配的字符:", match.string)
125 |    print("匹配的数据:", match.group())
126 |    ```
127 | 
128 |    ![1566656347663](https://github.com/ai-union/PythonSpyder/blob/master/img/1566656347663.png?raw=true)
129 | 
130 | 2. 使用search()方法进行匹配
131 | 
132 |    search()方法用于在整个字符串中搜索第一个匹配的值，找到就返回Match对象，否则返回None，语法如下：
133 | 
134 |    > re.search(pattern, string, [flags])
135 | 
136 |    ```python
137 |    import re
138 |    pattern = r'py\w+'
139 |    string = 'python is PYTHON'
140 |    match = re.search(pattern, string, re.I) # re.I表示不区分大小写
141 |    print(match)
142 |    string = "这是python"
143 |    match = re.search(pattern, string, re.I)
144 |    print(match)
145 |    ```
146 | 
147 |    ![1566656597002](https://github.com/ai-union/PythonSpyder/blob/master/img/1566656597002.png?raw=true)
148 | 
149 |    从运行结果可以看出，这个方法不仅仅在字符串的起始位置搜索，其他位置有符合的匹配也可以。
150 | 
151 | 3. 使用findall()方法进行匹配
152 | 
153 |    findall()方法用于在整个字符串中搜索所有符合正则表达式的字符串，并以列表的形式返回。查找不到就返回空列表。
154 | 
155 |    > re.findall(pattern, string, [flags])
156 | 
157 |    ```python
158 |    import re
159 |    pattern = r'py\w+'
160 |    string = 'python is PYTHON'
161 |    match = re.findall(pattern, string, re.I) # re.I表示不区分大小写
162 |    print(match)
163 |    string = "这是python"
164 |    match = re.findall(pattern, string, re.I)
165 |    print(match)
166 |    ```
167 | 
168 |    ![1566656895958](https://github.com/ai-union/PythonSpyder/blob/master/img/1566656895958.png?raw=true)
169 | 
170 | 4. 替换字符串
171 | 
172 |    sub方法用于实现字符串替换。在爬虫中很少会用到。等遇到后和大家分享。
173 | 
174 | 5. 用split()方法分割字符串
175 | 
176 |    ```python
177 |    import re
178 |    pattern = r'[?|&]'
179 |    url = 'https://search.bilibili.com/live?keyword=IT蜗壳'
180 |    result = re.split(pattern, url)
181 |    print(result)
182 |    ```
183 | 
184 |    ![1566657375088](https://github.com/ai-union/PythonSpyder/blob/master/img/1566657375088.png?raw=true)
185 | 
186 |    这里用？将url分割了两部分
187 | 
188 |    好了，今天的分享就是这些了，我们下期见。
189 | 
190 | 


--------------------------------------------------------------------------------
/md/12-第十二周-通过OCR识别验证码.md:
--------------------------------------------------------------------------------
  1 | #  专栏第十二周 通过OCR识别验证码
  2 | 
  3 | 验证码（CAPTCHA）的全称为**全自动区分计算机和人类的公开图灵测试**。从其定义可以看出，验证码用于测试用户是否为真实人类。
  4 | 
  5 | 很多网站在登录或者注册时都会出现一个被扭曲的图片，上面写着一些文本。这周我们将通过python程序来自动识别这些验证码。但要做好心里准备，程序并不能百分之百的准确识别出来。
  6 | 
  7 | ## 情景演示
  8 | 
  9 | 但我们访问 http://example.python-scraping.com/user/register 这个网站时，你会看到如下的画面：
 10 | 
 11 | ![1571484984537](https://github.com/ai-union/PythonSpyder/blob/master/img/1571484984537.png?raw=true)
 12 | 
 13 | 红框中的内容就是验证码了。我们通过爬虫获取一下这个页面的内容
 14 | 
 15 | ```python
 16 | import requests
 17 | from lxml.html import fromstring
 18 | 
 19 | def parse_form(html):
 20 |     tree = fromstring(html)
 21 |     data = {}
 22 |     for e in tree.cssselect('form input'):
 23 |         if e.get('name'):
 24 |             data[e.get('name')] = e.get('value')
 25 |     return data
 26 | 
 27 | url = "http://example.python-scraping.com/user/register"
 28 | session = requests.session()
 29 | html = session.get(url)
 30 | form = parse_form(html.content) # 获取页面的表单
 31 | print(form)
 32 | ```
 33 | 
 34 | ![1571485766709](https://github.com/ai-union/PythonSpyder/blob/master/img/1571485766709.png?raw=true)
 35 | 
 36 | > 需要安装一个cssselect库。 pip install cssselect 
 37 | 
 38 | 这些表单内容我们都很好处理，如果要处理验证码，我们首先需要将它加载出来。
 39 | 
 40 | 在Python中处理图像我们通常使用Pillow包，安装方法:**pip install Pillow**
 41 | 
 42 | 我们通过如下代码来返回Image对象。
 43 | 
 44 | ```python
 45 | from io import BytesIO
 46 | from PIL import Image
 47 | import base64
 48 | 
 49 | def get_img(html):
 50 |     tree = fromstring(html)
 51 |     img_data = tree.cssselect('div#recaptcha img')[0].get('src')
 52 |     img_data = img_data.partition(',')[-1]
 53 |     binary_img_data = base64.b64decode(img_data)
 54 |     img = Image.open(BytesIO(binary_img_data))
 55 |     return img
 56 | ```
 57 | 
 58 | 完整代码如下:
 59 | 
 60 | ```python
 61 | import requests
 62 | from lxml.html import fromstring
 63 | 
 64 | from io import BytesIO
 65 | from PIL import Image
 66 | import base64
 67 | 
 68 | def parse_form(html):
 69 |     tree = fromstring(html)
 70 |     data = {}
 71 |     for e in tree.cssselect('form input'):
 72 |         if e.get('name'):
 73 |             data[e.get('name')] = e.get('value')
 74 |     return data
 75 | def get_img(html):
 76 |     tree = fromstring(html)
 77 |     img_data = tree.cssselect('div#recaptcha img')[0].get('src')
 78 |     img_data = img_data.partition(',')[-1]
 79 |     binary_img_data = base64.b64decode(img_data)
 80 |     img = Image.open(BytesIO(binary_img_data))
 81 |     return img
 82 | 
 83 | url = "http://example.python-scraping.com/user/register"
 84 | session = requests.session()
 85 | html = session.get(url)
 86 | form = parse_form(html.content) # 获取页面的表单
 87 | img = get_img(html.content) # 获取图片信息
 88 | print(img)
 89 | ```
 90 | 
 91 | ![1571487145703](https://github.com/ai-union/PythonSpyder/blob/master/img/1571487145703.png?raw=true)
 92 | 
 93 | 我们为了能从图片中识别出字符，接下来需要用到**光学字符识别(OCR)**技术。
 94 | 
 95 | ## Tesseract的使用
 96 | 
 97 | 安装方法:**pip install pytesseract**
 98 | 
 99 | 除了安装这个库之外，还需要在本地安装一个软件下载地址： https://github.com/tesseract-ocr/tesseract/wiki 
100 | 
101 | 根据自己的系统下载对应的版本，我这边已Windows系统为例。
102 | 
103 | 这时运行基本会报错：
104 | 
105 | ![1571489386804](https://github.com/ai-union/PythonSpyder/blob/master/img/1571489386804.png?raw=true)
106 | 
107 | 我们还需要修改一下源码，就是红框中的py文件。
108 | 
109 | ![1571489620723](https://github.com/ai-union/PythonSpyder/blob/master/img/1571489620723.png?raw=true)
110 | 
111 | 我们把红框中的内容换成我们软件的安装路经，以下是我的安装路径：
112 | 
113 | ![1571489930881](https://github.com/ai-union/PythonSpyder/blob/master/img/1571489930881.png?raw=true)
114 | 
115 | ```python
116 | import pytesseract
117 | str_info = pytesseract.image_to_string(img)
118 | print(str_info)
119 | ```
120 | 
121 | 通常我们如果直接去识别这个图片，很大可能会返回空字符串。这是因为这里面有背景噪音也就是干扰元素。
122 | 
123 | 我们观察我们的验证码的文本颜色为黑色的，所以我们可以将图片中黑色部分分离出来，这个过程通常称为**阈值化**。
124 | 
125 | ```python
126 | img.save('temp.png?raw=true')
127 | new_img = img.convert('L') # 转化灰度图
128 | new_img.save('temp2.png?raw=true')
129 | bw = new_img.point(lambda x: 0 if x < 1 else 255,'1')
130 | bw.save('temp3.png?raw=true')
131 | ```
132 | 
133 | 运行后我们可以看到如下的图片，在我们脚本的同级目录下：
134 | 
135 | ![1571489797290](https://github.com/ai-union/PythonSpyder/blob/master/img/1571489797290.png?raw=true)
136 | 
137 | 这样我们处理后的图片噪音就少了很多，再通过我们的pytesseract库来进行识别，这样的成功率会比较高。
138 | 
139 | 完整代码：
140 | 
141 | ```python
142 | import requests
143 | from lxml.html import fromstring
144 | 
145 | from io import BytesIO
146 | from PIL import Image
147 | import base64
148 | 
149 | import pytesseract
150 | 
151 | def parse_form(html):
152 |     tree = fromstring(html)
153 |     data = {}
154 |     for e in tree.cssselect('form input'):
155 |         if e.get('name'):
156 |             data[e.get('name')] = e.get('value')
157 |     return data
158 | def get_img(html):
159 |     tree = fromstring(html)
160 |     img_data = tree.cssselect('div#recaptcha img')[0].get('src')
161 |     img_data = img_data.partition(',')[-1]
162 |     binary_img_data = base64.b64decode(img_data)
163 |     img = Image.open(BytesIO(binary_img_data))
164 |     return img
165 | 
166 | url = "http://example.python-scraping.com/user/register"
167 | session = requests.session()
168 | html = session.get(url)
169 | form = parse_form(html.content) # 获取页面的表单
170 | img = get_img(html.content)
171 | print(img)
172 | img.save('temp.png?raw=true')
173 | new_img = img.convert('L') # 转化灰度图
174 | new_img.save('temp2.png?raw=true')
175 | bw = new_img.point(lambda x: 0 if x < 1 else 255,'1')
176 | bw.save('temp3.png?raw=true')
177 | str_info = pytesseract.image_to_string(bw)
178 | print(str_info)
179 | ```
180 | 
181 | 运行效果：
182 | 
183 | ![1571490007402](https://github.com/ai-union/PythonSpyder/blob/master/img/1571490007402.png?raw=true)
184 | 
185 | 我们可以看到图片上的内容被识别出来了。
186 | 
187 | ## 问题
188 | 
189 | 对于这样简单的验证码，即使成功率达不到100%也没关系，只要多请求几次还是可以成功的，但是要注意频率，要不然很容易被封IP。
190 | 
191 | 除了这种简单的验证码之外，还有很多复杂的验证码，这时候我不建议大家花费太多精力在破解验证码上面。我们可以借助第三方工具或者打码平台甚至人工干预来绕过验证码。
192 | 
193 | 好了，这期的内容就是这些，欢迎大家进群讨论。


--------------------------------------------------------------------------------
/md/08-第八周-Scrapy爬虫开发.md:
--------------------------------------------------------------------------------
  1 | # 专栏第八周 Scrapy爬虫开发
  2 | 
  3 | 到目前为止我们基本掌握了基础的爬虫开发，但是基本都是比较简单的网站爬虫。我们有没有什么比较快速方便的构建完整的解决方案呢？本周就为大家介绍一个非常流行的Scrapy爬虫框架。
  4 | 
  5 | ## 认识与安装Scrapy
  6 | 
  7 | 爬虫框架能为项目开发起到规范的作用，也因如此使其失去了一定的灵活性。很多人会将Requests和Scrapy两者进行对比，前者是第三方库，后者是爬虫开发框架，尽管两者不在同一层次上，但还是有一定的对比性。
  8 | 
  9 | - 规范性：Scrapy自带功能模块能完成爬虫开发，各个功能代码划分明确。Requests只规范数据爬取，不支持数据清洗和数据存储，需结合其他库一起使用才能完成爬虫开发。
 10 | - 灵活性：Scrapy的灵活性不如Requests,对于一些设计不合理的网站或者特殊的网站，Requests更适合一些。
 11 | - 适用范围：Scrapy适合大型爬虫开发项目，便于开发者对代码的维护和管理。Requests受开发者的习惯影响较大，如果架构设计不合理或者更换开发人员，会使代码难以维护和管理。
 12 | 
 13 | 1. Scrapy的运行机制
 14 | 
 15 |    Scrapy使用Twisted异步网络库来处理网络通信，架构清晰，并且包含各个中间件接口，可以灵活的完成各种需求。Scrapy的整体架构图如下：
 16 | 
 17 |    ![Scrapy](https://github.com/ai-union/PythonSpyder/blob/master/img/Scrapy.jpg?raw=true)
 18 | 
 19 |    Scrapy的运行机制大概如下：
 20 | 
 21 |    - 引擎从调度器中取出一个URL，用于接下来的抓取。
 22 |    - 引擎把URL封装成请求传给下载器，下载器把资源下载后封装成应答包。
 23 |    - 爬虫解析Response。
 24 |    - 若解析出实体（Item），则交给实体管道进行进一步的处理。
 25 |    - 若解析出的是URL，则把URL交给Scheduler等待抓取。
 26 | 
 27 |    结合上图，各个组件的功能说明如下：
 28 | 
 29 |    - 引擎（Scrapy Engine）:处理整个系统的数据流，触发事务（框架核心）。
 30 |    - 调度器（Scheduler）:接受引擎发过来的请求，压入队列中，并在引擎再次请求的时候返回。
 31 |    - 下载器（Downloader）:用于下载网页内容，并将网页内容返回给爬虫（Spiders）。
 32 |    - 爬虫（Spiders）：从特定的网页中提取自己需要的信息，即实体（Item）。也可以从中提取URL，让Scrapy继续抓取下一个页面。
 33 |    - 项目管道（Item Pipeline）：负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体，验证实体的有效性，清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。
 34 |    - 下载器中间件（Downloader Middlewares）：位于Scrapy引擎和下载器之间的框架，处理引擎与下载器之间的请求及响应。
 35 |    - 爬虫中间件（Spider Middlewares）:介于Scrapy引擎和爬虫之间的框架，主要工作是处理爬虫的响应输入和请求输出。
 36 |    - 调度中间件（Scheduler Middlewares）:介于Scrapy引擎和调度器之间的中间件，用于处理Scrapy引擎发送到调度器的请求和响应。
 37 | 
 38 | 2. 安装Scrapy
 39 | 
 40 |    在安装Scrapy之前，需要先安装Twisted。可以使用pip安装，但容易出问题，建议下载whl文件安装。
 41 | 
 42 |    在安装完Twisted后可以使用pip安装Scrapy：
 43 | 
 44 |    ```shell
 45 |    pip install Scrapy
 46 |    ```
 47 | 
 48 |    ![1569076766706](https://github.com/ai-union/PythonSpyder/blob/master/img/1569076766706.png?raw=true)
 49 | 
 50 |    这样我们的安装就完成了。
 51 | 
 52 | ## Scrapy爬虫开发示例
 53 | 
 54 | 我们通过爬取百度知道https://zhidao.baidu.com/list?cid=110106演示如何使用Scrapy获取问题列表。
 55 | 
 56 | 分析页面得知每条数据在\<a>标签中，而\<a>标签则在class属性为question-title的\<div>标签中。
 57 | 
 58 | ![1569077409152](https://github.com/ai-union/PythonSpyder/blob/master/img/1569077409152.png?raw=true)
 59 | 
 60 | 1. 我们切换到工作目录下，然后通过命令创建一个叫做baidu的项目。
 61 | 
 62 |    ```python
 63 |    scrapy startproject baidu
 64 |    ```
 65 | 
 66 |    ![1569077656059](https://github.com/ai-union/PythonSpyder/blob/master/img/1569077656059.png?raw=true)
 67 | 
 68 |    我们会看到自动创建好的文件夹目录
 69 | 
 70 |    ![1569077783474](https://github.com/ai-union/PythonSpyder/blob/master/img/1569077783474.png?raw=true)
 71 | 
 72 |    - spiders(文件夹):编写爬虫规则，实现数据爬取和数据清洗处理。
 73 |    - items.py:数据定义和实例化，用于寄存清洗后的数据。
 74 |    - middlewares.py:是介于Scrapy的request/response处理的钩子框架，用于全局修改Scrapy request和response的一个轻量，底层的系统。
 75 |    - pipelines.py:执行保存数据的操作，数据对象来源于items.py。
 76 |    - setting.py:整个框架配置文件。
 77 |    - scrapy.cfg:项目部署文件。
 78 | 
 79 |    Scrapy的常用开发顺序为：
 80 | 
 81 |    - setting.py：主要配置爬虫，比如请求头什么的。
 82 |    - items.py：定义存储数据对象，主要衔接spiders和pipelines.py。
 83 |    - pipelines.py：数据存储，数据格式为字典，key为items.py定义的变量。
 84 |    - spiders：编写爬虫规则。
 85 | 
 86 |    2.开始编写爬虫
 87 | 
 88 |    - 修改setting.py：初始文件大部分内容都是注释掉的。本项目只需要设置Item Pipeline和请求头即可。修改如下：
 89 | 
 90 |      ```python
 91 |      pass # 上面的内容省略
 92 |      # Override the default request headers:
 93 |      DEFAULT_REQUEST_HEADERS = {
 94 |         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 95 |         'Accept-Language': 'en',
 96 |      }
 97 |      pass # 省略
 98 |      # Configure item pipelines
 99 |      # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
100 |      ITEM_PIPELINES = {
101 |          'baidu.pipelines.BaiduPipeline': 300,
102 |      }
103 |      ```
104 | 
105 |      > 由于篇幅问题，省略了其他代码
106 | 
107 |    - 修改items.py文件
108 | 
109 |      ```python
110 |      import scrapy
111 |      
112 |      
113 |      class BaiduItem(scrapy.Item):
114 |          # define the fields for your item here like:
115 |          # name = scrapy.Field()
116 |          TitleName = scrapy.Field()
117 |          pass
118 |      ```
119 | 
120 |    - 修改pipelines.py。我们将爬取的内容保存在txt中。
121 | 
122 |      ```python
123 |      class BaiduPipeline(object):
124 |          def process_item(self, item, spider):
125 |              with open('data.txt', 'a') as f:
126 |                  for i in item['TitleName']:
127 |                      value = i.replace("\n","")
128 |                      f.write(value + "\r\n")
129 |              return item
130 |      ```
131 | 
132 |    - 在spiders文件夹下面创建爬虫规则。Spider_spiders.py，代码如下：
133 | 
134 |      ```python
135 |      # 导入items.py的BaiduItem，存放爬取数据
136 |      from baidu.items import BaiduItem
137 |      # Scrapy自带的数据清洗模块
138 |      from scrapy.selector import Selector
139 |      # Scrapy搜素引擎
140 |      from scrapy.spider import Spider
141 |      # 爬虫规则
142 |      class Baiduspider(Spider):
143 |          # 属性name必须设置,且唯一,用于运行爬虫
144 |          name = "Baidu_know"
145 |          # 设置允许访问域名
146 |          allowed_domins = ['baidu.com']
147 |          # 设置URL
148 |          start_urls = [https://zhidao.baidu.com/list?cid=110106","https://zhidao.baidu.com/list?cid=110106"]
149 |          # 函数parse处理响应内容，函数名不能更改
150 |          def parse(self, response):
151 |              # 将响应内容生成Selector用于数据清洗
152 |              sel = Selector(response)
153 |              items = []
154 |              item = BaiduItem()
155 |              title = sel.xpath('//div[@class="question-title"]/a/text()').extract()
156 |              for i in title:
157 |                  items.append(i)
158 |              item['TitleName'] = items
159 |              return item
160 |      ```
161 | 
162 |    - 在终端中允许如下代码：
163 | 
164 |      ```shell
165 |      scrapy crawl Baidu_know
166 |      ```
167 | 
168 |      ![1569084529057](https://github.com/ai-union/PythonSpyder/blob/master/img/1569084529057.png?raw=true)
169 | 
170 |      我们可以看到，两个网站都进行了爬取，只是没有获取到数据。这样我们的第一个Scrapy项目就算允许成功了。关于Scrapy的详细内容大家还是要多多的看文档和书。欢迎大家进群讨论。
171 | 
172 | 


--------------------------------------------------------------------------------
/md/06-第六周-Lxml库与Xpath语法.md:
--------------------------------------------------------------------------------
  1 | # 专栏第六周 Lxml库与Xpath语法
  2 | 
  3 | Lxml库是基于libxml2的XML解析库的Python封装。该模块使用C语言编写，解析速度比BeautifulSoup更快。这个库使用Xpath语法解析定位网页数据。通过本篇文章你将学到以下内容：
  4 | 
  5 | > - Lxml库的使用
  6 | > - Xpath相关语法
  7 | 
  8 | ## Lxml库的安装与使用方法
  9 | 
 10 | 1. 安装：
 11 | 
 12 |    和其他第三方库的安装方法一样，在命令行中运行以下代码：
 13 | 
 14 |    ```shell
 15 |    pip install lxml
 16 |    ```
 17 | 
 18 |    如果在安装过程中提示缺少库，请利用搜索引擎查找解决方法。
 19 | 
 20 | 2. Lxml库的使用
 21 | 
 22 |    - 修正HTML代码
 23 | 
 24 |      ```python
 25 |      from lxml import etree
 26 |      
 27 |      html_text = '''
 28 |      <div>
 29 |          <ul>
 30 |              <li class="one">one</li>
 31 |              <li class="two">two
 32 |          </ul>
 33 |      </div>
 34 |      '''
 35 |      html = etree.HTML(html_text)
 36 |      print(html) # Lxml库解析数据为Element对象
 37 |      result = etree.tostring(html)
 38 |      print(result) # 可以自动修正HTML内容
 39 |      ```
 40 | 
 41 |      ![1567856337964](https://github.com/ai-union/PythonSpyder/blob/master/img/1567856337964.png?raw=true)
 42 | 
 43 |      从打印结果我们可以看出，etree库把HTML文档解析为Element对象，并且可以自动修正HTML文档。细心的朋友可能已经发现，我代码中的最后的li标签是不完整的，而从打印结果可以看到，不但li标签补全了，还把html和body标签也自动添加上了。
 44 | 
 45 |    - 读取HTML文件
 46 | 
 47 |      我们在本地同级目录下面新建一个html_demo.html文件。内容如下：
 48 | 
 49 |      ```html
 50 |      <html>
 51 |          <head>
 52 |              <title>DemoHTML</title>
 53 |          </head>
 54 |          <body>
 55 |              <div>
 56 |                  <ul>
 57 |                      <li>a</li>
 58 |                      <li>b</li>
 59 |                      <li>c</li>
 60 |                      <li>d</li>
 61 |                  </ul>
 62 |              </div>
 63 |          </body>
 64 |      </html>
 65 |      ```
 66 | 
 67 |      我们可以通过Lxml库读取HTML文件的内容了，代码如下：
 68 | 
 69 |      ```python
 70 |      from lxml import etree
 71 |      
 72 |      html= etree.parse("html_demo.html")
 73 |      result = etree.tostring(html) 
 74 |      print(result)
 75 |      ```
 76 | 
 77 |      ![1567857170021](https://github.com/ai-union/PythonSpyder/blob/master/img/1567857170021.png?raw=true)
 78 | 
 79 |    - 解析HTML文件
 80 | 
 81 |      我们可以配合requests库获取HTML文件，然后利用Lxml库进行解析。
 82 | 
 83 |      网站为：https://www.qiushibaike.com/text/
 84 | 
 85 |      ```python
 86 |      import requests
 87 |      from lxml import etree
 88 |      
 89 |      header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
 90 |           AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
 91 |      res = requests.get('https://www.qiushibaike.com/text/', headers = header)
 92 |      html = etree.HTML(res.text)
 93 |      result = etree.tostring(html)
 94 |      print(result)
 95 |      ```
 96 | 
 97 |      ![1567857830629](https://github.com/ai-union/PythonSpyder/blob/master/img/1567857830629.png?raw=true)
 98 | 
 99 | ## Xpath的用法
100 | 
101 | 1. 节点选择
102 | 
103 |    - 节点选择
104 | 
105 |      | 表达式   | 描述                                                     |
106 |      | -------- | -------------------------------------------------------- |
107 |      | nodename | 选取此节点的所有子节点                                   |
108 |      | /        | 从根节点选取                                             |
109 |      | //       | 从匹配选择的当前节点选择文档中的节点，而不考虑他们的位置 |
110 |      | .        | 选取当前节点                                             |
111 |      | ..       | 选取当前节点的父节点                                     |
112 |      | @        | 选取属性                                                 |
113 | 
114 |    - 谓语
115 | 
116 |      | 路径表达式              | 结果                                                    |
117 |      | ----------------------- | ------------------------------------------------------- |
118 |      | /ul/li[1]               | 选取ul标签的子元素的第一个li元素                        |
119 |      | //li[@attribute]        | 选取所有拥有名为attribute属性的li元素                   |
120 |      | //li[@attribute='info'] | 选取所有的li元素，且这些元素拥有值为info的attribute属性 |
121 | 
122 |      我们手写这些内容可能会出错，所以通常我们式在谷歌浏览器的开发者模式下，通过Copy Xpath命令来获取。
123 | 
124 |    例如我们获取糗事百科的用户名：
125 | 
126 |    ```python
127 |    import requests
128 |    from lxml import etree
129 |    
130 |    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
131 |         AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
132 |    res = requests.get('https://www.qiushibaike.com/text/', headers = header)
133 |    html = etree.HTML(res.text)
134 |    user_id = html.xpath('//*[@id="qiushi_tag_122217427"]/div[1]/a[2]/h2/text()') # text()为获取内容的方法
135 |    print(user_id)
136 |    ```
137 | 
138 |    ![1567859645775](https://github.com/ai-union/PythonSpyder/blob/master/img/1567859645775.png?raw=true)
139 | 
140 |    > 这里有一点需要注意:我们注意到id属性中有一串数字，大家在写的时候，要根据自己的情况填写。否则获取不到数据。
141 | 
142 |    下面我们看一下如何获取当前页面中所有用户的名字：
143 | 
144 |    ```python
145 |    import requests
146 |    from lxml import etree
147 |    
148 |    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
149 |         AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
150 |    res = requests.get('https://www.qiushibaike.com/text/', headers = header)
151 |    html = etree.HTML(res.text)
152 |    
153 |    users = html.xpath('//div[@class="article block untagged mb15 typs_hot"]')
154 |    for user_info in users:
155 |        user_id = user_info.xpath('div[1]/a[2]/h2/text()')[0]
156 |        print(user_id)
157 |    
158 |    ```
159 | 
160 |    ![1567860876635](https://github.com/ai-union/PythonSpyder/blob/master/img/1567860876635.png?raw=true)
161 | 
162 |    但是如果们观看页面的话，会发现第一个的用户名并没有取到，因为第一个的class为“article block untagged mb15 typs_long”
163 | 
164 |    所以我们把代码改为以下：
165 | 
166 |    ```python
167 |    import requests
168 |    from lxml import etree
169 |    
170 |    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
171 |         AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
172 |    res = requests.get('https://www.qiushibaike.com/text/', headers = header)
173 |    html = etree.HTML(res.text)
174 |    
175 |    users = html.xpath('//div[starts-with(@id,"qiushi_tag")]')
176 |    for user_info in users:
177 |        user_id = user_info.xpath('div[1]/a[2]/h2/text()')[0]
178 |        print(user_id)
179 |    
180 |    ```
181 | 
182 |    ![1567861127369](https://github.com/ai-union/PythonSpyder/blob/master/img/1567861127369.png?raw=true)
183 | 
184 |    这里使用了一个**starts-with()** 它可以只匹配目标元素属性的一部分。
185 | 
186 |    好了，这便是本周的内容，大家可以用Xpath把每一个段子的内容获取一下，如果出现问题，可以加入我们的学习群获取帮助。


--------------------------------------------------------------------------------
/md/10-第十周-ORM框架.md:
--------------------------------------------------------------------------------
  1 | # 专栏第十周 ORM框架
  2 | 
  3 | 我们的爬虫专栏基本到这里就算是完结了，接下来的几篇文章会介绍一些我们在项目中常用的第三方库或者是框架，有这些框架的支持我们在编写爬虫项目时，会方便很多。今天为大家介绍的这个ORM框架就是用来操作数据库的。
  4 | 
  5 | ## SQLAlchemy框架介绍与安装
  6 | 
  7 | SQLAlchemy是Python编程语言下的一款开源软件，提供SQL工具包以及对象-关系映射工具，使用MIT许可证发行。它的一个目标是提供能兼容众多数据库（如SQLite，MySql，Postgres，Oracle，MS-SQL，SQLServer和Firebird）的企业级持久性模型。
  8 | 
  9 | 1. SQLAlchemy的安装
 10 | 
 11 |    我们通过如下命令安装：
 12 | 
 13 |    ```python
 14 |    pip install SQLAlchemy
 15 |    ```
 16 | 
 17 |    如果安装比较慢或者失败可以用试试更换一下源，如下：
 18 | 
 19 |    ```python
 20 |    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple SQLAlchemy
 21 |    ```
 22 | 
 23 |    ![1570202323349](https://github.com/ai-union/PythonSpyder/blob/master/img/1570202323349.png?raw=true)
 24 | 
 25 |    使用SQLAlchemy连接数据库实质上还是通过数据库接口实现连接，安装SQLAlchemy后还需要安装对应数据库的接口模块，下面以MySql为例安装pymysql：
 26 | 
 27 |    ```python
 28 |    pip install pymysql
 29 |    ```
 30 | 
 31 |    ![1570202356214](https://github.com/ai-union/PythonSpyder/blob/master/img/1570202356214.png?raw=true)
 32 | 
 33 |    安装完成后我们在命令行窗口中导入模块来测试是否安装成功：
 34 | 
 35 |    ![1570202509673](https://github.com/ai-union/PythonSpyder/blob/master/img/1570202509673.png?raw=true)
 36 | 
 37 | 2. SQLAlchemy在Python中操作数据库
 38 | 
 39 |    SQLAlchemy连接数据库使用数据库链接池技术，原理是在系统初始化的时候，将数据库连接作为对象存储在内存中，当用户需要访问数据库时，并非建立一个新的连接，而是从连接池中取出一个已建立的空闲连接对象。使用完毕后，用户也并非将连接关闭，而是将连接放回连接池中，以供下一个请求访问使用。而连接的建立，断开都是由连接池自身来管理。同时，还可以通过设置连接池参数来控制连接池中的初始连接数，连接的上下限数以及每个连接的最大使用次数，最大空闲时间等。也可以通过其自身的管理机制来监视数据库连接的数量以及使用情况等。
 40 | 
 41 |    说完原理后，我没来看看SQLAlchemy连接数据库的代码：
 42 | 
 43 |    ```python
 44 |    from sqlalchemy import create_engine
 45 |    engine = create_engine("mysql+pymysql://root:pwd@localhost:3306/test?charset=utf8",echo=True)
 46 |    ```
 47 | 
 48 |    - mysql+pymysql://root:pwd@localhost:3306/test： mysql指明数据库系统类型，pymysql是连接数据库接口的模块，root是数据库用户名，pwd是数据库密码，localhost:3306是本地数据库和端口号,test是数据库名。
 49 | - echo=True：用于显示SQLAlchemy在操作数据库时所执行的SQL语句情况，相当于一个监视器，可以清楚知道执行情况。
 50 |    
 51 |    其他数据库的连接：
 52 |    
 53 |    | 数据库              | 连接字符串                                  |
 54 |    | ------------------- | ------------------------------------------- |
 55 |    | Microsoft SQL Sever | mssql+pymssql://usename:pwd@ip:port/dbname  |
 56 |    | MySql               | mysql+pymysql://username:pwd@ip:port/dbname |
 57 |    | Oracle              | cx_Oracle://username:pwd@ip:port/dbname     |
 58 |    | PostgreSQL          | postgresql://username:pwd@ip:port/dbname    |
 59 |    | SQLite              | sqlite://file_path                          |
 60 | 
 61 | ## 创建数据库表
 62 | 
 63 | 我们在连接好数据库后，接下来就是创建数据库表了，接下来为大家演示如何用SQLAlchemy创建表：
 64 | 
 65 | ```python
 66 | # 连接数据库 
 67 | from sqlalchemy import create_engine
 68 | engine = create_engine("mysql+pymysql://root:root@localhost:3306/test?charset=utf8",echo=True)
 69 | 
 70 | # 创建数据库
 71 | from sqlalchemy import Column, Integer, String, DateTime # 常用数据类型
 72 | from sqlalchemy.ext.declarative import declarative_base
 73 | Base = declarative_base()
 74 | 
 75 | class mytable(Base):
 76 |     # 表名
 77 |     __tablename__ = "mytable"
 78 |     # 字段，属性
 79 |     id = Column(Integer, primary_key = True)
 80 |     name = Column(String(10), unique = True)
 81 |     age = Column(Integer)
 82 |     brith = Column(DateTime)
 83 |     class_name = Column(String(50))
 84 | 
 85 | Base.metadata.create_all(engine)
 86 | ```
 87 | 
 88 | \__tablename__ = "mytable"可以省略，如果省略那么表名和类名就是一样的。类中的属性就是数据中的字段名以及类型。最后通过Base.metadata.create_all(engine)在数据库中创建对应的数据库表。
 89 | 
 90 | 如果要删除数据表可以通过如下代码：
 91 | 
 92 | ```python
 93 | # 先删除类，后删除数据库表
 94 | mytable.drop(bind = engine)
 95 | Base.metadata.drop_all(engine)
 96 | ```
 97 | 
 98 | 无论是创建还是删除都需要先对数据进行连接，即创建engine对象。
 99 | 
100 | ## 添加数据
101 | 
102 | 完成数据表的创建后，接下来我们就可以进行数据的操作。首先创建一个会话对象，用于执行SQL语句，代码如下：
103 | 
104 | ```python
105 | from sqlalchemy.orm import sessionmaker
106 | DBSession = sessionmaker(bind = engine)
107 | session = DBSession()
108 | ```
109 | 
110 | 我们常用的增，删，改，查在SQLAlchemy中有固定的语法支持。
111 | 
112 | ```python
113 | new_data = mytable(name = 'Tango', age = 10, birth='2019-10-05', class_name = '三年一班')
114 | session.add(new_data)
115 | session.commit()
116 | session.close()
117 | ```
118 | 
119 | ## 更新数据
120 | 
121 | - 使用update方法更新数据
122 | 
123 |   ```python
124 |   session.query(mytable).filter_by(id=1).update({mytable.age:12})
125 |   session.commit()
126 |   session.close()
127 |   ```
128 | 
129 |   首先我们查找到id为1的数据，然后对这条数据的age进行修改。
130 | 
131 | - 使用赋值方式更新：
132 | 
133 |   ```python
134 |   get_data = session.query(mytable).filter_by(id=1).first()
135 |   get_data.class_name = '三年二班'
136 |   session.commit()
137 |   session.close()
138 |   ```
139 | 
140 |   这种方式通常用于做单条数据的更新，如果要批量更新那么update的性能会更好一些。
141 | 
142 | ## 查询数据
143 | 
144 | - 查询所有数据
145 | 
146 |   ```python
147 |   get_data = session.query(mytable).all()
148 |   for i in get_data:
149 |       print('我的名字' + i.name)
150 |       print('我的班接' + i.class_name)
151 |   session.close()
152 |   ```
153 | 
154 |   这种写法相当于sql中的：
155 | 
156 |   ```sql
157 |   select * from mytable;
158 |   ```
159 | 
160 |   all()的返回类型是列表。
161 | 
162 | - 查询指定字段
163 | 
164 |   ```python
165 |   get_data = session.query(mytable.name, mytable.class_name).all()
166 |   for i in get_data:
167 |       print('我的名字' + i.name)
168 |       print('我的班接' + i.class_name)
169 |   session.close()
170 |   ```
171 | 
172 |   我们只需要将要查的字段传入query中。
173 | 
174 | - 设置筛选条件
175 | 
176 |   ```python
177 |   get_data = session.query(mytable.name, mytable.class_name).filter_by(id=1).all()
178 |   for i in get_data:
179 |       print('我的名字' + i.name)
180 |       print('我的班接' + i.class_name)
181 |   session.close()
182 |   
183 |   ```
184 | 
185 |   filter_by(id=1)指定的是查询对象id为1的那条数据。
186 | 
187 |   如果要获取查询结果中的第一天只需要将all()替换成first()就可以了。
188 | 
189 |   - and多条件查询
190 | 
191 |     ```python
192 |     get_data = session.query(mytable.name, mytable.class_name).filter(mytable.id>2, mytable.name=='Tango').first()
193 |     print('我的名字' + get_data.name)
194 |     print('我的班接' + get_data.class_name)
195 |     ```
196 | 
197 |   - or条件查询
198 | 
199 |     ```python
200 |     from sqlalchemy import or_
201 |     session.query(mytable).filter(or_(mytable.id>2, mytable.name=='Tango')).all()
202 |     ```
203 | 
204 |     需要引入or_
205 | 
206 |   - 多表查询
207 | 
208 |     ```python
209 |     get_data = session.query(mytable).join(mytable2).filter(mytable.id>=2).all()
210 |     ```
211 | 
212 |     join为内连，如果需要使用外连把join换成outerjoin就好。
213 | 
214 |     如果执行的sql语句很复杂我们也可以直接调用原生的SQL语句
215 | 
216 |     ```python
217 |     sql = "select * from mytable "
218 |     session.excute(sql)
219 |     session.commit()
220 |     ```
221 | 
222 | 好了，这周的内容就是这些了，大家可以将之前的爬虫获取的数据通过这个框架添加到数据库中看看效果如何。


--------------------------------------------------------------------------------
/md/03-第三周-我们的第一个爬虫.md:
--------------------------------------------------------------------------------
  1 | # 我们的第一个爬虫
  2 | 
  3 | > 本篇文章涉及到的内容如下：
  4 | >
  5 | > - Python第三方库的安装方法
  6 | > - 学会使用Requests库的使用方法
  7 | > - BeautifulSoup库的使用方法
  8 | > - Requests库和BeautifulSoup库的组合使用。
  9 | 
 10 | ## Python的第三方库
 11 | 
 12 | Python之所以如此火热和流行，有很大的原因是因为它有很多很强大的第三方库。这样“程序猿”们就不用太了解底层的实现，用最少的代码写出最多的功能。就像我们之前说的购买车票一样，需要：
 13 | 
 14 | - 打印车票的纸和墨
 15 | - 一台可以打印的机器
 16 | - 查询座位等操作。
 17 | 
 18 | 如果没有使用第三方库的话，恐怕我们买一张车票要一步一步的把上面需要的东西都实现一遍。这样购买一张车票是不是要很久啊。如今有了自动售票机这个“第三方库”，我们只需几十秒就可以拿到车票。这种拿来即用的在Python中就是第三方库。
 19 | 
 20 | 1. 如何安装第三方库呢
 21 | 
 22 |    - 纯原生的安装方法
 23 | 
 24 |      ```python
 25 |      pip install 第三方库的名称
 26 |      ```
 27 | 
 28 |    - 在anaconda环境中安装
 29 | 
 30 |      ```python
 31 |      conda install 第三方库的名称
 32 |      ```
 33 | 
 34 |    - 当然还可以通过源码安装
 35 | 
 36 |      我们将源码下载到本地，解压后会看到一个文件夹目录，里面有一个**setp.py**的文件，然后我们在该文件的**同级目录**下，执行如下代码即可：
 37 | 
 38 |      ```python
 39 |      python setp.py install
 40 |      ```
 41 | 
 42 |    > 以上这些方法都是在你的电脑上只有一个python环境的前提下。如果同时存在Python2 和python3的情况下，恐怕要小心一下了。看看你默认的环境是哪个。具体大家遇到问题可以在群里面@我一下。什么？! 还没加群，那快快扫描文章下面的二维码加入进来。会有很多福利的。
 43 | 
 44 | 2. 安装好后如何使用
 45 | 
 46 |    其实使用Python的第三方库很简单，只需要在你的代码开头将它导入进来就可以了。
 47 | 
 48 |    ```python
 49 |    import time # 这个库是Python自带的。可以直接这样引入
 50 |    from BS4 import BeautifulSoup # 从我们安装的第三方库中引入进来
 51 |    ```
 52 | 
 53 | ## Requests库
 54 | 
 55 | Requests库的作用就是请求网站获取网页数据的，让我们从简单的实例开始，讲解Requests的使用方法。
 56 | 
 57 | 我们这里的示例网站是，爱站的ICP备案查询[https://icp.aizhan.com](https://icp.aizhan.com/)。我们查询的网站是www.csdn.net
 58 | 
 59 | > 这里的目标url为：https://icp.aizhan.com/www.csdn.net/
 60 | 
 61 | 我们新建一个main.py的文件。将如下代码写在里面。
 62 | 
 63 | > 我建议大家在一个固定的地方建一个文件夹，用来保存我们专栏的所有的代码，等专栏结束后看看你的成就，也许会很吃惊的哦。
 64 | 
 65 | ```python
 66 | import requests
 67 | # 这里想查的CSDN的备案信息
 68 | res = requests.get('https://icp.aizhan.com/www.csdn.net/') 
 69 | print(res)
 70 | print(res.text)
 71 | ```
 72 | 
 73 | 然后在控制台（cmd）中输入：
 74 | 
 75 | ```python
 76 | python main.py
 77 | ```
 78 | 
 79 | 得到如下画面，表示你已经成功的爬取了这个网站的代码信息，恭喜你，你已经成功的入了爬虫的坑。
 80 | 
 81 | ![1566045583808](https://github.com/ai-union/PythonSpyder/blob/master/img/1566045583808.png?raw=true)
 82 | 
 83 | 1. 我们看到有一个**<Response [200]>**,代表的意思请求网站成功，若为404，400则表示请求失败，具体的原因有很多。比如：网站地址输错，当然也有可能触发了反爬机制。
 84 | 2. 上面的一堆HTML代码就是我们爬取的内容，我们可以通过在浏览器中F12来比较一下我们获取的内容和目标网站的代码是否一致。
 85 | 
 86 | 大部分时候，我们都要在爬虫中添加请求头，让服务器误以为我们是通过浏览器访问的，这样可以减少触发反爬机制的几率。
 87 | 
 88 | 那么如何伪装成浏览器呢？我只需要在get方法里面添加一个headers的参数即可。
 89 | 
 90 | 如何获取header的内容呢？
 91 | 
 92 | 我们在浏览器中按F12打开开发者工具。刷新页面，然后找到**User-Agent**，这就是我们的请求头。
 93 | 
 94 | ![1566046413728](https://github.com/ai-union/PythonSpyder/blob/master/img/1566046413728.png?raw=true)
 95 | 
 96 | 然后修改代码：
 97 | 
 98 | ```python
 99 | import requests
100 | 
101 | headers = {
102 |     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
103 |          Chrome/76.0.3809.100 Safari/537.36'
104 | }
105 | # 这里想查的CSDN的备案信息
106 | res = requests.get('https://icp.aizhan.com/www.csdn.net/', headers = headers) 
107 | print(res)
108 | print(res.text)
109 | ```
110 | 
111 | > 这里的headers是一个字典。这里的value会很长，我们可以在合适的地方添加一个\来让它换行。
112 | 
113 | Requests库不仅有get方法，还有post方法，通常是用来登录的，我们在后面的实战中会用到，这里暂时就不多说了。
114 | 
115 | 我们用这个库来访问网站，有时会出现问题，常见的如下：
116 | 
117 | - ConnectionError：通常是网络问题（如DNS查询失败，拒绝连接等）
118 | - HTTPError：原因为请求返回了一个不成功的状态码，比如404等。
119 | - Timeout异常：请求超时
120 | - TooManyRedirects异常，原因为超过了设定的最大重定向次数。
121 | 
122 | 我们在实际代码的中通常是需要做异常处理的
123 | 
124 | ```python
125 | import requests
126 | 
127 | headers = {
128 |     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
129 |          Chrome/76.0.3809.100 Safari/537.36'
130 | }
131 | # 这里想查的CSDN的备案信息
132 | res = requests.get('https://icp.aizhan.com/www.csdn.net/', headers = headers) 
133 | # 异常处理
134 | try:
135 |     print(res.text)
136 | except ConnectionError:
137 |     print("连接失败")
138 | 
139 | ```
140 | 
141 | 当出现连接失败的时候，就会在控制台打印一个信息告诉开发者连接失败了。
142 | 
143 | ## BeautifulSoup库
144 | 
145 | 我们在上面通过requests库获取到了页面代码，但是如何能准确的提取出我们需要的信息呢？
146 | 
147 | 我们在之前的代码中引入我们的BeautifulSoup库。
148 | 
149 | ```python
150 | import requests
151 | from bs4 import BeautifulSoup
152 | 
153 | headers = {
154 |     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
155 |          Chrome/76.0.3809.100 Safari/537.36'
156 | }
157 | # 这里想查的CSDN的备案信息
158 | res = requests.get('https://icp.aizhan.com/www.csdn.net/', headers = headers) 
159 | # 异常处理
160 | try:
161 |     # 对返回结果尽心解析
162 |     soup = BeautifulSoup(res.text, 'html.parser')
163 |     print(soup.prettify())
164 | except ConnectionError:
165 |     print("连接失败")
166 | ```
167 | 
168 | 同样通过在命令行中运行
169 | 
170 | ```python
171 | python main.py
172 | ```
173 | 
174 | 会得到页面代码，这时的代码和之前其实是一样的，只不过BS4(*我们后面都将BeautifulSoup库简称为BS4*)对它进行了标准的缩进格式化。
175 | 
176 | BS4除了支持Python标准库的HTML解析器外，还支持一些第三方的解析器。如下：
177 | 
178 | | 解析器          | 使用方法                                                     | 有点                                                    | 缺点                        |
179 | | --------------- | ------------------------------------------------------------ | ------------------------------------------------------- | --------------------------- |
180 | | Python标准库    | BeautifulSoup(res.text, 'html.parser')                       | 内置库执行速度适中，文档容错能力强                      | 旧版本（3.2.2）前的容错性差 |
181 | | lxml HTML解析器 | BeautifulSoup(res.text, 'lxml')                              | 速度快，文档容错能力强                                  | 需要安装C语言库             |
182 | | lxml XML解析器  | BeautifulSoup(res.text, 'lxml') 或者BeautifulSoup(res.text, ['lxml', 'xml']) | 速度快，唯一支持XML的解析器                             | 同上                        |
183 | | html5lib        | BeautifulSoup(res.text, 'html5lib')                          | 最好的容错性，以浏览器的方式解析文档生成HTML5格式的文档 | 速度慢，不依赖外部扩展      |
184 | 
185 | > BS4官方推荐使用lxml作为解析器，因为效率高。
186 | 
187 | 解析得到的Soup文档可以使用**find(), find_all()以及selector()**方法定位到需要的元素。
188 | 
189 | find和find_all的用法类似
190 | 
191 | ```python
192 | soup.find_all('div', 'item')
193 | soup.find_all('div', class = 'item')
194 | soup.find_all('div', attrs = {'class':'item'})
195 | ```
196 | 
197 | 以上这三行代码的功能是一样的，都是通过class来获取div元素。
198 | 
199 | find_all和find的返回类型不一样，前者返回所有符合条件的元素列表，而后知只返回一个元素。
200 | 
201 | ```python
202 | soup.selector(#icp-table > table > tbody > tr:nth-child(1) > td:nth-child(2))
203 | ```
204 | 
205 | 这个是通过标签的层级关系来获取指定的内容的。可以在谷歌浏览器中在指定元素上面右键，然后有一个copy里面可以选择selector就可以得到括号里面的内容了。
206 | 
207 | > tr:nth-child(1) 需要修改的，在Python中需要改成tr:nth-of-type(1)
208 | 
209 | 接下来演示一下如何获取到*主办单位名称*和*备案号*
210 | 
211 | ```python
212 | import requests
213 | from bs4 import BeautifulSoup
214 | 
215 | headers = {
216 |     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
217 |          Chrome/76.0.3809.100 Safari/537.36'
218 | }
219 | # 这里想查的CSDN的备案信息
220 | res = requests.get('https://icp.aizhan.com/www.csdn.net/', headers = headers) 
221 | # 异常处理
222 | try:
223 |     # 对返回结果尽心解析
224 |     soup = BeautifulSoup(res.text, 'html.parser')
225 |     # 通过find_all()获取主办单位名称
226 |     div = soup.find('div', attrs = {'id':'icp-table'})
227 |     td_list = div.find_all('td')
228 |     name_info = td_list[0].text + ":" + td_list[1].text
229 |     print(name_info) 
230 |     # 通过selector获取备案号
231 |     icp_no = soup.select('#icp-table > table > tr:nth-of-type(3) > td:nth-of-type(2) > span')[0].get_text()
232 |     print(icp_no) 
233 | except ConnectionError:
234 |     print("连接失败")
235 | 
236 | ```
237 | 
238 | 运行后得到：
239 | 
240 | ![1566049929368](https://github.com/ai-union/PythonSpyder/blob/master/img/1566049929368.png?raw=true)
241 | 
242 | > 这里有几点需要注意一下
243 | >
244 | > 1. 我们在使用selector的时候，用的方法是**select()**
245 | >
246 | > 2. 我们在复制了selector内容后需要修改，不能直接使用
247 | >
248 | >    我们复制的内容：
249 | >
250 | >    #icp-table > table > tbody > tr:nth-child(3) > td:nth-child(2) > span
251 | >
252 | >    实际写入代码的内容：
253 | >
254 | >    #icp-table > table > tr:nth-of-type(3) > td:nth-of-type(2) > span
255 | >
256 | > 3. 我们要获取的是文本，所以需要在获取对象后，在调用一些方法类获信息，如：get_text()等
257 | 
258 | 好了，今天的内容就是这些。大家下去把其他的信息也取一下，如果有问题欢迎大家在群里积极讨论。


--------------------------------------------------------------------------------
/md/13-第十三周-如何利用爬取的数据.md:
--------------------------------------------------------------------------------
  1 | # 专栏第十三周 如何利用爬取的数据
  2 | 
  3 | 我们通过各种方式辛辛苦苦的获取到了数据，那么接下来呢？还需要做些什么。
  4 | 
  5 | - 可以用获取到的图片做机器学习的训练数据
  6 | - 可以用获取的商品，招聘等信息做数据分析，看看什么卖的好，什么职位工资高。
  7 | - 可以用获取到的数据搭建自己的网站。比如各种搜索网站。
  8 | 
  9 | > 关于机器学习，数据分析大家可以查看一下我们其他的专栏文章。
 10 | 
 11 | 今天，就带着大家一起看看如何用Django快速搭建一个前后台齐全的网站。
 12 | 
 13 | ## 搭建环境
 14 | 
 15 | - 在工作目录下创建一个Pyweb的文件夹，用来保存代码。当然，这个目录的名字你可以任意取，但**不要有中文和特殊符号**。
 16 | 
 17 | - 在Pyweb文件夹下面创建一个env的文件夹，这个文件夹就是虚拟机的名，通过如下命令创建虚拟机。
 18 | 
 19 |   ```shell
 20 |   cd env
 21 |   python -m venv .
 22 |   ```
 23 | 
 24 |   > 注意venv和 . 之间有一个空格。
 25 |   >
 26 |   > 这里演示的是在Windows系统下的创建方式，其他操作系统大家自行搜索一下。ls可以换成dir
 27 | 
 28 |   ![image-20191026224553401](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026224553401.png?raw=true)
 29 | 
 30 | - 激活虚拟环境
 31 | 
 32 |   ```shell
 33 |   Scripts\activate
 34 |   ```
 35 | 
 36 |   ![image-20191026224709438](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026224709438.png?raw=true)
 37 | 
 38 |   激活成功后，会看到在命令行前面会多出一些内容。
 39 | 
 40 |   > 输入deactivate可以退出虚拟环境。
 41 | 
 42 | - 安装Django库
 43 | 
 44 |   在虚拟环境下输入如下命令，安装Django框架
 45 | 
 46 |   ```shell
 47 |   pip install django
 48 |   ```
 49 | 
 50 |   如果速度慢的话，可以换成国内的源
 51 | 
 52 |   ```python
 53 |   pip install django -i https://pypi.mirrors.ustc.edu.cn/simple/
 54 |   ```
 55 | 
 56 |   ![image-20191026225204779](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026225204779.png?raw=true)
 57 | 
 58 | 这样环境就创建好了，接下来创建我们的网站。下面的所有操作都是在虚拟机内完成的。
 59 | 
 60 | ## 创建网站
 61 | 
 62 | - 为了方便代码管理，回到Pyweb目录下,通过如下命令来创建一个叫做mysite的DJnago网站。
 63 | 
 64 |   ```shell
 65 |   django-admin startproject mysite
 66 |   ```
 67 | 
 68 |   ![image-20191026225551121](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026225551121.png?raw=true)
 69 | 
 70 | 可以看到会自动生成一个mysite的目录，里面的目录结构如下：
 71 | 
 72 | ![image-20191026225959992](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026225959992.png?raw=true)
 73 | 
 74 | - 工程创建好后，再创建一个应用movie：
 75 | 
 76 |   ```shell
 77 |   python manage.py startapp movie
 78 |   ```
 79 | 
 80 |   创建成功后的目录结构如下：
 81 | 
 82 |   ![image-20191026230235017](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026230235017.png?raw=true)
 83 | 
 84 | - 应用创建好之后我们要将它添加到这个工程之中，修改settings.py文件
 85 | 
 86 |   ```python
 87 |   INSTALLED_APPS = [
 88 |       'django.contrib.admin',
 89 |       'django.contrib.auth',
 90 |       'django.contrib.contenttypes',
 91 |       'django.contrib.sessions',
 92 |       'django.contrib.messages',
 93 |       'django.contrib.staticfiles',
 94 |       'movie', # 新加的应用
 95 |   ]
 96 |   ```
 97 | 
 98 | - 应用添加好之后，还要为这个应用创建一个数据库表，用来保存爬取或者要展示的内容。修改movie目录下的models.py
 99 | 
100 |   ```python
101 |   from django.db import models
102 |   
103 |   # Create your models here.
104 |   class Movie(models.Model):
105 |       name = models.CharField(max_length=50, verbose_name='电影名')
106 |       url = models.URLField(verbose_name='电影路径', max_length=200)
107 |       desc = models.TextField(verbose_name='电影描述')
108 |   
109 |       def __str__(self):
110 |           return self.name
111 |       
112 |       class Meta:
113 |           verbose_name = '电影信息'
114 |           verbose_name_plural = '电影信息'
115 |   ```
116 | 
117 |   为这个应该添加了3个字段，并且定义了它在后台的名称为“电影信息”
118 | 
119 | - 将这个应用添加到后台，方便之后的管理。修改admin.py
120 | 
121 |   ```python
122 |   from django.contrib import admin
123 |   from .models import Movie
124 |   
125 |   # Register your models here.
126 |   admin.site.register(Movie)
127 |   ```
128 | 
129 |   完成这些后，还差最后一步，创建数据库和管理员信息。
130 | 
131 |   - 同步数据
132 | 
133 |     ```shell
134 |      python manage.py makemigrations
135 |     ```
136 | 
137 |     ![image-20191026231439674](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026231439674.png?raw=true)
138 | 
139 |   ```shell
140 |   python manage.py migrate
141 |   ```
142 | 
143 |   ![image-20191026231525926](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026231525926.png?raw=true)
144 | 
145 |   发现除了我们定义的应用的表，还有很多其他的，这些都是Djnago框架自带的表。暂时不用太在意。
146 | 
147 |   - 创建管理员账号
148 | 
149 |     ```shell
150 |      python manage.py createsuperuser
151 |     ```
152 | 
153 |     ![image-20191026231748209](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026231748209.png?raw=true)
154 | 
155 |   这里输入完命令后，按照提示一步一步的输入就好。
156 | 
157 | - 启动网站
158 | 
159 |   ```shell
160 |   python manage.py runserver
161 |   ```
162 | 
163 |   ![image-20191026231937842](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026231937842.png?raw=true)
164 | 
165 |   启动成功后，在浏览器中输入红框中的网址，将看到如下画面。表示这个网站已经创建成功了。
166 | 
167 |   ![image-20191026232056812](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026232056812.png?raw=true)
168 | 
169 | 我们进入后台看看，在url后面输入**/admin**看到如下画面：
170 | 
171 | ![image-20191026232214264](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026232214264.png?raw=true)
172 | 
173 | 输入我们刚才创建好的管理员用户和密码：
174 | 
175 | ![image-20191026232312105](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026232312105.png?raw=true)
176 | 
177 | 可以看到这个应用出现在了后台，可以点击右边的add来添加几条数据。
178 | 
179 | ![image-20191026232535573](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026232535573.png?raw=true)
180 | 
181 | ![image-20191026232551493](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026232551493.png?raw=true)
182 | 
183 | 我这里是手动填写的，大家可以通过爬虫获取到数据，自动添加数据。另外这里用的是sqlite数据库。关于这个数据库，大家可以上网查一下用法。不明白的地方欢迎大家进群讨论。
184 | 
185 | 现在数据有了，该如何显示呢？
186 | 
187 | ## 显示数据
188 | 
189 | - 创建用于显示数据的模板
190 | 
191 |   创建一个templates的文件夹，用来保存html模板，创建好的目录结构：
192 | 
193 |   ![image-20191026232936265](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026232936265.png?raw=true)
194 | 
195 | 创建好了文件夹，要将它添加到工程中，并告诉程序这个是保存页面模板的，修改settings.py文件：
196 | 
197 | ```python
198 | TEMPLATES = [
199 |     {
200 |         'BACKEND': 'django.template.backends.django.DjangoTemplates',
201 |         'DIRS': [os.path.join(BASE_DIR, 'templates')], # 修改的地方
202 |         'APP_DIRS': True,
203 |         'OPTIONS': {
204 |             'context_processors': [
205 |                 'django.template.context_processors.debug',
206 |                 'django.template.context_processors.request',
207 |                 'django.contrib.auth.context_processors.auth',
208 |                 'django.contrib.messages.context_processors.messages',
209 |             ],
210 |         },
211 |     },
212 | ]
213 | ```
214 | 
215 | - 创建业务逻辑，修改movie文件夹下的views.py文件
216 | 
217 |   ```python
218 |   from django.shortcuts import render
219 |   from .models import Movie
220 |   
221 |   # Create your views here.
222 |   def Index(request):
223 |       movie_list = Movie.objects.all() # 查询所有数据
224 |       return render(request, 'index.html', {'movie_list':movie_list})
225 |   ```
226 | 
227 |   index.html是稍后要创建的html文件，用来显示数据的。
228 | 
229 |   {'movie_list':movie_list}是告诉页面都显示什么数据，这里将获取到的所有电影信息都显示在页面上。
230 | 
231 | - 告诉项目，我们访问什么路径，来显示电影信息，修改和settings.py同级目录下的urls.py文件
232 | 
233 |   ```python
234 |   from django.contrib import admin
235 |   from django.urls import path
236 |   from movie.views import Index
237 |   
238 |   urlpatterns = [
239 |       path('admin/', admin.site.urls),
240 |       path('info/', Index, name='info'),
241 |   ]
242 |   ```
243 | 
244 | - 在templates文件夹下创建html文件
245 | 
246 |   ```html
247 |   <html>
248 |       <head>
249 |           <title>电影信息</title>
250 |       </head>
251 |       <body>
252 |           <ul>
253 |               {% for movie in movie_list %}
254 |                   <li>
255 |                       <a href="{{movie.url}}" target="blank">{{movie.name}}</a>
256 |                   </li>
257 |               {% endfor %}
258 |           </ul>
259 |       </body>
260 |   </html>
261 |   ```
262 | 
263 |   这里的{%%}等内容是Django的模板语言，大家可以看看官方文档，这里就不详细说明了。
264 | 
265 |   访问： http://127.0.0.1:8000/info/ 
266 | 
267 |   ![image-20191026234438718](https://github.com/ai-union/PythonSpyder/blob/master/img/image-20191026234438718.png?raw=true)
268 | 
269 | 这样我们的信息就显示在了页面上了。
270 | 
271 | 我们这里只是简单的介绍了一下Django框架的入门用法，如果大家要深入学习的话，可以买一些书来看。或者关注公众号，后续可能会有相关的内容哦。如果有什么疑问，欢迎大家来群里讨论。
272 | 
273 | ## 最后的话
274 | 
275 | 感谢大家对本专栏的支持，这是我第一次写专栏内容，有很多不足的方法希望大家见谅。同时也正是因为大家的支持，才能使我坚持把这个专栏完成。
276 | 
277 | 我们的专栏目的是让大家对爬虫有个初步的认识。如果想真正在公司或面试中做到出类拔萃，还需要大家多多练习，同时也希望大家持续关注我们，后续我们还会推出更有意思，更有价值的内容。我们不见不散。


--------------------------------------------------------------------------------