├── images
├── 栗山未来头像_1.jpg
├── 栗山未来头像_2.jpg
├── 栗山未来头像_4.jpg
├── 栗山未来头像_5.jpg
├── 栗山未来头像_6.jpg
├── 栗山未来头像_7.jpg
├── 栗山未来头像_8.jpg
├── 栗山未来头像_9.jpg
├── 栗山未来头像_10.jpg
└── 栗山未来头像_3.jpg
├── src
└── main.py
├── LICENSE
└── README.md
/images/栗山未来头像_1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_1.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_2.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_4.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_5.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_6.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_6.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_7.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_7.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_8.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_8.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_9.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_9.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_10.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_10.jpg
--------------------------------------------------------------------------------
/images/栗山未来头像_3.jpg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | 404 File Not Found
5 |
6 |
7 | 404 File Not Found
8 |
9 |
10 |
--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
1 | # -*- coding:utf-8 -*-
2 | import re
3 | import requests
4 |
5 |
6 | def dowmloadPic(html, keyword):
7 | pic_url = re.findall('"objURL":"(.*?)",', html, re.S)
8 | i = 1
9 | print('找到关键词:' + keyword + '的图片,现在开始下载图片...')
10 | for each in pic_url:
11 | print('正在下载第' + str(i) + '张图片,图片地址:' + str(each))
12 | try:
13 | pic = requests.get(each, timeout=10)
14 | except requests.exceptions.ConnectionError:
15 | print('【错误】当前图片无法下载')
16 | continue
17 |
18 | dir = '../images/' + keyword + '_' + str(i) + '.jpg'
19 | fp = open(dir, 'wb')
20 | fp.write(pic.content)
21 | fp.close()
22 | i += 1
23 |
24 |
25 | if __name__ == '__main__':
26 | word = input("Input key word: ")
27 | url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip'
28 | result = requests.get(url)
29 | dowmloadPic(result.text, word)
30 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Python爬虫实现百度图片自动下载
2 |
3 | ## 制作爬虫的步骤
4 |
5 | 制作一个爬虫一般分以下几个步骤:
6 |
7 | * 分析需求
8 | * 分析网页源代码,配合开发者工具
9 | * 编写正则表达式或者XPath表达式
10 | * 正式编写 python 爬虫代码
11 |
12 | ## 效果预览
13 |
14 | 运行效果如下:
15 |
16 | ![][1]
17 |
18 | 存放图片的文件夹:
19 |
20 | ![][2]
21 |
22 | ## 需求分析
23 |
24 | 我们的爬虫至少要实现两个功能:一是搜索图片,二是自动下载。
25 |
26 | 搜索图片:最容易想到的是爬百度图片的结果,我们就上百度图片看看:
27 |
28 | ![][3]
29 |
30 | 随便搜索几个关键字,可以看到已经搜索出来很多张图片:
31 |
32 | ![][4]
33 |
34 | ## 分析网页
35 |
36 | 我们点击右键,查看源代码:
37 |
38 | ![][5]
39 |
40 | 打开源代码之后,发现一堆源代码比较难找出我们想要的资源。
41 |
42 | 这个时候,就要用开发者工具!我们回到上一页面,调出开发者工具,我们需要用的是左上角那个东西:(鼠标跟随)。
43 |
44 | ![][6]
45 |
46 | 然后选择你想看源代码的地方,就可以发现,下面的代码区自动定位到了相应的位置。如下图:
47 |
48 | ![][7]
49 |
50 | ![][8]
51 |
52 | 我们复制这个地址,然后到刚才的一堆源代码里搜索一下,发现了它的位置,但是这里我们又疑惑了,这个图片有这么多地址,到底用哪个呢?我们可以看到有thumbURL,middleURL,hoverURL,objURL
53 |
54 | ![][9]
55 |
56 | 通过分析可以知道,前面两个是缩小的版本,hoverURL 是鼠标移动过后显示的版本,objURL 应该是我们需要的,可以分别打开这几个网址看看,发现 objURL 的那个最大最清晰。
57 |
58 | 找到了图片地址,接下来我们分析源代码。看看是不是所有的 objURL 都是图片。
59 |
60 | ![][10]
61 |
62 | 发现都是以.jpg格式结尾的图片。
63 |
64 | ## 编写正则表达式
65 |
66 | ```python
67 | pic_url = re.findall('"objURL":"(.*?)",',html,re.S)
68 | ```
69 |
70 | ## 编写爬虫代码
71 |
72 | 这里我们用了2个包,一个是正则,一个是 requests 包
73 |
74 | ```python
75 | #-*- coding:utf-8 -*-
76 | import re
77 | import requests
78 | ```
79 |
80 | 复制百度图片搜索的链接,传入 requests ,然后把正则表达式写好
81 |
82 | ![][11]
83 |
84 | ```python
85 | url = 'https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=%E6%A0%97%E5%B1%B1%E6%9C%AA%E6%9D%A5%E5%A4%B4%E5%83%8F&ct=201326592&ic=0&lm=-1&width=&height=&v=index'
86 |
87 | html = requests.get(url).text
88 | pic_url = re.findall('"objURL":"(.*?)",',html,re.S)
89 |
90 | ```
91 |
92 | 因为有很多张图片,所以要循环,我们打印出结果来看看,然后用 requests 获取网址,由于有些图片可能存在网址打不开的情况,所以加了10秒超时控制。
93 |
94 | ```python
95 | pic_url = re.findall('"objURL":"(.*?)",',html,re.S)
96 | i = 1
97 | for each in pic_url:
98 | print each
99 | try:
100 | pic= requests.get(each, timeout=10)
101 | except requests.exceptions.ConnectionError:
102 | print('【错误】当前图片无法下载')
103 | continue
104 |
105 | ```
106 |
107 | 接着就是把图片保存下来,我们事先建立好一个 images 目录,把图片都放进去,命名的时候,以数字命名。
108 |
109 | ```python
110 | dir = '../images/' + keyword + '_' + str(i) + '.jpg'
111 | fp = open(dir, 'wb')
112 | fp.write(pic.content)
113 | fp.close()
114 | i += 1
115 | ```
116 |
117 | ## 完整的代码
118 |
119 | ```python
120 | # -*- coding:utf-8 -*-
121 | import re
122 | import requests
123 |
124 |
125 | def dowmloadPic(html, keyword):
126 | pic_url = re.findall('"objURL":"(.*?)",', html, re.S)
127 | i = 1
128 | print('找到关键词:' + keyword + '的图片,现在开始下载图片...')
129 | for each in pic_url:
130 | print('正在下载第' + str(i) + '张图片,图片地址:' + str(each))
131 | try:
132 | pic = requests.get(each, timeout=10)
133 | except requests.exceptions.ConnectionError:
134 | print('【错误】当前图片无法下载')
135 | continue
136 |
137 | dir = '../images/' + keyword + '_' + str(i) + '.jpg'
138 | fp = open(dir, 'wb')
139 | fp.write(pic.content)
140 | fp.close()
141 | i += 1
142 |
143 |
144 | if __name__ == '__main__':
145 | word = input("Input key word: ")
146 | url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip'
147 | result = requests.get(url)
148 | dowmloadPic(result.text, word)
149 |
150 | ```
151 |
152 | ![][12]
153 |
154 | ![][13]
155 |
156 | 我们看到有的图片没显示出来,打开网址看,发现确实没了。
157 |
158 | ![][14]
159 |
160 | 因为百度有些图片它缓存到百度的服务器上,所以我们在百度上还能看见它,但它的实际链接已经失效了。
161 |
162 | ## 总结
163 |
164 | enjoy 我们的第一个图片下载爬虫吧!当然它不仅能下载百度的图片,依葫芦画瓢,你现在应该能做很多事情了,比如爬取头像,爬淘宝展示图等等。
165 |
166 | 完整代码已经放到Github上 [https://github.com/nnngu/BaiduImageDownload](https://github.com/nnngu/BaiduImageDownload)
167 |
168 |
169 |
170 |
171 | [1]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517624440357.jpg
172 | [2]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517624588214.jpg
173 | [3]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517624851741.jpg
174 | [4]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517625097976.jpg
175 | [5]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517625636570.jpg
176 | [6]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626066422.jpg
177 | [7]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626276983.jpg
178 | [8]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626329451.jpg
179 | [9]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626739154.jpg
180 | [10]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517627100214.jpg
181 | [11]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517627638515.jpg
182 | [12]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517629256979.jpg
183 | [13]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517629346426.jpg
184 | [14]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517629377850.jpg
185 |
186 |
--------------------------------------------------------------------------------