├── images ├── 栗山未来头像_1.jpg ├── 栗山未来头像_2.jpg ├── 栗山未来头像_4.jpg ├── 栗山未来头像_5.jpg ├── 栗山未来头像_6.jpg ├── 栗山未来头像_7.jpg ├── 栗山未来头像_8.jpg ├── 栗山未来头像_9.jpg ├── 栗山未来头像_10.jpg └── 栗山未来头像_3.jpg ├── src └── main.py ├── LICENSE └── README.md /images/栗山未来头像_1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_1.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_2.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_4.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_5.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_6.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_6.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_7.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_7.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_8.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_8.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_9.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_9.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_10.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nnngu/BaiduImageDownload/HEAD/images/栗山未来头像_10.jpg -------------------------------------------------------------------------------- /images/栗山未来头像_3.jpg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 404 File Not Found 5 | 6 | 7 |

404 File Not Found

8 | 9 | 10 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf-8 -*- 2 | import re 3 | import requests 4 | 5 | 6 | def dowmloadPic(html, keyword): 7 | pic_url = re.findall('"objURL":"(.*?)",', html, re.S) 8 | i = 1 9 | print('找到关键词:' + keyword + '的图片，现在开始下载图片...') 10 | for each in pic_url: 11 | print('正在下载第' + str(i) + '张图片，图片地址:' + str(each)) 12 | try: 13 | pic = requests.get(each, timeout=10) 14 | except requests.exceptions.ConnectionError: 15 | print('【错误】当前图片无法下载') 16 | continue 17 | 18 | dir = '../images/' + keyword + '_' + str(i) + '.jpg' 19 | fp = open(dir, 'wb') 20 | fp.write(pic.content) 21 | fp.close() 22 | i += 1 23 | 24 | 25 | if __name__ == '__main__': 26 | word = input("Input key word: ") 27 | url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip' 28 | result = requests.get(url) 29 | dowmloadPic(result.text, word) 30 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Python爬虫实现百度图片自动下载 2 | 3 | ## 制作爬虫的步骤 4 | 5 | 制作一个爬虫一般分以下几个步骤： 6 | 7 | * 分析需求 8 | * 分析网页源代码，配合开发者工具 9 | * 编写正则表达式或者XPath表达式 10 | * 正式编写 python 爬虫代码 11 | 12 | ## 效果预览 13 | 14 | 运行效果如下： 15 | 16 | ![][1] 17 | 18 | 存放图片的文件夹： 19 | 20 | ![][2] 21 | 22 | ## 需求分析 23 | 24 | 我们的爬虫至少要实现两个功能：一是搜索图片，二是自动下载。 25 | 26 | 搜索图片：最容易想到的是爬百度图片的结果，我们就上百度图片看看： 27 | 28 | ![][3] 29 | 30 | 随便搜索几个关键字，可以看到已经搜索出来很多张图片： 31 | 32 | ![][4] 33 | 34 | ## 分析网页 35 | 36 | 我们点击右键，查看源代码： 37 | 38 | ![][5] 39 | 40 | 打开源代码之后，发现一堆源代码比较难找出我们想要的资源。 41 | 42 | 这个时候，就要用开发者工具！我们回到上一页面，调出开发者工具，我们需要用的是左上角那个东西：(鼠标跟随)。 43 | 44 | ![][6] 45 | 46 | 然后选择你想看源代码的地方，就可以发现，下面的代码区自动定位到了相应的位置。如下图： 47 | 48 | ![][7] 49 | 50 | ![][8] 51 | 52 | 我们复制这个地址，然后到刚才的一堆源代码里搜索一下，发现了它的位置，但是这里我们又疑惑了，这个图片有这么多地址，到底用哪个呢？我们可以看到有thumbURL，middleURL，hoverURL，objURL 53 | 54 | ![][9] 55 | 56 | 通过分析可以知道，前面两个是缩小的版本，hoverURL 是鼠标移动过后显示的版本，objURL 应该是我们需要的，可以分别打开这几个网址看看，发现 objURL 的那个最大最清晰。 57 | 58 | 找到了图片地址，接下来我们分析源代码。看看是不是所有的 objURL 都是图片。 59 | 60 | ![][10] 61 | 62 | 发现都是以.jpg格式结尾的图片。 63 | 64 | ## 编写正则表达式 65 | 66 | ```python 67 | pic_url = re.findall('"objURL":"(.*?)",',html,re.S) 68 | ``` 69 | 70 | ## 编写爬虫代码 71 | 72 | 这里我们用了2个包，一个是正则，一个是 requests 包 73 | 74 | ```python 75 | #-*- coding:utf-8 -*- 76 | import re 77 | import requests 78 | ``` 79 | 80 | 复制百度图片搜索的链接，传入 requests ，然后把正则表达式写好 81 | 82 | ![][11] 83 | 84 | ```python 85 | url = 'https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=%E6%A0%97%E5%B1%B1%E6%9C%AA%E6%9D%A5%E5%A4%B4%E5%83%8F&ct=201326592&ic=0&lm=-1&width=&height=&v=index' 86 | 87 | html = requests.get(url).text 88 | pic_url = re.findall('"objURL":"(.*?)",',html,re.S) 89 | 90 | ``` 91 | 92 | 因为有很多张图片，所以要循环，我们打印出结果来看看，然后用 requests 获取网址，由于有些图片可能存在网址打不开的情况，所以加了10秒超时控制。 93 | 94 | ```python 95 | pic_url = re.findall('"objURL":"(.*?)",',html,re.S) 96 | i = 1 97 | for each in pic_url: 98 | print each 99 | try: 100 | pic= requests.get(each, timeout=10) 101 | except requests.exceptions.ConnectionError: 102 | print('【错误】当前图片无法下载') 103 | continue 104 | 105 | ``` 106 | 107 | 接着就是把图片保存下来，我们事先建立好一个 images 目录，把图片都放进去，命名的时候，以数字命名。 108 | 109 | ```python 110 | dir = '../images/' + keyword + '_' + str(i) + '.jpg' 111 | fp = open(dir, 'wb') 112 | fp.write(pic.content) 113 | fp.close() 114 | i += 1 115 | ``` 116 | 117 | ## 完整的代码 118 | 119 | ```python 120 | # -*- coding:utf-8 -*- 121 | import re 122 | import requests 123 | 124 | 125 | def dowmloadPic(html, keyword): 126 | pic_url = re.findall('"objURL":"(.*?)",', html, re.S) 127 | i = 1 128 | print('找到关键词:' + keyword + '的图片，现在开始下载图片...') 129 | for each in pic_url: 130 | print('正在下载第' + str(i) + '张图片，图片地址:' + str(each)) 131 | try: 132 | pic = requests.get(each, timeout=10) 133 | except requests.exceptions.ConnectionError: 134 | print('【错误】当前图片无法下载') 135 | continue 136 | 137 | dir = '../images/' + keyword + '_' + str(i) + '.jpg' 138 | fp = open(dir, 'wb') 139 | fp.write(pic.content) 140 | fp.close() 141 | i += 1 142 | 143 | 144 | if __name__ == '__main__': 145 | word = input("Input key word: ") 146 | url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip' 147 | result = requests.get(url) 148 | dowmloadPic(result.text, word) 149 | 150 | ``` 151 | 152 | ![][12] 153 | 154 | ![][13] 155 | 156 | 我们看到有的图片没显示出来，打开网址看，发现确实没了。 157 | 158 | ![][14] 159 | 160 | 因为百度有些图片它缓存到百度的服务器上，所以我们在百度上还能看见它，但它的实际链接已经失效了。 161 | 162 | ## 总结 163 | 164 | enjoy 我们的第一个图片下载爬虫吧！当然它不仅能下载百度的图片，依葫芦画瓢，你现在应该能做很多事情了，比如爬取头像，爬淘宝展示图等等。 165 | 166 | 完整代码已经放到Github上 [https://github.com/nnngu/BaiduImageDownload](https://github.com/nnngu/BaiduImageDownload) 167 | 168 | 169 | 170 | 171 | [1]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517624440357.jpg 172 | [2]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517624588214.jpg 173 | [3]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517624851741.jpg 174 | [4]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517625097976.jpg 175 | [5]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517625636570.jpg 176 | [6]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626066422.jpg 177 | [7]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626276983.jpg 178 | [8]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626329451.jpg 179 | [9]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517626739154.jpg 180 | [10]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517627100214.jpg 181 | [11]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517627638515.jpg 182 | [12]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517629256979.jpg 183 | [13]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517629346426.jpg 184 | [14]: https://www.github.com/nnngu/FigureBed/raw/master/2018/2/3/1517629377850.jpg 185 | 186 | --------------------------------------------------------------------------------