├── .gitignore ├── 201802111155445124jhQyer.yasuotu.gif ├── LICENSE ├── README.md ├── Selection_131.png ├── Selection_132.png ├── __init__.py ├── _request.py ├── config.py ├── custom_get_ip ├── __init__.py └── get_ip_from_peauland.py ├── db_method.py ├── delete_not_update_ip.py ├── get_proxies_base_spider.py ├── proxy_api.py ├── proxy_basic_config.py ├── requirements.txt ├── work_spider.py ├── 个人公众号.gif ├── 修改记录.md └── 数据库截图.png /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | -------------------------------------------------------------------------------- /201802111155445124jhQyer.yasuotu.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/201802111155445124jhQyer.yasuotu.gif -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # 爬虫代理 3 | # 个人公众号 4 |  5 | 6 | # 更新 7 | 时间:2022年10月17日 8 | 9 | 最近后台收到了很多小伙伴的反馈说现在免费的ip可以使用的时长太短了,而且免费的多是透明代理,不安全。希望我推荐一个商业级的代理IP,于是有了这个更新,有需要的小伙伴可以试用一下。 10 | 11 | ### 地址 12 | - [IPIDEA官网地址](https://share.ipidea.net/VLLYbk0KpY2YLdj) 13 | 14 | ### 产品介绍(来自官网) 15 | ``` 16 | 1、覆盖220+的国家和地区,9000万真实住宅IP资源,汇聚成大规模代理服务池。 17 | 2、提供动态住宅代理、静态住宅代理、数据中心、移动代理等多种解决方案,满足电子商务、市场调查、抓取索引、网站测试、广告验证、seo监控优化等多个业务场景。 18 | 3、支持HTTP/HTTPS/Socks5协议 19 | 4、真实住宅IP,支持从制定国家城市访问目标网站,隐藏真实网络环境,保护隐私,24小时持续过滤并更新,IP纯净度高,快速响应,无限并发,99.9%的成功率,确保高效稳定连接,让您的业务得心应手 20 | 5、支持海量IP免费试用 21 | ``` 22 | 23 | # 环境: python 2.7 24 | ### 特点 25 | - 通过配置文件,即可对IP代理网站进行爬取 26 | - 构建web服务,提供api接口 27 | - 获取与检测IP完全自动化 28 | - 可根据IP代理网站的特殊性,自行扩展获取,检测脚本 29 | 30 | ### 数据库可用IP截图 31 | ******** 32 |  33 | ******** 34 | 35 | ******** 36 | > **insert_time** 37 | 插入时间, 会有一个专门的脚本根据插入时间与现在时间的时间差来将已经存在一段时间的ip取出来,利用target_url对其重新检测, 38 | 如果无法使用,则会将其删除。 39 | **ip** 40 | 可用IP 41 | **response_time** 42 | 利用代理IP去访问target_url时的相应时间, 利用requests库的elapsed方法获得, 数据库中的单位为秒 43 | **source** 44 | 代理IP的来源 45 | **target_url** 46 | 目标网站, 如你你想获得可以访问豆瓣的IP, 那么豆瓣网的网址就是target_url 47 | 48 | ******** 49 | ### 注:请求的框架基于我自己写的一个小框架, 地址在:https://github.com/xiaosimao/AiSpider 50 | #### 欢迎star以及交流, 我的微信在上面的地址中的readme.md中, 加的话注明: 来自github。 51 | 52 | #### 毕竟是免费IP,质量没得保证,所以还是尽量多找几个网站吧。 53 | ******** 54 | 55 | ## 1. **使用的方法** 56 | > 到上面提到的请求框架地址中下载框架到本地, 然后在work_spider.py,delete_not_update_ip.py, 57 | get_proxies_base_spider.py 中对sys.append(...)中的地址进行更换,更换到你本地的框架所在的地址,现在脚本中是在我 58 | 这里的路径,尽量路径中不要带中文。 59 | 60 | **上面这一步是必须的** 61 | **下面这一步也是必须的** 62 | `pip install -r requirements.txt` 63 | 64 | 65 | * 1.1 66 | 在proxy_basic_config.py中对代理IP网站进行配置,如果你只想试试,那么就不要配置了,已经有了几个可以用的配置项。 67 | * 1.2 68 | 在config.py中对请求框架进行配置。如果你还是只想试运行,那么这里也不用配置了,现在的配置可以使用。 69 | * 1.3 70 | 确保已经正确安装mongodb数据库。这一步就必须安装正确,否则没法使用。 71 | * 1.4 72 | 如果你的网站特殊, 那么请自定义解析函数, 并在第一步中配置正确 73 | * 1.5 74 | 执行work_spider.py脚本, 开始抓取,检测,入库。如果配置都没有改而且上面的步骤都正确, 那么数据在你本地的mongodb数据库的free_ip的proxy集合中。 75 | * 1.6 76 | 执行proxy_api.py脚本, 开启API服务 77 | * 1.7 78 | 执行delete_not_update_ip.py脚本, 对超过存活时间阀值的IP进行重新检测, 删除或更新插入时间 79 | 80 | * 如果你要爬的代理IP网站不是很变态的那种, 基本的请求加xpath和re就能获得ip的话, 那么你只要简简单单的 81 | 对上面提到的配置文件进行简单的配置就行了,配置文件中的各个字段的具体含义在下面有详细解释。 82 | 83 | ******** 84 | 85 | ## 2.工作流程 86 | * 定时获取IP; 87 | * 根据目标网站,分别对IP进行检测; 88 | * 检测成功的入库; 89 | * 构建API; 90 | * 对入库的IP再次检测,删除不可用; 91 | ******** 92 | ## 3. 具体介绍 93 | ******** 94 | ### 获取IP及检测IP并入库 95 | * **涉及脚本:** 96 | >work_spider.py 97 | get_proxies_base_spider.py 98 | _request.py 99 | proxy_basic_config.py 100 | custom_get_ip下的脚本 101 | ******** 102 | 103 | * **work_spider.py** 104 | `入口脚本` 105 | > 从 get_proxies_base_spider。py 中继承SpiderMain。 在WorkSpider类中重写父类的run方法,在其中可以 106 | 传入自定义的请求函数, 若没有, 则使用默认的上述框架中的请求函数。 107 | ******** 108 | * **get_proxies_base_spider.py** 109 | `主程序` 110 | > 提供了默认的ip的获取, 检测, 入库函数。其中,解析函数可以自定义。 111 | 112 | 包括的方法: 113 | >**run** 114 | 启动函数, 启动请求框架的线程池, 调用craw函数; 115 | **craw** 116 | 开始爬取函数, 对配置中的配置项目进行读取,发起初始请求; 117 | **get_and_check** 118 | 获取返回的源代码, 并调用解析函数; 119 | **parse_to_get_ip** 120 | 默认的解析函数, 用户可通过配置自定义。解析出原始IP, 然后调用检测函数; 121 | **start_check** 122 | 开始检测函数, 其中检测的函数为_request。py中的valid函数; 123 | **save_ip** 124 | 检测成功的入库函数, 将检测成功的数据放入mongodb数据库中; 125 | ******** 126 | 127 | * **_request.py** 128 | `检测脚本程序` 129 | 包括的方法: 130 | > **valid** 131 | 检测函数, 因为要符合请求框架的要求,所以自定义的这个函数需要返回两个值, 在这里返回了 **响应时间与检测的IP**。 132 | ******** 133 | 134 | * **proxy_basic_config.py** 135 | `代理网页设置脚本` 136 | 包括的字段: 137 | >**target_urls** 138 | 待检测的目标网站, list类型; 139 | **collection_name** 140 | mongodb数据库的集合名, 字符串类型; 141 | **over_time ** 142 | 数据库中IP存活时间阀值, 超过及对其重新检测, int类型, 默认1800秒; 143 | **url_parse_dict** 144 | 代理IP网址配置字典, 字典类型 145 | 146 | 下面将对这一配置字典作着重介绍: 147 | >(1) 格式 148 | {ip_website_name: value} 149 | **ip_website** 150 | 代理IP网站的名字, 如xicidaili 151 | **value** 152 | 其他设置值, 字典类型 153 | 154 | > (2) value 155 | 因为value是一个内嵌的字典,所以其也有自己的键和值。 156 | \[1] **status** 157 | 代理状态, 若不想爬取此网站,可以将status设置为非active的任意值, 必须; 158 | \[2] **request_method** 159 | 请求方法, 当为post的时候, 必须定义提交的submit_data, 否则会报错。默认为get, 非必须。 160 | \[3] **submit_data** 161 | 提交的数据,因项目的特殊性,如需要提交的数据里面带着网页页码字段,所以为了获得 162 | 多页的数据, 这里就需要定义多个post_data数据,然后分别进行请求。所以在这里将submit_data 定义为列表, 里面的数据为字典格式。 163 | \[4] **url** 164 | 代理IP网址, 列表类型, 把需要爬取的页面都放在里面,必须。 165 | \[5] **parse_type** 166 | 解析类型, 默认提供: xpath, re。 如果你需要别的解析方式, 那么请定义自己的parse_func。默认的解析函数即上文中的\[parse_to_get_ip] 167 | 只提供xpath与re两种解析获取IP的方式。当使用默认解析器的时候, 必须定义。 168 | \[6] **ip_port_together** 169 | ip和port是否在一个字段中, 若是则设置为True, 不是则设置为False, 当使用re进行解析和使用了自定义的解析函数时,非必须。 170 | 若为地址与端口在一起,且通过xpath解析时则建议 \[parse_method]中 的key为ip_address_and_port 171 | 若为地址与端口不在一起,且通过xpath解析时则建议 \[parse_method] key为ip_address, ip_port 172 | \[7] **parse_method** 173 | 解析的具体路径, 上一条有了部分的解释, 当使用默认解析器时,必须定义。 174 | 当通过re进行解析时, 则ip_port_together可以为任意的值,parse_method中只有一个键: _pattern 175 | \[8] **parse_func** 176 | 解析函数, 默认值为system, 当需要使用自定义的解析函数的时候, 需要显式的定义该字段为自定义的解析函数, 177 | 自定义的解析函数必须要有四个参数, 分别为value, html_content, parse_type, website_name, 后面有详细解释 178 | \[9] **header** 179 | 因网址较多, 所以在这里可以自定义浏览器头, 非必须。 180 | ******** 181 | 182 | * **custom_get_ip下的文件** 183 | `自定义解析函数` 184 | 在这里以get_ip_from_peauland.py为例 185 | > 通常, 如果你因为代理IP网站的特殊性而需要自定义一个解析函数, 那么你可以在custom_get_ip文件夹下定义脚本 186 | 定义的脚本中必须定义解析函数, 并在配置脚本proxy_basic_config。py中该网站对应的设置中将parse_func设置为你定义的解析函数 187 | 188 | > 在这里最好为需要自定义解析的网站写一个独立的脚本, 这样利于维护, 将需要自定义的值都放在其中, 然后在设置 189 | 脚本中进行导入和定义。 190 | 191 | > 具体的解释可以参考上面脚本中的注释。 192 | ******** 193 | 194 | ## 4。 API 195 | 为了方便的获取可用的代理IP,在这里利用FLASK简单的起了一个服务, 可以获得全部的IP, 随机的获得一个IP,全部IP的数量, 删除一个IP 196 | * 涉及脚本 197 | > db_method.py 198 | proxwofadey_api.py 199 | 200 | * **db_method。py** 201 | > 封装一些数据库的方法, 有 get_one, get_all, delete_one, total 202 | 203 | * **proxy_api.py** 204 | `接口服务主脚本` 205 | > 默认启动**22555**端口。 206 | 访问 http://0.0.0.0:22555/, 可以看到基本的api调用及介绍 207 | 访问 http://0.0.0.0:22555/count/, 可以看到数据库中现存的IP数量 208 | 访问 http://0.0.0.0:22555/get_one/, 可以看到随机返回了一条IP 209 | 访问 http://0.0.0.0:22555/get_all/, 可以看到返回科所有可以使用的IP 210 | 访问 http://0.0.0.0:22555/delete/?ip=127.0.0.1:8080&target_url=https://www.baidu.com, 删除你想删除的ip, 两个参数都是必须, 否则会报错 211 | 212 | ## 5。 删除不可用IP 213 | * 涉及脚本 214 | > delete_not_update_ip.py 215 | 216 | * **原理** 217 | > 在前面我们只是一味的将获取到的IP插入到数据库中,如果我们新插入的数据已经存在于数据库中,那么我们会 218 | 更新这一条数据的插入时间为当前的插入时间。 219 | 220 | >所以如果在数据库中一些数据库中的数据的插入时间依然停留在一段 221 | 时间之前, 那么就可以认为这条数据已经有一段时间没有更新了。 222 | 223 | > 这里就存在两种可能性: 这条数据中的ip依然可以使用和失效了。 224 | 225 | > 所以在这里我们将对超过一定的存活阀值时间的IP进行重新检测, 其实就是再次对目标网站进行请求, 成功的话则更新其插入时间为当前时间,若失败则删除。 226 | -------------------------------------------------------------------------------- /Selection_131.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/Selection_131.png -------------------------------------------------------------------------------- /Selection_132.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/Selection_132.png -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-20 4 | 5 | -------------------------------------------------------------------------------- /_request.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-20 4 | 5 | """ 6 | 验证函数 7 | 若请求成功,则返回响应时间, ip 8 | 若错误, 则返回None, None, 请求框架将不对这一结果进行处理 9 | 10 | """ 11 | import requests 12 | import urlparse 13 | 14 | header = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'} 15 | 16 | 17 | def valid(_args, dont_filter): 18 | url = _args.get('url') 19 | header['Host'] = urlparse.urlparse(url).netloc 20 | time_out = _args.get('time_out') 21 | _ip = _args.get('ip') 22 | diy_header = _args.get('diy_header') if _args.get('diy_header') else header 23 | 24 | try: 25 | proxy = { 26 | 'http': 'http://%s' % _ip, 27 | 'https': 'http://%s' % _ip 28 | } 29 | con = requests.get(url, headers=diy_header, proxies=proxy, timeout=time_out) 30 | except Exception,e: 31 | # print e 32 | return None, None 33 | else: 34 | if con.status_code == 200: 35 | return con.elapsed.microseconds/1000000., _ip -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-8-17 4 | 5 | 6 | # 爬虫名称 7 | spider_name = 'get_ip' 8 | 9 | # 日志设置 10 | log_folder_name = '%s_logs' % spider_name 11 | delete_existed_logs = True 12 | 13 | # 请求参数设置 14 | thread_num = 50 15 | sleep_time = 0.5 16 | retry_times = 10 17 | time_out = 5 18 | # 当use_proxy为True时,必须在请求的args中或者在配置文件中定义ip, eg: ip="120.52.72.58:80", 否则程序将报错 19 | use_proxy = False 20 | ip = None 21 | 22 | # 移动端设置为 ua_type = 'mobile' 23 | ua_type = 'pc' 24 | 25 | # 队列顺序 26 | FIFO = 0 27 | # 默认提供的浏览器头包括user_agent host, 若需要更丰富的header,可自行定定义新的header,并赋值给diy_header 28 | 29 | diy_header = None 30 | 31 | # 定义状态码,不在其中的均视为请求错误或异常 32 | status_code = [200, 304, 404] 33 | 34 | # 保存设置 35 | # 当你定义好了host,port,database_name之后再将connect改为True 36 | connect = True 37 | 38 | host = 'localhost' 39 | 40 | port = 27017 41 | 42 | database_name = 'free_ip' 43 | -------------------------------------------------------------------------------- /custom_get_ip/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-21 4 | 5 | -------------------------------------------------------------------------------- /custom_get_ip/get_ip_from_peauland.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-21 4 | 5 | import json 6 | import base64 7 | 8 | # 因为这个网站是post方法, 所以这里需要有post_data 9 | # 如果你确定要提交数据, 那么请定义该值为列表类型 10 | def peauland_format_post_data(): 11 | post_datas = [] 12 | 13 | base_post_data = { 14 | 'country_code': '', 15 | 'is_clusters': '', 16 | 'is_https': '', 17 | 'level_type': '', 18 | 'search_type': 'all', 19 | 'type': '', 20 | } 21 | for i in range(1,6): 22 | base_post_data['page'] = i 23 | post_datas.append(base_post_data) 24 | 25 | return post_datas 26 | 27 | # 根据网站的特殊性, 构建自定义header, 非必须. 28 | # 如果你的代理IP获取网站对请求头有特殊的要求,就可以自定义一个,然后在proxy_basic_config.py中对该代理网站的设置header字段并赋值即可 29 | def peauland_header(): 30 | cookie='peuland_id=649e2152bad01e29298950671635e44a; UM_distinctid=15ea23fc7b838f-096617276fe5c1-1c2f170b-1fa400-15ea23fc7b9752; CNZZDATA1253154494=2109729896-1505960642-%7C1505960642; peuland_md5=9b941affd9b676f62ab93081f6cc9a1b; w_h=1080; w_w=1920; w_cd=24; w_a_h=1056; w_a_w=1855; PHPSESSID=mv893pclb2qhc6etu4hvbl8067; php_id=916419018' 31 | headers = { 32 | 'Accept': '*/*', 33 | 'Accept-Encoding': 'gzip, deflate, br', 34 | 'Accept-Language': 'en-US,en;q=0.5', 35 | 'Connection': 'keep-alive', 36 | 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 37 | 'Cookie':cookie, 38 | 'Host': 'proxy.peuland.com', 39 | 'Referer': 'https://proxy.peuland.com/proxy_list_by_category.htm', 40 | 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0', 41 | 'X-Requested-With': 'XMLHttpRequest', 42 | } 43 | return headers 44 | 45 | # 这里就是根据网站的特殊性自定义的解析方法, 有固定的四个参数 46 | # 跟默认的解析方法一样, 需要将获得的ips和website_name作为参数, 然后调用SpiderMain的start_check方法 47 | def peauland_parser(value, html_content, parse_type, website_name): 48 | # 获取ips , 和website_name 作为参数传到开始检测的函数中 49 | from get_proxies_base_spider import SpiderMain 50 | spider_main = SpiderMain() 51 | ips = [] 52 | content = json.loads(html_content) 53 | datas = content.get('data') 54 | for data in datas: 55 | ip = base64.b64decode(data.get('ip')) 56 | port = base64.b64decode(data.get('port')) 57 | proxy = ip+':'+port 58 | ips.append(proxy) 59 | if ips: 60 | spider_main.start_check(ips, website_name) 61 | -------------------------------------------------------------------------------- /db_method.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-20 4 | 5 | import random 6 | 7 | 8 | class DB(object): 9 | def __init__(self, collection): 10 | self.collection = collection 11 | 12 | def get_one(self): 13 | _ips = [] 14 | for _ in self.collection.find(): 15 | _ips.append(_.get('ip')) 16 | return random.choice(_ips) 17 | 18 | def get_all(self): 19 | _ips = [] 20 | for _ in self.collection.find(): 21 | _ips.append(_.get('ip')) 22 | return _ips 23 | 24 | def delete_one(self, ip): 25 | self.collection.delete_one({'_id': ip}) 26 | 27 | def total(self): 28 | return self.collection.count() 29 | -------------------------------------------------------------------------------- /delete_not_update_ip.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-21 4 | 5 | """ 6 | 7 | 按照ip的插入时间距离现在的时间, 大于阀值时间的拿出来进行检测, 若已经无法使用, 则删除;若还可以使用,则更新插入的时间; 8 | 9 | """ 10 | import sys 11 | 12 | # 这里写你自己的框架保存地址 13 | sys.path.append('/home/shimeng/code/spider_framework_github_responsity') 14 | from spider.log_format import spider_log 15 | from config import log_folder_name 16 | from config import host, port, database_name 17 | from proxy_basic_config import collection_name, over_time 18 | import pymongo 19 | import time 20 | import urlparse 21 | from _request import valid 22 | 23 | diy_header = { 24 | 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'} 25 | log_name = 'delete_not_update_ip' 26 | file_folder = log_folder_name 27 | delete_existed_log = False 28 | 29 | logger = spider_log(log_name=log_name, file_folder=file_folder, delete_existed_log=delete_existed_log) 30 | 31 | 32 | def connect_to_mongodb(): 33 | client = pymongo.MongoClient(host, port) 34 | db = client[database_name] 35 | collection = db[collection_name] 36 | return collection 37 | 38 | 39 | def format_time_to_timestamp(foramt_time): 40 | st = time.strptime(foramt_time, '%Y-%m-%d %H:%M:%S') 41 | return time.mktime(st) 42 | 43 | 44 | def check(when=time.time): 45 | collection = connect_to_mongodb() 46 | for data in collection.find(): 47 | ip = data.get('ip') 48 | target_url = data.get('target_url') 49 | ip_stamp = format_time_to_timestamp(data.get('insert_time')) 50 | 51 | has_existed = int(when() - ip_stamp) 52 | if has_existed > over_time: 53 | diy_header['Host'] = urlparse.urlparse(target_url).netloc 54 | 55 | _args = { 56 | "url": target_url, 57 | "diy_header": diy_header, 58 | "time_out": 5, 59 | "_ip": ip, 60 | } 61 | 62 | _id = ip + '_' + target_url 63 | # 调用验证函数 64 | result1, result2 = valid(_args, False) 65 | if result1 is None: 66 | msg = 'delete ip: [{ip}], target_url is [{target_url}]'.format(ip=ip, target_url=target_url) 67 | logger.info(msg) 68 | collection.delete_one({'_id': _id}) 69 | else: 70 | msg = 'update ip: [{ip}], target_url is [{target_url}]'.format(ip=ip, target_url=target_url) 71 | logger.info(msg) 72 | collection.update({'_id': _id}, {'insert_time': when()}) 73 | 74 | if __name__ == '__main__': 75 | check(time.time) -------------------------------------------------------------------------------- /get_proxies_base_spider.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-20 4 | 5 | import sys 6 | import re 7 | import os 8 | import time 9 | 10 | # 这里写你自己的框架保存地址 11 | sys.path.append('/home/shimeng/code/spider_framework_github_responsity') 12 | 13 | from spider.tools import format_put_data 14 | from spider.data_save import pipeline 15 | from spider.html_parser import parser 16 | from spider.page_downloader import aispider 17 | from spider.threads import start, work_queue, save_queue 18 | from spider.log_format import logger 19 | from proxy_basic_config import url_parse_dict, target_urls,collection_name 20 | from _request import valid 21 | 22 | 23 | # 定义主程序 24 | class SpiderMain(object): 25 | def __init__(self, ): 26 | 27 | self.logger = logger 28 | self.parser = parser 29 | self.pipeline = pipeline 30 | self.target_urls = target_urls 31 | self.collection_name = collection_name 32 | 33 | 34 | def run(self): 35 | start() 36 | self.craw() 37 | 38 | def craw(self, request=aispider.request): 39 | 40 | for key, value in url_parse_dict.iteritems(): 41 | if value.get('status') == 'active': 42 | # 网站名 43 | website_name = key 44 | # 网站url 45 | website_urls = value.get('url') 46 | # 请求方法 47 | method = value.get('request_method') 48 | # 请求需要提交的数据 49 | post_datas = value.get('submit_data') 50 | # 解析方法 51 | parse_func = value.get('parse_func', 'system') 52 | if parse_func == 'system': 53 | parser = self.parse_to_get_ip 54 | else: 55 | parser = parse_func 56 | 57 | # 自定义头 58 | diy_header = value.get('header') 59 | 60 | for url in website_urls: 61 | # 调用format_put_data 构造放入队列中的数据 62 | if post_datas: 63 | for post_data in post_datas: 64 | put_data = format_put_data( 65 | args={"url": url, "method": method, 'submit_data': post_data, 'diy_header': diy_header}, 66 | work_func=request, 67 | follow_func=self.get_and_check, 68 | meta={'value': value, 'website_name': website_name, 'parser': parser}) 69 | # 放入队列 70 | work_queue.put(put_data) 71 | 72 | else: 73 | put_data = format_put_data(args={"url": url, "method": method, 'data': post_datas}, 74 | work_func=request, 75 | follow_func=self.get_and_check, 76 | meta={'value': value, 'website_name': website_name, 77 | 'parser': parser}) 78 | # 放入队列 79 | work_queue.put(put_data) 80 | 81 | def get_and_check(self, response): 82 | value = response.get('meta').get('value') 83 | html_content = response.get('content') 84 | # 网站名 85 | website_name = response.get('meta').get('website_name') 86 | # 解析类型: xpath, re 87 | parse_type = value.get('parse_type') 88 | # 解析函数 89 | parser = response.get('meta').get('parser') 90 | 91 | parser(value=value, html_content=html_content, parse_type=parse_type, website_name=website_name) 92 | 93 | def parse_to_get_ip(self, value, html_content, parse_type, website_name): 94 | ips = [] 95 | 96 | if parse_type == 'xpath': 97 | # xpath 98 | # 端口与地址是否在一起 99 | ip_port_together = value.get('ip_port_together') 100 | if ip_port_together: 101 | ip_and_port_xpath = value.get('parse_method').get('ip_address_and_port') 102 | ip_and_port = self.parser.get_data_by_xpath(html_content, ip_and_port_xpath) 103 | ips.extend(ip_and_port) 104 | 105 | else: 106 | ip_address_xpath = value.get('parse_method').get('ip_address') 107 | ip_port_xpath = value.get('parse_method').get('ip_port') 108 | 109 | ip_address = self.parser.get_data_by_xpath(html_content, ip_address_xpath) 110 | ip_port = self.parser.get_data_by_xpath(html_content, ip_port_xpath) 111 | for index, value in enumerate(ip_address): 112 | ips.append((ip_address[index] + ':' + ip_port[index])) 113 | 114 | elif parse_type == 're': 115 | # re 116 | ip_and_port_pattern = value.get('parse_method').get('_pattern') 117 | ip_and_port = parser.get_data_by_re(html_content, ip_and_port_pattern, flags=re.S) 118 | 119 | if ip_and_port: 120 | for data in ip_and_port: 121 | proxy = ':'.join(data) 122 | ips.append(proxy) 123 | 124 | # 调用检测程序 125 | self.start_check(ips, website_name) 126 | 127 | def start_check(self, ips, website_name): 128 | if ips: 129 | # 检测 130 | for _ip in ips: 131 | for target_url in self.target_urls: 132 | url = target_url 133 | # 调用format_put_data 构造放入队列中的数据 134 | put_data = format_put_data(args={"url": url, 'ip': _ip, 'time_out': 5}, work_func=valid, 135 | need_save=True, 136 | save_func=self.save_ip, 137 | meta={'website_name': website_name, 'target_url': target_url}) 138 | # 放入队列 139 | work_queue.put(put_data) 140 | else: 141 | msg = 'There Are No Available From [{website_name}] Can Be Used To Check, Please Check!!!!!!!'.format( 142 | website_name=website_name) 143 | logger.error(msg) 144 | 145 | # 上一步中定义的保存函数 146 | def save_ip(self, response): 147 | website_name = response.get('meta').get('website_name') 148 | response_time = response.get('content') 149 | target_url = response.get('meta').get('target_url') 150 | _ip = response.get('url') 151 | 152 | msg = '[{ip}] can visit the target url [{target_url}], source is [{source}]'.format(ip=_ip, 153 | target_url=target_url, 154 | source=website_name) 155 | logger.info(msg) 156 | # mongodb 集合名称 157 | 158 | insert_data = {} 159 | 160 | insert_data['_id'] = _ip+'_'+target_url 161 | insert_data['ip'] = _ip 162 | insert_data['source'] = website_name 163 | insert_data['response_time'] = response_time 164 | insert_data['target_url'] = target_url 165 | 166 | insert_data['insert_time'] = time.strftime('%Y-%m-%d %H:%M:%S') 167 | 168 | # 保存数入库 169 | self.pipeline.process_item(insert_data, self.collection_name) 170 | 171 | 172 | 173 | if __name__ == '__main__': 174 | # 测试代码 175 | spidermain = SpiderMain() 176 | spidermain.run() 177 | 178 | # blocking 179 | work_queue.join() 180 | save_queue.join() 181 | 182 | # finishing crawl origin ip 183 | logger.info('available proxy has been saved in your database, please check!') 184 | -------------------------------------------------------------------------------- /proxy_api.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-20 4 | """ 5 | 6 | API 7 | 8 | """ 9 | 10 | import sys 11 | 12 | sys.path.append('/home/shimeng/code/spider_framework_github_responsity') 13 | from spider.data_save import pipeline 14 | from db_method import DB 15 | from flask import Flask, jsonify, request 16 | 17 | db = DB(pipeline.db['proxy']) 18 | 19 | app = Flask(__name__) 20 | 21 | api_list = { 22 | 'count': u'get the count of proxy', 23 | 'get_one': u'get a random proxy', 24 | 'get_all': u'get all proxy from proxy pool', 25 | 'delete?ip=127.0.0.1:8080&target_url=https://www.baidu.com': u'delete the proxy which is unavailable', 26 | } 27 | 28 | 29 | @app.route('/') 30 | def index(): 31 | return jsonify(api_list) 32 | 33 | 34 | @app.route('/count/') 35 | def count(): 36 | num = db.total() 37 | msg = 'the total number is [%d]' % num 38 | return msg 39 | 40 | 41 | @app.route('/get_one/') 42 | def get(): 43 | proxy = db.get_one() 44 | print proxy 45 | return proxy 46 | 47 | 48 | @app.route('/get_all/') 49 | def get_all(): 50 | proxies = db.get_all() 51 | 52 | return jsonify([proxy.decode('utf8') for proxy in proxies]) 53 | 54 | 55 | @app.route('/delete/', methods=['GET']) 56 | def delete(): 57 | proxy = request.args.get('ip') 58 | target_url = request.args.get('target_url') 59 | _id = proxy + '_' + target_url 60 | db.delete_one(_id) 61 | return 'Delete Successfully' 62 | 63 | 64 | def run(): 65 | app.run(host='0.0.0.0', port=22555, debug=True) 66 | 67 | 68 | if __name__ == '__main__': 69 | run() 70 | -------------------------------------------------------------------------------- /proxy_basic_config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by shimeng on 17-9-19 4 | 5 | """ 6 | 7 | 代理网址及解析字典 8 | 9 | status 代理状态, 若不想爬取此网站,可以将status设置为非active的任意值 10 | request_method , 请求方法, 必写, 当为post的时候, 必须定义提交的post_data, 否则会报错.因项目的特殊性, 提交的数据中会带有页码数据, 所以在这里将 11 | post_data 定义为列表, 里面的数据为字典格式 12 | url 代理网址 13 | 14 | parse_type 解析类型,默认提供: xpath, re 15 | 16 | (1) xpath 17 | ip_port_together ip地址和ip的端口是否在一个字段中 18 | 若为地址与端口在一起,则建议key为ip_address_and_port 19 | 若为地址与端口不在一起,则建议key为ip_address, ip_port 20 | 21 | (2) re 22 | 若解析的类型为re, 则ip_port_together可以为任意的值 23 | parse_method中只有一个键: _pattern 24 | 25 | parse_func 解析函数, 默认值为system, 当需要使用自定义的解析函数的时候, 需要显式的定义该字段为自定义的解析函数 26 | 解析函数要有四个参数, 分别为value, html_content, parse_type, website_name 27 | 28 | header 因网址较多, 所以在这里可以自定义头 29 | """ 30 | 31 | from custom_get_ip.get_ip_from_peauland import peauland_parser, peauland_format_post_data, peauland_header 32 | 33 | # 定义检测的目标网站 34 | target_urls = ['https://www.baidu.com', 'https://httpbin.org/get'] 35 | 36 | # 数据库集合名 37 | collection_name = 'proxy' 38 | 39 | # 数据库中IP存活时间阀值, 超过及对其重新检测 40 | over_time = 1800 41 | 42 | url_parse_dict = { 43 | # data5u 44 | 'data5u': { 45 | 'status':'active', 46 | 'request_method':'get', 47 | 'url': ['http://www.data5u.com/free/{tag}/index.shtml'.format(tag=tag) for tag in ['gngn', 'gnpt', 'gwgn', 'gwpt']], 48 | 'parse_type': 'xpath', 49 | 'ip_port_together': False, 50 | 'parse_method':{ 51 | 'ip_address': '//ul[@class="l2"]/span[1]/li/text()', 52 | 'ip_port': '//ul[@class="l2"]/span[2]/li/text()', 53 | }, 54 | 'parse_func': 'system' 55 | }, 56 | 57 | # xicidaili 58 | 'xicidaili': { 59 | 'status': 'active', 60 | 'request_method': 'get', 61 | 'url': ['http://www.xicidaili.com/nn/{page}'.format(page=page) for page in range(1, 10)], 62 | 'parse_type': 'xpath', 63 | 'ip_port_together': False, 64 | 'parse_method': { 65 | 'ip_address': '//tr[@class="odd"]/td[2]/text()', 66 | 'ip_port': '//tr[@class="odd"]/td[3]/text()', 67 | }, 68 | 'parse_func': 'system' 69 | 70 | }, 71 | 72 | # 66ip 73 | '66ip': { 74 | 'status': 'active', 75 | 'request_method': 'get', 76 | 'url': ['http://m.66ip.cn/{page}.html'.format(page=page) for page in range(1, 10)], 77 | 'parse_type': 're', 78 | 'ip_port_together': False, 79 | 'parse_method': { 80 | '_pattern': '