├── .gitignore
├── 201802111155445124jhQyer.yasuotu.gif
├── LICENSE
├── README.md
├── Selection_131.png
├── Selection_132.png
├── __init__.py
├── _request.py
├── config.py
├── custom_get_ip
    ├── __init__.py
    └── get_ip_from_peauland.py
├── db_method.py
├── delete_not_update_ip.py
├── get_proxies_base_spider.py
├── proxy_api.py
├── proxy_basic_config.py
├── requirements.txt
├── work_spider.py
├── 个人公众号.gif
├── 修改记录.md
└── 数据库截图.png


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | 


--------------------------------------------------------------------------------
/201802111155445124jhQyer.yasuotu.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/201802111155445124jhQyer.yasuotu.gif


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # 爬虫代理
  3 | # 个人公众号
  4 | ![截图](https://github.com/xiaosimao/IP_POOL/blob/master/%E4%B8%AA%E4%BA%BA%E5%85%AC%E4%BC%97%E5%8F%B7.gif)
  5 | 
  6 | # 更新
  7 | 时间：2022年10月17日
  8 | 
  9 | 最近后台收到了很多小伙伴的反馈说现在免费的ip可以使用的时长太短了，而且免费的多是透明代理，不安全。希望我推荐一个商业级的代理IP，于是有了这个更新，有需要的小伙伴可以试用一下。
 10 | 
 11 | ### 地址
 12 | - [IPIDEA官网地址](https://share.ipidea.net/VLLYbk0KpY2YLdj)
 13 | 
 14 | ### 产品介绍(来自官网)
 15 | ```
 16 | 1、覆盖220+的国家和地区，9000万真实住宅IP资源，汇聚成大规模代理服务池。
 17 | 2、提供动态住宅代理、静态住宅代理、数据中心、移动代理等多种解决方案，满足电子商务、市场调查、抓取索引、网站测试、广告验证、seo监控优化等多个业务场景。
 18 | 3、支持HTTP/HTTPS/Socks5协议
 19 | 4、真实住宅IP，支持从制定国家城市访问目标网站，隐藏真实网络环境，保护隐私，24小时持续过滤并更新，IP纯净度高，快速响应，无限并发，99.9%的成功率，确保高效稳定连接，让您的业务得心应手
 20 | 5、支持海量IP免费试用
 21 | ```
 22 | 
 23 | # 环境: python 2.7
 24 | ### 特点
 25 | - 通过配置文件，即可对IP代理网站进行爬取
 26 | - 构建web服务，提供api接口
 27 | - 获取与检测IP完全自动化
 28 | - 可根据IP代理网站的特殊性，自行扩展获取，检测脚本
 29 | 
 30 | ### 数据库可用IP截图
 31 | ********
 32 | ![截图](https://github.com/xiaosimao/IP_POOL/blob/master/%E6%95%B0%E6%8D%AE%E5%BA%93%E6%88%AA%E5%9B%BE.png)
 33 | ********
 34 | 
 35 | ********
 36 | > **insert_time**    
 37 | 插入时间， 会有一个专门的脚本根据插入时间与现在时间的时间差来将已经存在一段时间的ip取出来，利用target_url对其重新检测，
 38 | 如果无法使用，则会将其删除。    
 39 | **ip**    
 40 | 可用IP    
 41 | **response_time**    
 42 | 利用代理IP去访问target_url时的相应时间， 利用requests库的elapsed方法获得， 数据库中的单位为秒    
 43 | **source**    
 44 | 代理IP的来源    
 45 | **target_url**    
 46 | 目标网站， 如你你想获得可以访问豆瓣的IP， 那么豆瓣网的网址就是target_url
 47 | 
 48 | ********
 49 | ###  注:请求的框架基于我自己写的一个小框架， 地址在:https://github.com/xiaosimao/AiSpider    
 50 | #### 欢迎star以及交流， 我的微信在上面的地址中的readme.md中， 加的话注明: 来自github。    
 51 | 
 52 | #### 毕竟是免费IP，质量没得保证，所以还是尽量多找几个网站吧。
 53 | ********
 54 | 
 55 | ## 1. **使用的方法** 
 56 |  > 到上面提到的请求框架地址中下载框架到本地， 然后在work_spider.py，delete_not_update_ip.py，
 57 |  get_proxies_base_spider.py 中对sys.append(...)中的地址进行更换，更换到你本地的框架所在的地址，现在脚本中是在我
 58 |  这里的路径，尽量路径中不要带中文。
 59 |  
 60 | **上面这一步是必须的**    
 61 | **下面这一步也是必须的**    
 62 | `pip install -r requirements.txt`
 63 | 
 64 | 
 65 | * 1.1     
 66 | 在proxy_basic_config.py中对代理IP网站进行配置，如果你只想试试，那么就不要配置了，已经有了几个可以用的配置项。
 67 | * 1.2     
 68 | 在config.py中对请求框架进行配置。如果你还是只想试运行，那么这里也不用配置了，现在的配置可以使用。
 69 | * 1.3     
 70 | 确保已经正确安装mongodb数据库。这一步就必须安装正确，否则没法使用。
 71 | * 1.4     
 72 | 如果你的网站特殊， 那么请自定义解析函数， 并在第一步中配置正确
 73 | * 1.5     
 74 | 执行work_spider.py脚本， 开始抓取，检测，入库。如果配置都没有改而且上面的步骤都正确， 那么数据在你本地的mongodb数据库的free_ip的proxy集合中。
 75 | * 1.6     
 76 | 执行proxy_api.py脚本， 开启API服务
 77 | * 1.7     
 78 | 执行delete_not_update_ip.py脚本， 对超过存活时间阀值的IP进行重新检测， 删除或更新插入时间    
 79 | 
 80 | * 如果你要爬的代理IP网站不是很变态的那种， 基本的请求加xpath和re就能获得ip的话， 那么你只要简简单单的
 81 | 对上面提到的配置文件进行简单的配置就行了，配置文件中的各个字段的具体含义在下面有详细解释。    
 82 | 
 83 | ********
 84 | 
 85 | ## 2.工作流程    
 86 | * 定时获取IP;
 87 | * 根据目标网站，分别对IP进行检测;
 88 | * 检测成功的入库;
 89 | * 构建API;
 90 | * 对入库的IP再次检测，删除不可用；
 91 | ********
 92 | ## 3. 具体介绍
 93 | ********
 94 | ###  获取IP及检测IP并入库
 95 | * **涉及脚本:** 
 96 | >work_spider.py     
 97 | get_proxies_base_spider.py    
 98 | _request.py    
 99 | proxy_basic_config.py    
100 | custom_get_ip下的脚本
101 | ********
102 | 
103 | * **work_spider.py**    
104 | `入口脚本`
105 | > 从 get_proxies_base_spider。py 中继承SpiderMain。  在WorkSpider类中重写父类的run方法，在其中可以
106 | 传入自定义的请求函数， 若没有， 则使用默认的上述框架中的请求函数。
107 | ********
108 | * **get_proxies_base_spider.py**     
109 | `主程序`  
110 | > 提供了默认的ip的获取， 检测， 入库函数。其中，解析函数可以自定义。    
111 | 
112 | 包括的方法:     
113 | >**run**    
114 | 启动函数， 启动请求框架的线程池， 调用craw函数;    
115 | **craw**    
116 | 开始爬取函数， 对配置中的配置项目进行读取，发起初始请求;    
117 | **get_and_check**     
118 | 获取返回的源代码， 并调用解析函数;    
119 | **parse_to_get_ip**     
120 | 默认的解析函数， 用户可通过配置自定义。解析出原始IP， 然后调用检测函数;   
121 | **start_check**     
122 | 开始检测函数， 其中检测的函数为_request。py中的valid函数;
123 | **save_ip**     
124 | 检测成功的入库函数， 将检测成功的数据放入mongodb数据库中;
125 | ********
126 | 
127 | * **_request.py**     
128 | `检测脚本程序`    
129 | 包括的方法:     
130 | > **valid**    
131 | 检测函数， 因为要符合请求框架的要求，所以自定义的这个函数需要返回两个值， 在这里返回了 **响应时间与检测的IP**。 
132 | ********
133 | 
134 | * **proxy_basic_config.py**     
135 | `代理网页设置脚本`        
136 | 包括的字段:    
137 | >**target_urls**    
138 | 待检测的目标网站， list类型;    
139 | **collection_name**    
140 | mongodb数据库的集合名， 字符串类型;    
141 | **over_time **   
142 | 数据库中IP存活时间阀值， 超过及对其重新检测， int类型， 默认1800秒;    
143 | **url_parse_dict**    
144 | 代理IP网址配置字典， 字典类型
145 | 
146 | 下面将对这一配置字典作着重介绍:    
147 | >(1) 格式    
148 | {ip_website_name: value}    
149 | **ip_website**    
150 | 代理IP网站的名字， 如xicidaili    
151 | **value**    
152 | 其他设置值， 字典类型
153 |     
154 | > (2) value    
155 |  因为value是一个内嵌的字典，所以其也有自己的键和值。    
156 |  \[1] **status**   
157 |  代理状态， 若不想爬取此网站，可以将status设置为非active的任意值， 必须;    
158 |  \[2] **request_method**    
159 |  请求方法， 当为post的时候， 必须定义提交的submit_data， 否则会报错。默认为get， 非必须。     
160 |   \[3] **submit_data**    
161 | 提交的数据，因项目的特殊性，如需要提交的数据里面带着网页页码字段，所以为了获得
162 | 多页的数据， 这里就需要定义多个post_data数据，然后分别进行请求。所以在这里将submit_data 定义为列表， 里面的数据为字典格式。    
163 |   \[4] **url**    
164 |    代理IP网址， 列表类型， 把需要爬取的页面都放在里面，必须。   
165 |   \[5] **parse_type**    
166 |     解析类型， 默认提供: xpath， re。 如果你需要别的解析方式， 那么请定义自己的parse_func。默认的解析函数即上文中的\[parse_to_get_ip]
167 |   只提供xpath与re两种解析获取IP的方式。当使用默认解析器的时候， 必须定义。    
168 |   \[6] **ip_port_together**    
169 |   ip和port是否在一个字段中， 若是则设置为True， 不是则设置为False， 当使用re进行解析和使用了自定义的解析函数时，非必须。    
170 |   若为地址与端口在一起，且通过xpath解析时则建议 \[parse_method]中 的key为ip_address_and_port    
171 |   若为地址与端口不在一起，且通过xpath解析时则建议 \[parse_method] key为ip_address， ip_port    
172 |   \[7] **parse_method**    
173 |   解析的具体路径， 上一条有了部分的解释， 当使用默认解析器时，必须定义。    
174 |   当通过re进行解析时， 则ip_port_together可以为任意的值，parse_method中只有一个键: _pattern    
175 |   \[8] **parse_func**    
176 |    解析函数， 默认值为system， 当需要使用自定义的解析函数的时候， 需要显式的定义该字段为自定义的解析函数，
177 | 自定义的解析函数必须要有四个参数， 分别为value， html_content， parse_type， website_name， 后面有详细解释    
178 |   \[9] **header**    
179 |    因网址较多， 所以在这里可以自定义浏览器头， 非必须。
180 |   ******** 
181 |   
182 | * **custom_get_ip下的文件**     
183 | `自定义解析函数`    
184 | 在这里以get_ip_from_peauland.py为例
185 | > 通常， 如果你因为代理IP网站的特殊性而需要自定义一个解析函数， 那么你可以在custom_get_ip文件夹下定义脚本    
186 | 定义的脚本中必须定义解析函数， 并在配置脚本proxy_basic_config。py中该网站对应的设置中将parse_func设置为你定义的解析函数    
187 | 
188 | > 在这里最好为需要自定义解析的网站写一个独立的脚本， 这样利于维护， 将需要自定义的值都放在其中， 然后在设置
189 | 脚本中进行导入和定义。
190 | 
191 | > 具体的解释可以参考上面脚本中的注释。    
192 | ********
193 | 
194 | ## 4。 API    
195 | 为了方便的获取可用的代理IP，在这里利用FLASK简单的起了一个服务， 可以获得全部的IP， 随机的获得一个IP，全部IP的数量， 删除一个IP    
196 | * 涉及脚本    
197 | > db_method.py    
198 | proxwofadey_api.py    
199 | 
200 | * **db_method。py**     
201 | > 封装一些数据库的方法， 有 get_one， get_all， delete_one， total    
202 | 
203 | * **proxy_api.py**   
204 | `接口服务主脚本`    
205 | > 默认启动**22555**端口。    
206 | 访问 http://0.0.0.0:22555/， 可以看到基本的api调用及介绍    
207 | 访问 http://0.0.0.0:22555/count/， 可以看到数据库中现存的IP数量    
208 | 访问 http://0.0.0.0:22555/get_one/， 可以看到随机返回了一条IP    
209 | 访问 http://0.0.0.0:22555/get_all/， 可以看到返回科所有可以使用的IP    
210 | 访问 http://0.0.0.0:22555/delete/?ip=127.0.0.1:8080&target_url=https://www.baidu.com， 删除你想删除的ip， 两个参数都是必须， 否则会报错
211 | 
212 | ## 5。 删除不可用IP    
213 | * 涉及脚本
214 | > delete_not_update_ip.py    
215 | 
216 | * **原理**
217 | > 在前面我们只是一味的将获取到的IP插入到数据库中，如果我们新插入的数据已经存在于数据库中，那么我们会
218 | 更新这一条数据的插入时间为当前的插入时间。
219 |     
220 | >所以如果在数据库中一些数据库中的数据的插入时间依然停留在一段
221 | 时间之前， 那么就可以认为这条数据已经有一段时间没有更新了。
222 |     
223 | > 这里就存在两种可能性: 这条数据中的ip依然可以使用和失效了。    
224 | 
225 | > 所以在这里我们将对超过一定的存活阀值时间的IP进行重新检测， 其实就是再次对目标网站进行请求， 成功的话则更新其插入时间为当前时间，若失败则删除。
226 | 


--------------------------------------------------------------------------------
/Selection_131.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/Selection_131.png


--------------------------------------------------------------------------------
/Selection_132.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/Selection_132.png


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | # Created by shimeng on 17-9-20
4 | 
5 | 


--------------------------------------------------------------------------------
/_request.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Created by shimeng on 17-9-20
 4 | 
 5 | """
 6 | 验证函数
 7 |  若请求成功,则返回响应时间, ip
 8 | 若错误, 则返回None, None, 请求框架将不对这一结果进行处理
 9 | 
10 | """
11 | import requests
12 | import urlparse
13 | 
14 | header = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
15 | 
16 | 
17 | def valid(_args, dont_filter):
18 |     url = _args.get('url')
19 |     header['Host'] = urlparse.urlparse(url).netloc
20 |     time_out = _args.get('time_out')
21 |     _ip = _args.get('ip')
22 |     diy_header = _args.get('diy_header') if _args.get('diy_header') else header
23 | 
24 |     try:
25 |         proxy = {
26 |             'http': 'http://%s' % _ip,
27 |             'https': 'http://%s' % _ip
28 |         }
29 |         con = requests.get(url, headers=diy_header, proxies=proxy, timeout=time_out)
30 |     except Exception,e:
31 |         # print e
32 |         return None, None
33 |     else:
34 |         if con.status_code == 200:
35 |             return con.elapsed.microseconds/1000000.,  _ip


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Created by shimeng on 17-8-17
 4 | 
 5 | 
 6 | # 爬虫名称
 7 | spider_name = 'get_ip'
 8 | 
 9 | # 日志设置
10 | log_folder_name = '%s_logs' % spider_name
11 | delete_existed_logs = True
12 | 
13 | # 请求参数设置
14 | thread_num = 50
15 | sleep_time = 0.5
16 | retry_times = 10
17 | time_out = 5
18 | # 当use_proxy为True时，必须在请求的args中或者在配置文件中定义ip, eg: ip="120.52.72.58:80", 否则程序将报错
19 | use_proxy = False
20 | ip = None
21 | 
22 | # 移动端设置为 ua_type = 'mobile'
23 | ua_type = 'pc'
24 | 
25 | # 队列顺序
26 | FIFO = 0
27 | # 默认提供的浏览器头包括user_agent  host, 若需要更丰富的header,可自行定定义新的header,并赋值给diy_header
28 | 
29 | diy_header = None
30 | 
31 | # 定义状态码,不在其中的均视为请求错误或异常
32 | status_code = [200, 304, 404]
33 | 
34 | # 保存设置
35 | # 当你定义好了host,port,database_name之后再将connect改为True
36 | connect = True
37 | 
38 | host = 'localhost'
39 | 
40 | port = 27017
41 | 
42 | database_name = 'free_ip'
43 | 


--------------------------------------------------------------------------------
/custom_get_ip/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | # Created by shimeng on 17-9-21
4 | 
5 | 


--------------------------------------------------------------------------------
/custom_get_ip/get_ip_from_peauland.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Created by shimeng on 17-9-21
 4 | 
 5 | import json
 6 | import base64
 7 | 
 8 | # 因为这个网站是post方法, 所以这里需要有post_data
 9 | # 如果你确定要提交数据, 那么请定义该值为列表类型
10 | def peauland_format_post_data():
11 |     post_datas = []
12 | 
13 |     base_post_data = {
14 |             'country_code': '',
15 |             'is_clusters': '',
16 |             'is_https': '',
17 |             'level_type': '',
18 |             'search_type': 'all',
19 |             'type': '',
20 |         }
21 |     for i in range(1,6):
22 |         base_post_data['page'] = i
23 |         post_datas.append(base_post_data)
24 | 
25 |     return post_datas
26 | 
27 | # 根据网站的特殊性, 构建自定义header, 非必须.
28 | # 如果你的代理IP获取网站对请求头有特殊的要求,就可以自定义一个,然后在proxy_basic_config.py中对该代理网站的设置header字段并赋值即可
29 | def peauland_header():
30 |     cookie='peuland_id=649e2152bad01e29298950671635e44a; UM_distinctid=15ea23fc7b838f-096617276fe5c1-1c2f170b-1fa400-15ea23fc7b9752; CNZZDATA1253154494=2109729896-1505960642-%7C1505960642; peuland_md5=9b941affd9b676f62ab93081f6cc9a1b; w_h=1080; w_w=1920; w_cd=24; w_a_h=1056; w_a_w=1855; PHPSESSID=mv893pclb2qhc6etu4hvbl8067; php_id=916419018'
31 |     headers = {
32 |         'Accept': '*/*',
33 |         'Accept-Encoding': 'gzip, deflate, br',
34 |         'Accept-Language': 'en-US,en;q=0.5',
35 |         'Connection': 'keep-alive',
36 |         'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
37 |         'Cookie':cookie,
38 |         'Host': 'proxy.peuland.com',
39 |         'Referer': 'https://proxy.peuland.com/proxy_list_by_category.htm',
40 |         'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0',
41 |         'X-Requested-With': 'XMLHttpRequest',
42 |     }
43 |     return headers
44 | 
45 | # 这里就是根据网站的特殊性自定义的解析方法, 有固定的四个参数
46 | # 跟默认的解析方法一样, 需要将获得的ips和website_name作为参数, 然后调用SpiderMain的start_check方法
47 | def peauland_parser(value, html_content, parse_type, website_name):
48 |     # 获取ips , 和website_name 作为参数传到开始检测的函数中
49 |     from get_proxies_base_spider import SpiderMain
50 |     spider_main = SpiderMain()
51 |     ips = []
52 |     content = json.loads(html_content)
53 |     datas = content.get('data')
54 |     for data in datas:
55 |         ip = base64.b64decode(data.get('ip'))
56 |         port = base64.b64decode(data.get('port'))
57 |         proxy = ip+':'+port
58 |         ips.append(proxy)
59 |     if ips:
60 |         spider_main.start_check(ips, website_name)
61 | 


--------------------------------------------------------------------------------
/db_method.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Created by shimeng on 17-9-20
 4 | 
 5 | import random
 6 | 
 7 | 
 8 | class DB(object):
 9 |     def __init__(self, collection):
10 |         self.collection = collection
11 | 
12 |     def get_one(self):
13 |         _ips = []
14 |         for _ in self.collection.find():
15 |             _ips.append(_.get('ip'))
16 |         return random.choice(_ips)
17 | 
18 |     def get_all(self):
19 |         _ips = []
20 |         for _ in self.collection.find():
21 |             _ips.append(_.get('ip'))
22 |         return _ips
23 | 
24 |     def delete_one(self, ip):
25 |         self.collection.delete_one({'_id': ip})
26 | 
27 |     def total(self):
28 |         return self.collection.count()
29 | 


--------------------------------------------------------------------------------
/delete_not_update_ip.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Created by shimeng on 17-9-21
 4 | 
 5 | """
 6 | 
 7 | 按照ip的插入时间距离现在的时间, 大于阀值时间的拿出来进行检测, 若已经无法使用, 则删除;若还可以使用,则更新插入的时间;
 8 | 
 9 | """
10 | import sys
11 | 
12 | # 这里写你自己的框架保存地址
13 | sys.path.append('/home/shimeng/code/spider_framework_github_responsity')
14 | from spider.log_format import spider_log
15 | from config import log_folder_name
16 | from config import host, port, database_name
17 | from proxy_basic_config import collection_name, over_time
18 | import pymongo
19 | import time
20 | import urlparse
21 | from _request import valid
22 | 
23 | diy_header = {
24 |     'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
25 | log_name = 'delete_not_update_ip'
26 | file_folder = log_folder_name
27 | delete_existed_log = False
28 | 
29 | logger = spider_log(log_name=log_name, file_folder=file_folder, delete_existed_log=delete_existed_log)
30 | 
31 | 
32 | def connect_to_mongodb():
33 |     client = pymongo.MongoClient(host, port)
34 |     db = client[database_name]
35 |     collection = db[collection_name]
36 |     return collection
37 | 
38 | 
39 | def format_time_to_timestamp(foramt_time):
40 |     st = time.strptime(foramt_time, '%Y-%m-%d %H:%M:%S')
41 |     return time.mktime(st)
42 | 
43 | 
44 | def check(when=time.time):
45 |     collection = connect_to_mongodb()
46 |     for data in collection.find():
47 |         ip = data.get('ip')
48 |         target_url = data.get('target_url')
49 |         ip_stamp = format_time_to_timestamp(data.get('insert_time'))
50 | 
51 |         has_existed = int(when() - ip_stamp)
52 |         if has_existed > over_time:
53 |             diy_header['Host'] = urlparse.urlparse(target_url).netloc
54 | 
55 |             _args = {
56 |                 "url": target_url,
57 |                 "diy_header": diy_header,
58 |                 "time_out": 5,
59 |                 "_ip": ip,
60 |             }
61 | 
62 |             _id = ip + '_' + target_url
63 |             # 调用验证函数
64 |             result1, result2 = valid(_args, False)
65 |             if result1 is None:
66 |                 msg = 'delete ip: [{ip}], target_url is [{target_url}]'.format(ip=ip, target_url=target_url)
67 |                 logger.info(msg)
68 |                 collection.delete_one({'_id': _id})
69 |             else:
70 |                 msg = 'update ip: [{ip}], target_url is [{target_url}]'.format(ip=ip, target_url=target_url)
71 |                 logger.info(msg)
72 |                 collection.update({'_id': _id}, {'insert_time': when()})
73 | 
74 | if __name__ == '__main__':
75 |     check(time.time)


--------------------------------------------------------------------------------
/get_proxies_base_spider.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # Created by shimeng on 17-9-20
  4 | 
  5 | import sys
  6 | import re
  7 | import os
  8 | import time
  9 | 
 10 | # 这里写你自己的框架保存地址
 11 | sys.path.append('/home/shimeng/code/spider_framework_github_responsity')
 12 | 
 13 | from spider.tools import format_put_data
 14 | from spider.data_save import pipeline
 15 | from spider.html_parser import parser
 16 | from spider.page_downloader import aispider
 17 | from spider.threads import start, work_queue, save_queue
 18 | from spider.log_format import logger
 19 | from proxy_basic_config import url_parse_dict, target_urls,collection_name
 20 | from _request import valid
 21 | 
 22 | 
 23 | # 定义主程序
 24 | class SpiderMain(object):
 25 |     def __init__(self, ):
 26 | 
 27 |         self.logger = logger
 28 |         self.parser = parser
 29 |         self.pipeline = pipeline
 30 |         self.target_urls = target_urls
 31 |         self.collection_name = collection_name
 32 | 
 33 | 
 34 |     def run(self):
 35 |         start()
 36 |         self.craw()
 37 | 
 38 |     def craw(self, request=aispider.request):
 39 | 
 40 |         for key, value in url_parse_dict.iteritems():
 41 |             if value.get('status') == 'active':
 42 |                 # 网站名
 43 |                 website_name = key
 44 |                 # 网站url
 45 |                 website_urls = value.get('url')
 46 |                 # 请求方法
 47 |                 method = value.get('request_method')
 48 |                 # 请求需要提交的数据
 49 |                 post_datas = value.get('submit_data')
 50 |                 # 解析方法
 51 |                 parse_func = value.get('parse_func', 'system')
 52 |                 if parse_func == 'system':
 53 |                     parser = self.parse_to_get_ip
 54 |                 else:
 55 |                     parser = parse_func
 56 | 
 57 |                 # 自定义头
 58 |                 diy_header = value.get('header')
 59 | 
 60 |                 for url in website_urls:
 61 |                     # 调用format_put_data 构造放入队列中的数据
 62 |                     if post_datas:
 63 |                         for post_data in post_datas:
 64 |                             put_data = format_put_data(
 65 |                                 args={"url": url, "method": method, 'submit_data': post_data, 'diy_header': diy_header},
 66 |                                 work_func=request,
 67 |                                 follow_func=self.get_and_check,
 68 |                                 meta={'value': value, 'website_name': website_name, 'parser': parser})
 69 |                             # 放入队列
 70 |                             work_queue.put(put_data)
 71 | 
 72 |                     else:
 73 |                         put_data = format_put_data(args={"url": url, "method": method, 'data': post_datas},
 74 |                                                    work_func=request,
 75 |                                                    follow_func=self.get_and_check,
 76 |                                                    meta={'value': value, 'website_name': website_name,
 77 |                                                          'parser': parser})
 78 |                         # 放入队列
 79 |                         work_queue.put(put_data)
 80 | 
 81 |     def get_and_check(self, response):
 82 |         value = response.get('meta').get('value')
 83 |         html_content = response.get('content')
 84 |         # 网站名
 85 |         website_name = response.get('meta').get('website_name')
 86 |         # 解析类型: xpath, re
 87 |         parse_type = value.get('parse_type')
 88 |         # 解析函数
 89 |         parser = response.get('meta').get('parser')
 90 | 
 91 |         parser(value=value, html_content=html_content, parse_type=parse_type, website_name=website_name)
 92 | 
 93 |     def parse_to_get_ip(self, value, html_content, parse_type, website_name):
 94 |         ips = []
 95 | 
 96 |         if parse_type == 'xpath':
 97 |             # xpath
 98 |             # 端口与地址是否在一起
 99 |             ip_port_together = value.get('ip_port_together')
100 |             if ip_port_together:
101 |                 ip_and_port_xpath = value.get('parse_method').get('ip_address_and_port')
102 |                 ip_and_port = self.parser.get_data_by_xpath(html_content, ip_and_port_xpath)
103 |                 ips.extend(ip_and_port)
104 | 
105 |             else:
106 |                 ip_address_xpath = value.get('parse_method').get('ip_address')
107 |                 ip_port_xpath = value.get('parse_method').get('ip_port')
108 | 
109 |                 ip_address = self.parser.get_data_by_xpath(html_content, ip_address_xpath)
110 |                 ip_port = self.parser.get_data_by_xpath(html_content, ip_port_xpath)
111 |                 for index, value in enumerate(ip_address):
112 |                     ips.append((ip_address[index] + ':' + ip_port[index]))
113 | 
114 |         elif parse_type == 're':
115 |             # re
116 |             ip_and_port_pattern = value.get('parse_method').get('_pattern')
117 |             ip_and_port = parser.get_data_by_re(html_content, ip_and_port_pattern, flags=re.S)
118 | 
119 |             if ip_and_port:
120 |                 for data in ip_and_port:
121 |                     proxy = ':'.join(data)
122 |                     ips.append(proxy)
123 | 
124 |         # 调用检测程序
125 |         self.start_check(ips, website_name)
126 | 
127 |     def start_check(self, ips, website_name):
128 |         if ips:
129 |             # 检测
130 |             for _ip in ips:
131 |                 for target_url in self.target_urls:
132 |                     url = target_url
133 |                     # 调用format_put_data 构造放入队列中的数据
134 |                     put_data = format_put_data(args={"url": url, 'ip': _ip, 'time_out': 5}, work_func=valid,
135 |                                                need_save=True,
136 |                                                save_func=self.save_ip,
137 |                                                meta={'website_name': website_name, 'target_url': target_url})
138 |                     # 放入队列
139 |                     work_queue.put(put_data)
140 |         else:
141 |             msg = 'There Are No Available From [{website_name}] Can Be Used To Check, Please Check!!!!!!!'.format(
142 |                 website_name=website_name)
143 |             logger.error(msg)
144 | 
145 |     # 上一步中定义的保存函数
146 |     def save_ip(self, response):
147 |         website_name = response.get('meta').get('website_name')
148 |         response_time = response.get('content')
149 |         target_url = response.get('meta').get('target_url')
150 |         _ip = response.get('url')
151 | 
152 |         msg = '[{ip}] can visit the target url [{target_url}], source is [{source}]'.format(ip=_ip,
153 |                                                                                             target_url=target_url,
154 |                                                                                             source=website_name)
155 |         logger.info(msg)
156 |         # mongodb 集合名称
157 | 
158 |         insert_data = {}
159 | 
160 |         insert_data['_id'] = _ip+'_'+target_url
161 |         insert_data['ip'] = _ip
162 |         insert_data['source'] = website_name
163 |         insert_data['response_time'] = response_time
164 |         insert_data['target_url'] = target_url
165 | 
166 |         insert_data['insert_time'] = time.strftime('%Y-%m-%d %H:%M:%S')
167 | 
168 |         # 　保存数入库　
169 |         self.pipeline.process_item(insert_data, self.collection_name)
170 | 
171 | 
172 | 
173 | if __name__ == '__main__':
174 |     # 测试代码
175 |     spidermain = SpiderMain()
176 |     spidermain.run()
177 | 
178 |     # blocking
179 |     work_queue.join()
180 |     save_queue.join()
181 | 
182 |     # finishing crawl origin ip
183 |     logger.info('available proxy has been saved in your database, please check!')
184 | 


--------------------------------------------------------------------------------
/proxy_api.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Created by shimeng on 17-9-20
 4 | """
 5 | 
 6 | API
 7 | 
 8 | """
 9 | 
10 | import sys
11 | 
12 | sys.path.append('/home/shimeng/code/spider_framework_github_responsity')
13 | from spider.data_save import pipeline
14 | from db_method import DB
15 | from flask import Flask, jsonify, request
16 | 
17 | db = DB(pipeline.db['proxy'])
18 | 
19 | app = Flask(__name__)
20 | 
21 | api_list = {
22 |     'count': u'get the count of proxy',
23 |     'get_one': u'get a random proxy',
24 |     'get_all': u'get all proxy from proxy pool',
25 |     'delete?ip=127.0.0.1:8080&target_url=https://www.baidu.com': u'delete the proxy which is unavailable',
26 | }
27 | 
28 | 
29 | @app.route('/')
30 | def index():
31 |     return jsonify(api_list)
32 | 
33 | 
34 | @app.route('/count/')
35 | def count():
36 |     num = db.total()
37 |     msg = 'the total number is [%d]' % num
38 |     return msg
39 | 
40 | 
41 | @app.route('/get_one/')
42 | def get():
43 |     proxy = db.get_one()
44 |     print proxy
45 |     return proxy
46 | 
47 | 
48 | @app.route('/get_all/')
49 | def get_all():
50 |     proxies = db.get_all()
51 | 
52 |     return jsonify([proxy.decode('utf8') for proxy in proxies])
53 | 
54 | 
55 | @app.route('/delete/', methods=['GET'])
56 | def delete():
57 |     proxy = request.args.get('ip')
58 |     target_url = request.args.get('target_url')
59 |     _id = proxy + '_' + target_url
60 |     db.delete_one(_id)
61 |     return 'Delete Successfully'
62 | 
63 | 
64 | def run():
65 |     app.run(host='0.0.0.0', port=22555, debug=True)
66 | 
67 | 
68 | if __name__ == '__main__':
69 |     run()
70 | 


--------------------------------------------------------------------------------
/proxy_basic_config.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # Created by shimeng on 17-9-19
  4 | 
  5 | """
  6 | 
  7 | 代理网址及解析字典
  8 | 
  9 | status 代理状态, 若不想爬取此网站,可以将status设置为非active的任意值
 10 | request_method , 请求方法, 必写, 当为post的时候, 必须定义提交的post_data, 否则会报错.因项目的特殊性, 提交的数据中会带有页码数据, 所以在这里将
 11 | post_data 定义为列表, 里面的数据为字典格式
 12 | url 代理网址
 13 | 
 14 | parse_type  解析类型,默认提供: xpath, re
 15 | 
 16 | (1) xpath
 17 | ip_port_together ip地址和ip的端口是否在一个字段中
 18 | 若为地址与端口在一起,则建议key为ip_address_and_port
 19 | 若为地址与端口不在一起,则建议key为ip_address, ip_port
 20 | 
 21 | (2) re
 22 | 若解析的类型为re, 则ip_port_together可以为任意的值
 23 | parse_method中只有一个键: _pattern
 24 | 
 25 | parse_func 解析函数, 默认值为system, 当需要使用自定义的解析函数的时候, 需要显式的定义该字段为自定义的解析函数
 26 | 解析函数要有四个参数, 分别为value, html_content, parse_type, website_name
 27 | 
 28 | header 因网址较多, 所以在这里可以自定义头
 29 | """
 30 | 
 31 | from custom_get_ip.get_ip_from_peauland import peauland_parser, peauland_format_post_data, peauland_header
 32 | 
 33 | # 定义检测的目标网站
 34 | target_urls = ['https://www.baidu.com', 'https://httpbin.org/get']
 35 | 
 36 | # 数据库集合名
 37 | collection_name = 'proxy'
 38 | 
 39 | # 数据库中IP存活时间阀值, 超过及对其重新检测
 40 | over_time = 1800
 41 | 
 42 | url_parse_dict = {
 43 |     # data5u
 44 |     'data5u': {
 45 |         'status':'active',
 46 |         'request_method':'get',
 47 |         'url': ['http://www.data5u.com/free/{tag}/index.shtml'.format(tag=tag) for tag in ['gngn', 'gnpt', 'gwgn', 'gwpt']],
 48 |         'parse_type': 'xpath',
 49 |         'ip_port_together': False,
 50 |         'parse_method':{
 51 |             'ip_address': '//ul[@class="l2"]/span[1]/li/text()',
 52 |             'ip_port': '//ul[@class="l2"]/span[2]/li/text()',
 53 |         },
 54 |         'parse_func': 'system'
 55 |     },
 56 | 
 57 |     # xicidaili
 58 |     'xicidaili': {
 59 |         'status': 'active',
 60 |         'request_method': 'get',
 61 |         'url': ['http://www.xicidaili.com/nn/{page}'.format(page=page) for page in range(1, 10)],
 62 |         'parse_type': 'xpath',
 63 |         'ip_port_together': False,
 64 |         'parse_method': {
 65 |             'ip_address': '//tr[@class="odd"]/td[2]/text()',
 66 |             'ip_port': '//tr[@class="odd"]/td[3]/text()',
 67 |         },
 68 |         'parse_func': 'system'
 69 | 
 70 |     },
 71 | 
 72 |     # 66ip
 73 |     '66ip': {
 74 |         'status': 'active',
 75 |         'request_method': 'get',
 76 |         'url': ['http://m.66ip.cn/{page}.html'.format(page=page) for page in range(1, 10)],
 77 |         'parse_type': 're',
 78 |         'ip_port_together': False,
 79 |         'parse_method': {
 80 |             '_pattern': '<tr><td>([\d\.]*?)</td><td>(.*?)</td>',
 81 |         },
 82 |         'parse_func': 'system'
 83 | 
 84 |     },
 85 | 
 86 |     # 这个是国外的一个网站,如果你的网络无法访问,可以将status改为inactive, 很尴尬, 测了一下,好像都不行,哈哈
 87 |     'proxylistplus': {
 88 |         'status': 'active',
 89 |         'request_method': 'get',
 90 |         'url': ['https://list.proxylistplus.com/Fresh-HTTP-Proxy-List-{page}'.format(page=page) for page in range(1, 5)],
 91 |         'parse_type': 'xpath',
 92 |         'ip_port_together': False,
 93 |         'parse_method': {
 94 |             'ip_address': '//table[@class="bg"]/tr[@class="cells"]/td[2]/text()',
 95 |             'ip_port': '//table[@class="bg"]/tr[@class="cells"]/td[3]/text()',
 96 |         },
 97 |         'parse_func': 'system'
 98 | 
 99 |     },
100 | 
101 |     # proxydb
102 |     # 这个也是国外的一个网站,如果你的网络无法访问,可以将status改为inactive
103 |     # 这个网站采用的post方法, 需要将submit_data定义好, 采用自定义解析函数, 自定义的请求头
104 |     # 如果你也遇到变态的网站, 按照这个进行配置即可
105 |     'proxydb': {
106 |         'status': 'active',
107 |         'request_method': 'post',
108 |         'submit_data':peauland_format_post_data(),
109 |         'url': ['https://proxy.peuland.com/proxy/search_proxy.php'],
110 |         'parse_func': peauland_parser,
111 |         'header': peauland_header()
112 |     },
113 | 
114 | }
115 | 
116 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | Flask==0.12.1
2 | requests==2.13.0
3 | pymongo==3.4.0
4 | 


--------------------------------------------------------------------------------
/work_spider.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # Created by shimeng on 17-9-21
 4 | import sys
 5 | 
 6 | # 这里写你自己的地址
 7 | sys.path.append('/home/shimeng/code/spider_framework_github_responsity')
 8 | 
 9 | from spider.tools import format_put_data
10 | from spider.data_save import pipeline
11 | from spider.html_parser import parser
12 | from spider.page_downloader import aispider
13 | from spider.threads import start, work_queue, save_queue
14 | from spider.log_format import logger
15 | from proxy_basic_config import url_parse_dict
16 | from _request import valid
17 | 
18 | from get_proxies_base_spider import SpiderMain
19 | 
20 | 
21 | class WorkSpider(SpiderMain):
22 |     def __init__(self):
23 |         super(WorkSpider, self).__init__()
24 | 
25 |     # 重写run方法,
26 |     # 若请求的函数为自定义, 则可以在crawl函数中设置: request=your_request_function, 默认为框架中的request
27 |     def run(self):
28 |         start()
29 |         self.craw()
30 | 
31 | 
32 | if __name__ == '__main__':
33 |     work_spider = WorkSpider()
34 | 
35 |     work_spider.run()
36 | 
37 |     # Blocking
38 |     work_queue.join()
39 |     save_queue.join()
40 | 
41 |     # Done
42 |     logger.info('All Job Finishing, Please Check!')
43 | 


--------------------------------------------------------------------------------
/个人公众号.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/个人公众号.gif


--------------------------------------------------------------------------------
/修改记录.md:
--------------------------------------------------------------------------------
1 | # 代码修改记录文档


--------------------------------------------------------------------------------
/数据库截图.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xiaosimao/IP_POOL/b438fb2f15f7ebebbe561aa1b4f9cf0eec06e0f9/数据库截图.png


--------------------------------------------------------------------------------