├── .gitignore ├── requirements.txt ├── Config.ini ├── DB ├── __init__.py ├── SsdbClient.py └── DbClient.py ├── Util ├── __init__.py ├── utilFunction.py ├── utilClass.py └── GetConfig.py ├── ProxyGetter ├── __init__.py └── getFreeProxy.py ├── __init__.py ├── Api ├── __init__.py └── ProxyApi.py ├── Manager ├── __init__.py └── ProxyManager.py ├── Schedule ├── __init__.py └── ProxyRefreshSchedule.py └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | *.pyc -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | APScheduler==3.2.0 2 | Flask==0.11.1 3 | requests==2.11.0 4 | ssdb==0.0.3 5 | 6 | -------------------------------------------------------------------------------- /Config.ini: -------------------------------------------------------------------------------- 1 | [DB] 2 | type = SSDB 3 | host = localhost 4 | port = 8080 5 | name = proxy 6 | 7 | [ProxyGetter] 8 | ;register the proxy getter function 9 | freeProxyFirst = 1 10 | freeProxySecond = 1 11 | freeProxyThird = 1 12 | freeProxyFourth = 1 13 | freeProxyFifth = 1 14 | 15 | -------------------------------------------------------------------------------- /DB/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ------------------------------------------------- 4 | File Name: __init__.py.py 5 | Description : 6 | Author : JHao 7 | date: 2016/12/2 8 | ------------------------------------------------- 9 | Change Activity: 10 | 2016/12/2: 11 | ------------------------------------------------- 12 | """ -------------------------------------------------------------------------------- /Util/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ------------------------------------------------- 4 | File Name: __init__.py.py 5 | Description : 6 | Author : JHao 7 | date: 2016/11/25 8 | ------------------------------------------------- 9 | Change Activity: 10 | 2016/11/25: 11 | ------------------------------------------------- 12 | """ -------------------------------------------------------------------------------- /ProxyGetter/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ------------------------------------------------- 4 | File Name: __init__.py.py 5 | Description : 6 | Author : JHao 7 | date: 2016/11/25 8 | ------------------------------------------------- 9 | Change Activity: 10 | 2016/11/25: 11 | ------------------------------------------------- 12 | """ -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ------------------------------------------------- 4 | File Name: __init__.py 5 | Description : 6 | Author : JHao 7 | date: 2016/12/3 8 | ------------------------------------------------- 9 | Change Activity: 10 | 2016/12/3: 11 | ------------------------------------------------- 12 | """ 13 | __author__ = 'JHao' -------------------------------------------------------------------------------- /Api/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ------------------------------------------------- 4 | File Name: __init__.py 5 | Description : 6 | Author : JHao 7 | date: 2016/12/3 8 | ------------------------------------------------- 9 | Change Activity: 10 | 2016/12/3: 11 | ------------------------------------------------- 12 | """ 13 | __author__ = 'JHao' -------------------------------------------------------------------------------- /Manager/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ------------------------------------------------- 4 | File Name: __init__.py.py 5 | Description : 6 | Author : JHao 7 | date: 2016/12/3 8 | ------------------------------------------------- 9 | Change Activity: 10 | 2016/12/3: 11 | ------------------------------------------------- 12 | """ 13 | __author__ = 'JHao' -------------------------------------------------------------------------------- /Schedule/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | ------------------------------------------------- 4 | File Name: __init__.py.py 5 | Description : 6 | Author : JHao 7 | date: 2016/12/3 8 | ------------------------------------------------- 9 | Change Activity: 10 | 2016/12/3: 11 | ------------------------------------------------- 12 | """ 13 | __author__ = 'JHao' -------------------------------------------------------------------------------- /Util/utilFunction.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: utilFunction.py 6 | Description : tool function 7 | Author : JHao 8 | date: 2016/11/25 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/11/25: 添加robustCrawl、verifyProxy、getHtmlTree 12 | ------------------------------------------------- 13 | """ 14 | 15 | 16 | # noinspection PyPep8Naming 17 | def robustCrawl(func): 18 | def decorate(*args, **kwargs): 19 | try: 20 | return func(*args, **kwargs) 21 | except Exception as e: 22 | print u"sorry, 抓取出错。错误原因:" 23 | print e 24 | 25 | return decorate 26 | 27 | 28 | def verifyProxy(proxy): 29 | """ 30 | 检查代理格式 31 | :param proxy: 32 | :return: 33 | """ 34 | import re 35 | verify_regex = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}" 36 | return True if re.findall(verify_regex, proxy) else False 37 | 38 | 39 | def getHtmlTree(url, **kwargs): 40 | """ 41 | 获取html树 42 | :param url: 43 | :param kwargs: 44 | :return: 45 | """ 46 | import requests 47 | from lxml import etree 48 | html = requests.get(url=url).content 49 | return etree.HTML(html) 50 | -------------------------------------------------------------------------------- /Api/ProxyApi.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: ProxyApi.py 6 | Description : 7 | Author : JHao 8 | date: 2016/12/4 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/12/4: 12 | ------------------------------------------------- 13 | """ 14 | __author__ = 'JHao' 15 | 16 | from flask import Flask, jsonify, request 17 | import sys 18 | 19 | sys.path.append('../') 20 | 21 | from Manager.ProxyManager import ProxyManager 22 | 23 | app = Flask(__name__) 24 | 25 | api_list = { 26 | 'get': u'get an usable proxy', 27 | 'refresh': u'refresh proxy pool', 28 | 'get_all': u'get all proxy from proxy pool', 29 | 'delete?proxy=127.0.0.1:8080': u'delete an unable proxy', 30 | } 31 | 32 | 33 | @app.route('/') 34 | def index(): 35 | return jsonify(api_list) 36 | 37 | 38 | @app.route('/get/') 39 | def get(): 40 | proxy = ProxyManager().get() 41 | return proxy 42 | 43 | 44 | @app.route('/refresh/') 45 | def refresh(): 46 | ProxyManager().refresh() 47 | return 'success' 48 | 49 | 50 | @app.route('/get_all/') 51 | def getAll(): 52 | proxys = ProxyManager().getAll() 53 | return jsonify(proxys) 54 | 55 | 56 | @app.route('/delete/', methods=['GET']) 57 | def delete(): 58 | proxy = request.args.get('proxy') 59 | ProxyManager().delete(proxy) 60 | return 'success' 61 | 62 | 63 | if __name__ == '__main__': 64 | app.run() 65 | -------------------------------------------------------------------------------- /Util/utilClass.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: utilClass.py 6 | Description : tool class 7 | Author : JHao 8 | date: 2016/12/3 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/12/3: Class LazyProperty 12 | 2016/12/4: rewrite ConfigParser 13 | ------------------------------------------------- 14 | """ 15 | __author__ = 'JHao' 16 | 17 | 18 | class LazyProperty(object): 19 | """ 20 | LazyProperty 21 | explain: http://www.spiderpy.cn/blog/5/ 22 | """ 23 | 24 | def __init__(self, func): 25 | self.func = func 26 | 27 | def __get__(self, instance, owner): 28 | if instance is None: 29 | return self 30 | else: 31 | value = self.func(instance) 32 | setattr(instance, self.func.__name__, value) 33 | return value 34 | 35 | 36 | from ConfigParser import ConfigParser 37 | 38 | 39 | class ConfigParse(ConfigParser): 40 | """ 41 | rewrite ConfigParser, for support upper option 42 | """ 43 | 44 | def __init__(self): 45 | ConfigParser.__init__(self) 46 | 47 | def optionxform(self, optionstr): 48 | return optionstr 49 | 50 | 51 | class Singleton(type): 52 | """ 53 | Singleton Metaclass 54 | """ 55 | 56 | _inst = {} 57 | 58 | def __call__(cls, *args, **kwargs): 59 | if cls not in cls._inst: 60 | cls._inst[cls] = super(Singleton, cls).__call__(*args) 61 | return cls._inst[cls] 62 | -------------------------------------------------------------------------------- /Util/GetConfig.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: GetConfig.py 6 | Description : fetch config from config.ini 7 | Author : JHao 8 | date: 2016/12/3 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/12/3: get db property func 12 | ------------------------------------------------- 13 | """ 14 | __author__ = 'JHao' 15 | 16 | import os 17 | from Util.utilClass import ConfigParse 18 | from Util.utilClass import LazyProperty 19 | 20 | 21 | class GetConfig(object): 22 | """ 23 | to get config from config.ini 24 | """ 25 | 26 | def __init__(self): 27 | self.pwd = os.path.split(os.path.realpath(__file__))[0] 28 | self.config_path = os.path.join(os.path.split(self.pwd)[0], 'Config.ini') 29 | self.config_file = ConfigParse() 30 | self.config_file.read(self.config_path) 31 | 32 | @LazyProperty 33 | def db_type(self): 34 | return self.config_file.get('DB', 'type') 35 | 36 | @LazyProperty 37 | def db_name(self): 38 | return self.config_file.get('DB', 'name') 39 | 40 | @LazyProperty 41 | def db_host(self): 42 | return self.config_file.get('DB', 'host') 43 | 44 | @LazyProperty 45 | def db_port(self): 46 | return int(self.config_file.get('DB', 'port')) 47 | 48 | @LazyProperty 49 | def proxy_getter_functions(self): 50 | return self.config_file.options('ProxyGetter') 51 | 52 | 53 | if __name__ == '__main__': 54 | gg = GetConfig() 55 | print gg.db_type 56 | print gg.db_name 57 | print gg.db_host 58 | print gg.db_port 59 | print gg.proxy_getter_functions 60 | -------------------------------------------------------------------------------- /DB/SsdbClient.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: SsdbClient.py 6 | Description : 封装SSDB操作 7 | Author : JHao 8 | date: 2016/12/2 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/12/2: 12 | ------------------------------------------------- 13 | """ 14 | __author__ = 'JHao' 15 | 16 | from ssdb.connection import BlockingConnectionPool 17 | from ssdb import SSDB 18 | import random 19 | import json 20 | 21 | 22 | class SsdbClient(object): 23 | """ 24 | SSDB client 25 | """ 26 | 27 | def __init__(self, name, host, port): 28 | """ 29 | init 30 | :param name: 31 | :param host: 32 | :param port: 33 | :return: 34 | """ 35 | self.name = name 36 | self.__conn = SSDB(connection_pool=BlockingConnectionPool(host=host, port=port)) 37 | 38 | def get(self): 39 | """ 40 | get an item 41 | :return: 42 | """ 43 | values = self.__conn.hgetall(name=self.name) 44 | return random.choice(values.keys()) if values else None 45 | 46 | def put(self, value): 47 | """ 48 | put an item 49 | :param value: 50 | :return: 51 | """ 52 | value = json.dump(value, ensure_ascii=False).encode('utf-8') if isinstance(value, (dict, list)) else value 53 | return self.__conn.hset(self.name, value, None) 54 | 55 | def pop(self): 56 | """ 57 | pop an item 58 | :return: 59 | """ 60 | key = self.get() 61 | if key: 62 | self.__conn.hdel(self.name, key) 63 | return key 64 | 65 | def delete(self, key): 66 | """ 67 | delete an item 68 | :param key: 69 | :return: 70 | """ 71 | self.__conn.hdel(self.name, key) 72 | 73 | def getAll(self): 74 | return self.__conn.hgetall(self.name).keys() 75 | 76 | def changeTable(self, name): 77 | self.name = name 78 | -------------------------------------------------------------------------------- /Manager/ProxyManager.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: ProxyManager.py 6 | Description : 7 | Author : JHao 8 | date: 2016/12/3 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/12/3: 12 | ------------------------------------------------- 13 | """ 14 | __author__ = 'JHao' 15 | 16 | from DB.DbClient import DbClient 17 | from Util.GetConfig import GetConfig 18 | from ProxyGetter.getFreeProxy import GetFreeProxy 19 | 20 | 21 | class ProxyManager(object): 22 | """ 23 | ProxyManager 24 | """ 25 | 26 | def __init__(self): 27 | self.db = DbClient() 28 | self.config = GetConfig() 29 | self.raw_proxy_queue = 'raw_proxy' 30 | self.useful_proxy_queue = 'useful_proxy_queue' 31 | 32 | def refresh(self): 33 | """ 34 | fetch proxy into Db by ProxyGetter 35 | :return: 36 | """ 37 | for proxyGetter in self.config.proxy_getter_functions: 38 | proxy_set = set() 39 | # fetch raw proxy 40 | for proxy in getattr(GetFreeProxy, proxyGetter.strip())(): 41 | proxy_set.add(proxy) 42 | 43 | # store raw proxy 44 | self.db.changeTable(self.raw_proxy_queue) 45 | for proxy in proxy_set: 46 | self.db.put(proxy) 47 | 48 | def get(self): 49 | """ 50 | return a useful proxy 51 | :return: 52 | """ 53 | self.db.changeTable(self.useful_proxy_queue) 54 | return self.db.pop() 55 | 56 | def delete(self, proxy): 57 | """ 58 | delete proxy from pool 59 | :param proxy: 60 | :return: 61 | """ 62 | self.db.changeTable(self.useful_proxy_queue) 63 | self.db.delete(proxy) 64 | 65 | def getAll(self): 66 | """ 67 | get all proxy from pool 68 | :return: 69 | """ 70 | self.db.changeTable(self.useful_proxy_queue) 71 | return self.db.getAll() 72 | 73 | 74 | if __name__ == '__main__': 75 | pp = ProxyManager() 76 | pp.refresh() 77 | -------------------------------------------------------------------------------- /DB/DbClient.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: DbClient.py 6 | Description : DB工厂类 7 | Author : JHao 8 | date: 2016/12/2 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/12/2: 12 | ------------------------------------------------- 13 | """ 14 | __author__ = 'JHao' 15 | 16 | import os 17 | import sys 18 | from Util.GetConfig import GetConfig 19 | from Util.utilClass import Singleton 20 | 21 | sys.path.append(os.path.dirname(os.path.abspath(__file__))) 22 | 23 | 24 | class DbClient(object): 25 | """ 26 | DbClient 27 | """ 28 | 29 | __metaclass__ = Singleton 30 | 31 | def __init__(self): 32 | """ 33 | init 34 | :return: 35 | """ 36 | self.config = GetConfig() 37 | self.__initDbClient() 38 | 39 | def __initDbClient(self): 40 | """ 41 | init DB Client 42 | :return: 43 | """ 44 | __type = None 45 | if "SSDB" == self.config.db_type: 46 | __type = "SsdbClient" 47 | else: 48 | pass 49 | assert __type, 'type error, Not support DB type: {}'.format(self.config.db_type) 50 | self.client = getattr(__import__(__type), __type)(name=self.config.db_name, 51 | host=self.config.db_host, 52 | port=self.config.db_port) 53 | 54 | def get(self, **kwargs): 55 | return self.client.get(**kwargs) 56 | 57 | def put(self, value, **kwargs): 58 | return self.client.put(value, **kwargs) 59 | 60 | def pop(self, **kwargs): 61 | return self.client.pop(**kwargs) 62 | 63 | def delete(self, value, **kwargs): 64 | return self.client.delete(value, **kwargs) 65 | 66 | def getAll(self): 67 | return self.client.getAll() 68 | 69 | def changeTable(self, name): 70 | self.client.changeTable(name) 71 | 72 | 73 | if __name__ == "__main__": 74 | account = DbClient() 75 | print account.get() 76 | account.changeTable('use') 77 | account.put('ac') 78 | print(account) 79 | -------------------------------------------------------------------------------- /Schedule/ProxyRefreshSchedule.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: ProxyRefreshSchedule.py 6 | Description : 代理定时刷新 7 | Author : JHao 8 | date: 2016/12/4 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/12/4: 代理定时刷新 12 | ------------------------------------------------- 13 | """ 14 | __author__ = 'JHao' 15 | 16 | from apscheduler.schedulers.blocking import BlockingScheduler 17 | from multiprocessing import Process 18 | import requests 19 | import time 20 | import sys 21 | 22 | sys.path.append('../') 23 | 24 | from Manager.ProxyManager import ProxyManager 25 | 26 | 27 | class ProxyRefreshSchedule(ProxyManager): 28 | """ 29 | 代理定时刷新 30 | """ 31 | 32 | def __init__(self): 33 | ProxyManager.__init__(self) 34 | 35 | def validProxy(self): 36 | self.db.changeTable(self.raw_proxy_queue) 37 | raw_proxy = self.db.pop() 38 | while raw_proxy: 39 | proxies = {"http": "http://{proxy}".format(proxy=raw_proxy), 40 | "https": "https://{proxy}".format(proxy=raw_proxy)} 41 | try: 42 | r = requests.get('https://www.baidu.com/', proxies=proxies, timeout=50, verify=False) 43 | if r.status_code == 200: 44 | self.db.changeTable(self.useful_proxy_queue) 45 | self.db.put(raw_proxy) 46 | except Exception as e: 47 | # print e 48 | pass 49 | self.db.changeTable(self.raw_proxy_queue) 50 | raw_proxy = self.db.pop() 51 | 52 | 53 | def refreshPool(): 54 | pp = ProxyRefreshSchedule() 55 | pp.validProxy() 56 | 57 | 58 | def main(process_num=100): 59 | p = ProxyRefreshSchedule() 60 | p.refresh() 61 | 62 | for num in range(process_num): 63 | P = Process(target=refreshPool, args=()) 64 | P.start() 65 | print '{time}: refresh complete!'.format(time=time.ctime()) 66 | 67 | 68 | if __name__ == '__main__': 69 | # pp = ProxyRefreshSchedule() 70 | # pp.main() 71 | main() 72 | sched = BlockingScheduler() 73 | sched.add_job(main, 'interval', minute=20) 74 | sched.start() 75 | -------------------------------------------------------------------------------- /ProxyGetter/getFreeProxy.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # !/usr/bin/env python 3 | """ 4 | ------------------------------------------------- 5 | File Name: GetFreeProxy.py 6 | Description : 抓取免费代理 7 | Author : JHao 8 | date: 2016/11/25 9 | ------------------------------------------------- 10 | Change Activity: 11 | 2016/11/25: 12 | ------------------------------------------------- 13 | """ 14 | import re 15 | import sys 16 | import requests 17 | 18 | reload(sys) 19 | sys.setdefaultencoding('utf-8') 20 | 21 | from Util.utilFunction import robustCrawl, getHtmlTree 22 | 23 | 24 | class GetFreeProxy(object): 25 | """ 26 | proxy getter 27 | """ 28 | 29 | def __init__(self): 30 | pass 31 | 32 | @staticmethod 33 | @robustCrawl 34 | def freeProxyFirst(page=10): 35 | """ 36 | 抓取快代理IP http://www.kuaidaili.com/ 37 | :param page: 翻页数 38 | :return: 39 | """ 40 | url_list = ('http://www.kuaidaili.com/proxylist/{page}/'.format(page=page) for page in range(1, page + 1)) 41 | # 页数不用太多, 后面的全是历史IP, 可用性不高 42 | for url in url_list: 43 | tree = getHtmlTree(url) 44 | proxy_list = tree.xpath('.//div[@id="index_free_list"]//tbody/tr') 45 | for proxy in proxy_list: 46 | yield ':'.join(proxy.xpath('./td/text()')[0:2]) 47 | 48 | @staticmethod 49 | @robustCrawl 50 | def freeProxySecond(proxy_number=100): 51 | """ 52 | 抓取代理66 http://www.66ip.cn/ 53 | :param proxy_number: 代理数量 54 | :return: 55 | """ 56 | url = "http://m.66ip.cn/mo.php?sxb=&tqsl={}&port=&export=&ktip=&sxa=&submit=%CC%E1++%C8%A1&textarea=".format( 57 | proxy_number) 58 | html = requests.get(url).content 59 | for proxy in re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}', html): 60 | yield proxy 61 | 62 | @staticmethod 63 | @robustCrawl 64 | def freeProxyThird(days=1): 65 | """ 66 | 抓取有代理 http://www.youdaili.net/Daili/http/ 67 | :param days: 68 | :return: 69 | """ 70 | url = "http://www.youdaili.net/Daili/http/" 71 | tree = getHtmlTree(url) 72 | page_url_list = tree.xpath('.//div[@class="chunlist"]/ul//a/@href')[0:days] 73 | for page_url in page_url_list: 74 | html = requests.get(page_url).content 75 | proxy_list = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}', html) 76 | for proxy in proxy_list: 77 | yield proxy 78 | 79 | @staticmethod 80 | @robustCrawl 81 | def freeProxyFourth(): 82 | """ 83 | 抓取西刺代理 http://api.xicidaili.com/free2016.txt 84 | :return: 85 | """ 86 | url = "http://api.xicidaili.com/free2016.txt" 87 | html = requests.get(url).content 88 | for proxy in re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}', html): 89 | yield proxy 90 | 91 | @staticmethod 92 | @robustCrawl 93 | def freeProxyFifth(): 94 | """ 95 | 抓取guobanjia http://www.goubanjia.com/free/gngn/index.shtml 96 | :return: 97 | """ 98 | url = "http://www.goubanjia.com/free/gngn/index.shtml" 99 | tree = getHtmlTree(url) 100 | proxy_list = tree.xpath('.//td[@class="ip"]') 101 | for proxy in proxy_list: 102 | yield ''.join(proxy.xpath('.//text()')) 103 | 104 | 105 | if __name__ == '__main__': 106 | gg = GetFreeProxy() 107 | for e in gg.freeProxyFifth(): 108 | print e 109 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 爬虫代理IP池 3 | ======= 4 | [![Yii2](https://img.shields.io/badge/Powered_by-Yii_Framework-green.svg?style=flat)](http://www.spiderpy.cn/) 5 | 6 | 7 | > 在公司做分布式深网爬虫,搭建了一套稳定的代理池服务,为上千个爬虫提供有效的代理,保证各个爬虫拿到的都是对应网站有效的代理IP,从而保证爬虫快速稳定的运行,当然在公司做的东西不能开源出来。不过呢,闲暇时间手痒,所以就想利用一些免费的资源搞一个简单的代理池服务。 8 | 9 | 10 | ### 1、问题 11 | 12 | * 代理IP从何而来? 13 | 14 |   刚自学爬虫的时候没有代理IP就去西刺、快代理之类有免费代理的网站去爬,还是有个别代理能用。当然,如果你有更好的代理接口也可以自己接入。 15 |   免费代理的采集也很简单,无非就是:访问页面页面 —> 正则/xpath提取 —> 保存 16 | 17 | * 如何保证代理质量? 18 | 19 |   可以肯定免费的代理IP大部分都是不能用的,不然别人为什么还提供付费的(不过事实是很多代理商的付费IP也不稳定,也有很多是不能用)。所以采集回来的代理IP不能直接使用,可以写检测程序不断的去用这些代理访问一个稳定的网站,看是否可以正常使用。这个过程可以使用多线程或异步的方式,因为检测代理是个很慢的过程。 20 | 21 | * 采集回来的代理如何存储? 22 | 23 |   这里不得不推荐一个高性能支持多种数据结构的NoSQL数据库[SSDB](http://ssdb.io/docs/zh_cn/),用于代理Redis。支持队列、hash、set、k-v对,支持T级别数据。是做分布式爬虫很好中间存储工具。 24 | 25 | * 如何让爬虫更简单的使用这些代理? 26 | 27 |   答案肯定是做成服务咯,python有这么多的web框架,随便拿一个来写个api供爬虫调用。这样有很多好处,比如:当爬虫发现代理不能使用可以主动通过api去delete代理IP,当爬虫发现代理池IP不够用时可以主动去refresh代理池。这样比检测程序更加靠谱。 28 | 29 | ### 2、代理池设计 30 | 31 |   代理池由四部分组成: 32 | 33 | * ProxyGetter: 34 | 35 |   代理获取接口,目前有5个免费代理源,每调用一次就会抓取这个5个网站的最新代理放入DB,可自行添加额外的代理获取接口; 36 | 37 | * DB: 38 | 39 |   用于存放代理IP,现在暂时只支持SSDB。至于为什么选择SSDB,大家可以参考这篇[文章](https://www.sdk.cn/news/2684),个人觉得SSDB是个不错的Redis替代方案,如果你没有用过SSDB,安装起来也很简单,可以参考[这里](https://github.com/jhao104/memory-notes/blob/master/SSDB/SSDB%E5%AE%89%E8%A3%85%E9%85%8D%E7%BD%AE%E8%AE%B0%E5%BD%95.md); 40 | 41 | * Schedule: 42 | 43 |   计划任务用户定时去检测DB中的代理可用性,删除不可用的代理。同时也会主动通过ProxyGetter去获取最新代理放入DB; 44 | 45 | * ProxyApi: 46 | 47 |   代理池的外部接口,由于现在这么代理池功能比较简单,花两个小时看了下[Flask](http://flask.pocoo.org/),愉快的决定用Flask搞定。功能是给爬虫提供get/delete/refresh等接口,方便爬虫直接使用。 48 | 49 | ![设计](https://pic2.zhimg.com/v2-f2756da2986aa8a8cab1f9562a115b55_b.png) 50 | 51 | ### 3、代码模块 52 | 53 |   Python中高层次的数据结构,动态类型和动态绑定,使得它非常适合于快速应用开发,也适合于作为胶水语言连接已有的软件部件。用Python来搞这个代理IP池也很简单,代码分为6个模块: 54 | 55 | * Api: 56 | 57 |   api接口相关代码,目前api是由Flask实现,代码也非常简单。客户端请求传给Flask,Flask调用ProxyManager中的实现,包括`get/delete/refresh/get_all`; 58 | 59 | * DB: 60 | 61 |   数据库相关代码,目前数据库是采用SSDB。代码用工厂模式实现,方便日后扩展其他类型数据库; 62 | 63 | * Manager: 64 | 65 |   `get/delete/refresh/get_all`等接口的具体实现类,目前代理池只负责管理proxy,日后可能会有更多功能,比如代理和爬虫的绑定,代理和账号的绑定等等; 66 | 67 | * ProxyGetter: 68 | 69 |   代理获取的相关代码,目前抓取了[快代理](http://www.kuaidaili.com)、[代理66](http://www.66ip.cn/)、[有代理](http://www.youdaili.net/Daili/http/)、[西刺代理](http://api.xicidaili.com/free2016.txt)、[guobanjia](http://www.goubanjia.com/free/gngn/index.shtml)这个五个网站的免费代理,经测试这个5个网站每天更新的可用代理只有六七十个,当然也支持自己扩展代理接口; 70 | 71 | * Schedule: 72 | 73 |   定时任务相关代码,现在只是实现定时去刷新代码,并验证可用代理,采用多进程方式; 74 | 75 | * Util: 76 | 77 |   存放一些公共的模块方法或函数,包含`GetConfig`:读取配置文件config.ini的类,`ConfigParse`: 集成重写ConfigParser的类,使其对大小写敏感, `Singleton`:实现单例,`LazyProperty`:实现类属性惰性计算。等等; 78 | 79 | * 其他文件: 80 | 81 |   配置文件:Config.ini,数据库配置和代理获取接口配置,可以在GetFreeProxy中添加新的代理获取方法,并在Config.ini中注册即可使用; 82 | 83 | ### 4、安装 84 | 85 | 下载代码: 86 | ``` 87 | git clone git@github.com:jhao104/proxy_pool.git 88 | 89 | 或者直接到https://github.com/jhao104/proxy_pool 下载zip文件 90 | ``` 91 | 92 | 安装依赖: 93 | ``` 94 | pip install -r requirements.txt 95 | ``` 96 | 97 | 启动: 98 | 99 | ``` 100 | 需要分别启动定时任务和api 101 | 到Config.ini中配置你的SSDB 102 | 103 | 项目目录下: 104 | >>>python -m Schedule.ProxyRefreshSchedule 105 | 106 | 到Api目录下: 107 | >>>python -m Api.ProxyApi 108 | ``` 109 | 110 | ### 5、使用 111 |   定时任务启动后,会通过代理获取方法fetch所有代理放入数据库并验证。此后默认每20分钟会重复执行一次。定时任务启动大概一两分钟后,便可在SSDB中看到刷新出来的可用的代理: 112 | 113 | ![useful_proxy](https://pic2.zhimg.com/v2-12f9b7eb72f60663212f317535a113d1_b.png) 114 | 115 |   启动ProxyApi.py后即可在浏览器中使用接口获取代理,一下是浏览器中的截图: 116 | 117 |   index页面: 118 | 119 | ![index](https://pic3.zhimg.com/v2-a867aa3db1d413fea8aeeb4c693f004a_b.png) 120 | 121 |   get: 122 | 123 | ![get](https://pic1.zhimg.com/v2-f54b876b428893235533de20f2edbfe0_b.png) 124 | 125 |   get_all: 126 | 127 | ![get_all](https://pic3.zhimg.com/v2-5c79f8c07e04f9ef655b9bea406d0306_b.png) 128 | 129 | 130 |   爬虫中使用,如果要在爬虫代码中使用的话, 可以将此api封装成函数直接使用,例如: 131 | ``` 132 | import requests 133 | 134 | def get_proxy(): 135 | return requests.get("http://127.0.0.1:5000/get/").content 136 | 137 | def delete_proxy(proxy): 138 | requests.get("http://127.0.0.1:5000/delete/?proxy={}".format(proxy)) 139 | 140 | # your spider code 141 | 142 | def spider(): 143 | # .... 144 | requests.get('https://www.example.com', proxies={"http": "http://{}".format(get_proxy)}) 145 | # .... 146 | 147 | ``` 148 | 149 | ### 6、最后 150 |   时间仓促,功能和代码都比较简陋,以后有时间再改进。喜欢的在github上给个star。感谢! --------------------------------------------------------------------------------