├── .gitignore ├── Readme.md ├── __init__.py ├── doc ├── API.md └── design.md ├── handlers ├── handler_8000.py ├── handler_8001.py ├── handler_8002.py ├── handler_8003.py ├── handler_8004.py └── handler_template.py ├── proxypool.py ├── requirements.txt └── settings.yaml /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | bin/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | eggs/ 16 | lib/ 17 | lib64/ 18 | parts/ 19 | sdist/ 20 | var/ 21 | *.egg-info/ 22 | .installed.cfg 23 | *.egg 24 | 25 | # Installer logs 26 | pip-log.txt 27 | pip-delete-this-directory.txt 28 | 29 | # Unit test / coverage reports 30 | htmlcov/ 31 | .tox/ 32 | .coverage 33 | .cache 34 | nosetests.xml 35 | coverage.xml 36 | 37 | # Translations 38 | *.mo 39 | 40 | # Mr Developer 41 | .mr.developer.cfg 42 | .project 43 | .pydevproject 44 | 45 | # Rope 46 | .ropeproject 47 | 48 | # Django stuff: 49 | *.log 50 | *.pot 51 | 52 | # Sphinx documentation 53 | docs/_build/ 54 | 55 | bak/ 56 | -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | ### 目的 2 | 解决抓取站点对访问频率的限制问题,通过匿名代理访问目标站点。 3 | 4 | ### 功能 5 | 通过一个 HTTP API 提供匿名代理列表。 6 | 7 | ### 配置 8 | 1、安装项目的依赖 9 | 与 python 相关的依赖 (注意该项目使用 python3.3): 10 | 11 | ```shell 12 | $ pip3 install -r requirements.txt 13 | ``` 14 | 15 | 此外系统中还需要安装 redis-server。 16 | 17 | 2、获取验证代理 18 | 在项目根目录下,执行: 19 | 20 | ```shell 21 | $ python3.3 proxypool.py 22 | ``` 23 | 24 | 3、配置 nginx 25 | 一个简单的 nginx.conf 形式: 26 | 27 | ```python 28 | worker_processes 1; 29 | 30 | events { 31 | worker_connections 1024; 32 | } 33 | 34 | http { 35 | include mime.types; 36 | default_type application/octet-stream; 37 | 38 | sendfile on; 39 | 40 | keepalive_timeout 65; 41 | 42 | upstream frontends { 43 | server 127.0.0.1:8000; 44 | server 127.0.0.1:8001; 45 | server 127.0.0.1:8002; 46 | server 127.0.0.1:8003; 47 | server 127.0.0.1:8004; 48 | } 49 | 50 | server { 51 | listen 9000; 52 | server_name localhost; 53 | 54 | location / { 55 | proxy_pass http://frontends; 56 | proxy_set_header Host $host; 57 | proxy_set_header X-Real-IP $remote_addr; 58 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 59 | proxy_set_header X-Forwarded-Proto $scheme; 60 | } 61 | } 62 | } 63 | ``` 64 | 65 | 启动 nginx: 66 | 67 | ```shell 68 | # nginx -t 69 | # nginx & 70 | ``` 71 | 72 | 4、启动 handler 73 | 在该项目目录下,运行: 74 | 75 | ```shell 76 | $ cd handlers 77 | $ python3.3 handler_800* & 78 | ``` 79 | 80 | 5、测试 81 | 在 chrome 中下载 82 | [Postman](https://chrome.google.com/webstore/detail/postman-rest-client/fdmmgilgnpjigdojojpjoooidkmcomcm?utm_source=chrome-ntp-launcher) 83 | 插件,然后通过 POST 方法请求 http://127.0.0.1:9000/proxylist 来查看返回结果。 84 | 85 | ### 其他文档 86 | 1、[API 使用文档](/proxypool/doc/API.md) 87 | 88 | 2、[项目设计文档](/proxypool/doc/design.md) 89 | 90 | ### 问题反馈 91 | 可随时向我 (zhangyifei@baixing.com) 反馈使用该项目过程中遇到的问题。 92 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wenson/proxypool/75bb59a591adacb30caec4b37a83908cc0b762b3/__init__.py -------------------------------------------------------------------------------- /doc/API.md: -------------------------------------------------------------------------------- 1 | ### API 使用文档 2 | #### 请求方法 3 | 通过 HTTP POST 方法请求 http://127.0.0.1:9000/proxylist, post 的数据为: 4 | 5 | * target (optional) 6 | 目标站点,表示需要访问这个网站的最佳代理列表,当前存储的有 baidu、58、ganji, 7 | 若该字段为空或非预定义的几个站点,则返回访问 baidu 最佳的代理列表。 8 | * num (optional) 9 | 需要的代理数目,默认 10 个。若 proxy pool 中满足要求的代理少于请求的数目,则返 10 | 回的代理数目会少于请求的数目。 11 | * delay (optional) 12 | 要求代理的延迟时间,单位是秒,默认 10s。 13 | 14 | 示例: 15 | 16 | * 无 post 数据 17 | 返回访问 baidu 最佳的 10 个代理,访问 baidu 的延迟在 10s 内。 18 | * target=baixing 19 | 返回访问 baidu 最佳的 10 个代理,访问 baidu 的延迟在 10s 内(因为配置中不存在对 20 | 访问 baixing 的配置)。 21 | * target=58 22 | 返回访问 58 最佳的 10 个代理,访问 58 的延迟在 10s 内。 23 | * ... 24 | 组合搭配 target、num、delay 返回满足这些需求的代理列表。 25 | 26 | ### 返回数据 27 | 都是 json 格式的数据。 28 | 共有三种状态: 29 | 1、成功获取代理列表,且 target 站点在默认的配置中 30 | 31 | ```javascript 32 | { 33 | "status": "success", 34 | "proxylist": { 35 | "target": "ganji", 36 | "num": 3, 37 | "proxies": [ 38 | "http://120.197.85.182:18253", 39 | "http://222.87.129.29:80", 40 | "http://122.96.59.103:80", 41 | ], 42 | }, 43 | } 44 | ``` 45 | 46 | 2、成功获取代理列表,但 target 站点不在默认的配置中 47 | 下述的 "success-partial" 表示返回的代理列表对目标站点可能部分可用。 48 | 49 | ```javascript 50 | { 51 | "status": "success-partial", 52 | "proxylist": { 53 | "target": "ganji", 54 | "num": 3, 55 | "proxies": [ 56 | "http://120.197.85.182:18253", 57 | "http://222.87.129.29:80", 58 | "http://122.96.59.103:80", 59 | ], 60 | }, 61 | } 62 | ``` 63 | 64 | 3、失败 65 | ```javascript 66 | { 67 | "status": "failure", 68 | "err": "失败原因", 69 | "target": "目标站点", 70 | } 71 | ``` 72 | -------------------------------------------------------------------------------- /doc/design.md: -------------------------------------------------------------------------------- 1 | ### 整体设计 2 | 分成两部分: 3 | 4 | 1、proxy pool 5 | 分三层: 6 | 7 | * 从网上抓取免费的 http 代理,存入 redis 中,采用 sets 数据结构存储 8 | * 验证抓取到的代理是否是匿名代理,存入 redis 中,采用 sets 数据结构存储 9 | * 验证匿名代理的访问延迟,存入 redis 中,采用 sorted sets 数据结构存储 10 | 11 | 2、http 服务 12 | 采用 nginx + tornado + proxypool 形式: 13 | 14 | * nginx 提供统一的访问接口,并提供负载均衡功能 15 | * tornado 处理 nginx 传来的请求,从 proxypool 中按要求取出代理列表并返回 16 | * proxypool 向 tornado 提供匿名代理列表 17 | -------------------------------------------------------------------------------- /handlers/handler_8000.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | # ----------------------------------------- 5 | # Author: zhangyifei 6 | # Date: 2014-03-04 21:18:55 7 | # ----------------------------------------- 8 | 9 | """处理 nginx 传来的请求,返回相应的 proxy list. 10 | """ 11 | 12 | import json 13 | 14 | import tornado.ioloop 15 | import tornado.web 16 | 17 | from proxypool import ProxyPool 18 | 19 | 20 | class ProxyListHandler(tornado.web.RequestHandler): 21 | """返回用户需求的代理列表,json 格式 22 | 示例: 23 | + 成功 24 | { 25 | 'status': 'success', 26 | 'proxylist': { 27 | 'num': 5, 28 | 'mtime': 1394069326, 29 | 'proxies': [ 30 | 'http://220.248.180.149:3128', 31 | 'http://61.55.141.11:81', 32 | ......, 33 | ], 34 | }, 35 | } 36 | - 若数据库中存储了指定站点的 proxy list,且正确返回,则状态是 success 37 | - 若数据库中没有存储指定站点的 proxy list,但正确返回,则状态是 success-partial, 38 | 表示返回的代理对指定的站点部分可用 39 | + 失败 40 | { 41 | 'status': 'failure', 42 | 'err': '失败原因', 43 | } 44 | """ 45 | def get(self): 46 | self.write('Please refer to the API doc.') 47 | 48 | def post(self): 49 | target = self.get_argument('target', default='') or 'all' 50 | num = int(self.get_argument('num', default='') or 5) 51 | delay = int(self.get_argument('delay', default='') or 10) 52 | 53 | proxypool = ProxyPool() 54 | 55 | try: 56 | proxies = proxypool.get_many(target=target, num=num, maxscore=delay) 57 | num_ret = len(proxies) 58 | mtime = proxypool.get_mtime(target=target) 59 | 60 | proxylist = [] 61 | for proxy in proxies: 62 | proxylist.append(proxy.decode('utf-8')) 63 | 64 | if str(target).upper() in proxypool.targets: 65 | status = 'success' 66 | else: 67 | status = 'success-partial' 68 | 69 | ret = { 70 | 'status': status, 71 | 'proxylist': { 72 | 'num': num_ret, 73 | 'mtime': mtime, 74 | 'target': target, 75 | 'proxies': proxylist, 76 | }, 77 | } 78 | except Exception as e: 79 | ret = { 80 | 'status': 'failure', 81 | 'target': target, 82 | 'err': str(e), 83 | } 84 | 85 | self.set_header('Content-Type', 'application/json') 86 | 87 | self.write(json.dumps(ret)) 88 | 89 | 90 | class MainHandler(tornado.web.RequestHandler): 91 | """处理未匹配到的请求""" 92 | def get(self): 93 | self.write('Please refer to the API doc.') 94 | 95 | def post(self): 96 | self.write('Please refer to the API doc.') 97 | 98 | 99 | app = tornado.web.Application([ 100 | (r'/proxylist', ProxyListHandler), 101 | (r'.*', MainHandler), 102 | ]) 103 | 104 | 105 | if __name__ == '__main__': 106 | app.listen(8000) 107 | tornado.ioloop.IOLoop.instance().start() 108 | -------------------------------------------------------------------------------- /handlers/handler_8001.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | # ----------------------------------------- 5 | # Author: zhangyifei 6 | # Date: 2014-03-04 21:18:55 7 | # ----------------------------------------- 8 | 9 | """处理 nginx 传来的请求,返回相应的 proxy list. 10 | """ 11 | 12 | import json 13 | 14 | import tornado.ioloop 15 | import tornado.web 16 | 17 | from proxypool import ProxyPool 18 | 19 | 20 | class ProxyListHandler(tornado.web.RequestHandler): 21 | """返回用户需求的代理列表,json 格式 22 | 示例: 23 | + 成功 24 | { 25 | 'status': 'success', 26 | 'proxylist': { 27 | 'num': 5, 28 | 'mtime': 1394069326, 29 | 'proxies': [ 30 | 'http://220.248.180.149:3128', 31 | 'http://61.55.141.11:81', 32 | ......, 33 | ], 34 | }, 35 | } 36 | - 若数据库中存储了指定站点的 proxy list,且正确返回,则状态是 success 37 | - 若数据库中没有存储指定站点的 proxy list,但正确返回,则状态是 success-partial, 38 | 表示返回的代理对指定的站点部分可用 39 | + 失败 40 | { 41 | 'status': 'failure', 42 | 'err': '失败原因', 43 | } 44 | """ 45 | def get(self): 46 | self.write('Please refer to the API doc.') 47 | 48 | def post(self): 49 | target = self.get_argument('target', default='') or 'all' 50 | num = int(self.get_argument('num', default='') or 5) 51 | delay = int(self.get_argument('delay', default='') or 10) 52 | 53 | proxypool = ProxyPool() 54 | 55 | try: 56 | proxies = proxypool.get_many(target=target, num=num, maxscore=delay) 57 | num_ret = len(proxies) 58 | mtime = proxypool.get_mtime(target=target) 59 | 60 | proxylist = [] 61 | for proxy in proxies: 62 | proxylist.append(proxy.decode('utf-8')) 63 | 64 | if str(target).upper() in proxypool.targets: 65 | status = 'success' 66 | else: 67 | status = 'success-partial' 68 | 69 | ret = { 70 | 'status': status, 71 | 'proxylist': { 72 | 'num': num_ret, 73 | 'mtime': mtime, 74 | 'target': target, 75 | 'proxies': proxylist, 76 | }, 77 | } 78 | except Exception as e: 79 | ret = { 80 | 'status': 'failure', 81 | 'target': target, 82 | 'err': str(e), 83 | } 84 | 85 | self.set_header('Content-Type', 'application/json') 86 | 87 | self.write(json.dumps(ret)) 88 | 89 | 90 | class MainHandler(tornado.web.RequestHandler): 91 | """处理未匹配到的请求""" 92 | def get(self): 93 | self.write('Please refer to the API doc.') 94 | 95 | def post(self): 96 | self.write('Please refer to the API doc.') 97 | 98 | 99 | app = tornado.web.Application([ 100 | (r'/proxylist', ProxyListHandler), 101 | (r'.*', MainHandler), 102 | ]) 103 | 104 | 105 | if __name__ == '__main__': 106 | app.listen(8001) 107 | tornado.ioloop.IOLoop.instance().start() 108 | -------------------------------------------------------------------------------- /handlers/handler_8002.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | # ----------------------------------------- 5 | # Author: zhangyifei 6 | # Date: 2014-03-04 21:18:55 7 | # ----------------------------------------- 8 | 9 | """处理 nginx 传来的请求,返回相应的 proxy list. 10 | """ 11 | 12 | import json 13 | 14 | import tornado.ioloop 15 | import tornado.web 16 | 17 | from proxypool import ProxyPool 18 | 19 | 20 | class ProxyListHandler(tornado.web.RequestHandler): 21 | """返回用户需求的代理列表,json 格式 22 | 示例: 23 | + 成功 24 | { 25 | 'status': 'success', 26 | 'proxylist': { 27 | 'num': 5, 28 | 'mtime': 1394069326, 29 | 'proxies': [ 30 | 'http://220.248.180.149:3128', 31 | 'http://61.55.141.11:81', 32 | ......, 33 | ], 34 | }, 35 | } 36 | - 若数据库中存储了指定站点的 proxy list,且正确返回,则状态是 success 37 | - 若数据库中没有存储指定站点的 proxy list,但正确返回,则状态是 success-partial, 38 | 表示返回的代理对指定的站点部分可用 39 | + 失败 40 | { 41 | 'status': 'failure', 42 | 'err': '失败原因', 43 | } 44 | """ 45 | def get(self): 46 | self.write('Please refer to the API doc.') 47 | 48 | def post(self): 49 | target = self.get_argument('target', default='') or 'all' 50 | num = int(self.get_argument('num', default='') or 5) 51 | delay = int(self.get_argument('delay', default='') or 10) 52 | 53 | proxypool = ProxyPool() 54 | 55 | try: 56 | proxies = proxypool.get_many(target=target, num=num, maxscore=delay) 57 | num_ret = len(proxies) 58 | mtime = proxypool.get_mtime(target=target) 59 | 60 | proxylist = [] 61 | for proxy in proxies: 62 | proxylist.append(proxy.decode('utf-8')) 63 | 64 | if str(target).upper() in proxypool.targets: 65 | status = 'success' 66 | else: 67 | status = 'success-partial' 68 | 69 | ret = { 70 | 'status': status, 71 | 'proxylist': { 72 | 'num': num_ret, 73 | 'mtime': mtime, 74 | 'target': target, 75 | 'proxies': proxylist, 76 | }, 77 | } 78 | except Exception as e: 79 | ret = { 80 | 'status': 'failure', 81 | 'target': target, 82 | 'err': str(e), 83 | } 84 | 85 | self.set_header('Content-Type', 'application/json') 86 | 87 | self.write(json.dumps(ret)) 88 | 89 | 90 | class MainHandler(tornado.web.RequestHandler): 91 | """处理未匹配到的请求""" 92 | def get(self): 93 | self.write('Please refer to the API doc.') 94 | 95 | def post(self): 96 | self.write('Please refer to the API doc.') 97 | 98 | 99 | app = tornado.web.Application([ 100 | (r'/proxylist', ProxyListHandler), 101 | (r'.*', MainHandler), 102 | ]) 103 | 104 | 105 | if __name__ == '__main__': 106 | app.listen(8002) 107 | tornado.ioloop.IOLoop.instance().start() 108 | -------------------------------------------------------------------------------- /handlers/handler_8003.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | # ----------------------------------------- 5 | # Author: zhangyifei 6 | # Date: 2014-03-04 21:18:55 7 | # ----------------------------------------- 8 | 9 | """处理 nginx 传来的请求,返回相应的 proxy list. 10 | """ 11 | 12 | import json 13 | 14 | import tornado.ioloop 15 | import tornado.web 16 | 17 | from proxypool import ProxyPool 18 | 19 | 20 | class ProxyListHandler(tornado.web.RequestHandler): 21 | """返回用户需求的代理列表,json 格式 22 | 示例: 23 | + 成功 24 | { 25 | 'status': 'success', 26 | 'proxylist': { 27 | 'num': 5, 28 | 'mtime': 1394069326, 29 | 'proxies': [ 30 | 'http://220.248.180.149:3128', 31 | 'http://61.55.141.11:81', 32 | ......, 33 | ], 34 | }, 35 | } 36 | - 若数据库中存储了指定站点的 proxy list,且正确返回,则状态是 success 37 | - 若数据库中没有存储指定站点的 proxy list,但正确返回,则状态是 success-partial, 38 | 表示返回的代理对指定的站点部分可用 39 | + 失败 40 | { 41 | 'status': 'failure', 42 | 'err': '失败原因', 43 | } 44 | """ 45 | def get(self): 46 | self.write('Please refer to the API doc.') 47 | 48 | def post(self): 49 | target = self.get_argument('target', default='') or 'all' 50 | num = int(self.get_argument('num', default='') or 5) 51 | delay = int(self.get_argument('delay', default='') or 10) 52 | 53 | proxypool = ProxyPool() 54 | 55 | try: 56 | proxies = proxypool.get_many(target=target, num=num, maxscore=delay) 57 | num_ret = len(proxies) 58 | mtime = proxypool.get_mtime(target=target) 59 | 60 | proxylist = [] 61 | for proxy in proxies: 62 | proxylist.append(proxy.decode('utf-8')) 63 | 64 | if str(target).upper() in proxypool.targets: 65 | status = 'success' 66 | else: 67 | status = 'success-partial' 68 | 69 | ret = { 70 | 'status': status, 71 | 'proxylist': { 72 | 'num': num_ret, 73 | 'mtime': mtime, 74 | 'target': target, 75 | 'proxies': proxylist, 76 | }, 77 | } 78 | except Exception as e: 79 | ret = { 80 | 'status': 'failure', 81 | 'target': target, 82 | 'err': str(e), 83 | } 84 | 85 | self.set_header('Content-Type', 'application/json') 86 | 87 | self.write(json.dumps(ret)) 88 | 89 | 90 | class MainHandler(tornado.web.RequestHandler): 91 | """处理未匹配到的请求""" 92 | def get(self): 93 | self.write('Please refer to the API doc.') 94 | 95 | def post(self): 96 | self.write('Please refer to the API doc.') 97 | 98 | 99 | app = tornado.web.Application([ 100 | (r'/proxylist', ProxyListHandler), 101 | (r'.*', MainHandler), 102 | ]) 103 | 104 | 105 | if __name__ == '__main__': 106 | app.listen(8003) 107 | tornado.ioloop.IOLoop.instance().start() 108 | -------------------------------------------------------------------------------- /handlers/handler_8004.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | # ----------------------------------------- 5 | # Author: zhangyifei 6 | # Date: 2014-03-04 21:18:55 7 | # ----------------------------------------- 8 | 9 | """处理 nginx 传来的请求,返回相应的 proxy list. 10 | """ 11 | 12 | import json 13 | 14 | import tornado.ioloop 15 | import tornado.web 16 | 17 | from proxypool import ProxyPool 18 | 19 | 20 | class ProxyListHandler(tornado.web.RequestHandler): 21 | """返回用户需求的代理列表,json 格式 22 | 示例: 23 | + 成功 24 | { 25 | 'status': 'success', 26 | 'proxylist': { 27 | 'num': 5, 28 | 'mtime': 1394069326, 29 | 'proxies': [ 30 | 'http://220.248.180.149:3128', 31 | 'http://61.55.141.11:81', 32 | ......, 33 | ], 34 | }, 35 | } 36 | - 若数据库中存储了指定站点的 proxy list,且正确返回,则状态是 success 37 | - 若数据库中没有存储指定站点的 proxy list,但正确返回,则状态是 success-partial, 38 | 表示返回的代理对指定的站点部分可用 39 | + 失败 40 | { 41 | 'status': 'failure', 42 | 'err': '失败原因', 43 | } 44 | """ 45 | def get(self): 46 | self.write('Please refer to the API doc.') 47 | 48 | def post(self): 49 | target = self.get_argument('target', default='') or 'all' 50 | num = int(self.get_argument('num', default='') or 5) 51 | delay = int(self.get_argument('delay', default='') or 10) 52 | 53 | proxypool = ProxyPool() 54 | 55 | try: 56 | proxies = proxypool.get_many(target=target, num=num, maxscore=delay) 57 | num_ret = len(proxies) 58 | mtime = proxypool.get_mtime(target=target) 59 | 60 | proxylist = [] 61 | for proxy in proxies: 62 | proxylist.append(proxy.decode('utf-8')) 63 | 64 | if str(target).upper() in proxypool.targets: 65 | status = 'success' 66 | else: 67 | status = 'success-partial' 68 | 69 | ret = { 70 | 'status': status, 71 | 'proxylist': { 72 | 'num': num_ret, 73 | 'mtime': mtime, 74 | 'target': target, 75 | 'proxies': proxylist, 76 | }, 77 | } 78 | except Exception as e: 79 | ret = { 80 | 'status': 'failure', 81 | 'target': target, 82 | 'err': str(e), 83 | } 84 | 85 | self.set_header('Content-Type', 'application/json') 86 | 87 | self.write(json.dumps(ret)) 88 | 89 | 90 | class MainHandler(tornado.web.RequestHandler): 91 | """处理未匹配到的请求""" 92 | def get(self): 93 | self.write('Please refer to the API doc.') 94 | 95 | def post(self): 96 | self.write('Please refer to the API doc.') 97 | 98 | 99 | app = tornado.web.Application([ 100 | (r'/proxylist', ProxyListHandler), 101 | (r'.*', MainHandler), 102 | ]) 103 | 104 | 105 | if __name__ == '__main__': 106 | app.listen(8004) 107 | tornado.ioloop.IOLoop.instance().start() 108 | -------------------------------------------------------------------------------- /handlers/handler_template.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | # ----------------------------------------- 5 | # Author: zhangyifei 6 | # Date: 2014-03-04 21:18:55 7 | # ----------------------------------------- 8 | 9 | """处理 nginx 传来的请求,返回相应的 proxy list. 10 | """ 11 | 12 | import json 13 | 14 | import tornado.ioloop 15 | import tornado.web 16 | 17 | from proxypool import ProxyPool 18 | 19 | 20 | class ProxyListHandler(tornado.web.RequestHandler): 21 | """返回用户需求的代理列表,json 格式 22 | 示例: 23 | + 成功 24 | { 25 | 'status': 'success', 26 | 'proxylist': { 27 | 'num': 5, 28 | 'mtime': 1394069326, 29 | 'proxies': [ 30 | 'http://220.248.180.149:3128', 31 | 'http://61.55.141.11:81', 32 | ......, 33 | ], 34 | }, 35 | } 36 | - 若数据库中存储了指定站点的 proxy list,且正确返回,则状态是 success 37 | - 若数据库中没有存储指定站点的 proxy list,但正确返回,则状态是 success-partial, 38 | 表示返回的代理对指定的站点部分可用 39 | + 失败 40 | { 41 | 'status': 'failure', 42 | 'err': '失败原因', 43 | } 44 | """ 45 | def get(self): 46 | self.write('Please refer to the API doc.') 47 | 48 | def post(self): 49 | target = self.get_argument('target', default='') or 'all' 50 | num = int(self.get_argument('num', default='') or 5) 51 | delay = int(self.get_argument('delay', default='') or 10) 52 | 53 | proxypool = ProxyPool() 54 | 55 | try: 56 | proxies = proxypool.get_many(target=target, num=num, maxscore=delay) 57 | num_ret = len(proxies) 58 | mtime = proxypool.get_mtime(target=target) 59 | 60 | proxylist = [] 61 | for proxy in proxies: 62 | proxylist.append(proxy.decode('utf-8')) 63 | 64 | if str(target).upper() in proxypool.targets: 65 | status = 'success' 66 | else: 67 | status = 'success-partial' 68 | 69 | ret = { 70 | 'status': status, 71 | 'proxylist': { 72 | 'num': num_ret, 73 | 'mtime': mtime, 74 | 'target': target, 75 | 'proxies': proxylist, 76 | }, 77 | } 78 | except Exception as e: 79 | ret = { 80 | 'status': 'failure', 81 | 'target': target, 82 | 'err': str(e), 83 | } 84 | 85 | self.set_header('Content-Type', 'application/json') 86 | 87 | self.write(json.dumps(ret)) 88 | 89 | 90 | class MainHandler(tornado.web.RequestHandler): 91 | """处理未匹配到的请求""" 92 | def get(self): 93 | self.write('Please refer to the API doc.') 94 | 95 | def post(self): 96 | self.write('Please refer to the API doc.') 97 | 98 | 99 | app = tornado.web.Application([ 100 | (r'/proxylist', ProxyListHandler), 101 | (r'.*', MainHandler), 102 | ]) 103 | 104 | 105 | if __name__ == '__main__': 106 | app.listen(8000) 107 | tornado.ioloop.IOLoop.instance().start() 108 | -------------------------------------------------------------------------------- /proxypool.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | # ----------------------------------------- 5 | # Author: zhangyifei 6 | # Date: 2014-03-04 7 | # ----------------------------------------- 8 | 9 | """代理池服务. 10 | 11 | NOTE: 12 | + 当前提供匿名 http 代理 13 | + etree.HTML() 传入数据时,传入 str 类型的,不要传入 bytes 类型的, 例: 14 | - etree.HTML(requests.get('http://www.baidu.com').text) 15 | + Redis 是单线程的,线程安全,故多线程操作时不需要锁 16 | 17 | TODO: 18 | 19 | Thinking: 20 | + gevent 不支持 python3.x,不能使用 21 | + asyncio 虽然可以用 coroutine,但由于没有配合的 http library,也不能发挥作用 22 | + 暂时通过多线程控制并发 23 | 24 | !!!: 25 | + redis 是通过 socket 连接,存在一个最大连接数问题,两种解决方法: 26 | - 通过 `ulimit -n 数字`来解决 27 | - 随机 sleep 几秒后重连 redis 28 | + 58 和 赶集 对代理封的比较狠 29 | """ 30 | 31 | import os 32 | import sys 33 | import time 34 | import random 35 | import pprint 36 | import logging 37 | import threading 38 | import concurrent.futures 39 | from concurrent.futures import ThreadPoolExecutor 40 | 41 | import requests 42 | import yaml 43 | import redis 44 | from lxml import etree 45 | 46 | logging.basicConfig(level=logging.INFO, 47 | format='[%(asctime)s][%(levelname)s] %(message)s', 48 | datefmt='%Y-%m-%d %H:%M:%S') 49 | 50 | 51 | class ProxyPool(object): 52 | """代理池. 53 | """ 54 | def __init__(self, configfile='settings.yaml'): 55 | self.configs = self._get_configs(configfile) 56 | 57 | self.rdb = redis.StrictRedis(db=self.configs['STORE']['RDB']) 58 | self.try_times_db = self.configs['STORE']['TRY'] 59 | self.try_time_wait = self.configs['STORE']['TIME_WAIT'] 60 | self.sproxy_all = self.configs['STORE']['SPROXY_ALL'] 61 | self.sproxy_anon = self.configs['STORE']['SPROXY_ANON'] 62 | self.init_value = self.configs['VALIDATE']['INIT_VALUE'] 63 | self.timeout_valid = self.configs['VALIDATE']['TIMEOUT_VALID'] 64 | self.time_exception = self.configs['VALIDATE']['TIME_EXCEPTION'] 65 | self.targets = list(self.configs['TARGET'].keys()) 66 | self.url_reflect = self.configs['URL']['REFLECT'] 67 | 68 | self.tnum_proxy_getter = self.configs['CONCURRENT']['PROXY_GETTER'] 69 | self.tnum_proxy_filter = self.configs['CONCURRENT']['PROXY_FILTER'] 70 | self.tnum_proxy_valid = self.configs['CONCURRENT']['PROXY_VALID'] 71 | 72 | def _get_configs(self, configfile): 73 | """Return the configuration dict""" 74 | # XXX: getting the path of the configuraion file needs improving 75 | configpath = os.path.join('.', configfile) 76 | with open(configpath, 'rb') as fp: 77 | configs = yaml.load(fp) 78 | 79 | return configs 80 | 81 | def get_mtime(self, target='all'): 82 | """返回代理上次更新时间""" 83 | target = str(target).upper() 84 | if target not in self.targets: 85 | target = 'ALL' 86 | 87 | db_mtime = self.configs['TARGET'][target]['DB_MTIME'] 88 | mtime = self.rdb.get(db_mtime) 89 | 90 | return int(mtime) 91 | 92 | def get_many(self, target='all', num=10, minscore=0, maxscore=None): 93 | """ 94 | Return a list of proxies including at most 'num' proxies 95 | which socres are between 'minscore' and 'mascore'. 96 | If there's no proxies matching, return an empty list. 97 | """ 98 | # XXX: 当前的策略是先从数据库中取出所有满足要求的代理,然后返回指定数目的代理. 99 | # 个人觉得这种策略有待改善,还有优化的空间. 100 | target = str(target).upper() 101 | if target not in self.targets: 102 | target = 'ALL' 103 | 104 | db = self.configs['TARGET'][target]['DB_PROXY'] 105 | num = num 106 | minscore = minscore 107 | maxscore = maxscore or self.init_value 108 | res = self.rdb.zrangebyscore(db, minscore, maxscore) 109 | if res: 110 | random.shuffle(res) # for getting random results 111 | if len(res) < num: 112 | logging.warning("The number of proxies you want is less than %d" 113 | % (num,)) 114 | return [proxy for proxy in res[:num]] 115 | else: 116 | logging.warning("There're no proxies which scores are between %d and %d" 117 | % (minscore, maxscore)) 118 | return [] 119 | 120 | def get_one(self, target='all', minscore=0, maxscore=None): 121 | """ 122 | Return one proxy which score is between 'minscore' 123 | and 'maxscore'. 124 | If there's no proxy matching, return an empty string. 125 | """ 126 | target = target 127 | minscore = minscore 128 | maxscore = maxscore or self.init_value 129 | res = self.get_many(target=target, num=1, minscore=minscore, maxscore=maxscore) 130 | 131 | if res: 132 | return res[0] 133 | else: 134 | return '' 135 | 136 | def fetch_proxies(self): 137 | """Get proxies from vairous methods.""" 138 | # 从网页获得 139 | self._crawl_proxies_sites() 140 | 141 | def _crawl_proxies_sites(self): 142 | """Get proxies from web pages.""" 143 | with ThreadPoolExecutor(max_workers=self.tnum_proxy_getter) as executor: 144 | for url, val in self.configs['PROXY_SITES'].items(): 145 | executor.submit(self._crawl_proxies_one_site, url, val['rules'], val['proxies']) 146 | 147 | def _crawl_proxies_one_site(self, url=None, rules=None, proxies=None): 148 | # Get proxies (ip:port) from url and then write them into redis. 149 | # XXX: 抓取和解析分离 150 | url = url 151 | rules = rules 152 | proxies = proxies 153 | headers = self.configs['CRAWL']['HEADERS'] 154 | logging.info('Begin crawl page %s' % (url,)) 155 | 156 | res = requests.get(url, headers=headers, proxies=proxies) 157 | encoding = res.encoding 158 | html = etree.HTML(res.text) 159 | proxies = [] 160 | len_rules = len(rules) 161 | 162 | nodes = html.xpath(rules[0]) 163 | if nodes: 164 | if len_rules == 1: 165 | for node in nodes: 166 | text = node.text.strip() 167 | if text: 168 | proxies.append('http://%s' % (text,)) 169 | elif len_rules == 2: 170 | rule_1 = rules[1].split(',') 171 | rule_1_len = len(rule_1) 172 | 173 | if rule_1_len == 3: 174 | for node in nodes: 175 | try: 176 | node = node.xpath(rule_1[0]) 177 | ip = node[1].text.strip() 178 | port = node[2].text.strip() or '80' 179 | if ip: 180 | proxies.append('http://%s:%s' % (ip, port)) 181 | except Exception as e: 182 | logging.error('Error when parsing %s: %r' % (url, e)) 183 | elif rule_1_len == 4: 184 | for node in nodes: 185 | try: 186 | ip = node.xpath(rule_1[0])[0].text.strip() 187 | port = node.xpath(rule_1[1])[0].text.strip() 188 | is_niming = node.xpath(rule_1[2])[0].text.strip().encode('utf-8') 189 | dec = rule_1[-1].encode('utf-8') 190 | 191 | if dec == b'proxy360' and ip and is_niming == '高匿': 192 | proxies.append('http://%s:%s' % (ip, port)) 193 | elif dec == b'cn-proxy' and ip and is_niming == '高度匿名': 194 | proxies.append('http://%s:%s' % (ip, port)) 195 | except Exception as e: 196 | logging.error('Error when parsing %s: %r' % (url, e)) 197 | 198 | for proxy in proxies: 199 | logging.info('Got proxy %s from %s' % (proxy, url)) 200 | self.rdb.sadd(self.sproxy_all, proxy) 201 | 202 | def get_ip_local(self): 203 | # 获取本机出口 ip,最多尝试三次,若尝试后都不能获得,就结束整个程序,因为后续不能保证 204 | # 提供的代理是否匿名可依赖 205 | timeout = self.configs['LOCAL_IP']['TIMEOUT'] 206 | try_times = self.configs['LOCAL_IP']['TRY'] 207 | for times_try in range(try_times): 208 | try: 209 | headers = self.configs['CRAWL']['HEADERS'] 210 | res = requests.get(self.url_reflect, headers=headers, timeout=timeout) 211 | 212 | ip_local = res.text.split(':')[-1].split("\n")[0].strip().split('"')[1] 213 | 214 | return ip_local 215 | except Exception: 216 | logging.debug('Times of trying to get local ip: %d' % (times_try+1,)) 217 | else: 218 | sys.exit("Fatal error: couldn't get the local ip.") 219 | 220 | def filter_anony(self): 221 | # 检测抓取到的所有的代理是否是匿名 222 | # XXX: 需要考虑下面这个循环是否必要 223 | # while True: 224 | # # 等待抓取代理的线程完成 225 | # num_active = threading.active_count() 226 | # if num_active == 1: 227 | # break 228 | # else: 229 | # logging.debug('Number of active: %d' % (num_active,)) 230 | # time.sleep(1) 231 | 232 | self.ip_local = self.get_ip_local() 233 | proxies = self.rdb.smembers(self.sproxy_all) 234 | self._filter_anony(proxies) 235 | 236 | def _filter_anony(self, proxies): 237 | # 把 proxies 中的匿名代理找出来,proxies 格式是 ['ip:port', 'ip:port', ...] 238 | with ThreadPoolExecutor(max_workers=self.tnum_proxy_filter) as executor: 239 | for proxy in proxies: 240 | executor.submit(self._valid_anony, proxy) 241 | 242 | def _valid_anony(self, proxy): 243 | # 判断该 proxy 是否是 http 匿名代理,参数 proxy 格式是 'http://ip:port' 244 | # 策略: 245 | # + 若是匿名代理,则加入到 sproxy_anon 246 | # + 若非匿名代理,则从 sproxy_anon 删除 (不存在时删除没有影响) 247 | proxies = { 248 | 'http': proxy.decode('utf-8'), 249 | } 250 | 251 | try: 252 | headers = self.configs['CRAWL']['HEADERS'] 253 | res = requests.get(self.url_reflect, headers=headers, proxies=proxies, timeout=10) 254 | except Exception as e: 255 | logging.error('Error when validating anonymous: %r' % (e,)) 256 | return 257 | 258 | if not self.ip_local in res.text: 259 | logging.info('Anonymous: %s' % (proxy,)) 260 | self.rdb.sadd(self.sproxy_anon, proxy) 261 | else: 262 | logging.info('NON-Anonymous: %s' % (proxy,)) 263 | self.rdb.srem(self.sproxy_anon, proxy) 264 | 265 | def valid_active(self): 266 | # 检验所有的匿名代理的可用性 267 | proxies = self.rdb.smembers(self.sproxy_anon) 268 | 269 | with ThreadPoolExecutor(max_workers=self.tnum_proxy_valid) as executor: 270 | for target in self.targets: 271 | for proxy in proxies: 272 | executor.submit(self._efficiency_proxy, proxy, target) 273 | 274 | def _efficiency_proxy(self, proxy, target): 275 | # 通过该代理访问指定的几个站点获取访问时间,来检验一个匿名代理是否存活 276 | # XXX: 当前是顺序访问指定的站点,考虑是否改为并发访问 277 | try: 278 | target = target.upper() 279 | except AttributeError: 280 | target = target 281 | test_site = self.configs['TARGET'][target]['URL'] 282 | db_proxy = self.configs['TARGET'][target]['DB_PROXY'] 283 | db_mtime = self.configs['TARGET'][target]['DB_MTIME'] 284 | validate = self.configs['TARGET'][target]['VALIDATE'] 285 | 286 | time_delay = self._timing_proxy(proxy.decode('utf-8'), test_site, val=validate) 287 | proxy = proxy.decode('utf-8') 288 | mtime = int(time.time()) 289 | 290 | try_times = 0 291 | while True: 292 | # 尝试三次连接 redis 293 | try: 294 | self.rdb.zadd(db_proxy, time_delay, proxy) 295 | self.rdb.set(db_mtime, mtime) 296 | 297 | logging.info('Have validated %s' % (proxy,)) 298 | 299 | break 300 | except Exception as e: 301 | try_times += 1 302 | logging.info('Tried %d' % (try_times,)) 303 | if try_times >= self.try_times_db: 304 | logging.error(e) 305 | break 306 | time.sleep(random.randint(0, self.try_time_wait)) 307 | self.rdb = redis.StrictRedis(db=self.configs['STORE']['RDB']) 308 | 309 | def _timing_proxy(self, proxy, site, val): 310 | # 获取通过该代理访问指定站点的耗时 311 | time_start = time.time() 312 | 313 | try: 314 | res = requests.get(site, timeout=self.timeout_valid) 315 | html = etree.HTML(res.content) 316 | title = html.xpath("/html/head/title")[0].text 317 | 318 | if res.status_code == 200: 319 | time_end = time.time() 320 | else: 321 | logging.error('Error when validating %s' % (proxy,)) 322 | time_end = -1 323 | except Exception as e: 324 | logging.error('Error when validating %s: %r' % (proxy, e)) 325 | time_end = -1 326 | 327 | if time_end == -1: 328 | time_interval = self.time_exception 329 | else: 330 | time_interval = time_end - time_start 331 | 332 | return time_interval 333 | 334 | 335 | if __name__ == '__main__': 336 | proxypool = ProxyPool() 337 | proxypool.fetch_proxies() # 抓取代理 338 | proxypool.filter_anony() # 挑选出匿名代理 339 | proxypool.valid_active() # 验证代理的可用性 340 | # pprint.pprint(proxypool.get_many(num=3, maxscore=10, target='58')) 341 | # pprint.pprint(proxypool.get_many(num=3, maxscore=10, target=58)) 342 | # pprint.pprint(proxypool.get_many(num=3, maxscore=10, target='baixing')) 343 | # pprint.pprint(proxypool.get_many()) 344 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | lxml==3.3.2 2 | requests==2.2.1 3 | pyaml==13.12.0 4 | redis==2.9.1 5 | tornado==3.2 6 | -------------------------------------------------------------------------------- /settings.yaml: -------------------------------------------------------------------------------- 1 | STORE: 2 | RDB: 3 # 代理存储在 redis#3 数据库中 3 | TRY: 3 # 连接 redis 数据库重试次数 4 | TIME_WAIT: 5 5 | SPROXY_ALL: sproxy_all # 代理以 sets 方式存储原始 proxy,key 是 proxy 6 | SPROXY_ANON: sproxy_anon # 代理以 sets 方式存储匿名 proxy,key 是 proxy 7 | 8 | TARGET: 9 | ALL: 10 | URL: http://www.baidu.com 11 | DB_PROXY: zproxy_all 12 | DB_MTIME: mtime_all # 上次更新时间 13 | VALIDATE: '百度' 14 | '58': 15 | URL: http://www.58.com 16 | DB_PROXY: zproxy_58 17 | DB_MTIME: mtime_58 18 | VALIDATE: '58同城' 19 | GANJI: 20 | URL: http://www.ganji.com 21 | DB_PROXY: zproxy_ganji 22 | DB_MTIME: mtime_ganji 23 | VALIDATE: '赶集' 24 | 25 | URL: 26 | REFLECT: http://httpbin.org/ip 27 | 28 | CRAWL: 29 | HEADERS: 30 | Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 31 | Accept-Encoding: gzip,deflate,sdch 32 | Accept-Language: en-US,en;q=0.8 33 | Connection: keep-alive 34 | Referer: http://www.baidu.com 35 | User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.114 Safari/537.36 36 | 37 | 38 | CONCURRENT: 39 | PROXY_GETTER: 100 # 从网上抓取代理的最大并发线程数 40 | PROXY_FILTER: 100 # 过滤出匿名代理的最大并发线程数 41 | PROXY_VALID: 10 # 验证代理的可用性的最大并发线程数 42 | 43 | PROXY_SITES: 44 | http://www.site-digger.com/html/articles/20110516/proxieslist.html: 45 | rules: 46 | - //table[@id='proxies_table']/tbody/tr/td[1] 47 | proxies: 48 | http: '' 49 | https: '' 50 | 51 | http://pachong.org/area/short/name/cn.html: 52 | rules: 53 | - //tr[@data-type='anonymous'] | //tr[@data-type='high'] 54 | - .//td,1,2 55 | proxies: # 与翻墙抓取这些代理相关 56 | http: '' 57 | https: '' 58 | 59 | http://proxy.com.ru/gaoni: 60 | freq: 600 # 更新频率,单位 s 61 | rules: 62 | - //table[@bordercolor='#CCCCCC'][last()-1]//tr[position()>1] # 包含 ip,port 的节点 63 | - ./td,2,3 # 对应 ip 和 port 64 | proxies: 65 | http: '' 66 | https: '' 67 | 68 | http://proxy.com.ru/niming: 69 | freq: 600 # 更新频率,单位 s 70 | rules: 71 | - //table[@bordercolor='#CCCCCC'][last()-1]//tr[position()>1] # 包含 ip,port 的节点 72 | - ./td,1,2 # 对应 ip 和 port 73 | proxies: 74 | http: '' 75 | https: '' 76 | 77 | http://www.proxy360.cn/Proxy: 78 | freq: -1 # 未知的更新频率 79 | rules: 80 | - //div[@name='list_proxy_ip']/div[1] # 包含 ip, port 的节点 81 | - ./span[1],./span[2],./span[3],proxy360 # 对应 ip、port、是否匿名 82 | proxies: 83 | http: '' 84 | https: '' 85 | 86 | http://cn-proxy.com/archives/218: 87 | freq: -1 # 未知的更新频率 88 | rules: 89 | - //table/tbody/tr # 包含 ip, port 的节点 90 | - ./td[1],./td[2],./td[3],cn-proxy # 对应 ip、port、是否匿名 91 | proxies: 92 | http: '' 93 | https: '' 94 | 95 | # http://www.cnproxy.com/proxy1.html: 96 | # freq: -1 97 | # rules: 98 | # - //table[last()]/tr[position()>1] # 包含 ip, port 的节点 99 | # - ./td, 100 | 101 | VALIDATE: 102 | INIT_VALUE: 20 103 | TIMEOUT_VALID: 10 104 | TIME_EXCEPTION: 100000 105 | 106 | LOCAL_IP: 107 | TIMEOUT: 10 108 | TRY: 3 109 | --------------------------------------------------------------------------------