├── .gitignore
├── Readme.md
├── __init__.py
├── doc
    ├── API.md
    └── design.md
├── handlers
    ├── handler_8000.py
    ├── handler_8001.py
    ├── handler_8002.py
    ├── handler_8003.py
    ├── handler_8004.py
    └── handler_template.py
├── proxypool.py
├── requirements.txt
└── settings.yaml


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | bin/
12 | build/
13 | develop-eggs/
14 | dist/
15 | eggs/
16 | lib/
17 | lib64/
18 | parts/
19 | sdist/
20 | var/
21 | *.egg-info/
22 | .installed.cfg
23 | *.egg
24 | 
25 | # Installer logs
26 | pip-log.txt
27 | pip-delete-this-directory.txt
28 | 
29 | # Unit test / coverage reports
30 | htmlcov/
31 | .tox/
32 | .coverage
33 | .cache
34 | nosetests.xml
35 | coverage.xml
36 | 
37 | # Translations
38 | *.mo
39 | 
40 | # Mr Developer
41 | .mr.developer.cfg
42 | .project
43 | .pydevproject
44 | 
45 | # Rope
46 | .ropeproject
47 | 
48 | # Django stuff:
49 | *.log
50 | *.pot
51 | 
52 | # Sphinx documentation
53 | docs/_build/
54 | 
55 | bak/
56 | 


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
 1 | ### 目的
 2 | 解决抓取站点对访问频率的限制问题，通过匿名代理访问目标站点。
 3 | 
 4 | ### 功能
 5 | 通过一个 HTTP API 提供匿名代理列表。
 6 | 
 7 | ### 配置
 8 | 1、安装项目的依赖
 9 | 与 python 相关的依赖 (注意该项目使用 python3.3):
10 | 
11 | ```shell
12 | $ pip3 install -r requirements.txt
13 | ```
14 | 
15 | 此外系统中还需要安装 redis-server。  
16 | 
17 | 2、获取验证代理
18 | 在项目根目录下，执行:
19 | 
20 | ```shell
21 | $ python3.3 proxypool.py
22 | ```
23 | 
24 | 3、配置 nginx
25 | 一个简单的 nginx.conf 形式:
26 | 
27 | ```python
28 | worker_processes  1;
29 | 
30 | events {
31 |     worker_connections  1024;
32 | }
33 | 
34 | http {
35 |     include       mime.types;
36 |     default_type  application/octet-stream;
37 | 
38 |     sendfile        on;
39 | 
40 |     keepalive_timeout  65;
41 | 
42 | 	upstream frontends {
43 | 	    server 127.0.0.1:8000;
44 | 		server 127.0.0.1:8001;
45 | 		server 127.0.0.1:8002;
46 | 		server 127.0.0.1:8003;
47 | 		server 127.0.0.1:8004;
48 | 	}
49 | 
50 | 	server {
51 | 	    listen 9000;
52 | 		server_name localhost;
53 | 
54 | 		location / {
55 | 			proxy_pass http://frontends;
56 | 			proxy_set_header Host $host;
57 | 			proxy_set_header X-Real-IP $remote_addr;
58 | 			proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
59 | 			proxy_set_header X-Forwarded-Proto $scheme;
60 | 		}
61 | 	}	
62 | }
63 | ```
64 | 
65 | 启动 nginx:
66 | 
67 | ```shell
68 | # nginx -t
69 | # nginx &
70 | ```
71 | 
72 | 4、启动 handler  
73 | 在该项目目录下，运行:
74 | 
75 | ```shell
76 | $ cd handlers
77 | $ python3.3 handler_800* &
78 | ```
79 | 
80 | 5、测试
81 | 在 chrome 中下载
82 | [Postman](https://chrome.google.com/webstore/detail/postman-rest-client/fdmmgilgnpjigdojojpjoooidkmcomcm?utm_source=chrome-ntp-launcher)
83 | 插件，然后通过 POST 方法请求 http://127.0.0.1:9000/proxylist 来查看返回结果。
84 | 
85 | ### 其他文档
86 | 1、[API 使用文档](/proxypool/doc/API.md)
87 | 
88 | 2、[项目设计文档](/proxypool/doc/design.md)
89 | 
90 | ### 问题反馈
91 | 可随时向我 (zhangyifei@baixing.com) 反馈使用该项目过程中遇到的问题。
92 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wenson/proxypool/75bb59a591adacb30caec4b37a83908cc0b762b3/__init__.py


--------------------------------------------------------------------------------
/doc/API.md:
--------------------------------------------------------------------------------
 1 | ### API 使用文档
 2 | #### 请求方法
 3 | 通过 HTTP POST 方法请求 http://127.0.0.1:9000/proxylist， post 的数据为:
 4 | 
 5 | * target (optional)  
 6 |   目标站点，表示需要访问这个网站的最佳代理列表，当前存储的有 baidu、58、ganji，
 7 |   若该字段为空或非预定义的几个站点，则返回访问 baidu 最佳的代理列表。  
 8 | * num (optional)  
 9 |   需要的代理数目，默认 10 个。若 proxy pool 中满足要求的代理少于请求的数目，则返
10 |   回的代理数目会少于请求的数目。  
11 | * delay (optional)  
12 |   要求代理的延迟时间，单位是秒，默认 10s。
13 | 
14 | 示例：
15 | 
16 | * 无 post 数据  
17 |   返回访问 baidu 最佳的 10 个代理，访问 baidu 的延迟在 10s 内。  
18 | * target=baixing  
19 |   返回访问 baidu 最佳的 10 个代理，访问 baidu 的延迟在 10s 内(因为配置中不存在对
20 |   访问 baixing 的配置)。  
21 | * target=58  
22 |   返回访问 58 最佳的 10 个代理，访问 58 的延迟在 10s 内。  
23 | * ...  
24 |   组合搭配 target、num、delay 返回满足这些需求的代理列表。  
25 | 
26 | ### 返回数据
27 | 都是 json 格式的数据。  
28 | 共有三种状态:  
29 | 1、成功获取代理列表，且 target 站点在默认的配置中
30 | 
31 | ```javascript
32 | {
33 |   "status": "success",
34 |   "proxylist": {
35 |     "target": "ganji",
36 | 	"num": 3,
37 | 	"proxies": [
38 | 	  "http://120.197.85.182:18253",
39 | 	  "http://222.87.129.29:80",
40 | 	  "http://122.96.59.103:80",
41 | 	],
42 |   },
43 | }
44 | ```
45 | 
46 | 2、成功获取代理列表，但 target 站点不在默认的配置中
47 | 下述的 "success-partial" 表示返回的代理列表对目标站点可能部分可用。 
48 | 
49 | ```javascript
50 | {
51 |   "status": "success-partial",
52 |   "proxylist": {
53 |     "target": "ganji",
54 | 	"num": 3,
55 | 	"proxies": [
56 | 	  "http://120.197.85.182:18253",
57 | 	  "http://222.87.129.29:80",
58 | 	  "http://122.96.59.103:80",
59 | 	],
60 |   },
61 | }
62 | ```
63 | 
64 | 3、失败
65 | ```javascript
66 | {
67 |   "status": "failure",
68 |   "err": "失败原因",
69 |   "target": "目标站点",
70 | }
71 | ```
72 | 


--------------------------------------------------------------------------------
/doc/design.md:
--------------------------------------------------------------------------------
 1 | ### 整体设计
 2 | 分成两部分:
 3 | 
 4 | 1、proxy pool
 5 | 分三层:
 6 | 
 7 | * 从网上抓取免费的 http 代理，存入 redis 中，采用 sets 数据结构存储  
 8 | * 验证抓取到的代理是否是匿名代理，存入 redis 中，采用 sets 数据结构存储
 9 | * 验证匿名代理的访问延迟，存入 redis 中，采用 sorted sets 数据结构存储 
10 | 
11 | 2、http 服务
12 | 采用 nginx + tornado + proxypool 形式:
13 | 
14 | * nginx 提供统一的访问接口，并提供负载均衡功能
15 | * tornado 处理 nginx 传来的请求，从 proxypool 中按要求取出代理列表并返回
16 | * proxypool 向 tornado 提供匿名代理列表 
17 | 


--------------------------------------------------------------------------------
/handlers/handler_8000.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # -----------------------------------------
  5 | # Author: zhangyifei <zhangyifei@baixing.com>
  6 | # Date: 2014-03-04 21:18:55
  7 | # -----------------------------------------
  8 | 
  9 | """处理 nginx 传来的请求，返回相应的 proxy list.
 10 | """
 11 | 
 12 | import json
 13 | 
 14 | import tornado.ioloop
 15 | import tornado.web
 16 | 
 17 | from proxypool import ProxyPool
 18 | 
 19 | 
 20 | class ProxyListHandler(tornado.web.RequestHandler):
 21 |     """返回用户需求的代理列表，json 格式
 22 |     示例:
 23 |     + 成功
 24 |     {
 25 |       'status': 'success',
 26 |       'proxylist': {
 27 |         'num': 5,
 28 |         'mtime': 1394069326,
 29 |         'proxies': [
 30 |           'http://220.248.180.149:3128',
 31 |           'http://61.55.141.11:81',
 32 |           ......,
 33 |         ],
 34 |       },
 35 |     }
 36 |       - 若数据库中存储了指定站点的 proxy list，且正确返回，则状态是 success
 37 |       - 若数据库中没有存储指定站点的 proxy list，但正确返回，则状态是 success-partial，
 38 |         表示返回的代理对指定的站点部分可用
 39 |     + 失败
 40 |     {
 41 |       'status': 'failure',
 42 |       'err': '失败原因',
 43 |     }
 44 |     """
 45 |     def get(self):
 46 |         self.write('Please refer to the API doc.')
 47 |     
 48 |     def post(self):
 49 |         target = self.get_argument('target', default='') or 'all'
 50 |         num    = int(self.get_argument('num', default='') or 5)
 51 |         delay  = int(self.get_argument('delay', default='') or 10)
 52 | 
 53 |         proxypool = ProxyPool()
 54 |         
 55 |         try:
 56 |             proxies = proxypool.get_many(target=target, num=num, maxscore=delay)
 57 |             num_ret = len(proxies)
 58 |             mtime   = proxypool.get_mtime(target=target)
 59 | 
 60 |             proxylist = []
 61 |             for proxy in proxies:
 62 |                 proxylist.append(proxy.decode('utf-8'))
 63 | 
 64 |             if str(target).upper() in proxypool.targets:
 65 |                 status = 'success'
 66 |             else:
 67 |                 status = 'success-partial'
 68 | 
 69 |             ret = {
 70 |                 'status': status,
 71 |                 'proxylist': {
 72 |                     'num': num_ret,
 73 |                     'mtime': mtime,
 74 |                     'target': target,
 75 |                     'proxies': proxylist,
 76 |                 },
 77 |             }
 78 |         except Exception as e:
 79 |             ret = {
 80 |                 'status': 'failure',
 81 |                 'target': target,
 82 |                 'err': str(e),
 83 |             }
 84 | 
 85 |         self.set_header('Content-Type', 'application/json')
 86 | 
 87 |         self.write(json.dumps(ret))
 88 | 
 89 | 
 90 | class MainHandler(tornado.web.RequestHandler):
 91 |     """处理未匹配到的请求"""
 92 |     def get(self):
 93 |         self.write('Please refer to the API doc.')
 94 | 
 95 |     def post(self):
 96 |         self.write('Please refer to the API doc.')
 97 |         
 98 | 
 99 | app = tornado.web.Application([
100 |     (r'/proxylist', ProxyListHandler),
101 |     (r'.*', MainHandler),
102 | ])
103 | 
104 | 
105 | if __name__ == '__main__':
106 |     app.listen(8000)
107 |     tornado.ioloop.IOLoop.instance().start()
108 | 


--------------------------------------------------------------------------------
/handlers/handler_8001.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # -----------------------------------------
  5 | # Author: zhangyifei <zhangyifei@baixing.com>
  6 | # Date: 2014-03-04 21:18:55
  7 | # -----------------------------------------
  8 | 
  9 | """处理 nginx 传来的请求，返回相应的 proxy list.
 10 | """
 11 | 
 12 | import json
 13 | 
 14 | import tornado.ioloop
 15 | import tornado.web
 16 | 
 17 | from proxypool import ProxyPool
 18 | 
 19 | 
 20 | class ProxyListHandler(tornado.web.RequestHandler):
 21 |     """返回用户需求的代理列表，json 格式
 22 |     示例:
 23 |     + 成功
 24 |     {
 25 |       'status': 'success',
 26 |       'proxylist': {
 27 |         'num': 5,
 28 |         'mtime': 1394069326,
 29 |         'proxies': [
 30 |           'http://220.248.180.149:3128',
 31 |           'http://61.55.141.11:81',
 32 |           ......,
 33 |         ],
 34 |       },
 35 |     }
 36 |       - 若数据库中存储了指定站点的 proxy list，且正确返回，则状态是 success
 37 |       - 若数据库中没有存储指定站点的 proxy list，但正确返回，则状态是 success-partial，
 38 |         表示返回的代理对指定的站点部分可用
 39 |     + 失败
 40 |     {
 41 |       'status': 'failure',
 42 |       'err': '失败原因',
 43 |     }
 44 |     """
 45 |     def get(self):
 46 |         self.write('Please refer to the API doc.')
 47 |     
 48 |     def post(self):
 49 |         target = self.get_argument('target', default='') or 'all'
 50 |         num    = int(self.get_argument('num', default='') or 5)
 51 |         delay  = int(self.get_argument('delay', default='') or 10)
 52 | 
 53 |         proxypool = ProxyPool()
 54 |         
 55 |         try:
 56 |             proxies = proxypool.get_many(target=target, num=num, maxscore=delay)
 57 |             num_ret = len(proxies)
 58 |             mtime   = proxypool.get_mtime(target=target)
 59 | 
 60 |             proxylist = []
 61 |             for proxy in proxies:
 62 |                 proxylist.append(proxy.decode('utf-8'))
 63 | 
 64 |             if str(target).upper() in proxypool.targets:
 65 |                 status = 'success'
 66 |             else:
 67 |                 status = 'success-partial'
 68 | 
 69 |             ret = {
 70 |                 'status': status,
 71 |                 'proxylist': {
 72 |                     'num': num_ret,
 73 |                     'mtime': mtime,
 74 |                     'target': target,
 75 |                     'proxies': proxylist,
 76 |                 },
 77 |             }
 78 |         except Exception as e:
 79 |             ret = {
 80 |                 'status': 'failure',
 81 |                 'target': target,
 82 |                 'err': str(e),
 83 |             }
 84 | 
 85 |         self.set_header('Content-Type', 'application/json')
 86 | 
 87 |         self.write(json.dumps(ret))
 88 | 
 89 | 
 90 | class MainHandler(tornado.web.RequestHandler):
 91 |     """处理未匹配到的请求"""
 92 |     def get(self):
 93 |         self.write('Please refer to the API doc.')
 94 | 
 95 |     def post(self):
 96 |         self.write('Please refer to the API doc.')
 97 |         
 98 | 
 99 | app = tornado.web.Application([
100 |     (r'/proxylist', ProxyListHandler),
101 |     (r'.*', MainHandler),
102 | ])
103 | 
104 | 
105 | if __name__ == '__main__':
106 |     app.listen(8001)
107 |     tornado.ioloop.IOLoop.instance().start()
108 | 


--------------------------------------------------------------------------------
/handlers/handler_8002.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # -----------------------------------------
  5 | # Author: zhangyifei <zhangyifei@baixing.com>
  6 | # Date: 2014-03-04 21:18:55
  7 | # -----------------------------------------
  8 | 
  9 | """处理 nginx 传来的请求，返回相应的 proxy list.
 10 | """
 11 | 
 12 | import json
 13 | 
 14 | import tornado.ioloop
 15 | import tornado.web
 16 | 
 17 | from proxypool import ProxyPool
 18 | 
 19 | 
 20 | class ProxyListHandler(tornado.web.RequestHandler):
 21 |     """返回用户需求的代理列表，json 格式
 22 |     示例:
 23 |     + 成功
 24 |     {
 25 |       'status': 'success',
 26 |       'proxylist': {
 27 |         'num': 5,
 28 |         'mtime': 1394069326,
 29 |         'proxies': [
 30 |           'http://220.248.180.149:3128',
 31 |           'http://61.55.141.11:81',
 32 |           ......,
 33 |         ],
 34 |       },
 35 |     }
 36 |       - 若数据库中存储了指定站点的 proxy list，且正确返回，则状态是 success
 37 |       - 若数据库中没有存储指定站点的 proxy list，但正确返回，则状态是 success-partial，
 38 |         表示返回的代理对指定的站点部分可用
 39 |     + 失败
 40 |     {
 41 |       'status': 'failure',
 42 |       'err': '失败原因',
 43 |     }
 44 |     """
 45 |     def get(self):
 46 |         self.write('Please refer to the API doc.')
 47 |     
 48 |     def post(self):
 49 |         target = self.get_argument('target', default='') or 'all'
 50 |         num    = int(self.get_argument('num', default='') or 5)
 51 |         delay  = int(self.get_argument('delay', default='') or 10)
 52 | 
 53 |         proxypool = ProxyPool()
 54 |         
 55 |         try:
 56 |             proxies = proxypool.get_many(target=target, num=num, maxscore=delay)
 57 |             num_ret = len(proxies)
 58 |             mtime   = proxypool.get_mtime(target=target)
 59 | 
 60 |             proxylist = []
 61 |             for proxy in proxies:
 62 |                 proxylist.append(proxy.decode('utf-8'))
 63 | 
 64 |             if str(target).upper() in proxypool.targets:
 65 |                 status = 'success'
 66 |             else:
 67 |                 status = 'success-partial'
 68 | 
 69 |             ret = {
 70 |                 'status': status,
 71 |                 'proxylist': {
 72 |                     'num': num_ret,
 73 |                     'mtime': mtime,
 74 |                     'target': target,
 75 |                     'proxies': proxylist,
 76 |                 },
 77 |             }
 78 |         except Exception as e:
 79 |             ret = {
 80 |                 'status': 'failure',
 81 |                 'target': target,
 82 |                 'err': str(e),
 83 |             }
 84 | 
 85 |         self.set_header('Content-Type', 'application/json')
 86 | 
 87 |         self.write(json.dumps(ret))
 88 | 
 89 | 
 90 | class MainHandler(tornado.web.RequestHandler):
 91 |     """处理未匹配到的请求"""
 92 |     def get(self):
 93 |         self.write('Please refer to the API doc.')
 94 | 
 95 |     def post(self):
 96 |         self.write('Please refer to the API doc.')
 97 |         
 98 | 
 99 | app = tornado.web.Application([
100 |     (r'/proxylist', ProxyListHandler),
101 |     (r'.*', MainHandler),
102 | ])
103 | 
104 | 
105 | if __name__ == '__main__':
106 |     app.listen(8002)
107 |     tornado.ioloop.IOLoop.instance().start()
108 | 


--------------------------------------------------------------------------------
/handlers/handler_8003.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # -----------------------------------------
  5 | # Author: zhangyifei <zhangyifei@baixing.com>
  6 | # Date: 2014-03-04 21:18:55
  7 | # -----------------------------------------
  8 | 
  9 | """处理 nginx 传来的请求，返回相应的 proxy list.
 10 | """
 11 | 
 12 | import json
 13 | 
 14 | import tornado.ioloop
 15 | import tornado.web
 16 | 
 17 | from proxypool import ProxyPool
 18 | 
 19 | 
 20 | class ProxyListHandler(tornado.web.RequestHandler):
 21 |     """返回用户需求的代理列表，json 格式
 22 |     示例:
 23 |     + 成功
 24 |     {
 25 |       'status': 'success',
 26 |       'proxylist': {
 27 |         'num': 5,
 28 |         'mtime': 1394069326,
 29 |         'proxies': [
 30 |           'http://220.248.180.149:3128',
 31 |           'http://61.55.141.11:81',
 32 |           ......,
 33 |         ],
 34 |       },
 35 |     }
 36 |       - 若数据库中存储了指定站点的 proxy list，且正确返回，则状态是 success
 37 |       - 若数据库中没有存储指定站点的 proxy list，但正确返回，则状态是 success-partial，
 38 |         表示返回的代理对指定的站点部分可用
 39 |     + 失败
 40 |     {
 41 |       'status': 'failure',
 42 |       'err': '失败原因',
 43 |     }
 44 |     """
 45 |     def get(self):
 46 |         self.write('Please refer to the API doc.')
 47 |     
 48 |     def post(self):
 49 |         target = self.get_argument('target', default='') or 'all'
 50 |         num    = int(self.get_argument('num', default='') or 5)
 51 |         delay  = int(self.get_argument('delay', default='') or 10)
 52 | 
 53 |         proxypool = ProxyPool()
 54 |         
 55 |         try:
 56 |             proxies = proxypool.get_many(target=target, num=num, maxscore=delay)
 57 |             num_ret = len(proxies)
 58 |             mtime   = proxypool.get_mtime(target=target)
 59 | 
 60 |             proxylist = []
 61 |             for proxy in proxies:
 62 |                 proxylist.append(proxy.decode('utf-8'))
 63 | 
 64 |             if str(target).upper() in proxypool.targets:
 65 |                 status = 'success'
 66 |             else:
 67 |                 status = 'success-partial'
 68 | 
 69 |             ret = {
 70 |                 'status': status,
 71 |                 'proxylist': {
 72 |                     'num': num_ret,
 73 |                     'mtime': mtime,
 74 |                     'target': target,
 75 |                     'proxies': proxylist,
 76 |                 },
 77 |             }
 78 |         except Exception as e:
 79 |             ret = {
 80 |                 'status': 'failure',
 81 |                 'target': target,
 82 |                 'err': str(e),
 83 |             }
 84 | 
 85 |         self.set_header('Content-Type', 'application/json')
 86 | 
 87 |         self.write(json.dumps(ret))
 88 | 
 89 | 
 90 | class MainHandler(tornado.web.RequestHandler):
 91 |     """处理未匹配到的请求"""
 92 |     def get(self):
 93 |         self.write('Please refer to the API doc.')
 94 | 
 95 |     def post(self):
 96 |         self.write('Please refer to the API doc.')
 97 |         
 98 | 
 99 | app = tornado.web.Application([
100 |     (r'/proxylist', ProxyListHandler),
101 |     (r'.*', MainHandler),
102 | ])
103 | 
104 | 
105 | if __name__ == '__main__':
106 |     app.listen(8003)
107 |     tornado.ioloop.IOLoop.instance().start()
108 | 


--------------------------------------------------------------------------------
/handlers/handler_8004.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # -----------------------------------------
  5 | # Author: zhangyifei <zhangyifei@baixing.com>
  6 | # Date: 2014-03-04 21:18:55
  7 | # -----------------------------------------
  8 | 
  9 | """处理 nginx 传来的请求，返回相应的 proxy list.
 10 | """
 11 | 
 12 | import json
 13 | 
 14 | import tornado.ioloop
 15 | import tornado.web
 16 | 
 17 | from proxypool import ProxyPool
 18 | 
 19 | 
 20 | class ProxyListHandler(tornado.web.RequestHandler):
 21 |     """返回用户需求的代理列表，json 格式
 22 |     示例:
 23 |     + 成功
 24 |     {
 25 |       'status': 'success',
 26 |       'proxylist': {
 27 |         'num': 5,
 28 |         'mtime': 1394069326,
 29 |         'proxies': [
 30 |           'http://220.248.180.149:3128',
 31 |           'http://61.55.141.11:81',
 32 |           ......,
 33 |         ],
 34 |       },
 35 |     }
 36 |       - 若数据库中存储了指定站点的 proxy list，且正确返回，则状态是 success
 37 |       - 若数据库中没有存储指定站点的 proxy list，但正确返回，则状态是 success-partial，
 38 |         表示返回的代理对指定的站点部分可用
 39 |     + 失败
 40 |     {
 41 |       'status': 'failure',
 42 |       'err': '失败原因',
 43 |     }
 44 |     """
 45 |     def get(self):
 46 |         self.write('Please refer to the API doc.')
 47 |     
 48 |     def post(self):
 49 |         target = self.get_argument('target', default='') or 'all'
 50 |         num    = int(self.get_argument('num', default='') or 5)
 51 |         delay  = int(self.get_argument('delay', default='') or 10)
 52 | 
 53 |         proxypool = ProxyPool()
 54 |         
 55 |         try:
 56 |             proxies = proxypool.get_many(target=target, num=num, maxscore=delay)
 57 |             num_ret = len(proxies)
 58 |             mtime   = proxypool.get_mtime(target=target)
 59 | 
 60 |             proxylist = []
 61 |             for proxy in proxies:
 62 |                 proxylist.append(proxy.decode('utf-8'))
 63 | 
 64 |             if str(target).upper() in proxypool.targets:
 65 |                 status = 'success'
 66 |             else:
 67 |                 status = 'success-partial'
 68 | 
 69 |             ret = {
 70 |                 'status': status,
 71 |                 'proxylist': {
 72 |                     'num': num_ret,
 73 |                     'mtime': mtime,
 74 |                     'target': target,
 75 |                     'proxies': proxylist,
 76 |                 },
 77 |             }
 78 |         except Exception as e:
 79 |             ret = {
 80 |                 'status': 'failure',
 81 |                 'target': target,
 82 |                 'err': str(e),
 83 |             }
 84 | 
 85 |         self.set_header('Content-Type', 'application/json')
 86 | 
 87 |         self.write(json.dumps(ret))
 88 | 
 89 | 
 90 | class MainHandler(tornado.web.RequestHandler):
 91 |     """处理未匹配到的请求"""
 92 |     def get(self):
 93 |         self.write('Please refer to the API doc.')
 94 | 
 95 |     def post(self):
 96 |         self.write('Please refer to the API doc.')
 97 |         
 98 | 
 99 | app = tornado.web.Application([
100 |     (r'/proxylist', ProxyListHandler),
101 |     (r'.*', MainHandler),
102 | ])
103 | 
104 | 
105 | if __name__ == '__main__':
106 |     app.listen(8004)
107 |     tornado.ioloop.IOLoop.instance().start()
108 | 


--------------------------------------------------------------------------------
/handlers/handler_template.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # -----------------------------------------
  5 | # Author: zhangyifei <zhangyifei@baixing.com>
  6 | # Date: 2014-03-04 21:18:55
  7 | # -----------------------------------------
  8 | 
  9 | """处理 nginx 传来的请求，返回相应的 proxy list.
 10 | """
 11 | 
 12 | import json
 13 | 
 14 | import tornado.ioloop
 15 | import tornado.web
 16 | 
 17 | from proxypool import ProxyPool
 18 | 
 19 | 
 20 | class ProxyListHandler(tornado.web.RequestHandler):
 21 |     """返回用户需求的代理列表，json 格式
 22 |     示例:
 23 |     + 成功
 24 |     {
 25 |       'status': 'success',
 26 |       'proxylist': {
 27 |         'num': 5,
 28 |         'mtime': 1394069326,
 29 |         'proxies': [
 30 |           'http://220.248.180.149:3128',
 31 |           'http://61.55.141.11:81',
 32 |           ......,
 33 |         ],
 34 |       },
 35 |     }
 36 |       - 若数据库中存储了指定站点的 proxy list，且正确返回，则状态是 success
 37 |       - 若数据库中没有存储指定站点的 proxy list，但正确返回，则状态是 success-partial，
 38 |         表示返回的代理对指定的站点部分可用
 39 |     + 失败
 40 |     {
 41 |       'status': 'failure',
 42 |       'err': '失败原因',
 43 |     }
 44 |     """
 45 |     def get(self):
 46 |         self.write('Please refer to the API doc.')
 47 |     
 48 |     def post(self):
 49 |         target = self.get_argument('target', default='') or 'all'
 50 |         num    = int(self.get_argument('num', default='') or 5)
 51 |         delay  = int(self.get_argument('delay', default='') or 10)
 52 | 
 53 |         proxypool = ProxyPool()
 54 |         
 55 |         try:
 56 |             proxies = proxypool.get_many(target=target, num=num, maxscore=delay)
 57 |             num_ret = len(proxies)
 58 |             mtime   = proxypool.get_mtime(target=target)
 59 | 
 60 |             proxylist = []
 61 |             for proxy in proxies:
 62 |                 proxylist.append(proxy.decode('utf-8'))
 63 | 
 64 |             if str(target).upper() in proxypool.targets:
 65 |                 status = 'success'
 66 |             else:
 67 |                 status = 'success-partial'
 68 | 
 69 |             ret = {
 70 |                 'status': status,
 71 |                 'proxylist': {
 72 |                     'num': num_ret,
 73 |                     'mtime': mtime,
 74 |                     'target': target,
 75 |                     'proxies': proxylist,
 76 |                 },
 77 |             }
 78 |         except Exception as e:
 79 |             ret = {
 80 |                 'status': 'failure',
 81 |                 'target': target,
 82 |                 'err': str(e),
 83 |             }
 84 | 
 85 |         self.set_header('Content-Type', 'application/json')
 86 | 
 87 |         self.write(json.dumps(ret))
 88 | 
 89 | 
 90 | class MainHandler(tornado.web.RequestHandler):
 91 |     """处理未匹配到的请求"""
 92 |     def get(self):
 93 |         self.write('Please refer to the API doc.')
 94 | 
 95 |     def post(self):
 96 |         self.write('Please refer to the API doc.')
 97 |         
 98 | 
 99 | app = tornado.web.Application([
100 |     (r'/proxylist', ProxyListHandler),
101 |     (r'.*', MainHandler),
102 | ])
103 | 
104 | 
105 | if __name__ == '__main__':
106 |     app.listen(8000)
107 |     tornado.ioloop.IOLoop.instance().start()
108 | 


--------------------------------------------------------------------------------
/proxypool.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | # -----------------------------------------
  5 | # Author: zhangyifei <zhangyifei@baixing.com>
  6 | # Date: 2014-03-04
  7 | # -----------------------------------------
  8 | 
  9 | """代理池服务.
 10 | 
 11 | NOTE:
 12 |   + 当前提供匿名 http 代理
 13 |   + etree.HTML() 传入数据时，传入 str 类型的，不要传入 bytes 类型的, 例：
 14 |     - etree.HTML(requests.get('http://www.baidu.com').text)
 15 |   + Redis 是单线程的，线程安全，故多线程操作时不需要锁
 16 | 
 17 | TODO:
 18 | 
 19 | Thinking:
 20 |   + gevent 不支持 python3.x，不能使用
 21 |   + asyncio 虽然可以用 coroutine，但由于没有配合的 http library，也不能发挥作用
 22 |   + 暂时通过多线程控制并发
 23 | 
 24 | !!!:
 25 |   + redis 是通过 socket 连接，存在一个最大连接数问题，两种解决方法:
 26 |     - 通过 ｀ulimit -n 数字｀来解决
 27 |     - 随机 sleep 几秒后重连 redis
 28 |   + 58 和 赶集 对代理封的比较狠
 29 | """
 30 | 
 31 | import os
 32 | import sys
 33 | import time
 34 | import random
 35 | import pprint
 36 | import logging
 37 | import threading
 38 | import concurrent.futures
 39 | from concurrent.futures import ThreadPoolExecutor
 40 | 
 41 | import requests
 42 | import yaml
 43 | import redis
 44 | from lxml import etree
 45 | 
 46 | logging.basicConfig(level=logging.INFO,
 47 |                     format='[%(asctime)s][%(levelname)s] %(message)s',
 48 |                     datefmt='%Y-%m-%d %H:%M:%S')
 49 | 
 50 | 
 51 | class ProxyPool(object):
 52 |     """代理池.
 53 |     """
 54 |     def __init__(self, configfile='settings.yaml'):
 55 |         self.configs        = self._get_configs(configfile)
 56 |         
 57 |         self.rdb            = redis.StrictRedis(db=self.configs['STORE']['RDB'])
 58 |         self.try_times_db   = self.configs['STORE']['TRY']
 59 |         self.try_time_wait  = self.configs['STORE']['TIME_WAIT']
 60 |         self.sproxy_all     = self.configs['STORE']['SPROXY_ALL']
 61 |         self.sproxy_anon    = self.configs['STORE']['SPROXY_ANON']
 62 |         self.init_value     = self.configs['VALIDATE']['INIT_VALUE']
 63 |         self.timeout_valid  = self.configs['VALIDATE']['TIMEOUT_VALID']
 64 |         self.time_exception = self.configs['VALIDATE']['TIME_EXCEPTION']
 65 |         self.targets        = list(self.configs['TARGET'].keys())
 66 |         self.url_reflect    = self.configs['URL']['REFLECT']
 67 | 
 68 |         self.tnum_proxy_getter = self.configs['CONCURRENT']['PROXY_GETTER']
 69 |         self.tnum_proxy_filter = self.configs['CONCURRENT']['PROXY_FILTER']
 70 |         self.tnum_proxy_valid  = self.configs['CONCURRENT']['PROXY_VALID']
 71 | 
 72 |     def _get_configs(self, configfile):
 73 |         """Return the configuration dict"""
 74 |         # XXX: getting the path of the configuraion file needs improving
 75 |         configpath = os.path.join('.', configfile)
 76 |         with open(configpath, 'rb') as fp:
 77 |             configs = yaml.load(fp)
 78 | 
 79 |         return configs
 80 | 
 81 |     def get_mtime(self, target='all'):
 82 |         """返回代理上次更新时间"""
 83 |         target = str(target).upper()
 84 |         if target not in self.targets:
 85 |             target = 'ALL'
 86 | 
 87 |         db_mtime = self.configs['TARGET'][target]['DB_MTIME']
 88 |         mtime    = self.rdb.get(db_mtime)
 89 | 
 90 |         return int(mtime)
 91 | 
 92 |     def get_many(self, target='all', num=10, minscore=0, maxscore=None):
 93 |         """
 94 |         Return a list of proxies including at most 'num' proxies
 95 |         which socres are between 'minscore' and 'mascore'.
 96 |         If there's no proxies matching, return an empty list.
 97 |         """
 98 |         # XXX: 当前的策略是先从数据库中取出所有满足要求的代理，然后返回指定数目的代理.
 99 |         #      个人觉得这种策略有待改善，还有优化的空间.
100 |         target = str(target).upper()
101 |         if target not in self.targets:
102 |             target = 'ALL'
103 | 
104 |         db       = self.configs['TARGET'][target]['DB_PROXY']
105 |         num      = num
106 |         minscore = minscore
107 |         maxscore = maxscore or self.init_value
108 |         res      = self.rdb.zrangebyscore(db, minscore, maxscore)
109 |         if res:
110 |             random.shuffle(res) # for getting random results
111 |             if len(res) < num:
112 |                 logging.warning("The number of proxies you want is less than %d"
113 |                                 % (num,))
114 |             return [proxy for proxy in res[:num]]
115 |         else:
116 |             logging.warning("There're no proxies which scores are between %d and %d"
117 |                             % (minscore, maxscore))
118 |             return []
119 | 
120 |     def get_one(self, target='all', minscore=0, maxscore=None):
121 |         """
122 |         Return one proxy which score is between 'minscore'
123 |         and 'maxscore'.
124 |         If there's no proxy matching, return an empty string.
125 |         """
126 |         target   = target
127 |         minscore = minscore
128 |         maxscore = maxscore or self.init_value
129 |         res      = self.get_many(target=target, num=1, minscore=minscore, maxscore=maxscore)
130 |         
131 |         if res:
132 |             return res[0]
133 |         else:
134 |             return ''
135 | 
136 |     def fetch_proxies(self):
137 |         """Get proxies from vairous methods."""
138 |         # 从网页获得
139 |         self._crawl_proxies_sites()
140 | 
141 |     def _crawl_proxies_sites(self):
142 |         """Get proxies from web pages."""
143 |         with ThreadPoolExecutor(max_workers=self.tnum_proxy_getter) as executor:
144 |             for url, val in self.configs['PROXY_SITES'].items():
145 |                 executor.submit(self._crawl_proxies_one_site, url, val['rules'], val['proxies'])
146 |         
147 |     def _crawl_proxies_one_site(self, url=None, rules=None, proxies=None):
148 |         # Get proxies (ip:port) from url and then write them into redis.
149 |         # XXX: 抓取和解析分离
150 |         url     = url
151 |         rules   = rules
152 |         proxies = proxies
153 |         headers = self.configs['CRAWL']['HEADERS']
154 |         logging.info('Begin crawl page %s' % (url,))
155 |         
156 |         res       = requests.get(url, headers=headers, proxies=proxies)
157 |         encoding  = res.encoding
158 |         html      = etree.HTML(res.text)
159 |         proxies   = []
160 |         len_rules = len(rules)
161 | 
162 |         nodes = html.xpath(rules[0])
163 |         if nodes:
164 |             if len_rules == 1:
165 |                 for node in nodes:
166 |                     text = node.text.strip()
167 |                     if text:
168 |                         proxies.append('http://%s' % (text,))
169 |             elif len_rules == 2:
170 |                 rule_1     = rules[1].split(',')
171 |                 rule_1_len = len(rule_1)
172 | 
173 |                 if rule_1_len == 3:
174 |                     for node in nodes:
175 |                         try:
176 |                             node = node.xpath(rule_1[0])
177 |                             ip   = node[1].text.strip()
178 |                             port = node[2].text.strip() or '80'
179 |                             if ip:
180 |                                 proxies.append('http://%s:%s' % (ip, port))
181 |                         except Exception as e:
182 |                             logging.error('Error when parsing %s: %r' % (url, e))
183 |                 elif rule_1_len == 4:
184 |                     for node in nodes:
185 |                         try:
186 |                             ip        = node.xpath(rule_1[0])[0].text.strip()
187 |                             port      = node.xpath(rule_1[1])[0].text.strip()
188 |                             is_niming = node.xpath(rule_1[2])[0].text.strip().encode('utf-8')
189 |                             dec       = rule_1[-1].encode('utf-8')
190 | 
191 |                             if dec == b'proxy360' and ip and is_niming == '高匿':
192 |                                 proxies.append('http://%s:%s' % (ip, port))
193 |                             elif dec == b'cn-proxy' and ip and is_niming == '高度匿名':
194 |                                 proxies.append('http://%s:%s' % (ip, port))
195 |                         except Exception as e:
196 |                             logging.error('Error when parsing %s: %r' % (url, e))
197 | 
198 |         for proxy in proxies:
199 |             logging.info('Got proxy %s from %s' % (proxy, url))
200 |             self.rdb.sadd(self.sproxy_all, proxy)
201 | 
202 |     def get_ip_local(self):
203 |         # 获取本机出口 ip，最多尝试三次，若尝试后都不能获得，就结束整个程序，因为后续不能保证
204 |         # 提供的代理是否匿名可依赖
205 |         timeout   = self.configs['LOCAL_IP']['TIMEOUT']
206 |         try_times = self.configs['LOCAL_IP']['TRY']
207 |         for times_try in range(try_times):
208 |             try:
209 |                 headers = self.configs['CRAWL']['HEADERS']
210 |                 res     = requests.get(self.url_reflect, headers=headers, timeout=timeout)
211 |                 
212 |                 ip_local = res.text.split(':')[-1].split("\n")[0].strip().split('"')[1]
213 | 
214 |                 return ip_local
215 |             except Exception:
216 |                 logging.debug('Times of trying to get local ip: %d' % (times_try+1,))
217 |         else:
218 |             sys.exit("Fatal error: couldn't get the local ip.")
219 | 
220 |     def filter_anony(self):
221 |         # 检测抓取到的所有的代理是否是匿名
222 |         # XXX: 需要考虑下面这个循环是否必要
223 |         # while True:
224 |         #     # 等待抓取代理的线程完成
225 |         #     num_active = threading.active_count()
226 |         #     if num_active == 1:
227 |         #         break
228 |         #     else:
229 |         #         logging.debug('Number of active: %d' % (num_active,))
230 |         #         time.sleep(1)
231 | 
232 |         self.ip_local = self.get_ip_local()
233 |         proxies = self.rdb.smembers(self.sproxy_all)
234 |         self._filter_anony(proxies)
235 |         
236 |     def _filter_anony(self, proxies):
237 |         # 把 proxies 中的匿名代理找出来，proxies 格式是 ['ip:port', 'ip:port', ...]
238 |         with ThreadPoolExecutor(max_workers=self.tnum_proxy_filter) as executor:
239 |             for proxy in proxies:
240 |                 executor.submit(self._valid_anony, proxy)
241 | 
242 |     def _valid_anony(self, proxy):
243 |         # 判断该 proxy 是否是 http 匿名代理，参数 proxy 格式是 'http://ip:port'
244 |         # 策略:
245 |         #     + 若是匿名代理，则加入到 sproxy_anon
246 |         #     + 若非匿名代理，则从 sproxy_anon 删除 (不存在时删除没有影响)
247 |         proxies = {
248 |             'http': proxy.decode('utf-8'),
249 |         }
250 |         
251 |         try:
252 |             headers = self.configs['CRAWL']['HEADERS']
253 |             res = requests.get(self.url_reflect, headers=headers, proxies=proxies, timeout=10)
254 |         except Exception as e:
255 |             logging.error('Error when validating anonymous: %r' % (e,))
256 |             return
257 |             
258 |         if not self.ip_local in res.text:
259 |             logging.info('Anonymous: %s' % (proxy,))
260 |             self.rdb.sadd(self.sproxy_anon, proxy)
261 |         else:
262 |             logging.info('NON-Anonymous: %s' % (proxy,))
263 |             self.rdb.srem(self.sproxy_anon, proxy)
264 |             
265 |     def valid_active(self):
266 |         # 检验所有的匿名代理的可用性
267 |         proxies = self.rdb.smembers(self.sproxy_anon)
268 | 
269 |         with ThreadPoolExecutor(max_workers=self.tnum_proxy_valid) as executor:
270 |             for target in self.targets:
271 |                 for proxy in proxies:
272 |                     executor.submit(self._efficiency_proxy, proxy, target)
273 | 
274 |     def _efficiency_proxy(self, proxy, target):
275 |         # 通过该代理访问指定的几个站点获取访问时间，来检验一个匿名代理是否存活
276 |         # XXX: 当前是顺序访问指定的站点，考虑是否改为并发访问
277 |         try:
278 |             target = target.upper()
279 |         except AttributeError:
280 |             target = target
281 |         test_site = self.configs['TARGET'][target]['URL']
282 |         db_proxy  = self.configs['TARGET'][target]['DB_PROXY']
283 |         db_mtime  = self.configs['TARGET'][target]['DB_MTIME']        
284 |         validate  = self.configs['TARGET'][target]['VALIDATE']
285 |         
286 |         time_delay = self._timing_proxy(proxy.decode('utf-8'), test_site, val=validate)
287 |         proxy      = proxy.decode('utf-8')
288 |         mtime      = int(time.time())
289 | 
290 |         try_times = 0
291 |         while True:
292 |             # 尝试三次连接 redis
293 |             try:
294 |                 self.rdb.zadd(db_proxy, time_delay, proxy)
295 |                 self.rdb.set(db_mtime, mtime)
296 | 
297 |                 logging.info('Have validated %s' % (proxy,))
298 |                 
299 |                 break
300 |             except Exception as e:
301 |                 try_times += 1
302 |                 logging.info('Tried %d' % (try_times,))
303 |                 if try_times >= self.try_times_db:
304 |                     logging.error(e)
305 |                     break
306 |                 time.sleep(random.randint(0, self.try_time_wait))
307 |                 self.rdb = redis.StrictRedis(db=self.configs['STORE']['RDB'])                
308 |             
309 |     def _timing_proxy(self, proxy, site, val):
310 |         # 获取通过该代理访问指定站点的耗时
311 |         time_start = time.time()
312 |         
313 |         try:
314 |             res   = requests.get(site, timeout=self.timeout_valid)
315 |             html  = etree.HTML(res.content)
316 |             title = html.xpath("/html/head/title")[0].text
317 |             
318 |             if res.status_code == 200:
319 |                 time_end = time.time()
320 |             else:
321 |                 logging.error('Error when validating %s' % (proxy,))
322 |                 time_end = -1
323 |         except Exception as e:
324 |             logging.error('Error when validating %s: %r' % (proxy, e))
325 |             time_end = -1
326 | 
327 |         if time_end == -1:
328 |             time_interval = self.time_exception
329 |         else:
330 |             time_interval = time_end - time_start
331 | 
332 |         return time_interval
333 | 
334 |         
335 | if __name__ == '__main__':
336 |     proxypool = ProxyPool()
337 |     proxypool.fetch_proxies()   # 抓取代理
338 |     proxypool.filter_anony()    # 挑选出匿名代理
339 |     proxypool.valid_active()    # 验证代理的可用性
340 |     # pprint.pprint(proxypool.get_many(num=3, maxscore=10, target='58'))
341 |     # pprint.pprint(proxypool.get_many(num=3, maxscore=10, target=58))
342 |     # pprint.pprint(proxypool.get_many(num=3, maxscore=10, target='baixing'))
343 |     # pprint.pprint(proxypool.get_many())
344 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | lxml==3.3.2
2 | requests==2.2.1
3 | pyaml==13.12.0
4 | redis==2.9.1
5 | tornado==3.2
6 | 


--------------------------------------------------------------------------------
/settings.yaml:
--------------------------------------------------------------------------------
  1 | STORE:
  2 |   RDB: 3   # 代理存储在 redis#3 数据库中
  3 |   TRY: 3   # 连接 redis 数据库重试次数
  4 |   TIME_WAIT: 5
  5 |   SPROXY_ALL: sproxy_all    # 代理以 sets 方式存储原始 proxy，key 是 proxy
  6 |   SPROXY_ANON: sproxy_anon    # 代理以 sets 方式存储匿名 proxy，key 是 proxy
  7 | 
  8 | TARGET:
  9 |   ALL:
 10 |     URL: http://www.baidu.com
 11 |     DB_PROXY: zproxy_all
 12 |     DB_MTIME: mtime_all    # 上次更新时间
 13 |     VALIDATE: '百度'
 14 |   '58':
 15 |     URL: http://www.58.com
 16 |     DB_PROXY: zproxy_58
 17 |     DB_MTIME: mtime_58
 18 |     VALIDATE: '58同城'
 19 |   GANJI:
 20 |     URL: http://www.ganji.com
 21 |     DB_PROXY: zproxy_ganji
 22 |     DB_MTIME: mtime_ganji
 23 |     VALIDATE: '赶集'
 24 | 
 25 | URL:
 26 |   REFLECT: http://httpbin.org/ip
 27 | 
 28 | CRAWL:
 29 |   HEADERS:
 30 |     Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
 31 |     Accept-Encoding: gzip,deflate,sdch
 32 |     Accept-Language: en-US,en;q=0.8
 33 |     Connection: keep-alive
 34 |     Referer: http://www.baidu.com
 35 |     User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.114 Safari/537.36
 36 | 
 37 | 
 38 | CONCURRENT:
 39 |   PROXY_GETTER: 100    # 从网上抓取代理的最大并发线程数
 40 |   PROXY_FILTER: 100    # 过滤出匿名代理的最大并发线程数
 41 |   PROXY_VALID: 10    # 验证代理的可用性的最大并发线程数 
 42 | 
 43 | PROXY_SITES:
 44 |   http://www.site-digger.com/html/articles/20110516/proxieslist.html:
 45 |     rules:
 46 |       - //table[@id='proxies_table']/tbody/tr/td[1]
 47 |     proxies:
 48 |       http: ''
 49 |       https: ''
 50 | 
 51 |   http://pachong.org/area/short/name/cn.html:
 52 |     rules:
 53 |       - //tr[@data-type='anonymous'] | //tr[@data-type='high']
 54 |       - .//td,1,2
 55 |     proxies:    # 与翻墙抓取这些代理相关
 56 |       http: ''
 57 |       https: ''
 58 | 
 59 |   http://proxy.com.ru/gaoni:
 60 |     freq: 600    # 更新频率,单位 s
 61 |     rules:
 62 |       - //table[@bordercolor='#CCCCCC'][last()-1]//tr[position()>1]    # 包含 ip,port 的节点
 63 |       - ./td,2,3    # 对应 ip 和 port
 64 |     proxies:
 65 |       http: ''
 66 |       https: ''
 67 | 
 68 |   http://proxy.com.ru/niming:
 69 |     freq: 600    # 更新频率,单位 s
 70 |     rules:
 71 |       - //table[@bordercolor='#CCCCCC'][last()-1]//tr[position()>1]    # 包含 ip,port 的节点
 72 |       - ./td,1,2    # 对应 ip 和 port
 73 |     proxies:
 74 |       http: ''
 75 |       https: ''
 76 | 
 77 |   http://www.proxy360.cn/Proxy:
 78 |     freq: -1    # 未知的更新频率
 79 |     rules:
 80 |       - //div[@name='list_proxy_ip']/div[1]    # 包含 ip, port 的节点
 81 |       - ./span[1],./span[2],./span[3],proxy360    # 对应 ip、port、是否匿名
 82 |     proxies:
 83 |       http: ''
 84 |       https: ''
 85 | 
 86 |   http://cn-proxy.com/archives/218:
 87 |     freq: -1    # 未知的更新频率
 88 |     rules:
 89 |       - //table/tbody/tr    # 包含 ip, port 的节点
 90 |       - ./td[1],./td[2],./td[3],cn-proxy    # 对应 ip、port、是否匿名
 91 |     proxies:
 92 |       http: ''
 93 |       https: ''
 94 | 
 95 |   # http://www.cnproxy.com/proxy1.html:
 96 |   #   freq: -1
 97 |   #   rules:
 98 |   #     - //table[last()]/tr[position()>1]    # 包含 ip, port 的节点
 99 |   #     - ./td,
100 | 
101 | VALIDATE:
102 |   INIT_VALUE: 20
103 |   TIMEOUT_VALID: 10
104 |   TIME_EXCEPTION: 100000
105 | 
106 | LOCAL_IP:
107 |   TIMEOUT: 10
108 |   TRY: 3
109 | 


--------------------------------------------------------------------------------