├── README.MD ├── __pycache__ ├── config.cpython-35.pyc └── config.cpython-36.pyc ├── assess_logger.log ├── assess_quality.py ├── config.py ├── crawl_demo.py ├── images └── data.png └── ip_pool.py /README.MD: -------------------------------------------------------------------------------- 1 | 一个小巧的代理ip抓取+评估+存储一体化的工具,使用requests+mysql完成。 2 | 3 | ### 简介 4 | 5 | 程序有如下几个功能: 6 | 7 | * 每天从多个代理ip网站上抓下最新高匿ip数据。 8 | * 经过筛选后的ip将存入数据库。 9 | * 存入数据库的ip每天也要经过测试,存在剔除、评分机制,多次不合格的ip将被删除,每个ip都被评分,我们最终可以按得分排名获得稳定、低响应时间的优质ip。 10 | 11 | 存储如下图所示: 12 | ![ip在数据库中的存储结构](https://github.com/fancoo/Proxy/blob/master/images/data.png) 13 | 14 | 15 | ### 参数 16 | ```python 17 | USELESS_TIME = 4 # 最大失效次数 18 | SUCCESS_RATE = 0.8 19 | TIME_OUT_PENALTY = 10 # 超时惩罚时间 20 | CHECK_TIME_INTERVAL = 24*3600 # 每天更新一次 21 | ``` 22 | 除数据库配置参数外,主要用到的几个参数说明如下: 23 | 24 | * ```USELESS_TIME```和```SUCCESS_RATE```是配合使用的,当某个```ip```的```USELESS_TIME < 4 && SUCCESS_RATE < 0.8```时(同时兼顾到ip短期和长期的检测表现),则剔除该ip。 25 | * ```TIME_OUT_PENALTY```, 当某个ip在某次检测时失效,而又没有达到上一条的条件时(比如检测了100次后第一次出现超时),设置一个```response_time```的惩罚项,此处为10秒。 26 | * ```CHECK_TIME_INTERVAL```, 检测周期。此处设置为每隔12小时检测一次数据库里每一个ip的可用性。 27 | 28 | 29 | ### 策略 30 | 31 | * 每天如下5个代理ip网站上抓下最新高匿ip数据: 32 | * ```mimi``` 33 | * ```66ip``` 34 | * ```xici``` 35 | * ```cn-proxy``` 36 | * ```kuaidaili``` 37 | * N轮筛选 38 | * 收集到的ip集合将经过N轮,间隔为t的连接测试,对于每一个ip,必须全部通过这N轮测试才能最终进入数据库。如果当天进入数据库的ip较少,则暂停一段时间(一天)再抓。 39 | 40 | * 数据库中ip评价准则 41 | * 检测过程中累计超时次数>```USELESS_TIME```&&成功率<```SUCCESS_RATE```就被剔除。 42 | ```score = (success_rate + test_times / 500) / avg_response_time``` 43 | 原来的考虑是```score = success_rate / avg_response_time```, 即:评分=成功率/平均响应时间, 考虑到检测合格过100次的老ip比新ip更有价值,检测次数也被引入评分。 44 | 45 | ### 使用 46 | 有`ip_pool.py`和`assess_quality.py`两个程序,前者负责每天抓ip存进数据库,后者负责数据库中ip的清理和评估。 47 | 48 | #### 运行程序 49 | ```bash 50 | python ip_pool.py 51 | # 等待上述程序抓取完结果后再运行评测程序 52 | python assess_quality.py 53 | ``` 54 | 之后按默认配置,这两个程序每天分别执行抓取和评估工作。 55 | 56 | #### 使用代理Demo 57 | 以抓取[豆瓣](https://www.douban.com/)主页为例: 58 | ```python 59 | # 访问数据库拿到代理 60 | ip_list = [] 61 | try: 62 | cursor.execute('SELECT content FROM %s' % cfg.TABLE_NAME) 63 | result = cursor.fetchall() 64 | for i in result: 65 | ip_list.append(i[0]) 66 | except Exception as e: 67 | print e 68 | finally: 69 | cursor.close() 70 | conn.close() 71 | 72 | # 利用代理抓取豆瓣页面 73 | for i in ip_list: 74 | proxy = {'http': 'http://'+i} 75 | url = "https://www.douban.com/" 76 | r = requests.get(url, proxies=proxy, timeout=4) 77 | print r.text 78 | ``` 79 | 详见[crawl_demo.py](https://github.com/fancoo/Proxy/blob/master/crawl_demo.py)。 80 | 81 | ### 问题反馈 82 | 联系邮箱:myfancoo@qq.com 欢迎指正。 83 | -------------------------------------------------------------------------------- /__pycache__/config.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chungminglu/Proxy/cb658fc92310a140d92dbfd503a432f310bed045/__pycache__/config.cpython-35.pyc -------------------------------------------------------------------------------- /__pycache__/config.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chungminglu/Proxy/cb658fc92310a140d92dbfd503a432f310bed045/__pycache__/config.cpython-36.pyc -------------------------------------------------------------------------------- /assess_logger.log: -------------------------------------------------------------------------------- 1 | WARNING:root:2017-08-15 12:20:34: >>>> 1 round! <<<< 2 | WARNING:root:2017-08-15 12:22:09: >>>> 1 round! <<<< 3 | WARNING:root:2017-08-15 12:27:57: >>>> 1 round! <<<< 4 | -------------------------------------------------------------------------------- /assess_quality.py: -------------------------------------------------------------------------------- 1 | #!/user/bin/env python 2 | # -*- coding:utf-8 -*- 3 | # 4 | # @author Ringo 5 | # @email myfancoo@qq.com 6 | # @date 2016/10/12 7 | # 8 | 9 | import requests 10 | import time 11 | import datetime 12 | import logging 13 | import pymysql as mdb 14 | import config as cfg 15 | 16 | log_file = 'assess_logger.log' 17 | logging.basicConfig(filename=log_file, level=logging.WARNING) 18 | 19 | TEST_ROUND_COUNT = 0 20 | 21 | 22 | def modify_score(ip, success, response_time): 23 | # type = 0 means ip hasn't pass the test 24 | 25 | # 连接数据库 26 | conn = mdb.connect(cfg.host, cfg.user, cfg.passwd, cfg.DB_NAME) 27 | cursor = conn.cursor() 28 | 29 | # ip超时 30 | if success == 0: 31 | logging.warning(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + ip + \ 32 | " out of time") 33 | try: 34 | cursor.execute('SELECT * FROM %s WHERE content= "%s"' % (cfg.TABLE_NAME, ip)) 35 | q_result = cursor.fetchall() 36 | for r in q_result: 37 | test_times = r[1] + 1 38 | failure_times = r[2] 39 | success_rate = r[3] 40 | avg_response_time = r[4] 41 | 42 | # 超时达到4次且成功率低于标准 43 | if failure_times > 4 and success_rate < cfg.SUCCESS_RATE: 44 | cursor.execute('DELETE FROM %s WHERE content= "%s"' % (cfg.TABLE_NAME, ip)) 45 | conn.commit() 46 | logging.warning(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + ip + \ 47 | " was deleted.") 48 | else: 49 | # not too bad 50 | failure_times += 1 51 | success_rate = 1 - float(failure_times) / test_times 52 | avg_response_time = (avg_response_time * (test_times - 1) + cfg.TIME_OUT_PENALTY) / test_times 53 | score = (success_rate + float(test_times) / 500) / avg_response_time 54 | n = cursor.execute('UPDATE %s SET test_times = %d, failure_times = %d, success_rate = %.2f, avg_response_time = %.2f, score = %.2f WHERE content = "%s"' % (TABLE_NAME, test_times, failure_times, success_rate, avg_response_time, score, ip)) 55 | conn.commit() 56 | if n: 57 | logging.error(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + \ 58 | ip + ' has been modify successfully!') 59 | break 60 | except Exception as e: 61 | logging.error(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + \ 62 | 'Error when try to delete ' + ip + str(e)) 63 | finally: 64 | cursor.close() 65 | conn.close() 66 | elif success == 1: 67 | # pass the test 68 | try: 69 | cursor.execute('SELECT * FROM %s WHERE content= "%s"' % (cfg.TABLE_NAME, ip)) 70 | q_result = cursor.fetchall() 71 | for r in q_result: 72 | test_times = r[1] + 1 73 | failure_times = r[2] 74 | avg_response_time = r[4] 75 | success_rate = 1 - float(failure_times) / test_times 76 | avg_response_time = (avg_response_time * (test_times - 1) + response_time) / test_times 77 | score = (success_rate + float(test_times) / 500) / avg_response_time 78 | n = cursor.execute('UPDATE %s SET test_times = %d, success_rate = %.2f, avg_response_time = %.2f, score = %.2f WHERE content = "%s"' %(cfg.TABLE_NAME, test_times, success_rate, avg_response_time, score, ip)) 79 | conn.commit() 80 | if n: 81 | logging.error(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + \ 82 | ip + 'has been modify successfully!') 83 | break 84 | except Exception as e: 85 | logging.error(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + \ 86 | 'Error when try to modify ' + ip + str(e)) 87 | finally: 88 | cursor.close() 89 | conn.close() 90 | 91 | 92 | def ip_test(proxies, timeout): 93 | 94 | # 挨个检查代理是否可用 95 | url = 'https://www.baidu.com' 96 | 97 | for p in proxies: 98 | proxy = {'http': 'http://'+p} 99 | try: 100 | # 请求开始时间 101 | start = time.time() 102 | r = requests.get(url, proxies=proxy, timeout=timeout) 103 | # 请求结束时间 104 | end = time.time() 105 | # 判断是否可用 106 | if r.text is not None: 107 | logging.warning(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + p + \ 108 | " out of time") 109 | resp_time = end -start 110 | modify_score(p, 1, resp_time) 111 | print ('Database test succeed: '+p+'\t'+str(resp_time)) 112 | except OSError: 113 | modify_score(p, 0, 0) 114 | 115 | 116 | def assess(): 117 | global TEST_ROUND_COUNT 118 | TEST_ROUND_COUNT += 1 119 | logging.warning(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + ">>>>\t" + str(TEST_ROUND_COUNT) + " round!\t<<<<") 120 | # 连接数据库 121 | conn = mdb.connect(cfg.host, cfg.user, cfg.passwd, cfg.DB_NAME) 122 | cursor = conn.cursor() 123 | 124 | try: 125 | cursor.execute('SELECT content FROM %s' % cfg.TABLE_NAME) 126 | result = cursor.fetchall() 127 | ip_list = [] 128 | for i in result: 129 | ip_list.append(i[0]) 130 | if len(ip_list) == 0: 131 | return 132 | ip_test(ip_list, cfg.timeout) 133 | except Exception as e: 134 | logging.warning(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+": " + str(e)) 135 | finally: 136 | cursor.close() 137 | conn.close() 138 | 139 | 140 | def main(): 141 | while True: 142 | assess() 143 | # 每天定时 144 | time.sleep(cfg.CHECK_TIME_INTERVAL) 145 | 146 | if __name__ == '__main__': 147 | main() -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8 2 | 3 | # 从代理ip网站上总共要爬取的ip页数。一般每页20条,小项目(20-30个代理ip即可完成的)可以设置为1-2页。 4 | page_num = 3 5 | 6 | # 对已经检测成功的ip测试轮次。 7 | examine_round = 3 8 | 9 | # 超时时间。代理ip在测试过程中的超时时间。 10 | timeout = 2 11 | 12 | # 数据库链接地址 13 | host = '10.120.27.11' 14 | 15 | # 数据库链接端口 16 | port = 3306 17 | 18 | # 数据库链接用户名 19 | user = 'root' 20 | 21 | # 数据库密码 22 | passwd = 'kmkm0404' 23 | 24 | # 数据库名 25 | DB_NAME = 'proxies' 26 | 27 | # 表名 28 | TABLE_NAME = 'valid_ip' 29 | 30 | # 数据库字符 31 | charset = 'utf8' 32 | 33 | # 1个代理ip最大容忍失败次数,超过则从db中删去。 34 | USELESS_TIME = 4 35 | 36 | # 1个代理ip最小容忍成功率 37 | SUCCESS_RATE = 0.8 38 | 39 | # 超时惩罚时间 40 | TIME_OUT_PENALTY = 10 41 | 42 | # 每隔多久检测一次 43 | CHECK_TIME_INTERVAL = 24*3600 44 | -------------------------------------------------------------------------------- /crawl_demo.py: -------------------------------------------------------------------------------- 1 | import pymysql as mdb 2 | import config as cfg 3 | import requests 4 | 5 | conn = mdb.connect(cfg.host, cfg.user, cfg.passwd, cfg.DB_NAME) 6 | cursor = conn.cursor() 7 | 8 | ip_list = [] 9 | try: 10 | cursor.execute('SELECT content FROM %s' % cfg.TABLE_NAME) 11 | result = cursor.fetchall() 12 | for i in result: 13 | ip_list.append(i[0]) 14 | except Exception as e: 15 | print e 16 | finally: 17 | cursor.close() 18 | conn.close() 19 | 20 | for i in ip_list: 21 | proxy = {'http': 'http://'+i} 22 | url = "https://www.douban.com/" 23 | r = requests.get(url, proxies=proxy, timeout=4) 24 | print r.text -------------------------------------------------------------------------------- /images/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chungminglu/Proxy/cb658fc92310a140d92dbfd503a432f310bed045/images/data.png -------------------------------------------------------------------------------- /ip_pool.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8 2 | import time 3 | import config as cfg 4 | import requests 5 | from lxml import etree 6 | import pymysql as mdb 7 | import datetime 8 | 9 | 10 | class IPFactory: 11 | """ 12 | 代理ip抓取/评估/存储一体化。 13 | """ 14 | def __init__(self): 15 | self.page_num = cfg.page_num 16 | self.round = cfg.examine_round 17 | self.timeout = cfg.timeout 18 | self.all_ip = set() 19 | 20 | # 创建数据库 21 | self.create_db() 22 | 23 | # # 抓取全部ip 24 | # current_ips = self.get_all_ip() 25 | # # 获取有效ip 26 | # valid_ip = self.get_the_best(current_ips, self.timeout, self.round) 27 | # print valid_ip 28 | 29 | def create_db(self): 30 | """ 31 | 创建数据库用于保存有效ip 32 | """ 33 | # 创建数据库/表语句 34 | # 创建数据库 35 | drop_db_str = 'drop database if exists ' + cfg.DB_NAME + ' ;' 36 | create_db_str = 'create database ' + cfg.DB_NAME + ' ;' 37 | # 选择该数据库 38 | use_db_str = 'use ' + cfg.DB_NAME + ' ;' 39 | # 创建表格 40 | create_table_str = "CREATE TABLE " + cfg.TABLE_NAME + """( 41 | `content` varchar(30) NOT NULL, 42 | `test_times` int(5) NOT NULL DEFAULT '0', 43 | `failure_times` int(5) NOT NULL DEFAULT '0', 44 | `success_rate` float(5,2) NOT NULL DEFAULT '0.00', 45 | `avg_response_time` float NOT NULL DEFAULT '0', 46 | `score` float(5,2) NOT NULL DEFAULT '0.00' 47 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8;""" 48 | 49 | # 连接数据库 50 | conn = mdb.connect(cfg.host, cfg.user, cfg.passwd) 51 | cursor = conn.cursor() 52 | try: 53 | cursor.execute(drop_db_str) 54 | cursor.execute(create_db_str) 55 | cursor.execute(use_db_str) 56 | cursor.execute(create_table_str) 57 | conn.commit() 58 | except OSError: 59 | print "无法创建数据库!" 60 | finally: 61 | cursor.close() 62 | conn.close() 63 | 64 | def get_content(self, url, url_xpath, port_xpath): 65 | """ 66 | 使用xpath解析网页内容,并返回ip列表。 67 | """ 68 | # 返回列表 69 | ip_list = [] 70 | 71 | try: 72 | # 设置请求头信息 73 | headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko'} 74 | 75 | # 获取页面数据 76 | results = requests.get(url, headers=headers, timeout=4) 77 | tree = etree.HTML(results.text) 78 | 79 | # 提取ip:port 80 | url_results = tree.xpath(url_xpath) 81 | port_results = tree.xpath(port_xpath) 82 | urls = [line.strip() for line in url_results] 83 | ports = [line.strip() for line in port_results] 84 | 85 | if len(urls) == len(ports): 86 | for i in range(len(urls)): 87 | # 匹配ip:port对 88 | full_ip = urls[i]+":"+ports[i] 89 | # 此处利用all_ip对过往爬取的ip做了记录,下次再爬时如果发现 90 | # 已经爬过,就不再加入ip列表。 91 | if full_ip in self.all_ip: 92 | continue 93 | # 存储 94 | ip_list.append(full_ip) 95 | except Exception as e: 96 | print 'get proxies error: ', e 97 | 98 | return ip_list 99 | 100 | def get_all_ip(self): 101 | """ 102 | 各大网站抓取的ip聚合。 103 | """ 104 | # 有2个概念:all_ip和current_all_ip。前者保存了历次抓取的ip,后者只保存本次的抓取。 105 | current_all_ip = set() 106 | 107 | ################################## 108 | # 66ip网 109 | ################################### 110 | url_xpath_66 = '/html/body/div[last()]//table//tr[position()>1]/td[1]/text()' 111 | port_xpath_66 = '/html/body/div[last()]//table//tr[position()>1]/td[2]/text()' 112 | for i in xrange(self.page_num): 113 | url_66 = 'http://www.66ip.cn/' + str(i+1) + '.html' 114 | results = self.get_content(url_66, url_xpath_66, port_xpath_66) 115 | self.all_ip.update(results) 116 | current_all_ip.update(results) 117 | # 停0.5s再抓取 118 | time.sleep(0.5) 119 | 120 | ################################## 121 | # xici代理 122 | ################################### 123 | url_xpath_xici = '//table[@id="ip_list"]//tr[position()>1]/td[position()=2]/text()' 124 | port_xpath_xici = '//table[@id="ip_list"]//tr[position()>1]/td[position()=3]/text()' 125 | for i in xrange(self.page_num): 126 | url_xici = 'http://www.xicidaili.com/nn/' + str(i+1) 127 | results = self.get_content(url_xici, url_xpath_xici, port_xpath_xici) 128 | self.all_ip.update(results) 129 | current_all_ip.update(results) 130 | time.sleep(0.5) 131 | 132 | ################################## 133 | # mimiip网 134 | ################################### 135 | url_xpath_mimi = '//table[@class="list"]//tr[position()>1]/td[1]/text()' 136 | port_xpath_mimi = '//table[@class="list"]//tr[position()>1]/td[2]/text()' 137 | for i in xrange(self.page_num): 138 | url_mimi = 'http://www.mimiip.com/gngao/' + str(i+1) 139 | results = self.get_content(url_mimi, url_xpath_mimi, port_xpath_mimi) 140 | self.all_ip.update(results) 141 | current_all_ip.update(results) 142 | time.sleep(0.5) 143 | 144 | ################################## 145 | # kuaidaili网 146 | ################################### 147 | url_xpath_kuaidaili = '//td[@data-title="IP"]/text()' 148 | port_xpath_kuaidaili = '//td[@data-title="PORT"]/text()' 149 | for i in xrange(self.page_num): 150 | url_kuaidaili = 'http://www.kuaidaili.com/free/inha/' + str(i+1) + '/' 151 | results = self.get_content(url_kuaidaili, url_xpath_kuaidaili, port_xpath_kuaidaili) 152 | self.all_ip.update(results) 153 | current_all_ip.update(results) 154 | time.sleep(0.5) 155 | 156 | return current_all_ip 157 | 158 | def get_valid_ip(self, ip_set, timeout): 159 | """ 160 | 代理ip可用性测试 161 | """ 162 | # 设置请求地址 163 | url = 'https://www.baidu.com' 164 | 165 | # 可用代理结果 166 | results = set() 167 | 168 | # 挨个检查代理是否可用 169 | for p in ip_set: 170 | proxy = {'http': 'http://'+p} 171 | try: 172 | # 请求开始时间 173 | start = time.time() 174 | r = requests.get(url, proxies=proxy, timeout=timeout) 175 | # 请求结束时间 176 | end = time.time() 177 | # 判断是否可用 178 | if r.status_code == 200: 179 | print 'succeed: ' + p + '\t' + " in " + format(end-start, '0.2f') + 's' 180 | # 追加代理ip到返回的set中 181 | results.add(p) 182 | except requests.ConnectionError, e: 183 | print p, e 184 | 185 | return results 186 | 187 | def get_the_best(self, valid_ip, timeout, round): 188 | """ 189 | N轮检测ip列表,避免"辉煌的15分钟" 190 | """ 191 | # 循环检查次数 192 | for i in range(round): 193 | print "\n>>>>>>>\tRound\t"+str(i+1)+"\t<<<<<<<<<<" 194 | # 检查代理是否可用 195 | valid_ip = self.get_valid_ip(valid_ip, timeout) 196 | # 停一下 197 | if i < round-1: 198 | print ">>>>>>>\tRound"+str(i+2)+"\t还有30秒开始\t<<<<<<<<<<" 199 | time.sleep(30) 200 | 201 | # 返回可用数据 202 | return valid_ip 203 | 204 | def save_to_db(self, valid_ips): 205 | """ 206 | 将可用的ip存储进mysql数据库 207 | """ 208 | if len(valid_ips) == 0: 209 | print "本次没有抓到可用ip。" 210 | return 211 | # 连接数据库 212 | print "\n>>>>>>>>>>>>>>>>>>>> 代理数据入库处理 Start <<<<<<<<<<<<<<<<<<<<<<\n" 213 | conn = mdb.connect(cfg.host, cfg.user, cfg.passwd, cfg.DB_NAME) 214 | cursor = conn.cursor() 215 | try: 216 | for item in valid_ips: 217 | # 检查表中是否存在数据 218 | item_exist = cursor.execute('SELECT * FROM %s WHERE content="%s"' %(cfg.TABLE_NAME, item)) 219 | 220 | # 新增代理数据入库 221 | if item_exist == 0: 222 | # 插入数据 223 | n = cursor.execute('INSERT INTO %s VALUES("%s", 1, 0, 0, 1.0, 2.5)' %(cfg.TABLE_NAME, item)) 224 | conn.commit() 225 | 226 | # 输出入库状态 227 | if n: 228 | print datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+" "+item+" 插入成功。\n" 229 | else: 230 | print datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+" "+item+" 插入失败。\n" 231 | 232 | else: 233 | print datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")+" "+ item +" 已存在。\n" 234 | except Exception as e: 235 | print "入库失败:" + str(e) 236 | finally: 237 | cursor.close() 238 | conn.close() 239 | print "\n>>>>>>>>>>>>>>>>>>>> 代理数据入库处理 End <<<<<<<<<<<<<<<<<<<<<<\n" 240 | 241 | def get_proxies(self): 242 | ip_list = [] 243 | 244 | # 连接数据库 245 | conn = mdb.connect(cfg.host, cfg.user, cfg.passwd, cfg.DB_NAME) 246 | cursor = conn.cursor() 247 | 248 | # 检查数据表中是否有数据 249 | try: 250 | ip_exist = cursor.execute('SELECT * FROM %s ' % cfg.TABLE_NAME) 251 | 252 | # 提取数据 253 | result = cursor.fetchall() 254 | 255 | # 若表里有数据 直接返回,没有则抓取再返回 256 | if len(result): 257 | for item in result: 258 | ip_list.append(item[0]) 259 | else: 260 | # 获取代理数据 261 | current_ips = self.get_all_ip() 262 | valid_ips = self.get_the_best(current_ips, self.timeout, self.round) 263 | self.save_to_db(valid_ips) 264 | ip_list.extend(valid_ips) 265 | except Exception as e: 266 | print "从数据库获取ip失败!" 267 | finally: 268 | cursor.close() 269 | conn.close() 270 | 271 | return ip_list 272 | 273 | 274 | def main(): 275 | ip_pool = IPFactory() 276 | while True: 277 | current_ips = ip_pool.get_all_ip() 278 | # 获取有效ip 279 | valid_ip = ip_pool.get_the_best(current_ips, cfg.timeout, cfg.examine_round) 280 | print valid_ip 281 | ip_pool.save_to_db(valid_ip) 282 | time.sleep(cfg.CHECK_TIME_INTERVAL) 283 | 284 | if __name__ == '__main__': 285 | main() --------------------------------------------------------------------------------