├── .gitignore ├── README.md ├── doc ├── queues.png ├── repos.png ├── routing.png ├── user.png ├── 爬虫流程.graffle ├── 爬虫流程.png ├── 程序流程.graffle └── 程序流程.png ├── github_spider ├── __init__.py ├── const.py ├── extensions.py ├── proxy │ ├── __init__.py │ └── extract.py ├── queue │ ├── __init__.py │ ├── consumer.py │ ├── main.py │ └── producer.py ├── recursion │ ├── __init__.py │ ├── flow.py │ ├── main.py │ └── request.py ├── settings.py ├── utils.py └── worker.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | .idea/* 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # github_spider 2 | 3 | 根据github API接口下载github用户还有他们的reposities数据,根据一个用户的following以及follower关系,遍历整个用户网,下面是流程图: 4 | 5 | ![](doc/爬虫流程.png) 6 | 7 | 8 | # 递归实现 9 | 10 | ## 运行命令 11 | 12 | 代码在目录recursion下。数据存储使用mongo,重复请求判断使用的redis,写mongo数据采用celery的异步调用,需要rabbitmq服务正常启动,在settings.py正确配置后,使用下面的步骤启动: 13 | 14 | 1. 进入github_spider目录 15 | 2. 执行命令```celery -A github_spider.worker worker loglevel=info```启动异步任务 16 | 3. 执行命令```python github_spider/recursion/main.py```启动爬虫 17 | 18 | ## 运行结果 19 | 20 | 因为每个请求延时很高,爬虫运行效率很慢,访问了几千个请求之后拿到了部分数据,这是按照查看数降序排列的python项目: 21 | ![repo](doc/repos.png) 22 | 23 | 这是按粉丝数降序排列的用户列表 24 | ![user](doc/user.png) 25 | 26 | ## 运行缺陷 27 | 28 | 1. 因为是深度优先,当整个用户图很大的时候,单机递归可能造成内存溢出从而使程序崩溃,只能在单机短时间运行。 29 | 2. 单个请求延时过长,数据下载速度太慢。 30 | 3. 针对一段时间内访问失败的链接没有重试机制,存在数据丢失的可能。 31 | 32 | ## 异步优化 33 | 34 | 针对这种I/O耗时的问题,解决方法也就那几种,要么多并发,要么走异步访问,要么双管齐下。针对上面的问题2,我最开始的解决方式是异步请求API。因为最开始写代码的时候考虑到了这点,代码对调用方法已经做过优化,很快就改好了,实现方式使用了[grequests](https://github.com/kennethreitz/grequests)。这个库和requests是同一个作者,代码也非常的简单,就是讲request请求用gevent做了一个简单的封装,可以非阻塞的请求数据。 35 | 36 | 但是异步处理会导致IP屏蔽,所以专门写了一个辅助脚本从网上爬取免费的HTTPS代理存放在redis中,路径proxy/extract.py,每次请求的时候都带上代理,运行错误重试自动更换代理并把错误代理清楚。但网上免费的HTTPS代理就很少,而且很多还不能用,由于大量的报错重试,访问速度没有明显提升。 37 | 38 | # 多进程实现 39 | 40 | ## 实现原理 41 | 42 | 采取广度优先的遍历的方式,可以把要访问的网址存放在队列中,再套用生产者消费者的模式就可以很容易的实现多并发,从而解决上面的问题2。如果某段时间内一直失败,只需要将数据再仍会队列就可以彻底解决问题3。不仅如此,这种方式还可以支持中断后继续运行,程序流程图如下: 43 | 44 | ![程序流程](doc/程序流程.png) 45 | 46 | 47 | ## 运行程序 48 | 49 | 为了实现多级部署,消息队列使用了rabbitmq,需要创建名为github,类型是direct的exchange,然后创建四个名称分别为user, repo, follower, following的队列,详细的绑定关系见下图: 50 | 51 | ![路由](doc/routing.png) 52 | 53 | 详细的启动步骤如下: 54 | 55 | 1. 进入github_spider目录 56 | 2. 执行命令```celery -A github_spider.worker worker loglevel=info```启动异步任务 57 | 3. 执行命令```python github_spider/proxy/extract.py```更新代理 58 | 4. 执行命令```python github_spider/queue/main.py```启动脚本 59 | 60 | 队列状态图: 61 | 62 | ![queue](doc/queues.png) 63 | 64 | 65 | 66 | 67 | 68 | -------------------------------------------------------------------------------- /doc/queues.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/queues.png -------------------------------------------------------------------------------- /doc/repos.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/repos.png -------------------------------------------------------------------------------- /doc/routing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/routing.png -------------------------------------------------------------------------------- /doc/user.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/user.png -------------------------------------------------------------------------------- /doc/爬虫流程.graffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/爬虫流程.graffle -------------------------------------------------------------------------------- /doc/爬虫流程.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/爬虫流程.png -------------------------------------------------------------------------------- /doc/程序流程.graffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/程序流程.graffle -------------------------------------------------------------------------------- /doc/程序流程.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/程序流程.png -------------------------------------------------------------------------------- /github_spider/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | 3 | -------------------------------------------------------------------------------- /github_spider/const.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | 3 | GITHUB_API_HOST = 'api.github.com' 4 | 5 | # API 获取列表最大长度 6 | PAGE_SIZE = 30 7 | 8 | MONGO_DB_NAME = 'github' 9 | 10 | REDIS_VISITED_URLS = 'visited_urls' 11 | 12 | PROXY_KEY = 'proxy_zset' 13 | 14 | HEADERS = { 15 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 16 | 'Accept-Encoding': 'gzip, deflate, sdch', 17 | 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4', 18 | 'Cache-Control': 'no-cache', 19 | 'Connection':'keep-alive', 20 | 'Host':'api.github.com', 21 | 'Pragma':'no-cache', 22 | 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) ' 23 | 'AppleWebKit/537.36 (KHTML, like Gecko) ' 24 | 'Chrome/49.0.2623.87 Safari/537.36' 25 | } 26 | 27 | 28 | class MongodbCollection(object): 29 | """ 30 | 用到的mongodb collection名称 31 | """ 32 | USER = 'user' 33 | REPO = 'repo' 34 | USER_REPO = 'user_repo' 35 | FOLLOWER = 'follower' 36 | FOLLOWING = 'following' 37 | 38 | 39 | class RoutingKey(object): 40 | """ 41 | 消息队列路由 42 | """ 43 | USER = 'user' 44 | REPO = 'repo' 45 | FOLLOWER = 'follower' 46 | FOLLOWING = 'following' 47 | 48 | 49 | class QueueName(object): 50 | """ 51 | 消息队列名称 52 | """ 53 | USER = 'user' 54 | REPO = 'repo' 55 | FOLLOWER = 'follower' 56 | FOLLOWING = 'following' 57 | 58 | MESSAGE_QUEUE_EXCHANGE = 'github' 59 | -------------------------------------------------------------------------------- /github_spider/extensions.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 数据访问 4 | """ 5 | from pymongo import MongoClient 6 | from redis import Redis 7 | 8 | from github_spider.const import MONGO_DB_NAME 9 | from github_spider.settings import ( 10 | MONGO_URI, 11 | REDIS_URI 12 | ) 13 | 14 | mongo_client = MongoClient(MONGO_URI) 15 | redis_client = Redis.from_url(REDIS_URI) 16 | 17 | mongo_db = mongo_client[MONGO_DB_NAME] 18 | -------------------------------------------------------------------------------- /github_spider/proxy/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*_ 2 | -------------------------------------------------------------------------------- /github_spider/proxy/extract.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 从网上爬取HTTPS代理 4 | """ 5 | import re 6 | import time 7 | 8 | import requests 9 | from pyquery import PyQuery 10 | 11 | from github_spider.extensions import redis_client 12 | from github_spider.const import PROXY_KEY 13 | from github_spider.settings import PROXY_USE_COUNT 14 | 15 | 16 | def get_ip181_proxies(): 17 | """ 18 | http://www.ip181.com/获取HTTPS代理 19 | """ 20 | html_page = requests.get('http://www.ip181.com/').content.decode('gb2312') 21 | jq = PyQuery(html_page) 22 | 23 | proxy_list = [] 24 | for tr in jq("tr"): 25 | element = [PyQuery(td).text() for td in PyQuery(tr)("td")] 26 | if 'HTTPS' not in element[3]: 27 | continue 28 | 29 | result = re.search(r'\d+\.\d+', element[4], re.UNICODE) 30 | if result and float(result.group()) > 5: 31 | continue 32 | proxy_list.append((element[0], element[1])) 33 | 34 | return proxy_list 35 | 36 | if __name__ == '__main__': 37 | while 1: 38 | try: 39 | proxies = get_ip181_proxies() 40 | 41 | redis_client.zremrangebyscore(PROXY_KEY, PROXY_USE_COUNT, 10000) 42 | 43 | for host, port in proxies: 44 | address = '{}:{}'.format(host, port) 45 | print(address) 46 | if redis_client.zscore(PROXY_KEY, address) is None: 47 | redis_client.zadd(PROXY_KEY, address, 0) 48 | except Exception as exc: 49 | print(exc) 50 | finally: 51 | time.sleep(10 * 60) 52 | -------------------------------------------------------------------------------- /github_spider/queue/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 队列下载 4 | """ 5 | -------------------------------------------------------------------------------- /github_spider/queue/consumer.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 消费者 4 | """ 5 | import logging 6 | import functools 7 | 8 | 9 | import requests 10 | from retrying import retry 11 | from kombu import Connection, Queue 12 | from kombu.mixins import ConsumerMixin 13 | 14 | from github_spider.queue.producer import url_sender 15 | from github_spider.extensions import redis_client 16 | from github_spider.worker import ( 17 | mongo_save_entity, 18 | mongo_save_relation, 19 | ) 20 | from github_spider.utils import ( 21 | gen_user_repo_url, 22 | gen_user_following_url, 23 | gen_user_follwer_url, 24 | gen_url_list, 25 | check_url_visited, 26 | get_proxy, 27 | find_login_by_url, 28 | gen_user_page_url, 29 | ) 30 | from github_spider.settings import ( 31 | MESSAGE_BROKER_URI, 32 | USER_CONSUMER_COUNT, 33 | FOLLOWER_CONSUMER_COUNT, 34 | FOLLOWING_CONSUMER_COUNT, 35 | REPO_CONSUMER_COUNT, 36 | REQUEST_RETRY_COUNT, 37 | TIMEOUT 38 | ) 39 | from github_spider.const import ( 40 | PROXY_KEY, 41 | HEADERS, 42 | REDIS_VISITED_URLS, 43 | RoutingKey, 44 | QueueName, 45 | MongodbCollection 46 | ) 47 | 48 | LOGGER = logging.getLogger(__name__) 49 | retry_get = retry(stop_max_attempt_number=REQUEST_RETRY_COUNT)(requests.get) 50 | 51 | 52 | def request_deco(func): 53 | """ 54 | 调用github API的装饰器 55 | """ 56 | @functools.wraps(func) 57 | def _inner(self, body, message): 58 | url = body 59 | if not check_url_visited([url]): 60 | message.reject() 61 | 62 | # 当前没有可用的代理仍会队列稍后再请求 63 | proxy = get_proxy() 64 | if not proxy: 65 | message.requeue() 66 | 67 | proxy_map = {'https': 'http://{}'.format(proxy.decode('ascii'))} 68 | redis_client.zincrby(PROXY_KEY, proxy) 69 | try: 70 | response = retry_get(url, proxies=proxy_map, headers=HEADERS, 71 | timeout=TIMEOUT) 72 | except Exception as exc: 73 | LOGGER.error('get {} failed'.format(url)) 74 | LOGGER.exception(exc) 75 | redis_client.zrem(PROXY_KEY, proxy) 76 | message.requeue() 77 | else: 78 | if response.status_code == 200: 79 | data = response.json() 80 | if not data: 81 | message.reject() 82 | else: 83 | message.ack() 84 | print 'url:', url, 'body: ', data 85 | func(self, (url, data), message) 86 | redis_client.sadd(REDIS_VISITED_URLS, url) 87 | else: 88 | message.requeue() 89 | 90 | return _inner 91 | 92 | 93 | class BaseConsumer(ConsumerMixin): 94 | """ 95 | 消费者公共类 96 | """ 97 | def __init__(self, broker_url, queue_name, fetch_count=10): 98 | """初始化消费者 99 | 100 | Args: 101 | broker_url (string): broker地址 102 | queue_name (string): 消费的队列名称 103 | fetch_count (int): 一次消费的个数 104 | """ 105 | self.queue_name = queue_name 106 | self.broker_url = broker_url 107 | self.fetch_count = fetch_count 108 | 109 | self.connection = Connection(broker_url) 110 | self.queue = Queue(queue_name) 111 | 112 | def handle_url(self, body, message): 113 | """ 114 | 处理队列中的url 115 | """ 116 | raise NotImplementedError 117 | 118 | def get_consumers(self, Consumer, channel): 119 | """ 120 | 继承 121 | """ 122 | consumer = Consumer( 123 | self.queue, 124 | callbacks=[self.handle_url], 125 | auto_declare=True 126 | ) 127 | consumer.qos(prefetch_count=self.fetch_count) 128 | return [consumer] 129 | 130 | 131 | class UserConsumer(BaseConsumer): 132 | """ 133 | 用户队列消费者 134 | """ 135 | @request_deco 136 | def handle_url(self, body, message): 137 | """ 138 | 解析用户数据 139 | """ 140 | url, data = body 141 | user = { 142 | 'id': data.get('login'), 143 | 'type': data.get('type'), 144 | 'name': data.get('name'), 145 | 'company': data.get('company'), 146 | 'blog': data.get('blog'), 147 | 'location': data.get('location'), 148 | 'email': data.get('email'), 149 | 'repos_count': data.get('public_repos', 0), 150 | 'gists_count': data.get('public+gists', 0), 151 | 'followers': data.get('followers', 0), 152 | 'following': data.get('following', 0), 153 | 'created_at': data.get('created_at') 154 | } 155 | mongo_save_entity.delay(user) 156 | 157 | follower_urls = gen_url_list(user['id'], gen_user_follwer_url, 158 | user['followers']) 159 | following_urls = gen_url_list(user['id'], gen_user_following_url, 160 | user['following']) 161 | repo_urls = gen_url_list(user['id'], gen_user_repo_url, 162 | user['repos_count']) 163 | map(lambda x: url_sender.send_url(x, RoutingKey.REPO), repo_urls) 164 | map(lambda x: url_sender.send_url(x, RoutingKey.FOLLOWING), 165 | following_urls) 166 | map(lambda x: url_sender.send_url(x, RoutingKey.FOLLOWER), 167 | follower_urls) 168 | 169 | 170 | class RepoConsumer(BaseConsumer): 171 | """ 172 | repo队列消费者 173 | """ 174 | @request_deco 175 | def handle_url(self, body, message): 176 | """ 177 | 解析项目数据 178 | """ 179 | url, data = body 180 | 181 | user = find_login_by_url(str(url)) 182 | repo_list = [] 183 | for element in data: 184 | # fork的项目不关心 185 | if element.get('fork'): 186 | continue 187 | 188 | repo = { 189 | 'id': element.get('id'), 190 | 'name': element.get('full_name'), 191 | 'description': element.get('description'), 192 | 'size': element.get('size'), 193 | 'language': element.get('language'), 194 | 'watchers_count': element.get('watchers_count'), 195 | 'fork_count': element.get('fork_count'), 196 | } 197 | repo_list.append(repo['name']) 198 | mongo_save_entity.delay(repo, False) 199 | mongo_save_relation.delay({'id': user, 'list': repo_list}, 200 | MongodbCollection.USER_REPO) 201 | 202 | 203 | class FollowConsumer(BaseConsumer): 204 | """ 205 | 用户关系消费者 206 | """ 207 | def __init__(self, kind, broker_url, queue_name, fetch_count=10): 208 | """ 209 | kind指是following还是follower 210 | """ 211 | BaseConsumer.__init__(self, broker_url, queue_name, fetch_count) 212 | self.kind = kind 213 | 214 | @request_deco 215 | def handle_url(self, body, message): 216 | """ 217 | 解析人员关系数据 218 | """ 219 | url, data = body 220 | 221 | user = find_login_by_url(str(url)) 222 | users = [] 223 | urls = [] 224 | for element in data: 225 | users.append(element.get('login')) 226 | urls.append(gen_user_page_url(element.get('login'))) 227 | 228 | mongo_save_relation.delay({'id': user, 'list': users}, self.kind) 229 | map(lambda x: url_sender.send_url(x, RoutingKey.USER), urls) 230 | 231 | consumer_list = [UserConsumer(MESSAGE_BROKER_URI, QueueName.USER)] * USER_CONSUMER_COUNT + \ 232 | [RepoConsumer(MESSAGE_BROKER_URI, QueueName.REPO)] * REPO_CONSUMER_COUNT + \ 233 | [FollowConsumer(MongodbCollection.FOLLOWER, MESSAGE_BROKER_URI, QueueName.FOLLOWER)] * FOLLOWER_CONSUMER_COUNT + \ 234 | [FollowConsumer(MongodbCollection.FOLLOWING, MESSAGE_BROKER_URI, QueueName.FOLLOWING)] * FOLLOWING_CONSUMER_COUNT 235 | -------------------------------------------------------------------------------- /github_spider/queue/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 主函数 4 | """ 5 | from multiprocessing import Process 6 | 7 | from github_spider.queue.consumer import consumer_list 8 | from github_spider.queue.producer import url_sender 9 | from github_spider.extensions import redis_client 10 | from github_spider.settings import START_USER 11 | from github_spider.const import REDIS_VISITED_URLS, RoutingKey 12 | from github_spider.utils import gen_user_page_url 13 | 14 | if __name__ == '__main__': 15 | #redis_client.delete(REDIS_VISITED_URLS) 16 | #start_user_url = gen_user_page_url(START_USER) 17 | #url_sender.send_url(start_user_url, RoutingKey.USER) 18 | 19 | # user_consumer = consumer_list[0] 20 | # user_consumer.run() 21 | process_list = map(lambda x: Process(target=x.run), consumer_list) 22 | map(lambda p: p.start(), process_list) 23 | map(lambda p: p.join(), process_list) 24 | -------------------------------------------------------------------------------- /github_spider/queue/producer.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 往队列发送消息 4 | """ 5 | import gevent 6 | from retrying import retry 7 | from kombu import Connection, Exchange 8 | from kombu.pools import producers 9 | 10 | from github_spider.settings import MESSAGE_BROKER_URI 11 | from github_spider.const import MESSAGE_QUEUE_EXCHANGE 12 | 13 | SYNC = 1 14 | ASYNC = 2 15 | 16 | 17 | class Producer(object): 18 | """ 19 | 生产者 20 | """ 21 | def __init__(self, exchange_name, broker_url, mode=ASYNC): 22 | """初始化生产者 23 | 24 | Args: 25 | exchange_name (string): 路由名称 26 | broker_url (string): 连接地址 27 | mode (int): 发送 28 | """ 29 | self.exchange_name = exchange_name 30 | self.broker_url = broker_url 31 | self.mode = mode 32 | 33 | self.exchange = Exchange(exchange_name, type='direct') 34 | self.connection = Connection(broker_url) 35 | 36 | @retry(stop_max_attempt_number=5) 37 | def _sync_send(self, payload, routing_key, **kwargs): 38 | """发送url至指定队列 39 | 40 | Args: 41 | payload (string): 消息内容 42 | routing_key (string) 43 | """ 44 | with producers[self.connection].acquire(block=True) as p: 45 | p.publish(payload, exchange=self.exchange, 46 | routing_key=routing_key, **kwargs) 47 | 48 | def _async_send(self, payload, routing_key, **kwargs): 49 | """发送url至指定队列 50 | 51 | Args: 52 | payload (string): 消息内容 53 | routing_key (string) 54 | """ 55 | gevent.spawn(self._sync_send, payload, routing_key, **kwargs) 56 | gevent.sleep(0) 57 | 58 | def send_url(self, url, routing_key, **kwargs): 59 | """发送url至指定队列 60 | 61 | Args: 62 | url (string): url地址 63 | routing_key (string) 64 | """ 65 | if self.mode == SYNC: 66 | self._sync_send(url, routing_key, **kwargs) 67 | elif self.mode == ASYNC: 68 | self._async_send(url, routing_key, **kwargs) 69 | 70 | url_sender = Producer(MESSAGE_QUEUE_EXCHANGE, MESSAGE_BROKER_URI, ASYNC) 71 | -------------------------------------------------------------------------------- /github_spider/recursion/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 递归下载 4 | """ 5 | -------------------------------------------------------------------------------- /github_spider/recursion/flow.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 流程控制 4 | """ 5 | import logging 6 | 7 | from github_spider.const import ( 8 | REDIS_VISITED_URLS, 9 | MongodbCollection, 10 | ) 11 | from github_spider.worker import ( 12 | mongo_save_entity, 13 | mongo_save_relation, 14 | ) 15 | from github_spider.extensions import redis_client 16 | from github_spider.utils import ( 17 | gen_user_repo_url, 18 | gen_user_following_url, 19 | gen_user_follwer_url, 20 | gen_url_list, 21 | gen_user_page_url, 22 | check_url_visited, 23 | ) 24 | 25 | LOGGER = logging.getLogger(__name__) 26 | 27 | 28 | def request_api(urls, method, callback, **kwargs): 29 | """请求API数据 30 | 31 | Args: 32 | urls (list): 请求url列表 33 | method (func): 请求方法 34 | callback (func): 回调函数 35 | """ 36 | unvisited_urls = check_url_visited(urls) 37 | if not unvisited_urls: 38 | return 39 | 40 | try: 41 | bodies = method(unvisited_urls) 42 | except Exception as exc: 43 | LOGGER.exception(exc) 44 | else: 45 | redis_client.sadd(REDIS_VISITED_URLS, *unvisited_urls) 46 | map(lambda body: callback(body, method, **kwargs), bodies) 47 | 48 | 49 | def parse_user(data, method): 50 | """解析用户数据 51 | 52 | Args: 53 | data (dict): 用户数据 54 | method (func): 请求方法 55 | """ 56 | if not data: 57 | return 58 | 59 | user_id = data.get('login') 60 | if not user_id: 61 | return 62 | 63 | user = { 64 | 'id': user_id, 65 | 'type': data.get('type'), 66 | 'name': data.get('name'), 67 | 'company': data.get('company'), 68 | 'blog': data.get('blog'), 69 | 'location': data.get('location'), 70 | 'email': data.get('email'), 71 | 'repos_count': data.get('public_repos', 0), 72 | 'gists_count': data.get('public+gists', 0), 73 | 'followers': data.get('followers', 0), 74 | 'following': data.get('following', 0), 75 | 'created_at': data.get('created_at') 76 | } 77 | mongo_save_entity.delay(user) 78 | follower_urls = gen_url_list(user_id, gen_user_follwer_url, 79 | user['followers']) 80 | following_urls = gen_url_list(user_id, gen_user_following_url, 81 | user['following']) 82 | repo_urls = gen_url_list(user_id, gen_user_repo_url, user['repos_count']) 83 | 84 | request_api(repo_urls, method, parse_repos, user=user_id) 85 | request_api(following_urls, method, parse_follow, 86 | user=user_id, kind=MongodbCollection.FOLLOWING) 87 | request_api(follower_urls, method, parse_follow, 88 | user=user_id, kind=MongodbCollection.FOLLOWER) 89 | 90 | 91 | def parse_repos(data, method, user=None): 92 | """解析项目数据 93 | 94 | Args: 95 | data (list): 用户数据 96 | method (func): 请求函数 97 | user (string): 用户 98 | """ 99 | if not data: 100 | return 101 | 102 | user = user or data[0].get('owner', {}).get('login') 103 | if not user: 104 | return 105 | 106 | repo_list = [] 107 | for element in data: 108 | # fork的项目不关心 109 | if element.get('fork'): 110 | continue 111 | 112 | repo = { 113 | 'id': element.get('id'), 114 | 'name': element.get('full_name'), 115 | 'description': element.get('description'), 116 | 'size': element.get('size'), 117 | 'language': element.get('language'), 118 | 'watchers_count': element.get('watchers_count'), 119 | 'fork_count': element.get('fork_count'), 120 | } 121 | repo_list.append(repo['name']) 122 | mongo_save_entity.delay(repo, False) 123 | mongo_save_relation.delay({'id': user, 'list': repo_list}, 124 | MongodbCollection.USER_REPO) 125 | 126 | 127 | def parse_follow(data, method, kind=MongodbCollection.FOLLOWER, user=None): 128 | """解析关注关系 129 | 130 | Args: 131 | data (list): 请求数据 132 | method (func): 请求函数 133 | kind (string): 是关注的人还是关注着 134 | user (string): 用户 135 | """ 136 | if not data or not user: 137 | return 138 | 139 | users = [] 140 | urls = [] 141 | for element in data: 142 | users.append(element.get('login')) 143 | urls.append(gen_user_page_url(element.get('login'))) 144 | 145 | mongo_save_relation.delay({'id': user, 'list': users}, kind) 146 | request_api(urls, method, parse_user) 147 | -------------------------------------------------------------------------------- /github_spider/recursion/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 主流程 4 | """ 5 | import signal 6 | import gevent 7 | 8 | from github_spider.recursion.flow import request_api, parse_user 9 | from github_spider.recursion.request import sync_get, async_get 10 | from github_spider.utils import gen_user_page_url 11 | from github_spider.extensions import redis_client 12 | from github_spider.settings import START_USER 13 | from github_spider.const import REDIS_VISITED_URLS 14 | 15 | 16 | def main(): 17 | """ 18 | 主流程函数 19 | """ 20 | redis_client.delete(REDIS_VISITED_URLS) 21 | start_user_url = gen_user_page_url(START_USER) 22 | 23 | gevent.signal(signal.SIGQUIT, gevent.kill) 24 | request_api([start_user_url], async_get, parse_user) 25 | 26 | if __name__ == '__main__': 27 | main() 28 | -------------------------------------------------------------------------------- /github_spider/recursion/request.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 异步请求方法 4 | """ 5 | import time 6 | import logging 7 | import requests 8 | import grequests 9 | from retrying import retry 10 | 11 | from github_spider.extensions import redis_client 12 | from github_spider.const import ( 13 | PROXY_KEY, 14 | HEADERS, 15 | ) 16 | from github_spider.settings import ( 17 | TIMEOUT, 18 | REQUEST_RETRY_COUNT, 19 | ) 20 | from github_spider.utils import get_proxy 21 | 22 | LOGGER = logging.getLogger(__name__) 23 | 24 | 25 | def request_with_proxy(url): 26 | """ 27 | proxy访问url 28 | """ 29 | for i in range(REQUEST_RETRY_COUNT): 30 | proxy = get_proxy() 31 | if not proxy: 32 | time.sleep(10 * 60) 33 | 34 | try: 35 | proxy_map = {'https': 'http://{}'.format(proxy.decode('ascii'))} 36 | redis_client.zincrby(PROXY_KEY, proxy) 37 | response = requests.get(url, proxies=proxy_map, 38 | headers=HEADERS, timeout=TIMEOUT) 39 | except Exception as exc: 40 | LOGGER.exception(exc) 41 | redis_client.zrem(PROXY_KEY, proxy) 42 | else: 43 | if response.status_code == 200: 44 | return response.json() 45 | 46 | 47 | def exception_handler(request, exception): 48 | """ 49 | 操作错误 50 | """ 51 | proxy = request.kwargs.get('proxies', {}).get('https', '')[7:] 52 | redis_client.zrem(PROXY_KEY, proxy) 53 | LOGGER.error('request url:{} failed'.format(request.url)) 54 | LOGGER.error('proxy: {}'.format(proxy)) 55 | LOGGER.exception(exception) 56 | 57 | 58 | @retry(stop_max_attempt_number=REQUEST_RETRY_COUNT, 59 | retry_on_result=lambda result: not result) 60 | def async_get(urls): 61 | """ 62 | 异步请求数据 63 | """ 64 | rs = [] 65 | for url in urls: 66 | proxy = get_proxy() 67 | if not proxy: 68 | time.sleep(10 * 60) 69 | 70 | proxy_map = {'https': 'http://{}'.format(proxy.decode('ascii'))} 71 | redis_client.zincrby(PROXY_KEY, proxy) 72 | rs.append(grequests.get(url, proxies=proxy_map, 73 | headers=HEADERS, timeout=TIMEOUT)) 74 | result = grequests.map(rs, exception_handler=exception_handler) 75 | return [x.json() for x in result if x] 76 | 77 | 78 | def sync_get(urls): 79 | """ 80 | 同步请求数据 81 | """ 82 | result = [] 83 | for url in urls: 84 | try: 85 | LOGGER.debug('get {}'.format(url)) 86 | response = requests.get(url, headers=HEADERS) 87 | if not response.ok: 88 | LOGGER.info('get {} failed'.format(url)) 89 | continue 90 | 91 | result.append(response.json()) 92 | # response = request_with_proxy(url) 93 | # result.append(response) 94 | except Exception as exc: 95 | LOGGER.error('get {} fail'.format(url)) 96 | LOGGER.exception(exc) 97 | continue 98 | return result 99 | -------------------------------------------------------------------------------- /github_spider/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 参数设置 4 | """ 5 | 6 | # 爬取的第一个用户 7 | START_USER = 'LiuRoy' 8 | 9 | # 请求超时时间 10 | TIMEOUT = 10 11 | 12 | # mongo配置 13 | MONGO_URI = 'mongodb://localhost:27017' 14 | 15 | # redis配置 16 | REDIS_URI = 'redis://localhost:6379' 17 | 18 | # rabbitmq 配置 19 | CELERY_BROKER_URI = 'amqp://guest:guest@localhost:5672/' 20 | MESSAGE_BROKER_URI = 'amqp://guest:guest@localhost:5672/' 21 | 22 | # 每个代理的最大使用次数 23 | PROXY_USE_COUNT = 100 24 | 25 | # 请求失败重试次数 26 | REQUEST_RETRY_COUNT = 5 27 | 28 | USER_CONSUMER_COUNT = 1 29 | FOLLOWER_CONSUMER_COUNT = 1 30 | FOLLOWING_CONSUMER_COUNT = 1 31 | REPO_CONSUMER_COUNT = 1 32 | -------------------------------------------------------------------------------- /github_spider/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 工具函数 4 | """ 5 | import urlparse 6 | from github_spider.extensions import redis_client 7 | from github_spider.settings import PROXY_USE_COUNT 8 | from github_spider.const import ( 9 | GITHUB_API_HOST, 10 | PAGE_SIZE, 11 | PROXY_KEY, 12 | REDIS_VISITED_URLS, 13 | ) 14 | 15 | 16 | def gen_user_page_url(user_name): 17 | """获取用户主页url 18 | 19 | Args: 20 | user_name (string): github用户id 21 | page (int): 页号 22 | """ 23 | return 'https://{}/users/{}'.format(GITHUB_API_HOST, user_name) 24 | 25 | 26 | def gen_user_follwer_url(user_name, page=1): 27 | """获取用户粉丝列表url 28 | 29 | Args: 30 | user_name (string): github用户id 31 | page (int): 页号 32 | """ 33 | return 'https://{}/users/{}/followers?page={}'.format(GITHUB_API_HOST, 34 | user_name, page) 35 | 36 | 37 | def gen_user_following_url(user_name, page=1): 38 | """获取用户关注用户列表url 39 | 40 | Args: 41 | user_name (string): github用户id 42 | page (int): 页号 43 | """ 44 | return 'https://{}/users/{}/following?page={}'.format(GITHUB_API_HOST, 45 | user_name, page) 46 | 47 | 48 | def gen_user_repo_url(user_name, page=1): 49 | """获取用户项目列表url 50 | 51 | Args: 52 | user_name (string): github用户id 53 | page (int): 页号 54 | """ 55 | return 'https://{}/users/{}/repos?page={}'.format(GITHUB_API_HOST, 56 | user_name, page) 57 | 58 | 59 | def get_short_url(url): 60 | """ 61 | 去掉url前面的https://api.github.com/ 62 | """ 63 | return url[23:-1] 64 | 65 | 66 | def find_login_by_url(url): 67 | """ 68 | 获取url中的用户名 69 | """ 70 | result = urlparse.urlsplit(url) 71 | return result.path.split('/')[2] 72 | 73 | 74 | def gen_url_list(user_name, func, count): 75 | """调用func生成url列表 76 | 77 | Args: 78 | user_name (string): 用户名 79 | func (func): 生成函数 80 | count (int): 总个数 81 | """ 82 | result = [] 83 | page = 1 84 | while (page - 1) * PAGE_SIZE < count: 85 | result.append(func(user_name, page)) 86 | page += 1 87 | return result 88 | 89 | 90 | def check_url_visited(urls): 91 | """ 92 | 判断url是否重复访问过 93 | """ 94 | result = [] 95 | for url in urls: 96 | short_url = get_short_url(url) 97 | visited = redis_client.sismember(REDIS_VISITED_URLS, short_url) 98 | if not visited: 99 | result.append(url) 100 | return result 101 | 102 | 103 | def get_proxy(): 104 | """ 105 | 从redis获取代理 106 | """ 107 | available_proxy = redis_client.zrangebyscore(PROXY_KEY, 0, PROXY_USE_COUNT) 108 | if available_proxy: 109 | return available_proxy[0] 110 | return None 111 | -------------------------------------------------------------------------------- /github_spider/worker.py: -------------------------------------------------------------------------------- 1 | # -*- coding=utf8 -*- 2 | """ 3 | 异步任务 4 | """ 5 | from celery import Celery 6 | from github_spider.const import MongodbCollection 7 | from github_spider.settings import CELERY_BROKER_URI 8 | from github_spider.extensions import mongo_db 9 | 10 | app = Celery('write_mongo', broker=CELERY_BROKER_URI) 11 | 12 | 13 | @app.task 14 | def mongo_save_entity(entity, is_user=True): 15 | """把user或repo信息写入mongo 16 | 17 | Args: 18 | entity (dict): 数据 19 | is_user (bool): 用户数据还是项目数据 20 | """ 21 | collection_name = MongodbCollection.USER \ 22 | if is_user else MongodbCollection.REPO 23 | mongo_collection = mongo_db[collection_name] 24 | mongo_collection.update({'id': entity['id']}, entity, upsert=True) 25 | 26 | 27 | @app.task 28 | def mongo_save_relation(entity, relation_type): 29 | """把用户与用户或用户与项目的关系写入mongo 30 | 31 | Args: 32 | entity (dict): 数据 33 | relation_type (bool): 关系类型 34 | """ 35 | mongo_collection = mongo_db[relation_type] 36 | data = mongo_collection.find_one({'id': entity['id']}) 37 | if not data: 38 | mongo_collection.insert(entity) 39 | else: 40 | origin_list = entity['list'] 41 | new_list = data['list'] 42 | data['list'] = list(set(origin_list) | set(new_list)) 43 | mongo_collection.update({'id': entity['id']}, data) 44 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | gevent 2 | pymongo 3 | redis 4 | grequests 5 | requests 6 | pyquery 7 | celery 8 | retrying 9 | --------------------------------------------------------------------------------