├── .gitignore
├── README.md
├── doc
    ├── queues.png
    ├── repos.png
    ├── routing.png
    ├── user.png
    ├── 爬虫流程.graffle
    ├── 爬虫流程.png
    ├── 程序流程.graffle
    └── 程序流程.png
├── github_spider
    ├── __init__.py
    ├── const.py
    ├── extensions.py
    ├── proxy
    │   ├── __init__.py
    │   └── extract.py
    ├── queue
    │   ├── __init__.py
    │   ├── consumer.py
    │   ├── main.py
    │   └── producer.py
    ├── recursion
    │   ├── __init__.py
    │   ├── flow.py
    │   ├── main.py
    │   └── request.py
    ├── settings.py
    ├── utils.py
    └── worker.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | .idea/*
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # github_spider
 2 | 
 3 | 根据github API接口下载github用户还有他们的reposities数据，根据一个用户的following以及follower关系，遍历整个用户网，下面是流程图：
 4 | 
 5 | ![](doc/爬虫流程.png)
 6 | 
 7 | 
 8 | # 递归实现
 9 | 
10 | ## 运行命令
11 | 
12 | 代码在目录recursion下。数据存储使用mongo，重复请求判断使用的redis，写mongo数据采用celery的异步调用，需要rabbitmq服务正常启动，在settings.py正确配置后，使用下面的步骤启动：
13 | 
14 | 1. 进入github_spider目录
15 | 2. 执行命令```celery -A github_spider.worker worker loglevel=info```启动异步任务
16 | 3. 执行命令```python github_spider/recursion/main.py```启动爬虫
17 | 
18 | ## 运行结果
19 | 
20 | 因为每个请求延时很高，爬虫运行效率很慢，访问了几千个请求之后拿到了部分数据，这是按照查看数降序排列的python项目：
21 | ![repo](doc/repos.png)
22 | 
23 | 这是按粉丝数降序排列的用户列表
24 | ![user](doc/user.png)
25 | 
26 | ## 运行缺陷
27 | 
28 | 1. 因为是深度优先，当整个用户图很大的时候，单机递归可能造成内存溢出从而使程序崩溃，只能在单机短时间运行。
29 | 2. 单个请求延时过长，数据下载速度太慢。
30 | 3. 针对一段时间内访问失败的链接没有重试机制，存在数据丢失的可能。
31 | 
32 | ## 异步优化
33 | 
34 | 针对这种I/O耗时的问题，解决方法也就那几种，要么多并发，要么走异步访问，要么双管齐下。针对上面的问题2，我最开始的解决方式是异步请求API。因为最开始写代码的时候考虑到了这点，代码对调用方法已经做过优化，很快就改好了，实现方式使用了[grequests](https://github.com/kennethreitz/grequests)。这个库和requests是同一个作者，代码也非常的简单，就是讲request请求用gevent做了一个简单的封装，可以非阻塞的请求数据。
35 | 
36 | 但是异步处理会导致IP屏蔽，所以专门写了一个辅助脚本从网上爬取免费的HTTPS代理存放在redis中，路径proxy/extract.py，每次请求的时候都带上代理，运行错误重试自动更换代理并把错误代理清楚。但网上免费的HTTPS代理就很少，而且很多还不能用，由于大量的报错重试，访问速度没有明显提升。
37 | 
38 | # 多进程实现
39 | 
40 | ## 实现原理
41 | 
42 | 采取广度优先的遍历的方式，可以把要访问的网址存放在队列中，再套用生产者消费者的模式就可以很容易的实现多并发，从而解决上面的问题2。如果某段时间内一直失败，只需要将数据再仍会队列就可以彻底解决问题3。不仅如此，这种方式还可以支持中断后继续运行，程序流程图如下：
43 | 
44 | ![程序流程](doc/程序流程.png)
45 | 
46 | 
47 | ## 运行程序
48 | 
49 | 为了实现多级部署，消息队列使用了rabbitmq，需要创建名为github，类型是direct的exchange，然后创建四个名称分别为user, repo, follower, following的队列，详细的绑定关系见下图：
50 | 
51 | ![路由](doc/routing.png)
52 | 
53 | 详细的启动步骤如下：
54 | 
55 | 1. 进入github_spider目录
56 | 2. 执行命令```celery -A github_spider.worker worker loglevel=info```启动异步任务
57 | 3. 执行命令```python github_spider/proxy/extract.py```更新代理
58 | 4. 执行命令```python github_spider/queue/main.py```启动脚本
59 | 
60 | 队列状态图：
61 | 
62 | ![queue](doc/queues.png)
63 | 
64 | 
65 | 
66 | 
67 | 
68 | 


--------------------------------------------------------------------------------
/doc/queues.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/queues.png


--------------------------------------------------------------------------------
/doc/repos.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/repos.png


--------------------------------------------------------------------------------
/doc/routing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/routing.png


--------------------------------------------------------------------------------
/doc/user.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/user.png


--------------------------------------------------------------------------------
/doc/爬虫流程.graffle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/爬虫流程.graffle


--------------------------------------------------------------------------------
/doc/爬虫流程.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/爬虫流程.png


--------------------------------------------------------------------------------
/doc/程序流程.graffle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/程序流程.graffle


--------------------------------------------------------------------------------
/doc/程序流程.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LiuRoy/github_spider/afec86a58655ef8e7c28de940bf711191dea0de0/doc/程序流程.png


--------------------------------------------------------------------------------
/github_spider/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding=utf8 -*-
2 | 
3 | 


--------------------------------------------------------------------------------
/github_spider/const.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | 
 3 | GITHUB_API_HOST = 'api.github.com'
 4 | 
 5 | # API 获取列表最大长度
 6 | PAGE_SIZE = 30
 7 | 
 8 | MONGO_DB_NAME = 'github'
 9 | 
10 | REDIS_VISITED_URLS = 'visited_urls'
11 | 
12 | PROXY_KEY = 'proxy_zset'
13 | 
14 | HEADERS = {
15 |     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
16 |     'Accept-Encoding': 'gzip, deflate, sdch',
17 |     'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
18 |     'Cache-Control': 'no-cache',
19 |     'Connection':'keep-alive',
20 |     'Host':'api.github.com',
21 |     'Pragma':'no-cache',
22 |     'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
23 |                  'AppleWebKit/537.36 (KHTML, like Gecko) '
24 |                  'Chrome/49.0.2623.87 Safari/537.36'
25 | }
26 | 
27 | 
28 | class MongodbCollection(object):
29 |     """
30 |     用到的mongodb collection名称
31 |     """
32 |     USER = 'user'
33 |     REPO = 'repo'
34 |     USER_REPO = 'user_repo'
35 |     FOLLOWER = 'follower'
36 |     FOLLOWING = 'following'
37 | 
38 | 
39 | class RoutingKey(object):
40 |     """
41 |     消息队列路由
42 |     """
43 |     USER = 'user'
44 |     REPO = 'repo'
45 |     FOLLOWER = 'follower'
46 |     FOLLOWING = 'following'
47 | 
48 | 
49 | class QueueName(object):
50 |     """
51 |     消息队列名称
52 |     """
53 |     USER = 'user'
54 |     REPO = 'repo'
55 |     FOLLOWER = 'follower'
56 |     FOLLOWING = 'following'
57 | 
58 | MESSAGE_QUEUE_EXCHANGE = 'github'
59 | 


--------------------------------------------------------------------------------
/github_spider/extensions.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     数据访问
 4 | """
 5 | from pymongo import MongoClient
 6 | from redis import Redis
 7 | 
 8 | from github_spider.const import MONGO_DB_NAME
 9 | from github_spider.settings import (
10 |     MONGO_URI,
11 |     REDIS_URI
12 | )
13 | 
14 | mongo_client = MongoClient(MONGO_URI)
15 | redis_client = Redis.from_url(REDIS_URI)
16 | 
17 | mongo_db = mongo_client[MONGO_DB_NAME]
18 | 


--------------------------------------------------------------------------------
/github_spider/proxy/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding=utf8 -*_
2 | 


--------------------------------------------------------------------------------
/github_spider/proxy/extract.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     从网上爬取HTTPS代理
 4 | """
 5 | import re
 6 | import time
 7 | 
 8 | import requests
 9 | from pyquery import PyQuery
10 | 
11 | from github_spider.extensions import redis_client
12 | from github_spider.const import PROXY_KEY
13 | from github_spider.settings import PROXY_USE_COUNT
14 | 
15 | 
16 | def get_ip181_proxies():
17 |     """
18 |     http://www.ip181.com/获取HTTPS代理
19 |     """
20 |     html_page = requests.get('http://www.ip181.com/').content.decode('gb2312')
21 |     jq = PyQuery(html_page)
22 | 
23 |     proxy_list = []
24 |     for tr in jq("tr"):
25 |         element = [PyQuery(td).text() for td in PyQuery(tr)("td")]
26 |         if 'HTTPS' not in element[3]:
27 |             continue
28 | 
29 |         result = re.search(r'\d+\.\d+', element[4], re.UNICODE)
30 |         if result and float(result.group()) > 5:
31 |             continue
32 |         proxy_list.append((element[0], element[1]))
33 | 
34 |     return proxy_list
35 | 
36 | if __name__ == '__main__':
37 |     while 1:
38 |         try:
39 |             proxies = get_ip181_proxies()
40 | 
41 |             redis_client.zremrangebyscore(PROXY_KEY, PROXY_USE_COUNT, 10000)
42 | 
43 |             for host, port in proxies:
44 |                 address = '{}:{}'.format(host, port)
45 |                 print(address)
46 |                 if redis_client.zscore(PROXY_KEY, address) is None:
47 |                     redis_client.zadd(PROXY_KEY, address, 0)
48 |         except Exception as exc:
49 |             print(exc)
50 |         finally:
51 |             time.sleep(10 * 60)
52 | 


--------------------------------------------------------------------------------
/github_spider/queue/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding=utf8 -*-
2 | """
3 |     队列下载
4 | """
5 | 


--------------------------------------------------------------------------------
/github_spider/queue/consumer.py:
--------------------------------------------------------------------------------
  1 | # -*- coding=utf8 -*-
  2 | """
  3 |     消费者
  4 | """
  5 | import logging
  6 | import functools
  7 | 
  8 | 
  9 | import requests
 10 | from retrying import retry
 11 | from kombu import Connection, Queue
 12 | from kombu.mixins import ConsumerMixin
 13 | 
 14 | from github_spider.queue.producer import url_sender
 15 | from github_spider.extensions import redis_client
 16 | from github_spider.worker import (
 17 |     mongo_save_entity,
 18 |     mongo_save_relation,
 19 | )
 20 | from github_spider.utils import (
 21 |     gen_user_repo_url,
 22 |     gen_user_following_url,
 23 |     gen_user_follwer_url,
 24 |     gen_url_list,
 25 |     check_url_visited,
 26 |     get_proxy,
 27 |     find_login_by_url,
 28 |     gen_user_page_url,
 29 | )
 30 | from github_spider.settings import (
 31 |     MESSAGE_BROKER_URI,
 32 |     USER_CONSUMER_COUNT,
 33 |     FOLLOWER_CONSUMER_COUNT,
 34 |     FOLLOWING_CONSUMER_COUNT,
 35 |     REPO_CONSUMER_COUNT,
 36 |     REQUEST_RETRY_COUNT,
 37 |     TIMEOUT
 38 | )
 39 | from github_spider.const import (
 40 |     PROXY_KEY,
 41 |     HEADERS,
 42 |     REDIS_VISITED_URLS,
 43 |     RoutingKey,
 44 |     QueueName,
 45 |     MongodbCollection
 46 | )
 47 | 
 48 | LOGGER = logging.getLogger(__name__)
 49 | retry_get = retry(stop_max_attempt_number=REQUEST_RETRY_COUNT)(requests.get)
 50 | 
 51 | 
 52 | def request_deco(func):
 53 |     """
 54 |     调用github API的装饰器
 55 |     """
 56 |     @functools.wraps(func)
 57 |     def _inner(self, body, message):
 58 |         url = body
 59 |         if not check_url_visited([url]):
 60 |             message.reject()
 61 | 
 62 |         # 当前没有可用的代理仍会队列稍后再请求
 63 |         proxy = get_proxy()
 64 |         if not proxy:
 65 |             message.requeue()
 66 | 
 67 |         proxy_map = {'https': 'http://{}'.format(proxy.decode('ascii'))}
 68 |         redis_client.zincrby(PROXY_KEY, proxy)
 69 |         try:
 70 |             response = retry_get(url, proxies=proxy_map, headers=HEADERS,
 71 |                                  timeout=TIMEOUT)
 72 |         except Exception as exc:
 73 |             LOGGER.error('get {} failed'.format(url))
 74 |             LOGGER.exception(exc)
 75 |             redis_client.zrem(PROXY_KEY, proxy)
 76 |             message.requeue()
 77 |         else:
 78 |             if response.status_code == 200:
 79 |                 data = response.json()
 80 |                 if not data:
 81 |                     message.reject()
 82 |                 else:
 83 |                     message.ack()
 84 |                 print 'url:', url, 'body: ', data
 85 |                 func(self, (url, data), message)
 86 |                 redis_client.sadd(REDIS_VISITED_URLS, url)
 87 |             else:
 88 |                 message.requeue()
 89 | 
 90 |     return _inner
 91 | 
 92 | 
 93 | class BaseConsumer(ConsumerMixin):
 94 |     """
 95 |     消费者公共类
 96 |     """
 97 |     def __init__(self, broker_url, queue_name, fetch_count=10):
 98 |         """初始化消费者
 99 | 
100 |         Args:
101 |             broker_url (string): broker地址
102 |             queue_name (string): 消费的队列名称
103 |             fetch_count (int): 一次消费的个数
104 |         """
105 |         self.queue_name = queue_name
106 |         self.broker_url = broker_url
107 |         self.fetch_count = fetch_count
108 | 
109 |         self.connection = Connection(broker_url)
110 |         self.queue = Queue(queue_name)
111 | 
112 |     def handle_url(self, body, message):
113 |         """
114 |         处理队列中的url
115 |         """
116 |         raise NotImplementedError
117 | 
118 |     def get_consumers(self, Consumer, channel):
119 |         """
120 |         继承
121 |         """
122 |         consumer = Consumer(
123 |             self.queue,
124 |             callbacks=[self.handle_url],
125 |             auto_declare=True
126 |         )
127 |         consumer.qos(prefetch_count=self.fetch_count)
128 |         return [consumer]
129 | 
130 | 
131 | class UserConsumer(BaseConsumer):
132 |     """
133 |     用户队列消费者
134 |     """
135 |     @request_deco
136 |     def handle_url(self, body, message):
137 |         """
138 |         解析用户数据
139 |         """
140 |         url, data = body
141 |         user = {
142 |             'id': data.get('login'),
143 |             'type': data.get('type'),
144 |             'name': data.get('name'),
145 |             'company': data.get('company'),
146 |             'blog': data.get('blog'),
147 |             'location': data.get('location'),
148 |             'email': data.get('email'),
149 |             'repos_count': data.get('public_repos', 0),
150 |             'gists_count': data.get('public+gists', 0),
151 |             'followers': data.get('followers', 0),
152 |             'following': data.get('following', 0),
153 |             'created_at': data.get('created_at')
154 |         }
155 |         mongo_save_entity.delay(user)
156 | 
157 |         follower_urls = gen_url_list(user['id'], gen_user_follwer_url,
158 |                                      user['followers'])
159 |         following_urls = gen_url_list(user['id'], gen_user_following_url,
160 |                                       user['following'])
161 |         repo_urls = gen_url_list(user['id'], gen_user_repo_url,
162 |                                  user['repos_count'])
163 |         map(lambda x: url_sender.send_url(x, RoutingKey.REPO), repo_urls)
164 |         map(lambda x: url_sender.send_url(x, RoutingKey.FOLLOWING),
165 |             following_urls)
166 |         map(lambda x: url_sender.send_url(x, RoutingKey.FOLLOWER),
167 |             follower_urls)
168 | 
169 | 
170 | class RepoConsumer(BaseConsumer):
171 |     """
172 |     repo队列消费者
173 |     """
174 |     @request_deco
175 |     def handle_url(self, body, message):
176 |         """
177 |         解析项目数据
178 |         """
179 |         url, data = body
180 | 
181 |         user = find_login_by_url(str(url))
182 |         repo_list = []
183 |         for element in data:
184 |             # fork的项目不关心
185 |             if element.get('fork'):
186 |                 continue
187 | 
188 |             repo = {
189 |                 'id': element.get('id'),
190 |                 'name': element.get('full_name'),
191 |                 'description': element.get('description'),
192 |                 'size': element.get('size'),
193 |                 'language': element.get('language'),
194 |                 'watchers_count': element.get('watchers_count'),
195 |                 'fork_count': element.get('fork_count'),
196 |             }
197 |             repo_list.append(repo['name'])
198 |             mongo_save_entity.delay(repo, False)
199 |         mongo_save_relation.delay({'id': user, 'list': repo_list},
200 |                                   MongodbCollection.USER_REPO)
201 | 
202 | 
203 | class FollowConsumer(BaseConsumer):
204 |     """
205 |     用户关系消费者
206 |     """
207 |     def __init__(self, kind, broker_url, queue_name, fetch_count=10):
208 |         """
209 |         kind指是following还是follower
210 |         """
211 |         BaseConsumer.__init__(self, broker_url, queue_name, fetch_count)
212 |         self.kind = kind
213 | 
214 |     @request_deco
215 |     def handle_url(self, body, message):
216 |         """
217 |         解析人员关系数据
218 |         """
219 |         url, data = body
220 | 
221 |         user = find_login_by_url(str(url))
222 |         users = []
223 |         urls = []
224 |         for element in data:
225 |             users.append(element.get('login'))
226 |             urls.append(gen_user_page_url(element.get('login')))
227 | 
228 |         mongo_save_relation.delay({'id': user, 'list': users}, self.kind)
229 |         map(lambda x: url_sender.send_url(x, RoutingKey.USER), urls)
230 | 
231 | consumer_list = [UserConsumer(MESSAGE_BROKER_URI, QueueName.USER)] * USER_CONSUMER_COUNT + \
232 |                 [RepoConsumer(MESSAGE_BROKER_URI, QueueName.REPO)] * REPO_CONSUMER_COUNT + \
233 |                 [FollowConsumer(MongodbCollection.FOLLOWER, MESSAGE_BROKER_URI, QueueName.FOLLOWER)] * FOLLOWER_CONSUMER_COUNT + \
234 |                 [FollowConsumer(MongodbCollection.FOLLOWING, MESSAGE_BROKER_URI, QueueName.FOLLOWING)] * FOLLOWING_CONSUMER_COUNT
235 | 


--------------------------------------------------------------------------------
/github_spider/queue/main.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     主函数
 4 | """
 5 | from multiprocessing import Process
 6 | 
 7 | from github_spider.queue.consumer import consumer_list
 8 | from github_spider.queue.producer import url_sender
 9 | from github_spider.extensions import redis_client
10 | from github_spider.settings import START_USER
11 | from github_spider.const import REDIS_VISITED_URLS, RoutingKey
12 | from github_spider.utils import gen_user_page_url
13 | 
14 | if __name__ == '__main__':
15 |     #redis_client.delete(REDIS_VISITED_URLS)
16 |     #start_user_url = gen_user_page_url(START_USER)
17 |     #url_sender.send_url(start_user_url, RoutingKey.USER)
18 | 
19 |     # user_consumer = consumer_list[0]
20 |     # user_consumer.run()
21 |     process_list = map(lambda x: Process(target=x.run), consumer_list)
22 |     map(lambda p: p.start(), process_list)
23 |     map(lambda p: p.join(), process_list)
24 | 


--------------------------------------------------------------------------------
/github_spider/queue/producer.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     往队列发送消息
 4 | """
 5 | import gevent
 6 | from retrying import retry
 7 | from kombu import Connection, Exchange
 8 | from kombu.pools import producers
 9 | 
10 | from github_spider.settings import MESSAGE_BROKER_URI
11 | from github_spider.const import MESSAGE_QUEUE_EXCHANGE
12 | 
13 | SYNC = 1
14 | ASYNC = 2
15 | 
16 | 
17 | class Producer(object):
18 |     """
19 |     生产者
20 |     """
21 |     def __init__(self, exchange_name, broker_url, mode=ASYNC):
22 |         """初始化生产者
23 | 
24 |         Args:
25 |             exchange_name (string): 路由名称
26 |             broker_url (string): 连接地址
27 |             mode (int): 发送
28 |         """
29 |         self.exchange_name = exchange_name
30 |         self.broker_url = broker_url
31 |         self.mode = mode
32 | 
33 |         self.exchange = Exchange(exchange_name, type='direct')
34 |         self.connection = Connection(broker_url)
35 | 
36 |     @retry(stop_max_attempt_number=5)
37 |     def _sync_send(self, payload, routing_key, **kwargs):
38 |         """发送url至指定队列
39 | 
40 |         Args:
41 |             payload (string): 消息内容
42 |             routing_key (string)
43 |         """
44 |         with producers[self.connection].acquire(block=True) as p:
45 |             p.publish(payload, exchange=self.exchange,
46 |                       routing_key=routing_key, **kwargs)
47 | 
48 |     def _async_send(self, payload, routing_key, **kwargs):
49 |         """发送url至指定队列
50 | 
51 |         Args:
52 |             payload (string): 消息内容
53 |             routing_key (string)
54 |         """
55 |         gevent.spawn(self._sync_send, payload, routing_key, **kwargs)
56 |         gevent.sleep(0)
57 | 
58 |     def send_url(self, url, routing_key, **kwargs):
59 |         """发送url至指定队列
60 | 
61 |         Args:
62 |             url (string): url地址
63 |             routing_key (string)
64 |         """
65 |         if self.mode == SYNC:
66 |             self._sync_send(url, routing_key, **kwargs)
67 |         elif self.mode == ASYNC:
68 |             self._async_send(url, routing_key, **kwargs)
69 | 
70 | url_sender = Producer(MESSAGE_QUEUE_EXCHANGE, MESSAGE_BROKER_URI, ASYNC)
71 | 


--------------------------------------------------------------------------------
/github_spider/recursion/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding=utf8 -*-
2 | """
3 |     递归下载
4 | """
5 | 


--------------------------------------------------------------------------------
/github_spider/recursion/flow.py:
--------------------------------------------------------------------------------
  1 | # -*- coding=utf8 -*-
  2 | """
  3 |     流程控制
  4 | """
  5 | import logging
  6 | 
  7 | from github_spider.const import (
  8 |     REDIS_VISITED_URLS,
  9 |     MongodbCollection,
 10 | )
 11 | from github_spider.worker import (
 12 |     mongo_save_entity,
 13 |     mongo_save_relation,
 14 | )
 15 | from github_spider.extensions import redis_client
 16 | from github_spider.utils import (
 17 |     gen_user_repo_url,
 18 |     gen_user_following_url,
 19 |     gen_user_follwer_url,
 20 |     gen_url_list,
 21 |     gen_user_page_url,
 22 |     check_url_visited,
 23 | )
 24 | 
 25 | LOGGER = logging.getLogger(__name__)
 26 | 
 27 | 
 28 | def request_api(urls, method, callback, **kwargs):
 29 |     """请求API数据
 30 | 
 31 |     Args:
 32 |         urls (list): 请求url列表
 33 |         method (func): 请求方法
 34 |         callback (func): 回调函数
 35 |     """
 36 |     unvisited_urls = check_url_visited(urls)
 37 |     if not unvisited_urls:
 38 |         return
 39 | 
 40 |     try:
 41 |         bodies = method(unvisited_urls)
 42 |     except Exception as exc:
 43 |         LOGGER.exception(exc)
 44 |     else:
 45 |         redis_client.sadd(REDIS_VISITED_URLS, *unvisited_urls)
 46 |         map(lambda body: callback(body, method, **kwargs), bodies)
 47 | 
 48 | 
 49 | def parse_user(data, method):
 50 |     """解析用户数据
 51 | 
 52 |     Args:
 53 |         data (dict): 用户数据
 54 |         method (func): 请求方法
 55 |     """
 56 |     if not data:
 57 |         return
 58 | 
 59 |     user_id = data.get('login')
 60 |     if not user_id:
 61 |         return
 62 | 
 63 |     user = {
 64 |         'id': user_id,
 65 |         'type': data.get('type'),
 66 |         'name': data.get('name'),
 67 |         'company': data.get('company'),
 68 |         'blog': data.get('blog'),
 69 |         'location': data.get('location'),
 70 |         'email': data.get('email'),
 71 |         'repos_count': data.get('public_repos', 0),
 72 |         'gists_count': data.get('public+gists', 0),
 73 |         'followers': data.get('followers', 0),
 74 |         'following': data.get('following', 0),
 75 |         'created_at': data.get('created_at')
 76 |     }
 77 |     mongo_save_entity.delay(user)
 78 |     follower_urls = gen_url_list(user_id, gen_user_follwer_url,
 79 |                                  user['followers'])
 80 |     following_urls = gen_url_list(user_id, gen_user_following_url,
 81 |                                   user['following'])
 82 |     repo_urls = gen_url_list(user_id, gen_user_repo_url, user['repos_count'])
 83 | 
 84 |     request_api(repo_urls, method, parse_repos, user=user_id)
 85 |     request_api(following_urls, method, parse_follow,
 86 |                 user=user_id, kind=MongodbCollection.FOLLOWING)
 87 |     request_api(follower_urls, method, parse_follow,
 88 |                 user=user_id, kind=MongodbCollection.FOLLOWER)
 89 | 
 90 | 
 91 | def parse_repos(data, method, user=None):
 92 |     """解析项目数据
 93 | 
 94 |     Args:
 95 |         data (list): 用户数据
 96 |         method (func): 请求函数
 97 |         user (string): 用户
 98 |     """
 99 |     if not data:
100 |         return
101 | 
102 |     user = user or data[0].get('owner', {}).get('login')
103 |     if not user:
104 |         return
105 | 
106 |     repo_list = []
107 |     for element in data:
108 |         # fork的项目不关心
109 |         if element.get('fork'):
110 |             continue
111 | 
112 |         repo = {
113 |             'id': element.get('id'),
114 |             'name': element.get('full_name'),
115 |             'description': element.get('description'),
116 |             'size': element.get('size'),
117 |             'language': element.get('language'),
118 |             'watchers_count': element.get('watchers_count'),
119 |             'fork_count': element.get('fork_count'),
120 |         }
121 |         repo_list.append(repo['name'])
122 |         mongo_save_entity.delay(repo, False)
123 |     mongo_save_relation.delay({'id': user, 'list': repo_list},
124 |                               MongodbCollection.USER_REPO)
125 | 
126 | 
127 | def parse_follow(data, method, kind=MongodbCollection.FOLLOWER, user=None):
128 |     """解析关注关系
129 | 
130 |      Args:
131 |         data (list): 请求数据
132 |         method (func): 请求函数
133 |         kind (string): 是关注的人还是关注着
134 |         user (string): 用户
135 |     """
136 |     if not data or not user:
137 |         return
138 | 
139 |     users = []
140 |     urls = []
141 |     for element in data:
142 |         users.append(element.get('login'))
143 |         urls.append(gen_user_page_url(element.get('login')))
144 | 
145 |     mongo_save_relation.delay({'id': user, 'list': users}, kind)
146 |     request_api(urls, method, parse_user)
147 | 


--------------------------------------------------------------------------------
/github_spider/recursion/main.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     主流程
 4 | """
 5 | import signal
 6 | import gevent
 7 | 
 8 | from github_spider.recursion.flow import request_api, parse_user
 9 | from github_spider.recursion.request import sync_get, async_get
10 | from github_spider.utils import gen_user_page_url
11 | from github_spider.extensions import redis_client
12 | from github_spider.settings import START_USER
13 | from github_spider.const import REDIS_VISITED_URLS
14 | 
15 | 
16 | def main():
17 |     """
18 |     主流程函数
19 |     """
20 |     redis_client.delete(REDIS_VISITED_URLS)
21 |     start_user_url = gen_user_page_url(START_USER)
22 | 
23 |     gevent.signal(signal.SIGQUIT, gevent.kill)
24 |     request_api([start_user_url], async_get, parse_user)
25 | 
26 | if __name__ == '__main__':
27 |     main()
28 | 


--------------------------------------------------------------------------------
/github_spider/recursion/request.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     异步请求方法
 4 | """
 5 | import time
 6 | import logging
 7 | import requests
 8 | import grequests
 9 | from retrying import retry
10 | 
11 | from github_spider.extensions import redis_client
12 | from github_spider.const import (
13 |     PROXY_KEY,
14 |     HEADERS,
15 | )
16 | from github_spider.settings import (
17 |     TIMEOUT,
18 |     REQUEST_RETRY_COUNT,
19 | )
20 | from github_spider.utils import get_proxy
21 | 
22 | LOGGER = logging.getLogger(__name__)
23 | 
24 | 
25 | def request_with_proxy(url):
26 |     """
27 |     proxy访问url
28 |     """
29 |     for i in range(REQUEST_RETRY_COUNT):
30 |         proxy = get_proxy()
31 |         if not proxy:
32 |             time.sleep(10 * 60)
33 | 
34 |         try:
35 |             proxy_map = {'https': 'http://{}'.format(proxy.decode('ascii'))}
36 |             redis_client.zincrby(PROXY_KEY, proxy)
37 |             response = requests.get(url, proxies=proxy_map,
38 |                                     headers=HEADERS, timeout=TIMEOUT)
39 |         except Exception as exc:
40 |             LOGGER.exception(exc)
41 |             redis_client.zrem(PROXY_KEY, proxy)
42 |         else:
43 |             if response.status_code == 200:
44 |                 return response.json()
45 | 
46 | 
47 | def exception_handler(request, exception):
48 |     """
49 |     操作错误
50 |     """
51 |     proxy = request.kwargs.get('proxies', {}).get('https', '')[7:]
52 |     redis_client.zrem(PROXY_KEY, proxy)
53 |     LOGGER.error('request url:{} failed'.format(request.url))
54 |     LOGGER.error('proxy: {}'.format(proxy))
55 |     LOGGER.exception(exception)
56 | 
57 | 
58 | @retry(stop_max_attempt_number=REQUEST_RETRY_COUNT,
59 |        retry_on_result=lambda result: not result)
60 | def async_get(urls):
61 |     """
62 |     异步请求数据
63 |     """
64 |     rs = []
65 |     for url in urls:
66 |         proxy = get_proxy()
67 |         if not proxy:
68 |             time.sleep(10 * 60)
69 | 
70 |         proxy_map = {'https': 'http://{}'.format(proxy.decode('ascii'))}
71 |         redis_client.zincrby(PROXY_KEY, proxy)
72 |         rs.append(grequests.get(url, proxies=proxy_map,
73 |                                 headers=HEADERS, timeout=TIMEOUT))
74 |     result = grequests.map(rs, exception_handler=exception_handler)
75 |     return [x.json() for x in result if x]
76 | 
77 | 
78 | def sync_get(urls):
79 |     """
80 |     同步请求数据
81 |     """
82 |     result = []
83 |     for url in urls:
84 |         try:
85 |             LOGGER.debug('get {}'.format(url))
86 |             response = requests.get(url, headers=HEADERS)
87 |             if not response.ok:
88 |                 LOGGER.info('get {} failed'.format(url))
89 |                 continue
90 | 
91 |             result.append(response.json())
92 |             # response = request_with_proxy(url)
93 |             # result.append(response)
94 |         except Exception as exc:
95 |             LOGGER.error('get {} fail'.format(url))
96 |             LOGGER.exception(exc)
97 |             continue
98 |     return result
99 | 


--------------------------------------------------------------------------------
/github_spider/settings.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     参数设置
 4 | """
 5 | 
 6 | # 爬取的第一个用户
 7 | START_USER = 'LiuRoy'
 8 | 
 9 | # 请求超时时间
10 | TIMEOUT = 10
11 | 
12 | # mongo配置
13 | MONGO_URI = 'mongodb://localhost:27017'
14 | 
15 | # redis配置
16 | REDIS_URI = 'redis://localhost:6379'
17 | 
18 | # rabbitmq 配置
19 | CELERY_BROKER_URI = 'amqp://guest:guest@localhost:5672/'
20 | MESSAGE_BROKER_URI = 'amqp://guest:guest@localhost:5672/'
21 | 
22 | # 每个代理的最大使用次数
23 | PROXY_USE_COUNT = 100
24 | 
25 | # 请求失败重试次数
26 | REQUEST_RETRY_COUNT = 5
27 | 
28 | USER_CONSUMER_COUNT = 1
29 | FOLLOWER_CONSUMER_COUNT = 1
30 | FOLLOWING_CONSUMER_COUNT = 1
31 | REPO_CONSUMER_COUNT = 1
32 | 


--------------------------------------------------------------------------------
/github_spider/utils.py:
--------------------------------------------------------------------------------
  1 | # -*- coding=utf8 -*-
  2 | """
  3 |     工具函数
  4 | """
  5 | import urlparse
  6 | from github_spider.extensions import redis_client
  7 | from github_spider.settings import PROXY_USE_COUNT
  8 | from github_spider.const import (
  9 |     GITHUB_API_HOST,
 10 |     PAGE_SIZE,
 11 |     PROXY_KEY,
 12 |     REDIS_VISITED_URLS,
 13 | )
 14 | 
 15 | 
 16 | def gen_user_page_url(user_name):
 17 |     """获取用户主页url
 18 | 
 19 |     Args:
 20 |         user_name (string): github用户id
 21 |         page (int): 页号
 22 |     """
 23 |     return 'https://{}/users/{}'.format(GITHUB_API_HOST, user_name)
 24 | 
 25 | 
 26 | def gen_user_follwer_url(user_name, page=1):
 27 |     """获取用户粉丝列表url
 28 | 
 29 |     Args:
 30 |         user_name (string): github用户id
 31 |         page (int): 页号
 32 |     """
 33 |     return 'https://{}/users/{}/followers?page={}'.format(GITHUB_API_HOST,
 34 |                                                           user_name, page)
 35 | 
 36 | 
 37 | def gen_user_following_url(user_name, page=1):
 38 |     """获取用户关注用户列表url
 39 | 
 40 |     Args:
 41 |         user_name (string): github用户id
 42 |         page (int): 页号
 43 |     """
 44 |     return 'https://{}/users/{}/following?page={}'.format(GITHUB_API_HOST,
 45 |                                                           user_name, page)
 46 | 
 47 | 
 48 | def gen_user_repo_url(user_name, page=1):
 49 |     """获取用户项目列表url
 50 | 
 51 |     Args:
 52 |         user_name (string): github用户id
 53 |         page (int): 页号
 54 |     """
 55 |     return 'https://{}/users/{}/repos?page={}'.format(GITHUB_API_HOST,
 56 |                                                       user_name, page)
 57 | 
 58 | 
 59 | def get_short_url(url):
 60 |     """
 61 |     去掉url前面的https://api.github.com/
 62 |     """
 63 |     return url[23:-1]
 64 | 
 65 | 
 66 | def find_login_by_url(url):
 67 |     """
 68 |     获取url中的用户名
 69 |     """
 70 |     result = urlparse.urlsplit(url)
 71 |     return result.path.split('/')[2]
 72 | 
 73 | 
 74 | def gen_url_list(user_name, func, count):
 75 |     """调用func生成url列表
 76 | 
 77 |     Args:
 78 |         user_name (string): 用户名
 79 |         func (func): 生成函数
 80 |         count (int): 总个数
 81 |     """
 82 |     result = []
 83 |     page = 1
 84 |     while (page - 1) * PAGE_SIZE < count:
 85 |         result.append(func(user_name, page))
 86 |         page += 1
 87 |     return result
 88 | 
 89 | 
 90 | def check_url_visited(urls):
 91 |     """
 92 |     判断url是否重复访问过
 93 |     """
 94 |     result = []
 95 |     for url in urls:
 96 |         short_url = get_short_url(url)
 97 |         visited = redis_client.sismember(REDIS_VISITED_URLS, short_url)
 98 |         if not visited:
 99 |             result.append(url)
100 |     return result
101 | 
102 | 
103 | def get_proxy():
104 |     """
105 |     从redis获取代理
106 |     """
107 |     available_proxy = redis_client.zrangebyscore(PROXY_KEY, 0, PROXY_USE_COUNT)
108 |     if available_proxy:
109 |         return available_proxy[0]
110 |     return None
111 | 


--------------------------------------------------------------------------------
/github_spider/worker.py:
--------------------------------------------------------------------------------
 1 | # -*- coding=utf8 -*-
 2 | """
 3 |     异步任务
 4 | """
 5 | from celery import Celery
 6 | from github_spider.const import MongodbCollection
 7 | from github_spider.settings import CELERY_BROKER_URI
 8 | from github_spider.extensions import mongo_db
 9 | 
10 | app = Celery('write_mongo', broker=CELERY_BROKER_URI)
11 | 
12 | 
13 | @app.task
14 | def mongo_save_entity(entity, is_user=True):
15 |     """把user或repo信息写入mongo
16 | 
17 |     Args:
18 |         entity (dict): 数据
19 |         is_user (bool): 用户数据还是项目数据
20 |     """
21 |     collection_name = MongodbCollection.USER \
22 |         if is_user else MongodbCollection.REPO
23 |     mongo_collection = mongo_db[collection_name]
24 |     mongo_collection.update({'id': entity['id']}, entity, upsert=True)
25 | 
26 | 
27 | @app.task
28 | def mongo_save_relation(entity, relation_type):
29 |     """把用户与用户或用户与项目的关系写入mongo
30 | 
31 |     Args:
32 |         entity (dict): 数据
33 |         relation_type (bool): 关系类型
34 |     """
35 |     mongo_collection = mongo_db[relation_type]
36 |     data = mongo_collection.find_one({'id': entity['id']})
37 |     if not data:
38 |         mongo_collection.insert(entity)
39 |     else:
40 |         origin_list = entity['list']
41 |         new_list = data['list']
42 |         data['list'] = list(set(origin_list) | set(new_list))
43 |         mongo_collection.update({'id': entity['id']}, data)
44 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | gevent
2 | pymongo
3 | redis
4 | grequests
5 | requests
6 | pyquery
7 | celery
8 | retrying
9 | 


--------------------------------------------------------------------------------