├── README.md
├── db
    ├── __init__.py
    ├── mongo_helper.py
    └── motor_helper.py
├── infoq_detail_spider.py
├── infoq_seed_spider.py
├── logger
    ├── __init__.py
    └── log.py
└── tool
    ├── __init__.py
    └── headers_format.py


/README.md:
--------------------------------------------------------------------------------
  1 | ### 问题修复
  2 | 1.修改ssl报错的问题 
  3 | 2.修改了branch任务分流的问题
  4 | 
  5 | ----
  6 | ### 异步爬虫模块aiohttp实战之infoq
  7 | > 之前写过很多的关于异步的文章，介绍了asyncio的使用，大多的只是通过简单的例子说明了asyncio的使用，但是没有拿出具体的使用例子，所以这次特意用一个具体例子来说明异步模块aiohttp配合asyncio的使用。
  8 | 
  9 | 这次以infoq网站为例子进行讲解，infoq是一个怎样的网站，这里引用官网的一句话
 10 | 「InfoQ 是一个实践驱动的社区资讯站点,致力于促进软件开发领域知识与创新的传播。... InfoQ 每周精要 订阅InfoQ 每周精要,加入资深开发者汇聚的庞大技术社区。」 很明显了infoq是一个技术型的网站，这次我们就对该网站的推荐内容进行讲解。
 11 | ###思路分析
 12 | 这里先说下思路，这里分了两步，首先爬取列表页，然后再去爬详情页的内容。
 13 | 爬取列表页的数据保存到数据库,这里我使用的是mongodb，同时加一个字段标注状态，这个状态是为了后面详情页可以做到一个续爬，这里我设置三种状态0,1,2其中0表示初始状态也就是还没开始爬，1表示开始下载，2表示下载成功。通过状态我们可以直到已经完成了多少链接的爬取了。进而达到的一个续爬的效果。爬取详情页的时候就去读取数据库中状态为0的数据。根据状态码我们就可以作出对应的操作了，例如我们任务都完了，查看列表页的数据发现状态是1，可以知道这个爬取的过程中出现了问题，所以我们将状态改为0，再运行详情页的程序即可。
 14 | ### 网站分析之列表部分
 15 | 首先让我分析一下，这个推荐内容部分都有哪些请求，经过查看发现在链接https://www.infoq.cn/public/v1/my/recommond 里面有我们需要的内容
 16 | 很明显这是一个ajax请求的页面，然后就是分析这个链接的请求方式和请求参数了
 17 | 请求链接我们已经知道了，然后看请求参数部分经过观察我们发现请求参数的格式为
 18 | ```
 19 | {"size":12,"score":1549328400000}
 20 | ```
 21 | 然后我们只要直到参数来源就好了,一般通过多翻几页去总结规律，观察发现size的值都是12，也就是一页显示的内容，然后是score内容，我们发现每一页的score内容是不同的，通过搜索 可以发现再上一页的json内容中包含着这个字段。
 22 | ![](https://img2018.cnblogs.com/blog/736399/201902/736399-20190207174247059-348757324.gif)
 23 | 于是我们就可以写代码了。
 24 | ### 列表部分的代码
 25 | 因为列表页部分是一步接一步的不适合并发所以我们用requests模块常规的方式爬取即可。
 26 | 文件名infoq_seed_spider.py
 27 | 网络请求部分
 28 | ```python
 29 |     def get_req(self, data=None):
 30 |         req = self.session.post(self.start_url, data=json.dumps(data))
 31 |         if req.status_code in [200, 201]:
 32 |             return req
 33 | ```
 34 | 数据解析部分
 35 | ```
 36 | def save_data(self, data):
 37 |         tasks = []
 38 |         for item in data:
 39 |             try:
 40 |                 dic = {}
 41 |                 uuid = item.get("uuid")
 42 |                 dic["uuid"] = uuid# 经过分析发现uuid就是详情页链接的组成部分。
 43 |                 dic["url"] = f"https://www.infoq.cn/article/{uuid}"
 44 |                 dic["title"] = item.get("article_title")
 45 |                 dic["cover"] = item.get("article_cover")
 46 |                 dic["summary"] = item.get("article_summary")
 47 |                 author=item.get("author")
 48 |                 if author:
 49 |                    dic["author"] = author[0].get("nickname")
 50 |                 else:
 51 |                    dic["author"]=item.get("no_author","").split(":")[-1]
 52 |                 score=item.get("publish_time")
 53 |                 dic["publish_time"] = datetime.datetime.utcfromtimestamp(score/1000).strftime("%Y-%m-%d %H:%M:%S")
 54 |                 dic["tags"] = ",".join([data.get("name") for data in item.get("topic")])
 55 |                 translate = item.get("translator")
 56 |                 dic["translator"] = dic["author"]
 57 |                 if translate:
 58 |                     dic["translator"] = translate[0].get("nickname")
 59 |                 dic["status"] = 0
 60 |                 dic["update_time"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 61 |                 tasks.append(dic)
 62 |             except IndexError as e:
 63 |                 crawler.error("解析出错")
 64 |         Mongo().save_data(tasks)
 65 |         crawler.info(f"add {len(tasks)} datas to mongodb")
 66 |         return score
 67 | ```
 68 | 程序入口
 69 | ```python
 70 |     def start(self):
 71 |         i = 0
 72 |         post_data = {"size": 12} 
 73 |         while i < 10: #这里只爬取了10页，这个值可以自己设置
 74 |             req = self.get_req(post_data)
 75 |             data = req.json().get("data")
 76 |             score = self.save_data(data)
 77 |             post_data.update({"score": score}) #通过上一页的内容得知下一次要请求的参数。
 78 |             i += 1
 79 |             time.sleep(random.randint(0, 5))
 80 | 
 81 | ```
 82 | 列表页逻辑简单代码也是同步，就没有什么需要强调的地方了
 83 | 新手需要注意是访问量和频率，这里只是做一个学习范例所以请求量和请求频率相应的比较低。
 84 | ###网站分析之详情部分
 85 | 这里分为两部分存储，一个是详情内容的封面我将图片保存到了本地。另外一个就是详情内容的爬取保存到数据库里，保存图片的同时获取图片的路径一起保存在数据库，方便找到文章的配图。
 86 | ### 详情页的思考
 87 | 一开始我以为详情页的内容请求类似https://www.infoq.cn/article/cY3lZj1G-cR2iJ3ONUd6链接就行了，但是后来事实证明我想简单了，详情页其实也是通过ajax加载的，经过和列表页类似的分析，发现请求链接为：https://www.infoq.cn/public/v1/article/getDetail，请求参数为 uuid,也就是详情页的链接后面的部分，我们在爬取列表页的时候已经获取了。
 88 | 也就是说详情页的爬虫我们是通过访问数据库的形式去进行爬取的，这里我们就可以做到并发去访问了。这里就让我们熟悉一下aiohttp的使用已经其他异步库的使用。
 89 | ###详情页的代码部分
 90 | 首先说下我们需要的包,aiohttp异步网络请求包,motor异步mongodb请求包,aiofiles异步文件操作包,async_timeout请求timeout设置包。以上包我们都可以通过pip进行安装，下面我们对这些包进行一一了解。
 91 | #### 数据读取
 92 | 首先我们需要读取数据的数据，这里我们使用pymongo。
 93 | 首先是导入必备的包pip install pymongo安装即可。
 94 | 读取部分没有涉及太多的操作，我们就直接获取一个生成器格式的数据即可
 95 | ```
 96 | def find_data(self, col="infoq_seed"):
 97 |         # 获取状态为0的数据
 98 |         data = self.db[col].find({"status": 0})
 99 |         gen = (item for item in data)
100 |         return gen
101 | ```
102 | #### 入口函数
103 | ```
104 | async def run(data):
105 |     crawler.info("Start Spider")
106 |     async with aiohttp.connector.TCPConnector(limit=300, force_close=True, enable_cleanup_closed=True) as tc:  #限制tcp连接数
107 |         async with aiohttp.ClientSession(connector=tc) as session: #创建一个可持续链接的session，传递下去，
108 |             coros = (asyncio.ensure_future(bound_fetch(item, session)) for item in data) 
109 |             await start_branch(coros) 
110 | ```
111 | #### 协程分流
112 | 这里主要做一个任务的均分，对于同步中的迭代器我们可以使用itertools的islice模块来实现
113 | ```
114 | # -*- coding: utf-8 -*-
115 | # @Time : 2019/1/2 11:52 AM
116 | # @Author : cxa
117 | # @File : 切片.py
118 | # @Software: PyCharm
119 | from itertools import islice
120 | 
121 | la = (x for x in range(20))
122 | print(type(la))
123 | for item in islice(la, 5, 9):  # 取下标5-9的元素
124 |     print(item)
125 | ```
126 | 但是异步生成器没有这中方法所以定义了如下方式进行分流。
127 | 下面代码的作用就是每次并发10个。通过修改limited_as_completed
128 | 方法的第二个参数可以设置不同的并发量。
129 | ```
130 | async def start_branch(tasks):
131 |     # 分流
132 |     [await _ for _ in limited_as_completed(tasks, 10)]
133 | 
134 | 
135 | async def first_to_finish(futures, coros):
136 |     while True:
137 |         await asyncio.sleep(0.01)
138 |         for f in futures:
139 |             if f.done():
140 |                 futures.remove(f)
141 |                 try:
142 |                     new_future = next(coros)
143 |                     futures.append(asyncio.ensure_future(new_future))
144 |                 except StopIteration as e:
145 |                     pass
146 |                 return f.result()
147 | 
148 | 
149 | def limited_as_completed(coros, limit):
150 |     futures = [asyncio.ensure_future(c) for c in islice(coros, 5, limit)]
151 | 
152 |     while len(futures) > 0:
153 |         yield first_to_finish(futures, coros)
154 | ```
155 | 一般对于并发100万以及更大的数据量时，可以使用此方案。
156 | 下面具体说下网络请求部分的逻辑。
157 | #### 网络请求
158 | 这里分了两部分进行并行抓取，图片部分和详情内容部分
159 | ```
160 | async def bound_fetch(item, session):
161 |     md5name = item.get("md5name")
162 |     file_path = os.path.join(os.getcwd(), "infoq_cover")
163 |     image_path = os.path.join(file_path, f"{md5name}.jpg")
164 | 
165 |     item["md5name"] = md5name
166 |     item["image_path"] = image_path
167 |     item["file_path"] = file_path
168 |     async with sema:
169 |         await fetch(item, session) #内容抓取部分协程
170 |         await get_buff(item, session) #图片抓去部分协程
171 | ```
172 | 内容部分核心内容
173 | ```
174 | async def fetch(item, session, retry_index=0):
175 |     try:
176 |         refer = item.get("url")
177 |         name = item.get("title")
178 |         uuid = item.get("uuid")
179 |         md5name = hashlib.md5(name.encode("utf-8")).hexdigest()  # 图片的名字
180 |         item["md5name"] = md5name
181 |         data = {"uuid": uuid}
182 |         headers["Referer"] = refer
183 |         if retry_index == 0:
184 |             await MotorBase().change_status(uuid, item, 1)  #开始下载并修改列表页的状态
185 |         with async_timeout.timeout(60):
186 |             async with session.post(url=base_url, headers=headers, data=json.dumps(data)) as req:
187 |                 res_status = req.status
188 | 
189 |                 if res_status == 200:
190 |                     jsondata = await req.json()
191 |                     await get_content(jsondata, item) #获取内容
192 |         await MotorBase().change_status(uuid, item, 2)  #修改状态下载成功
193 |     except Exception as e:
194 |         jsondata = None
195 |     if not jsondata:
196 |         crawler.error(f'Retry times: {retry_index + 1}')
197 |         retry_index += 1
198 |         return await fetch(item, session, retry_index) 
199 | ```
200 | 图片抓取部分核心内容
201 | ```
202 | async def get_buff(item, session):
203 |     url = item.get("cover")
204 |     with async_timeout.timeout(60):
205 |         async with session.get(url) as r:
206 |             if r.status == 200:
207 |                 buff = await r.read()
208 |                 if len(buff):
209 |                     crawler.info(f"NOW_IMAGE_URL:, {url}")
210 |                     await get_img(item, buff)
211 | ```
212 | #### 数据库的异步读取操作
213 | 首先导入mongo的异步库motor。
214 | ```
215 | from motor.motor_asyncio import AsyncIOMotorClient
216 | ```
217 | 然后创建数据库链接
218 | ```
219 | class MotorBase():
220 |     def __init__(self):
221 |         self.__dict__.update(**db_configs)
222 |         if self.user:
223 |             self.motor_uri = f"mongodb://{self.user}:{self.passwd}@{self.host}:{self.port}/{self.db_name}?authSource={self.user}"
224 |         self.motor_uri = f"mongodb://{self.host}:{self.port}/{self.db_name}"
225 |         self.client = AsyncIOMotorClient(self.motor_uri)
226 |         self.db = self.client.spider_data
227 | ```
228 |  上面的代码可以根据是否需要用户名来创建不同的链接方式
229 | 其中读取的配置内容格式如下
230 | ```
231 | # 数据库基本信息
232 | db_configs = {
233 |     'type': 'mongo',
234 |     'host': '127.0.0.1',
235 |     'port': '27017',
236 |     "user": "",
237 |     "password": "",
238 |     'db_name': 'spider_data'
239 | }
240 | ```
241 | 状态修改
242 | ```
243 |  async def change_status(self, uuid, item, status_code=0):
244 |         # status_code 0:初始,1:开始下载，2下载完了
245 |         try:
246 |             # storage.info(f"修改状态,此时的数据是:{item}")
247 |             item["status"] = status_code
248 |             await self.db.infoq_seed.update_one({'uuid': uuid}, {'$set': item}, upsert=True)
249 |         except Exception as e:
250 |             if "immutable" in e.args[0]:
251 |                 await self.db.infoq_seed.delete_one({'_id': item["_id"]})
252 |                 storage.info(f"数据重复删除:{e.args},此时的数据是:{item}")
253 |             else:
254 |                 storage.error(f"修改状态出错:{e.args}此时的数据是:{item}")
255 | ```
256 | 数据保存
257 | ```
258 |  async def save_data(self, item):
259 |         try:
260 |             await self.db.infoq_details.update_one({
261 |                 'uuid': item.get("uuid")},
262 |                 {'$set': item},
263 |                 upsert=True)
264 |         except Exception as e:
265 |             storage.error(f"数据插入出错:{e.args}此时的item是:{item}")
266 | ```
267 | 图片保存
268 | ```
269 | async def get_img(item, buff):
270 |     # 题目层目录是否存在
271 |     file_path = item.get("file_path")
272 |     image_path = item.get("image_path")
273 |     if not os.path.exists(file_path):
274 |         os.makedirs(file_path)
275 | 
276 |     # 文件是否存在
277 |     if not os.path.exists(image_path):
278 |         storage.info(f"SAVE_PATH:{image_path}")
279 |         async with aiofiles.open(image_path, 'wb') as f:
280 |             await f.write(buff)
281 | ```
282 | motor,aiofiles的使用类似pymongo,os.open只不过使用async/await关键字实现而已.
283 | 


--------------------------------------------------------------------------------
/db/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2019-01-28 15:41
3 | # @Author : cxa
4 | # @File : __init__.py.py
5 | # @Software: PyCharm


--------------------------------------------------------------------------------
/db/mongo_helper.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time : 2019-01-28 15:33
 3 | # @Author : cxa
 4 | # @File : mongo_helper.py
 5 | # @Software: PyCharm
 6 | # -*- coding: utf-8 -*-
 7 | # @Time : 2018/12/28 10:01 AM
 8 | # @Author : cxa
 9 | # @File : mongohelper_pymongo.py
10 | import pymongo
11 | from logger.log import storage
12 | from motor.motor_asyncio import AsyncIOMotorClient
13 | 
14 | # 数据库基本信息
15 | db_configs = {
16 |     'type': 'mongo',
17 |     'host': '127.0.0.1',
18 |     'port': '27017',
19 |     "user": "",
20 |     "password": "",
21 |     'db_name': 'spider_data'
22 | }
23 | 
24 | 
25 | class Mongo():
26 |     def __init__(self):
27 |         self.db_name = db_configs.get("db_name")
28 |         self.host = db_configs.get("host")
29 |         self.port = db_configs.get("port")
30 |         self.client = pymongo.MongoClient(f'mongodb://{self.host}:{self.port}')
31 |         self.username = db_configs.get("user")
32 |         self.password = db_configs.get("passwd")
33 |         if self.username and self.password:
34 |             self.db = self.client[self.db_name].authenticate(self.username, self.password)
35 |         self.db = self.client[self.db_name]
36 | 
37 |     def find_data(self, col="infoq_seed"):
38 |         # 获取状态为0的数据
39 |         data = self.db[col].find({"status": 0}, {"_id": 0})
40 |         gen = (item for item in data)
41 |         return gen
42 | 
43 |     def change_status(self, uuid, item, col="infoq_seed", status_code=0):
44 |         # status_code 0:初始,1:开始下载，2下载完了
45 |         item["status"] = status_code
46 |         self.db[col].update_one({'uuid': uuid}, {'$set': item})
47 | 
48 |     def save_data(self, items, col="infoq_seed"):
49 |         if isinstance(items, list):
50 |             for item in items:
51 |                 try:
52 |                     self.db[col].update_one({
53 |                         'uuid': item.get("uuid")},
54 |                         {'$set': item},
55 |                         upsert=True)
56 |                 except Exception as e:
57 |                     storage.error(f"数据插入出错:{e.args},此时的item是:{item}")
58 |         else:
59 |             try:
60 |                 self.db[col].update_one({
61 |                     'uuid': items.get("uuid")},
62 |                     {'$set': items},
63 |                     upsert=True)
64 |             except Exception as e:
65 |                 storage.error(f"数据插入出错:{e.args},此时的item是:{item}")
66 | 
67 | 
68 | if __name__ == '__main__':
69 |     m = Mongo()
70 |     m.find_data()
71 | 


--------------------------------------------------------------------------------
/db/motor_helper.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time : 2019-02-02 15:57
 3 | # @Author : cxa
 4 | # @File : motor_helper.py
 5 | # @Software: PyCharm
 6 | import asyncio
 7 | from logger.log import storage
 8 | from motor.motor_asyncio import AsyncIOMotorClient
 9 | from bson import SON
10 | import pprint
11 | 
12 | try:
13 |     import uvloop
14 | 
15 |     asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
16 | except ImportError:
17 |     pass
18 | # 数据库基本信息
19 | db_configs = {
20 |     'type': 'mongo',
21 |     'host': '127.0.0.1',
22 |     'port': '27017',
23 |     "user": "",
24 |     "password": "",
25 |     'db_name': 'spider_data'
26 | }
27 | 
28 | 
29 | class MotorBase():
30 |     def __init__(self):
31 |         self.__dict__.update(**db_configs)
32 |         if self.user:
33 |             self.motor_uri = f"mongodb://{self.user}:{self.passwd}@{self.host}:{self.port}/{self.db_name}?authSource={self.user}"
34 |         self.motor_uri = f"mongodb://{self.host}:{self.port}/{self.db_name}"
35 |         self.client = AsyncIOMotorClient(self.motor_uri)
36 |         self.db = self.client.spider_data
37 | 
38 |     async def save_data(self, item):
39 |         try:
40 |             await self.db.infoq_details.update_one({
41 |                 'uuid': item.get("uuid")},
42 |                 {'$set': item},
43 |                 upsert=True)
44 |         except Exception as e:
45 |             storage.error(f"数据插入出错:{e.args}此时的item是:{item}")
46 | 
47 |     async def change_status(self, uuid, status_code=0):
48 |         # status_code 0:初始,1:开始下载，2下载完了
49 |         # storage.info(f"修改状态,此时的数据是:{item}")
50 |         item = {}
51 |         item["status"] = status_code
52 |         await self.db.infoq_seed.update_one({'uuid': uuid}, {'$set': item}, upsert=True)
53 | 
54 |     async def reset_status(self):
55 |         await self.db.infoq_seed.update_many({'status': 1}, {'$set': {"status": 0}})
56 | 
57 |     async def reset_all_status(self):
58 |         await self.db.infoq_seed.update_many({}, {'$set': {"status": 0}})
59 | 
60 |     async def get_detail_datas(self):
61 |         data = self.db.infoq_seed.find({'status': 1})
62 | 
63 |         async for item in data:
64 |             print(item)
65 |         return data
66 | 
67 |     async def find(self):
68 |         data = self.db.infoq_seed.find({'status': 0})
69 | 
70 |         async_gen = (item async for item in data)
71 |         return async_gen
72 | 
73 |     async def use_count_command(self):
74 |         response = await self.db.command(SON([("count", "infoq_seed")]))
75 |         print(f'response:{pprint.pformat(response)}')
76 | 
77 | 
78 | if __name__ == '__main__':
79 |     m = MotorBase()
80 |     loop = asyncio.get_event_loop()
81 |     loop.run_until_complete(m.find())
82 | 


--------------------------------------------------------------------------------
/infoq_detail_spider.py:
--------------------------------------------------------------------------------
  1 | # @Time : 2019/03/12 10:02 AM
  2 | # @Author : cxa
  3 | # @Software: PyCharm
  4 | # encoding: utf-8
  5 | import os
  6 | import aiohttp
  7 | import aiofiles
  8 | import async_timeout
  9 | import asyncio
 10 | from logger.log import crawler, storage
 11 | from db.motor_helper import MotorBase
 12 | import datetime
 13 | import json
 14 | from w3lib.html import remove_tags
 15 | from aiostream import stream
 16 | from async_retrying import retry
 17 | 
 18 | base_url = "https://www.infoq.cn/public/v1/article/getDetail"
 19 | headers = {
 20 |     "Accept": "application/json, text/plain, */*",
 21 |     "Accept-Encoding": "gzip, deflate, br",
 22 |     "Accept-Language": "zh-CN,zh;q=0.9",
 23 |     "Content-Type": "application/json",
 24 |     "Host": "www.infoq.cn",
 25 |     "Origin": "https://www.infoq.cn",
 26 |     "Referer": "https://www.infoq.cn/article/Ns2yelhHTd0rhmu2-IzN",
 27 |     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
 28 | }
 29 | 
 30 | headers2 = {
 31 |     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
 32 | }
 33 | try:
 34 |     import uvloop
 35 | 
 36 |     asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
 37 | except ImportError:
 38 |     pass
 39 | 
 40 | sema = asyncio.Semaphore(5)
 41 | 
 42 | 
 43 | async def get_buff(item, session):
 44 |     url = item.get("cover")
 45 |     with async_timeout.timeout(60):
 46 |         async with session.get(url) as r:
 47 |             if r.status == 200:
 48 |                 buff = await r.read()
 49 |                 if len(buff):
 50 |                     crawler.info(f"NOW_IMAGE_URL:, {url}")
 51 |                     await get_img(item, buff)
 52 | 
 53 | 
 54 | async def get_img(item, buff):
 55 |     # 题目层目录是否存在
 56 |     file_path = item.get("file_path")
 57 |     image_path = item.get("image_path")
 58 |     if not os.path.exists(file_path):
 59 |         os.makedirs(file_path)
 60 | 
 61 |     # 文件是否存在
 62 |     if not os.path.exists(image_path):
 63 |         storage.info(f"SAVE_PATH:{image_path}")
 64 |         async with aiofiles.open(image_path, 'wb') as f:
 65 |             await f.write(buff)
 66 | 
 67 | 
 68 | async def get_content(source, item):
 69 |     dic = {}
 70 |     dic["uuid"] = item.get("uuid")
 71 |     dic["title"] = item.get("title")
 72 |     dic["author"] = item.get("author")
 73 |     dic["publish_time"] = item.get("publish_time")
 74 |     dic["cover_url"] = item.get("cover")
 75 |     dic["tags"] = item.get("tags")
 76 |     dic["image_path"] = item.get("image_path")
 77 |     dic["md5name"] = item.get("md5name")
 78 |     html_content = source.get("data").get("content")
 79 |     dic["html"] = html_content
 80 |     dic["content"] = remove_tags(html_content)
 81 |     dic["update_time"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 82 |     await MotorBase().save_data(dic)
 83 | 
 84 | 
 85 | @retry(attempts=5)
 86 | async def fetch(item, session, retry_index=0):
 87 |     '''
 88 |     对内容的下载，重试次数5次
 89 |     :param item:
 90 |     :param session:
 91 |     :param retry_index:
 92 |     :return:
 93 |     '''
 94 |     refer = item.get("url")
 95 |     uuid = item.get("uuid")
 96 |     if retry_index == 0:
 97 |         await MotorBase().change_status(uuid, 1)  # 开始下载
 98 |     data = {"uuid": uuid}
 99 |     headers["Referer"] = refer
100 |     with async_timeout.timeout(60):
101 |         async with session.post(url=base_url, headers=headers, data=json.dumps(data)) as req:
102 |             res_status = req.status
103 | 
104 |             if res_status == 200:
105 |                 jsondata = await req.json()
106 |                 await get_content(jsondata, item)
107 |     await MotorBase().change_status(uuid, 2)  # 下载成功
108 | 
109 | 
110 | async def bound_fetch(item, session):
111 |     '''
112 |     分别处理图片和内容的下载
113 |     :param item:
114 |     :param session:
115 |     :return:
116 |     '''
117 |     md5name = item.get("md5name")
118 |     file_path = os.path.join(os.getcwd(), "infoq_cover")
119 |     image_path = os.path.join(file_path, f"{md5name}.jpg")
120 |     item["md5name"] = md5name
121 |     item["image_path"] = image_path
122 |     item["file_path"] = file_path
123 |     async with sema:
124 |         await fetch(item, session)
125 |         await get_buff(item, session)
126 | 
127 | 
128 | async def branch(coros, limit=10):
129 |     '''
130 |     使用aiostream模块对异步生成器做一个切片操作。这里并发量为10.
131 |     :param coros: 异步生成器
132 |     :param limit: 并发次数
133 |     :return:
134 |     '''
135 |     index = 0
136 |     flag = True
137 |     while flag:
138 |         xs = stream.iterate(coros)
139 |         ys = xs[index:index + limit]
140 |         t = await stream.list(ys)
141 |         if not t:
142 |             flag = False
143 |         else:
144 |             await asyncio.ensure_future(asyncio.gather(*t))
145 |             index += limit
146 | 
147 | 
148 | async def run():
149 |     '''
150 |     入口函数
151 |     :return:
152 |     '''
153 |     data = await MotorBase().find()
154 |     crawler.info("Start Spider")
155 |     async with aiohttp.connector.TCPConnector(limit=300, force_close=True, enable_cleanup_closed=True,verify_ssl=False) as tc:
156 |         async with aiohttp.ClientSession(connector=tc) as session:
157 |             coros = (asyncio.ensure_future(bound_fetch(item, session)) async for item in data)
158 |             await branch(coros)
159 | 
160 | 
161 | if __name__ == '__main__':
162 |     loop = asyncio.get_event_loop()
163 |     loop.run_until_complete(run())
164 |     loop.close()
165 | 


--------------------------------------------------------------------------------
/infoq_seed_spider.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time : 2019-01-28 14:41
 3 | # @Author : cxa
 4 | # @File : infoq_seed_spider.py
 5 | # @Software: PyCharm
 6 | import requests
 7 | import json
 8 | import time
 9 | import random
10 | import datetime
11 | from logger.log import crawler, storage
12 | from db.mongo_helper import Mongo
13 | import hashlib
14 | 
15 | 
16 | class InfoQ_Seed_Spider():
17 |     def __init__(self):
18 |         self.start_url = "https://www.infoq.cn/public/v1/my/recommond"
19 |         self.headers = {
20 |             "Accept": "application/json, text/plain, */*",
21 |             "Accept-Encoding": "gzip, deflate, br",
22 |             "Accept-Language": "zh-CN,zh;q=0.9",
23 |             "Content-Type": "application/json",
24 |             "Host": "www.infoq.cn",
25 |             "Origin": "https://www.infoq.cn",
26 |             "Referer": "https://www.infoq.cn/",
27 |             "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
28 |         }
29 |         self.session = requests.Session()
30 |         self.session.headers.update(self.headers)
31 | 
32 |     def get_req(self, data=None):
33 |         '''
34 |         请求列表页
35 |         :param data:
36 |         :return:
37 |         '''
38 |         req = self.session.post(self.start_url, data=json.dumps(data))
39 |         if req.status_code in [200, 201]:
40 |             return req
41 | 
42 |     def save_data(self, data):
43 |         tasks = []
44 |         for item in data:
45 |             try:
46 |                 dic = {}
47 |                 uuid = item.get("uuid")
48 |                 dic["uuid"] = uuid
49 |                 dic["url"] = f"https://www.infoq.cn/article/{uuid}"
50 |                 title = item.get("article_title")
51 |                 dic["title"] = title
52 |                 dic["cover"] = item.get("article_cover")
53 |                 dic["summary"] = item.get("article_summary")
54 |                 author = item.get("author")
55 |                 if author:
56 |                     dic["author"] = author[0].get("nickname")
57 |                 else:
58 |                     dic["author"] = item.get("no_author", "").split(":")[-1]
59 |                 score = item.get("publish_time")
60 |                 dic["publish_time"] = datetime.datetime.utcfromtimestamp(score / 1000).strftime("%Y-%m-%d %H:%M:%S")
61 |                 dic["tags"] = ",".join([data.get("name") for data in item.get("topic")])
62 |                 translate = item.get("translator")
63 |                 dic["translator"] = dic["author"]
64 |                 if translate:
65 |                     dic["translator"] = translate[0].get("nickname")
66 |                 dic["status"] = 0
67 |                 md5name = hashlib.md5(title.encode("utf-8")).hexdigest()  # 图片的名字
68 |                 dic["md5name"] = md5name
69 |                 dic["update_time"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
70 |                 tasks.append(dic)
71 |             except IndexError as e:
72 |                 crawler.error("解析出错")
73 |         Mongo().save_data(tasks)
74 |         crawler.info(f"add {len(tasks)} datas to mongodb")
75 |         return score
76 | 
77 |     def start(self):
78 |         i = 0
79 |         post_data = {"size": 12}
80 |         while i < 4:
81 |             req = self.get_req(post_data)
82 |             data = req.json().get("data")
83 |             score = self.save_data(data)
84 |             post_data.update({"score": score})
85 |             i += 1
86 |             time.sleep(random.randint(0, 5))
87 | 
88 | 
89 | if __name__ == '__main__':
90 |     iss = InfoQ_Seed_Spider()
91 |     iss.start()
92 | 


--------------------------------------------------------------------------------
/logger/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2019-01-28 15:37
3 | # @Author : cxa
4 | # @File : __init__.py.py
5 | # @Software: PyCharm


--------------------------------------------------------------------------------
/logger/log.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time : 2019-01-28 15:37
 3 | # @Author : cxa
 4 | # @File : log.py
 5 | # @Software: PyCharm
 6 | import os
 7 | import logging
 8 | import logging.config as log_conf
 9 | import datetime
10 | import coloredlogs
11 | 
12 | log_dir = os.path.dirname(os.path.dirname(__file__)) + '/logs'
13 | if not os.path.exists(log_dir):
14 |     os.mkdir(log_dir)
15 | today = datetime.datetime.now().strftime("%Y%m%d")
16 | 
17 | log_path = os.path.join(log_dir, f'infoq_{today}.log')
18 | 
19 | log_config = {
20 |     'version': 1.0,
21 |     'formatters': {
22 |         'colored_console': {'()': 'coloredlogs.ColoredFormatter',
23 |                             'format': "%(asctime)s - %(name)s - %(levelname)s - %(message)s", 'datefmt': '%H:%M:%S'},
24 |         'detail': {
25 |             'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
26 |             'datefmt': "%Y-%m-%d %H:%M:%S"  # 如果不加这个会显示到毫秒。
27 |         },
28 |         'simple': {
29 |             'format': '%(name)s - %(levelname)s - %(message)s',
30 |         },
31 |     },
32 |     'handlers': {
33 |         'console': {
34 |             'class': 'logging.StreamHandler',  # 日志打印到屏幕显示的类。
35 |             'level': 'INFO',
36 |             'formatter': 'colored_console'
37 |         },
38 |         'file': {
39 |             'class': 'logging.handlers.RotatingFileHandler',  # 日志打印到文件的类。
40 |             'maxBytes': 1024 * 1024 * 1024,  # 单个文件最大内存
41 |             'backupCount': 1,  # 备份的文件个数
42 |             'filename': log_path,  # 日志文件名
43 |             'level': 'INFO',  # 日志等级
44 |             'formatter': 'detail',  # 调用上面的哪个格式
45 |             'encoding': 'utf-8',  # 编码
46 |         },
47 |     },
48 |     'loggers': {
49 |         'crawler': {
50 |             'handlers': ['console', 'file'],  # 只打印屏幕
51 |             'level': 'DEBUG',  # 只显示错误的log
52 |         },
53 |         'parser': {
54 |             'handlers': ['file'],
55 |             'level': 'INFO',
56 |         },
57 |         'other': {
58 |             'handlers': ['console', 'file'],
59 |             'level': 'INFO',
60 |         },
61 |         'storage': {
62 |             'handlers': ['console', 'file'],
63 |             'level': 'INFO',
64 |         }
65 |     }
66 | }
67 | 
68 | log_conf.dictConfig(log_config)
69 | 
70 | crawler = logging.getLogger('crawler')
71 | storage = logging.getLogger('storage')
72 | coloredlogs.install(level='DEBUG', logger=crawler)
73 | coloredlogs.install(level='DEBUG', logger=storage)
74 | 


--------------------------------------------------------------------------------
/tool/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2019-01-28 14:47
3 | # @Author : cxa
4 | # @File : __init__.py.py
5 | # @Software: PyCharm


--------------------------------------------------------------------------------
/tool/headers_format.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time : 2019-01-28 14:47
 3 | # @Author : cxa
 4 | # @File : headers_format.py
 5 | # @Software: PyCharm
 6 | headers = """
 7 | Accept: application/json, text/plain, */*
 8 | Accept-Encoding: gzip, deflate, br
 9 | Accept-Language: zh-CN,zh;q=0.9
10 | Content-Type: application/json
11 | Host: www.infoq.cn
12 | Origin: https://www.infoq.cn
13 | Referer: https://www.infoq.cn/article/Ns2yelhHTd0rhmu2-IzN
14 | User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
15 | """
16 | hs = headers.split('\n')
17 | b = [k for k in hs if len(k)]
18 | e = b
19 | f = {(i.split(":")[0], i.split(":", 1)[1].strip()) for i in e}
20 | g = sorted(f)
21 | index = 0
22 | print("{")
23 | for k, v in g:
24 |     print(repr(k).replace('\'','"'), repr(v).replace('\'','"'), sep=':', end=",\n")
25 | print("}")
26 | 


--------------------------------------------------------------------------------