├── README.md ├── db ├── __init__.py ├── mongo_helper.py └── motor_helper.py ├── infoq_detail_spider.py ├── infoq_seed_spider.py ├── logger ├── __init__.py └── log.py └── tool ├── __init__.py └── headers_format.py /README.md: -------------------------------------------------------------------------------- 1 | ### 问题修复 2 | 1.修改ssl报错的问题 3 | 2.修改了branch任务分流的问题 4 | 5 | ---- 6 | ### 异步爬虫模块aiohttp实战之infoq 7 | > 之前写过很多的关于异步的文章,介绍了asyncio的使用,大多的只是通过简单的例子说明了asyncio的使用,但是没有拿出具体的使用例子,所以这次特意用一个具体例子来说明异步模块aiohttp配合asyncio的使用。 8 | 9 | 这次以infoq网站为例子进行讲解,infoq是一个怎样的网站,这里引用官网的一句话 10 | 「InfoQ 是一个实践驱动的社区资讯站点,致力于促进软件开发领域知识与创新的传播。... InfoQ 每周精要 订阅InfoQ 每周精要,加入资深开发者汇聚的庞大技术社区。」 很明显了infoq是一个技术型的网站,这次我们就对该网站的推荐内容进行讲解。 11 | ###思路分析 12 | 这里先说下思路,这里分了两步,首先爬取列表页,然后再去爬详情页的内容。 13 | 爬取列表页的数据保存到数据库,这里我使用的是mongodb,同时加一个字段标注状态,这个状态是为了后面详情页可以做到一个续爬,这里我设置三种状态0,1,2其中0表示初始状态也就是还没开始爬,1表示开始下载,2表示下载成功。通过状态我们可以直到已经完成了多少链接的爬取了。进而达到的一个续爬的效果。爬取详情页的时候就去读取数据库中状态为0的数据。根据状态码我们就可以作出对应的操作了,例如我们任务都完了,查看列表页的数据发现状态是1,可以知道这个爬取的过程中出现了问题,所以我们将状态改为0,再运行详情页的程序即可。 14 | ### 网站分析之列表部分 15 | 首先让我分析一下,这个推荐内容部分都有哪些请求,经过查看发现在链接https://www.infoq.cn/public/v1/my/recommond 里面有我们需要的内容 16 | 很明显这是一个ajax请求的页面,然后就是分析这个链接的请求方式和请求参数了 17 | 请求链接我们已经知道了,然后看请求参数部分经过观察我们发现请求参数的格式为 18 | ``` 19 | {"size":12,"score":1549328400000} 20 | ``` 21 | 然后我们只要直到参数来源就好了,一般通过多翻几页去总结规律,观察发现size的值都是12,也就是一页显示的内容,然后是score内容,我们发现每一页的score内容是不同的,通过搜索 可以发现再上一页的json内容中包含着这个字段。 22 | ![](https://img2018.cnblogs.com/blog/736399/201902/736399-20190207174247059-348757324.gif) 23 | 于是我们就可以写代码了。 24 | ### 列表部分的代码 25 | 因为列表页部分是一步接一步的不适合并发所以我们用requests模块常规的方式爬取即可。 26 | 文件名infoq_seed_spider.py 27 | 网络请求部分 28 | ```python 29 | def get_req(self, data=None): 30 | req = self.session.post(self.start_url, data=json.dumps(data)) 31 | if req.status_code in [200, 201]: 32 | return req 33 | ``` 34 | 数据解析部分 35 | ``` 36 | def save_data(self, data): 37 | tasks = [] 38 | for item in data: 39 | try: 40 | dic = {} 41 | uuid = item.get("uuid") 42 | dic["uuid"] = uuid# 经过分析发现uuid就是详情页链接的组成部分。 43 | dic["url"] = f"https://www.infoq.cn/article/{uuid}" 44 | dic["title"] = item.get("article_title") 45 | dic["cover"] = item.get("article_cover") 46 | dic["summary"] = item.get("article_summary") 47 | author=item.get("author") 48 | if author: 49 | dic["author"] = author[0].get("nickname") 50 | else: 51 | dic["author"]=item.get("no_author","").split(":")[-1] 52 | score=item.get("publish_time") 53 | dic["publish_time"] = datetime.datetime.utcfromtimestamp(score/1000).strftime("%Y-%m-%d %H:%M:%S") 54 | dic["tags"] = ",".join([data.get("name") for data in item.get("topic")]) 55 | translate = item.get("translator") 56 | dic["translator"] = dic["author"] 57 | if translate: 58 | dic["translator"] = translate[0].get("nickname") 59 | dic["status"] = 0 60 | dic["update_time"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") 61 | tasks.append(dic) 62 | except IndexError as e: 63 | crawler.error("解析出错") 64 | Mongo().save_data(tasks) 65 | crawler.info(f"add {len(tasks)} datas to mongodb") 66 | return score 67 | ``` 68 | 程序入口 69 | ```python 70 | def start(self): 71 | i = 0 72 | post_data = {"size": 12} 73 | while i < 10: #这里只爬取了10页,这个值可以自己设置 74 | req = self.get_req(post_data) 75 | data = req.json().get("data") 76 | score = self.save_data(data) 77 | post_data.update({"score": score}) #通过上一页的内容得知下一次要请求的参数。 78 | i += 1 79 | time.sleep(random.randint(0, 5)) 80 | 81 | ``` 82 | 列表页逻辑简单代码也是同步,就没有什么需要强调的地方了 83 | 新手需要注意是访问量和频率,这里只是做一个学习范例所以请求量和请求频率相应的比较低。 84 | ###网站分析之详情部分 85 | 这里分为两部分存储,一个是详情内容的封面我将图片保存到了本地。另外一个就是详情内容的爬取保存到数据库里,保存图片的同时获取图片的路径一起保存在数据库,方便找到文章的配图。 86 | ### 详情页的思考 87 | 一开始我以为详情页的内容请求类似https://www.infoq.cn/article/cY3lZj1G-cR2iJ3ONUd6链接就行了,但是后来事实证明我想简单了,详情页其实也是通过ajax加载的,经过和列表页类似的分析,发现请求链接为:https://www.infoq.cn/public/v1/article/getDetail,请求参数为 uuid,也就是详情页的链接后面的部分,我们在爬取列表页的时候已经获取了。 88 | 也就是说详情页的爬虫我们是通过访问数据库的形式去进行爬取的,这里我们就可以做到并发去访问了。这里就让我们熟悉一下aiohttp的使用已经其他异步库的使用。 89 | ###详情页的代码部分 90 | 首先说下我们需要的包,aiohttp异步网络请求包,motor异步mongodb请求包,aiofiles异步文件操作包,async_timeout请求timeout设置包。以上包我们都可以通过pip进行安装,下面我们对这些包进行一一了解。 91 | #### 数据读取 92 | 首先我们需要读取数据的数据,这里我们使用pymongo。 93 | 首先是导入必备的包pip install pymongo安装即可。 94 | 读取部分没有涉及太多的操作,我们就直接获取一个生成器格式的数据即可 95 | ``` 96 | def find_data(self, col="infoq_seed"): 97 | # 获取状态为0的数据 98 | data = self.db[col].find({"status": 0}) 99 | gen = (item for item in data) 100 | return gen 101 | ``` 102 | #### 入口函数 103 | ``` 104 | async def run(data): 105 | crawler.info("Start Spider") 106 | async with aiohttp.connector.TCPConnector(limit=300, force_close=True, enable_cleanup_closed=True) as tc: #限制tcp连接数 107 | async with aiohttp.ClientSession(connector=tc) as session: #创建一个可持续链接的session,传递下去, 108 | coros = (asyncio.ensure_future(bound_fetch(item, session)) for item in data) 109 | await start_branch(coros) 110 | ``` 111 | #### 协程分流 112 | 这里主要做一个任务的均分,对于同步中的迭代器我们可以使用itertools的islice模块来实现 113 | ``` 114 | # -*- coding: utf-8 -*- 115 | # @Time : 2019/1/2 11:52 AM 116 | # @Author : cxa 117 | # @File : 切片.py 118 | # @Software: PyCharm 119 | from itertools import islice 120 | 121 | la = (x for x in range(20)) 122 | print(type(la)) 123 | for item in islice(la, 5, 9): # 取下标5-9的元素 124 | print(item) 125 | ``` 126 | 但是异步生成器没有这中方法所以定义了如下方式进行分流。 127 | 下面代码的作用就是每次并发10个。通过修改limited_as_completed 128 | 方法的第二个参数可以设置不同的并发量。 129 | ``` 130 | async def start_branch(tasks): 131 | # 分流 132 | [await _ for _ in limited_as_completed(tasks, 10)] 133 | 134 | 135 | async def first_to_finish(futures, coros): 136 | while True: 137 | await asyncio.sleep(0.01) 138 | for f in futures: 139 | if f.done(): 140 | futures.remove(f) 141 | try: 142 | new_future = next(coros) 143 | futures.append(asyncio.ensure_future(new_future)) 144 | except StopIteration as e: 145 | pass 146 | return f.result() 147 | 148 | 149 | def limited_as_completed(coros, limit): 150 | futures = [asyncio.ensure_future(c) for c in islice(coros, 5, limit)] 151 | 152 | while len(futures) > 0: 153 | yield first_to_finish(futures, coros) 154 | ``` 155 | 一般对于并发100万以及更大的数据量时,可以使用此方案。 156 | 下面具体说下网络请求部分的逻辑。 157 | #### 网络请求 158 | 这里分了两部分进行并行抓取,图片部分和详情内容部分 159 | ``` 160 | async def bound_fetch(item, session): 161 | md5name = item.get("md5name") 162 | file_path = os.path.join(os.getcwd(), "infoq_cover") 163 | image_path = os.path.join(file_path, f"{md5name}.jpg") 164 | 165 | item["md5name"] = md5name 166 | item["image_path"] = image_path 167 | item["file_path"] = file_path 168 | async with sema: 169 | await fetch(item, session) #内容抓取部分协程 170 | await get_buff(item, session) #图片抓去部分协程 171 | ``` 172 | 内容部分核心内容 173 | ``` 174 | async def fetch(item, session, retry_index=0): 175 | try: 176 | refer = item.get("url") 177 | name = item.get("title") 178 | uuid = item.get("uuid") 179 | md5name = hashlib.md5(name.encode("utf-8")).hexdigest() # 图片的名字 180 | item["md5name"] = md5name 181 | data = {"uuid": uuid} 182 | headers["Referer"] = refer 183 | if retry_index == 0: 184 | await MotorBase().change_status(uuid, item, 1) #开始下载并修改列表页的状态 185 | with async_timeout.timeout(60): 186 | async with session.post(url=base_url, headers=headers, data=json.dumps(data)) as req: 187 | res_status = req.status 188 | 189 | if res_status == 200: 190 | jsondata = await req.json() 191 | await get_content(jsondata, item) #获取内容 192 | await MotorBase().change_status(uuid, item, 2) #修改状态下载成功 193 | except Exception as e: 194 | jsondata = None 195 | if not jsondata: 196 | crawler.error(f'Retry times: {retry_index + 1}') 197 | retry_index += 1 198 | return await fetch(item, session, retry_index) 199 | ``` 200 | 图片抓取部分核心内容 201 | ``` 202 | async def get_buff(item, session): 203 | url = item.get("cover") 204 | with async_timeout.timeout(60): 205 | async with session.get(url) as r: 206 | if r.status == 200: 207 | buff = await r.read() 208 | if len(buff): 209 | crawler.info(f"NOW_IMAGE_URL:, {url}") 210 | await get_img(item, buff) 211 | ``` 212 | #### 数据库的异步读取操作 213 | 首先导入mongo的异步库motor。 214 | ``` 215 | from motor.motor_asyncio import AsyncIOMotorClient 216 | ``` 217 | 然后创建数据库链接 218 | ``` 219 | class MotorBase(): 220 | def __init__(self): 221 | self.__dict__.update(**db_configs) 222 | if self.user: 223 | self.motor_uri = f"mongodb://{self.user}:{self.passwd}@{self.host}:{self.port}/{self.db_name}?authSource={self.user}" 224 | self.motor_uri = f"mongodb://{self.host}:{self.port}/{self.db_name}" 225 | self.client = AsyncIOMotorClient(self.motor_uri) 226 | self.db = self.client.spider_data 227 | ``` 228 | 上面的代码可以根据是否需要用户名来创建不同的链接方式 229 | 其中读取的配置内容格式如下 230 | ``` 231 | # 数据库基本信息 232 | db_configs = { 233 | 'type': 'mongo', 234 | 'host': '127.0.0.1', 235 | 'port': '27017', 236 | "user": "", 237 | "password": "", 238 | 'db_name': 'spider_data' 239 | } 240 | ``` 241 | 状态修改 242 | ``` 243 | async def change_status(self, uuid, item, status_code=0): 244 | # status_code 0:初始,1:开始下载,2下载完了 245 | try: 246 | # storage.info(f"修改状态,此时的数据是:{item}") 247 | item["status"] = status_code 248 | await self.db.infoq_seed.update_one({'uuid': uuid}, {'$set': item}, upsert=True) 249 | except Exception as e: 250 | if "immutable" in e.args[0]: 251 | await self.db.infoq_seed.delete_one({'_id': item["_id"]}) 252 | storage.info(f"数据重复删除:{e.args},此时的数据是:{item}") 253 | else: 254 | storage.error(f"修改状态出错:{e.args}此时的数据是:{item}") 255 | ``` 256 | 数据保存 257 | ``` 258 | async def save_data(self, item): 259 | try: 260 | await self.db.infoq_details.update_one({ 261 | 'uuid': item.get("uuid")}, 262 | {'$set': item}, 263 | upsert=True) 264 | except Exception as e: 265 | storage.error(f"数据插入出错:{e.args}此时的item是:{item}") 266 | ``` 267 | 图片保存 268 | ``` 269 | async def get_img(item, buff): 270 | # 题目层目录是否存在 271 | file_path = item.get("file_path") 272 | image_path = item.get("image_path") 273 | if not os.path.exists(file_path): 274 | os.makedirs(file_path) 275 | 276 | # 文件是否存在 277 | if not os.path.exists(image_path): 278 | storage.info(f"SAVE_PATH:{image_path}") 279 | async with aiofiles.open(image_path, 'wb') as f: 280 | await f.write(buff) 281 | ``` 282 | motor,aiofiles的使用类似pymongo,os.open只不过使用async/await关键字实现而已. 283 | -------------------------------------------------------------------------------- /db/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-01-28 15:41 3 | # @Author : cxa 4 | # @File : __init__.py.py 5 | # @Software: PyCharm -------------------------------------------------------------------------------- /db/mongo_helper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-01-28 15:33 3 | # @Author : cxa 4 | # @File : mongo_helper.py 5 | # @Software: PyCharm 6 | # -*- coding: utf-8 -*- 7 | # @Time : 2018/12/28 10:01 AM 8 | # @Author : cxa 9 | # @File : mongohelper_pymongo.py 10 | import pymongo 11 | from logger.log import storage 12 | from motor.motor_asyncio import AsyncIOMotorClient 13 | 14 | # 数据库基本信息 15 | db_configs = { 16 | 'type': 'mongo', 17 | 'host': '127.0.0.1', 18 | 'port': '27017', 19 | "user": "", 20 | "password": "", 21 | 'db_name': 'spider_data' 22 | } 23 | 24 | 25 | class Mongo(): 26 | def __init__(self): 27 | self.db_name = db_configs.get("db_name") 28 | self.host = db_configs.get("host") 29 | self.port = db_configs.get("port") 30 | self.client = pymongo.MongoClient(f'mongodb://{self.host}:{self.port}') 31 | self.username = db_configs.get("user") 32 | self.password = db_configs.get("passwd") 33 | if self.username and self.password: 34 | self.db = self.client[self.db_name].authenticate(self.username, self.password) 35 | self.db = self.client[self.db_name] 36 | 37 | def find_data(self, col="infoq_seed"): 38 | # 获取状态为0的数据 39 | data = self.db[col].find({"status": 0}, {"_id": 0}) 40 | gen = (item for item in data) 41 | return gen 42 | 43 | def change_status(self, uuid, item, col="infoq_seed", status_code=0): 44 | # status_code 0:初始,1:开始下载,2下载完了 45 | item["status"] = status_code 46 | self.db[col].update_one({'uuid': uuid}, {'$set': item}) 47 | 48 | def save_data(self, items, col="infoq_seed"): 49 | if isinstance(items, list): 50 | for item in items: 51 | try: 52 | self.db[col].update_one({ 53 | 'uuid': item.get("uuid")}, 54 | {'$set': item}, 55 | upsert=True) 56 | except Exception as e: 57 | storage.error(f"数据插入出错:{e.args},此时的item是:{item}") 58 | else: 59 | try: 60 | self.db[col].update_one({ 61 | 'uuid': items.get("uuid")}, 62 | {'$set': items}, 63 | upsert=True) 64 | except Exception as e: 65 | storage.error(f"数据插入出错:{e.args},此时的item是:{item}") 66 | 67 | 68 | if __name__ == '__main__': 69 | m = Mongo() 70 | m.find_data() 71 | -------------------------------------------------------------------------------- /db/motor_helper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-02-02 15:57 3 | # @Author : cxa 4 | # @File : motor_helper.py 5 | # @Software: PyCharm 6 | import asyncio 7 | from logger.log import storage 8 | from motor.motor_asyncio import AsyncIOMotorClient 9 | from bson import SON 10 | import pprint 11 | 12 | try: 13 | import uvloop 14 | 15 | asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) 16 | except ImportError: 17 | pass 18 | # 数据库基本信息 19 | db_configs = { 20 | 'type': 'mongo', 21 | 'host': '127.0.0.1', 22 | 'port': '27017', 23 | "user": "", 24 | "password": "", 25 | 'db_name': 'spider_data' 26 | } 27 | 28 | 29 | class MotorBase(): 30 | def __init__(self): 31 | self.__dict__.update(**db_configs) 32 | if self.user: 33 | self.motor_uri = f"mongodb://{self.user}:{self.passwd}@{self.host}:{self.port}/{self.db_name}?authSource={self.user}" 34 | self.motor_uri = f"mongodb://{self.host}:{self.port}/{self.db_name}" 35 | self.client = AsyncIOMotorClient(self.motor_uri) 36 | self.db = self.client.spider_data 37 | 38 | async def save_data(self, item): 39 | try: 40 | await self.db.infoq_details.update_one({ 41 | 'uuid': item.get("uuid")}, 42 | {'$set': item}, 43 | upsert=True) 44 | except Exception as e: 45 | storage.error(f"数据插入出错:{e.args}此时的item是:{item}") 46 | 47 | async def change_status(self, uuid, status_code=0): 48 | # status_code 0:初始,1:开始下载,2下载完了 49 | # storage.info(f"修改状态,此时的数据是:{item}") 50 | item = {} 51 | item["status"] = status_code 52 | await self.db.infoq_seed.update_one({'uuid': uuid}, {'$set': item}, upsert=True) 53 | 54 | async def reset_status(self): 55 | await self.db.infoq_seed.update_many({'status': 1}, {'$set': {"status": 0}}) 56 | 57 | async def reset_all_status(self): 58 | await self.db.infoq_seed.update_many({}, {'$set': {"status": 0}}) 59 | 60 | async def get_detail_datas(self): 61 | data = self.db.infoq_seed.find({'status': 1}) 62 | 63 | async for item in data: 64 | print(item) 65 | return data 66 | 67 | async def find(self): 68 | data = self.db.infoq_seed.find({'status': 0}) 69 | 70 | async_gen = (item async for item in data) 71 | return async_gen 72 | 73 | async def use_count_command(self): 74 | response = await self.db.command(SON([("count", "infoq_seed")])) 75 | print(f'response:{pprint.pformat(response)}') 76 | 77 | 78 | if __name__ == '__main__': 79 | m = MotorBase() 80 | loop = asyncio.get_event_loop() 81 | loop.run_until_complete(m.find()) 82 | -------------------------------------------------------------------------------- /infoq_detail_spider.py: -------------------------------------------------------------------------------- 1 | # @Time : 2019/03/12 10:02 AM 2 | # @Author : cxa 3 | # @Software: PyCharm 4 | # encoding: utf-8 5 | import os 6 | import aiohttp 7 | import aiofiles 8 | import async_timeout 9 | import asyncio 10 | from logger.log import crawler, storage 11 | from db.motor_helper import MotorBase 12 | import datetime 13 | import json 14 | from w3lib.html import remove_tags 15 | from aiostream import stream 16 | from async_retrying import retry 17 | 18 | base_url = "https://www.infoq.cn/public/v1/article/getDetail" 19 | headers = { 20 | "Accept": "application/json, text/plain, */*", 21 | "Accept-Encoding": "gzip, deflate, br", 22 | "Accept-Language": "zh-CN,zh;q=0.9", 23 | "Content-Type": "application/json", 24 | "Host": "www.infoq.cn", 25 | "Origin": "https://www.infoq.cn", 26 | "Referer": "https://www.infoq.cn/article/Ns2yelhHTd0rhmu2-IzN", 27 | "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36", 28 | } 29 | 30 | headers2 = { 31 | "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36", 32 | } 33 | try: 34 | import uvloop 35 | 36 | asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) 37 | except ImportError: 38 | pass 39 | 40 | sema = asyncio.Semaphore(5) 41 | 42 | 43 | async def get_buff(item, session): 44 | url = item.get("cover") 45 | with async_timeout.timeout(60): 46 | async with session.get(url) as r: 47 | if r.status == 200: 48 | buff = await r.read() 49 | if len(buff): 50 | crawler.info(f"NOW_IMAGE_URL:, {url}") 51 | await get_img(item, buff) 52 | 53 | 54 | async def get_img(item, buff): 55 | # 题目层目录是否存在 56 | file_path = item.get("file_path") 57 | image_path = item.get("image_path") 58 | if not os.path.exists(file_path): 59 | os.makedirs(file_path) 60 | 61 | # 文件是否存在 62 | if not os.path.exists(image_path): 63 | storage.info(f"SAVE_PATH:{image_path}") 64 | async with aiofiles.open(image_path, 'wb') as f: 65 | await f.write(buff) 66 | 67 | 68 | async def get_content(source, item): 69 | dic = {} 70 | dic["uuid"] = item.get("uuid") 71 | dic["title"] = item.get("title") 72 | dic["author"] = item.get("author") 73 | dic["publish_time"] = item.get("publish_time") 74 | dic["cover_url"] = item.get("cover") 75 | dic["tags"] = item.get("tags") 76 | dic["image_path"] = item.get("image_path") 77 | dic["md5name"] = item.get("md5name") 78 | html_content = source.get("data").get("content") 79 | dic["html"] = html_content 80 | dic["content"] = remove_tags(html_content) 81 | dic["update_time"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") 82 | await MotorBase().save_data(dic) 83 | 84 | 85 | @retry(attempts=5) 86 | async def fetch(item, session, retry_index=0): 87 | ''' 88 | 对内容的下载,重试次数5次 89 | :param item: 90 | :param session: 91 | :param retry_index: 92 | :return: 93 | ''' 94 | refer = item.get("url") 95 | uuid = item.get("uuid") 96 | if retry_index == 0: 97 | await MotorBase().change_status(uuid, 1) # 开始下载 98 | data = {"uuid": uuid} 99 | headers["Referer"] = refer 100 | with async_timeout.timeout(60): 101 | async with session.post(url=base_url, headers=headers, data=json.dumps(data)) as req: 102 | res_status = req.status 103 | 104 | if res_status == 200: 105 | jsondata = await req.json() 106 | await get_content(jsondata, item) 107 | await MotorBase().change_status(uuid, 2) # 下载成功 108 | 109 | 110 | async def bound_fetch(item, session): 111 | ''' 112 | 分别处理图片和内容的下载 113 | :param item: 114 | :param session: 115 | :return: 116 | ''' 117 | md5name = item.get("md5name") 118 | file_path = os.path.join(os.getcwd(), "infoq_cover") 119 | image_path = os.path.join(file_path, f"{md5name}.jpg") 120 | item["md5name"] = md5name 121 | item["image_path"] = image_path 122 | item["file_path"] = file_path 123 | async with sema: 124 | await fetch(item, session) 125 | await get_buff(item, session) 126 | 127 | 128 | async def branch(coros, limit=10): 129 | ''' 130 | 使用aiostream模块对异步生成器做一个切片操作。这里并发量为10. 131 | :param coros: 异步生成器 132 | :param limit: 并发次数 133 | :return: 134 | ''' 135 | index = 0 136 | flag = True 137 | while flag: 138 | xs = stream.iterate(coros) 139 | ys = xs[index:index + limit] 140 | t = await stream.list(ys) 141 | if not t: 142 | flag = False 143 | else: 144 | await asyncio.ensure_future(asyncio.gather(*t)) 145 | index += limit 146 | 147 | 148 | async def run(): 149 | ''' 150 | 入口函数 151 | :return: 152 | ''' 153 | data = await MotorBase().find() 154 | crawler.info("Start Spider") 155 | async with aiohttp.connector.TCPConnector(limit=300, force_close=True, enable_cleanup_closed=True,verify_ssl=False) as tc: 156 | async with aiohttp.ClientSession(connector=tc) as session: 157 | coros = (asyncio.ensure_future(bound_fetch(item, session)) async for item in data) 158 | await branch(coros) 159 | 160 | 161 | if __name__ == '__main__': 162 | loop = asyncio.get_event_loop() 163 | loop.run_until_complete(run()) 164 | loop.close() 165 | -------------------------------------------------------------------------------- /infoq_seed_spider.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-01-28 14:41 3 | # @Author : cxa 4 | # @File : infoq_seed_spider.py 5 | # @Software: PyCharm 6 | import requests 7 | import json 8 | import time 9 | import random 10 | import datetime 11 | from logger.log import crawler, storage 12 | from db.mongo_helper import Mongo 13 | import hashlib 14 | 15 | 16 | class InfoQ_Seed_Spider(): 17 | def __init__(self): 18 | self.start_url = "https://www.infoq.cn/public/v1/my/recommond" 19 | self.headers = { 20 | "Accept": "application/json, text/plain, */*", 21 | "Accept-Encoding": "gzip, deflate, br", 22 | "Accept-Language": "zh-CN,zh;q=0.9", 23 | "Content-Type": "application/json", 24 | "Host": "www.infoq.cn", 25 | "Origin": "https://www.infoq.cn", 26 | "Referer": "https://www.infoq.cn/", 27 | "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36", 28 | } 29 | self.session = requests.Session() 30 | self.session.headers.update(self.headers) 31 | 32 | def get_req(self, data=None): 33 | ''' 34 | 请求列表页 35 | :param data: 36 | :return: 37 | ''' 38 | req = self.session.post(self.start_url, data=json.dumps(data)) 39 | if req.status_code in [200, 201]: 40 | return req 41 | 42 | def save_data(self, data): 43 | tasks = [] 44 | for item in data: 45 | try: 46 | dic = {} 47 | uuid = item.get("uuid") 48 | dic["uuid"] = uuid 49 | dic["url"] = f"https://www.infoq.cn/article/{uuid}" 50 | title = item.get("article_title") 51 | dic["title"] = title 52 | dic["cover"] = item.get("article_cover") 53 | dic["summary"] = item.get("article_summary") 54 | author = item.get("author") 55 | if author: 56 | dic["author"] = author[0].get("nickname") 57 | else: 58 | dic["author"] = item.get("no_author", "").split(":")[-1] 59 | score = item.get("publish_time") 60 | dic["publish_time"] = datetime.datetime.utcfromtimestamp(score / 1000).strftime("%Y-%m-%d %H:%M:%S") 61 | dic["tags"] = ",".join([data.get("name") for data in item.get("topic")]) 62 | translate = item.get("translator") 63 | dic["translator"] = dic["author"] 64 | if translate: 65 | dic["translator"] = translate[0].get("nickname") 66 | dic["status"] = 0 67 | md5name = hashlib.md5(title.encode("utf-8")).hexdigest() # 图片的名字 68 | dic["md5name"] = md5name 69 | dic["update_time"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") 70 | tasks.append(dic) 71 | except IndexError as e: 72 | crawler.error("解析出错") 73 | Mongo().save_data(tasks) 74 | crawler.info(f"add {len(tasks)} datas to mongodb") 75 | return score 76 | 77 | def start(self): 78 | i = 0 79 | post_data = {"size": 12} 80 | while i < 4: 81 | req = self.get_req(post_data) 82 | data = req.json().get("data") 83 | score = self.save_data(data) 84 | post_data.update({"score": score}) 85 | i += 1 86 | time.sleep(random.randint(0, 5)) 87 | 88 | 89 | if __name__ == '__main__': 90 | iss = InfoQ_Seed_Spider() 91 | iss.start() 92 | -------------------------------------------------------------------------------- /logger/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-01-28 15:37 3 | # @Author : cxa 4 | # @File : __init__.py.py 5 | # @Software: PyCharm -------------------------------------------------------------------------------- /logger/log.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-01-28 15:37 3 | # @Author : cxa 4 | # @File : log.py 5 | # @Software: PyCharm 6 | import os 7 | import logging 8 | import logging.config as log_conf 9 | import datetime 10 | import coloredlogs 11 | 12 | log_dir = os.path.dirname(os.path.dirname(__file__)) + '/logs' 13 | if not os.path.exists(log_dir): 14 | os.mkdir(log_dir) 15 | today = datetime.datetime.now().strftime("%Y%m%d") 16 | 17 | log_path = os.path.join(log_dir, f'infoq_{today}.log') 18 | 19 | log_config = { 20 | 'version': 1.0, 21 | 'formatters': { 22 | 'colored_console': {'()': 'coloredlogs.ColoredFormatter', 23 | 'format': "%(asctime)s - %(name)s - %(levelname)s - %(message)s", 'datefmt': '%H:%M:%S'}, 24 | 'detail': { 25 | 'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s', 26 | 'datefmt': "%Y-%m-%d %H:%M:%S" # 如果不加这个会显示到毫秒。 27 | }, 28 | 'simple': { 29 | 'format': '%(name)s - %(levelname)s - %(message)s', 30 | }, 31 | }, 32 | 'handlers': { 33 | 'console': { 34 | 'class': 'logging.StreamHandler', # 日志打印到屏幕显示的类。 35 | 'level': 'INFO', 36 | 'formatter': 'colored_console' 37 | }, 38 | 'file': { 39 | 'class': 'logging.handlers.RotatingFileHandler', # 日志打印到文件的类。 40 | 'maxBytes': 1024 * 1024 * 1024, # 单个文件最大内存 41 | 'backupCount': 1, # 备份的文件个数 42 | 'filename': log_path, # 日志文件名 43 | 'level': 'INFO', # 日志等级 44 | 'formatter': 'detail', # 调用上面的哪个格式 45 | 'encoding': 'utf-8', # 编码 46 | }, 47 | }, 48 | 'loggers': { 49 | 'crawler': { 50 | 'handlers': ['console', 'file'], # 只打印屏幕 51 | 'level': 'DEBUG', # 只显示错误的log 52 | }, 53 | 'parser': { 54 | 'handlers': ['file'], 55 | 'level': 'INFO', 56 | }, 57 | 'other': { 58 | 'handlers': ['console', 'file'], 59 | 'level': 'INFO', 60 | }, 61 | 'storage': { 62 | 'handlers': ['console', 'file'], 63 | 'level': 'INFO', 64 | } 65 | } 66 | } 67 | 68 | log_conf.dictConfig(log_config) 69 | 70 | crawler = logging.getLogger('crawler') 71 | storage = logging.getLogger('storage') 72 | coloredlogs.install(level='DEBUG', logger=crawler) 73 | coloredlogs.install(level='DEBUG', logger=storage) 74 | -------------------------------------------------------------------------------- /tool/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-01-28 14:47 3 | # @Author : cxa 4 | # @File : __init__.py.py 5 | # @Software: PyCharm -------------------------------------------------------------------------------- /tool/headers_format.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2019-01-28 14:47 3 | # @Author : cxa 4 | # @File : headers_format.py 5 | # @Software: PyCharm 6 | headers = """ 7 | Accept: application/json, text/plain, */* 8 | Accept-Encoding: gzip, deflate, br 9 | Accept-Language: zh-CN,zh;q=0.9 10 | Content-Type: application/json 11 | Host: www.infoq.cn 12 | Origin: https://www.infoq.cn 13 | Referer: https://www.infoq.cn/article/Ns2yelhHTd0rhmu2-IzN 14 | User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 15 | """ 16 | hs = headers.split('\n') 17 | b = [k for k in hs if len(k)] 18 | e = b 19 | f = {(i.split(":")[0], i.split(":", 1)[1].strip()) for i in e} 20 | g = sorted(f) 21 | index = 0 22 | print("{") 23 | for k, v in g: 24 | print(repr(k).replace('\'','"'), repr(v).replace('\'','"'), sep=':', end=",\n") 25 | print("}") 26 | --------------------------------------------------------------------------------