├── .gitignore ├── LICENSE ├── README.md ├── WeiboCrawler ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders │ ├── __init__.py │ ├── comment.py │ ├── mblog.py │ ├── repost.py │ ├── user.py │ └── utils.py ├── requirements.txt └── scrapy.cfg /.gitignore: -------------------------------------------------------------------------------- 1 | *.DS_Store 2 | WeiboCrawler/__pycache__ 3 | WeiboCrawler/spiders/__pycache__ 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 XWang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WeiboCrawler 2 | 3 | ## 项目说明 4 | 5 | ### 项目介绍 6 | 7 | 新浪微博是国内主要的社交舆论平台,对社交媒体中的数据进行采集是舆论分析的方法之一。本项目无需cookie,可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。 8 | 9 | ### 实例 10 | 抓取用户信息 11 | ![user](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406163110.jpg) 12 | 抓取用户微博 13 | ![mblog](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406163931.png) 14 | 抓取微博转发 15 | ![Repost](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406164056.png) 16 | 抓取微博评论 17 | ![comment](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406164127.jpg) 18 | 19 | ## 使用方法 20 | ### 拉取项目 21 | ``` 22 | $ git clone https://github.com/XWang20/WeiboCrawler.git 23 | ``` 24 | 25 | ### 安装依赖 26 | 本项目Python版本为Python3.8 27 | ``` 28 | $ cd WeiboCrawler 29 | $ python -m pip install -r requirements.txt 30 | ``` 31 | 32 | ### 安装数据库(可选) 33 | 默认使用MongoDB数据库,可在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)中修改URL和数据库名,默认为localhost、weibo。 34 | 35 | ### 运行程序 36 | #### 基本程序 37 | 在命令行中运行以下命令: 38 | 39 | 抓取用户信息 40 | ``` 41 | $ scrapy crawl user 42 | ``` 43 | 抓取用户微博 44 | ``` 45 | $ scrapy crawl mblog 46 | ``` 47 | 抓取微博转发 48 | ``` 49 | $ scrapy crawl repost 50 | ``` 51 | 抓取微博评论 52 | ``` 53 | $ scrapy crawl comment 54 | ``` 55 | 56 | #### 自定义选项 57 | 1. 关键词检索,需要将`./WeiboCrawler/spiders/mblog.py`中第28行代码替换为`urls = init_url_by_search()`,并在`init_url_by_search()`中增加关键词列表。 58 | 59 | 2. 采集id和时间范围等信息可根据自己实际需要重写`./WeiboCrawler/spiders/*.py`中的`start_requests`函数。 60 | 61 | 3. 输出方式:支持输出到mongo数据库中,或输出json或csv文件。 62 | 63 | 如果输出json或csv文件,需要在命令后加入`-o *.json`或`-o *.csv`,例如: 64 | ``` 65 | $ scrapy crawl user -o user.csv 66 | ``` 67 | 68 | 如果输出到mongo数据库中,需要将`./WeiboCrawler/settings.py`中 mongo 数据库的部分取消注释: 69 | ``` 70 | ITEM_PIPELINES = { 71 | 'WeiboCrawler.pipelines.MongoPipeline': 400, 72 | } 73 | MONGO_URI = 'localhost' 74 | MONGO_DB = 'weibo' 75 | ``` 76 | 77 | 4. 添加账号cookie:可在[settings.py](WeiboCrawler/settings.py)中添加默认头,或在start_request函数中添加。 78 | 79 | 5. 默认下载延迟为3,可在[settings.py](WeiboCrawler/settings.py)修改DOWNLOAD_DELAY。 80 | 81 | 6. 默认会爬取二级评论,如果不需要可以在[comment.py](WeiboCrawler/spiders/comment.py)中注释以下代码: 82 | 83 | ```python 84 | if comment['total_number']: 85 | secondary_url = 'https://m.weibo.cn/comments/hotFlowChild?cid=' + comment['idstr'] 86 | yield Request(secondary_url, callback=self.parse_secondary_comment, meta={"mblog_id": mblog_id}) 87 | ``` 88 | 89 | ## 无cookie版限制的说明 90 | * 单用户微博最多采集200页,每页最多10条 91 | 限制可以通过添加账号cookie解决。 92 | 93 | ## 设置多线程和代理ip 94 | 95 | * 多线程:(**单ip池或单账号不建议采用多线程**) 96 | 在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)文件中将以下代码取消注释: 97 | ```python 98 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 99 | CONCURRENT_REQUESTS = 100 100 | # The download delay setting will honor only one of: 101 | CONCURRENT_REQUESTS_PER_DOMAIN = 100 102 | CONCURRENT_REQUESTS_PER_IP = 100 103 | ``` 104 | 105 | * 代理ip池 106 | 1. 填写[middlewares.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/middlewares.py)中的`fetch_proxy`函数。 107 | 2. 在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)文件中将以下代码取消注释: 108 | ```python 109 | # Enable or disable downloader middlewares 110 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 111 | DOWNLOADER_MIDDLEWARES = { 112 | 'WeiboCrawler.middlewares.IPProxyMiddleware': 100, 113 | } 114 | ``` 115 | 3. 在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)文件中将DOWNLOAD_DELAY设置为0。 116 | ```python 117 | DOWNLOAD_DELAY = 0 118 | ``` 119 | 120 | ## 字段说明 121 | ### 用户信息 122 | * _id: 用户ID 123 | * nick_name: 昵称 124 | * gender: 性别 125 | * brief_introduction: 简介 126 | * location: 所在地 127 | * mblogs_num: 微博数 128 | * follows_num: 关注数 129 | * fans_num: 粉丝数 130 | * vip_level: 会员等级 131 | * authentication: 认证,对于已认证用户该字段会显示认证信息 132 | * person_url: 首页链接 133 | 134 | ### 微博信息 135 | * _id: 微博id 136 | * bid: 微博bid 137 | * weibo_url: 微博URL 138 | * created_at: 微博发表时间 139 | * like_num: 点赞数 140 | * repost_num: 转发数 141 | * comment_num: 评论数 142 | * content: 微博内容 143 | * user_id: 发表该微博用户的id 144 | * tool: 发布微博的工具 145 | 146 | 147 | ### 转发信息 148 | * _id: 转发id 149 | * repost_user_id: 转发用户的id 150 | * content: 转发的内容 151 | * mblog_id: 转发的微博的id 152 | * created_at: 转发时间 153 | * source: 转发工具 154 | 155 | 156 | ### 评论信息 157 | * _id: 评论id 158 | * comment_user_id: 评论用户的id 159 | * content: 评论的内容 160 | * mblog_id: 评论的微博的id 161 | * created_at: 评论发表时间 162 | * like_num: 点赞数 163 | * root_comment_id: 根评论id,只有二级评论有该项 164 | * img_url: 图片地址 165 | * reply_comment_id: 评论的id,只有二级评论有该项 166 | 167 | ## 写在最后 168 | 169 | 170 | 本项目参考了[dataabc/weibo-crawler](https://github.com/dataabc/weibo-crawler)和[nghuyong/WeiboSpider](https://github.com/nghuyong/WeiboSpider),感谢他们的开源。 171 | 172 | 欢迎为本项目贡献力量。欢迎大家提交PR、通过issue提建议(如新功能、改进方案等)、通过issue告知项目存在哪些bug、缺点等。 173 | 174 | 如有问题和交流,也欢迎联系我: 175 | -------------------------------------------------------------------------------- /WeiboCrawler/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/XWang20/WeiboCrawler/591825b526c3d5c9b0e8b9f997803e3529fc3ed1/WeiboCrawler/__init__.py -------------------------------------------------------------------------------- /WeiboCrawler/items.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # https://doc.scrapy.org/en/latest/topics/items.html 7 | 8 | from scrapy import Item, Field 9 | 10 | class UserItem(Item): 11 | """ User Information """ 12 | _id = Field() # 用户ID 13 | nick_name = Field() # 昵称 14 | gender = Field() # 性别 15 | brief_introduction = Field() # 简介 16 | location = Field() # 首页链接 17 | mblogs_num = Field() # 微博数 18 | follows_num = Field() # 关注数 19 | fans_num = Field() # 粉丝数 20 | vip_level = Field() # 会员等级 21 | authentication = Field() # 认证 22 | person_url = Field() # 首页链接 23 | 24 | class MblogItem(Item): 25 | """ Mblog information """ 26 | _id = Field() # 微博id 27 | bid = Field() 28 | weibo_url = Field() # 微博URL 29 | created_at = Field() # 微博发表时间 30 | like_num = Field() # 点赞数 31 | repost_num = Field() # 转发数 32 | comment_num = Field() # 评论数 33 | content = Field() # 微博内容 34 | user_id = Field() # 发表该微博用户的id 35 | tool = Field() # 发布微博的工具 36 | 37 | class CommentItem(Item): 38 | """ Mblog Comment Information """ 39 | _id = Field() 40 | comment_user_id = Field() # 评论用户的id 41 | content = Field() # 评论的内容 42 | mblog_id = Field() # 评论的微博的id 43 | created_at = Field() # 评论发表时间 44 | like_num = Field() # 点赞数 45 | root_comment_id = Field() # 根评论id,只有二级评论有该项 46 | img_url = Field() 47 | img_name = Field() 48 | reply_comment_id =Field() 49 | 50 | class RepostItem(Item): 51 | """ Mblog Repost Information """ 52 | _id = Field() 53 | repost_user_id = Field() # 转发用户的id 54 | content = Field() # 转发的内容 55 | mblog_id = Field() # 转发的微博的id 56 | created_at = Field() # 转发时间 57 | source = Field() # 转发工具 58 | -------------------------------------------------------------------------------- /WeiboCrawler/middlewares.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your spider middleware 4 | # 5 | # See documentation in: 6 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 7 | 8 | from scrapy.downloadermiddlewares.retry import RetryMiddleware 9 | from scrapy.utils.response import response_status_message 10 | from scrapy import signals 11 | import time 12 | 13 | # class IPProxyMiddleware(object): 14 | 15 | # def fetch_proxy(self): 16 | # # You need to rewrite this function if you want to add proxy pool 17 | # # the function should return a ip in the format of "ip:port" like "12.34.1.4:9090" 18 | # pass 19 | 20 | # def process_request(self, request, spider): 21 | # proxy_data = self.fetch_proxy() 22 | # if proxy_data: 23 | # current_proxy = proxy_data 24 | # spider.logger.debug(f"current proxy:{current_proxy}") 25 | # request.meta['proxy'] = current_proxy 26 | 27 | class TooManyRequestsRetryMiddleware(RetryMiddleware): 28 | 29 | def __init__(self, crawler): 30 | super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings) 31 | self.crawler = crawler 32 | 33 | @classmethod 34 | def from_crawler(cls, crawler): 35 | return cls(crawler) 36 | 37 | def process_response(self, request, response, spider): 38 | if request.meta.get('dont_retry', False): 39 | return response 40 | elif response.status == 429 or response.status == 418 or response.status == 418: 41 | self.crawler.engine.pause() 42 | time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on. 43 | self.crawler.engine.unpause() 44 | reason = response_status_message(response.status) 45 | return self._retry(request, reason, spider) or response 46 | elif response.status in self.retry_http_codes: 47 | reason = response_status_message(response.status) 48 | return self._retry(request, reason, spider) or response 49 | return response -------------------------------------------------------------------------------- /WeiboCrawler/pipelines.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 | import re 8 | import pymongo 9 | from scrapy.pipelines.images import ImagesPipeline 10 | from scrapy import Request 11 | 12 | class WeibocrawlerPipeline(object): 13 | def process_item(self, item, spider): 14 | return item 15 | 16 | class MongoPipeline(object): 17 | def __init__(self, mongo_uri, mongo_db): 18 | self.mongo_uri = mongo_uri 19 | self.mongo_db = mongo_db 20 | 21 | @classmethod 22 | def from_crawler(cls, crawler): 23 | return cls( 24 | mongo_uri=crawler.settings.get('MONGO_URI'), 25 | mongo_db=crawler.settings.get('MONGO_DB') 26 | ) 27 | 28 | def open_spider(self, spider): 29 | self.client = pymongo.MongoClient(self.mongo_uri) 30 | self.db = self.client[self.mongo_db] 31 | 32 | def process_item(self, item, spider): 33 | name = item.__class__.__name__ 34 | self.db[name].insert(dict(item)) 35 | return item 36 | 37 | def close_spider(self, spider): 38 | self.client.close() 39 | 40 | class ImagesnamePipeline(ImagesPipeline): 41 | def get_media_requests(self, item, info): 42 | if 'img_url' in item: 43 | for image_url in item['img_url']: 44 | # meta里面的数据是从spider获取,然后通过meta传递给下面方法:file_path 45 | yield Request(image_url, meta={'name':item['create_at']}, dont_filter=True, headers={'Host': 'wx1.sinaimg.cn'}) 46 | 47 | def file_path(self, request, response=None, info=None): 48 | # 提取url前面名称 49 | image_guid = request.url.split('/')[-1] 50 | # 图片名称,默认为评论产生日期 51 | name = request.meta['name'] 52 | name = re.sub(r'[?\\*|“<>:/]', '', name) 53 | # 图片存储默认位置:根目录/images/评论时间/评论时间_img_id.格式 54 | filename = u'{0}/{0}_{1}'.format(name, image_guid) 55 | return filename -------------------------------------------------------------------------------- /WeiboCrawler/settings.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Scrapy settings for WeiboCrawler project 4 | # 5 | # For simplicity, this file contains only settings considered important or 6 | # commonly used. You can find more settings consulting the documentation: 7 | # 8 | # https://doc.scrapy.org/en/latest/topics/settings.html 9 | # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 10 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 11 | 12 | BOT_NAME = 'WeiboCrawler' 13 | 14 | SPIDER_MODULES = ['WeiboCrawler.spiders'] 15 | NEWSPIDER_MODULE = 'WeiboCrawler.spiders' 16 | 17 | 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 | USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36' 20 | 21 | # Obey robots.txt rules 22 | ROBOTSTXT_OBEY = False 23 | 24 | # Configure a delay for requests for the same website (default: 0) 25 | # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay 26 | # See also autothrottle settings and docs 27 | DOWNLOAD_DELAY = 3 28 | DOWNLOAD_TIMEOUT = 15 29 | # # Configure maximum concurrent requests performed by Scrapy (default: 16) 30 | # CONCURRENT_REQUESTS = 100 31 | # # The download delay setting will honor only one of: 32 | # CONCURRENT_REQUESTS_PER_DOMAIN = 100 33 | # CONCURRENT_REQUESTS_PER_IP = 100 34 | 35 | # Disable cookies (enabled by default) 36 | COOKIES_ENABLED = False 37 | 38 | # Disable Telnet Console (enabled by default) 39 | #TELNETCONSOLE_ENABLED = False 40 | 41 | # Override the default request headers: 42 | DEFAULT_REQUEST_HEADERS = { 43 | 'Accept': 'application/json, text/plain, */*', 44 | 'Accept-Encoding': 'gzip, deflate, sdch', 45 | 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2,mt;q=0.2', 46 | 'Connection': 'keep-alive', 47 | 'Host': 'm.weibo.cn', 48 | 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36', 49 | 'X-Requested-With': 'XMLHttpRequest', 50 | # 'cookie': 51 | } 52 | 53 | # Enable or disable downloader middlewares 54 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 55 | DOWNLOADER_MIDDLEWARES = { 56 | # 'WeiboCrawler.middlewares.IPProxyMiddleware': 100, 57 | 'WeiboCrawler.middlewares.TooManyRequestsRetryMiddleware': 543, 58 | } 59 | RETRY_HTTP_CODES = [429, 418, 502] 60 | 61 | # Enable or disable extensions 62 | # See https://doc.scrapy.org/en/latest/topics/extensions.html 63 | #EXTENSIONS = { 64 | # 'scrapy.extensions.telnet.TelnetConsole': None, 65 | #} 66 | 67 | # Configure item pipelines 68 | # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html 69 | ITEM_PIPELINES = { 70 | 'WeiboCrawler.pipelines.WeibocrawlerPipeline': 300, 71 | # 'WeiboCrawler.pipelines.MongoPipeline': 400, 72 | } 73 | # MONGO_URI = 'localhost' 74 | # MONGO_DB = 'weibo' 75 | 76 | # Enable and configure the AutoThrottle extension (disabled by default) 77 | # See https://doc.scrapy.org/en/latest/topics/autothrottle.html 78 | #AUTOTHROTTLE_ENABLED = True 79 | # The initial download delay 80 | #AUTOTHROTTLE_START_DELAY = 5 81 | # The maximum download delay to be set in case of high latencies 82 | #AUTOTHROTTLE_MAX_DELAY = 60 83 | # The average number of requests Scrapy should be sending in parallel to 84 | # each remote server 85 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 86 | # Enable showing throttling stats for every response received: 87 | #AUTOTHROTTLE_DEBUG = False 88 | 89 | # Enable and configure HTTP caching (disabled by default) 90 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 91 | #HTTPCACHE_ENABLED = True 92 | #HTTPCACHE_EXPIRATION_SECS = 0 93 | #HTTPCACHE_DIR = 'httpcache' 94 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 95 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 96 | 97 | # # 图片存储目录 98 | # IMAGES_STORE = 'images/' 99 | # ITEM_PIPELINES = { 100 | # 'WeiboCrawler.pipelines.ImagesnamePipeline': 300, 101 | # } -------------------------------------------------------------------------------- /WeiboCrawler/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /WeiboCrawler/spiders/comment.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import re 3 | import json 4 | from datetime import datetime 5 | from scrapy import Request, Spider 6 | from WeiboCrawler.items import CommentItem 7 | from WeiboCrawler.spiders.utils import standardize_date, extract_content 8 | 9 | class CommentSpider(Spider): 10 | name = 'comment' 11 | base_url = 'https://api.weibo.cn/2/comments/build_comments?' 12 | 13 | def start_requests(self): 14 | mblog_ids = ['4704521537720325'] 15 | urls = [f'{self.base_url}is_show_bulletin=2&c=android&s=746fd605&id={mblog_id}&from=10A8195010&gsid=_2AkMolNMzf8NhqwJRmf4dxWzgb49zzQrEieKeyCLoJRM3HRl-wT9jqmwMtRV6AgOZP3LqGBH-29qGRB4vP3j-Hng6DkBJ&count=50&max_id_type=1' for mblog_id in mblog_ids] 16 | for url in urls: 17 | yield Request(url, callback=self.parse, dont_filter=True, headers={'Host': 'api.weibo.cn'}) 18 | 19 | def parse(self, response): 20 | js = json.loads(response.text) 21 | mblog_id = re.search(r'[\d]{16}', response.url).group(0) 22 | 23 | comments = js['root_comments'] 24 | for comment in comments: 25 | commentItem = CommentItem() 26 | commentItem['_id'] = comment['idstr'] 27 | commentItem['comment_user_id'] = comment['user']['id'] 28 | commentItem['mblog_id'] = mblog_id 29 | commentItem['created_at'] = standardize_date(comment['created_at']).strftime('%Y-%m-%d') 30 | commentItem['like_num'] = comment['like_counts'] 31 | commentItem['content'] = extract_content(comment['text']) 32 | img_url = [] 33 | if 'pic_infos' in comment: 34 | for pic in comment['pic_infos']: 35 | img_url.append(comment['pic_infos'][pic]['original']['url']) 36 | commentItem['img_url'] = img_url 37 | yield commentItem 38 | 39 | if comment['total_number']: 40 | cid = comment['idstr'] 41 | secondary_url = f'https://weibo.com/ajax/statuses/buildComments?is_reload=1&id={cid}&is_show_bulletin=2&is_mix=1&fetch_level=1&count=20&flow=1' 42 | yield Request(secondary_url, callback=self.parse_secondary_comment, meta={"mblog_id": mblog_id, 'secondary_url': secondary_url}, headers={'Host': 'weibo.com'}) 43 | 44 | max_id = js['max_id'] 45 | if max_id > 0: 46 | max_id_type = js['max_id_type'] 47 | next_url = f'{self.base_url}is_show_bulletin=2&c=android&s=746fd605&id={mblog_id}&from=10A8195010&gsid=_2AkMolNMzf8NhqwJRmf4dxWzgb49zzQrEieKeyCLoJRM3HRl-wT9jqmwMtRV6AgOZP3LqGBH-29qGRB4vP3j-Hng6DkBJ&count=50&max_id={max_id}&max_id_type={max_id_type}' 48 | yield Request(next_url, callback=self.parse, dont_filter=True, headers={'Host': 'api.weibo.cn'}) 49 | 50 | def parse_secondary_comment(self, response): 51 | js = json.loads(response.text) 52 | max_id = js['max_id'] 53 | 54 | if js['ok']: 55 | seccomments = js['data'] 56 | for seccomment in seccomments: 57 | commentItem = CommentItem() 58 | commentItem['_id'] = seccomment['id'] 59 | commentItem['comment_user_id'] = seccomment['user']['id'] 60 | commentItem['mblog_id'] = response.meta['mblog_id'] 61 | commentItem['created_at'] = standardize_date(seccomment['created_at']).strftime('%Y-%m-%d') 62 | commentItem['like_num'] = seccomment['like_counts'] 63 | commentItem['content'] = extract_content(seccomment['text']) 64 | commentItem['reply_comment_id'] = seccomment['reply_comment']['idstr'] 65 | commentItem['root_comment_id'] = seccomment['rootidstr'] 66 | yield commentItem 67 | 68 | if max_id > 0: 69 | secondary_url = response.meta['secondary_url'] 70 | next_url = f'{secondary_url}&max_id={max_id}' 71 | yield Request(next_url, callback=self.parse_secondary_comment, meta={"mblog_id": response.meta['mblog_id'], 'secondary_url': secondary_url}, headers={'Host': 'weibo.com'}) 72 | -------------------------------------------------------------------------------- /WeiboCrawler/spiders/mblog.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | from datetime import date 4 | from datetime import datetime 5 | from scrapy import Request, Spider 6 | from WeiboCrawler.items import MblogItem 7 | from WeiboCrawler.spiders.utils import standardize_date, extract_content 8 | 9 | class MblogSpider(Spider): 10 | name = 'mblog' 11 | base_url = 'https://m.weibo.cn/api/container/getIndex?' 12 | 13 | def start_requests(self): 14 | 15 | # 通过用户id搜索 16 | def init_url_by_user_id(): 17 | # crawl mblogs post by users 18 | user_ids = ['1699432410'] 19 | urls = [f'{self.base_url}containerid=107603{user_id}&page=1' for user_id in user_ids] 20 | return urls 21 | 22 | # 通过关键词搜索 23 | def init_url_by_search(): 24 | key_words = [''] 25 | urls = [f'{self.base_url}type=wb&queryVal={key_word}&containerid=100103type=61%26q%3D{key_word}&page=1' for key_word in key_words] 26 | return urls 27 | 28 | urls = init_url_by_user_id() 29 | for url in urls: 30 | yield Request(url, callback=self.parse) 31 | 32 | def parse(self, response): 33 | js = json.loads(response.text) 34 | page_num = int(response.url.split('=')[-1]) 35 | # 设定采集的时间段 36 | date_start = datetime.strptime("2019-12-01", '%Y-%m-%d') 37 | date_end = datetime.strptime("2022-2-8", '%Y-%m-%d') 38 | if js['ok']: 39 | weibos = js['data']['cards'] 40 | for w in weibos: 41 | if w['card_type'] == 9: 42 | weibo_info = w['mblog'] 43 | created_at = standardize_date(weibo_info['created_at']) 44 | if date_start <= created_at and created_at <= date_end: 45 | mblogItem = MblogItem() 46 | weiboid = mblogItem['_id'] = weibo_info['id'] 47 | mblogItem['bid'] = weibo_info['bid'] 48 | mblogItem['user_id'] = weibo_info['user']['id'] if weibo_info['user'] else '' 49 | mblogItem['like_num'] = weibo_info['attitudes_count'] 50 | mblogItem['repost_num'] = weibo_info['reposts_count'] 51 | mblogItem['comment_num'] = weibo_info['comments_count'] 52 | mblogItem['tool'] = weibo_info['source'] 53 | mblogItem['created_at'] = created_at.strftime('%Y-%m-%d') 54 | weibo_url = mblogItem['weibo_url'] = 'https://m.weibo.cn/detail/'+weiboid 55 | is_long = True if weibo_info.get('pic_num') > 9 else weibo_info.get('isLongText') # 判断是否为长微博 56 | if is_long: 57 | yield Request(weibo_url, callback=self.parse_all_content, meta={'item': mblogItem}, priority=1) 58 | else: 59 | mblogItem['content'] = extract_content(weibo_info['text']) 60 | yield mblogItem 61 | 62 | elif date_end < created_at: # 微博在采集时间段后 63 | continue 64 | else: # 微博超过需采集的时间 65 | page_num = 0 # 退出采集该用户 66 | break 67 | 68 | if js['ok'] and page_num: 69 | next_url = response.url.replace('page={}'.format(page_num), 'page={}'.format(page_num+1)) 70 | yield Request(next_url, callback=self.parse) 71 | 72 | def parse_all_content(self,response): 73 | mblogItem = response.meta['item'] 74 | html = response.text 75 | html = html[html.find('"status":'):] 76 | html = html[:html.rfind('"hotScheme"')] 77 | html = html[:html.rfind(',')] 78 | html = '{' + html + '}' 79 | js = json.loads(html, strict=False) 80 | weibo_info = js.get('status') 81 | if weibo_info: 82 | mblogItem['content'] = extract_content(weibo_info['text']) 83 | yield mblogItem 84 | -------------------------------------------------------------------------------- /WeiboCrawler/spiders/repost.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | from datetime import datetime 4 | from scrapy import Request, Spider 5 | from WeiboCrawler.items import RepostItem 6 | from WeiboCrawler.spiders.utils import extract_content, standardize_date 7 | 8 | class RepostSpider(Spider): 9 | name = 'repost' 10 | base_url = 'https://api.weibo.cn/2/guest/statuses_repost_timeline?' 11 | 12 | def start_requests(self): 13 | mblog_ids = ['4750304827933227'] # 原微博id文件 14 | urls = [f"{self.base_url}c=android&s=746fd605&ft=0&id={mblog_id}&from=10A8195010&gsid=_2AkMomC38f8NhqwJRmf4dxWzgb49zzQrEieKexNwnJRM3HRl-wT9kqmkAtRV6AgOZPxlCXdki_q9a-GZtfNgXXwAhZ5en&page=1" for mblog_id in mblog_ids] 15 | for url in urls: 16 | yield Request(url, callback=self.parse, headers={'Host': 'api.weibo.cn'}) 17 | 18 | def parse(self, response): 19 | js = json.loads(response.text) 20 | 21 | if js['next_cursor']: 22 | page_list = response.url.split("=") 23 | page_list[-1] = str(int(response.url.split("=")[-1])+1) 24 | page_url = '='.join(page_list) 25 | print(page_url) 26 | yield Request(page_url, self.parse, headers={'Host': 'api.weibo.cn'}, priority=1) 27 | 28 | reposts = js['reposts'] 29 | for repost in reposts: 30 | repostItem = RepostItem() 31 | repostItem['_id'] = repost['id'] 32 | repostItem['repost_user_id'] = repost['user']['id'] 33 | repostItem['mblog_id'] = repost['retweeted_status']['id'] 34 | repostItem['created_at'] = standardize_date(repost['created_at']).strftime('%Y-%m-%d') 35 | repostItem['content'] = repost['text'] 36 | repostItem['source'] = extract_content(repost['source']) 37 | yield repostItem 38 | -------------------------------------------------------------------------------- /WeiboCrawler/spiders/user.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | from scrapy import Request, Spider 4 | from WeiboCrawler.items import UserItem 5 | 6 | 7 | class UserSpider(Spider): 8 | name = 'user' 9 | base_url = 'https://m.weibo.cn/api/container/getIndex?' 10 | 11 | def start_requests(self): 12 | # file_path = '' #用户id文件 13 | # with open(file_path, 'rb') as f: 14 | # try: 15 | # lines = f.read().splitlines() 16 | # lines = [line.decode('utf-8-sig') for line in lines] 17 | # except UnicodeDecodeError: 18 | # logger.error(u'%s文件应为utf-8编码,请先将文件编码转为utf-8再运行程序', file_path) 19 | # sys.exit() 20 | user_ids = ['1749127163', '2028810631'] # 用户id列表 21 | urls = [f'{self.base_url}containerid=100505{user_id}' for user_id in user_ids] 22 | for url in urls: 23 | yield Request(url, callback=self.parse) 24 | 25 | def parse(self, response): 26 | userItem = UserItem() 27 | js = json.loads(response.text) 28 | if js['ok']: 29 | userInfo = js['data']['userInfo'] 30 | userItem['_id'] = userInfo['id'] 31 | userItem['nick_name'] = userInfo['screen_name'] 32 | userItem['gender'] = userInfo['gender'] 33 | userItem['brief_introduction'] = userInfo['description'] 34 | userItem['mblogs_num'] = userInfo['statuses_count'] 35 | userItem['follows_num'] = userInfo['follow_count'] 36 | userItem['fans_num'] = userInfo['followers_count'] 37 | userItem['authentication'] = userInfo['verified'] 38 | if userItem['authentication']: 39 | userItem['authentication'] = userInfo['verified_reason'] 40 | userItem['vip_level'] = userInfo['mbrank'] 41 | userItem['person_url'] = userInfo['profile_url'].split('?')[0] 42 | profile_url = f"{self.base_url}containerid=230283{userInfo['id']}" 43 | yield Request(profile_url, callback=self.parse_location, meta={'item': userItem}, priority=1) 44 | 45 | def parse_location(self,response): 46 | userItem = response.meta['item'] 47 | js = json.loads(response.text) 48 | if js['ok']: 49 | userItem['location']=js['data']['cards'][0]['card_group'][0]['item_content'] 50 | print(userItem['location']) 51 | yield userItem 52 | -------------------------------------------------------------------------------- /WeiboCrawler/spiders/utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | import re 4 | from datetime import datetime, timedelta 5 | 6 | def standardize_date(created_at): 7 | """标准化微博发布时间""" 8 | if u'刚刚' in created_at: 9 | created_at = datetime.now().strftime('%Y-%m-%d') 10 | elif u'分钟' in created_at: 11 | minute = created_at[:created_at.find(u'分钟')] 12 | minute = timedelta(minutes=int(minute)) 13 | created_at = (datetime.now() - minute).strftime('%Y-%m-%d') 14 | elif u'小时' in created_at: 15 | hour = created_at[:created_at.find(u'小时')] 16 | hour = timedelta(hours=int(hour)) 17 | created_at = (datetime.now() - hour).strftime('%Y-%m-%d') 18 | elif u'昨天' in created_at: 19 | day = timedelta(days=1) 20 | created_at = (datetime.now() - day).strftime('%Y-%m-%d') 21 | else: 22 | created_at = created_at.replace('+0800 ', '') 23 | temp = datetime.strptime(created_at, '%c') 24 | created_at = datetime.strftime(temp, '%Y-%m-%d') 25 | return datetime.strptime(created_at, '%Y-%m-%d') 26 | 27 | 28 | def extract_content(text): 29 | text_body = text 30 | dr = re.compile(r'<[^>]+>',re.S) # 过滤html标签 31 | text_body = dr.sub('',text_body) 32 | return text_body 33 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pymongo==3.10.1 2 | Scrapy==1.5.1 3 | Pillow==8.2.0 -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = WeiboCrawler.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = WeiboCrawler 12 | --------------------------------------------------------------------------------