├── .gitignore
├── LICENSE
├── README.md
├── WeiboCrawler
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
    │   ├── __init__.py
    │   ├── comment.py
    │   ├── mblog.py
    │   ├── repost.py
    │   ├── user.py
    │   └── utils.py
├── requirements.txt
└── scrapy.cfg


/.gitignore:
--------------------------------------------------------------------------------
1 | *.DS_Store
2 | WeiboCrawler/__pycache__
3 | WeiboCrawler/spiders/__pycache__
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 XWang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # WeiboCrawler
  2 | 
  3 | ## 项目说明
  4 | 
  5 | ### 项目介绍
  6 | 
  7 | 新浪微博是国内主要的社交舆论平台，对社交媒体中的数据进行采集是舆论分析的方法之一。本项目无需cookie，可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。
  8 | 
  9 | ### 实例
 10 | 抓取用户信息
 11 | ![user](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406163110.jpg)
 12 | 抓取用户微博
 13 | ![mblog](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406163931.png)
 14 | 抓取微博转发
 15 | ![Repost](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406164056.png)
 16 | 抓取微博评论
 17 | ![comment](https://xtopia-1258297046.cos.ap-shanghai.myqcloud.com/20210406164127.jpg)
 18 | 
 19 | ## 使用方法
 20 | ### 拉取项目
 21 | ```
 22 | $ git clone https://github.com/XWang20/WeiboCrawler.git
 23 | ```
 24 | 
 25 | ### 安装依赖
 26 | 本项目Python版本为Python3.8
 27 | ```
 28 | $ cd WeiboCrawler
 29 | $ python -m pip install -r requirements.txt
 30 | ```
 31 | 
 32 | ### 安装数据库（可选）
 33 | 默认使用MongoDB数据库，可在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)中修改URL和数据库名，默认为localhost、weibo。
 34 | 
 35 | ### 运行程序
 36 | #### 基本程序
 37 | 在命令行中运行以下命令：
 38 | 
 39 | 抓取用户信息
 40 | ```
 41 | $ scrapy crawl user
 42 | ```
 43 | 抓取用户微博
 44 | ```
 45 | $ scrapy crawl mblog
 46 | ```
 47 | 抓取微博转发
 48 | ```
 49 | $ scrapy crawl repost
 50 | ```
 51 | 抓取微博评论
 52 | ```
 53 | $ scrapy crawl comment
 54 | ```
 55 | 
 56 | #### 自定义选项
 57 | 1. 关键词检索，需要将`./WeiboCrawler/spiders/mblog.py`中第28行代码替换为`urls = init_url_by_search()`，并在`init_url_by_search()`中增加关键词列表。
 58 | 
 59 | 2. 采集id和时间范围等信息可根据自己实际需要重写`./WeiboCrawler/spiders/*.py`中的`start_requests`函数。
 60 | 
 61 | 3. 输出方式：支持输出到mongo数据库中，或输出json或csv文件。
 62 | 
 63 | 如果输出json或csv文件，需要在命令后加入`-o *.json`或`-o *.csv`，例如：
 64 | ```
 65 | $ scrapy crawl user -o user.csv
 66 | ```
 67 | 
 68 | 如果输出到mongo数据库中，需要将`./WeiboCrawler/settings.py`中 mongo 数据库的部分取消注释:
 69 | ```
 70 | ITEM_PIPELINES = {
 71 |     'WeiboCrawler.pipelines.MongoPipeline': 400,
 72 | }
 73 | MONGO_URI = 'localhost'
 74 | MONGO_DB = 'weibo'
 75 | ```
 76 | 
 77 | 4. 添加账号cookie：可在[settings.py](WeiboCrawler/settings.py)中添加默认头，或在start_request函数中添加。
 78 | 
 79 | 5. 默认下载延迟为3，可在[settings.py](WeiboCrawler/settings.py)修改DOWNLOAD_DELAY。
 80 | 
 81 | 6. 默认会爬取二级评论，如果不需要可以在[comment.py](WeiboCrawler/spiders/comment.py)中注释以下代码：
 82 | 
 83 | ```python
 84 | if comment['total_number']:
 85 | secondary_url = 'https://m.weibo.cn/comments/hotFlowChild?cid=' + comment['idstr']
 86 | yield Request(secondary_url, callback=self.parse_secondary_comment, meta={"mblog_id": mblog_id})
 87 | ```
 88 | 
 89 | ## 无cookie版限制的说明
 90 | * 单用户微博最多采集200页，每页最多10条
 91 | 限制可以通过添加账号cookie解决。
 92 | 
 93 | ## 设置多线程和代理ip
 94 | 
 95 | * 多线程：(**单ip池或单账号不建议采用多线程**)
 96 | 在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)文件中将以下代码取消注释: 
 97 | ```python
 98 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
 99 | CONCURRENT_REQUESTS = 100
100 | # The download delay setting will honor only one of:
101 | CONCURRENT_REQUESTS_PER_DOMAIN = 100
102 | CONCURRENT_REQUESTS_PER_IP = 100
103 | ```
104 | 
105 | * 代理ip池
106 | 1. 填写[middlewares.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/middlewares.py)中的`fetch_proxy`函数。
107 | 2. 在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)文件中将以下代码取消注释: 
108 | ```python
109 | # Enable or disable downloader middlewares
110 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
111 | DOWNLOADER_MIDDLEWARES = {
112 |     'WeiboCrawler.middlewares.IPProxyMiddleware': 100,
113 | }
114 | ```
115 | 3. 在[settings.py](https://github.com/XWang20/WeiboCrawler/blob/main/WeiboCrawler/settings.py)文件中将DOWNLOAD_DELAY设置为0。
116 | ```python
117 | DOWNLOAD_DELAY = 0
118 | ```
119 | 
120 | ## 字段说明
121 | ### 用户信息
122 | * _id: 用户ID
123 | * nick_name: 昵称
124 | * gender: 性别
125 | * brief_introduction: 简介
126 | * location: 所在地
127 | * mblogs_num: 微博数
128 | * follows_num: 关注数
129 | * fans_num: 粉丝数
130 | * vip_level: 会员等级
131 | * authentication: 认证，对于已认证用户该字段会显示认证信息
132 | * person_url: 首页链接
133 | 
134 | ### 微博信息
135 | * _id: 微博id
136 | * bid: 微博bid
137 | * weibo_url: 微博URL
138 | * created_at: 微博发表时间
139 | * like_num: 点赞数
140 | * repost_num: 转发数
141 | * comment_num: 评论数
142 | * content: 微博内容
143 | * user_id: 发表该微博用户的id
144 | * tool: 发布微博的工具
145 | 
146 | 
147 | ### 转发信息
148 | * _id: 转发id
149 | * repost_user_id: 转发用户的id
150 | * content: 转发的内容
151 | * mblog_id: 转发的微博的id
152 | * created_at: 转发时间
153 | * source: 转发工具
154 | 
155 | 
156 | ### 评论信息
157 | * _id: 评论id
158 | * comment_user_id: 评论用户的id
159 | * content: 评论的内容
160 | * mblog_id: 评论的微博的id
161 | * created_at: 评论发表时间
162 | * like_num: 点赞数
163 | * root_comment_id:  根评论id，只有二级评论有该项
164 | * img_url: 图片地址
165 | * reply_comment_id: 评论的id，只有二级评论有该项
166 | 
167 | ## 写在最后
168 | 
169 | 
170 | 本项目参考了[dataabc/weibo-crawler](https://github.com/dataabc/weibo-crawler)和[nghuyong/WeiboSpider](https://github.com/nghuyong/WeiboSpider)，感谢他们的开源。
171 | 
172 | 欢迎为本项目贡献力量。欢迎大家提交PR、通过issue提建议（如新功能、改进方案等）、通过issue告知项目存在哪些bug、缺点等。
173 | 
174 | 如有问题和交流，也欢迎联系我：<wangxing1027@gmail.com>
175 | 


--------------------------------------------------------------------------------
/WeiboCrawler/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XWang20/WeiboCrawler/591825b526c3d5c9b0e8b9f997803e3529fc3ed1/WeiboCrawler/__init__.py


--------------------------------------------------------------------------------
/WeiboCrawler/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # https://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | from scrapy import Item, Field
 9 | 
10 | class UserItem(Item):
11 |     """ User Information """
12 |     _id = Field()  # 用户ID
13 |     nick_name = Field()  # 昵称
14 |     gender = Field()  # 性别
15 |     brief_introduction = Field()  # 简介
16 |     location = Field()  # 首页链接
17 |     mblogs_num = Field()  # 微博数
18 |     follows_num = Field()  # 关注数
19 |     fans_num = Field()  # 粉丝数
20 |     vip_level = Field()  # 会员等级
21 |     authentication = Field()  # 认证
22 |     person_url = Field()  # 首页链接
23 | 
24 | class MblogItem(Item):
25 |     """ Mblog information """
26 |     _id = Field()  # 微博id
27 |     bid = Field()
28 |     weibo_url = Field()  # 微博URL
29 |     created_at = Field()  # 微博发表时间
30 |     like_num = Field()  # 点赞数
31 |     repost_num = Field()  # 转发数
32 |     comment_num = Field()  # 评论数
33 |     content = Field()  # 微博内容
34 |     user_id = Field()  # 发表该微博用户的id
35 |     tool = Field()  # 发布微博的工具
36 | 
37 | class CommentItem(Item):
38 |     """ Mblog Comment Information """
39 |     _id = Field()
40 |     comment_user_id = Field()  # 评论用户的id
41 |     content = Field()  # 评论的内容
42 |     mblog_id = Field()  # 评论的微博的id
43 |     created_at = Field()  # 评论发表时间
44 |     like_num = Field()  # 点赞数
45 |     root_comment_id = Field()   # 根评论id，只有二级评论有该项
46 |     img_url = Field()
47 |     img_name = Field()
48 |     reply_comment_id =Field()
49 | 
50 | class RepostItem(Item):
51 |     """ Mblog Repost Information """
52 |     _id = Field()
53 |     repost_user_id = Field()  # 转发用户的id
54 |     content = Field()  # 转发的内容
55 |     mblog_id = Field()  # 转发的微博的id
56 |     created_at = Field()  # 转发时间
57 |     source = Field()    # 转发工具
58 | 


--------------------------------------------------------------------------------
/WeiboCrawler/middlewares.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your spider middleware
 4 | #
 5 | # See documentation in:
 6 | # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 7 | 
 8 | from scrapy.downloadermiddlewares.retry import RetryMiddleware
 9 | from scrapy.utils.response import response_status_message
10 | from scrapy import signals
11 | import time
12 | 
13 | # class IPProxyMiddleware(object):
14 | 
15 | #     def fetch_proxy(self):
16 | #         # You need to rewrite this function if you want to add proxy pool
17 | #         # the function should return a ip in the format of "ip:port" like "12.34.1.4:9090"
18 | #         pass
19 | 
20 | #     def process_request(self, request, spider):
21 | #         proxy_data = self.fetch_proxy()
22 | #         if proxy_data:
23 | #             current_proxy = proxy_data
24 | #             spider.logger.debug(f"current proxy:{current_proxy}")
25 | #             request.meta['proxy'] = current_proxy
26 | 
27 | class TooManyRequestsRetryMiddleware(RetryMiddleware):
28 | 
29 |     def __init__(self, crawler):
30 |         super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
31 |         self.crawler = crawler
32 | 
33 |     @classmethod
34 |     def from_crawler(cls, crawler):
35 |         return cls(crawler)
36 |     
37 |     def process_response(self, request, response, spider):
38 |         if request.meta.get('dont_retry', False):
39 |             return response
40 |         elif response.status == 429 or response.status == 418 or response.status == 418:
41 |             self.crawler.engine.pause()
42 |             time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
43 |             self.crawler.engine.unpause()
44 |             reason = response_status_message(response.status)
45 |             return self._retry(request, reason, spider) or response
46 |         elif response.status in self.retry_http_codes:
47 |             reason = response_status_message(response.status)
48 |             return self._retry(request, reason, spider) or response
49 |         return response 


--------------------------------------------------------------------------------
/WeiboCrawler/pipelines.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define your item pipelines here
 4 | #
 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 | # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 | import re
 8 | import pymongo
 9 | from scrapy.pipelines.images import ImagesPipeline
10 | from scrapy import Request
11 | 
12 | class WeibocrawlerPipeline(object):
13 |     def process_item(self, item, spider):
14 |         return item
15 | 
16 | class MongoPipeline(object):
17 |     def __init__(self, mongo_uri, mongo_db):
18 |         self.mongo_uri = mongo_uri
19 |         self.mongo_db = mongo_db
20 |     
21 |     @classmethod
22 |     def from_crawler(cls, crawler):
23 |         return cls(
24 |             mongo_uri=crawler.settings.get('MONGO_URI'),
25 |             mongo_db=crawler.settings.get('MONGO_DB')
26 |         )
27 |     
28 |     def open_spider(self, spider):
29 |         self.client = pymongo.MongoClient(self.mongo_uri)
30 |         self.db = self.client[self.mongo_db]
31 |     
32 |     def process_item(self, item, spider):
33 |         name = item.__class__.__name__
34 |         self.db[name].insert(dict(item))
35 |         return item
36 |     
37 |     def close_spider(self, spider):
38 |         self.client.close()
39 | 
40 | class ImagesnamePipeline(ImagesPipeline):
41 |     def get_media_requests(self, item, info):
42 |         if 'img_url' in item:
43 |             for image_url in item['img_url']:
44 |                 # meta里面的数据是从spider获取，然后通过meta传递给下面方法：file_path
45 |                 yield Request(image_url, meta={'name':item['create_at']}, dont_filter=True, headers={'Host': 'wx1.sinaimg.cn'})
46 |     
47 |     def file_path(self, request, response=None, info=None):
48 |         # 提取url前面名称
49 |         image_guid = request.url.split('/')[-1]
50 |         # 图片名称，默认为评论产生日期
51 |         name = request.meta['name']
52 |         name = re.sub(r'[？\\*|“<>:/]', '', name)
53 |         # 图片存储默认位置：根目录/images/评论时间/评论时间_img_id.格式
54 |         filename = u'{0}/{0}_{1}'.format(name, image_guid)
55 |         return filename


--------------------------------------------------------------------------------
/WeiboCrawler/settings.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Scrapy settings for WeiboCrawler project
  4 | #
  5 | # For simplicity, this file contains only settings considered important or
  6 | # commonly used. You can find more settings consulting the documentation:
  7 | #
  8 | #     https://doc.scrapy.org/en/latest/topics/settings.html
  9 | #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 10 | #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 11 | 
 12 | BOT_NAME = 'WeiboCrawler'
 13 | 
 14 | SPIDER_MODULES = ['WeiboCrawler.spiders']
 15 | NEWSPIDER_MODULE = 'WeiboCrawler.spiders'
 16 | 
 17 | 
 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
 19 | USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
 20 | 
 21 | # Obey robots.txt rules
 22 | ROBOTSTXT_OBEY = False
 23 | 
 24 | # Configure a delay for requests for the same website (default: 0)
 25 | # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
 26 | # See also autothrottle settings and docs
 27 | DOWNLOAD_DELAY = 3
 28 | DOWNLOAD_TIMEOUT = 15
 29 | # # Configure maximum concurrent requests performed by Scrapy (default: 16)
 30 | # CONCURRENT_REQUESTS = 100
 31 | # # The download delay setting will honor only one of:
 32 | # CONCURRENT_REQUESTS_PER_DOMAIN = 100
 33 | # CONCURRENT_REQUESTS_PER_IP = 100
 34 | 
 35 | # Disable cookies (enabled by default)
 36 | COOKIES_ENABLED = False
 37 | 
 38 | # Disable Telnet Console (enabled by default)
 39 | #TELNETCONSOLE_ENABLED = False
 40 | 
 41 | # Override the default request headers:
 42 | DEFAULT_REQUEST_HEADERS = {
 43 |     'Accept': 'application/json, text/plain, */*',
 44 |     'Accept-Encoding': 'gzip, deflate, sdch',
 45 |     'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2,mt;q=0.2',
 46 |     'Connection': 'keep-alive',
 47 |     'Host': 'm.weibo.cn',
 48 |     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
 49 |     'X-Requested-With': 'XMLHttpRequest',
 50 |     # 'cookie':
 51 | }
 52 | 
 53 | # Enable or disable downloader middlewares
 54 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 55 | DOWNLOADER_MIDDLEWARES = {
 56 |     # 'WeiboCrawler.middlewares.IPProxyMiddleware': 100,
 57 |     'WeiboCrawler.middlewares.TooManyRequestsRetryMiddleware': 543,
 58 | }
 59 | RETRY_HTTP_CODES = [429, 418, 502]
 60 | 
 61 | # Enable or disable extensions
 62 | # See https://doc.scrapy.org/en/latest/topics/extensions.html
 63 | #EXTENSIONS = {
 64 | #    'scrapy.extensions.telnet.TelnetConsole': None,
 65 | #}
 66 | 
 67 | # Configure item pipelines
 68 | # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 69 | ITEM_PIPELINES = {
 70 |     'WeiboCrawler.pipelines.WeibocrawlerPipeline': 300,
 71 |     # 'WeiboCrawler.pipelines.MongoPipeline': 400,
 72 | }
 73 | # MONGO_URI = 'localhost'
 74 | # MONGO_DB = 'weibo'
 75 | 
 76 | # Enable and configure the AutoThrottle extension (disabled by default)
 77 | # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
 78 | #AUTOTHROTTLE_ENABLED = True
 79 | # The initial download delay
 80 | #AUTOTHROTTLE_START_DELAY = 5
 81 | # The maximum download delay to be set in case of high latencies
 82 | #AUTOTHROTTLE_MAX_DELAY = 60
 83 | # The average number of requests Scrapy should be sending in parallel to
 84 | # each remote server
 85 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 86 | # Enable showing throttling stats for every response received:
 87 | #AUTOTHROTTLE_DEBUG = False
 88 | 
 89 | # Enable and configure HTTP caching (disabled by default)
 90 | # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
 91 | #HTTPCACHE_ENABLED = True
 92 | #HTTPCACHE_EXPIRATION_SECS = 0
 93 | #HTTPCACHE_DIR = 'httpcache'
 94 | #HTTPCACHE_IGNORE_HTTP_CODES = []
 95 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
 96 | 
 97 | # # 图片存储目录
 98 | # IMAGES_STORE = 'images/'
 99 | # ITEM_PIPELINES = {
100 | #    'WeiboCrawler.pipelines.ImagesnamePipeline': 300,
101 | # }


--------------------------------------------------------------------------------
/WeiboCrawler/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/WeiboCrawler/spiders/comment.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | import re
 3 | import json
 4 | from datetime import datetime
 5 | from scrapy import Request, Spider
 6 | from WeiboCrawler.items import CommentItem
 7 | from WeiboCrawler.spiders.utils import standardize_date, extract_content
 8 | 
 9 | class CommentSpider(Spider):
10 |     name = 'comment'
11 |     base_url = 'https://api.weibo.cn/2/comments/build_comments?'
12 | 
13 |     def start_requests(self):
14 |         mblog_ids = ['4704521537720325']
15 |         urls = [f'{self.base_url}is_show_bulletin=2&c=android&s=746fd605&id={mblog_id}&from=10A8195010&gsid=_2AkMolNMzf8NhqwJRmf4dxWzgb49zzQrEieKeyCLoJRM3HRl-wT9jqmwMtRV6AgOZP3LqGBH-29qGRB4vP3j-Hng6DkBJ&count=50&max_id_type=1' for mblog_id in mblog_ids]
16 |         for url in urls:
17 |             yield Request(url, callback=self.parse, dont_filter=True, headers={'Host': 'api.weibo.cn'})
18 | 
19 |     def parse(self, response):
20 |         js = json.loads(response.text)
21 |         mblog_id = re.search(r'[\d]{16}', response.url).group(0)
22 | 
23 |         comments = js['root_comments']
24 |         for comment in comments:
25 |             commentItem = CommentItem()
26 |             commentItem['_id'] = comment['idstr']
27 |             commentItem['comment_user_id'] = comment['user']['id']
28 |             commentItem['mblog_id'] = mblog_id
29 |             commentItem['created_at'] = standardize_date(comment['created_at']).strftime('%Y-%m-%d')
30 |             commentItem['like_num'] = comment['like_counts']
31 |             commentItem['content'] = extract_content(comment['text'])
32 |             img_url = []
33 |             if 'pic_infos' in comment:
34 |                 for pic in comment['pic_infos']:
35 |                     img_url.append(comment['pic_infos'][pic]['original']['url'])
36 |             commentItem['img_url'] = img_url
37 |             yield commentItem
38 | 
39 |             if comment['total_number']:
40 |                 cid = comment['idstr']
41 |                 secondary_url = f'https://weibo.com/ajax/statuses/buildComments?is_reload=1&id={cid}&is_show_bulletin=2&is_mix=1&fetch_level=1&count=20&flow=1'
42 |                 yield Request(secondary_url, callback=self.parse_secondary_comment, meta={"mblog_id": mblog_id, 'secondary_url': secondary_url}, headers={'Host': 'weibo.com'})
43 |                 
44 |         max_id = js['max_id']
45 |         if max_id > 0:
46 |             max_id_type = js['max_id_type']
47 |             next_url = f'{self.base_url}is_show_bulletin=2&c=android&s=746fd605&id={mblog_id}&from=10A8195010&gsid=_2AkMolNMzf8NhqwJRmf4dxWzgb49zzQrEieKeyCLoJRM3HRl-wT9jqmwMtRV6AgOZP3LqGBH-29qGRB4vP3j-Hng6DkBJ&count=50&max_id={max_id}&max_id_type={max_id_type}'
48 |             yield Request(next_url, callback=self.parse, dont_filter=True, headers={'Host': 'api.weibo.cn'})
49 |     
50 |     def parse_secondary_comment(self, response):
51 |         js = json.loads(response.text)
52 |         max_id = js['max_id']
53 | 
54 |         if js['ok']:
55 |             seccomments = js['data']
56 |             for seccomment in seccomments:
57 |                 commentItem = CommentItem()
58 |                 commentItem['_id'] = seccomment['id']
59 |                 commentItem['comment_user_id'] = seccomment['user']['id']
60 |                 commentItem['mblog_id'] = response.meta['mblog_id']
61 |                 commentItem['created_at'] = standardize_date(seccomment['created_at']).strftime('%Y-%m-%d')
62 |                 commentItem['like_num'] = seccomment['like_counts']
63 |                 commentItem['content'] = extract_content(seccomment['text'])
64 |                 commentItem['reply_comment_id'] = seccomment['reply_comment']['idstr']
65 |                 commentItem['root_comment_id'] = seccomment['rootidstr']
66 |                 yield commentItem
67 | 
68 |         if max_id > 0:
69 |             secondary_url = response.meta['secondary_url']
70 |             next_url = f'{secondary_url}&max_id={max_id}'
71 |             yield Request(next_url, callback=self.parse_secondary_comment, meta={"mblog_id": response.meta['mblog_id'], 'secondary_url': secondary_url}, headers={'Host': 'weibo.com'})
72 | 


--------------------------------------------------------------------------------
/WeiboCrawler/spiders/mblog.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | import json
 3 | from datetime import date
 4 | from datetime import datetime
 5 | from scrapy import Request, Spider
 6 | from WeiboCrawler.items import MblogItem
 7 | from WeiboCrawler.spiders.utils import standardize_date, extract_content
 8 | 
 9 | class MblogSpider(Spider):
10 |     name = 'mblog'
11 |     base_url = 'https://m.weibo.cn/api/container/getIndex?'
12 | 
13 |     def start_requests(self):
14 |         
15 |         # 通过用户id搜索
16 |         def init_url_by_user_id():
17 |             # crawl mblogs post by users
18 |             user_ids = ['1699432410']
19 |             urls = [f'{self.base_url}containerid=107603{user_id}&page=1' for user_id in user_ids]
20 |             return urls
21 |         
22 |         # 通过关键词搜索
23 |         def init_url_by_search():
24 |             key_words = ['']
25 |             urls = [f'{self.base_url}type=wb&queryVal={key_word}&containerid=100103type=61%26q%3D{key_word}&page=1' for key_word in key_words]
26 |             return urls
27 |         
28 |         urls = init_url_by_user_id()
29 |         for url in urls:
30 |             yield Request(url, callback=self.parse)
31 | 
32 |     def parse(self, response):
33 |         js = json.loads(response.text)
34 |         page_num = int(response.url.split('=')[-1])
35 |         # 设定采集的时间段
36 |         date_start = datetime.strptime("2019-12-01", '%Y-%m-%d')
37 |         date_end = datetime.strptime("2022-2-8", '%Y-%m-%d')
38 |         if js['ok']:
39 |             weibos = js['data']['cards']
40 |             for w in weibos:
41 |                 if w['card_type'] == 9:
42 |                     weibo_info = w['mblog']
43 |                     created_at = standardize_date(weibo_info['created_at'])
44 |                     if date_start <= created_at and created_at <= date_end:
45 |                         mblogItem = MblogItem()
46 |                         weiboid = mblogItem['_id'] = weibo_info['id']
47 |                         mblogItem['bid'] = weibo_info['bid']
48 |                         mblogItem['user_id'] = weibo_info['user']['id'] if weibo_info['user'] else ''
49 |                         mblogItem['like_num'] = weibo_info['attitudes_count']
50 |                         mblogItem['repost_num'] = weibo_info['reposts_count']
51 |                         mblogItem['comment_num'] = weibo_info['comments_count']
52 |                         mblogItem['tool'] = weibo_info['source']
53 |                         mblogItem['created_at'] = created_at.strftime('%Y-%m-%d')
54 |                         weibo_url = mblogItem['weibo_url'] = 'https://m.weibo.cn/detail/'+weiboid
55 |                         is_long = True if weibo_info.get('pic_num') > 9 else weibo_info.get('isLongText') # 判断是否为长微博
56 |                         if is_long:
57 |                             yield Request(weibo_url, callback=self.parse_all_content, meta={'item': mblogItem}, priority=1)
58 |                         else:
59 |                             mblogItem['content'] = extract_content(weibo_info['text'])
60 |                             yield mblogItem
61 | 
62 |                     elif date_end < created_at: # 微博在采集时间段后
63 |                         continue
64 |                     else:   # 微博超过需采集的时间
65 |                         page_num = 0 # 退出采集该用户
66 |                         break
67 | 
68 |         if js['ok'] and page_num: 
69 |             next_url = response.url.replace('page={}'.format(page_num), 'page={}'.format(page_num+1))
70 |             yield Request(next_url, callback=self.parse)
71 | 
72 |     def parse_all_content(self,response):
73 |         mblogItem = response.meta['item']
74 |         html = response.text
75 |         html = html[html.find('"status":'):]
76 |         html = html[:html.rfind('"hotScheme"')]
77 |         html = html[:html.rfind(',')]
78 |         html = '{' + html + '}'
79 |         js = json.loads(html, strict=False)
80 |         weibo_info = js.get('status')
81 |         if weibo_info:
82 |             mblogItem['content'] = extract_content(weibo_info['text'])
83 |             yield mblogItem
84 | 


--------------------------------------------------------------------------------
/WeiboCrawler/spiders/repost.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | import json
 3 | from datetime import datetime
 4 | from scrapy import Request, Spider
 5 | from WeiboCrawler.items import RepostItem
 6 | from WeiboCrawler.spiders.utils import extract_content, standardize_date
 7 | 
 8 | class RepostSpider(Spider):
 9 |     name = 'repost'
10 |     base_url = 'https://api.weibo.cn/2/guest/statuses_repost_timeline?'
11 | 
12 |     def start_requests(self):
13 |         mblog_ids = ['4750304827933227']    # 原微博id文件
14 |         urls = [f"{self.base_url}c=android&s=746fd605&ft=0&id={mblog_id}&from=10A8195010&gsid=_2AkMomC38f8NhqwJRmf4dxWzgb49zzQrEieKexNwnJRM3HRl-wT9kqmkAtRV6AgOZPxlCXdki_q9a-GZtfNgXXwAhZ5en&page=1" for mblog_id in mblog_ids]
15 |         for url in urls:
16 |             yield Request(url, callback=self.parse, headers={'Host': 'api.weibo.cn'})
17 | 
18 |     def parse(self, response):
19 |         js = json.loads(response.text)
20 | 
21 |         if js['next_cursor']:
22 |             page_list = response.url.split("=")
23 |             page_list[-1] = str(int(response.url.split("=")[-1])+1)
24 |             page_url = '='.join(page_list)
25 |             print(page_url)
26 |             yield Request(page_url, self.parse, headers={'Host': 'api.weibo.cn'}, priority=1)
27 |         
28 |         reposts = js['reposts']
29 |         for repost in reposts:
30 |             repostItem = RepostItem()
31 |             repostItem['_id'] = repost['id']
32 |             repostItem['repost_user_id'] = repost['user']['id']
33 |             repostItem['mblog_id'] = repost['retweeted_status']['id']
34 |             repostItem['created_at'] = standardize_date(repost['created_at']).strftime('%Y-%m-%d')
35 |             repostItem['content'] = repost['text']
36 |             repostItem['source'] = extract_content(repost['source'])
37 |             yield repostItem
38 | 


--------------------------------------------------------------------------------
/WeiboCrawler/spiders/user.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | import json
 3 | from scrapy import Request, Spider
 4 | from WeiboCrawler.items import UserItem
 5 | 
 6 | 
 7 | class UserSpider(Spider):
 8 |     name = 'user'
 9 |     base_url = 'https://m.weibo.cn/api/container/getIndex?'
10 | 
11 |     def start_requests(self):
12 |         # file_path = '' #用户id文件
13 |         # with open(file_path, 'rb') as f:
14 |         #     try:
15 |         #         lines = f.read().splitlines()
16 |         #         lines = [line.decode('utf-8-sig') for line in lines]
17 |         #     except UnicodeDecodeError:
18 |         #         logger.error(u'%s文件应为utf-8编码，请先将文件编码转为utf-8再运行程序', file_path)
19 |         #         sys.exit()
20 |         user_ids = ['1749127163', '2028810631'] # 用户id列表
21 |         urls = [f'{self.base_url}containerid=100505{user_id}' for user_id in user_ids]
22 |         for url in urls:
23 |             yield Request(url, callback=self.parse)
24 | 
25 |     def parse(self, response):
26 |         userItem = UserItem()
27 |         js = json.loads(response.text)
28 |         if js['ok']:
29 |             userInfo = js['data']['userInfo']
30 |             userItem['_id'] = userInfo['id']
31 |             userItem['nick_name'] = userInfo['screen_name']
32 |             userItem['gender'] = userInfo['gender']
33 |             userItem['brief_introduction'] = userInfo['description']
34 |             userItem['mblogs_num'] = userInfo['statuses_count']
35 |             userItem['follows_num'] = userInfo['follow_count']
36 |             userItem['fans_num'] = userInfo['followers_count']
37 |             userItem['authentication'] = userInfo['verified']
38 |             if userItem['authentication']:
39 |                 userItem['authentication'] = userInfo['verified_reason']
40 |             userItem['vip_level'] = userInfo['mbrank']
41 |             userItem['person_url'] = userInfo['profile_url'].split('?')[0]
42 |             profile_url = f"{self.base_url}containerid=230283{userInfo['id']}"
43 |             yield Request(profile_url, callback=self.parse_location, meta={'item': userItem}, priority=1)
44 |     
45 |     def parse_location(self,response):
46 |         userItem = response.meta['item']
47 |         js = json.loads(response.text)
48 |         if js['ok']:
49 |             userItem['location']=js['data']['cards'][0]['card_group'][0]['item_content']
50 |             print(userItem['location'])
51 |         yield userItem
52 | 


--------------------------------------------------------------------------------
/WeiboCrawler/spiders/utils.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # encoding: utf-8
 3 | import re
 4 | from datetime import datetime, timedelta
 5 | 
 6 | def standardize_date(created_at):
 7 |     """标准化微博发布时间"""
 8 |     if u'刚刚' in created_at:
 9 |         created_at = datetime.now().strftime('%Y-%m-%d')
10 |     elif u'分钟' in created_at:
11 |         minute = created_at[:created_at.find(u'分钟')]
12 |         minute = timedelta(minutes=int(minute))
13 |         created_at = (datetime.now() - minute).strftime('%Y-%m-%d')
14 |     elif u'小时' in created_at:
15 |         hour = created_at[:created_at.find(u'小时')]
16 |         hour = timedelta(hours=int(hour))
17 |         created_at = (datetime.now() - hour).strftime('%Y-%m-%d')
18 |     elif u'昨天' in created_at:
19 |         day = timedelta(days=1)
20 |         created_at = (datetime.now() - day).strftime('%Y-%m-%d')
21 |     else:
22 |         created_at = created_at.replace('+0800 ', '')
23 |         temp = datetime.strptime(created_at, '%c')
24 |         created_at = datetime.strftime(temp, '%Y-%m-%d')
25 |     return datetime.strptime(created_at, '%Y-%m-%d')
26 | 
27 | 
28 | def extract_content(text):
29 |     text_body = text
30 |     dr = re.compile(r'<[^>]+>',re.S)    # 过滤html标签
31 |     text_body = dr.sub('',text_body)
32 |     return text_body
33 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pymongo==3.10.1
2 | Scrapy==1.5.1
3 | Pillow==8.2.0


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = WeiboCrawler.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = WeiboCrawler
12 | 


--------------------------------------------------------------------------------