├── .gitignore
├── README.md
├── script.py
└── work.py
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/
2 | tmp/
3 | 1.*
4 | *.json
5 | *.jsonl
6 | *.html
7 | *.xlsx
8 |
9 | keywords.py
10 | work1.py
11 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## 微信公众号文章爬虫
2 |
3 | ### 要求
4 |
5 | 在选定微信订阅号账号的情况下,批量抓取一段时间之内的所有推文的:
6 |
7 | - 链接
8 | - 标题
9 | - 封面图
10 | - 作者
11 | - 发布账号
12 | - 正文文本内容
13 | - 发布时间
14 | - 阅读量
15 | - 点赞数
16 | - 在看数
17 | - 留言数
18 | - 是否开启留言
19 |
20 | 并导出为 excel 表格
21 |
22 | ### 实现
23 |
24 | 整体思路:枚举公众号->爬取公众号下的所有文章->对每篇文章找到所需信息
25 |
26 | #### step1 爬取公众号下的所有文章
27 |
28 | 即使不是爬虫,想要看到一个公众号下的所有文章也不那么容易。目前统一的手段是在微信应用中打开公众号,但此页面无法抓包,也无法获得爬虫需要的请求和响应。有部分公众号给出了往期文章列表供爬取,但此方法非常依赖公众号,并不统一。
29 |
30 | 网络上已知的实现有
31 |
32 | - 微信历史消息,但可能被封号
33 | - [搜狗微信](https://weixin.sogou.com/),但现在似乎已经没有维护
34 | - [微信公众平台](https://mp.weixin.qq.com/),也可能被封号
35 |
36 | 本爬虫使用的手段是利用微信公众平台,在创作图文消息时插入超链接,可以搜到一个公众号下的所有文章。
37 |
38 | 微信公众平台需要先注册账号并修改名称通过审核。
39 |
40 | 实际测试发现,爬取几十页后就会被封一段时间(几十分钟)。因此需要多次启动爬虫,因而记录下了失败时爬取的公众号和页数,方便下次直接跳转。
41 |
42 | 具体的实现则是使用 `playwright` 模拟浏览器,参考[script.py](https://github.com/zsq259/WeChat-official-account-crawler/blob/main/script.py)。
43 |
44 | 爬取获得的网址会存储到 `article_links.jsonl`。
45 |
46 | #### step2 爬取一篇文章中的所有信息
47 |
48 | 第一部分爬到的链接中含有
49 |
50 | 难点在于找出所需信息是在哪个包中,以及如何发送请求和提取。
51 |
52 | 通过网络搜索和实践探索获得了相关参数,分别在两个包中,具体实现可见代码:
53 |
54 | - 标题:rich_media_title
55 |
56 | - 封面图:js_row_immersive_cover_img
57 |
58 | - 作者:
59 |
60 | ```html
61 | class="rich_media_meta_list"
62 | class="rich_media_meta rich_media_meta_text"
63 | ```
64 |
65 | - 发布账号:id="js_name"
66 |
67 | - 正文文本:
68 |
69 | ```html
70 |
73 | ```
74 | - 发布时间:id="publish_time"
75 |
76 | 另一个包 getappmsgext: ([参考](https://wnma3mz.github.io/hexo_blog/2017/11/18/%E8%AE%B0%E4%B8%80%E6%AC%A1%E5%BE%AE%E4%BF%A1%E5%85%AC%E4%BC%97%E5%8F%B7%E7%88%AC%E8%99%AB%E7%9A%84%E7%BB%8F%E5%8E%86%EF%BC%88%E5%BE%AE%E4%BF%A1%E6%96%87%E7%AB%A0%E9%98%85%E8%AF%BB%E7%82%B9%E8%B5%9E%E7%9A%84%E8%8E%B7%E5%8F%96%EF%BC%89/))
77 |
78 | - 阅读量:read_num
79 | - 点赞数:old_like_num
80 | - 在看数:like_num
81 | - 留言数:comment_count
82 | - 开启留言:comment_enabled
83 |
84 | 想要获取这个包的响应,即这五个信息,需要中请求中加入参数 `key`,此参数不易获取,与公众号相关联,且一段时间后失效。因此目前的程序只能通过手动查看抓到的包,再将 `key` 填入代码中,即每个公众号需要手动操作一次。可能的解决办法是自动化获取抓到的包中的 `key`。
85 |
86 | #### 缺陷
87 |
88 | 在 step1 中, 爬虫的速度很受限制,稍微爬快一点就会被封很久,无法获取数据。即使慢到了 10s 翻一页,也会在爬五六十页后被封几十分钟左右。有多个账号的话,可以通过流水线来缓解。由于实现细节,目前没有做到记录公众号是否已经爬取完成,需要手动修改 `keywords.py`,在其中注释已经完成的公众号。
89 |
90 | 在 step2 中,目前还没有实现 key 的自动获取,因此需要先手动在微信应用端中打开公众号的文章,再通过抓包软件获取 key 并复制进代码中。如果爬的速度过快,也会被检测到而导致账号无法获得文章点赞数等信息。在实现保存文件的时候,也需要保证 `output.xlsx` 没有在其他应用(如WPS)中打开,以避免写入权限出错。
91 |
92 | ### 使用方法(目前)
93 |
94 | `script.py` 完成了 step1 的工作。安装所需的库并运行代码,在弹出的窗口中扫码登录微信公众平台,等待一次爬取完成即可。**在爬完一个公众号的所有数据后,需要手动在 `keywords.py` 中将其注释,避免重复爬取浪费时间。**
95 |
96 | `work.py` 完成了 step2 的工作。首先安装所需的库。然后通过抓包工具抓取文章请求,获得相关参数(`headers`, `cookies`, `key`)后填入代码中。之后运行代码即可。**在运行代码时,尽量不要在其他应用中打开输出文件 `output.xlsx`,以防写入时权限出错。** 在爬完一个公众号的所有文章后,或 `key` 时限过期后,程序会结束,此时需要再次使用抓包软件获取新的 `key`。
97 |
98 | ### 参考
99 |
100 | [https://github.com/wnma3mz/wechat_articles_spider](https://github.com/wnma3mz/wechat_articles_spider)
101 |
102 | [https://blog.csdn.net/nzbing/article/details/131601790](https://blog.csdn.net/nzbing/article/details/131601790)
103 |
104 | [https://www.bilibili.com/video/BV1mK4y1F7Kp/](https://www.bilibili.com/video/BV1mK4y1F7Kp/)
105 |
--------------------------------------------------------------------------------
/script.py:
--------------------------------------------------------------------------------
1 | from playwright.sync_api import Playwright, sync_playwright, expect
2 | import time
3 | from datetime import datetime
4 | import json, jsonlines
5 | import os
6 | from keywords import keywords
7 |
8 | def fake(page):
9 | page.mouse.wheel(0,1200)
10 | time.sleep(1)
11 | page.mouse.wheel(0,200)
12 | time.sleep(1)
13 | page.mouse.wheel(0,200)
14 | time.sleep(1)
15 | page.mouse.wheel(0,200)
16 | time.sleep(1)
17 | page.mouse.wheel(0,200)
18 | time.sleep(1)
19 | page.mouse.wheel(0,200)
20 |
21 | file_path = "article_links.jsonl"
22 |
23 | def init_dic():
24 | data = {}
25 | if os.path.exists(file_path):
26 | with open(file_path, 'r', encoding='utf-8') as f:
27 | for line in f:
28 | line = json.loads(line)
29 | if data.get(line["pub_name"]) is None:
30 | data[line["pub_name"]] = []
31 | data[line["pub_name"]].append(line["url"])
32 | return data
33 |
34 | data = init_dic()
35 |
36 | comparison_date = datetime.strptime("2023-03-01", "%Y-%m-%d")
37 |
38 | def get_links(page, pub_name):
39 | time.sleep(5)
40 | # 定位到所有的label元素
41 | labels = page.locator('label.inner_link_article_item')
42 |
43 | # 获取元素数量
44 | count = labels.count()
45 | print(f"当前页共有{count}篇文章")
46 | if count == 0:
47 | return False
48 |
49 | # 遍历所有label元素
50 | for i in range(count):
51 | # 对于每个label元素,找到其下的第二个span中的a标签的href属性
52 | href_value = labels.nth(i).locator('span:nth-of-type(2) a').get_attribute('href')
53 | date_element_text = labels.nth(i).locator('.inner_link_article_date').text_content()
54 | test_date = datetime.strptime(date_element_text, "%Y-%m-%d")
55 | if test_date < comparison_date:
56 | return None
57 | if href_value not in data[pub_name]:
58 | data[pub_name].append(href_value)
59 | with jsonlines.open(file_path, mode='a') as writer:
60 | writer.write({"pub_name": pub_name, "url": href_value})
61 | if len(data[pub_name]) >= 500:
62 | return None
63 |
64 | return True
65 |
66 | def login(playwright: Playwright):
67 | browser = playwright.chromium.launch(headless=False)
68 | context = browser.new_context()
69 | page = context.new_page()
70 | page.goto("https://mp.weixin.qq.com/")
71 | with page.expect_popup() as page1_info:
72 | page.locator(".new-creation__icon > svg").first.click(timeout=1000000)
73 | page1 = page1_info.value
74 | cookies = page.context.cookies()
75 | page1.close()
76 | page.close()
77 | context.close()
78 | browser.close()
79 | return cookies
80 |
81 | def get_cookies():
82 | with sync_playwright() as playwright:
83 | cookies = login(playwright)
84 | return cookies
85 |
86 | cookies = get_cookies()
87 |
88 | def record_state(count_path, page_count):
89 | with open(count_path, 'w') as f:
90 | f.write(str(page_count - 1))
91 |
92 |
93 | def run(playwright: Playwright, pub_name) -> None:
94 | browser = playwright.chromium.launch(headless=True)
95 | context = browser.new_context()
96 | page = context.new_page()
97 | page.context.add_cookies(cookies)
98 | page.goto("https://mp.weixin.qq.com/")
99 |
100 | with page.expect_popup() as page1_info:
101 | page.locator(".new-creation__icon > svg").first.click(timeout=1000000)
102 | page1 = page1_info.value
103 |
104 | page1.get_by_text("超链接").click()
105 | page1.get_by_text("选择其他公众号").click()
106 | page1.get_by_placeholder("输入文章来源的公众号名称或微信号,回车进行搜索").click()
107 | page1.get_by_placeholder("输入文章来源的公众号名称或微信号,回车进行搜索").fill(pub_name)
108 | page1.get_by_placeholder("输入文章来源的公众号名称或微信号,回车进行搜索").press("Enter")
109 | page1.get_by_text("订阅号", exact=True).nth(1).click()
110 |
111 | page_count = 0
112 | count_path = f"./tmp/page_count_{pub_name}.txt"
113 | if os.path.exists(count_path):
114 | with open(count_path, 'r') as f:
115 | page_count = int(f.read())
116 |
117 | if page_count > 0:
118 | page1.fill('input[type="number"]', str(page_count))
119 | page1.get_by_role("link", name="跳转").click()
120 |
121 | error_flag = False
122 | this_count = 0
123 |
124 | while (True):
125 | page_count += 1
126 | this_count += 1
127 | if this_count > 48:
128 | record_state(count_path, page_count)
129 | error_flag = True
130 | break
131 | print(f"公众号 {pub_name} 第{page_count}页")
132 | flag = get_links(page1, pub_name)
133 | if flag is None:
134 | record_state(count_path, page_count)
135 | error_flag = None
136 | break
137 |
138 | if not flag:
139 | record_state(count_path, page_count)
140 | error_flag = True
141 | break
142 | fake(page1)
143 | next_button = page1.get_by_role("link", name="下一页")
144 |
145 | if (next_button.count() == 0):
146 | record_state(count_path, page_count)
147 | error_flag = True
148 | break
149 | print(f"此公众号总共获取到{len(data[pub_name])}篇文章")
150 | next_button.click()
151 |
152 | page1.close()
153 | page.close()
154 | context.close()
155 | browser.close()
156 | return error_flag
157 |
158 | def check_folder():
159 | if not os.path.exists("tmp"):
160 | os.makedirs("tmp")
161 |
162 | check_folder()
163 |
164 | for keyword in keywords:
165 | if data.get(keyword) is None:
166 | data[keyword] = []
167 | with sync_playwright() as playwright:
168 | flag = run(playwright, keyword)
169 | if flag is None:
170 | continue
171 | if flag:
172 | break
--------------------------------------------------------------------------------
/work.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from bs4 import BeautifulSoup
3 | import re
4 | import os
5 | import pandas as pd
6 | import json
7 | import ssl
8 | import time
9 | ssl._create_default_https_context = ssl._create_unverified_context
10 |
11 | cookies = {} # fill in the cookies you get
12 |
13 | headers = {} # fill in the headers you get
14 |
15 | data = {
16 | "is_only_read": "1",
17 | "is_temp_url": "0",
18 | "appmsg_type": "9", # 新参数,不加入无法获取like_num
19 | "reward_uin_count": "0",
20 | }
21 |
22 | # work2() 额外需要的:
23 | # 都可以在 work1() 的包中获取
24 | uin = "" # fill in the uin you get
25 | article_key = "" # fill in the article_key you get
26 |
27 | links = []
28 | titles = []
29 | cover_images = []
30 | arthors = []
31 | publishers = []
32 | texts = []
33 | publish_times = []
34 | read_nums = []
35 | old_like_nums = []
36 | like_nums = []
37 | conment_nums = []
38 | conment_enableds = []
39 |
40 | def get_params(url):
41 | url = url.split("&")
42 | __biz = url[0].split("__biz=")[1]
43 | article_mid = url[1].split("=")[1]
44 | article_idx = url[2].split("=")[1]
45 | article_sn = url[3].split("=")[1]
46 | return __biz, article_mid, article_idx, article_sn
47 |
48 | def work1(origin_url, need_more_info=False):
49 | __biz, article_mid, article_idx, article_sn = get_params(origin_url)
50 | url = "https://mp.weixin.qq.com/s?"
51 | article_url = url + "__biz={}&mid={}&idx={}&sn={}".format(__biz, article_mid, article_idx, article_sn)
52 | url2 = url + "__biz={}&key={}&uin={}&mid={}&idx={}&sn={}".format(__biz, article_key, uin, article_mid, article_idx, article_sn)
53 | content = requests.get(url2, headers=headers, data=data)
54 | links.append(article_url)
55 |
56 | soup = BeautifulSoup(content.text, 'lxml')
57 |
58 | # 提取标题、封面图、作者、发布账号和发布时间
59 | title = soup.find('meta', property="og:title")['content'] if soup.find('meta', property="og:title") else ""
60 | cover_image = soup.find('meta', property="og:image")['content'] if soup.find('meta', property="og:image") else ""
61 | author = soup.find(class_="rich_media_meta rich_media_meta_text").text.strip() if soup.find(class_="rich_media_meta rich_media_meta_text") else ""
62 |
63 | publisher_test = soup.find(id="js_name")
64 | publisher = ""
65 | if publisher_test is None:
66 | print(origin_url)
67 | publisher = ""
68 | else:
69 | publisher = soup.find(id="js_name").text.strip()
70 |
71 | # 查找具有特定类名的
标签
72 | text_content = soup.find('div', id='js_content')
73 | text = None
74 |
75 | # 如果找到该标签,则提取其文本内容;否则,输出提示信息
76 | if text_content:
77 | text = text_content.get_text(separator=' ', strip=True)
78 | else:
79 | print(origin_url)
80 | # raise Exception("找不到文章正文")
81 | text = ""
82 | texts.append(text)
83 |
84 | # 使用正则表达式从JavaScript中提取变量值
85 | js_content = str(soup.find_all('script'))
86 | publish_time = re.search(r'var createTime = [\'"](.*?)[\'"]', js_content).group(1) if re.search(r'var createTime = [\'"](.*?)[\'"]', js_content) else ""
87 | comment_id = re.search(r'var comment_id = [\'"](.*?)[\'"]', js_content).group(1) if re.search(r'var comment_id = [\'"](.*?)[\'"]', js_content) else ""
88 |
89 | titles.append(title)
90 | cover_images.append(cover_image)
91 | arthors.append(author)
92 | publishers.append(publisher)
93 | publish_times.append(publish_time)
94 |
95 | if need_more_info:
96 | work2(__biz, article_mid, article_idx, article_sn, article_key, uin, comment_id)
97 | else:
98 | read_nums.append(-1)
99 | old_like_nums.append(-1)
100 | like_nums.append(-1)
101 | conment_nums.append(-1)
102 | conment_enableds.append(-1)
103 |
104 | def work2(__biz, article_mid, article_idx, article_sn, article_key, uin, article_comment_id):
105 | params = {
106 | "__biz": __biz,
107 | "mid": article_mid,
108 | "sn": article_sn,
109 | "idx": article_idx,
110 | "key": article_key,
111 | "uin": uin,
112 | "comment_id": article_comment_id,
113 | }
114 | url = "https://mp.weixin.qq.com/mp/getappmsgext?"
115 | contents = requests.post(url, params=params, data=data, cookies=cookies, headers=headers).json()
116 | # print(contents)
117 |
118 | if 'appmsgstat' not in contents:
119 | read_nums.append(-1)
120 | old_like_nums.append(-1)
121 | like_nums.append(-1)
122 | conment_nums.append(-1)
123 | conment_enableds.append(-1)
124 | output()
125 | raise Exception("获取文章信息失败")
126 |
127 | read_num = contents['appmsgstat']['read_num'] if 'appmsgstat' in contents else -1
128 | old_like_num = contents['appmsgstat']['old_like_num'] if 'appmsgstat' in contents else -1
129 | like_num = contents['appmsgstat']['like_num'] if 'appmsgstat' in contents else -1
130 | conment_num = contents['comment_count'] if 'comment_num' in contents else 0
131 | conment_enabled = contents['comment_enabled'] if 'comment_enabled' in contents else 0
132 |
133 | read_nums.append(read_num)
134 | old_like_nums.append(old_like_num)
135 | like_nums.append(like_num)
136 | conment_nums.append(conment_num)
137 | conment_enableds.append(conment_enabled)
138 |
139 | file_path = "./output.xlsx"
140 |
141 | def output(file_path=file_path):
142 | df = pd.DataFrame({
143 | "链接": links,
144 | "标题": titles,
145 | "封面图": cover_images,
146 | "作者": arthors,
147 | "发布账号": publishers,
148 | "发布时间": publish_times,
149 | "阅读量": read_nums,
150 | "点赞数": old_like_nums,
151 | "在看数": like_nums,
152 | "留言数": conment_nums,
153 | "留言是否开启": conment_enableds,
154 | "正文": texts
155 | })
156 |
157 | if not os.path.exists(file_path):
158 | df.to_excel(file_path, index=False)
159 | return
160 | old_df = pd.read_excel(file_path)
161 | new_df = pd.concat([old_df, df], ignore_index=True).drop_duplicates()
162 | new_df.to_excel(file_path, index=False)
163 |
164 | links_path = "./article_links.jsonl"
165 |
166 | def init(file_path=file_path):
167 | file_path = "./output.xlsx"
168 | links = []
169 | if os.path.exists(file_path):
170 | data = pd.read_excel(file_path)
171 | links = data["链接"].tolist()
172 |
173 | return links
174 |
175 | ready_links = init()
176 |
177 | now_biz = ""
178 |
179 | with open(links_path, 'r', encoding='utf-8') as f:
180 | for line in f:
181 | line = json.loads(line)
182 | url = line["url"].split("&chksm=")[0]
183 | url = "https" + url.split("http")[1]
184 |
185 | if url not in ready_links:
186 | # print(url)
187 | __biz = url.split("&")[0].split("__biz=")[1]
188 | if __biz != now_biz:
189 | if now_biz != "":
190 | break
191 | now_biz = __biz
192 | time.sleep(2)
193 | ready_links.append(url)
194 | work1(url, True)
195 | output()
196 | df = pd.read_excel(file_path).drop_duplicates()
197 | df.to_excel(file_path, index=False)
--------------------------------------------------------------------------------