├── .gitignore
├── README.md
├── script.py
└── work.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | __pycache__/
 2 | tmp/
 3 | 1.*
 4 | *.json
 5 | *.jsonl
 6 | *.html
 7 | *.xlsx
 8 | 
 9 | keywords.py
10 | work1.py
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## 微信公众号文章爬虫
  2 | 
  3 | ### 要求
  4 | 
  5 | 在选定微信订阅号账号的情况下，批量抓取一段时间之内的所有推文的：
  6 | 
  7 | - 链接
  8 | - 标题
  9 | - 封面图
 10 | - 作者
 11 | - 发布账号
 12 | - 正文文本内容
 13 | - 发布时间
 14 | - 阅读量
 15 | - 点赞数
 16 | - 在看数
 17 | - 留言数
 18 | - 是否开启留言
 19 | 
 20 | 并导出为 excel 表格
 21 | 
 22 | ### 实现
 23 | 
 24 | 整体思路：枚举公众号->爬取公众号下的所有文章->对每篇文章找到所需信息
 25 | 
 26 | #### step1 爬取公众号下的所有文章
 27 | 
 28 | 即使不是爬虫，想要看到一个公众号下的所有文章也不那么容易。目前统一的手段是在微信应用中打开公众号，但此页面无法抓包，也无法获得爬虫需要的请求和响应。有部分公众号给出了往期文章列表供爬取，但此方法非常依赖公众号，并不统一。
 29 | 
 30 | 网络上已知的实现有
 31 | 
 32 | - 微信历史消息，但可能被封号
 33 | - [搜狗微信](https://weixin.sogou.com/)，但现在似乎已经没有维护
 34 | - [微信公众平台](https://mp.weixin.qq.com/)，也可能被封号
 35 | 
 36 | 本爬虫使用的手段是利用微信公众平台，在创作图文消息时插入超链接，可以搜到一个公众号下的所有文章。
 37 | 
 38 | 微信公众平台需要先注册账号并修改名称通过审核。
 39 | 
 40 | 实际测试发现，爬取几十页后就会被封一段时间（几十分钟）。因此需要多次启动爬虫，因而记录下了失败时爬取的公众号和页数，方便下次直接跳转。
 41 | 
 42 | 具体的实现则是使用 `playwright` 模拟浏览器，参考[script.py](https://github.com/zsq259/WeChat-official-account-crawler/blob/main/script.py)。
 43 | 
 44 | 爬取获得的网址会存储到 `article_links.jsonl`。
 45 | 
 46 | #### step2 爬取一篇文章中的所有信息
 47 | 
 48 | 第一部分爬到的链接中含有
 49 | 
 50 | 难点在于找出所需信息是在哪个包中，以及如何发送请求和提取。
 51 | 
 52 | 通过网络搜索和实践探索获得了相关参数，分别在两个包中，具体实现可见代码：
 53 | 
 54 | - 标题：rich_media_title 
 55 | 
 56 | - 封面图：js_row_immersive_cover_img
 57 | 
 58 | - 作者：
 59 | 
 60 |   ```html
 61 |   class="rich_media_meta_list"
 62 |   class="rich_media_meta rich_media_meta_text"
 63 |   ```
 64 | 
 65 | - 发布账号：id="js_name"
 66 | 
 67 | - 正文文本：
 68 | 
 69 | ```html
 70 |   <div class="rich_media_content js_underline_content
 71 |                          defaultNoSetting
 72 |               " id="js_content" style="visibility: visible;">
 73 | ```
 74 | - 发布时间：id="publish_time"
 75 | 
 76 | 另一个包 getappmsgext: （[参考](https://wnma3mz.github.io/hexo_blog/2017/11/18/%E8%AE%B0%E4%B8%80%E6%AC%A1%E5%BE%AE%E4%BF%A1%E5%85%AC%E4%BC%97%E5%8F%B7%E7%88%AC%E8%99%AB%E7%9A%84%E7%BB%8F%E5%8E%86%EF%BC%88%E5%BE%AE%E4%BF%A1%E6%96%87%E7%AB%A0%E9%98%85%E8%AF%BB%E7%82%B9%E8%B5%9E%E7%9A%84%E8%8E%B7%E5%8F%96%EF%BC%89/)）
 77 | 
 78 | - 阅读量：read_num
 79 | - 点赞数：old_like_num
 80 | - 在看数：like_num
 81 | - 留言数：comment_count
 82 | - 开启留言：comment_enabled
 83 | 
 84 | 想要获取这个包的响应，即这五个信息，需要中请求中加入参数 `key`，此参数不易获取，与公众号相关联，且一段时间后失效。因此目前的程序只能通过手动查看抓到的包，再将 `key` 填入代码中，即每个公众号需要手动操作一次。可能的解决办法是自动化获取抓到的包中的 `key`。
 85 | 
 86 | #### 缺陷
 87 | 
 88 | 在 step1 中， 爬虫的速度很受限制，稍微爬快一点就会被封很久，无法获取数据。即使慢到了 10s 翻一页，也会在爬五六十页后被封几十分钟左右。有多个账号的话，可以通过流水线来缓解。由于实现细节，目前没有做到记录公众号是否已经爬取完成，需要手动修改 `keywords.py`，在其中注释已经完成的公众号。
 89 | 
 90 | 在 step2 中，目前还没有实现 key 的自动获取，因此需要先手动在微信应用端中打开公众号的文章，再通过抓包软件获取 key 并复制进代码中。如果爬的速度过快，也会被检测到而导致账号无法获得文章点赞数等信息。在实现保存文件的时候，也需要保证 `output.xlsx` 没有在其他应用（如WPS）中打开，以避免写入权限出错。
 91 | 
 92 | ### 使用方法（目前）
 93 | 
 94 | `script.py` 完成了 step1 的工作。安装所需的库并运行代码，在弹出的窗口中扫码登录微信公众平台，等待一次爬取完成即可。**在爬完一个公众号的所有数据后，需要手动在 `keywords.py` 中将其注释，避免重复爬取浪费时间。**
 95 | 
 96 | `work.py` 完成了 step2 的工作。首先安装所需的库。然后通过抓包工具抓取文章请求，获得相关参数（`headers`, `cookies`, `key`）后填入代码中。之后运行代码即可。**在运行代码时，尽量不要在其他应用中打开输出文件 `output.xlsx`，以防写入时权限出错。** 在爬完一个公众号的所有文章后，或 `key` 时限过期后，程序会结束，此时需要再次使用抓包软件获取新的 `key`。
 97 | 
 98 | ### 参考
 99 | 
100 | [https://github.com/wnma3mz/wechat_articles_spider](https://github.com/wnma3mz/wechat_articles_spider)
101 | 
102 | [https://blog.csdn.net/nzbing/article/details/131601790](https://blog.csdn.net/nzbing/article/details/131601790)
103 | 
104 | [https://www.bilibili.com/video/BV1mK4y1F7Kp/](https://www.bilibili.com/video/BV1mK4y1F7Kp/)
105 | 


--------------------------------------------------------------------------------
/script.py:
--------------------------------------------------------------------------------
  1 | from playwright.sync_api import Playwright, sync_playwright, expect
  2 | import time
  3 | from datetime import datetime
  4 | import json, jsonlines
  5 | import os
  6 | from keywords import keywords
  7 | 
  8 | def fake(page):
  9 |     page.mouse.wheel(0,1200)
 10 |     time.sleep(1)
 11 |     page.mouse.wheel(0,200)
 12 |     time.sleep(1)
 13 |     page.mouse.wheel(0,200)
 14 |     time.sleep(1)
 15 |     page.mouse.wheel(0,200)
 16 |     time.sleep(1)
 17 |     page.mouse.wheel(0,200)
 18 |     time.sleep(1)
 19 |     page.mouse.wheel(0,200)
 20 | 
 21 | file_path = "article_links.jsonl"
 22 | 
 23 | def init_dic():
 24 |     data = {}
 25 |     if os.path.exists(file_path):
 26 |         with open(file_path, 'r', encoding='utf-8') as f:
 27 |             for line in f:
 28 |                 line = json.loads(line)
 29 |                 if data.get(line["pub_name"]) is None:
 30 |                     data[line["pub_name"]] = []
 31 |                 data[line["pub_name"]].append(line["url"])
 32 |     return data
 33 |     
 34 | data = init_dic()
 35 | 
 36 | comparison_date = datetime.strptime("2023-03-01", "%Y-%m-%d")
 37 | 
 38 | def get_links(page, pub_name):
 39 |     time.sleep(5)
 40 |      # 定位到所有的label元素
 41 |     labels = page.locator('label.inner_link_article_item')
 42 | 
 43 |     # 获取元素数量
 44 |     count = labels.count()
 45 |     print(f"当前页共有{count}篇文章")
 46 |     if count == 0:
 47 |         return False
 48 | 
 49 |     # 遍历所有label元素
 50 |     for i in range(count):
 51 |         # 对于每个label元素，找到其下的第二个span中的a标签的href属性
 52 |         href_value = labels.nth(i).locator('span:nth-of-type(2) a').get_attribute('href')
 53 |         date_element_text = labels.nth(i).locator('.inner_link_article_date').text_content()
 54 |         test_date = datetime.strptime(date_element_text, "%Y-%m-%d")
 55 |         if test_date < comparison_date:
 56 |             return None
 57 |         if href_value not in data[pub_name]:
 58 |             data[pub_name].append(href_value)
 59 |             with jsonlines.open(file_path, mode='a') as writer:
 60 |                 writer.write({"pub_name": pub_name, "url": href_value})
 61 |             if len(data[pub_name]) >= 500:
 62 |                 return None
 63 | 
 64 |     return True
 65 | 
 66 | def login(playwright: Playwright):
 67 |     browser = playwright.chromium.launch(headless=False)
 68 |     context = browser.new_context()
 69 |     page = context.new_page()    
 70 |     page.goto("https://mp.weixin.qq.com/")
 71 |     with page.expect_popup() as page1_info:
 72 |         page.locator(".new-creation__icon > svg").first.click(timeout=1000000)
 73 |     page1 = page1_info.value
 74 |     cookies = page.context.cookies()
 75 |     page1.close()
 76 |     page.close()
 77 |     context.close()
 78 |     browser.close()
 79 |     return cookies
 80 | 
 81 | def get_cookies():
 82 |     with sync_playwright() as playwright:
 83 |         cookies = login(playwright)
 84 |     return cookies
 85 | 
 86 | cookies = get_cookies()
 87 | 
 88 | def record_state(count_path, page_count):
 89 |     with open(count_path, 'w') as f:
 90 |         f.write(str(page_count - 1))
 91 | 
 92 | 
 93 | def run(playwright: Playwright, pub_name) -> None:    
 94 |     browser = playwright.chromium.launch(headless=True)
 95 |     context = browser.new_context()
 96 |     page = context.new_page()
 97 |     page.context.add_cookies(cookies)
 98 |     page.goto("https://mp.weixin.qq.com/")
 99 | 
100 |     with page.expect_popup() as page1_info:
101 |         page.locator(".new-creation__icon > svg").first.click(timeout=1000000)
102 |     page1 = page1_info.value
103 |     
104 |     page1.get_by_text("超链接").click()
105 |     page1.get_by_text("选择其他公众号").click()
106 |     page1.get_by_placeholder("输入文章来源的公众号名称或微信号，回车进行搜索").click()
107 |     page1.get_by_placeholder("输入文章来源的公众号名称或微信号，回车进行搜索").fill(pub_name)
108 |     page1.get_by_placeholder("输入文章来源的公众号名称或微信号，回车进行搜索").press("Enter")
109 |     page1.get_by_text("订阅号", exact=True).nth(1).click()
110 | 
111 |     page_count = 0
112 |     count_path = f"./tmp/page_count_{pub_name}.txt"
113 |     if os.path.exists(count_path):
114 |         with open(count_path, 'r') as f:
115 |             page_count = int(f.read())
116 | 
117 |     if page_count > 0:
118 |         page1.fill('input[type="number"]', str(page_count))
119 |         page1.get_by_role("link", name="跳转").click()
120 |     
121 |     error_flag = False
122 |     this_count = 0
123 | 
124 |     while (True):
125 |         page_count += 1
126 |         this_count += 1
127 |         if this_count > 48:
128 |             record_state(count_path, page_count)
129 |             error_flag = True
130 |             break
131 |         print(f"公众号 {pub_name} 第{page_count}页")
132 |         flag = get_links(page1, pub_name)
133 |         if flag is None:
134 |             record_state(count_path, page_count)
135 |             error_flag = None
136 |             break
137 |         
138 |         if not flag:
139 |             record_state(count_path, page_count)
140 |             error_flag = True
141 |             break
142 |         fake(page1)
143 |         next_button = page1.get_by_role("link", name="下一页")
144 | 
145 |         if (next_button.count() == 0):
146 |             record_state(count_path, page_count)
147 |             error_flag = True
148 |             break
149 |         print(f"此公众号总共获取到{len(data[pub_name])}篇文章")
150 |         next_button.click()
151 |         
152 |     page1.close()
153 |     page.close()
154 |     context.close()
155 |     browser.close()
156 |     return error_flag
157 | 
158 | def check_folder():
159 |     if not os.path.exists("tmp"):
160 |         os.makedirs("tmp")
161 | 
162 | check_folder()
163 | 
164 | for keyword in keywords:
165 |     if data.get(keyword) is None:
166 |         data[keyword] = []
167 |     with sync_playwright() as playwright:
168 |         flag = run(playwright, keyword)
169 |         if flag is None:
170 |             continue
171 |         if flag:
172 |             break


--------------------------------------------------------------------------------
/work.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | from bs4 import BeautifulSoup
  3 | import re
  4 | import os
  5 | import pandas as pd
  6 | import json
  7 | import ssl
  8 | import time
  9 | ssl._create_default_https_context = ssl._create_unverified_context
 10 | 
 11 | cookies = {} # fill in the cookies you get
 12 | 
 13 | headers = {} # fill in the headers you get
 14 | 
 15 | data = {
 16 |     "is_only_read": "1",
 17 |     "is_temp_url": "0",              
 18 |     "appmsg_type": "9", # 新参数，不加入无法获取like_num
 19 |     "reward_uin_count": "0",
 20 | }
 21 | 
 22 | # work2() 额外需要的：
 23 | # 都可以在 work1() 的包中获取
 24 | uin = "" # fill in the uin you get
 25 | article_key = "" # fill in the article_key you get
 26 | 
 27 | links = []
 28 | titles = []
 29 | cover_images = []
 30 | arthors = []
 31 | publishers = []
 32 | texts = []
 33 | publish_times = []
 34 | read_nums = []
 35 | old_like_nums = []
 36 | like_nums = []
 37 | conment_nums = []
 38 | conment_enableds = []
 39 | 
 40 | def get_params(url):
 41 |     url = url.split("&")
 42 |     __biz = url[0].split("__biz=")[1]
 43 |     article_mid = url[1].split("=")[1]
 44 |     article_idx = url[2].split("=")[1]
 45 |     article_sn = url[3].split("=")[1]
 46 |     return __biz, article_mid, article_idx, article_sn
 47 | 
 48 | def work1(origin_url, need_more_info=False):    
 49 |     __biz, article_mid, article_idx, article_sn = get_params(origin_url)
 50 |     url = "https://mp.weixin.qq.com/s?"
 51 |     article_url = url + "__biz={}&mid={}&idx={}&sn={}".format(__biz, article_mid, article_idx, article_sn)    
 52 |     url2 = url + "__biz={}&key={}&uin={}&mid={}&idx={}&sn={}".format(__biz, article_key, uin, article_mid, article_idx, article_sn)    
 53 |     content = requests.get(url2, headers=headers, data=data)    
 54 |     links.append(article_url)
 55 | 
 56 |     soup = BeautifulSoup(content.text, 'lxml')
 57 | 
 58 |     # 提取标题、封面图、作者、发布账号和发布时间    
 59 |     title = soup.find('meta', property="og:title")['content'] if soup.find('meta', property="og:title") else ""    
 60 |     cover_image = soup.find('meta', property="og:image")['content'] if soup.find('meta', property="og:image") else ""
 61 |     author = soup.find(class_="rich_media_meta rich_media_meta_text").text.strip() if soup.find(class_="rich_media_meta rich_media_meta_text") else ""    
 62 | 
 63 |     publisher_test = soup.find(id="js_name")
 64 |     publisher = ""
 65 |     if publisher_test is None:
 66 |         print(origin_url)
 67 |         publisher = ""
 68 |     else:
 69 |         publisher = soup.find(id="js_name").text.strip()
 70 |     
 71 |     # 查找具有特定类名的<div>标签
 72 |     text_content = soup.find('div', id='js_content')
 73 |     text = None
 74 | 
 75 |     # 如果找到该标签，则提取其文本内容；否则，输出提示信息
 76 |     if text_content:        
 77 |         text = text_content.get_text(separator=' ', strip=True)                
 78 |     else:        
 79 |         print(origin_url)
 80 |         # raise Exception("找不到文章正文")
 81 |         text = ""
 82 |     texts.append(text)
 83 | 
 84 |     # 使用正则表达式从JavaScript中提取变量值
 85 |     js_content = str(soup.find_all('script'))
 86 |     publish_time = re.search(r'var createTime = [\'"](.*?)[\'"]', js_content).group(1) if re.search(r'var createTime = [\'"](.*?)[\'"]', js_content) else ""
 87 |     comment_id = re.search(r'var comment_id = [\'"](.*?)[\'"]', js_content).group(1) if re.search(r'var comment_id = [\'"](.*?)[\'"]', js_content) else ""
 88 | 
 89 |     titles.append(title)
 90 |     cover_images.append(cover_image)
 91 |     arthors.append(author)
 92 |     publishers.append(publisher)
 93 |     publish_times.append(publish_time)
 94 | 
 95 |     if need_more_info:
 96 |         work2(__biz, article_mid, article_idx, article_sn, article_key, uin, comment_id)
 97 |     else:
 98 |         read_nums.append(-1)
 99 |         old_like_nums.append(-1)
100 |         like_nums.append(-1)
101 |         conment_nums.append(-1)
102 |         conment_enableds.append(-1)
103 | 
104 | def work2(__biz, article_mid, article_idx, article_sn, article_key, uin, article_comment_id):
105 |     params = {
106 |         "__biz": __biz,
107 |         "mid": article_mid,
108 |         "sn": article_sn,
109 |         "idx": article_idx,
110 |         "key": article_key,
111 |         "uin": uin,
112 |         "comment_id": article_comment_id,
113 |     }        
114 |     url = "https://mp.weixin.qq.com/mp/getappmsgext?"    
115 |     contents = requests.post(url, params=params, data=data, cookies=cookies, headers=headers).json()
116 |     # print(contents)        
117 |             
118 |     if 'appmsgstat' not in contents:
119 |         read_nums.append(-1)
120 |         old_like_nums.append(-1)
121 |         like_nums.append(-1)
122 |         conment_nums.append(-1)
123 |         conment_enableds.append(-1)
124 |         output()
125 |         raise Exception("获取文章信息失败")
126 | 
127 |     read_num = contents['appmsgstat']['read_num'] if 'appmsgstat' in contents else -1
128 |     old_like_num = contents['appmsgstat']['old_like_num'] if 'appmsgstat' in contents else -1
129 |     like_num = contents['appmsgstat']['like_num'] if 'appmsgstat' in contents else -1
130 |     conment_num = contents['comment_count']  if 'comment_num' in contents else 0
131 |     conment_enabled = contents['comment_enabled'] if 'comment_enabled' in contents else 0
132 | 
133 |     read_nums.append(read_num)
134 |     old_like_nums.append(old_like_num)
135 |     like_nums.append(like_num)
136 |     conment_nums.append(conment_num)
137 |     conment_enableds.append(conment_enabled)
138 |     
139 | file_path = "./output.xlsx"
140 | 
141 | def output(file_path=file_path):
142 |     df = pd.DataFrame({
143 |         "链接": links,
144 |         "标题": titles,
145 |         "封面图": cover_images,
146 |         "作者": arthors,
147 |         "发布账号": publishers,
148 |         "发布时间": publish_times,
149 |         "阅读量": read_nums,
150 |         "点赞数": old_like_nums,
151 |         "在看数": like_nums,
152 |         "留言数": conment_nums,
153 |         "留言是否开启": conment_enableds,
154 |         "正文": texts
155 |     })
156 |     
157 |     if not os.path.exists(file_path):
158 |         df.to_excel(file_path, index=False)
159 |         return
160 |     old_df = pd.read_excel(file_path)
161 |     new_df = pd.concat([old_df, df], ignore_index=True).drop_duplicates()
162 |     new_df.to_excel(file_path, index=False)
163 | 
164 | links_path = "./article_links.jsonl"
165 | 
166 | def init(file_path=file_path):
167 |     file_path = "./output.xlsx"
168 |     links = []
169 |     if os.path.exists(file_path):
170 |         data = pd.read_excel(file_path)
171 |         links = data["链接"].tolist()        
172 | 
173 |     return links
174 | 
175 | ready_links = init()
176 | 
177 | now_biz = ""
178 | 
179 | with open(links_path, 'r', encoding='utf-8') as f:
180 |     for line in f:
181 |         line = json.loads(line)
182 |         url = line["url"].split("&chksm=")[0]
183 |         url = "https" + url.split("http")[1]        
184 |                 
185 |         if url not in ready_links:
186 |             # print(url)
187 |             __biz = url.split("&")[0].split("__biz=")[1]
188 |             if __biz != now_biz:
189 |                 if now_biz != "":
190 |                     break
191 |                 now_biz = __biz
192 |             time.sleep(2)
193 |             ready_links.append(url)
194 |             work1(url, True)            
195 | output()
196 | df = pd.read_excel(file_path).drop_duplicates()
197 | df.to_excel(file_path, index=False)


--------------------------------------------------------------------------------