├── .gitignore ├── README.md ├── script.py └── work.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | tmp/ 3 | 1.* 4 | *.json 5 | *.jsonl 6 | *.html 7 | *.xlsx 8 | 9 | keywords.py 10 | work1.py 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## 微信公众号文章爬虫 2 | 3 | ### 要求 4 | 5 | 在选定微信订阅号账号的情况下,批量抓取一段时间之内的所有推文的: 6 | 7 | - 链接 8 | - 标题 9 | - 封面图 10 | - 作者 11 | - 发布账号 12 | - 正文文本内容 13 | - 发布时间 14 | - 阅读量 15 | - 点赞数 16 | - 在看数 17 | - 留言数 18 | - 是否开启留言 19 | 20 | 并导出为 excel 表格 21 | 22 | ### 实现 23 | 24 | 整体思路:枚举公众号->爬取公众号下的所有文章->对每篇文章找到所需信息 25 | 26 | #### step1 爬取公众号下的所有文章 27 | 28 | 即使不是爬虫,想要看到一个公众号下的所有文章也不那么容易。目前统一的手段是在微信应用中打开公众号,但此页面无法抓包,也无法获得爬虫需要的请求和响应。有部分公众号给出了往期文章列表供爬取,但此方法非常依赖公众号,并不统一。 29 | 30 | 网络上已知的实现有 31 | 32 | - 微信历史消息,但可能被封号 33 | - [搜狗微信](https://weixin.sogou.com/),但现在似乎已经没有维护 34 | - [微信公众平台](https://mp.weixin.qq.com/),也可能被封号 35 | 36 | 本爬虫使用的手段是利用微信公众平台,在创作图文消息时插入超链接,可以搜到一个公众号下的所有文章。 37 | 38 | 微信公众平台需要先注册账号并修改名称通过审核。 39 | 40 | 实际测试发现,爬取几十页后就会被封一段时间(几十分钟)。因此需要多次启动爬虫,因而记录下了失败时爬取的公众号和页数,方便下次直接跳转。 41 | 42 | 具体的实现则是使用 `playwright` 模拟浏览器,参考[script.py](https://github.com/zsq259/WeChat-official-account-crawler/blob/main/script.py)。 43 | 44 | 爬取获得的网址会存储到 `article_links.jsonl`。 45 | 46 | #### step2 爬取一篇文章中的所有信息 47 | 48 | 第一部分爬到的链接中含有 49 | 50 | 难点在于找出所需信息是在哪个包中,以及如何发送请求和提取。 51 | 52 | 通过网络搜索和实践探索获得了相关参数,分别在两个包中,具体实现可见代码: 53 | 54 | - 标题:rich_media_title 55 | 56 | - 封面图:js_row_immersive_cover_img 57 | 58 | - 作者: 59 | 60 | ```html 61 | class="rich_media_meta_list" 62 | class="rich_media_meta rich_media_meta_text" 63 | ``` 64 | 65 | - 发布账号:id="js_name" 66 | 67 | - 正文文本: 68 | 69 | ```html 70 |
73 | ``` 74 | - 发布时间:id="publish_time" 75 | 76 | 另一个包 getappmsgext: ([参考](https://wnma3mz.github.io/hexo_blog/2017/11/18/%E8%AE%B0%E4%B8%80%E6%AC%A1%E5%BE%AE%E4%BF%A1%E5%85%AC%E4%BC%97%E5%8F%B7%E7%88%AC%E8%99%AB%E7%9A%84%E7%BB%8F%E5%8E%86%EF%BC%88%E5%BE%AE%E4%BF%A1%E6%96%87%E7%AB%A0%E9%98%85%E8%AF%BB%E7%82%B9%E8%B5%9E%E7%9A%84%E8%8E%B7%E5%8F%96%EF%BC%89/)) 77 | 78 | - 阅读量:read_num 79 | - 点赞数:old_like_num 80 | - 在看数:like_num 81 | - 留言数:comment_count 82 | - 开启留言:comment_enabled 83 | 84 | 想要获取这个包的响应,即这五个信息,需要中请求中加入参数 `key`,此参数不易获取,与公众号相关联,且一段时间后失效。因此目前的程序只能通过手动查看抓到的包,再将 `key` 填入代码中,即每个公众号需要手动操作一次。可能的解决办法是自动化获取抓到的包中的 `key`。 85 | 86 | #### 缺陷 87 | 88 | 在 step1 中, 爬虫的速度很受限制,稍微爬快一点就会被封很久,无法获取数据。即使慢到了 10s 翻一页,也会在爬五六十页后被封几十分钟左右。有多个账号的话,可以通过流水线来缓解。由于实现细节,目前没有做到记录公众号是否已经爬取完成,需要手动修改 `keywords.py`,在其中注释已经完成的公众号。 89 | 90 | 在 step2 中,目前还没有实现 key 的自动获取,因此需要先手动在微信应用端中打开公众号的文章,再通过抓包软件获取 key 并复制进代码中。如果爬的速度过快,也会被检测到而导致账号无法获得文章点赞数等信息。在实现保存文件的时候,也需要保证 `output.xlsx` 没有在其他应用(如WPS)中打开,以避免写入权限出错。 91 | 92 | ### 使用方法(目前) 93 | 94 | `script.py` 完成了 step1 的工作。安装所需的库并运行代码,在弹出的窗口中扫码登录微信公众平台,等待一次爬取完成即可。**在爬完一个公众号的所有数据后,需要手动在 `keywords.py` 中将其注释,避免重复爬取浪费时间。** 95 | 96 | `work.py` 完成了 step2 的工作。首先安装所需的库。然后通过抓包工具抓取文章请求,获得相关参数(`headers`, `cookies`, `key`)后填入代码中。之后运行代码即可。**在运行代码时,尽量不要在其他应用中打开输出文件 `output.xlsx`,以防写入时权限出错。** 在爬完一个公众号的所有文章后,或 `key` 时限过期后,程序会结束,此时需要再次使用抓包软件获取新的 `key`。 97 | 98 | ### 参考 99 | 100 | [https://github.com/wnma3mz/wechat_articles_spider](https://github.com/wnma3mz/wechat_articles_spider) 101 | 102 | [https://blog.csdn.net/nzbing/article/details/131601790](https://blog.csdn.net/nzbing/article/details/131601790) 103 | 104 | [https://www.bilibili.com/video/BV1mK4y1F7Kp/](https://www.bilibili.com/video/BV1mK4y1F7Kp/) 105 | -------------------------------------------------------------------------------- /script.py: -------------------------------------------------------------------------------- 1 | from playwright.sync_api import Playwright, sync_playwright, expect 2 | import time 3 | from datetime import datetime 4 | import json, jsonlines 5 | import os 6 | from keywords import keywords 7 | 8 | def fake(page): 9 | page.mouse.wheel(0,1200) 10 | time.sleep(1) 11 | page.mouse.wheel(0,200) 12 | time.sleep(1) 13 | page.mouse.wheel(0,200) 14 | time.sleep(1) 15 | page.mouse.wheel(0,200) 16 | time.sleep(1) 17 | page.mouse.wheel(0,200) 18 | time.sleep(1) 19 | page.mouse.wheel(0,200) 20 | 21 | file_path = "article_links.jsonl" 22 | 23 | def init_dic(): 24 | data = {} 25 | if os.path.exists(file_path): 26 | with open(file_path, 'r', encoding='utf-8') as f: 27 | for line in f: 28 | line = json.loads(line) 29 | if data.get(line["pub_name"]) is None: 30 | data[line["pub_name"]] = [] 31 | data[line["pub_name"]].append(line["url"]) 32 | return data 33 | 34 | data = init_dic() 35 | 36 | comparison_date = datetime.strptime("2023-03-01", "%Y-%m-%d") 37 | 38 | def get_links(page, pub_name): 39 | time.sleep(5) 40 | # 定位到所有的label元素 41 | labels = page.locator('label.inner_link_article_item') 42 | 43 | # 获取元素数量 44 | count = labels.count() 45 | print(f"当前页共有{count}篇文章") 46 | if count == 0: 47 | return False 48 | 49 | # 遍历所有label元素 50 | for i in range(count): 51 | # 对于每个label元素,找到其下的第二个span中的a标签的href属性 52 | href_value = labels.nth(i).locator('span:nth-of-type(2) a').get_attribute('href') 53 | date_element_text = labels.nth(i).locator('.inner_link_article_date').text_content() 54 | test_date = datetime.strptime(date_element_text, "%Y-%m-%d") 55 | if test_date < comparison_date: 56 | return None 57 | if href_value not in data[pub_name]: 58 | data[pub_name].append(href_value) 59 | with jsonlines.open(file_path, mode='a') as writer: 60 | writer.write({"pub_name": pub_name, "url": href_value}) 61 | if len(data[pub_name]) >= 500: 62 | return None 63 | 64 | return True 65 | 66 | def login(playwright: Playwright): 67 | browser = playwright.chromium.launch(headless=False) 68 | context = browser.new_context() 69 | page = context.new_page() 70 | page.goto("https://mp.weixin.qq.com/") 71 | with page.expect_popup() as page1_info: 72 | page.locator(".new-creation__icon > svg").first.click(timeout=1000000) 73 | page1 = page1_info.value 74 | cookies = page.context.cookies() 75 | page1.close() 76 | page.close() 77 | context.close() 78 | browser.close() 79 | return cookies 80 | 81 | def get_cookies(): 82 | with sync_playwright() as playwright: 83 | cookies = login(playwright) 84 | return cookies 85 | 86 | cookies = get_cookies() 87 | 88 | def record_state(count_path, page_count): 89 | with open(count_path, 'w') as f: 90 | f.write(str(page_count - 1)) 91 | 92 | 93 | def run(playwright: Playwright, pub_name) -> None: 94 | browser = playwright.chromium.launch(headless=True) 95 | context = browser.new_context() 96 | page = context.new_page() 97 | page.context.add_cookies(cookies) 98 | page.goto("https://mp.weixin.qq.com/") 99 | 100 | with page.expect_popup() as page1_info: 101 | page.locator(".new-creation__icon > svg").first.click(timeout=1000000) 102 | page1 = page1_info.value 103 | 104 | page1.get_by_text("超链接").click() 105 | page1.get_by_text("选择其他公众号").click() 106 | page1.get_by_placeholder("输入文章来源的公众号名称或微信号,回车进行搜索").click() 107 | page1.get_by_placeholder("输入文章来源的公众号名称或微信号,回车进行搜索").fill(pub_name) 108 | page1.get_by_placeholder("输入文章来源的公众号名称或微信号,回车进行搜索").press("Enter") 109 | page1.get_by_text("订阅号", exact=True).nth(1).click() 110 | 111 | page_count = 0 112 | count_path = f"./tmp/page_count_{pub_name}.txt" 113 | if os.path.exists(count_path): 114 | with open(count_path, 'r') as f: 115 | page_count = int(f.read()) 116 | 117 | if page_count > 0: 118 | page1.fill('input[type="number"]', str(page_count)) 119 | page1.get_by_role("link", name="跳转").click() 120 | 121 | error_flag = False 122 | this_count = 0 123 | 124 | while (True): 125 | page_count += 1 126 | this_count += 1 127 | if this_count > 48: 128 | record_state(count_path, page_count) 129 | error_flag = True 130 | break 131 | print(f"公众号 {pub_name} 第{page_count}页") 132 | flag = get_links(page1, pub_name) 133 | if flag is None: 134 | record_state(count_path, page_count) 135 | error_flag = None 136 | break 137 | 138 | if not flag: 139 | record_state(count_path, page_count) 140 | error_flag = True 141 | break 142 | fake(page1) 143 | next_button = page1.get_by_role("link", name="下一页") 144 | 145 | if (next_button.count() == 0): 146 | record_state(count_path, page_count) 147 | error_flag = True 148 | break 149 | print(f"此公众号总共获取到{len(data[pub_name])}篇文章") 150 | next_button.click() 151 | 152 | page1.close() 153 | page.close() 154 | context.close() 155 | browser.close() 156 | return error_flag 157 | 158 | def check_folder(): 159 | if not os.path.exists("tmp"): 160 | os.makedirs("tmp") 161 | 162 | check_folder() 163 | 164 | for keyword in keywords: 165 | if data.get(keyword) is None: 166 | data[keyword] = [] 167 | with sync_playwright() as playwright: 168 | flag = run(playwright, keyword) 169 | if flag is None: 170 | continue 171 | if flag: 172 | break -------------------------------------------------------------------------------- /work.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | import re 4 | import os 5 | import pandas as pd 6 | import json 7 | import ssl 8 | import time 9 | ssl._create_default_https_context = ssl._create_unverified_context 10 | 11 | cookies = {} # fill in the cookies you get 12 | 13 | headers = {} # fill in the headers you get 14 | 15 | data = { 16 | "is_only_read": "1", 17 | "is_temp_url": "0", 18 | "appmsg_type": "9", # 新参数,不加入无法获取like_num 19 | "reward_uin_count": "0", 20 | } 21 | 22 | # work2() 额外需要的: 23 | # 都可以在 work1() 的包中获取 24 | uin = "" # fill in the uin you get 25 | article_key = "" # fill in the article_key you get 26 | 27 | links = [] 28 | titles = [] 29 | cover_images = [] 30 | arthors = [] 31 | publishers = [] 32 | texts = [] 33 | publish_times = [] 34 | read_nums = [] 35 | old_like_nums = [] 36 | like_nums = [] 37 | conment_nums = [] 38 | conment_enableds = [] 39 | 40 | def get_params(url): 41 | url = url.split("&") 42 | __biz = url[0].split("__biz=")[1] 43 | article_mid = url[1].split("=")[1] 44 | article_idx = url[2].split("=")[1] 45 | article_sn = url[3].split("=")[1] 46 | return __biz, article_mid, article_idx, article_sn 47 | 48 | def work1(origin_url, need_more_info=False): 49 | __biz, article_mid, article_idx, article_sn = get_params(origin_url) 50 | url = "https://mp.weixin.qq.com/s?" 51 | article_url = url + "__biz={}&mid={}&idx={}&sn={}".format(__biz, article_mid, article_idx, article_sn) 52 | url2 = url + "__biz={}&key={}&uin={}&mid={}&idx={}&sn={}".format(__biz, article_key, uin, article_mid, article_idx, article_sn) 53 | content = requests.get(url2, headers=headers, data=data) 54 | links.append(article_url) 55 | 56 | soup = BeautifulSoup(content.text, 'lxml') 57 | 58 | # 提取标题、封面图、作者、发布账号和发布时间 59 | title = soup.find('meta', property="og:title")['content'] if soup.find('meta', property="og:title") else "" 60 | cover_image = soup.find('meta', property="og:image")['content'] if soup.find('meta', property="og:image") else "" 61 | author = soup.find(class_="rich_media_meta rich_media_meta_text").text.strip() if soup.find(class_="rich_media_meta rich_media_meta_text") else "" 62 | 63 | publisher_test = soup.find(id="js_name") 64 | publisher = "" 65 | if publisher_test is None: 66 | print(origin_url) 67 | publisher = "" 68 | else: 69 | publisher = soup.find(id="js_name").text.strip() 70 | 71 | # 查找具有特定类名的
标签 72 | text_content = soup.find('div', id='js_content') 73 | text = None 74 | 75 | # 如果找到该标签,则提取其文本内容;否则,输出提示信息 76 | if text_content: 77 | text = text_content.get_text(separator=' ', strip=True) 78 | else: 79 | print(origin_url) 80 | # raise Exception("找不到文章正文") 81 | text = "" 82 | texts.append(text) 83 | 84 | # 使用正则表达式从JavaScript中提取变量值 85 | js_content = str(soup.find_all('script')) 86 | publish_time = re.search(r'var createTime = [\'"](.*?)[\'"]', js_content).group(1) if re.search(r'var createTime = [\'"](.*?)[\'"]', js_content) else "" 87 | comment_id = re.search(r'var comment_id = [\'"](.*?)[\'"]', js_content).group(1) if re.search(r'var comment_id = [\'"](.*?)[\'"]', js_content) else "" 88 | 89 | titles.append(title) 90 | cover_images.append(cover_image) 91 | arthors.append(author) 92 | publishers.append(publisher) 93 | publish_times.append(publish_time) 94 | 95 | if need_more_info: 96 | work2(__biz, article_mid, article_idx, article_sn, article_key, uin, comment_id) 97 | else: 98 | read_nums.append(-1) 99 | old_like_nums.append(-1) 100 | like_nums.append(-1) 101 | conment_nums.append(-1) 102 | conment_enableds.append(-1) 103 | 104 | def work2(__biz, article_mid, article_idx, article_sn, article_key, uin, article_comment_id): 105 | params = { 106 | "__biz": __biz, 107 | "mid": article_mid, 108 | "sn": article_sn, 109 | "idx": article_idx, 110 | "key": article_key, 111 | "uin": uin, 112 | "comment_id": article_comment_id, 113 | } 114 | url = "https://mp.weixin.qq.com/mp/getappmsgext?" 115 | contents = requests.post(url, params=params, data=data, cookies=cookies, headers=headers).json() 116 | # print(contents) 117 | 118 | if 'appmsgstat' not in contents: 119 | read_nums.append(-1) 120 | old_like_nums.append(-1) 121 | like_nums.append(-1) 122 | conment_nums.append(-1) 123 | conment_enableds.append(-1) 124 | output() 125 | raise Exception("获取文章信息失败") 126 | 127 | read_num = contents['appmsgstat']['read_num'] if 'appmsgstat' in contents else -1 128 | old_like_num = contents['appmsgstat']['old_like_num'] if 'appmsgstat' in contents else -1 129 | like_num = contents['appmsgstat']['like_num'] if 'appmsgstat' in contents else -1 130 | conment_num = contents['comment_count'] if 'comment_num' in contents else 0 131 | conment_enabled = contents['comment_enabled'] if 'comment_enabled' in contents else 0 132 | 133 | read_nums.append(read_num) 134 | old_like_nums.append(old_like_num) 135 | like_nums.append(like_num) 136 | conment_nums.append(conment_num) 137 | conment_enableds.append(conment_enabled) 138 | 139 | file_path = "./output.xlsx" 140 | 141 | def output(file_path=file_path): 142 | df = pd.DataFrame({ 143 | "链接": links, 144 | "标题": titles, 145 | "封面图": cover_images, 146 | "作者": arthors, 147 | "发布账号": publishers, 148 | "发布时间": publish_times, 149 | "阅读量": read_nums, 150 | "点赞数": old_like_nums, 151 | "在看数": like_nums, 152 | "留言数": conment_nums, 153 | "留言是否开启": conment_enableds, 154 | "正文": texts 155 | }) 156 | 157 | if not os.path.exists(file_path): 158 | df.to_excel(file_path, index=False) 159 | return 160 | old_df = pd.read_excel(file_path) 161 | new_df = pd.concat([old_df, df], ignore_index=True).drop_duplicates() 162 | new_df.to_excel(file_path, index=False) 163 | 164 | links_path = "./article_links.jsonl" 165 | 166 | def init(file_path=file_path): 167 | file_path = "./output.xlsx" 168 | links = [] 169 | if os.path.exists(file_path): 170 | data = pd.read_excel(file_path) 171 | links = data["链接"].tolist() 172 | 173 | return links 174 | 175 | ready_links = init() 176 | 177 | now_biz = "" 178 | 179 | with open(links_path, 'r', encoding='utf-8') as f: 180 | for line in f: 181 | line = json.loads(line) 182 | url = line["url"].split("&chksm=")[0] 183 | url = "https" + url.split("http")[1] 184 | 185 | if url not in ready_links: 186 | # print(url) 187 | __biz = url.split("&")[0].split("__biz=")[1] 188 | if __biz != now_biz: 189 | if now_biz != "": 190 | break 191 | now_biz = __biz 192 | time.sleep(2) 193 | ready_links.append(url) 194 | work1(url, True) 195 | output() 196 | df = pd.read_excel(file_path).drop_duplicates() 197 | df.to_excel(file_path, index=False) --------------------------------------------------------------------------------