├── README.md ├── daily_update.sh ├── data ├── README.md ├── id_info.json ├── issues_message.json ├── message_info.json ├── minhash_dict.pickle ├── name2fakeid.json ├── 微信公众号聚合平台_按公众号区分.md └── 微信公众号聚合平台_按时间区分.md ├── figures └── blog_preview.png ├── main.py ├── request_ └── wechat_request.py ├── requirements.txt └── util ├── filter_duplication.py ├── message2md.py └── util.py /README.md: -------------------------------------------------------------------------------- 1 | # WeChatOA_Aggregation 2 | 微信公众号聚合平台,获取多个公众号的博文进行筛选、过滤,使用户更方便的读取公众号上的所有文章 3 | 4 | ![blog_preview.png](figures/blog_preview.png) 5 | 6 | 7 | ## 关于token和cookie 8 | 进入微信公众平台,扫码登录后在网页地址栏最后面就可以看到`token=xxxxxxxxx`, 9 | 此时按F12点Network监控网络请求,选中Fetch/XHR,刷新一下网页,随便点击一个请求就可以找到Cookie字段 10 | 11 | 目前支持token或cookie自动过期时,会自动打开浏览器,弹出公众号页面,用户扫码登录后自动获取token和cookie 12 | 13 | ## TODO 14 | - [x] 根据标题筛选可能相似博文,再获取具体内容计算重复率去重,去除大量转载文章 15 | - [ ] 使用向量编码模型对文章编码,去除重复文章,防止出现标题不同文章相同的问题 16 | - 长文本准确率较低 17 | - [x] 使用minhash+LSH算法对文章编码,去除重复文章 18 | - 0.9阈值找到的528篇文章,检测准确率100%,召回率待测 19 | - [x] 定期爬取,每天早上8:00爬。爬取当前早上6:00到昨天早上6:00的 20 | - 需要架设服务器,当前支持终端运行`daily_update.sh`文件获取最新文章,我直接上传到hexo博客上,可根据自己需求更改sh文件 21 | - [x] cookie和token过期自动模拟登陆获取 22 | - [x] 已读取的文章定期检测是否博文已删除 23 | - [x] 爬取次数限制,记录最新爬取时间,若一天内爬取过跳过,反复执行直到爬取完成 24 | - [x] github pages搭建个人博客,将公众号聚合平台部署上去(简易版):https://zejuncao.github.io/ 25 | - [ ] 增加搜索功能,关键词粗召回,再向量重排 26 | - [ ] 去除广告等无用博文 27 | - [x] 请求频率限制时,切换代理ip 28 | - 想主打一个零成本,但免费的代理ip不稳定 29 | - 现在微信取消了请求次数限制,可一次爬取所有内容 30 | - [x] 优化hexo网页显示或自己搭建一个博客(使用更好看一些的hexo主题) 31 | - [x] 拆分单个博客,并显示封面 32 | - [x] 支持站内搜索 33 | 34 | ## minHash实验记录 35 | - 在 4005 条博文的测试集下做的实验记录 36 | - 其中 minhash_0.9 代表 MinHashLSH 的阈值设置为 0.9 37 | 38 | | 方法 | 检测重复个数 | 错误个数 | 39 | |-------------------|--------|----------------| 40 | | minhash_0.9 | 528 | 0 | 41 | | minhash_0.8 | 699 | 24 | 42 | | minhash_0.8+规则0.7 | 665 | 1 (文字很少,主体为图片) | 43 | 44 | ## 类似项目参考 45 | - https://github.com/jooooock/wechat-article-exporter 46 | - https://github.com/1061700625/WeChat_Article -------------------------------------------------------------------------------- /daily_update.sh: -------------------------------------------------------------------------------- 1 | # 注:调用shell时不要开科学上网工具(亲测在windows系统git bash中运行shell导致request失败) 2 | echo "爬取公众号最新文章ing" 3 | cd "D:\learning\python\WeChatOA_Aggregation" 4 | D:\\anaconda\\envs\\torch20\\python main.py 5 | 6 | if [ $? -ne 0 ]; then 7 | echo "请求次数过多,或session及token过期,请查看异常获取详细信息" 8 | exit 1 9 | else 10 | echo "爬取成功" 11 | fi 12 | 13 | echo "上传md文件中ing" 14 | SOURCE_PATH1="D:\learning\python\WeChatOA_Aggregation\data\微信公众号聚合平台_按时间区分.md" 15 | SOURCE_PATH2="D:\learning\python\WeChatOA_Aggregation\data\微信公众号聚合平台_按公众号区分.md" 16 | TARGET_PATH="D:\learning\zejun'blog\Hexo\source\_posts" 17 | 18 | # 检查目标路径是否存在,如果不存在,则打印错误并退出 19 | if test -d "$TARGET_PATH"; then 20 | cp $SOURCE_PATH1 $TARGET_PATH 21 | cp $SOURCE_PATH2 $TARGET_PATH 22 | else 23 | echo "目标路径不存在: $TARGET_PATH" 24 | exit 1 25 | fi 26 | 27 | cd "D:\learning\zejun'blog\Hexo" 28 | hexo g -d -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | # 微信公众号聚合平台数据文件说明 2 | 3 | 本目录包含了微信公众号聚合平台所需的各类数据文件。以下是各个文件的详细说明: 4 | 5 | ## 数据文件 6 | 7 | ### 基础配置文件 8 | - `id_info.json` - 包含微信公众平台的基础配置信息,如token和cookie等认证信息 9 | - `name2fakeid.json` - 存储公众号名称到其对应的fakeid映射关系,用于公众号文章的获取 10 | 11 | ### 文章数据文件 12 | - `message_info.json` - 原始的公众号文章信息数据 13 | - `message_detail_text.json` - 存储文章的详细内容文本 14 | - `issues_message.json` - 记录文章处理过程中的问题和异常情况 15 | 16 | ### 去重相关文件 17 | - `minhash_dict.pickle` - 使用MinHash算法生成的文章指纹数据,用于文章去重 18 | 19 | ### 展示相关文件 20 | - `微信公众号聚合平台_按公众号区分.md` - 按照公众号分类整理的文章展示文件 21 | - `微信公众号聚合平台_按时间区分.md` - 按照发布时间整理的文章展示文件 22 | 23 | ## 文件用途说明 24 | 25 | 1. 数据采集和认证: 26 | - 使用`id_info.json`中的认证信息进行API访问 27 | - 通过`name2fakeid.json`获取目标公众号的唯一标识 28 | 29 | 2. 文章处理流程: 30 | - 首先获取的原始数据存储在`message_info.json` 31 | - 文章详细内容保存在`message_detail_text.json` 32 | 33 | 3. 文章去重机制: 34 | - 使用MinHash算法生成文章指纹,存储在`minhash_dict.pickle` 35 | - 记录MinHash重复和已被删除的博文,存储在`issues_message` 36 | 37 | 4. 内容展示: 38 | - 分别通过按公众号和按时间两种方式组织文章,生成对应的md文件 -------------------------------------------------------------------------------- /data/id_info.json: -------------------------------------------------------------------------------- 1 | { 2 | "token": 898827626, 3 | "cookie": "ua_id=KAfRPZPRS2CqGIwQAAAAAOxEmPdP6Vk1H5bwISXzYUk=; wxuin=22183570217006; mm_lang=zh_CN; poc_sid=HJkQrWaj5W6CBCIfsly25ckYc4UzuNQJvPbaBwzV; rewardsn=; wxtokenkey=777; _clck=betmb9|1|fo2|0; uuid=52ac2d250677497626280bcbeda85442; rand_info=CAESIGBrTd+lT29MhXC2Dnc5L5q6Yl7TmlRgX7MCTEfs41XR; slave_bizuin=3931536317; data_bizuin=3931536317; bizuin=3931536317; data_ticket=0mP/0TzvYKySQxnaES+sQCmwovHoEt3s/MohTlPys85ULkIhigmEmUt10qP5uPPj; slave_sid=enBlSmcxdEs0SDhkdkpFWHk2SUlTRVlJUzF5SzVCUXNLSFZNWk1KUFBCWVVDMmQ3b2VLQTVqdUl5MVpXaERmODBRWjd1Y29ZVWRhMkRqdEdnZzRDbjR5UjFtd3hSVlpwNGdHUGduNTdHa3paQTZMN0l0WDhpYllQSVp0VGJ6ZUdPalF4ZkdJZmdTWk12TkFt; slave_user=gh_bfe9a82e08da; xid=46473c49590e212a727969aea580390d; _clsk=6s9zmy|1722881201857|2|1|mp.weixin.qq.com/weheat-agent/payload/record" 4 | } -------------------------------------------------------------------------------- /data/minhash_dict.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ZejunCao/WeChatOA_Aggregation/952341cdc9e7f00d10239b536ede6f5521eba825/data/minhash_dict.pickle -------------------------------------------------------------------------------- /data/name2fakeid.json: -------------------------------------------------------------------------------- 1 | { 2 | "老刘说NLP": "MzAxMjc3MjkyMg==", 3 | "包包算法笔记": "MzIwNDY1NTU5Mg==", 4 | "机器之心": "MzA3MzI4MjgzMw==", 5 | "量子位": "MzIzNjc1NzUzMw==", 6 | "手撕LLM": "MzkzNzU4MTU5Nw==", 7 | "NLP前沿": "MzkyOTU5NzY1Mw==", 8 | "机器学习研习院": "MzkxODI5NjE1OQ==", 9 | "简说Python": "MzUyOTAwMzI4NA==", 10 | "深度学习初学者": "Mzg4NTUzNzE5OQ==", 11 | "学姐带你玩AI": "Mzg2NzYxODI3MQ==", 12 | "机器之心SOTA模型": "MzkyMzcwMDIyMQ==", 13 | "人工智能前沿讲习": "MzIzNjc0MTMwMA==", 14 | "AI科技大本营": "Mzg4NDQwNTI0OQ==", 15 | "人工智能学家": "MzIwOTA1MDAyNA==", 16 | "机器学习算法那些事": "MzU0MDQ1NjAzNg==", 17 | "Coggle数据科学": "MzIwNDA5NDYzNA==", 18 | "AI前线": "MzU1NDA4NjU2MA==", 19 | "程序锅锅": "MzkwNzY3ODU5MA==", 20 | "ScienceAI": "MzI3MjM3ODk0NQ==", 21 | "数据STUDIO": "Mzk0OTI1OTQ2MQ==", 22 | "机器学习初学者": "MzIwODI2NDkxNQ==", 23 | "机器学习实战": "MzkxMzUxNzEzMQ==", 24 | "深度学习技术前沿": "MzU2NDExMzE5Nw==", 25 | "机器学习与大模型": "MzkxNDYzMjA0Ng==", 26 | "深度学习基础与进阶": "Mzg3MjY1MzExMA==", 27 | "玩机器学习的章北海": "MzA5Njg3NDU0Nw==", 28 | "机器学习算法与Python实战": "MzA4MjYwMTc5Nw==", 29 | "pythonic生物人": "MzUwOTg0MjczNw==", 30 | "DASOU": "MzIyNTY1MDUwNQ==", 31 | "AI有道": "MzIwOTc2MTUyMg==", 32 | "IT咖啡馆": "MzI1NzEzOTAzOA==", 33 | "程序员好物馆": "MzkzMzI4MjMyNA==", 34 | "小白学视觉": "MzU0NjgzMDIxMQ==", 35 | "AINLP": "MjM5ODkzMzMwMQ==", 36 | "江大白": "Mzg5NzgyNTU2Mg==", 37 | "kaggle竞赛宝典": "Mzk0NDE5Nzg1Ng==", 38 | "AI大模型前沿": "Mzk0MDY2ODM3NQ==", 39 | "飞桨PaddlePaddle": "Mzg2OTEzODA5MA==", 40 | "智源社区": "MzU5ODg0MTAwMw==", 41 | "阿郎小哥的随笔驿站": "MzkyNDY0MTU5MA==", 42 | "AI算法工程师Future": "MzkyOTQwOTMzMg==", 43 | "算法美食屋": "MzU3OTQzNTU2OA==", 44 | "数智笔记": "MzkxNjY0MDgxNg==", 45 | "知识工场": "MzI0MTI1Nzk1MA==", 46 | "深度学习自然语言处理": "MzI3ODgwODA2MA==", 47 | "五角钱的程序员": "MzAwODc2ODgzMw==", 48 | "李rumor": "MzAxMTk4NDkwNw==", 49 | "码农的荒岛求生": "Mzg4OTYzODM4Mw==", 50 | "YeungNLP": "MzA3MTgwODE1Ng==", 51 | "RUC AI Engine": "Mzk0MTYwNjEwNw==", 52 | "猴子数据分析": "MzAxMTMwNTMxMQ==", 53 | "ChallengeHub": "MzAxOTU5NTU4MQ==", 54 | "NLP日志": "Mzk0NzMwNjU5Nw==", 55 | "CVHub": "MzU1MzY0MDI2NA==", 56 | "AIGC Studio": "MzU2OTg5NTU2Ng==", 57 | "AIGC最前线": "Mzk0NjMyNjUwOQ==", 58 | "慢慢学 AIGC": "MzI2MzYwNzUyNg==", 59 | "AIGC小白入门记": "MzkyNTY0Mjg0OQ==", 60 | "AIGC启示录": "MzU5NjczODIzMg==", 61 | "硬核AIGC": "MzI4ODEwMzI3Nw==", 62 | "AIGC智谷": "MzkxNTUxMzYwMw==", 63 | "AIGC创想者": "MzkzODY1MTQzOQ==", 64 | "AIGC Research": "MzkzMzI5MDE0Nw==", 65 | "AIGC开放社区": "Mzg3Mzg5MjY3Nw==", 66 | "AGI Hunt": "MzA4NzgzMjA4MQ==" 67 | } -------------------------------------------------------------------------------- /figures/blog_preview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ZejunCao/WeChatOA_Aggregation/952341cdc9e7f00d10239b536ede6f5521eba825/figures/blog_preview.png -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | # @Author : Cao Zejun 4 | # @Time : 2024/7/31 1:11 5 | # @File : main.py 6 | # @Software : Pycharm 7 | # @description : 主程序,爬取文章并存储 8 | 9 | from tqdm import tqdm 10 | from request_.wechat_request import WechatRequest 11 | from util.message2md import message2md, single_message2md 12 | from util.util import read_json, write_json, time_delta, time_now 13 | from util.filter_duplication import minHashLSH 14 | 15 | 16 | if __name__ == '__main__': 17 | # 获取必要信息 18 | name2fakeid_dict = read_json('name2fakeid') 19 | message_info = read_json('message_info') 20 | 21 | wechat_request = WechatRequest() 22 | try: 23 | for n, id in tqdm(name2fakeid_dict.items()): 24 | # 如果是新增加的公众号 25 | if not id: 26 | name2fakeid_dict[n] = wechat_request.name2fakeid(n) 27 | write_json('name2fakeid', data=name2fakeid_dict) 28 | message_info[n] = { 29 | 'latest_time': "2000-01-01 00:00", # 默认一个很久远的时间 30 | 'blogs': [], 31 | } 32 | # 如果latest_time非空(之前太久不发文章的),或者今天已经爬取过,则跳过 33 | if message_info[n]['latest_time'] and time_delta(time_now(), message_info[n]['latest_time']).days < 1: 34 | continue 35 | message_info[n]['blogs'].extend(wechat_request.fakeid2message_update(id, message_info[n]['blogs'])) 36 | message_info[n]['latest_time'] = time_now() 37 | except Exception as e: 38 | # 写入message_info,如果请求中间失败,及时写入 39 | write_json('message_info', data=message_info) 40 | raise e 41 | 42 | # 写入message_info,如果请求顺利进行,则正常写入 43 | write_json('message_info', data=message_info) 44 | 45 | # 每次更新时验证去重 46 | with minHashLSH() as minhash: 47 | minhash.write_vector() 48 | 49 | # 将message_info转换为md上传到个人博客系统 50 | message2md(message_info) 51 | single_message2md(message_info) -------------------------------------------------------------------------------- /request_/wechat_request.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | # @Author : Cao Zejun 4 | # @Time : 2024/7/31 0:54 5 | # @File : wechat_request.py 6 | # @Software : Pycharm 7 | # @description : 微信公众号爬虫工具函数 8 | 9 | import json 10 | import re 11 | 12 | import requests 13 | 14 | from util.util import headers, jstime2realtime, read_json, write_json 15 | 16 | 17 | class WechatRequest: 18 | def __init__(self): 19 | id_info = read_json('id_info') 20 | self.headers = headers 21 | self.headers['Cookie'] = id_info['cookie'] 22 | self.token = id_info['token'] 23 | 24 | # 使用公众号名字获取 id 值 25 | def name2fakeid(self, name): 26 | params = { 27 | 'action': 'search_biz', 28 | 'begin': 0, 29 | 'count': 5, 30 | 'query': name, 31 | 'token': self.token, 32 | 'lang': 'zh_CN', 33 | 'f': 'json', 34 | 'ajax': 1, 35 | } 36 | 37 | nickname = {} 38 | url = 'https://mp.weixin.qq.com/cgi-bin/searchbiz?' 39 | response = requests.get(url=url, params=params, headers=self.headers).json() 40 | if self.session_is_overdue(response): 41 | params['token'] = self.token 42 | response = requests.get(url=url, params=params, headers=headers).json() 43 | self.session_is_overdue(response) 44 | for l in response['list']: 45 | nickname[l['nickname']] = l['fakeid'] 46 | if name in nickname.keys(): 47 | return nickname[name] 48 | else: 49 | return None 50 | 51 | # 请求次数限制,不是请求文章条数限制 52 | def fakeid2message_update(self, fakeid, message_exist=[]): 53 | params = { 54 | 'sub': 'list', 55 | 'search_field': 'null', 56 | 'begin': 0, 57 | 'count': 20, 58 | 'query': '', 59 | 'fakeid': fakeid, 60 | 'type': '101_1', 61 | 'free_publish_type': 1, 62 | 'sub_action': 'list_ex', 63 | 'token': self.token, 64 | 'lang': 'zh_CN', 65 | 'f': 'json', 66 | 'ajax': 1, 67 | } 68 | # 根据文章id判断新爬取的文章是否已存在 69 | msgid_exist = set() 70 | for m in message_exist: 71 | msgid_exist.add(int(m['id'].split('/')[0])) 72 | 73 | message_url = [] 74 | url = "https://mp.weixin.qq.com/cgi-bin/appmsgpublish?" 75 | response = requests.get(url=url, params=params, headers=headers).json() 76 | if self.session_is_overdue(response): 77 | params['token'] = self.token 78 | response = requests.get(url=url, params=params, headers=headers).json() 79 | self.session_is_overdue(response) 80 | # if 'publish_page' not in response.keys(): 81 | # raise Exception('The number of requests is too fast, please try again later') 82 | messages = json.loads(response['publish_page'])['publish_list'] 83 | for message_i in range(len(messages)): 84 | message = json.loads(messages[message_i]['publish_info']) 85 | if message['msgid'] in msgid_exist: 86 | continue 87 | for i in range(len(message['appmsgex'])): 88 | link = message['appmsgex'][i]['link'] 89 | if not message['appmsgex'][i]['create_time']: 90 | continue 91 | real_time = jstime2realtime(message['appmsgex'][i]['create_time']) 92 | message_url.append({ 93 | 'title': message['appmsgex'][i]['title'], 94 | 'create_time': real_time, 95 | 'link': link, 96 | 'id': str(message['msgid']) + '/' + str(message['appmsgex'][i]['aid']), 97 | }) 98 | message_url.sort(key=lambda x: x['create_time']) 99 | return message_url 100 | 101 | def login(self): 102 | from DrissionPage import ChromiumPage 103 | 104 | bro = ChromiumPage() 105 | bro.get('https://mp.weixin.qq.com/') 106 | bro.set.window.max() 107 | while 'token' not in bro.url: 108 | pass 109 | 110 | match = re.search(r'token=(.*)', bro.url) 111 | if not match: 112 | raise ValueError("无法在URL中找到token") 113 | token = match.group(1) 114 | cookie = bro.cookies() 115 | cookie_str = '' 116 | for c in cookie: 117 | cookie_str += c['name'] + '=' + c['value'] + '; ' 118 | 119 | self.token = token 120 | self.headers['Cookie'] = cookie_str 121 | id_info = { 122 | 'token': token, 123 | 'cookie': cookie_str, 124 | } 125 | write_json('id_info', data=id_info) 126 | bro.close() 127 | 128 | # 检查session和token是否过期 129 | def session_is_overdue(self, response): 130 | err_msg = response['base_resp']['err_msg'] 131 | if err_msg in ['invalid session', 'invalid csrf token']: 132 | self.login() 133 | return True 134 | if err_msg == 'freq control': 135 | raise Exception('The number of requests is too fast, please try again later') 136 | return False 137 | 138 | def sort_messages(self): 139 | message_info = read_json('message_info') 140 | 141 | for k, v in message_info.items(): 142 | v['blogs'].sort(key=lambda x: x['create_time']) 143 | 144 | write_json('message_info', data=message_info) -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | DrissionPage 2 | lxml 3 | requests 4 | tqdm 5 | datasketch -------------------------------------------------------------------------------- /util/filter_duplication.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | # @Author : Cao Zejun 4 | # @Time : 2024/8/5 23:20 5 | # @File : filter_duplication.py 6 | # @Software : Pycharm 7 | # @description : 去重操作 8 | ''' 9 | ## 实验过程 10 | - 根据标题去重 11 | - 存在问题:存在标题相同内容不同,例如"今日Github最火的10个Python项目",该公众号每天都用这个标题,但是内容每日更新 12 | - [ ] 解决方案1:增加白名单,保留该标题所有博文(不需要) 13 | - [x] 解决方案2:获取文章具体内容,使用`tree.xpath`提取对应`div`下的所有`//text()`,以列表形式返回,计算两个文本列表的重叠个数占比 14 | - 存在问题:取`//text()`的方式是按照标签分割,一些加粗的文本会单独列出,导致文章结尾多出很多无意义文本,但在列表长度上占比很大 15 | - [x] 解决方案1:以重叠字数计算占比,而不是重叠列表长度 16 | - [x] 解决方案2:改进`tree.xpath`取文本策略,获取所有section和p标签,取此标签下的所有文本并还原顺序 17 | 18 | - datasketch官方文档:https://ekzhu.com/datasketch/lsh.html 19 | ''' 20 | import pickle 21 | from pathlib import Path 22 | 23 | from nltk.translate.bleu_score import corpus_bleu, sentence_bleu 24 | from tqdm import tqdm 25 | 26 | from .util import read_json, url2text, write_json 27 | 28 | 29 | def calc_duplicate_rate(text_list1, text_list2) -> float: 30 | ''' 31 | 计算重复率方法1:以提取文本方法1中的返回值为参数,比对列表1中的每个元素是否在列表2中,若在计入重复字数,最后统计重复字数比例 32 | :param text_list1: 相同 title 下最早发布的文章 33 | :param text_list2: 其余相同 title 的文章 34 | :return: 重复字数比例 35 | ''' 36 | if len(''.join(text_list1)) == 0: 37 | return 0 38 | text_set2 = set(text_list2) 39 | co_word_count = 0 40 | for t in text_list1: 41 | if t in text_set2: 42 | co_word_count += len(t) 43 | co_rate = co_word_count / len(''.join(text_list1)) 44 | return co_rate 45 | 46 | 47 | def calc_duplicate_rate_max(text_list1, text_list2) -> float: 48 | '''重复字数判断,调换顺序计算两次''' 49 | dup_rate = max([calc_duplicate_rate(text_list1, text_list2), calc_duplicate_rate(text_list2, text_list1)]) 50 | # 再次计算bleu值 51 | if dup_rate < 0.8: 52 | bleu_score = sentence_bleu([list(''.join(text_list1))], list(''.join(text_list2))) 53 | if isinstance(bleu_score, (int, float)): 54 | dup_rate = max(dup_rate, float(bleu_score)) 55 | return dup_rate 56 | 57 | 58 | class minHashLSH: 59 | def __init__(self): 60 | from datasketch import MinHash, MinHashLSH 61 | self.lsh = MinHashLSH(threshold=0.8, num_perm=128) 62 | 63 | # 加载minhash重复文件 64 | self.issues_message = read_json('issues_message') 65 | if 'dup_minhash' not in self.issues_message.keys(): 66 | self.issues_message['dup_minhash'] = {} 67 | 68 | self.delete_messages_set = set(self.issues_message['is_delete']) 69 | self.message_detail_text = read_json('message_detail_text') 70 | 71 | # 加载minhash签名缓存文件 72 | self.minhash_dict_path = Path(__file__).parent.parent / 'data' / 'minhash_dict.pickle' 73 | # minhash_dict 字典记录所有id的minhash签名,key: id, value: minhash签名 74 | if self.minhash_dict_path.exists(): 75 | with open(self.minhash_dict_path, 'rb') as fp: 76 | self.minhash_dict = pickle.load(fp) # 此时v是minhash签名的hash值(数组) 77 | for k, v in self.minhash_dict.items(): 78 | self.minhash_dict[k] = MinHash(hashvalues=v) # 将其转换为MinHash对象 79 | else: 80 | self.minhash_dict = {} 81 | 82 | def write_vector(self): 83 | from datasketch import MinHash 84 | message_info = read_json('message_info') 85 | id2url = {m['id']: m['link'] for v in message_info.values() for m in v['blogs']} 86 | 87 | message_total = [m for v in message_info.values() for m in v['blogs'] 88 | if m['id'] not in self.delete_messages_set 89 | and m['create_time'] > "2024-07-01"] 90 | message_total.sort(key=lambda x: x['create_time']) 91 | for i, m in tqdm(enumerate(message_total), total=len(message_total)): 92 | # 如果文章没有minhash编码,则进行minhash编码 93 | if m['id'] not in self.minhash_dict.keys(): 94 | if m['id'] not in self.message_detail_text: 95 | self.message_detail_text[m['id']] = url2text(m['link']) 96 | text_list = self.message_detail_text[m['id']] 97 | if self.is_delete(text_list, m['id']): continue 98 | text_list = ' '.join(text_list) 99 | text_list = self.split_text(text_list) 100 | min1 = MinHash(num_perm=128) 101 | for d in text_list: 102 | min1.update(d.encode('utf8')) 103 | self.minhash_dict[m['id']] = min1 104 | else: 105 | # 已 minhash 编码的文章也已去过重 106 | continue 107 | 108 | sim_m = self.lsh.query(self.minhash_dict[m['id']]) 109 | if sim_m: 110 | if m['id'] in self.issues_message['dup_minhash'].keys(): 111 | continue 112 | # 如果有相似的,先判断jaccard相似度,大于0.9直接通过,若在0.8-0.9之间则使用规则再次判断 113 | sim_m_res = [] 114 | for s in sim_m: 115 | if self.minhash_dict[m['id']].jaccard(self.minhash_dict[s]) >= 0.9: # .jaccard会和MinHashLSH计算的有点差异 116 | sim_m_res.append(s) 117 | else: 118 | dup_rate = calc_duplicate_rate_max(text_list, url2text(id2url[s])) 119 | # 规则大于0.7则认为是重复的 120 | if dup_rate > 0.7: 121 | sim_m_res.append(s) 122 | if sim_m_res: 123 | self.issues_message['dup_minhash'][m['id']] = { 124 | 'from_id': sim_m, 125 | } 126 | else: 127 | self.lsh.insert(m['id'], self.minhash_dict[m['id']]) 128 | write_json('message_detail_text', self.message_detail_text) 129 | 130 | def is_delete(self, text_list, id_): 131 | if text_list in ['已删除']: 132 | self.issues_message['is_delete'].append(id_) 133 | write_json('issues_message', self.issues_message) 134 | return True 135 | return False 136 | 137 | def split_text(self, text): 138 | # words = re.findall(r'\w| |[\u4e00-\u9fff]', text) 139 | words = list(text) 140 | 141 | # 结果列表 142 | result = [] 143 | last_word = 0 # 0:中文,1:英文 144 | 145 | for word in words: 146 | if '\u4e00' <= word <= '\u9fff': # 如果是中文字符 147 | result.append(word) 148 | last_word = 0 149 | else: # 如果是英文单词 150 | if not result: 151 | if word != ' ': 152 | result.append(word) 153 | last_word = 1 154 | else: 155 | if last_word == 1: 156 | if word != ' ': 157 | result[-1] += word 158 | last_word = 1 159 | else: 160 | last_word = 0 161 | else: 162 | if word != ' ': 163 | result.append(word) 164 | last_word = 1 165 | 166 | return result 167 | 168 | # 为了正确调用with 169 | def __enter__(self): 170 | return self 171 | 172 | # 在debug停止或发生异常时能及时保存 173 | def __exit__(self, exc_type, exc_val, exc_tb): 174 | hashvalues_dict = {} 175 | for k, v in self.minhash_dict.items(): 176 | hashvalues_dict[k] = v.hashvalues 177 | with open(self.minhash_dict_path, 'wb') as fp: 178 | pickle.dump(hashvalues_dict, fp) 179 | write_json('issues_message', self.issues_message) 180 | # 返回 True 表示异常已被处理,不会向外传播 181 | # return True -------------------------------------------------------------------------------- /util/message2md.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | # @Author : Cao Zejun 4 | # @Time : 2024/7/31 23:43 5 | # @File : message2md.py 6 | # @Software : Pycharm 7 | # @description : 将微信公众号聚合平台数据转换为markdown文件,上传博客平台 8 | 9 | import datetime 10 | import os 11 | import re 12 | from collections import defaultdict 13 | from pathlib import Path 14 | 15 | import requests 16 | from PIL import Image 17 | from tqdm import tqdm 18 | 19 | from .util import check_text_ratio, headers, nunjucks_escape, read_json 20 | 21 | 22 | def get_valid_message(message_info=None): 23 | if not message_info: 24 | message_info = read_json('message_info') 25 | 26 | name2fakeid = read_json('name2fakeid') 27 | issues_message = read_json('issues_message') 28 | delete_messages_set = set(issues_message['is_delete']) if issues_message else set() 29 | 30 | delete_count = 0 31 | dup_count = 0 32 | md_dict_by_date = defaultdict(list) # 按日期分割,key=时间,年月日,value=文章 33 | md_dict_by_blogger = defaultdict(list) # 按博主分割,key=博主名,value=文章 34 | for k, v in message_info.items(): 35 | if k not in name2fakeid.keys(): 36 | continue 37 | for m in v['blogs']: 38 | # 历史遗留,有些文章没有创建时间,疑似已删除,待验证 39 | if not m['create_time']: 40 | continue 41 | # 去除已删除文章 42 | if m['id'] in delete_messages_set: 43 | delete_count += 1 44 | continue 45 | # 按博主分割,不需要文章去重 46 | md_dict_by_blogger[k].append(m) 47 | # 去掉重复率高的文章 48 | if m['id'] in issues_message['dup_minhash'].keys(): 49 | dup_count += 1 50 | continue 51 | # 按日期分割,需要文章去重 52 | t = datetime.datetime.strptime(m['create_time'],"%Y-%m-%d %H:%M").strftime("%Y-%m-%d") 53 | md_dict_by_date[t].append(m) 54 | 55 | print(f'{delete_count} messages have been deleted') 56 | print(f'{dup_count} messages have been deduplicated') 57 | return md_dict_by_date, md_dict_by_blogger 58 | 59 | 60 | def message2md(message_info=None): 61 | md_dict_by_date, md_dict_by_blogger = get_valid_message(message_info) 62 | # 1. 写入按日期区分的md文件 63 | md_by_date = '''--- 64 | layout: post 65 | title: "微信公众号聚合平台_按时间区分" 66 | date: 2024-07-29 01:36 67 | top: true 68 | hide: true 69 | tags: 70 | - 开源项目 71 | - 微信公众号聚合平台 72 | --- 73 | ''' 74 | # 获取所有时间并逆序排列 75 | date_list = sorted(md_dict_by_date.keys(), reverse=True) 76 | now = datetime.datetime.now() 77 | for date in date_list: 78 | # 为方便查看,只保留近半年的 79 | if now - datetime.datetime.strptime(date, '%Y-%m-%d') > datetime.timedelta(days=6*30): 80 | continue 81 | md_by_date += f'## {date}\n' 82 | for m in md_dict_by_date[date]: 83 | md_by_date += f'* [{m["title"]}]({m["link"]})\n' 84 | 85 | md_path = Path(__file__).parent.parent / 'data' / '微信公众号聚合平台_按时间区分.md' 86 | with open(md_path, 'w', encoding='utf-8') as f: 87 | f.write(md_by_date) 88 | 89 | # 2. 写入按公众号区分的md文件 90 | md_by_blogger = '''--- 91 | layout: post 92 | title: "微信公众号聚合平台_按公众号区分" 93 | date: 2024-08-31 02:16 94 | top: true 95 | hide: true 96 | tags: 97 | - 开源项目 98 | - 微信公众号聚合平台 99 | --- 100 | ''' 101 | md_dict_by_blogger = {k: sorted(v, key=lambda x: x['create_time'], reverse=True) for k, v in md_dict_by_blogger.items()} 102 | for k, v in md_dict_by_blogger.items(): 103 | md_by_blogger += f'## {k}\n' 104 | for m in v: 105 | # 为方便查看,只保留近半年的 106 | if now - datetime.datetime.strptime(m['create_time'], '%Y-%m-%d %H:%M') > datetime.timedelta(days=6*30): 107 | continue 108 | md_by_blogger += f'* [{m["title"]}]({m["link"]})\n' 109 | 110 | md_path = Path(__file__).parent.parent / 'data' / '微信公众号聚合平台_按公众号区分.md' 111 | with open(md_path, 'w', encoding='utf-8') as f: 112 | f.write(md_by_blogger) 113 | 114 | 115 | def single_message2md(message_info=None): 116 | if not message_info: 117 | message_info = read_json('message_info') 118 | md_dict_by_date, _ = get_valid_message(message_info) 119 | message_detail_text = read_json('message_detail_text') 120 | # hexo路径 121 | img_path = r"D:\learning\zejun'blog\Hexo\themes\hexo-theme-matery\source\medias\frontcover" 122 | md_path = r"D:\learning\zejun'blog\Hexo\source\_posts" 123 | 124 | # 将近半月的文章单独写入md文件展示 125 | # 1. 先收集近半月的文章id 126 | # - 先获取每个id对应的博主名字 127 | id2oaname = {} 128 | for k, v in message_info.items(): 129 | for m in v['blogs']: 130 | id2oaname[m['id']] = k 131 | # - 获取每个id对应的文章信息 132 | id2message_info = {} 133 | now = datetime.datetime.now() 134 | for k, v in md_dict_by_date.items(): 135 | if now - datetime.datetime.strptime(k, '%Y-%m-%d') > datetime.timedelta(days=15): 136 | continue 137 | for m in v: 138 | id2message_info[m['id']] = m 139 | id2message_info[m['id']]['oaname'] = id2oaname[m['id']] 140 | 141 | # 2. 下载文章封面图 142 | all_frontcover_img = os.listdir(img_path) 143 | for _id in tqdm(id2message_info.keys(), desc='downloading frontcover img', total=len(id2message_info)): 144 | if _id.replace('/', '_') + '.jpg' in all_frontcover_img: 145 | continue 146 | # 下载处理 147 | d = id2message_info[_id] 148 | url = d['link'] 149 | response = requests.get(url=url, headers=headers) 150 | msg_cdn_url = re.search(r'var msg_cdn_url = "/*?(.*)"', response.text) 151 | if msg_cdn_url: 152 | msg_cdn_url = msg_cdn_url.group(1) 153 | else: 154 | continue 155 | img = requests.get(url=msg_cdn_url, headers=headers).content 156 | single_img_path = os.path.join(img_path, f"{_id.replace('/', '_')}.jpg") 157 | with open(single_img_path, 'wb') as fp: 158 | fp.write(img) 159 | # 缩放图片,防止封面太大占用空间 160 | img = Image.open(single_img_path) 161 | # 获取图片尺寸 162 | width, height = img.size 163 | # 如果高度大于640,进行缩放 164 | if width > 640: 165 | # 计算缩放比例 166 | ratio = 640.0 / width 167 | new_height = int(height * ratio) 168 | new_width = 640 169 | img = img.resize((new_width, new_height), Image.Resampling.LANCZOS) 170 | # 保存缩放后的图片 171 | img.save(single_img_path) 172 | 173 | # 3. 将近半月的文章写入成单个md文件 174 | for _id in id2message_info.keys(): 175 | d = id2message_info[_id] 176 | d['title'] = d['title'].replace('"', "'") 177 | md = f'''--- 178 | layout: post 179 | title: "{d['title']}" 180 | date: {d['create_time']} 181 | top: false 182 | hide: false 183 | img: /medias/frontcover/{_id.replace('/', '_')}.jpg 184 | tags: 185 | - {d['oaname']} 186 | --- 187 | ''' 188 | # - 开源项目 189 | # - 微信公众号聚合平台 190 | md += f'[{d["title"]}]({d["link"]})\n\n' 191 | md += '> 仅用于站内搜索,没有排版格式,具体信息请跳转上方微信公众号内链接\n\n' 192 | all_text = message_detail_text[_id] 193 | all_text = [all_text] if isinstance(all_text, str) else all_text 194 | for i in range(len(all_text)): 195 | # 替换一些字符,防止 Nunjucks 转义失败 196 | all_text[i] = nunjucks_escape(all_text[i]) 197 | # 去掉大段代码 198 | if len(all_text[i]) > 100 and sum(check_text_ratio(all_text[i])) > 0.5: 199 | all_text[i] = "" 200 | 201 | md += '\n'.join(all_text) 202 | 203 | single_md_path = os.path.join(md_path, f"{_id.replace('/', '_')}.md") 204 | with open(single_md_path, 'w', encoding='utf-8') as f: 205 | f.write(md) 206 | 207 | valid_id = [id.replace('/', '_') for id in id2message_info.keys()] 208 | # 4. 删除多余的md文件 209 | for filename in os.listdir(md_path): 210 | if filename in ["微信公众号聚合平台.md", "微信公众号聚合平台_byname.md"]: 211 | continue 212 | if filename[:-3] not in valid_id: 213 | os.remove(os.path.join(md_path, filename)) 214 | 215 | # 5. 删除多余的图片 216 | for filename in os.listdir(img_path): 217 | if filename[:-4] not in valid_id: 218 | os.remove(os.path.join(img_path, filename)) -------------------------------------------------------------------------------- /util/util.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | # @Author : Cao Zejun 4 | # @Time : 2024/8/5 22:06 5 | # @File : util.py 6 | # @Software : Pycharm 7 | # @description : 工具函数,存储一些通用的函数 8 | 9 | import datetime 10 | import json 11 | import re 12 | import shutil 13 | from pathlib import Path 14 | 15 | import requests 16 | from lxml import etree 17 | 18 | headers = { 19 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 20 | } 21 | 22 | 23 | def jstime2realtime(jstime): 24 | """将js获取的时间id转化成真实时间,截止到分钟""" 25 | return (datetime.datetime.strptime("1970-01-01 08:00", "%Y-%m-%d %H:%M") + datetime.timedelta( 26 | minutes=jstime // 60)).strftime("%Y-%m-%d %H:%M") 27 | 28 | 29 | def time_delta(time1, time2): 30 | """计算时间差 31 | 32 | Args: 33 | time1: 第一个时间,格式为"YYYY-MM-DD HH:MM" 34 | time2: 第二个时间,格式为"YYYY-MM-DD HH:MM" 35 | 36 | Returns: 37 | 返回时间差对象,可以通过: 38 | - result.days 获取天数 39 | - result.seconds 获取不满一天的秒数(0-86399) 40 | - result.total_seconds() 获取总的秒数 41 | """ 42 | delta = datetime.datetime.strptime(time1,"%Y-%m-%d %H:%M") - datetime.datetime.strptime(time2,"%Y-%m-%d %H:%M") 43 | return delta 44 | 45 | 46 | def time_now(): 47 | """获取当前时间,格式为:2024-05-29 10:00""" 48 | return datetime.datetime.now().strftime("%Y-%m-%d %H:%M") 49 | 50 | 51 | def url2text(url, num=0): 52 | ''' 53 | 提取文本方法1:直接获取对应div下的所有文本,未处理 54 | :param url: 55 | :return: 列表形式,每个元素对应 div 下的一个子标签内的文本 56 | ''' 57 | response = requests.get(url, headers=headers).text 58 | tree = etree.HTML(response, parser=etree.HTMLParser(encoding='utf-8')) 59 | # 不同文章存储字段的class标签名不同 60 | div = tree.xpath('//div[@class="rich_media_content js_underline_content\n autoTypeSetting24psection\n "]') 61 | if not div: 62 | div = tree.xpath('//div[@class="rich_media_content js_underline_content\n defaultNoSetting\n "]') 63 | # 点进去显示分享一篇文章,然后需要再点阅读原文跳转 64 | if not div: 65 | data_url = tree.xpath('//div[@class="original_panel_tool"]/span/@data-url') 66 | if data_url: 67 | response = requests.get(data_url[0], headers=headers).text 68 | tree = etree.HTML(response, parser=etree.HTMLParser(encoding='utf-8')) 69 | # 不同文章存储字段的class标签名不同 70 | div = tree.xpath('//div[@class="rich_media_content js_underline_content\n autoTypeSetting24psection\n "]') 71 | if not div: 72 | div = tree.xpath('//div[@class="rich_media_content js_underline_content\n defaultNoSetting\n "]') 73 | 74 | # 判断是博文删除了还是请求错误 75 | if not div: 76 | if message_is_delete(response=response): 77 | return '已删除' 78 | else: 79 | # '请求错误'则再次重新请求,最多3次 80 | if num == 3: 81 | return '请求错误' 82 | return url2text(url, num=num+1) 83 | 84 | s_p = [p for p in div[0].iter() if p.tag in ['section', 'p']] 85 | text_list = [] 86 | tag = [] 87 | filter_char = ['\xa0', '\u200d', ' ', '■', ' '] 88 | pattern = '|'.join(filter_char) 89 | for s in s_p: 90 | text = ''.join([re.sub(pattern, '', i) for i in s.xpath('.//text()') if i != '\u200d']) 91 | if not text: 92 | continue 93 | if text_list and text in text_list[-1]: 94 | parent_tag = [] 95 | tmp = s 96 | while tmp.tag != 'div': 97 | tmp = tmp.getparent() 98 | parent_tag.append(tmp) 99 | if tag[-1] in parent_tag: 100 | del text_list[-1] 101 | tag.append(s) 102 | text_list.append(text) 103 | return text_list 104 | 105 | 106 | def message_is_delete(url='', response=None): 107 | """检查文章是否正常运行(未被作者删除)""" 108 | if not response: 109 | response = requests.get(url=url, headers=headers).text 110 | tree = etree.HTML(response, parser=etree.HTMLParser(encoding='utf-8')) 111 | warn = tree.xpath('//div[@class="weui-msg__title warn"]/text()') 112 | if len(warn) > 0 and warn[0] == '该内容已被发布者删除': 113 | return True 114 | return False 115 | 116 | 117 | def read_json(file_name) -> dict: 118 | """读取json文件,传入文件名可自动补全路径,若没有文件则返回空字典""" 119 | if not file_name.endswith('.json'): 120 | file_name = Path(__file__).parent.parent / 'data' / f'{file_name}.json' 121 | 122 | if not file_name.exists(): 123 | return {} 124 | with open(file_name, 'r', encoding='utf-8') as f: 125 | return json.load(f) 126 | 127 | 128 | def write_json(file_name, data) -> None: 129 | """安全写入,防止在写入过程中中断程序导致数据丢失""" 130 | if not file_name.endswith('.json'): 131 | file_name = Path(__file__).parent.parent / 'data' / f'{file_name}.json' 132 | 133 | with open('tmp.json', 'w', encoding='utf-8') as f: 134 | json.dump(data, f, ensure_ascii=False, indent=4) 135 | shutil.move('tmp.json', file_name) 136 | 137 | 138 | def check_text_ratio(text): 139 | """检测文本中英文和符号的占比 140 | Args: 141 | text: 输入文本字符串 142 | Returns: 英文字符占比和符号占比 143 | """ 144 | # 统计字符数 145 | total_chars = len(text) 146 | if total_chars == 0: 147 | return 0, 0 148 | 149 | # 统计英文字符 150 | english_chars = sum(1 for c in text if c.isascii() and c.isalpha()) 151 | 152 | # 统计符号 (不包括空格) 153 | symbols = sum(1 for c in text if not c.isalnum() and not c.isspace()) 154 | 155 | # 计算占比 156 | english_ratio = english_chars / total_chars 157 | symbol_ratio = symbols / total_chars 158 | 159 | return english_ratio, symbol_ratio 160 | 161 | 162 | def nunjucks_escape(text): 163 | """替换 Nunjucks 转义字符""" 164 | text = text.replace('{{', '{ {') 165 | text = text.replace('}}', '} }') # 补充右大括号 166 | text = text.replace('{%', '{ %') # 补充 Nunjucks 标签 167 | text = text.replace('%}', '% }') # 补充 Nunjucks 标签 168 | text = text.replace('{#', '{ #') 169 | text = text.replace('#}', '# }') # 补充注释标签 170 | text = text.replace('https:', 'https :') 171 | text = text.replace('http:', 'http :') 172 | 173 | # 新增:处理可能引起解析错误的特殊字符组合 174 | text = text.replace('{-', '{ -') 175 | text = text.replace('-}', '- }') 176 | text = text.replace('{{-', '{ { -') 177 | text = text.replace('-}}', '- } }') 178 | text = text.replace('{%-', '{ % -') 179 | text = text.replace('-%}', '- % }') 180 | 181 | # 处理可能的变量访问语法 182 | text = re.sub(r'(\w+)\.(\w+)', r'\1\. \2', text) # 处理点号访问 183 | text = re.sub(r'(\w+)\[(\w+)\]', r'\1\[ \2\]', text) # 处理方括号访问 184 | 185 | # 处理管道符(Nunjucks 过滤器语法) 186 | text = text.replace('|', '\\|') 187 | 188 | # 处理反引号和特殊引号 189 | text = text.replace('`', '\\`') 190 | text = text.replace('"', '\\"') 191 | text = text.replace('"', '\\"') 192 | 193 | # 处理可能的数学表达式或特殊符号 194 | text = text.replace('<', '< ') 195 | text = text.replace('>', '> ') 196 | text = text.replace('&', '& ') 197 | text = text.replace('"', '\\" ') 198 | 199 | # 处理HTML实体编码中的特殊字符 200 | text = re.sub(r'&#x([0-9A-Fa-f]+);', r'&# x\1;', text) 201 | text = re.sub(r'&#(\d+);', r'&# \1;', text) 202 | 203 | # 处理可能被误认为是 Nunjucks 语法的其他字符组合 204 | text = text.replace('\\n\\n\\n', '\\n \\n \\n') # 根据你的错误可能相关 205 | 206 | for ext in ['.png', '.jpg', '.jpeg', '.gif', '.bmp']: 207 | text = text.replace(ext, '') 208 | 209 | # 去掉html标签,防止转义失败 210 | text = re.sub(r'<[^>]*>', '', text) 211 | 212 | return text --------------------------------------------------------------------------------