├── README.md ├── config.json ├── faceimg.py ├── fetchNewArticle.py ├── setupPackage.py ├── start.py └── wkhtmltopdf.exe /README.md: -------------------------------------------------------------------------------- 1 | # vWeChatCrawl-小V公众号文章下载(开源版) 2 | 批量导出任意微信公众号历史文章,会用python写hello world就会用这个。 3 | 配套视频教程 https://www.bilibili.com/video/BV1jv4y1f7j5/ 4 | 5 | # 注意: 6 | 7 | 本项目的最新文章会发在公众号“不止技术流”中,欢迎关注。 8 | ![avatar](https://www.xiaokuake.com/p/wp-content/uploads/2019/08/2019081511223334.jpeg) 9 | QQ交流群 703431832 加群暗号"不止技术流" 10 | 11 | 12 | # 使用步骤: 13 | ## a.安装Python库 14 | 直接 python setupPackage.py 安装本项目需要的库。有朋友反映默认源安装慢,这里我用了豆瓣的源。 15 | ## b.安装并配置Fiddler 16 | Fiddler的官网有时会连不上,可去pc.qq.com搜索Fiddler4 并安装 17 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/2019080602070412.png) 18 | 19 | 会弹出几个窗口,都点 Yes 20 | 21 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/2019080602072832.png) 22 | 23 | 最后是这样的,打了 3 个钩。点 OK 保存即可。 24 | 25 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/2019080602075168.png) 26 | 27 | 在主窗口右侧按下图所示设置,其中需要填的网址为 mp.weixin.qq.com/mp/profile_ext?action=getms 28 | 29 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/201908060209546.png) 30 | 31 | 32 | 至此配置完成了,点软件左下角的方块,会显示Capturing ,表示它此时处在可以抓取数据的状态,再点一下会暂停抓取。此处先打开为抓取状态 33 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/2019080602082132.png) 34 | 35 | 有的朋友可能会在Fiddler 中抓取不到Https请求,请仔细按照上面流程检查。若有其他异常,绝大多数Fiddler相关的问题通过百度可以解决。 36 | 37 | ## c.打开某个微信公众号的历史文章列表 38 | [先根据这个视频操作](https://www.bilibili.com/video/BV1yu4y1n7PV/) 39 | 40 | 打开公众号历史文章列表之后,在列表中不断下划,使历史文章列表都显示出来,但注意不要划得太快。 41 | 42 | Fiddler中显示了我们需要的请求 43 | 44 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/2019080602101979.png) 45 | 46 | 把这些请求保存下来,基中包含文章url列表 47 | 48 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/2019080602105916.png) 49 | 50 | 如果有下面这个窗口则选Raw Files 51 | 52 | ![avatar](http://www.xiaokuake.com/p/wp-content/uploads/2022/05/2022050912413657.png) 53 | 54 | ![avatar](http://img1.xiaokuake.com/p/wp-content/uploads/2019/08/2019080602105929.png) 55 | 56 | ## d.运行python文件 57 | 打开本项目的 config.json 文件,设置 58 | - jsonDir:上面在Fiddler中保存的文件 59 | - htmlDir:保存html的目录,路径中不能有空格 60 | - pdfDir:保存pdf的目录,路径中不能有空格 61 | 改完记得保存文件 62 | 63 | 64 | 65 | 运行 python start.py #开始下载html 66 | 运行 python start.py pdf #把下载的html转pdf 67 | 68 | 有的朋友会发现下载下来的文章缺少最近的一批文章,那是因为默认情况下最新文章不是以json的形式返回的,可以这么操作: 69 | 70 | 在文章列表中,第一篇文章的右上角有个“全部”,先选“视频”,再选“全部”,就能在Fiddler中看到第一页的视频列表了。 71 | 72 | 73 | 上文中没提到的文件是实现其他功能的(作者偷懒把好几个项目都放在了这里),感兴趣的可了解,不感兴趣的也并不影响使用你使用上文所述的功能。 74 | 75 | ## 补充 76 | 77 | 可以批量抓取大量公众号的文章、阅读数等信息,欢迎企业用户商业合作 [https://www.xiaokuake.com](https://www.xiaokuake.com?id=github) 78 | 79 | 也可以抓取快手相关数据,目前正在开发快手直播相关的工具,也欢迎讨论 80 | 81 | 作者微信 xiaov0755 有其他爬虫定制需求的可一起讨论(黑灰产不做) 82 | 83 | 本开源项目仅用于技术学习交流,请勿用于非法用途,由此引起的后果本作者概不负责。 84 | 85 | 86 | 主要思路参考这几篇文章 87 | [一步步教你打造文章爬虫(1)-综述](https://mp.weixin.qq.com/s?__biz=MzAxMDM4MTA2MA==&mid=2455304602&idx=1&sn=4beadc781c44c17cb4451b579d077c45&chksm=8cfd6bf1bb8ae2e7d5a9f1a66696dd12e260ac7919c7bebe317af81e90bd25591ba286da1f0f&token=2137480545&lang=zh_CN#rd) 88 | [一步步教你打造文章爬虫(2)-下载网页](https://mp.weixin.qq.com/s?__biz=MzAxMDM4MTA2MA==&mid=2455304609&idx=1&sn=b7496563aab42e92060bd68936bc4212&chksm=8cfd6bcabb8ae2dc606b060fecf3f837177e3ef22a05a30ee28ebefd75c6677b29df3e426692&token=2137480545&lang=zh_CN#rd) 89 | 特别要仔细看第3篇 90 | [一步步教你打造文章爬虫(3)-批量下载 91 | ](https://mp.weixin.qq.com/s?__biz=MzAxMDM4MTA2MA==&mid=2455304632&idx=1&sn=d0a1f6ef7e5d4356d17219a2b79f65d4&chksm=8cfd6bd3bb8ae2c532f901e11aa4b080c19f16626f0dceb291fcb8270e2d7689d7b97d232683&token=2137480545&lang=zh_CN#rd) 92 | 93 | 常见问题列在这里 94 | 95 | [一步步教你打造文章爬虫(4)-常见问题解答](https://mp.weixin.qq.com/s/jMHeQGKuEb5G6iFDg6jmDA) 96 | 97 | # 其他接口: 98 | 用户指定一批公众号,我们的服务器 24 小时不间断监测这些号的最新发文情况,发现新文章将第一时间(一般是几秒到几分钟内)通知用户,包含公众号文章链接、标题、发布时间、链接、封面图、作者、摘要等内容。[详情查看这里](https://www.xiaokuake.com/p/wxpush.html) 99 | 100 | -------------------------------------------------------------------------------- /config.json: -------------------------------------------------------------------------------- 1 | { 2 | "jsonDir": "c:/Users/kklwin10/Desktop/Dump-0103-20-14-29", 3 | "htmlDir": "c:/vWeChatFiles/html", 4 | "pdfDir": "c:/vWeChatFiles/pdf" 5 | } -------------------------------------------------------------------------------- /faceimg.py: -------------------------------------------------------------------------------- 1 | from PIL import Image 2 | 3 | def GenFaceFlag(mainPicPath,flagPath,savepath): 4 | mainImg = Image.open(mainPicPath) #主图 5 | flagImg = Image.open(flagPath) #要添加的小旗 6 | mw,mh = mainImg.size 7 | fw,fh = flagImg.size 8 | if fw>(int)(mw * 0.3):#如果flag的尺寸太大则要缩放 9 | newwidth = (int)(mw*0.3) 10 | newheight = (int)(mw*0.3*fh/fw) 11 | flagImgNew=flagImg.resize((newwidth,newheight)) 12 | else: 13 | newwidth = fw 14 | newheight =fh 15 | flagImgNew = flagImg 16 | lt_x=mw-newwidth#计算要把flag粘贴的位置 17 | lt_y=mh-newheight 18 | mainImg.paste(flagImgNew,(lt_x,lt_y)) #粘贴 19 | mainImg.save(savepath)#保存新图像 20 | 21 | if __name__=="__main__": 22 | mainpath = "C:\\Python\\vWXCrawl\\pub\\vWeChatCrawl\\main2.jpg" 23 | flagpath = "C:\\Python\\vWXCrawl\\pub\\vWeChatCrawl\\flag2.png" 24 | savepath = "C:\\Python\\vWXCrawl\\pub\\vWeChatCrawl\\save.png" 25 | 26 | GenFaceFlag(mainpath,flagpath,savepath) -------------------------------------------------------------------------------- /fetchNewArticle.py: -------------------------------------------------------------------------------- 1 | import os,sys,json 2 | import time,requests 3 | import hashlib,time,urllib 4 | from urllib.request import urlopen 5 | from urllib import parse,request 6 | import urllib.request as urllib2 7 | from pprint import pprint 8 | 9 | def SaveFile(fpath,fileContent): 10 | with open(fpath,'w',encoding='UTF-8') as f: 11 | f.write(fileContent) 12 | 13 | 14 | 15 | def run(token,customerid,starttime,bizlist): 16 | nowtime=str(int(time.time())) 17 | url="http://tst.xiaokuake.com/w/fetch/?customerid="+customerid+"&token="+token 18 | 19 | postdata = {'customerid': customerid, 20 | 'bizlist': bizlist, 21 | 'starttime': starttime, 22 | } 23 | poststr = json.dumps(postdata) #转成字符串 24 | headers = {'content-type': "application/json"} 25 | response = requests.post(url, data = poststr, headers = headers) #发送一个post请求 26 | pprint(response.text) #pprint是格式化显示的意思 27 | #SaveFile("a.txt",response.text) 28 | 29 | 30 | token="93cO7O302oDS" #测试账号固定填这个 31 | customerid = "weiyan" #测试账号固定填这个 32 | starttime = 1591784106 #获取starttime至今的数据,但最多只能获取最近24小时内的 33 | bizlist = [ 34 | "MzA5NDc1NzQ4MA==", #差评 35 | "MzUxNjUxMTg3OA==", #占豪 36 | "MjM5MjAxNDM4MA==" #人民日报 37 | ] 38 | #测试版的biz列表只能是列表中这几个,正式版也是要提前确定好一批biz,先关注了这些号才能收到其最新推送 39 | #如果公众号不固定,又想做到想查哪个号就立马能查的,需要用到另外的接口,可联系作者获取(但价格会比这种固定一批公众号的贵) 40 | #联系方式 www.xiaokuake.com/w/fetch/ 41 | 42 | run(token,customerid,starttime,bizlist) 43 | -------------------------------------------------------------------------------- /setupPackage.py: -------------------------------------------------------------------------------- 1 | import pip 2 | from subprocess import call 3 | 4 | 5 | #如果从默认源安装比较慢的话直接运行这个文件安装 6 | lst=["beautifulsoup4","lxml","requests"] 7 | for pkg in lst: 8 | call("pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade " + pkg) 9 | -------------------------------------------------------------------------------- /start.py: -------------------------------------------------------------------------------- 1 | import os, sys 2 | import requests 3 | import json 4 | import subprocess 5 | from bs4 import BeautifulSoup 6 | from datetime import datetime, timedelta 7 | from time import sleep 8 | 9 | """ 10 | 本项目开源地址 https://github.com/LeLe86/vWeChatCrawl 11 | 讨论QQ群 703431832 加群暗号:不止技术流 12 | """ 13 | 14 | 15 | # 保存文件 16 | def SaveFile(fpath, fileContent): 17 | with open(fpath, 'w', encoding='utf-8') as f: 18 | f.write(fileContent) 19 | 20 | 21 | # 读取文件 22 | def ReadFile(filepath): 23 | with open(filepath, 'r', encoding='utf-8') as f: 24 | all_the_text = f.read() 25 | return all_the_text 26 | 27 | 28 | # 时间戳转日期 29 | def Timestamp2Datetime(stampstr): 30 | dt = datetime.utcfromtimestamp(stampstr) 31 | dt = dt + timedelta(hours=8) 32 | newtimestr = dt.strftime("%Y%m%d_%H%M%S") 33 | return newtimestr 34 | 35 | 36 | # 初始化环境 37 | def GetJson(): 38 | jstxt = ReadFile("config.json") 39 | jstxt = jstxt.replace("\\\\", "/").replace("\\", "/") # 防止json中有 / 导致无法识别 40 | jsbd = json.loads(jstxt) 41 | if jsbd["htmlDir"][-1] == "/": 42 | jsbd["htmlDir"] = jsbd["htmlDir"][:-1] 43 | if jsbd["jsonDir"][-1] == "/": 44 | jsbd["jsonDir"] = jsbd["jsonDir"][:-1] 45 | return jsbd 46 | 47 | 48 | # 下载url网页 49 | def DownLoadHtml(url): 50 | # 2024.5.6更新,需要对url进行简单处理,否则会下载异常 51 | url = url.replace("&","&") 52 | # 构造请求头 53 | headers = { 54 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36', 55 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 56 | 'Connection': 'keep-alive', 57 | 'Accept-Language': 'zh-CN,zh;q=0.9' 58 | } 59 | session = requests.Session() 60 | session.trust_env = False 61 | response = session.get(url, headers=headers) 62 | if response.status_code == 200: 63 | htmltxt = response.text # 返回的网页正文 64 | return htmltxt 65 | else: 66 | return None 67 | 68 | 69 | # 将图片从远程下载保存到本地 70 | def DownImg(url, savepath): 71 | # 构造请求头 72 | headers = { 73 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36', 74 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 75 | 'Connection': 'keep-alive', 76 | 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3' 77 | } 78 | session = requests.Session() 79 | session.trust_env = False 80 | response = session.get(url, headers=headers) 81 | with open(savepath, 'wb') as f: 82 | f.write(response.content) 83 | 84 | 85 | # 修改网页中图片的src,使图片能正常显示 86 | def ChangeImgSrc(htmltxt, saveimgdir, htmlname): 87 | bs = BeautifulSoup(htmltxt, "lxml") # 由网页源代码生成BeautifulSoup对象,第二个参数固定为lxml 88 | imgList = bs.findAll("img") 89 | imgindex = 0 90 | for img in imgList: 91 | imgindex += 1 92 | originalURL = "" # 图片真实url 93 | if "data-src" in img.attrs: # 有的 20: 102 | print("\r down imgs " + "▇" * imgindex + " " + str(imgindex), end="") 103 | if "data-type" in img.attrs: 104 | imgtype = img.attrs["data-type"] 105 | else: 106 | imgtype = "png" 107 | imgname = htmlname + "_" + str(imgindex) + "." + imgtype # 形如 1.png的图片名 108 | imgsavepath = os.path.join(saveimgdir, imgname) # 图片保存目录 109 | DownImg(originalURL, imgsavepath) 110 | img.attrs["src"] = "images/" + imgname # 网页中图片的相对路径 111 | else: 112 | img.attrs["src"] = "" 113 | ChangeCssSrc(bs) # 修改link标签 114 | ChangeContent(bs) # 修改js_content的style,使正文能正常显示 115 | allscript = bs.findAll("script") 116 | for script in allscript: 117 | if "src" in script.attrs: # 解决远程加载js失败导致打开网页很慢的问题 118 | script["src"] = "" 119 | return str(bs) # 将BeautifulSoup对象再转换为字符串,用于保存 120 | 121 | 122 | def ChangeCssSrc(bs): 123 | linkList = bs.findAll("link") 124 | for link in linkList: 125 | href = link.attrs["href"] 126 | if href.startswith("//"): 127 | newhref = "http:" + href 128 | link.attrs["href"] = newhref 129 | 130 | 131 | def ChangeContent(bs): 132 | jscontent = bs.find(id="js_content") 133 | if jscontent: 134 | jscontent.attrs["style"] = "" 135 | else: 136 | print("-----可能文章被删了-----") 137 | 138 | 139 | # 文章类 140 | class Article(): 141 | def __init__(self, url, pubdate, idx, title): 142 | self.url = url 143 | self.pubdate = pubdate 144 | self.idx = idx 145 | self.title = title 146 | 147 | 148 | # 从fiddler保存的json文件中提取文章url等信息 149 | def GetArticleList(jsondir): 150 | filelist = os.listdir(jsondir) 151 | ArtList = [] 152 | for file in filelist: 153 | try: 154 | filepath = os.path.join(jsondir, file) 155 | filetxt = ReadFile(filepath) 156 | jsbody = json.loads(filetxt) 157 | general_msg_list = jsbody["general_msg_list"] 158 | jsbd2 = json.loads(general_msg_list) 159 | list = jsbd2["list"] 160 | for item in list: # 一个item里可能有多篇文章 161 | artidx = 1 # 请注意这里的编号只是为了保存html方便,并不对应于真实的文章发文位置(比如头条、次条、3条) 162 | comm_msg_info = item["comm_msg_info"] 163 | 164 | pubstamp = comm_msg_info["datetime"] 165 | pubdate = Timestamp2Datetime(pubstamp) 166 | if comm_msg_info["type"] == 49: # 49为普通图文类型,还有其他类型,暂不考虑 167 | app_msg_ext_info = item["app_msg_ext_info"] 168 | url = app_msg_ext_info["content_url"] # 文章链接 169 | idx = artidx 170 | title = app_msg_ext_info["title"] 171 | art = Article(url, pubdate, idx, title) 172 | if len(url) > 3: # url不完整则跳过 173 | ArtList.append(art) 174 | print(len(ArtList), pubdate, idx, title) 175 | if app_msg_ext_info["is_multi"] == 1: # 一次发多篇 176 | artidx += 1 177 | multi_app_msg_item_list = app_msg_ext_info["multi_app_msg_item_list"] 178 | for subArt in multi_app_msg_item_list: 179 | url = subArt["content_url"] 180 | idx = artidx 181 | title = subArt["title"] 182 | art = Article(url, pubdate, idx, title) 183 | if len(url) > 3: 184 | ArtList.append(art) 185 | print(len(ArtList), pubdate, idx, title) 186 | except: 187 | print("跳过,可不用管", file) 188 | return ArtList 189 | 190 | 191 | def DownHtmlMain(jsonDir, saveHtmlDir): 192 | saveHtmlDir = jsbd["htmlDir"] 193 | if not os.path.exists(saveHtmlDir): 194 | os.makedirs(saveHtmlDir) 195 | saveImgDir = saveHtmlDir + "/images" 196 | if not os.path.exists(saveImgDir): 197 | os.makedirs(saveImgDir) 198 | ArtList = GetArticleList(jsonDir) 199 | ArtList.sort(key=lambda x: x.pubdate, reverse=True) # 按日期倒序排列 200 | totalCount = len(ArtList) 201 | idx = 0 202 | for art in ArtList: 203 | idx += 1 204 | artname = art.pubdate + "_" + str(art.idx) 205 | arthtmlname = artname + ".html" 206 | arthtmlsavepath = saveHtmlDir + "/" + arthtmlname 207 | print(idx, "of", totalCount, artname, art.title) 208 | # 如果已经有了则跳过,便于暂停后续传 209 | if os.path.exists(arthtmlsavepath): 210 | print("exists", arthtmlsavepath) 211 | continue 212 | arthtmlstr = DownLoadHtml(art.url) 213 | arthtmlstr = ChangeImgSrc(arthtmlstr, saveImgDir, artname) 214 | print("\r", end="") 215 | SaveFile(arthtmlsavepath, arthtmlstr) 216 | 217 | sleep(3) # 防止下载过快被微信屏蔽,间隔3秒下载一篇 218 | 219 | 220 | # 把一个文件夹下的html文件都转为pdf 221 | def PDFDir(htmldir, pdfdir): 222 | if not os.path.exists(pdfdir): 223 | os.makedirs(pdfdir) 224 | flist = os.listdir(htmldir) 225 | for f in flist: 226 | if (not f[-5:] == ".html") or ("tmp" in f): # 不是html文件的不转换,含有tmp的不转换 227 | continue 228 | htmlpath = htmldir + "/" + f 229 | tmppath = htmlpath[:-5] + "_tmp.html" # 生成临时文件,供转pdf用 230 | htmlstr = ReadFile(htmlpath) 231 | bs = BeautifulSoup(htmlstr, "lxml") 232 | title = "" 233 | # pdf文件名中包含文章标题,但如果标题中有不能出现在文件名中的符号则会转换失败 234 | titleTag = bs.find(id="activity-name") 235 | if titleTag is not None: 236 | title = "_" + titleTag.get_text().replace(" ", "").replace(" ", "").replace("\n", "").replace("|", "").replace(":", "") 237 | ridx = htmlpath.rindex("/") + 1 238 | pdfname = htmlpath[ridx:-5] + title 239 | pdfpath = pdfdir + "/" + pdfname + ".pdf" 240 | 241 | """ 242 | 把js等去掉,减少转PDF时的加载项, 243 | 注意此处去掉了css(link),如果发现pdf格式乱了可以不去掉css 244 | """ 245 | [s.extract() for s in bs(["script", "iframe", "link"])] 246 | SaveFile(tmppath, str(bs)) 247 | try: 248 | PDFOne(tmppath, pdfpath) 249 | except: 250 | print("转pdf失败,可能是因为标题中有特殊字符", f) 251 | 252 | 253 | # 把一个Html文件转为pdf 254 | def PDFOne(htmlpath, pdfpath, skipExists=True, removehtml=True): 255 | if skipExists and os.path.exists(pdfpath): 256 | print("pdf exists", pdfpath) 257 | if removehtml: 258 | os.remove(htmlpath) 259 | return 260 | exepath = "wkhtmltopdf.exe" # 把wkhtmltopdf.exe文件保存到与本py文件相同的目录下 261 | cmdlist = [] 262 | cmdlist.append(" --load-error-handling ignore ") 263 | cmdlist.append(" --page-height 200 ") # 数字可以自己调节,也可以不加这两行 264 | cmdlist.append(" --page-width 140 ") 265 | cmdlist.append(" " + htmlpath + " ") 266 | cmdlist.append(" " + pdfpath + " ") 267 | cmdstr = exepath + "".join(cmdlist) 268 | print(cmdstr) 269 | result = subprocess.check_call(cmdstr, shell=False) 270 | # stdout,stderr = result.communicate() 271 | # result.wait() #等待转换完一个再转下一个 272 | if removehtml: 273 | os.remove(htmlpath) 274 | 275 | """ 276 | 1.设置: 277 | 先去config.json文件中设置 278 | jsonDir:Fiddler生成的文件 279 | htmlDir:保存html的目录,路径中不能有空格 280 | pdfDir:保存pdf的目录,路径中不能有空格 281 | 2.使用方法: 282 | 运行 python start.py #开始下载html 283 | 运行 python start.py pdf #把下载的html转pdf 284 | """ 285 | 286 | 287 | if __name__ == "__main__": 288 | if len(sys.argv) == 1: 289 | arg = None 290 | else: 291 | arg = sys.argv[1] 292 | if arg is None or arg == "html": 293 | jsbd = GetJson() 294 | saveHtmlDir = jsbd["htmlDir"] 295 | jsdir = jsbd["jsonDir"] 296 | DownHtmlMain(jsdir, saveHtmlDir) 297 | elif arg == "pdf": 298 | jsbd = GetJson() 299 | saveHtmlDir = jsbd["htmlDir"] 300 | savePdfDir = jsbd["pdfDir"] 301 | PDFDir(saveHtmlDir, savePdfDir) 302 | -------------------------------------------------------------------------------- /wkhtmltopdf.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LeLe86/vWeChatCrawl/1c1345321fe6f0e5388f7fcad09cabc8eb5c9918/wkhtmltopdf.exe --------------------------------------------------------------------------------