├── requirements.txt ├── static ├── loading.gif └── css │ └── style.css ├── README.md ├── templates └── post.html ├── GetTxt.py ├── GetPpt.py └── GetAll.py /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | chardet 3 | bs4 4 | Pillow 5 | pdfkit 6 | flask 7 | imgkit 8 | img2pdf -------------------------------------------------------------------------------- /static/loading.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb/HEAD/static/loading.gif -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BaiduWenkuSpider_flaskWeb 2 | 以web server形式实现对百度文库文档以pdf形式原格式下载 3 | 如果觉得可以的话,可以点个**🌟**哦 4 | (**当前爬取方式可能已经不支持,仅提供flask开发参考**) 5 | 6 | ## 前言 7 | 首先,这是根据 8 | [https://github.com/M010K/BaiduWenkuSpider](https://github.com/M010K/BaiduWenkuSpider) 9 | 的项目进行一点修改得到的基于flask框架的python web项目, 10 | 可以对百度文库的文档转换为pdf格式进行下载 11 | 12 | **[博客地址](https://www.upstudy.top/index.php/archives/21/)** 13 | 14 | ## 如何使用? 15 | #### 一、下载项目zip包,或者直接用git获取 16 | 17 | **$ git clone https://github.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb** 18 | 19 | 20 | #### 二、安装依赖 21 | 项目使用的依赖有 22 | 1. requests 23 | 2. chardet 24 | 3. bs4 25 | 4. Pillow 26 | 5. pdfkit 27 | 6. flask 28 | 7. imgkit 29 | 8. img2pdf 30 | 31 | cd到项目文件夹中使用命令,直接一键安装 32 | **pip install -r requirements.txt** 33 | 34 | #### 三、安装wkhtmltopdf工具 35 | [官网下载地址](https://wkhtmltopdf.org/downloads.html) 36 | 37 | 下载后按当前系统 38 | 配置环境变量即可 39 | 40 | **window:** 41 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200421234401464.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mzg3ODMzMg==,size_16,color_FFFFFF,t_70) 42 | 43 | **Centos:** 44 | 45 | [https://blog.csdn.net/LookingTomorrow/article/details/93513457](https://blog.csdn.net/LookingTomorrow/article/details/93513457) 46 | 47 | #### 四、直接运行GetAll.py文件,访问http://127.0.0.1:5000/post 即可(运行在服务器端则访问IP:5000/post) 48 | 49 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200421234635967.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mzg3ODMzMg==,size_16,color_FFFFFF,t_70) 50 | 51 | ps:ppt格式的文档不支持预览 52 | #### 五、Github源码下载地址 53 | [https://github.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb](https://github.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb) 54 | -------------------------------------------------------------------------------- /static/css/style.css: -------------------------------------------------------------------------------- 1 | .container { 2 | width: 500px; 3 | height: 50px; 4 | margin: 100px auto; 5 | } 6 | 7 | .parent { 8 | width: 100%; 9 | height: 42px; 10 | top: 4px; 11 | position: relative; 12 | } 13 | 14 | .parent>input:first-of-type { 15 | /*输入框高度设置为40px, border占据2px,总高度为42px*/ 16 | width: 380px; 17 | height: 40px; 18 | border: 1px solid #ccc; 19 | font-size: 16px; 20 | outline: none; 21 | } 22 | 23 | .parent>input:first-of-type:focus { 24 | border: 1px solid #317ef3; 25 | padding-left: 10px; 26 | } 27 | 28 | .parent>input:last-of-type { 29 | /*button按钮border并不占据外围大小,设置高度42px*/ 30 | width: 100px; 31 | height: 44px; 32 | position: absolute; 33 | background: #317ef3; 34 | border: 1px solid #317ef3; 35 | color: #fff; 36 | font-size: 16px; 37 | outline: none; 38 | } 39 | 40 | .a_demo_two { 41 | background-color:#317ef3; 42 | padding:10px; 43 | position:relative; 44 | font-family: 'Open Sans', sans-serif; 45 | font-size:12px; 46 | text-decoration:none; 47 | color:#fff; 48 | background-image: linear-gradient(bottom, rgb(100,170,30) 0%, rgb(129,212,51) 100%); 49 | box-shadow: inset 0px 1px 0px #b2f17f, 0px 6px 0px #3d6f0d; 50 | border-radius: 5px; 51 | } 52 | 53 | .a_demo_two:active { 54 | top:7px; 55 | background-image: linear-gradient(bottom, rgb(100,170,30) 100%, rgb(129,212,51) 0%); 56 | box-shadow: inset 0px 1px 0px #b2f17f, inset 0px -1px 0px #3d6f0d; 57 | color: #156785; 58 | text-shadow: 0px 1px 1px rgba(255,255,255,0.3); 59 | background: rgb(44,160,202); 60 | } 61 | 62 | .a_demo_two::before { 63 | background-color:#072239; 64 | content:""; 65 | display:block; 66 | position:absolute; 67 | width:100%; 68 | height:100%; 69 | padding-left:2px; 70 | padding-right:2px; 71 | padding-bottom:4px; 72 | left:-2px; 73 | top:5px; 74 | z-index:-1; 75 | border-radius: 6px; 76 | box-shadow: 0px 1px 0px #fff; 77 | } 78 | 79 | .a_demo_two:active::before { 80 | top:-2px; 81 | } 82 | 83 | 84 | -------------------------------------------------------------------------------- /templates/post.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 百度文库下载 8 | 25 | 26 | 27 | 28 | 29 |
30 |

仅供学习交流使用

31 | 32 |
33 |
34 | 35 | 36 |
37 |
38 | 39 | {% if get_flashed_messages()[1] %} 40 |
41 | 42 | 预览(不支持ppt) 43 | 44 |           45 | 46 | 不废话,下载 47 | 48 |
49 | 50 | {% endif %} 51 |
52 | 64 | 65 | 66 |
67 |
68 |

博客地址

69 |
70 |
71 |

Github项目地址

72 |
73 | 74 | 83 | 84 | 85 | -------------------------------------------------------------------------------- /GetTxt.py: -------------------------------------------------------------------------------- 1 | from requests import get 2 | from requests.exceptions import ReadTimeout 3 | from chardet import detect 4 | from bs4 import BeautifulSoup 5 | from os import getcwd,mkdir 6 | from os.path import join,exists 7 | from re import findall 8 | from json import loads 9 | from time import time 10 | 11 | 12 | class GetTxt: 13 | def __init__(self, url, savepath): 14 | self.url = url 15 | self.savepath = savepath if savepath != '' else getcwd() 16 | self.txtsavepath = self.makeDirForTxtSave() 17 | self.html = '' 18 | self.wkinfo = {} # 存储文档基本信息:title、docType、docID 19 | self.txturls = [] 20 | 21 | self.getHtml() 22 | self.getWkInfo() 23 | 24 | 25 | # 创建临时文件夹保存ppt 26 | def makeDirForTxtSave(self): 27 | if not exists(join(self.savepath,'txtfiles')): 28 | mkdir(join(self.savepath,'txtfiles')) 29 | return join(self.savepath,'txtfiles') 30 | 31 | # 获取网站源代码 32 | def getHtml(self): 33 | try: 34 | header = {'User-Agent': 'Mozilla/5.0 ' 35 | '(Macintosh; Intel Mac OS X 10_14_6) ' 36 | 'AppleWebKit/537.36 (KHTML, like Gecko) ' 37 | 'Chrome/78.0.3904.108 Safari/537.36'} 38 | response = get(self.url, headers=header) 39 | self.transfromEncoding(response) 40 | self.html = BeautifulSoup(response.text, 'html.parser') # 格式化 41 | 42 | except ReadTimeout as e: 43 | print(e) 44 | return None 45 | 46 | # 转换网页源代码为对应编码格式 47 | def transfromEncoding(self, html): 48 | # 检测并修改html内容的编码方式 49 | html.encoding = detect(html.content).get("encoding") 50 | 51 | # 获取文档基本信息:名字,类型,文档ID 52 | def getWkInfo(self): 53 | items = ["'title'", "'docType'", "'docId'", "'totalPageNum"] 54 | for item in items: 55 | ls = findall(item + ".*'", str(self.html)) 56 | if len(ls) != 0: 57 | message = ls[0].split(':') 58 | self.wkinfo[eval(message[0])] = eval(message[1]) 59 | 60 | # 获取json字符串 61 | def getJson(self, url): 62 | """ 63 | :param url: json文件所在页面的url 64 | :return: json格式字符串 65 | """ 66 | response = get(url) 67 | # 获取json格式数据 68 | jsonstr = response.text[response.text.find('(') + 1:response.text.rfind(')')] 69 | return jsonstr 70 | 71 | # 获取json字符串对应的字典 72 | def convertJsonToDict(self, jsonstr): 73 | """ 74 | :param: jsonstr: json格式字符串 75 | :return: json字符串所对应的python字典 76 | """ 77 | textdict = loads(jsonstr) # 将json字符串转换为python的字典对象 78 | return textdict 79 | 80 | # 获取包含txt文本的json文件的url 81 | def getTxtUrlForTXT(self): 82 | timestamp = round(time() * 1000) # 获取时间戳 83 | # 构造请求url,获取json文件所在url的参数 84 | messageurlprefix = "https://wenku.baidu.com/api/doc/getdocinfo?" \ 85 | "callback=cb&doc_id=" 86 | messageurlsuffix = self.wkinfo.get("docId") + "&t=" + \ 87 | str(timestamp) + "&_=" + str(timestamp + 1) 88 | 89 | textdict = self.convertJsonToDict( 90 | self.getJson(messageurlprefix + messageurlsuffix)) 91 | 92 | # 获取json文件所在url的参数 93 | self.txturls.append("https://wkretype.bdimg.com/retype/text/" + 94 | self.wkinfo.get('docId') + 95 | textdict.get('md5sum') + 96 | "&callback=cb&pn=1&rn=" + 97 | textdict.get("docInfo").get("totalPageNum") + 98 | "&rsign=" + textdict.get("rsign") + "&_=" + 99 | str(timestamp)) 100 | 101 | # 将文本内容保存 102 | def saveToTxt(self, content): 103 | savepath = join(self.txtsavepath,self.wkinfo.get('title') + '.txt') 104 | with open(savepath, "a",encoding='utf-8') as f: 105 | f.write(content) 106 | 107 | 108 | def getTXT(self): 109 | self.getTxtUrlForTXT() 110 | for url in self.txturls: 111 | textls = self.convertJsonToDict(self.getJson(url)) 112 | for text in textls: 113 | content = text.get("parags")[0].get("c") 114 | self.saveToTxt(content) 115 | url = join('\\txtfiles',self.wkinfo.get('title') + '.txt') 116 | return url 117 | 118 | 119 | 120 | 121 | if __name__ == '__main__': 122 | url = input('请输入网址:') 123 | GetTxt(url, '').getTXT() 124 | 125 | 126 | 127 | -------------------------------------------------------------------------------- /GetPpt.py: -------------------------------------------------------------------------------- 1 | from requests import get 2 | from PIL import Image 3 | from os import removedirs,remove,mkdir,getcwd 4 | from os.path import join, exists 5 | from requests.exceptions import ReadTimeout 6 | from chardet import detect 7 | from bs4 import BeautifulSoup 8 | from re import findall 9 | from json import loads 10 | from time import time 11 | 12 | 13 | class GetPpt: 14 | def __init__(self, url, savepath): 15 | self.url = url 16 | self.savepath = savepath if savepath != '' else getcwd() 17 | self.tempdirpath = self.makeDirForImageSave() 18 | self.pptsavepath = self.makeDirForPptSave() 19 | 20 | self.html = '' 21 | self.wkinfo ={} # 存储文档基本信息:title、docType、docID 22 | self.ppturls = [] # 顺序存储包含ppt图片的url 23 | 24 | self.getHtml() 25 | self.getWkInfo() 26 | 27 | 28 | # 获取网站源代码 29 | def getHtml(self): 30 | try: 31 | header = {'User-Agent': 'Mozilla/5.0 ' 32 | '(Macintosh; Intel Mac OS X 10_14_6) ' 33 | 'AppleWebKit/537.36 (KHTML, like Gecko) ' 34 | 'Chrome/78.0.3904.108 Safari/537.36'} 35 | response = get(self.url, headers = header) 36 | self.transfromEncoding(response) 37 | self.html = BeautifulSoup(response.text, 'html.parser') #格式化 38 | except ReadTimeout as e: 39 | print(e) 40 | return None 41 | 42 | 43 | # 转换网页源代码为对应编码格式 44 | def transfromEncoding(self,html): 45 | html.encoding = detect(html.content).get("encoding") #检测并修改html内容的编码方式 46 | 47 | 48 | # 获取文档基本信息:名字,类型,文档ID 49 | def getWkInfo(self): 50 | items = ["'title'","'docType'","'docId'","'totalPageNum"] 51 | for item in items: 52 | ls = findall(item+".*'", str(self.html)) 53 | if len(ls) != 0: 54 | message = ls[0].split(':') 55 | self.wkinfo[eval(message[0])] = eval(message[1]) 56 | 57 | 58 | # 获取json字符串 59 | def getJson(self, url): 60 | """ 61 | :param url: json文件所在页面的url 62 | :return: json格式字符串 63 | """ 64 | response = get(url) 65 | jsonstr = response.text[response.text.find('(')+1: response.text.rfind(')')] # 获取json格式数据 66 | return jsonstr 67 | 68 | 69 | # 获取json字符串对应的字典 70 | def convertJsonToDict(self, jsonstr): 71 | """ 72 | :param jsonstr: json格式字符串 73 | :return: json字符串所对应的python字典 74 | """ 75 | textdict = loads(jsonstr) # 将json字符串转换为python的字典对象 76 | return textdict 77 | 78 | 79 | # 创建临时文件夹保存图片 80 | def makeDirForImageSave(self): 81 | if not exists(join(self.savepath,'tempimages')): 82 | mkdir(join(self.savepath,'tempimages')) 83 | return join(self.savepath,'tempimages') 84 | 85 | # 创建临时文件夹保存ppt 86 | def makeDirForPptSave(self): 87 | if not exists(join(self.savepath,'pptfiles')): 88 | mkdir(join(self.savepath,'pptfiles')) 89 | return join(self.savepath,'pptfiles') 90 | 91 | 92 | # 从json文件中提取ppt图片的url 93 | def getImageUrlForPPT(self): 94 | timestamp = round(time()*1000) # 获取时间戳 95 | desturl = "https://wenku.baidu.com/browse/getbcsurl?doc_id="+\ 96 | self.wkinfo.get("docId")+\ 97 | "&pn=1&rn=99999&type=ppt&callback=jQuery1101000870141751143283_"+\ 98 | str(timestamp) + "&_=" + str(timestamp+1) 99 | 100 | 101 | textdict = self.convertJsonToDict(self.getJson(desturl)) 102 | self.ppturls = [x.get('zoom') for x in textdict.get('list')] 103 | 104 | 105 | # 通过给定的图像url及名称保存图像至临时文件夹 106 | def getImage(self, imagename, imageurl): 107 | imagename = join(self.tempdirpath, imagename) 108 | with open(imagename,'wb') as ig: 109 | ig.write(get(imageurl).content) #content属性为byte 110 | 111 | 112 | # 将获取的图片合成pdf文件 113 | def mergeImageToPDF(self, pages): 114 | if pages == 0: 115 | raise IOError 116 | 117 | 118 | namelist = [join(self.tempdirpath, str(x)+'.png') for x in range(pages)] 119 | firstimg = Image.open(namelist[0]) 120 | imglist = [] 121 | for imgname in namelist[1:]: 122 | img = Image.open(imgname) 123 | img.load() 124 | 125 | if img.mode == 'RGBA': # png图片的转为RGB mode,否则保存时会引发异常 126 | img.mode = 'RGB' 127 | imglist.append(img) 128 | 129 | savepath = join(self.pptsavepath, self.wkinfo.get('title')+'.pdf') 130 | url = join('\pptfiles', self.wkinfo.get('title')+'.pdf') 131 | firstimg.save(savepath, "PDF", resolution=100.0, 132 | save_all=True, append_images=imglist) 133 | return url 134 | 135 | # 清除下载的图片 136 | def removeImage(self,pages): 137 | namelist = [join(self.tempdirpath, str(x)+'.png') for x in range(pages)] 138 | for name in namelist: 139 | if exists(name): 140 | remove(name) 141 | if exists(join(self.savepath,'tempimages')): 142 | removedirs(join(self.savepath,'tempimages')) 143 | 144 | 145 | def getPPT(self): 146 | self.getImageUrlForPPT() 147 | for page, url in enumerate(self.ppturls): 148 | self.getImage(str(page)+'.png', url) 149 | url = self.mergeImageToPDF(len(self.ppturls)) 150 | self.removeImage(len(self.ppturls)) 151 | return url 152 | 153 | 154 | if __name__ == '__main__': 155 | url = input('请输入网址:') 156 | GetPpt(url, '').getPPT() 157 | 158 | -------------------------------------------------------------------------------- /GetAll.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -* 2 | from flask import Flask,render_template, redirect, url_for, request, flash, get_flashed_messages , send_from_directory 3 | from flask.helpers import safe_join 4 | import requests 5 | import os 6 | from requests.exceptions import ReadTimeout 7 | import chardet 8 | from bs4 import BeautifulSoup 9 | import re 10 | import json 11 | import math 12 | import imgkit 13 | from PIL import Image 14 | import img2pdf 15 | import pdfkit 16 | from GetPpt import GetPpt 17 | from GetTxt import GetTxt 18 | import time 19 | 20 | 21 | 22 | class GetAll: 23 | def __init__(self, url, savepath): 24 | """ 25 | :param url: 待爬取文档所在页面的url 26 | :param savepath: 生成文档保存路径 27 | """ 28 | self.url = url 29 | self.savepath = savepath if savepath != '' else os.getcwd() 30 | self.startpage = 1 31 | self.url = self.url + "&pn=1" 32 | self.html = '' 33 | self.wkinfo ={} # 存储文档基本信息:title、docType、docID、totalPageNum 34 | self.jsonurls = [] 35 | self.pdfurls = [] 36 | 37 | self.getHtml() 38 | self.getWkInfo() 39 | self.htmlsdirpath = self.makeDirForHtmlSave() 40 | self.pdfsdirpath = self.makeDirForPdfSave() 41 | self.htmlfile = self.wkinfo.get('title')+".html" 42 | 43 | 44 | # 创建临时文件夹保存html文件 45 | def makeDirForHtmlSave(self): 46 | if not os.path.exists(os.path.join(self.savepath,'htmlfiles')): 47 | os.mkdir(os.path.join(self.savepath,'htmlfiles')) 48 | return os.path.join(self.savepath, 'htmlfiles') 49 | 50 | 51 | def makeDirForPdfSave(self): 52 | if not os.path.exists(os.path.join(self.savepath, 'pdffiles')): 53 | os.mkdir(os.path.join(self.savepath,'pdffiles')) 54 | return os.path.join(self.savepath,'pdffiles') 55 | 56 | # 创建html文档,用于组织爬取的文件 57 | def creatHtml(self): 58 | with open(os.path.join(self.htmlsdirpath, self.htmlfile), "w",encoding='utf-8') as f: 59 | # 生成文档头 60 | message = """ 61 | 62 | 63 | 64 | 65 | 文库""" 66 | f.write(message) 67 | 68 | 69 | def addMessageToHtml(self,message): 70 | """:param message:向html文档中添加内容 """ 71 | with open(os.path.join(self.htmlsdirpath, self.htmlfile), "a",encoding='utf-8') as a: 72 | a.write(message) 73 | 74 | # 获取网站源代码 75 | def getHtml(self): 76 | try: 77 | header = {'User-Agent': 'Mozilla/5.0 ' 78 | '(Macintosh; Intel Mac OS X 10_14_6) ' 79 | 'AppleWebKit/537.36 (KHTML, like Gecko) ' 80 | 'Chrome/78.0.3904.108 Safari/537.36'} 81 | response = requests.get(self.url, headers = header) 82 | self.transfromEncoding(response) 83 | self.html = BeautifulSoup(response.text, 'html.parser') # 格式化html 84 | except ReadTimeout as e: 85 | print(e) 86 | 87 | # 转换网页源代码为对应编码格式 88 | def transfromEncoding(self, html): 89 | html.encoding = chardet.detect(html.content).get("encoding") #检测并修改html内容的编码方式 90 | 91 | # 获取文档基本信息:名字,类型,文档ID 92 | def getWkInfo(self): 93 | items = ["'title'","'docType'","'docId'","'totalPageNum"] 94 | for item in items: 95 | ls = re.findall(item+".*'", str(self.html)) 96 | if len(ls) != 0: 97 | message = ls[0].split(':') 98 | self.wkinfo[eval(message[0])] = eval(message[1]) 99 | 100 | 101 | # 获取存储信息的json文件的url 102 | def getJsonUrl(self): 103 | urlinfo = re.findall("WkInfo.htmlUrls = '.*?;", str(self.html)) 104 | urls = re.findall("https:.*?}", urlinfo[0]) 105 | urls = [str(url).replace("\\", "").replace('x22}','') for url in urls ] 106 | self.jsonurls = urls 107 | 108 | # 获取json字符串 109 | def getJson(self, url): 110 | """ 111 | :param url: json文件所在页面的url 112 | :return: json格式字符串 113 | """ 114 | response = requests.get(url) 115 | jsonstr = response.text[response.text.find('(')+1: response.text.rfind(')')] # 获取json格式数据 116 | return jsonstr 117 | 118 | # 获取json字符串对应的字典 119 | def convertJsonToDict(self, jsonstr): 120 | """ 121 | :param jsonstr: json格式字符串 122 | :return: json字符串所对应的python字典 123 | """ 124 | textdict = json.loads(jsonstr) # 将json字符串转换为python的字典对象 125 | return textdict 126 | 127 | # 判断文档是否为ppt格式 128 | def isPptStyle(self): 129 | iswholepic = False 130 | ispptlike = False 131 | for url in self.jsonurls: 132 | if "0.json" in url: 133 | textdict = self.convertJsonToDict(self.getJson(url)) 134 | # 若json文件中的style属性为空字符串且font属性为None,则说明pdf全由图片组成 135 | if textdict.get("style") == "" and textdict.get("font") is None: 136 | iswholepic = True 137 | break 138 | elif textdict.get('page').get('pptlike'): 139 | ispptlike = True 140 | break 141 | break 142 | 143 | return iswholepic and ispptlike 144 | 145 | 146 | # 从html中匹配出与控制格式相关的css文件的url 147 | def getCssUrl(self): 148 | pattern = re.compile('全部内容""" 170 | page = 1 171 | style = '\n" 180 | 181 | return style 182 | 183 | 184 | def getReaderRenderStyle(self, allstyle, font, r, page): 185 | """ 186 | :param allstyle: json数据的style内容 187 | :param font: json数据的font内容 188 | :param r: TODO:解析作用未知,先取值与e相同 189 | :param page: 当前页面 190 | :return: style 191 | """ 192 | p, stylecontent = "", [] 193 | for index in range(len(allstyle)): 194 | style = allstyle[index] 195 | if style.get('s'): 196 | p = self.getPartReaderRenderStyle(style.get('s'), font, r).strip(" ") 197 | l = "reader-word-s" + str(page) + "-" 198 | p and stylecontent.append("." + l + (",." + l).join([str(x) for x in style.get('c')]) + "{ " + p + "}") 199 | if style.get('s').get("font-family"): 200 | pass 201 | stylecontent.append("#pageNo-" + str(page) + " .reader-parent{visibility:visible;}") 202 | return "".join(stylecontent) 203 | 204 | 205 | def getPartReaderRenderStyle(self, s, font, r): 206 | """ 207 | :param s: json style下的s属性 208 | :param font: json font属性 209 | :param r: fontMapping TODO:先取为与e相同 210 | :return: style 中的部分字符串 211 | """ 212 | content = [] 213 | n, p = 10, 1349.19 / 1262.85 # n为倍数, p为比例系数, 通过页面宽度比得出 214 | 215 | def fontsize(f): 216 | content.append("font-size:" + str(math.floor(eval(f) * n * p)) + "px;") 217 | 218 | def letterspacing(l): 219 | content.append("letter-spacing:" + str(eval(l) * n) + "px;") 220 | 221 | def bold(b): 222 | "false" == b or content.append("font-weight:600;") 223 | 224 | def fontfamily(o): 225 | n = font.get(o) or o if font else o 226 | content.append("font-family:'" + n + "','" + o + "','" + (r.get(n) and r[n] or n) + "';") 227 | 228 | for attribute in s: 229 | if attribute == "font-size": 230 | fontsize(s[attribute]) 231 | elif attribute == "letter-spacing": 232 | letterspacing(s[attribute]) 233 | elif attribute == "bold": 234 | bold(s[attribute]) 235 | elif attribute == "font-family": 236 | fontfamily(s[attribute]) 237 | else: 238 | content.append(attribute + ":" + s[attribute] + ";") 239 | return "".join(content) 240 | 241 | 242 | # 向html中添加css 243 | def AddCss(self): 244 | urls = self.getCssUrl() 245 | urls = [url for url in urls if "htmlReader" in url or "core" in url or "main" in url or "base" in url] 246 | for url in urls: 247 | message = '>" 248 | self.addMessageToHtml(message) 249 | 250 | content = self.getAllReaderRenderStyle() # 获取文本控制属性css 251 | self.addMessageToHtml(content) 252 | 253 | 254 | def addMainContent(self): 255 | """ 256 | :param startpage: 开始生成的页面数 257 | :return: 258 | """ 259 | 260 | self.addMessageToHtml("\n\n\n\n") 261 | docidupdate = self.getDocIdUpdate() 262 | 263 | # 分别获取json和png所在的url 264 | jsonurl = [x for x in self.jsonurls if "json" in x] 265 | pngurl = [x for x in self.jsonurls if "png" in x] 266 | 267 | tags = self.getPageTag() 268 | for page, tag in enumerate(tags): 269 | if page > 50: 270 | break 271 | tag['style'] = "height: 1349.19px;" 272 | tag['id'] = "pageNo-" + str(page+1) 273 | self.addMessageToHtml(str(tag).replace('', '')) 274 | diu = self.getDocIdUpdate() 275 | n = "-webkit-transform:scale(1.00);-webkit-transform-origin:left top;" 276 | textdict = self.convertJsonToDict(self.getJson(jsonurl[page])) 277 | 278 | # 判断是否出现图片url少于json文件url情况 279 | if page < len(pngurl): 280 | maincontent = self.creatMainContent(textdict.get('body'), textdict.get('page'), textdict.get('font'), page + 1, docidupdate, 281 | pngurl[page]) 282 | else: 283 | maincontent = self.creatMainContent(textdict.get('body'), textdict.get('page'), textdict.get('font'), page + 1, docidupdate, "") 284 | content = "".join([ 285 | '
', 286 | '
', 287 | '
', maincontent, 288 | "
", "
", "
", ""]) 289 | 290 | self.addMessageToHtml(content) 291 | print("已完成%s页的写入,当前写入进度为%f" % (str(page+self.startpage), 100*(page+self.startpage)/int(self.wkinfo.get('totalPageNum'))) + '%') 292 | 293 | self.addMessageToHtml("\n\n\n\n") 294 | 295 | 296 | def isNumber(self, obj): 297 | """ 298 | :param obj:任意对象 299 | :return: obj是否为数字 300 | """ 301 | return isinstance(obj, int) or isinstance(obj, float) 302 | 303 | 304 | def creatMainContent(self, body, page, font, currentpage, o, pngurl): 305 | """ 306 | :param body: body属性 307 | :param page: page属性 308 | :param font: font属性 309 | :param currentpage: 当前页面数 310 | :param o:doc_id_update 311 | :param pngurl: 图片所在url 312 | :return:文本及图片的html内容字符串 313 | """ 314 | content, p, s, h = 0, 0, 0, 0 315 | main = [] 316 | l = 2 317 | c = page.get('v') 318 | 319 | d = font # d原本为fongmapping 320 | y = { 321 | "pic": '
', 322 | "word": '
' 323 | } 324 | g = "
" 325 | MAX1 , MAX2 = 0, 0 326 | body = sorted(body, key=lambda k: k.get('p').get('z')) 327 | for index in range(len(body)): 328 | content = body[index] 329 | if "pic" == content.get('t'): 330 | MAX1 = max(MAX1, content.get('c').get('ih') + content.get('c').get('iy') + 5) 331 | MAX2 = max(MAX2, content.get('c').get('iw')) 332 | for index in range(len(body)): 333 | content = body[index] 334 | s = content.get('t') 335 | if not p: 336 | p = h = s 337 | if p == s: 338 | if content.get('t') == "word": 339 | # m函数需要接受可变参数 340 | main.append(self.creatTagOfWord(content, currentpage, font, d, c)) 341 | elif content.get('t') == 'pic': 342 | main.append(self.creatTagOfImage(content, pngurl, MAX1, MAX2)) 343 | else: 344 | main.append(g) 345 | main.append(y.get(s).replace('__NUM__', str(l))) 346 | l += 1 347 | if content.get('t') == "word": 348 | # m函数需要接受可变参数 349 | main.append(self.creatTagOfWord(content, currentpage, font, d, c)) 350 | elif content.get('t') == 'pic': 351 | main.append(self.creatTagOfImage(content, pngurl, MAX1, MAX2)) 352 | p = s 353 | return y.get(h).replace('__NUM__', "1") + "".join(main) + g 354 | 355 | 356 | def creatTagOfWord(self, t, currentpage, font, o, version, *args): 357 | """ 358 | :param t: body中的每个属性 359 | :param currentpage: page 360 | :param font: font属性 361 | :param o:font属性 362 | :param version: page中的version属性 363 | :param args: 364 | :return:

标签--文本内容 365 | """ 366 | p = t.get('p') 367 | ps = t.get('ps') 368 | s = t.get('s') 369 | z = [' ', "\n"] 370 | k, N = 10, 1349.19 / 1262.85 371 | # T = self.j 372 | U = self.O(ps) 373 | w, h, y, x, D= p.get('w'), p.get('h'), p.get('y'), p.get('x'), p.get('z') 374 | pattern=re.compile("[\s\t\0xa0]| [\0xa0\s\t]$") 375 | final = [] 376 | 377 | if U and ps and ((ps.get('_opacity') and ps.get('_opacity') == 1) or (ps.get('_alpha') and ps.get('_alpha') == 0)): 378 | return "" 379 | else: 380 | width = math.floor(w * k * N) 381 | height = math.floor(h * k * N) 382 | final.append("

') 390 | final.append(t.get('c') if t.get('c') else "") 391 | final.append(U and ps and str(self.isNumber(ps.get('_enter'))) and z[ps.get('_enter') if ps.get('_enter') else 1] or "") 392 | final.append("

") 393 | 394 | return "".join(final) 395 | 396 | 397 | def processStyleOfS(self, t, font, r, version): 398 | """ 399 | :param t: 文本的s属性 400 | :param font: font属性 401 | :param r:font属性 402 | :param version: 403 | :return:处理好的S属性字符串 404 | """ 405 | infoOfS = [] 406 | n = {"font-size": 1} 407 | p , u = 10, 1349.19 / 1262.85 408 | 409 | def fontfamily(o): 410 | n = font.get(o) or o if font else o 411 | if abs(version) > 5: 412 | infoOfS.append("font-family:'"+ n + "','" + o + "','" + (r.get('n') and r[n] or n) + "';") 413 | else: 414 | infoOfS.append("font-family:'" + o + "','" + n + "','" + (r.get(n) and r[n] or n) + "';") 415 | 416 | def bold(e): 417 | "false" == e or infoOfS.append("font-weight:600;") 418 | 419 | def letter(e): 420 | infoOfS.append("letter-spacing:" + str(eval(e) * p) + "px;") 421 | 422 | if t is not None: 423 | for attribute in t: 424 | if attribute == "font-family": 425 | fontfamily(t[attribute]) 426 | elif attribute == "bold": 427 | bold(t[attribute]) 428 | elif attribute == "letter-spacing": 429 | letter(t[attribute]) 430 | else: 431 | infoOfS.append(attribute + ":" + (str(math.floor(((t[attribute] if self.isNumber(t[attribute]) else eval(t[attribute])) * p * u))) + "px" if n.get(attribute) else t[attribute]) + ";") 432 | 433 | return "".join(infoOfS) 434 | 435 | 436 | def processStyleOfR(self, r, page): 437 | """ 438 | :param r: 文本的r属性 439 | :param page: 当前页面 440 | :return: 441 | """ 442 | l = " " + "reader-word-s" + str(page) + "-" 443 | return "".join([l + str(x) for x in r]) if isinstance(r, list) and len(r) != 0 else "" 444 | 445 | 446 | def processStyleOf_rotate(self, t, w, h, x, y, k, N): 447 | """ 448 | :param t: _rotate属性 449 | :param w: body中p.w 450 | :param h: body中p.h 451 | :param x: body中p.x 452 | :param y: body中p.y 453 | :param k: 倍数10 454 | :param N: 比例系数 455 | :return: 处理好的_rotate属性字符串 456 | """ 457 | p = [] 458 | s = k * N 459 | if t == 90: 460 | p.append("left:" + str(math.floor(x + (w - h) / 2) * s) + "px;" + "top:" + str(math.floor(y - (h - w) / 2) * s) + "px;" + "text-align: right;" + "height:" + str(math.floor(h + 7) * s) + "px;") 461 | elif t == 180: 462 | p.append("left:" + str(math.floor(x - w) * s) + "px;" + "top:" + str(math.floor(y - h) * s) + "px;") 463 | elif t == 270: 464 | p.append("left:" + str(math.floor(x + (h - w) / 2) * s) + "px;" + "top:" + str(math.floor(y - (w - h) / 2) * s) + "px;") 465 | 466 | return "-webkit-"+"transform:rotate("+str(t)+"deg);"+"".join(p) 467 | 468 | 469 | def processStyleOf_scaleX(self, t, width, height): 470 | """ 471 | :param t: _scaleX属性 472 | :param width: 计算好的页面width 473 | :param height:计算好的页面height 474 | :return: 处理好的_scaleX属性字符串 475 | """ 476 | return "-webkit-" + "transform: scaleX(" + str(t) + ");" + "-webkit-" + "transform-origin:left top;width:" + str(width + math.floor(width / 2)) + "px;height:" + str(height + math.floor(height / 2)) + "px;" 477 | 478 | 479 | def processStyleOfOpacity(self,t): 480 | """ 481 | :param t: opacity属性 482 | :return:处理好的opacity属性字符串 483 | """ 484 | t = (t or 0), 485 | return "opacity:" + str(t) + ";" 486 | 487 | 488 | def creatTagOfImage(self,t,url, *args): 489 | """ 490 | :param t: 图片的字典 491 | :param url:图片链接 492 | :param args: 493 | :return:图像标签 494 | """ 495 | u, l = t.get('p'), t.get('c') 496 | if u.get("opacity") and u.get('opacity') == 0: 497 | return "" 498 | else: 499 | if u.get("x1") or (u.get('rotate') != 0 and u.get('opacity') != 1): 500 | message = '
' 502 | else: 503 | [s, h] = [str(x) for x in args] 504 | message = '

' 505 | 506 | return message 507 | 508 | 509 | def getStyleOfImage(self, t, e): 510 | """ 511 | :param t: 图片p属性 512 | :param e: 图片c属性 513 | :return: 514 | """ 515 | def parseFloat(string): 516 | """ 517 | :param string:待处理的字符串 518 | :return: 返回字符串中的首个有效float值,若字符首位为非数字,则返回nan 519 | """ 520 | if string is None: 521 | return math.nan 522 | elif isinstance(string, float): 523 | return string 524 | elif isinstance(string, int): 525 | return float(string) 526 | elif string[0] != ' ' and not str.isdigit(string[0]): 527 | return math.nan 528 | else: 529 | p = re.compile("\d+\.?\d*") 530 | all = p.findall(string) 531 | return float(all[0]) if len(all) != 0 else math.nan 532 | 533 | if t is None: 534 | return "" 535 | else: 536 | r, o, a, n = 0, 0, "", 0 537 | iw = e.get('iw') 538 | ih = e.get('ih') 539 | u = 1349.19 / 1262.85 540 | l = str(t.get('x') * u) + "px" 541 | c = str(t.get('y') * u) + "px" 542 | d = "" 543 | x = {} 544 | w = {"opacity": 1, "rotate": 1, "z": 1} 545 | for n in t: 546 | x[n] = t[n] * u if (self.isNumber(t[n]) and not w.get(n)) else t[n] 547 | 548 | if x.get('w') != iw or x.get('h') != ih: 549 | if x.get('x1'): 550 | a = self.P(x.get('x0'), x.get('y0'), x.get('x1'), x.get('y1'), x.get('x2'), x.get('y2')) 551 | r = parseFloat(parseFloat(a[0])/iw if len(a) else x.get('w') / iw) 552 | o = parseFloat(parseFloat(a[1])/ih if len(a) else x.get('h') / ih) 553 | 554 | m, v = iw * (r-1), ih * (o-1) 555 | c = str((x.get('y1') + x.get('y3')) / 2 - parseFloat(ih) / 2)+"px" if x.get('x1') else str(x.get('y') + v / 2) + "px" 556 | l = str((x.get('x1') + x.get('x3')) / 2 - parseFloat(iw) / 2)+"px" if x.get('x1') else str(x.get('x') + m / 2) + "px" 557 | d = "-webkit-" + "transform:scale(" + str(r) + "," + str(o) + ")" 558 | 559 | message = "z-index:" + str(x.get('z')) + ";" + "left:" + l + ";" + "top:" + c + ";" + "opacity:" + str(x.get('opacity') or 1) + ";" 560 | if x.get('x1'): 561 | message += self.O(x.get('rotate')) if x.get('rotate') > 0.01 else self.O(0, x.get('x1'), x.get('x2'), x.get('y0'), x.get('y1'), d) 562 | else: 563 | message += d + ";" 564 | 565 | return message 566 | 567 | 568 | def P(self,t, e, r, i, o, a): 569 | p = round(math.sqrt(math.pow(abs(t - r), 2) + math.pow(abs(e - i), 2)), 4) 570 | s = round(math.sqrt(math.pow(abs(r - o), 2) + math.pow(abs(i - a), 2)), 4) 571 | return [s, p] 572 | 573 | 574 | def O(self, t, *args): 575 | [e, r, i, o, a] = [0, 0, 0, 0, ""] if len(args) == 0 else [x for x in args] 576 | n = o > i 577 | p = e > r 578 | if n and p: 579 | a += " Matrix(1,0,0,-1,0,0)" 580 | elif n: 581 | a += " Matrix(1,0,0,-1,0,0)" 582 | elif p: 583 | a += " Matrix(-1,0,0,1,0,0)" 584 | elif t: 585 | a += " rotate(" + str(t) + "deg)" 586 | return a + ";" 587 | 588 | 589 | def convertHtmlToPdf(self): 590 | savepath = os.path.join(self.pdfsdirpath, self.wkinfo.get('title') + '.pdf') 591 | url = os.path.join('\pdffiles', self.wkinfo.get('title') + '.pdf') 592 | 593 | # 每个url的最大页数为50 594 | exactpages = int(self.wkinfo.get('totalPageNum')) 595 | if exactpages > 50: 596 | exactpages = 50 597 | options = {'disable-smart-shrinking':'', 598 | 'lowquality': '', 599 | 'image-quality': 60, 600 | 'page-height': str(1349.19*0.26458333), 601 | 'page-width': '291', 602 | 'margin-bottom': '0', 603 | 'margin-top': '0', 604 | } 605 | pdfkit.from_file(os.path.join(self.htmlsdirpath, self.htmlfile), savepath, options=options) 606 | return url 607 | 608 | def Run(self): 609 | if self.wkinfo.get('docType') == 'ppt': 610 | savePath = GetPpt(self.url, self.savepath).getPPT() 611 | htmlurl = "/htmlfiles/" + self.htmlfile 612 | flag = 1 613 | return savePath, flag , htmlurl 614 | 615 | if self.wkinfo.get('docType') == 'txt': 616 | savePath = GetTxt(self.url, self.savepath).getTXT() 617 | flag = 1 618 | htmlurl = "/htmlfiles/" + self.htmlfile 619 | return savePath, flag , htmlurl 620 | 621 | else: 622 | self.getJsonUrl() 623 | for epoch in range(int(int(self.wkinfo.get('totalPageNum'))/50)+1): 624 | self.startpage = epoch * 50 + 1 625 | if epoch == 0: 626 | self.creatHtml() 627 | 628 | start = time.time() 629 | print('-------------Start Add Css--------------') 630 | self.AddCss() 631 | print('-------------Css Add Finissed-----------') 632 | end = time.time() 633 | print("Add Css Cost: %ss" % str(end - start)) 634 | 635 | start = time.time() 636 | print('-------------Start Add Content----------') 637 | self.addMainContent() 638 | print('-------------Content Add Finished-------') 639 | end = time.time() 640 | print("Add MainContent Cost: %ss" % str(end - start)) 641 | 642 | start = time.time() 643 | print('-------------Start Convert--------------') 644 | url = self.convertHtmlToPdf() 645 | print('-------------Convert Finished-----------') 646 | end = time.time() 647 | print("Convert Cost: %ss" % str(end - start)) 648 | 649 | htmlurl = "/htmlfiles/" + self.htmlfile 650 | 651 | flag = 1 652 | return url, flag , htmlurl 653 | 654 | else: 655 | self.url = self.url[:self.url.find('&pn=')] + "&pn=" + str(self.startpage) 656 | print(self.url) 657 | self.getHtml() 658 | self.getJsonUrl() 659 | 660 | self.creatHtml() 661 | 662 | start = time.time() 663 | print('-------------Start Add Css--------------') 664 | self.AddCss() 665 | print('-------------Css Add Finissed-----------') 666 | end = time.time() 667 | print("Add Css Cost: %ss" % str(end - start)) 668 | 669 | start = time.time() 670 | print('-------------Start Add Content----------') 671 | self.addMainContent() 672 | print('-------------Content Add Finished-------') 673 | end = time.time() 674 | print("Add MainContent Cost: %ss" % str(end - start)) 675 | 676 | start = time.time() 677 | print('-------------Start Convert--------------') 678 | url = self.convertHtmlToPdf() 679 | print('-------------Convert Finished-----------') 680 | end = time.time() 681 | print("Convert Cost: %ss" % str(end - start)) 682 | 683 | htmlurl = "/htmlfiles/" + self.htmlfile 684 | 685 | flag = 1 686 | return url , flag , htmlurl 687 | 688 | app = Flask(__name__) 689 | app.secret_key = '520' 690 | 691 | # 捕获异常 692 | @app.errorhandler(500) 693 | def internal_server_error(e): 694 | return '当前文档无法下载' 695 | 696 | # 文件下载 697 | @app.route("//") 698 | def downloader(url,filename): 699 | dirpath = safe_join(app.root_path, url) # 下载文件目录路径 700 | return send_from_directory(dirpath, filename, as_attachment=True) # as_attachment=True 一定要写,不然会变成打开,而不是下载 701 | 702 | # 文件预览 703 | @app.route("/htmlfiles/") 704 | def htmlView(filename): 705 | dirpath = os.path.join(app.root_path, "htmlfiles") # 下载文件目录路径 706 | return send_from_directory(dirpath, filename) # 打开html文件 707 | 708 | 709 | 710 | # 获取提交后的文档链接,然后转换 711 | @app.route('/post', methods=['POST','GET']) 712 | def post(): 713 | 714 | if request.method=='POST': 715 | flag = 0 716 | # 如果连接输入为空,点击则直接刷新页面 717 | if request.form.get('url') == '': 718 | return redirect("post") 719 | url , flag , htmlurl = GetAll((request.form.get('url')),"").Run() 720 | # 直接用flash返回该文件地址,以及一个flag用作按钮显示 721 | flash(url)# 生成的pdf地址 722 | flash(flag)# 是否已生成文档,用来显示按钮 723 | flash(htmlurl)# html地址 724 | return redirect(url_for('post')) 725 | 726 | return render_template('post.html') 727 | 728 | if __name__ == '__main__': 729 | # 运行在本地5000端口 730 | app.run('127.0.0.1',5000) 731 | 732 | 733 | --------------------------------------------------------------------------------