├── requirements.txt
├── static
    ├── loading.gif
    └── css
    │   └── style.css
├── README.md
├── templates
    └── post.html
├── GetTxt.py
├── GetPpt.py
└── GetAll.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | chardet
3 | bs4
4 | Pillow
5 | pdfkit
6 | flask
7 | imgkit
8 | img2pdf


--------------------------------------------------------------------------------
/static/loading.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb/HEAD/static/loading.gif


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # BaiduWenkuSpider_flaskWeb
 2 | 以web server形式实现对百度文库文档以pdf形式原格式下载
 3 | 如果觉得可以的话，可以点个**🌟**哦 
 4 | （**当前爬取方式可能已经不支持，仅提供flask开发参考**）
 5 | 
 6 | ## 前言
 7 | 首先，这是根据
 8 | [https://github.com/M010K/BaiduWenkuSpider](https://github.com/M010K/BaiduWenkuSpider)
 9 | 的项目进行一点修改得到的基于flask框架的python web项目，
10 | 可以对百度文库的文档转换为pdf格式进行下载
11 | 
12 | **[博客地址](https://www.upstudy.top/index.php/archives/21/)**
13 | 
14 | ## 如何使用？
15 | #### 一、下载项目zip包，或者直接用git获取
16 | 
17 | **$ git clone https://github.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb**
18 | 
19 | 
20 | #### 二、安装依赖
21 | 项目使用的依赖有
22 | 1. requests
23 | 2. chardet
24 | 3. bs4
25 | 4. Pillow
26 | 5. pdfkit
27 | 6. flask
28 | 7. imgkit
29 | 8. img2pdf
30 | 
31 | cd到项目文件夹中使用命令，直接一键安装
32 | **pip install -r requirements.txt**
33 | 
34 | #### 三、安装wkhtmltopdf工具
35 | [官网下载地址](https://wkhtmltopdf.org/downloads.html)
36 | 
37 | 下载后按当前系统
38 | 配置环境变量即可
39 | 
40 | **window：**
41 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200421234401464.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mzg3ODMzMg==,size_16,color_FFFFFF,t_70)
42 | 
43 | **Centos：**
44 | 
45 | [https://blog.csdn.net/LookingTomorrow/article/details/93513457](https://blog.csdn.net/LookingTomorrow/article/details/93513457)
46 | 
47 | #### 四、直接运行GetAll.py文件，访问http://127.0.0.1:5000/post 即可（运行在服务器端则访问IP：5000/post）
48 | 
49 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20200421234635967.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mzg3ODMzMg==,size_16,color_FFFFFF,t_70)
50 | 
51 | ps：ppt格式的文档不支持预览
52 | #### 五、Github源码下载地址
53 | [https://github.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb](https://github.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb)
54 | 


--------------------------------------------------------------------------------
/static/css/style.css:
--------------------------------------------------------------------------------
 1 | .container {
 2 |     width: 500px;
 3 |     height: 50px;
 4 |     margin: 100px auto;
 5 | }
 6 | 
 7 | .parent {
 8 |     width: 100%;
 9 |     height: 42px;
10 |     top: 4px;
11 |     position: relative;
12 | }
13 | 
14 | .parent>input:first-of-type {
15 |     /*输入框高度设置为40px, border占据2px，总高度为42px*/
16 |     width: 380px;
17 |     height: 40px; 
18 |     border: 1px solid #ccc;
19 |     font-size: 16px;
20 |     outline: none;
21 | }
22 | 
23 | .parent>input:first-of-type:focus {
24 |     border: 1px solid #317ef3;
25 |     padding-left: 10px;
26 | }
27 | 
28 | .parent>input:last-of-type {
29 |     /*button按钮border并不占据外围大小，设置高度42px*/
30 |     width: 100px;
31 |     height: 44px; 
32 |     position: absolute;
33 |     background: #317ef3;
34 |     border: 1px solid #317ef3;
35 |     color: #fff;
36 |     font-size: 16px;
37 |     outline: none;
38 | }
39 | 
40 | .a_demo_two {
41 |     background-color:#317ef3;
42 |     padding:10px;
43 |     position:relative;
44 |     font-family: 'Open Sans', sans-serif;
45 |     font-size:12px;
46 |     text-decoration:none;
47 |     color:#fff;
48 |     background-image: linear-gradient(bottom, rgb(100,170,30) 0%, rgb(129,212,51) 100%);
49 |     box-shadow: inset 0px 1px 0px #b2f17f, 0px 6px 0px #3d6f0d;
50 |     border-radius: 5px;
51 | }
52 |   
53 | .a_demo_two:active {
54 |     top:7px;
55 |     background-image: linear-gradient(bottom, rgb(100,170,30) 100%, rgb(129,212,51) 0%);
56 |     box-shadow: inset 0px 1px 0px #b2f17f, inset 0px -1px 0px #3d6f0d;
57 |     color: #156785;
58 |     text-shadow: 0px 1px 1px rgba(255,255,255,0.3);
59 |     background: rgb(44,160,202);
60 | }
61 | 
62 | .a_demo_two::before {
63 |     background-color:#072239;
64 |     content:"";
65 |     display:block;
66 |     position:absolute;
67 |     width:100%;
68 |     height:100%;
69 |     padding-left:2px;
70 |     padding-right:2px;
71 |     padding-bottom:4px;
72 |     left:-2px;
73 |     top:5px;
74 |     z-index:-1;
75 |     border-radius: 6px;
76 |     box-shadow: 0px 1px 0px #fff;
77 | }
78 |   
79 | .a_demo_two:active::before {
80 |     top:-2px;
81 | }
82 | 
83 | 
84 | 


--------------------------------------------------------------------------------
/templates/post.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="en">
 3 | 
 4 | <head>
 5 |    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 6 |    <link rel="stylesheet" href="../static/css/style.css">
 7 |    <title>百度文库下载</title>
 8 |    <style>
 9 |       body {
10 |          background-color: #E8E8E8;
11 |       }
12 | 
13 |       .shadow {
14 |          width: 100%;
15 |          height: 100%;
16 |          position: absolute;
17 |          left: 0;
18 |          top: 0;
19 |          z-index: 998;
20 |          background-color: #000;
21 |          opacity: 0.6;
22 |          display: none;
23 |       }
24 |    </style>
25 | </head>
26 | 
27 | <body>
28 | 
29 |    <div style="text-align: center; margin-top: 50px;">
30 |       <h1>仅供学习交流使用</h1>
31 | 
32 |    <div class="container">
33 |       <form action="/post" class="parent" method="post">
34 |          <input type="text" name="url" placeholder="   输入文库链接">
35 |          <input type="submit" value="点击下载" onclick="fn()">
36 |       </form>
37 |    </div>
38 |    
39 | {% if get_flashed_messages()[1] %}
40 |    <div class="container">
41 |       <a href="{{ get_flashed_messages()[2] }}" class="a_demo_two">
42 |          预览（不支持ppt）
43 |       </a>
44 |       &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
45 |       <a href="{{ get_flashed_messages()[0] }}" class="a_demo_two">
46 |          不废话，下载
47 |       </a>
48 |    </div>
49 | 
50 |    {% endif %}
51 |    <div class="shadow" id="shadow"></div>
52 |    <img src="../static/loading.gif" alt="" id="img" style="
53 | display: none;
54 | position: absolute;
55 | 	top: 50%;
56 | 	left: 50%;
57 | 	transform: translate(-50%,-50%);
58 | 	-ms-transform: translate(-50%,-50%);
59 | 	-o-transform: translate(-50%,-50%);
60 | 	-moz-transform: translate(-50%,-50%);
61 | 	-webkit-transform: translate(-50%,-50%);
62 |    -khtml-transform: translate(-50%,-50%);
63 |    z-index:999;" />
64 | 
65 | 
66 | </div>
67 | <div style="text-align: center; margin-top: 50px;">
68 |    <p><a href="https://www.upstudy.top">博客地址</a></p>
69 | </div>
70 | <div style="text-align: center; margin-top: 50px;">
71 |    <p><a href="https://github.com/ChangeWeDer/BaiduWenkuSpider_flaskWeb">Github项目地址</a></p>
72 | </div>
73 | </body>
74 | <script>
75 | 
76 | 
77 |    function fn() {
78 |       document.getElementById('img').style.display = "block";
79 |       document.getElementById('shadow').style.display = "block";
80 |    }
81 | 
82 | </script>
83 | 
84 | </html>
85 | 


--------------------------------------------------------------------------------
/GetTxt.py:
--------------------------------------------------------------------------------
  1 | from requests import get
  2 | from requests.exceptions import ReadTimeout
  3 | from chardet import detect
  4 | from bs4 import BeautifulSoup
  5 | from os import getcwd,mkdir
  6 | from os.path import join,exists
  7 | from re import findall
  8 | from json import loads
  9 | from time import time
 10 | 
 11 | 
 12 | class GetTxt:
 13 |     def __init__(self, url, savepath):
 14 |         self.url = url
 15 |         self.savepath = savepath if savepath != '' else getcwd()
 16 |         self.txtsavepath = self.makeDirForTxtSave()
 17 |         self.html = ''
 18 |         self.wkinfo = {}  # 存储文档基本信息:title、docType、docID
 19 |         self.txturls = []
 20 | 
 21 |         self.getHtml()
 22 |         self.getWkInfo()
 23 | 
 24 | 
 25 |     # 创建临时文件夹保存ppt
 26 |     def makeDirForTxtSave(self):
 27 |         if not exists(join(self.savepath,'txtfiles')):
 28 |             mkdir(join(self.savepath,'txtfiles'))
 29 |         return join(self.savepath,'txtfiles')
 30 | 
 31 |     # 获取网站源代码
 32 |     def getHtml(self):
 33 |         try:
 34 |             header = {'User-Agent': 'Mozilla/5.0 '
 35 |                                     '(Macintosh; Intel Mac OS X 10_14_6) '
 36 |                                     'AppleWebKit/537.36 (KHTML, like Gecko) '
 37 |                                     'Chrome/78.0.3904.108 Safari/537.36'}
 38 |             response = get(self.url, headers=header)
 39 |             self.transfromEncoding(response)
 40 |             self.html = BeautifulSoup(response.text, 'html.parser')  # 格式化
 41 | 
 42 |         except ReadTimeout as e:
 43 |             print(e)
 44 |             return None
 45 | 
 46 |     # 转换网页源代码为对应编码格式
 47 |     def transfromEncoding(self, html):
 48 |         # 检测并修改html内容的编码方式
 49 |         html.encoding = detect(html.content).get("encoding")
 50 | 
 51 |     # 获取文档基本信息:名字,类型,文档ID
 52 |     def getWkInfo(self):
 53 |         items = ["'title'", "'docType'", "'docId'", "'totalPageNum"]
 54 |         for item in items:
 55 |             ls = findall(item + ".*'", str(self.html))
 56 |             if len(ls) != 0:
 57 |                 message = ls[0].split(':')
 58 |                 self.wkinfo[eval(message[0])] = eval(message[1])
 59 | 
 60 |     # 获取json字符串
 61 |     def getJson(self, url):
 62 |         """
 63 |         :param url: json文件所在页面的url
 64 |         :return: json格式字符串
 65 |         """
 66 |         response = get(url)
 67 |         # 获取json格式数据
 68 |         jsonstr = response.text[response.text.find('(') + 1:response.text.rfind(')')]
 69 |         return jsonstr
 70 | 
 71 |     # 获取json字符串对应的字典
 72 |     def convertJsonToDict(self, jsonstr):
 73 |         """
 74 |         :param: jsonstr: json格式字符串
 75 |         :return: json字符串所对应的python字典
 76 |         """
 77 |         textdict = loads(jsonstr)  # 将json字符串转换为python的字典对象
 78 |         return textdict
 79 | 
 80 |     # 获取包含txt文本的json文件的url
 81 |     def getTxtUrlForTXT(self):
 82 |         timestamp = round(time() * 1000)  # 获取时间戳
 83 |         # 构造请求url,获取json文件所在url的参数
 84 |         messageurlprefix = "https://wenku.baidu.com/api/doc/getdocinfo?" \
 85 |                            "callback=cb&doc_id="
 86 |         messageurlsuffix = self.wkinfo.get("docId") + "&t=" + \
 87 |                            str(timestamp) + "&_=" + str(timestamp + 1)
 88 | 
 89 |         textdict = self.convertJsonToDict(
 90 |             self.getJson(messageurlprefix + messageurlsuffix))
 91 | 
 92 |         # 获取json文件所在url的参数
 93 |         self.txturls.append("https://wkretype.bdimg.com/retype/text/" +
 94 |                             self.wkinfo.get('docId') +
 95 |                             textdict.get('md5sum') +
 96 |                             "&callback=cb&pn=1&rn=" +
 97 |                             textdict.get("docInfo").get("totalPageNum") +
 98 |                             "&rsign=" + textdict.get("rsign") + "&_=" +
 99 |                             str(timestamp))
100 | 
101 |     # 将文本内容保存
102 |     def saveToTxt(self, content):
103 |         savepath = join(self.txtsavepath,self.wkinfo.get('title') + '.txt')
104 |         with open(savepath, "a",encoding='utf-8') as f:
105 |             f.write(content)
106 | 
107 | 
108 |     def getTXT(self):
109 |         self.getTxtUrlForTXT()
110 |         for url in self.txturls:
111 |             textls = self.convertJsonToDict(self.getJson(url))
112 |             for text in textls:
113 |                 content = text.get("parags")[0].get("c")
114 |                 self.saveToTxt(content)
115 |         url = join('\\txtfiles',self.wkinfo.get('title') + '.txt')
116 |         return url
117 |         
118 |         
119 | 
120 | 
121 | if __name__ == '__main__':
122 |     url = input('请输入网址：')
123 |     GetTxt(url, '').getTXT()
124 | 
125 | 
126 | 
127 | 


--------------------------------------------------------------------------------
/GetPpt.py:
--------------------------------------------------------------------------------
  1 | from requests import get
  2 | from PIL import Image
  3 | from os import removedirs,remove,mkdir,getcwd
  4 | from os.path import join, exists
  5 | from requests.exceptions import ReadTimeout
  6 | from chardet import detect
  7 | from bs4 import BeautifulSoup
  8 | from re import findall
  9 | from json import loads
 10 | from time import time
 11 | 
 12 | 
 13 | class GetPpt:
 14 |     def __init__(self, url, savepath):
 15 |         self.url = url
 16 |         self.savepath = savepath if savepath != '' else getcwd()
 17 |         self.tempdirpath = self.makeDirForImageSave()
 18 |         self.pptsavepath = self.makeDirForPptSave()
 19 | 
 20 |         self.html = ''
 21 |         self.wkinfo ={}     # 存储文档基本信息:title、docType、docID
 22 |         self.ppturls = []   # 顺序存储包含ppt图片的url
 23 | 
 24 |         self.getHtml()
 25 |         self.getWkInfo()
 26 | 
 27 | 
 28 |     # 获取网站源代码
 29 |     def getHtml(self):
 30 |         try:
 31 |             header = {'User-Agent': 'Mozilla/5.0 '
 32 |                                     '(Macintosh; Intel Mac OS X 10_14_6) '
 33 |                                     'AppleWebKit/537.36 (KHTML, like Gecko) '
 34 |                                     'Chrome/78.0.3904.108 Safari/537.36'}
 35 |             response = get(self.url, headers = header)
 36 |             self.transfromEncoding(response)
 37 |             self.html = BeautifulSoup(response.text, 'html.parser')  #格式化
 38 |         except ReadTimeout as e:
 39 |             print(e)
 40 |             return None
 41 | 
 42 | 
 43 |     # 转换网页源代码为对应编码格式
 44 |     def transfromEncoding(self,html):
 45 |         html.encoding =  detect(html.content).get("encoding")   #检测并修改html内容的编码方式
 46 | 
 47 | 
 48 |     # 获取文档基本信息:名字,类型,文档ID
 49 |     def getWkInfo(self):
 50 |         items = ["'title'","'docType'","'docId'","'totalPageNum"]
 51 |         for item in items:
 52 |             ls = findall(item+".*'", str(self.html))
 53 |             if len(ls) != 0:
 54 |                 message = ls[0].split(':')
 55 |                 self.wkinfo[eval(message[0])] = eval(message[1])
 56 | 
 57 | 
 58 |     # 获取json字符串
 59 |     def getJson(self, url):
 60 |         """
 61 |         :param url: json文件所在页面的url
 62 |         :return: json格式字符串
 63 |         """
 64 |         response = get(url)
 65 |         jsonstr = response.text[response.text.find('(')+1: response.text.rfind(')')]  # 获取json格式数据
 66 |         return jsonstr
 67 | 
 68 | 
 69 |     # 获取json字符串对应的字典
 70 |     def convertJsonToDict(self, jsonstr):
 71 |         """
 72 |         :param jsonstr: json格式字符串
 73 |         :return: json字符串所对应的python字典
 74 |         """
 75 |         textdict = loads(jsonstr)  # 将json字符串转换为python的字典对象
 76 |         return textdict
 77 | 
 78 | 
 79 |     # 创建临时文件夹保存图片
 80 |     def makeDirForImageSave(self):
 81 |         if not exists(join(self.savepath,'tempimages')):
 82 |             mkdir(join(self.savepath,'tempimages'))
 83 |         return join(self.savepath,'tempimages')
 84 | 
 85 |     # 创建临时文件夹保存ppt
 86 |     def makeDirForPptSave(self):
 87 |         if not exists(join(self.savepath,'pptfiles')):
 88 |             mkdir(join(self.savepath,'pptfiles'))
 89 |         return join(self.savepath,'pptfiles')
 90 | 
 91 | 
 92 |     # 从json文件中提取ppt图片的url
 93 |     def getImageUrlForPPT(self):
 94 |         timestamp = round(time()*1000)  # 获取时间戳
 95 |         desturl = "https://wenku.baidu.com/browse/getbcsurl?doc_id="+\
 96 |                   self.wkinfo.get("docId")+\
 97 |                   "&pn=1&rn=99999&type=ppt&callback=jQuery1101000870141751143283_"+\
 98 |                   str(timestamp) + "&_=" + str(timestamp+1)
 99 | 
100 | 
101 |         textdict = self.convertJsonToDict(self.getJson(desturl))
102 |         self.ppturls = [x.get('zoom') for x in textdict.get('list')]
103 | 
104 | 
105 |     # 通过给定的图像url及名称保存图像至临时文件夹
106 |     def getImage(self, imagename, imageurl):
107 |         imagename = join(self.tempdirpath, imagename)
108 |         with open(imagename,'wb') as ig:
109 |             ig.write(get(imageurl).content)  #content属性为byte
110 | 
111 | 
112 |     # 将获取的图片合成pdf文件
113 |     def mergeImageToPDF(self, pages):
114 |         if pages == 0:
115 |             raise IOError
116 | 
117 | 
118 |         namelist = [join(self.tempdirpath, str(x)+'.png')  for x in range(pages)]
119 |         firstimg = Image.open(namelist[0])
120 |         imglist = []
121 |         for imgname in namelist[1:]:
122 |             img = Image.open(imgname)
123 |             img.load()
124 | 
125 |             if img.mode == 'RGBA':  # png图片的转为RGB mode,否则保存时会引发异常
126 |                 img.mode = 'RGB'
127 |             imglist.append(img)
128 | 
129 |         savepath = join(self.pptsavepath, self.wkinfo.get('title')+'.pdf')
130 |         url = join('\pptfiles', self.wkinfo.get('title')+'.pdf')
131 |         firstimg.save(savepath, "PDF", resolution=100.0,
132 |                       save_all=True, append_images=imglist)
133 |         return url
134 | 
135 |     # 清除下载的图片
136 |     def removeImage(self,pages):
137 |         namelist = [join(self.tempdirpath, str(x)+'.png') for x in range(pages)]
138 |         for name in namelist:
139 |             if  exists(name):
140 |                 remove(name)
141 |         if exists(join(self.savepath,'tempimages')):
142 |             removedirs(join(self.savepath,'tempimages'))
143 | 
144 | 
145 |     def getPPT(self):
146 |         self.getImageUrlForPPT()
147 |         for page, url in enumerate(self.ppturls):
148 |             self.getImage(str(page)+'.png', url)
149 |         url = self.mergeImageToPDF(len(self.ppturls))
150 |         self.removeImage(len(self.ppturls))
151 |         return url
152 | 
153 | 
154 | if __name__ == '__main__':
155 |     url = input('请输入网址：')
156 |     GetPpt(url, '').getPPT()
157 | 
158 | 


--------------------------------------------------------------------------------
/GetAll.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*
  2 | from flask import Flask,render_template, redirect, url_for, request, flash, get_flashed_messages , send_from_directory
  3 | from flask.helpers import safe_join
  4 | import requests
  5 | import os
  6 | from requests.exceptions import ReadTimeout
  7 | import chardet
  8 | from bs4 import BeautifulSoup
  9 | import re
 10 | import json
 11 | import math
 12 | import imgkit
 13 | from PIL import Image
 14 | import img2pdf
 15 | import pdfkit
 16 | from GetPpt import GetPpt
 17 | from GetTxt import GetTxt
 18 | import time
 19 | 
 20 | 
 21 | 
 22 | class GetAll:
 23 |     def __init__(self, url, savepath):
 24 |         """
 25 |         :param url: 待爬取文档所在页面的url
 26 |         :param savepath: 生成文档保存路径
 27 |         """
 28 |         self.url = url
 29 |         self.savepath = savepath if savepath != '' else os.getcwd()
 30 |         self.startpage = 1
 31 |         self.url = self.url + "&pn=1"
 32 |         self.html = ''
 33 |         self.wkinfo ={}     # 存储文档基本信息:title、docType、docID、totalPageNum
 34 |         self.jsonurls = []
 35 |         self.pdfurls = []
 36 | 
 37 |         self.getHtml()
 38 |         self.getWkInfo()
 39 |         self.htmlsdirpath = self.makeDirForHtmlSave()
 40 |         self.pdfsdirpath = self.makeDirForPdfSave()
 41 |         self.htmlfile = self.wkinfo.get('title')+".html"
 42 | 
 43 | 
 44 |     # 创建临时文件夹保存html文件
 45 |     def makeDirForHtmlSave(self):
 46 |         if not os.path.exists(os.path.join(self.savepath,'htmlfiles')):
 47 |             os.mkdir(os.path.join(self.savepath,'htmlfiles'))
 48 |         return os.path.join(self.savepath, 'htmlfiles')
 49 | 
 50 | 
 51 |     def makeDirForPdfSave(self):
 52 |         if not os.path.exists(os.path.join(self.savepath, 'pdffiles')):
 53 |             os.mkdir(os.path.join(self.savepath,'pdffiles'))
 54 |         return os.path.join(self.savepath,'pdffiles')
 55 | 
 56 |     # 创建html文档,用于组织爬取的文件
 57 |     def creatHtml(self):
 58 |         with open(os.path.join(self.htmlsdirpath, self.htmlfile), "w",encoding='utf-8') as f:
 59 |             # 生成文档头
 60 |             message = """
 61 |             <!DOCTYPE html>
 62 |             <html class="expanded screen-max">
 63 |                 <head>
 64 |                 <meta charset="utf-8">
 65 |                 <title>文库</title>"""
 66 |             f.write(message)
 67 | 
 68 | 
 69 |     def addMessageToHtml(self,message):
 70 |         """:param message:向html文档中添加内容 """
 71 |         with open(os.path.join(self.htmlsdirpath, self.htmlfile), "a",encoding='utf-8') as a:
 72 |             a.write(message)
 73 | 
 74 |     # 获取网站源代码
 75 |     def getHtml(self):
 76 |         try:
 77 |             header = {'User-Agent': 'Mozilla/5.0 '
 78 |                                     '(Macintosh; Intel Mac OS X 10_14_6) '
 79 |                                     'AppleWebKit/537.36 (KHTML, like Gecko) '
 80 |                                     'Chrome/78.0.3904.108 Safari/537.36'}
 81 |             response = requests.get(self.url, headers = header)
 82 |             self.transfromEncoding(response)
 83 |             self.html = BeautifulSoup(response.text, 'html.parser')  # 格式化html
 84 |         except ReadTimeout as e:
 85 |             print(e)
 86 | 
 87 |     # 转换网页源代码为对应编码格式
 88 |     def transfromEncoding(self, html):
 89 |         html.encoding = chardet.detect(html.content).get("encoding")   #检测并修改html内容的编码方式
 90 | 
 91 |     # 获取文档基本信息:名字,类型,文档ID
 92 |     def getWkInfo(self):
 93 |         items = ["'title'","'docType'","'docId'","'totalPageNum"]
 94 |         for item in items:
 95 |             ls = re.findall(item+".*'", str(self.html))
 96 |             if len(ls) != 0:
 97 |                 message = ls[0].split(':')
 98 |                 self.wkinfo[eval(message[0])] = eval(message[1])
 99 | 
100 | 
101 |     # 获取存储信息的json文件的url
102 |     def getJsonUrl(self):
103 |             urlinfo = re.findall("WkInfo.htmlUrls = '.*?;", str(self.html))
104 |             urls = re.findall("https:.*?}", urlinfo[0])
105 |             urls = [str(url).replace("\\", "").replace('x22}','') for url in urls ]
106 |             self.jsonurls = urls
107 | 
108 |     # 获取json字符串
109 |     def getJson(self, url):
110 |         """
111 |         :param url: json文件所在页面的url
112 |         :return: json格式字符串
113 |         """
114 |         response = requests.get(url)
115 |         jsonstr = response.text[response.text.find('(')+1: response.text.rfind(')')]  # 获取json格式数据
116 |         return jsonstr
117 | 
118 |     # 获取json字符串对应的字典
119 |     def convertJsonToDict(self, jsonstr):
120 |         """
121 |         :param jsonstr: json格式字符串
122 |         :return: json字符串所对应的python字典
123 |         """
124 |         textdict = json.loads(jsonstr)  # 将json字符串转换为python的字典对象
125 |         return textdict
126 | 
127 |     # 判断文档是否为ppt格式
128 |     def isPptStyle(self):
129 |         iswholepic = False
130 |         ispptlike = False
131 |         for url in self.jsonurls:
132 |             if "0.json" in url:
133 |                 textdict = self.convertJsonToDict(self.getJson(url))
134 |                 # 若json文件中的style属性为空字符串且font属性为None,则说明pdf全由图片组成
135 |                 if textdict.get("style") == "" and textdict.get("font") is None:
136 |                     iswholepic = True
137 |                     break
138 |                 elif textdict.get('page').get('pptlike'):
139 |                     ispptlike = True
140 |                     break
141 |             break
142 | 
143 |         return iswholepic and ispptlike
144 | 
145 | 
146 |     # 从html中匹配出与控制格式相关的css文件的url
147 |     def getCssUrl(self):
148 |         pattern =  re.compile('<link href="//.*?\.css')
149 |         allmessage =  pattern.findall(str(self.html))
150 |         allcss = [x.replace('<link href="', "https:") for x in allmessage]
151 |         return allcss
152 | 
153 | 
154 |     def getPageTag(self):
155 |         """:return:返回id属性包含 data-page-no 的所有标签,即所有页面的首标签"""
156 |         def attributeFilter(tag):
157 |             return tag.has_attr('data-page-no')
158 |         return self.html.find_all(attributeFilter)
159 | 
160 | 
161 |     def getDocIdUpdate(self):
162 |         """:return:doc_id_update字符串"""
163 |         pattern = re.compile('doc_id_update:".*?"')
164 |         for i in pattern.findall(str(self.html)):
165 |             return i.split('"')[1]
166 | 
167 | 
168 |     def getAllReaderRenderStyle(self):
169 |         """:return: style <id = "reader - render - style">全部内容"""
170 |         page = 1
171 |         style = '<style id='+'"reader-render-style">\n'
172 |         for url in self.jsonurls:
173 |             if "json" in url:
174 |                 textdict = self.convertJsonToDict(self.getJson(url))
175 |                 style += self.getReaderRenderStyle(textdict.get('style'), textdict.get('font'), textdict.get('font'), page)
176 |                 page += 1
177 |             else:
178 |                 break
179 |         style += "</style>\n"
180 | 
181 |         return style
182 | 
183 | 
184 |     def getReaderRenderStyle(self, allstyle, font, r, page):
185 |         """
186 |         :param allstyle: json数据的style内容
187 |         :param font: json数据的font内容
188 |         :param r: TODO:解析作用未知,先取值与e相同
189 |         :param page: 当前页面
190 |         :return: style <id = "reader - render - style">
191 |         """
192 |         p, stylecontent = "", []
193 |         for index in range(len(allstyle)):
194 |             style = allstyle[index]
195 |             if style.get('s'):
196 |                 p = self.getPartReaderRenderStyle(style.get('s'), font, r).strip(" ")
197 |             l = "reader-word-s" + str(page) + "-"
198 |             p and stylecontent.append("." + l + (",." + l).join([str(x) for x in style.get('c')]) + "{ " + p + "}")
199 |             if style.get('s').get("font-family"):
200 |                 pass
201 |         stylecontent.append("#pageNo-" + str(page) + " .reader-parent{visibility:visible;}")
202 |         return "".join(stylecontent)
203 | 
204 | 
205 |     def getPartReaderRenderStyle(self, s, font, r):
206 |         """
207 |         :param s:  json style下的s属性
208 |         :param font:  json font属性
209 |         :param r: fontMapping TODO:先取为与e相同
210 |         :return: style <id = "reader - render - style">中的部分字符串
211 |         """
212 |         content = []
213 |         n, p = 10, 1349.19 / 1262.85  # n为倍数, p为比例系数, 通过页面宽度比得出
214 | 
215 |         def fontsize(f):
216 |             content.append("font-size:" + str(math.floor(eval(f) * n * p)) + "px;")
217 | 
218 |         def letterspacing(l):
219 |             content.append("letter-spacing:" + str(eval(l) * n) + "px;")
220 | 
221 |         def bold(b):
222 |             "false" == b or content.append("font-weight:600;")
223 | 
224 |         def fontfamily(o):
225 |             n = font.get(o) or o if font else o
226 |             content.append("font-family:'" + n + "','" + o + "','" + (r.get(n) and r[n] or n) + "';")
227 | 
228 |         for attribute in s:
229 |             if attribute == "font-size":
230 |                 fontsize(s[attribute])
231 |             elif attribute == "letter-spacing":
232 |                 letterspacing(s[attribute])
233 |             elif attribute == "bold":
234 |                 bold(s[attribute])
235 |             elif attribute == "font-family":
236 |                 fontfamily(s[attribute])
237 |             else:
238 |                 content.append(attribute + ":" + s[attribute] + ";")
239 |         return "".join(content)
240 | 
241 | 
242 |     # 向html中添加css
243 |     def AddCss(self):
244 |         urls = self.getCssUrl()
245 |         urls = [url  for url in urls if "htmlReader" in url or "core" in url or "main" in url or "base" in url]
246 |         for url in urls:
247 |             message = '<style type="text/css">'+requests.get(url).text+"</style>>"
248 |             self.addMessageToHtml(message)
249 | 
250 |         content = self.getAllReaderRenderStyle()  # 获取文本控制属性css
251 |         self.addMessageToHtml(content)
252 | 
253 | 
254 |     def addMainContent(self):
255 |         """
256 |         :param startpage: 开始生成的页面数
257 |         :return:
258 |         """
259 | 
260 |         self.addMessageToHtml("\n\n\n<body>\n")
261 |         docidupdate = self.getDocIdUpdate()
262 |         
263 |         # 分别获取json和png所在的url
264 |         jsonurl = [x for x in self.jsonurls if "json" in x]
265 |         pngurl = [x for x in self.jsonurls if "png" in x]
266 | 
267 |         tags = self.getPageTag()
268 |         for page, tag in enumerate(tags):
269 |             if page > 50:
270 |                 break
271 |             tag['style'] = "height: 1349.19px;"
272 |             tag['id'] = "pageNo-" + str(page+1)
273 |             self.addMessageToHtml(str(tag).replace('</div>', ''))
274 |             diu = self.getDocIdUpdate()
275 |             n = "-webkit-transform:scale(1.00);-webkit-transform-origin:left top;"
276 |             textdict = self.convertJsonToDict(self.getJson(jsonurl[page]))
277 | 
278 |             # 判断是否出现图片url少于json文件url情况
279 |             if page < len(pngurl):
280 |                 maincontent = self.creatMainContent(textdict.get('body'), textdict.get('page'), textdict.get('font'), page + 1, docidupdate,
281 |                                                     pngurl[page])
282 |             else:
283 |                 maincontent = self.creatMainContent(textdict.get('body'), textdict.get('page'), textdict.get('font'), page + 1, docidupdate, "")
284 |             content = "".join([
285 |                 '<div class="reader-parent-' + diu + " reader-parent " + '" style="position:relative;top:0;left:0;' + n + '">',
286 |                 '<div class="reader-wrap' + diu + '" style="position:absolute;top:0;left:0;width:100%;height:100%;">',
287 |                 '<div class="reader-main-' + diu + '" style="position:relative;top:0;left:0;width:100%;height:100%;">', maincontent,
288 |                 "</div>", "</div>", "</div>", "</div>"])
289 | 
290 |             self.addMessageToHtml(content)
291 |             print("已完成%s页的写入,当前写入进度为%f" % (str(page+self.startpage), 100*(page+self.startpage)/int(self.wkinfo.get('totalPageNum'))) + '%')
292 | 
293 |         self.addMessageToHtml("\n\n\n</body>\n</html>")
294 | 
295 | 
296 |     def isNumber(self, obj):
297 |         """
298 |         :param obj:任意对象
299 |         :return: obj是否为数字
300 |         """
301 |         return isinstance(obj, int) or isinstance(obj, float)
302 | 
303 | 
304 |     def creatMainContent(self, body, page, font, currentpage, o, pngurl):
305 |         """
306 |         :param body: body属性
307 |         :param page: page属性
308 |         :param font: font属性
309 |         :param currentpage: 当前页面数
310 |         :param o:doc_id_update
311 |         :param pngurl: 图片所在url
312 |         :return:文本及图片的html内容字符串
313 |         """
314 |         content, p, s, h = 0, 0, 0, 0
315 |         main = []
316 |         l = 2
317 |         c = page.get('v')
318 | 
319 |         d = font   # d原本为fongmapping
320 |         y = {
321 |                 "pic": '<div class="reader-pic-layer" style="z-index:__NUM__"><div class="ie-fix">',
322 |                 "word": '<div class="reader-txt-layer" style="z-index:__NUM__"><div class="ie-fix">'
323 |             }
324 |         g = "</div></div>"
325 |         MAX1 , MAX2 = 0, 0
326 |         body = sorted(body, key=lambda k: k.get('p').get('z'))
327 |         for index in range(len(body)):
328 |             content = body[index]
329 |             if "pic" == content.get('t'):
330 |                 MAX1 = max(MAX1, content.get('c').get('ih') + content.get('c').get('iy') + 5)
331 |                 MAX2 = max(MAX2, content.get('c').get('iw'))
332 |         for index in range(len(body)):
333 |             content = body[index]
334 |             s = content.get('t')
335 |             if not p:
336 |                 p = h = s
337 |             if p == s:
338 |                 if content.get('t') == "word":
339 |                     # m函数需要接受可变参数
340 |                     main.append(self.creatTagOfWord(content, currentpage, font, d, c))
341 |                 elif content.get('t') == 'pic':
342 |                     main.append(self.creatTagOfImage(content, pngurl, MAX1, MAX2))
343 |             else:
344 |                 main.append(g)
345 |                 main.append(y.get(s).replace('__NUM__', str(l)))
346 |                 l += 1
347 |                 if content.get('t') == "word":
348 |                     # m函数需要接受可变参数
349 |                     main.append(self.creatTagOfWord(content, currentpage, font, d, c))
350 |                 elif content.get('t') == 'pic':
351 |                     main.append(self.creatTagOfImage(content, pngurl, MAX1, MAX2))
352 |                 p = s
353 |         return y.get(h).replace('__NUM__', "1") + "".join(main) + g
354 | 
355 | 
356 |     def creatTagOfWord(self, t, currentpage, font, o, version, *args):
357 |         """
358 |         :param t: body中的每个属性
359 |         :param currentpage: page
360 |         :param font: font属性
361 |         :param o:font属性
362 |         :param version: page中的version属性
363 |         :param args:
364 |         :return:<p>标签--文本内容
365 |         """
366 |         p = t.get('p')
367 |         ps = t.get('ps')
368 |         s = t.get('s')
369 |         z = ['<b style="font-family:simsun;">&nbsp</b>', "\n"]
370 |         k, N = 10, 1349.19 / 1262.85
371 |         # T = self.j
372 |         U = self.O(ps)
373 |         w, h, y, x, D= p.get('w'), p.get('h'), p.get('y'), p.get('x'), p.get('z')
374 |         pattern=re.compile("[\s\t\0xa0]| [\0xa0\s\t]$")
375 |         final = []
376 | 
377 |         if U and ps and ((ps.get('_opacity') and ps.get('_opacity') == 1) or (ps.get('_alpha') and ps.get('_alpha') == 0)):
378 |             return ""
379 |         else:
380 |             width = math.floor(w * k * N)
381 |             height = math.floor(h * k * N)
382 |             final.append("<p "+'class="'+"reader-word-layer" + self.processStyleOfR(t.get('r'), currentpage) + '" ' + 'style="' + "width:" +str(width) + "px;" + "height:" + str(height) + "px;" + "line-height:" + str(height) + "px;")
383 |             final.append("top:"+str(math.floor(y * k * N))+"px;"+"left:"+str(math.floor(x * k * N))+"px;"+"z-index:"+str(D)+";")
384 |             final.append(self.processStyleOfS(s, font, o, version))
385 |             final.append(self.processStyleOf_rotate(ps.get('_rotate'), w, h, x, y, k, N) if U and ps and self.isNumber(ps.get('_rotate')) else "")
386 |             final.append(self.processStyleOfOpacity(ps.get('_opacity')) if U and ps and ps.get('_opacity') else "")
387 |             final.append(self.processStyleOf_scaleX(ps.get('_scaleX'), width, height) if U and ps and ps.get('_scaleX') else "")
388 |             final.append(str(isinstance(t.get('c'), str) and len(t.get('c')) == 1 and pattern.match(t.get('c')) and "font-family:simsun;") if isinstance(t.get('c'), str) and len(t.get('c')) == 1 and pattern.match(t.get('c')) else "")
389 |             final.append('">')
390 |             final.append(t.get('c') if t.get('c') else "")
391 |             final.append(U and ps and str(self.isNumber(ps.get('_enter'))) and z[ps.get('_enter') if ps.get('_enter') else 1] or "")
392 |             final.append("</p>")
393 | 
394 |             return "".join(final)
395 | 
396 | 
397 |     def processStyleOfS(self, t, font, r, version):
398 |         """
399 |         :param t: 文本的s属性
400 |         :param font: font属性
401 |         :param r:font属性
402 |         :param version:
403 |         :return:处理好的S属性字符串
404 |         """
405 |         infoOfS = []
406 |         n = {"font-size": 1}
407 |         p , u = 10, 1349.19 / 1262.85
408 | 
409 |         def fontfamily(o):
410 |             n = font.get(o) or o if font else o
411 |             if abs(version) > 5:
412 |                 infoOfS.append("font-family:'"+ n + "','" + o + "','" + (r.get('n') and r[n] or n) + "';")
413 |             else:
414 |                 infoOfS.append("font-family:'" + o + "','" + n + "','" + (r.get(n) and r[n] or n) + "';")
415 | 
416 |         def bold(e):
417 |             "false" == e or infoOfS.append("font-weight:600;")
418 | 
419 |         def letter(e):
420 |             infoOfS.append("letter-spacing:" + str(eval(e) * p) + "px;")
421 | 
422 |         if t is not None:
423 |             for attribute in t:
424 |                 if attribute == "font-family":
425 |                     fontfamily(t[attribute])
426 |                 elif attribute == "bold":
427 |                     bold(t[attribute])
428 |                 elif attribute == "letter-spacing":
429 |                     letter(t[attribute])
430 |                 else:
431 |                     infoOfS.append(attribute + ":" + (str(math.floor(((t[attribute] if self.isNumber(t[attribute]) else eval(t[attribute])) * p * u))) + "px" if n.get(attribute) else t[attribute]) + ";")
432 | 
433 |         return "".join(infoOfS)
434 | 
435 | 
436 |     def processStyleOfR(self, r, page):
437 |         """
438 |         :param r: 文本的r属性
439 |         :param page: 当前页面
440 |         :return:
441 |         """
442 |         l = " " + "reader-word-s" + str(page) + "-"
443 |         return "".join([l + str(x) for x in r]) if isinstance(r, list) and len(r) != 0 else ""
444 | 
445 | 
446 |     def processStyleOf_rotate(self, t, w, h, x, y, k, N):
447 |         """
448 |         :param t: _rotate属性
449 |         :param w: body中p.w
450 |         :param h: body中p.h
451 |         :param x: body中p.x
452 |         :param y: body中p.y
453 |         :param k: 倍数10
454 |         :param N: 比例系数
455 |         :return: 处理好的_rotate属性字符串
456 |         """
457 |         p = []
458 |         s = k * N
459 |         if t == 90:
460 |             p.append("left:" + str(math.floor(x + (w - h) / 2) * s) + "px;" + "top:" + str(math.floor(y - (h - w) / 2) * s) + "px;" + "text-align: right;" + "height:" + str(math.floor(h + 7) * s) + "px;")
461 |         elif t == 180:
462 |             p.append("left:" + str(math.floor(x - w) * s) + "px;" + "top:" + str(math.floor(y - h) * s) + "px;")
463 |         elif t == 270:
464 |             p.append("left:" + str(math.floor(x + (h - w) / 2) * s) + "px;" + "top:" + str(math.floor(y - (w - h) / 2) * s) + "px;")
465 | 
466 |         return "-webkit-"+"transform:rotate("+str(t)+"deg);"+"".join(p)
467 | 
468 | 
469 |     def processStyleOf_scaleX(self, t, width, height):
470 |         """
471 |         :param t:     _scaleX属性
472 |         :param width: 计算好的页面width
473 |         :param height:计算好的页面height
474 |         :return: 处理好的_scaleX属性字符串
475 |         """
476 |         return "-webkit-" + "transform: scaleX(" + str(t) + ");" + "-webkit-" + "transform-origin:left top;width:" + str(width + math.floor(width / 2)) + "px;height:" + str(height + math.floor(height / 2)) + "px;"
477 | 
478 | 
479 |     def processStyleOfOpacity(self,t):
480 |         """
481 |         :param t: opacity属性
482 |         :return:处理好的opacity属性字符串
483 |         """
484 |         t = (t or 0),
485 |         return "opacity:" + str(t) + ";"
486 | 
487 | 
488 |     def creatTagOfImage(self,t,url, *args):
489 |         """
490 |         :param t: 图片的字典
491 |         :param url:图片链接
492 |         :param args:
493 |         :return:图像标签
494 |         """
495 |         u, l = t.get('p'), t.get('c')
496 |         if u.get("opacity") and u.get('opacity') == 0:
497 |             return ""
498 |         else:
499 |             if u.get("x1") or (u.get('rotate') != 0 and u.get('opacity') != 1):
500 |                 message = '<div class="reader-pic-item" style="' + "background-image: url(" + url + ");" + "background-position:" + str(-l.get('ix')) + "px " + str(-l.get('iy')) + "px;" \
501 |                           + "width:" + str(l.get('iw')) + "px;" + "height:" + str(l.get('ih')) + "px;" + self.getStyleOfImage(u, l) + 'position:absolute;overflow:hidden;"></div>'
502 |             else:
503 |                 [s, h] = [str(x) for x in args]
504 |                 message = '<p class="reader-pic-item" style="' + "width:" + str(l.get('iw')) + "px;" + "height:" + str(l.get('ih')) + "px;" + self.getStyleOfImage(u, l) + 'position:absolute;overflow:hidden;"><img width="' + str(h) + '" height="' + str(s) + '" style="position:absolute;top:-' + str(l.get('iy')) + "px;left:-" + str(l.get('ix')) + "px;clip:rect(" + str(l.get('iy')) + "px," + str(int(h) - l.get('ix')) + "px, " + str(s) + "px, " + str(l.get('ix')) + 'px);" src="' + url + '" alt=""></p>'
505 | 
506 |             return message
507 | 
508 | 
509 |     def getStyleOfImage(self, t, e):
510 |         """
511 |         :param t: 图片p属性
512 |         :param e: 图片c属性
513 |         :return:
514 |         """
515 |         def parseFloat(string):
516 |             """
517 |             :param string:待处理的字符串
518 |             :return: 返回字符串中的首个有效float值，若字符首位为非数字，则返回nan
519 |             """
520 |             if string is None:
521 |                 return math.nan
522 |             elif isinstance(string, float):
523 |                 return string
524 |             elif isinstance(string, int):
525 |                 return float(string)
526 |             elif string[0] != ' ' and not str.isdigit(string[0]):
527 |                 return math.nan
528 |             else:
529 |                 p = re.compile("\d+\.?\d*")
530 |                 all = p.findall(string)
531 |                 return float(all[0]) if len(all) != 0 else math.nan
532 | 
533 |         if t is None:
534 |             return ""
535 |         else:
536 |             r, o, a, n = 0, 0, "", 0
537 |             iw = e.get('iw')
538 |             ih = e.get('ih')
539 |             u = 1349.19 / 1262.85
540 |             l = str(t.get('x') * u) + "px"
541 |             c = str(t.get('y') * u) + "px"
542 |             d = ""
543 |             x = {}
544 |             w = {"opacity": 1, "rotate": 1, "z": 1}
545 |             for n in t:
546 |                 x[n] = t[n] * u if (self.isNumber(t[n]) and not w.get(n)) else t[n]
547 | 
548 |             if x.get('w') != iw or x.get('h') != ih:
549 |                 if x.get('x1'):
550 |                     a = self.P(x.get('x0'), x.get('y0'), x.get('x1'), x.get('y1'), x.get('x2'), x.get('y2'))
551 |                 r = parseFloat(parseFloat(a[0])/iw if len(a) else x.get('w') / iw)
552 |                 o = parseFloat(parseFloat(a[1])/ih if len(a) else x.get('h') / ih)
553 | 
554 |                 m, v = iw * (r-1), ih * (o-1)
555 |                 c = str((x.get('y1') + x.get('y3')) / 2 - parseFloat(ih) / 2)+"px" if x.get('x1') else str(x.get('y') + v / 2) + "px"
556 |                 l = str((x.get('x1') + x.get('x3')) / 2 - parseFloat(iw) / 2)+"px" if x.get('x1') else str(x.get('x') + m / 2) + "px"
557 |                 d = "-webkit-" + "transform:scale(" + str(r) + "," + str(o) + ")"
558 | 
559 |             message = "z-index:" + str(x.get('z')) + ";" + "left:" + l + ";" + "top:" + c + ";" + "opacity:" + str(x.get('opacity') or 1) + ";"
560 |             if x.get('x1'):
561 |                 message += self.O(x.get('rotate')) if x.get('rotate') > 0.01 else self.O(0, x.get('x1'), x.get('x2'), x.get('y0'), x.get('y1'), d)
562 |             else:
563 |                 message += d + ";"
564 | 
565 |             return message
566 | 
567 | 
568 |     def P(self,t, e, r, i, o, a):
569 |         p = round(math.sqrt(math.pow(abs(t - r), 2) + math.pow(abs(e - i), 2)), 4)
570 |         s = round(math.sqrt(math.pow(abs(r - o), 2) + math.pow(abs(i - a), 2)), 4)
571 |         return [s, p]
572 | 
573 | 
574 |     def O(self, t, *args):
575 |         [e, r, i, o, a] = [0, 0, 0, 0, ""] if len(args) == 0 else [x for x in args]
576 |         n = o > i
577 |         p = e > r
578 |         if n and p:
579 |             a += " Matrix(1,0,0,-1,0,0)"
580 |         elif n:
581 |             a += " Matrix(1,0,0,-1,0,0)"
582 |         elif p:
583 |             a += " Matrix(-1,0,0,1,0,0)"
584 |         elif t:
585 |             a += " rotate(" + str(t) + "deg)"
586 |         return a + ";"
587 | 
588 | 
589 |     def convertHtmlToPdf(self):
590 |         savepath = os.path.join(self.pdfsdirpath, self.wkinfo.get('title') + '.pdf')
591 |         url = os.path.join('\pdffiles', self.wkinfo.get('title') + '.pdf')
592 | 
593 |         # 每个url的最大页数为50
594 |         exactpages = int(self.wkinfo.get('totalPageNum'))
595 |         if exactpages > 50:
596 |             exactpages = 50
597 |         options = {'disable-smart-shrinking':'',
598 |                    'lowquality': '',
599 |                    'image-quality': 60,
600 |                    'page-height': str(1349.19*0.26458333),
601 |                    'page-width': '291',
602 |                    'margin-bottom': '0',
603 |                    'margin-top': '0',
604 |                    }
605 |         pdfkit.from_file(os.path.join(self.htmlsdirpath, self.htmlfile), savepath, options=options)
606 |         return url
607 | 
608 |     def Run(self):
609 |         if self.wkinfo.get('docType') == 'ppt':
610 |             savePath = GetPpt(self.url, self.savepath).getPPT()
611 |             htmlurl = "/htmlfiles/" + self.htmlfile
612 |             flag = 1
613 |             return savePath, flag , htmlurl
614 | 
615 |         if self.wkinfo.get('docType') == 'txt':
616 |             savePath = GetTxt(self.url, self.savepath).getTXT()
617 |             flag = 1
618 |             htmlurl = "/htmlfiles/" + self.htmlfile
619 |             return savePath, flag , htmlurl
620 | 
621 |         else:
622 |             self.getJsonUrl()
623 |             for epoch in range(int(int(self.wkinfo.get('totalPageNum'))/50)+1):
624 |                 self.startpage = epoch * 50 + 1
625 |                 if epoch == 0:
626 |                     self.creatHtml()
627 | 
628 |                     start = time.time()
629 |                     print('-------------Start Add Css--------------')
630 |                     self.AddCss()
631 |                     print('-------------Css Add Finissed-----------')
632 |                     end = time.time()
633 |                     print("Add Css Cost: %ss" % str(end - start))
634 | 
635 |                     start = time.time()
636 |                     print('-------------Start Add Content----------')
637 |                     self.addMainContent()
638 |                     print('-------------Content Add Finished-------')
639 |                     end = time.time()
640 |                     print("Add MainContent Cost: %ss" % str(end - start))
641 | 
642 |                     start = time.time()
643 |                     print('-------------Start Convert--------------')
644 |                     url = self.convertHtmlToPdf()
645 |                     print('-------------Convert Finished-----------')
646 |                     end = time.time()
647 |                     print("Convert Cost: %ss" % str(end - start))
648 | 
649 |                     htmlurl = "/htmlfiles/" + self.htmlfile
650 |               
651 |                     flag = 1
652 |                     return url, flag , htmlurl
653 | 
654 |                 else:
655 |                     self.url = self.url[:self.url.find('&pn=')] + "&pn=" + str(self.startpage)
656 |                     print(self.url)
657 |                     self.getHtml()
658 |                     self.getJsonUrl()
659 | 
660 |                     self.creatHtml()
661 | 
662 |                     start = time.time()
663 |                     print('-------------Start Add Css--------------')
664 |                     self.AddCss()
665 |                     print('-------------Css Add Finissed-----------')
666 |                     end = time.time()
667 |                     print("Add Css Cost: %ss" % str(end - start))
668 | 
669 |                     start = time.time()
670 |                     print('-------------Start Add Content----------')
671 |                     self.addMainContent()
672 |                     print('-------------Content Add Finished-------')
673 |                     end = time.time()
674 |                     print("Add MainContent Cost: %ss" % str(end - start))
675 | 
676 |                     start = time.time()
677 |                     print('-------------Start Convert--------------')
678 |                     url = self.convertHtmlToPdf()
679 |                     print('-------------Convert Finished-----------')
680 |                     end = time.time()
681 |                     print("Convert Cost: %ss" % str(end - start))
682 | 
683 |                     htmlurl = "/htmlfiles/" + self.htmlfile
684 |           
685 |                     flag = 1
686 |                     return url , flag , htmlurl
687 | 
688 | app = Flask(__name__)
689 | app.secret_key = '520'
690 | 
691 | # 捕获异常
692 | @app.errorhandler(500)
693 | def internal_server_error(e):
694 |     return '当前文档无法下载'
695 | 
696 | # 文件下载
697 | @app.route("/<path:url>/<path:filename>")
698 | def downloader(url,filename):
699 |     dirpath = safe_join(app.root_path, url)  # 下载文件目录路径
700 |     return send_from_directory(dirpath, filename, as_attachment=True)  # as_attachment=True 一定要写，不然会变成打开，而不是下载
701 | 
702 | # 文件预览
703 | @app.route("/htmlfiles/<path:filename>")
704 | def htmlView(filename):
705 |     dirpath = os.path.join(app.root_path, "htmlfiles")  # 下载文件目录路径
706 |     return send_from_directory(dirpath, filename)  # 打开html文件
707 | 
708 | 
709 | 
710 | # 获取提交后的文档链接，然后转换
711 | @app.route('/post', methods=['POST','GET'])
712 | def post():
713 |     
714 |     if request.method=='POST':
715 |         flag = 0
716 |         # 如果连接输入为空，点击则直接刷新页面
717 |         if request.form.get('url') == '':
718 |             return redirect("post")
719 |         url , flag , htmlurl = GetAll((request.form.get('url')),"").Run()
720 |         # 直接用flash返回该文件地址，以及一个flag用作按钮显示
721 |         flash(url)# 生成的pdf地址
722 |         flash(flag)# 是否已生成文档，用来显示按钮
723 |         flash(htmlurl)# html地址
724 |         return redirect(url_for('post'))
725 | 
726 |     return render_template('post.html')
727 | 
728 | if __name__ == '__main__':
729 |   # 运行在本地5000端口
730 |     app.run('127.0.0.1',5000)
731 | 
732 | 
733 | 


--------------------------------------------------------------------------------