├── LICENSE
├── README.md
├── crawl_wqxt.py
├── imgautocompress.py
└── requirements.txt
/LICENSE:
--------------------------------------------------------------------------------
1 | GNU LESSER GENERAL PUBLIC LICENSE
2 | Version 3, 29 June 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 |
9 | This version of the GNU Lesser General Public License incorporates
10 | the terms and conditions of version 3 of the GNU General Public
11 | License, supplemented by the additional permissions listed below.
12 |
13 | 0. Additional Definitions.
14 |
15 | As used herein, "this License" refers to version 3 of the GNU Lesser
16 | General Public License, and the "GNU GPL" refers to version 3 of the GNU
17 | General Public License.
18 |
19 | "The Library" refers to a covered work governed by this License,
20 | other than an Application or a Combined Work as defined below.
21 |
22 | An "Application" is any work that makes use of an interface provided
23 | by the Library, but which is not otherwise based on the Library.
24 | Defining a subclass of a class defined by the Library is deemed a mode
25 | of using an interface provided by the Library.
26 |
27 | A "Combined Work" is a work produced by combining or linking an
28 | Application with the Library. The particular version of the Library
29 | with which the Combined Work was made is also called the "Linked
30 | Version".
31 |
32 | The "Minimal Corresponding Source" for a Combined Work means the
33 | Corresponding Source for the Combined Work, excluding any source code
34 | for portions of the Combined Work that, considered in isolation, are
35 | based on the Application, and not on the Linked Version.
36 |
37 | The "Corresponding Application Code" for a Combined Work means the
38 | object code and/or source code for the Application, including any data
39 | and utility programs needed for reproducing the Combined Work from the
40 | Application, but excluding the System Libraries of the Combined Work.
41 |
42 | 1. Exception to Section 3 of the GNU GPL.
43 |
44 | You may convey a covered work under sections 3 and 4 of this License
45 | without being bound by section 3 of the GNU GPL.
46 |
47 | 2. Conveying Modified Versions.
48 |
49 | If you modify a copy of the Library, and, in your modifications, a
50 | facility refers to a function or data to be supplied by an Application
51 | that uses the facility (other than as an argument passed when the
52 | facility is invoked), then you may convey a copy of the modified
53 | version:
54 |
55 | a) under this License, provided that you make a good faith effort to
56 | ensure that, in the event an Application does not supply the
57 | function or data, the facility still operates, and performs
58 | whatever part of its purpose remains meaningful, or
59 |
60 | b) under the GNU GPL, with none of the additional permissions of
61 | this License applicable to that copy.
62 |
63 | 3. Object Code Incorporating Material from Library Header Files.
64 |
65 | The object code form of an Application may incorporate material from
66 | a header file that is part of the Library. You may convey such object
67 | code under terms of your choice, provided that, if the incorporated
68 | material is not limited to numerical parameters, data structure
69 | layouts and accessors, or small macros, inline functions and templates
70 | (ten or fewer lines in length), you do both of the following:
71 |
72 | a) Give prominent notice with each copy of the object code that the
73 | Library is used in it and that the Library and its use are
74 | covered by this License.
75 |
76 | b) Accompany the object code with a copy of the GNU GPL and this license
77 | document.
78 |
79 | 4. Combined Works.
80 |
81 | You may convey a Combined Work under terms of your choice that,
82 | taken together, effectively do not restrict modification of the
83 | portions of the Library contained in the Combined Work and reverse
84 | engineering for debugging such modifications, if you also do each of
85 | the following:
86 |
87 | a) Give prominent notice with each copy of the Combined Work that
88 | the Library is used in it and that the Library and its use are
89 | covered by this License.
90 |
91 | b) Accompany the Combined Work with a copy of the GNU GPL and this license
92 | document.
93 |
94 | c) For a Combined Work that displays copyright notices during
95 | execution, include the copyright notice for the Library among
96 | these notices, as well as a reference directing the user to the
97 | copies of the GNU GPL and this license document.
98 |
99 | d) Do one of the following:
100 |
101 | 0) Convey the Minimal Corresponding Source under the terms of this
102 | License, and the Corresponding Application Code in a form
103 | suitable for, and under terms that permit, the user to
104 | recombine or relink the Application with a modified version of
105 | the Linked Version to produce a modified Combined Work, in the
106 | manner specified by section 6 of the GNU GPL for conveying
107 | Corresponding Source.
108 |
109 | 1) Use a suitable shared library mechanism for linking with the
110 | Library. A suitable mechanism is one that (a) uses at run time
111 | a copy of the Library already present on the user's computer
112 | system, and (b) will operate properly with a modified version
113 | of the Library that is interface-compatible with the Linked
114 | Version.
115 |
116 | e) Provide Installation Information, but only if you would otherwise
117 | be required to provide such information under section 6 of the
118 | GNU GPL, and only to the extent that such information is
119 | necessary to install and execute a modified version of the
120 | Combined Work produced by recombining or relinking the
121 | Application with a modified version of the Linked Version. (If
122 | you use option 4d0, the Installation Information must accompany
123 | the Minimal Corresponding Source and Corresponding Application
124 | Code. If you use option 4d1, you must provide the Installation
125 | Information in the manner specified by section 6 of the GNU GPL
126 | for conveying Corresponding Source.)
127 |
128 | 5. Combined Libraries.
129 |
130 | You may place library facilities that are a work based on the
131 | Library side by side in a single library together with other library
132 | facilities that are not Applications and are not covered by this
133 | License, and convey such a combined library under terms of your
134 | choice, if you do both of the following:
135 |
136 | a) Accompany the combined library with a copy of the same work based
137 | on the Library, uncombined with any other library facilities,
138 | conveyed under the terms of this License.
139 |
140 | b) Give prominent notice with the combined library that part of it
141 | is a work based on the Library, and explaining where to find the
142 | accompanying uncombined form of the same work.
143 |
144 | 6. Revised Versions of the GNU Lesser General Public License.
145 |
146 | The Free Software Foundation may publish revised and/or new versions
147 | of the GNU Lesser General Public License from time to time. Such new
148 | versions will be similar in spirit to the present version, but may
149 | differ in detail to address new problems or concerns.
150 |
151 | Each version is given a distinguishing version number. If the
152 | Library as you received it specifies that a certain numbered version
153 | of the GNU Lesser General Public License "or any later version"
154 | applies to it, you have the option of following the terms and
155 | conditions either of that published version or of any later version
156 | published by the Free Software Foundation. If the Library as you
157 | received it does not specify a version number of the GNU Lesser
158 | General Public License, you may choose any version of the GNU Lesser
159 | General Public License ever published by the Free Software Foundation.
160 |
161 | If the Library as you received it specifies that a proxy can decide
162 | whether future versions of the GNU Lesser General Public License shall
163 | apply, that proxy's public statement of acceptance of any version is
164 | permanent authorization for you to choose that version for the
165 | Library.
166 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | “文泉学堂”PDF下载
2 | ====================
3 |
4 | [文泉学堂](https://lib-nuanxin.wqxuetang.com/)
5 |
6 | 1. 安装 requirements.txt 里的依赖
7 | 2. 找到你要的书,看地址栏的数字为 id
8 | 3. 运行 `python3 crawl_wqxt.py `
9 |
10 | 服务器生成图片需要时间,可能出现 not loaded,会稍候重试。若一直出现 not loaded(第二遍还是),请尝试重新运行,已下载的图片不会重新下载。
11 |
12 | 若需要清理缓存,请删除 wqxt.db 或自行更改其内容(SQLite 数据库)。
13 |
14 | 若需要登录,请自行在 `crawl_wqxt.py` 的 HEADERS(36行)里加 Cookie 等内容。
15 |
16 | 请合理使用服务器资源。版权问题概不负责。
17 |
18 | imgautocompress.py 会对下载的图片判断是否为灰度、是否为黑白,并转成相应格式,减少图片大小。
19 |
20 | 要类似地减少其他扫描版 PDF 文件大小,可以使用 [pdfreduce](https://github.com/gumblex/pdfreduce)。要添加 OCR 层,可使用 [ocrmypdf](https://github.com/jbarlow83/OCRmyPDF)。
21 |
--------------------------------------------------------------------------------
/crawl_wqxt.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 |
4 | import os
5 | import re
6 | import sys
7 | import time
8 | import json
9 | import sqlite3
10 | import logging
11 | import collections
12 |
13 | import jwt
14 | import img2pdf
15 | import imgautocompress
16 |
17 | try:
18 | from httpx import Client as Session
19 | except ImportError:
20 | from requests import Session
21 |
22 | WITH_PDFRW = True
23 |
24 | if WITH_PDFRW:
25 | try:
26 | from pdfrw import PdfDict, PdfName
27 | except ImportError:
28 | PdfDict = img2pdf.MyPdfDict
29 | PdfName = img2pdf.MyPdfName
30 | WITH_PDFRW = False
31 | else:
32 | PdfDict = img2pdf.MyPdfDict
33 | PdfName = img2pdf.MyPdfName
34 |
35 |
36 | HEADERS = {
37 | "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
38 | "Accept-Encoding": "gzip, deflate",
39 | "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
40 | "User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0",
41 | }
42 |
43 | re_author = re.compile(r'《.+?》\s*(.+?)\s*【')
44 |
45 | logging.basicConfig(stream=sys.stderr, format='%(asctime)s [%(levelname)s] %(message)s', level=logging.INFO)
46 |
47 |
48 | class APIError(ValueError):
49 | pass
50 |
51 |
52 | class TryAgain(ValueError):
53 | pass
54 |
55 |
56 | def generate_pdf_outline(pdf, contents, parent=None):
57 | if parent is None:
58 | parent = PdfDict(indirect=True)
59 | if not contents:
60 | return parent
61 | first = prev = None
62 | for k, row in enumerate(contents):
63 | try:
64 | page = pdf.writer.pagearray[int(row['pnum'])-1]
65 | except IndexError:
66 | # bad bookmark
67 | continue
68 | bookmark = PdfDict(
69 | Parent=parent,
70 | Title=row['label'],
71 | A=PdfDict(
72 | D=[page, PdfName.Fit],
73 | S=PdfName.GoTo
74 | ),
75 | indirect=True
76 | )
77 | children = row.get('children')
78 | if children:
79 | bookmark = generate_pdf_outline(pdf, children, bookmark)
80 | if first:
81 | bookmark[PdfName.Prev] = prev
82 | prev[PdfName.Next] = bookmark
83 | else:
84 | first = bookmark
85 | prev = bookmark
86 | parent[PdfName.Count] = k + 1
87 | parent[PdfName.First] = first
88 | parent[PdfName.Last] = prev
89 | return parent
90 |
91 |
92 | def pdf_convert(*images, **kwargs):
93 |
94 | _default_kwargs = dict(
95 | title=None,
96 | author=None,
97 | creator=None,
98 | producer=None,
99 | creationdate=None,
100 | moddate=None,
101 | subject=None,
102 | keywords=None,
103 | colorspace=None,
104 | contents=None,
105 | nodate=False,
106 | layout_fun=img2pdf.default_layout_fun,
107 | viewer_panes=None,
108 | viewer_initial_page=None,
109 | viewer_magnification=None,
110 | viewer_page_layout=None,
111 | viewer_fit_window=False,
112 | viewer_center_window=False,
113 | viewer_fullscreen=False,
114 | with_pdfrw=True,
115 | first_frame_only=False,
116 | allow_oversized=True,
117 | )
118 | for kwname, default in _default_kwargs.items():
119 | if kwname not in kwargs:
120 | kwargs[kwname] = default
121 |
122 | pdf = img2pdf.pdfdoc(
123 | "1.3",
124 | kwargs["title"],
125 | kwargs["author"],
126 | kwargs["creator"],
127 | kwargs["producer"],
128 | kwargs["creationdate"],
129 | kwargs["moddate"],
130 | kwargs["subject"],
131 | kwargs["keywords"],
132 | kwargs["nodate"],
133 | kwargs["viewer_panes"],
134 | kwargs["viewer_initial_page"],
135 | kwargs["viewer_magnification"],
136 | kwargs["viewer_page_layout"],
137 | kwargs["viewer_fit_window"],
138 | kwargs["viewer_center_window"],
139 | kwargs["viewer_fullscreen"],
140 | kwargs["with_pdfrw"],
141 | )
142 |
143 | # backwards compatibility with older img2pdf versions where the first
144 | # argument to the function had to be given as a list
145 | if len(images) == 1:
146 | # if only one argument was given and it is a list, expand it
147 | if isinstance(images[0], (list, tuple)):
148 | images = images[0]
149 |
150 | if not isinstance(images, (list, tuple)):
151 | images = [images]
152 |
153 | for img in images:
154 | # img is allowed to be a path, a binary string representing image data
155 | # or a file-like object (really anything that implements read())
156 | try:
157 | rawdata = img.read()
158 | except AttributeError:
159 | if not isinstance(img, (str, bytes)):
160 | raise TypeError("Neither implements read() nor is str or bytes")
161 | # the thing doesn't have a read() function, so try if we can treat
162 | # it as a file name
163 | try:
164 | with open(img, "rb") as f:
165 | rawdata = f.read()
166 | except Exception:
167 | # whatever the exception is (string could contain NUL
168 | # characters or the path could just not exist) it's not a file
169 | # name so we now try treating it as raw image content
170 | rawdata = img
171 |
172 | for (
173 | color,
174 | ndpi,
175 | imgformat,
176 | imgdata,
177 | imgwidthpx,
178 | imgheightpx,
179 | palette,
180 | inverted,
181 | depth,
182 | rotation,
183 | ) in img2pdf.read_images(rawdata, kwargs["colorspace"], kwargs["first_frame_only"]):
184 | pagewidth, pageheight, imgwidthpdf, imgheightpdf = kwargs["layout_fun"](
185 | imgwidthpx, imgheightpx, ndpi
186 | )
187 |
188 | userunit = None
189 | if pagewidth < 3.00 or pageheight < 3.00:
190 | logging.warning(
191 | "pdf width or height is below 3.00 - too " "small for some viewers!"
192 | )
193 | elif pagewidth > 14400.0 or pageheight > 14400.0:
194 | if kwargs["allow_oversized"]:
195 | userunit = img2pdf.find_scale(pagewidth, pageheight)
196 | pagewidth /= userunit
197 | pageheight /= userunit
198 | imgwidthpdf /= userunit
199 | imgheightpdf /= userunit
200 | else:
201 | raise img2pdf.PdfTooLargeError(
202 | "pdf width or height must not exceed 200 inches."
203 | )
204 | # the image is always centered on the page
205 | imgxpdf = (pagewidth - imgwidthpdf) / 2.0
206 | imgypdf = (pageheight - imgheightpdf) / 2.0
207 | pdf.add_imagepage(
208 | color,
209 | imgwidthpx,
210 | imgheightpx,
211 | imgformat,
212 | imgdata,
213 | imgwidthpdf,
214 | imgheightpdf,
215 | imgxpdf,
216 | imgypdf,
217 | pagewidth,
218 | pageheight,
219 | userunit,
220 | palette,
221 | inverted,
222 | depth,
223 | rotation,
224 | )
225 |
226 | if kwargs['contents']:
227 | if pdf.with_pdfrw:
228 | catalog = pdf.writer.trailer.Root
229 | else:
230 | catalog = pdf.writer.catalog
231 | catalog[PdfName.Outlines] = generate_pdf_outline(pdf, kwargs['contents'])
232 |
233 | if kwargs["outputstream"]:
234 | pdf.tostream(kwargs["outputstream"])
235 | return
236 |
237 | return pdf.tostring()
238 |
239 |
240 | class WQXTDownloader:
241 | baseurl = 'https://lib-nuanxin.wqxuetang.com/read/pdf/'
242 | jwt_secret = "g0NnWdSE8qEjdMD8a1aq12qEYphwErKctvfd3IktWHWiOBpVsgkecur38aBRPn2w"
243 | loading_img = '3f08d2c4b0d8cac7641730c7f27f7263c8687bc67cdf179de6996edb9d8409bf09664e035b56d72c00d0b46d8dca1868a48290f469064efd5ba611958fe614e1'
244 |
245 | def __init__(self, downloadpath='.', db='wqxt.db'):
246 | self.downloadpath = downloadpath
247 | self.db = sqlite3.connect(db)
248 | self.session = Session()
249 | self.session.headers.update(HEADERS)
250 | self.init_db()
251 |
252 | def init_db(self):
253 | cur = self.db.cursor()
254 | cur.execute('PRAGMA case_sensitive_like=1')
255 | cur.execute('CREATE TABLE IF NOT EXISTS api_cache ('
256 | 'url TEXT PRIMARY KEY,'
257 | 'updated INTEGER,'
258 | 'value TEXT'
259 | ')')
260 | cur.execute('CREATE TABLE IF NOT EXISTS book_img ('
261 | 'bookid INTEGER,'
262 | 'page INTEGER,'
263 | 'updated INTEGER,'
264 | 'data BLOB,'
265 | 'PRIMARY KEY (bookid, page)'
266 | ')')
267 | self.db.commit()
268 |
269 | def json_call(self, bookid, url, cache=True):
270 | cur = self.db.cursor()
271 | url = url % bookid
272 | if cache:
273 | cur.execute('SELECT value FROM api_cache WHERE url=?', (url,))
274 | res = cur.fetchone()
275 | if res:
276 | return json.loads(res[0])
277 | r = self.session.get(url, headers={
278 | 'referer': self.baseurl + str(bookid),
279 | 'sec-fetch-mode': 'cors',
280 | 'sec-fetch-site': 'same-origin',
281 | 'user': 'bapkg/com.bookask.wqxuetang baver/1.1.1',
282 | })
283 | r.raise_for_status()
284 | result = r.json()
285 | if result['errcode']:
286 | name = url.rsplit('/', 1)[-1]
287 | raise APIError('%s [%s]: %s', name, result['errcode'], result['errmsg'])
288 | cur.execute('REPLACE INTO api_cache VALUES (?,?,?)', (
289 | url, int(time.time()), json.dumps(result['data'])))
290 | self.db.commit()
291 | return result['data']
292 |
293 | def get_img(self, bookid, page, jwtkey):
294 | cur = self.db.cursor()
295 | cur.execute('SELECT data FROM book_img WHERE bookid=? AND page=?',
296 | (bookid, page))
297 | res = cur.fetchone()
298 | if res:
299 | return res[0]
300 | cur_time = time.time()
301 | jwttoken = jwt.encode({
302 | "p": page,
303 | "t": int(cur_time*1000),
304 | "b": str(bookid),
305 | "w": 1000,
306 | "k": json.dumps(jwtkey),
307 | "iat": int(cur_time)
308 | }, self.jwt_secret, algorithm='HS256').decode('ascii')
309 | r = self.session.get(
310 | "https://lib-nuanxin.wqxuetang.com/page/img/%s/%s?k=%s" % (
311 | bookid, page, jwttoken), headers={
312 | 'referer': self.baseurl + str(bookid),
313 | 'sec-fetch-mode': 'no-cors',
314 | 'sec-fetch-site': 'same-origin',
315 | })
316 | r.raise_for_status()
317 | result = r.content
318 | if r.headers.get('pragma') != 'catch':
319 | raise TryAgain()
320 | cur.execute('REPLACE INTO book_img VALUES (?,?,?,?)', (
321 | bookid, page, int(cur_time), result))
322 | self.db.commit()
323 | return result
324 |
325 | def download_pdf(self, bookid, convertimg=True):
326 | logging.info('%s: Loading metadata', bookid)
327 | r = self.session.get(self.baseurl + str(bookid))
328 | r.raise_for_status()
329 | metadata = self.json_call(bookid, "https://lib-nuanxin.wqxuetang.com/v1/read/initread?bid=%s")
330 | title = metadata['name']
331 | try:
332 | author = re_author.match(metadata['title']).group(1)
333 | except Exception:
334 | author = None
335 | contents = self.json_call(bookid, "https://lib-nuanxin.wqxuetang.com/v1/book/catatree?bid=%s")
336 | sizes = self.json_call(bookid, "https://lib-nuanxin.wqxuetang.com/page/size/?bid=%s")
337 | jwtkey = self.json_call(bookid, "https://lib-nuanxin.wqxuetang.com/v1/read/k?bid=%s", cache=False)
338 | page_num = int(metadata['pages'])
339 | images = [None] * page_num
340 | tasks = collections.deque(range(1, page_num+1))
341 | while tasks:
342 | i = tasks.popleft()
343 | try:
344 | img = self.get_img(bookid, i, jwtkey)
345 | if convertimg:
346 | img, imgfmt = imgautocompress.auto_encode(img)
347 | images[i-1] = img
348 | logging.info('%s: %s/%s', bookid, i, page_num)
349 | except TryAgain:
350 | tasks.append(i)
351 | logging.info('%s: %s/%s not loaded', bookid, i, page_num)
352 | time.sleep(0.5)
353 | except Exception:
354 | tasks.append(i)
355 | logging.exception('%s: %s/%s', bookid, i, page_num)
356 | time.sleep(1)
357 | logging.info('%s: Generating PDF', bookid)
358 | filename = "%s-%s.pdf" % (
359 | bookid, title.replace('/', '_').replace(':', ':'))
360 | with open(filename, "wb") as f:
361 | pdf_convert(
362 | images,
363 | title=metadata['name'],
364 | author=author,
365 | with_pdfrw=True,
366 | contents=contents,
367 | outputstream=f
368 | )
369 |
370 | if __name__ == '__main__':
371 | # usage: python3 crawl_wqxt.py
372 | dl = WQXTDownloader()
373 | dl.download_pdf(int(sys.argv[1]))
374 |
--------------------------------------------------------------------------------
/imgautocompress.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 |
4 | import os
5 | import io
6 | import sys
7 | import math
8 | import numpy as np
9 | from PIL import Image, ImageStat, ImageFilter
10 |
11 | _PIXWEIGHT = np.concatenate((np.arange(128, 0, -1), np.arange(0, 128))) / 128
12 |
13 |
14 | def otsu_threshold(hist):
15 | total = sum(hist)
16 | sumB = 0
17 | wB = 0
18 | maximum = 0.0
19 | sum1 = np.dot(np.arange(256), hist)
20 | for i in range(256):
21 | wB += hist[i]
22 | wF = total - wB
23 | if wB == 0 or wF == 0:
24 | continue
25 | sumB += i * hist[i]
26 | mF = (sum1 - sumB) / wF
27 | between = wB * wF * ((sumB / wB) - mF) * ((sumB / wB) - mF)
28 | if between >= maximum:
29 | level = i + 1
30 | maximum = between
31 | return level
32 |
33 |
34 | def auto_downgrade(pil_img, thumb_size=128, grey_cutoff=1, bw_ratio=0.99, bw_supersample=1):
35 | mode = pil_img.mode
36 | if mode == '1' and mode not in ('L', 'LA', 'RGB', 'RGBA'):
37 | # ignore special modes
38 | return pil_img
39 | elif mode == 'P':
40 | pil_img = pil_img.convert('RGB')
41 | elif mode == 'PA':
42 | pil_img = pil_img.convert('RGBA')
43 | bands = pil_img.getbands()
44 | alpha_band = False
45 | if bands[-1] == 'A':
46 | alpha_band = True
47 | if all(x == 255 for x in pil_img.getdata(len(bands) - 1)):
48 | alpha_band = False
49 | if bands[:3] == ('R', 'G', 'B'):
50 | thumb = pil_img.resize((thumb_size,thumb_size), resample=Image.BILINEAR)
51 | pixels = np.array(thumb.getdata(), dtype=float)[:, :3]
52 | pixels_max = np.max(pixels, axis=1)
53 | pixels_min = np.min(pixels, axis=1)
54 | val = np.mean(pixels_max - pixels_min)
55 | if val > grey_cutoff:
56 | if bands[-1] == 'A' and not alpha_band:
57 | return pil_img.convert('RGB')
58 | else:
59 | return pil_img
60 | if alpha_band:
61 | return pil_img.convert('LA')
62 | else:
63 | pil_img = pil_img.convert('L')
64 | if alpha_band:
65 | return pil_img
66 | hist = pil_img.histogram()[:256]
67 | if np.average(_PIXWEIGHT, weights=hist) > bw_ratio:
68 | if bw_supersample != 1:
69 | width, height = pil_img.size
70 | width = round(width * bw_supersample)
71 | height = round(height * bw_supersample)
72 | scaled = pil_img.resize((width, height), resample=Image.BICUBIC)
73 | else:
74 | scaled = pil_img
75 | threshold = otsu_threshold(hist)
76 | if 50 < threshold < 250: # resonable range
77 | scaled = scaled.point(lambda p: p > threshold and 255)
78 | return scaled.convert('1', dither=Image.NONE)
79 | if bands[-1] == 'A':
80 | return pil_img.convert('L')
81 | return pil_img
82 |
83 |
84 | def auto_encode(fp, quality=95, thumb_size=128, grey_cutoff=1, bw_ratio=0.99, bw_supersample=1):
85 | if isinstance(fp, str):
86 | with open(fp, 'rb') as f:
87 | orig_data = f.read()
88 | elif isinstance(fp, bytes):
89 | orig_data = fp
90 | else:
91 | orig_data = fp.read()
92 | orig_buf = io.BytesIO(orig_data)
93 | orig_size = len(orig_data)
94 | im = Image.open(orig_buf)
95 | out_im = auto_downgrade(im, thumb_size, grey_cutoff, bw_ratio)
96 | buf = io.BytesIO()
97 | if out_im.mode == '1':
98 | out_im.save(buf, 'TIFF', compression='group4')
99 | return buf.getvalue(), 'TIFF'
100 | elif out_im.mode[0] == 'L' or out_im.mode[-1] == 'A':
101 | out_im.save(buf, 'PNG', optimize=True)
102 | return buf.getvalue(), 'PNG'
103 | if im.format.startswith('JPEG'):
104 | out_format = 'PNG'
105 | out_im.save(buf, 'PNG', optimize=True)
106 | else:
107 | out_format = 'JPEG'
108 | out_im.convert('RGB').save(buf, 'JPEG', quality=quality, optimize=True)
109 | out_data = buf.getvalue()
110 | if len(out_data) > orig_size:
111 | if out_im.mode == im.mode:
112 | return orig_data, im.format
113 | else:
114 | buf = io.BytesIO()
115 | out_im.save(buf, 'PNG', optimize=True)
116 | return buf.getvalue(), 'PNG'
117 | else:
118 | return out_data, out_format
119 |
120 |
121 | if __name__ == '__main__':
122 | input_file = sys.argv[1]
123 | output_prefix = sys.argv[2]
124 | output_data, output_format = auto_encode(input_file)
125 | if output_format == 'JPEG':
126 | output_name = output_prefix + '.jpg'
127 | else:
128 | output_name = output_prefix + '.' + output_format.lower()
129 | with open(output_name, 'wb') as f:
130 | f.write(output_data)
131 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | httpx
2 | Pillow
3 | numpy
4 | img2pdf
5 | pyjwt
6 | pdfrw
7 |
--------------------------------------------------------------------------------