├── img
├── .DS_Store
└── 允许远程自动化.png
├── .idea
├── encodings.xml
├── vcs.xml
├── modules.xml
├── misc.xml
├── xzl.iml
└── workspace.xml
├── SECURITY.md
├── requirements.txt
├── README.md
├── LICENSE
├── .gitignore
└── xzl.py
/img/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xxxxxzl/xzl/HEAD/img/.DS_Store
--------------------------------------------------------------------------------
/img/允许远程自动化.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xxxxxzl/xzl/HEAD/img/允许远程自动化.png
--------------------------------------------------------------------------------
/.idea/encodings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/xzl.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
1 | # Security Policy
2 |
3 | ## Supported Versions
4 |
5 | Use this section to tell people about which versions of your project are
6 | currently being supported with security updates.
7 |
8 | | Version | Supported |
9 | | ------- | ------------------ |
10 | | 5.1.x | :white_check_mark: |
11 | | 5.0.x | :x: |
12 | | 4.0.x | :white_check_mark: |
13 | | < 4.0 | :x: |
14 |
15 | ## Reporting a Vulnerability
16 |
17 | Use this section to tell people how to report a vulnerability.
18 |
19 | Tell them where to go, how often they can expect to get an update on a
20 | reported vulnerability, what to expect if the vulnerability is accepted or
21 | declined, etc.
22 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | asn1crypto==0.24.0
2 | attrs==19.1.0
3 | Automat==0.7.0
4 | browsercookie==0.7.5
5 | certifi==2019.6.16
6 | cffi==1.12.3
7 | chardet==3.0.4
8 | constantly==15.1.0
9 | cryptography==3.2
10 | cssselect==1.0.3
11 | entrypoints==0.3
12 | html2text==2018.1.9
13 | hyperlink==19.0.0
14 | idna==2.8
15 | incremental==17.5.0
16 | keyring==19.0.2
17 | lxml==4.6.3
18 | lz4==2.1.10
19 | parsel==1.5.1
20 | pdfkit==0.6.1
21 | pip==19.2
22 | pyasn1==0.4.5
23 | pyasn1-modules==0.2.5
24 | pycparser==2.19
25 | pycrypto ==2.6.1
26 | PyDispatcher==2.0.5
27 | PyHamcrest==1.9.0
28 | pyOpenSSL==19.0.0
29 | queuelib==1.5.0
30 | requests==2.22.0
31 | Scrapy==1.6.0
32 | selenium==3.141.0
33 | service-identity==18.1.0
34 | setuptools==40.8.0
35 | six==1.12.0
36 | Twisted==19.7.0
37 | urllib3==1.26.5
38 | w3lib==1.20.0
39 | zope.interface==4.6.0
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # xzl
2 | 
3 |
4 | 将[小专栏](https://xiaozhuanlan.com)的内容通过markdown/pdf格式导出
5 |
6 | # 小专栏规则修改, 目前仅能导出免费专栏内容
7 | # 我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件
8 |
9 | ~~使用前请确定已在Chrome中登录过小专栏, 否则可能会出现导出失败的问题~~
10 |
11 | #### 已支持
12 | -
专栏
13 | - 小书
14 | - 全部订阅
15 |
16 | #### 导出格式
17 | - markdown
18 | - pdf
19 |
20 | #### 由于获取目录时需要模拟鼠标滑动, 所以请打开Safari的允许远程自动化, 开发->允许远程自动化
21 | 
22 |
23 | ### 做 virtualenv 虚拟环境,直接安装全部依赖:
24 |
25 | # 全局安装 virtualenv
26 | $ pip3 install virtualenv
27 |
28 | # clone repo
29 | $ git clone git@github.com:iizvv/xzl.git & cd xzl
30 |
31 | # 创建虚拟环境
32 | $ virtualenv venv
33 |
34 | # 激活环境
35 | $ source venv/bin/active
36 |
37 | # 安装依赖
38 | $ pip3 install -r requirements.txt
39 |
40 | # 执行脚本
41 | $ python3 xzl.py
42 |
43 | - 感谢[瓜神](https://github.com/Desgard)提供的方案
44 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 A
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 | cookie.cache
106 |
--------------------------------------------------------------------------------
/xzl.py:
--------------------------------------------------------------------------------
1 | #coding=utf8
2 |
3 | import requests
4 | from scrapy.selector import Selector
5 | import time
6 | import html2text as ht
7 | import os
8 | from selenium import webdriver
9 | import pdfkit
10 | import browsercookie
11 |
12 | # 我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件
13 |
14 | # 由于采集目录时需要模拟鼠标滑动, 所以请打开Safari的允许远程自动化, 开发->允许远程自动化
15 |
16 | # 小专栏基础地址
17 | xzl = 'https://xiaozhuanlan.com'
18 |
19 | # 设置等待时长
20 | seconds = 3
21 |
22 | # 文件标题是否添加文章编写时间
23 | hasTime = True
24 |
25 | # 是否以MarkDown格式导出, 导出pdf需先下载
26 | # mac可以直接通过 `brew install Caskroom/cask/wkhtmltopdf` 进行安装
27 | markdown = True
28 |
29 | # 当为小书时,且`markdown=False`,是否将所有章节进行拼接为一个pdf
30 | xs_pdf = True
31 |
32 | # 通过Chrome采集到账号cookie,模拟用户登录状态
33 | # 此处不需要修改,将会自动从Chrome中获取cookie内容
34 | headers = {
35 | 'Cookie': ''
36 | }
37 |
38 |
39 | def fetch_cookie():
40 | """
41 | Fetch cookie from `cookie.cache`
42 | """
43 | global headers
44 | if os.path.exists('./cookie.cache'):
45 | with open('./cookie.cache', 'r') as cookie:
46 | headers['Cookie'] = cookie.read()
47 | else:
48 | chrome_cookie = browsercookie.chrome()
49 | for cookie in chrome_cookie:
50 | if cookie.name == '_xiaozhuanlan_session':
51 | xzl_session = cookie.name + '=' + cookie.value
52 | with open('./cookie.cache', 'w') as f:
53 | f.write(xzl_session)
54 | headers['Cookie'] = xzl_session
55 | if not xzl_session:
56 | print('\n\n\n\n请先在Chrome上登录小专栏\n\n\n\n')
57 |
58 |
59 | # 采集订阅列表
60 | def get_subscribes():
61 | print('开始采集订阅列表\n')
62 | url = xzl + '/' + 'me/subscribes'
63 | driver = webdriver.Safari()
64 | driver.get(xzl)
65 | cookies = headers.get('Cookie').replace(' ', '').split(';')
66 | for cookie in cookies:
67 | cs = cookie.split('=')
68 | driver.add_cookie({'name': cs[0], 'value': cs[1]})
69 | driver.get(url)
70 | print('开始采集订阅目录, 采集完成后自动关闭浏览器\n')
71 | style = ''
72 | while not style == 'display: block;':
73 | print('正在采集。。。\n')
74 | time.sleep(seconds)
75 | # 此处模拟浏览器滚动, 以采集更多数据
76 | js = "window.scrollTo(0, document.documentElement.scrollHeight*2)"
77 | driver.execute_script(js)
78 | style = driver.find_element_by_class_name('xzl-topic-list-no-topics').get_attribute('style')
79 | selector = Selector(text=driver.page_source)
80 | items = selector.css(u'.streamItem-cardInner').extract()
81 | print('列表采集完成,共找到%d条数据\n'%len(items))
82 | for item in items:
83 | selector = Selector(text=item)
84 | href = selector.css(u'.zl-title a::attr(href)').extract_first()
85 | title = selector.css(u'.zl-title a::text').extract_first()
86 | book = selector.css('.zl-bookContent').extract_first()
87 | print('开始采集: ' + title + '的目录信息\n')
88 | if book:
89 | print('当前内容为小书\n')
90 | get_xs(href, True)
91 | else:
92 | print('当前内容为专栏\n')
93 | get_zl(href, driver)
94 | time.sleep(seconds)
95 | print('所有内容已导出完成,我们应该尊重每一位作者的付出,请不要随意传播下载后的文件\n')
96 |
97 |
98 | # 采集小书章节目录
99 | def get_xs(href, is_all=False):
100 | url = xzl + href + '#a4'
101 | print('开始采集小书信息,小书地址为: ' + url + '\n')
102 | xzl_path = ''
103 | if is_all:
104 | xzl_path = '小专栏/'
105 | response = close_session().get(url=url, headers=headers)
106 | selector = Selector(text=response.text)
107 | chapter = selector.css(u'.book-cata-item').extract()
108 | xs_title = selector.css(u'.bannerMsg .title ::text').extract_first()
109 | html = ''
110 | if xs_pdf:
111 | html = '' + selector.css(u'.dot-list').extract_first() + '
'
112 | for idx, c in enumerate(chapter):
113 | selector = Selector(text=c)
114 | items = selector.css(u'.cata-sm-item').extract()
115 | z_title = selector.css(u'a::text').extract_first()
116 | z_href = selector.css(u'a::attr(href)').extract_first()
117 | path = os.path.join(os.path.expanduser("~"), 'Desktop')+'/'+xzl_path+xs_title+'/'+z_title+'/'
118 | if xs_pdf:
119 | path = os.path.join(os.path.expanduser("~"), 'Desktop')+'/'+xzl_path+xs_title+'/'
120 | else:
121 | print(xs_title + '共%d章, 正在创建存储目录\n' % len(chapter))
122 | print('文件存储位置: ' + path + '\n')
123 | if not os.path.exists(path):
124 | os.makedirs(path)
125 | print('文件夹创建成功\n')
126 | html += get_xs_detail(z_href, z_title, path)
127 | for item in items:
128 | selector = Selector(text=item)
129 | j_title = selector.css(u'.cata-sm-item a::text').extract_first()
130 | j_href = selector.css(u'.cata-sm-item a::attr(href)').extract_first()
131 | html += get_xs_detail(j_href, j_title, path)
132 | time.sleep(seconds)
133 | time.sleep(seconds)
134 | if xs_pdf:
135 | # 在html中加入编码, 否则中文会乱码
136 | html = " " + html + ""
137 | pdfkit.from_string(html, path+xs_title+'.pdf')
138 | print('小书:' + xs_title + '的文章已采集完成\n')
139 | print('我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件\n')
140 |
141 |
142 | # 采集小书章节详情
143 | def get_xs_detail(href, title, path):
144 | url = xzl+href
145 | print('开始采集' + title + '的详情, 章节地址为: ' + url + '\n')
146 | text_maker = ht.HTML2Text()
147 | response = close_session().get(url=url, headers=headers)
148 | selector = Selector(text=response.text)
149 | html = selector.css(u'.cata-book-content').extract_first()
150 | file_name = title
151 | if markdown:
152 | md = text_maker.handle(html)
153 | with open(path + file_name + '.md', 'w') as f:
154 | f.write(md)
155 | else:
156 | if not xs_pdf:
157 | # 在html中加入编码, 否则中文会乱码
158 | html = " " + html + ""
159 | pdfkit.from_string(html, path + file_name + '.pdf')
160 | else:
161 | return html
162 |
163 |
164 | # 采集专栏列表
165 | def get_zl(href, driver=None):
166 | url = xzl + href
167 | print('开始采集专栏信息,专栏地址为: ' + url + '\n')
168 | xzl_path = ''
169 | if not driver:
170 | driver = webdriver.Safari()
171 | driver.get(xzl)
172 | cookies = headers.get('Cookie').replace(' ', '').split(';')
173 | for cookie in cookies:
174 | cs = cookie.split('=')
175 | driver.add_cookie({'name': cs[0], 'value': cs[1]})
176 | else:
177 | xzl_path = '小专栏/'
178 | driver.get(url)
179 | print('开始采集专栏文章目录\n')
180 | style = ''
181 | while not style == 'display: block;':
182 | print('正在采集。。。\n')
183 | time.sleep(seconds)
184 | # 此处模拟浏览器滚动, 以采集更多数据
185 | js = "window.scrollTo(0, document.documentElement.scrollHeight*2)"
186 | driver.execute_script(js)
187 | style = driver.find_element_by_class_name('xzl-topic-list-no-topics').get_attribute('style')
188 | selector = Selector(text=driver.page_source)
189 | items = selector.css(u'.topic-body').extract()
190 | print('采集文章数量: %d篇\n'%len(items))
191 | zl_title = selector.css(u'.zhuanlan-title ::text').extract_first().replace('\n', '').replace(' ', '')
192 | print('目录采集完成, 正在采集的专栏为: ' + zl_title + ', 开始为您创建存储路径\n')
193 | path = os.path.join(os.path.expanduser("~"), 'Desktop') + '/' + xzl_path + zl_title + '/'
194 | print('文件存储位置: ' + path + '\n')
195 | if not os.path.exists(path):
196 | os.makedirs(path)
197 | print('文件夹创建成功\n')
198 | print('开始采集文章详情\n')
199 | for idx, item in enumerate(items):
200 | selector = Selector(text=item)
201 | link = selector.css(u'a::attr(href)').extract_first()
202 | title = selector.css(u'h3::text').extract_first().replace('\n', '').replace(' ', '').replace('/', '-')
203 | detail_url = xzl+link
204 | print('开始采集文章: ' + title + ', 文章地址为: ' + detail_url + '\n')
205 | get_zl_detail(detail_url, path, title)
206 | # 延迟三秒后采集下一文章
207 | time.sleep(seconds)
208 | print('专栏:' + zl_title + '的文章已采集完成' + '\n')
209 | print('我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件\n')
210 |
211 |
212 | # 采集专栏详情
213 | def get_zl_detail(url, path, name):
214 | response = close_session().get(url=url, headers=headers)
215 | selector = Selector(text=response.text)
216 | text_maker = ht.HTML2Text()
217 | create_time = selector.css(u'.time abbr::attr(title)').extract_first()
218 | html = selector.css(u'.xzl-topic-body-content').extract_first()
219 | file_name = name
220 | if hasTime:
221 | file_name = create_time+' '+name
222 | if markdown:
223 | md = text_maker.handle(html)
224 | with open(path + file_name + '.md', 'w') as f:
225 | f.write(md)
226 | else:
227 | # 在html中加入编码, 否则中文会乱码
228 | html = " " + html + ""
229 | pdfkit.from_string(html, path + file_name + '.pdf')
230 |
231 |
232 | # 关闭多余连接
233 | def close_session():
234 | request = requests.session()
235 | # 关闭多余连接
236 | request.keep_alive = False
237 | return request
238 |
239 |
240 | if __name__ == '__main__':
241 | print('我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件\n')
242 | print('当浏览器自动打开后,请勿关闭浏览器,内容采集、导出完成后将会自动关闭\n')
243 | # 增加重连次数
244 | # requests.adapters.DEFAULT_RETRIES = 5
245 | # 获取cookie
246 | fetch_cookie()
247 | # 采集小书
248 | # 专栏地址,仅填写最后一位即可,如:https://xiaozhuanlan.com/ios-interview, 填写/ios-interview即可
249 | # get_xs('/ios-interview')
250 | # 采集专栏
251 | # 专栏地址,仅填写最后一位即可,如:https://xiaozhuanlan.com/The-story-of-the-programmer, 填写/The-story-of-the-programmer即可
252 | get_zl('/The-story-of-the-programmer')
253 | # 采集全部订阅内容
254 | # get_subscribes()
255 |
256 |
257 |
--------------------------------------------------------------------------------
/.idea/workspace.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
45 |
46 |
47 |
48 | 已经
49 | xzl-topic-item
50 | 天
51 | 小时
52 | 开始获取
53 | 获取
54 | _xiao
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 | 1562663626153
162 |
163 |
164 | 1562663626153
165 |
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 | 1562746527998
179 |
180 |
181 |
182 | 1562746527998
183 |
184 |
185 |
186 |
187 |
188 |
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
206 |
207 |
208 |
209 |
210 |
211 |
212 |
213 |
214 |
215 |
216 |
217 |
218 |
219 |
220 |
221 |
222 |
223 |
224 |
225 |
226 |
227 |
228 |
229 |
230 |
231 |
232 |
233 |
234 |
235 |
236 |
237 |
238 |
239 |
240 |
241 |
242 |
243 |
244 |
245 |
246 |
247 |
248 |
249 |
250 |
251 |
252 |
253 |
254 |
255 |
256 |
257 |
258 |
259 |
260 |
261 |
262 |
263 |
264 |
265 |
266 |
267 |
268 |
269 |
270 |
271 |
272 |
273 |
274 |
275 |
276 |
277 |
278 |
279 |
280 |
281 |
282 |
283 |
284 |
285 |
286 |
287 |
288 |
289 |
290 |
291 |
292 |
293 |
294 |
295 |
296 |
297 |
298 |
299 |
300 |
301 |
302 |
303 |
304 |
305 |
306 |
307 |
308 |
309 |
310 |
311 |
312 |
313 |
314 |
315 |
316 |
317 |
318 |
319 |
320 |
321 |
322 |
323 |
--------------------------------------------------------------------------------