├── img ├── .DS_Store └── 允许远程自动化.png ├── .idea ├── encodings.xml ├── vcs.xml ├── modules.xml ├── misc.xml ├── xzl.iml └── workspace.xml ├── SECURITY.md ├── requirements.txt ├── README.md ├── LICENSE ├── .gitignore └── xzl.py /img/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xxxxxzl/xzl/HEAD/img/.DS_Store -------------------------------------------------------------------------------- /img/允许远程自动化.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xxxxxzl/xzl/HEAD/img/允许远程自动化.png -------------------------------------------------------------------------------- /.idea/encodings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 7 | -------------------------------------------------------------------------------- /.idea/xzl.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 13 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | # Security Policy 2 | 3 | ## Supported Versions 4 | 5 | Use this section to tell people about which versions of your project are 6 | currently being supported with security updates. 7 | 8 | | Version | Supported | 9 | | ------- | ------------------ | 10 | | 5.1.x | :white_check_mark: | 11 | | 5.0.x | :x: | 12 | | 4.0.x | :white_check_mark: | 13 | | < 4.0 | :x: | 14 | 15 | ## Reporting a Vulnerability 16 | 17 | Use this section to tell people how to report a vulnerability. 18 | 19 | Tell them where to go, how often they can expect to get an update on a 20 | reported vulnerability, what to expect if the vulnerability is accepted or 21 | declined, etc. 22 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | asn1crypto==0.24.0 2 | attrs==19.1.0 3 | Automat==0.7.0 4 | browsercookie==0.7.5 5 | certifi==2019.6.16 6 | cffi==1.12.3 7 | chardet==3.0.4 8 | constantly==15.1.0 9 | cryptography==3.2 10 | cssselect==1.0.3 11 | entrypoints==0.3 12 | html2text==2018.1.9 13 | hyperlink==19.0.0 14 | idna==2.8 15 | incremental==17.5.0 16 | keyring==19.0.2 17 | lxml==4.6.3 18 | lz4==2.1.10 19 | parsel==1.5.1 20 | pdfkit==0.6.1 21 | pip==19.2 22 | pyasn1==0.4.5 23 | pyasn1-modules==0.2.5 24 | pycparser==2.19 25 | pycrypto ==2.6.1 26 | PyDispatcher==2.0.5 27 | PyHamcrest==1.9.0 28 | pyOpenSSL==19.0.0 29 | queuelib==1.5.0 30 | requests==2.22.0 31 | Scrapy==1.6.0 32 | selenium==3.141.0 33 | service-identity==18.1.0 34 | setuptools==40.8.0 35 | six==1.12.0 36 | Twisted==19.7.0 37 | urllib3==1.26.5 38 | w3lib==1.20.0 39 | zope.interface==4.6.0 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # xzl 2 | ![python3](https://img.shields.io/badge/Python-3.7.3-green.svg) 3 | 4 | 将[小专栏](https://xiaozhuanlan.com)的内容通过markdown/pdf格式导出 5 | 6 | # 小专栏规则修改, 目前仅能导出免费专栏内容 7 | # 我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件 8 | 9 | ~~使用前请确定已在Chrome中登录过小专栏, 否则可能会出现导出失败的问题~~ 10 | 11 | #### 已支持 12 | -

专栏

13 | -

小书

14 | -

全部订阅

15 | 16 | #### 导出格式 17 | -

markdown

18 | -

pdf

19 | 20 | #### 由于获取目录时需要模拟鼠标滑动, 所以请打开Safari的允许远程自动化, 开发->允许远程自动化 21 | ![允许远程自动化](./img/允许远程自动化.png) 22 | 23 | ### 做 virtualenv 虚拟环境,直接安装全部依赖: 24 | 25 | # 全局安装 virtualenv 26 | $ pip3 install virtualenv 27 | 28 | # clone repo 29 | $ git clone git@github.com:iizvv/xzl.git & cd xzl 30 | 31 | # 创建虚拟环境 32 | $ virtualenv venv 33 | 34 | # 激活环境 35 | $ source venv/bin/active 36 | 37 | # 安装依赖 38 | $ pip3 install -r requirements.txt 39 | 40 | # 执行脚本 41 | $ python3 xzl.py 42 | 43 | - 感谢[瓜神](https://github.com/Desgard)提供的方案 44 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 A 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | cookie.cache 106 | -------------------------------------------------------------------------------- /xzl.py: -------------------------------------------------------------------------------- 1 | #coding=utf8 2 | 3 | import requests 4 | from scrapy.selector import Selector 5 | import time 6 | import html2text as ht 7 | import os 8 | from selenium import webdriver 9 | import pdfkit 10 | import browsercookie 11 | 12 | # 我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件 13 | 14 | # 由于采集目录时需要模拟鼠标滑动, 所以请打开Safari的允许远程自动化, 开发->允许远程自动化 15 | 16 | # 小专栏基础地址 17 | xzl = 'https://xiaozhuanlan.com' 18 | 19 | # 设置等待时长 20 | seconds = 3 21 | 22 | # 文件标题是否添加文章编写时间 23 | hasTime = True 24 | 25 | # 是否以MarkDown格式导出, 导出pdf需先下载![wkhtmltopdf](https://wkhtmltopdf.org/downloads.html) 26 | # mac可以直接通过 `brew install Caskroom/cask/wkhtmltopdf` 进行安装 27 | markdown = True 28 | 29 | # 当为小书时,且`markdown=False`,是否将所有章节进行拼接为一个pdf 30 | xs_pdf = True 31 | 32 | # 通过Chrome采集到账号cookie,模拟用户登录状态 33 | # 此处不需要修改,将会自动从Chrome中获取cookie内容 34 | headers = { 35 | 'Cookie': '' 36 | } 37 | 38 | 39 | def fetch_cookie(): 40 | """ 41 | Fetch cookie from `cookie.cache` 42 | """ 43 | global headers 44 | if os.path.exists('./cookie.cache'): 45 | with open('./cookie.cache', 'r') as cookie: 46 | headers['Cookie'] = cookie.read() 47 | else: 48 | chrome_cookie = browsercookie.chrome() 49 | for cookie in chrome_cookie: 50 | if cookie.name == '_xiaozhuanlan_session': 51 | xzl_session = cookie.name + '=' + cookie.value 52 | with open('./cookie.cache', 'w') as f: 53 | f.write(xzl_session) 54 | headers['Cookie'] = xzl_session 55 | if not xzl_session: 56 | print('\n\n\n\n请先在Chrome上登录小专栏\n\n\n\n') 57 | 58 | 59 | # 采集订阅列表 60 | def get_subscribes(): 61 | print('开始采集订阅列表\n') 62 | url = xzl + '/' + 'me/subscribes' 63 | driver = webdriver.Safari() 64 | driver.get(xzl) 65 | cookies = headers.get('Cookie').replace(' ', '').split(';') 66 | for cookie in cookies: 67 | cs = cookie.split('=') 68 | driver.add_cookie({'name': cs[0], 'value': cs[1]}) 69 | driver.get(url) 70 | print('开始采集订阅目录, 采集完成后自动关闭浏览器\n') 71 | style = '' 72 | while not style == 'display: block;': 73 | print('正在采集。。。\n') 74 | time.sleep(seconds) 75 | # 此处模拟浏览器滚动, 以采集更多数据 76 | js = "window.scrollTo(0, document.documentElement.scrollHeight*2)" 77 | driver.execute_script(js) 78 | style = driver.find_element_by_class_name('xzl-topic-list-no-topics').get_attribute('style') 79 | selector = Selector(text=driver.page_source) 80 | items = selector.css(u'.streamItem-cardInner').extract() 81 | print('列表采集完成,共找到%d条数据\n'%len(items)) 82 | for item in items: 83 | selector = Selector(text=item) 84 | href = selector.css(u'.zl-title a::attr(href)').extract_first() 85 | title = selector.css(u'.zl-title a::text').extract_first() 86 | book = selector.css('.zl-bookContent').extract_first() 87 | print('开始采集: ' + title + '的目录信息\n') 88 | if book: 89 | print('当前内容为小书\n') 90 | get_xs(href, True) 91 | else: 92 | print('当前内容为专栏\n') 93 | get_zl(href, driver) 94 | time.sleep(seconds) 95 | print('所有内容已导出完成,我们应该尊重每一位作者的付出,请不要随意传播下载后的文件\n') 96 | 97 | 98 | # 采集小书章节目录 99 | def get_xs(href, is_all=False): 100 | url = xzl + href + '#a4' 101 | print('开始采集小书信息,小书地址为: ' + url + '\n') 102 | xzl_path = '' 103 | if is_all: 104 | xzl_path = '小专栏/' 105 | response = close_session().get(url=url, headers=headers) 106 | selector = Selector(text=response.text) 107 | chapter = selector.css(u'.book-cata-item').extract() 108 | xs_title = selector.css(u'.bannerMsg .title ::text').extract_first() 109 | html = '' 110 | if xs_pdf: 111 | html = '
' + selector.css(u'.dot-list').extract_first() + '
' 112 | for idx, c in enumerate(chapter): 113 | selector = Selector(text=c) 114 | items = selector.css(u'.cata-sm-item').extract() 115 | z_title = selector.css(u'a::text').extract_first() 116 | z_href = selector.css(u'a::attr(href)').extract_first() 117 | path = os.path.join(os.path.expanduser("~"), 'Desktop')+'/'+xzl_path+xs_title+'/'+z_title+'/' 118 | if xs_pdf: 119 | path = os.path.join(os.path.expanduser("~"), 'Desktop')+'/'+xzl_path+xs_title+'/' 120 | else: 121 | print(xs_title + '共%d章, 正在创建存储目录\n' % len(chapter)) 122 | print('文件存储位置: ' + path + '\n') 123 | if not os.path.exists(path): 124 | os.makedirs(path) 125 | print('文件夹创建成功\n') 126 | html += get_xs_detail(z_href, z_title, path) 127 | for item in items: 128 | selector = Selector(text=item) 129 | j_title = selector.css(u'.cata-sm-item a::text').extract_first() 130 | j_href = selector.css(u'.cata-sm-item a::attr(href)').extract_first() 131 | html += get_xs_detail(j_href, j_title, path) 132 | time.sleep(seconds) 133 | time.sleep(seconds) 134 | if xs_pdf: 135 | # 在html中加入编码, 否则中文会乱码 136 | html = " " + html + "" 137 | pdfkit.from_string(html, path+xs_title+'.pdf') 138 | print('小书:' + xs_title + '的文章已采集完成\n') 139 | print('我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件\n') 140 | 141 | 142 | # 采集小书章节详情 143 | def get_xs_detail(href, title, path): 144 | url = xzl+href 145 | print('开始采集' + title + '的详情, 章节地址为: ' + url + '\n') 146 | text_maker = ht.HTML2Text() 147 | response = close_session().get(url=url, headers=headers) 148 | selector = Selector(text=response.text) 149 | html = selector.css(u'.cata-book-content').extract_first() 150 | file_name = title 151 | if markdown: 152 | md = text_maker.handle(html) 153 | with open(path + file_name + '.md', 'w') as f: 154 | f.write(md) 155 | else: 156 | if not xs_pdf: 157 | # 在html中加入编码, 否则中文会乱码 158 | html = " " + html + "" 159 | pdfkit.from_string(html, path + file_name + '.pdf') 160 | else: 161 | return html 162 | 163 | 164 | # 采集专栏列表 165 | def get_zl(href, driver=None): 166 | url = xzl + href 167 | print('开始采集专栏信息,专栏地址为: ' + url + '\n') 168 | xzl_path = '' 169 | if not driver: 170 | driver = webdriver.Safari() 171 | driver.get(xzl) 172 | cookies = headers.get('Cookie').replace(' ', '').split(';') 173 | for cookie in cookies: 174 | cs = cookie.split('=') 175 | driver.add_cookie({'name': cs[0], 'value': cs[1]}) 176 | else: 177 | xzl_path = '小专栏/' 178 | driver.get(url) 179 | print('开始采集专栏文章目录\n') 180 | style = '' 181 | while not style == 'display: block;': 182 | print('正在采集。。。\n') 183 | time.sleep(seconds) 184 | # 此处模拟浏览器滚动, 以采集更多数据 185 | js = "window.scrollTo(0, document.documentElement.scrollHeight*2)" 186 | driver.execute_script(js) 187 | style = driver.find_element_by_class_name('xzl-topic-list-no-topics').get_attribute('style') 188 | selector = Selector(text=driver.page_source) 189 | items = selector.css(u'.topic-body').extract() 190 | print('采集文章数量: %d篇\n'%len(items)) 191 | zl_title = selector.css(u'.zhuanlan-title ::text').extract_first().replace('\n', '').replace(' ', '') 192 | print('目录采集完成, 正在采集的专栏为: ' + zl_title + ', 开始为您创建存储路径\n') 193 | path = os.path.join(os.path.expanduser("~"), 'Desktop') + '/' + xzl_path + zl_title + '/' 194 | print('文件存储位置: ' + path + '\n') 195 | if not os.path.exists(path): 196 | os.makedirs(path) 197 | print('文件夹创建成功\n') 198 | print('开始采集文章详情\n') 199 | for idx, item in enumerate(items): 200 | selector = Selector(text=item) 201 | link = selector.css(u'a::attr(href)').extract_first() 202 | title = selector.css(u'h3::text').extract_first().replace('\n', '').replace(' ', '').replace('/', '-') 203 | detail_url = xzl+link 204 | print('开始采集文章: ' + title + ', 文章地址为: ' + detail_url + '\n') 205 | get_zl_detail(detail_url, path, title) 206 | # 延迟三秒后采集下一文章 207 | time.sleep(seconds) 208 | print('专栏:' + zl_title + '的文章已采集完成' + '\n') 209 | print('我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件\n') 210 | 211 | 212 | # 采集专栏详情 213 | def get_zl_detail(url, path, name): 214 | response = close_session().get(url=url, headers=headers) 215 | selector = Selector(text=response.text) 216 | text_maker = ht.HTML2Text() 217 | create_time = selector.css(u'.time abbr::attr(title)').extract_first() 218 | html = selector.css(u'.xzl-topic-body-content').extract_first() 219 | file_name = name 220 | if hasTime: 221 | file_name = create_time+' '+name 222 | if markdown: 223 | md = text_maker.handle(html) 224 | with open(path + file_name + '.md', 'w') as f: 225 | f.write(md) 226 | else: 227 | # 在html中加入编码, 否则中文会乱码 228 | html = " " + html + "" 229 | pdfkit.from_string(html, path + file_name + '.pdf') 230 | 231 | 232 | # 关闭多余连接 233 | def close_session(): 234 | request = requests.session() 235 | # 关闭多余连接 236 | request.keep_alive = False 237 | return request 238 | 239 | 240 | if __name__ == '__main__': 241 | print('我们应该尊重每一位作者的付出, 请不要随意传播下载后的文件\n') 242 | print('当浏览器自动打开后,请勿关闭浏览器,内容采集、导出完成后将会自动关闭\n') 243 | # 增加重连次数 244 | # requests.adapters.DEFAULT_RETRIES = 5 245 | # 获取cookie 246 | fetch_cookie() 247 | # 采集小书 248 | # 专栏地址,仅填写最后一位即可,如:https://xiaozhuanlan.com/ios-interview, 填写/ios-interview即可 249 | # get_xs('/ios-interview') 250 | # 采集专栏 251 | # 专栏地址,仅填写最后一位即可,如:https://xiaozhuanlan.com/The-story-of-the-programmer, 填写/The-story-of-the-programmer即可 252 | get_zl('/The-story-of-the-programmer') 253 | # 采集全部订阅内容 254 | # get_subscribes() 255 | 256 | 257 | -------------------------------------------------------------------------------- /.idea/workspace.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 45 | 46 | 47 | 48 | 已经 49 | xzl-topic-item 50 | 51 | 小时 52 | 开始获取 53 | 获取 54 | _xiao 55 | 56 | 57 | 58 | 59 | 64 | 67 | 68 | 75 | 76 | 77 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 |