├── README.md ├── cookie.png └── csdn_to_md.py /README.md: -------------------------------------------------------------------------------- 1 | # csdn 文章批量转换markdown格式下载至本地 2 | 3 | ## 1. 背景 4 | 5 | 最近准备搭建新博客,尽管csdn的在线编辑、发布、专栏、自定义模块、模板等比较成熟,但实在没有美感,这一点令人比较失望,但在线编辑确实速度会很快,做笔记非常方便,检索也还算可以,前阵[阿里云开发者社区](https://developer.aliyun.com/)、[infoQ中国社区](https://www.infoq.cn/)运营人员相继邀请去他们平台发布,但我更想尝试本地利用[Obsidian](https://obsidian.md/)工具编写笔记试试,并同步[github](https://github.com/)或[gitee](https://gitee.com/)仓库存储,博客也许会以github page依托利用[hexo](https://hexo.io/zh-cn/)、[Jekyll](https://jekyllcn.com/)等工具发布,如果还可以同步[notion](https://www.notion.so/)或[云雀](https://www.yuque.com/)就更完美。 6 | 7 | ## 2.功能 8 | 9 | - 1. 通过传入id确立个人用户主页; 10 | - 2. 创建个人博客目录,专栏目录; 11 | - 3. 获取专栏URL、名称、篇数量; 12 | - 4. 依靠专栏URL获取对应的多页文章URL、标题; 13 | - 5. 遍历文章URL通过cookie获取文章内容并转换markdown格式。 14 | 15 | ## 3. 下载 16 | 17 | ```bash 18 | $ git clone https://github.com/Ghostwritten/csdn_to_md.git 19 | ``` 20 | 21 | ## 4. 配置 22 | 23 | chrome浏览器登陆csdn平台,按"F12"找到自己网页cookie,选择部分cookie内容复制至csdn_to_md.py脚本109行。 24 | 25 | ![获取cookie](https://github.com/Ghostwritten/csdn_to_md/blob/main/cookie.png) 26 | 27 | ## 5. 演示 28 | * [观看视频](https://www.youtube.com/watch?v=vVJJDB0xQfA&t=25s) 29 | 30 | ```bash 31 | $ python3 csdn_to_md.py -i xixihahalelehehe 32 | download blog markdown blog:【helm】helm_快速学习手册 33 | download blog markdown blog:【helm】如何开发一个完整的Helm_charts应用实例 34 | download blog markdown blog:【helm】helm_将yaml文件转换json的插件helm-schema-gen 35 | download blog markdown blog:【helm】helm_NOTES.txt 36 | download blog markdown blog:【helm】helm_test_测试详解 37 | download blog markdown blog:【helm】helm_charts_入门指南 38 | download blog markdown blog:【helm】openshift_Certified_Helm_Charts_实践 39 | download blog markdown blog:【helm】Helm_Values.yaml 40 | ...... 41 | 42 | 43 | $ cd xixihahalelehehe 44 | 45 | $ /xixihahalelehehe# tree 46 | . 47 | ├── ansible 48 | │ ├── anible_【模块】_notify.md 49 | │ ├── ansbile【模块】replace_替换.md 50 | │ ├── ansbile_模块开发-自定义模块.md 51 | │ ├── ansible_assert_模块.md 52 | │ ├── ansible_become配置.md 53 | │ ├── ansible_cron_模块.md 54 | │ ├── ansible_debug模块.md 55 | │ ├── ansible_delegate_to_模块.md 56 | │ ├── ansible_file模块详解.md 57 | │ ├── ansible_gather_facts配置.md 58 | │ ├── ansible_hosts_and_groups配置.md 59 | │ ├── ansible_jinja2详解.md 60 | │ ├── ansible-playbook_role角色.md 61 | │ ├── ansible-playbook实战.md 62 | │ ├── ansible_script模块.md 63 | │ ├── ansible_set_fact模块.md 64 | │ ├── ansible_URI模块.md 65 | │ ├── ansible【任务】安装httpd.md 66 | │ ├── ansible变量.md 67 | │ ├── ansible_安装.md 68 | │ ├── ansible_快速学习手册.md 69 | │ ├── ansible【模块】add_host.md 70 | │ ├── ansible【模块】blockinfile.md 71 | │ ├── ansible_【模块】find.md 72 | │ ├── ansible【模块】include_tasks.md 73 | │ ├── ansible【模块】linefile_文件行处理.md 74 | │ ├── ansible【模块】modprobe.md 75 | │ ├── ansible【模块】pause.md 76 | │ ├── ansible_【模块】sysctl.md 77 | │ ├── ansible【模块】systemd.md 78 | │ ├── ansible【模块】template.md 79 | │ ├── ansible【模块】yum.md 80 | │ ├── ansible_系统选择性执行脚本.md 81 | │ ├── ansible远程容器机种方法.md 82 | │ └── ansible_配置.md 83 | ├── blog 84 | │ ├── github如何搭建一个博客.md 85 | │ ├── jekyll的一个主题TeXt-theme拆解.md 86 | │ ├── jekyll配置管理github博客.md 87 | │ ├── 如何使用jekyll插件.md 88 | │ ├── 如何安装jekyll并搭建一个博客.md 89 | │ └── 如何购买域名.md 90 | ├── c++ 91 | │ └── makefile入门.md 92 | ├── camera 93 | │ ├── A7R2_图标列表.md 94 | │ ├── sony_A7R2介绍.md 95 | │ └── SONY_A7R2_基础操作.md 96 | ├── Cisco 97 | │ ├── 运维之思科篇_-----1.VLAN_、_Trunk_、_以太通道及DHCP.md 98 | │ ├── 运维之思科篇_-----2.vlan间通讯_、_动态路由.md 99 | │ ├── 运维之思科篇_-----3.HSRP(热备份路由协议),STP(生成树协议),PVST(增强版PST).md 100 | │ ├── 运维之思科篇_-----4._标准与扩展ACL_、_命名ACL.md 101 | │ ├── 运维之思科篇_-----5._NAT及静态转换_、_动态转换及PAT.md 102 | │ ├── 运维之思科篇_-----6..md 103 | │ └── 运维之思科篇_-----6.思科项目练习.md 104 | 105 | ``` 106 | ## 6. 技术 107 | 108 | - [python json](https://blog.csdn.net/xixihahalelehehe/article/details/106550900) 109 | - [python os](https://blog.csdn.net/xixihahalelehehe/article/details/104253123) 110 | - [python time](https://blog.csdn.net/xixihahalelehehe/article/details/108998768) 111 | - [python request](https://blog.csdn.net/xixihahalelehehe/article/details/108996025) 112 | - [python argparse](https://blog.csdn.net/xixihahalelehehe/article/details/121199110) 113 | - [python re](https://blog.csdn.net/xixihahalelehehe/article/details/106247378) 114 | - [python bs4](https://blog.csdn.net/xixihahalelehehe/article/details/124152439) 115 | 116 | - [python split()](https://blog.csdn.net/xixihahalelehehe/article/details/124547771) 117 | - [python list 列表](https://blog.csdn.net/xixihahalelehehe/article/details/104437743) 118 | - [python 计算之除法](https://blog.csdn.net/xixihahalelehehe/article/details/124549366) 119 | - [python range()](https://ghostwritten.blog.csdn.net/article/details/124549150) 120 | 121 | ## 7. 参考 122 | - [https://blog.csdn.net/pang787559613/article/details/105444286](https://blog.csdn.net/pang787559613/article/details/105444286) 123 | 124 | 125 | --- 126 | -------------------------------------------------------------------------------- /cookie.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ghostwritten/csdn_to_md/6eb909c657e091efb77ec9db99145fb7df6ba209/cookie.png -------------------------------------------------------------------------------- /csdn_to_md.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import json 5 | import os 6 | import uuid 7 | import time 8 | import requests 9 | import datetime 10 | import argparse 11 | import re 12 | from bs4 import BeautifulSoup 13 | 14 | 15 | def request_blog_column(id): 16 | 17 | headers = { 18 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36' 19 | } 20 | 21 | urls = 'https://blog.csdn.net/' + id 22 | reply = requests.get(url=urls,headers=headers) 23 | parse = BeautifulSoup(reply.content, "html.parser") 24 | spans = parse.find_all('a', attrs={'class':'special-column-name'}) 25 | 26 | blog_columns = [] 27 | 28 | for span in spans: 29 | href = re.findall(r'href=\"(.*?)\".*?',str(span),re.S) 30 | href = ''.join(href) 31 | # print(href) 32 | 33 | headers = { 34 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36' 35 | } 36 | blog_column_reply = requests.get(url=href,headers=headers) 37 | # print(blog_column_reply) 38 | blogs_num = re.findall(r'(.+?)篇',blog_column_reply.text,re.S) 39 | # print(str(blogs_num[0])) 40 | blogs_column_num = str(blogs_num[0]) 41 | 42 | 43 | blog_column = span.text.strip() 44 | 45 | blog_column_dir = './' + str(id) + '/' + str(blog_column) 46 | if not os.path.exists(blog_column_dir): 47 | os.mkdir(blog_column_dir) 48 | 49 | blog_id = href.split("_")[-1] 50 | blog_id = blog_id.split(".")[0] 51 | blog_columns.append([href,blog_column,blog_id,blogs_column_num]) 52 | # print(blog_columns) 53 | 54 | 55 | 56 | return blog_columns 57 | 58 | def request_blog_list(id): 59 | blog_columns = request_blog_column(id) 60 | blogs = [] 61 | for blog_column in blog_columns: 62 | blog_column_url = blog_column[0] 63 | blog_column_name = blog_column[1] 64 | blog_column_id = blog_column[2] 65 | blog_column_num = int(blog_column[3]) 66 | #print(blog_column_url) 67 | 68 | if blog_column_num > 40: 69 | page_num = round(blog_column_num/40) 70 | for i in range(page_num,0,-1): 71 | blog_column_url = blog_column[0] 72 | url_str = blog_column_url.split('.html')[0] 73 | blog_column_url = url_str + '_' + str(i) + '.html' 74 | append_blog_info(blog_column_url,blog_column_name,blogs) 75 | blog_column_url = blog_column[0] 76 | blogs = append_blog_info(blog_column_url,blog_column_name,blogs) 77 | # print(blogs) 78 | 79 | else: 80 | blogs = append_blog_info(blog_column_url,blog_column_name,blogs) 81 | # print(blogs) 82 | 83 | #print(blogs) 84 | return blogs 85 | 86 | 87 | def append_blog_info(blog_column_url,blog_column_name,blogs): 88 | headers = { 89 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36' 90 | } 91 | reply = requests.get(url=blog_column_url,headers=headers) 92 | blog_span = BeautifulSoup(reply.content, "html.parser") 93 | blogs_list = blog_span.find_all('ul', attrs={'class':'column_article_list'}) 94 | for arch_blog_info in blogs_list: 95 | blogs_list = arch_blog_info.find_all('li') 96 | for blog_info in blogs_list: 97 | 98 | blog_url = blog_info.find('a', attrs={'target':'_blank'})['href'] 99 | blog_title = blog_info.find('h2', attrs={'class':"title"}).get_text().strip().replace(" ", "_").replace('/','_') 100 | blog_dict = {'column': blog_column_name,'url': blog_url, 'title': blog_title} 101 | blogs.append(blog_dict) 102 | return blogs 103 | 104 | def request_md(id): 105 | 106 | blogs = request_blog_list(id) 107 | 108 | for blog_dict in blogs: 109 | blog_url = blog_dict['url'] 110 | blog_title = blog_dict['title'] 111 | blog_column = blog_dict['column'] 112 | blog_id = blog_url.split("/")[-1] 113 | url = f"https://blog-console-api.csdn.net/v1/editor/getArticle?id={blog_id}" 114 | headers = { 115 | 116 | #"Cookie": "uuid_tt_dd=20_18815108270-1651420958061-407001;dc_session_id=10_1651420958062.474827;acw_tc=276077c116514209580398703eb6ccd1972794b4689cf3b5039f85506a6146;UserName=hahalelehehe; UserInfo=30ee073e05d84844b34a829ef0d80541; UserToken=30ee073e05d84844b34a829ef0d80541; UserNick=ghostwritten; AU=5F9; BT=1651420447587; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22xixihahalelehehe%22%2C%22scope%22%3A1%7D%7D; log_Id_click=1111; c_pref=https%3A//i.csdn.net/; c_ref=https%3A//blog.csdn.net/xixihahalelehehe%3Fspm%3D1010.2135.3001.5343; c_page_id=default; dc_tos=rb7pev; log_Id_pv=1346; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1651422057; log_Id_view=2182;", 117 | "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36" 118 | 119 | } 120 | data = {"id": blog_id} 121 | reply = requests.get(url, headers=headers,data=data) 122 | reply_data = reply.json() 123 | # print(reply_data) 124 | 125 | try: 126 | key = "key" + str(uuid.uuid4()) 127 | content = reply_data["data"]["markdowncontent"].replace("@[toc]", "") 128 | blog_title_dir = './' + str(id) + '/' + str(blog_column) + '/' + str(blog_title) + '.md' 129 | with open(blog_title_dir, "w", encoding="utf-8") as f: 130 | f.write(content) 131 | 132 | print("download blog markdown blog:" + '【' + blog_column + '】' + str(blog_title)) 133 | except Exception as e: 134 | print("***********************************") 135 | print(e) 136 | print(url) 137 | 138 | 139 | 140 | 141 | def read_from_json(filename): 142 | jsonfile = open(filename, "r",encoding='utf-8') 143 | jsondata = jsonfile.read() 144 | return json.loads(jsondata) 145 | 146 | def write_to_json(filename, data): 147 | jsonfile = open(filename, "w") 148 | jsonfile.write(data) 149 | 150 | 151 | 152 | def main(): 153 | parser = argparse.ArgumentParser() 154 | 155 | parser.add_argument('-i', '--id', dest='id',type=str, required=True, help='csdn name') 156 | 157 | args = parser.parse_args() 158 | 159 | csdn_id = args.id 160 | 161 | name_dir = './' + str(csdn_id) 162 | if not os.path.exists(name_dir): 163 | os.mkdir(name_dir) 164 | 165 | request_md(csdn_id) 166 | 167 | if __name__ == '__main__': 168 | main() 169 | --------------------------------------------------------------------------------