├── README.md
├── cookie.png
└── csdn_to_md.py


/README.md:
--------------------------------------------------------------------------------
  1 | # csdn 文章批量转换markdown格式下载至本地
  2 | 
  3 | ## 1. 背景
  4 | 
  5 | 最近准备搭建新博客,尽管csdn的在线编辑、发布、专栏、自定义模块、模板等比较成熟,但实在没有美感,这一点令人比较失望，但在线编辑确实速度会很快，做笔记非常方便，检索也还算可以,前阵[阿里云开发者社区](https://developer.aliyun.com/)、[infoQ中国社区](https://www.infoq.cn/)运营人员相继邀请去他们平台发布，但我更想尝试本地利用[Obsidian](https://obsidian.md/)工具编写笔记试试，并同步[github](https://github.com/)或[gitee](https://gitee.com/)仓库存储,博客也许会以github page依托利用[hexo](https://hexo.io/zh-cn/)、[Jekyll](https://jekyllcn.com/)等工具发布，如果还可以同步[notion](https://www.notion.so/)或[云雀](https://www.yuque.com/)就更完美。
  6 | 
  7 | ## 2.功能
  8 | 
  9 | - 1. 通过传入id确立个人用户主页；
 10 | - 2. 创建个人博客目录，专栏目录；
 11 | - 3. 获取专栏URL、名称、篇数量；
 12 | - 4. 依靠专栏URL获取对应的多页文章URL、标题；
 13 | - 5. 遍历文章URL通过cookie获取文章内容并转换markdown格式。
 14 | 
 15 | ## 3. 下载
 16 | 
 17 | ```bash
 18 | $ git clone https://github.com/Ghostwritten/csdn_to_md.git 
 19 | ```
 20 | 
 21 | ## 4. 配置
 22 | 
 23 | chrome浏览器登陆csdn平台，按"F12"找到自己网页cookie,选择部分cookie内容复制至csdn_to_md.py脚本109行。
 24 | 
 25 | ![获取cookie](https://github.com/Ghostwritten/csdn_to_md/blob/main/cookie.png)
 26 | 
 27 | ## 5. 演示
 28 | * [观看视频](https://www.youtube.com/watch?v=vVJJDB0xQfA&t=25s)
 29 | 
 30 | ```bash
 31 | $ python3 csdn_to_md.py -i  xixihahalelehehe
 32 | download blog markdown blog:【helm】helm_快速学习手册
 33 | download blog markdown blog:【helm】如何开发一个完整的Helm_charts应用实例
 34 | download blog markdown blog:【helm】helm_将yaml文件转换json的插件helm-schema-gen
 35 | download blog markdown blog:【helm】helm_NOTES.txt
 36 | download blog markdown blog:【helm】helm_test_测试详解
 37 | download blog markdown blog:【helm】helm_charts_入门指南
 38 | download blog markdown blog:【helm】openshift_Certified_Helm_Charts_实践
 39 | download blog markdown blog:【helm】Helm_Values.yaml
 40 | ......
 41 | 
 42 | 
 43 | $ cd xixihahalelehehe 
 44 | 
 45 | $ /xixihahalelehehe# tree 
 46 | .
 47 | ├── ansible
 48 | │   ├── anible_【模块】_notify.md
 49 | │   ├── ansbile【模块】replace_替换.md
 50 | │   ├── ansbile_模块开发-自定义模块.md
 51 | │   ├── ansible_assert_模块.md
 52 | │   ├── ansible_become配置.md
 53 | │   ├── ansible_cron_模块.md
 54 | │   ├── ansible_debug模块.md
 55 | │   ├── ansible_delegate_to_模块.md
 56 | │   ├── ansible_file模块详解.md
 57 | │   ├── ansible_gather_facts配置.md
 58 | │   ├── ansible_hosts_and_groups配置.md
 59 | │   ├── ansible_jinja2详解.md
 60 | │   ├── ansible-playbook_role角色.md
 61 | │   ├── ansible-playbook实战.md
 62 | │   ├── ansible_script模块.md
 63 | │   ├── ansible_set_fact模块.md
 64 | │   ├── ansible_URI模块.md
 65 | │   ├── ansible【任务】安装httpd.md
 66 | │   ├── ansible变量.md
 67 | │   ├── ansible_安装.md
 68 | │   ├── ansible_快速学习手册.md
 69 | │   ├── ansible【模块】add_host.md
 70 | │   ├── ansible【模块】blockinfile.md
 71 | │   ├── ansible_【模块】find.md
 72 | │   ├── ansible【模块】include_tasks.md
 73 | │   ├── ansible【模块】linefile_文件行处理.md
 74 | │   ├── ansible【模块】modprobe.md
 75 | │   ├── ansible【模块】pause.md
 76 | │   ├── ansible_【模块】sysctl.md
 77 | │   ├── ansible【模块】systemd.md
 78 | │   ├── ansible【模块】template.md
 79 | │   ├── ansible【模块】yum.md
 80 | │   ├── ansible_系统选择性执行脚本.md
 81 | │   ├── ansible远程容器机种方法.md
 82 | │   └── ansible_配置.md
 83 | ├── blog
 84 | │   ├── github如何搭建一个博客.md
 85 | │   ├── jekyll的一个主题TeXt-theme拆解.md
 86 | │   ├── jekyll配置管理github博客.md
 87 | │   ├── 如何使用jekyll插件.md
 88 | │   ├── 如何安装jekyll并搭建一个博客.md
 89 | │   └── 如何购买域名.md
 90 | ├── c++
 91 | │   └── makefile入门.md
 92 | ├── camera
 93 | │   ├── A7R2_图标列表.md
 94 | │   ├── sony_A7R2介绍.md
 95 | │   └── SONY_A7R2_基础操作.md
 96 | ├── Cisco
 97 | │   ├── 运维之思科篇_-----1.VLAN_、_Trunk_、_以太通道及DHCP.md
 98 | │   ├── 运维之思科篇_-----2.vlan间通讯_、_动态路由.md
 99 | │   ├── 运维之思科篇_-----3.HSRP（热备份路由协议），STP（生成树协议），PVST（增强版PST）.md
100 | │   ├── 运维之思科篇_-----4._标准与扩展ACL_、_命名ACL.md
101 | │   ├── 运维之思科篇_-----5._NAT及静态转换_、_动态转换及PAT.md
102 | │   ├── 运维之思科篇_-----6..md
103 | │   └── 运维之思科篇_-----6.思科项目练习.md
104 | 
105 | ```
106 | ## 6. 技术
107 | 
108 | - [python json](https://blog.csdn.net/xixihahalelehehe/article/details/106550900)
109 | - [python os](https://blog.csdn.net/xixihahalelehehe/article/details/104253123)
110 | - [python time](https://blog.csdn.net/xixihahalelehehe/article/details/108998768)
111 | - [python request](https://blog.csdn.net/xixihahalelehehe/article/details/108996025)
112 | - [python argparse](https://blog.csdn.net/xixihahalelehehe/article/details/121199110)
113 | - [python re](https://blog.csdn.net/xixihahalelehehe/article/details/106247378)
114 | - [python bs4](https://blog.csdn.net/xixihahalelehehe/article/details/124152439)
115 | 
116 | - [python split()](https://blog.csdn.net/xixihahalelehehe/article/details/124547771)
117 | - [python list 列表](https://blog.csdn.net/xixihahalelehehe/article/details/104437743)
118 | - [python 计算之除法](https://blog.csdn.net/xixihahalelehehe/article/details/124549366)
119 | - [python range()](https://ghostwritten.blog.csdn.net/article/details/124549150)
120 | 
121 | ## 7. 参考
122 | - [https://blog.csdn.net/pang787559613/article/details/105444286](https://blog.csdn.net/pang787559613/article/details/105444286)
123 | 
124 | 
125 | ---
126 | 


--------------------------------------------------------------------------------
/cookie.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ghostwritten/csdn_to_md/6eb909c657e091efb77ec9db99145fb7df6ba209/cookie.png


--------------------------------------------------------------------------------
/csdn_to_md.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | import json
  5 | import os
  6 | import uuid
  7 | import time
  8 | import requests
  9 | import datetime
 10 | import argparse
 11 | import re
 12 | from bs4 import BeautifulSoup
 13 | 
 14 | 
 15 | def request_blog_column(id):
 16 | 
 17 |     headers = {
 18 |       'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36'
 19 |     }
 20 |     
 21 |     urls = 'https://blog.csdn.net/' + id
 22 |     reply = requests.get(url=urls,headers=headers)
 23 |     parse = BeautifulSoup(reply.content, "html.parser")
 24 |     spans = parse.find_all('a', attrs={'class':'special-column-name'})
 25 | 
 26 |     blog_columns = []
 27 |     
 28 |     for span in spans:
 29 |           href = re.findall(r'href=\"(.*?)\".*?',str(span),re.S)
 30 |           href = ''.join(href)
 31 |          # print(href)
 32 | 
 33 |           headers = {
 34 |           'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36'
 35 |           }
 36 |           blog_column_reply = requests.get(url=href,headers=headers)
 37 |          # print(blog_column_reply)
 38 |           blogs_num = re.findall(r'<a class="clearfix special-column-name"  href=\"'+ href +'\".*?<span class="special-column-num">(.+?)篇</span>',blog_column_reply.text,re.S)
 39 |          # print(str(blogs_num[0]))
 40 |           blogs_column_num = str(blogs_num[0])
 41 | 
 42 | 
 43 |           blog_column = span.text.strip()
 44 | 
 45 |           blog_column_dir = './' + str(id) + '/' + str(blog_column)          
 46 |           if not os.path.exists(blog_column_dir):
 47 |             os.mkdir(blog_column_dir)
 48 | 
 49 |           blog_id = href.split("_")[-1]
 50 |           blog_id = blog_id.split(".")[0]
 51 |           blog_columns.append([href,blog_column,blog_id,blogs_column_num])
 52 |         #  print(blog_columns)
 53 |           
 54 |  
 55 | 
 56 |     return blog_columns
 57 | 
 58 | def request_blog_list(id):
 59 |     blog_columns = request_blog_column(id)
 60 |     blogs = []
 61 |     for blog_column in blog_columns:
 62 |         blog_column_url = blog_column[0]
 63 |         blog_column_name = blog_column[1]
 64 |         blog_column_id = blog_column[2]
 65 |         blog_column_num = int(blog_column[3])
 66 |         #print(blog_column_url) 
 67 | 
 68 |         if blog_column_num > 40: 
 69 |            page_num = round(blog_column_num/40)
 70 |            for i in range(page_num,0,-1):
 71 |                blog_column_url = blog_column[0]
 72 |                url_str = blog_column_url.split('.html')[0]
 73 |                blog_column_url = url_str + '_' + str(i) + '.html'
 74 |                append_blog_info(blog_column_url,blog_column_name,blogs)
 75 |            blog_column_url = blog_column[0]
 76 |            blogs = append_blog_info(blog_column_url,blog_column_name,blogs)
 77 |         #   print(blogs)
 78 |                
 79 |         else:
 80 |            blogs = append_blog_info(blog_column_url,blog_column_name,blogs)
 81 |         #   print(blogs)
 82 |    
 83 |     #print(blogs)
 84 |     return blogs
 85 | 
 86 | 
 87 | def append_blog_info(blog_column_url,blog_column_name,blogs):
 88 |     headers = {
 89 |       'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36'
 90 |     }
 91 |     reply = requests.get(url=blog_column_url,headers=headers)
 92 |     blog_span = BeautifulSoup(reply.content, "html.parser")
 93 |     blogs_list = blog_span.find_all('ul', attrs={'class':'column_article_list'})
 94 |     for arch_blog_info in blogs_list:
 95 |         blogs_list = arch_blog_info.find_all('li')
 96 |         for blog_info in blogs_list:
 97 | 	    
 98 |             blog_url = blog_info.find('a', attrs={'target':'_blank'})['href']
 99 |             blog_title = blog_info.find('h2', attrs={'class':"title"}).get_text().strip().replace(" ", "_").replace('/','_')
100 |             blog_dict = {'column': blog_column_name,'url': blog_url, 'title': blog_title}
101 |             blogs.append(blog_dict)
102 |     return blogs 
103 | 
104 | def request_md(id):
105 | 
106 |     blogs = request_blog_list(id)
107 |     
108 |     for blog_dict in blogs:
109 |         blog_url = blog_dict['url']
110 |         blog_title = blog_dict['title']
111 |         blog_column = blog_dict['column']
112 |         blog_id = blog_url.split("/")[-1] 
113 |         url = f"https://blog-console-api.csdn.net/v1/editor/getArticle?id={blog_id}"
114 |         headers = {
115 |         
116 |          #"Cookie": "uuid_tt_dd=20_18815108270-1651420958061-407001;dc_session_id=10_1651420958062.474827;acw_tc=276077c116514209580398703eb6ccd1972794b4689cf3b5039f85506a6146;UserName=hahalelehehe; UserInfo=30ee073e05d84844b34a829ef0d80541; UserToken=30ee073e05d84844b34a829ef0d80541; UserNick=ghostwritten; AU=5F9; BT=1651420447587; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22xixihahalelehehe%22%2C%22scope%22%3A1%7D%7D; log_Id_click=1111; c_pref=https%3A//i.csdn.net/; c_ref=https%3A//blog.csdn.net/xixihahalelehehe%3Fspm%3D1010.2135.3001.5343; c_page_id=default; dc_tos=rb7pev; log_Id_pv=1346; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1651422057; log_Id_view=2182;",
117 |          "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
118 |            
119 |         }
120 |         data = {"id": blog_id}
121 |         reply = requests.get(url, headers=headers,data=data)
122 |         reply_data = reply.json()
123 |     #    print(reply_data)
124 | 
125 |         try:
126 |            key = "key" + str(uuid.uuid4())
127 |            content = reply_data["data"]["markdowncontent"].replace("@[toc]", "")
128 |            blog_title_dir = './' + str(id) + '/' + str(blog_column) + '/' + str(blog_title) + '.md'
129 |            with open(blog_title_dir, "w", encoding="utf-8") as f:
130 |                f.write(content)
131 |              
132 |            print("download blog markdown blog:" + '【' + blog_column + '】' + str(blog_title))
133 |         except Exception as e:
134 |            print("***********************************")
135 |            print(e)
136 |            print(url)
137 | 
138 | 
139 | 
140 | 
141 | def read_from_json(filename):
142 |     jsonfile = open(filename, "r",encoding='utf-8')
143 |     jsondata = jsonfile.read()
144 |     return json.loads(jsondata)
145 | 
146 | def write_to_json(filename, data):
147 |     jsonfile = open(filename, "w")
148 |     jsonfile.write(data)
149 | 
150 | 
151 | 
152 | def main():
153 |     parser = argparse.ArgumentParser()
154 | 
155 |     parser.add_argument('-i', '--id', dest='id',type=str,  required=True, help='csdn name')
156 | 
157 |     args = parser.parse_args()
158 |     
159 |     csdn_id = args.id
160 | 
161 |     name_dir = './' + str(csdn_id)
162 |     if not os.path.exists(name_dir):
163 |        os.mkdir(name_dir)
164 | 
165 |     request_md(csdn_id)
166 | 
167 | if __name__ == '__main__':
168 |     main()
169 | 


--------------------------------------------------------------------------------