├── red_book_spider.py
├── config.py
└── README.md


/red_book_spider.py:
--------------------------------------------------------------------------------
1 | #! usr/bin/python
2 | # writer: yuej1njia0ke
3 | 
4 | 


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
1 | # 配置你的cookie值
2 | cookie = ''
3 | # note_id = '64449a34000000002702a9e8'


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # xiaohongshu_spider
 2 | 爬取小红书相关评论
 3 | 注:本代码仅为兴趣爱好探究，请勿进行商业利用或非法研究，负责后果自负，与创作者无关
 4 | 
 5 | ## 一.总体概述
 6 | 
 7 | 爬取的数据包括
 8 | 
 9 | ```
10 | 评论者昵称，id，评论级别，评论内容
11 | ```
12 | 
13 | 先上个图
14 | 
15 | ![image-20240209122147148](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209122147148.png)
16 | 
17 | 
18 | 
19 | 
20 | 
21 | ## 二.爬虫过程
22 | 
23 | ![image-20240208112541353](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240208112541353.png)
24 | 
25 | 打开小红书页面，f12大法查看xhr请求，找到对应内容
26 | 
27 | ![image-20240209122031074](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209122031074.png)
28 | 
29 | 内容都在comments后面，翻页通过cursor翻页，逻辑如下
30 | 
31 | ```python
32 |  next_cursor = json_text['data']['cursor']
33 | 
34 |  if page == 1:
35 |     url = 'https://edith.xiaohongshu.com/api/sns/web/v2/comment/page?note_id={}&cursor=&top_comment_id=&image_formats=jpg,webp,avif'.format(note_id)
36 |  else:
37 |     print(colorama.Fore.GREEN + "[info] 进入下一轮循环")
38 |     url = 'https://edith.xiaohongshu.com/api/sns/web/v2/comment/page?note_id={}&cursor={}&top_comment_id=&image_formats=jpg,webp,avif'.format(note_id,next_cursor)
39 | ```
40 | 
41 | 如何确定爬取完成？
42 | 
43 | ![image-20240209122855262](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209122855262.png)
44 | 
45 | 这个参数为true就证明可以继续爬取
46 | 
47 | 
48 | 
49 | 数据处理过程在这里
50 | 
51 | ![image-20240209123048799](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209123048799.png)
52 | 
53 | 
54 | 
55 | 如何节约时间并发爬取呢
56 | 
57 | ![image-20240209124738104](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209124738104.png)
58 | 
59 | 
60 | 
61 | 
62 | 
63 | 整体效果如下
64 | 
65 | ![image-20240209131356340](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209131356340.png)
66 | 
67 | ![image-20240209131817005](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209131817005.png)
68 | 
69 | 
70 | 
71 | 完整代码连接放在github上了，有需自取
72 | 
73 | 
74 | 
75 | ## 三.readme
76 | 
77 | config文件里面填入自己的cookie
78 | 
79 | 小红书具有反爬机制，因此需要自己寻找对应的note_id进行爬取
80 | 
81 | ![image-20240209174346194](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209174346194.png)
82 | 
83 | 进行keyword搜索后f12大法进行获取note_id，建议默认点击最热，这样爬取的评论数才可以足够满足数据爬取的需要
84 | 
85 | 获取成功后在主函数这里进行初始化
86 | ![image-20240209174603533](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209174603533.png)
87 | 
88 | 
89 | 
90 | 公众号链接
91 | 
92 | https://mp.weixin.qq.com/s/3mZ66SBusCsZg7lqcusMzg
93 | 
94 | 目前源码已下架，有需求者请通过公众号联系wx进行了解
95 | 
96 | 


--------------------------------------------------------------------------------