├── red_book_spider.py ├── config.py └── README.md /red_book_spider.py: -------------------------------------------------------------------------------- 1 | #! usr/bin/python 2 | # writer: yuej1njia0ke 3 | 4 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | # 配置你的cookie值 2 | cookie = '' 3 | # note_id = '64449a34000000002702a9e8' -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # xiaohongshu_spider 2 | 爬取小红书相关评论 3 | 注:本代码仅为兴趣爱好探究,请勿进行商业利用或非法研究,负责后果自负,与创作者无关 4 | 5 | ## 一.总体概述 6 | 7 | 爬取的数据包括 8 | 9 | ``` 10 | 评论者昵称,id,评论级别,评论内容 11 | ``` 12 | 13 | 先上个图 14 | 15 | ![image-20240209122147148](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209122147148.png) 16 | 17 | 18 | 19 | 20 | 21 | ## 二.爬虫过程 22 | 23 | ![image-20240208112541353](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240208112541353.png) 24 | 25 | 打开小红书页面,f12大法查看xhr请求,找到对应内容 26 | 27 | ![image-20240209122031074](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209122031074.png) 28 | 29 | 内容都在comments后面,翻页通过cursor翻页,逻辑如下 30 | 31 | ```python 32 | next_cursor = json_text['data']['cursor'] 33 | 34 | if page == 1: 35 | url = 'https://edith.xiaohongshu.com/api/sns/web/v2/comment/page?note_id={}&cursor=&top_comment_id=&image_formats=jpg,webp,avif'.format(note_id) 36 | else: 37 | print(colorama.Fore.GREEN + "[info] 进入下一轮循环") 38 | url = 'https://edith.xiaohongshu.com/api/sns/web/v2/comment/page?note_id={}&cursor={}&top_comment_id=&image_formats=jpg,webp,avif'.format(note_id,next_cursor) 39 | ``` 40 | 41 | 如何确定爬取完成? 42 | 43 | ![image-20240209122855262](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209122855262.png) 44 | 45 | 这个参数为true就证明可以继续爬取 46 | 47 | 48 | 49 | 数据处理过程在这里 50 | 51 | ![image-20240209123048799](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209123048799.png) 52 | 53 | 54 | 55 | 如何节约时间并发爬取呢 56 | 57 | ![image-20240209124738104](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209124738104.png) 58 | 59 | 60 | 61 | 62 | 63 | 整体效果如下 64 | 65 | ![image-20240209131356340](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209131356340.png) 66 | 67 | ![image-20240209131817005](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209131817005.png) 68 | 69 | 70 | 71 | 完整代码连接放在github上了,有需自取 72 | 73 | 74 | 75 | ## 三.readme 76 | 77 | config文件里面填入自己的cookie 78 | 79 | 小红书具有反爬机制,因此需要自己寻找对应的note_id进行爬取 80 | 81 | ![image-20240209174346194](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209174346194.png) 82 | 83 | 进行keyword搜索后f12大法进行获取note_id,建议默认点击最热,这样爬取的评论数才可以足够满足数据爬取的需要 84 | 85 | 获取成功后在主函数这里进行初始化 86 | ![image-20240209174603533](https://gitee.com/yuejinjianke/tuchuang/raw/master/image/image-20240209174603533.png) 87 | 88 | 89 | 90 | 公众号链接 91 | 92 | https://mp.weixin.qq.com/s/3mZ66SBusCsZg7lqcusMzg 93 | 94 | 目前源码已下架,有需求者请通过公众号联系wx进行了解 95 | 96 | --------------------------------------------------------------------------------