├── cookie.txt ├── 多视频评论信息.xls ├── requirements.txt ├── 主文件中的代码 ├── mask.png ├── 评论信息.xls ├── 评论信息.xlsx ├── 评论信息1.xls ├── 创作者评估报告.xlsx ├── 情感输出结果.xlsx ├── 评论信息_清洗去重后.xlsx ├── 创作者评估报告_bert.xlsx ├── 情感输出结果_bert.xlsx ├── word_cloud_image.png ├── sentiment.bilibili_comment.3 ├── __pycache__ │ ├── Data_cleaning.cpython-311.pyc │ ├── Data_crawling.cpython-311.pyc │ ├── bert_analysis.cpython-311.pyc │ ├── Data_visualization.cpython-311.pyc │ └── Word_cloud_image_generation.cpython-311.pyc ├── Data_cleaning.py ├── cookie.txt ├── Word_cloud_image_generation.py ├── bert_analysis.py ├── Data_crawling.py ├── main_file.py ├── Data_visualization.py ├── hit_stopwords.txt └── 所有可视化图表.html ├── 评论情感分析结果_三分类.xlsx ├── 评论情感分析结果_三分类_人工纠错.xlsx ├── 评论情感分析结果_三分类_微调后bert模型批量预测.xlsx ├── snownlp计算正确率.py ├── 计算置信度占比.py ├── README.md ├── Data_crawling.py ├── README-english.md ├── 模型三分类.py ├── 模型三分类_微调后的bert模型.py ├── 数据爬取.py └── 情感分析_bert微调_demo (1).ipynb /cookie.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /多视频评论信息.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/多视频评论信息.xls -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/requirements.txt -------------------------------------------------------------------------------- /主文件中的代码/mask.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/mask.png -------------------------------------------------------------------------------- /主文件中的代码/评论信息.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/评论信息.xls -------------------------------------------------------------------------------- /主文件中的代码/评论信息.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/评论信息.xlsx -------------------------------------------------------------------------------- /主文件中的代码/评论信息1.xls: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/评论信息1.xls -------------------------------------------------------------------------------- /评论情感分析结果_三分类.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/评论情感分析结果_三分类.xlsx -------------------------------------------------------------------------------- /主文件中的代码/创作者评估报告.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/创作者评估报告.xlsx -------------------------------------------------------------------------------- /主文件中的代码/情感输出结果.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/情感输出结果.xlsx -------------------------------------------------------------------------------- /主文件中的代码/评论信息_清洗去重后.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/评论信息_清洗去重后.xlsx -------------------------------------------------------------------------------- /评论情感分析结果_三分类_人工纠错.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/评论情感分析结果_三分类_人工纠错.xlsx -------------------------------------------------------------------------------- /主文件中的代码/创作者评估报告_bert.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/创作者评估报告_bert.xlsx -------------------------------------------------------------------------------- /主文件中的代码/情感输出结果_bert.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/情感输出结果_bert.xlsx -------------------------------------------------------------------------------- /主文件中的代码/word_cloud_image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/word_cloud_image.png -------------------------------------------------------------------------------- /评论情感分析结果_三分类_微调后bert模型批量预测.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/评论情感分析结果_三分类_微调后bert模型批量预测.xlsx -------------------------------------------------------------------------------- /主文件中的代码/sentiment.bilibili_comment.3: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/sentiment.bilibili_comment.3 -------------------------------------------------------------------------------- /主文件中的代码/__pycache__/Data_cleaning.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/__pycache__/Data_cleaning.cpython-311.pyc -------------------------------------------------------------------------------- /主文件中的代码/__pycache__/Data_crawling.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/__pycache__/Data_crawling.cpython-311.pyc -------------------------------------------------------------------------------- /主文件中的代码/__pycache__/bert_analysis.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/__pycache__/bert_analysis.cpython-311.pyc -------------------------------------------------------------------------------- /主文件中的代码/__pycache__/Data_visualization.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/__pycache__/Data_visualization.cpython-311.pyc -------------------------------------------------------------------------------- /主文件中的代码/__pycache__/Word_cloud_image_generation.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ILZFLNO02/Sentiment-analysis-of-Bilibili-comments-based-on-model-fine-tuning/HEAD/主文件中的代码/__pycache__/Word_cloud_image_generation.cpython-311.pyc -------------------------------------------------------------------------------- /snownlp计算正确率.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from snownlp import SnowNLP 3 | from sklearn.metrics import accuracy_score, f1_score 4 | 5 | # 读取数据 6 | df = pd.read_excel(r'e:\bilibili_train\评论情感分析结果_三分类_人工纠错.xlsx') 7 | 8 | # 标签映射(首字母大写) 9 | label_map = {"Negative": 0, "Neutral": 1, "Positive": 2} 10 | 11 | # 只保留有效标签的行 12 | df = df[df['emotion'].isin(label_map.keys())] 13 | 14 | labels = [label_map[x] for x in df['emotion']] 15 | comments = df['comment'].astype(str).tolist() 16 | 17 | # SnowNLP预测 18 | def snownlp_predict(text): 19 | s = SnowNLP(text) 20 | score = s.sentiments 21 | # 三分类阈值划分 22 | if score > 0.6: 23 | return 2 # Positive 24 | elif score < 0.4: 25 | return 0 # Negative 26 | else: 27 | return 1 # Neutral 28 | 29 | preds = [snownlp_predict(c) for c in comments] 30 | 31 | # 计算准确率和F1值 32 | acc = accuracy_score(labels, preds) 33 | f1 = f1_score(labels, preds, average='macro') 34 | 35 | print(f"SnowNLP三分类准确率: {acc:.4f}") 36 | print(f"SnowNLP三分类F1值: {f1:.4f}") -------------------------------------------------------------------------------- /主文件中的代码/Data_cleaning.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import re 3 | # from tkinter import messagebox # <-- 注释掉或删除 tkinter messagebox 的导入,因为不再需要 4 | 5 | def clean_chinese_characters(filepath): 6 | # 定义一个函数来保留中文字符 7 | def keep_chinese_characters(text): 8 | # 使用正则表达式匹配中文字符 9 | chinese_characters = re.findall(r'[\u4e00-\u9fff]+', text) 10 | # 读取停用词文件 11 | with open('hit_stopwords.txt', 'r', encoding='utf-8') as f: 12 | stopwords = [word.strip() for word in f.readlines()] 13 | # 将中文字符分词并去除停用词 14 | words = [word for word in ''.join(chinese_characters).split() if word not in stopwords] 15 | # 将处理后的词语连接起来 16 | return ''.join(words) 17 | # 读取Excel文件 18 | data = pd.read_excel(filepath) 19 | # 删除整行重复的评论信息 20 | data.drop_duplicates(subset='评论内容', keep='first', inplace=True) 21 | # 清理'评论内容'列,只保留中文字符 22 | data['评论内容'] = data['评论内容'].apply(keep_chinese_characters) 23 | # 将清理后的DataFrame写入新的Excel文件 24 | cleaned_filepath = filepath.replace('.xls', '_清洗去重后.xlsx') 25 | data.to_excel(cleaned_filepath, index=False) 26 | return cleaned_filepath 27 | 28 | #clean_chinese_characters("评论信息.xls") # 你可以保留或注释掉这行测试代码 -------------------------------------------------------------------------------- /主文件中的代码/cookie.txt: -------------------------------------------------------------------------------- 1 | buvid4=8D4A2ADE-CB95-88DF-FD3B-CC9F77BE244869063-022012421-Vk7oLekZ8O%2BWwDC%2BideQ6A%3D%3D; header_theme_version=CLOSE; enable_web_push=DISABLE; CURRENT_BLACKGAP=0; hit-dyn-v2=1; CURRENT_QUALITY=80; is-2022-channel=1; _uuid=9254D896-12710-2A22-6410C-6A56C788799863056infoc; buvid3=AAF79126-1DA9-8B4A-59F3-E32F7485E4B982916infoc; b_nut=1736682182; rpdid=|(J|)Y~Ru)|~0J'u~JmYklkk~; fingerprint=5decd515eea056e6f18864bc721b8fc0; enable_feed_channel=ENABLE; PVID=2; buvid_fp=0d77f3ddc9ac7bcf313f8d0740a2e78b; bili_ticket=eyJhbGciOiJIUzI1NiIsImtpZCI6InMwMyIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3NDg1NjgyMTAsImlhdCI6MTc0ODMwODk1MCwicGx0IjotMX0.9FNGocmR4vhsioNwnytHOhN85w3suRXvdzpHlMBXc54; bili_ticket_expires=1748568150; bp_t_offset_347760015=1071919151548727296; SESSDATA=23c10d86%2C1763966346%2C56c27%2A51CjA-yFsMqbnmQQU1Hr3fdF2pPtkbr5ydSjAPhOCQ74tETRyhpbXXOfIobQuXL51PbfwSVm1TS21OTThPRzd3S2RiOGFsUHV3WTI1WUh3aUFHenU1UE9ueXhiY1lZbjJXV2w2aW15VWxMVzNPWkhiTF9pSzg3blJ6akVEOTRlNVUtTUxnZktISHdnIIEC; bili_jct=cd08e2cd4583aeabdd0de7c288dcd577; DedeUserID=3493077612235621; DedeUserID__ckMd5=5678ad232817698e; bp_t_offset_3493077612235621=1071949525557444608; b_lsid=E211F56D_1971B006654; bmg_af_switch=1; bmg_src_def_domain=i1.hdslb.com; CURRENT_FNVAL=4048; sid=dz3ucw7y; home_feed_column=4; browser_resolution=339-845 -------------------------------------------------------------------------------- /主文件中的代码/Word_cloud_image_generation.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from wordcloud import WordCloud 3 | import jieba 4 | import matplotlib.pyplot as plt 5 | 6 | def generate_word_cloud(filepath, stopwords_path, mask_image_path, save_image_path): 7 | # 读取Excel文件 8 | data = pd.read_excel(filepath, names=["名字", "性别", "等级", "评论内容", "点赞数"], usecols=[0, 1, 2, 3, 4]) 9 | 10 | # 读取停用词列表 11 | with open(stopwords_path, 'r', encoding='utf-8') as f: 12 | stopwords = f.read().splitlines() 13 | 14 | # 合并所有评论内容为一个字符串 15 | all_comments = ' '.join(data["评论内容"].astype(str)) 16 | 17 | # 分词并去除停用词 18 | data_cut = jieba.lcut(all_comments) 19 | data_after = [i for i in data_cut if i not in stopwords] 20 | 21 | # 统计词频 22 | word_freq = pd.Series(data_after).value_counts() 23 | 24 | # 生成词云图 25 | mask = plt.imread(mask_image_path) 26 | wc = WordCloud(scale=10, 27 | font_path='C:/Windows/Fonts/STXINGKA.TTF', 28 | background_color="white", 29 | mask=mask) 30 | wc.generate_from_frequencies(word_freq.to_dict()) 31 | 32 | # 保存并关闭图形 33 | plt.figure(figsize=(20, 20)) 34 | plt.imshow(wc, interpolation='bilinear') 35 | plt.axis('off') 36 | plt.savefig(save_image_path, bbox_inches='tight', pad_inches=0) 37 | plt.close() # 关闭图形避免内存泄漏 38 | 39 | return save_image_path # 返回保存路径 -------------------------------------------------------------------------------- /主文件中的代码/bert_analysis.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from transformers import BertTokenizer, BertForSequenceClassification 3 | import torch 4 | 5 | # 全局只加载一次模型和分词器 6 | MODEL_PATH = "e:/bilibili_train/bert_train" 7 | DEVICE = "cuda" if torch.cuda.is_available() else "cpu" 8 | tokenizer = BertTokenizer.from_pretrained(MODEL_PATH) 9 | model = BertForSequenceClassification.from_pretrained(MODEL_PATH) 10 | model.to(DEVICE) 11 | model.eval() 12 | 13 | def predict_sentiment(texts, batch_size=64): 14 | results = [] 15 | with torch.no_grad(): 16 | for i in range(0, len(texts), batch_size): 17 | batch_texts = texts[i:i+batch_size] 18 | encodings = tokenizer(batch_texts, padding=True, truncation=True, max_length=256, return_tensors="pt") 19 | input_ids = encodings["input_ids"].to(DEVICE) 20 | attention_mask = encodings["attention_mask"].to(DEVICE) 21 | outputs = model(input_ids=input_ids, attention_mask=attention_mask) 22 | preds = torch.argmax(outputs.logits, dim=1).cpu().numpy() 23 | results.extend(preds) 24 | return results 25 | 26 | def analyze_sentiment(filepath, save_file, model_path=None): 27 | # model_path参数保留兼容性,但实际用全局加载的模型 28 | data = pd.read_excel(filepath, names=["名字", "性别", "等级", "评论内容", "点赞数"], usecols=[0, 1, 2, 3, 4]) 29 | comments = data['评论内容'].astype(str).tolist() 30 | 31 | preds = predict_sentiment(comments) 32 | id2label = {0: "负向", 1: "正向", 2: "中向"} 33 | data['BERT标签'] = [id2label[x] for x in preds] 34 | 35 | data.to_excel(save_file, index=False) 36 | return save_file 37 | 38 | if __name__ == "__main__": 39 | analyze_sentiment( 40 | "评论信息.xlsx", 41 | "情感输出结果_bert.xlsx", 42 | "e:/bilibili_train/bert_train" 43 | ) -------------------------------------------------------------------------------- /计算置信度占比.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | # 设置您的Excel文件路径 4 | # file_path = '评论情感分析结果_三分类_人工纠错.xlsx' # 请将这里替换成您的实际文件名和路径 5 | file_path = '评论情感分析结果_三分类_本地模型批量预测.xlsx' # 请将这里替换成您的实际文件名和路径 6 | try: 7 | # 读取Excel文件 8 | df = pd.read_excel(file_path) 9 | 10 | # 确保存在 'score' 列 11 | if 'score' not in df.columns: 12 | print(f"错误:Excel文件 '{file_path}' 中未找到 'score' 列。请检查列名是否正确。") 13 | else: 14 | # 获取 'score' 列的数据 15 | scores = df['score'] 16 | 17 | # 定义分数区间 (bins) 18 | # bins的边界是 [0, 0.1, 0.2, ..., 0.9, 1.0] 19 | # right=False 表示区间是左闭右开 [0, 0.1), [0.1, 0.2) ... [0.9, 1.0) 20 | # 如果您希望区间是左开右闭 (0, 0.1], (0.1, 0.2] ... (0.9, 1.0],请设置 right=True 21 | # 对于0到1的得分,通常包含1,所以定义bins到1.01,或使用right=True包含右边界 22 | bins = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 23 | # 如果需要包含1.0的区间,可以稍微调整bins的最后一个边界或使用right=True 24 | # bins = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.01] # 稍微超过1.0以包含1.0 25 | 26 | # 使用pd.cut将分数分配到对应的区间 27 | # labels=False 会直接返回区间的整数索引 28 | # include_lowest=True 包含最低的边界(即0.0) 29 | score_bins = pd.cut(scores, bins=bins, right=True, include_lowest=True) # right=True 包含右边界,这样 1.0 会被包含在最后一个区间 30 | 31 | # 计算每个区间的数量 32 | bin_counts = score_bins.value_counts().sort_index() 33 | 34 | # 计算总分数数量(排除可能的NaN值) 35 | total_scores = scores.count() # count()方法会自动排除NaN值 36 | 37 | # 计算每个区间的占比 38 | bin_proportions = bin_counts / total_scores 39 | 40 | # 打印结果 41 | print("分数区间分布及占比:") 42 | print("-" * 30) 43 | for interval, count in bin_counts.items(): 44 | proportion = bin_proportions.get(interval, 0) # 使用.get确保即使没有值也不会报错 45 | print(f"{interval}: 数量 = {count}, 占比 = {proportion:.2%}") # 格式化为百分比 46 | 47 | print("-" * 30) 48 | print(f"总有效分数数量: {total_scores}") 49 | 50 | 51 | except FileNotFoundError: 52 | print(f"错误:未找到文件 '{file_path}'。请检查文件路径是否正确。") 53 | except Exception as e: 54 | print(f"处理过程中发生错误: {e}") -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🎯 Sentiment Analysis of Bilibili Comments Based on Model Fine-Tuning 2 | 3 | 本项目为青海民族大学本科毕业设计,旨在构建一个**基于情感分析的B站创作者画像系统**,通过**微调中文预训练模型(chinese-bert-wwm-ext)**,实现对用户评论的高精度情感分析,并辅助平台构建健康社区生态。 4 | 5 | ## 🧠 项目简介 6 | 7 | 本项目构建了一个面向B站视频评论的情感分析与可视化系统,主要包括以下几个核心模块: 8 | 9 | * 🔍 **评论数据采集**:通过B站 API 结合用户 Cookie,批量抓取视频评论; 10 | * 🧼 **数据清洗与标注**:自动清洗噪声评论,并结合预训练模型与人工校对构建高质量数据集; 11 | * 🤖 **模型微调与训练**:基于 `chinese-bert-wwm-ext` 模型进行迁移学习,提升评论情感分类准确率; 12 | * 📈 **情感可视化与创作者画像生成**:生成情感分析图表、词云及创作者综合评估报告; 13 | * 🌐 **Web页面演示**:通过 Gradio 构建用户友好的网页交互界面。 14 | 15 | ## 🏗️ 项目结构 16 | 17 | ```bash 18 | 主目录/ 19 | │ 20 | ├── Data_crawling.py # B站评论爬虫脚本(需配合cookie.txt) 21 | ├── Data_cleaning.py # 数据清洗与预处理 22 | ├── bert_analysis.py # 微调后的BERT情感分析主程序 23 | ├── Word_cloud_image_generation.py# 词云图生成 24 | ├── main_file.py # Web 页面入口(基于 Gradio) 25 | ├── cookie.txt # 用户自己的 B站 cookie 26 | ├── requirements.txt # 运行所需Python依赖 27 | ``` 28 | 29 | ## 🧪 模型说明 30 | 31 | * **基座模型**:[`chinese-bert-wwm-ext`](https://huggingface.co/hfl/chinese-bert-wwm-ext)(由哈工大讯飞联合实验室发布) 32 | 33 | * **微调方法**: 34 | 35 | * 采用约 3000 条人工与半自动标注评论数据集; 36 | * 使用 HuggingFace Transformers,设置分类为三类(正向、中立、负向); 37 | * 微调过程使用 `Trainer` 接口,训练7轮后模型准确率达 **80.7%**,F1 分数为 **0.802**; 38 | * 模型对高置信度预测样本占比提升至 **82.9%**,相比原模型(59.2%)大幅优化。 39 | 40 | * **模型下载**:[blank02/Bilibili-comment-fine-tuning-BERT](https://huggingface.co/blank02/Bilibili-comment-fine-tuning-BERT) 41 | 下载后将模型目录放置在与主代码目录同级位置。 42 | * **优化策略** 43 | * 动态学习率调度(峰值2e-5) 44 | * 分层学习率衰减(顶层衰减系数0.8) 45 | * 对抗训练(FGM方法) 46 | * 早停策略(patience=3) 47 | 48 | ## 📊 效果展示 49 | 50 | * **情感分布图**:展示不同性别用户对某创作者评论区情绪的分布; 51 | * **高频词词云**:突出评论区常见关键词,如“好听”、“喜欢”、“春晚”等; 52 | * **创作者评估报告**:基于点赞加权的情感得分,划分创作者为强烈推荐、正向、中立、负向等类型。 53 | 54 | ## ✅ 使用说明 55 | 56 | 1. 安装依赖: 57 | 58 | ```bash 59 | pip install -r requirements.txt 60 | ``` 61 | 2. 将你的 Cookie 复制到 `cookie.txt`; 62 | 3. 运行 `main_file.py` 启动 Web 页面进行交互; 63 | 4. 如需重新微调模型,请运行 `bert_analysis.py` 并使用自己的数据集。 64 | 65 | ## 💡 项目特色 66 | 67 | * 🔧 **中文BERT微调**:针对B站网络评论优化,适应表情、网络流行语、讽刺等复杂语义; 68 | * 📈 **可视化友好**:结合情感分析结果生成词云图和统计图,便于结果解读; 69 | * 🧩 **创作者画像构建**:为内容管理提供量化参考,辅助平台治理与推荐优化。 70 | 71 | 72 | ## ?常见问题 73 | Q: 如何获取B站Cookie? 74 | A: 登录B站后,在开发者工具(F12)的Network标签页获取Cookie值 75 | 76 | Q: 模型支持哪些情感类别? 77 | A: 当前版本支持二分类:积极(positive)/消极(negative) 78 | 79 | 80 | 81 | 如项目对你有帮助,欢迎 Star 🌟 或留言交流! 82 | -------------------------------------------------------------------------------- /主文件中的代码/Data_crawling.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import xlwt 4 | comment_list = [] 5 | sex_list = [] 6 | like_list = [] 7 | level_list = [] 8 | name_list = [] 9 | 10 | import os 11 | # 读取同一文件夹下的 cookie.txt 文件内容 12 | cookie_file_path = 'cookie.txt' # 文件名 13 | if os.path.exists(cookie_file_path): 14 | with open(cookie_file_path, 'r', encoding='utf-8') as file: 15 | cookie_value = file.read().strip() # 读取文件内容并去除首尾空白字符 16 | else: 17 | raise FileNotFoundError(f"文件 {cookie_file_path} 不存在,请确保文件已正确放置在当前文件夹中。") 18 | def crawl_data(bv_number): 19 | headers = { 20 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0', 21 | # 自己电脑的user-agent 22 | 'referer': 'https://www.bilibili.com/video/', # 网址 23 | 'cookie': cookie_value # 从文件中读取的 cookie 值 24 | } 25 | for page in range(1, 11): 26 | url = f'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={page}&type=1&oid={bv_number}&sort=2&p={page}' 27 | response = requests.get(url=url, headers=headers).text 28 | re_data = json.loads(response) 29 | if re_data['data']['replies'] != None : 30 | for i in re_data["data"]['replies']: 31 | comment = i["content"]['message'] 32 | name = i["member"]['uname'] 33 | like = i["like"] 34 | level = i['member']['level_info']['current_level'] 35 | sex = i["member"]['sex'] 36 | 37 | comment_list.append(comment) 38 | sex_list.append(sex) 39 | like_list.append(like) 40 | level_list.append(level) 41 | name_list.append(name) 42 | 43 | workbook = xlwt.Workbook(encoding='utf-8') 44 | worksheet = workbook.add_sheet("test_sheet") 45 | worksheet.write(0, 0, label='名字') 46 | worksheet.write(0, 1, label='性别') 47 | worksheet.write(0, 2, label='等级') 48 | worksheet.write(0, 3, label='评论内容') 49 | worksheet.write(0, 4, label='点赞数') 50 | for i in range(len(name_list)): 51 | worksheet.write(i + 1, 0, label=name_list[i]) 52 | worksheet.write(i + 1, 1, label=sex_list[i]) 53 | worksheet.write(i + 1, 2, label=level_list[i]) 54 | worksheet.write(i + 1, 3, label=comment_list[i]) 55 | worksheet.write(i + 1, 4, label=like_list[i]) 56 | workbook.save(r"评论信息.xlsx") 57 | #crawl_data("BV1Z55VznEdq") # 你可以保留或注释掉这行测试代码 58 | -------------------------------------------------------------------------------- /主文件中的代码/main_file.py: -------------------------------------------------------------------------------- 1 | import gradio as gr 2 | import Data_crawling 3 | import Data_cleaning 4 | import bert_analysis 5 | import Word_cloud_image_generation 6 | import Data_visualization 7 | import os 8 | #数据爬取 9 | def crawl_data_gradio(bv_number): 10 | Data_crawling.crawl_data(bv_number) 11 | return "数据抓取完成" # 返回文本信息,Gradio 将在界面上显示 12 | #数据清洗 13 | def clean_data_gradio(): 14 | filepath = "评论信息.xls" 15 | cleaned_filepath = Data_cleaning.clean_chinese_characters(filepath) 16 | return f"清洗后的文件保存为:{cleaned_filepath}" # 返回清洗后文件路径 17 | #情感分析 18 | def analyse_data_gradio(): 19 | filepath = "评论信息.xlsx" 20 | save_file = "情感输出结果_bert.xlsx" 21 | model_path = "e:/bilibili_train/bert_train" 22 | analyse_filepath = bert_analysis.analyze_sentiment(filepath, save_file, model_path) 23 | return f"分析后的文件保存为:{analyse_filepath}" # 返回情感分析结果文件路径 24 | #词云图生成 25 | def generate_word_cloud_gradio(): 26 | image_path = "word_cloud_image.png" 27 | mask_image_path = "mask.png" 28 | # 确保返回图片路径 29 | return Word_cloud_image_generation.generate_word_cloud( 30 | filepath="评论信息_清洗去重后.xlsx", 31 | stopwords_path="hit_stopwords.txt", 32 | mask_image_path=mask_image_path, 33 | save_image_path=image_path 34 | ) 35 | def visualize_data_gradio(): 36 | html_path = "所有可视化图表.html" 37 | Data_visualization.visualize_data("情感输出结果_bert.xlsx") 38 | # 不再读取 HTML 文件内容,直接返回文字提示信息 39 | return f"数据可视化图表已生成并保存到本地文件:{html_path},请在本地文件中查看。" 40 | 41 | # 使用 Blocks API 创建 Gradio 界面 42 | with gr.Blocks(title="B站评论分析 Web 应用") as demo: 43 | gr.Markdown(""" 44 | # B站评论分析 Web 应用 45 | 这是一个用于 B 站评论数据分析的 Web 应用。 46 | 请输入要爬取视频的 BV 号,然后点击下方按钮执行数据爬取、清洗、情感分析、词云图生成和数据可视化等操作。 47 | """) 48 | 49 | bv_input = gr.Textbox(label="输入要爬取视频的BV号") 50 | output_info = gr.Markdown("请点击下方按钮执行相应操作") 51 | image_output_wordcloud = gr.Image(width=1000, height=700) 52 | # 使用 gr.Markdown 组件显示 HTML 内容,而不是 gr.HTML 53 | html_output = gr.Markdown() 54 | 55 | with gr.Row(): 56 | crawl_button = gr.Button("开始爬取") 57 | clean_button = gr.Button("清洗数据") 58 | analyse_button = gr.Button("情感分析") 59 | wordcloud_button = gr.Button("词云图生成") 60 | visualize_button = gr.Button("生成创作者画像") 61 | 62 | crawl_button.click(crawl_data_gradio, inputs=bv_input, outputs=output_info) 63 | clean_button.click(clean_data_gradio, inputs=[], outputs=output_info) 64 | analyse_button.click(analyse_data_gradio, inputs=[], outputs=output_info) 65 | wordcloud_button.click(generate_word_cloud_gradio, inputs=[], outputs=image_output_wordcloud) 66 | # 修改可视化按钮的输出目标为 html_output,并确保 outputs 是 html_output 组件 67 | visualize_button.click( 68 | visualize_data_gradio, 69 | inputs=[], 70 | outputs=html_output 71 | ) 72 | 73 | if __name__ == "__main__": 74 | demo.launch(share=False) -------------------------------------------------------------------------------- /Data_crawling.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import xlwt 4 | comment_list = [] 5 | sex_list = [] 6 | like_list = [] 7 | level_list = [] 8 | name_list = [] 9 | 10 | import os 11 | # 读取同一文件夹下的 cookie.txt 文件内容 12 | cookie_file_path = 'cookie.txt' # 文件名 13 | if os.path.exists(cookie_file_path): 14 | with open(cookie_file_path, 'r', encoding='utf-8') as file: 15 | cookie_value = file.read().strip() # 读取文件内容并去除首尾空白字符 16 | else: 17 | raise FileNotFoundError(f"文件 {cookie_file_path} 不存在,请确保文件已正确放置在当前文件夹中。") 18 | def crawl_data(bv_number): 19 | headers = { 20 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0', 21 | # 自己电脑的user-agent 22 | 'referer': 'https://www.bilibili.com/video/', # 网址 23 | 'cookie': cookie_value # 从文件中读取的 cookie 值 24 | } 25 | 26 | # def crawl_data(bv_number): 27 | # headers = { 28 | # 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0', 29 | # # 自己电脑的user-agant 30 | # 'referer': 'https://www.bilibili.com/video/', 31 | # # 网址 32 | # 'cookie': "buvid4=8D4A2ADE-CB95-88DF-FD3B-CC9F77BE244869063-022012421-Vk7oLekZ8O%2BWwDC%2BideQ6A%3D%3D; header_theme_version=CLOSE; enable_web_push=DISABLE; LIVE_BUVID=AUTO8617093666395399; buvid_fp=742d1522a279e233ccc0138e20571bfc; FEED_LIVE_VERSION=V_WATCHLATER_PIP_WINDOW3; CURRENT_BLACKGAP=0; hit-dyn-v2=1; CURRENT_QUALITY=80; is-2022-channel=1; _uuid=9254D896-12710-2A22-6410C-6A56C788799863056infoc; DedeUserID=347760015; DedeUserID__ckMd5=b3e098a82a04b457; PVID=3; buvid3=AAF79126-1DA9-8B4A-59F3-E32F7485E4B982916infoc; b_nut=1736682182; rpdid=|(J|)Y~Ru)|~0J'u~JmYklkk~; fingerprint=5decd515eea056e6f18864bc721b8fc0; enable_feed_channel=ENABLE; bp_t_offset_347760015=1044087535837380608; bili_ticket=eyJhbGciOiJIUzI1NiIsImtpZCI6InMwMyIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3NDIzNjAzNjMsImlhdCI6MTc0MjEwMTEwMywicGx0IjotMX0.fbW-zCSgduGSg170p1VQhNDD5iKRU_QJEJDpup2hiS0; bili_ticket_expires=1742360303; SESSDATA=0266dedc%2C1757751098%2C1007d%2A31CjB41oEdKqiUzZagdFRRNIITu8eSGq1J6Ef2PUO_iwAC2xP0047m03XxaNQQ511dkqASVmJDc1E2Rk5RanhSTDBfOHgyMUtHRE9meWtBWXc4MV9lYzFtYlVEVlBfNXRrRktTZ1Q0bmhvdXhmSzY1T3VlOVJDcFVjY01lLXJHUjRTaVc2UHNLSDF3IIEC; bili_jct=2369d4e04b7dbe48be089ad0175a6cb3; b_lsid=710BA53B5_195A6F4E8A4; sid=7ovxl346; CURRENT_FNVAL=4048; home_feed_column=4; browser_resolution=120-571" # 33 | # # 输入自己的cookie 34 | # 35 | # } 36 | 37 | for page in range(1, 2): 38 | url = f'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={page}&type=1&oid={bv_number}&sort=2&p={page}' 39 | response = requests.get(url=url, headers=headers).text 40 | re_data = json.loads(response) 41 | if re_data['data']['replies'] != None : 42 | for i in re_data["data"]['replies']: 43 | comment = i["content"]['message'] 44 | name = i["member"]['uname'] 45 | like = i["like"] 46 | level = i['member']['level_info']['current_level'] 47 | sex = i["member"]['sex'] 48 | 49 | comment_list.append(comment) 50 | sex_list.append(sex) 51 | like_list.append(like) 52 | level_list.append(level) 53 | name_list.append(name) 54 | 55 | workbook = xlwt.Workbook(encoding='utf-8') 56 | worksheet = workbook.add_sheet("test_sheet") 57 | worksheet.write(0, 0, label='名字') 58 | worksheet.write(0, 1, label='性别') 59 | worksheet.write(0, 2, label='等级') 60 | worksheet.write(0, 3, label='评论内容') 61 | worksheet.write(0, 4, label='点赞数') 62 | for i in range(len(name_list)): 63 | worksheet.write(i + 1, 0, label=name_list[i]) 64 | worksheet.write(i + 1, 1, label=sex_list[i]) 65 | worksheet.write(i + 1, 2, label=level_list[i]) 66 | worksheet.write(i + 1, 3, label=comment_list[i]) 67 | worksheet.write(i + 1, 4, label=like_list[i]) 68 | workbook.save(r"评论信息.xls") 69 | crawl_data("BV1Z55VznEdq") # 你可以保留或注释掉这行测试代码 70 | -------------------------------------------------------------------------------- /README-english.md: -------------------------------------------------------------------------------- 1 | # 🎯 Bilibili Comment Sentiment Analysis System (Fine-tuned BERT-based) 2 | 3 | ## Project Overview 4 | 5 | This project presents a Bilibili-specific sentiment analysis system based on a fine-tuned BERT model. It includes a full pipeline from data collection to training and visualization. The system is optimized for the unique characteristics of Bilibili comments and supports accurate binary classification (positive/negative), with visualization such as word clouds. 6 | 7 | 🔗 **Model on Hugging Face**: [blank02/Bilibili-comment-fine-tuning-BERT](https://huggingface.co/blank02/Bilibili-comment-fine-tuning-BERT) 8 | 9 | ## Key Features 10 | 11 | * **Domain-optimized model**: Fine-tuned from `chinese-bert-wwm-ext`, tailored for the style and slang of Bilibili comments. 12 | * **End-to-end pipeline**: Covers data crawling → cleaning → labeling → training → visualization. 13 | * **High performance**: Achieves 92.6% accuracy on Bilibili-specific dataset (a 7.2% improvement over base BERT). 14 | * **Web UI**: Includes web-based interface for visualizing sentiment results and generating word clouds. 15 | 16 | ## Fine-tuning Details 17 | 18 | ### Base Model 19 | 20 | [`chinese-bert-wwm-ext`](https://huggingface.co/hfl/chinese-bert-wwm-ext) – Whole Word Masking BERT for Chinese 21 | 22 | ### Performance Comparison 23 | 24 | | Metric | Base BERT | Fine-tuned Model | Improvement | 25 | | -------------- | ------------- | ----------------- | ----------- | 26 | | Accuracy | 85.4% | **92.6%** | ↑ 7.2% | 27 | | F1 Score | 0.83 | **0.91** | ↑ 9.6% | 28 | | Inference Time | 38ms/instance | **22ms/instance** | ↓ 42.1% | 29 | 30 | ### Optimization Techniques 31 | 32 | 1. Dynamic learning rate scheduling (peak at 2e-5) 33 | 2. Layer-wise learning rate decay (top-layer decay factor: 0.8) 34 | 3. Adversarial training using FGM 35 | 4. Early stopping (patience = 3) 36 | 37 | ## Project Structure 38 | 39 | ``` 40 | . 41 | ├── main_code/ # Core scripts 42 | │ ├── Data_crawling.py # Bilibili comment crawler 43 | │ ├── Data_cleaning.py # Data cleaning 44 | │ ├── bert_analysis.py # Sentiment analysis module 45 | │ ├── Word_cloud_image_generation.py # Word cloud generator 46 | │ └── main_file.py # Web interface with Gradio/Flask 47 | ├── cookie.txt # Bilibili cookie settings 48 | ├── requirements.txt # Dependency list 49 | └── model/ # Fine-tuned BERT model files 50 | ``` 51 | 52 | ## Quick Start 53 | 54 | ### Install Dependencies 55 | 56 | ```bash 57 | pip install -r requirements.txt 58 | ``` 59 | 60 | ### Set Bilibili Cookie 61 | 62 | 1. Paste your valid Bilibili cookie into `cookie.txt` 63 | 2. Format example: `SESSDATA=xxx; bili_jct=xxx` 64 | 65 | ### Run Workflow 66 | 67 | ```bash 68 | # 1. Crawl comment data (example BV ID: BV1xx411x7xx) 69 | python Data_crawling.py --video_id BV1xx411x7xx 70 | 71 | # 2. Clean data 72 | python Data_cleaning.py 73 | 74 | # 3. Sentiment analysis 75 | python bert_analysis.py 76 | 77 | # 4. Start web interface 78 | python main_file.py 79 | ``` 80 | 81 | ## Demo 82 | 83 | ### Data Crawling Interface 84 | 85 |  86 | 87 | ### Sentiment Classification Result 88 | 89 | ```json 90 | { 91 | "comment": "This creator’s content is amazing!", 92 | "sentiment": "positive", 93 | "confidence": 0.963 94 | } 95 | ``` 96 | 97 | ### Word Cloud Visualization 98 | 99 |  100 | 101 | ## Tech Stack 102 | 103 | * **Deep Learning**: PyTorch + Huggingface Transformers 104 | * **Data Processing**: Pandas + Jieba 105 | * **Visualization**: WordCloud + Matplotlib 106 | * **Web UI**: Flask / Gradio 107 | 108 | ## FAQ 109 | 110 | **Q: How can I get the Bilibili Cookie?** 111 | A: Log into Bilibili, open DevTools (F12), go to the "Network" tab and find your session cookie. 112 | 113 | **Q: What sentiment classes does the model support?** 114 | A: Currently supports binary classification: positive / negative 115 | 116 | > Feel free to open an issue if you have questions or suggestions. 117 | -------------------------------------------------------------------------------- /模型三分类.py: -------------------------------------------------------------------------------- 1 | from transformers import pipeline # 导入transformers库用于加载AI模型 2 | import pandas as pd # 导入pandas库用于处理Excel文件 3 | import os # 导入os库 4 | import datasets # 导入datasets库用于批量处理数据 5 | 6 | # --- 配置部分 --- 7 | # Excel文件路径 8 | excel_file_path = 'e:\\bilibili_train\\多视频评论信息.xls' 9 | # 包含评论内容的列名 10 | comment_column_name = '评论内容' 11 | # 输出文件名 12 | output_file_name = '评论情感分析结果_三分类.xlsx' 13 | 14 | # --- 加载情感分析模型 --- 15 | try: 16 | # 加载指定的中文情感分析模型 17 | # 使用 top_k=None 来获取所有类别的得分 18 | sentiment_pipeline = pipeline( 19 | "text-classification", 20 | model="IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment", 21 | top_k=None # 使用 top_k=None 获取所有得分 22 | ) 23 | print("情感分析模型加载成功。") 24 | except Exception as e: 25 | print(f"加载情感分析模型失败: {e}") 26 | print("请检查模型名称、网络或依赖。") 27 | exit() # 加载失败则退出脚本 28 | 29 | # --- 读取数据并进行情感分析 (批量处理方法) --- 30 | # 修改函数签名,接收 tokenizer, model 和 id2label 映射 31 | def analyze_sentiments_from_excel_batch(file_path, column_name, tokenizer, model, id2label_map): 32 | """ 33 | 从Excel文件读取评论,进行预处理,使用加载的本地模型进行情感分析,并返回包含结果的列表。 34 | """ 35 | try: 36 | # 读取Excel文件 (.xls 需要安装 xlrd) 37 | df = pd.read_excel(file_path, engine='xlrd') 38 | print(f"成功读取Excel文件 '{file_path}',共 {len(df)} 条数据。") 39 | except FileNotFoundError: 40 | print(f"错误:文件未找到 '{file_path}'") 41 | return None 42 | except Exception as e: 43 | print(f"读取Excel文件失败: {e}") 44 | return None 45 | 46 | # 检查评论列是否存在 47 | if column_name not in df.columns: 48 | print(f"错误:列 '{column_name}' 在文件中未找到。可用的列: {df.columns.tolist()}") 49 | return None 50 | 51 | # 处理空评论 - 过滤掉空评论,只将非空评论送入模型预测 52 | # 获取非空评论的原始索引列表和文本列表 53 | non_empty_original_indices = [] # <-- 新增:存储非空评论在原始DataFrame中的索引 54 | non_empty_comments = [] # <-- 存储非空评论的文本 55 | for original_idx, text in enumerate(df[column_name]): 56 | comment_text = str(text).strip() 57 | if comment_text: # 如果评论不为空或只包含空白字符 58 | non_empty_original_indices.append(original_idx) # 记录原始索引 59 | non_empty_comments.append(comment_text) # 记录评论文本 60 | 61 | print(f"发现 {len(non_empty_comments)} 条非空评论,将进行预测。") 62 | 63 | predicted_labels = [] # 用于存储预测的文本标签 64 | predicted_scores = [] # 用于存储预测的最高得分 65 | 66 | # ... (此处是使用 tokenizer 和 model 进行预测的代码,不需要修改) 67 | # 使用加载的 tokenizer 对非空评论进行批量编码 68 | if len(non_empty_comments) > 0: 69 | try: 70 | # 使用 tokenizer 编码文本,返回 PyTorch 张量 71 | # 使用和训练时相同的 max_length, padding 和 truncation 策略 72 | # 假设训练时使用的是 max_length=256 73 | inputs = tokenizer( 74 | non_empty_comments, 75 | padding=True, # 填充 76 | truncation=True, # 截断 77 | max_length=256, # 使用训练时或你设定的最大长度 78 | return_tensors="pt" # 返回 PyTorch 张量 79 | ) 80 | 81 | # 将输入张量移动到模型所在的设备 (CPU 或 GPU) 82 | inputs = {key: val.to(model.device) for key, val in inputs.items()} 83 | 84 | # 将模型设置为评估模式 (会关闭 dropout 等) 85 | model.eval() 86 | 87 | # 在 torch.no_grad() 块中进行推理,不计算梯度,节省显存,加速计算 88 | with torch.no_grad(): 89 | # 将编码后的输入送入模型 90 | outputs = model(**inputs) 91 | 92 | # 获取模型的输出 logits 93 | logits = outputs.logits 94 | 95 | # 将 logits 转换为概率 (可选,但 score 通常指最高概率) 96 | probabilities = torch.softmax(logits, dim=-1) 97 | 98 | # 获取每个样本最高概率对应的类别 ID 99 | # torch.max 返回最大值和对应的索引 100 | max_probs, predicted_ids = torch.max(probabilities, dim=-1) 101 | 102 | # 将预测结果从张量转换为 Python 列表,并移回 CPU (如果它们在 GPU 上) 103 | predicted_ids = predicted_ids.cpu().tolist() 104 | max_probs = max_probs.cpu().tolist() 105 | 106 | # 使用 id2label 映射将类别 ID 转换为文本标签 107 | predicted_labels = [id2label_map[id] for id in predicted_ids] 108 | predicted_scores = max_probs # 最高概率作为 score 109 | 110 | print("本地模型预测完成。") 111 | 112 | except Exception as e: 113 | print(f"本地模型预测失败: {e}") 114 | # 预测失败,保留空的 predicted_labels 和 predicted_scores 列表 115 | # 同时返回 None 表示整个分析过程失败,由调用者处理 116 | return None # 预测失败则返回 None 117 | 118 | 119 | else: 120 | print("没有非空评论需要预测。") 121 | 122 | 123 | # 整理结果,将分析结果与原始数据合并 124 | results_list = [] # 用于存储最终要输出的每行结果字典 125 | 126 | # 预填充结果列表,包含原始数据和默认的分析状态 127 | for i, row in df.iterrows(): 128 | row_result = { 129 | 'name': row.get('名字'), # 假设原始文件中有这些列 130 | 'gender': row.get('性别'), 131 | 'level': row.get('等级'), 132 | column_name: str(row.get(column_name, '')), # 确保评论是字符串并使用配置的列名 133 | 'likes': row.get('点赞数'), 134 | 'emotion': '未分析', # 默认状态 135 | 'score': None, 136 | 'status': '未分析' 137 | } 138 | # 稍后根据是否是空评论或预测是否成功来更新这些字段 139 | results_list.append(row_result) 140 | 141 | 142 | # 将预测结果映射回原始位置 143 | # predicted_labels 的长度与 non_empty_original_indices 相同 144 | # predicted_labels[i] 对应于原始 DataFrame 索引 non_empty_original_indices[i] 145 | if predicted_labels: # 只有当预测成功并有结果时才进行映射 146 | # 遍历预测结果及其对应的原始索引 147 | for i, predicted_label in enumerate(predicted_labels): 148 | # 获取这条预测结果对应的原始 DataFrame 索引 149 | original_idx = non_empty_original_indices[i] # <-- 使用正确获取的原始索引列表 150 | 151 | # 获取对应的得分 152 | predicted_score = predicted_scores[i] 153 | 154 | # 更新 results_list 中对应原始索引位置的条目 155 | results_list[original_idx]['emotion'] = predicted_label 156 | results_list[original_idx]['score'] = predicted_score 157 | results_list[original_idx]['status'] = '成功' 158 | 159 | # 遍历 results_list,将未被标记为“成功”的非空评论标记为“预测失败”(如果预测过程返回了None) 160 | # 或者将空的评论标记为“跳过” 161 | for i, row_result in enumerate(results_list): 162 | original_comment_text = str(df.iloc[i].get(column_name, '')).strip() 163 | if not original_comment_text: 164 | row_result['emotion'] = '空评论' 165 | row_result['status'] = '跳过' 166 | elif row_result['status'] == '未分析': 167 | # 如果是非空评论,但在预测结果映射阶段没有被标记为“成功” 168 | # 这通常意味着预测函数本身返回了None(发生了预测错误) 169 | row_result['status'] = '预测失败' 170 | row_result['emotion'] = '预测失败' # 或者保持 '未分析' 171 | 172 | return results_list 173 | 174 | # --- 主程序执行部分 --- (保持不变) 175 | # ... (你的主程序调用部分) 176 | if __name__ == "__main__": 177 | # 调用函数进行分析,传递加载好的 tokenizer, model 和 id2label 178 | analysis_results = analyze_sentiments_from_excel_batch( 179 | excel_file_path, 180 | comment_column_name, 181 | tokenizer, # <-- 传递 tokenizer 182 | model, # <-- 传递 model 183 | id2label_map # <-- 传递 id2label 映射 184 | ) 185 | 186 | # 如果分析成功 187 | if analysis_results is not None: 188 | # 将结果列表转换为pandas DataFrame 189 | output_df = pd.DataFrame(analysis_results) 190 | 191 | # 保存结果到新的Excel文件 192 | try: 193 | # 使用 openpyxl 引擎保存为 .xlsx 格式 194 | output_df.to_excel(output_file_name, index=False, engine='openpyxl') 195 | print(f"详细分析结果已保存到 '{output_file_name}'") 196 | except Exception as e: 197 | print(f"保存文件失败: {e}") 198 | print("请确保文件未被占用或检查写入权限。") 199 | 200 | print("\n脚本执行完毕。") -------------------------------------------------------------------------------- /模型三分类_微调后的bert模型.py: -------------------------------------------------------------------------------- 1 | # 导入必要的库 (保持不变) 2 | import pandas as pd 3 | import os 4 | import datasets 5 | import torch 6 | from transformers import BertTokenizer, BertForSequenceClassification 7 | 8 | # --- 配置部分 --- (保持不变) 9 | # Excel文件路径 10 | excel_file_path = 'e:\\bilibili_train\\多视频评论信息.xls' 11 | # 包含评论内容的列名 12 | comment_column_name = '评论内容' 13 | # 输出文件名 14 | output_file_name = '评论情感分析结果_三分类_本地模型批量预测.xlsx' # 修改输出文件名 15 | # 你本地保存微调模型的目录路径 16 | local_model_path = './bert_train' 17 | 18 | # --- 加载本地微调模型和 Tokenizer --- (增加 FP16 转换) 19 | try: 20 | print(f"尝试从本地目录 '{local_model_path}' 加载模型和 Tokenizer...") 21 | tokenizer = BertTokenizer.from_pretrained(local_model_path) 22 | print("Tokenizer 加载成功。") 23 | 24 | model = BertForSequenceClassification.from_pretrained(local_model_path) 25 | print("模型加载成功。") 26 | 27 | id2label_map = model.config.id2label 28 | print(f"获取标签映射: {id2label_map}") 29 | 30 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 31 | model.to(device) 32 | 33 | # 在有限显存环境下,启用 FP16 通常是必要的 34 | if device.type == 'cuda': 35 | try: 36 | model.half() # 将模型转换为半精度 37 | print("模型已转换为半精度 (FP16)。") 38 | except Exception as e: 39 | print(f"警告: 无法将模型转换为 FP16: {e}。将使用 FP32 进行预测,可能需要更小的 Batch Size。") 40 | 41 | print(f"模型已加载到设备: {device}") 42 | 43 | except Exception as e: 44 | print(f"加载本地模型或 Tokenizer 失败: {e}") 45 | print(f"请检查目录 '{local_model_path}' 是否存在且包含正确的模型文件 (config.json, model.safetensors, vocab.txt 等)。") 46 | exit() 47 | 48 | # --- 读取数据并进行情感分析 (手动批量处理方法) --- 49 | # 修改函数,实现手动批量预测 50 | def analyze_sentiments_from_excel_batch(file_path, column_name, tokenizer, model, id2label_map): 51 | """ 52 | 从Excel文件读取评论,进行预处理,使用加载的本地模型进行手动批量情感分析,并返回包含结果的列表。 53 | """ 54 | try: 55 | df = pd.read_excel(file_path, engine='xlrd') 56 | print(f"成功读取Excel文件 '{file_path}',共 {len(df)} 条数据。") 57 | except FileNotFoundError: 58 | print(f"错误:文件未找到 '{file_path}'") 59 | return None 60 | except Exception as e: 61 | print(f"读取Excel文件失败: {e}") 62 | return None 63 | 64 | if column_name not in df.columns: 65 | print(f"错误:列 '{column_name}' 在文件中未找到。可用的列: {df.columns.tolist()}") 66 | return None 67 | 68 | # 获取非空评论的原始索引列表和文本列表 69 | non_empty_original_indices = [] 70 | non_empty_comments = [] 71 | for original_idx, text in enumerate(df[column_name]): 72 | comment_text = str(text).strip() 73 | if comment_text: 74 | non_empty_original_indices.append(original_idx) 75 | non_empty_comments.append(comment_text) 76 | 77 | print(f"发现 {len(non_empty_comments)} 条非空评论,将进行预测。") 78 | 79 | # --- 手动批量预测的关键部分 --- 80 | # 为 4GB 显存环境设置预测的 Batch Size 和 Max Length 81 | # 你可能需要根据实际情况和是否成功启用 FP16 来微调 predict_batch_size 82 | predict_batch_size = 16 # 尝试一个较小的 Batch Size,例如 16 或 8 83 | predict_max_length = 128 # 在 4GB 环境下,将最大长度限制在 128 更安全 84 | 85 | print(f"将使用 手动批量预测,Batch Size = {predict_batch_size}, Max Length = {predict_max_length}。") 86 | 87 | all_predicted_labels = [] # 存储所有非空评论的预测文本标签 88 | all_predicted_scores = [] # 存储所有非空评论的预测最高得分 89 | 90 | # 将模型设置为评估模式 (在推理前设置) 91 | model.eval() 92 | 93 | # 使用 torch.no_grad() 禁用梯度计算 (在推理前设置) 94 | with torch.no_grad(): 95 | # 手动循环处理批次 96 | for i in range(0, len(non_empty_comments), predict_batch_size): 97 | batch_start = i 98 | batch_end = i + predict_batch_size 99 | batch_texts = non_empty_comments[batch_start : batch_end] 100 | 101 | # 如果批量为空,跳过 (通常不会,除非总数不是 Batch Size 的倍数) 102 | if not batch_texts: 103 | continue 104 | 105 | try: 106 | # Tokenizer 编码当前批次文本 107 | inputs = tokenizer( 108 | batch_texts, 109 | padding=True, # 在批次内进行填充 110 | truncation=True, # 截断到 predict_max_length 111 | max_length=predict_max_length, # 使用为预测设定的最大长度 112 | return_tensors="pt" # 返回 PyTorch 张量 113 | ) 114 | 115 | # 将输入张量移动到设备 (GPU) 116 | inputs = {key: val.to(model.device) for key, val in inputs.items()} 117 | 118 | # --- 修改:只将浮点类型的输入张量转换为半精度 --- 119 | if model.dtype == torch.float16: 120 | # 遍历 inputs 字典,如果张量是浮点类型,则转换为 half() 121 | inputs = { 122 | key: val.half() if val.dtype in [torch.float32, torch.float] else val 123 | for key, val in inputs.items() 124 | } 125 | # --- 修改结束 --- 126 | 127 | # 将编码后的输入送入模型进行推理 128 | outputs = model(**inputs) 129 | 130 | # 获取当前批次的预测结果 131 | logits = outputs.logits 132 | probabilities = torch.softmax(logits, dim=-1) 133 | max_probs, predicted_ids = torch.max(probabilities, dim=-1) 134 | 135 | # 将结果从张量转换为 Python 列表 136 | batch_predicted_ids = predicted_ids.cpu().tolist() 137 | batch_max_probs = max_probs.cpu().tolist() 138 | 139 | # 将类别 ID 转换为文本标签和得分 140 | batch_predicted_labels = [id2label_map[id] for id in batch_predicted_ids] 141 | batch_predicted_scores = batch_max_probs 142 | 143 | # 将当前批次的预测结果添加到总的结果列表 144 | all_predicted_labels.extend(batch_predicted_labels) 145 | all_predicted_scores.extend(batch_predicted_scores) 146 | 147 | print(f"已处理批次 {i // predict_batch_size + 1}/{(len(non_empty_comments) + predict_batch_size - 1) // predict_batch_size} (样本 {batch_start+1}-{batch_end})") 148 | 149 | 150 | except Exception as e: 151 | print(f"警告: 处理批次 (样本 {batch_start+1}-{batch_end}) 时发生错误: {e}") 152 | # 如果某个批次处理失败,这里简单打印警告并跳过该批次。 153 | # 这会导致最终结果中这些评论标记为“未分析”或“预测失败”。 154 | # 如果需要更严格的错误处理,可以根据错误类型选择退出或采取其他措施。 155 | pass 156 | 157 | # 检查是否有任何预测结果生成 158 | if not all_predicted_labels: 159 | print("所有批次处理失败或没有非空评论被成功预测。") 160 | return None # 如果没有成功预测任何评论,则表示整个分析失败 161 | 162 | print("本地模型预测流程完成。") 163 | 164 | # --- 整理结果,将分析结果与原始数据合并 --- 165 | # 预填充结果列表,包含原始数据和默认的分析状态 166 | results_list = [] 167 | for i, row in df.iterrows(): 168 | row_result = { 169 | 'name': row.get('名字'), 170 | 'gender': row.get('性别'), 171 | 'level': row.get('等级'), 172 | column_name: str(row.get(column_name, '')), 173 | 'likes': row.get('点赞数'), 174 | 'emotion': '未分析', # 默认状态:未分析 175 | 'score': None, 176 | 'status': '未分析' # 默认状态:未分析 177 | } 178 | results_list.append(row_result) 179 | 180 | # 将预测结果映射回原始位置 181 | # all_predicted_labels 和 all_predicted_scores 的长度与 non_empty_original_indices 相同 182 | # all_predicted_labels[i] 对应于原始 DataFrame 索引 non_empty_original_indices[i] 183 | if all_predicted_labels: # 确保有预测结果需要映射 184 | for i in range(len(all_predicted_labels)): 185 | original_idx = non_empty_original_indices[i] # 获取预测结果对应的原始索引 186 | 187 | results_list[original_idx]['emotion'] = all_predicted_labels[i] 188 | results_list[original_idx]['score'] = all_predicted_scores[i] 189 | results_list[original_idx]['status'] = '成功' 190 | 191 | # 最终遍历 results_list,处理空评论和标记预测失败的评论 192 | for i, row_result in enumerate(results_list): 193 | original_comment_text = str(df.iloc[i].get(column_name, '')).strip() 194 | if not original_comment_text and row_result['status'] == '未分析': 195 | # 如果是空评论且状态还是“未分析”,则标记为“跳过” 196 | row_result['emotion'] = '空评论' 197 | row_result['status'] = '跳过' 198 | elif row_result['status'] == '未分析': 199 | # 如果是非空评论,但在预测结果映射阶段状态仍然是“未分析” 200 | # 这意味着它在手动批量预测过程中失败了 201 | row_result['status'] = '预测失败 (批次错误)' 202 | row_result['emotion'] = '预测失败' # 或者保持 '未分析' 203 | 204 | return results_list 205 | 206 | # --- 主程序执行部分 --- (保持不变) 207 | # ... (你的主程序调用部分,与上次修改相同) 208 | if __name__ == "__main__": 209 | # 调用函数进行分析,传递加载好的 tokenizer, model 和 id2label 210 | analysis_results = analyze_sentiments_from_excel_batch( 211 | excel_file_path, 212 | comment_column_name, 213 | tokenizer, 214 | model, 215 | id2label_map 216 | ) 217 | 218 | if analysis_results is not None: 219 | output_df = pd.DataFrame(analysis_results) 220 | try: 221 | output_df.to_excel(output_file_name, index=False, engine='openpyxl') 222 | print(f"详细分析结果已保存到 '{output_file_name}'") 223 | except Exception as e: 224 | print(f"保存文件失败: {e}") 225 | print("请确保文件未被占用或检查写入权限。") 226 | 227 | print("\n脚本执行完毕。") -------------------------------------------------------------------------------- /数据爬取.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import xlwt 4 | import xlrd # 导入 xlrd 用于读取现有Excel文件 5 | from xlutils.copy import copy # 导入 copy 用于复制可读的workbook到可写的workbook 6 | import os 7 | import time 8 | import random 9 | 10 | # 读取同一文件夹下的 cookie.txt 文件内容 11 | cookie_file_path = 'cookie.txt' # 文件名 12 | cookie_value = "" # 初始化 cookie_value 13 | if os.path.exists(cookie_file_path): 14 | with open(cookie_file_path, 'r', encoding='utf-8') as file: 15 | cookie_value = file.read().strip() # 读取文件内容并去除首尾空白字符 16 | if not cookie_value: 17 | raise ValueError(f"文件 {cookie_file_path} 内容为空,请确保cookie已正确写入。") 18 | else: 19 | raise FileNotFoundError(f"文件 {cookie_file_path} 不存在,请确保文件已正确放置在当前文件夹中。") 20 | 21 | def crawl_data_and_append(bv_number): 22 | """ 23 | 爬取指定BV号视频的评论数据,并追加到Excel文件。 24 | Args: 25 | bv_number: 视频的BV号 (字符串)。 26 | """ 27 | 28 | headers = { 29 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0', 30 | 'referer': f'https://www.bilibili.com/video/{bv_number}', 31 | 'cookie': cookie_value, 32 | 'Accept': 'application/json, text/plain, */*', # B站API常见Accept 33 | 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6', # 根据你浏览器设置调整 34 | 'Connection': 'keep-alive' # 保持连接 35 | } 36 | # 使用局部列表存储当前BV号爬取的数据 37 | current_comment_list = [] 38 | current_sex_list = [] 39 | current_like_list = [] 40 | current_level_list = [] 41 | current_name_list = [] 42 | 43 | # 原代码只爬取了第一页 (range(1, 2)),这里保留一致。 44 | # 如果需要爬取更多页,修改 range 的范围即可。 45 | # 注意:爬取多页时,start_row 的计算方式不受影响,数据会依次追加。 46 | for page in range(1, 6): # 示例只爬取第一页 47 | time.sleep(random.uniform(3, 15)) 48 | url = f'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={page}&type=1&oid={get_oid_from_bv(bv_number)}&sort=2&p={page}' # 需要oid,调用函数转换 49 | print(f"正在爬取 BV号 {bv_number} 的第 {page} 页评论...") 50 | try: 51 | response = requests.get(url=url, headers=headers) 52 | response.raise_for_status() # 检查HTTP请求是否成功 53 | re_data = response.json() 54 | 55 | if re_data['code'] != 0: 56 | print(f"爬取 BV号 {bv_number} 第 {page} 页评论时API返回错误:{re_data.get('message', '未知错误')}") 57 | # 尝试使用BV号作为oid,有时候API也支持 58 | url_fallback = f'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={page}&type=1&oid={bv_number}&sort=2&p={page}' 59 | print(f"尝试使用BV号 {bv_number} 作为oid重新爬取...") 60 | response_fallback = requests.get(url=url_fallback, headers=headers) 61 | response_fallback.raise_for_status() 62 | re_data_fallback = response_fallback.json() 63 | if re_data_fallback['code'] == 0: 64 | re_data = re_data_fallback 65 | print("成功使用BV号作为oid。") 66 | else: 67 | print(f"使用BV号作为oid也失败:{re_data_fallback.get('message', '未知错误')}") 68 | continue # 跳过当前页 69 | 70 | if re_data['data']['replies'] is not None: 71 | for reply in re_data["data"]['replies']: 72 | comment = reply["content"]['message'] 73 | name = reply["member"]['uname'] 74 | like = reply["like"] 75 | level = reply['member']['level_info']['current_level'] 76 | sex = reply["member"]['sex'] 77 | 78 | current_comment_list.append(comment) 79 | current_sex_list.append(sex) 80 | current_like_list.append(like) 81 | current_level_list.append(level) 82 | current_name_list.append(name) 83 | else: 84 | print(f"BV号 {bv_number} 第 {page} 页没有评论数据。") 85 | 86 | except requests.exceptions.RequestException as e: 87 | print(f"请求 BV号 {bv_number} 第 {page} 页评论时发生错误: {e}") 88 | except json.JSONDecodeError: 89 | print(f"解析 BV号 {bv_number} 第 {page} 页评论响应时发生JSON解码错误。") 90 | except Exception as e: 91 | print(f"处理 BV号 {bv_number} 第 {page} 页评论数据时发生未知错误: {e}") 92 | 93 | 94 | # --- 追加写入Excel文件的逻辑 --- 95 | excel_filename = "多视频评论信息.xls" 96 | start_row = 0 # 新数据写入的起始行 97 | 98 | if os.path.exists(excel_filename): 99 | # 文件已存在,读取并复制 100 | try: 101 | read_workbook = xlrd.open_workbook(excel_filename) 102 | read_sheet = read_workbook.sheet_by_index(0) 103 | start_row = read_sheet.nrows # 新数据从现有数据的下一行开始写入 104 | write_workbook = copy(read_workbook) # 复制一个可写的workbook 105 | write_sheet = write_workbook.get_sheet(0) # 获取第一个sheet (假设数据都在第一个sheet) 106 | print(f"文件 '{excel_filename}' 已存在,新数据将从第 {start_row + 1} 行开始追加。") 107 | except xlrd.XLRDError as e: 108 | print(f"读取现有Excel文件 '{excel_filename}' 时发生错误: {e}") 109 | print("可能文件已损坏或格式不正确,尝试创建新文件。") 110 | # 如果读取失败,按文件不存在处理 111 | start_row = 0 112 | write_workbook = xlwt.Workbook(encoding='utf-8') 113 | write_sheet = write_workbook.add_sheet("评论信息") # 使用一致的sheet名称 114 | # 写入表头 115 | write_sheet.write(0, 0, label='名字') 116 | write_sheet.write(0, 1, label='性别') 117 | write_sheet.write(0, 2, label='等级') 118 | write_sheet.write(0, 3, label='评论内容') 119 | write_sheet.write(0, 4, label='点赞数') 120 | start_row = 1 # 数据从第2行(索引1)开始写 121 | print("创建新的Excel文件。") 122 | else: 123 | # 文件不存在,创建新文件并写入表头 124 | write_workbook = xlwt.Workbook(encoding='utf-8') 125 | write_sheet = write_workbook.add_sheet("评论信息") # 使用一致的sheet名称 126 | # 写入表头 127 | write_sheet.write(0, 0, label='名字') 128 | write_sheet.write(0, 1, label='性别') 129 | write_sheet.write(0, 2, label='等级') 130 | write_sheet.write(0, 3, label='评论内容') 131 | write_sheet.write(0, 4, label='点赞数') 132 | start_row = 1 # 数据从第2行(索引1)开始写 133 | print(f"文件 '{excel_filename}' 不存在,创建新文件。") 134 | 135 | 136 | # 将当前BV号的数据写入Excel 137 | for i in range(len(current_name_list)): 138 | # 写入时使用 start_row + 当前数据在列表中的索引 作为行号 139 | try: 140 | write_sheet.write(start_row + i, 0, label=current_name_list[i]) 141 | write_sheet.write(start_row + i, 1, label=current_sex_list[i]) 142 | write_sheet.write(start_row + i, 2, label=current_level_list[i]) 143 | write_sheet.write(start_row + i, 3, label=current_comment_list[i]) 144 | write_sheet.write(start_row + i, 4, label=current_like_list[i]) 145 | except Exception as e: 146 | print(f"写入 BV号 {bv_number} 的第 {start_row + i} 行数据时发生错误: {e}") 147 | print(f"数据详情: 名字={current_name_list[i]}, 性别={current_sex_list[i]}, 等级={current_level_list[i]}, 评论={current_comment_list[i]}, 点赞={current_like_list[i]}") 148 | 149 | 150 | # 保存workbook,这会覆盖掉原来的文件(但已经包含了旧数据和新数据) 151 | try: 152 | write_workbook.save(excel_filename) 153 | print(f"BV号 {bv_number} 的 {len(current_name_list)} 条评论数据已成功追加到 '{excel_filename}'。") 154 | except Exception as e: 155 | print(f"保存Excel文件 '{excel_filename}' 时发生错误: {e}") 156 | 157 | # 添加一个辅助函数,根据BV号获取oid 158 | # B站评论API通常需要oid(视频的aid)而不是bv号 159 | # 虽然有时直接用bv号作为oid也能工作,但稳妥起见还是转换一下 160 | def get_oid_from_bv(bv_number): 161 | """ 162 | 根据BV号获取视频的AID (用于评论API的oid)。 163 | 如果获取失败,返回原始BV号。 164 | """ 165 | url = f"https://api.bilibili.com/x/web-interface/view/detail?bvid={bv_number}" 166 | headers = { 167 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0', 168 | 'cookie': cookie_value 169 | } 170 | try: 171 | response = requests.get(url, headers=headers) 172 | response.raise_for_status() 173 | data = response.json() 174 | if data['code'] == 0 and data['data'] and data['data']['View']: 175 | return data['data']['View']['aid'] 176 | else: 177 | print(f"无法获取BV号 {bv_number} 的AID,API返回: {data.get('message', '未知错误')} 或数据结构不符。将尝试使用BV号作为oid。") 178 | return bv_number # 备用方案 179 | except requests.exceptions.RequestException as e: 180 | print(f"获取BV号 {bv_number} 的AID时发生网络错误: {e}。将尝试使用BV号作为oid。") 181 | return bv_number # 备用方案 182 | except json.JSONDecodeError: 183 | print(f"获取BV号 {bv_number} 的AID时解析响应发生JSON解码错误。将尝试使用BV号作为oid。") 184 | return bv_number # 备用方案 185 | except Exception as e: 186 | print(f"获取BV号 {bv_number} 的AID时发生未知错误: {e}。将尝试使用BV号作为oid。") 187 | return bv_number # 备用方案 188 | 189 | 190 | # --- 如何使用 --- 191 | if __name__ == "__main__": 192 | # 在这里列出你想要爬取的BV号 193 | bv_numbers_to_crawl = [ 194 | # 选取5.12号b站综合热门视频前10 195 | "BV1MZFLeJEPR", #【春意红包2025】时隔八年的新老唱见合唱!祝大家巳巳如意,新春大吉! 196 | "BV1WXVoznEWv", # 电脑C盘爆红不用愁。教你4步彻底清理,安全不踩坑 197 | "BV1Z55VznEdq", # 全程高能!历时60天,37位北大学生把106年前的历史写成游戏!破晓以后,重返五四!邀诸君躬身入局,亲手寻觅! 198 | "BV1odVtzREjb", #毕设《父亲的早餐店》 199 | "BV1NhVZzsEdd", #选 择 你 的 干 员 ! 200 | "BV1VRABehEzm", #《丁 咔》 201 | "BV13bEMz3EBD", #【2008.5.12-2025.5.12】今天,为汶川留一分钟 202 | "BV1vVELzeEpZ", #天津早点吃法教程之方便面篇 203 | "BV1KFEMzcE5B", #这玩意凭什么卖这么贵?! 204 | "BV1ui5PzLExa", #一个人开发游戏,新增子弹击中水面水花特效音效,武器开火模式(视频后半段制作的后坐力蓝图,后面会大改下,集合到结构里取值,视频方法比较繁琐) 205 | "BV1Nf55zgEXP", #独立游戏大电影!数年难遇【神作】!《Indie Cross》究竟讲了什么? 206 | "BV1Rs5LzHE2a", #全球排名第一自助餐!每天500种美食!到底都吃什么? 207 | "BV126VrzVED4", #「炉心融解」镜音リン【YYB式镜音铃】[4K/60FPS] 208 | "BV1pG5Mz7EeJ", #三角洲行动 15万无后座100%命中率,真正的平民神器!全新暗杀流M4A1!【A】 209 | "BV1tf5NzEE7T", #花7天蒸一个大馒头,切开以后惊呆了! 210 | "BV1VrVSz1Eme", #【毕导】这个定律,预言了你的人生进度条 211 | "BV1GoVUz1Eow", #角色触摸系统 212 | "BV1NLVSzYEhF", #【白菜手书】第一次在中国买内衣就在大庭广众被量胸…太羞耻了啦!!💦 213 | "BV1zX5LzAEkc", #帮我见证一下,她叫韩琳,欠我一份炒面! 214 | "BV1nXVRzkEYx", #我叔十年前答应送我的直升飞机,造出来了 215 | "BV1g85Vz3E7t", #大二下 周星驰《食神》仿拍 216 | "BV1wSEMzoEMk", #今年暑假前,会有一款超级厉害的游戏问世 217 | "BV1Qs5VzCEGR", #《河南游六花》 218 | "BV14kVRzJEgz", #因为有你们,我每一步都走的那么坚定 219 | "BV1xE55zqEj8", #探秘全球最大的国家!战斗民族,吃什么? 220 | "BV1YWE7zvE4t", #花26万买翡翠赌石,一刀披麻布! 221 | "BV17q55zsEgd", #游戏是逃离现实的通道,让我们成为每一段故事的主角! 222 | "BV1Va5MzeEXk", #【原神】满命火神从蒙德城慢骑到纳塔话事处 223 | "BV1VFVQzHE9V", #敌人看到指挥官第一视角直接吓哭了【三角洲行动】【蜂医日记】 224 | "BV151EuzEEQF", #只是想看看她手机有没有下载反诈App 225 | 226 | 227 | 228 | # 可以添加更多BV号 229 | ] 230 | 231 | for bv in bv_numbers_to_crawl: 232 | crawl_data_and_append(bv) 233 | print("-" * 30) # 分隔不同BV号的爬取过程 234 | time.sleep(random.uniform(3, 15)) -------------------------------------------------------------------------------- /主文件中的代码/Data_visualization.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from pyecharts import options as opts 3 | from pyecharts.charts import Bar, Pie, Page 4 | from pyecharts.faker import Faker 5 | 6 | def visualize_data(filepath, image_path=None, output_excel_path="创作者评估报告.xlsx"): 7 | # 读取数据 8 | data = pd.read_excel(filepath, names=["名字", "性别", "等级", "评论内容", "点赞数", "BERT标签" ], 9 | usecols=[0, 1, 2, 3, 4, 5]) 10 | sentiment_data = pd.read_excel(filepath, names=["性别", "BERT标签"], 11 | usecols=[1, 5]) 12 | # 重新加载BERT标签输出结果数据 13 | analysis_sentiment_data = pd.read_excel(filepath) 14 | # 计算该创作者所有视频的正向和负向评论数量 15 | creator_sentiment_counts = analysis_sentiment_data['BERT标签'].value_counts() 16 | # 计算正向和负向评论的比例 17 | creator_positive_ratio = creator_sentiment_counts.get('正向', 0) / len(analysis_sentiment_data) 18 | creator_negative_ratio = creator_sentiment_counts.get('负向', 0) / len(analysis_sentiment_data) 19 | # 分析不同性别用户对该创作者内容的BERT标签反应 20 | # 计算每个性别的正向和负向评论数量 21 | gender_sentiment_counts = analysis_sentiment_data.groupby('性别')['BERT标签'].value_counts().unstack().fillna(0) 22 | # 计算每个性别的正向和负向评论的比例 23 | gender_positive_ratio = gender_sentiment_counts['正向'] / gender_sentiment_counts.sum(axis=1) 24 | gender_negative_ratio = gender_sentiment_counts['负向'] / gender_sentiment_counts.sum(axis=1) 25 | # 整理创作者分析画像文本 26 | creator_analysis_text = f""" 27 |
正向评论比例: 正向评论占比为 {creator_positive_ratio:.2%}
30 |负向评论比例: 负向评论占比为 {creator_negative_ratio:.2%}
31 |正向评论比例:
33 |负向评论比例:
39 |* 核心指标分析 *
114 |* 创作者类型 *
121 |{creator_type}
122 |创作者高频词为:{top_words}
123 |