├── .gitignore ├── README.md ├── __pycache__ ├── common.cpython-39.pyc ├── config.cpython-39.pyc ├── metrics.cpython-39.pyc ├── model.cpython-39.pyc ├── process_data_dl.cpython-39.pyc ├── process_data_ml.cpython-39.pyc └── process_data_pretrain.cpython-39.pyc ├── common.py ├── config.py ├── dl_algorithm ├── capsules_model.py ├── cnn.py ├── dl_config.py ├── dl_model.py ├── lstm.py └── transformer.py ├── logs └── events.out.tfevents.1679558718.huangzihengdeMacBook-Air.local ├── main.py ├── metrics.py ├── ml_algorithm └── ml_model.py ├── model.py ├── pic ├── pic_dl.png ├── pic_ml.png ├── pretrain_pic.png ├── result.png ├── tensorboard.png ├── test_pic.png └── train_pic.png ├── pretrain_algorithm ├── bert_graph.py ├── deberta_graph.py ├── nezha_graph.py ├── pre_model.py └── roberta_wwm.py ├── pretrain_model └── bert_wwm │ └── readme.txt ├── process_data_dl.py ├── process_data_ml.py ├── process_data_pretrain.py ├── requirements.txt ├── save_model └── knn.pkl ├── trick ├── dynamic_padding.py ├── early_stop.py ├── fgm_pgd_ema.py ├── init_model.py └── set_all_seed.py └── word2vec_train.py /.gitignore: -------------------------------------------------------------------------------- 1 | # .gitignore 2 | data/ 3 | logs/ 4 | */__pycache__ 5 | visualization_data.ipynb 6 | cs.py 7 | pretrain_model/ 8 | save_model/ 9 | pic/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 机器学习代码模板(开箱即用) 2 | 3 | 4 | - [机器学习代码模板(开箱即用)](#机器学习代码模板开箱即用) 5 | - [关于 ](#关于-) 6 | - [1. 介绍](#1-介绍) 7 | - [2. 目前已经涵盖的算法](#2-目前已经涵盖的算法) 8 | - [2.1 常见的机器学习算法](#21-常见的机器学习算法) 9 | - [2.2 常见的深度学习算法](#22-常见的深度学习算法) 10 | - [2.3 预训练模型](#23-预训练模型) 11 | - [前提准备 ](#前提准备-) 12 | - [环境安装](#环境安装) 13 | - [具体使用方法 ](#具体使用方法-) 14 | - [数据集github传不上来,请移步 数据集下载,下载后直接解压到根目录即可](#数据集github传不上来请移步-数据集下载下载后直接解压到根目录即可) 15 | - [3. 参数介绍](#3-参数介绍) 16 | - [3.1 针对常见的机器学习算法](#31-针对常见的机器学习算法) 17 | - [3.2 针对深度神经网络算法](#32-针对深度神经网络算法) 18 | - [3.3 针对预训练模型](#33-针对预训练模型) 19 | - [Note:](#note) 20 | - [文件目录介绍](#文件目录介绍) 21 | - [开发日志](#开发日志) 22 | 23 | 24 | --- 25 | 26 | ## 关于 27 | ### 1. 介绍 28 | 29 | > 这是一个包含多种机器学习算法的代码模板库,其主要用于NLP中文本分类的下游任务,包括二分类及多分类。使用者只需更改一些参数例如数据集地址,算法名称等,即可以使用其中的各种模型来进行文本分类(前提是数据集与我提供的数据集形式一致,具体可以看data/ 下我提供的数据集),各种算法的参数只在xx_config.py单个文件中提供,方便用户对神经网络模型进行调参。 30 | ### 2. 目前已经涵盖的算法 31 | #### 2.1 常见的机器学习算法 32 | 33 | - Logistic Regression 34 | - KNN 35 | - Decision Tree 36 | - Random Forest 37 | - GBDT(Gradient Boosting Decision Tree) 38 | - XGBoost 39 | - Catboost 40 | - SVM 41 | - Bayes 42 | - todo... 43 | 44 | 45 | #### 2.2 常见的深度学习算法 46 | 47 | - TextCNN 48 | - Bi-LSTM 49 | - Transformer 50 | - Capsules 51 | - todo... 52 | 53 | #### 2.3 预训练模型 54 | - Bert_WWM 55 | - MacBert 56 | - NEZHA_WWM 57 | - RoBerta_WWM 58 | - todo... 59 | --- 60 | 61 | 62 | 63 | ## 前提准备 64 | 65 | ### 环境安装 66 | 67 | 具体的相关库的版本见requestments.txt 68 | 69 | - 使用命令安装 70 | 71 | ``` 72 | pip install -r requestments.txt 73 | ``` 74 | 75 | 76 | 77 | ## 具体使用方法 78 |
79 | 80 | ### 数据集github传不上来,请移步 [数据集下载](https://pan.baidu.com/s/1_2qhpb4eRbraFAShSoPVhQ?pwd=c4n6),下载后直接解压到根目录即可 81 |
82 | 83 | ### 3. 参数介绍 84 | ***主程序:main.py,其中各个参数的含义如下:*** 85 | 86 | > *--data_path*: 一个完整的(未切分训练集测试集)的数据集路径 87 | > 88 | > *--model_name*: 需要使用的算法名称,填写的简称见config.py中的ML_MODEL_NAME和DL_MODEL_NAME 89 | > 90 | > *--model_saved_path*: 模型存储的路径 91 | > 92 | > *--type_obj*: 程序的运行目的:train,test,predict三个选项 93 | > 94 | > *--train_data_path*: 切分好的训练集路径 95 | > 96 | > *--test_data_path*: 切分好的测试集路径 97 | > 98 | > *--dev_data_path*: 切分好的验证集路径 99 | ### 3.1 针对常见的机器学习算法 100 | 101 | ***终端命令如下:*** 102 | ``` 103 | python main.py --data_path [] --model_name [] --model_saved_path [] --type_obj [] 104 | ``` 105 | ***示例*** 106 | 107 | ``` 108 | # 训练 109 | python main.py --data_path ./data/processed_data.csv --model_saved_path ./save_model/ --model_name lg --type_obj train 110 | # 测试 111 | python main.py --test_data_path ./data/processed_data.csv --model_saved_path ./save_model/ --model_name lg --type_obj test 112 | # 预测 113 | python main.py --dev_data_path ./data/processed_data.csv --model_saved_path ./save_model/ --model_name lg --type_obj predict 114 | ``` 115 | 116 | 解释:这里的train_data_path, test_data_path, dev_data_path都默认为空,ml的数据处理模块会自动按照7:3划分训练集和测试集,并且默认进行下采样,避免数据不平衡带来的不良影响,划分比例和是否下采样参数可在config.py自行修改,如果参数train_data_path, test_data_path被指定,则无需指定data_path, split_size, is_sample参数 117 | 118 | ***运行结果如下:*** 119 | 120 | ![result.png](pic/result.png) 121 | 122 | ***结果图片展示:*** 123 | 124 | ![result_ml.png](pic/pic_ml.png) 125 | 126 | ### 3.2 针对深度神经网络算法 127 | 128 | 129 | ***示例*** 130 | 131 | ``` 132 | # 训练代码 133 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj train --train_data_path ./data/dl_data/test.csv --test_data_path ./data/dl_data/dev.csv 134 | # 测试代码 135 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj test --test_data_path ./data/dl_data/test.csv 136 | # 预测代码 137 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj predict --dev_data_path ./data/dl_data/dev.csv 138 | ``` 139 | ***运行结果如下:*** 140 | 141 | ![result2.png](pic/pic_dl.png) 142 | ***结果图片展示:*** 143 | 144 | 由于采用的数据是多分类,画的图比较乱,多分类暂时不输出图 145 | 146 | ***训练过程指标参数变化可视化展示:*** 147 | ![result3.png](pic/tensorboard.png) 148 | 149 | ### 3.3 针对预训练模型 150 | ***示例*** 151 | ``` 152 | # 训练 153 | # python main.py --model_name mac_bert --model_saved_path ./save_model/mac_bert --type_obj train --train_data_path ./data/dl_data/train.csv --test_data_path ./data/dl_data/test.csv --pretrain_file_path ./pretrain_model/mac_bert/ 154 | # 测试 155 | # python main.py --model_name mac_bert --model_saved_path ./save_model/mac_bert --type_obj test --test_data_path ./data/dl_data/test.csv 156 | # 预测 157 | # python main.py --model_name mac_bert --model_saved_path ./save_model/mac_bert --type_obj predict --test_data_path ./data/dl_data/dev.csv 158 | ``` 159 | ***运行结果如下:*** 160 | 161 | ![result3.png](pic/pretrain_pic.png) 162 | ### Note: 163 | >> **常见的机器学习算法调参在 ml_algorithm/ml_moel.py下
深度神经网络/预训练模型的调参在 dl_algorithm/dl_config.py下
其他全局参数调参在 ./config.py下
从transformers官网下载的预训练模型放在pretrain_model/下
在运行代码后,新建一个终端输出’tensorboard --logdir logs‘,然后打开浏览器输入 http://localhost:6006 可以看到训练过程中的f1、loss、learning rate的变化,具体使用教程网上搜索tensorboard** 164 | 165 | ## 文件目录介绍 166 | 167 | ## 开发日志 168 | |已添加的机器学习相关算法|已添加深度学习相关算法添加|其他功能新增或优化| 169 | |:-|:-|:-| 170 | |1. LogisticRegression
2.KNN
3. DecisionTree
4. SVM|1. TextCNN
2. Bi-Lstm
3. Transfomer
4. Capsules # 按照17的论文直接改过来的,论文是图片分类,直接改成文本分类效果特别差,18年出了一篇基于胶囊网络的文本分类的论文,还没有看如何实现(**todo**)|1. 优化读取文件(增加用户指定训练集和测试集位置)
2. 区分DL和ML模型的构建
3. DL模型的参数文件撰写
4. 处理DL的数据集兼容整体的DATAloader通用方法| 171 | |5.GaussianNB
6. RandomForest
7. GBDT
8. XGBOOST|5. Bert_WWM
6. Mac_bert
7. NEZHA_WWM
8. RoBerta_WWM|5. plt.show阻塞问题,换成显示1S,然后保存在当前目录下
6. 深度学习中数据的处理(转换id,构建词表)
7. dataset类构建
8. 添加3种模型权重初始化代码| 172 | |9. CatBOOST||9. 模型训练代码
10. 模型评估代码
11. 添加早停机制
12. 参数优化| 173 | |||13. 解决Lstm的输出bug
14. 完成test,predict模块
15. main函数优化
16. 新增预训练词向量的载入功能| 174 | |||17. 深度学习下,输入单个数据集,自动进行数据切分及下采样,无需人工划分和采样(**todo**)
18. 英文文本分类待添加,主要体现在分词部分(**todo**)
19. 添加竞赛trick【FGM、PGD、EMA】策略
20. 添加竞赛trick【将bert的cls输出修改为中间多层embed的输出加权平均,详情看bert_graph.py| 175 | |||21. 所有代码的关键地方添加注释,方便理解修改代码
22. 完善readme文档
23. 添加mac电脑m系列芯片加速支持
24. 优化代码逻辑| 176 | |||25. 利用tensorboardx的添加可视化过程| 177 | -------------------------------------------------------------------------------- /__pycache__/common.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/common.cpython-39.pyc -------------------------------------------------------------------------------- /__pycache__/config.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/config.cpython-39.pyc -------------------------------------------------------------------------------- /__pycache__/metrics.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/metrics.cpython-39.pyc -------------------------------------------------------------------------------- /__pycache__/model.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/model.cpython-39.pyc -------------------------------------------------------------------------------- /__pycache__/process_data_dl.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/process_data_dl.cpython-39.pyc -------------------------------------------------------------------------------- /__pycache__/process_data_ml.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/process_data_ml.cpython-39.pyc -------------------------------------------------------------------------------- /__pycache__/process_data_pretrain.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/process_data_pretrain.cpython-39.pyc -------------------------------------------------------------------------------- /common.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : common.py 5 | @Time : 2023/02/09 14:33:09 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : some common func 10 | ''' 11 | 12 | import time 13 | import os 14 | import glob 15 | import torch 16 | from datetime import timedelta 17 | 18 | def get_time_dif(start_time): 19 | """获取已使用时间""" 20 | end_time = time.time() 21 | time_dif = end_time - start_time 22 | return timedelta(seconds=int(round(time_dif))) 23 | 24 | 25 | def check_args(args): 26 | args.setting_file = os.path.join(args.checkpoint_dir, args.setting_file) 27 | args.log_file = os.path.join(args.checkpoint_dir, args.log_file) 28 | os.makedirs(args.checkpoint_dir, exist_ok=True) 29 | with open(args.setting_file, 'wt') as opt_file: 30 | opt_file.write('------------ Options -------------\n') 31 | print('------------ Options -------------') 32 | for k in args.__dict__: 33 | v = args.__dict__[k] 34 | opt_file.write('%s: %s\n' % (str(k), str(v))) 35 | print('%s: %s' % (str(k), str(v))) 36 | opt_file.write('-------------- End ----------------\n') 37 | print('------------ End -------------') 38 | 39 | return args 40 | 41 | 42 | def torch_show_all_params(model, rank=0): 43 | params = list(model.parameters()) 44 | k = 0 45 | for i in params: 46 | l = 1 47 | for j in i.size(): 48 | l *= j 49 | k = k + l 50 | if rank == 0: 51 | print("Total param num:" + str(k)) 52 | 53 | 54 | def torch_init_model(model, init_checkpoint, delete_module=False): 55 | state_dict = torch.load(init_checkpoint, map_location='cpu') 56 | state_dict_new = {} 57 | # delete module代表是否你是用了DistributeDataParallel分布式训练. 58 | # 这里如果是用了pytorch的DDP方式训练,要删掉module这个字段的内容 59 | if delete_module: 60 | for key in state_dict.keys(): 61 | v = state_dict[key] 62 | state_dict_new[key.replace('module.', '')] = v 63 | state_dict = state_dict_new 64 | missing_keys = [] 65 | unexpected_keys = [] 66 | error_msgs = [] 67 | # copy state_dict so _load_from_state_dict can modify it 68 | metadata = getattr(state_dict, '_metadata', None) 69 | state_dict = state_dict.copy() 70 | if metadata is not None: 71 | state_dict._metadata = metadata 72 | 73 | def load(module, prefix=''): 74 | local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {}) 75 | 76 | module._load_from_state_dict( 77 | state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs) 78 | for name, child in module._modules.items(): 79 | if child is not None: 80 | load(child, prefix + name + '.') 81 | 82 | load(model, prefix='' if hasattr(model, 'bert') else 'bert.') 83 | 84 | 85 | def torch_save_model(model, output_dir, scores, max_save_num=1): 86 | # Save model checkpoint 87 | if not os.path.exists(output_dir): 88 | os.makedirs(output_dir) 89 | model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training 90 | saved_pths = glob(os.path.join(output_dir, '*.pth')) 91 | saved_pths.sort() 92 | while len(saved_pths) >= max_save_num: 93 | if os.path.exists(saved_pths[0].replace('//', '/')): 94 | os.remove(saved_pths[0].replace('//', '/')) 95 | del saved_pths[0] 96 | 97 | save_prex = "checkpoint_score" 98 | for k in scores: 99 | save_prex += ('_' + k + '-' + str(scores[k])[:6]) 100 | save_prex += '.pth' 101 | 102 | torch.save(model_to_save.state_dict(), 103 | os.path.join(output_dir, save_prex)) 104 | print("Saving model checkpoint to %s", output_dir) -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : config.py 5 | @Time : 2023/01/13 16:19:42 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : None 10 | ''' 11 | 12 | ML_MODEL_NAME = ['lg', 'knn', 'dt', 'rf', 'gbdt', 'xgb', 'catboost', 'svm', 'bayes'] 13 | 14 | DL_MODEL_NAME = ['lstm', 'cnn', 'transformer', 'capsules'] 15 | 16 | PRE_MODEL_NAME = ['mac_bert', 'bert_wwm', 'bert', 'nezha_wwm', 'roberta_wwm'] 17 | 18 | BATCH_SIZE = 8 19 | 20 | SPLIT_SIZE = 0.3 21 | 22 | IS_SAMPLE = True 23 | 24 | PIC_SAVED_PATH = './pic/' # result的pic图片保存的路径 25 | 26 | VOCAB_MAX_SIZE = 100000 # 词表中词的最大数量 27 | 28 | WORD_MIN_FREQ = 5 # 词表中一个单词出现的最小频率 29 | 30 | VOCAB_SAVE_PATH = './data/vocab_dic.pkl' # 词表存储的位置 31 | 32 | L2I_SAVE_PATH = './data/label2id.pkl' # label的映射表 33 | 34 | PRETRAIN_EMBEDDING_FILE = './data/embed.txt' 35 | 36 | VERBOSE = 1 # 每隔10个epoch 输出一次训练结果和测试的loss 37 | 38 | MAX_SEQ_LEN = 100 # 使用预训练模型时,设置允许每条文本数据的最长长度 -------------------------------------------------------------------------------- /dl_algorithm/capsules_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : cap.py 5 | @Time : 2023/03/02 19:34:32 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : capsules network for text classfier 10 | ''' 11 | 12 | import torch 13 | import torch.nn as nn 14 | import torch.nn.functional as F 15 | 16 | 17 | class Squash(nn.Module): 18 | def __init__(self, epsilon=1e-8): 19 | super().__init__() 20 | # 防止除数为0 21 | self.epsilon = epsilon 22 | 23 | def forward(self, x): 24 | # x: [batch_size, nums_capsules, n_features] 25 | s2 = (x ** 2).sum(dim=-1, keepdim=True) 26 | return (s2 / (1+s2)) * (x / torch.sqrt(s2 + self.epsilon)) 27 | 28 | class Router(nn.Module): 29 | def __init__(self, in_d, out_d, iterations=3): 30 | """ 31 | Args: 32 | 33 | in_d (int): per capsues features, paper set 8 34 | out_d (int): 4*4=16 in paper 35 | iterations (int): Cij更新迭代的次数,论文里说3次就可以了 36 | """ 37 | super().__init__() 38 | self.in_d = in_d 39 | self.out_d = out_d 40 | self.iterations = iterations 41 | self.softmax = nn.Softmax(dim=1) 42 | self.squash = Squash() 43 | 44 | def forward(self, nums_caps, out_caps, x): 45 | # nums_caps (int): nums of capsules 46 | # out_caps (int): unique labels 47 | # x: [batch_size, nums_capsules, n_features] 48 | # [1152,10,8,16]*[64,1152,8] -> [64,1152,10,16] 49 | 50 | # init Wij 51 | # [1152,10,8,16] 52 | self.w = nn.Parameter(torch.randn(nums_caps, out_caps, self.in_d, self.out_d)) 53 | 54 | u_hat = torch.einsum('ijnm,bin->bijm', self.w, x) 55 | 56 | # init bij --> zero [batch, nums_capsules, out_caps] [64, 1152,10] 57 | b = x.new_zeros(x.shape[0], nums_caps, out_caps) 58 | v = None 59 | 60 | for i in range(self.iterations): 61 | c = self.softmax(b) #[64, 1152, 16] 62 | s = torch.einsum('bij,bijm->bjm', c, u_hat) 63 | v = self.squash(s) 64 | a = torch.einsum('bjm,bijm->bij', v, u_hat) 65 | b = b + a 66 | return v 67 | 68 | class MarginLoss(nn.Module): 69 | def __init__(self, lambda_=0.5, m1=0.9, m2=0.1): 70 | super().__init__() 71 | self.m1 = m1 72 | self.m2 = m2 73 | self.lambda_ = lambda_ 74 | 75 | def forward(self, v, labels): 76 | # v: [batch_size, out_caps, out_d] 64,10,16 there is a capsule for each label 77 | # labels : [batch_size] 78 | n_labels = v.shape[1] 79 | v_norm = torch.sqrt(v) #[batch_size, out_caps] 80 | labels = torch.eye(n_labels, device=labels.device)[labels] #[batch_size, out_caps] 81 | loss = labels * F.relu(self.m1 - v_norm) + self.lambda_ * (1.0-labels) * F.relu(v_norm - self.m2) 82 | return loss.sum(dim=-1).mean() 83 | 84 | class capsules_model(nn.Module): 85 | def __init__(self, dlconfig): 86 | super().__init__() 87 | if dlconfig.embedding_pretrained == 'random': 88 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1) 89 | else: 90 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1) 91 | self.in_d = dlconfig.in_d 92 | self.out_d = dlconfig.out_d 93 | self.nums_label = dlconfig.nums_label 94 | self.reshape_num = dlconfig.reshape_num 95 | self.conv1 = nn.Conv2d(1, 256, (2, dlconfig.embedding_size), stride=1, padding=dlconfig.pad_size) 96 | self.conv2 = nn.Conv2d(256, self.reshape_num * self.in_d, (2, 1), stride=2, padding=dlconfig.pad_size) 97 | self.squash = Squash() 98 | self.digit_capsules = Router(self.in_d, self.out_d, dlconfig.iter) 99 | 100 | def forward(self, data): 101 | x = self.embedding(data) 102 | x = x.unsqueeze(1) 103 | x = F.relu(self.conv1(x)) 104 | x = self.conv2(x) 105 | 106 | caps = x.view(x.shape[0], self.in_d, self.reshape_num*x.shape[-1]*x.shape[-2]).permute(0, 2, 1) 107 | caps = self.digit_capsules(caps.shape[1], self.nums_label, caps) 108 | 109 | # pre = (caps ** 2).sum(-1).argmax(-1) 110 | return (caps ** 2).sum(-1) 111 | -------------------------------------------------------------------------------- /dl_algorithm/cnn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : cnn.py 5 | @Time : 2023/01/17 15:22:31 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : cnn for textclassifier 10 | ''' 11 | 12 | import torch 13 | import torch.nn as nn 14 | import torch.nn.functional as F 15 | 16 | 17 | class TextCNN(nn.Module): 18 | def __init__(self, dlconfig): 19 | super().__init__() 20 | if dlconfig.embedding_pretrained == 'random': 21 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1) 22 | else: 23 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1) 24 | self.convs = nn.ModuleList( 25 | [nn.Conv2d(1, dlconfig.nums_filters, (k, dlconfig.embedding_size), stride=dlconfig.stride, padding=dlconfig.pad_size) for k in dlconfig.filter_size] 26 | ) 27 | self.dropout = nn.Dropout(p=dlconfig.dropout) 28 | self.relu = nn.ReLU(inplace=True) 29 | self.fc = nn.Linear(dlconfig.nums_filters * len(dlconfig.filter_size), dlconfig.nums_label) 30 | 31 | def conv_and_pool(self, x, conv): 32 | x = self.relu(conv(x)) 33 | x = x.squeeze(3) 34 | x = F.max_pool1d(x, x.size(2)) 35 | x = x.squeeze(2) 36 | return x 37 | 38 | def forward(self, x): 39 | x = self.embedding(x) 40 | x = x.unsqueeze(1) # 增加通道数为1 41 | x = [self.conv_and_pool(x, conv) for conv in self.convs] 42 | x = torch.cat(x, 1) 43 | x = self.dropout(x) 44 | x = self.fc(x) 45 | return x -------------------------------------------------------------------------------- /dl_algorithm/dl_config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : dl_config.py 5 | @Time : 2023/02/07 17:27:38 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : some params of dl model, 参数全部在这里改 10 | ''' 11 | 12 | import torch 13 | from process_data_dl import DataSetProcess 14 | from dl_algorithm.capsules_model import MarginLoss 15 | from config import PRE_MODEL_NAME 16 | 17 | class DlConfig: 18 | """ 19 | model_name: LSTM, CNN, Transformer, capsules... 20 | """ 21 | 22 | def __init__(self, model_name, vocab_size, label2id_nums, vocab_dict, embedding_pretrained='pretrain'): 23 | self.model_name = model_name 24 | self.train_data_path = '' 25 | self.test_data_path = '' 26 | self.dev_data_path = '' 27 | self.vocab_size = vocab_size 28 | self.nums_label = label2id_nums 29 | self.embedding_size = 200 30 | self.embedding_pretrained = embedding_pretrained # random, pretrain 31 | if self.embedding_pretrained != 'random': 32 | self.embedding_matrix, dim = DataSetProcess().load_emb(vocab_dict) 33 | self.embedding_size = dim 34 | self.device = 'cuda' if torch.cuda.is_available() else 'cpu' 35 | # 针对mac m1 m2芯片,使用mps加速代码 36 | if torch.backends.mps.is_available(): 37 | self.device = 'mps' 38 | print(f'use device: {self.device}') 39 | self.dropout = 0.5 40 | self.epochs = 10 41 | self.learning_rate = 3e-5 42 | self.update_lr = True # 是否使用衰减学习率的方法动态更新学习率 43 | self.warmup_prop = 0.1 # 学习率更新策略系数 44 | self.loss_type = 'multi' # 'binary, regression, marginLoss, multi' 45 | self.judge_loss_fct() 46 | self.create_special_params() 47 | 48 | def create_special_params(self): 49 | if self.model_name == 'lstm': 50 | self.hidden_size = 128 51 | self.nums_layer = 1 # lstm的层数stack的 52 | elif self.model_name == 'cnn': 53 | self.nums_filters = 256 # 卷积核的数量 54 | self.filter_size = (2, 3, 4) # 相当于提取2gram,3gram,4gram的信息 55 | self.stride = 1 56 | self.pad_size = 0 57 | elif self.model_name == 'transformer': 58 | self.heads = 5 # 确保能被embed_size 整除 59 | self.n_layers = 2 # encoder里有几个transformer 60 | self.hidden = 1024 61 | self.d_model = self.embedding_size 62 | elif self.model_name == 'capsules': 63 | # 注意:self.in_d * self.reshape_num = 256 64 | self.in_d = 8 65 | self.reshape_num = 32 66 | self.out_d = 16 67 | self.iter = 3 # cij 的迭代次数 68 | self.pad_size = 0 69 | #==============================# 70 | elif self.model_name in PRE_MODEL_NAME: 71 | self.use_fgm = True # 是否使用fgm (Fast Gradient Method) 72 | else: 73 | pass 74 | 75 | def judge_loss_fct(self): 76 | if self.loss_type == 'multi': 77 | # torch.nn.CrossEntropyLoss(input, target)的input是没有归一化的每个类的得分,而不是softmax之后的分布 78 | # target是:类别的序号。形如 target = [1, 3, 2] 79 | self.loss_fct = torch.nn.CrossEntropyLoss() 80 | elif self.loss_type == 'binary': 81 | self.loss_fct = torch.nn.BCELoss() 82 | elif self.loss_type == 'regression': 83 | self.loss_fct = torch.nn.MSELoss() 84 | elif self.loss_type == 'marginLoss': 85 | self.loss_fct = MarginLoss() 86 | else: 87 | #! 这里自定义loss函数 88 | pass 89 | 90 | 91 | -------------------------------------------------------------------------------- /dl_algorithm/dl_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : dl_model.py 5 | @Time : 2023/02/07 19:54:48 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : deep net model excuter 10 | ''' 11 | 12 | import os 13 | import time 14 | import torch 15 | import gc 16 | import numpy as np 17 | from tqdm import tqdm 18 | from metrics import Matrix 19 | from config import DL_MODEL_NAME, VERBOSE 20 | from dl_algorithm.lstm import LSTM 21 | from dl_algorithm.cnn import TextCNN 22 | from dl_algorithm.transformer import TransformerModel 23 | from dl_algorithm.capsules_model import capsules_model 24 | from trick.init_model import init_network 25 | from trick.early_stop import EarlyStopping 26 | from common import get_time_dif 27 | from tensorboardX import SummaryWriter 28 | 29 | 30 | class DL_EXCUTER: 31 | def __init__(self, dl_config): 32 | self.dlconfig = dl_config 33 | 34 | def judge_model(self, assign_path=''): 35 | if self.dlconfig.model_name not in DL_MODEL_NAME: 36 | print('dl model name is not support, please see DL_MODEL_NAME of config.py') 37 | if self.dlconfig.model_name == 'lstm': 38 | self.model = LSTM(self.dlconfig) 39 | elif self.dlconfig.model_name == 'cnn': 40 | self.model = TextCNN(self.dlconfig) 41 | elif self.dlconfig.model_name == 'transformer': 42 | self.model = TransformerModel(self.dlconfig) 43 | elif self.dlconfig.model_name == 'capsules': 44 | self.model = capsules_model(self.dlconfig) 45 | #! 其他模型 46 | else: 47 | pass 48 | init_network(self.model) 49 | print('初始化网络权重完成,默认采用xavier') 50 | self.model.to(self.dlconfig.device) 51 | 52 | 53 | def train(self, train_loader, test_loader, dev_loader, model_saved_path, model_name): 54 | # 设置优化器 55 | optimizer = torch.optim.AdamW(self.model.parameters(), lr=self.dlconfig.learning_rate) 56 | best_test_f1 = 0 57 | # 定义一个summarywriter对象,用来可视化 58 | writer = SummaryWriter(logdir='./logs') 59 | # 学习率指数衰减,每次epoch:学习率 = gamma * 学习率 60 | # scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9) 61 | 62 | # 学习更新策略--预热(warmup) 63 | if self.dlconfig.update_lr: 64 | from transformers import get_linear_schedule_with_warmup 65 | num_warmup_steps = int(self.dlconfig.warmup_prop * self.dlconfig.epochs * len(train_loader)) 66 | num_training_steps = int(self.dlconfig.epochs * len(train_loader)) 67 | scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps) 68 | 69 | # 早停策略 70 | early_stopping = EarlyStopping(patience = 20, delta=0) 71 | 72 | for epoch in range(self.dlconfig.epochs): 73 | # 设定训练模式 74 | self.model.train() 75 | # 梯度清零 76 | self.model.zero_grad() 77 | start_time = time.time() 78 | avg_loss = 0 79 | first_epoch_eval = 0 80 | for index, data in enumerate(tqdm(train_loader, ncols=100)): 81 | pred = self.model(data['input_ids'].to(self.dlconfig.device)) 82 | loss = self.dlconfig.loss_fct(pred, data['label'].to(self.dlconfig.device)).mean() 83 | # 反向传播 84 | loss.backward() 85 | avg_loss += loss.item() / len(train_loader) 86 | 87 | # 更新优化器 88 | optimizer.step() 89 | # 更新学习率 90 | if self.dlconfig.update_lr: 91 | scheduler.step() 92 | 93 | # 用以下方式替代model.zero_grad(),可以提高gpu利用率 94 | for param in self.model.parameters(): 95 | param.grad = None 96 | 97 | # 计算模型运行时间 98 | elapsed_time = get_time_dif(start_time) 99 | # 打印间隔 100 | if (epoch + 1) % VERBOSE == 0: 101 | # 在测试集上看下效果 102 | avg_test_loss, test_f1, pred_all, true_all = self.evaluate(test_loader) 103 | elapsed_time = elapsed_time * VERBOSE 104 | if self.dlconfig.update_lr: 105 | lr = scheduler.get_last_lr()[0] 106 | else: 107 | lr = self.dlconfig.learning_rate 108 | tqdm.write( 109 | f"Epoch {epoch + 1:02d}/{self.dlconfig.epochs:02d} \t time={elapsed_time} \t" 110 | f"loss={avg_loss:.3f}\t lr={lr:.1e}", 111 | end="\t", 112 | ) 113 | 114 | if (epoch + 1 >= first_epoch_eval) or (epoch + 1 == self.dlconfig.epochs): 115 | tqdm.write(f"val_loss={avg_test_loss:.3f}\ttest_f1={test_f1:.4f}") 116 | else: 117 | tqdm.write("") 118 | writer.add_scalar('Loss/train', avg_loss, epoch) 119 | writer.add_scalar('Loss/test', avg_test_loss, epoch) 120 | writer.add_scalar('F1/test', test_f1, epoch) 121 | writer.add_scalar('lr/train', lr, epoch) 122 | 123 | # 每次保存最优的模型,以测试集f1为准 124 | if best_test_f1 < test_f1: 125 | best_test_f1 = test_f1 126 | tqdm.write('*'*20) 127 | self.save_model(model_saved_path, model_name) 128 | tqdm.write('new model saved') 129 | tqdm.write('*'*20) 130 | 131 | early_stopping(avg_test_loss) 132 | if early_stopping.early_stop: 133 | break 134 | # 删除数据加载器以及变量 135 | del (test_loader, train_loader, loss, data, pred) 136 | # 释放内存 137 | gc.collect() 138 | torch.cuda.empty_cache() 139 | writer.close() 140 | 141 | 142 | def evaluate(self, test_loader): 143 | pre_all = [] 144 | true_all = [] 145 | # 设定评估模式 146 | self.model.eval() 147 | avg_test_loss = 0 148 | with torch.no_grad(): 149 | for test_data in test_loader: 150 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device)) 151 | test_loss = self.dlconfig.loss_fct(pred, test_data['label'].to(self.dlconfig.device)).mean() 152 | avg_test_loss += test_loss.item() / len(test_loader) 153 | true_all.extend(test_data['label'].detach().cpu().numpy()) 154 | pre_all.append(pred.softmax(-1).detach().cpu().numpy()) 155 | pre_all = np.concatenate(pre_all) 156 | pre_all = np.argmax(pre_all, axis=-1) 157 | if self.dlconfig.loss_type == 'multi' or self.dlconfig.loss_type == 'marginLoss': 158 | multi = True 159 | else: 160 | multi = False 161 | matrix = Matrix(true_all, pre_all, multi=multi) 162 | return avg_test_loss, matrix.get_f1(), pre_all, true_all 163 | 164 | 165 | def predict(self, dev_loader): 166 | pre_all = [] 167 | with torch.no_grad(): 168 | for test_data in dev_loader: 169 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device)) 170 | pre_all.append(pred.softmax(-1).detach().cpu().numpy()) 171 | pre_all = np.concatenate(pre_all) 172 | pre_all = np.argmax(pre_all, axis=-1) 173 | return pre_all 174 | 175 | # 保存模型权重 176 | def save_model(self, path, name): 177 | if not os.path.exists(path): 178 | os.makedirs(path) 179 | output_path = os.path.join(path, name) 180 | torch.save(self.model.state_dict(), output_path) 181 | print(f'model is saved, in {str(output_path)}') 182 | 183 | def load_model(self, path, name): 184 | output_path = os.path.join(path, name) 185 | try: 186 | self.judge_model() 187 | self.model.load_state_dict(torch.load(output_path)) 188 | self.model.eval() 189 | print('model 已加载预训练参数') 190 | except: 191 | print('model load error') 192 | 193 | 194 | 195 | -------------------------------------------------------------------------------- /dl_algorithm/lstm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : lstm.py 5 | @Time : 2023/02/07 16:09:05 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : lstm for classifier 10 | ''' 11 | 12 | import torch.nn as nn 13 | 14 | 15 | class LSTM(nn.Module): 16 | def __init__(self, dlconfig): 17 | super().__init__() 18 | if dlconfig.embedding_pretrained == 'random': 19 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1) 20 | else: 21 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1) 22 | self.lstm = nn.LSTM(dlconfig.embedding_size, dlconfig.hidden_size, batch_first=True, bidirectional=True, dropout=dlconfig.dropout) 23 | self.fc1 = nn.Linear(dlconfig.hidden_size*2, dlconfig.hidden_size) 24 | self.fc2 = nn.Linear(dlconfig.hidden_size, dlconfig.nums_label) 25 | self.dropout = nn.Dropout(p=dlconfig.dropout) 26 | self.relu = nn.ReLU(inplace=True) 27 | 28 | 29 | def forward(self, x): 30 | x = self.embedding(x) # [batch_size, seq_len, embeding_size] 31 | x,_ = self.lstm(x) 32 | x = self.fc1(x) 33 | x = self.dropout(self.relu(x)) 34 | x = self.fc2(x) 35 | return x[:, -1, :] 36 | 37 | -------------------------------------------------------------------------------- /dl_algorithm/transformer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : transformer.py 5 | @Time : 2023/02/16 14:58:06 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : None 10 | ''' 11 | 12 | import copy 13 | import math 14 | import torch 15 | import torch.nn as nn 16 | 17 | class PrepareForMultiHeadAttention(nn.Module): 18 | """生成Wq,Wk,Wv三个权重矩阵 19 | """ 20 | def __init__(self, d_model, heads, d_k, bias): 21 | """ 22 | Args: 23 | d_model (int): dim for model : 512 24 | heads (int): nums of attention head: 8 25 | d_k (int): dim for K 26 | bias (bool): bias for linear layer 27 | """ 28 | super().__init__() 29 | self.linear = nn.Linear(d_model, heads * d_k, bias=bias) 30 | self.heads = heads 31 | self.d_k = d_k 32 | 33 | def forward(self, x): 34 | # input_shape: [batch, seqlenth, d_model] 35 | head_shape = x.shape[:-1] 36 | x = self.linear(x) 37 | # reshape 38 | x = x.view(*head_shape, self.heads, self.d_k) 39 | return x 40 | 41 | class MultiHeadAtttention(nn.Module): 42 | """计算过程 43 | """ 44 | def __init__(self, heads, d_model, dropout, bias=True): 45 | """ 46 | Args: 47 | heads (int): 48 | d_model (int): 49 | dropout (float): 50 | bias (bool, optional): Defaults to True. 51 | """ 52 | super().__init__() 53 | self.d_k = d_model // heads 54 | self.heads = heads 55 | self.query = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias) 56 | self.key = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias) 57 | self.value = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias) 58 | self.softmax = nn.Softmax(dim=1) 59 | self.fc = nn.Linear(d_model, d_model) 60 | self.dropout = nn.Dropout(dropout) 61 | self.scale = 1 / math.sqrt(self.d_k) 62 | self.attn = None # 主要用于画图或者输出debug 63 | 64 | def get_scores(self, query, key): 65 | # return [batch, seq_len, seq_len, heads] 66 | return torch.einsum('bihd,bjhd->bijh', query, key) 67 | 68 | def prepare_mask(self, mask, query_shape, key_shape): 69 | assert mask.shape[0] == 1 or mask.shape[0] == query_shape[0] 70 | assert mask.shape[1] == key_shape[0] 71 | assert mask.shape[2] == 1 or mask.shape[2] == query_shape[1] 72 | mask = mask.unsqueeze(-1) 73 | return mask 74 | 75 | def forward(self, query, key, value, mask=None): 76 | # 自注意力机制里,这里的query,key,value其实都是x 77 | batch_size, seq_len, _ = query.shape 78 | if mask: 79 | mask = self.prepare_mask(mask, query.shape, key.shape) 80 | Q = self.query(query) 81 | K = self.key(key) 82 | V = self.value(value) 83 | scores = self.get_scores(Q, K) 84 | scores *= self.scale 85 | if mask: 86 | scores = scores.masked_fill(mask==0, float('-inf')) 87 | attn = self.softmax(scores) 88 | attn = self.dropout(attn) 89 | x = torch.einsum('bijh,bjhd->bihd', attn, V) 90 | self.attn = attn.detach() 91 | x = x.reshape(batch_size, seq_len, -1) 92 | x = self.fc(x) 93 | return x 94 | 95 | class PositionalEncoding(nn.Module): 96 | def __init__(self, d_model, dropout, max_len=5000): 97 | super(PositionalEncoding, self).__init__() 98 | self.dropout = nn.Dropout(p=dropout) 99 | pe = torch.zeros(max_len, d_model) 100 | position = torch.arange(0, max_len).unsqueeze(1) 101 | div_term = torch.exp(torch.arange(0, d_model, 2) * 102 | -(math.log(10000.0) / d_model)) 103 | pe[:, 0::2] = torch.sin(position * div_term) 104 | pe[:, 1::2] = torch.cos(position * div_term) 105 | pe = pe.unsqueeze(0) 106 | self.register_buffer('pe', pe) 107 | 108 | def forward(self, x): 109 | x = x + self.pe[:, :x.size(1)].requires_grad_(False) 110 | return self.dropout(x) 111 | 112 | class FeedForward(nn.Module): 113 | """FFN module 114 | """ 115 | def __init__(self, d_model: int, hidden: int, 116 | dropout: float = 0.1, 117 | activation=nn.ReLU(), 118 | is_gated: bool = False, 119 | bias1: bool = True, 120 | bias2: bool = True, 121 | bias_gate: bool = True): 122 | """ 123 | * `d_model` is the number of features in a token embedding 124 | * `hidden` is the number of features in the hidden layer of the FFN 125 | * `dropout` is dropout probability for the hidden layer 126 | * `is_gated` specifies whether the hidden layer is gated 127 | * `bias1` specified whether the first fully connected layer should have a learnable bias 128 | * `bias2` specified whether the second fully connected layer should have a learnable bias 129 | * `bias_gate` specified whether the fully connected layer for the gate should have a learnable bias 130 | """ 131 | super().__init__() 132 | # Layer one parameterized by weight $W_1$ and bias $b_1$ 133 | self.layer1 = nn.Linear(d_model, hidden, bias=bias1) 134 | # Layer one parameterized by weight $W_1$ and bias $b_1$ 135 | self.layer2 = nn.Linear(hidden, d_model, bias=bias2) 136 | # Hidden layer dropout 137 | self.dropout = nn.Dropout(dropout) 138 | # Activation function $f$ 139 | self.activation = activation 140 | # Whether there is a gate 141 | self.is_gated = is_gated 142 | if is_gated: 143 | # If there is a gate the linear layer to transform inputs to 144 | # be multiplied by the gate, parameterized by weight $V$ and bias $c$ 145 | self.linear_v = nn.Linear(d_model, hidden, bias=bias_gate) 146 | 147 | def forward(self, x: torch.Tensor): 148 | # $f(x W_1 + b_1)$ 149 | g = self.activation(self.layer1(x)) 150 | # If gated, $f(x W_1 + b_1) \otimes (x V + b) $ 151 | if self.is_gated: 152 | x = g * self.linear_v(x) 153 | # Otherwise 154 | else: 155 | x = g 156 | # Apply dropout 157 | x = self.dropout(x) 158 | # $(f(x W_1 + b_1) \otimes (x V + b)) W_2 + b_2$ or $f(x W_1 + b_1) W_2 + b_2$ 159 | # depending on whether it is gated 160 | return self.layer2(x) 161 | 162 | class TransformerLayer(nn.Module): 163 | def __init__(self, d_model, self_attn, src_attn, feed_forward, dropout): 164 | """transformer layer 165 | 166 | Args: 167 | d_model (int): 168 | self_attn (): multi-head-attention layer 169 | src_attn (): multi-head-attention layer 170 | feed_forward (): feed forwardd layer 171 | dropout (float): dropout prob 172 | """ 173 | super().__init__() 174 | self.size = d_model 175 | self.self_attn = self_attn 176 | self.src_attn =src_attn 177 | self.feed_forward = feed_forward 178 | self.dropout = nn.Dropout(dropout) 179 | self.layernorm = nn.LayerNorm([d_model]) 180 | 181 | def forward(self, x, mask): 182 | z = self.layernorm(x) 183 | self_attn = self.self_attn(z, z, z, mask) # mha 184 | x = x + self.dropout(self_attn) # add 185 | z = self.layernorm(x) # norm 186 | ff = self.feed_forward(z) # ff 187 | x = z + self.dropout(ff) # add 188 | x = self.layernorm(x) # norm 189 | return x 190 | 191 | class Encoder(nn.Module): 192 | def __init__(self, layer, n_layers): 193 | """encoder layer 194 | 195 | Args: 196 | layer (): transformer layer 197 | n_layers (int): nums of layer: default 6 198 | """ 199 | super().__init__() 200 | self.layers = self.clones(layer, n_layers) 201 | self.layernorm = nn.LayerNorm([layer.size]) 202 | 203 | 204 | def clones(self, layer, N): 205 | return nn.ModuleList([copy.deepcopy(layer) for _ in range(N)]) 206 | 207 | def forward(self, x, mask): 208 | for l in self.layers: 209 | x = l(x, mask) 210 | return self.layernorm(x) 211 | 212 | 213 | class TransformerModel(nn.Module): 214 | def __init__(self, dlconfig): 215 | super().__init__() 216 | if dlconfig.embedding_pretrained == 'random': 217 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1) 218 | else: 219 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1) 220 | 221 | self.postion_embedding = PositionalEncoding(dlconfig.d_model, dlconfig.dropout) 222 | self.transformerlayer = TransformerLayer(d_model=dlconfig.d_model, 223 | self_attn=MultiHeadAtttention(dlconfig.heads, dlconfig.d_model, dlconfig.dropout), 224 | src_attn=None, 225 | feed_forward=FeedForward(dlconfig.d_model, dlconfig.hidden, dlconfig.dropout), 226 | dropout=dlconfig.dropout) 227 | 228 | self.encoder = Encoder(self.transformerlayer, dlconfig.n_layers) 229 | self.fc1 = nn.Linear(dlconfig.embedding_size, dlconfig.nums_label) 230 | 231 | def forward(self, x): 232 | x = self.embedding(x) 233 | x = self.postion_embedding(x) 234 | x = self.encoder(x, mask=None) 235 | x = self.fc1(x) 236 | return x[:, -1, :] 237 | -------------------------------------------------------------------------------- /logs/events.out.tfevents.1679558718.huangzihengdeMacBook-Air.local: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/logs/events.out.tfevents.1679558718.huangzihengdeMacBook-Air.local -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2023-03-09 19:27:58 8 | LastEditTime : 2023-03-23 15:16:11 9 | FilePath : \\codes\\main.py 10 | Description : 11 | ''' 12 | 13 | import os 14 | os.environ["CUDA_VISIBLE_DEVICES"] = '0' 15 | 16 | import argparse 17 | import transformers 18 | from process_data_ml import ML_Data_Excuter 19 | from process_data_dl import DL_Data_Excuter 20 | from process_data_pretrain import PRE_Data_Excuter 21 | from metrics import Matrix 22 | from model import Model_Excuter 23 | from config import ML_MODEL_NAME, DL_MODEL_NAME, PRE_MODEL_NAME, BATCH_SIZE, SPLIT_SIZE, IS_SAMPLE 24 | from dl_algorithm.dl_config import DlConfig 25 | from trick.set_all_seed import set_seed 26 | import warnings 27 | 28 | 29 | warnings.filterwarnings("ignore") 30 | transformers.logging.set_verbosity_error() 31 | 32 | # def set_args(): 33 | # parser = argparse.ArgumentParser() 34 | # parser.add_argument('--data_path', help='data path', default='', type=str) 35 | # parser.add_argument( 36 | # '--model_name', help='model name ex: knn', default='lg', type=str) 37 | # parser.add_argument( 38 | # '--model_saved_path', help='the path of model saved', default='./save_model/', type=str) 39 | # parser.add_argument( 40 | # '--type_obj', help='need train or test or only predict', default='test', type=str) 41 | # parser.add_argument('--train_data_path', 42 | # help='train set', default='', type=str) 43 | # parser.add_argument('--test_data_path', help='test set', 44 | # default='./data/processed_data.csv', type=str) 45 | # parser.add_argument('--dev_data_path', help='dev set', 46 | # default='', type=str) 47 | # args = parser.parse_args() 48 | # return args 49 | 50 | 51 | def set_args(): 52 | # 训练代码 53 | # python main.py --model_name transformer --model_saved_path ./save_model/ --type_obj train --train_data_path ./data/dl_data/test.csv --test_data_path ./data/dl_data/dev.csv 54 | # 测试代码 55 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj test --test_data_path ./data/dl_data/test.csv 56 | # 预测代码 57 | # python main.py --model_name lstm --model_saved_path './save_model/ --type_obj predict --dev_data_path ./data/dl_data/dev.csv 58 | parser = argparse.ArgumentParser() 59 | parser.add_argument('--data_path', help='data path', default='', type=str) 60 | parser.add_argument( 61 | '--model_name', help='model name ex: knn', default='transformer', type=str) 62 | parser.add_argument( 63 | '--model_saved_path', help='the path of model saved', default='./save_model/transformer', type=str) 64 | parser.add_argument( 65 | '--type_obj', help='need train or test or only predict', default='train', type=str) 66 | parser.add_argument('--train_data_path', 67 | help='train set', default='./data/dl_data/test.csv', type=str) 68 | parser.add_argument('--test_data_path', 69 | help='./data/dl_data/test.csv', default='./data/dl_data/dev.csv', type=str) 70 | parser.add_argument('--dev_data_path', help='dev set', 71 | default='', type=str) 72 | parser.add_argument('--pretrain_file_path', help='# 预训练模型的文件地址(模型在transformers官网下载)', 73 | default='./pretrain_model/roberta_wwm/', type=str) 74 | args = parser.parse_args() 75 | return args 76 | 77 | 78 | def print_msg(metrix_ex_train, metrix_ex_test, data_ex, pic_name='pic'): 79 | if metrix_ex_train: 80 | print('train dataset:') 81 | print(f"acc: {round(metrix_ex_train.get_acc(), 4)}") 82 | print(f"presion: {round(metrix_ex_train.get_precision(), 4)}") 83 | print(f"recall: {round(metrix_ex_train.get_recall(), 4)}") 84 | print(f"f1: {round(metrix_ex_train.get_f1(), 4)}") 85 | print('=' * 20) 86 | if metrix_ex_test: 87 | print('test dataset:') 88 | print(f"acc: {round(metrix_ex_test.get_acc(), 4)}") 89 | print(f"presion: {round(metrix_ex_test.get_precision(), 4)}") 90 | print(f"recall: {round(metrix_ex_test.get_recall(), 4)}") 91 | print(f"f1: {round(metrix_ex_test.get_f1(), 4)}") 92 | print(metrix_ex_test.plot_confusion_matrix(data_ex.i2l_dic, pic_name)) 93 | 94 | 95 | def create_me_de(args, split_size=SPLIT_SIZE, is_sample=IS_SAMPLE, split=True, batch_size=BATCH_SIZE, train_data_path='', test_data_path='', need_predict=False): 96 | if args.model_type == 'ML': 97 | data_ex = ML_Data_Excuter(args.data_path, split_size=split_size, is_sample=is_sample, 98 | split=split, train_data_path=train_data_path, test_data_path=test_data_path) 99 | # 初始化模型 100 | model_ex = Model_Excuter().init(model_name=args.model_name) 101 | if need_predict and args.type_obj == 'test': 102 | model_ex.load_model(args.model_saved_path, 103 | args.model_name + '.pkl') 104 | y_pre_test = model_ex.predict(data_ex.X) 105 | true_all = data_ex.label 106 | return data_ex, model_ex, true_all, y_pre_test 107 | elif need_predict and args.type_obj == 'predict': 108 | model_ex.load_model(args.model_saved_path, 109 | args.model_name + '.pkl') 110 | y_pre_test = model_ex.predict(data_ex.X) 111 | return data_ex, model_ex, y_pre_test 112 | elif args.model_type == 'DL': 113 | data_ex = DL_Data_Excuter() 114 | vocab_size, nums_class = data_ex.process(batch_size=batch_size, 115 | train_data_path=args.train_data_path, 116 | test_data_path=args.test_data_path, 117 | dev_data_path=args.dev_data_path) 118 | dl_config = DlConfig(args.model_name, vocab_size, 119 | nums_class, data_ex.vocab) 120 | # 初始化模型 121 | model_ex = Model_Excuter().init(dl_config=dl_config) 122 | if need_predict and args.type_obj == 'test': 123 | model_ex.load_model(args.model_saved_path, 124 | args.model_name + '.pth') 125 | _, _, y_pre_test, true_all = model_ex.evaluate( 126 | data_ex.test_data_loader) 127 | return data_ex, model_ex, true_all, y_pre_test 128 | elif need_predict and args.type_obj == 'predict': 129 | model_ex.load_model(args.model_saved_path, 130 | args.model_name + '.pth') 131 | y_pre_test = model_ex.predict(data_ex.dev_data_loader) 132 | return data_ex, model_ex, y_pre_test 133 | else: 134 | data_ex = PRE_Data_Excuter(args.model_name) 135 | nums_class = data_ex.process(batch_size=batch_size, 136 | train_data_path=args.train_data_path, 137 | test_data_path=args.test_data_path, 138 | dev_data_path=args.dev_data_path, 139 | pretrain_file_path=args.pretrain_file_path 140 | ) 141 | dl_config = DlConfig(args.model_name, 0, nums_class, '', 'random') 142 | # 初始化模型 143 | model_ex = Model_Excuter().init(dl_config=dl_config) 144 | if need_predict and args.type_obj == 'test': 145 | model_ex.load_model(args.model_saved_path) 146 | _, _, y_pre_test, true_all = model_ex.evaluate( 147 | data_ex.test_data_loader) 148 | return data_ex, model_ex, true_all, y_pre_test 149 | elif need_predict and args.type_obj == 'predict': 150 | model_ex.load_model(args.model_saved_path) 151 | y_pre_test = model_ex.predict(data_ex.dev_data_loader) 152 | return data_ex, model_ex, y_pre_test 153 | 154 | return data_ex, model_ex 155 | 156 | 157 | def main(args): 158 | """ 159 | 1. 载入数据 160 | 2. 载入模型 161 | 3. 训练模型 162 | 4. 预测结果 163 | 5. 保存模型 164 | """ 165 | if args.model_name in ML_MODEL_NAME: 166 | args.model_type = 'ML' 167 | elif args.model_name in DL_MODEL_NAME: 168 | args.model_type = 'DL' 169 | elif args.model_name in PRE_MODEL_NAME: 170 | args.model_type = 'PRE' 171 | else: 172 | print('model name error') 173 | exit(0) 174 | 175 | set_seed(96) 176 | 177 | if args.type_obj == 'train': 178 | data_ex, model_ex = create_me_de(args) 179 | 180 | model_ex.judge_model(args.pretrain_file_path) 181 | 182 | # 这里dl和ml的train得用if分开,数据的接口不一样 183 | if args.model_type == 'ML': 184 | model_ex.train(data_ex.train_data_x, data_ex.train_data_label) 185 | 186 | y_pre_train = model_ex.predict(data_ex.train_data_x) 187 | y_pre_test = model_ex.predict(data_ex.test_data_x) 188 | 189 | mtrix_ex_train = Matrix( 190 | data_ex.train_data_label, y_pre_train, multi=data_ex.multi) 191 | mtrix_ex_test = Matrix( 192 | data_ex.test_data_label, y_pre_test, multi=data_ex.multi) 193 | print_msg(mtrix_ex_train, mtrix_ex_test, data_ex, 'train_pic') 194 | 195 | model_ex.save_model(args.model_saved_path, 196 | args.model_name + '.pkl') 197 | elif args.model_type == 'DL': 198 | model_ex.train(data_ex.train_data_loader, 199 | data_ex.test_data_loader, 200 | data_ex.dev_data_loader, 201 | args.model_saved_path, 202 | args.model_name + '.pth') 203 | else: 204 | model_ex.dlconfig.pretrain_file_path = args.pretrain_file_path 205 | model_ex.train(data_ex.train_data_loader, 206 | data_ex.test_data_loader, 207 | data_ex.dev_data_loader, 208 | args.model_saved_path 209 | ) 210 | 211 | elif args.type_obj == 'test': 212 | args.data_path = args.test_data_path 213 | args.train_data_path, args.dev_data_path = '', '' 214 | data_ex, model_ex, true_all, y_pre_test = create_me_de( 215 | args, split_size=0, is_sample=False, split=False, need_predict=True) 216 | mtrix_ex_test = Matrix(true_all, y_pre_test, multi=data_ex.multi) 217 | print_msg(None, mtrix_ex_test, data_ex, 'test_pic') 218 | 219 | elif args.type_obj == 'predict': 220 | args.data_path = args.dev_data_path 221 | args.train_data_path, args.test_data_path = '', '' 222 | data_ex, model_ex, y_pre_test = create_me_de( 223 | args, split_size=0, is_sample=False, split=False, need_predict=True) 224 | # data_ex.i2l_dic可以将y_pre_test中的数字转成文字标签,按需使用 225 | #! 如何保存数据,按需求填写 226 | else: 227 | print('please input train, test or predict in type_obj of params!') 228 | exit(0) 229 | 230 | 231 | if __name__ == '__main__': 232 | args = set_args() 233 | main(args) 234 | -------------------------------------------------------------------------------- /metrics.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : metrics.py 5 | @Time : 2023/01/15 11:35:31 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : 一系列的评估函数f1, recall, acc, presion, confusion_matrix... 10 | ''' 11 | 12 | import os 13 | import numpy as np 14 | import matplotlib.pyplot as plt 15 | from matplotlib import rcParams 16 | from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score 17 | from sklearn.metrics import confusion_matrix 18 | from config import PIC_SAVED_PATH 19 | 20 | 21 | class Matrix: 22 | def __init__(self, y_true, y_pre, multi=False): 23 | self.true = y_true 24 | self.pre = y_pre 25 | # 是否是多分类, 默认二分类 26 | self.multi = multi # average的参数有micro、macro、weighted,如果选择micro,那么recall和pre和acc没区别,建议使用macro,同时数据集最好已经没有不平衡的问题 27 | 28 | def get_acc(self): 29 | return accuracy_score(self.true, self.pre) 30 | 31 | def get_recall(self): 32 | # tp / (tp + fn) 33 | if self.multi: 34 | return recall_score(self.true, self.pre, average='macro') 35 | return recall_score(self.true, self.pre) 36 | 37 | def get_precision(self): 38 | # tp / (tp + fp) 39 | if self.multi: 40 | return precision_score(self.true, self.pre, average='macro') 41 | return precision_score(self.true, self.pre) 42 | 43 | def get_f1(self): 44 | # F1 = 2 * (precision * recall) / (precision + recall) 45 | if self.multi: 46 | return f1_score(self.true, self.pre, average='macro') 47 | return f1_score(self.true, self.pre) 48 | 49 | def get_confusion_matrix(self): 50 | return confusion_matrix(self.true, self.pre) 51 | 52 | def plot_confusion_matrix(self, dic_labels, pic_name): 53 | """plot 54 | 55 | Args: 56 | dic_labels (dict): {0: 'label1', 1: 'label2'} # 一定是个有序字典 57 | """ 58 | proportion = [] 59 | con_matrix = self.get_confusion_matrix() 60 | num_class = len(dic_labels) 61 | labels = [v for k, v in dic_labels.items()] 62 | for i in con_matrix: 63 | for j in i: 64 | temp = j / (np.sum(i)) 65 | proportion.append(temp) 66 | pshow = [] 67 | for i in proportion: 68 | pt = "%.2f%%" % (i * 100) 69 | pshow.append(pt) 70 | proportion = np.array(proportion).reshape(num_class, num_class) 71 | pshow = np.array(pshow).reshape(num_class, num_class) 72 | config = {"font.family": "Times New Roman"} 73 | rcParams.update(config) 74 | plt.imshow(proportion, interpolation='nearest', 75 | cmap=plt.cm.Blues) # 按照像素显示出矩阵 76 | # (改变颜色:'Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds','YlOrBr', 'YlOrRd', 77 | # 'OrRd', 'PuRd', 'RdPu', 'BuPu','GnBu', 'PuBu', 'YlGnBu', 'PuBuGn', 'BuGn', 'YlGn') 78 | plt.title('confusion_matrix') 79 | plt.colorbar() 80 | tick_marks = np.arange(len(labels)) 81 | plt.xticks(tick_marks, labels, fontsize=12) 82 | plt.yticks(tick_marks, labels, fontsize=12) 83 | # iters = [[i,j] for i in range(len(classes)) for j in range((classes))] 84 | # ij配对,遍历矩阵迭代器 85 | iters = np.reshape([[[i, j] for j in range(num_class)] 86 | for i in range(num_class)], (con_matrix.size, 2)) 87 | for i, j in iters: 88 | if (i == j): 89 | plt.text(j, i - 0.12, format(con_matrix[i, j]), va='center', 90 | ha='center', fontsize=12, color='white', weight=5) # 显示对应的数字 91 | plt.text(j, i + 0.12, pshow[i, j], va='center', 92 | ha='center', fontsize=12, color='white') 93 | else: 94 | # 显示对应的数字 95 | plt.text( 96 | j, i - 0.12, format(con_matrix[i, j]), va='center', ha='center', fontsize=12) 97 | plt.text(j, i + 0.12, pshow[i, j], 98 | va='center', ha='center', fontsize=12) 99 | 100 | plt.ylabel('True label', fontsize=16) 101 | plt.xlabel('Predict label', fontsize=16) 102 | plt.tight_layout() 103 | plt.pause(1) 104 | plt.show(block=False) 105 | if not os.path.exists(PIC_SAVED_PATH): 106 | os.makedirs(PIC_SAVED_PATH) 107 | pic_name = pic_name + '.png' 108 | save_path = os.path.join(PIC_SAVED_PATH, pic_name) 109 | plt.savefig(save_path) 110 | print(f'result pic is saved in {save_path}') 111 | 112 | 113 | if __name__ == '__main__': 114 | # dic_labels = {0: 'W', 1: 'LS', 2: 'SWS', 3: 'REM', 4: 'E'} 115 | # cm = np.array([(193, 31, 0, 41, 42), (87, 1038, 32, 126, 125), 116 | # (17, 337, 862, 1, 2), (17, 70, 0, 638, 54), (1, 2, 3, 4, 5)]) 117 | # matrix_excute = Matrix(None, None) 118 | # matrix_excute.plot_confusion_matrix(cm, dic_labels) 119 | y_true = np.array([0]*30 + [1]*240 + [2]*30) 120 | y_pred = np.array([0]*10 + [1]*10 + [2]*10 + 121 | [0]*40 + [1]*160 + [2]*40 + 122 | [0]*5 + [1]*5 + [2]*20) 123 | dic_labels = {0:0, 1:1, 2:2} 124 | matrix_excute = Matrix(y_true=y_true, y_pre=y_pred, multi=True) 125 | print(matrix_excute.get_acc()) 126 | print(matrix_excute.get_precision()) 127 | print(matrix_excute.get_recall()) 128 | print(matrix_excute.get_f1()) 129 | matrix_excute.plot_confusion_matrix(dic_labels) 130 | 131 | -------------------------------------------------------------------------------- /ml_algorithm/ml_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : ml_model.py 5 | @Time : 2023/01/13 16:26:41 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : lg, knn, dt, rt, gbdt, xgb, catboost, svm ... etc. 10 | ''' 11 | 12 | import pickle 13 | import random 14 | import os 15 | import joblib 16 | import catboost as cb 17 | from sklearn.linear_model import LogisticRegression 18 | from sklearn.tree import DecisionTreeClassifier 19 | from sklearn.svm import SVC 20 | from sklearn.naive_bayes import GaussianNB 21 | from sklearn.neighbors import KNeighborsClassifier 22 | from sklearn.ensemble import RandomForestClassifier 23 | from sklearn.ensemble import GradientBoostingClassifier 24 | from xgboost import XGBClassifier 25 | from config import ML_MODEL_NAME 26 | 27 | 28 | class ML_EXCUTER: 29 | def __init__(self, model_name): 30 | self.model_name = model_name 31 | 32 | def judge_model(self, assign_path=''): 33 | if self.model_name not in ML_MODEL_NAME: 34 | print('ml model name is not support, please see ML_MODEL_NAME of config.py') 35 | 36 | if self.model_name == 'lg': 37 | model = LogisticRegression(random_state=96) 38 | elif self.model_name == 'knn': 39 | model = KNeighborsClassifier(n_neighbors=5) 40 | elif self.model_name == 'bayes': 41 | model = GaussianNB() 42 | elif self.model_name == 'svm': 43 | model = SVC(kernel='rbf') 44 | elif self.model_name == 'dt': 45 | model = DecisionTreeClassifier(random_state=96) 46 | elif self.model_name == 'rf': 47 | model = RandomForestClassifier(n_estimators=100, random_state=96) 48 | elif self.model_name == 'gbdt': 49 | model = GradientBoostingClassifier( 50 | learning_rate=0.1, n_estimators=100, random_state=96) 51 | elif self.model_name == 'xgb': 52 | model = XGBClassifier(learning_rate=0.1, 53 | # n_estimatores 54 | # 含义:总共迭代的次数,即决策树的个数 55 | n_estimators=1000, 56 | # max_depth 57 | # 含义:树的深度,默认值为6,典型值3-10。 58 | max_depth=6, 59 | # min_child_weight 60 | # 调参:值越大,越容易欠拟合;值越小,越容易过拟合 61 | # (值较大时,避免模型学习到局部的特殊样本)。 62 | min_child_weight=1, 63 | # 惩罚项系数,指定节点分裂所需的最小损失函数下降值。 64 | gamma=0, 65 | # subsample 66 | # 含义:训练每棵树时,使用的数据占全部训练集的比例。 67 | # 默认值为1,典型值为0.5-1。 68 | subsample=0.8, 69 | # colsample_bytree 70 | # 含义:训练每棵树时,使用的特征占全部特征的比例。默认值为1,典型值为0.5-1。 71 | colsample_btree=0.8, 72 | # objective 目标函数 73 | # multi:softmax num_class=n 返回类别 74 | # binary:logistic,二元分类的逻辑回归,输出概率 binary:hinge:二进制分类的铰链损失。这使预测为0或1,而不是产生概率。 75 | objective='multi:softmax', 76 | num_class=3, 77 | # scale_pos_weight 78 | # 正样本的权重,在二分类任务中,当正负样本比例失衡时,设置正样本的权重,模型效果更好。例如,当正负样本比例为1:10时,scale_pos_weight=10 79 | scale_pos_weight=1, 80 | random_state=96 81 | ) 82 | # xgb 的调参看这篇文章:https://zhuanlan.zhihu.com/p/143009353 83 | elif self.model_name == 'catboost': 84 | # 详细调参和gpu训练看这里:http://t.zoukankan.com/webRobot-p-9249906.html 85 | model = cb.CatBoostClassifier(iterations=500, 86 | learning_rate=0.1, 87 | max_depth=6, 88 | verbose=100, 89 | early_stopping_rounds=500, 90 | loss_function='Logloss', 91 | task_type='CPU', # 'GPU' 92 | random_seed=96, 93 | one_hot_max_size=2 94 | ) 95 | 96 | else: 97 | pass 98 | self.model = model 99 | 100 | def train(self, x_data, y_data): 101 | self.model.fit(x_data, y_data) 102 | 103 | def predict(self, data): 104 | return self.model.predict(data) 105 | 106 | def save_model(self, path, name): 107 | if not os.path.exists(path): 108 | os.makedirs(path) 109 | output_path = os.path.join(path, name) 110 | # with open(output_path, 'wb') as f: 111 | # pickle.dump(self.model, f) 112 | joblib.dump(self.model, output_path) 113 | print(f'model is saved, in {str(output_path)}') 114 | 115 | def load_model(self, path, name): 116 | output_path = os.path.join(path, name) 117 | try: 118 | # with open(output_path, 'rb') as f: 119 | # self.model = pickle.load(f) 120 | self.model = joblib.load(output_path) 121 | print('model is load') 122 | except: 123 | print('model load fail, check path') 124 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : model.py 5 | @Time : 2023/02/07 19:54:07 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : None 10 | ''' 11 | 12 | from config import ML_MODEL_NAME, DL_MODEL_NAME, PRE_MODEL_NAME 13 | from ml_algorithm.ml_model import ML_EXCUTER 14 | from dl_algorithm.dl_model import DL_EXCUTER 15 | from pretrain_algorithm.pre_model import PRE_EXCUTER 16 | 17 | class Model_Excuter: 18 | def __init__(self): 19 | pass 20 | 21 | def init(self, model_name='', dl_config=''): 22 | if model_name in ML_MODEL_NAME: 23 | return ML_EXCUTER(model_name) 24 | elif dl_config.model_name in DL_MODEL_NAME: 25 | return DL_EXCUTER(dl_config) 26 | elif dl_config.model_name in PRE_MODEL_NAME: 27 | return PRE_EXCUTER(dl_config) -------------------------------------------------------------------------------- /pic/pic_dl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/pic_dl.png -------------------------------------------------------------------------------- /pic/pic_ml.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/pic_ml.png -------------------------------------------------------------------------------- /pic/pretrain_pic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/pretrain_pic.png -------------------------------------------------------------------------------- /pic/result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/result.png -------------------------------------------------------------------------------- /pic/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/tensorboard.png -------------------------------------------------------------------------------- /pic/test_pic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/test_pic.png -------------------------------------------------------------------------------- /pic/train_pic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/train_pic.png -------------------------------------------------------------------------------- /pretrain_algorithm/bert_graph.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2023-03-12 14:39:21 8 | LastEditTime : 2023-03-14 19:20:06 9 | FilePath : \\codes\\pretrain_algorithm\\bert_graph.py 10 | Description : 11 | ''' 12 | 13 | 14 | import torch 15 | import torch.nn as nn 16 | from transformers import BertPreTrainedModel, BertModel 17 | 18 | 19 | class bert_classifier(BertPreTrainedModel): 20 | ''' 21 | pooler_output:shape是(batch_size, hidden_size),这是序列的第一个token (cls) 的最后一层的隐藏状态, 22 | 它是由线性层和Tanh激活函数进一步处理的, 23 | 这个输出不是对输入的语义内容的一个很好的总结,对于整个输入序列的隐藏状态序列的平均化或池化可以更好的表示一句话。(这里还加入了embedding层和 24 | 每个隐藏层的cls进行加权平均化来表示一句话) 25 | ''' 26 | 27 | def __init__(self, config): 28 | super().__init__(config, ) 29 | config.output_hidden_states = True 30 | ''' 31 | hidden_states:这是输出的一个可选项,如果输出,需要指定config.output_hidden_states=True,它是一个元组,含有13个元素, 32 | 第一个元素可以当做是embedding,也就是cls,其余12个元素是各层隐藏状态的输出,每个元素的形状是(batch_size, sequence_length, hidden_size), 33 | ''' 34 | self.num_labels = config.num_labels 35 | self.bert = BertModel(config) 36 | self.dropout = nn.Dropout(p=0.2) 37 | self.high_dropout = nn.Dropout(p=0.5) 38 | n_weights = config.num_hidden_layers + 1 # 因为指定了输出hidden_states,所以多了一层,加1 39 | weights_init = torch.zeros(n_weights).float() 40 | weights_init.data[:-1] = -3 41 | self.layer_weights = torch.nn.Parameter(weights_init) 42 | self.classifier = nn.Linear(config.hidden_size, self.num_labels) 43 | self.init_weights() 44 | 45 | def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, label=None,): 46 | outputs = self.bert( 47 | input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids) 48 | ''' 49 | bert的输出 50 | # output[0] 最后一层的隐藏状态 (batch_size, sequence_length, hidden_size) 51 | # output[1] 第一个token即(cls)最后一层的隐藏状态 (batch_size, hidden_size) 52 | # output[2] 需要指定 output_hidden_states = True, 包含所有隐藏状态,第一个元素是embedding, 其余元素是各层的输出 (batch_size, sequence_length, hidden_size) 53 | # output[3] 需要指定output_attentions=True,包含每一层的注意力权重,用于计算self-attention heads的加权平均值(batch_size, layer_nums, sequence_length, sequence_legth) 54 | ''' 55 | hidden_layers = outputs[2] 56 | # 取每一层的cls(shape:batchsize * hidden_size) dropout叠加 shape: 13*bathsize*hidden_size 57 | cls_outputs = torch.stack( 58 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=0 59 | ) 60 | # 然后加权求和 shape: bathsize*hidden_size 61 | cls_output = (torch.softmax(self.layer_weights, 62 | dim=0).unsqueeze(-1).unsqueeze(-1) * cls_outputs).sum(0) 63 | # 对求和后的cls向量进行dropout,在输入线性层,重复五次,然后求平均的到最后的输出logit 64 | logits = torch.mean( 65 | torch.stack( 66 | [self.classifier(self.high_dropout(cls_output)) 67 | for _ in range(5)], 68 | dim=0, 69 | ), 70 | dim=0, 71 | ) 72 | 73 | return logits 74 | -------------------------------------------------------------------------------- /pretrain_algorithm/deberta_graph.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2020 Microsoft and the Hugging Face Inc. team. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """ PyTorch DeBERTa-v2 model.""" 16 | 17 | from collections.abc import Sequence 18 | from typing import Optional, Tuple, Union 19 | 20 | import torch 21 | import torch.utils.checkpoint 22 | from torch import nn 23 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, LayerNorm, MSELoss 24 | 25 | from transformers.activations import ACT2FN 26 | from transformers.modeling_outputs import ( 27 | BaseModelOutput, 28 | MaskedLMOutput, 29 | MultipleChoiceModelOutput, 30 | QuestionAnsweringModelOutput, 31 | SequenceClassifierOutput, 32 | TokenClassifierOutput, 33 | ) 34 | from transformers.modeling_utils import PreTrainedModel 35 | from transformers.pytorch_utils import softmax_backward_data 36 | from transformers.utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, \ 37 | logging 38 | from transformers import DebertaV2Config 39 | 40 | logger = logging.get_logger(__name__) 41 | 42 | _CONFIG_FOR_DOC = "DebertaV2Config" 43 | _TOKENIZER_FOR_DOC = "DebertaV2Tokenizer" 44 | _CHECKPOINT_FOR_DOC = "microsoft/deberta-v2-xlarge" 45 | 46 | # Masked LM docstring 47 | _CHECKPOINT_FOR_MASKED_LM = "hf-internal-testing/tiny-random-deberta-v2" 48 | _MASKED_LM_EXPECTED_OUTPUT = "'enberry'" 49 | _MASKED_LM_EXPECTED_LOSS = "11.85" 50 | 51 | # TokenClassification docstring 52 | _CHECKPOINT_FOR_TOKEN_CLASSIFICATION = "hf-internal-testing/tiny-random-deberta-v2" 53 | _TOKEN_CLASS_EXPECTED_OUTPUT = ( 54 | "['LABEL_0', 'LABEL_0', 'LABEL_1', 'LABEL_0', 'LABEL_0', 'LABEL_1', 'LABEL_0', 'LABEL_0', 'LABEL_0', 'LABEL_0'," 55 | " 'LABEL_0', 'LABEL_0']" 56 | ) 57 | _TOKEN_CLASS_EXPECTED_LOSS = 0.61 58 | 59 | # QuestionAnswering docstring 60 | _CHECKPOINT_FOR_QA = "hf-internal-testing/tiny-random-deberta-v2" 61 | _QA_EXPECTED_OUTPUT = "'was Jim Henson? Jim Henson was'" 62 | _QA_EXPECTED_LOSS = 2.47 63 | _QA_TARGET_START_INDEX = 2 64 | _QA_TARGET_END_INDEX = 9 65 | 66 | # SequenceClassification docstring 67 | _CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION = "hf-internal-testing/tiny-random-deberta-v2" 68 | _SEQ_CLASS_EXPECTED_OUTPUT = "'LABEL_1'" 69 | _SEQ_CLASS_EXPECTED_LOSS = "0.69" 70 | 71 | DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST = [ 72 | "microsoft/deberta-v2-xlarge", 73 | "microsoft/deberta-v2-xxlarge", 74 | "microsoft/deberta-v2-xlarge-mnli", 75 | "microsoft/deberta-v2-xxlarge-mnli", 76 | ] 77 | 78 | 79 | # Copied from transformers.models.deberta.modeling_deberta.ContextPooler 80 | class ContextPooler(nn.Module): 81 | def __init__(self, config): 82 | super().__init__() 83 | self.dense = nn.Linear(config.pooler_hidden_size, 84 | config.pooler_hidden_size) 85 | self.dropout = StableDropout(config.pooler_dropout) 86 | self.config = config 87 | 88 | def forward(self, hidden_states): 89 | # We "pool" the model by simply taking the hidden state corresponding 90 | # to the first token. 91 | 92 | context_token = hidden_states[:, 0] 93 | context_token = self.dropout(context_token) 94 | pooled_output = self.dense(context_token) 95 | pooled_output = ACT2FN[self.config.pooler_hidden_act](pooled_output) 96 | return pooled_output 97 | 98 | @property 99 | def output_dim(self): 100 | return self.config.hidden_size 101 | 102 | 103 | # Copied from transformers.models.deberta.modeling_deberta.XSoftmax with deberta->deberta_v2 104 | class XSoftmax(torch.autograd.Function): 105 | """ 106 | Masked Softmax which is optimized for saving memory 107 | 108 | Args: 109 | input (`torch.tensor`): The input tensor that will apply softmax. 110 | mask (`torch.IntTensor`): 111 | The mask matrix where 0 indicate that element will be ignored in the softmax calculation. 112 | dim (int): The dimension that will apply softmax 113 | 114 | Example: 115 | 116 | ```python 117 | >>> import torch 118 | >>> from transformers.models.deberta_v2.modeling_deberta_v2 import XSoftmax 119 | 120 | >>> # Make a tensor 121 | >>> x = torch.randn([4, 20, 100]) 122 | 123 | >>> # Create a mask 124 | >>> mask = (x > 0).int() 125 | 126 | >>> # Specify the dimension to apply softmax 127 | >>> dim = -1 128 | 129 | >>> y = XSoftmax.apply(x, mask, dim) 130 | ```""" 131 | 132 | @staticmethod 133 | def forward(self, input, mask, dim): 134 | self.dim = dim 135 | rmask = ~(mask.to(torch.bool)) 136 | 137 | output = input.masked_fill( 138 | rmask, torch.tensor(torch.finfo(input.dtype).min)) 139 | output = torch.softmax(output, self.dim) 140 | output.masked_fill_(rmask, 0) 141 | self.save_for_backward(output) 142 | return output 143 | 144 | @staticmethod 145 | def backward(self, grad_output): 146 | (output,) = self.saved_tensors 147 | inputGrad = softmax_backward_data( 148 | self, grad_output, output, self.dim, output) 149 | return inputGrad, None, None 150 | 151 | @staticmethod 152 | def symbolic(g, self, mask, dim): 153 | import torch.onnx.symbolic_helper as sym_help 154 | from torch.onnx.symbolic_opset9 import masked_fill, softmax 155 | 156 | mask_cast_value = g.op( 157 | "Cast", mask, to_i=sym_help.cast_pytorch_to_onnx["Long"]) 158 | r_mask = g.op( 159 | "Cast", 160 | g.op("Sub", g.op("Constant", value_t=torch.tensor( 161 | 1, dtype=torch.int64)), mask_cast_value), 162 | to_i=sym_help.cast_pytorch_to_onnx["Byte"], 163 | ) 164 | output = masked_fill( 165 | g, self, r_mask, g.op("Constant", value_t=torch.tensor( 166 | torch.finfo(self.type().dtype()).min)) 167 | ) 168 | output = softmax(g, output, dim) 169 | return masked_fill(g, output, r_mask, g.op("Constant", value_t=torch.tensor(0, dtype=torch.uint8))) 170 | 171 | 172 | # Copied from transformers.models.deberta.modeling_deberta.DropoutContext 173 | class DropoutContext(object): 174 | def __init__(self): 175 | self.dropout = 0 176 | self.mask = None 177 | self.scale = 1 178 | self.reuse_mask = True 179 | 180 | 181 | # Copied from transformers.models.deberta.modeling_deberta.get_mask 182 | def get_mask(input, local_context): 183 | if not isinstance(local_context, DropoutContext): 184 | dropout = local_context 185 | mask = None 186 | else: 187 | dropout = local_context.dropout 188 | dropout *= local_context.scale 189 | mask = local_context.mask if local_context.reuse_mask else None 190 | 191 | if dropout > 0 and mask is None: 192 | mask = (1 - torch.empty_like(input).bernoulli_(1 - dropout)).to(torch.bool) 193 | 194 | if isinstance(local_context, DropoutContext): 195 | if local_context.mask is None: 196 | local_context.mask = mask 197 | 198 | return mask, dropout 199 | 200 | 201 | # Copied from transformers.models.deberta.modeling_deberta.XDropout 202 | class XDropout(torch.autograd.Function): 203 | """Optimized dropout function to save computation and memory by using mask operation instead of multiplication.""" 204 | 205 | @staticmethod 206 | def forward(ctx, input, local_ctx): 207 | mask, dropout = get_mask(input, local_ctx) 208 | ctx.scale = 1.0 / (1 - dropout) 209 | if dropout > 0: 210 | ctx.save_for_backward(mask) 211 | return input.masked_fill(mask, 0) * ctx.scale 212 | else: 213 | return input 214 | 215 | @staticmethod 216 | def backward(ctx, grad_output): 217 | if ctx.scale > 1: 218 | (mask,) = ctx.saved_tensors 219 | return grad_output.masked_fill(mask, 0) * ctx.scale, None 220 | else: 221 | return grad_output, None 222 | 223 | @staticmethod 224 | def symbolic(g: torch._C.Graph, input: torch._C.Value, local_ctx: Union[float, DropoutContext]) -> torch._C.Value: 225 | from torch.onnx import symbolic_opset12 226 | 227 | dropout_p = local_ctx 228 | if isinstance(local_ctx, DropoutContext): 229 | dropout_p = local_ctx.dropout 230 | # StableDropout only calls this function when training. 231 | train = True 232 | # TODO: We should check if the opset_version being used to export 233 | # is > 12 here, but there's no good way to do that. As-is, if the 234 | # opset_version < 12, export will fail with a CheckerError. 235 | # Once https://github.com/pytorch/pytorch/issues/78391 is fixed, do something like: 236 | # if opset_version < 12: 237 | # return torch.onnx.symbolic_opset9.dropout(g, input, dropout_p, train) 238 | return symbolic_opset12.dropout(g, input, dropout_p, train) 239 | 240 | 241 | # Copied from transformers.models.deberta.modeling_deberta.StableDropout 242 | class StableDropout(nn.Module): 243 | """ 244 | Optimized dropout module for stabilizing the training 245 | 246 | Args: 247 | drop_prob (float): the dropout probabilities 248 | """ 249 | 250 | def __init__(self, drop_prob): 251 | super().__init__() 252 | self.drop_prob = drop_prob 253 | self.count = 0 254 | self.context_stack = None 255 | 256 | def forward(self, x): 257 | """ 258 | Call the module 259 | 260 | Args: 261 | x (`torch.tensor`): The input tensor to apply dropout 262 | """ 263 | if self.training and self.drop_prob > 0: 264 | return XDropout.apply(x, self.get_context()) 265 | return x 266 | 267 | def clear_context(self): 268 | self.count = 0 269 | self.context_stack = None 270 | 271 | def init_context(self, reuse_mask=True, scale=1): 272 | if self.context_stack is None: 273 | self.context_stack = [] 274 | self.count = 0 275 | for c in self.context_stack: 276 | c.reuse_mask = reuse_mask 277 | c.scale = scale 278 | 279 | def get_context(self): 280 | if self.context_stack is not None: 281 | if self.count >= len(self.context_stack): 282 | self.context_stack.append(DropoutContext()) 283 | ctx = self.context_stack[self.count] 284 | ctx.dropout = self.drop_prob 285 | self.count += 1 286 | return ctx 287 | else: 288 | return self.drop_prob 289 | 290 | 291 | # Copied from transformers.models.deberta.modeling_deberta.DebertaSelfOutput with DebertaLayerNorm->LayerNorm 292 | class DebertaV2SelfOutput(nn.Module): 293 | def __init__(self, config): 294 | super().__init__() 295 | self.dense = nn.Linear(config.hidden_size, config.hidden_size) 296 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps) 297 | self.dropout = StableDropout(config.hidden_dropout_prob) 298 | 299 | def forward(self, hidden_states, input_tensor): 300 | hidden_states = self.dense(hidden_states) 301 | hidden_states = self.dropout(hidden_states) 302 | hidden_states = self.LayerNorm(hidden_states + input_tensor) 303 | return hidden_states 304 | 305 | 306 | # Copied from transformers.models.deberta.modeling_deberta.DebertaAttention with Deberta->DebertaV2 307 | class DebertaV2Attention(nn.Module): 308 | def __init__(self, config): 309 | super().__init__() 310 | self.self = DisentangledSelfAttention(config) 311 | self.output = DebertaV2SelfOutput(config) 312 | self.config = config 313 | 314 | def forward( 315 | self, 316 | hidden_states, 317 | attention_mask, 318 | output_attentions=False, 319 | query_states=None, 320 | relative_pos=None, 321 | rel_embeddings=None, 322 | ): 323 | self_output = self.self( 324 | hidden_states, 325 | attention_mask, 326 | output_attentions, 327 | query_states=query_states, 328 | relative_pos=relative_pos, 329 | rel_embeddings=rel_embeddings, 330 | ) 331 | if output_attentions: 332 | self_output, att_matrix = self_output 333 | if query_states is None: 334 | query_states = hidden_states 335 | attention_output = self.output(self_output, query_states) 336 | 337 | if output_attentions: 338 | return (attention_output, att_matrix) 339 | else: 340 | return attention_output 341 | 342 | 343 | # Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->DebertaV2 344 | class DebertaV2Intermediate(nn.Module): 345 | def __init__(self, config): 346 | super().__init__() 347 | self.dense = nn.Linear(config.hidden_size, config.intermediate_size) 348 | if isinstance(config.hidden_act, str): 349 | self.intermediate_act_fn = ACT2FN[config.hidden_act] 350 | else: 351 | self.intermediate_act_fn = config.hidden_act 352 | 353 | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: 354 | hidden_states = self.dense(hidden_states) 355 | hidden_states = self.intermediate_act_fn(hidden_states) 356 | return hidden_states 357 | 358 | 359 | # Copied from transformers.models.deberta.modeling_deberta.DebertaOutput with DebertaLayerNorm->LayerNorm 360 | class DebertaV2Output(nn.Module): 361 | def __init__(self, config): 362 | super().__init__() 363 | self.dense = nn.Linear(config.intermediate_size, config.hidden_size) 364 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps) 365 | self.dropout = StableDropout(config.hidden_dropout_prob) 366 | self.config = config 367 | 368 | def forward(self, hidden_states, input_tensor): 369 | hidden_states = self.dense(hidden_states) 370 | hidden_states = self.dropout(hidden_states) 371 | hidden_states = self.LayerNorm(hidden_states + input_tensor) 372 | return hidden_states 373 | 374 | 375 | # Copied from transformers.models.deberta.modeling_deberta.DebertaLayer with Deberta->DebertaV2 376 | class DebertaV2Layer(nn.Module): 377 | def __init__(self, config): 378 | super().__init__() 379 | self.attention = DebertaV2Attention(config) 380 | self.intermediate = DebertaV2Intermediate(config) 381 | self.output = DebertaV2Output(config) 382 | 383 | def forward( 384 | self, 385 | hidden_states, 386 | attention_mask, 387 | query_states=None, 388 | relative_pos=None, 389 | rel_embeddings=None, 390 | output_attentions=False, 391 | ): 392 | attention_output = self.attention( 393 | hidden_states, 394 | attention_mask, 395 | output_attentions=output_attentions, 396 | query_states=query_states, 397 | relative_pos=relative_pos, 398 | rel_embeddings=rel_embeddings, 399 | ) 400 | if output_attentions: 401 | attention_output, att_matrix = attention_output 402 | intermediate_output = self.intermediate(attention_output) 403 | layer_output = self.output(intermediate_output, attention_output) 404 | if output_attentions: 405 | return (layer_output, att_matrix) 406 | else: 407 | return layer_output 408 | 409 | 410 | class ConvLayer(nn.Module): 411 | def __init__(self, config): 412 | super().__init__() 413 | kernel_size = getattr(config, "conv_kernel_size", 3) 414 | groups = getattr(config, "conv_groups", 1) 415 | self.conv_act = getattr(config, "conv_act", "tanh") 416 | self.conv = nn.Conv1d( 417 | config.hidden_size, config.hidden_size, kernel_size, padding=(kernel_size - 1) // 2, groups=groups 418 | ) 419 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps) 420 | self.dropout = StableDropout(config.hidden_dropout_prob) 421 | self.config = config 422 | 423 | def forward(self, hidden_states, residual_states, input_mask): 424 | out = self.conv(hidden_states.permute( 425 | 0, 2, 1).contiguous()).permute(0, 2, 1).contiguous() 426 | rmask = (1 - input_mask).bool() 427 | out.masked_fill_(rmask.unsqueeze(-1).expand(out.size()), 0) 428 | out = ACT2FN[self.conv_act](self.dropout(out)) 429 | 430 | layer_norm_input = residual_states + out 431 | output = self.LayerNorm(layer_norm_input).to(layer_norm_input) 432 | 433 | if input_mask is None: 434 | output_states = output 435 | else: 436 | if input_mask.dim() != layer_norm_input.dim(): 437 | if input_mask.dim() == 4: 438 | input_mask = input_mask.squeeze(1).squeeze(1) 439 | input_mask = input_mask.unsqueeze(2) 440 | 441 | input_mask = input_mask.to(output.dtype) 442 | output_states = output * input_mask 443 | 444 | return output_states 445 | 446 | 447 | class DebertaV2Encoder(nn.Module): 448 | """Modified BertEncoder with relative position bias support""" 449 | 450 | def __init__(self, config): 451 | super().__init__() 452 | 453 | self.layer = nn.ModuleList([DebertaV2Layer(config) 454 | for _ in range(config.num_hidden_layers)]) 455 | self.relative_attention = getattr(config, "relative_attention", False) 456 | 457 | if self.relative_attention: 458 | self.max_relative_positions = getattr( 459 | config, "max_relative_positions", -1) 460 | if self.max_relative_positions < 1: 461 | self.max_relative_positions = config.max_position_embeddings 462 | 463 | self.position_buckets = getattr(config, "position_buckets", -1) 464 | pos_ebd_size = self.max_relative_positions * 2 465 | 466 | if self.position_buckets > 0: 467 | pos_ebd_size = self.position_buckets * 2 468 | 469 | self.rel_embeddings = nn.Embedding( 470 | pos_ebd_size, config.hidden_size) 471 | 472 | self.norm_rel_ebd = [x.strip() for x in getattr( 473 | config, "norm_rel_ebd", "none").lower().split("|")] 474 | 475 | if "layer_norm" in self.norm_rel_ebd: 476 | self.LayerNorm = LayerNorm( 477 | config.hidden_size, config.layer_norm_eps, elementwise_affine=True) 478 | 479 | self.conv = ConvLayer(config) if getattr( 480 | config, "conv_kernel_size", 0) > 0 else None 481 | self.gradient_checkpointing = False 482 | 483 | def get_rel_embedding(self): 484 | rel_embeddings = self.rel_embeddings.weight if self.relative_attention else None 485 | if rel_embeddings is not None and ("layer_norm" in self.norm_rel_ebd): 486 | rel_embeddings = self.LayerNorm(rel_embeddings) 487 | return rel_embeddings 488 | 489 | def get_attention_mask(self, attention_mask): 490 | if attention_mask.dim() <= 2: 491 | extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2) 492 | attention_mask = extended_attention_mask * \ 493 | extended_attention_mask.squeeze(-2).unsqueeze(-1) 494 | attention_mask = attention_mask.byte() 495 | elif attention_mask.dim() == 3: 496 | attention_mask = attention_mask.unsqueeze(1) 497 | 498 | return attention_mask 499 | 500 | def get_rel_pos(self, hidden_states, query_states=None, relative_pos=None): 501 | if self.relative_attention and relative_pos is None: 502 | q = query_states.size( 503 | -2) if query_states is not None else hidden_states.size(-2) 504 | relative_pos = build_relative_position( 505 | q, hidden_states.size(-2), bucket_size=self.position_buckets, max_position=self.max_relative_positions 506 | ) 507 | return relative_pos 508 | 509 | def forward( 510 | self, 511 | hidden_states, 512 | attention_mask, 513 | output_hidden_states=True, 514 | output_attentions=False, 515 | query_states=None, 516 | relative_pos=None, 517 | return_dict=True, 518 | ): 519 | if attention_mask.dim() <= 2: 520 | input_mask = attention_mask 521 | else: 522 | input_mask = (attention_mask.sum(-2) > 0).byte() 523 | attention_mask = self.get_attention_mask(attention_mask) 524 | relative_pos = self.get_rel_pos( 525 | hidden_states, query_states, relative_pos) 526 | 527 | all_hidden_states = () if output_hidden_states else None 528 | all_attentions = () if output_attentions else None 529 | 530 | if isinstance(hidden_states, Sequence): 531 | next_kv = hidden_states[0] 532 | else: 533 | next_kv = hidden_states 534 | rel_embeddings = self.get_rel_embedding() 535 | output_states = next_kv 536 | for i, layer_module in enumerate(self.layer): 537 | 538 | if output_hidden_states: 539 | all_hidden_states = all_hidden_states + (output_states,) 540 | 541 | if self.gradient_checkpointing and self.training: 542 | 543 | def create_custom_forward(module): 544 | def custom_forward(*inputs): 545 | return module(*inputs, output_attentions) 546 | 547 | return custom_forward 548 | 549 | output_states = torch.utils.checkpoint.checkpoint( 550 | create_custom_forward(layer_module), 551 | next_kv, 552 | attention_mask, 553 | query_states, 554 | relative_pos, 555 | rel_embeddings, 556 | ) 557 | else: 558 | output_states = layer_module( 559 | next_kv, 560 | attention_mask, 561 | query_states=query_states, 562 | relative_pos=relative_pos, 563 | rel_embeddings=rel_embeddings, 564 | output_attentions=output_attentions, 565 | ) 566 | 567 | if output_attentions: 568 | output_states, att_m = output_states 569 | 570 | if i == 0 and self.conv is not None: 571 | output_states = self.conv( 572 | hidden_states, output_states, input_mask) 573 | 574 | if query_states is not None: 575 | query_states = output_states 576 | if isinstance(hidden_states, Sequence): 577 | next_kv = hidden_states[i + 1] if i + \ 578 | 1 < len(self.layer) else None 579 | else: 580 | next_kv = output_states 581 | 582 | if output_attentions: 583 | all_attentions = all_attentions + (att_m,) 584 | 585 | if output_hidden_states: 586 | all_hidden_states = all_hidden_states + (output_states,) 587 | 588 | if not return_dict: 589 | return tuple(v for v in [output_states, all_hidden_states, all_attentions] if v is not None) 590 | return BaseModelOutput( 591 | last_hidden_state=output_states, hidden_states=all_hidden_states, attentions=all_attentions 592 | ) 593 | 594 | 595 | def make_log_bucket_position(relative_pos, bucket_size, max_position): 596 | sign = torch.sign(relative_pos) 597 | mid = bucket_size // 2 598 | abs_pos = torch.where( 599 | (relative_pos < mid) & (relative_pos > -mid), 600 | torch.tensor(mid - 1).type_as(relative_pos), 601 | torch.abs(relative_pos), 602 | ) 603 | log_pos = ( 604 | torch.ceil(torch.log(abs_pos / mid) / 605 | torch.log(torch.tensor((max_position - 1) / mid)) * (mid - 1)) + mid 606 | ) 607 | bucket_pos = torch.where( 608 | abs_pos <= mid, relative_pos.type_as(log_pos), log_pos * sign) 609 | return bucket_pos 610 | 611 | 612 | def build_relative_position(query_size, key_size, bucket_size=-1, max_position=-1): 613 | """ 614 | Build relative position according to the query and key 615 | 616 | We assume the absolute position of query \\(P_q\\) is range from (0, query_size) and the absolute position of key 617 | \\(P_k\\) is range from (0, key_size), The relative positions from query to key is \\(R_{q \\rightarrow k} = P_q - 618 | P_k\\) 619 | 620 | Args: 621 | query_size (int): the length of query 622 | key_size (int): the length of key 623 | bucket_size (int): the size of position bucket 624 | max_position (int): the maximum allowed absolute position 625 | 626 | Return: 627 | `torch.LongTensor`: A tensor with shape [1, query_size, key_size] 628 | 629 | """ 630 | q_ids = torch.arange(0, query_size) 631 | k_ids = torch.arange(0, key_size) 632 | rel_pos_ids = q_ids[:, None] - k_ids[None, :] 633 | if bucket_size > 0 and max_position > 0: 634 | rel_pos_ids = make_log_bucket_position( 635 | rel_pos_ids, bucket_size, max_position) 636 | rel_pos_ids = rel_pos_ids.to(torch.long) 637 | rel_pos_ids = rel_pos_ids[:query_size, :] 638 | rel_pos_ids = rel_pos_ids.unsqueeze(0) 639 | return rel_pos_ids 640 | 641 | 642 | @torch.jit.script 643 | # Copied from transformers.models.deberta.modeling_deberta.c2p_dynamic_expand 644 | def c2p_dynamic_expand(c2p_pos, query_layer, relative_pos): 645 | return c2p_pos.expand([query_layer.size(0), query_layer.size(1), query_layer.size(2), relative_pos.size(-1)]) 646 | 647 | 648 | @torch.jit.script 649 | # Copied from transformers.models.deberta.modeling_deberta.p2c_dynamic_expand 650 | def p2c_dynamic_expand(c2p_pos, query_layer, key_layer): 651 | return c2p_pos.expand([query_layer.size(0), query_layer.size(1), key_layer.size(-2), key_layer.size(-2)]) 652 | 653 | 654 | @torch.jit.script 655 | # Copied from transformers.models.deberta.modeling_deberta.pos_dynamic_expand 656 | def pos_dynamic_expand(pos_index, p2c_att, key_layer): 657 | return pos_index.expand(p2c_att.size()[:2] + (pos_index.size(-2), key_layer.size(-2))) 658 | 659 | 660 | class DisentangledSelfAttention(nn.Module): 661 | """ 662 | Disentangled self-attention module 663 | 664 | Parameters: 665 | config (`DebertaV2Config`): 666 | A model config class instance with the configuration to build a new model. The schema is similar to 667 | *BertConfig*, for more details, please refer [`DebertaV2Config`] 668 | 669 | """ 670 | 671 | def __init__(self, config): 672 | super().__init__() 673 | if config.hidden_size % config.num_attention_heads != 0: 674 | raise ValueError( 675 | f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention " 676 | f"heads ({config.num_attention_heads})" 677 | ) 678 | self.num_attention_heads = config.num_attention_heads 679 | _attention_head_size = config.hidden_size // config.num_attention_heads 680 | self.attention_head_size = getattr( 681 | config, "attention_head_size", _attention_head_size) 682 | self.all_head_size = self.num_attention_heads * self.attention_head_size 683 | self.query_proj = nn.Linear( 684 | config.hidden_size, self.all_head_size, bias=True) 685 | self.key_proj = nn.Linear( 686 | config.hidden_size, self.all_head_size, bias=True) 687 | self.value_proj = nn.Linear( 688 | config.hidden_size, self.all_head_size, bias=True) 689 | 690 | self.share_att_key = getattr(config, "share_att_key", False) 691 | self.pos_att_type = config.pos_att_type if config.pos_att_type is not None else [] 692 | self.relative_attention = getattr(config, "relative_attention", False) 693 | 694 | if self.relative_attention: 695 | self.position_buckets = getattr(config, "position_buckets", -1) 696 | self.max_relative_positions = getattr( 697 | config, "max_relative_positions", -1) 698 | if self.max_relative_positions < 1: 699 | self.max_relative_positions = config.max_position_embeddings 700 | self.pos_ebd_size = self.max_relative_positions 701 | if self.position_buckets > 0: 702 | self.pos_ebd_size = self.position_buckets 703 | 704 | self.pos_dropout = StableDropout(config.hidden_dropout_prob) 705 | 706 | if not self.share_att_key: 707 | if "c2p" in self.pos_att_type: 708 | self.pos_key_proj = nn.Linear( 709 | config.hidden_size, self.all_head_size, bias=True) 710 | if "p2c" in self.pos_att_type: 711 | self.pos_query_proj = nn.Linear( 712 | config.hidden_size, self.all_head_size) 713 | 714 | self.dropout = StableDropout(config.attention_probs_dropout_prob) 715 | 716 | def transpose_for_scores(self, x, attention_heads): 717 | new_x_shape = x.size()[:-1] + (attention_heads, -1) 718 | x = x.view(new_x_shape) 719 | return x.permute(0, 2, 1, 3).contiguous().view(-1, x.size(1), x.size(-1)) 720 | 721 | def forward( 722 | self, 723 | hidden_states, 724 | attention_mask, 725 | output_attentions=False, 726 | query_states=None, 727 | relative_pos=None, 728 | rel_embeddings=None, 729 | ): 730 | """ 731 | Call the module 732 | 733 | Args: 734 | hidden_states (`torch.FloatTensor`): 735 | Input states to the module usually the output from previous layer, it will be the Q,K and V in 736 | *Attention(Q,K,V)* 737 | 738 | attention_mask (`torch.ByteTensor`): 739 | An attention mask matrix of shape [*B*, *N*, *N*] where *B* is the batch size, *N* is the maximum 740 | sequence length in which element [i,j] = *1* means the *i* th token in the input can attend to the *j* 741 | th token. 742 | 743 | output_attentions (`bool`, optional): 744 | Whether return the attention matrix. 745 | 746 | query_states (`torch.FloatTensor`, optional): 747 | The *Q* state in *Attention(Q,K,V)*. 748 | 749 | relative_pos (`torch.LongTensor`): 750 | The relative position encoding between the tokens in the sequence. It's of shape [*B*, *N*, *N*] with 751 | values ranging in [*-max_relative_positions*, *max_relative_positions*]. 752 | 753 | rel_embeddings (`torch.FloatTensor`): 754 | The embedding of relative distances. It's a tensor of shape [\\(2 \\times 755 | \\text{max_relative_positions}\\), *hidden_size*]. 756 | 757 | 758 | """ 759 | if query_states is None: 760 | query_states = hidden_states 761 | query_layer = self.transpose_for_scores( 762 | self.query_proj(query_states), self.num_attention_heads) 763 | key_layer = self.transpose_for_scores( 764 | self.key_proj(hidden_states), self.num_attention_heads) 765 | value_layer = self.transpose_for_scores( 766 | self.value_proj(hidden_states), self.num_attention_heads) 767 | 768 | rel_att = None 769 | # Take the dot product between "query" and "key" to get the raw attention scores. 770 | scale_factor = 1 771 | if "c2p" in self.pos_att_type: 772 | scale_factor += 1 773 | if "p2c" in self.pos_att_type: 774 | scale_factor += 1 775 | scale = torch.sqrt(torch.tensor(query_layer.size(-1), 776 | dtype=torch.float) * scale_factor) 777 | attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / torch.tensor( 778 | scale, dtype=query_layer.dtype 779 | ) 780 | if self.relative_attention: 781 | rel_embeddings = self.pos_dropout(rel_embeddings) 782 | rel_att = self.disentangled_attention_bias( 783 | query_layer, key_layer, relative_pos, rel_embeddings, scale_factor 784 | ) 785 | 786 | if rel_att is not None: 787 | attention_scores = attention_scores + rel_att 788 | attention_scores = attention_scores 789 | attention_scores = attention_scores.view( 790 | -1, self.num_attention_heads, attention_scores.size(-2), attention_scores.size(-1) 791 | ) 792 | 793 | # bsz x height x length x dimension 794 | attention_probs = XSoftmax.apply(attention_scores, attention_mask, -1) 795 | attention_probs = self.dropout(attention_probs) 796 | context_layer = torch.bmm( 797 | attention_probs.view(-1, attention_probs.size(-2), 798 | attention_probs.size(-1)), value_layer 799 | ) 800 | context_layer = ( 801 | context_layer.view(-1, self.num_attention_heads, 802 | context_layer.size(-2), context_layer.size(-1)) 803 | .permute(0, 2, 1, 3) 804 | .contiguous() 805 | ) 806 | new_context_layer_shape = context_layer.size()[:-2] + (-1,) 807 | context_layer = context_layer.view(new_context_layer_shape) 808 | if output_attentions: 809 | return (context_layer, attention_probs) 810 | else: 811 | return context_layer 812 | 813 | def disentangled_attention_bias(self, query_layer, key_layer, relative_pos, rel_embeddings, scale_factor): 814 | if relative_pos is None: 815 | q = query_layer.size(-2) 816 | relative_pos = build_relative_position( 817 | q, key_layer.size(-2), bucket_size=self.position_buckets, max_position=self.max_relative_positions 818 | ) 819 | if relative_pos.dim() == 2: 820 | relative_pos = relative_pos.unsqueeze(0).unsqueeze(0) 821 | elif relative_pos.dim() == 3: 822 | relative_pos = relative_pos.unsqueeze(1) 823 | # bsz x height x query x key 824 | elif relative_pos.dim() != 4: 825 | raise ValueError( 826 | f"Relative position ids must be of dim 2 or 3 or 4. {relative_pos.dim()}") 827 | 828 | att_span = self.pos_ebd_size 829 | relative_pos = relative_pos.long().to(query_layer.device) 830 | 831 | rel_embeddings = rel_embeddings[0: att_span * 2, :].unsqueeze(0) 832 | if self.share_att_key: 833 | pos_query_layer = self.transpose_for_scores( 834 | self.query_proj(rel_embeddings), self.num_attention_heads 835 | ).repeat(query_layer.size(0) // self.num_attention_heads, 1, 1) 836 | pos_key_layer = self.transpose_for_scores(self.key_proj(rel_embeddings), self.num_attention_heads).repeat( 837 | query_layer.size(0) // self.num_attention_heads, 1, 1 838 | ) 839 | else: 840 | if "c2p" in self.pos_att_type: 841 | pos_key_layer = self.transpose_for_scores( 842 | self.pos_key_proj(rel_embeddings), self.num_attention_heads 843 | ).repeat( 844 | query_layer.size(0) // self.num_attention_heads, 1, 1 845 | ) # .split(self.all_head_size, dim=-1) 846 | if "p2c" in self.pos_att_type: 847 | pos_query_layer = self.transpose_for_scores( 848 | self.pos_query_proj( 849 | rel_embeddings), self.num_attention_heads 850 | ).repeat( 851 | query_layer.size(0) // self.num_attention_heads, 1, 1 852 | ) # .split(self.all_head_size, dim=-1) 853 | 854 | score = 0 855 | # content->position 856 | if "c2p" in self.pos_att_type: 857 | scale = torch.sqrt(torch.tensor( 858 | pos_key_layer.size(-1), dtype=torch.float) * scale_factor) 859 | c2p_att = torch.bmm(query_layer, pos_key_layer.transpose(-1, -2)) 860 | c2p_pos = torch.clamp(relative_pos + att_span, 0, att_span * 2 - 1) 861 | c2p_att = torch.gather( 862 | c2p_att, 863 | dim=-1, 864 | index=c2p_pos.squeeze(0).expand( 865 | [query_layer.size(0), query_layer.size(1), relative_pos.size(-1)]), 866 | ) 867 | score += c2p_att / torch.tensor(scale, dtype=c2p_att.dtype) 868 | 869 | # position->content 870 | if "p2c" in self.pos_att_type: 871 | scale = torch.sqrt(torch.tensor( 872 | pos_query_layer.size(-1), dtype=torch.float) * scale_factor) 873 | if key_layer.size(-2) != query_layer.size(-2): 874 | r_pos = build_relative_position( 875 | key_layer.size(-2), 876 | key_layer.size(-2), 877 | bucket_size=self.position_buckets, 878 | max_position=self.max_relative_positions, 879 | ).to(query_layer.device) 880 | r_pos = r_pos.unsqueeze(0) 881 | else: 882 | r_pos = relative_pos 883 | 884 | p2c_pos = torch.clamp(-r_pos + att_span, 0, att_span * 2 - 1) 885 | p2c_att = torch.bmm(key_layer, pos_query_layer.transpose(-1, -2)) 886 | p2c_att = torch.gather( 887 | p2c_att, 888 | dim=-1, 889 | index=p2c_pos.squeeze(0).expand( 890 | [query_layer.size(0), key_layer.size(-2), key_layer.size(-2)]), 891 | ).transpose(-1, -2) 892 | score += p2c_att / torch.tensor(scale, dtype=p2c_att.dtype) 893 | 894 | return score 895 | 896 | 897 | # Copied from transformers.models.deberta.modeling_deberta.DebertaEmbeddings with DebertaLayerNorm->LayerNorm 898 | class DebertaV2Embeddings(nn.Module): 899 | """Construct the embeddings from word, position and token_type embeddings.""" 900 | 901 | def __init__(self, config): 902 | super().__init__() 903 | pad_token_id = getattr(config, "pad_token_id", 0) 904 | self.embedding_size = getattr( 905 | config, "embedding_size", config.hidden_size) 906 | self.word_embeddings = nn.Embedding( 907 | config.vocab_size, self.embedding_size, padding_idx=pad_token_id) 908 | 909 | self.position_biased_input = getattr( 910 | config, "position_biased_input", True) 911 | if not self.position_biased_input: 912 | self.position_embeddings = None 913 | else: 914 | self.position_embeddings = nn.Embedding( 915 | config.max_position_embeddings, self.embedding_size) 916 | 917 | if config.type_vocab_size > 0: 918 | self.token_type_embeddings = nn.Embedding( 919 | config.type_vocab_size, self.embedding_size) 920 | 921 | if self.embedding_size != config.hidden_size: 922 | self.embed_proj = nn.Linear( 923 | self.embedding_size, config.hidden_size, bias=False) 924 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps) 925 | self.dropout = StableDropout(config.hidden_dropout_prob) 926 | self.config = config 927 | 928 | # position_ids (1, len position emb) is contiguous in memory and exported when serialized 929 | self.register_buffer("position_ids", torch.arange( 930 | config.max_position_embeddings).expand((1, -1))) 931 | 932 | def forward(self, input_ids=None, token_type_ids=None, position_ids=None, mask=None, inputs_embeds=None): 933 | if input_ids is not None: 934 | input_shape = input_ids.size() 935 | else: 936 | input_shape = inputs_embeds.size()[:-1] 937 | 938 | seq_length = input_shape[1] 939 | 940 | if position_ids is None: 941 | position_ids = self.position_ids[:, :seq_length] 942 | 943 | if token_type_ids is None: 944 | token_type_ids = torch.zeros( 945 | input_shape, dtype=torch.long, device=self.position_ids.device) 946 | 947 | if inputs_embeds is None: 948 | inputs_embeds = self.word_embeddings(input_ids) 949 | 950 | if self.position_embeddings is not None: 951 | position_embeddings = self.position_embeddings(position_ids.long()) 952 | else: 953 | position_embeddings = torch.zeros_like(inputs_embeds) 954 | 955 | embeddings = inputs_embeds 956 | if self.position_biased_input: 957 | embeddings += position_embeddings 958 | if self.config.type_vocab_size > 0: 959 | token_type_embeddings = self.token_type_embeddings(token_type_ids) 960 | embeddings += token_type_embeddings 961 | 962 | if self.embedding_size != self.config.hidden_size: 963 | embeddings = self.embed_proj(embeddings) 964 | 965 | embeddings = self.LayerNorm(embeddings) 966 | 967 | if mask is not None: 968 | if mask.dim() != embeddings.dim(): 969 | if mask.dim() == 4: 970 | mask = mask.squeeze(1).squeeze(1) 971 | mask = mask.unsqueeze(2) 972 | mask = mask.to(embeddings.dtype) 973 | 974 | embeddings = embeddings * mask 975 | 976 | embeddings = self.dropout(embeddings) 977 | return embeddings 978 | 979 | 980 | # Copied from transformers.models.deberta.modeling_deberta.DebertaPreTrainedModel with Deberta->DebertaV2 981 | class DebertaV2PreTrainedModel(PreTrainedModel): 982 | """ 983 | An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained 984 | models. 985 | """ 986 | 987 | config_class = DebertaV2Config 988 | base_model_prefix = "deberta" 989 | _keys_to_ignore_on_load_missing = ["position_ids"] 990 | _keys_to_ignore_on_load_unexpected = ["position_embeddings"] 991 | supports_gradient_checkpointing = True 992 | 993 | def _init_weights(self, module): 994 | """Initialize the weights.""" 995 | if isinstance(module, nn.Linear): 996 | # Slightly different from the TF version which uses truncated_normal for initialization 997 | # cf https://github.com/pytorch/pytorch/pull/5617 998 | module.weight.data.normal_( 999 | mean=0.0, std=self.config.initializer_range) 1000 | if module.bias is not None: 1001 | module.bias.data.zero_() 1002 | elif isinstance(module, nn.Embedding): 1003 | module.weight.data.normal_( 1004 | mean=0.0, std=self.config.initializer_range) 1005 | if module.padding_idx is not None: 1006 | module.weight.data[module.padding_idx].zero_() 1007 | 1008 | def _set_gradient_checkpointing(self, module, value=False): 1009 | if isinstance(module, DebertaV2Encoder): 1010 | module.gradient_checkpointing = value 1011 | 1012 | 1013 | DEBERTA_START_DOCSTRING = r""" 1014 | The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled 1015 | Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It's build 1016 | on top of BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two 1017 | improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pretraining data. 1018 | 1019 | This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. 1020 | Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage 1021 | and behavior. 1022 | 1023 | 1024 | Parameters: 1025 | config ([`DebertaV2Config`]): Model configuration class with all the parameters of the model. 1026 | Initializing with a config file does not load the weights associated with the model, only the 1027 | configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. 1028 | """ 1029 | 1030 | DEBERTA_INPUTS_DOCSTRING = r""" 1031 | Args: 1032 | input_ids (`torch.LongTensor` of shape `({0})`): 1033 | Indices of input sequence tokens in the vocabulary. 1034 | 1035 | Indices can be obtained using [`DebertaV2Tokenizer`]. See [`PreTrainedTokenizer.encode`] and 1036 | [`PreTrainedTokenizer.__call__`] for details. 1037 | 1038 | [What are input IDs?](../glossary#input-ids) 1039 | attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): 1040 | Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: 1041 | 1042 | - 1 for tokens that are **not masked**, 1043 | - 0 for tokens that are **masked**. 1044 | 1045 | [What are attention masks?](../glossary#attention-mask) 1046 | token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): 1047 | Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1048 | 1]`: 1049 | 1050 | - 0 corresponds to a *sentence A* token, 1051 | - 1 corresponds to a *sentence B* token. 1052 | 1053 | [What are token type IDs?](../glossary#token-type-ids) 1054 | position_ids (`torch.LongTensor` of shape `({0})`, *optional*): 1055 | Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, 1056 | config.max_position_embeddings - 1]`. 1057 | 1058 | [What are position IDs?](../glossary#position-ids) 1059 | inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*): 1060 | Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This 1061 | is useful if you want more control over how to convert *input_ids* indices into associated vectors than the 1062 | model's internal embedding lookup matrix. 1063 | output_attentions (`bool`, *optional*): 1064 | Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned 1065 | tensors for more detail. 1066 | output_hidden_states (`bool`, *optional*): 1067 | Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for 1068 | more detail. 1069 | return_dict (`bool`, *optional*): 1070 | Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. 1071 | """ 1072 | 1073 | 1074 | @add_start_docstrings( 1075 | "The bare DeBERTa Model transformer outputting raw hidden-states without any specific head on top.", 1076 | DEBERTA_START_DOCSTRING, 1077 | ) 1078 | # Copied from transformers.models.deberta.modeling_deberta.DebertaModel with Deberta->DebertaV2 1079 | class DebertaV2Model(DebertaV2PreTrainedModel): 1080 | def __init__(self, config): 1081 | super().__init__(config) 1082 | 1083 | self.embeddings = DebertaV2Embeddings(config) 1084 | self.encoder = DebertaV2Encoder(config) 1085 | self.z_steps = 0 1086 | self.config = config 1087 | # Initialize weights and apply final processing 1088 | self.post_init() 1089 | 1090 | def get_input_embeddings(self): 1091 | return self.embeddings.word_embeddings 1092 | 1093 | def set_input_embeddings(self, new_embeddings): 1094 | self.embeddings.word_embeddings = new_embeddings 1095 | 1096 | def _prune_heads(self, heads_to_prune): 1097 | """ 1098 | Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base 1099 | class PreTrainedModel 1100 | """ 1101 | raise NotImplementedError( 1102 | "The prune function is not implemented in DeBERTa model.") 1103 | 1104 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")) 1105 | @add_code_sample_docstrings( 1106 | processor_class=_TOKENIZER_FOR_DOC, 1107 | checkpoint=_CHECKPOINT_FOR_DOC, 1108 | output_type=BaseModelOutput, 1109 | config_class=_CONFIG_FOR_DOC, 1110 | ) 1111 | def forward( 1112 | self, 1113 | input_ids: Optional[torch.Tensor] = None, 1114 | attention_mask: Optional[torch.Tensor] = None, 1115 | token_type_ids: Optional[torch.Tensor] = None, 1116 | position_ids: Optional[torch.Tensor] = None, 1117 | inputs_embeds: Optional[torch.Tensor] = None, 1118 | output_attentions: Optional[bool] = None, 1119 | output_hidden_states: Optional[bool] = None, 1120 | return_dict: Optional[bool] = None, 1121 | ) -> Union[Tuple, BaseModelOutput]: 1122 | output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions 1123 | output_hidden_states = ( 1124 | output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states 1125 | ) 1126 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1127 | 1128 | if input_ids is not None and inputs_embeds is not None: 1129 | raise ValueError( 1130 | "You cannot specify both input_ids and inputs_embeds at the same time") 1131 | elif input_ids is not None: 1132 | input_shape = input_ids.size() 1133 | elif inputs_embeds is not None: 1134 | input_shape = inputs_embeds.size()[:-1] 1135 | else: 1136 | raise ValueError( 1137 | "You have to specify either input_ids or inputs_embeds") 1138 | 1139 | device = input_ids.device if input_ids is not None else inputs_embeds.device 1140 | 1141 | if attention_mask is None: 1142 | attention_mask = torch.ones(input_shape, device=device) 1143 | if token_type_ids is None: 1144 | token_type_ids = torch.zeros( 1145 | input_shape, dtype=torch.long, device=device) 1146 | 1147 | embedding_output = self.embeddings( 1148 | input_ids=input_ids, 1149 | token_type_ids=token_type_ids, 1150 | position_ids=position_ids, 1151 | mask=attention_mask, 1152 | inputs_embeds=inputs_embeds, 1153 | ) 1154 | 1155 | encoder_outputs = self.encoder( 1156 | embedding_output, 1157 | attention_mask, 1158 | output_hidden_states=True, 1159 | output_attentions=output_attentions, 1160 | return_dict=return_dict, 1161 | ) 1162 | encoded_layers = encoder_outputs[1] 1163 | 1164 | if self.z_steps > 1: 1165 | hidden_states = encoded_layers[-2] 1166 | layers = [self.encoder.layer[-1] for _ in range(self.z_steps)] 1167 | query_states = encoded_layers[-1] 1168 | rel_embeddings = self.encoder.get_rel_embedding() 1169 | attention_mask = self.encoder.get_attention_mask(attention_mask) 1170 | rel_pos = self.encoder.get_rel_pos(embedding_output) 1171 | for layer in layers[1:]: 1172 | query_states = layer( 1173 | hidden_states, 1174 | attention_mask, 1175 | output_attentions=False, 1176 | query_states=query_states, 1177 | relative_pos=rel_pos, 1178 | rel_embeddings=rel_embeddings, 1179 | ) 1180 | encoded_layers.append(query_states) 1181 | 1182 | sequence_output = encoded_layers[-1] 1183 | 1184 | if not return_dict: 1185 | return (sequence_output,) + encoder_outputs[(1 if output_hidden_states else 2):] 1186 | 1187 | return BaseModelOutput( 1188 | last_hidden_state=sequence_output, 1189 | hidden_states=encoder_outputs.hidden_states if output_hidden_states else None, 1190 | attentions=encoder_outputs.attentions, 1191 | ) 1192 | 1193 | 1194 | @add_start_docstrings("""DeBERTa Model with a `language modeling` head on top.""", DEBERTA_START_DOCSTRING) 1195 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForMaskedLM with Deberta->DebertaV2 1196 | class DebertaV2ForMaskedLM(DebertaV2PreTrainedModel): 1197 | _keys_to_ignore_on_load_unexpected = [r"pooler"] 1198 | _keys_to_ignore_on_load_missing = [ 1199 | r"position_ids", r"predictions.decoder.bias"] 1200 | 1201 | def __init__(self, config): 1202 | super().__init__(config) 1203 | 1204 | self.deberta = DebertaV2Model(config) 1205 | self.cls = DebertaV2OnlyMLMHead(config) 1206 | 1207 | # Initialize weights and apply final processing 1208 | self.post_init() 1209 | 1210 | def get_output_embeddings(self): 1211 | return self.cls.predictions.decoder 1212 | 1213 | def set_output_embeddings(self, new_embeddings): 1214 | self.cls.predictions.decoder = new_embeddings 1215 | 1216 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")) 1217 | @add_code_sample_docstrings( 1218 | processor_class=_TOKENIZER_FOR_DOC, 1219 | checkpoint=_CHECKPOINT_FOR_MASKED_LM, 1220 | output_type=MaskedLMOutput, 1221 | config_class=_CONFIG_FOR_DOC, 1222 | mask="[MASK]", 1223 | expected_output=_MASKED_LM_EXPECTED_OUTPUT, 1224 | expected_loss=_MASKED_LM_EXPECTED_LOSS, 1225 | ) 1226 | def forward( 1227 | self, 1228 | input_ids: Optional[torch.Tensor] = None, 1229 | attention_mask: Optional[torch.Tensor] = None, 1230 | token_type_ids: Optional[torch.Tensor] = None, 1231 | position_ids: Optional[torch.Tensor] = None, 1232 | inputs_embeds: Optional[torch.Tensor] = None, 1233 | labels: Optional[torch.Tensor] = None, 1234 | output_attentions: Optional[bool] = None, 1235 | output_hidden_states: Optional[bool] = None, 1236 | return_dict: Optional[bool] = None, 1237 | ) -> Union[Tuple, MaskedLMOutput]: 1238 | r""" 1239 | labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): 1240 | Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., 1241 | config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the 1242 | loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` 1243 | """ 1244 | 1245 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1246 | 1247 | outputs = self.deberta( 1248 | input_ids, 1249 | attention_mask=attention_mask, 1250 | token_type_ids=token_type_ids, 1251 | position_ids=position_ids, 1252 | inputs_embeds=inputs_embeds, 1253 | output_attentions=output_attentions, 1254 | output_hidden_states=output_hidden_states, 1255 | return_dict=return_dict, 1256 | ) 1257 | 1258 | sequence_output = outputs[0] 1259 | prediction_scores = self.cls(sequence_output) 1260 | 1261 | masked_lm_loss = None 1262 | if labels is not None: 1263 | loss_fct = CrossEntropyLoss() # -100 index = padding token 1264 | masked_lm_loss = loss_fct( 1265 | prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)) 1266 | 1267 | if not return_dict: 1268 | output = (prediction_scores,) + outputs[1:] 1269 | return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output 1270 | 1271 | return MaskedLMOutput( 1272 | loss=masked_lm_loss, 1273 | logits=prediction_scores, 1274 | hidden_states=outputs.hidden_states, 1275 | attentions=outputs.attentions, 1276 | ) 1277 | 1278 | 1279 | # copied from transformers.models.bert.BertPredictionHeadTransform with bert -> deberta 1280 | class DebertaV2PredictionHeadTransform(nn.Module): 1281 | def __init__(self, config): 1282 | super().__init__() 1283 | self.dense = nn.Linear(config.hidden_size, config.hidden_size) 1284 | if isinstance(config.hidden_act, str): 1285 | self.transform_act_fn = ACT2FN[config.hidden_act] 1286 | else: 1287 | self.transform_act_fn = config.hidden_act 1288 | self.LayerNorm = nn.LayerNorm( 1289 | config.hidden_size, eps=config.layer_norm_eps) 1290 | 1291 | def forward(self, hidden_states): 1292 | hidden_states = self.dense(hidden_states) 1293 | hidden_states = self.transform_act_fn(hidden_states) 1294 | hidden_states = self.LayerNorm(hidden_states) 1295 | return hidden_states 1296 | 1297 | 1298 | # copied from transformers.models.bert.BertLMPredictionHead with bert -> deberta 1299 | class DebertaV2LMPredictionHead(nn.Module): 1300 | def __init__(self, config): 1301 | super().__init__() 1302 | self.transform = DebertaV2PredictionHeadTransform(config) 1303 | 1304 | # The output weights are the same as the input embeddings, but there is 1305 | # an output-only bias for each token. 1306 | self.decoder = nn.Linear( 1307 | config.hidden_size, config.vocab_size, bias=False) 1308 | 1309 | self.bias = nn.Parameter(torch.zeros(config.vocab_size)) 1310 | 1311 | # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings` 1312 | self.decoder.bias = self.bias 1313 | 1314 | def forward(self, hidden_states): 1315 | hidden_states = self.transform(hidden_states) 1316 | hidden_states = self.decoder(hidden_states) 1317 | return hidden_states 1318 | 1319 | 1320 | # copied from transformers.models.bert.BertOnlyMLMHead with bert -> deberta 1321 | class DebertaV2OnlyMLMHead(nn.Module): 1322 | def __init__(self, config): 1323 | super().__init__() 1324 | self.predictions = DebertaV2LMPredictionHead(config) 1325 | 1326 | def forward(self, sequence_output): 1327 | prediction_scores = self.predictions(sequence_output) 1328 | return prediction_scores 1329 | 1330 | 1331 | @add_start_docstrings( 1332 | """ 1333 | DeBERTa Model transformer with a sequence classification/regression head on top (a linear layer on top of the 1334 | pooled output) e.g. for GLUE tasks. 1335 | """, 1336 | DEBERTA_START_DOCSTRING, 1337 | ) 1338 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForSequenceClassification with Deberta->DebertaV2 1339 | class DebertaV2ForSequenceClassification(DebertaV2PreTrainedModel): 1340 | def __init__(self, config): 1341 | super().__init__(config) 1342 | 1343 | num_labels = getattr(config, "num_labels", 2) 1344 | self.num_labels = num_labels 1345 | 1346 | self.deberta = DebertaV2Model(config) 1347 | self.pooler = ContextPooler(config) 1348 | output_dim = self.pooler.output_dim 1349 | 1350 | self.classifier = nn.Linear(output_dim, num_labels) 1351 | drop_out = getattr(config, "cls_dropout", None) 1352 | drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out 1353 | self.dropout = StableDropout(drop_out) 1354 | 1355 | # Initialize weights and apply final processing 1356 | self.post_init() 1357 | 1358 | def get_input_embeddings(self): 1359 | return self.deberta.get_input_embeddings() 1360 | 1361 | def set_input_embeddings(self, new_embeddings): 1362 | self.deberta.set_input_embeddings(new_embeddings) 1363 | 1364 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")) 1365 | @add_code_sample_docstrings( 1366 | processor_class=_TOKENIZER_FOR_DOC, 1367 | checkpoint=_CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION, 1368 | output_type=SequenceClassifierOutput, 1369 | config_class=_CONFIG_FOR_DOC, 1370 | expected_output=_SEQ_CLASS_EXPECTED_OUTPUT, 1371 | expected_loss=_SEQ_CLASS_EXPECTED_LOSS, 1372 | ) 1373 | def forward( 1374 | self, 1375 | input_ids: Optional[torch.Tensor] = None, 1376 | attention_mask: Optional[torch.Tensor] = None, 1377 | token_type_ids: Optional[torch.Tensor] = None, 1378 | position_ids: Optional[torch.Tensor] = None, 1379 | inputs_embeds: Optional[torch.Tensor] = None, 1380 | labels: Optional[torch.Tensor] = None, 1381 | output_attentions: Optional[bool] = None, 1382 | output_hidden_states: Optional[bool] = None, 1383 | return_dict: Optional[bool] = None, 1384 | ) -> Union[Tuple, SequenceClassifierOutput]: 1385 | r""" 1386 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): 1387 | Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., 1388 | config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If 1389 | `config.num_labels > 1` a classification loss is computed (Cross-Entropy). 1390 | """ 1391 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1392 | 1393 | outputs = self.deberta( 1394 | input_ids, 1395 | token_type_ids=token_type_ids, 1396 | attention_mask=attention_mask, 1397 | position_ids=position_ids, 1398 | inputs_embeds=inputs_embeds, 1399 | output_attentions=output_attentions, 1400 | output_hidden_states=output_hidden_states, 1401 | return_dict=return_dict, 1402 | ) 1403 | 1404 | encoder_layer = outputs[0] 1405 | pooled_output = self.pooler(encoder_layer) 1406 | pooled_output = self.dropout(pooled_output) 1407 | logits = self.classifier(pooled_output) 1408 | 1409 | loss = None 1410 | if labels is not None: 1411 | if self.config.problem_type is None: 1412 | if self.num_labels == 1: 1413 | # regression task 1414 | loss_fn = nn.MSELoss() 1415 | logits = logits.view(-1).to(labels.dtype) 1416 | loss = loss_fn(logits, labels.view(-1)) 1417 | elif labels.dim() == 1 or labels.size(-1) == 1: 1418 | label_index = (labels >= 0).nonzero() 1419 | labels = labels.long() 1420 | if label_index.size(0) > 0: 1421 | labeled_logits = torch.gather( 1422 | logits, 0, label_index.expand( 1423 | label_index.size(0), logits.size(1)) 1424 | ) 1425 | labels = torch.gather(labels, 0, label_index.view(-1)) 1426 | loss_fct = CrossEntropyLoss() 1427 | loss = loss_fct( 1428 | labeled_logits.view(-1, self.num_labels).float(), labels.view(-1)) 1429 | else: 1430 | loss = torch.tensor(0).to(logits) 1431 | else: 1432 | log_softmax = nn.LogSoftmax(-1) 1433 | loss = -((log_softmax(logits) * labels).sum(-1)).mean() 1434 | elif self.config.problem_type == "regression": 1435 | loss_fct = MSELoss() 1436 | if self.num_labels == 1: 1437 | loss = loss_fct(logits.squeeze(), labels.squeeze()) 1438 | else: 1439 | loss = loss_fct(logits, labels) 1440 | elif self.config.problem_type == "single_label_classification": 1441 | loss_fct = CrossEntropyLoss() 1442 | loss = loss_fct( 1443 | logits.view(-1, self.num_labels), labels.view(-1)) 1444 | elif self.config.problem_type == "multi_label_classification": 1445 | loss_fct = BCEWithLogitsLoss() 1446 | loss = loss_fct(logits, labels) 1447 | if not return_dict: 1448 | output = (logits,) + outputs[1:] 1449 | return ((loss,) + output) if loss is not None else output 1450 | 1451 | return SequenceClassifierOutput( 1452 | loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions 1453 | ) 1454 | 1455 | 1456 | @add_start_docstrings( 1457 | """ 1458 | DeBERTa Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for 1459 | Named-Entity-Recognition (NER) tasks. 1460 | """, 1461 | DEBERTA_START_DOCSTRING, 1462 | ) 1463 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForTokenClassification with Deberta->DebertaV2 1464 | class DebertaV2ForTokenClassification(DebertaV2PreTrainedModel): 1465 | _keys_to_ignore_on_load_unexpected = [r"pooler"] 1466 | 1467 | def __init__(self, config): 1468 | super().__init__(config) 1469 | self.num_labels = config.num_labels 1470 | 1471 | self.deberta = DebertaV2Model(config) 1472 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 1473 | self.classifier = nn.Linear(config.hidden_size, config.num_labels) 1474 | 1475 | # Initialize weights and apply final processing 1476 | self.post_init() 1477 | 1478 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")) 1479 | @add_code_sample_docstrings( 1480 | processor_class=_TOKENIZER_FOR_DOC, 1481 | checkpoint=_CHECKPOINT_FOR_TOKEN_CLASSIFICATION, 1482 | output_type=TokenClassifierOutput, 1483 | config_class=_CONFIG_FOR_DOC, 1484 | expected_output=_TOKEN_CLASS_EXPECTED_OUTPUT, 1485 | expected_loss=_TOKEN_CLASS_EXPECTED_LOSS, 1486 | ) 1487 | def forward( 1488 | self, 1489 | input_ids: Optional[torch.Tensor] = None, 1490 | attention_mask: Optional[torch.Tensor] = None, 1491 | token_type_ids: Optional[torch.Tensor] = None, 1492 | position_ids: Optional[torch.Tensor] = None, 1493 | inputs_embeds: Optional[torch.Tensor] = None, 1494 | labels: Optional[torch.Tensor] = None, 1495 | output_attentions: Optional[bool] = None, 1496 | output_hidden_states: Optional[bool] = None, 1497 | return_dict: Optional[bool] = None, 1498 | ) -> Union[Tuple, TokenClassifierOutput]: 1499 | r""" 1500 | labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): 1501 | Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`. 1502 | """ 1503 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1504 | 1505 | outputs = self.deberta( 1506 | input_ids, 1507 | attention_mask=attention_mask, 1508 | token_type_ids=token_type_ids, 1509 | position_ids=position_ids, 1510 | inputs_embeds=inputs_embeds, 1511 | output_attentions=output_attentions, 1512 | output_hidden_states=output_hidden_states, 1513 | return_dict=return_dict, 1514 | ) 1515 | 1516 | sequence_output = outputs[0] 1517 | 1518 | sequence_output = self.dropout(sequence_output) 1519 | logits = self.classifier(sequence_output) 1520 | 1521 | loss = None 1522 | if labels is not None: 1523 | loss_fct = CrossEntropyLoss() 1524 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) 1525 | 1526 | if not return_dict: 1527 | output = (logits,) + outputs[1:] 1528 | return ((loss,) + output) if loss is not None else output 1529 | 1530 | return TokenClassifierOutput( 1531 | loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions 1532 | ) 1533 | 1534 | 1535 | @add_start_docstrings( 1536 | """ 1537 | DeBERTa Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear 1538 | layers on top of the hidden-states output to compute `span start logits` and `span end logits`). 1539 | """, 1540 | DEBERTA_START_DOCSTRING, 1541 | ) 1542 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForQuestionAnswering with Deberta->DebertaV2 1543 | class DebertaV2ForQuestionAnswering(DebertaV2PreTrainedModel): 1544 | _keys_to_ignore_on_load_unexpected = [r"pooler"] 1545 | 1546 | def __init__(self, config): 1547 | super().__init__(config) 1548 | self.num_labels = config.num_labels 1549 | 1550 | self.deberta = DebertaV2Model(config) 1551 | self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels) 1552 | 1553 | # Initialize weights and apply final processing 1554 | self.post_init() 1555 | 1556 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")) 1557 | @add_code_sample_docstrings( 1558 | processor_class=_TOKENIZER_FOR_DOC, 1559 | checkpoint=_CHECKPOINT_FOR_QA, 1560 | output_type=QuestionAnsweringModelOutput, 1561 | config_class=_CONFIG_FOR_DOC, 1562 | expected_output=_QA_EXPECTED_OUTPUT, 1563 | expected_loss=_QA_EXPECTED_LOSS, 1564 | qa_target_start_index=_QA_TARGET_START_INDEX, 1565 | qa_target_end_index=_QA_TARGET_END_INDEX, 1566 | ) 1567 | def forward( 1568 | self, 1569 | input_ids: Optional[torch.Tensor] = None, 1570 | attention_mask: Optional[torch.Tensor] = None, 1571 | token_type_ids: Optional[torch.Tensor] = None, 1572 | position_ids: Optional[torch.Tensor] = None, 1573 | inputs_embeds: Optional[torch.Tensor] = None, 1574 | start_positions: Optional[torch.Tensor] = None, 1575 | end_positions: Optional[torch.Tensor] = None, 1576 | output_attentions: Optional[bool] = None, 1577 | output_hidden_states: Optional[bool] = None, 1578 | return_dict: Optional[bool] = None, 1579 | ) -> Union[Tuple, QuestionAnsweringModelOutput]: 1580 | r""" 1581 | start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*): 1582 | Labels for position (index) of the start of the labelled span for computing the token classification loss. 1583 | Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence 1584 | are not taken into account for computing the loss. 1585 | end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*): 1586 | Labels for position (index) of the end of the labelled span for computing the token classification loss. 1587 | Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence 1588 | are not taken into account for computing the loss. 1589 | """ 1590 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1591 | 1592 | outputs = self.deberta( 1593 | input_ids, 1594 | attention_mask=attention_mask, 1595 | token_type_ids=token_type_ids, 1596 | position_ids=position_ids, 1597 | inputs_embeds=inputs_embeds, 1598 | output_attentions=output_attentions, 1599 | output_hidden_states=output_hidden_states, 1600 | return_dict=return_dict, 1601 | ) 1602 | 1603 | sequence_output = outputs[0] 1604 | 1605 | logits = self.qa_outputs(sequence_output) 1606 | start_logits, end_logits = logits.split(1, dim=-1) 1607 | start_logits = start_logits.squeeze(-1).contiguous() 1608 | end_logits = end_logits.squeeze(-1).contiguous() 1609 | 1610 | total_loss = None 1611 | if start_positions is not None and end_positions is not None: 1612 | # If we are on multi-GPU, split add a dimension 1613 | if len(start_positions.size()) > 1: 1614 | start_positions = start_positions.squeeze(-1) 1615 | if len(end_positions.size()) > 1: 1616 | end_positions = end_positions.squeeze(-1) 1617 | # sometimes the start/end positions are outside our model inputs, we ignore these terms 1618 | ignored_index = start_logits.size(1) 1619 | start_positions = start_positions.clamp(0, ignored_index) 1620 | end_positions = end_positions.clamp(0, ignored_index) 1621 | 1622 | loss_fct = CrossEntropyLoss(ignore_index=ignored_index) 1623 | start_loss = loss_fct(start_logits, start_positions) 1624 | end_loss = loss_fct(end_logits, end_positions) 1625 | total_loss = (start_loss + end_loss) / 2 1626 | 1627 | if not return_dict: 1628 | output = (start_logits, end_logits) + outputs[1:] 1629 | return ((total_loss,) + output) if total_loss is not None else output 1630 | 1631 | return QuestionAnsweringModelOutput( 1632 | loss=total_loss, 1633 | start_logits=start_logits, 1634 | end_logits=end_logits, 1635 | hidden_states=outputs.hidden_states, 1636 | attentions=outputs.attentions, 1637 | ) 1638 | 1639 | 1640 | @add_start_docstrings( 1641 | """ 1642 | DeBERTa Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a 1643 | softmax) e.g. for RocStories/SWAG tasks. 1644 | """, 1645 | DEBERTA_START_DOCSTRING, 1646 | ) 1647 | class DebertaV2ForMultipleChoice(DebertaV2PreTrainedModel): 1648 | def __init__(self, config): 1649 | super().__init__(config) 1650 | 1651 | num_labels = getattr(config, "num_labels", 2) 1652 | self.num_labels = num_labels 1653 | 1654 | self.deberta = DebertaV2Model(config) 1655 | self.pooler = ContextPooler(config) 1656 | output_dim = self.pooler.output_dim 1657 | 1658 | self.classifier = nn.Linear(output_dim, 1) 1659 | drop_out = getattr(config, "cls_dropout", None) 1660 | drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out 1661 | self.dropout = StableDropout(drop_out) 1662 | 1663 | self.init_weights() 1664 | 1665 | def get_input_embeddings(self): 1666 | return self.deberta.get_input_embeddings() 1667 | 1668 | def set_input_embeddings(self, new_embeddings): 1669 | self.deberta.set_input_embeddings(new_embeddings) 1670 | 1671 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length")) 1672 | @add_code_sample_docstrings( 1673 | processor_class=_TOKENIZER_FOR_DOC, 1674 | checkpoint=_CHECKPOINT_FOR_DOC, 1675 | output_type=MultipleChoiceModelOutput, 1676 | config_class=_CONFIG_FOR_DOC, 1677 | ) 1678 | def forward( 1679 | self, 1680 | input_ids=None, 1681 | attention_mask=None, 1682 | token_type_ids=None, 1683 | position_ids=None, 1684 | inputs_embeds=None, 1685 | labels=None, 1686 | output_attentions=None, 1687 | output_hidden_states=None, 1688 | return_dict=None, 1689 | ): 1690 | r""" 1691 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): 1692 | Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., 1693 | num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See 1694 | `input_ids` above) 1695 | """ 1696 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1697 | num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1] 1698 | 1699 | flat_input_ids = input_ids.view(-1, input_ids.size(-1) 1700 | ) if input_ids is not None else None 1701 | flat_position_ids = position_ids.view( 1702 | -1, position_ids.size(-1)) if position_ids is not None else None 1703 | flat_token_type_ids = token_type_ids.view( 1704 | -1, token_type_ids.size(-1)) if token_type_ids is not None else None 1705 | flat_attention_mask = attention_mask.view( 1706 | -1, attention_mask.size(-1)) if attention_mask is not None else None 1707 | flat_inputs_embeds = ( 1708 | inputs_embeds.view(-1, inputs_embeds.size(-2), 1709 | inputs_embeds.size(-1)) 1710 | if inputs_embeds is not None 1711 | else None 1712 | ) 1713 | 1714 | outputs = self.deberta( 1715 | flat_input_ids, 1716 | position_ids=flat_position_ids, 1717 | token_type_ids=flat_token_type_ids, 1718 | attention_mask=flat_attention_mask, 1719 | inputs_embeds=flat_inputs_embeds, 1720 | output_attentions=output_attentions, 1721 | output_hidden_states=output_hidden_states, 1722 | return_dict=return_dict, 1723 | ) 1724 | 1725 | encoder_layer = outputs[0] 1726 | pooled_output = self.pooler(encoder_layer) 1727 | pooled_output = self.dropout(pooled_output) 1728 | logits = self.classifier(pooled_output) 1729 | reshaped_logits = logits.view(-1, num_choices) 1730 | 1731 | loss = None 1732 | if labels is not None: 1733 | loss_fct = CrossEntropyLoss() 1734 | loss = loss_fct(reshaped_logits, labels) 1735 | 1736 | if not return_dict: 1737 | output = (reshaped_logits,) + outputs[1:] 1738 | return ((loss,) + output) if loss is not None else output 1739 | 1740 | return MultipleChoiceModelOutput( 1741 | loss=loss, 1742 | logits=reshaped_logits, 1743 | hidden_states=outputs.hidden_states, 1744 | attentions=outputs.attentions, 1745 | ) 1746 | 1747 | 1748 | class bert_classify(DebertaV2PreTrainedModel): 1749 | _keys_to_ignore_on_load_unexpected = [r"pooler"] 1750 | 1751 | def __init__(self, config): 1752 | super().__init__(config, ) 1753 | config.output_hidden_states = True 1754 | ''' 1755 | hidden_states:这是输出的一个可选项,如果输出,需要指定config.output_hidden_states=True,它是一个元组,含有13个元素, 1756 | 第一个元素可以当做是embedding,也就是cls,其余12个元素是各层隐藏状态的输出,每个元素的形状是(batch_size, sequence_length, hidden_size), 1757 | ''' 1758 | self.num_labels = config.num_labels 1759 | self.deberta = DebertaV2Model(config) 1760 | self.dropout = nn.Dropout(p=0.2) 1761 | self.high_dropout = nn.Dropout(p=0.5) 1762 | n_weights = config.num_hidden_layers + 1 # 因为指定了输出hidden_states,所以多了一层,加1 1763 | weights_init = torch.zeros(n_weights).float() 1764 | weights_init.data[:-1] = -3 1765 | self.layer_weights = torch.nn.Parameter(weights_init) 1766 | self.bilstm = nn.LSTM(config.hidden_size, 100, bidirectional=True) 1767 | self.classifier = nn.Linear(config.hidden_size + 200, self.num_labels) 1768 | # 随机初始化只对最后增加的线性层,已加载的模型参数不受影响 1769 | self.init_weights() 1770 | self.post_init() 1771 | 1772 | def forward( 1773 | self, 1774 | input_ids: Optional[torch.Tensor] = None, 1775 | attention_mask: Optional[torch.Tensor] = None, 1776 | token_type_ids: Optional[torch.Tensor] = None, 1777 | position_ids: Optional[torch.Tensor] = None, 1778 | inputs_embeds: Optional[torch.Tensor] = None, 1779 | labels: Optional[torch.Tensor] = None, 1780 | output_attentions: Optional[bool] = None, 1781 | output_hidden_states: Optional[bool] = None, 1782 | return_dict: Optional[bool] = None, 1783 | ): 1784 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1785 | 1786 | outputs = self.deberta( 1787 | input_ids, 1788 | attention_mask=attention_mask, 1789 | token_type_ids=token_type_ids, 1790 | position_ids=position_ids, 1791 | inputs_embeds=inputs_embeds, 1792 | output_attentions=output_attentions, 1793 | output_hidden_states=output_hidden_states, 1794 | return_dict=return_dict, 1795 | ) 1796 | 1797 | hidden_layers = outputs[1] 1798 | hidden_layers_last = outputs[0] 1799 | rnn_output, _ = self.bilstm(hidden_layers_last) # shape [batchsize, 200] 1800 | 1801 | # 取每一层的cls(shape:batchsize * hidden_size) dropout叠加 shape: 13*bathsize*hidden_size 1802 | cls_outputs = torch.stack( 1803 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=0 1804 | ) 1805 | 1806 | # 然后加权求和 shape: bathsize*hidden_size 1807 | cls_output = (torch.softmax(self.layer_weights, dim=0).unsqueeze(-1).unsqueeze(-1) * cls_outputs).sum(0) 1808 | cls_output = torch.cat((rnn_output[:, -1], cls_output), dim=1) 1809 | 1810 | # 对求和后的cls向量进行dropout,在输入线性层,重复五次,然后求平均的到最后的输出logit 1811 | logits = torch.mean( 1812 | torch.stack( 1813 | [self.classifier(self.high_dropout(cls_output)) for _ in range(5)], 1814 | dim=0, 1815 | ), 1816 | dim=0, 1817 | ) 1818 | outputs = (logits,) + outputs[2:] 1819 | if labels is not None: 1820 | loss_fct = CrossEntropyLoss() 1821 | loss1 = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) 1822 | loss = loss1 1823 | 1824 | outputs = (loss.mean(),) + outputs # loss, logits, output[2:] 1825 | 1826 | return outputs -------------------------------------------------------------------------------- /pretrain_algorithm/nezha_graph.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2022-09-16 14:39:21 8 | LastEditTime : 2023-03-21 19:12:37 9 | FilePath : \\codes\\pretrain_algorithm\\nezha_graph.py 10 | Description : 11 | ''' 12 | 13 | import torch 14 | from torch import nn 15 | from transformers import NezhaPreTrainedModel, NezhaModel 16 | 17 | class nezha_classify(NezhaPreTrainedModel): 18 | def __init__(self, config): 19 | super().__init__(config, ) 20 | 21 | self.num_labels = config.num_labels 22 | 23 | self.nezha = NezhaModel(config) 24 | 25 | self.dropout = nn.Dropout(p=0.2) 26 | self.high_dropout = nn.Dropout(p=0.5) 27 | 28 | n_weights = config.num_hidden_layers + 1 29 | weights_init = torch.zeros(n_weights).float() 30 | weights_init.data[:-1] = -3 31 | 32 | self.layer_weights = torch.nn.Parameter(weights_init) 33 | 34 | self.classifier = nn.Linear(config.hidden_size, self.num_labels) 35 | 36 | self.post_init() 37 | 38 | def forward( 39 | self, 40 | input_ids=None, 41 | attention_mask=None, 42 | token_type_ids=None, 43 | label=None, 44 | ): 45 | outputs = self.nezha( 46 | input_ids, 47 | attention_mask=attention_mask, 48 | token_type_ids=token_type_ids, 49 | output_hidden_states=True 50 | ) 51 | 52 | hidden_layers = outputs[2] 53 | 54 | cls_outputs = torch.stack( 55 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=2 56 | ) 57 | 58 | cls_output = (torch.softmax(self.layer_weights, dim=0) * cls_outputs).sum(-1) 59 | 60 | logits = torch.mean( 61 | torch.stack( 62 | [self.classifier(self.high_dropout(cls_output)) for _ in range(5)], 63 | dim=0, 64 | ), 65 | dim=0, 66 | ) 67 | 68 | return logits 69 | -------------------------------------------------------------------------------- /pretrain_algorithm/pre_model.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2023-03-13 17:10:12 8 | LastEditTime : 2023-03-23 16:12:19 9 | FilePath : \\codes\\pretrain_algorithm\\pre_model.py 10 | Description : 11 | ''' 12 | 13 | import gc 14 | import os 15 | import shutil 16 | import numpy as np 17 | import torch 18 | import time 19 | from tqdm import tqdm 20 | from common import get_time_dif 21 | from config import PRE_MODEL_NAME, VERBOSE 22 | from metrics import Matrix 23 | from pretrain_algorithm.bert_graph import bert_classifier 24 | from pretrain_algorithm.nezha_graph import nezha_classify 25 | from pretrain_algorithm.roberta_wwm import roberta_classify 26 | from transformers import BertConfig, NezhaConfig, RobertaConfig 27 | from trick.early_stop import EarlyStopping 28 | from trick.fgm_pgd_ema import FGM 29 | from tensorboardX import SummaryWriter 30 | 31 | 32 | class PRE_EXCUTER: 33 | def __init__(self, dl_config): 34 | self.dlconfig = dl_config 35 | 36 | def judge_model(self, assign_path=''): 37 | load_path = assign_path 38 | if self.dlconfig.model_name not in PRE_MODEL_NAME: 39 | print('pretrain model name is not support, please see PRE_MODEL_NAME of config.py') 40 | #* 后续添加模型需要在这里酌情修改对应的方法 41 | if self.dlconfig.model_name in ['mac_bert', 'bert', 'bert_wwm']: 42 | self.pre_config = BertConfig.from_pretrained(os.path.join(load_path, 'config.json')) 43 | self.pre_config.num_labels = self.dlconfig.nums_label 44 | self.model = bert_classifier.from_pretrained(os.path.join( 45 | load_path, 'pytorch_model.bin'), config=self.pre_config) 46 | elif self.dlconfig.model_name == 'nezha_wwm': 47 | self.pre_config = NezhaConfig.from_pretrained(os.path.join(load_path, 'config.json')) 48 | self.pre_config.num_labels = self.dlconfig.nums_label 49 | self.model = nezha_classify.from_pretrained(os.path.join( 50 | load_path, 'pytorch_model.bin'), config=self.pre_config) 51 | elif self.dlconfig.model_name == 'roberta_wwm': 52 | self.pre_config = RobertaConfig.from_pretrained(os.path.join(load_path, 'config.json')) 53 | self.pre_config.num_labels = self.dlconfig.nums_label 54 | self.model = roberta_classify.from_pretrained(os.path.join( 55 | load_path, 'pytorch_model.bin'), config=self.pre_config) 56 | 57 | #! 其他模型 58 | else: 59 | pass 60 | self.model.to(self.dlconfig.device) 61 | 62 | def train(self, train_loader, test_loader, dev_loader, model_saved_path): 63 | # 设置优化器 64 | # 带这些名字的参数不需要做权重衰减 65 | no_decay = ["bias", "LayerNorm.weight"] 66 | optimizer_grouped_parameters = [ 67 | {'params': [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)], 68 | 'weight_decay': 0.01, 'lr': self.dlconfig.learning_rate}, 69 | {'params': [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)], 70 | 'weight_decay': 0.0, 71 | 'lr': self.dlconfig.learning_rate}, 72 | ] 73 | optimizer = torch.optim.AdamW( 74 | optimizer_grouped_parameters, lr=self.dlconfig.learning_rate) 75 | best_test_f1 = 0 76 | writer = SummaryWriter(logdir='./logs') 77 | # 学习率指数衰减,每次epoch:学习率 = gamma * 学习率 78 | # scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9) 79 | 80 | # 学习更新策略--预热(warmup) 81 | if self.dlconfig.update_lr: 82 | from transformers import get_linear_schedule_with_warmup 83 | num_warmup_steps = int( 84 | self.dlconfig.warmup_prop * self.dlconfig.epochs * len(train_loader)) 85 | num_training_steps = int(self.dlconfig.epochs * len(train_loader)) 86 | # 由于transformers自带的adamw优化器只实现了权重衰减,因此还要自己调用调度器和做梯度裁剪,下面使用线性调度器 87 | scheduler = get_linear_schedule_with_warmup( 88 | optimizer, num_warmup_steps, num_training_steps) 89 | 90 | # 早停策略 91 | early_stopping = EarlyStopping(patience=20, delta=0) 92 | 93 | for epoch in range(self.dlconfig.epochs): 94 | # 设定训练模式 95 | self.model.train() 96 | # 梯度清零 97 | self.model.zero_grad() 98 | start_time = time.time() 99 | avg_loss = 0 100 | first_epoch_eval = 0 101 | for data in tqdm(train_loader, ncols=100): 102 | data['input_ids'] = data['input_ids'].to(self.dlconfig.device) 103 | data['attention_mask'] = data['attention_mask'].to(self.dlconfig.device) 104 | data['token_type_ids'] = data['token_type_ids'].to(self.dlconfig.device) 105 | data['label'] = data['label'].to(self.dlconfig.device) 106 | pred = self.model(**data) 107 | loss = self.dlconfig.loss_fct(pred, data['label']).mean() 108 | # 反向传播 109 | loss.backward() 110 | avg_loss += loss.item() / len(train_loader) 111 | 112 | # 使用fgm 113 | if self.dlconfig.use_fgm: 114 | fgm = FGM(self.model) 115 | fgm.attack() 116 | loss_adv = self.model(**data).mean() 117 | # 通过扰乱后的embedding训练后得到对抗训练后的loss值,然后反向传播计算对抗后的梯度,累加到前面正常的梯度上,最后再去更新参数 118 | loss_adv.backward() 119 | fgm.restore() 120 | 121 | if self.dlconfig.update_lr: 122 | torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0) # 对应上面的梯度衰减 123 | 124 | # 更新优化器 125 | optimizer.step() 126 | # 更新学习率 127 | if self.dlconfig.update_lr: 128 | scheduler.step() 129 | 130 | # 用以下方式替代model.zero_grad(),可以提高gpu利用率 131 | for param in self.model.parameters(): 132 | param.grad = None 133 | 134 | # 计算模型运行时间 135 | elapsed_time = get_time_dif(start_time) 136 | # 打印间隔 137 | if (epoch + 1) % VERBOSE == 0: 138 | # 在测试集上看下效果 139 | avg_test_loss, test_f1, pred_all, true_all = self.evaluate( 140 | test_loader) 141 | elapsed_time = elapsed_time * VERBOSE 142 | if self.dlconfig.update_lr: 143 | lr = scheduler.get_last_lr()[0] 144 | else: 145 | lr = self.dlconfig.learning_rate 146 | tqdm.write( 147 | f"Epoch {epoch + 1:02d}/{self.dlconfig.epochs:02d} \t time={elapsed_time} \t" 148 | f"loss={avg_loss:.3f}\t lr={lr:.1e}", 149 | end="\t", 150 | ) 151 | 152 | if (epoch + 1 >= first_epoch_eval) or (epoch + 1 == self.dlconfig.epochs): 153 | tqdm.write( 154 | f"val_loss={avg_test_loss:.3f}\ttest_f1={test_f1:.4f}\t lr={lr:.1e}") 155 | else: 156 | tqdm.write("") 157 | writer.add_scalar('Loss/train', avg_loss, epoch) 158 | writer.add_scalar('Loss/test', avg_test_loss, epoch) 159 | writer.add_scalar('F1/test', test_f1, epoch) 160 | writer.add_scalar('lr/train', lr, epoch) 161 | 162 | # 每次保存最优的模型,以测试集f1为准 163 | if best_test_f1 < test_f1: 164 | best_test_f1 = test_f1 165 | tqdm.write('*' * 20) 166 | self.save_model(model_saved_path) 167 | tqdm.write('new model saved') 168 | tqdm.write('*' * 20) 169 | 170 | early_stopping(avg_test_loss) 171 | if early_stopping.early_stop: 172 | break 173 | # 删除数据加载器以及变量 174 | del (test_loader, train_loader, loss, data, pred) 175 | # 释放内存 176 | gc.collect() 177 | torch.cuda.empty_cache() 178 | writer.close() 179 | 180 | def evaluate(self, test_loader): 181 | pre_all = [] 182 | true_all = [] 183 | # 设定评估模式 184 | self.model.eval() 185 | avg_test_loss = 0 186 | with torch.no_grad(): 187 | for test_data in test_loader: 188 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device), 189 | test_data['attention_mask'].to(self.dlconfig.device), 190 | test_data['token_type_ids'].to(self.dlconfig.device), 191 | ) 192 | test_loss = self.dlconfig.loss_fct(pred, test_data['label'].to(self.dlconfig.device)).mean() 193 | avg_test_loss += test_loss.item() / len(test_loader) 194 | true_all.extend(test_data['label'].detach().cpu().numpy()) 195 | pre_all.append(pred.softmax(-1).detach().cpu().numpy()) 196 | pre_all = np.concatenate(pre_all) 197 | pre_all = np.argmax(pre_all, axis=-1) 198 | if self.dlconfig.loss_type == 'multi' or self.dlconfig.loss_type == 'marginLoss': 199 | multi = True 200 | else: 201 | multi = False 202 | matrix = Matrix(true_all, pre_all, multi=multi) 203 | return avg_test_loss, matrix.get_f1(), pre_all, true_all 204 | 205 | def predict(self, dev_loader): 206 | pre_all = [] 207 | with torch.no_grad(): 208 | for test_data in dev_loader: 209 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device), 210 | test_data['attention_mask'].to(self.dlconfig.device), 211 | test_data['token_type_ids'].to(self.dlconfig.device), 212 | ) 213 | pre_all.append(pred.softmax(-1).detach().cpu().numpy()) 214 | pre_all = np.concatenate(pre_all) 215 | pre_all = np.argmax(pre_all, axis=-1) 216 | return pre_all 217 | 218 | # 保存模型权重 219 | def save_model(self, path): 220 | if not os.path.exists(path): 221 | os.makedirs(path) 222 | if not os.path.exists(os.path.join(path, 'config.json')): 223 | shutil.copy(f'{self.dlconfig.pretrain_file_path}/config.json', f'{path}/config.json') 224 | if not os.path.exists(os.path.join(path, 'vocab.txt')): 225 | shutil.copy(f'{self.dlconfig.pretrain_file_path}/vocab.txt', f'{path}/vocab.txt') 226 | name = 'pytorch_model.bin' 227 | output_path = os.path.join(path, name) 228 | torch.save(self.model.state_dict(), output_path) 229 | print(f'model is saved, in {str(output_path)}') 230 | 231 | def load_model(self, path): 232 | try: 233 | self.judge_model(path) 234 | self.model.eval() 235 | print('model 已加载预训练参数') 236 | except: 237 | print('model load error') 238 | -------------------------------------------------------------------------------- /pretrain_algorithm/roberta_wwm.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2023-03-21 19:14:06 8 | LastEditTime : 2023-03-22 11:06:19 9 | FilePath : \\codes\\pretrain_algorithm\\roberta_wwm.py 10 | Description : 11 | ''' 12 | 13 | 14 | import torch 15 | from torch import nn 16 | from transformers import RobertaPreTrainedModel, RobertaModel 17 | 18 | class roberta_classify(RobertaPreTrainedModel): 19 | _keys_to_ignore_on_load_unexpected = [r"pooler"] 20 | _keys_to_ignore_on_load_missing = [r"position_ids"] 21 | def __init__(self, config): 22 | super().__init__(config, ) 23 | 24 | self.num_labels = config.num_labels 25 | # 如果add_pooling_layer设置为True,那么output会多一个池化层结果,可以选择用这个池化层的结果去做下游任务 26 | # 由于这里用多个隐层的平均作为下游任务的输入,所以设置为Fasle 27 | self.roberta = RobertaModel(config, add_pooling_layer=False) 28 | 29 | self.dropout = nn.Dropout(p=0.2) 30 | self.high_dropout = nn.Dropout(p=0.5) 31 | 32 | n_weights = config.num_hidden_layers + 1 33 | weights_init = torch.zeros(n_weights).float() 34 | weights_init.data[:-1] = -3 35 | 36 | self.layer_weights = torch.nn.Parameter(weights_init) 37 | 38 | self.classifier = nn.Linear(config.hidden_size, self.num_labels) 39 | 40 | self.post_init() 41 | 42 | def forward( 43 | self, 44 | input_ids=None, 45 | attention_mask=None, 46 | token_type_ids=None, 47 | label=None, 48 | ): 49 | outputs = self.roberta( 50 | input_ids, 51 | attention_mask=attention_mask, 52 | token_type_ids=token_type_ids, 53 | output_hidden_states=True 54 | ) 55 | 56 | hidden_layers = outputs[1] 57 | 58 | cls_outputs = torch.stack( 59 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=2 60 | ) 61 | 62 | cls_output = (torch.softmax(self.layer_weights, dim=0) * cls_outputs).sum(-1) 63 | 64 | logits = torch.mean( 65 | torch.stack( 66 | [self.classifier(self.high_dropout(cls_output)) for _ in range(5)], 67 | dim=0, 68 | ), 69 | dim=0, 70 | ) 71 | 72 | return logits 73 | -------------------------------------------------------------------------------- /pretrain_model/bert_wwm/readme.txt: -------------------------------------------------------------------------------- 1 | # 这里放模型文件,vocab.txt config.json pytorch_model.bin -------------------------------------------------------------------------------- /process_data_dl.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : utils.py 5 | @Time : 2023/02/08 14:57:32 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : get vocab, label, label_nums, label2n, word2n, n2word, n2label, dataset定义 10 | ''' 11 | 12 | import os 13 | import jieba 14 | import pickle as pkl 15 | import pandas as pd 16 | import numpy as np 17 | import torch 18 | import torch.nn as nn 19 | from collections import OrderedDict 20 | from torch.utils.data import Dataset, DataLoader 21 | from config import VOCAB_MAX_SIZE, WORD_MIN_FREQ, VOCAB_SAVE_PATH, L2I_SAVE_PATH, PRETRAIN_EMBEDDING_FILE 22 | from trick.dynamic_padding import collater 23 | 24 | 25 | class DataSetProcess: 26 | def __init__(self, train_data_path='', test_data_path='', dev_data_path=''): 27 | self.train_data_path = train_data_path 28 | self.test_data_path = test_data_path 29 | self.dev_data_path = dev_data_path 30 | self.train_data, self.l1 = self.load_data( 31 | self.train_data_path) if self.train_data_path else [[], []] 32 | self.test_data, self.l2 = self.load_data( 33 | self.test_data_path) if self.test_data_path else [[], []] 34 | self.dev_data, self.l3 = self.load_data( 35 | self.dev_data_path) if self.dev_data_path else [[], []] 36 | 37 | def load_data(self, path): 38 | """默认是处理csv文件,其他形式的要改,csv的话文本内容要改成我提供的demo格式 39 | """ 40 | if path.endswith('csv'): 41 | df = pd.read_csv(path, encoding='utf-8') 42 | contents = df['content'].values.tolist() 43 | try: 44 | labels = df['label'].values.tolist() 45 | except: 46 | labels = [] 47 | return contents, labels 48 | else: 49 | #! todo 其他格式的文件读取 50 | pass 51 | 52 | def build_vocab(self, save=False): 53 | if os.path.exists(VOCAB_SAVE_PATH): 54 | with open(VOCAB_SAVE_PATH, 'rb') as f: 55 | vocab_dic = pkl.load(f) 56 | print(f"vocab size {len(vocab_dic)}") 57 | return vocab_dic 58 | 59 | vocab_dic = {} 60 | UNK, PAD = '', '' 61 | min_freq = WORD_MIN_FREQ 62 | vocab_max_size = VOCAB_MAX_SIZE 63 | 64 | all_data = self.train_data + self.test_data + self.dev_data 65 | 66 | for sentence in all_data: 67 | sentence = sentence.strip() 68 | #! 这里只设置了中文,英文用空格,还没写 69 | tokens = jieba.cut(sentence) 70 | for token in tokens: 71 | vocab_dic[token] = vocab_dic.get(token, 0) + 1 72 | # 对词表进行排序 73 | vocab_list = sorted([_ for _ in vocab_dic.items() if _[ 74 | 1] >= min_freq], key=lambda x: x[1], reverse=True)[:vocab_max_size] 75 | 76 | # 还原成字典 77 | vocab_dic = {word_count[0]: idx for idx, 78 | word_count in enumerate(vocab_list)} 79 | 80 | # 使用UNK填充单词表的尾部 81 | vocab_dic.update({UNK: len(vocab_dic), PAD: len(vocab_dic) + 1}) 82 | 83 | # 是否保存 84 | if save: 85 | abs_path = VOCAB_SAVE_PATH.rsplit('/', 1)[0] 86 | if not os.path.exists(abs_path): 87 | os.makedirs(abs_path) 88 | with open(VOCAB_SAVE_PATH, 'wb') as f: 89 | pkl.dump(vocab_dic, f) 90 | print(f'vocab_dic is saved in {VOCAB_SAVE_PATH}') 91 | print(f"vocab size {len(vocab_dic)}") 92 | return vocab_dic 93 | 94 | def build_label2id(self, save=False): 95 | if os.path.exists(L2I_SAVE_PATH): 96 | with open(L2I_SAVE_PATH, 'rb') as f: 97 | l2i_dic = pkl.load(f) 98 | i2l_dic = {} 99 | for k, n in l2i_dic.items(): 100 | i2l_dic[n] = k 101 | return l2i_dic, i2l_dic 102 | 103 | 104 | i2l_dic = OrderedDict() 105 | l2i_dic = OrderedDict() 106 | all_label_list = self.l1 + self.l2 + self.l3 107 | all_label_list = list(set(all_label_list)) 108 | for i in range(len(all_label_list)): 109 | i2l_dic[i] = all_label_list[i] 110 | l2i_dic[all_label_list[i]] = i 111 | 112 | # 是否保存 113 | if save: 114 | abs_path = L2I_SAVE_PATH.rsplit('/', 1)[0] 115 | if not os.path.exists(abs_path): 116 | os.makedirs(abs_path) 117 | with open(L2I_SAVE_PATH, 'wb') as f: 118 | pkl.dump(l2i_dic, f) 119 | print(f'label2id_dic is saved in {L2I_SAVE_PATH}') 120 | 121 | return l2i_dic, i2l_dic 122 | 123 | def trans_data(self, data_path, vocab_dic, label_dic): 124 | contents = [] 125 | datas, labels = self.load_data(data_path) 126 | if not labels: 127 | labels = [-1] * len(datas) 128 | for d, l in zip(datas, labels): 129 | if not d.strip(): 130 | continue 131 | wordlists = [] 132 | tokens = jieba.cut(d.strip()) 133 | for token in tokens: 134 | wordlists.append(vocab_dic.get(token, vocab_dic.get(""))) 135 | if l != -1: 136 | contents.append((wordlists, int(label_dic.get(l)))) 137 | else: 138 | contents.append((wordlists,)) 139 | return contents 140 | 141 | def load_emb(self, vocab_dic): 142 | skip = True 143 | emb_dic = {} 144 | with open(PRETRAIN_EMBEDDING_FILE, 'r', encoding='utf-8') as f: 145 | for i in f: 146 | if skip: 147 | skip = False 148 | dim = int(i.split(' ', 1)[1].strip()) 149 | continue 150 | word, embeds = i.split(' ', 1) 151 | embeds = embeds.strip().split(' ') 152 | if vocab_dic.get(word, None): 153 | emb_dic[vocab_dic[word]] = embeds 154 | emb_dics = sorted(emb_dic.items(), key=lambda x: x[0]) 155 | orignal_emb = nn.Embedding(len(vocab_dic), dim, padding_idx=len(vocab_dic)-1) 156 | emb_array = orignal_emb.weight.data.numpy() 157 | for i in emb_dics: 158 | index = i[0] 159 | weight = np.array(i[1], dtype=float) 160 | emb_array[index] = weight 161 | print(f'已载入预训练词向量,维度为{dim}') 162 | return torch.FloatTensor(emb_array), dim 163 | 164 | 165 | class DLDataset(Dataset): 166 | """自定义torch的dataset 167 | """ 168 | 169 | def __init__(self, contents): 170 | self.data, self.label = self.get_data_label(contents) 171 | self.len = len(self.data) 172 | 173 | def __len__(self): 174 | return self.len 175 | 176 | def __getitem__(self, index): 177 | if self.label: 178 | return { 179 | 'input_ids': self.data[index], 180 | 'label': self.label[index] 181 | } 182 | else: 183 | return {'input_ids': self.data[index]} 184 | 185 | def get_data_label(self, contents): 186 | """contents: [([xx,x,,], label?), ()] 187 | """ 188 | data = [] 189 | label = [] 190 | for i in contents: 191 | data.append(i[0]) 192 | if len(i) == 2: 193 | label.append(i[1]) 194 | return data, label 195 | 196 | class DL_Data_Excuter: 197 | def __init__(self): 198 | pass 199 | def process(self,batch_size, train_data_path='', test_data_path='', dev_data_path=''): 200 | """内部构建各个数据集的dataloader,返回词表大小和类别数量 201 | """ 202 | 203 | p = DataSetProcess(train_data_path, test_data_path, dev_data_path) 204 | self.vocab = p.build_vocab(save=True) 205 | pad_index = self.vocab[''] 206 | self.label_dic, self.i2l_dic = p.build_label2id(save=True) 207 | if len(self.label_dic) > 2: 208 | self.multi = True 209 | else: 210 | self.multi = False 211 | collater_fn = collater(pad_index) 212 | self.train_data_loader = '' 213 | self.test_data_loader = '' 214 | self.dev_data_loader = '' 215 | if train_data_path: 216 | content = p.trans_data(train_data_path, self.vocab, self.label_dic) 217 | data_set = DLDataset(content) 218 | self.train_data_loader = DataLoader( 219 | data_set, batch_size=batch_size, shuffle=True, collate_fn=collater_fn) 220 | if test_data_path: 221 | content = p.trans_data(test_data_path, self.vocab, self.label_dic) 222 | data_set = DLDataset(content) 223 | self.test_data_loader = DataLoader( 224 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn) 225 | if dev_data_path: 226 | content = p.trans_data(dev_data_path, self.vocab, self.label_dic) 227 | data_set = DLDataset(content) 228 | self.dev_data_loader = DataLoader( 229 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn) 230 | return len(self.vocab), len(self.label_dic) 231 | 232 | 233 | if __name__ == '__main__': 234 | d = DL_Data_Excuter() 235 | d.get_dataloader(2, '', './data/dl_data/test.csv', '') 236 | print(1) 237 | -------------------------------------------------------------------------------- /process_data_ml.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : process_data.py 5 | @Time : 2023/01/13 16:25:15 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : process data, 两件事:将label和特征分开,对label做好映射和转换;采样, 解决不平衡的问题 10 | ''' 11 | 12 | import pandas as pd 13 | from sklearn.utils import shuffle 14 | from collections import OrderedDict 15 | from imblearn.under_sampling import RandomUnderSampler 16 | 17 | class ML_Data_Excuter: 18 | def __init__(self, data_path, split_size, is_sample=False, split=True, train_data_path='', test_data_path=''): 19 | """数据处理类 20 | 21 | Args: 22 | data_path (str): 数据的路径 23 | split_size (int): 切分训练集和测试集的比例 24 | is_sample (bool, optional): 是否对数据进行采样,当数据不平衡时推荐True. Defaults to False. 25 | split (bool, optional): 是否进行训练集和测试集的切分操作. Defaults to True. 26 | train_data_path (str, optional): 如果这个路径存在,那么默认不进行程序默认的训练集和测试集的切分,用用户已经切分好的数据. Defaults to ''. 27 | test_data_path (str, optional): 同上. Defaults to ''. 28 | """ 29 | self.train_data_path = train_data_path 30 | self.test_data_path = test_data_path 31 | if self.train_data_path and self.test_data_path: 32 | self.train_data = pd.read_csv(self.train_data_path) 33 | self.test_data = pd.read_csv(self.test_data_path) 34 | self.data = pd.concat([self.train_data, self.test_data], axis=0) 35 | self.l2i_dic, self.i2l_dic = self.create_l2i() 36 | self.label = self.data['label'] 37 | if len(set(self.label.values.tolist())) > 2: 38 | self.multi = True 39 | elif len(set(self.label.values.tolist())) == 2: 40 | self.multi = False 41 | else: 42 | print('there have only one label, must >= 2') 43 | exit(0) 44 | self.X = self.data.loc[:, self.data.columns!='label'] 45 | print('data nums: ') 46 | print(self.X.shape[0]) 47 | self.train_data_label = self.train_data['label'] 48 | self.train_data_x = self.train_data.loc[:, self.train_data.columns!='label'] 49 | self.test_data_label = self.test_data['label'] 50 | self.test_data_x = self.test_data.loc[:, self.test_data.columns!='label'] 51 | print('split train_test data:') 52 | print('train_data num:') 53 | print(self.train_data_x.shape) 54 | print('test_data num:') 55 | print(self.test_data_x.shape) 56 | else: 57 | self.split_size = split_size 58 | self.data = pd.read_csv(data_path) 59 | self.l2i_dic, self.i2l_dic = self.create_l2i() 60 | self.label = self.data['label'] 61 | if len(set(self.label.values.tolist())) > 2: 62 | self.multi = True 63 | elif len(set(self.label.values.tolist())) == 2: 64 | self.multi = False 65 | else: 66 | print('there have only one label, must >= 2') 67 | exit(0) 68 | self.X = self.data.loc[:, self.data.columns!='label'] 69 | if is_sample: 70 | self.sample() 71 | print('data nums: ') 72 | print(self.X.shape[0]) 73 | if split: 74 | self.train_test_split() 75 | 76 | def create_l2i(self): 77 | i2l_dic = OrderedDict() 78 | l2i_dic = OrderedDict() 79 | # 将label转成数字,并且生成有序字典,方便后续画confusion_matrix 80 | classes = sorted(list(set(self.data['label'].values.tolist()))) 81 | print(classes) 82 | num_classes = len(set(classes)) 83 | for i in range(num_classes): 84 | i2l_dic[i] = classes[i] 85 | l2i_dic[classes[i]] = i 86 | self.data['label'] = self.data['label'].map(l2i_dic) 87 | 88 | return l2i_dic, i2l_dic 89 | 90 | def sample(self): 91 | # 这里采用简单的随机下采样,换方法可以在这里改 92 | def get_res(): 93 | res = sorted(Counter(self.label).items()) 94 | res_ = [] 95 | for i in res: 96 | tmp = (self.i2l_dic[i[0]], i[1]) 97 | res_.append(tmp) 98 | return res_ 99 | from collections import Counter 100 | 101 | print('before sample,data nums:') 102 | print(get_res()) 103 | sample_excuter = RandomUnderSampler(random_state=96) 104 | self.X, self.label = sample_excuter.fit_resample(self.X, self.label) 105 | print('after sample,data nums:') 106 | print(get_res()) 107 | self.data = pd.concat([self.X, self.label], axis=1) 108 | 109 | def train_test_split(self): 110 | """ 111 | 这里的划分是按照每个标签的数量进行划分,确保训练集和验证集中的标签种类一致,不会出现训练集里有的标签,而测试集里没有出现过 112 | """ 113 | type_label = list(set(self.data.label.values.tolist())) 114 | test_data_index = [] 115 | for l in type_label: 116 | tmp_data = self.data[self.data['label']==l] 117 | tmp_data = shuffle(tmp_data) 118 | random_test = tmp_data.sample(frac=self.split_size, random_state=96) 119 | index_num = random_test.index.tolist() 120 | test_data_index += index_num 121 | test_data = self.data.iloc[test_data_index, :] 122 | train_data = self.data[~self.data.index.isin(test_data_index)] 123 | self.train_data_label = train_data['label'] 124 | self.train_data_x = train_data.loc[:, train_data.columns!='label'] 125 | self.test_data_label = test_data['label'] 126 | self.test_data_x = test_data.loc[:, test_data.columns!='label'] 127 | print('split train_test data:') 128 | print('train_data num:') 129 | print(self.train_data_x.shape) 130 | print('test_data num:') 131 | print(self.test_data_x.shape) 132 | 133 | 134 | if __name__ == '__main__': 135 | data_path = './data/processed_data.csv' 136 | data_ex = ML_Data_Excuter(data_path, 0.3, is_sample=True, split=True) 137 | print(1) -------------------------------------------------------------------------------- /process_data_pretrain.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2023-03-13 15:09:48 8 | LastEditTime : 2023-03-21 19:51:13 9 | FilePath : \\codes\\process_data_pretrain.py 10 | Description : data process for pretrain method 11 | ''' 12 | 13 | 14 | from process_data_dl import DataSetProcess 15 | from trick.dynamic_padding import collater 16 | from torch.utils.data import Dataset, DataLoader 17 | from config import MAX_SEQ_LEN 18 | from transformers import AutoTokenizer 19 | 20 | 21 | class DataSetProcess_pre(DataSetProcess): 22 | def trans_data(self, data_path, label_dic): 23 | contents = [] 24 | datas, labels = self.load_data(data_path) 25 | if not labels: 26 | labels = [-1] * len(datas) 27 | for d, l in zip(datas, labels): 28 | if not d.strip(): 29 | continue 30 | if l != -1: 31 | contents.append(([d], int(label_dic.get(l)))) 32 | else: 33 | contents.append(([d],)) 34 | return contents 35 | 36 | 37 | class PREDataset(Dataset): 38 | def __init__(self, contents, tokenizer, max_seq_len): 39 | self. tokenizer = tokenizer 40 | self.max_seq_len = max_seq_len 41 | self.data, self.label = self.get_data_label(contents) 42 | self.len = len(self.data) 43 | 44 | def __len__(self): 45 | return self.len 46 | 47 | def __getitem__(self, index): 48 | #! todo 增量训练的数据集构造 49 | # 预测返回的label和上面不一样,因为预测是和任务有关,不是做pretrain,因此要用数据集的标签,上面要用自己的数据增量训练,所以要自己预测mask的地方 50 | tokenize_result = self.tokenizer.encode_plus(self.data[index], max_length=self.max_seq_len) 51 | if self.label: 52 | return { 53 | 'input_ids': tokenize_result["input_ids"], 54 | 'attention_mask': tokenize_result["attention_mask"], 55 | 'token_type_ids': tokenize_result["token_type_ids"], 56 | 'label': self.label[index] 57 | 58 | } 59 | return { 60 | 'input_ids': tokenize_result["input_ids"], 61 | 'attention_mask': tokenize_result["attention_mask"], 62 | 'token_type_ids': tokenize_result["token_type_ids"], 63 | } 64 | 65 | def get_data_label(self, contents): 66 | """contents: [([data], label?), ()] 67 | """ 68 | data = [] 69 | label = [] 70 | for i in contents: 71 | data.append(i[0][0]) 72 | if len(i) == 2: 73 | label.append(i[1]) 74 | return data, label 75 | 76 | 77 | class PRE_Data_Excuter: 78 | def __init__(self, model_type): 79 | self.model_type = model_type 80 | 81 | def process(self,batch_size, train_data_path='', test_data_path='', dev_data_path='', pretrain_file_path=''): 82 | self.pretrain_file_path = pretrain_file_path 83 | #* 分词器的设置,不同模型不一样的分词器 84 | # if self.model_type in ['mac_bert','bert', 'bert_wwm', 'nezha_wwm']: 85 | # from transformers import BertTokenizer 86 | # tokenizer = BertTokenizer.from_pretrained(self.pretrain_file_path) 87 | # #// 其他分词器,先不用Autotokenizer这个类 88 | # else: 89 | # print('tokenizer is null, please check model_name') 90 | # exit() 91 | tokenizer = AutoTokenizer.from_pretrained(self.pretrain_file_path) 92 | p = DataSetProcess_pre(train_data_path, test_data_path, dev_data_path) 93 | self.label_dic, self.i2l_dic = p.build_label2id(save=True) 94 | if len(self.label_dic) > 2: 95 | self.multi = True 96 | else: 97 | self.multi = False 98 | collater_fn = collater(pad_index=0, for_pretrain=True) 99 | self.train_data_loader = '' 100 | self.test_data_loader = '' 101 | self.dev_data_loader = '' 102 | if train_data_path: 103 | content = p.trans_data(train_data_path, self.label_dic) 104 | data_set = PREDataset(content,tokenizer=tokenizer, max_seq_len=MAX_SEQ_LEN) 105 | self.train_data_loader = DataLoader( 106 | data_set, batch_size=batch_size, shuffle=True, collate_fn=collater_fn) 107 | if test_data_path: 108 | content = p.trans_data(test_data_path, self.label_dic) 109 | data_set = PREDataset(content,tokenizer=tokenizer, max_seq_len=MAX_SEQ_LEN) 110 | self.test_data_loader = DataLoader( 111 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn) 112 | if dev_data_path: 113 | content = p.trans_data(dev_data_path, self.label_dic) 114 | data_set = PREDataset(content,tokenizer=tokenizer, max_seq_len=MAX_SEQ_LEN) 115 | self.dev_data_loader = DataLoader( 116 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn) 117 | return len(self.label_dic) 118 | 119 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | catboost==1.1.1 2 | gensim==4.3.1 3 | imbalanced_learn==0.10.1 4 | imblearn==0.0 5 | jieba==0.42.1 6 | joblib==1.2.0 7 | matplotlib==3.6.3 8 | numpy==1.24.1 9 | pandas==1.5.2 10 | scikit_learn==1.2.2 11 | torch==1.13.1 12 | tqdm==4.64.1 13 | transformers==4.25.1 14 | xgboost==1.7.3 15 | -------------------------------------------------------------------------------- /save_model/knn.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/save_model/knn.pkl -------------------------------------------------------------------------------- /trick/dynamic_padding.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : dynamic_padding.py 5 | @Time : 2023/02/09 10:17:47 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : 每个batch保持一个长度,而不是所有的数据保持一个长度 10 | ''' 11 | 12 | import torch 13 | 14 | class collater(): 15 | def __init__(self, pad_index, for_pretrain=False): 16 | # 如果for_pretrain=True,说明返回的数据要包含attention矩阵和token_ids矩阵 17 | self.pad_index = pad_index 18 | self.for_pretrain = for_pretrain 19 | 20 | def __call__(self, batch): 21 | # dynamic_pad 22 | input_ids, label = [], [] 23 | collate_max_len = 0 24 | attention_mask, token_type_ids = [], [] 25 | 26 | # get maxlen for a batch 27 | for data in batch: 28 | collate_max_len = max(collate_max_len, len(data['input_ids'])) 29 | 30 | for data in batch: 31 | # padding to maxlen for each data 32 | length = len(data['input_ids']) 33 | input_ids.append(data['input_ids'] + [self.pad_index] * (collate_max_len - length)) 34 | if self.for_pretrain: 35 | attention_mask.append(data['attention_mask'] + [self.pad_index] * (collate_max_len - length)) 36 | token_type_ids.append(data['token_type_ids'] + [self.pad_index] * (collate_max_len - length)) 37 | if len(data) >= 2: 38 | label.append(data['label']) 39 | input_ids = torch.tensor(input_ids, dtype=torch.long) 40 | result = {'input_ids': input_ids} 41 | if label: 42 | label = torch.tensor(label, dtype=torch.long) 43 | result['label'] = label 44 | if self.for_pretrain: 45 | attention_mask = torch.tensor(attention_mask, dtype=torch.long) 46 | token_type_ids = torch.tensor(token_type_ids, dtype=torch.long) 47 | result['attention_mask'] = attention_mask 48 | result['token_type_ids'] = token_type_ids 49 | return result 50 | 51 | -------------------------------------------------------------------------------- /trick/early_stop.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : early_stop.py 5 | @Time : 2023/02/09 16:21:34 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : 早停策略 10 | ''' 11 | 12 | # 早停策略 13 | class EarlyStopping: 14 | def __init__(self, patience=10, delta=0): 15 | self.patience = patience 16 | self.counter = 0 17 | self.best_score = None 18 | self.early_stop = False 19 | self.delta = delta 20 | 21 | def __call__(self, val_loss): 22 | score = -val_loss 23 | if self.best_score is None: 24 | self.best_score = score 25 | elif score < self.best_score + self.delta: 26 | self.counter += 1 27 | if self.counter >= self.patience: 28 | self.early_stop = True 29 | else: 30 | self.best_score = score 31 | self.counter = 0 -------------------------------------------------------------------------------- /trick/fgm_pgd_ema.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2023-03-13 17:45:35 8 | LastEditTime : 2023-03-13 17:45:36 9 | FilePath : \\codes\\trick\\fgm.py 10 | Description : FGM, PGD, EMA 11 | ''' 12 | 13 | import torch 14 | 15 | 16 | class FGM: 17 | ''' 18 | 对于每个x: 19 | 1.计算x的前向loss、反向传播得到梯度 20 | 2.根据embedding矩阵的梯度计算出r,并加到当前embedding上,相当于x+r 21 | 3.计算x+r的前向loss,反向传播得到对抗的梯度,累加到(1)的梯度上 22 | 4.将embedding恢复为(1)时的值 23 | 5.根据(3)的梯度对参数进行更新 24 | ''' 25 | 26 | def __init__(self, model): 27 | self.model = model 28 | self.backup = {} 29 | 30 | def attack(self, epsilon=0.5, emb_name='word_embeddings'): 31 | # emb_name这个参数要换成你模型中embedding的参数名 32 | for name, param in self.model.named_parameters(): 33 | if param.requires_grad and emb_name in name: 34 | self.backup[name] = param.data.clone() 35 | norm = torch.norm(param.grad) 36 | if norm != 0: 37 | r_at = epsilon * param.grad / norm 38 | param.data.add_(r_at) 39 | 40 | def restore(self, emb_name='word_embeddings'): 41 | # emb_name这个参数要换成你模型中embedding的参数名 42 | for name, param in self.model.named_parameters(): 43 | if param.requires_grad and emb_name in name: 44 | assert name in self.backup 45 | param.data = self.backup[name] 46 | self.backup = {} 47 | 48 | 49 | class PGD: 50 | def __init__(self, model, eps=1., alpha=0.3): 51 | self.model = ( 52 | model.module if hasattr(model, "module") else model 53 | ) 54 | self.eps = eps 55 | self.alpha = alpha 56 | self.emb_backup = {} 57 | self.grad_backup = {} 58 | 59 | def attack(self, emb_name='embeddings', is_first_attack=False): 60 | for name, param in self.model.named_parameters(): 61 | if param.requires_grad and emb_name in name: 62 | if is_first_attack: 63 | self.emb_backup[name] = param.data.clone() 64 | norm = torch.norm(param.grad) 65 | if norm != 0 and not torch.isnan(norm): 66 | r_at = self.alpha * param.grad / norm 67 | param.data.add_(r_at) 68 | param.data = self.project(name, param.data) 69 | 70 | def restore(self, emb_name='embeddings'): 71 | for name, param in self.model.named_parameters(): 72 | if param.requires_grad and emb_name in name: 73 | assert name in self.emb_backup 74 | param.data = self.emb_backup[name] 75 | self.emb_backup = {} 76 | 77 | def project(self, param_name, param_data): 78 | r = param_data - self.emb_backup[param_name] 79 | if torch.norm(r) > self.eps: 80 | r = self.eps * r / torch.norm(r) 81 | return self.emb_backup[param_name] + r 82 | 83 | def backup_grad(self): 84 | for name, param in self.model.named_parameters(): 85 | if param.requires_grad and param.grad is not None: 86 | self.grad_backup[name] = param.grad.clone() 87 | 88 | def restore_grad(self): 89 | for name, param in self.model.named_parameters(): 90 | if param.requires_grad and param.grad is not None: 91 | param.grad = self.grad_backup[name] 92 | 93 | 94 | class EMA: 95 | def __init__(self, model, decay): 96 | self.model = model 97 | self.decay = decay 98 | self.shadow = {} 99 | self.backup = {} 100 | 101 | def register(self): 102 | for name, param in self.model.named_parameters(): 103 | if param.requires_grad: 104 | self.shadow[name] = param.data.clone() 105 | 106 | def update(self): 107 | for name, param in self.model.named_parameters(): 108 | if param.requires_grad: 109 | assert name in self.shadow 110 | new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name] 111 | self.shadow[name] = new_average.clone() 112 | 113 | def apply_shadow(self): 114 | for name, param in self.model.named_parameters(): 115 | if param.requires_grad: 116 | assert name in self.shadow 117 | self.backup[name] = param.data 118 | param.data = self.shadow[name] 119 | 120 | def restore(self): 121 | for name, param in self.model.named_parameters(): 122 | if param.requires_grad: 123 | assert name in self.backup 124 | param.data = self.backup[name] 125 | self.backup = {} 126 | -------------------------------------------------------------------------------- /trick/init_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : init_model.py 5 | @Time : 2023/02/09 14:26:06 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : dl net权重初始化方式 10 | ''' 11 | 12 | import torch.nn as nn 13 | 14 | 15 | def init_network(model, method='xavier', exclude='embedding'): 16 | # 权重初始化:不同的初始化方法,导致精确性和收敛时间不同 17 | # 默认xavier 18 | # xavier:“Xavier”初始化方法是一种很有效的神经网络初始化方法 19 | # kaiming:何凯明初始化 20 | # normal_: 正态分布初始化 21 | for name, w in model.named_parameters(): 22 | if exclude not in name: 23 | if 'weight' in name and 'layernorm' not in name: 24 | if method == 'xavier': 25 | nn.init.xavier_normal_(w) 26 | elif method == 'kaiming': 27 | nn.init.kaiming_normal_(w) 28 | else: 29 | nn.init.normal_(w) 30 | elif 'bias' in name: 31 | nn.init.constant_(w, 0) 32 | else: 33 | pass -------------------------------------------------------------------------------- /trick/set_all_seed.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- encoding: utf-8 -*- 3 | ''' 4 | @File : set_all_seed.py 5 | @Time : 2023/02/07 17:37:34 6 | @Author : Huang zh 7 | @Contact : jacob.hzh@qq.com 8 | @Version : 0.1 9 | @Desc : 固定所有的随机数种子 10 | ''' 11 | 12 | import os 13 | import torch 14 | import numpy as np 15 | import random 16 | from torch.backends import cudnn 17 | 18 | # 固定随机种子 19 | def set_seed(seed): 20 | random.seed(seed) 21 | os.environ["PYTHONHASHSEED"] = str(seed) 22 | np.random.seed(seed) 23 | torch.manual_seed(seed) 24 | torch.cuda.manual_seed(seed) 25 | cudnn.deterministic = True 26 | cudnn.benchmark = False 27 | 28 | 29 | -------------------------------------------------------------------------------- /word2vec_train.py: -------------------------------------------------------------------------------- 1 | # !usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | Author : Huang zh 6 | Email : jacob.hzh@qq.com 7 | Date : 2023-03-20 21:13:24 8 | LastEditTime : 2023-03-20 21:21:18 9 | FilePath : \\codes\\word2vec_train.py 10 | Description : 11 | ''' 12 | 13 | 14 | 15 | import os 16 | import pickle 17 | import argparse 18 | from gensim.models import word2vec, keyedvectors 19 | from gensim.models.callbacks import CallbackAny2Vec 20 | 21 | 22 | 23 | def pickle_read(path): 24 | with open(path, 'rb') as f: 25 | data = pickle.load(f) 26 | return data 27 | 28 | 29 | 30 | # 定义回调函数 31 | class callback(CallbackAny2Vec): 32 | def __init__(self): 33 | self.epoch = 0 34 | self.loss_to_be_subed = 0 35 | 36 | def on_epoch_end(self, model): 37 | loss = model.get_latest_training_loss() 38 | loss_now = loss - self.loss_to_be_subed 39 | self.loss_to_be_subed = loss 40 | print('Loss after epoch {}: {}'.format(self.epoch, loss_now)) 41 | self.epoch += 1 42 | 43 | def input(): 44 | parser = argparse.ArgumentParser() 45 | parser.add_argument('-i', '--input_dir', help='input dir name', default='./result_pickle_1') 46 | parser.add_argument('-o', '--outputfile', help='output file name', default='./result_pickle_1') 47 | args = parser.parse_args() 48 | print(args) 49 | return args 50 | 51 | 52 | def word2vec_train(stences, only_vec=False): 53 | if only_vec: 54 | if os.path.exists("w2v_vec_300.bin.gz"): 55 | model = keyedvectors.load_word2vec_format("w2v_vec_300.bin.gz", binary=True) 56 | return model 57 | else: 58 | vec_path = 'w2v_vec_300.bin.gz' 59 | # save model, word_vectors 60 | model = word2vec.Word2Vec(sentences=stences, min_count=5, vector_size=300, epochs=100, callbacks=[callback()],compute_loss=True, workers=16) 61 | model.wv.save_word2vec_format(vec_path, binary=True) 62 | return model.wv 63 | 64 | else: 65 | if os.path.exists("w2v_model.bin"): 66 | model = word2vec.Word2Vec.load("w2v_model.bin") 67 | else: 68 | model = word2vec.Word2Vec(sentences=stences, min_count=5, vector_size=300, epochs=100, callbacks=[callback()],compute_loss=True, workers=16) 69 | model.save("w2v_model.bin") 70 | model.wv.save_word2vec_format('./embed.txt') 71 | return model.wv 72 | 73 | 74 | 75 | def main(args): 76 | # all_data_tokens = word_token(args.input_dir) 77 | with open('./d.pkl', 'rb') as f: 78 | all_data_tokens = pickle.load(f) 79 | print('train begin') 80 | model = word2vec_train(all_data_tokens, only_vec=False) 81 | print('train over') 82 | print(model.get_vector('我')) 83 | 84 | def test_model(): 85 | # model = word2vec.Word2Vec.load("w2v_model.bin") 86 | # print(model.wv.get_vector('00')) 87 | model = keyedvectors.load_word2vec_format('w2v_vec_300.bin.gz', binary=True) 88 | print(model.get_vector('我')) 89 | 90 | if __name__ == '__main__': 91 | args = input() 92 | main(args) 93 | # test_model() 94 | --------------------------------------------------------------------------------