├── .gitignore
├── README.md
├── __pycache__
├── common.cpython-39.pyc
├── config.cpython-39.pyc
├── metrics.cpython-39.pyc
├── model.cpython-39.pyc
├── process_data_dl.cpython-39.pyc
├── process_data_ml.cpython-39.pyc
└── process_data_pretrain.cpython-39.pyc
├── common.py
├── config.py
├── dl_algorithm
├── capsules_model.py
├── cnn.py
├── dl_config.py
├── dl_model.py
├── lstm.py
└── transformer.py
├── logs
└── events.out.tfevents.1679558718.huangzihengdeMacBook-Air.local
├── main.py
├── metrics.py
├── ml_algorithm
└── ml_model.py
├── model.py
├── pic
├── pic_dl.png
├── pic_ml.png
├── pretrain_pic.png
├── result.png
├── tensorboard.png
├── test_pic.png
└── train_pic.png
├── pretrain_algorithm
├── bert_graph.py
├── deberta_graph.py
├── nezha_graph.py
├── pre_model.py
└── roberta_wwm.py
├── pretrain_model
└── bert_wwm
│ └── readme.txt
├── process_data_dl.py
├── process_data_ml.py
├── process_data_pretrain.py
├── requirements.txt
├── save_model
└── knn.pkl
├── trick
├── dynamic_padding.py
├── early_stop.py
├── fgm_pgd_ema.py
├── init_model.py
└── set_all_seed.py
└── word2vec_train.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # .gitignore
2 | data/
3 | logs/
4 | */__pycache__
5 | visualization_data.ipynb
6 | cs.py
7 | pretrain_model/
8 | save_model/
9 | pic/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 机器学习代码模板(开箱即用)
2 |
3 |
4 | - [机器学习代码模板(开箱即用)](#机器学习代码模板开箱即用)
5 | - [关于 ](#关于-)
6 | - [1. 介绍](#1-介绍)
7 | - [2. 目前已经涵盖的算法](#2-目前已经涵盖的算法)
8 | - [2.1 常见的机器学习算法](#21-常见的机器学习算法)
9 | - [2.2 常见的深度学习算法](#22-常见的深度学习算法)
10 | - [2.3 预训练模型](#23-预训练模型)
11 | - [前提准备 ](#前提准备-)
12 | - [环境安装](#环境安装)
13 | - [具体使用方法 ](#具体使用方法-)
14 | - [数据集github传不上来,请移步 数据集下载,下载后直接解压到根目录即可](#数据集github传不上来请移步-数据集下载下载后直接解压到根目录即可)
15 | - [3. 参数介绍](#3-参数介绍)
16 | - [3.1 针对常见的机器学习算法](#31-针对常见的机器学习算法)
17 | - [3.2 针对深度神经网络算法](#32-针对深度神经网络算法)
18 | - [3.3 针对预训练模型](#33-针对预训练模型)
19 | - [Note:](#note)
20 | - [文件目录介绍](#文件目录介绍)
21 | - [开发日志](#开发日志)
22 |
23 |
24 | ---
25 |
26 | ## 关于
27 | ### 1. 介绍
28 |
29 | > 这是一个包含多种机器学习算法的代码模板库,其主要用于NLP中文本分类的下游任务,包括二分类及多分类。使用者只需更改一些参数例如数据集地址,算法名称等,即可以使用其中的各种模型来进行文本分类(前提是数据集与我提供的数据集形式一致,具体可以看data/ 下我提供的数据集),各种算法的参数只在xx_config.py单个文件中提供,方便用户对神经网络模型进行调参。
30 | ### 2. 目前已经涵盖的算法
31 | #### 2.1 常见的机器学习算法
32 |
33 | - Logistic Regression
34 | - KNN
35 | - Decision Tree
36 | - Random Forest
37 | - GBDT(Gradient Boosting Decision Tree)
38 | - XGBoost
39 | - Catboost
40 | - SVM
41 | - Bayes
42 | - todo...
43 |
44 |
45 | #### 2.2 常见的深度学习算法
46 |
47 | - TextCNN
48 | - Bi-LSTM
49 | - Transformer
50 | - Capsules
51 | - todo...
52 |
53 | #### 2.3 预训练模型
54 | - Bert_WWM
55 | - MacBert
56 | - NEZHA_WWM
57 | - RoBerta_WWM
58 | - todo...
59 | ---
60 |
61 |
62 |
63 | ## 前提准备
64 |
65 | ### 环境安装
66 |
67 | 具体的相关库的版本见requestments.txt
68 |
69 | - 使用命令安装
70 |
71 | ```
72 | pip install -r requestments.txt
73 | ```
74 |
75 |
76 |
77 | ## 具体使用方法
78 |
79 |
80 | ### 数据集github传不上来,请移步 [数据集下载](https://pan.baidu.com/s/1_2qhpb4eRbraFAShSoPVhQ?pwd=c4n6),下载后直接解压到根目录即可
81 |
82 |
83 | ### 3. 参数介绍
84 | ***主程序:main.py,其中各个参数的含义如下:***
85 |
86 | > *--data_path*: 一个完整的(未切分训练集测试集)的数据集路径
87 | >
88 | > *--model_name*: 需要使用的算法名称,填写的简称见config.py中的ML_MODEL_NAME和DL_MODEL_NAME
89 | >
90 | > *--model_saved_path*: 模型存储的路径
91 | >
92 | > *--type_obj*: 程序的运行目的:train,test,predict三个选项
93 | >
94 | > *--train_data_path*: 切分好的训练集路径
95 | >
96 | > *--test_data_path*: 切分好的测试集路径
97 | >
98 | > *--dev_data_path*: 切分好的验证集路径
99 | ### 3.1 针对常见的机器学习算法
100 |
101 | ***终端命令如下:***
102 | ```
103 | python main.py --data_path [] --model_name [] --model_saved_path [] --type_obj []
104 | ```
105 | ***示例***
106 |
107 | ```
108 | # 训练
109 | python main.py --data_path ./data/processed_data.csv --model_saved_path ./save_model/ --model_name lg --type_obj train
110 | # 测试
111 | python main.py --test_data_path ./data/processed_data.csv --model_saved_path ./save_model/ --model_name lg --type_obj test
112 | # 预测
113 | python main.py --dev_data_path ./data/processed_data.csv --model_saved_path ./save_model/ --model_name lg --type_obj predict
114 | ```
115 |
116 | 解释:这里的train_data_path, test_data_path, dev_data_path都默认为空,ml的数据处理模块会自动按照7:3划分训练集和测试集,并且默认进行下采样,避免数据不平衡带来的不良影响,划分比例和是否下采样参数可在config.py自行修改,如果参数train_data_path, test_data_path被指定,则无需指定data_path, split_size, is_sample参数
117 |
118 | ***运行结果如下:***
119 |
120 | 
121 |
122 | ***结果图片展示:***
123 |
124 | 
125 |
126 | ### 3.2 针对深度神经网络算法
127 |
128 |
129 | ***示例***
130 |
131 | ```
132 | # 训练代码
133 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj train --train_data_path ./data/dl_data/test.csv --test_data_path ./data/dl_data/dev.csv
134 | # 测试代码
135 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj test --test_data_path ./data/dl_data/test.csv
136 | # 预测代码
137 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj predict --dev_data_path ./data/dl_data/dev.csv
138 | ```
139 | ***运行结果如下:***
140 |
141 | 
142 | ***结果图片展示:***
143 |
144 | 由于采用的数据是多分类,画的图比较乱,多分类暂时不输出图
145 |
146 | ***训练过程指标参数变化可视化展示:***
147 | 
148 |
149 | ### 3.3 针对预训练模型
150 | ***示例***
151 | ```
152 | # 训练
153 | # python main.py --model_name mac_bert --model_saved_path ./save_model/mac_bert --type_obj train --train_data_path ./data/dl_data/train.csv --test_data_path ./data/dl_data/test.csv --pretrain_file_path ./pretrain_model/mac_bert/
154 | # 测试
155 | # python main.py --model_name mac_bert --model_saved_path ./save_model/mac_bert --type_obj test --test_data_path ./data/dl_data/test.csv
156 | # 预测
157 | # python main.py --model_name mac_bert --model_saved_path ./save_model/mac_bert --type_obj predict --test_data_path ./data/dl_data/dev.csv
158 | ```
159 | ***运行结果如下:***
160 |
161 | 
162 | ### Note:
163 | >> **常见的机器学习算法调参在 ml_algorithm/ml_moel.py下
深度神经网络/预训练模型的调参在 dl_algorithm/dl_config.py下
其他全局参数调参在 ./config.py下
从transformers官网下载的预训练模型放在pretrain_model/下
在运行代码后,新建一个终端输出’tensorboard --logdir logs‘,然后打开浏览器输入 http://localhost:6006 可以看到训练过程中的f1、loss、learning rate的变化,具体使用教程网上搜索tensorboard**
164 |
165 | ## 文件目录介绍
166 |
167 | ## 开发日志
168 | |已添加的机器学习相关算法|已添加深度学习相关算法添加|其他功能新增或优化|
169 | |:-|:-|:-|
170 | |1. LogisticRegression
2.KNN
3. DecisionTree
4. SVM|1. TextCNN
2. Bi-Lstm
3. Transfomer
4. Capsules # 按照17的论文直接改过来的,论文是图片分类,直接改成文本分类效果特别差,18年出了一篇基于胶囊网络的文本分类的论文,还没有看如何实现(**todo**)|1. 优化读取文件(增加用户指定训练集和测试集位置)
2. 区分DL和ML模型的构建
3. DL模型的参数文件撰写
4. 处理DL的数据集兼容整体的DATAloader通用方法|
171 | |5.GaussianNB
6. RandomForest
7. GBDT
8. XGBOOST|5. Bert_WWM
6. Mac_bert
7. NEZHA_WWM
8. RoBerta_WWM|5. plt.show阻塞问题,换成显示1S,然后保存在当前目录下
6. 深度学习中数据的处理(转换id,构建词表)
7. dataset类构建
8. 添加3种模型权重初始化代码|
172 | |9. CatBOOST||9. 模型训练代码
10. 模型评估代码
11. 添加早停机制
12. 参数优化|
173 | |||13. 解决Lstm的输出bug
14. 完成test,predict模块
15. main函数优化
16. 新增预训练词向量的载入功能|
174 | |||17. 深度学习下,输入单个数据集,自动进行数据切分及下采样,无需人工划分和采样(**todo**)
18. 英文文本分类待添加,主要体现在分词部分(**todo**)
19. 添加竞赛trick【FGM、PGD、EMA】策略
20. 添加竞赛trick【将bert的cls输出修改为中间多层embed的输出加权平均,详情看bert_graph.py|
175 | |||21. 所有代码的关键地方添加注释,方便理解修改代码
22. 完善readme文档
23. 添加mac电脑m系列芯片加速支持
24. 优化代码逻辑|
176 | |||25. 利用tensorboardx的添加可视化过程|
177 |
--------------------------------------------------------------------------------
/__pycache__/common.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/common.cpython-39.pyc
--------------------------------------------------------------------------------
/__pycache__/config.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/config.cpython-39.pyc
--------------------------------------------------------------------------------
/__pycache__/metrics.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/metrics.cpython-39.pyc
--------------------------------------------------------------------------------
/__pycache__/model.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/model.cpython-39.pyc
--------------------------------------------------------------------------------
/__pycache__/process_data_dl.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/process_data_dl.cpython-39.pyc
--------------------------------------------------------------------------------
/__pycache__/process_data_ml.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/process_data_ml.cpython-39.pyc
--------------------------------------------------------------------------------
/__pycache__/process_data_pretrain.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/__pycache__/process_data_pretrain.cpython-39.pyc
--------------------------------------------------------------------------------
/common.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : common.py
5 | @Time : 2023/02/09 14:33:09
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : some common func
10 | '''
11 |
12 | import time
13 | import os
14 | import glob
15 | import torch
16 | from datetime import timedelta
17 |
18 | def get_time_dif(start_time):
19 | """获取已使用时间"""
20 | end_time = time.time()
21 | time_dif = end_time - start_time
22 | return timedelta(seconds=int(round(time_dif)))
23 |
24 |
25 | def check_args(args):
26 | args.setting_file = os.path.join(args.checkpoint_dir, args.setting_file)
27 | args.log_file = os.path.join(args.checkpoint_dir, args.log_file)
28 | os.makedirs(args.checkpoint_dir, exist_ok=True)
29 | with open(args.setting_file, 'wt') as opt_file:
30 | opt_file.write('------------ Options -------------\n')
31 | print('------------ Options -------------')
32 | for k in args.__dict__:
33 | v = args.__dict__[k]
34 | opt_file.write('%s: %s\n' % (str(k), str(v)))
35 | print('%s: %s' % (str(k), str(v)))
36 | opt_file.write('-------------- End ----------------\n')
37 | print('------------ End -------------')
38 |
39 | return args
40 |
41 |
42 | def torch_show_all_params(model, rank=0):
43 | params = list(model.parameters())
44 | k = 0
45 | for i in params:
46 | l = 1
47 | for j in i.size():
48 | l *= j
49 | k = k + l
50 | if rank == 0:
51 | print("Total param num:" + str(k))
52 |
53 |
54 | def torch_init_model(model, init_checkpoint, delete_module=False):
55 | state_dict = torch.load(init_checkpoint, map_location='cpu')
56 | state_dict_new = {}
57 | # delete module代表是否你是用了DistributeDataParallel分布式训练.
58 | # 这里如果是用了pytorch的DDP方式训练,要删掉module这个字段的内容
59 | if delete_module:
60 | for key in state_dict.keys():
61 | v = state_dict[key]
62 | state_dict_new[key.replace('module.', '')] = v
63 | state_dict = state_dict_new
64 | missing_keys = []
65 | unexpected_keys = []
66 | error_msgs = []
67 | # copy state_dict so _load_from_state_dict can modify it
68 | metadata = getattr(state_dict, '_metadata', None)
69 | state_dict = state_dict.copy()
70 | if metadata is not None:
71 | state_dict._metadata = metadata
72 |
73 | def load(module, prefix=''):
74 | local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
75 |
76 | module._load_from_state_dict(
77 | state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
78 | for name, child in module._modules.items():
79 | if child is not None:
80 | load(child, prefix + name + '.')
81 |
82 | load(model, prefix='' if hasattr(model, 'bert') else 'bert.')
83 |
84 |
85 | def torch_save_model(model, output_dir, scores, max_save_num=1):
86 | # Save model checkpoint
87 | if not os.path.exists(output_dir):
88 | os.makedirs(output_dir)
89 | model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training
90 | saved_pths = glob(os.path.join(output_dir, '*.pth'))
91 | saved_pths.sort()
92 | while len(saved_pths) >= max_save_num:
93 | if os.path.exists(saved_pths[0].replace('//', '/')):
94 | os.remove(saved_pths[0].replace('//', '/'))
95 | del saved_pths[0]
96 |
97 | save_prex = "checkpoint_score"
98 | for k in scores:
99 | save_prex += ('_' + k + '-' + str(scores[k])[:6])
100 | save_prex += '.pth'
101 |
102 | torch.save(model_to_save.state_dict(),
103 | os.path.join(output_dir, save_prex))
104 | print("Saving model checkpoint to %s", output_dir)
--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : config.py
5 | @Time : 2023/01/13 16:19:42
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : None
10 | '''
11 |
12 | ML_MODEL_NAME = ['lg', 'knn', 'dt', 'rf', 'gbdt', 'xgb', 'catboost', 'svm', 'bayes']
13 |
14 | DL_MODEL_NAME = ['lstm', 'cnn', 'transformer', 'capsules']
15 |
16 | PRE_MODEL_NAME = ['mac_bert', 'bert_wwm', 'bert', 'nezha_wwm', 'roberta_wwm']
17 |
18 | BATCH_SIZE = 8
19 |
20 | SPLIT_SIZE = 0.3
21 |
22 | IS_SAMPLE = True
23 |
24 | PIC_SAVED_PATH = './pic/' # result的pic图片保存的路径
25 |
26 | VOCAB_MAX_SIZE = 100000 # 词表中词的最大数量
27 |
28 | WORD_MIN_FREQ = 5 # 词表中一个单词出现的最小频率
29 |
30 | VOCAB_SAVE_PATH = './data/vocab_dic.pkl' # 词表存储的位置
31 |
32 | L2I_SAVE_PATH = './data/label2id.pkl' # label的映射表
33 |
34 | PRETRAIN_EMBEDDING_FILE = './data/embed.txt'
35 |
36 | VERBOSE = 1 # 每隔10个epoch 输出一次训练结果和测试的loss
37 |
38 | MAX_SEQ_LEN = 100 # 使用预训练模型时,设置允许每条文本数据的最长长度
--------------------------------------------------------------------------------
/dl_algorithm/capsules_model.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : cap.py
5 | @Time : 2023/03/02 19:34:32
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : capsules network for text classfier
10 | '''
11 |
12 | import torch
13 | import torch.nn as nn
14 | import torch.nn.functional as F
15 |
16 |
17 | class Squash(nn.Module):
18 | def __init__(self, epsilon=1e-8):
19 | super().__init__()
20 | # 防止除数为0
21 | self.epsilon = epsilon
22 |
23 | def forward(self, x):
24 | # x: [batch_size, nums_capsules, n_features]
25 | s2 = (x ** 2).sum(dim=-1, keepdim=True)
26 | return (s2 / (1+s2)) * (x / torch.sqrt(s2 + self.epsilon))
27 |
28 | class Router(nn.Module):
29 | def __init__(self, in_d, out_d, iterations=3):
30 | """
31 | Args:
32 |
33 | in_d (int): per capsues features, paper set 8
34 | out_d (int): 4*4=16 in paper
35 | iterations (int): Cij更新迭代的次数,论文里说3次就可以了
36 | """
37 | super().__init__()
38 | self.in_d = in_d
39 | self.out_d = out_d
40 | self.iterations = iterations
41 | self.softmax = nn.Softmax(dim=1)
42 | self.squash = Squash()
43 |
44 | def forward(self, nums_caps, out_caps, x):
45 | # nums_caps (int): nums of capsules
46 | # out_caps (int): unique labels
47 | # x: [batch_size, nums_capsules, n_features]
48 | # [1152,10,8,16]*[64,1152,8] -> [64,1152,10,16]
49 |
50 | # init Wij
51 | # [1152,10,8,16]
52 | self.w = nn.Parameter(torch.randn(nums_caps, out_caps, self.in_d, self.out_d))
53 |
54 | u_hat = torch.einsum('ijnm,bin->bijm', self.w, x)
55 |
56 | # init bij --> zero [batch, nums_capsules, out_caps] [64, 1152,10]
57 | b = x.new_zeros(x.shape[0], nums_caps, out_caps)
58 | v = None
59 |
60 | for i in range(self.iterations):
61 | c = self.softmax(b) #[64, 1152, 16]
62 | s = torch.einsum('bij,bijm->bjm', c, u_hat)
63 | v = self.squash(s)
64 | a = torch.einsum('bjm,bijm->bij', v, u_hat)
65 | b = b + a
66 | return v
67 |
68 | class MarginLoss(nn.Module):
69 | def __init__(self, lambda_=0.5, m1=0.9, m2=0.1):
70 | super().__init__()
71 | self.m1 = m1
72 | self.m2 = m2
73 | self.lambda_ = lambda_
74 |
75 | def forward(self, v, labels):
76 | # v: [batch_size, out_caps, out_d] 64,10,16 there is a capsule for each label
77 | # labels : [batch_size]
78 | n_labels = v.shape[1]
79 | v_norm = torch.sqrt(v) #[batch_size, out_caps]
80 | labels = torch.eye(n_labels, device=labels.device)[labels] #[batch_size, out_caps]
81 | loss = labels * F.relu(self.m1 - v_norm) + self.lambda_ * (1.0-labels) * F.relu(v_norm - self.m2)
82 | return loss.sum(dim=-1).mean()
83 |
84 | class capsules_model(nn.Module):
85 | def __init__(self, dlconfig):
86 | super().__init__()
87 | if dlconfig.embedding_pretrained == 'random':
88 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1)
89 | else:
90 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1)
91 | self.in_d = dlconfig.in_d
92 | self.out_d = dlconfig.out_d
93 | self.nums_label = dlconfig.nums_label
94 | self.reshape_num = dlconfig.reshape_num
95 | self.conv1 = nn.Conv2d(1, 256, (2, dlconfig.embedding_size), stride=1, padding=dlconfig.pad_size)
96 | self.conv2 = nn.Conv2d(256, self.reshape_num * self.in_d, (2, 1), stride=2, padding=dlconfig.pad_size)
97 | self.squash = Squash()
98 | self.digit_capsules = Router(self.in_d, self.out_d, dlconfig.iter)
99 |
100 | def forward(self, data):
101 | x = self.embedding(data)
102 | x = x.unsqueeze(1)
103 | x = F.relu(self.conv1(x))
104 | x = self.conv2(x)
105 |
106 | caps = x.view(x.shape[0], self.in_d, self.reshape_num*x.shape[-1]*x.shape[-2]).permute(0, 2, 1)
107 | caps = self.digit_capsules(caps.shape[1], self.nums_label, caps)
108 |
109 | # pre = (caps ** 2).sum(-1).argmax(-1)
110 | return (caps ** 2).sum(-1)
111 |
--------------------------------------------------------------------------------
/dl_algorithm/cnn.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : cnn.py
5 | @Time : 2023/01/17 15:22:31
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : cnn for textclassifier
10 | '''
11 |
12 | import torch
13 | import torch.nn as nn
14 | import torch.nn.functional as F
15 |
16 |
17 | class TextCNN(nn.Module):
18 | def __init__(self, dlconfig):
19 | super().__init__()
20 | if dlconfig.embedding_pretrained == 'random':
21 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1)
22 | else:
23 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1)
24 | self.convs = nn.ModuleList(
25 | [nn.Conv2d(1, dlconfig.nums_filters, (k, dlconfig.embedding_size), stride=dlconfig.stride, padding=dlconfig.pad_size) for k in dlconfig.filter_size]
26 | )
27 | self.dropout = nn.Dropout(p=dlconfig.dropout)
28 | self.relu = nn.ReLU(inplace=True)
29 | self.fc = nn.Linear(dlconfig.nums_filters * len(dlconfig.filter_size), dlconfig.nums_label)
30 |
31 | def conv_and_pool(self, x, conv):
32 | x = self.relu(conv(x))
33 | x = x.squeeze(3)
34 | x = F.max_pool1d(x, x.size(2))
35 | x = x.squeeze(2)
36 | return x
37 |
38 | def forward(self, x):
39 | x = self.embedding(x)
40 | x = x.unsqueeze(1) # 增加通道数为1
41 | x = [self.conv_and_pool(x, conv) for conv in self.convs]
42 | x = torch.cat(x, 1)
43 | x = self.dropout(x)
44 | x = self.fc(x)
45 | return x
--------------------------------------------------------------------------------
/dl_algorithm/dl_config.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : dl_config.py
5 | @Time : 2023/02/07 17:27:38
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : some params of dl model, 参数全部在这里改
10 | '''
11 |
12 | import torch
13 | from process_data_dl import DataSetProcess
14 | from dl_algorithm.capsules_model import MarginLoss
15 | from config import PRE_MODEL_NAME
16 |
17 | class DlConfig:
18 | """
19 | model_name: LSTM, CNN, Transformer, capsules...
20 | """
21 |
22 | def __init__(self, model_name, vocab_size, label2id_nums, vocab_dict, embedding_pretrained='pretrain'):
23 | self.model_name = model_name
24 | self.train_data_path = ''
25 | self.test_data_path = ''
26 | self.dev_data_path = ''
27 | self.vocab_size = vocab_size
28 | self.nums_label = label2id_nums
29 | self.embedding_size = 200
30 | self.embedding_pretrained = embedding_pretrained # random, pretrain
31 | if self.embedding_pretrained != 'random':
32 | self.embedding_matrix, dim = DataSetProcess().load_emb(vocab_dict)
33 | self.embedding_size = dim
34 | self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
35 | # 针对mac m1 m2芯片,使用mps加速代码
36 | if torch.backends.mps.is_available():
37 | self.device = 'mps'
38 | print(f'use device: {self.device}')
39 | self.dropout = 0.5
40 | self.epochs = 10
41 | self.learning_rate = 3e-5
42 | self.update_lr = True # 是否使用衰减学习率的方法动态更新学习率
43 | self.warmup_prop = 0.1 # 学习率更新策略系数
44 | self.loss_type = 'multi' # 'binary, regression, marginLoss, multi'
45 | self.judge_loss_fct()
46 | self.create_special_params()
47 |
48 | def create_special_params(self):
49 | if self.model_name == 'lstm':
50 | self.hidden_size = 128
51 | self.nums_layer = 1 # lstm的层数stack的
52 | elif self.model_name == 'cnn':
53 | self.nums_filters = 256 # 卷积核的数量
54 | self.filter_size = (2, 3, 4) # 相当于提取2gram,3gram,4gram的信息
55 | self.stride = 1
56 | self.pad_size = 0
57 | elif self.model_name == 'transformer':
58 | self.heads = 5 # 确保能被embed_size 整除
59 | self.n_layers = 2 # encoder里有几个transformer
60 | self.hidden = 1024
61 | self.d_model = self.embedding_size
62 | elif self.model_name == 'capsules':
63 | # 注意:self.in_d * self.reshape_num = 256
64 | self.in_d = 8
65 | self.reshape_num = 32
66 | self.out_d = 16
67 | self.iter = 3 # cij 的迭代次数
68 | self.pad_size = 0
69 | #==============================#
70 | elif self.model_name in PRE_MODEL_NAME:
71 | self.use_fgm = True # 是否使用fgm (Fast Gradient Method)
72 | else:
73 | pass
74 |
75 | def judge_loss_fct(self):
76 | if self.loss_type == 'multi':
77 | # torch.nn.CrossEntropyLoss(input, target)的input是没有归一化的每个类的得分,而不是softmax之后的分布
78 | # target是:类别的序号。形如 target = [1, 3, 2]
79 | self.loss_fct = torch.nn.CrossEntropyLoss()
80 | elif self.loss_type == 'binary':
81 | self.loss_fct = torch.nn.BCELoss()
82 | elif self.loss_type == 'regression':
83 | self.loss_fct = torch.nn.MSELoss()
84 | elif self.loss_type == 'marginLoss':
85 | self.loss_fct = MarginLoss()
86 | else:
87 | #! 这里自定义loss函数
88 | pass
89 |
90 |
91 |
--------------------------------------------------------------------------------
/dl_algorithm/dl_model.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : dl_model.py
5 | @Time : 2023/02/07 19:54:48
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : deep net model excuter
10 | '''
11 |
12 | import os
13 | import time
14 | import torch
15 | import gc
16 | import numpy as np
17 | from tqdm import tqdm
18 | from metrics import Matrix
19 | from config import DL_MODEL_NAME, VERBOSE
20 | from dl_algorithm.lstm import LSTM
21 | from dl_algorithm.cnn import TextCNN
22 | from dl_algorithm.transformer import TransformerModel
23 | from dl_algorithm.capsules_model import capsules_model
24 | from trick.init_model import init_network
25 | from trick.early_stop import EarlyStopping
26 | from common import get_time_dif
27 | from tensorboardX import SummaryWriter
28 |
29 |
30 | class DL_EXCUTER:
31 | def __init__(self, dl_config):
32 | self.dlconfig = dl_config
33 |
34 | def judge_model(self, assign_path=''):
35 | if self.dlconfig.model_name not in DL_MODEL_NAME:
36 | print('dl model name is not support, please see DL_MODEL_NAME of config.py')
37 | if self.dlconfig.model_name == 'lstm':
38 | self.model = LSTM(self.dlconfig)
39 | elif self.dlconfig.model_name == 'cnn':
40 | self.model = TextCNN(self.dlconfig)
41 | elif self.dlconfig.model_name == 'transformer':
42 | self.model = TransformerModel(self.dlconfig)
43 | elif self.dlconfig.model_name == 'capsules':
44 | self.model = capsules_model(self.dlconfig)
45 | #! 其他模型
46 | else:
47 | pass
48 | init_network(self.model)
49 | print('初始化网络权重完成,默认采用xavier')
50 | self.model.to(self.dlconfig.device)
51 |
52 |
53 | def train(self, train_loader, test_loader, dev_loader, model_saved_path, model_name):
54 | # 设置优化器
55 | optimizer = torch.optim.AdamW(self.model.parameters(), lr=self.dlconfig.learning_rate)
56 | best_test_f1 = 0
57 | # 定义一个summarywriter对象,用来可视化
58 | writer = SummaryWriter(logdir='./logs')
59 | # 学习率指数衰减,每次epoch:学习率 = gamma * 学习率
60 | # scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
61 |
62 | # 学习更新策略--预热(warmup)
63 | if self.dlconfig.update_lr:
64 | from transformers import get_linear_schedule_with_warmup
65 | num_warmup_steps = int(self.dlconfig.warmup_prop * self.dlconfig.epochs * len(train_loader))
66 | num_training_steps = int(self.dlconfig.epochs * len(train_loader))
67 | scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps)
68 |
69 | # 早停策略
70 | early_stopping = EarlyStopping(patience = 20, delta=0)
71 |
72 | for epoch in range(self.dlconfig.epochs):
73 | # 设定训练模式
74 | self.model.train()
75 | # 梯度清零
76 | self.model.zero_grad()
77 | start_time = time.time()
78 | avg_loss = 0
79 | first_epoch_eval = 0
80 | for index, data in enumerate(tqdm(train_loader, ncols=100)):
81 | pred = self.model(data['input_ids'].to(self.dlconfig.device))
82 | loss = self.dlconfig.loss_fct(pred, data['label'].to(self.dlconfig.device)).mean()
83 | # 反向传播
84 | loss.backward()
85 | avg_loss += loss.item() / len(train_loader)
86 |
87 | # 更新优化器
88 | optimizer.step()
89 | # 更新学习率
90 | if self.dlconfig.update_lr:
91 | scheduler.step()
92 |
93 | # 用以下方式替代model.zero_grad(),可以提高gpu利用率
94 | for param in self.model.parameters():
95 | param.grad = None
96 |
97 | # 计算模型运行时间
98 | elapsed_time = get_time_dif(start_time)
99 | # 打印间隔
100 | if (epoch + 1) % VERBOSE == 0:
101 | # 在测试集上看下效果
102 | avg_test_loss, test_f1, pred_all, true_all = self.evaluate(test_loader)
103 | elapsed_time = elapsed_time * VERBOSE
104 | if self.dlconfig.update_lr:
105 | lr = scheduler.get_last_lr()[0]
106 | else:
107 | lr = self.dlconfig.learning_rate
108 | tqdm.write(
109 | f"Epoch {epoch + 1:02d}/{self.dlconfig.epochs:02d} \t time={elapsed_time} \t"
110 | f"loss={avg_loss:.3f}\t lr={lr:.1e}",
111 | end="\t",
112 | )
113 |
114 | if (epoch + 1 >= first_epoch_eval) or (epoch + 1 == self.dlconfig.epochs):
115 | tqdm.write(f"val_loss={avg_test_loss:.3f}\ttest_f1={test_f1:.4f}")
116 | else:
117 | tqdm.write("")
118 | writer.add_scalar('Loss/train', avg_loss, epoch)
119 | writer.add_scalar('Loss/test', avg_test_loss, epoch)
120 | writer.add_scalar('F1/test', test_f1, epoch)
121 | writer.add_scalar('lr/train', lr, epoch)
122 |
123 | # 每次保存最优的模型,以测试集f1为准
124 | if best_test_f1 < test_f1:
125 | best_test_f1 = test_f1
126 | tqdm.write('*'*20)
127 | self.save_model(model_saved_path, model_name)
128 | tqdm.write('new model saved')
129 | tqdm.write('*'*20)
130 |
131 | early_stopping(avg_test_loss)
132 | if early_stopping.early_stop:
133 | break
134 | # 删除数据加载器以及变量
135 | del (test_loader, train_loader, loss, data, pred)
136 | # 释放内存
137 | gc.collect()
138 | torch.cuda.empty_cache()
139 | writer.close()
140 |
141 |
142 | def evaluate(self, test_loader):
143 | pre_all = []
144 | true_all = []
145 | # 设定评估模式
146 | self.model.eval()
147 | avg_test_loss = 0
148 | with torch.no_grad():
149 | for test_data in test_loader:
150 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device))
151 | test_loss = self.dlconfig.loss_fct(pred, test_data['label'].to(self.dlconfig.device)).mean()
152 | avg_test_loss += test_loss.item() / len(test_loader)
153 | true_all.extend(test_data['label'].detach().cpu().numpy())
154 | pre_all.append(pred.softmax(-1).detach().cpu().numpy())
155 | pre_all = np.concatenate(pre_all)
156 | pre_all = np.argmax(pre_all, axis=-1)
157 | if self.dlconfig.loss_type == 'multi' or self.dlconfig.loss_type == 'marginLoss':
158 | multi = True
159 | else:
160 | multi = False
161 | matrix = Matrix(true_all, pre_all, multi=multi)
162 | return avg_test_loss, matrix.get_f1(), pre_all, true_all
163 |
164 |
165 | def predict(self, dev_loader):
166 | pre_all = []
167 | with torch.no_grad():
168 | for test_data in dev_loader:
169 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device))
170 | pre_all.append(pred.softmax(-1).detach().cpu().numpy())
171 | pre_all = np.concatenate(pre_all)
172 | pre_all = np.argmax(pre_all, axis=-1)
173 | return pre_all
174 |
175 | # 保存模型权重
176 | def save_model(self, path, name):
177 | if not os.path.exists(path):
178 | os.makedirs(path)
179 | output_path = os.path.join(path, name)
180 | torch.save(self.model.state_dict(), output_path)
181 | print(f'model is saved, in {str(output_path)}')
182 |
183 | def load_model(self, path, name):
184 | output_path = os.path.join(path, name)
185 | try:
186 | self.judge_model()
187 | self.model.load_state_dict(torch.load(output_path))
188 | self.model.eval()
189 | print('model 已加载预训练参数')
190 | except:
191 | print('model load error')
192 |
193 |
194 |
195 |
--------------------------------------------------------------------------------
/dl_algorithm/lstm.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : lstm.py
5 | @Time : 2023/02/07 16:09:05
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : lstm for classifier
10 | '''
11 |
12 | import torch.nn as nn
13 |
14 |
15 | class LSTM(nn.Module):
16 | def __init__(self, dlconfig):
17 | super().__init__()
18 | if dlconfig.embedding_pretrained == 'random':
19 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1)
20 | else:
21 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1)
22 | self.lstm = nn.LSTM(dlconfig.embedding_size, dlconfig.hidden_size, batch_first=True, bidirectional=True, dropout=dlconfig.dropout)
23 | self.fc1 = nn.Linear(dlconfig.hidden_size*2, dlconfig.hidden_size)
24 | self.fc2 = nn.Linear(dlconfig.hidden_size, dlconfig.nums_label)
25 | self.dropout = nn.Dropout(p=dlconfig.dropout)
26 | self.relu = nn.ReLU(inplace=True)
27 |
28 |
29 | def forward(self, x):
30 | x = self.embedding(x) # [batch_size, seq_len, embeding_size]
31 | x,_ = self.lstm(x)
32 | x = self.fc1(x)
33 | x = self.dropout(self.relu(x))
34 | x = self.fc2(x)
35 | return x[:, -1, :]
36 |
37 |
--------------------------------------------------------------------------------
/dl_algorithm/transformer.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : transformer.py
5 | @Time : 2023/02/16 14:58:06
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : None
10 | '''
11 |
12 | import copy
13 | import math
14 | import torch
15 | import torch.nn as nn
16 |
17 | class PrepareForMultiHeadAttention(nn.Module):
18 | """生成Wq,Wk,Wv三个权重矩阵
19 | """
20 | def __init__(self, d_model, heads, d_k, bias):
21 | """
22 | Args:
23 | d_model (int): dim for model : 512
24 | heads (int): nums of attention head: 8
25 | d_k (int): dim for K
26 | bias (bool): bias for linear layer
27 | """
28 | super().__init__()
29 | self.linear = nn.Linear(d_model, heads * d_k, bias=bias)
30 | self.heads = heads
31 | self.d_k = d_k
32 |
33 | def forward(self, x):
34 | # input_shape: [batch, seqlenth, d_model]
35 | head_shape = x.shape[:-1]
36 | x = self.linear(x)
37 | # reshape
38 | x = x.view(*head_shape, self.heads, self.d_k)
39 | return x
40 |
41 | class MultiHeadAtttention(nn.Module):
42 | """计算过程
43 | """
44 | def __init__(self, heads, d_model, dropout, bias=True):
45 | """
46 | Args:
47 | heads (int):
48 | d_model (int):
49 | dropout (float):
50 | bias (bool, optional): Defaults to True.
51 | """
52 | super().__init__()
53 | self.d_k = d_model // heads
54 | self.heads = heads
55 | self.query = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias)
56 | self.key = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias)
57 | self.value = PrepareForMultiHeadAttention(d_model, heads, self.d_k, bias=bias)
58 | self.softmax = nn.Softmax(dim=1)
59 | self.fc = nn.Linear(d_model, d_model)
60 | self.dropout = nn.Dropout(dropout)
61 | self.scale = 1 / math.sqrt(self.d_k)
62 | self.attn = None # 主要用于画图或者输出debug
63 |
64 | def get_scores(self, query, key):
65 | # return [batch, seq_len, seq_len, heads]
66 | return torch.einsum('bihd,bjhd->bijh', query, key)
67 |
68 | def prepare_mask(self, mask, query_shape, key_shape):
69 | assert mask.shape[0] == 1 or mask.shape[0] == query_shape[0]
70 | assert mask.shape[1] == key_shape[0]
71 | assert mask.shape[2] == 1 or mask.shape[2] == query_shape[1]
72 | mask = mask.unsqueeze(-1)
73 | return mask
74 |
75 | def forward(self, query, key, value, mask=None):
76 | # 自注意力机制里,这里的query,key,value其实都是x
77 | batch_size, seq_len, _ = query.shape
78 | if mask:
79 | mask = self.prepare_mask(mask, query.shape, key.shape)
80 | Q = self.query(query)
81 | K = self.key(key)
82 | V = self.value(value)
83 | scores = self.get_scores(Q, K)
84 | scores *= self.scale
85 | if mask:
86 | scores = scores.masked_fill(mask==0, float('-inf'))
87 | attn = self.softmax(scores)
88 | attn = self.dropout(attn)
89 | x = torch.einsum('bijh,bjhd->bihd', attn, V)
90 | self.attn = attn.detach()
91 | x = x.reshape(batch_size, seq_len, -1)
92 | x = self.fc(x)
93 | return x
94 |
95 | class PositionalEncoding(nn.Module):
96 | def __init__(self, d_model, dropout, max_len=5000):
97 | super(PositionalEncoding, self).__init__()
98 | self.dropout = nn.Dropout(p=dropout)
99 | pe = torch.zeros(max_len, d_model)
100 | position = torch.arange(0, max_len).unsqueeze(1)
101 | div_term = torch.exp(torch.arange(0, d_model, 2) *
102 | -(math.log(10000.0) / d_model))
103 | pe[:, 0::2] = torch.sin(position * div_term)
104 | pe[:, 1::2] = torch.cos(position * div_term)
105 | pe = pe.unsqueeze(0)
106 | self.register_buffer('pe', pe)
107 |
108 | def forward(self, x):
109 | x = x + self.pe[:, :x.size(1)].requires_grad_(False)
110 | return self.dropout(x)
111 |
112 | class FeedForward(nn.Module):
113 | """FFN module
114 | """
115 | def __init__(self, d_model: int, hidden: int,
116 | dropout: float = 0.1,
117 | activation=nn.ReLU(),
118 | is_gated: bool = False,
119 | bias1: bool = True,
120 | bias2: bool = True,
121 | bias_gate: bool = True):
122 | """
123 | * `d_model` is the number of features in a token embedding
124 | * `hidden` is the number of features in the hidden layer of the FFN
125 | * `dropout` is dropout probability for the hidden layer
126 | * `is_gated` specifies whether the hidden layer is gated
127 | * `bias1` specified whether the first fully connected layer should have a learnable bias
128 | * `bias2` specified whether the second fully connected layer should have a learnable bias
129 | * `bias_gate` specified whether the fully connected layer for the gate should have a learnable bias
130 | """
131 | super().__init__()
132 | # Layer one parameterized by weight $W_1$ and bias $b_1$
133 | self.layer1 = nn.Linear(d_model, hidden, bias=bias1)
134 | # Layer one parameterized by weight $W_1$ and bias $b_1$
135 | self.layer2 = nn.Linear(hidden, d_model, bias=bias2)
136 | # Hidden layer dropout
137 | self.dropout = nn.Dropout(dropout)
138 | # Activation function $f$
139 | self.activation = activation
140 | # Whether there is a gate
141 | self.is_gated = is_gated
142 | if is_gated:
143 | # If there is a gate the linear layer to transform inputs to
144 | # be multiplied by the gate, parameterized by weight $V$ and bias $c$
145 | self.linear_v = nn.Linear(d_model, hidden, bias=bias_gate)
146 |
147 | def forward(self, x: torch.Tensor):
148 | # $f(x W_1 + b_1)$
149 | g = self.activation(self.layer1(x))
150 | # If gated, $f(x W_1 + b_1) \otimes (x V + b) $
151 | if self.is_gated:
152 | x = g * self.linear_v(x)
153 | # Otherwise
154 | else:
155 | x = g
156 | # Apply dropout
157 | x = self.dropout(x)
158 | # $(f(x W_1 + b_1) \otimes (x V + b)) W_2 + b_2$ or $f(x W_1 + b_1) W_2 + b_2$
159 | # depending on whether it is gated
160 | return self.layer2(x)
161 |
162 | class TransformerLayer(nn.Module):
163 | def __init__(self, d_model, self_attn, src_attn, feed_forward, dropout):
164 | """transformer layer
165 |
166 | Args:
167 | d_model (int):
168 | self_attn (): multi-head-attention layer
169 | src_attn (): multi-head-attention layer
170 | feed_forward (): feed forwardd layer
171 | dropout (float): dropout prob
172 | """
173 | super().__init__()
174 | self.size = d_model
175 | self.self_attn = self_attn
176 | self.src_attn =src_attn
177 | self.feed_forward = feed_forward
178 | self.dropout = nn.Dropout(dropout)
179 | self.layernorm = nn.LayerNorm([d_model])
180 |
181 | def forward(self, x, mask):
182 | z = self.layernorm(x)
183 | self_attn = self.self_attn(z, z, z, mask) # mha
184 | x = x + self.dropout(self_attn) # add
185 | z = self.layernorm(x) # norm
186 | ff = self.feed_forward(z) # ff
187 | x = z + self.dropout(ff) # add
188 | x = self.layernorm(x) # norm
189 | return x
190 |
191 | class Encoder(nn.Module):
192 | def __init__(self, layer, n_layers):
193 | """encoder layer
194 |
195 | Args:
196 | layer (): transformer layer
197 | n_layers (int): nums of layer: default 6
198 | """
199 | super().__init__()
200 | self.layers = self.clones(layer, n_layers)
201 | self.layernorm = nn.LayerNorm([layer.size])
202 |
203 |
204 | def clones(self, layer, N):
205 | return nn.ModuleList([copy.deepcopy(layer) for _ in range(N)])
206 |
207 | def forward(self, x, mask):
208 | for l in self.layers:
209 | x = l(x, mask)
210 | return self.layernorm(x)
211 |
212 |
213 | class TransformerModel(nn.Module):
214 | def __init__(self, dlconfig):
215 | super().__init__()
216 | if dlconfig.embedding_pretrained == 'random':
217 | self.embedding = nn.Embedding(dlconfig.vocab_size, dlconfig.embedding_size, padding_idx=dlconfig.vocab_size-1)
218 | else:
219 | self.embedding = nn.Embedding.from_pretrained(dlconfig.embedding_matrix, freeze=False, padding_idx=dlconfig.vocab_size-1)
220 |
221 | self.postion_embedding = PositionalEncoding(dlconfig.d_model, dlconfig.dropout)
222 | self.transformerlayer = TransformerLayer(d_model=dlconfig.d_model,
223 | self_attn=MultiHeadAtttention(dlconfig.heads, dlconfig.d_model, dlconfig.dropout),
224 | src_attn=None,
225 | feed_forward=FeedForward(dlconfig.d_model, dlconfig.hidden, dlconfig.dropout),
226 | dropout=dlconfig.dropout)
227 |
228 | self.encoder = Encoder(self.transformerlayer, dlconfig.n_layers)
229 | self.fc1 = nn.Linear(dlconfig.embedding_size, dlconfig.nums_label)
230 |
231 | def forward(self, x):
232 | x = self.embedding(x)
233 | x = self.postion_embedding(x)
234 | x = self.encoder(x, mask=None)
235 | x = self.fc1(x)
236 | return x[:, -1, :]
237 |
--------------------------------------------------------------------------------
/logs/events.out.tfevents.1679558718.huangzihengdeMacBook-Air.local:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/logs/events.out.tfevents.1679558718.huangzihengdeMacBook-Air.local
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2023-03-09 19:27:58
8 | LastEditTime : 2023-03-23 15:16:11
9 | FilePath : \\codes\\main.py
10 | Description :
11 | '''
12 |
13 | import os
14 | os.environ["CUDA_VISIBLE_DEVICES"] = '0'
15 |
16 | import argparse
17 | import transformers
18 | from process_data_ml import ML_Data_Excuter
19 | from process_data_dl import DL_Data_Excuter
20 | from process_data_pretrain import PRE_Data_Excuter
21 | from metrics import Matrix
22 | from model import Model_Excuter
23 | from config import ML_MODEL_NAME, DL_MODEL_NAME, PRE_MODEL_NAME, BATCH_SIZE, SPLIT_SIZE, IS_SAMPLE
24 | from dl_algorithm.dl_config import DlConfig
25 | from trick.set_all_seed import set_seed
26 | import warnings
27 |
28 |
29 | warnings.filterwarnings("ignore")
30 | transformers.logging.set_verbosity_error()
31 |
32 | # def set_args():
33 | # parser = argparse.ArgumentParser()
34 | # parser.add_argument('--data_path', help='data path', default='', type=str)
35 | # parser.add_argument(
36 | # '--model_name', help='model name ex: knn', default='lg', type=str)
37 | # parser.add_argument(
38 | # '--model_saved_path', help='the path of model saved', default='./save_model/', type=str)
39 | # parser.add_argument(
40 | # '--type_obj', help='need train or test or only predict', default='test', type=str)
41 | # parser.add_argument('--train_data_path',
42 | # help='train set', default='', type=str)
43 | # parser.add_argument('--test_data_path', help='test set',
44 | # default='./data/processed_data.csv', type=str)
45 | # parser.add_argument('--dev_data_path', help='dev set',
46 | # default='', type=str)
47 | # args = parser.parse_args()
48 | # return args
49 |
50 |
51 | def set_args():
52 | # 训练代码
53 | # python main.py --model_name transformer --model_saved_path ./save_model/ --type_obj train --train_data_path ./data/dl_data/test.csv --test_data_path ./data/dl_data/dev.csv
54 | # 测试代码
55 | # python main.py --model_name lstm --model_saved_path ./save_model/ --type_obj test --test_data_path ./data/dl_data/test.csv
56 | # 预测代码
57 | # python main.py --model_name lstm --model_saved_path './save_model/ --type_obj predict --dev_data_path ./data/dl_data/dev.csv
58 | parser = argparse.ArgumentParser()
59 | parser.add_argument('--data_path', help='data path', default='', type=str)
60 | parser.add_argument(
61 | '--model_name', help='model name ex: knn', default='transformer', type=str)
62 | parser.add_argument(
63 | '--model_saved_path', help='the path of model saved', default='./save_model/transformer', type=str)
64 | parser.add_argument(
65 | '--type_obj', help='need train or test or only predict', default='train', type=str)
66 | parser.add_argument('--train_data_path',
67 | help='train set', default='./data/dl_data/test.csv', type=str)
68 | parser.add_argument('--test_data_path',
69 | help='./data/dl_data/test.csv', default='./data/dl_data/dev.csv', type=str)
70 | parser.add_argument('--dev_data_path', help='dev set',
71 | default='', type=str)
72 | parser.add_argument('--pretrain_file_path', help='# 预训练模型的文件地址(模型在transformers官网下载)',
73 | default='./pretrain_model/roberta_wwm/', type=str)
74 | args = parser.parse_args()
75 | return args
76 |
77 |
78 | def print_msg(metrix_ex_train, metrix_ex_test, data_ex, pic_name='pic'):
79 | if metrix_ex_train:
80 | print('train dataset:')
81 | print(f"acc: {round(metrix_ex_train.get_acc(), 4)}")
82 | print(f"presion: {round(metrix_ex_train.get_precision(), 4)}")
83 | print(f"recall: {round(metrix_ex_train.get_recall(), 4)}")
84 | print(f"f1: {round(metrix_ex_train.get_f1(), 4)}")
85 | print('=' * 20)
86 | if metrix_ex_test:
87 | print('test dataset:')
88 | print(f"acc: {round(metrix_ex_test.get_acc(), 4)}")
89 | print(f"presion: {round(metrix_ex_test.get_precision(), 4)}")
90 | print(f"recall: {round(metrix_ex_test.get_recall(), 4)}")
91 | print(f"f1: {round(metrix_ex_test.get_f1(), 4)}")
92 | print(metrix_ex_test.plot_confusion_matrix(data_ex.i2l_dic, pic_name))
93 |
94 |
95 | def create_me_de(args, split_size=SPLIT_SIZE, is_sample=IS_SAMPLE, split=True, batch_size=BATCH_SIZE, train_data_path='', test_data_path='', need_predict=False):
96 | if args.model_type == 'ML':
97 | data_ex = ML_Data_Excuter(args.data_path, split_size=split_size, is_sample=is_sample,
98 | split=split, train_data_path=train_data_path, test_data_path=test_data_path)
99 | # 初始化模型
100 | model_ex = Model_Excuter().init(model_name=args.model_name)
101 | if need_predict and args.type_obj == 'test':
102 | model_ex.load_model(args.model_saved_path,
103 | args.model_name + '.pkl')
104 | y_pre_test = model_ex.predict(data_ex.X)
105 | true_all = data_ex.label
106 | return data_ex, model_ex, true_all, y_pre_test
107 | elif need_predict and args.type_obj == 'predict':
108 | model_ex.load_model(args.model_saved_path,
109 | args.model_name + '.pkl')
110 | y_pre_test = model_ex.predict(data_ex.X)
111 | return data_ex, model_ex, y_pre_test
112 | elif args.model_type == 'DL':
113 | data_ex = DL_Data_Excuter()
114 | vocab_size, nums_class = data_ex.process(batch_size=batch_size,
115 | train_data_path=args.train_data_path,
116 | test_data_path=args.test_data_path,
117 | dev_data_path=args.dev_data_path)
118 | dl_config = DlConfig(args.model_name, vocab_size,
119 | nums_class, data_ex.vocab)
120 | # 初始化模型
121 | model_ex = Model_Excuter().init(dl_config=dl_config)
122 | if need_predict and args.type_obj == 'test':
123 | model_ex.load_model(args.model_saved_path,
124 | args.model_name + '.pth')
125 | _, _, y_pre_test, true_all = model_ex.evaluate(
126 | data_ex.test_data_loader)
127 | return data_ex, model_ex, true_all, y_pre_test
128 | elif need_predict and args.type_obj == 'predict':
129 | model_ex.load_model(args.model_saved_path,
130 | args.model_name + '.pth')
131 | y_pre_test = model_ex.predict(data_ex.dev_data_loader)
132 | return data_ex, model_ex, y_pre_test
133 | else:
134 | data_ex = PRE_Data_Excuter(args.model_name)
135 | nums_class = data_ex.process(batch_size=batch_size,
136 | train_data_path=args.train_data_path,
137 | test_data_path=args.test_data_path,
138 | dev_data_path=args.dev_data_path,
139 | pretrain_file_path=args.pretrain_file_path
140 | )
141 | dl_config = DlConfig(args.model_name, 0, nums_class, '', 'random')
142 | # 初始化模型
143 | model_ex = Model_Excuter().init(dl_config=dl_config)
144 | if need_predict and args.type_obj == 'test':
145 | model_ex.load_model(args.model_saved_path)
146 | _, _, y_pre_test, true_all = model_ex.evaluate(
147 | data_ex.test_data_loader)
148 | return data_ex, model_ex, true_all, y_pre_test
149 | elif need_predict and args.type_obj == 'predict':
150 | model_ex.load_model(args.model_saved_path)
151 | y_pre_test = model_ex.predict(data_ex.dev_data_loader)
152 | return data_ex, model_ex, y_pre_test
153 |
154 | return data_ex, model_ex
155 |
156 |
157 | def main(args):
158 | """
159 | 1. 载入数据
160 | 2. 载入模型
161 | 3. 训练模型
162 | 4. 预测结果
163 | 5. 保存模型
164 | """
165 | if args.model_name in ML_MODEL_NAME:
166 | args.model_type = 'ML'
167 | elif args.model_name in DL_MODEL_NAME:
168 | args.model_type = 'DL'
169 | elif args.model_name in PRE_MODEL_NAME:
170 | args.model_type = 'PRE'
171 | else:
172 | print('model name error')
173 | exit(0)
174 |
175 | set_seed(96)
176 |
177 | if args.type_obj == 'train':
178 | data_ex, model_ex = create_me_de(args)
179 |
180 | model_ex.judge_model(args.pretrain_file_path)
181 |
182 | # 这里dl和ml的train得用if分开,数据的接口不一样
183 | if args.model_type == 'ML':
184 | model_ex.train(data_ex.train_data_x, data_ex.train_data_label)
185 |
186 | y_pre_train = model_ex.predict(data_ex.train_data_x)
187 | y_pre_test = model_ex.predict(data_ex.test_data_x)
188 |
189 | mtrix_ex_train = Matrix(
190 | data_ex.train_data_label, y_pre_train, multi=data_ex.multi)
191 | mtrix_ex_test = Matrix(
192 | data_ex.test_data_label, y_pre_test, multi=data_ex.multi)
193 | print_msg(mtrix_ex_train, mtrix_ex_test, data_ex, 'train_pic')
194 |
195 | model_ex.save_model(args.model_saved_path,
196 | args.model_name + '.pkl')
197 | elif args.model_type == 'DL':
198 | model_ex.train(data_ex.train_data_loader,
199 | data_ex.test_data_loader,
200 | data_ex.dev_data_loader,
201 | args.model_saved_path,
202 | args.model_name + '.pth')
203 | else:
204 | model_ex.dlconfig.pretrain_file_path = args.pretrain_file_path
205 | model_ex.train(data_ex.train_data_loader,
206 | data_ex.test_data_loader,
207 | data_ex.dev_data_loader,
208 | args.model_saved_path
209 | )
210 |
211 | elif args.type_obj == 'test':
212 | args.data_path = args.test_data_path
213 | args.train_data_path, args.dev_data_path = '', ''
214 | data_ex, model_ex, true_all, y_pre_test = create_me_de(
215 | args, split_size=0, is_sample=False, split=False, need_predict=True)
216 | mtrix_ex_test = Matrix(true_all, y_pre_test, multi=data_ex.multi)
217 | print_msg(None, mtrix_ex_test, data_ex, 'test_pic')
218 |
219 | elif args.type_obj == 'predict':
220 | args.data_path = args.dev_data_path
221 | args.train_data_path, args.test_data_path = '', ''
222 | data_ex, model_ex, y_pre_test = create_me_de(
223 | args, split_size=0, is_sample=False, split=False, need_predict=True)
224 | # data_ex.i2l_dic可以将y_pre_test中的数字转成文字标签,按需使用
225 | #! 如何保存数据,按需求填写
226 | else:
227 | print('please input train, test or predict in type_obj of params!')
228 | exit(0)
229 |
230 |
231 | if __name__ == '__main__':
232 | args = set_args()
233 | main(args)
234 |
--------------------------------------------------------------------------------
/metrics.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : metrics.py
5 | @Time : 2023/01/15 11:35:31
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : 一系列的评估函数f1, recall, acc, presion, confusion_matrix...
10 | '''
11 |
12 | import os
13 | import numpy as np
14 | import matplotlib.pyplot as plt
15 | from matplotlib import rcParams
16 | from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score
17 | from sklearn.metrics import confusion_matrix
18 | from config import PIC_SAVED_PATH
19 |
20 |
21 | class Matrix:
22 | def __init__(self, y_true, y_pre, multi=False):
23 | self.true = y_true
24 | self.pre = y_pre
25 | # 是否是多分类, 默认二分类
26 | self.multi = multi # average的参数有micro、macro、weighted,如果选择micro,那么recall和pre和acc没区别,建议使用macro,同时数据集最好已经没有不平衡的问题
27 |
28 | def get_acc(self):
29 | return accuracy_score(self.true, self.pre)
30 |
31 | def get_recall(self):
32 | # tp / (tp + fn)
33 | if self.multi:
34 | return recall_score(self.true, self.pre, average='macro')
35 | return recall_score(self.true, self.pre)
36 |
37 | def get_precision(self):
38 | # tp / (tp + fp)
39 | if self.multi:
40 | return precision_score(self.true, self.pre, average='macro')
41 | return precision_score(self.true, self.pre)
42 |
43 | def get_f1(self):
44 | # F1 = 2 * (precision * recall) / (precision + recall)
45 | if self.multi:
46 | return f1_score(self.true, self.pre, average='macro')
47 | return f1_score(self.true, self.pre)
48 |
49 | def get_confusion_matrix(self):
50 | return confusion_matrix(self.true, self.pre)
51 |
52 | def plot_confusion_matrix(self, dic_labels, pic_name):
53 | """plot
54 |
55 | Args:
56 | dic_labels (dict): {0: 'label1', 1: 'label2'} # 一定是个有序字典
57 | """
58 | proportion = []
59 | con_matrix = self.get_confusion_matrix()
60 | num_class = len(dic_labels)
61 | labels = [v for k, v in dic_labels.items()]
62 | for i in con_matrix:
63 | for j in i:
64 | temp = j / (np.sum(i))
65 | proportion.append(temp)
66 | pshow = []
67 | for i in proportion:
68 | pt = "%.2f%%" % (i * 100)
69 | pshow.append(pt)
70 | proportion = np.array(proportion).reshape(num_class, num_class)
71 | pshow = np.array(pshow).reshape(num_class, num_class)
72 | config = {"font.family": "Times New Roman"}
73 | rcParams.update(config)
74 | plt.imshow(proportion, interpolation='nearest',
75 | cmap=plt.cm.Blues) # 按照像素显示出矩阵
76 | # (改变颜色:'Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds','YlOrBr', 'YlOrRd',
77 | # 'OrRd', 'PuRd', 'RdPu', 'BuPu','GnBu', 'PuBu', 'YlGnBu', 'PuBuGn', 'BuGn', 'YlGn')
78 | plt.title('confusion_matrix')
79 | plt.colorbar()
80 | tick_marks = np.arange(len(labels))
81 | plt.xticks(tick_marks, labels, fontsize=12)
82 | plt.yticks(tick_marks, labels, fontsize=12)
83 | # iters = [[i,j] for i in range(len(classes)) for j in range((classes))]
84 | # ij配对,遍历矩阵迭代器
85 | iters = np.reshape([[[i, j] for j in range(num_class)]
86 | for i in range(num_class)], (con_matrix.size, 2))
87 | for i, j in iters:
88 | if (i == j):
89 | plt.text(j, i - 0.12, format(con_matrix[i, j]), va='center',
90 | ha='center', fontsize=12, color='white', weight=5) # 显示对应的数字
91 | plt.text(j, i + 0.12, pshow[i, j], va='center',
92 | ha='center', fontsize=12, color='white')
93 | else:
94 | # 显示对应的数字
95 | plt.text(
96 | j, i - 0.12, format(con_matrix[i, j]), va='center', ha='center', fontsize=12)
97 | plt.text(j, i + 0.12, pshow[i, j],
98 | va='center', ha='center', fontsize=12)
99 |
100 | plt.ylabel('True label', fontsize=16)
101 | plt.xlabel('Predict label', fontsize=16)
102 | plt.tight_layout()
103 | plt.pause(1)
104 | plt.show(block=False)
105 | if not os.path.exists(PIC_SAVED_PATH):
106 | os.makedirs(PIC_SAVED_PATH)
107 | pic_name = pic_name + '.png'
108 | save_path = os.path.join(PIC_SAVED_PATH, pic_name)
109 | plt.savefig(save_path)
110 | print(f'result pic is saved in {save_path}')
111 |
112 |
113 | if __name__ == '__main__':
114 | # dic_labels = {0: 'W', 1: 'LS', 2: 'SWS', 3: 'REM', 4: 'E'}
115 | # cm = np.array([(193, 31, 0, 41, 42), (87, 1038, 32, 126, 125),
116 | # (17, 337, 862, 1, 2), (17, 70, 0, 638, 54), (1, 2, 3, 4, 5)])
117 | # matrix_excute = Matrix(None, None)
118 | # matrix_excute.plot_confusion_matrix(cm, dic_labels)
119 | y_true = np.array([0]*30 + [1]*240 + [2]*30)
120 | y_pred = np.array([0]*10 + [1]*10 + [2]*10 +
121 | [0]*40 + [1]*160 + [2]*40 +
122 | [0]*5 + [1]*5 + [2]*20)
123 | dic_labels = {0:0, 1:1, 2:2}
124 | matrix_excute = Matrix(y_true=y_true, y_pre=y_pred, multi=True)
125 | print(matrix_excute.get_acc())
126 | print(matrix_excute.get_precision())
127 | print(matrix_excute.get_recall())
128 | print(matrix_excute.get_f1())
129 | matrix_excute.plot_confusion_matrix(dic_labels)
130 |
131 |
--------------------------------------------------------------------------------
/ml_algorithm/ml_model.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : ml_model.py
5 | @Time : 2023/01/13 16:26:41
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : lg, knn, dt, rt, gbdt, xgb, catboost, svm ... etc.
10 | '''
11 |
12 | import pickle
13 | import random
14 | import os
15 | import joblib
16 | import catboost as cb
17 | from sklearn.linear_model import LogisticRegression
18 | from sklearn.tree import DecisionTreeClassifier
19 | from sklearn.svm import SVC
20 | from sklearn.naive_bayes import GaussianNB
21 | from sklearn.neighbors import KNeighborsClassifier
22 | from sklearn.ensemble import RandomForestClassifier
23 | from sklearn.ensemble import GradientBoostingClassifier
24 | from xgboost import XGBClassifier
25 | from config import ML_MODEL_NAME
26 |
27 |
28 | class ML_EXCUTER:
29 | def __init__(self, model_name):
30 | self.model_name = model_name
31 |
32 | def judge_model(self, assign_path=''):
33 | if self.model_name not in ML_MODEL_NAME:
34 | print('ml model name is not support, please see ML_MODEL_NAME of config.py')
35 |
36 | if self.model_name == 'lg':
37 | model = LogisticRegression(random_state=96)
38 | elif self.model_name == 'knn':
39 | model = KNeighborsClassifier(n_neighbors=5)
40 | elif self.model_name == 'bayes':
41 | model = GaussianNB()
42 | elif self.model_name == 'svm':
43 | model = SVC(kernel='rbf')
44 | elif self.model_name == 'dt':
45 | model = DecisionTreeClassifier(random_state=96)
46 | elif self.model_name == 'rf':
47 | model = RandomForestClassifier(n_estimators=100, random_state=96)
48 | elif self.model_name == 'gbdt':
49 | model = GradientBoostingClassifier(
50 | learning_rate=0.1, n_estimators=100, random_state=96)
51 | elif self.model_name == 'xgb':
52 | model = XGBClassifier(learning_rate=0.1,
53 | # n_estimatores
54 | # 含义:总共迭代的次数,即决策树的个数
55 | n_estimators=1000,
56 | # max_depth
57 | # 含义:树的深度,默认值为6,典型值3-10。
58 | max_depth=6,
59 | # min_child_weight
60 | # 调参:值越大,越容易欠拟合;值越小,越容易过拟合
61 | # (值较大时,避免模型学习到局部的特殊样本)。
62 | min_child_weight=1,
63 | # 惩罚项系数,指定节点分裂所需的最小损失函数下降值。
64 | gamma=0,
65 | # subsample
66 | # 含义:训练每棵树时,使用的数据占全部训练集的比例。
67 | # 默认值为1,典型值为0.5-1。
68 | subsample=0.8,
69 | # colsample_bytree
70 | # 含义:训练每棵树时,使用的特征占全部特征的比例。默认值为1,典型值为0.5-1。
71 | colsample_btree=0.8,
72 | # objective 目标函数
73 | # multi:softmax num_class=n 返回类别
74 | # binary:logistic,二元分类的逻辑回归,输出概率 binary:hinge:二进制分类的铰链损失。这使预测为0或1,而不是产生概率。
75 | objective='multi:softmax',
76 | num_class=3,
77 | # scale_pos_weight
78 | # 正样本的权重,在二分类任务中,当正负样本比例失衡时,设置正样本的权重,模型效果更好。例如,当正负样本比例为1:10时,scale_pos_weight=10
79 | scale_pos_weight=1,
80 | random_state=96
81 | )
82 | # xgb 的调参看这篇文章:https://zhuanlan.zhihu.com/p/143009353
83 | elif self.model_name == 'catboost':
84 | # 详细调参和gpu训练看这里:http://t.zoukankan.com/webRobot-p-9249906.html
85 | model = cb.CatBoostClassifier(iterations=500,
86 | learning_rate=0.1,
87 | max_depth=6,
88 | verbose=100,
89 | early_stopping_rounds=500,
90 | loss_function='Logloss',
91 | task_type='CPU', # 'GPU'
92 | random_seed=96,
93 | one_hot_max_size=2
94 | )
95 |
96 | else:
97 | pass
98 | self.model = model
99 |
100 | def train(self, x_data, y_data):
101 | self.model.fit(x_data, y_data)
102 |
103 | def predict(self, data):
104 | return self.model.predict(data)
105 |
106 | def save_model(self, path, name):
107 | if not os.path.exists(path):
108 | os.makedirs(path)
109 | output_path = os.path.join(path, name)
110 | # with open(output_path, 'wb') as f:
111 | # pickle.dump(self.model, f)
112 | joblib.dump(self.model, output_path)
113 | print(f'model is saved, in {str(output_path)}')
114 |
115 | def load_model(self, path, name):
116 | output_path = os.path.join(path, name)
117 | try:
118 | # with open(output_path, 'rb') as f:
119 | # self.model = pickle.load(f)
120 | self.model = joblib.load(output_path)
121 | print('model is load')
122 | except:
123 | print('model load fail, check path')
124 |
--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : model.py
5 | @Time : 2023/02/07 19:54:07
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : None
10 | '''
11 |
12 | from config import ML_MODEL_NAME, DL_MODEL_NAME, PRE_MODEL_NAME
13 | from ml_algorithm.ml_model import ML_EXCUTER
14 | from dl_algorithm.dl_model import DL_EXCUTER
15 | from pretrain_algorithm.pre_model import PRE_EXCUTER
16 |
17 | class Model_Excuter:
18 | def __init__(self):
19 | pass
20 |
21 | def init(self, model_name='', dl_config=''):
22 | if model_name in ML_MODEL_NAME:
23 | return ML_EXCUTER(model_name)
24 | elif dl_config.model_name in DL_MODEL_NAME:
25 | return DL_EXCUTER(dl_config)
26 | elif dl_config.model_name in PRE_MODEL_NAME:
27 | return PRE_EXCUTER(dl_config)
--------------------------------------------------------------------------------
/pic/pic_dl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/pic_dl.png
--------------------------------------------------------------------------------
/pic/pic_ml.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/pic_ml.png
--------------------------------------------------------------------------------
/pic/pretrain_pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/pretrain_pic.png
--------------------------------------------------------------------------------
/pic/result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/result.png
--------------------------------------------------------------------------------
/pic/tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/tensorboard.png
--------------------------------------------------------------------------------
/pic/test_pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/test_pic.png
--------------------------------------------------------------------------------
/pic/train_pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/pic/train_pic.png
--------------------------------------------------------------------------------
/pretrain_algorithm/bert_graph.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2023-03-12 14:39:21
8 | LastEditTime : 2023-03-14 19:20:06
9 | FilePath : \\codes\\pretrain_algorithm\\bert_graph.py
10 | Description :
11 | '''
12 |
13 |
14 | import torch
15 | import torch.nn as nn
16 | from transformers import BertPreTrainedModel, BertModel
17 |
18 |
19 | class bert_classifier(BertPreTrainedModel):
20 | '''
21 | pooler_output:shape是(batch_size, hidden_size),这是序列的第一个token (cls) 的最后一层的隐藏状态,
22 | 它是由线性层和Tanh激活函数进一步处理的,
23 | 这个输出不是对输入的语义内容的一个很好的总结,对于整个输入序列的隐藏状态序列的平均化或池化可以更好的表示一句话。(这里还加入了embedding层和
24 | 每个隐藏层的cls进行加权平均化来表示一句话)
25 | '''
26 |
27 | def __init__(self, config):
28 | super().__init__(config, )
29 | config.output_hidden_states = True
30 | '''
31 | hidden_states:这是输出的一个可选项,如果输出,需要指定config.output_hidden_states=True,它是一个元组,含有13个元素,
32 | 第一个元素可以当做是embedding,也就是cls,其余12个元素是各层隐藏状态的输出,每个元素的形状是(batch_size, sequence_length, hidden_size),
33 | '''
34 | self.num_labels = config.num_labels
35 | self.bert = BertModel(config)
36 | self.dropout = nn.Dropout(p=0.2)
37 | self.high_dropout = nn.Dropout(p=0.5)
38 | n_weights = config.num_hidden_layers + 1 # 因为指定了输出hidden_states,所以多了一层,加1
39 | weights_init = torch.zeros(n_weights).float()
40 | weights_init.data[:-1] = -3
41 | self.layer_weights = torch.nn.Parameter(weights_init)
42 | self.classifier = nn.Linear(config.hidden_size, self.num_labels)
43 | self.init_weights()
44 |
45 | def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, label=None,):
46 | outputs = self.bert(
47 | input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
48 | '''
49 | bert的输出
50 | # output[0] 最后一层的隐藏状态 (batch_size, sequence_length, hidden_size)
51 | # output[1] 第一个token即(cls)最后一层的隐藏状态 (batch_size, hidden_size)
52 | # output[2] 需要指定 output_hidden_states = True, 包含所有隐藏状态,第一个元素是embedding, 其余元素是各层的输出 (batch_size, sequence_length, hidden_size)
53 | # output[3] 需要指定output_attentions=True,包含每一层的注意力权重,用于计算self-attention heads的加权平均值(batch_size, layer_nums, sequence_length, sequence_legth)
54 | '''
55 | hidden_layers = outputs[2]
56 | # 取每一层的cls(shape:batchsize * hidden_size) dropout叠加 shape: 13*bathsize*hidden_size
57 | cls_outputs = torch.stack(
58 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=0
59 | )
60 | # 然后加权求和 shape: bathsize*hidden_size
61 | cls_output = (torch.softmax(self.layer_weights,
62 | dim=0).unsqueeze(-1).unsqueeze(-1) * cls_outputs).sum(0)
63 | # 对求和后的cls向量进行dropout,在输入线性层,重复五次,然后求平均的到最后的输出logit
64 | logits = torch.mean(
65 | torch.stack(
66 | [self.classifier(self.high_dropout(cls_output))
67 | for _ in range(5)],
68 | dim=0,
69 | ),
70 | dim=0,
71 | )
72 |
73 | return logits
74 |
--------------------------------------------------------------------------------
/pretrain_algorithm/deberta_graph.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2020 Microsoft and the Hugging Face Inc. team.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """ PyTorch DeBERTa-v2 model."""
16 |
17 | from collections.abc import Sequence
18 | from typing import Optional, Tuple, Union
19 |
20 | import torch
21 | import torch.utils.checkpoint
22 | from torch import nn
23 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, LayerNorm, MSELoss
24 |
25 | from transformers.activations import ACT2FN
26 | from transformers.modeling_outputs import (
27 | BaseModelOutput,
28 | MaskedLMOutput,
29 | MultipleChoiceModelOutput,
30 | QuestionAnsweringModelOutput,
31 | SequenceClassifierOutput,
32 | TokenClassifierOutput,
33 | )
34 | from transformers.modeling_utils import PreTrainedModel
35 | from transformers.pytorch_utils import softmax_backward_data
36 | from transformers.utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, \
37 | logging
38 | from transformers import DebertaV2Config
39 |
40 | logger = logging.get_logger(__name__)
41 |
42 | _CONFIG_FOR_DOC = "DebertaV2Config"
43 | _TOKENIZER_FOR_DOC = "DebertaV2Tokenizer"
44 | _CHECKPOINT_FOR_DOC = "microsoft/deberta-v2-xlarge"
45 |
46 | # Masked LM docstring
47 | _CHECKPOINT_FOR_MASKED_LM = "hf-internal-testing/tiny-random-deberta-v2"
48 | _MASKED_LM_EXPECTED_OUTPUT = "'enberry'"
49 | _MASKED_LM_EXPECTED_LOSS = "11.85"
50 |
51 | # TokenClassification docstring
52 | _CHECKPOINT_FOR_TOKEN_CLASSIFICATION = "hf-internal-testing/tiny-random-deberta-v2"
53 | _TOKEN_CLASS_EXPECTED_OUTPUT = (
54 | "['LABEL_0', 'LABEL_0', 'LABEL_1', 'LABEL_0', 'LABEL_0', 'LABEL_1', 'LABEL_0', 'LABEL_0', 'LABEL_0', 'LABEL_0',"
55 | " 'LABEL_0', 'LABEL_0']"
56 | )
57 | _TOKEN_CLASS_EXPECTED_LOSS = 0.61
58 |
59 | # QuestionAnswering docstring
60 | _CHECKPOINT_FOR_QA = "hf-internal-testing/tiny-random-deberta-v2"
61 | _QA_EXPECTED_OUTPUT = "'was Jim Henson? Jim Henson was'"
62 | _QA_EXPECTED_LOSS = 2.47
63 | _QA_TARGET_START_INDEX = 2
64 | _QA_TARGET_END_INDEX = 9
65 |
66 | # SequenceClassification docstring
67 | _CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION = "hf-internal-testing/tiny-random-deberta-v2"
68 | _SEQ_CLASS_EXPECTED_OUTPUT = "'LABEL_1'"
69 | _SEQ_CLASS_EXPECTED_LOSS = "0.69"
70 |
71 | DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST = [
72 | "microsoft/deberta-v2-xlarge",
73 | "microsoft/deberta-v2-xxlarge",
74 | "microsoft/deberta-v2-xlarge-mnli",
75 | "microsoft/deberta-v2-xxlarge-mnli",
76 | ]
77 |
78 |
79 | # Copied from transformers.models.deberta.modeling_deberta.ContextPooler
80 | class ContextPooler(nn.Module):
81 | def __init__(self, config):
82 | super().__init__()
83 | self.dense = nn.Linear(config.pooler_hidden_size,
84 | config.pooler_hidden_size)
85 | self.dropout = StableDropout(config.pooler_dropout)
86 | self.config = config
87 |
88 | def forward(self, hidden_states):
89 | # We "pool" the model by simply taking the hidden state corresponding
90 | # to the first token.
91 |
92 | context_token = hidden_states[:, 0]
93 | context_token = self.dropout(context_token)
94 | pooled_output = self.dense(context_token)
95 | pooled_output = ACT2FN[self.config.pooler_hidden_act](pooled_output)
96 | return pooled_output
97 |
98 | @property
99 | def output_dim(self):
100 | return self.config.hidden_size
101 |
102 |
103 | # Copied from transformers.models.deberta.modeling_deberta.XSoftmax with deberta->deberta_v2
104 | class XSoftmax(torch.autograd.Function):
105 | """
106 | Masked Softmax which is optimized for saving memory
107 |
108 | Args:
109 | input (`torch.tensor`): The input tensor that will apply softmax.
110 | mask (`torch.IntTensor`):
111 | The mask matrix where 0 indicate that element will be ignored in the softmax calculation.
112 | dim (int): The dimension that will apply softmax
113 |
114 | Example:
115 |
116 | ```python
117 | >>> import torch
118 | >>> from transformers.models.deberta_v2.modeling_deberta_v2 import XSoftmax
119 |
120 | >>> # Make a tensor
121 | >>> x = torch.randn([4, 20, 100])
122 |
123 | >>> # Create a mask
124 | >>> mask = (x > 0).int()
125 |
126 | >>> # Specify the dimension to apply softmax
127 | >>> dim = -1
128 |
129 | >>> y = XSoftmax.apply(x, mask, dim)
130 | ```"""
131 |
132 | @staticmethod
133 | def forward(self, input, mask, dim):
134 | self.dim = dim
135 | rmask = ~(mask.to(torch.bool))
136 |
137 | output = input.masked_fill(
138 | rmask, torch.tensor(torch.finfo(input.dtype).min))
139 | output = torch.softmax(output, self.dim)
140 | output.masked_fill_(rmask, 0)
141 | self.save_for_backward(output)
142 | return output
143 |
144 | @staticmethod
145 | def backward(self, grad_output):
146 | (output,) = self.saved_tensors
147 | inputGrad = softmax_backward_data(
148 | self, grad_output, output, self.dim, output)
149 | return inputGrad, None, None
150 |
151 | @staticmethod
152 | def symbolic(g, self, mask, dim):
153 | import torch.onnx.symbolic_helper as sym_help
154 | from torch.onnx.symbolic_opset9 import masked_fill, softmax
155 |
156 | mask_cast_value = g.op(
157 | "Cast", mask, to_i=sym_help.cast_pytorch_to_onnx["Long"])
158 | r_mask = g.op(
159 | "Cast",
160 | g.op("Sub", g.op("Constant", value_t=torch.tensor(
161 | 1, dtype=torch.int64)), mask_cast_value),
162 | to_i=sym_help.cast_pytorch_to_onnx["Byte"],
163 | )
164 | output = masked_fill(
165 | g, self, r_mask, g.op("Constant", value_t=torch.tensor(
166 | torch.finfo(self.type().dtype()).min))
167 | )
168 | output = softmax(g, output, dim)
169 | return masked_fill(g, output, r_mask, g.op("Constant", value_t=torch.tensor(0, dtype=torch.uint8)))
170 |
171 |
172 | # Copied from transformers.models.deberta.modeling_deberta.DropoutContext
173 | class DropoutContext(object):
174 | def __init__(self):
175 | self.dropout = 0
176 | self.mask = None
177 | self.scale = 1
178 | self.reuse_mask = True
179 |
180 |
181 | # Copied from transformers.models.deberta.modeling_deberta.get_mask
182 | def get_mask(input, local_context):
183 | if not isinstance(local_context, DropoutContext):
184 | dropout = local_context
185 | mask = None
186 | else:
187 | dropout = local_context.dropout
188 | dropout *= local_context.scale
189 | mask = local_context.mask if local_context.reuse_mask else None
190 |
191 | if dropout > 0 and mask is None:
192 | mask = (1 - torch.empty_like(input).bernoulli_(1 - dropout)).to(torch.bool)
193 |
194 | if isinstance(local_context, DropoutContext):
195 | if local_context.mask is None:
196 | local_context.mask = mask
197 |
198 | return mask, dropout
199 |
200 |
201 | # Copied from transformers.models.deberta.modeling_deberta.XDropout
202 | class XDropout(torch.autograd.Function):
203 | """Optimized dropout function to save computation and memory by using mask operation instead of multiplication."""
204 |
205 | @staticmethod
206 | def forward(ctx, input, local_ctx):
207 | mask, dropout = get_mask(input, local_ctx)
208 | ctx.scale = 1.0 / (1 - dropout)
209 | if dropout > 0:
210 | ctx.save_for_backward(mask)
211 | return input.masked_fill(mask, 0) * ctx.scale
212 | else:
213 | return input
214 |
215 | @staticmethod
216 | def backward(ctx, grad_output):
217 | if ctx.scale > 1:
218 | (mask,) = ctx.saved_tensors
219 | return grad_output.masked_fill(mask, 0) * ctx.scale, None
220 | else:
221 | return grad_output, None
222 |
223 | @staticmethod
224 | def symbolic(g: torch._C.Graph, input: torch._C.Value, local_ctx: Union[float, DropoutContext]) -> torch._C.Value:
225 | from torch.onnx import symbolic_opset12
226 |
227 | dropout_p = local_ctx
228 | if isinstance(local_ctx, DropoutContext):
229 | dropout_p = local_ctx.dropout
230 | # StableDropout only calls this function when training.
231 | train = True
232 | # TODO: We should check if the opset_version being used to export
233 | # is > 12 here, but there's no good way to do that. As-is, if the
234 | # opset_version < 12, export will fail with a CheckerError.
235 | # Once https://github.com/pytorch/pytorch/issues/78391 is fixed, do something like:
236 | # if opset_version < 12:
237 | # return torch.onnx.symbolic_opset9.dropout(g, input, dropout_p, train)
238 | return symbolic_opset12.dropout(g, input, dropout_p, train)
239 |
240 |
241 | # Copied from transformers.models.deberta.modeling_deberta.StableDropout
242 | class StableDropout(nn.Module):
243 | """
244 | Optimized dropout module for stabilizing the training
245 |
246 | Args:
247 | drop_prob (float): the dropout probabilities
248 | """
249 |
250 | def __init__(self, drop_prob):
251 | super().__init__()
252 | self.drop_prob = drop_prob
253 | self.count = 0
254 | self.context_stack = None
255 |
256 | def forward(self, x):
257 | """
258 | Call the module
259 |
260 | Args:
261 | x (`torch.tensor`): The input tensor to apply dropout
262 | """
263 | if self.training and self.drop_prob > 0:
264 | return XDropout.apply(x, self.get_context())
265 | return x
266 |
267 | def clear_context(self):
268 | self.count = 0
269 | self.context_stack = None
270 |
271 | def init_context(self, reuse_mask=True, scale=1):
272 | if self.context_stack is None:
273 | self.context_stack = []
274 | self.count = 0
275 | for c in self.context_stack:
276 | c.reuse_mask = reuse_mask
277 | c.scale = scale
278 |
279 | def get_context(self):
280 | if self.context_stack is not None:
281 | if self.count >= len(self.context_stack):
282 | self.context_stack.append(DropoutContext())
283 | ctx = self.context_stack[self.count]
284 | ctx.dropout = self.drop_prob
285 | self.count += 1
286 | return ctx
287 | else:
288 | return self.drop_prob
289 |
290 |
291 | # Copied from transformers.models.deberta.modeling_deberta.DebertaSelfOutput with DebertaLayerNorm->LayerNorm
292 | class DebertaV2SelfOutput(nn.Module):
293 | def __init__(self, config):
294 | super().__init__()
295 | self.dense = nn.Linear(config.hidden_size, config.hidden_size)
296 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps)
297 | self.dropout = StableDropout(config.hidden_dropout_prob)
298 |
299 | def forward(self, hidden_states, input_tensor):
300 | hidden_states = self.dense(hidden_states)
301 | hidden_states = self.dropout(hidden_states)
302 | hidden_states = self.LayerNorm(hidden_states + input_tensor)
303 | return hidden_states
304 |
305 |
306 | # Copied from transformers.models.deberta.modeling_deberta.DebertaAttention with Deberta->DebertaV2
307 | class DebertaV2Attention(nn.Module):
308 | def __init__(self, config):
309 | super().__init__()
310 | self.self = DisentangledSelfAttention(config)
311 | self.output = DebertaV2SelfOutput(config)
312 | self.config = config
313 |
314 | def forward(
315 | self,
316 | hidden_states,
317 | attention_mask,
318 | output_attentions=False,
319 | query_states=None,
320 | relative_pos=None,
321 | rel_embeddings=None,
322 | ):
323 | self_output = self.self(
324 | hidden_states,
325 | attention_mask,
326 | output_attentions,
327 | query_states=query_states,
328 | relative_pos=relative_pos,
329 | rel_embeddings=rel_embeddings,
330 | )
331 | if output_attentions:
332 | self_output, att_matrix = self_output
333 | if query_states is None:
334 | query_states = hidden_states
335 | attention_output = self.output(self_output, query_states)
336 |
337 | if output_attentions:
338 | return (attention_output, att_matrix)
339 | else:
340 | return attention_output
341 |
342 |
343 | # Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->DebertaV2
344 | class DebertaV2Intermediate(nn.Module):
345 | def __init__(self, config):
346 | super().__init__()
347 | self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
348 | if isinstance(config.hidden_act, str):
349 | self.intermediate_act_fn = ACT2FN[config.hidden_act]
350 | else:
351 | self.intermediate_act_fn = config.hidden_act
352 |
353 | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
354 | hidden_states = self.dense(hidden_states)
355 | hidden_states = self.intermediate_act_fn(hidden_states)
356 | return hidden_states
357 |
358 |
359 | # Copied from transformers.models.deberta.modeling_deberta.DebertaOutput with DebertaLayerNorm->LayerNorm
360 | class DebertaV2Output(nn.Module):
361 | def __init__(self, config):
362 | super().__init__()
363 | self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
364 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps)
365 | self.dropout = StableDropout(config.hidden_dropout_prob)
366 | self.config = config
367 |
368 | def forward(self, hidden_states, input_tensor):
369 | hidden_states = self.dense(hidden_states)
370 | hidden_states = self.dropout(hidden_states)
371 | hidden_states = self.LayerNorm(hidden_states + input_tensor)
372 | return hidden_states
373 |
374 |
375 | # Copied from transformers.models.deberta.modeling_deberta.DebertaLayer with Deberta->DebertaV2
376 | class DebertaV2Layer(nn.Module):
377 | def __init__(self, config):
378 | super().__init__()
379 | self.attention = DebertaV2Attention(config)
380 | self.intermediate = DebertaV2Intermediate(config)
381 | self.output = DebertaV2Output(config)
382 |
383 | def forward(
384 | self,
385 | hidden_states,
386 | attention_mask,
387 | query_states=None,
388 | relative_pos=None,
389 | rel_embeddings=None,
390 | output_attentions=False,
391 | ):
392 | attention_output = self.attention(
393 | hidden_states,
394 | attention_mask,
395 | output_attentions=output_attentions,
396 | query_states=query_states,
397 | relative_pos=relative_pos,
398 | rel_embeddings=rel_embeddings,
399 | )
400 | if output_attentions:
401 | attention_output, att_matrix = attention_output
402 | intermediate_output = self.intermediate(attention_output)
403 | layer_output = self.output(intermediate_output, attention_output)
404 | if output_attentions:
405 | return (layer_output, att_matrix)
406 | else:
407 | return layer_output
408 |
409 |
410 | class ConvLayer(nn.Module):
411 | def __init__(self, config):
412 | super().__init__()
413 | kernel_size = getattr(config, "conv_kernel_size", 3)
414 | groups = getattr(config, "conv_groups", 1)
415 | self.conv_act = getattr(config, "conv_act", "tanh")
416 | self.conv = nn.Conv1d(
417 | config.hidden_size, config.hidden_size, kernel_size, padding=(kernel_size - 1) // 2, groups=groups
418 | )
419 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps)
420 | self.dropout = StableDropout(config.hidden_dropout_prob)
421 | self.config = config
422 |
423 | def forward(self, hidden_states, residual_states, input_mask):
424 | out = self.conv(hidden_states.permute(
425 | 0, 2, 1).contiguous()).permute(0, 2, 1).contiguous()
426 | rmask = (1 - input_mask).bool()
427 | out.masked_fill_(rmask.unsqueeze(-1).expand(out.size()), 0)
428 | out = ACT2FN[self.conv_act](self.dropout(out))
429 |
430 | layer_norm_input = residual_states + out
431 | output = self.LayerNorm(layer_norm_input).to(layer_norm_input)
432 |
433 | if input_mask is None:
434 | output_states = output
435 | else:
436 | if input_mask.dim() != layer_norm_input.dim():
437 | if input_mask.dim() == 4:
438 | input_mask = input_mask.squeeze(1).squeeze(1)
439 | input_mask = input_mask.unsqueeze(2)
440 |
441 | input_mask = input_mask.to(output.dtype)
442 | output_states = output * input_mask
443 |
444 | return output_states
445 |
446 |
447 | class DebertaV2Encoder(nn.Module):
448 | """Modified BertEncoder with relative position bias support"""
449 |
450 | def __init__(self, config):
451 | super().__init__()
452 |
453 | self.layer = nn.ModuleList([DebertaV2Layer(config)
454 | for _ in range(config.num_hidden_layers)])
455 | self.relative_attention = getattr(config, "relative_attention", False)
456 |
457 | if self.relative_attention:
458 | self.max_relative_positions = getattr(
459 | config, "max_relative_positions", -1)
460 | if self.max_relative_positions < 1:
461 | self.max_relative_positions = config.max_position_embeddings
462 |
463 | self.position_buckets = getattr(config, "position_buckets", -1)
464 | pos_ebd_size = self.max_relative_positions * 2
465 |
466 | if self.position_buckets > 0:
467 | pos_ebd_size = self.position_buckets * 2
468 |
469 | self.rel_embeddings = nn.Embedding(
470 | pos_ebd_size, config.hidden_size)
471 |
472 | self.norm_rel_ebd = [x.strip() for x in getattr(
473 | config, "norm_rel_ebd", "none").lower().split("|")]
474 |
475 | if "layer_norm" in self.norm_rel_ebd:
476 | self.LayerNorm = LayerNorm(
477 | config.hidden_size, config.layer_norm_eps, elementwise_affine=True)
478 |
479 | self.conv = ConvLayer(config) if getattr(
480 | config, "conv_kernel_size", 0) > 0 else None
481 | self.gradient_checkpointing = False
482 |
483 | def get_rel_embedding(self):
484 | rel_embeddings = self.rel_embeddings.weight if self.relative_attention else None
485 | if rel_embeddings is not None and ("layer_norm" in self.norm_rel_ebd):
486 | rel_embeddings = self.LayerNorm(rel_embeddings)
487 | return rel_embeddings
488 |
489 | def get_attention_mask(self, attention_mask):
490 | if attention_mask.dim() <= 2:
491 | extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
492 | attention_mask = extended_attention_mask * \
493 | extended_attention_mask.squeeze(-2).unsqueeze(-1)
494 | attention_mask = attention_mask.byte()
495 | elif attention_mask.dim() == 3:
496 | attention_mask = attention_mask.unsqueeze(1)
497 |
498 | return attention_mask
499 |
500 | def get_rel_pos(self, hidden_states, query_states=None, relative_pos=None):
501 | if self.relative_attention and relative_pos is None:
502 | q = query_states.size(
503 | -2) if query_states is not None else hidden_states.size(-2)
504 | relative_pos = build_relative_position(
505 | q, hidden_states.size(-2), bucket_size=self.position_buckets, max_position=self.max_relative_positions
506 | )
507 | return relative_pos
508 |
509 | def forward(
510 | self,
511 | hidden_states,
512 | attention_mask,
513 | output_hidden_states=True,
514 | output_attentions=False,
515 | query_states=None,
516 | relative_pos=None,
517 | return_dict=True,
518 | ):
519 | if attention_mask.dim() <= 2:
520 | input_mask = attention_mask
521 | else:
522 | input_mask = (attention_mask.sum(-2) > 0).byte()
523 | attention_mask = self.get_attention_mask(attention_mask)
524 | relative_pos = self.get_rel_pos(
525 | hidden_states, query_states, relative_pos)
526 |
527 | all_hidden_states = () if output_hidden_states else None
528 | all_attentions = () if output_attentions else None
529 |
530 | if isinstance(hidden_states, Sequence):
531 | next_kv = hidden_states[0]
532 | else:
533 | next_kv = hidden_states
534 | rel_embeddings = self.get_rel_embedding()
535 | output_states = next_kv
536 | for i, layer_module in enumerate(self.layer):
537 |
538 | if output_hidden_states:
539 | all_hidden_states = all_hidden_states + (output_states,)
540 |
541 | if self.gradient_checkpointing and self.training:
542 |
543 | def create_custom_forward(module):
544 | def custom_forward(*inputs):
545 | return module(*inputs, output_attentions)
546 |
547 | return custom_forward
548 |
549 | output_states = torch.utils.checkpoint.checkpoint(
550 | create_custom_forward(layer_module),
551 | next_kv,
552 | attention_mask,
553 | query_states,
554 | relative_pos,
555 | rel_embeddings,
556 | )
557 | else:
558 | output_states = layer_module(
559 | next_kv,
560 | attention_mask,
561 | query_states=query_states,
562 | relative_pos=relative_pos,
563 | rel_embeddings=rel_embeddings,
564 | output_attentions=output_attentions,
565 | )
566 |
567 | if output_attentions:
568 | output_states, att_m = output_states
569 |
570 | if i == 0 and self.conv is not None:
571 | output_states = self.conv(
572 | hidden_states, output_states, input_mask)
573 |
574 | if query_states is not None:
575 | query_states = output_states
576 | if isinstance(hidden_states, Sequence):
577 | next_kv = hidden_states[i + 1] if i + \
578 | 1 < len(self.layer) else None
579 | else:
580 | next_kv = output_states
581 |
582 | if output_attentions:
583 | all_attentions = all_attentions + (att_m,)
584 |
585 | if output_hidden_states:
586 | all_hidden_states = all_hidden_states + (output_states,)
587 |
588 | if not return_dict:
589 | return tuple(v for v in [output_states, all_hidden_states, all_attentions] if v is not None)
590 | return BaseModelOutput(
591 | last_hidden_state=output_states, hidden_states=all_hidden_states, attentions=all_attentions
592 | )
593 |
594 |
595 | def make_log_bucket_position(relative_pos, bucket_size, max_position):
596 | sign = torch.sign(relative_pos)
597 | mid = bucket_size // 2
598 | abs_pos = torch.where(
599 | (relative_pos < mid) & (relative_pos > -mid),
600 | torch.tensor(mid - 1).type_as(relative_pos),
601 | torch.abs(relative_pos),
602 | )
603 | log_pos = (
604 | torch.ceil(torch.log(abs_pos / mid) /
605 | torch.log(torch.tensor((max_position - 1) / mid)) * (mid - 1)) + mid
606 | )
607 | bucket_pos = torch.where(
608 | abs_pos <= mid, relative_pos.type_as(log_pos), log_pos * sign)
609 | return bucket_pos
610 |
611 |
612 | def build_relative_position(query_size, key_size, bucket_size=-1, max_position=-1):
613 | """
614 | Build relative position according to the query and key
615 |
616 | We assume the absolute position of query \\(P_q\\) is range from (0, query_size) and the absolute position of key
617 | \\(P_k\\) is range from (0, key_size), The relative positions from query to key is \\(R_{q \\rightarrow k} = P_q -
618 | P_k\\)
619 |
620 | Args:
621 | query_size (int): the length of query
622 | key_size (int): the length of key
623 | bucket_size (int): the size of position bucket
624 | max_position (int): the maximum allowed absolute position
625 |
626 | Return:
627 | `torch.LongTensor`: A tensor with shape [1, query_size, key_size]
628 |
629 | """
630 | q_ids = torch.arange(0, query_size)
631 | k_ids = torch.arange(0, key_size)
632 | rel_pos_ids = q_ids[:, None] - k_ids[None, :]
633 | if bucket_size > 0 and max_position > 0:
634 | rel_pos_ids = make_log_bucket_position(
635 | rel_pos_ids, bucket_size, max_position)
636 | rel_pos_ids = rel_pos_ids.to(torch.long)
637 | rel_pos_ids = rel_pos_ids[:query_size, :]
638 | rel_pos_ids = rel_pos_ids.unsqueeze(0)
639 | return rel_pos_ids
640 |
641 |
642 | @torch.jit.script
643 | # Copied from transformers.models.deberta.modeling_deberta.c2p_dynamic_expand
644 | def c2p_dynamic_expand(c2p_pos, query_layer, relative_pos):
645 | return c2p_pos.expand([query_layer.size(0), query_layer.size(1), query_layer.size(2), relative_pos.size(-1)])
646 |
647 |
648 | @torch.jit.script
649 | # Copied from transformers.models.deberta.modeling_deberta.p2c_dynamic_expand
650 | def p2c_dynamic_expand(c2p_pos, query_layer, key_layer):
651 | return c2p_pos.expand([query_layer.size(0), query_layer.size(1), key_layer.size(-2), key_layer.size(-2)])
652 |
653 |
654 | @torch.jit.script
655 | # Copied from transformers.models.deberta.modeling_deberta.pos_dynamic_expand
656 | def pos_dynamic_expand(pos_index, p2c_att, key_layer):
657 | return pos_index.expand(p2c_att.size()[:2] + (pos_index.size(-2), key_layer.size(-2)))
658 |
659 |
660 | class DisentangledSelfAttention(nn.Module):
661 | """
662 | Disentangled self-attention module
663 |
664 | Parameters:
665 | config (`DebertaV2Config`):
666 | A model config class instance with the configuration to build a new model. The schema is similar to
667 | *BertConfig*, for more details, please refer [`DebertaV2Config`]
668 |
669 | """
670 |
671 | def __init__(self, config):
672 | super().__init__()
673 | if config.hidden_size % config.num_attention_heads != 0:
674 | raise ValueError(
675 | f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
676 | f"heads ({config.num_attention_heads})"
677 | )
678 | self.num_attention_heads = config.num_attention_heads
679 | _attention_head_size = config.hidden_size // config.num_attention_heads
680 | self.attention_head_size = getattr(
681 | config, "attention_head_size", _attention_head_size)
682 | self.all_head_size = self.num_attention_heads * self.attention_head_size
683 | self.query_proj = nn.Linear(
684 | config.hidden_size, self.all_head_size, bias=True)
685 | self.key_proj = nn.Linear(
686 | config.hidden_size, self.all_head_size, bias=True)
687 | self.value_proj = nn.Linear(
688 | config.hidden_size, self.all_head_size, bias=True)
689 |
690 | self.share_att_key = getattr(config, "share_att_key", False)
691 | self.pos_att_type = config.pos_att_type if config.pos_att_type is not None else []
692 | self.relative_attention = getattr(config, "relative_attention", False)
693 |
694 | if self.relative_attention:
695 | self.position_buckets = getattr(config, "position_buckets", -1)
696 | self.max_relative_positions = getattr(
697 | config, "max_relative_positions", -1)
698 | if self.max_relative_positions < 1:
699 | self.max_relative_positions = config.max_position_embeddings
700 | self.pos_ebd_size = self.max_relative_positions
701 | if self.position_buckets > 0:
702 | self.pos_ebd_size = self.position_buckets
703 |
704 | self.pos_dropout = StableDropout(config.hidden_dropout_prob)
705 |
706 | if not self.share_att_key:
707 | if "c2p" in self.pos_att_type:
708 | self.pos_key_proj = nn.Linear(
709 | config.hidden_size, self.all_head_size, bias=True)
710 | if "p2c" in self.pos_att_type:
711 | self.pos_query_proj = nn.Linear(
712 | config.hidden_size, self.all_head_size)
713 |
714 | self.dropout = StableDropout(config.attention_probs_dropout_prob)
715 |
716 | def transpose_for_scores(self, x, attention_heads):
717 | new_x_shape = x.size()[:-1] + (attention_heads, -1)
718 | x = x.view(new_x_shape)
719 | return x.permute(0, 2, 1, 3).contiguous().view(-1, x.size(1), x.size(-1))
720 |
721 | def forward(
722 | self,
723 | hidden_states,
724 | attention_mask,
725 | output_attentions=False,
726 | query_states=None,
727 | relative_pos=None,
728 | rel_embeddings=None,
729 | ):
730 | """
731 | Call the module
732 |
733 | Args:
734 | hidden_states (`torch.FloatTensor`):
735 | Input states to the module usually the output from previous layer, it will be the Q,K and V in
736 | *Attention(Q,K,V)*
737 |
738 | attention_mask (`torch.ByteTensor`):
739 | An attention mask matrix of shape [*B*, *N*, *N*] where *B* is the batch size, *N* is the maximum
740 | sequence length in which element [i,j] = *1* means the *i* th token in the input can attend to the *j*
741 | th token.
742 |
743 | output_attentions (`bool`, optional):
744 | Whether return the attention matrix.
745 |
746 | query_states (`torch.FloatTensor`, optional):
747 | The *Q* state in *Attention(Q,K,V)*.
748 |
749 | relative_pos (`torch.LongTensor`):
750 | The relative position encoding between the tokens in the sequence. It's of shape [*B*, *N*, *N*] with
751 | values ranging in [*-max_relative_positions*, *max_relative_positions*].
752 |
753 | rel_embeddings (`torch.FloatTensor`):
754 | The embedding of relative distances. It's a tensor of shape [\\(2 \\times
755 | \\text{max_relative_positions}\\), *hidden_size*].
756 |
757 |
758 | """
759 | if query_states is None:
760 | query_states = hidden_states
761 | query_layer = self.transpose_for_scores(
762 | self.query_proj(query_states), self.num_attention_heads)
763 | key_layer = self.transpose_for_scores(
764 | self.key_proj(hidden_states), self.num_attention_heads)
765 | value_layer = self.transpose_for_scores(
766 | self.value_proj(hidden_states), self.num_attention_heads)
767 |
768 | rel_att = None
769 | # Take the dot product between "query" and "key" to get the raw attention scores.
770 | scale_factor = 1
771 | if "c2p" in self.pos_att_type:
772 | scale_factor += 1
773 | if "p2c" in self.pos_att_type:
774 | scale_factor += 1
775 | scale = torch.sqrt(torch.tensor(query_layer.size(-1),
776 | dtype=torch.float) * scale_factor)
777 | attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / torch.tensor(
778 | scale, dtype=query_layer.dtype
779 | )
780 | if self.relative_attention:
781 | rel_embeddings = self.pos_dropout(rel_embeddings)
782 | rel_att = self.disentangled_attention_bias(
783 | query_layer, key_layer, relative_pos, rel_embeddings, scale_factor
784 | )
785 |
786 | if rel_att is not None:
787 | attention_scores = attention_scores + rel_att
788 | attention_scores = attention_scores
789 | attention_scores = attention_scores.view(
790 | -1, self.num_attention_heads, attention_scores.size(-2), attention_scores.size(-1)
791 | )
792 |
793 | # bsz x height x length x dimension
794 | attention_probs = XSoftmax.apply(attention_scores, attention_mask, -1)
795 | attention_probs = self.dropout(attention_probs)
796 | context_layer = torch.bmm(
797 | attention_probs.view(-1, attention_probs.size(-2),
798 | attention_probs.size(-1)), value_layer
799 | )
800 | context_layer = (
801 | context_layer.view(-1, self.num_attention_heads,
802 | context_layer.size(-2), context_layer.size(-1))
803 | .permute(0, 2, 1, 3)
804 | .contiguous()
805 | )
806 | new_context_layer_shape = context_layer.size()[:-2] + (-1,)
807 | context_layer = context_layer.view(new_context_layer_shape)
808 | if output_attentions:
809 | return (context_layer, attention_probs)
810 | else:
811 | return context_layer
812 |
813 | def disentangled_attention_bias(self, query_layer, key_layer, relative_pos, rel_embeddings, scale_factor):
814 | if relative_pos is None:
815 | q = query_layer.size(-2)
816 | relative_pos = build_relative_position(
817 | q, key_layer.size(-2), bucket_size=self.position_buckets, max_position=self.max_relative_positions
818 | )
819 | if relative_pos.dim() == 2:
820 | relative_pos = relative_pos.unsqueeze(0).unsqueeze(0)
821 | elif relative_pos.dim() == 3:
822 | relative_pos = relative_pos.unsqueeze(1)
823 | # bsz x height x query x key
824 | elif relative_pos.dim() != 4:
825 | raise ValueError(
826 | f"Relative position ids must be of dim 2 or 3 or 4. {relative_pos.dim()}")
827 |
828 | att_span = self.pos_ebd_size
829 | relative_pos = relative_pos.long().to(query_layer.device)
830 |
831 | rel_embeddings = rel_embeddings[0: att_span * 2, :].unsqueeze(0)
832 | if self.share_att_key:
833 | pos_query_layer = self.transpose_for_scores(
834 | self.query_proj(rel_embeddings), self.num_attention_heads
835 | ).repeat(query_layer.size(0) // self.num_attention_heads, 1, 1)
836 | pos_key_layer = self.transpose_for_scores(self.key_proj(rel_embeddings), self.num_attention_heads).repeat(
837 | query_layer.size(0) // self.num_attention_heads, 1, 1
838 | )
839 | else:
840 | if "c2p" in self.pos_att_type:
841 | pos_key_layer = self.transpose_for_scores(
842 | self.pos_key_proj(rel_embeddings), self.num_attention_heads
843 | ).repeat(
844 | query_layer.size(0) // self.num_attention_heads, 1, 1
845 | ) # .split(self.all_head_size, dim=-1)
846 | if "p2c" in self.pos_att_type:
847 | pos_query_layer = self.transpose_for_scores(
848 | self.pos_query_proj(
849 | rel_embeddings), self.num_attention_heads
850 | ).repeat(
851 | query_layer.size(0) // self.num_attention_heads, 1, 1
852 | ) # .split(self.all_head_size, dim=-1)
853 |
854 | score = 0
855 | # content->position
856 | if "c2p" in self.pos_att_type:
857 | scale = torch.sqrt(torch.tensor(
858 | pos_key_layer.size(-1), dtype=torch.float) * scale_factor)
859 | c2p_att = torch.bmm(query_layer, pos_key_layer.transpose(-1, -2))
860 | c2p_pos = torch.clamp(relative_pos + att_span, 0, att_span * 2 - 1)
861 | c2p_att = torch.gather(
862 | c2p_att,
863 | dim=-1,
864 | index=c2p_pos.squeeze(0).expand(
865 | [query_layer.size(0), query_layer.size(1), relative_pos.size(-1)]),
866 | )
867 | score += c2p_att / torch.tensor(scale, dtype=c2p_att.dtype)
868 |
869 | # position->content
870 | if "p2c" in self.pos_att_type:
871 | scale = torch.sqrt(torch.tensor(
872 | pos_query_layer.size(-1), dtype=torch.float) * scale_factor)
873 | if key_layer.size(-2) != query_layer.size(-2):
874 | r_pos = build_relative_position(
875 | key_layer.size(-2),
876 | key_layer.size(-2),
877 | bucket_size=self.position_buckets,
878 | max_position=self.max_relative_positions,
879 | ).to(query_layer.device)
880 | r_pos = r_pos.unsqueeze(0)
881 | else:
882 | r_pos = relative_pos
883 |
884 | p2c_pos = torch.clamp(-r_pos + att_span, 0, att_span * 2 - 1)
885 | p2c_att = torch.bmm(key_layer, pos_query_layer.transpose(-1, -2))
886 | p2c_att = torch.gather(
887 | p2c_att,
888 | dim=-1,
889 | index=p2c_pos.squeeze(0).expand(
890 | [query_layer.size(0), key_layer.size(-2), key_layer.size(-2)]),
891 | ).transpose(-1, -2)
892 | score += p2c_att / torch.tensor(scale, dtype=p2c_att.dtype)
893 |
894 | return score
895 |
896 |
897 | # Copied from transformers.models.deberta.modeling_deberta.DebertaEmbeddings with DebertaLayerNorm->LayerNorm
898 | class DebertaV2Embeddings(nn.Module):
899 | """Construct the embeddings from word, position and token_type embeddings."""
900 |
901 | def __init__(self, config):
902 | super().__init__()
903 | pad_token_id = getattr(config, "pad_token_id", 0)
904 | self.embedding_size = getattr(
905 | config, "embedding_size", config.hidden_size)
906 | self.word_embeddings = nn.Embedding(
907 | config.vocab_size, self.embedding_size, padding_idx=pad_token_id)
908 |
909 | self.position_biased_input = getattr(
910 | config, "position_biased_input", True)
911 | if not self.position_biased_input:
912 | self.position_embeddings = None
913 | else:
914 | self.position_embeddings = nn.Embedding(
915 | config.max_position_embeddings, self.embedding_size)
916 |
917 | if config.type_vocab_size > 0:
918 | self.token_type_embeddings = nn.Embedding(
919 | config.type_vocab_size, self.embedding_size)
920 |
921 | if self.embedding_size != config.hidden_size:
922 | self.embed_proj = nn.Linear(
923 | self.embedding_size, config.hidden_size, bias=False)
924 | self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps)
925 | self.dropout = StableDropout(config.hidden_dropout_prob)
926 | self.config = config
927 |
928 | # position_ids (1, len position emb) is contiguous in memory and exported when serialized
929 | self.register_buffer("position_ids", torch.arange(
930 | config.max_position_embeddings).expand((1, -1)))
931 |
932 | def forward(self, input_ids=None, token_type_ids=None, position_ids=None, mask=None, inputs_embeds=None):
933 | if input_ids is not None:
934 | input_shape = input_ids.size()
935 | else:
936 | input_shape = inputs_embeds.size()[:-1]
937 |
938 | seq_length = input_shape[1]
939 |
940 | if position_ids is None:
941 | position_ids = self.position_ids[:, :seq_length]
942 |
943 | if token_type_ids is None:
944 | token_type_ids = torch.zeros(
945 | input_shape, dtype=torch.long, device=self.position_ids.device)
946 |
947 | if inputs_embeds is None:
948 | inputs_embeds = self.word_embeddings(input_ids)
949 |
950 | if self.position_embeddings is not None:
951 | position_embeddings = self.position_embeddings(position_ids.long())
952 | else:
953 | position_embeddings = torch.zeros_like(inputs_embeds)
954 |
955 | embeddings = inputs_embeds
956 | if self.position_biased_input:
957 | embeddings += position_embeddings
958 | if self.config.type_vocab_size > 0:
959 | token_type_embeddings = self.token_type_embeddings(token_type_ids)
960 | embeddings += token_type_embeddings
961 |
962 | if self.embedding_size != self.config.hidden_size:
963 | embeddings = self.embed_proj(embeddings)
964 |
965 | embeddings = self.LayerNorm(embeddings)
966 |
967 | if mask is not None:
968 | if mask.dim() != embeddings.dim():
969 | if mask.dim() == 4:
970 | mask = mask.squeeze(1).squeeze(1)
971 | mask = mask.unsqueeze(2)
972 | mask = mask.to(embeddings.dtype)
973 |
974 | embeddings = embeddings * mask
975 |
976 | embeddings = self.dropout(embeddings)
977 | return embeddings
978 |
979 |
980 | # Copied from transformers.models.deberta.modeling_deberta.DebertaPreTrainedModel with Deberta->DebertaV2
981 | class DebertaV2PreTrainedModel(PreTrainedModel):
982 | """
983 | An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
984 | models.
985 | """
986 |
987 | config_class = DebertaV2Config
988 | base_model_prefix = "deberta"
989 | _keys_to_ignore_on_load_missing = ["position_ids"]
990 | _keys_to_ignore_on_load_unexpected = ["position_embeddings"]
991 | supports_gradient_checkpointing = True
992 |
993 | def _init_weights(self, module):
994 | """Initialize the weights."""
995 | if isinstance(module, nn.Linear):
996 | # Slightly different from the TF version which uses truncated_normal for initialization
997 | # cf https://github.com/pytorch/pytorch/pull/5617
998 | module.weight.data.normal_(
999 | mean=0.0, std=self.config.initializer_range)
1000 | if module.bias is not None:
1001 | module.bias.data.zero_()
1002 | elif isinstance(module, nn.Embedding):
1003 | module.weight.data.normal_(
1004 | mean=0.0, std=self.config.initializer_range)
1005 | if module.padding_idx is not None:
1006 | module.weight.data[module.padding_idx].zero_()
1007 |
1008 | def _set_gradient_checkpointing(self, module, value=False):
1009 | if isinstance(module, DebertaV2Encoder):
1010 | module.gradient_checkpointing = value
1011 |
1012 |
1013 | DEBERTA_START_DOCSTRING = r"""
1014 | The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled
1015 | Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It's build
1016 | on top of BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two
1017 | improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pretraining data.
1018 |
1019 | This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
1020 | Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
1021 | and behavior.
1022 |
1023 |
1024 | Parameters:
1025 | config ([`DebertaV2Config`]): Model configuration class with all the parameters of the model.
1026 | Initializing with a config file does not load the weights associated with the model, only the
1027 | configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
1028 | """
1029 |
1030 | DEBERTA_INPUTS_DOCSTRING = r"""
1031 | Args:
1032 | input_ids (`torch.LongTensor` of shape `({0})`):
1033 | Indices of input sequence tokens in the vocabulary.
1034 |
1035 | Indices can be obtained using [`DebertaV2Tokenizer`]. See [`PreTrainedTokenizer.encode`] and
1036 | [`PreTrainedTokenizer.__call__`] for details.
1037 |
1038 | [What are input IDs?](../glossary#input-ids)
1039 | attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
1040 | Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1041 |
1042 | - 1 for tokens that are **not masked**,
1043 | - 0 for tokens that are **masked**.
1044 |
1045 | [What are attention masks?](../glossary#attention-mask)
1046 | token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
1047 | Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1048 | 1]`:
1049 |
1050 | - 0 corresponds to a *sentence A* token,
1051 | - 1 corresponds to a *sentence B* token.
1052 |
1053 | [What are token type IDs?](../glossary#token-type-ids)
1054 | position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
1055 | Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1056 | config.max_position_embeddings - 1]`.
1057 |
1058 | [What are position IDs?](../glossary#position-ids)
1059 | inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
1060 | Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
1061 | is useful if you want more control over how to convert *input_ids* indices into associated vectors than the
1062 | model's internal embedding lookup matrix.
1063 | output_attentions (`bool`, *optional*):
1064 | Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1065 | tensors for more detail.
1066 | output_hidden_states (`bool`, *optional*):
1067 | Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1068 | more detail.
1069 | return_dict (`bool`, *optional*):
1070 | Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1071 | """
1072 |
1073 |
1074 | @add_start_docstrings(
1075 | "The bare DeBERTa Model transformer outputting raw hidden-states without any specific head on top.",
1076 | DEBERTA_START_DOCSTRING,
1077 | )
1078 | # Copied from transformers.models.deberta.modeling_deberta.DebertaModel with Deberta->DebertaV2
1079 | class DebertaV2Model(DebertaV2PreTrainedModel):
1080 | def __init__(self, config):
1081 | super().__init__(config)
1082 |
1083 | self.embeddings = DebertaV2Embeddings(config)
1084 | self.encoder = DebertaV2Encoder(config)
1085 | self.z_steps = 0
1086 | self.config = config
1087 | # Initialize weights and apply final processing
1088 | self.post_init()
1089 |
1090 | def get_input_embeddings(self):
1091 | return self.embeddings.word_embeddings
1092 |
1093 | def set_input_embeddings(self, new_embeddings):
1094 | self.embeddings.word_embeddings = new_embeddings
1095 |
1096 | def _prune_heads(self, heads_to_prune):
1097 | """
1098 | Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
1099 | class PreTrainedModel
1100 | """
1101 | raise NotImplementedError(
1102 | "The prune function is not implemented in DeBERTa model.")
1103 |
1104 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1105 | @add_code_sample_docstrings(
1106 | processor_class=_TOKENIZER_FOR_DOC,
1107 | checkpoint=_CHECKPOINT_FOR_DOC,
1108 | output_type=BaseModelOutput,
1109 | config_class=_CONFIG_FOR_DOC,
1110 | )
1111 | def forward(
1112 | self,
1113 | input_ids: Optional[torch.Tensor] = None,
1114 | attention_mask: Optional[torch.Tensor] = None,
1115 | token_type_ids: Optional[torch.Tensor] = None,
1116 | position_ids: Optional[torch.Tensor] = None,
1117 | inputs_embeds: Optional[torch.Tensor] = None,
1118 | output_attentions: Optional[bool] = None,
1119 | output_hidden_states: Optional[bool] = None,
1120 | return_dict: Optional[bool] = None,
1121 | ) -> Union[Tuple, BaseModelOutput]:
1122 | output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1123 | output_hidden_states = (
1124 | output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1125 | )
1126 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1127 |
1128 | if input_ids is not None and inputs_embeds is not None:
1129 | raise ValueError(
1130 | "You cannot specify both input_ids and inputs_embeds at the same time")
1131 | elif input_ids is not None:
1132 | input_shape = input_ids.size()
1133 | elif inputs_embeds is not None:
1134 | input_shape = inputs_embeds.size()[:-1]
1135 | else:
1136 | raise ValueError(
1137 | "You have to specify either input_ids or inputs_embeds")
1138 |
1139 | device = input_ids.device if input_ids is not None else inputs_embeds.device
1140 |
1141 | if attention_mask is None:
1142 | attention_mask = torch.ones(input_shape, device=device)
1143 | if token_type_ids is None:
1144 | token_type_ids = torch.zeros(
1145 | input_shape, dtype=torch.long, device=device)
1146 |
1147 | embedding_output = self.embeddings(
1148 | input_ids=input_ids,
1149 | token_type_ids=token_type_ids,
1150 | position_ids=position_ids,
1151 | mask=attention_mask,
1152 | inputs_embeds=inputs_embeds,
1153 | )
1154 |
1155 | encoder_outputs = self.encoder(
1156 | embedding_output,
1157 | attention_mask,
1158 | output_hidden_states=True,
1159 | output_attentions=output_attentions,
1160 | return_dict=return_dict,
1161 | )
1162 | encoded_layers = encoder_outputs[1]
1163 |
1164 | if self.z_steps > 1:
1165 | hidden_states = encoded_layers[-2]
1166 | layers = [self.encoder.layer[-1] for _ in range(self.z_steps)]
1167 | query_states = encoded_layers[-1]
1168 | rel_embeddings = self.encoder.get_rel_embedding()
1169 | attention_mask = self.encoder.get_attention_mask(attention_mask)
1170 | rel_pos = self.encoder.get_rel_pos(embedding_output)
1171 | for layer in layers[1:]:
1172 | query_states = layer(
1173 | hidden_states,
1174 | attention_mask,
1175 | output_attentions=False,
1176 | query_states=query_states,
1177 | relative_pos=rel_pos,
1178 | rel_embeddings=rel_embeddings,
1179 | )
1180 | encoded_layers.append(query_states)
1181 |
1182 | sequence_output = encoded_layers[-1]
1183 |
1184 | if not return_dict:
1185 | return (sequence_output,) + encoder_outputs[(1 if output_hidden_states else 2):]
1186 |
1187 | return BaseModelOutput(
1188 | last_hidden_state=sequence_output,
1189 | hidden_states=encoder_outputs.hidden_states if output_hidden_states else None,
1190 | attentions=encoder_outputs.attentions,
1191 | )
1192 |
1193 |
1194 | @add_start_docstrings("""DeBERTa Model with a `language modeling` head on top.""", DEBERTA_START_DOCSTRING)
1195 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForMaskedLM with Deberta->DebertaV2
1196 | class DebertaV2ForMaskedLM(DebertaV2PreTrainedModel):
1197 | _keys_to_ignore_on_load_unexpected = [r"pooler"]
1198 | _keys_to_ignore_on_load_missing = [
1199 | r"position_ids", r"predictions.decoder.bias"]
1200 |
1201 | def __init__(self, config):
1202 | super().__init__(config)
1203 |
1204 | self.deberta = DebertaV2Model(config)
1205 | self.cls = DebertaV2OnlyMLMHead(config)
1206 |
1207 | # Initialize weights and apply final processing
1208 | self.post_init()
1209 |
1210 | def get_output_embeddings(self):
1211 | return self.cls.predictions.decoder
1212 |
1213 | def set_output_embeddings(self, new_embeddings):
1214 | self.cls.predictions.decoder = new_embeddings
1215 |
1216 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1217 | @add_code_sample_docstrings(
1218 | processor_class=_TOKENIZER_FOR_DOC,
1219 | checkpoint=_CHECKPOINT_FOR_MASKED_LM,
1220 | output_type=MaskedLMOutput,
1221 | config_class=_CONFIG_FOR_DOC,
1222 | mask="[MASK]",
1223 | expected_output=_MASKED_LM_EXPECTED_OUTPUT,
1224 | expected_loss=_MASKED_LM_EXPECTED_LOSS,
1225 | )
1226 | def forward(
1227 | self,
1228 | input_ids: Optional[torch.Tensor] = None,
1229 | attention_mask: Optional[torch.Tensor] = None,
1230 | token_type_ids: Optional[torch.Tensor] = None,
1231 | position_ids: Optional[torch.Tensor] = None,
1232 | inputs_embeds: Optional[torch.Tensor] = None,
1233 | labels: Optional[torch.Tensor] = None,
1234 | output_attentions: Optional[bool] = None,
1235 | output_hidden_states: Optional[bool] = None,
1236 | return_dict: Optional[bool] = None,
1237 | ) -> Union[Tuple, MaskedLMOutput]:
1238 | r"""
1239 | labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1240 | Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1241 | config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1242 | loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1243 | """
1244 |
1245 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1246 |
1247 | outputs = self.deberta(
1248 | input_ids,
1249 | attention_mask=attention_mask,
1250 | token_type_ids=token_type_ids,
1251 | position_ids=position_ids,
1252 | inputs_embeds=inputs_embeds,
1253 | output_attentions=output_attentions,
1254 | output_hidden_states=output_hidden_states,
1255 | return_dict=return_dict,
1256 | )
1257 |
1258 | sequence_output = outputs[0]
1259 | prediction_scores = self.cls(sequence_output)
1260 |
1261 | masked_lm_loss = None
1262 | if labels is not None:
1263 | loss_fct = CrossEntropyLoss() # -100 index = padding token
1264 | masked_lm_loss = loss_fct(
1265 | prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
1266 |
1267 | if not return_dict:
1268 | output = (prediction_scores,) + outputs[1:]
1269 | return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1270 |
1271 | return MaskedLMOutput(
1272 | loss=masked_lm_loss,
1273 | logits=prediction_scores,
1274 | hidden_states=outputs.hidden_states,
1275 | attentions=outputs.attentions,
1276 | )
1277 |
1278 |
1279 | # copied from transformers.models.bert.BertPredictionHeadTransform with bert -> deberta
1280 | class DebertaV2PredictionHeadTransform(nn.Module):
1281 | def __init__(self, config):
1282 | super().__init__()
1283 | self.dense = nn.Linear(config.hidden_size, config.hidden_size)
1284 | if isinstance(config.hidden_act, str):
1285 | self.transform_act_fn = ACT2FN[config.hidden_act]
1286 | else:
1287 | self.transform_act_fn = config.hidden_act
1288 | self.LayerNorm = nn.LayerNorm(
1289 | config.hidden_size, eps=config.layer_norm_eps)
1290 |
1291 | def forward(self, hidden_states):
1292 | hidden_states = self.dense(hidden_states)
1293 | hidden_states = self.transform_act_fn(hidden_states)
1294 | hidden_states = self.LayerNorm(hidden_states)
1295 | return hidden_states
1296 |
1297 |
1298 | # copied from transformers.models.bert.BertLMPredictionHead with bert -> deberta
1299 | class DebertaV2LMPredictionHead(nn.Module):
1300 | def __init__(self, config):
1301 | super().__init__()
1302 | self.transform = DebertaV2PredictionHeadTransform(config)
1303 |
1304 | # The output weights are the same as the input embeddings, but there is
1305 | # an output-only bias for each token.
1306 | self.decoder = nn.Linear(
1307 | config.hidden_size, config.vocab_size, bias=False)
1308 |
1309 | self.bias = nn.Parameter(torch.zeros(config.vocab_size))
1310 |
1311 | # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
1312 | self.decoder.bias = self.bias
1313 |
1314 | def forward(self, hidden_states):
1315 | hidden_states = self.transform(hidden_states)
1316 | hidden_states = self.decoder(hidden_states)
1317 | return hidden_states
1318 |
1319 |
1320 | # copied from transformers.models.bert.BertOnlyMLMHead with bert -> deberta
1321 | class DebertaV2OnlyMLMHead(nn.Module):
1322 | def __init__(self, config):
1323 | super().__init__()
1324 | self.predictions = DebertaV2LMPredictionHead(config)
1325 |
1326 | def forward(self, sequence_output):
1327 | prediction_scores = self.predictions(sequence_output)
1328 | return prediction_scores
1329 |
1330 |
1331 | @add_start_docstrings(
1332 | """
1333 | DeBERTa Model transformer with a sequence classification/regression head on top (a linear layer on top of the
1334 | pooled output) e.g. for GLUE tasks.
1335 | """,
1336 | DEBERTA_START_DOCSTRING,
1337 | )
1338 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForSequenceClassification with Deberta->DebertaV2
1339 | class DebertaV2ForSequenceClassification(DebertaV2PreTrainedModel):
1340 | def __init__(self, config):
1341 | super().__init__(config)
1342 |
1343 | num_labels = getattr(config, "num_labels", 2)
1344 | self.num_labels = num_labels
1345 |
1346 | self.deberta = DebertaV2Model(config)
1347 | self.pooler = ContextPooler(config)
1348 | output_dim = self.pooler.output_dim
1349 |
1350 | self.classifier = nn.Linear(output_dim, num_labels)
1351 | drop_out = getattr(config, "cls_dropout", None)
1352 | drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out
1353 | self.dropout = StableDropout(drop_out)
1354 |
1355 | # Initialize weights and apply final processing
1356 | self.post_init()
1357 |
1358 | def get_input_embeddings(self):
1359 | return self.deberta.get_input_embeddings()
1360 |
1361 | def set_input_embeddings(self, new_embeddings):
1362 | self.deberta.set_input_embeddings(new_embeddings)
1363 |
1364 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1365 | @add_code_sample_docstrings(
1366 | processor_class=_TOKENIZER_FOR_DOC,
1367 | checkpoint=_CHECKPOINT_FOR_SEQUENCE_CLASSIFICATION,
1368 | output_type=SequenceClassifierOutput,
1369 | config_class=_CONFIG_FOR_DOC,
1370 | expected_output=_SEQ_CLASS_EXPECTED_OUTPUT,
1371 | expected_loss=_SEQ_CLASS_EXPECTED_LOSS,
1372 | )
1373 | def forward(
1374 | self,
1375 | input_ids: Optional[torch.Tensor] = None,
1376 | attention_mask: Optional[torch.Tensor] = None,
1377 | token_type_ids: Optional[torch.Tensor] = None,
1378 | position_ids: Optional[torch.Tensor] = None,
1379 | inputs_embeds: Optional[torch.Tensor] = None,
1380 | labels: Optional[torch.Tensor] = None,
1381 | output_attentions: Optional[bool] = None,
1382 | output_hidden_states: Optional[bool] = None,
1383 | return_dict: Optional[bool] = None,
1384 | ) -> Union[Tuple, SequenceClassifierOutput]:
1385 | r"""
1386 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1387 | Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1388 | config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1389 | `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1390 | """
1391 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1392 |
1393 | outputs = self.deberta(
1394 | input_ids,
1395 | token_type_ids=token_type_ids,
1396 | attention_mask=attention_mask,
1397 | position_ids=position_ids,
1398 | inputs_embeds=inputs_embeds,
1399 | output_attentions=output_attentions,
1400 | output_hidden_states=output_hidden_states,
1401 | return_dict=return_dict,
1402 | )
1403 |
1404 | encoder_layer = outputs[0]
1405 | pooled_output = self.pooler(encoder_layer)
1406 | pooled_output = self.dropout(pooled_output)
1407 | logits = self.classifier(pooled_output)
1408 |
1409 | loss = None
1410 | if labels is not None:
1411 | if self.config.problem_type is None:
1412 | if self.num_labels == 1:
1413 | # regression task
1414 | loss_fn = nn.MSELoss()
1415 | logits = logits.view(-1).to(labels.dtype)
1416 | loss = loss_fn(logits, labels.view(-1))
1417 | elif labels.dim() == 1 or labels.size(-1) == 1:
1418 | label_index = (labels >= 0).nonzero()
1419 | labels = labels.long()
1420 | if label_index.size(0) > 0:
1421 | labeled_logits = torch.gather(
1422 | logits, 0, label_index.expand(
1423 | label_index.size(0), logits.size(1))
1424 | )
1425 | labels = torch.gather(labels, 0, label_index.view(-1))
1426 | loss_fct = CrossEntropyLoss()
1427 | loss = loss_fct(
1428 | labeled_logits.view(-1, self.num_labels).float(), labels.view(-1))
1429 | else:
1430 | loss = torch.tensor(0).to(logits)
1431 | else:
1432 | log_softmax = nn.LogSoftmax(-1)
1433 | loss = -((log_softmax(logits) * labels).sum(-1)).mean()
1434 | elif self.config.problem_type == "regression":
1435 | loss_fct = MSELoss()
1436 | if self.num_labels == 1:
1437 | loss = loss_fct(logits.squeeze(), labels.squeeze())
1438 | else:
1439 | loss = loss_fct(logits, labels)
1440 | elif self.config.problem_type == "single_label_classification":
1441 | loss_fct = CrossEntropyLoss()
1442 | loss = loss_fct(
1443 | logits.view(-1, self.num_labels), labels.view(-1))
1444 | elif self.config.problem_type == "multi_label_classification":
1445 | loss_fct = BCEWithLogitsLoss()
1446 | loss = loss_fct(logits, labels)
1447 | if not return_dict:
1448 | output = (logits,) + outputs[1:]
1449 | return ((loss,) + output) if loss is not None else output
1450 |
1451 | return SequenceClassifierOutput(
1452 | loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
1453 | )
1454 |
1455 |
1456 | @add_start_docstrings(
1457 | """
1458 | DeBERTa Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
1459 | Named-Entity-Recognition (NER) tasks.
1460 | """,
1461 | DEBERTA_START_DOCSTRING,
1462 | )
1463 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForTokenClassification with Deberta->DebertaV2
1464 | class DebertaV2ForTokenClassification(DebertaV2PreTrainedModel):
1465 | _keys_to_ignore_on_load_unexpected = [r"pooler"]
1466 |
1467 | def __init__(self, config):
1468 | super().__init__(config)
1469 | self.num_labels = config.num_labels
1470 |
1471 | self.deberta = DebertaV2Model(config)
1472 | self.dropout = nn.Dropout(config.hidden_dropout_prob)
1473 | self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1474 |
1475 | # Initialize weights and apply final processing
1476 | self.post_init()
1477 |
1478 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1479 | @add_code_sample_docstrings(
1480 | processor_class=_TOKENIZER_FOR_DOC,
1481 | checkpoint=_CHECKPOINT_FOR_TOKEN_CLASSIFICATION,
1482 | output_type=TokenClassifierOutput,
1483 | config_class=_CONFIG_FOR_DOC,
1484 | expected_output=_TOKEN_CLASS_EXPECTED_OUTPUT,
1485 | expected_loss=_TOKEN_CLASS_EXPECTED_LOSS,
1486 | )
1487 | def forward(
1488 | self,
1489 | input_ids: Optional[torch.Tensor] = None,
1490 | attention_mask: Optional[torch.Tensor] = None,
1491 | token_type_ids: Optional[torch.Tensor] = None,
1492 | position_ids: Optional[torch.Tensor] = None,
1493 | inputs_embeds: Optional[torch.Tensor] = None,
1494 | labels: Optional[torch.Tensor] = None,
1495 | output_attentions: Optional[bool] = None,
1496 | output_hidden_states: Optional[bool] = None,
1497 | return_dict: Optional[bool] = None,
1498 | ) -> Union[Tuple, TokenClassifierOutput]:
1499 | r"""
1500 | labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1501 | Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1502 | """
1503 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1504 |
1505 | outputs = self.deberta(
1506 | input_ids,
1507 | attention_mask=attention_mask,
1508 | token_type_ids=token_type_ids,
1509 | position_ids=position_ids,
1510 | inputs_embeds=inputs_embeds,
1511 | output_attentions=output_attentions,
1512 | output_hidden_states=output_hidden_states,
1513 | return_dict=return_dict,
1514 | )
1515 |
1516 | sequence_output = outputs[0]
1517 |
1518 | sequence_output = self.dropout(sequence_output)
1519 | logits = self.classifier(sequence_output)
1520 |
1521 | loss = None
1522 | if labels is not None:
1523 | loss_fct = CrossEntropyLoss()
1524 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1525 |
1526 | if not return_dict:
1527 | output = (logits,) + outputs[1:]
1528 | return ((loss,) + output) if loss is not None else output
1529 |
1530 | return TokenClassifierOutput(
1531 | loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
1532 | )
1533 |
1534 |
1535 | @add_start_docstrings(
1536 | """
1537 | DeBERTa Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
1538 | layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
1539 | """,
1540 | DEBERTA_START_DOCSTRING,
1541 | )
1542 | # Copied from transformers.models.deberta.modeling_deberta.DebertaForQuestionAnswering with Deberta->DebertaV2
1543 | class DebertaV2ForQuestionAnswering(DebertaV2PreTrainedModel):
1544 | _keys_to_ignore_on_load_unexpected = [r"pooler"]
1545 |
1546 | def __init__(self, config):
1547 | super().__init__(config)
1548 | self.num_labels = config.num_labels
1549 |
1550 | self.deberta = DebertaV2Model(config)
1551 | self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1552 |
1553 | # Initialize weights and apply final processing
1554 | self.post_init()
1555 |
1556 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1557 | @add_code_sample_docstrings(
1558 | processor_class=_TOKENIZER_FOR_DOC,
1559 | checkpoint=_CHECKPOINT_FOR_QA,
1560 | output_type=QuestionAnsweringModelOutput,
1561 | config_class=_CONFIG_FOR_DOC,
1562 | expected_output=_QA_EXPECTED_OUTPUT,
1563 | expected_loss=_QA_EXPECTED_LOSS,
1564 | qa_target_start_index=_QA_TARGET_START_INDEX,
1565 | qa_target_end_index=_QA_TARGET_END_INDEX,
1566 | )
1567 | def forward(
1568 | self,
1569 | input_ids: Optional[torch.Tensor] = None,
1570 | attention_mask: Optional[torch.Tensor] = None,
1571 | token_type_ids: Optional[torch.Tensor] = None,
1572 | position_ids: Optional[torch.Tensor] = None,
1573 | inputs_embeds: Optional[torch.Tensor] = None,
1574 | start_positions: Optional[torch.Tensor] = None,
1575 | end_positions: Optional[torch.Tensor] = None,
1576 | output_attentions: Optional[bool] = None,
1577 | output_hidden_states: Optional[bool] = None,
1578 | return_dict: Optional[bool] = None,
1579 | ) -> Union[Tuple, QuestionAnsweringModelOutput]:
1580 | r"""
1581 | start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1582 | Labels for position (index) of the start of the labelled span for computing the token classification loss.
1583 | Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1584 | are not taken into account for computing the loss.
1585 | end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1586 | Labels for position (index) of the end of the labelled span for computing the token classification loss.
1587 | Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1588 | are not taken into account for computing the loss.
1589 | """
1590 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1591 |
1592 | outputs = self.deberta(
1593 | input_ids,
1594 | attention_mask=attention_mask,
1595 | token_type_ids=token_type_ids,
1596 | position_ids=position_ids,
1597 | inputs_embeds=inputs_embeds,
1598 | output_attentions=output_attentions,
1599 | output_hidden_states=output_hidden_states,
1600 | return_dict=return_dict,
1601 | )
1602 |
1603 | sequence_output = outputs[0]
1604 |
1605 | logits = self.qa_outputs(sequence_output)
1606 | start_logits, end_logits = logits.split(1, dim=-1)
1607 | start_logits = start_logits.squeeze(-1).contiguous()
1608 | end_logits = end_logits.squeeze(-1).contiguous()
1609 |
1610 | total_loss = None
1611 | if start_positions is not None and end_positions is not None:
1612 | # If we are on multi-GPU, split add a dimension
1613 | if len(start_positions.size()) > 1:
1614 | start_positions = start_positions.squeeze(-1)
1615 | if len(end_positions.size()) > 1:
1616 | end_positions = end_positions.squeeze(-1)
1617 | # sometimes the start/end positions are outside our model inputs, we ignore these terms
1618 | ignored_index = start_logits.size(1)
1619 | start_positions = start_positions.clamp(0, ignored_index)
1620 | end_positions = end_positions.clamp(0, ignored_index)
1621 |
1622 | loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
1623 | start_loss = loss_fct(start_logits, start_positions)
1624 | end_loss = loss_fct(end_logits, end_positions)
1625 | total_loss = (start_loss + end_loss) / 2
1626 |
1627 | if not return_dict:
1628 | output = (start_logits, end_logits) + outputs[1:]
1629 | return ((total_loss,) + output) if total_loss is not None else output
1630 |
1631 | return QuestionAnsweringModelOutput(
1632 | loss=total_loss,
1633 | start_logits=start_logits,
1634 | end_logits=end_logits,
1635 | hidden_states=outputs.hidden_states,
1636 | attentions=outputs.attentions,
1637 | )
1638 |
1639 |
1640 | @add_start_docstrings(
1641 | """
1642 | DeBERTa Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
1643 | softmax) e.g. for RocStories/SWAG tasks.
1644 | """,
1645 | DEBERTA_START_DOCSTRING,
1646 | )
1647 | class DebertaV2ForMultipleChoice(DebertaV2PreTrainedModel):
1648 | def __init__(self, config):
1649 | super().__init__(config)
1650 |
1651 | num_labels = getattr(config, "num_labels", 2)
1652 | self.num_labels = num_labels
1653 |
1654 | self.deberta = DebertaV2Model(config)
1655 | self.pooler = ContextPooler(config)
1656 | output_dim = self.pooler.output_dim
1657 |
1658 | self.classifier = nn.Linear(output_dim, 1)
1659 | drop_out = getattr(config, "cls_dropout", None)
1660 | drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out
1661 | self.dropout = StableDropout(drop_out)
1662 |
1663 | self.init_weights()
1664 |
1665 | def get_input_embeddings(self):
1666 | return self.deberta.get_input_embeddings()
1667 |
1668 | def set_input_embeddings(self, new_embeddings):
1669 | self.deberta.set_input_embeddings(new_embeddings)
1670 |
1671 | @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
1672 | @add_code_sample_docstrings(
1673 | processor_class=_TOKENIZER_FOR_DOC,
1674 | checkpoint=_CHECKPOINT_FOR_DOC,
1675 | output_type=MultipleChoiceModelOutput,
1676 | config_class=_CONFIG_FOR_DOC,
1677 | )
1678 | def forward(
1679 | self,
1680 | input_ids=None,
1681 | attention_mask=None,
1682 | token_type_ids=None,
1683 | position_ids=None,
1684 | inputs_embeds=None,
1685 | labels=None,
1686 | output_attentions=None,
1687 | output_hidden_states=None,
1688 | return_dict=None,
1689 | ):
1690 | r"""
1691 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1692 | Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1693 | num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1694 | `input_ids` above)
1695 | """
1696 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1697 | num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1698 |
1699 | flat_input_ids = input_ids.view(-1, input_ids.size(-1)
1700 | ) if input_ids is not None else None
1701 | flat_position_ids = position_ids.view(
1702 | -1, position_ids.size(-1)) if position_ids is not None else None
1703 | flat_token_type_ids = token_type_ids.view(
1704 | -1, token_type_ids.size(-1)) if token_type_ids is not None else None
1705 | flat_attention_mask = attention_mask.view(
1706 | -1, attention_mask.size(-1)) if attention_mask is not None else None
1707 | flat_inputs_embeds = (
1708 | inputs_embeds.view(-1, inputs_embeds.size(-2),
1709 | inputs_embeds.size(-1))
1710 | if inputs_embeds is not None
1711 | else None
1712 | )
1713 |
1714 | outputs = self.deberta(
1715 | flat_input_ids,
1716 | position_ids=flat_position_ids,
1717 | token_type_ids=flat_token_type_ids,
1718 | attention_mask=flat_attention_mask,
1719 | inputs_embeds=flat_inputs_embeds,
1720 | output_attentions=output_attentions,
1721 | output_hidden_states=output_hidden_states,
1722 | return_dict=return_dict,
1723 | )
1724 |
1725 | encoder_layer = outputs[0]
1726 | pooled_output = self.pooler(encoder_layer)
1727 | pooled_output = self.dropout(pooled_output)
1728 | logits = self.classifier(pooled_output)
1729 | reshaped_logits = logits.view(-1, num_choices)
1730 |
1731 | loss = None
1732 | if labels is not None:
1733 | loss_fct = CrossEntropyLoss()
1734 | loss = loss_fct(reshaped_logits, labels)
1735 |
1736 | if not return_dict:
1737 | output = (reshaped_logits,) + outputs[1:]
1738 | return ((loss,) + output) if loss is not None else output
1739 |
1740 | return MultipleChoiceModelOutput(
1741 | loss=loss,
1742 | logits=reshaped_logits,
1743 | hidden_states=outputs.hidden_states,
1744 | attentions=outputs.attentions,
1745 | )
1746 |
1747 |
1748 | class bert_classify(DebertaV2PreTrainedModel):
1749 | _keys_to_ignore_on_load_unexpected = [r"pooler"]
1750 |
1751 | def __init__(self, config):
1752 | super().__init__(config, )
1753 | config.output_hidden_states = True
1754 | '''
1755 | hidden_states:这是输出的一个可选项,如果输出,需要指定config.output_hidden_states=True,它是一个元组,含有13个元素,
1756 | 第一个元素可以当做是embedding,也就是cls,其余12个元素是各层隐藏状态的输出,每个元素的形状是(batch_size, sequence_length, hidden_size),
1757 | '''
1758 | self.num_labels = config.num_labels
1759 | self.deberta = DebertaV2Model(config)
1760 | self.dropout = nn.Dropout(p=0.2)
1761 | self.high_dropout = nn.Dropout(p=0.5)
1762 | n_weights = config.num_hidden_layers + 1 # 因为指定了输出hidden_states,所以多了一层,加1
1763 | weights_init = torch.zeros(n_weights).float()
1764 | weights_init.data[:-1] = -3
1765 | self.layer_weights = torch.nn.Parameter(weights_init)
1766 | self.bilstm = nn.LSTM(config.hidden_size, 100, bidirectional=True)
1767 | self.classifier = nn.Linear(config.hidden_size + 200, self.num_labels)
1768 | # 随机初始化只对最后增加的线性层,已加载的模型参数不受影响
1769 | self.init_weights()
1770 | self.post_init()
1771 |
1772 | def forward(
1773 | self,
1774 | input_ids: Optional[torch.Tensor] = None,
1775 | attention_mask: Optional[torch.Tensor] = None,
1776 | token_type_ids: Optional[torch.Tensor] = None,
1777 | position_ids: Optional[torch.Tensor] = None,
1778 | inputs_embeds: Optional[torch.Tensor] = None,
1779 | labels: Optional[torch.Tensor] = None,
1780 | output_attentions: Optional[bool] = None,
1781 | output_hidden_states: Optional[bool] = None,
1782 | return_dict: Optional[bool] = None,
1783 | ):
1784 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1785 |
1786 | outputs = self.deberta(
1787 | input_ids,
1788 | attention_mask=attention_mask,
1789 | token_type_ids=token_type_ids,
1790 | position_ids=position_ids,
1791 | inputs_embeds=inputs_embeds,
1792 | output_attentions=output_attentions,
1793 | output_hidden_states=output_hidden_states,
1794 | return_dict=return_dict,
1795 | )
1796 |
1797 | hidden_layers = outputs[1]
1798 | hidden_layers_last = outputs[0]
1799 | rnn_output, _ = self.bilstm(hidden_layers_last) # shape [batchsize, 200]
1800 |
1801 | # 取每一层的cls(shape:batchsize * hidden_size) dropout叠加 shape: 13*bathsize*hidden_size
1802 | cls_outputs = torch.stack(
1803 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=0
1804 | )
1805 |
1806 | # 然后加权求和 shape: bathsize*hidden_size
1807 | cls_output = (torch.softmax(self.layer_weights, dim=0).unsqueeze(-1).unsqueeze(-1) * cls_outputs).sum(0)
1808 | cls_output = torch.cat((rnn_output[:, -1], cls_output), dim=1)
1809 |
1810 | # 对求和后的cls向量进行dropout,在输入线性层,重复五次,然后求平均的到最后的输出logit
1811 | logits = torch.mean(
1812 | torch.stack(
1813 | [self.classifier(self.high_dropout(cls_output)) for _ in range(5)],
1814 | dim=0,
1815 | ),
1816 | dim=0,
1817 | )
1818 | outputs = (logits,) + outputs[2:]
1819 | if labels is not None:
1820 | loss_fct = CrossEntropyLoss()
1821 | loss1 = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1822 | loss = loss1
1823 |
1824 | outputs = (loss.mean(),) + outputs # loss, logits, output[2:]
1825 |
1826 | return outputs
--------------------------------------------------------------------------------
/pretrain_algorithm/nezha_graph.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2022-09-16 14:39:21
8 | LastEditTime : 2023-03-21 19:12:37
9 | FilePath : \\codes\\pretrain_algorithm\\nezha_graph.py
10 | Description :
11 | '''
12 |
13 | import torch
14 | from torch import nn
15 | from transformers import NezhaPreTrainedModel, NezhaModel
16 |
17 | class nezha_classify(NezhaPreTrainedModel):
18 | def __init__(self, config):
19 | super().__init__(config, )
20 |
21 | self.num_labels = config.num_labels
22 |
23 | self.nezha = NezhaModel(config)
24 |
25 | self.dropout = nn.Dropout(p=0.2)
26 | self.high_dropout = nn.Dropout(p=0.5)
27 |
28 | n_weights = config.num_hidden_layers + 1
29 | weights_init = torch.zeros(n_weights).float()
30 | weights_init.data[:-1] = -3
31 |
32 | self.layer_weights = torch.nn.Parameter(weights_init)
33 |
34 | self.classifier = nn.Linear(config.hidden_size, self.num_labels)
35 |
36 | self.post_init()
37 |
38 | def forward(
39 | self,
40 | input_ids=None,
41 | attention_mask=None,
42 | token_type_ids=None,
43 | label=None,
44 | ):
45 | outputs = self.nezha(
46 | input_ids,
47 | attention_mask=attention_mask,
48 | token_type_ids=token_type_ids,
49 | output_hidden_states=True
50 | )
51 |
52 | hidden_layers = outputs[2]
53 |
54 | cls_outputs = torch.stack(
55 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=2
56 | )
57 |
58 | cls_output = (torch.softmax(self.layer_weights, dim=0) * cls_outputs).sum(-1)
59 |
60 | logits = torch.mean(
61 | torch.stack(
62 | [self.classifier(self.high_dropout(cls_output)) for _ in range(5)],
63 | dim=0,
64 | ),
65 | dim=0,
66 | )
67 |
68 | return logits
69 |
--------------------------------------------------------------------------------
/pretrain_algorithm/pre_model.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2023-03-13 17:10:12
8 | LastEditTime : 2023-03-23 16:12:19
9 | FilePath : \\codes\\pretrain_algorithm\\pre_model.py
10 | Description :
11 | '''
12 |
13 | import gc
14 | import os
15 | import shutil
16 | import numpy as np
17 | import torch
18 | import time
19 | from tqdm import tqdm
20 | from common import get_time_dif
21 | from config import PRE_MODEL_NAME, VERBOSE
22 | from metrics import Matrix
23 | from pretrain_algorithm.bert_graph import bert_classifier
24 | from pretrain_algorithm.nezha_graph import nezha_classify
25 | from pretrain_algorithm.roberta_wwm import roberta_classify
26 | from transformers import BertConfig, NezhaConfig, RobertaConfig
27 | from trick.early_stop import EarlyStopping
28 | from trick.fgm_pgd_ema import FGM
29 | from tensorboardX import SummaryWriter
30 |
31 |
32 | class PRE_EXCUTER:
33 | def __init__(self, dl_config):
34 | self.dlconfig = dl_config
35 |
36 | def judge_model(self, assign_path=''):
37 | load_path = assign_path
38 | if self.dlconfig.model_name not in PRE_MODEL_NAME:
39 | print('pretrain model name is not support, please see PRE_MODEL_NAME of config.py')
40 | #* 后续添加模型需要在这里酌情修改对应的方法
41 | if self.dlconfig.model_name in ['mac_bert', 'bert', 'bert_wwm']:
42 | self.pre_config = BertConfig.from_pretrained(os.path.join(load_path, 'config.json'))
43 | self.pre_config.num_labels = self.dlconfig.nums_label
44 | self.model = bert_classifier.from_pretrained(os.path.join(
45 | load_path, 'pytorch_model.bin'), config=self.pre_config)
46 | elif self.dlconfig.model_name == 'nezha_wwm':
47 | self.pre_config = NezhaConfig.from_pretrained(os.path.join(load_path, 'config.json'))
48 | self.pre_config.num_labels = self.dlconfig.nums_label
49 | self.model = nezha_classify.from_pretrained(os.path.join(
50 | load_path, 'pytorch_model.bin'), config=self.pre_config)
51 | elif self.dlconfig.model_name == 'roberta_wwm':
52 | self.pre_config = RobertaConfig.from_pretrained(os.path.join(load_path, 'config.json'))
53 | self.pre_config.num_labels = self.dlconfig.nums_label
54 | self.model = roberta_classify.from_pretrained(os.path.join(
55 | load_path, 'pytorch_model.bin'), config=self.pre_config)
56 |
57 | #! 其他模型
58 | else:
59 | pass
60 | self.model.to(self.dlconfig.device)
61 |
62 | def train(self, train_loader, test_loader, dev_loader, model_saved_path):
63 | # 设置优化器
64 | # 带这些名字的参数不需要做权重衰减
65 | no_decay = ["bias", "LayerNorm.weight"]
66 | optimizer_grouped_parameters = [
67 | {'params': [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
68 | 'weight_decay': 0.01, 'lr': self.dlconfig.learning_rate},
69 | {'params': [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],
70 | 'weight_decay': 0.0,
71 | 'lr': self.dlconfig.learning_rate},
72 | ]
73 | optimizer = torch.optim.AdamW(
74 | optimizer_grouped_parameters, lr=self.dlconfig.learning_rate)
75 | best_test_f1 = 0
76 | writer = SummaryWriter(logdir='./logs')
77 | # 学习率指数衰减,每次epoch:学习率 = gamma * 学习率
78 | # scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
79 |
80 | # 学习更新策略--预热(warmup)
81 | if self.dlconfig.update_lr:
82 | from transformers import get_linear_schedule_with_warmup
83 | num_warmup_steps = int(
84 | self.dlconfig.warmup_prop * self.dlconfig.epochs * len(train_loader))
85 | num_training_steps = int(self.dlconfig.epochs * len(train_loader))
86 | # 由于transformers自带的adamw优化器只实现了权重衰减,因此还要自己调用调度器和做梯度裁剪,下面使用线性调度器
87 | scheduler = get_linear_schedule_with_warmup(
88 | optimizer, num_warmup_steps, num_training_steps)
89 |
90 | # 早停策略
91 | early_stopping = EarlyStopping(patience=20, delta=0)
92 |
93 | for epoch in range(self.dlconfig.epochs):
94 | # 设定训练模式
95 | self.model.train()
96 | # 梯度清零
97 | self.model.zero_grad()
98 | start_time = time.time()
99 | avg_loss = 0
100 | first_epoch_eval = 0
101 | for data in tqdm(train_loader, ncols=100):
102 | data['input_ids'] = data['input_ids'].to(self.dlconfig.device)
103 | data['attention_mask'] = data['attention_mask'].to(self.dlconfig.device)
104 | data['token_type_ids'] = data['token_type_ids'].to(self.dlconfig.device)
105 | data['label'] = data['label'].to(self.dlconfig.device)
106 | pred = self.model(**data)
107 | loss = self.dlconfig.loss_fct(pred, data['label']).mean()
108 | # 反向传播
109 | loss.backward()
110 | avg_loss += loss.item() / len(train_loader)
111 |
112 | # 使用fgm
113 | if self.dlconfig.use_fgm:
114 | fgm = FGM(self.model)
115 | fgm.attack()
116 | loss_adv = self.model(**data).mean()
117 | # 通过扰乱后的embedding训练后得到对抗训练后的loss值,然后反向传播计算对抗后的梯度,累加到前面正常的梯度上,最后再去更新参数
118 | loss_adv.backward()
119 | fgm.restore()
120 |
121 | if self.dlconfig.update_lr:
122 | torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0) # 对应上面的梯度衰减
123 |
124 | # 更新优化器
125 | optimizer.step()
126 | # 更新学习率
127 | if self.dlconfig.update_lr:
128 | scheduler.step()
129 |
130 | # 用以下方式替代model.zero_grad(),可以提高gpu利用率
131 | for param in self.model.parameters():
132 | param.grad = None
133 |
134 | # 计算模型运行时间
135 | elapsed_time = get_time_dif(start_time)
136 | # 打印间隔
137 | if (epoch + 1) % VERBOSE == 0:
138 | # 在测试集上看下效果
139 | avg_test_loss, test_f1, pred_all, true_all = self.evaluate(
140 | test_loader)
141 | elapsed_time = elapsed_time * VERBOSE
142 | if self.dlconfig.update_lr:
143 | lr = scheduler.get_last_lr()[0]
144 | else:
145 | lr = self.dlconfig.learning_rate
146 | tqdm.write(
147 | f"Epoch {epoch + 1:02d}/{self.dlconfig.epochs:02d} \t time={elapsed_time} \t"
148 | f"loss={avg_loss:.3f}\t lr={lr:.1e}",
149 | end="\t",
150 | )
151 |
152 | if (epoch + 1 >= first_epoch_eval) or (epoch + 1 == self.dlconfig.epochs):
153 | tqdm.write(
154 | f"val_loss={avg_test_loss:.3f}\ttest_f1={test_f1:.4f}\t lr={lr:.1e}")
155 | else:
156 | tqdm.write("")
157 | writer.add_scalar('Loss/train', avg_loss, epoch)
158 | writer.add_scalar('Loss/test', avg_test_loss, epoch)
159 | writer.add_scalar('F1/test', test_f1, epoch)
160 | writer.add_scalar('lr/train', lr, epoch)
161 |
162 | # 每次保存最优的模型,以测试集f1为准
163 | if best_test_f1 < test_f1:
164 | best_test_f1 = test_f1
165 | tqdm.write('*' * 20)
166 | self.save_model(model_saved_path)
167 | tqdm.write('new model saved')
168 | tqdm.write('*' * 20)
169 |
170 | early_stopping(avg_test_loss)
171 | if early_stopping.early_stop:
172 | break
173 | # 删除数据加载器以及变量
174 | del (test_loader, train_loader, loss, data, pred)
175 | # 释放内存
176 | gc.collect()
177 | torch.cuda.empty_cache()
178 | writer.close()
179 |
180 | def evaluate(self, test_loader):
181 | pre_all = []
182 | true_all = []
183 | # 设定评估模式
184 | self.model.eval()
185 | avg_test_loss = 0
186 | with torch.no_grad():
187 | for test_data in test_loader:
188 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device),
189 | test_data['attention_mask'].to(self.dlconfig.device),
190 | test_data['token_type_ids'].to(self.dlconfig.device),
191 | )
192 | test_loss = self.dlconfig.loss_fct(pred, test_data['label'].to(self.dlconfig.device)).mean()
193 | avg_test_loss += test_loss.item() / len(test_loader)
194 | true_all.extend(test_data['label'].detach().cpu().numpy())
195 | pre_all.append(pred.softmax(-1).detach().cpu().numpy())
196 | pre_all = np.concatenate(pre_all)
197 | pre_all = np.argmax(pre_all, axis=-1)
198 | if self.dlconfig.loss_type == 'multi' or self.dlconfig.loss_type == 'marginLoss':
199 | multi = True
200 | else:
201 | multi = False
202 | matrix = Matrix(true_all, pre_all, multi=multi)
203 | return avg_test_loss, matrix.get_f1(), pre_all, true_all
204 |
205 | def predict(self, dev_loader):
206 | pre_all = []
207 | with torch.no_grad():
208 | for test_data in dev_loader:
209 | pred = self.model(test_data['input_ids'].to(self.dlconfig.device),
210 | test_data['attention_mask'].to(self.dlconfig.device),
211 | test_data['token_type_ids'].to(self.dlconfig.device),
212 | )
213 | pre_all.append(pred.softmax(-1).detach().cpu().numpy())
214 | pre_all = np.concatenate(pre_all)
215 | pre_all = np.argmax(pre_all, axis=-1)
216 | return pre_all
217 |
218 | # 保存模型权重
219 | def save_model(self, path):
220 | if not os.path.exists(path):
221 | os.makedirs(path)
222 | if not os.path.exists(os.path.join(path, 'config.json')):
223 | shutil.copy(f'{self.dlconfig.pretrain_file_path}/config.json', f'{path}/config.json')
224 | if not os.path.exists(os.path.join(path, 'vocab.txt')):
225 | shutil.copy(f'{self.dlconfig.pretrain_file_path}/vocab.txt', f'{path}/vocab.txt')
226 | name = 'pytorch_model.bin'
227 | output_path = os.path.join(path, name)
228 | torch.save(self.model.state_dict(), output_path)
229 | print(f'model is saved, in {str(output_path)}')
230 |
231 | def load_model(self, path):
232 | try:
233 | self.judge_model(path)
234 | self.model.eval()
235 | print('model 已加载预训练参数')
236 | except:
237 | print('model load error')
238 |
--------------------------------------------------------------------------------
/pretrain_algorithm/roberta_wwm.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2023-03-21 19:14:06
8 | LastEditTime : 2023-03-22 11:06:19
9 | FilePath : \\codes\\pretrain_algorithm\\roberta_wwm.py
10 | Description :
11 | '''
12 |
13 |
14 | import torch
15 | from torch import nn
16 | from transformers import RobertaPreTrainedModel, RobertaModel
17 |
18 | class roberta_classify(RobertaPreTrainedModel):
19 | _keys_to_ignore_on_load_unexpected = [r"pooler"]
20 | _keys_to_ignore_on_load_missing = [r"position_ids"]
21 | def __init__(self, config):
22 | super().__init__(config, )
23 |
24 | self.num_labels = config.num_labels
25 | # 如果add_pooling_layer设置为True,那么output会多一个池化层结果,可以选择用这个池化层的结果去做下游任务
26 | # 由于这里用多个隐层的平均作为下游任务的输入,所以设置为Fasle
27 | self.roberta = RobertaModel(config, add_pooling_layer=False)
28 |
29 | self.dropout = nn.Dropout(p=0.2)
30 | self.high_dropout = nn.Dropout(p=0.5)
31 |
32 | n_weights = config.num_hidden_layers + 1
33 | weights_init = torch.zeros(n_weights).float()
34 | weights_init.data[:-1] = -3
35 |
36 | self.layer_weights = torch.nn.Parameter(weights_init)
37 |
38 | self.classifier = nn.Linear(config.hidden_size, self.num_labels)
39 |
40 | self.post_init()
41 |
42 | def forward(
43 | self,
44 | input_ids=None,
45 | attention_mask=None,
46 | token_type_ids=None,
47 | label=None,
48 | ):
49 | outputs = self.roberta(
50 | input_ids,
51 | attention_mask=attention_mask,
52 | token_type_ids=token_type_ids,
53 | output_hidden_states=True
54 | )
55 |
56 | hidden_layers = outputs[1]
57 |
58 | cls_outputs = torch.stack(
59 | [self.dropout(layer[:, 0, :]) for layer in hidden_layers], dim=2
60 | )
61 |
62 | cls_output = (torch.softmax(self.layer_weights, dim=0) * cls_outputs).sum(-1)
63 |
64 | logits = torch.mean(
65 | torch.stack(
66 | [self.classifier(self.high_dropout(cls_output)) for _ in range(5)],
67 | dim=0,
68 | ),
69 | dim=0,
70 | )
71 |
72 | return logits
73 |
--------------------------------------------------------------------------------
/pretrain_model/bert_wwm/readme.txt:
--------------------------------------------------------------------------------
1 | # 这里放模型文件,vocab.txt config.json pytorch_model.bin
--------------------------------------------------------------------------------
/process_data_dl.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : utils.py
5 | @Time : 2023/02/08 14:57:32
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : get vocab, label, label_nums, label2n, word2n, n2word, n2label, dataset定义
10 | '''
11 |
12 | import os
13 | import jieba
14 | import pickle as pkl
15 | import pandas as pd
16 | import numpy as np
17 | import torch
18 | import torch.nn as nn
19 | from collections import OrderedDict
20 | from torch.utils.data import Dataset, DataLoader
21 | from config import VOCAB_MAX_SIZE, WORD_MIN_FREQ, VOCAB_SAVE_PATH, L2I_SAVE_PATH, PRETRAIN_EMBEDDING_FILE
22 | from trick.dynamic_padding import collater
23 |
24 |
25 | class DataSetProcess:
26 | def __init__(self, train_data_path='', test_data_path='', dev_data_path=''):
27 | self.train_data_path = train_data_path
28 | self.test_data_path = test_data_path
29 | self.dev_data_path = dev_data_path
30 | self.train_data, self.l1 = self.load_data(
31 | self.train_data_path) if self.train_data_path else [[], []]
32 | self.test_data, self.l2 = self.load_data(
33 | self.test_data_path) if self.test_data_path else [[], []]
34 | self.dev_data, self.l3 = self.load_data(
35 | self.dev_data_path) if self.dev_data_path else [[], []]
36 |
37 | def load_data(self, path):
38 | """默认是处理csv文件,其他形式的要改,csv的话文本内容要改成我提供的demo格式
39 | """
40 | if path.endswith('csv'):
41 | df = pd.read_csv(path, encoding='utf-8')
42 | contents = df['content'].values.tolist()
43 | try:
44 | labels = df['label'].values.tolist()
45 | except:
46 | labels = []
47 | return contents, labels
48 | else:
49 | #! todo 其他格式的文件读取
50 | pass
51 |
52 | def build_vocab(self, save=False):
53 | if os.path.exists(VOCAB_SAVE_PATH):
54 | with open(VOCAB_SAVE_PATH, 'rb') as f:
55 | vocab_dic = pkl.load(f)
56 | print(f"vocab size {len(vocab_dic)}")
57 | return vocab_dic
58 |
59 | vocab_dic = {}
60 | UNK, PAD = '', ''
61 | min_freq = WORD_MIN_FREQ
62 | vocab_max_size = VOCAB_MAX_SIZE
63 |
64 | all_data = self.train_data + self.test_data + self.dev_data
65 |
66 | for sentence in all_data:
67 | sentence = sentence.strip()
68 | #! 这里只设置了中文,英文用空格,还没写
69 | tokens = jieba.cut(sentence)
70 | for token in tokens:
71 | vocab_dic[token] = vocab_dic.get(token, 0) + 1
72 | # 对词表进行排序
73 | vocab_list = sorted([_ for _ in vocab_dic.items() if _[
74 | 1] >= min_freq], key=lambda x: x[1], reverse=True)[:vocab_max_size]
75 |
76 | # 还原成字典
77 | vocab_dic = {word_count[0]: idx for idx,
78 | word_count in enumerate(vocab_list)}
79 |
80 | # 使用UNK填充单词表的尾部
81 | vocab_dic.update({UNK: len(vocab_dic), PAD: len(vocab_dic) + 1})
82 |
83 | # 是否保存
84 | if save:
85 | abs_path = VOCAB_SAVE_PATH.rsplit('/', 1)[0]
86 | if not os.path.exists(abs_path):
87 | os.makedirs(abs_path)
88 | with open(VOCAB_SAVE_PATH, 'wb') as f:
89 | pkl.dump(vocab_dic, f)
90 | print(f'vocab_dic is saved in {VOCAB_SAVE_PATH}')
91 | print(f"vocab size {len(vocab_dic)}")
92 | return vocab_dic
93 |
94 | def build_label2id(self, save=False):
95 | if os.path.exists(L2I_SAVE_PATH):
96 | with open(L2I_SAVE_PATH, 'rb') as f:
97 | l2i_dic = pkl.load(f)
98 | i2l_dic = {}
99 | for k, n in l2i_dic.items():
100 | i2l_dic[n] = k
101 | return l2i_dic, i2l_dic
102 |
103 |
104 | i2l_dic = OrderedDict()
105 | l2i_dic = OrderedDict()
106 | all_label_list = self.l1 + self.l2 + self.l3
107 | all_label_list = list(set(all_label_list))
108 | for i in range(len(all_label_list)):
109 | i2l_dic[i] = all_label_list[i]
110 | l2i_dic[all_label_list[i]] = i
111 |
112 | # 是否保存
113 | if save:
114 | abs_path = L2I_SAVE_PATH.rsplit('/', 1)[0]
115 | if not os.path.exists(abs_path):
116 | os.makedirs(abs_path)
117 | with open(L2I_SAVE_PATH, 'wb') as f:
118 | pkl.dump(l2i_dic, f)
119 | print(f'label2id_dic is saved in {L2I_SAVE_PATH}')
120 |
121 | return l2i_dic, i2l_dic
122 |
123 | def trans_data(self, data_path, vocab_dic, label_dic):
124 | contents = []
125 | datas, labels = self.load_data(data_path)
126 | if not labels:
127 | labels = [-1] * len(datas)
128 | for d, l in zip(datas, labels):
129 | if not d.strip():
130 | continue
131 | wordlists = []
132 | tokens = jieba.cut(d.strip())
133 | for token in tokens:
134 | wordlists.append(vocab_dic.get(token, vocab_dic.get("")))
135 | if l != -1:
136 | contents.append((wordlists, int(label_dic.get(l))))
137 | else:
138 | contents.append((wordlists,))
139 | return contents
140 |
141 | def load_emb(self, vocab_dic):
142 | skip = True
143 | emb_dic = {}
144 | with open(PRETRAIN_EMBEDDING_FILE, 'r', encoding='utf-8') as f:
145 | for i in f:
146 | if skip:
147 | skip = False
148 | dim = int(i.split(' ', 1)[1].strip())
149 | continue
150 | word, embeds = i.split(' ', 1)
151 | embeds = embeds.strip().split(' ')
152 | if vocab_dic.get(word, None):
153 | emb_dic[vocab_dic[word]] = embeds
154 | emb_dics = sorted(emb_dic.items(), key=lambda x: x[0])
155 | orignal_emb = nn.Embedding(len(vocab_dic), dim, padding_idx=len(vocab_dic)-1)
156 | emb_array = orignal_emb.weight.data.numpy()
157 | for i in emb_dics:
158 | index = i[0]
159 | weight = np.array(i[1], dtype=float)
160 | emb_array[index] = weight
161 | print(f'已载入预训练词向量,维度为{dim}')
162 | return torch.FloatTensor(emb_array), dim
163 |
164 |
165 | class DLDataset(Dataset):
166 | """自定义torch的dataset
167 | """
168 |
169 | def __init__(self, contents):
170 | self.data, self.label = self.get_data_label(contents)
171 | self.len = len(self.data)
172 |
173 | def __len__(self):
174 | return self.len
175 |
176 | def __getitem__(self, index):
177 | if self.label:
178 | return {
179 | 'input_ids': self.data[index],
180 | 'label': self.label[index]
181 | }
182 | else:
183 | return {'input_ids': self.data[index]}
184 |
185 | def get_data_label(self, contents):
186 | """contents: [([xx,x,,], label?), ()]
187 | """
188 | data = []
189 | label = []
190 | for i in contents:
191 | data.append(i[0])
192 | if len(i) == 2:
193 | label.append(i[1])
194 | return data, label
195 |
196 | class DL_Data_Excuter:
197 | def __init__(self):
198 | pass
199 | def process(self,batch_size, train_data_path='', test_data_path='', dev_data_path=''):
200 | """内部构建各个数据集的dataloader,返回词表大小和类别数量
201 | """
202 |
203 | p = DataSetProcess(train_data_path, test_data_path, dev_data_path)
204 | self.vocab = p.build_vocab(save=True)
205 | pad_index = self.vocab['']
206 | self.label_dic, self.i2l_dic = p.build_label2id(save=True)
207 | if len(self.label_dic) > 2:
208 | self.multi = True
209 | else:
210 | self.multi = False
211 | collater_fn = collater(pad_index)
212 | self.train_data_loader = ''
213 | self.test_data_loader = ''
214 | self.dev_data_loader = ''
215 | if train_data_path:
216 | content = p.trans_data(train_data_path, self.vocab, self.label_dic)
217 | data_set = DLDataset(content)
218 | self.train_data_loader = DataLoader(
219 | data_set, batch_size=batch_size, shuffle=True, collate_fn=collater_fn)
220 | if test_data_path:
221 | content = p.trans_data(test_data_path, self.vocab, self.label_dic)
222 | data_set = DLDataset(content)
223 | self.test_data_loader = DataLoader(
224 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn)
225 | if dev_data_path:
226 | content = p.trans_data(dev_data_path, self.vocab, self.label_dic)
227 | data_set = DLDataset(content)
228 | self.dev_data_loader = DataLoader(
229 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn)
230 | return len(self.vocab), len(self.label_dic)
231 |
232 |
233 | if __name__ == '__main__':
234 | d = DL_Data_Excuter()
235 | d.get_dataloader(2, '', './data/dl_data/test.csv', '')
236 | print(1)
237 |
--------------------------------------------------------------------------------
/process_data_ml.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : process_data.py
5 | @Time : 2023/01/13 16:25:15
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : process data, 两件事:将label和特征分开,对label做好映射和转换;采样, 解决不平衡的问题
10 | '''
11 |
12 | import pandas as pd
13 | from sklearn.utils import shuffle
14 | from collections import OrderedDict
15 | from imblearn.under_sampling import RandomUnderSampler
16 |
17 | class ML_Data_Excuter:
18 | def __init__(self, data_path, split_size, is_sample=False, split=True, train_data_path='', test_data_path=''):
19 | """数据处理类
20 |
21 | Args:
22 | data_path (str): 数据的路径
23 | split_size (int): 切分训练集和测试集的比例
24 | is_sample (bool, optional): 是否对数据进行采样,当数据不平衡时推荐True. Defaults to False.
25 | split (bool, optional): 是否进行训练集和测试集的切分操作. Defaults to True.
26 | train_data_path (str, optional): 如果这个路径存在,那么默认不进行程序默认的训练集和测试集的切分,用用户已经切分好的数据. Defaults to ''.
27 | test_data_path (str, optional): 同上. Defaults to ''.
28 | """
29 | self.train_data_path = train_data_path
30 | self.test_data_path = test_data_path
31 | if self.train_data_path and self.test_data_path:
32 | self.train_data = pd.read_csv(self.train_data_path)
33 | self.test_data = pd.read_csv(self.test_data_path)
34 | self.data = pd.concat([self.train_data, self.test_data], axis=0)
35 | self.l2i_dic, self.i2l_dic = self.create_l2i()
36 | self.label = self.data['label']
37 | if len(set(self.label.values.tolist())) > 2:
38 | self.multi = True
39 | elif len(set(self.label.values.tolist())) == 2:
40 | self.multi = False
41 | else:
42 | print('there have only one label, must >= 2')
43 | exit(0)
44 | self.X = self.data.loc[:, self.data.columns!='label']
45 | print('data nums: ')
46 | print(self.X.shape[0])
47 | self.train_data_label = self.train_data['label']
48 | self.train_data_x = self.train_data.loc[:, self.train_data.columns!='label']
49 | self.test_data_label = self.test_data['label']
50 | self.test_data_x = self.test_data.loc[:, self.test_data.columns!='label']
51 | print('split train_test data:')
52 | print('train_data num:')
53 | print(self.train_data_x.shape)
54 | print('test_data num:')
55 | print(self.test_data_x.shape)
56 | else:
57 | self.split_size = split_size
58 | self.data = pd.read_csv(data_path)
59 | self.l2i_dic, self.i2l_dic = self.create_l2i()
60 | self.label = self.data['label']
61 | if len(set(self.label.values.tolist())) > 2:
62 | self.multi = True
63 | elif len(set(self.label.values.tolist())) == 2:
64 | self.multi = False
65 | else:
66 | print('there have only one label, must >= 2')
67 | exit(0)
68 | self.X = self.data.loc[:, self.data.columns!='label']
69 | if is_sample:
70 | self.sample()
71 | print('data nums: ')
72 | print(self.X.shape[0])
73 | if split:
74 | self.train_test_split()
75 |
76 | def create_l2i(self):
77 | i2l_dic = OrderedDict()
78 | l2i_dic = OrderedDict()
79 | # 将label转成数字,并且生成有序字典,方便后续画confusion_matrix
80 | classes = sorted(list(set(self.data['label'].values.tolist())))
81 | print(classes)
82 | num_classes = len(set(classes))
83 | for i in range(num_classes):
84 | i2l_dic[i] = classes[i]
85 | l2i_dic[classes[i]] = i
86 | self.data['label'] = self.data['label'].map(l2i_dic)
87 |
88 | return l2i_dic, i2l_dic
89 |
90 | def sample(self):
91 | # 这里采用简单的随机下采样,换方法可以在这里改
92 | def get_res():
93 | res = sorted(Counter(self.label).items())
94 | res_ = []
95 | for i in res:
96 | tmp = (self.i2l_dic[i[0]], i[1])
97 | res_.append(tmp)
98 | return res_
99 | from collections import Counter
100 |
101 | print('before sample,data nums:')
102 | print(get_res())
103 | sample_excuter = RandomUnderSampler(random_state=96)
104 | self.X, self.label = sample_excuter.fit_resample(self.X, self.label)
105 | print('after sample,data nums:')
106 | print(get_res())
107 | self.data = pd.concat([self.X, self.label], axis=1)
108 |
109 | def train_test_split(self):
110 | """
111 | 这里的划分是按照每个标签的数量进行划分,确保训练集和验证集中的标签种类一致,不会出现训练集里有的标签,而测试集里没有出现过
112 | """
113 | type_label = list(set(self.data.label.values.tolist()))
114 | test_data_index = []
115 | for l in type_label:
116 | tmp_data = self.data[self.data['label']==l]
117 | tmp_data = shuffle(tmp_data)
118 | random_test = tmp_data.sample(frac=self.split_size, random_state=96)
119 | index_num = random_test.index.tolist()
120 | test_data_index += index_num
121 | test_data = self.data.iloc[test_data_index, :]
122 | train_data = self.data[~self.data.index.isin(test_data_index)]
123 | self.train_data_label = train_data['label']
124 | self.train_data_x = train_data.loc[:, train_data.columns!='label']
125 | self.test_data_label = test_data['label']
126 | self.test_data_x = test_data.loc[:, test_data.columns!='label']
127 | print('split train_test data:')
128 | print('train_data num:')
129 | print(self.train_data_x.shape)
130 | print('test_data num:')
131 | print(self.test_data_x.shape)
132 |
133 |
134 | if __name__ == '__main__':
135 | data_path = './data/processed_data.csv'
136 | data_ex = ML_Data_Excuter(data_path, 0.3, is_sample=True, split=True)
137 | print(1)
--------------------------------------------------------------------------------
/process_data_pretrain.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2023-03-13 15:09:48
8 | LastEditTime : 2023-03-21 19:51:13
9 | FilePath : \\codes\\process_data_pretrain.py
10 | Description : data process for pretrain method
11 | '''
12 |
13 |
14 | from process_data_dl import DataSetProcess
15 | from trick.dynamic_padding import collater
16 | from torch.utils.data import Dataset, DataLoader
17 | from config import MAX_SEQ_LEN
18 | from transformers import AutoTokenizer
19 |
20 |
21 | class DataSetProcess_pre(DataSetProcess):
22 | def trans_data(self, data_path, label_dic):
23 | contents = []
24 | datas, labels = self.load_data(data_path)
25 | if not labels:
26 | labels = [-1] * len(datas)
27 | for d, l in zip(datas, labels):
28 | if not d.strip():
29 | continue
30 | if l != -1:
31 | contents.append(([d], int(label_dic.get(l))))
32 | else:
33 | contents.append(([d],))
34 | return contents
35 |
36 |
37 | class PREDataset(Dataset):
38 | def __init__(self, contents, tokenizer, max_seq_len):
39 | self. tokenizer = tokenizer
40 | self.max_seq_len = max_seq_len
41 | self.data, self.label = self.get_data_label(contents)
42 | self.len = len(self.data)
43 |
44 | def __len__(self):
45 | return self.len
46 |
47 | def __getitem__(self, index):
48 | #! todo 增量训练的数据集构造
49 | # 预测返回的label和上面不一样,因为预测是和任务有关,不是做pretrain,因此要用数据集的标签,上面要用自己的数据增量训练,所以要自己预测mask的地方
50 | tokenize_result = self.tokenizer.encode_plus(self.data[index], max_length=self.max_seq_len)
51 | if self.label:
52 | return {
53 | 'input_ids': tokenize_result["input_ids"],
54 | 'attention_mask': tokenize_result["attention_mask"],
55 | 'token_type_ids': tokenize_result["token_type_ids"],
56 | 'label': self.label[index]
57 |
58 | }
59 | return {
60 | 'input_ids': tokenize_result["input_ids"],
61 | 'attention_mask': tokenize_result["attention_mask"],
62 | 'token_type_ids': tokenize_result["token_type_ids"],
63 | }
64 |
65 | def get_data_label(self, contents):
66 | """contents: [([data], label?), ()]
67 | """
68 | data = []
69 | label = []
70 | for i in contents:
71 | data.append(i[0][0])
72 | if len(i) == 2:
73 | label.append(i[1])
74 | return data, label
75 |
76 |
77 | class PRE_Data_Excuter:
78 | def __init__(self, model_type):
79 | self.model_type = model_type
80 |
81 | def process(self,batch_size, train_data_path='', test_data_path='', dev_data_path='', pretrain_file_path=''):
82 | self.pretrain_file_path = pretrain_file_path
83 | #* 分词器的设置,不同模型不一样的分词器
84 | # if self.model_type in ['mac_bert','bert', 'bert_wwm', 'nezha_wwm']:
85 | # from transformers import BertTokenizer
86 | # tokenizer = BertTokenizer.from_pretrained(self.pretrain_file_path)
87 | # #// 其他分词器,先不用Autotokenizer这个类
88 | # else:
89 | # print('tokenizer is null, please check model_name')
90 | # exit()
91 | tokenizer = AutoTokenizer.from_pretrained(self.pretrain_file_path)
92 | p = DataSetProcess_pre(train_data_path, test_data_path, dev_data_path)
93 | self.label_dic, self.i2l_dic = p.build_label2id(save=True)
94 | if len(self.label_dic) > 2:
95 | self.multi = True
96 | else:
97 | self.multi = False
98 | collater_fn = collater(pad_index=0, for_pretrain=True)
99 | self.train_data_loader = ''
100 | self.test_data_loader = ''
101 | self.dev_data_loader = ''
102 | if train_data_path:
103 | content = p.trans_data(train_data_path, self.label_dic)
104 | data_set = PREDataset(content,tokenizer=tokenizer, max_seq_len=MAX_SEQ_LEN)
105 | self.train_data_loader = DataLoader(
106 | data_set, batch_size=batch_size, shuffle=True, collate_fn=collater_fn)
107 | if test_data_path:
108 | content = p.trans_data(test_data_path, self.label_dic)
109 | data_set = PREDataset(content,tokenizer=tokenizer, max_seq_len=MAX_SEQ_LEN)
110 | self.test_data_loader = DataLoader(
111 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn)
112 | if dev_data_path:
113 | content = p.trans_data(dev_data_path, self.label_dic)
114 | data_set = PREDataset(content,tokenizer=tokenizer, max_seq_len=MAX_SEQ_LEN)
115 | self.dev_data_loader = DataLoader(
116 | data_set, batch_size=batch_size, shuffle=False, collate_fn=collater_fn)
117 | return len(self.label_dic)
118 |
119 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | catboost==1.1.1
2 | gensim==4.3.1
3 | imbalanced_learn==0.10.1
4 | imblearn==0.0
5 | jieba==0.42.1
6 | joblib==1.2.0
7 | matplotlib==3.6.3
8 | numpy==1.24.1
9 | pandas==1.5.2
10 | scikit_learn==1.2.2
11 | torch==1.13.1
12 | tqdm==4.64.1
13 | transformers==4.25.1
14 | xgboost==1.7.3
15 |
--------------------------------------------------------------------------------
/save_model/knn.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hziheng/Machine-learning-project-for-text-classification/ec6a7517adaf4618148d25f9d192d76b3f747e10/save_model/knn.pkl
--------------------------------------------------------------------------------
/trick/dynamic_padding.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : dynamic_padding.py
5 | @Time : 2023/02/09 10:17:47
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : 每个batch保持一个长度,而不是所有的数据保持一个长度
10 | '''
11 |
12 | import torch
13 |
14 | class collater():
15 | def __init__(self, pad_index, for_pretrain=False):
16 | # 如果for_pretrain=True,说明返回的数据要包含attention矩阵和token_ids矩阵
17 | self.pad_index = pad_index
18 | self.for_pretrain = for_pretrain
19 |
20 | def __call__(self, batch):
21 | # dynamic_pad
22 | input_ids, label = [], []
23 | collate_max_len = 0
24 | attention_mask, token_type_ids = [], []
25 |
26 | # get maxlen for a batch
27 | for data in batch:
28 | collate_max_len = max(collate_max_len, len(data['input_ids']))
29 |
30 | for data in batch:
31 | # padding to maxlen for each data
32 | length = len(data['input_ids'])
33 | input_ids.append(data['input_ids'] + [self.pad_index] * (collate_max_len - length))
34 | if self.for_pretrain:
35 | attention_mask.append(data['attention_mask'] + [self.pad_index] * (collate_max_len - length))
36 | token_type_ids.append(data['token_type_ids'] + [self.pad_index] * (collate_max_len - length))
37 | if len(data) >= 2:
38 | label.append(data['label'])
39 | input_ids = torch.tensor(input_ids, dtype=torch.long)
40 | result = {'input_ids': input_ids}
41 | if label:
42 | label = torch.tensor(label, dtype=torch.long)
43 | result['label'] = label
44 | if self.for_pretrain:
45 | attention_mask = torch.tensor(attention_mask, dtype=torch.long)
46 | token_type_ids = torch.tensor(token_type_ids, dtype=torch.long)
47 | result['attention_mask'] = attention_mask
48 | result['token_type_ids'] = token_type_ids
49 | return result
50 |
51 |
--------------------------------------------------------------------------------
/trick/early_stop.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : early_stop.py
5 | @Time : 2023/02/09 16:21:34
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : 早停策略
10 | '''
11 |
12 | # 早停策略
13 | class EarlyStopping:
14 | def __init__(self, patience=10, delta=0):
15 | self.patience = patience
16 | self.counter = 0
17 | self.best_score = None
18 | self.early_stop = False
19 | self.delta = delta
20 |
21 | def __call__(self, val_loss):
22 | score = -val_loss
23 | if self.best_score is None:
24 | self.best_score = score
25 | elif score < self.best_score + self.delta:
26 | self.counter += 1
27 | if self.counter >= self.patience:
28 | self.early_stop = True
29 | else:
30 | self.best_score = score
31 | self.counter = 0
--------------------------------------------------------------------------------
/trick/fgm_pgd_ema.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2023-03-13 17:45:35
8 | LastEditTime : 2023-03-13 17:45:36
9 | FilePath : \\codes\\trick\\fgm.py
10 | Description : FGM, PGD, EMA
11 | '''
12 |
13 | import torch
14 |
15 |
16 | class FGM:
17 | '''
18 | 对于每个x:
19 | 1.计算x的前向loss、反向传播得到梯度
20 | 2.根据embedding矩阵的梯度计算出r,并加到当前embedding上,相当于x+r
21 | 3.计算x+r的前向loss,反向传播得到对抗的梯度,累加到(1)的梯度上
22 | 4.将embedding恢复为(1)时的值
23 | 5.根据(3)的梯度对参数进行更新
24 | '''
25 |
26 | def __init__(self, model):
27 | self.model = model
28 | self.backup = {}
29 |
30 | def attack(self, epsilon=0.5, emb_name='word_embeddings'):
31 | # emb_name这个参数要换成你模型中embedding的参数名
32 | for name, param in self.model.named_parameters():
33 | if param.requires_grad and emb_name in name:
34 | self.backup[name] = param.data.clone()
35 | norm = torch.norm(param.grad)
36 | if norm != 0:
37 | r_at = epsilon * param.grad / norm
38 | param.data.add_(r_at)
39 |
40 | def restore(self, emb_name='word_embeddings'):
41 | # emb_name这个参数要换成你模型中embedding的参数名
42 | for name, param in self.model.named_parameters():
43 | if param.requires_grad and emb_name in name:
44 | assert name in self.backup
45 | param.data = self.backup[name]
46 | self.backup = {}
47 |
48 |
49 | class PGD:
50 | def __init__(self, model, eps=1., alpha=0.3):
51 | self.model = (
52 | model.module if hasattr(model, "module") else model
53 | )
54 | self.eps = eps
55 | self.alpha = alpha
56 | self.emb_backup = {}
57 | self.grad_backup = {}
58 |
59 | def attack(self, emb_name='embeddings', is_first_attack=False):
60 | for name, param in self.model.named_parameters():
61 | if param.requires_grad and emb_name in name:
62 | if is_first_attack:
63 | self.emb_backup[name] = param.data.clone()
64 | norm = torch.norm(param.grad)
65 | if norm != 0 and not torch.isnan(norm):
66 | r_at = self.alpha * param.grad / norm
67 | param.data.add_(r_at)
68 | param.data = self.project(name, param.data)
69 |
70 | def restore(self, emb_name='embeddings'):
71 | for name, param in self.model.named_parameters():
72 | if param.requires_grad and emb_name in name:
73 | assert name in self.emb_backup
74 | param.data = self.emb_backup[name]
75 | self.emb_backup = {}
76 |
77 | def project(self, param_name, param_data):
78 | r = param_data - self.emb_backup[param_name]
79 | if torch.norm(r) > self.eps:
80 | r = self.eps * r / torch.norm(r)
81 | return self.emb_backup[param_name] + r
82 |
83 | def backup_grad(self):
84 | for name, param in self.model.named_parameters():
85 | if param.requires_grad and param.grad is not None:
86 | self.grad_backup[name] = param.grad.clone()
87 |
88 | def restore_grad(self):
89 | for name, param in self.model.named_parameters():
90 | if param.requires_grad and param.grad is not None:
91 | param.grad = self.grad_backup[name]
92 |
93 |
94 | class EMA:
95 | def __init__(self, model, decay):
96 | self.model = model
97 | self.decay = decay
98 | self.shadow = {}
99 | self.backup = {}
100 |
101 | def register(self):
102 | for name, param in self.model.named_parameters():
103 | if param.requires_grad:
104 | self.shadow[name] = param.data.clone()
105 |
106 | def update(self):
107 | for name, param in self.model.named_parameters():
108 | if param.requires_grad:
109 | assert name in self.shadow
110 | new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
111 | self.shadow[name] = new_average.clone()
112 |
113 | def apply_shadow(self):
114 | for name, param in self.model.named_parameters():
115 | if param.requires_grad:
116 | assert name in self.shadow
117 | self.backup[name] = param.data
118 | param.data = self.shadow[name]
119 |
120 | def restore(self):
121 | for name, param in self.model.named_parameters():
122 | if param.requires_grad:
123 | assert name in self.backup
124 | param.data = self.backup[name]
125 | self.backup = {}
126 |
--------------------------------------------------------------------------------
/trick/init_model.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : init_model.py
5 | @Time : 2023/02/09 14:26:06
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : dl net权重初始化方式
10 | '''
11 |
12 | import torch.nn as nn
13 |
14 |
15 | def init_network(model, method='xavier', exclude='embedding'):
16 | # 权重初始化:不同的初始化方法,导致精确性和收敛时间不同
17 | # 默认xavier
18 | # xavier:“Xavier”初始化方法是一种很有效的神经网络初始化方法
19 | # kaiming:何凯明初始化
20 | # normal_: 正态分布初始化
21 | for name, w in model.named_parameters():
22 | if exclude not in name:
23 | if 'weight' in name and 'layernorm' not in name:
24 | if method == 'xavier':
25 | nn.init.xavier_normal_(w)
26 | elif method == 'kaiming':
27 | nn.init.kaiming_normal_(w)
28 | else:
29 | nn.init.normal_(w)
30 | elif 'bias' in name:
31 | nn.init.constant_(w, 0)
32 | else:
33 | pass
--------------------------------------------------------------------------------
/trick/set_all_seed.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- encoding: utf-8 -*-
3 | '''
4 | @File : set_all_seed.py
5 | @Time : 2023/02/07 17:37:34
6 | @Author : Huang zh
7 | @Contact : jacob.hzh@qq.com
8 | @Version : 0.1
9 | @Desc : 固定所有的随机数种子
10 | '''
11 |
12 | import os
13 | import torch
14 | import numpy as np
15 | import random
16 | from torch.backends import cudnn
17 |
18 | # 固定随机种子
19 | def set_seed(seed):
20 | random.seed(seed)
21 | os.environ["PYTHONHASHSEED"] = str(seed)
22 | np.random.seed(seed)
23 | torch.manual_seed(seed)
24 | torch.cuda.manual_seed(seed)
25 | cudnn.deterministic = True
26 | cudnn.benchmark = False
27 |
28 |
29 |
--------------------------------------------------------------------------------
/word2vec_train.py:
--------------------------------------------------------------------------------
1 | # !usr/bin/env python
2 | # -*- coding:utf-8 -*-
3 |
4 | '''
5 | Author : Huang zh
6 | Email : jacob.hzh@qq.com
7 | Date : 2023-03-20 21:13:24
8 | LastEditTime : 2023-03-20 21:21:18
9 | FilePath : \\codes\\word2vec_train.py
10 | Description :
11 | '''
12 |
13 |
14 |
15 | import os
16 | import pickle
17 | import argparse
18 | from gensim.models import word2vec, keyedvectors
19 | from gensim.models.callbacks import CallbackAny2Vec
20 |
21 |
22 |
23 | def pickle_read(path):
24 | with open(path, 'rb') as f:
25 | data = pickle.load(f)
26 | return data
27 |
28 |
29 |
30 | # 定义回调函数
31 | class callback(CallbackAny2Vec):
32 | def __init__(self):
33 | self.epoch = 0
34 | self.loss_to_be_subed = 0
35 |
36 | def on_epoch_end(self, model):
37 | loss = model.get_latest_training_loss()
38 | loss_now = loss - self.loss_to_be_subed
39 | self.loss_to_be_subed = loss
40 | print('Loss after epoch {}: {}'.format(self.epoch, loss_now))
41 | self.epoch += 1
42 |
43 | def input():
44 | parser = argparse.ArgumentParser()
45 | parser.add_argument('-i', '--input_dir', help='input dir name', default='./result_pickle_1')
46 | parser.add_argument('-o', '--outputfile', help='output file name', default='./result_pickle_1')
47 | args = parser.parse_args()
48 | print(args)
49 | return args
50 |
51 |
52 | def word2vec_train(stences, only_vec=False):
53 | if only_vec:
54 | if os.path.exists("w2v_vec_300.bin.gz"):
55 | model = keyedvectors.load_word2vec_format("w2v_vec_300.bin.gz", binary=True)
56 | return model
57 | else:
58 | vec_path = 'w2v_vec_300.bin.gz'
59 | # save model, word_vectors
60 | model = word2vec.Word2Vec(sentences=stences, min_count=5, vector_size=300, epochs=100, callbacks=[callback()],compute_loss=True, workers=16)
61 | model.wv.save_word2vec_format(vec_path, binary=True)
62 | return model.wv
63 |
64 | else:
65 | if os.path.exists("w2v_model.bin"):
66 | model = word2vec.Word2Vec.load("w2v_model.bin")
67 | else:
68 | model = word2vec.Word2Vec(sentences=stences, min_count=5, vector_size=300, epochs=100, callbacks=[callback()],compute_loss=True, workers=16)
69 | model.save("w2v_model.bin")
70 | model.wv.save_word2vec_format('./embed.txt')
71 | return model.wv
72 |
73 |
74 |
75 | def main(args):
76 | # all_data_tokens = word_token(args.input_dir)
77 | with open('./d.pkl', 'rb') as f:
78 | all_data_tokens = pickle.load(f)
79 | print('train begin')
80 | model = word2vec_train(all_data_tokens, only_vec=False)
81 | print('train over')
82 | print(model.get_vector('我'))
83 |
84 | def test_model():
85 | # model = word2vec.Word2Vec.load("w2v_model.bin")
86 | # print(model.wv.get_vector('00'))
87 | model = keyedvectors.load_word2vec_format('w2v_vec_300.bin.gz', binary=True)
88 | print(model.get_vector('我'))
89 |
90 | if __name__ == '__main__':
91 | args = input()
92 | main(args)
93 | # test_model()
94 |
--------------------------------------------------------------------------------