├── README.md ├── data_augmentation.py ├── data_utils.py ├── extract_bert_char_embedding.py ├── feature_engineering.py ├── model_feature.py ├── model_final.py ├── model_final_2.py ├── model_multitask.py ├── model_qpm.py └── post_processing.py /README.md: -------------------------------------------------------------------------------- 1 | # chip2019_task2_question_pairs_matching 2 | [CHIP 2019 平安医疗科技疾病问答迁移学习比赛](https://www.biendata.com/competition/chip2019/),本质上就是一个类似于Quora Question Pairs的问句匹配问题。基于[huggingface/pytorch-transformers](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py)实现的BERT baseline,代码比较冗余,中文预训练模型采用[ymcui/Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)。因为条件有限(没有GPU。。。),所以只跑了几个baseline(提交次数惨淡),没有trick、模型融合以及超参数选择,只做10折交叉验证,A榜应该就能达到0.878+,B榜0.864+,A榜rank 9,B榜rank11,剔除了小号和未报名队伍后rank7,作为baseline效果还是可以的。 3 | 4 | 项目文件目录结构及文件说明如下: 5 | ``` 6 | |-- README.md 7 | |-- data_augmentation.py 利用问句相似性传递进行数据增强和生成10折交叉验证数据文件 8 | |-- data_utils.py 读取数据文件,转化为模型输入等操作 9 | |-- extract_bert_char_embedding.py 抽取出BERT的字向量,然后用在例如ESIM之类的常规模型上,效果不好 10 | |-- feature_engineering.py 对每一折数据抽取出数据字词级别的tfidf等特征 11 | |-- model_qpm.py BERT常规句对分类 12 | |-- model_multitask.py 在qpm的基础上,增加一个子任务,将句对句子合在一起,判断所属类目 13 | |-- model_feature.py 在qpm的基础上,加入手工特征的dense层 14 | |-- model_final.py 将qpm、判断类目和手工特征三种做法融合 15 | |-- model_final_2.py 和model_final.py基本一致,只是改变了模型保存方式 16 | |-- post_processing.py 利用问句相似性传递进行后处理文件 17 | |-- data 按照10折交叉验证分别存放10折数据文件 18 | |-- noextension 未数据增强的文件 19 | |-- 0 20 | |-- ... 21 | |-- THUOCL_medical.txt 清华开源的医疗词库,用于jieba加载后分词做词特征抽取 22 | |-- tmp 按照10折交叉验证分别保存模型的目录 23 | |-- 0 24 | |-- ... 25 | ``` 26 | `model_final.py`和`model_final_2.py`只是在模型保存方式上有区别,前者占空间,后者费时间。 27 | 28 | 训练模型请使用 29 | ``` 30 | python3 model_final_2.py --model_name_or_path ./chinese_roberta_wwm_ext_pytorch/ --do_train --do_lower_case --data_dir ./data/noextension/ --max_seq_length 128 --per_gpu_train_batch_size 16 --learning_rate 2e-5 --num_train_epochs 5.0 --output_dir ./tmp/ --overwrite_output_dir --evaluate_during_training 31 | ``` 32 | 33 | 预测结果请使用 34 | ``` 35 | python3 model_final_2.py --model_name_or_path ./chinese_roberta_wwm_ext_pytorch/ --do_predict --do_lower_case --data_dir ./data/noextension/ --per_gpu_test_batch_size 16 --output_dir ./tmp/ 36 | ``` 37 | 38 | 说说模型效果: 39 | - 使用`RoBERTa-wwm-ext`和`RoBERTa-wwm-ext-large`能比`BERT-wwm-ext`提升0.005,而roberta base和large效果差别不大,就0.002左右。 40 | - 利用问句相似性传递做数据增强容易过拟合,而且本训练集标注有很多问题,不对训练集做任何修改的话,用了数据增强反而下降0.005左右。同样做相似性传递后处理也会下降0.001~0.005,所以就没用数据增强和后处理。 41 | - 加入句子分类的loss,能提升0.001。 42 | - 加入特征工程的dense层,效果不稳定,可能是我的特征选得不好。 43 | 44 | 另外两个句子分别通过BERT得到representation后,互相做一下attention拼接到句子对的[CLS]输出后也能提升模型效果,不过后期没GPU了,没基于roberta-wwm训练新模型,所以最终提交的还是roberta_final_2.py跑出来的结果。 45 | -------------------------------------------------------------------------------- /data_augmentation.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import copy 4 | import os 5 | import numpy as np 6 | import random 7 | import sys 8 | 9 | import pandas as pd 10 | from sklearn.model_selection import KFold, train_test_split 11 | from sklearn.utils import shuffle 12 | from tqdm import tqdm 13 | 14 | """ Naive train test split. """ 15 | # df = pd.read_csv('./data/aug_train.csv', encoding='utf-8', engine='python') 16 | # df = shuffle(df) 17 | # train_df = df[2000:] 18 | # dev_df = df[:2000] 19 | # train_df.to_csv('./data/origin_train.csv', index=False) 20 | # dev_df.to_csv('./data/dev.csv', index=False) 21 | 22 | """ KFold """ 23 | # parent_directory = './data/extension/' 24 | # df = pd.read_csv(parent_directory + 'final_train.csv', encoding='utf-8', engine='python') 25 | # kFold = KFold(n_splits=10, shuffle=True, random_state=12345) 26 | # folds = kFold.split(df) 27 | # for i in range(10): 28 | # os.makedirs(parent_directory + str(i)) 29 | # for i, (train, dev) in enumerate(folds): 30 | # df.iloc[train].to_csv(parent_directory + str(i) + '/origin_train.csv', index=False) 31 | # df.iloc[dev].to_csv(parent_directory + str(i) + '/dev.csv', index=False) 32 | 33 | """ Data extension by questions similarity. """ 34 | df_train = pd.read_csv('./data/origin_train.csv', encoding='utf-8', engine='python') 35 | q1 = df_train['question1'].values 36 | q2 = df_train['question2'].values 37 | label = df_train['label'].values 38 | category = df_train['category'].values 39 | dict_1 = {} 40 | dict_ct = {} 41 | for i in range(0, df_train.shape[0]): 42 | dict_ct[q1[i]] = category[i] 43 | dict_ct[q2[i]] = category[i] 44 | if label[i] == 1: 45 | if dict_1.get(q1[i], -1) == -1: 46 | dict_1[q1[i]] = [q2[i]] 47 | else: 48 | dict_1[q1[i]].append(q2[i]) 49 | if dict_1.get(q2[i], -1) == -1: 50 | dict_1[q2[i]] = [q1[i]] 51 | else: 52 | dict_1[q2[i]].append(q1[i]) 53 | if i % 5000 == 0: 54 | sys.stdout.flush() 55 | sys.stdout.write('#') 56 | print(len(dict_1)) 57 | 58 | listxy = [] 59 | for x in dict_1: 60 | listx = dict_1[x] 61 | if len(listx) > 1: 62 | listy = listx[:] 63 | random.shuffle(listy) 64 | for x, y in zip(listx, listy): 65 | if x != y and y not in dict_1[x] and x not in dict_1[y]: 66 | if dict_ct[x] != dict_ct[y]: 67 | ct = 'wrong' 68 | listxy.append([x, y, 0, ct]) 69 | else: 70 | ct = dict_ct[x] 71 | listxy.append([x, y, 1, ct]) 72 | print(len(listxy)) 73 | random.shuffle(listxy) 74 | df_ext = pd.DataFrame(listxy) 75 | df_ext.columns = ['question1', 'question2', 'label', 'category'] 76 | df_ext.to_csv('./data/extension/ext_train.csv', index=False) 77 | 78 | """ Produce negative samples and ombine extension dataset. """ 79 | df_ext_train = pd.read_csv('./data/extension/ext_train.csv') 80 | temp_q1 = df_train['question1'].values.copy() 81 | temp_q2 = df_train['question2'].values.copy() 82 | np.random.shuffle(temp_q1) 83 | np.random.shuffle(temp_q2) 84 | temp_df = pd.DataFrame() 85 | temp_df['label'] = np.zeros(temp_q1.shape[0], dtype=int) 86 | temp_df['question1'] = temp_q1 87 | temp_df['question2'] = temp_q2 88 | category_col = [] 89 | for i in range(len(temp_df)): 90 | if dict_ct[temp_df.iloc[i]['question1']] == dict_ct[temp_df.iloc[i]['question2']]: 91 | category_col.append(dict_ct[temp_df.iloc[i]['question1']]) 92 | else: 93 | category_col.append('wrong') 94 | temp_df['category'] = category_col 95 | temp_df = temp_df.sample(n=int(df_ext_train.shape[0]*0.8)) 96 | df_train = pd.concat([df_train, df_ext_train, temp_df], sort=False) 97 | df_train = df_train.drop_duplicates(['question1', 'question2']).reset_index(drop=True) 98 | df_train.columns = ['question1', 'question2', 'label', 'category'] 99 | df_train.to_csv('./data/extension/final_train.csv', index=False) 100 | print('Complete.') 101 | 102 | parent_directory = './data/extension/' 103 | df = pd.read_csv(parent_directory + 'final_train.csv', encoding='utf-8', engine='python') 104 | kFold = KFold(n_splits=10, shuffle=True, random_state=12345) 105 | folds = kFold.split(df) 106 | for i in range(10): 107 | os.makedirs(parent_directory + str(i)) 108 | for i, (train, dev) in enumerate(folds): 109 | df.iloc[train].to_csv(parent_directory + str(i) + '/origin_train.csv', index=False) 110 | df.iloc[dev].to_csv(parent_directory + str(i) + '/dev.csv', index=False) 111 | 112 | 113 | category_map = { 114 | 'aids': '艾滋病', 115 | 'breast_cancer': '乳腺癌', 116 | 'diabetes': '糖尿病', 117 | 'hepatitis': '乙肝', 118 | 'hypertension': '高血压' 119 | } 120 | -------------------------------------------------------------------------------- /data_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ BERT classification fine-tuning: utilities to work with QPM tasks. """ 3 | 4 | import logging 5 | import os 6 | import sys 7 | 8 | import pandas as pd 9 | from sklearn.metrics import f1_score 10 | 11 | logger = logging.getLogger(__name__) 12 | 13 | 14 | class InputExample(object): 15 | """ A single training/test example for question pairs matching task. """ 16 | def __init__(self, guid, question_a, question_b, label=None, category=None, hand_features=None): 17 | """ Constructs a InputExample. 18 | Args: 19 | guid: Unique id for the example. 20 | question_a: string. The untokenized question sentence of the first sequence. 21 | question_b: string. The untokenized question sentence of the second sequence. 22 | label: string. The label of the example. This should be specified for train and dev examples, 23 | but not for test examples 24 | """ 25 | self.guid = guid 26 | self.question_a = question_a 27 | self.question_b = question_b 28 | self.label = label 29 | self.category = category 30 | self.hand_features = hand_features 31 | 32 | 33 | class InputFeatures(object): 34 | """ A single set of features of data. """ 35 | def __init__(self, input_ids, input_mask, segment_ids, label_id, 36 | category_clf_input_ids, category_clf_input_mask, category_clf_segment_ids, category_id, hand_features): 37 | self.input_ids = input_ids 38 | self.input_mask = input_mask 39 | self.segment_ids = segment_ids 40 | self.label_id = label_id 41 | self.category_clf_input_ids = category_clf_input_ids 42 | self.category_clf_input_mask = category_clf_input_mask 43 | self.category_clf_segment_ids = category_clf_segment_ids 44 | self.category_id = category_id 45 | self.hand_features = hand_features 46 | 47 | 48 | class DataProcessor(object): 49 | """ Base class for data converters for sequence classfication data sets. """ 50 | def get_examples(self, data_dir, set_type): 51 | """ Gets a collection of `InputExample`s for the train set. """ 52 | raise NotImplementedError() 53 | 54 | def get_labels(self): 55 | """ Gets the list of labels for this data set.""" 56 | raise NotImplementedError() 57 | 58 | @classmethod 59 | def _read_csv(cls, input_file, quotechar=None): 60 | """ Reads a `,` seperated value file. """ 61 | with open(input_file, 'r', encoding='utf-8') as f: 62 | print(input_file) 63 | df = pd.read_csv(f, delimiter=',') 64 | df_feat = pd.read_csv(input_file.replace('.csv', '_feats.csv')) 65 | lines = [] 66 | for index in df.index: 67 | line = df.iloc[index].values 68 | line = line.tolist() 69 | line.append(df_feat.iloc[index]) 70 | lines.append(line) 71 | return lines 72 | 73 | 74 | class QPMProcessor(DataProcessor): 75 | """ Processor for the QPM data set. """ 76 | def get_examples(self, data_dir, set_type): 77 | """ See base class. """ 78 | return self._create_examples( 79 | self._read_csv(os.path.join(data_dir, set_type + '.csv')), set_type) 80 | 81 | def get_labels(self): 82 | """ See base class. """ 83 | return [0, 1] 84 | 85 | def get_categories(self): 86 | return ['aids', 'hypertension', 'hepatitis', 'diabetes', 'breast_cancer', 'wrong'] 87 | 88 | def _create_examples(self, lines, set_type): 89 | """ Creates examples for the training and dev sets. """ 90 | examples = [] 91 | 92 | for (i, line) in enumerate(lines): 93 | guid = '%s-%s' % (set_type, i) 94 | try: 95 | if set_type == 'train' or set_type == 'dev': 96 | question_a = line[0] 97 | question_b = line[1] 98 | label = line[2] 99 | category = line[3] 100 | hand_features = line[4] 101 | elif set_type == 'test': 102 | question_a = line[2] 103 | question_b = line[3] 104 | category = line[0] 105 | hand_features = line[4] 106 | label = 0 107 | else: 108 | raise ValueError() 109 | except IndexError: 110 | continue 111 | examples.append(InputExample(guid=guid, question_a=question_a, question_b=question_b, label=label, category=category, hand_features=hand_features)) 112 | return examples 113 | 114 | 115 | def convert_examples_to_features(examples, label_list, category_list, max_seq_length, tokenizer, 116 | cls_token_at_end=False, cls_token='[CLS]', cls_token_segment_id=1, 117 | sep_token='[SEP]', sep_token_extra=False, 118 | pad_on_left=False, pad_token=0, pad_token_segment_id=0, 119 | sequence_a_segment_id=0, sequence_b_segment_id=1, 120 | mask_padding_with_zero=True): 121 | """ Loads a data file into a list of `InputBatch`s 122 | `cls_toekn_at_end` define the location of the CLS token: 123 | - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP] 124 | - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS] 125 | `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet) 126 | """ 127 | label_map = {label: i for i, label in enumerate(label_list)} 128 | category_map = {category: i for i, category in enumerate(category_list)} 129 | features = [] 130 | for (ex_index, example) in enumerate(examples): 131 | hand_features = example.hand_features 132 | 133 | if ex_index % 10000 == 0: 134 | logger.info('Writing example %d of %d' % (ex_index, len(examples))) 135 | 136 | tokens_a = tokenizer.tokenize(example.question_a) 137 | tokens_b = tokenizer.tokenize(example.question_b) 138 | # Modifies `tokens_a` and `tokens_b` in place so that the total length is less than the specified length. 139 | # Account for [CLS], [SEP], [SEP] with '- 3'. '- 4' for ro RoBERTa 140 | special_tokens_count = 4 if sep_token_extra else 3 141 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length-special_tokens_count) 142 | 143 | # The convention in BERT is: 144 | # (a) For sequence pairs: 145 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 146 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 147 | # (b) For single sequences: 148 | # tokens: [CLS] the dog is hairy . [SEP] 149 | # type_ids: 0 0 0 0 0 0 0 150 | # 151 | # Where "type_ids" are used to indicate whether this is the first 152 | # sequence or the second sequence. The embedding vectors for `type=0` and 153 | # `type=1` were learned during pre-training and are added to the wordpiece 154 | # embedding vector (and position vector). This is not *strictly* necessary 155 | # since the [SEP] token unambiguously separates the sequences, but it makes 156 | # it easier for the model to learn the concept of sequences. 157 | # 158 | # For classification tasks, the first vector (corresponding to [CLS]) is 159 | # used as as the "sentence vector". Note that this only makes sense because 160 | # the entire model is fine-tuned. 161 | tokens = tokens_a + [sep_token] 162 | segment_ids = [sequence_a_segment_id] * len(tokens) 163 | tokens += tokens_b + [sep_token] 164 | segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1) 165 | 166 | category_clf_tokens = tokens_a + tokens_b 167 | special_tokens_count = 3 if sep_token_extra else 2 168 | if len(category_clf_tokens) > max_seq_length - special_tokens_count: 169 | category_clf_tokens = category_clf_tokens[:(max_seq_length - special_tokens_count)] 170 | category_clf_tokens += [sep_token] 171 | category_clf_segment_ids = [sequence_a_segment_id] * len(category_clf_tokens) 172 | 173 | if cls_token_at_end: 174 | tokens = tokens + [cls_token] 175 | segment_ids = segment_ids + [cls_token_segment_id] 176 | else: 177 | tokens = [cls_token] + tokens 178 | segment_ids = [cls_token_segment_id] + segment_ids 179 | 180 | category_clf_tokens = [cls_token] + category_clf_tokens 181 | category_clf_segment_ids = [cls_token_segment_id] + category_clf_segment_ids 182 | 183 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 184 | category_clf_input_ids = tokenizer.convert_tokens_to_ids(category_clf_tokens) 185 | 186 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 187 | # tokens are attended to. 188 | input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids) 189 | category_clf_input_mask = [1 if mask_padding_with_zero else 0] * len(category_clf_input_ids) 190 | 191 | # Zero-pad up to the sequence length. 192 | padding_length = max_seq_length - len(input_ids) 193 | category_clf_padding_length = max_seq_length - len(category_clf_input_ids) 194 | if pad_on_left: 195 | input_ids = ([pad_token] * padding_length) + input_ids 196 | input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask 197 | segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids 198 | 199 | category_clf_input_ids = ([pad_token] * category_clf_padding_length) + category_clf_input_ids 200 | category_clf_input_mask = ([0 if mask_padding_with_zero else 1] * category_clf_padding_length) + category_clf_input_mask 201 | category_clf_segment_ids = ([pad_token_segment_id] * category_clf_padding_length) + category_clf_segment_ids 202 | else: 203 | input_ids = input_ids + ([pad_token] * padding_length) 204 | input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length) 205 | segment_ids = segment_ids + ([pad_token_segment_id] * padding_length) 206 | 207 | category_clf_input_ids = category_clf_input_ids + ([pad_token] * category_clf_padding_length) 208 | category_clf_input_mask = category_clf_input_mask + ([0 if mask_padding_with_zero else 1] * category_clf_padding_length) 209 | category_clf_segment_ids = category_clf_segment_ids + ([pad_token_segment_id] * category_clf_padding_length) 210 | 211 | assert len(input_ids) == max_seq_length 212 | assert len(input_mask) == max_seq_length 213 | assert len(segment_ids) == max_seq_length 214 | assert len(category_clf_input_ids) == max_seq_length 215 | assert len(category_clf_input_mask) == max_seq_length 216 | assert len(category_clf_segment_ids) == max_seq_length 217 | 218 | label_id = label_map[example.label] 219 | category_id = category_map[example.category] 220 | 221 | if ex_index < 5: 222 | logger.info("*** Example ***") 223 | logger.info("guid: %s" % (example.guid)) 224 | logger.info("tokens: %s" % " ".join( 225 | [str(x) for x in tokens])) 226 | logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 227 | logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 228 | logger.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 229 | logger.info("hand_features: %s" % " ".join([str(x) for x in hand_features])) 230 | if example.label is not None: 231 | logger.info("label: %s (id = %d)" % (example.label, label_id)) 232 | 233 | features.append( 234 | InputFeatures(input_ids=input_ids, 235 | input_mask=input_mask, 236 | segment_ids=segment_ids, 237 | label_id=label_id, 238 | category_clf_input_ids=category_clf_input_ids, 239 | category_clf_input_mask=category_clf_input_mask, 240 | category_clf_segment_ids=category_clf_segment_ids, 241 | category_id=category_id, 242 | hand_features=hand_features 243 | )) 244 | return features 245 | 246 | 247 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 248 | """ Truncates a sequence pair in place to the maximum length. """ 249 | # This is a simple heuristic which will always truncate the longer sequence 250 | # one token at a time. This makes more sense than truncating an equal percent 251 | # of tokens from each, since if one sequence is very short then each token 252 | # that's truncated likely contains more information that a longer sequence. 253 | while True: 254 | total_length = len(tokens_a) + len(tokens_b) 255 | if total_length <= max_length: 256 | break 257 | if len(tokens_a) > len(tokens_b): 258 | tokens_a.pop() 259 | else: 260 | tokens_b.pop() 261 | 262 | 263 | def compute_metrics(preds, labels): 264 | assert len(preds) == len(labels) 265 | return acc_and_f1(preds, labels) 266 | 267 | 268 | def acc_and_f1(preds, labels): 269 | acc = simple_accuracy(preds, labels) 270 | f1 = f1_score(y_true=labels, y_pred=preds) 271 | return { 272 | "acc": acc, 273 | "f1": f1, 274 | "acc_and_f1": (acc + f1) / 2, 275 | } 276 | 277 | 278 | def simple_accuracy(preds, labels): 279 | return (preds == labels).mean() 280 | -------------------------------------------------------------------------------- /extract_bert_char_embedding.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ Extract the pretrained character level embedding from BERT hidden outputs. """ 3 | import re 4 | 5 | import numpy as np 6 | from pytorch_transformers import BertTokenizer, BertModel 7 | 8 | 9 | if __name__ == '__main__': 10 | print('# Load pretrained model tokenizer.') 11 | tokenizer = BertTokenizer.from_pretrained('./bert_wwm/') 12 | 13 | print('# Construct vocab.') 14 | vocab = [token for token in tokenizer.vocab] 15 | 16 | print('# Load pretrained model.') 17 | model = BertModel.from_pretrained('./bert_wwm') 18 | 19 | print('# Load word embeddings') 20 | emb = model.embeddings.word_embeddings.weight.data 21 | emb = emb.numpy() 22 | 23 | print('# Write') 24 | with open('{}.{}.{}d.vec'.format('bert_wwm', len(vocab), emb.shape[-1]), 'w', encoding='utf-8') as fout: 25 | fout.write('{} {}\n'.format(len(vocab), emb.shape[-1])) 26 | assert len(vocab) == len(emb), 'The number of vocab and embeddings MUST be identical.' 27 | for token, e in zip(vocab, emb): 28 | e = np.array2string(e, max_line_width=np.inf)[1:-1] 29 | e = re.sub('[ ]+', ' ', e) 30 | fout.write('{} {}\n'.format(token, e)) 31 | -------------------------------------------------------------------------------- /feature_engineering.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import jieba 4 | import numpy as np 5 | import pandas as pd 6 | 7 | from sklearn.feature_extraction.text import TfidfVectorizer 8 | from tqdm import tqdm 9 | 10 | 11 | def get_len_diff(data): 12 | """ 13 | Get the difference of length and normalize by the longest one of question pairs. 14 | """ 15 | q1_len = data.question1.apply(lambda x: len(x.split(' '))).values 16 | q2_len = data.question2.apply(lambda x: len(x.split(' '))).values 17 | len_diff = np.abs(q1_len - q2_len) / np.max([q1_len, q2_len], axis=0) 18 | return len_diff 19 | 20 | 21 | def get_num_common_units(data): 22 | """ 23 | Get the common units(words or chars) in q1 and q2. 24 | """ 25 | q1_unit_set = data.question1.apply(lambda x: x.split(' ')).apply(set).values 26 | q2_unit_set = data.question2.apply(lambda x: x.split(' ')).apply(set).values 27 | result = [len(q1_unit_set[i] & q2_unit_set[i]) for i in range(len(q1_unit_set))] 28 | result = pd.DataFrame(result, index=data.index) 29 | result.columns = ['num_common_units'] 30 | return result 31 | 32 | 33 | def get_common_units_ratio(data): 34 | q1_unit_set = data.question1.apply(lambda x: x.split(' ')).apply(set).values() 35 | q2_unit_set = data.question2.apply(lambda x: x.split(' ')).apply(set).values() 36 | q1_len = data.question1.apply(lambda x: len(x.split(' '))).values 37 | q2_len = data.question2.apply(lambda x: len(x.split(' '))).values 38 | result = [len(q1_unit_set[i] & q2_unit_set[i])/max(q1_len[i], q2_len[i]) for i in range(len(q1_unit_set))] 39 | result = pd.DataFrame(result, index=data.index) 40 | result.columns = ['common_units_ratio'] 41 | return result 42 | 43 | 44 | def get_tfidf_vector(data, vectorizer): 45 | q1_tfidf = vectorizer.transform(data.question1.values) 46 | q2_tfidf = vectorizer.transform(data.question2.values) 47 | return vectorizer.vocabulary_, q1_tfidf, q2_tfidf 48 | 49 | 50 | def adjust_common_units_ratio_by_tfidf(data, unit2index, q1_tfidf, q2_tfidf): 51 | adjusted_common_units_ratio = [] 52 | for i in range(q1_tfidf.shape[0]): 53 | q1_units = {} 54 | q2_units = {} 55 | for unit in data.loc[i, 'question1'].lower().split(): 56 | q1_units[unit] = q1_units.get(unit, 0) + 1 57 | for unit in data.loc[i, 'question2'].lower().split(): 58 | q2_units[unit] = q2_units.get(unit, 0) + 1 59 | 60 | sum_shared_unit_in_q1 = sum([q1_units[u] * q1_tfidf[i, unit2index[u]] for u in q1_units if u in q2_units]) 61 | sum_shared_unit_in_q2 = sum([q2_units[u] * q2_tfidf[i, unit2index[u]] for u in q2_units if u in q1_units]) 62 | sum_total = sum([q1_units[u] * q1_tfidf[i, unit2index[u]] for u in q1_units]) +\ 63 | sum([q2_units[u] * q2_tfidf[i, unit2index[u]] for u in q2_units]) 64 | if 1e-6 > sum_total: 65 | adjusted_common_units_ratio.append(0.) 66 | else: 67 | adjusted_common_units_ratio.append(1.0 * (sum_shared_unit_in_q1 + sum_shared_unit_in_q2) / sum_total) 68 | return adjusted_common_units_ratio 69 | 70 | 71 | def generate_powerful_unit(data): 72 | """ 73 | Calculate the influence of unit. 74 | 0. the num of unit appeared in question pairs 75 | 1. the ratio of unit appeared in question pairs 76 | 2. the ratio of unit appeared in question pairs labeled 1 77 | 3. the ratio of unit appeared in only one question 78 | 4. the ratio of unit appeared in only one question and pair labeled 1 79 | 5. the ratio of unit appeared in both two questions 80 | 6. the ratio of unit appeared in both two questions and pair labeled 1 81 | """ 82 | units_power = {} 83 | for i in data.index: 84 | label = int(data.loc[i, 'label']) 85 | q1_units = list(data.loc[i, 'question1'].lower().split()) 86 | q2_units = list(data.loc[i, 'question2'].lower().split()) 87 | all_units = set(q1_units + q2_units) 88 | q1_units = set(q1_units) 89 | q2_units = set(q2_units) 90 | 91 | for unit in all_units: 92 | if unit not in units_power: 93 | units_power[unit] = [0. for i in range(7)] 94 | units_power[unit][0] += 1. 95 | units_power[unit][1] += 1. 96 | 97 | if (unit in q1_units and unit not in q2_units) or (unit not in q1_units and unit in q2_units): 98 | units_power[unit][3] += 1. 99 | if 1 == label: 100 | units_power[unit][2] += 1. 101 | units_power[unit][4] += 1. 102 | 103 | if unit in q1_units and unit in q2_units: 104 | units_power[unit][5] += 1. 105 | if 1 == label: 106 | units_power[unit][2] += 1. 107 | units_power[unit][6] += 1. 108 | 109 | for unit in units_power: 110 | # calculate ratios 111 | units_power[unit][1] /= data.shape[0] 112 | units_power[unit][2] /= data.shape[0] 113 | if units_power[unit][3] > 1e-6: 114 | units_power[unit][4] /= units_power[unit][3] 115 | units_power[unit][4] /= units_power[unit][0] 116 | if units_power[unit][5] > 1e-6: 117 | units_power[unit][6] /= units_power[unit][5] 118 | units_power[unit][5] /= units_power[unit][0] 119 | 120 | sorted_units_power = sorted(units_power.items(), key=lambda d: d[1][0], reverse=True) 121 | return sorted_units_power 122 | 123 | 124 | def powerful_units_dside_tag(punit, data, threshold_num, threshold_rate): 125 | """ 126 | If a powerful units appeared in both questions, the tag was set as 1, otherwise 0. 127 | """ 128 | punit_dside = [] 129 | punit = filter(lambda x: x[1][0] * x[1][5] >= threshold_num, punit) 130 | punit_sort = sorted(punit, key=lambda d: d[1][6], reverse=True) 131 | punit_dside.extend(map(lambda x: x[0], filter(lambda x: x[1][6] >= threshold_rate, punit_sort))) 132 | 133 | punit_dside_tags = [] 134 | for i in data.index: 135 | tags = [] 136 | q1_units = set(data.loc[i, 'question1'].lower().split()) 137 | q2_units = set(data.loc[i, 'question2'].lower().split()) 138 | for unit in punit_dside: 139 | if unit in q1_units and unit in q2_units: 140 | tags.append(1.0) 141 | else: 142 | tags.append(0.0) 143 | punit_dside_tags.append(tags) 144 | return punit_dside, punit_dside_tags 145 | 146 | 147 | def powerful_units_oside_tag(punit, data, threshold_num, threshold_rate): 148 | punit_oside = [] 149 | punit = filter(lambda x: x[1][0] * x[1][3] >= threshold_num, punit) 150 | punit_oside.extend(map(lambda x: x[0], filter(lambda x: x[1][4] >= threshold_rate, punit))) 151 | 152 | punit_oside_tags = [] 153 | for i in data.index: 154 | tags = [] 155 | q1_units = set(data.loc[i, 'question1'].lower().split()) 156 | q2_units = set(data.loc[i, 'question2'].lower().split()) 157 | for unit in punit_oside: 158 | if unit in q1_units and unit not in q2_units: 159 | tags.append(1.0) 160 | elif unit not in q1_units and unit in q2_units: 161 | tags.append(1.0) 162 | else: 163 | tags.append(0.0) 164 | punit_oside_tags.append(tags) 165 | return punit_oside, punit_oside_tags 166 | 167 | 168 | def powerful_units_dside_rate(sorted_units_power, punit_dside, data): 169 | num_least = 300 170 | units_power = dict(sorted_units_power) 171 | punit_dside_rate = [] 172 | for i in data.index: 173 | rate = 1.0 174 | q1_units = set(data.loc[i, 'question1'].lower().split()) 175 | q2_units = set(data.loc[i, 'question2'].lower().split()) 176 | share_units = list(q1_units.intersection(q2_units)) 177 | for unit in share_units: 178 | if unit in punit_dside: 179 | rate *= (1.0 - units_power[unit][6]) 180 | punit_dside_rate.append(1-rate) 181 | return punit_dside_rate 182 | 183 | 184 | def powerful_units_oside_rate(sorted_units_power, punit_oside, data): 185 | num_least = 300 186 | units_power = dict(sorted_units_power) 187 | punits_oside_rate = [] 188 | for i in data.index: 189 | rate = 1.0 190 | q1_units = set(data.loc[i, 'question1'].lower().split()) 191 | q2_units = set(data.loc[i, 'question2'].lower().split()) 192 | q1_diff = list(set(q1_units).difference(set(q2_units))) 193 | q2_diff = list(set(q2_units).difference(set(q1_units))) 194 | all_diff = set(q1_diff + q2_diff) 195 | for unit in all_diff: 196 | if unit in punit_oside: 197 | rate *= (1.0 - units_power[unit][4]) 198 | punits_oside_rate.append(1-rate) 199 | return punits_oside_rate 200 | 201 | 202 | def edit_distance(q1, q2): 203 | str1 = q1.split(' ') 204 | str2 = q2.split(' ') 205 | matrix = [[i+j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)] 206 | for i in range(1, len(str1) + 1): 207 | for j in range(1, len(str2) + 1): 208 | if str1[i - 1] == str2[j - 1]: 209 | d = 0 210 | else: 211 | d = 1 212 | matrix[i][j] = min(matrix[i-1][j]+1, matrix[i][j-1] + 1, matrix[i-1][j-1]+d) 213 | if j > i > 1 and str1[i-1] == str2[j-2] and str1[i-2] == str2[j-1]: 214 | d = 0 215 | matrix[i][j] = min(matrix[i][j], matrix[i-2][j-2] + d) 216 | return matrix[len(str1)][len(str2)] 217 | 218 | 219 | def get_edit_distance(data): 220 | q1_len = data['question1'].apply(lambda x: len(list(x.split(' ')))).values 221 | q2_len = data['question2'].apply(lambda x: len(list(x.split(' ')))).values 222 | 223 | dist = [edit_distance(data.loc[i, 'question1'], data.loc[i, 'question2']) / np.max([q1_len, q2_len], axis=0)[i] for i in data.index] 224 | return dist 225 | 226 | 227 | def generate_split_chars(): 228 | for mode in ['train', 'test']: 229 | df_temp = pd.read_csv('./data/noextension/' + mode + '.csv', encoding='utf-8', engine='python') 230 | question1 = df_temp.question1.apply(lambda x: ' '.join(list(x.replace(' ', '')))) 231 | question2 = df_temp.question2.apply(lambda x: ' '.join(list(x.replace(' ', '')))) 232 | df_corpus = pd.DataFrame({ 233 | 'question1': question1, 234 | 'question2': question2, 235 | }) 236 | df_corpus.to_csv('./data/noextension/' + mode + '_corpus_char.csv', index=False) 237 | 238 | for i in range(10): 239 | for mode in ['train', 'dev', 'test']: 240 | df_temp = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '.csv', encoding='utf-8', engine='python') 241 | question1 = df_temp.question1.apply(lambda x: ' '.join(list(x.replace(' ', '')))) 242 | question2 = df_temp.question2.apply(lambda x: ' '.join(list(x.replace(' ', '')))) 243 | 244 | if mode == 'train': 245 | df_corpus = pd.DataFrame({ 246 | 'question1': question1, 247 | 'question2': question2, 248 | 'label': df_temp.label 249 | }) 250 | else: 251 | df_corpus = pd.DataFrame({ 252 | 'question1': question1, 253 | 'question2': question2, 254 | }) 255 | df_corpus.to_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_char.csv', index=False) 256 | 257 | 258 | def generate_split_words(): 259 | for mode in ['train', 'test']: 260 | df_temp = pd.read_csv('./data/noextension/' + mode + '.csv', encoding='utf-8', engine='python') 261 | question1 = df_temp.question1.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', '')))) 262 | question2 = df_temp.question2.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', '')))) 263 | df_corpus = pd.DataFrame({ 264 | 'question1': question1, 265 | 'question2': question2, 266 | }) 267 | df_corpus.to_csv('./data/noextension/' + mode + '_corpus_word.csv', index=False) 268 | 269 | for i in range(10): 270 | for mode in ['train', 'dev', 'test']: 271 | df_temp = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '.csv', encoding='utf-8', engine='python') 272 | question1 = df_temp.question1.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', '')))) 273 | question2 = df_temp.question2.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', '')))) 274 | 275 | if mode == 'train': 276 | df_corpus = pd.DataFrame({ 277 | 'question1': question1, 278 | 'question2': question2, 279 | 'label': df_temp.label 280 | }) 281 | else: 282 | df_corpus = pd.DataFrame({ 283 | 'question1': question1, 284 | 'question2': question2, 285 | }) 286 | df_corpus.to_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_word.csv', index=False) 287 | 288 | 289 | def generate_features_csv(): 290 | # prepare and load THUOCL medical file 291 | with open('./data/THUOCL_medical.txt', 'r', encoding='utf-8') as f: 292 | content = f.read() 293 | with open('./data/THUOCL_medical.txt', 'w', encoding='utf-8') as f: 294 | content = content.replace('\t', ' ') 295 | f.write(content) 296 | jieba.load_userdict('./data/THUOCL_medical.txt') 297 | 298 | print('*' * 10 + ' Generating chars corpus file ' + '*' * 10) 299 | generate_split_chars() 300 | print('*' * 10 + ' Generating words corpus file ' + '*' * 10) 301 | generate_split_words() 302 | 303 | all_train_data = pd.read_csv('./data/noextension/train_corpus_char.csv', encoding='utf-8', engine='python') 304 | corpus = list(all_train_data.question1) + list(all_train_data.question2) 305 | all_test_data = pd.read_csv('./data/noextension/test_corpus_char.csv', encoding='utf-8', engine='python') 306 | corpus += list(all_test_data.question1) + list(all_test_data.question2) 307 | vectorizer_char = TfidfVectorizer(token_pattern=r'[^\s]+').fit(corpus) 308 | 309 | all_train_data = pd.read_csv('./data/noextension/train_corpus_word.csv', encoding='utf-8', engine='python') 310 | corpus = list(all_train_data.question1) + list(all_train_data.question2) 311 | all_test_data = pd.read_csv('./data/noextension/test_corpus_word.csv', encoding='utf-8', engine='python') 312 | corpus += list(all_test_data.question1) + list(all_test_data.question2) 313 | vectorizer_word = TfidfVectorizer(token_pattern=r'[^\s]+').fit(corpus) 314 | 315 | print('*' * 10 + ' Generating feature file ' + '*' * 10) 316 | for i in tqdm(range(10)): 317 | sorted_chars_power = None 318 | sorted_words_power = None 319 | for mode in ['train', 'dev', 'test']: 320 | data = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_char.csv', encoding='utf-8', engine='python') 321 | if mode == 'train': 322 | sorted_chars_power = generate_powerful_unit(data) 323 | 324 | len_diff_char = get_len_diff(data) 325 | edit_char = get_edit_distance(data) 326 | vocab, q1_tfidf, q2_tfidf = get_tfidf_vector(data, vectorizer_char) 327 | adjusted_common_char_ratio = adjust_common_units_ratio_by_tfidf(data, vocab, q1_tfidf, q2_tfidf) 328 | pchar_dside, pchar_dside_tags = powerful_units_dside_tag(sorted_chars_power, data, 1, 0.7) 329 | pchar_dside_rate = powerful_units_dside_rate(sorted_chars_power, pchar_dside, data) 330 | pchar_oside, pchar_oside_tags = powerful_units_oside_tag(sorted_chars_power, data, 1, 0.7) 331 | pchar_oside_rate = powerful_units_oside_rate(sorted_chars_power, pchar_oside, data) 332 | 333 | data = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_word.csv', encoding='utf-8', engine='python') 334 | if mode == 'train': 335 | sorted_words_power = generate_powerful_unit(data) 336 | 337 | len_diff_word = get_len_diff(data) 338 | edit_word = get_edit_distance(data) 339 | vocab, q1_tfidf, q2_tfidf = get_tfidf_vector(data, vectorizer_word) 340 | adjusted_common_word_ratio = adjust_common_units_ratio_by_tfidf(data, vocab, q1_tfidf, q2_tfidf) 341 | pword_dside, pword_dside_tags = powerful_units_dside_tag(sorted_words_power, data, 1, 0.7) 342 | pword_dside_rate = powerful_units_dside_rate(sorted_words_power, pword_dside, data) 343 | pword_oside, pword_oside_tags = powerful_units_oside_tag(sorted_words_power, data, 1, 0.7) 344 | pword_oside_rate = powerful_units_oside_rate(sorted_words_power, pword_oside, data) 345 | 346 | df = pd.DataFrame({'len_diff_char': len_diff_char, 'edit_char': edit_char, 'len_diff_word': len_diff_word, 'edit_word': edit_word, 347 | 'adjusted_common_char_ratio': adjusted_common_char_ratio, 'adjusted_common_word_ratio': adjusted_common_word_ratio, 348 | 'pchar_dside_rate': pchar_dside_rate, 'pchar_oside_rate': pchar_oside_rate, 'pword_dside_rate': pword_dside_rate, 'pword_oside_rate': pword_oside_rate}) 349 | df.to_csv('./data/noextension/' + str(i) + '/' + mode + '_feats.csv', index=False) 350 | 351 | 352 | generate_features_csv() 353 | -------------------------------------------------------------------------------- /model_feature.py: -------------------------------------------------------------------------------- 1 | # -*— coding: utf-8 -*- 2 | """ Finetuning the library models for chip2019 question pairs matching. """ 3 | 4 | import argparse 5 | import glob 6 | import logging 7 | import os 8 | import random 9 | import shutil 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import torch 14 | import torch.nn as nn 15 | from torch.nn import CrossEntropyLoss, MSELoss 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset) 17 | from tensorboardX import SummaryWriter 18 | from tqdm import tqdm, trange 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer) 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor) 23 | 24 | logger = logging.getLogger(__name__) 25 | 26 | 27 | class FeatureBert(BertPreTrainedModel): 28 | def __init__(self, config): 29 | super(FeatureBert, self).__init__(config) 30 | self.num_labels = config.num_labels 31 | 32 | self.bert = BertModel(config) 33 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 34 | # self.classifier = nn.Linear(config.hidden_size + 5, self.config.num_labels) 35 | self.classifier = nn.Linear(config.hidden_size, self.config.num_labels) 36 | 37 | self.features_bn = nn.BatchNorm1d(5) 38 | self.features_dense = nn.Linear(5, 5) 39 | self.init_weights() 40 | 41 | def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, 42 | position_ids=None, head_mask=None, hand_features=None): 43 | outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids, 44 | attention_mask=attention_mask, head_mask=head_mask) 45 | 46 | hand_features = hand_features.float() 47 | # features = self.features_dense(self.features_bn(hand_features)) 48 | pooled_output = outputs[1] 49 | 50 | pooled_output = self.dropout(pooled_output) 51 | # pooled_output = torch.cat((pooled_output, features), dim=1) 52 | logits = self.classifier(pooled_output) 53 | 54 | outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here 55 | 56 | if labels is not None: 57 | if self.num_labels == 1: 58 | # We are doing regression 59 | loss_fct = MSELoss() 60 | loss = loss_fct(logits.view(-1), labels.view(-1)) 61 | else: 62 | loss_fct = CrossEntropyLoss() 63 | loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) 64 | outputs = (loss,) + outputs 65 | 66 | return outputs # (loss), logits, (hidden_states), (attentions) 67 | 68 | 69 | def set_seed(args): 70 | random.seed(args.seed) 71 | np.random.seed(args.seed) 72 | torch.manual_seed(args.seed) 73 | if args.n_gpu > 0: 74 | torch.cuda.manual_seed_all(args.seed) 75 | 76 | 77 | def train(args, train_dataset, model, tokenizer): 78 | """ Train the model. """ 79 | tb_writer = SummaryWriter() 80 | 81 | args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu) 82 | train_sampler = RandomSampler(train_dataset) 83 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) 84 | 85 | if args.max_steps > 0: 86 | t_total = args.max_steps 87 | args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 88 | else: 89 | t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs 90 | 91 | # Prepare optimizer and schedule (linear warmup and decay) 92 | no_decay = ['bias', 'LayerNorm.weight'] 93 | optimizer_grouped_parameters = [ 94 | {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay}, 95 | {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 96 | ] 97 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 98 | scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total) 99 | if args.fp16: 100 | try: 101 | from apex import amp 102 | except ImportError: 103 | raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.') 104 | model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) 105 | 106 | # multi-gpu training (should be after apex fp16 initialization) 107 | if args.n_gpu > 1: 108 | model = torch.nn.DataParallel(model) 109 | 110 | # Train! 111 | logger.info('***** Running training *****') 112 | logger.info(' Num examples = %d', len(train_dataset)) 113 | logger.info(' Num Epochs = %d', args.num_train_epochs) 114 | logger.info(' Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size) 115 | logger.info(' Total train batch size (w. parallel & accumulation) = %d', 116 | args.train_batch_size * args.gradient_accumulation_steps) 117 | logger.info(' Gradient Accumulation steps = %d', args.gradient_accumulation_steps) 118 | logger.info(' Total optimization steps = %d', t_total) 119 | 120 | global_step = 0 121 | tr_loss, logging_loss = 0.0, 0.0 122 | model.zero_grad() 123 | train_iterator = trange(int(args.num_train_epochs), desc='Epoch') 124 | set_seed(args) # Added here for reproductibility 125 | for _ in train_iterator: 126 | epoch_iterator = tqdm(train_dataloader, desc='Iteration') 127 | for step, batch in enumerate(epoch_iterator): 128 | model.train() 129 | batch = tuple(t.to(args.device) for t in batch) 130 | inputs = {'input_ids': batch[0], 131 | 'attention_mask': batch[1], 132 | 'token_type_ids': batch[2], 133 | 'labels': batch[3], 134 | 'hand_features': batch[4]} 135 | outputs = model(**inputs) 136 | loss = outputs[0] # model outputs are always tuple in pytorch_transformers (see doc) 137 | 138 | if args.n_gpu > 1: 139 | loss = loss.mean() 140 | if args.gradient_accumulation_steps > 1: 141 | loss = loss / args.gradient_accumulation_steps 142 | 143 | if args.fp16: 144 | with amp.scale_los(loss.optimizer) as scaled_loss: 145 | scaled_loss.backward() 146 | torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) 147 | else: 148 | loss.backward() 149 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 150 | 151 | tr_loss += loss.item() 152 | if (step + 1) % args.gradient_accumulation_steps == 0: 153 | optimizer.step() 154 | scheduler.step() 155 | model.zero_grad() 156 | global_step += 1 157 | 158 | if args.logging_steps > 0 and global_step % args.logging_steps == 0: 159 | # Log metrics 160 | if args.evaluate_during_training: 161 | result = evaluate(args, model, tokenizer) 162 | for key, value in result.items(): 163 | tb_writer.add_scalar('eval_{}'.format(key), value, global_step) 164 | tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step) 165 | tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step) 166 | logging_loss = tr_loss 167 | 168 | if args.save_steps > 0 and global_step % args.save_steps == 0: 169 | # Save model checkpoint 170 | output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step)) 171 | if not os.path.exists(output_dir): 172 | os.makedirs(output_dir) 173 | model_to_save = model.module if hasattr(model, 'module') else model 174 | model_to_save.save_pretrained(output_dir) 175 | torch.save(args, 'training_args.bin') 176 | logger.info('Saving model checkpoint to %s', output_dir) 177 | 178 | if args.max_steps > 0 and global_step > args.max_steps: 179 | epoch_iterator.close() 180 | break 181 | if args.max_steps > 0 and global_step > args.max_steps: 182 | train_iterator.close() 183 | break 184 | 185 | tb_writer.close() 186 | return global_step, tr_loss / global_step 187 | 188 | 189 | def evaluate(args, model, tokenizer, prefix=''): 190 | eval_output_dir = args.output_dir 191 | 192 | results = {} 193 | eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev') 194 | 195 | if not os.path.exists(eval_output_dir): 196 | os.makedirs(eval_output_dir) 197 | 198 | args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) 199 | eval_sampler = SequentialSampler(eval_dataset) 200 | eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) 201 | 202 | # Eval! 203 | logger.info('***** Running evaluation {} *****'.format(prefix)) 204 | logger.info(' Num examples = %d', len(eval_dataset)) 205 | logger.info(' Batch size = %d', args.eval_batch_size) 206 | eval_loss = 0.0 207 | nb_eval_steps = 0 208 | preds = None 209 | out_label_ids = None 210 | for batch in tqdm(eval_dataloader, desc='Evaluating'): 211 | model.eval() 212 | batch = tuple(t.to(args.device) for t in batch) 213 | 214 | with torch.no_grad(): 215 | inputs = {'input_ids': batch[0], 216 | 'attention_mask': batch[1], 217 | 'token_type_ids': batch[2], 218 | 'labels': batch[3], 219 | 'hand_features': batch[4]} 220 | outputs = model(**inputs) 221 | tmp_eval_loss, logits = outputs[:2] 222 | eval_loss += tmp_eval_loss.mean().item() 223 | nb_eval_steps += 1 224 | if preds is None: 225 | preds = logits.detach().cpu().numpy() 226 | out_label_ids = inputs['labels'].detach().cpu().numpy() 227 | else: 228 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 229 | out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0) 230 | 231 | eval_loss = eval_loss / nb_eval_steps 232 | preds = np.argmax(preds, axis=1) 233 | result = compute_metrics(preds, out_label_ids) 234 | results.update(result) 235 | 236 | output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt') 237 | with open(output_eval_file, 'a') as writer: 238 | for key in sorted(result.keys()): 239 | logger.info(' %s = %s', key, str(result[key])) 240 | writer.write('%s = %s\n' % (key, str(result[key]))) 241 | writer.write('='*20 + '\n') 242 | 243 | return results 244 | 245 | 246 | # def predict(args, model, tokenizer, index): 247 | def predict(args, model, tokenizer): 248 | test_dataset = load_and_cache_examples(args, tokenizer, set_type='test') 249 | 250 | args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu) 251 | test_sampler = SequentialSampler(test_dataset) 252 | test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size) 253 | 254 | # Eval! 255 | logger.info('***** Running prediction *****') 256 | logger.info(' Num examples = %d', len(test_dataset)) 257 | logger.info(' Batch size = %d', args.test_batch_size) 258 | preds = None 259 | for batch in tqdm(test_dataloader, desc='Testing'): 260 | model.eval() 261 | batch = tuple(t.to(args.device) for t in batch) 262 | 263 | with torch.no_grad(): 264 | inputs = {'input_ids': batch[0], 265 | 'attention_mask': batch[1], 266 | 'token_type_ids': batch[2], 267 | 'labels': batch[3], 268 | 'hand_features': batch[4]} 269 | outputs = model(**inputs) 270 | tmp_eval_loss, logits = outputs[:2] 271 | if preds is None: 272 | preds = logits.detach().cpu().numpy() 273 | else: 274 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 275 | 276 | preds = np.argmax(preds, axis=1) 277 | # with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f: 278 | with open(os.path.join(args.data_dir, 'result.csv'), 'w') as f: 279 | f.write('id,label\n') 280 | for i, pred in enumerate(preds): 281 | f.write('%d,%d\n' % (i, pred)) 282 | 283 | 284 | def load_and_cache_examples(args, tokenizer, set_type): 285 | processor = QPMProcessor() 286 | # Load data features from cache or dataset file 287 | cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_hand_feature'.format( 288 | set_type, 289 | list(filter(None, args.model_name_or_path.split('/'))).pop(), 290 | str(args.max_seq_length) 291 | )) 292 | if os.path.exists(cached_features_file): 293 | logger.info('Loading features from cache file %s', cached_features_file) 294 | features = torch.load(cached_features_file) 295 | else: 296 | logger.info('Creating features from dataset file at %s', args.data_dir) 297 | label_list = processor.get_labels() 298 | category_list = processor.get_categories() 299 | examples = processor.get_examples(args.data_dir, set_type) 300 | features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer, 301 | cls_token_at_end=False, # xlnet has a cls token at the end 302 | cls_token=tokenizer.cls_token, 303 | cls_token_segment_id=0, 304 | sep_token=tokenizer.sep_token, 305 | sep_token_extra=False, 306 | pad_on_left=False, 307 | pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 308 | pad_token_segment_id=0 309 | # cls_token_at_end=bool(args.model_type in ['xlnet']), # xlnet has a cls token at the end 310 | # cls_token=tokenizer.cls_token, 311 | # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0, 312 | # sep_token=tokenizer.sep_token, 313 | # sep_token_extra=bool(args.model_type in ['roberta']), # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805 314 | # pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet 315 | # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 316 | # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0, 317 | ) 318 | logger.info("Saving features into cached file %s", cached_features_file) 319 | torch.save(features, cached_features_file) 320 | 321 | # Convert to Tensors and build dataset 322 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) 323 | all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) 324 | all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) 325 | all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long) 326 | all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long) 327 | all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long) 328 | all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long) 329 | all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long) 330 | all_hand_features = torch.tensor([f.hand_features for f in features], dtype=torch.long) 331 | 332 | dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_hand_features, 333 | all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids) 334 | return dataset 335 | 336 | 337 | def main(): 338 | parser = argparse.ArgumentParser() 339 | 340 | ## Required parameters 341 | parser.add_argument('--data_dir', default=None, type=str, required=True, 342 | help='The input data dir. Should contain the .csv files for the task.') 343 | parser.add_argument('--model_name_or_path', default=None, type=str, required=True, 344 | help='Path to pretrained model or shortcut name selected in the list.') 345 | parser.add_argument('--output_dir', default=None, type=str, required=True, 346 | help='The output directory where the model predictions and checkpoints will be written.') 347 | 348 | ## Other parameters 349 | parser.add_argument('--config_name', default='', type=str, 350 | help='Pretrained config name or path if not the same as model_name.') 351 | parser.add_argument('--tokenizer_name', default='', type=str, 352 | help='Pretrained tokenizer name or path if not the same as model_name.') 353 | parser.add_argument('--max_seq_length', default='128', type=int, 354 | help='The maximum total input sequence length after tokenization. Sequences longer than this ' 355 | 'will be truncated, sequences shorter will be padded.') 356 | parser.add_argument('--do_train', action='store_true', 357 | help='Whether to run training.') 358 | parser.add_argument('--do_eval', action='store_true', 359 | help='Whether to run eval on the dev set.') 360 | parser.add_argument('--do_predict', action='store_true', 361 | help='Whether to run test on the test set.') 362 | parser.add_argument('--evaluate_during_training', action='store_true', 363 | help='Rul evaluation during training at each logging step.') 364 | parser.add_argument('--do_lower_case', action='store_true', 365 | help='Set this flag if you are using an uncased model.') 366 | 367 | parser.add_argument('--per_gpu_train_batch_size', default=1, type=int, 368 | help='Batch size per GPU/CPU for training.') 369 | parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int, 370 | help='Batch size per GPU/CPU for evaluation.') 371 | parser.add_argument('--per_gpu_test_batch_size', default=8, type=int, 372 | help='Batch size per GPU/CPU for prediction.') 373 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 374 | help='Number of updates steps to accumulate before performing a backward/update pass.') 375 | parser.add_argument('--learning_rate', default=5e-5, type=float, 376 | help='The initial learning rate for Adam.') 377 | parser.add_argument('--weight_decay', default=0.0, type=float, 378 | help='Weight decay if we apply some.') 379 | parser.add_argument('--adam_epsilon', default=1e-8, type=float, 380 | help='Epsilon for Adam optimizer.') 381 | parser.add_argument('--max_grad_norm', default=1.0, type=float, 382 | help='Max gradient norm.') 383 | parser.add_argument('--num_train_epochs', default=4.0, type=float, 384 | help='Total number of training epochs to perform.') 385 | parser.add_argument('--max_steps', default=-1, type=int, 386 | help='If > 0: set total number of training steps to perform. Override num_train_epochs.') 387 | parser.add_argument('--warmup_steps', default=0, type=int, 388 | help='Linear warmup over warmup_steps.') 389 | 390 | parser.add_argument('--logging_steps', type=int, default=50, 391 | help='Log every X updates steps.') 392 | parser.add_argument('--save_steps', type=int, default=100, 393 | help='Save checkpoint every X updates steps.') 394 | parser.add_argument('--eval_all_checkpoints', action='store_true', 395 | help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.') 396 | parser.add_argument('--no_cuda', action='store_true', 397 | help='Avoid using CUDA when available.') 398 | parser.add_argument('--overwrite_output_dir', action='store_true', 399 | help='Overwrite the content of the output directory.') 400 | parser.add_argument('--overwrite_cache', action='store_true', 401 | help='Overwrite the cached training and evaluation sets.') 402 | parser.add_argument('--seed', type=int, default=42, 403 | help='random seed for initialization') 404 | 405 | parser.add_argument('--fp16', action='store_true', 406 | help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit") 407 | parser.add_argument('--fp16_opt_level', type=str, default='O1', 408 | help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." 409 | "See details at https://nvidia.github.io/apex/amp.html") 410 | args = parser.parse_args() 411 | 412 | # Setup CUDA, GPU 413 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir: 414 | raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.') 415 | 416 | device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu') 417 | args.n_gpu = torch.cuda.device_count() 418 | args.device = device 419 | 420 | # Setup logging 421 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 422 | datefmt = '%m/%d/%Y %H:%M:%S', 423 | level = logging.INFO) 424 | logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s', 425 | device, args.n_gpu, args.fp16) 426 | 427 | # Set seed 428 | set_seed(args) 429 | # Prepare QPM task 430 | processor = QPMProcessor() 431 | label_list = processor.get_labels() 432 | num_labels = len(label_list) 433 | 434 | # Load pretrained model and tokenizer 435 | config_class, model_class, tokenizer_class = BertConfig, FeatureBert, BertTokenizer 436 | config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels) 437 | tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case) 438 | model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config) 439 | model.to(args.device) 440 | 441 | logger.info('Trainning/evaluation parameters %s', args) 442 | parent_data_dir = args.data_dir 443 | parent_output_dir = args.output_dir 444 | 445 | # Trainning 446 | results_tmp = {} 447 | if args.do_train: 448 | # 10-Fold dataset for training. 449 | # for i in range(0, 10): 450 | # Reload the pretrained model. 451 | model = model_class.from_pretrained(args.model_name_or_path, 452 | from_tf=bool('.ckpt' in args.model_name_or_path), 453 | config=config) 454 | model.to(args.device) 455 | 456 | # args.data_dir = parent_data_dir + str(i) 457 | # args.output_dir = parent_output_dir + str(i) 458 | 459 | train_dataset = load_and_cache_examples(args, tokenizer, set_type='train') 460 | global_step, tr_loss = train(args, train_dataset, model, tokenizer) 461 | logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) 462 | # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() 463 | # Create output directory if needed 464 | if not os.path.exists(args.output_dir): 465 | os.makedirs(args.output_dir) 466 | 467 | logger.info("Saving model checkpoint to %s", args.output_dir) 468 | # Save a trained model, configuration and tokenizer using `save_pretrained()`. 469 | # They can then be reloaded using `from_pretrained()` 470 | model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training 471 | model_to_save.save_pretrained(args.output_dir) 472 | tokenizer.save_pretrained(args.output_dir) 473 | 474 | # Good practice: save your training arguments together with the trained model 475 | torch.save(args, os.path.join(args.output_dir, 'training_args.bin')) 476 | 477 | # Load a trained model and vocabulary that you have fine-tuned 478 | model = model_class.from_pretrained(args.output_dir) 479 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 480 | model.to(args.device) 481 | 482 | # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset. 483 | # args.data_dir = parent_data_dir + str(i) 484 | # args.output_dir = parent_output_dir + str(i) 485 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 486 | checkpoints = [args.output_dir] 487 | if args.eval_all_checkpoints: 488 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 489 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 490 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 491 | best_f1 = 0.0 492 | for checkpoint in checkpoints: 493 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 494 | model = model_class.from_pretrained(checkpoint) 495 | model.to(args.device) 496 | result = evaluate(args, model, tokenizer, prefix=global_step) 497 | if result['f1'] > best_f1: 498 | best_f1 = result['f1'] 499 | # Save the best model checkpoint 500 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold') 501 | if not os.path.exists(output_dir): 502 | os.makedirs(output_dir) 503 | model_to_save = model.module if hasattr(model, 'module') else model 504 | model_to_save.save_pretrained(output_dir) 505 | torch.save(args, 'training_args.bin') 506 | logger.info('Saving model checkpoint to %s', output_dir) 507 | 508 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 509 | results_tmp.update(result) 510 | checkpoints.remove(args.output_dir) 511 | for checkpoint in checkpoints: 512 | shutil.rmtree(checkpoint) 513 | 514 | # Evaluation 515 | results = {} 516 | if args.do_eval: 517 | for i in range(10): 518 | args.data_dir = parent_data_dir + str(i) 519 | args.output_dir = parent_output_dir + str(i) 520 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 521 | checkpoints = [args.output_dir] 522 | if args.eval_all_checkpoints: 523 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 524 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 525 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 526 | best_f1 = 0.0 527 | for checkpoint in checkpoints: 528 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 529 | model = model_class.from_pretrained(checkpoint) 530 | model.to(args.device) 531 | result = evaluate(args, model, tokenizer, prefix=global_step) 532 | if result['f1'] > best_f1: 533 | best_f1 = result['f1'] 534 | # Save the best model checkpoint 535 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 536 | if not os.path.exists(output_dir): 537 | os.makedirs(output_dir) 538 | model_to_save = model.module if hasattr(model, 'module') else model 539 | model_to_save.save_pretrained(output_dir) 540 | torch.save(args, 'training_args.bin') 541 | logger.info('Saving model checkpoint to %s', output_dir) 542 | 543 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 544 | results.update(result) 545 | 546 | # Prediction 547 | if args.do_predict: 548 | # for i in range(10): 549 | # args.output_dir = parent_output_dir + str(i) 550 | args.output_dir = parent_output_dir 551 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 552 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 553 | # checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i) 554 | checkpoint = args.output_dir + '/best_checkpoint_fold' 555 | model = model_class.from_pretrained(checkpoint) 556 | model.to(args.device) 557 | # predict(args, model, tokenizer, i) 558 | predict(args, model, tokenizer) 559 | 560 | # For bagging. 561 | all = pd.read_csv('./data/sample_submission.csv') 562 | for i in range(10): 563 | df = pd.read_csv(args.data_dir + str(i) + '/result.csv') 564 | all['label'] += df['label'] 565 | all['label'] = all['label'] // 6 566 | all.to_csv('./data/result.csv', index=False) 567 | 568 | 569 | if __name__ == '__main__': 570 | main() 571 | -------------------------------------------------------------------------------- /model_final.py: -------------------------------------------------------------------------------- 1 | # -*— coding: utf-8 -*- 2 | """ Finetuning the library models for chip2019 question pairs matching. """ 3 | 4 | import argparse 5 | import glob 6 | import logging 7 | import os 8 | import random 9 | import shutil 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import torch 14 | import torch.nn as nn 15 | from torch.nn import CrossEntropyLoss, MSELoss 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset) 17 | from tensorboardX import SummaryWriter 18 | from tqdm import tqdm, trange 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer) 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor) 23 | 24 | logger = logging.getLogger(__name__) 25 | 26 | 27 | class CustomizedBert(BertPreTrainedModel): 28 | def __init__(self, config): 29 | super(CustomizedBert, self).__init__(config) 30 | self.num_labels = 2 31 | self.num_categories = 5 32 | self.num_features = 10 33 | self.bert = BertModel(config) 34 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 35 | self.classifier1 = nn.Linear(config.hidden_size + self.num_features, self.num_labels) 36 | self.classifier2 = nn.Linear(config.hidden_size, self.num_categories) 37 | self.features_bn = nn.BatchNorm1d(self.num_features) 38 | self.features_dense = nn.Linear(self.num_features, self.num_features) 39 | 40 | self.init_weights() 41 | 42 | def forward(self, input_ids, ct_clf_input_ids, token_type_ids=None, attention_mask=None, position_ids=None, labels=None, 43 | ct_clf_token_type_ids=None, ct_clf_attention_mask=None, ct_clf_position_ids=None, categories=None, 44 | head_mask=None, hand_features=None): 45 | outputs1 = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids, 46 | attention_mask=attention_mask, head_mask=head_mask) 47 | outputs2 = self.bert(ct_clf_input_ids, position_ids=ct_clf_position_ids, token_type_ids=ct_clf_token_type_ids, 48 | attention_mask=ct_clf_attention_mask, head_mask=head_mask) 49 | pooled_output1 = outputs1[1] 50 | pooled_output2 = outputs2[1] 51 | 52 | hand_features = hand_features.float() 53 | hand_features = self.features_dense(self.features_bn(hand_features)) 54 | pooled_output1 = torch.cat((pooled_output1, hand_features), dim=1) 55 | pooled_output1 = self.dropout(pooled_output1) 56 | pooled_output2 = self.dropout(pooled_output2) 57 | logits1 = self.classifier1(pooled_output1) 58 | logits2 = self.classifier2(pooled_output2) 59 | 60 | outputs1 = (logits1,) + outputs1[2:] # add hidden states and attention if they are here 61 | outputs2 = (logits2,) + outputs2[2:] 62 | 63 | if labels is not None: 64 | if self.num_labels == 1: 65 | # We are doing regression 66 | loss_fct = MSELoss() 67 | loss = loss_fct(logits1.view(-1), labels.view(-1)) 68 | else: 69 | loss_fct = CrossEntropyLoss() 70 | loss = loss_fct(logits1.view(-1, self.num_labels), labels.view(-1)) 71 | outputs1 = (loss,) + outputs1 72 | if categories is not None: 73 | if self.num_categories == 1: 74 | # We are doing regression 75 | loss_fct = MSELoss() 76 | loss = loss_fct(logits2.view(-1), categories.view(-1)) 77 | else: 78 | loss_fct = CrossEntropyLoss() 79 | loss = loss_fct(logits2.view(-1, self.num_categories), categories.view(-1)) 80 | outputs2 = (loss,) + outputs2 81 | 82 | return outputs1, outputs2 # (loss), logits, (hidden_states), (attentions) 83 | 84 | 85 | def set_seed(args): 86 | random.seed(args.seed) 87 | np.random.seed(args.seed) 88 | torch.manual_seed(args.seed) 89 | if args.n_gpu > 0: 90 | torch.cuda.manual_seed_all(args.seed) 91 | 92 | 93 | def train(args, train_dataset, model, tokenizer): 94 | """ Train the model. """ 95 | tb_writer = SummaryWriter() 96 | 97 | args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu) 98 | train_sampler = RandomSampler(train_dataset) 99 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) 100 | 101 | if args.max_steps > 0: 102 | t_total = args.max_steps 103 | args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 104 | else: 105 | t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs 106 | 107 | # Prepare optimizer and schedule (linear warmup and decay) 108 | no_decay = ['bias', 'LayerNorm.weight'] 109 | optimizer_grouped_parameters = [ 110 | {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay}, 111 | {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 112 | ] 113 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 114 | scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total) 115 | if args.fp16: 116 | try: 117 | from apex import amp 118 | except ImportError: 119 | raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.') 120 | model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) 121 | 122 | # multi-gpu training (should be after apex fp16 initialization) 123 | if args.n_gpu > 1: 124 | model = torch.nn.DataParallel(model) 125 | 126 | # Train! 127 | logger.info('***** Running training *****') 128 | logger.info(' Num examples = %d', len(train_dataset)) 129 | logger.info(' Num Epochs = %d', args.num_train_epochs) 130 | logger.info(' Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size) 131 | logger.info(' Total train batch size (w. parallel & accumulation) = %d', 132 | args.train_batch_size * args.gradient_accumulation_steps) 133 | logger.info(' Gradient Accumulation steps = %d', args.gradient_accumulation_steps) 134 | logger.info(' Total optimization steps = %d', t_total) 135 | 136 | global_step = 0 137 | tr_loss, logging_loss = 0.0, 0.0 138 | model.zero_grad() 139 | train_iterator = trange(int(args.num_train_epochs), desc='Epoch') 140 | set_seed(args) # Added here for reproductibility 141 | for _ in train_iterator: 142 | epoch_iterator = tqdm(train_dataloader, desc='Iteration') 143 | for step, batch in enumerate(epoch_iterator): 144 | model.train() 145 | batch = tuple(t.to(args.device) for t in batch) 146 | inputs = {'input_ids': batch[0], 147 | 'attention_mask': batch[1], 148 | 'token_type_ids': batch[2], 149 | 'labels': batch[3], 150 | 'ct_clf_input_ids': batch[4], 151 | 'ct_clf_attention_mask': batch[5], 152 | 'ct_clf_token_type_ids': batch[6], 153 | 'categories': batch[7], 154 | 'hand_features': batch[8]} 155 | outputs = model(**inputs) 156 | loss, clf_loss = outputs[0][0], outputs[1][0] # model outputs are always tuple in pytorch_transformers (see doc) 157 | 158 | total_loss = loss + clf_loss 159 | if args.n_gpu > 1: 160 | total_loss = total_loss.mean() 161 | if args.gradient_accumulation_steps > 1: 162 | total_loss = total_loss / args.gradient_accumulation_steps 163 | 164 | if args.fp16: 165 | with amp.scale_los(total_loss.optimizer) as scaled_loss: 166 | scaled_loss.backward() 167 | torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) 168 | else: 169 | total_loss.backward() 170 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 171 | 172 | tr_loss += total_loss.item() 173 | if (step + 1) % args.gradient_accumulation_steps == 0: 174 | optimizer.step() 175 | scheduler.step() 176 | model.zero_grad() 177 | global_step += 1 178 | 179 | if args.logging_steps > 0 and global_step % args.logging_steps == 0: 180 | # Log metrics 181 | if args.evaluate_during_training: 182 | result = evaluate(args, model, tokenizer) 183 | for key, value in result.items(): 184 | tb_writer.add_scalar('eval_{}'.format(key), value, global_step) 185 | tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step) 186 | tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step) 187 | logging_loss = tr_loss 188 | 189 | if args.save_steps > 0 and global_step % args.save_steps == 0: 190 | # Save model checkpoint 191 | output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step)) 192 | if not os.path.exists(output_dir): 193 | os.makedirs(output_dir) 194 | model_to_save = model.module if hasattr(model, 'module') else model 195 | model_to_save.save_pretrained(output_dir) 196 | torch.save(args, 'training_args.bin') 197 | logger.info('Saving model checkpoint to %s', output_dir) 198 | 199 | if args.max_steps > 0 and global_step > args.max_steps: 200 | epoch_iterator.close() 201 | break 202 | if args.max_steps > 0 and global_step > args.max_steps: 203 | train_iterator.close() 204 | break 205 | 206 | tb_writer.close() 207 | return global_step, tr_loss / global_step 208 | 209 | 210 | def evaluate(args, model, tokenizer, prefix=''): 211 | eval_output_dir = args.output_dir 212 | 213 | results = {} 214 | eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev') 215 | 216 | if not os.path.exists(eval_output_dir): 217 | os.makedirs(eval_output_dir) 218 | 219 | args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) 220 | eval_sampler = SequentialSampler(eval_dataset) 221 | eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) 222 | 223 | # Eval! 224 | logger.info('***** Running evaluation {} *****'.format(prefix)) 225 | logger.info(' Num examples = %d', len(eval_dataset)) 226 | logger.info(' Batch size = %d', args.eval_batch_size) 227 | eval_loss = 0.0 228 | nb_eval_steps = 0 229 | preds = None 230 | out_label_ids = None 231 | for batch in tqdm(eval_dataloader, desc='Evaluating'): 232 | model.eval() 233 | batch = tuple(t.to(args.device) for t in batch) 234 | 235 | with torch.no_grad(): 236 | inputs = {'input_ids': batch[0], 237 | 'attention_mask': batch[1], 238 | 'token_type_ids': batch[2], 239 | 'labels': batch[3], 240 | 'ct_clf_input_ids': batch[4], 241 | 'ct_clf_attention_mask': batch[5], 242 | 'ct_clf_token_type_ids': batch[6], 243 | 'categories': batch[7], 244 | 'hand_features': batch[8]} 245 | outputs = model(**inputs) 246 | tmp_eval_loss, logits = outputs[0][:2] 247 | eval_loss += tmp_eval_loss.mean().item() 248 | nb_eval_steps += 1 249 | if preds is None: 250 | preds = logits.detach().cpu().numpy() 251 | out_label_ids = inputs['labels'].detach().cpu().numpy() 252 | else: 253 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 254 | out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0) 255 | 256 | eval_loss = eval_loss / nb_eval_steps 257 | preds = np.argmax(preds, axis=1) 258 | result = compute_metrics(preds, out_label_ids) 259 | results.update(result) 260 | 261 | output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt') 262 | with open(output_eval_file, 'a') as writer: 263 | for key in sorted(result.keys()): 264 | logger.info(' %s = %s', key, str(result[key])) 265 | writer.write('%s = %s\n' % (key, str(result[key]))) 266 | writer.write('='*20 + '\n') 267 | 268 | return results 269 | 270 | 271 | def predict(args, model, tokenizer, index): 272 | test_dataset = load_and_cache_examples(args, tokenizer, set_type='test') 273 | 274 | args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu) 275 | test_sampler = SequentialSampler(test_dataset) 276 | test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size) 277 | 278 | # Eval! 279 | logger.info('***** Running prediction *****') 280 | logger.info(' Num examples = %d', len(test_dataset)) 281 | logger.info(' Batch size = %d', args.test_batch_size) 282 | preds = None 283 | for batch in tqdm(test_dataloader, desc='Testing'): 284 | model.eval() 285 | batch = tuple(t.to(args.device) for t in batch) 286 | 287 | with torch.no_grad(): 288 | inputs = {'input_ids': batch[0], 289 | 'attention_mask': batch[1], 290 | 'token_type_ids': batch[2], 291 | 'labels': batch[3], 292 | 'ct_clf_input_ids': batch[4], 293 | 'ct_clf_attention_mask': batch[5], 294 | 'ct_clf_token_type_ids': batch[6], 295 | 'categories': batch[7], 296 | 'hand_features': batch[8]} 297 | outputs = model(**inputs) 298 | tmp_eval_loss, logits = outputs[0][:2] 299 | if preds is None: 300 | preds = logits.detach().cpu().numpy() 301 | else: 302 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 303 | 304 | preds = np.argmax(preds, axis=1) 305 | with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f: 306 | f.write('id,label\n') 307 | for i, pred in enumerate(preds): 308 | f.write('%d,%d\n' % (i, pred)) 309 | 310 | 311 | def load_and_cache_examples(args, tokenizer, set_type): 312 | processor = QPMProcessor() 313 | # Load data features from cache or dataset file 314 | cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_customized'.format( 315 | set_type, 316 | list(filter(None, args.model_name_or_path.split('/'))).pop(), 317 | str(args.max_seq_length) 318 | )) 319 | if os.path.exists(cached_features_file): 320 | logger.info('Loading features from cache file %s', cached_features_file) 321 | features = torch.load(cached_features_file) 322 | else: 323 | logger.info('Creating features from dataset file at %s', args.data_dir) 324 | label_list = processor.get_labels() 325 | category_list = processor.get_categories() 326 | examples = processor.get_examples(args.data_dir, set_type) 327 | features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer, 328 | cls_token_at_end=False, # xlnet has a cls token at the end 329 | cls_token=tokenizer.cls_token, 330 | cls_token_segment_id=0, 331 | sep_token=tokenizer.sep_token, 332 | sep_token_extra=False, 333 | pad_on_left=False, 334 | pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 335 | pad_token_segment_id=0 336 | # cls_token_at_end=bool(args.model_type in ['xlnet']), # xlnet has a cls token at the end 337 | # cls_token=tokenizer.cls_token, 338 | # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0, 339 | # sep_token=tokenizer.sep_token, 340 | # sep_token_extra=bool(args.model_type in ['roberta']), # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805 341 | # pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet 342 | # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 343 | # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0, 344 | ) 345 | logger.info("Saving features into cached file %s", cached_features_file) 346 | torch.save(features, cached_features_file) 347 | 348 | # Convert to Tensors and build dataset 349 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) 350 | all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) 351 | all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) 352 | all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long) 353 | all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long) 354 | all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long) 355 | all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long) 356 | all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long) 357 | all_hand_features = torch.tensor([f.hand_features for f in features], dtype=torch.long) 358 | 359 | dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, 360 | all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids, 361 | all_hand_features) 362 | return dataset 363 | 364 | 365 | def main(): 366 | parser = argparse.ArgumentParser() 367 | 368 | ## Required parameters 369 | parser.add_argument('--data_dir', default=None, type=str, required=True, 370 | help='The input data dir. Should contain the .csv files for the task.') 371 | parser.add_argument('--model_name_or_path', default=None, type=str, required=True, 372 | help='Path to pretrained model or shortcut name selected in the list.') 373 | parser.add_argument('--output_dir', default=None, type=str, required=True, 374 | help='The output directory where the model predictions and checkpoints will be written.') 375 | 376 | ## Other parameters 377 | parser.add_argument('--config_name', default='', type=str, 378 | help='Pretrained config name or path if not the same as model_name.') 379 | parser.add_argument('--tokenizer_name', default='', type=str, 380 | help='Pretrained tokenizer name or path if not the same as model_name.') 381 | parser.add_argument('--max_seq_length', default='128', type=int, 382 | help='The maximum total input sequence length after tokenization. Sequences longer than this ' 383 | 'will be truncated, sequences shorter will be padded.') 384 | parser.add_argument('--do_train', action='store_true', 385 | help='Whether to run training.') 386 | parser.add_argument('--do_eval', action='store_true', 387 | help='Whether to run eval on the dev set.') 388 | parser.add_argument('--do_predict', action='store_true', 389 | help='Whether to run test on the test set.') 390 | parser.add_argument('--evaluate_during_training', action='store_true', 391 | help='Rul evaluation during training at each logging step.') 392 | parser.add_argument('--do_lower_case', action='store_true', 393 | help='Set this flag if you are using an uncased model.') 394 | 395 | parser.add_argument('--num_features', default=10, type=int, 396 | help='Number of hand-crafted features.') 397 | parser.add_argument('--per_gpu_train_batch_size', default=8, type=int, 398 | help='Batch size per GPU/CPU for training.') 399 | parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int, 400 | help='Batch size per GPU/CPU for evaluation.') 401 | parser.add_argument('--per_gpu_test_batch_size', default=8, type=int, 402 | help='Batch size per GPU/CPU for prediction.') 403 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 404 | help='Number of updates steps to accumulate before performing a backward/update pass.') 405 | parser.add_argument('--learning_rate', default=5e-5, type=float, 406 | help='The initial learning rate for Adam.') 407 | parser.add_argument('--weight_decay', default=0.0, type=float, 408 | help='Weight decay if we apply some.') 409 | parser.add_argument('--adam_epsilon', default=1e-8, type=float, 410 | help='Epsilon for Adam optimizer.') 411 | parser.add_argument('--max_grad_norm', default=1.0, type=float, 412 | help='Max gradient norm.') 413 | parser.add_argument('--num_train_epochs', default=3.0, type=float, 414 | help='Total number of training epochs to perform.') 415 | parser.add_argument('--max_steps', default=-1, type=int, 416 | help='If > 0: set total number of training steps to perform. Override num_train_epochs.') 417 | parser.add_argument('--warmup_steps', default=0, type=int, 418 | help='Linear warmup over warmup_steps.') 419 | 420 | parser.add_argument('--logging_steps', type=int, default=50, 421 | help='Log every X updates steps.') 422 | parser.add_argument('--save_steps', type=int, default=200, 423 | help='Save checkpoint every X updates steps.') 424 | parser.add_argument('--eval_all_checkpoints', action='store_true', 425 | help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.') 426 | parser.add_argument('--no_cuda', action='store_true', 427 | help='Avoid using CUDA when available.') 428 | parser.add_argument('--overwrite_output_dir', action='store_true', 429 | help='Overwrite the content of the output directory.') 430 | parser.add_argument('--overwrite_cache', action='store_true', 431 | help='Overwrite the cached training and evaluation sets.') 432 | parser.add_argument('--seed', type=int, default=42, 433 | help='random seed for initialization') 434 | 435 | parser.add_argument('--fp16', action='store_true', 436 | help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit") 437 | parser.add_argument('--fp16_opt_level', type=str, default='O1', 438 | help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." 439 | "See details at https://nvidia.github.io/apex/amp.html") 440 | args = parser.parse_args() 441 | 442 | # Setup CUDA, GPU 443 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir: 444 | raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.') 445 | 446 | device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu') 447 | args.n_gpu = torch.cuda.device_count() 448 | args.device = device 449 | 450 | # Setup logging 451 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 452 | datefmt = '%m/%d/%Y %H:%M:%S', 453 | level = logging.INFO) 454 | logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s', 455 | device, args.n_gpu, args.fp16) 456 | 457 | # Set seed 458 | set_seed(args) 459 | # Prepare QPM task 460 | processor = QPMProcessor() 461 | label_list = processor.get_labels() 462 | num_labels = len(label_list) 463 | category_list = processor.get_categories() 464 | clf_num_labels = len(category_list) 465 | 466 | # Load pretrained model and tokenizer 467 | config_class, model_class, tokenizer_class = BertConfig, CustomizedBert, BertTokenizer 468 | config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels) 469 | tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case) 470 | model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config) 471 | model.to(args.device) 472 | 473 | logger.info('Trainning/evaluation parameters %s', args) 474 | parent_data_dir = args.data_dir 475 | parent_output_dir = args.output_dir 476 | 477 | # Trainning 478 | results_tmp = {} 479 | if args.do_train: 480 | # 10-Fold dataset for training. 481 | for i in range(0, 10): 482 | # Reload the pretrained model. 483 | model = model_class.from_pretrained(args.model_name_or_path, 484 | from_tf=bool('.ckpt' in args.model_name_or_path), 485 | config=config) 486 | model.to(args.device) 487 | 488 | args.data_dir = parent_data_dir + str(i) 489 | args.output_dir = parent_output_dir + str(i) 490 | 491 | train_dataset = load_and_cache_examples(args, tokenizer, set_type='train') 492 | global_step, tr_loss = train(args, train_dataset, model, tokenizer) 493 | logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) 494 | # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() 495 | # Create output directory if needed 496 | if not os.path.exists(args.output_dir): 497 | os.makedirs(args.output_dir) 498 | 499 | logger.info("Saving model checkpoint to %s", args.output_dir) 500 | # Save a trained model, configuration and tokenizer using `save_pretrained()`. 501 | # They can then be reloaded using `from_pretrained()` 502 | model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training 503 | model_to_save.save_pretrained(args.output_dir) 504 | tokenizer.save_pretrained(args.output_dir) 505 | 506 | # Good practice: save your training arguments together with the trained model 507 | torch.save(args, os.path.join(args.output_dir, 'training_args.bin')) 508 | 509 | # Load a trained model and vocabulary that you have fine-tuned 510 | model = model_class.from_pretrained(args.output_dir) 511 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 512 | model.to(args.device) 513 | 514 | # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset. 515 | args.data_dir = parent_data_dir + str(i) 516 | args.output_dir = parent_output_dir + str(i) 517 | checkpoints = [args.output_dir] 518 | if args.eval_all_checkpoints: 519 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 520 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 521 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 522 | best_f1 = 0.0 523 | for checkpoint in checkpoints: 524 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 525 | model = model_class.from_pretrained(checkpoint) 526 | model.to(args.device) 527 | result = evaluate(args, model, tokenizer, prefix=global_step) 528 | if result['f1'] > best_f1: 529 | best_f1 = result['f1'] 530 | # Save the best model checkpoint 531 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 532 | if not os.path.exists(output_dir): 533 | os.makedirs(output_dir) 534 | model_to_save = model.module if hasattr(model, 'module') else model 535 | model_to_save.save_pretrained(output_dir) 536 | torch.save(args, 'training_args.bin') 537 | logger.info('Saving model checkpoint to %s', output_dir) 538 | 539 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 540 | results_tmp.update(result) 541 | checkpoints.remove(args.output_dir) 542 | for checkpoint in checkpoints: 543 | shutil.rmtree(checkpoint) 544 | 545 | # Evaluation 546 | results = {} 547 | if args.do_eval: 548 | for i in range(10): 549 | args.data_dir = parent_data_dir + str(i) 550 | args.output_dir = parent_output_dir + str(i) 551 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 552 | checkpoints = [args.output_dir] 553 | if args.eval_all_checkpoints: 554 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 555 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 556 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 557 | best_f1 = 0.0 558 | for checkpoint in checkpoints: 559 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 560 | model = model_class.from_pretrained(checkpoint) 561 | model.to(args.device) 562 | result = evaluate(args, model, tokenizer, prefix=global_step) 563 | if result['f1'] > best_f1: 564 | best_f1 = result['f1'] 565 | # Save the best model checkpoint 566 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 567 | if not os.path.exists(output_dir): 568 | os.makedirs(output_dir) 569 | model_to_save = model.module if hasattr(model, 'module') else model 570 | model_to_save.save_pretrained(output_dir) 571 | torch.save(args, 'training_args.bin') 572 | logger.info('Saving model checkpoint to %s', output_dir) 573 | 574 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 575 | results.update(result) 576 | 577 | # Prediction 578 | if args.do_predict: 579 | for i in range(10): 580 | args.data_dir = parent_data_dir + str(i) 581 | args.output_dir = parent_output_dir + str(i) 582 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 583 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 584 | checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i) 585 | model = model_class.from_pretrained(checkpoint) 586 | model.to(args.device) 587 | predict(args, model, tokenizer, i) 588 | 589 | # For bagging. 590 | all = pd.read_csv('./data/sample_submission.csv') 591 | for i in range(10): 592 | df = pd.read_csv(args.data_dir + str(i) + '/result.csv') 593 | all['label'] += df['label'] 594 | all['label'] = all['label'] // 6 595 | all.to_csv('./data/result.csv', index=False) 596 | 597 | 598 | if __name__ == '__main__': 599 | main() 600 | -------------------------------------------------------------------------------- /model_final_2.py: -------------------------------------------------------------------------------- 1 | # -*— coding: utf-8 -*- 2 | """ Finetuning the library models for chip2019 question pairs matching. """ 3 | 4 | import argparse 5 | import glob 6 | import logging 7 | import os 8 | import random 9 | import shutil 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import torch 14 | import torch.nn as nn 15 | from torch.nn import CrossEntropyLoss, MSELoss 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset) 17 | from tensorboardX import SummaryWriter 18 | from tqdm import tqdm, trange 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer) 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor) 23 | 24 | logger = logging.getLogger(__name__) 25 | 26 | 27 | class CustomizedBert(BertPreTrainedModel): 28 | def __init__(self, config): 29 | super(CustomizedBert, self).__init__(config) 30 | self.num_labels = 2 31 | self.num_categories = 5 32 | self.num_features = 10 33 | self.bert = BertModel(config) 34 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 35 | self.classifier1 = nn.Linear(config.hidden_size + self.num_features, self.num_labels) 36 | self.classifier2 = nn.Linear(config.hidden_size, self.num_categories) 37 | self.features_bn = nn.BatchNorm1d(self.num_features) 38 | self.features_dense = nn.Linear(self.num_features, self.num_features) 39 | 40 | self.init_weights() 41 | 42 | def forward(self, input_ids, ct_clf_input_ids, token_type_ids=None, attention_mask=None, position_ids=None, labels=None, 43 | ct_clf_token_type_ids=None, ct_clf_attention_mask=None, ct_clf_position_ids=None, categories=None, 44 | head_mask=None, hand_features=None): 45 | outputs1 = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids, 46 | attention_mask=attention_mask, head_mask=head_mask) 47 | outputs2 = self.bert(ct_clf_input_ids, position_ids=ct_clf_position_ids, token_type_ids=ct_clf_token_type_ids, 48 | attention_mask=ct_clf_attention_mask, head_mask=head_mask) 49 | pooled_output1 = outputs1[1] 50 | pooled_output2 = outputs2[1] 51 | 52 | hand_features = hand_features.float() 53 | hand_features = self.features_dense(self.features_bn(hand_features)) 54 | pooled_output1 = torch.cat((pooled_output1, hand_features), dim=1) 55 | pooled_output1 = self.dropout(pooled_output1) 56 | pooled_output2 = self.dropout(pooled_output2) 57 | logits1 = self.classifier1(pooled_output1) 58 | logits2 = self.classifier2(pooled_output2) 59 | 60 | outputs1 = (logits1,) + outputs1[2:] # add hidden states and attention if they are here 61 | outputs2 = (logits2,) + outputs2[2:] 62 | 63 | if labels is not None: 64 | if self.num_labels == 1: 65 | # We are doing regression 66 | loss_fct = MSELoss() 67 | loss = loss_fct(logits1.view(-1), labels.view(-1)) 68 | else: 69 | loss_fct = CrossEntropyLoss() 70 | loss = loss_fct(logits1.view(-1, self.num_labels), labels.view(-1)) 71 | outputs1 = (loss,) + outputs1 72 | if categories is not None: 73 | if self.num_categories == 1: 74 | # We are doing regression 75 | loss_fct = MSELoss() 76 | loss = loss_fct(logits2.view(-1), categories.view(-1)) 77 | else: 78 | loss_fct = CrossEntropyLoss() 79 | loss = loss_fct(logits2.view(-1, self.num_categories), categories.view(-1)) 80 | outputs2 = (loss,) + outputs2 81 | 82 | return outputs1, outputs2 # (loss), logits, (hidden_states), (attentions) 83 | 84 | 85 | def set_seed(args): 86 | random.seed(args.seed) 87 | np.random.seed(args.seed) 88 | torch.manual_seed(args.seed) 89 | if args.n_gpu > 0: 90 | torch.cuda.manual_seed_all(args.seed) 91 | 92 | 93 | def train(args, train_dataset, model, tokenizer): 94 | """ Train the model. """ 95 | tb_writer = SummaryWriter() 96 | 97 | args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu) 98 | train_sampler = RandomSampler(train_dataset) 99 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) 100 | 101 | if args.max_steps > 0: 102 | t_total = args.max_steps 103 | args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 104 | else: 105 | t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs 106 | 107 | # Prepare optimizer and schedule (linear warmup and decay) 108 | no_decay = ['bias', 'LayerNorm.weight'] 109 | optimizer_grouped_parameters = [ 110 | {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay}, 111 | {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 112 | ] 113 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 114 | scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total) 115 | if args.fp16: 116 | try: 117 | from apex import amp 118 | except ImportError: 119 | raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.') 120 | model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) 121 | 122 | # multi-gpu training (should be after apex fp16 initialization) 123 | if args.n_gpu > 1: 124 | model = torch.nn.DataParallel(model) 125 | 126 | # Train! 127 | logger.info('***** Running training *****') 128 | logger.info(' Num examples = %d', len(train_dataset)) 129 | logger.info(' Num Epochs = %d', args.num_train_epochs) 130 | logger.info(' Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size) 131 | logger.info(' Total train batch size (w. parallel & accumulation) = %d', 132 | args.train_batch_size * args.gradient_accumulation_steps) 133 | logger.info(' Gradient Accumulation steps = %d', args.gradient_accumulation_steps) 134 | logger.info(' Total optimization steps = %d', t_total) 135 | 136 | global_step = 0 137 | tr_loss, logging_loss = 0.0, 0.0 138 | model.zero_grad() 139 | train_iterator = trange(int(args.num_train_epochs), desc='Epoch') 140 | set_seed(args) # Added here for reproductibility 141 | 142 | max_val_acc = 0 143 | max_val_f1 = 0 144 | 145 | for _ in train_iterator: 146 | for step, batch in enumerate(train_dataloader): 147 | model.train() 148 | batch = tuple(t.to(args.device) for t in batch) 149 | inputs = {'input_ids': batch[0], 150 | 'attention_mask': batch[1], 151 | 'token_type_ids': batch[2], 152 | 'labels': batch[3], 153 | 'ct_clf_input_ids': batch[4], 154 | 'ct_clf_attention_mask': batch[5], 155 | 'ct_clf_token_type_ids': batch[6], 156 | 'categories': batch[7], 157 | 'hand_features': batch[8]} 158 | outputs = model(**inputs) 159 | loss, clf_loss = outputs[0][0], outputs[1][0] # model outputs are always tuple in pytorch_transformers (see doc) 160 | 161 | total_loss = loss + clf_loss 162 | if args.n_gpu > 1: 163 | total_loss = total_loss.mean() 164 | if args.gradient_accumulation_steps > 1: 165 | total_loss = total_loss / args.gradient_accumulation_steps 166 | 167 | if args.fp16: 168 | with amp.scale_los(total_loss.optimizer) as scaled_loss: 169 | scaled_loss.backward() 170 | torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) 171 | else: 172 | total_loss.backward() 173 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 174 | 175 | tr_loss += total_loss.item() 176 | if (step + 1) % args.gradient_accumulation_steps == 0: 177 | optimizer.step() 178 | scheduler.step() 179 | model.zero_grad() 180 | global_step += 1 181 | 182 | if args.logging_steps > 0 and global_step % args.logging_steps == 0: 183 | # Log metrics 184 | if args.evaluate_during_training: 185 | result = evaluate(args, model, tokenizer) 186 | for key, value in result.items(): 187 | tb_writer.add_scalar('eval_{}'.format(key), value, global_step) 188 | if result['acc'] > max_val_acc: 189 | max_val_acc = result['acc'] 190 | if result['f1'] > max_val_f1: 191 | max_val_f1 = result['f1'] 192 | output_dir = os.path.join(args.output_dir, 'best_checkpoint') 193 | if not os.path.exists(output_dir): 194 | os.makedirs(output_dir) 195 | model_to_save = model.module if hasattr(model, 'module') else model 196 | model_to_save.save_pretrained(output_dir) 197 | torch.save(args, 'training_args.bin') 198 | logger.info('Saving model checkpoint with f1 {:.4f}'.format(max_val_f1)) 199 | tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step) 200 | tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step) 201 | logging_loss = tr_loss 202 | 203 | if args.max_steps > 0 and global_step > args.max_steps: 204 | train_iterator.close() 205 | break 206 | 207 | tb_writer.close() 208 | return global_step, tr_loss / global_step 209 | 210 | 211 | def evaluate(args, model, tokenizer, prefix=''): 212 | eval_output_dir = args.output_dir 213 | 214 | results = {} 215 | eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev') 216 | 217 | if not os.path.exists(eval_output_dir): 218 | os.makedirs(eval_output_dir) 219 | 220 | args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) 221 | eval_sampler = SequentialSampler(eval_dataset) 222 | eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) 223 | 224 | # Eval! 225 | logger.info('***** Running evaluation {} *****'.format(prefix)) 226 | logger.info(' Num examples = %d', len(eval_dataset)) 227 | logger.info(' Batch size = %d', args.eval_batch_size) 228 | eval_loss = 0.0 229 | nb_eval_steps = 0 230 | preds = None 231 | out_label_ids = None 232 | for batch in eval_dataloader: 233 | model.eval() 234 | batch = tuple(t.to(args.device) for t in batch) 235 | 236 | with torch.no_grad(): 237 | inputs = {'input_ids': batch[0], 238 | 'attention_mask': batch[1], 239 | 'token_type_ids': batch[2], 240 | 'labels': batch[3], 241 | 'ct_clf_input_ids': batch[4], 242 | 'ct_clf_attention_mask': batch[5], 243 | 'ct_clf_token_type_ids': batch[6], 244 | 'categories': batch[7], 245 | 'hand_features': batch[8]} 246 | outputs = model(**inputs) 247 | tmp_eval_loss, logits = outputs[0][:2] 248 | eval_loss += tmp_eval_loss.mean().item() 249 | nb_eval_steps += 1 250 | if preds is None: 251 | preds = logits.detach().cpu().numpy() 252 | out_label_ids = inputs['labels'].detach().cpu().numpy() 253 | else: 254 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 255 | out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0) 256 | 257 | eval_loss = eval_loss / nb_eval_steps 258 | preds = np.argmax(preds, axis=1) 259 | result = compute_metrics(preds, out_label_ids) 260 | results.update(result) 261 | 262 | output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt') 263 | with open(output_eval_file, 'a') as writer: 264 | for key in sorted(result.keys()): 265 | logger.info(' %s = %s', key, str(result[key])) 266 | writer.write('%s = %s\n' % (key, str(result[key]))) 267 | writer.write('='*20 + '\n') 268 | 269 | return results 270 | 271 | 272 | def predict(args, model, tokenizer, index): 273 | test_dataset = load_and_cache_examples(args, tokenizer, set_type='test') 274 | 275 | args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu) 276 | test_sampler = SequentialSampler(test_dataset) 277 | test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size) 278 | 279 | # Eval! 280 | logger.info('***** Running prediction *****') 281 | logger.info(' Num examples = %d', len(test_dataset)) 282 | logger.info(' Batch size = %d', args.test_batch_size) 283 | preds = None 284 | for batch in tqdm(test_dataloader, desc='Testing'): 285 | model.eval() 286 | batch = tuple(t.to(args.device) for t in batch) 287 | 288 | with torch.no_grad(): 289 | inputs = {'input_ids': batch[0], 290 | 'attention_mask': batch[1], 291 | 'token_type_ids': batch[2], 292 | 'labels': batch[3], 293 | 'ct_clf_input_ids': batch[4], 294 | 'ct_clf_attention_mask': batch[5], 295 | 'ct_clf_token_type_ids': batch[6], 296 | 'categories': batch[7], 297 | 'hand_features': batch[8]} 298 | outputs = model(**inputs) 299 | tmp_eval_loss, logits = outputs[0][:2] 300 | if preds is None: 301 | preds = logits.detach().cpu().numpy() 302 | else: 303 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 304 | 305 | preds = np.argmax(preds, axis=1) 306 | with open(os.path.join(args.data_dir, 'result.csv'), 'w') as f: 307 | f.write('id,label\n') 308 | for i, pred in enumerate(preds): 309 | f.write('%d,%d\n' % (i, pred)) 310 | 311 | 312 | def load_and_cache_examples(args, tokenizer, set_type): 313 | processor = QPMProcessor() 314 | # Load data features from cache or dataset file 315 | cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_customized'.format( 316 | set_type, 317 | list(filter(None, args.model_name_or_path.split('/'))).pop(), 318 | str(args.max_seq_length) 319 | )) 320 | if os.path.exists(cached_features_file): 321 | logger.info('Loading features from cache file %s', cached_features_file) 322 | features = torch.load(cached_features_file) 323 | else: 324 | logger.info('Creating features from dataset file at %s', args.data_dir) 325 | label_list = processor.get_labels() 326 | category_list = processor.get_categories() 327 | examples = processor.get_examples(args.data_dir, set_type) 328 | features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer, 329 | cls_token_at_end=False, # xlnet has a cls token at the end 330 | cls_token=tokenizer.cls_token, 331 | cls_token_segment_id=0, 332 | sep_token=tokenizer.sep_token, 333 | sep_token_extra=False, 334 | pad_on_left=False, 335 | pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 336 | pad_token_segment_id=0 337 | # cls_token_at_end=bool(args.model_type in ['xlnet']), # xlnet has a cls token at the end 338 | # cls_token=tokenizer.cls_token, 339 | # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0, 340 | # sep_token=tokenizer.sep_token, 341 | # sep_token_extra=bool(args.model_type in ['roberta']), # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805 342 | # pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet 343 | # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 344 | # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0, 345 | ) 346 | logger.info("Saving features into cached file %s", cached_features_file) 347 | torch.save(features, cached_features_file) 348 | 349 | # Convert to Tensors and build dataset 350 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) 351 | all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) 352 | all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) 353 | all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long) 354 | all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long) 355 | all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long) 356 | all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long) 357 | all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long) 358 | all_hand_features = torch.tensor([f.hand_features for f in features], dtype=torch.long) 359 | 360 | dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, 361 | all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids, 362 | all_hand_features) 363 | return dataset 364 | 365 | 366 | def main(): 367 | parser = argparse.ArgumentParser() 368 | 369 | ## Required parameters 370 | parser.add_argument('--data_dir', default=None, type=str, required=True, 371 | help='The input data dir. Should contain the .csv files for the task.') 372 | parser.add_argument('--model_name_or_path', default=None, type=str, required=True, 373 | help='Path to pretrained model or shortcut name selected in the list.') 374 | parser.add_argument('--output_dir', default=None, type=str, required=True, 375 | help='The output directory where the model predictions and checkpoints will be written.') 376 | 377 | ## Other parameters 378 | parser.add_argument('--config_name', default='', type=str, 379 | help='Pretrained config name or path if not the same as model_name.') 380 | parser.add_argument('--tokenizer_name', default='', type=str, 381 | help='Pretrained tokenizer name or path if not the same as model_name.') 382 | parser.add_argument('--max_seq_length', default='128', type=int, 383 | help='The maximum total input sequence length after tokenization. Sequences longer than this ' 384 | 'will be truncated, sequences shorter will be padded.') 385 | parser.add_argument('--do_train', action='store_true', 386 | help='Whether to run training.') 387 | parser.add_argument('--do_eval', action='store_true', 388 | help='Whether to run eval on the dev set.') 389 | parser.add_argument('--do_predict', action='store_true', 390 | help='Whether to run test on the test set.') 391 | parser.add_argument('--evaluate_during_training', action='store_true', 392 | help='Rul evaluation during training at each logging step.') 393 | parser.add_argument('--do_lower_case', action='store_true', 394 | help='Set this flag if you are using an uncased model.') 395 | 396 | parser.add_argument('--num_features', default=10, type=int, 397 | help='Number of hand-crafted features.') 398 | parser.add_argument('--per_gpu_train_batch_size', default=8, type=int, 399 | help='Batch size per GPU/CPU for training.') 400 | parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int, 401 | help='Batch size per GPU/CPU for evaluation.') 402 | parser.add_argument('--per_gpu_test_batch_size', default=8, type=int, 403 | help='Batch size per GPU/CPU for prediction.') 404 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 405 | help='Number of updates steps to accumulate before performing a backward/update pass.') 406 | parser.add_argument('--learning_rate', default=5e-5, type=float, 407 | help='The initial learning rate for Adam.') 408 | parser.add_argument('--weight_decay', default=0.0, type=float, 409 | help='Weight decay if we apply some.') 410 | parser.add_argument('--adam_epsilon', default=1e-8, type=float, 411 | help='Epsilon for Adam optimizer.') 412 | parser.add_argument('--max_grad_norm', default=1.0, type=float, 413 | help='Max gradient norm.') 414 | parser.add_argument('--num_train_epochs', default=3.0, type=float, 415 | help='Total number of training epochs to perform.') 416 | parser.add_argument('--max_steps', default=-1, type=int, 417 | help='If > 0: set total number of training steps to perform. Override num_train_epochs.') 418 | parser.add_argument('--warmup_steps', default=0, type=int, 419 | help='Linear warmup over warmup_steps.') 420 | 421 | parser.add_argument('--logging_steps', type=int, default=100, 422 | help='Log every X updates steps.') 423 | parser.add_argument('--save_steps', type=int, default=100, 424 | help='Save checkpoint every X updates steps.') 425 | parser.add_argument('--eval_all_checkpoints', action='store_true', 426 | help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.') 427 | parser.add_argument('--no_cuda', action='store_true', 428 | help='Avoid using CUDA when available.') 429 | parser.add_argument('--overwrite_output_dir', action='store_true', 430 | help='Overwrite the content of the output directory.') 431 | parser.add_argument('--overwrite_cache', action='store_true', 432 | help='Overwrite the cached training and evaluation sets.') 433 | parser.add_argument('--seed', type=int, default=42, 434 | help='random seed for initialization') 435 | 436 | parser.add_argument('--fp16', action='store_true', 437 | help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit") 438 | parser.add_argument('--fp16_opt_level', type=str, default='O1', 439 | help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." 440 | "See details at https://nvidia.github.io/apex/amp.html") 441 | args = parser.parse_args() 442 | 443 | # Setup CUDA, GPU 444 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir: 445 | raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.') 446 | 447 | device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu') 448 | args.n_gpu = torch.cuda.device_count() 449 | args.device = device 450 | 451 | # Setup logging 452 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 453 | datefmt = '%m/%d/%Y %H:%M:%S', 454 | level = logging.INFO) 455 | logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s', 456 | device, args.n_gpu, args.fp16) 457 | 458 | # Set seed 459 | set_seed(args) 460 | # Prepare QPM task 461 | processor = QPMProcessor() 462 | label_list = processor.get_labels() 463 | num_labels = len(label_list) 464 | category_list = processor.get_categories() 465 | clf_num_labels = len(category_list) 466 | 467 | # Load pretrained model and tokenizer 468 | config_class, model_class, tokenizer_class = BertConfig, CustomizedBert, BertTokenizer 469 | config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels) 470 | tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case) 471 | model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config) 472 | model.to(args.device) 473 | 474 | logger.info('Trainning/evaluation parameters %s', args) 475 | parent_data_dir = args.data_dir 476 | parent_output_dir = args.output_dir 477 | 478 | # Trainning 479 | results_tmp = {} 480 | if args.do_train: 481 | # 10-Fold dataset for training. 482 | for i in range(0, 10): 483 | # Reload the pretrained model. 484 | model = model_class.from_pretrained(args.model_name_or_path, 485 | from_tf=bool('.ckpt' in args.model_name_or_path), 486 | config=config) 487 | model.to(args.device) 488 | 489 | args.data_dir = parent_data_dir + str(i) 490 | args.output_dir = parent_output_dir + str(i) 491 | 492 | train_dataset = load_and_cache_examples(args, tokenizer, set_type='train') 493 | global_step, tr_loss = train(args, train_dataset, model, tokenizer) 494 | logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) 495 | # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() 496 | # Create output directory if needed 497 | if not os.path.exists(args.output_dir): 498 | os.makedirs(args.output_dir) 499 | 500 | logger.info("Saving model checkpoint to %s", args.output_dir) 501 | # Save a trained model, configuration and tokenizer using `save_pretrained()`. 502 | # They can then be reloaded using `from_pretrained()` 503 | model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training 504 | model_to_save.save_pretrained(args.output_dir) 505 | tokenizer.save_pretrained(args.output_dir) 506 | 507 | # Good practice: save your training arguments together with the trained model 508 | torch.save(args, os.path.join(args.output_dir, 'training_args.bin')) 509 | 510 | # Load a trained model and vocabulary that you have fine-tuned 511 | model = model_class.from_pretrained(args.output_dir) 512 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 513 | model.to(args.device) 514 | 515 | # Evaluation 516 | results = {} 517 | if args.do_eval: 518 | for i in range(10): 519 | args.data_dir = parent_data_dir + str(i) 520 | args.output_dir = parent_output_dir + str(i) 521 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 522 | checkpoints = [args.output_dir] 523 | if args.eval_all_checkpoints: 524 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 525 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 526 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 527 | best_f1 = 0.0 528 | for checkpoint in checkpoints: 529 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 530 | model = model_class.from_pretrained(checkpoint) 531 | model.to(args.device) 532 | result = evaluate(args, model, tokenizer, prefix=global_step) 533 | if result['f1'] > best_f1: 534 | best_f1 = result['f1'] 535 | # Save the best model checkpoint 536 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 537 | if not os.path.exists(output_dir): 538 | os.makedirs(output_dir) 539 | model_to_save = model.module if hasattr(model, 'module') else model 540 | model_to_save.save_pretrained(output_dir) 541 | torch.save(args, 'training_args.bin') 542 | logger.info('Saving model checkpoint to %s', output_dir) 543 | 544 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 545 | results.update(result) 546 | 547 | # Prediction 548 | if args.do_predict: 549 | for i in range(10): 550 | args.data_dir = parent_data_dir + str(i) 551 | args.output_dir = parent_output_dir + str(i) 552 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 553 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 554 | checkpoint = args.output_dir + '/best_checkpoint' 555 | model = model_class.from_pretrained(checkpoint) 556 | model.to(args.device) 557 | predict(args, model, tokenizer, i) 558 | 559 | # For bagging. 560 | num_test = 50000 561 | all = pd.DataFrame({'id': [i for i in range(num_test)], 'label': num_test*[0]}) 562 | for i in range(10): 563 | args.data_dir = parent_data_dir + str(i) 564 | df = pd.read_csv(args.data_dir + '/result.csv') 565 | all['label'] += df['label'] 566 | all['label'] = all['label'] // 6 567 | all.to_csv('./data/result.csv', index=False) 568 | 569 | 570 | if __name__ == '__main__': 571 | main() 572 | -------------------------------------------------------------------------------- /model_multitask.py: -------------------------------------------------------------------------------- 1 | # -*— coding: utf-8 -*- 2 | """ Finetuning the library models for chip2019 question pairs matching. """ 3 | 4 | import argparse 5 | import glob 6 | import logging 7 | import os 8 | import random 9 | import shutil 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import torch 14 | import torch.nn as nn 15 | from torch.nn import CrossEntropyLoss, MSELoss 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset) 17 | from tensorboardX import SummaryWriter 18 | from tqdm import tqdm, trange 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer) 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor) 23 | 24 | logger = logging.getLogger(__name__) 25 | 26 | 27 | class MultiTaskBert(BertPreTrainedModel): 28 | def __init__(self, config): 29 | super(MultiTaskBert, self).__init__(config) 30 | self.num_labels = 3 31 | self.num_categories = 5 32 | self.bert = BertModel(config) 33 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 34 | self.classifier1 = nn.Linear(config.hidden_size, self.num_labels) 35 | self.classifier2 = nn.Linear(config.hidden_size, self.num_categories) 36 | 37 | self.init_weights() 38 | 39 | def forward(self, input_ids, ct_clf_input_ids, token_type_ids=None, attention_mask=None, position_ids=None, labels=None, 40 | ct_clf_token_type_ids=None, ct_clf_attention_mask=None, ct_clf_position_ids=None, categories=None, 41 | head_mask=None): 42 | outputs1 = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids, 43 | attention_mask=attention_mask, head_mask=head_mask) 44 | outputs2 = self.bert(ct_clf_input_ids, position_ids=ct_clf_position_ids, token_type_ids=ct_clf_token_type_ids, 45 | attention_mask=ct_clf_attention_mask, head_mask=head_mask) 46 | pooled_output1 = outputs1[1] 47 | pooled_output2 = outputs2[1] 48 | pooled_output1 = self.dropout(pooled_output1) 49 | pooled_output2 = self.dropout(pooled_output2) 50 | logits1 = self.classifier1(pooled_output1) 51 | logits2 = self.classifier2(pooled_output2) 52 | 53 | outputs1 = (logits1,) + outputs1[2:] # add hidden states and attention if they are here 54 | outputs2 = (logits2,) + outputs2[2:] 55 | 56 | if labels is not None: 57 | if self.num_labels == 1: 58 | # We are doing regression 59 | loss_fct = MSELoss() 60 | loss = loss_fct(logits1.view(-1), labels.view(-1)) 61 | else: 62 | loss_fct = CrossEntropyLoss() 63 | loss = loss_fct(logits1.view(-1, self.num_labels), labels.view(-1)) 64 | outputs1 = (loss,) + outputs1 65 | if categories is not None: 66 | if self.num_categories == 1: 67 | # We are doing regression 68 | loss_fct = MSELoss() 69 | loss = loss_fct(logits2.view(-1), categories.view(-1)) 70 | else: 71 | loss_fct = CrossEntropyLoss() 72 | loss = loss_fct(logits2.view(-1, self.num_categories), categories.view(-1)) 73 | outputs2 = (loss,) + outputs2 74 | 75 | return outputs1, outputs2 # (loss), logits, (hidden_states), (attentions) 76 | 77 | 78 | def set_seed(args): 79 | random.seed(args.seed) 80 | np.random.seed(args.seed) 81 | torch.manual_seed(args.seed) 82 | if args.n_gpu > 0: 83 | torch.cuda.manual_seed_all(args.seed) 84 | 85 | 86 | def train(args, train_dataset, model, tokenizer): 87 | """ Train the model. """ 88 | tb_writer = SummaryWriter() 89 | 90 | args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu) 91 | train_sampler = RandomSampler(train_dataset) 92 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) 93 | 94 | if args.max_steps > 0: 95 | t_total = args.max_steps 96 | args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 97 | else: 98 | t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs 99 | 100 | # Prepare optimizer and schedule (linear warmup and decay) 101 | no_decay = ['bias', 'LayerNorm.weight'] 102 | optimizer_grouped_parameters = [ 103 | {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay}, 104 | {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 105 | ] 106 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 107 | scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total) 108 | if args.fp16: 109 | try: 110 | from apex import amp 111 | except ImportError: 112 | raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.') 113 | model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) 114 | 115 | # multi-gpu training (should be after apex fp16 initialization) 116 | if args.n_gpu > 1: 117 | model = torch.nn.DataParallel(model) 118 | 119 | # Train! 120 | logger.info('***** Running training *****') 121 | logger.info(' Num examples = %d', len(train_dataset)) 122 | logger.info(' Num Epochs = %d', args.num_train_epochs) 123 | logger.info(' Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size) 124 | logger.info(' Total train batch size (w. parallel & accumulation) = %d', 125 | args.train_batch_size * args.gradient_accumulation_steps) 126 | logger.info(' Gradient Accumulation steps = %d', args.gradient_accumulation_steps) 127 | logger.info(' Total optimization steps = %d', t_total) 128 | 129 | global_step = 0 130 | tr_loss, logging_loss = 0.0, 0.0 131 | model.zero_grad() 132 | train_iterator = trange(int(args.num_train_epochs), desc='Epoch') 133 | set_seed(args) # Added here for reproductibility 134 | for _ in train_iterator: 135 | epoch_iterator = tqdm(train_dataloader, desc='Iteration') 136 | for step, batch in enumerate(epoch_iterator): 137 | model.train() 138 | batch = tuple(t.to(args.device) for t in batch) 139 | inputs = {'input_ids': batch[0], 140 | 'attention_mask': batch[1], 141 | 'token_type_ids': batch[2], 142 | 'labels': batch[3], 143 | 'ct_clf_input_ids': batch[4], 144 | 'ct_clf_attention_mask': batch[5], 145 | 'ct_clf_token_type_ids': batch[6], 146 | 'categories': batch[7]} 147 | outputs = model(**inputs) 148 | loss, clf_loss = outputs[0][0], outputs[1][0] # model outputs are always tuple in pytorch_transformers (see doc) 149 | 150 | total_loss = loss + clf_loss 151 | if args.n_gpu > 1: 152 | total_loss = total_loss.mean() 153 | if args.gradient_accumulation_steps > 1: 154 | total_loss = total_loss / args.gradient_accumulation_steps 155 | 156 | if args.fp16: 157 | with amp.scale_los(total_loss.optimizer) as scaled_loss: 158 | scaled_loss.backward() 159 | torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) 160 | else: 161 | total_loss.backward() 162 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 163 | 164 | tr_loss += total_loss.item() 165 | if (step + 1) % args.gradient_accumulation_steps == 0: 166 | optimizer.step() 167 | scheduler.step() 168 | model.zero_grad() 169 | global_step += 1 170 | 171 | if args.logging_steps > 0 and global_step % args.logging_steps == 0: 172 | # Log metrics 173 | if args.evaluate_during_training: 174 | result = evaluate(args, model, tokenizer) 175 | for key, value in result.items(): 176 | tb_writer.add_scalar('eval_{}'.format(key), value, global_step) 177 | tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step) 178 | tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step) 179 | logging_loss = tr_loss 180 | 181 | if args.save_steps > 0 and global_step % args.save_steps == 0: 182 | # Save model checkpoint 183 | output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step)) 184 | if not os.path.exists(output_dir): 185 | os.makedirs(output_dir) 186 | model_to_save = model.module if hasattr(model, 'module') else model 187 | model_to_save.save_pretrained(output_dir) 188 | torch.save(args, 'training_args.bin') 189 | logger.info('Saving model checkpoint to %s', output_dir) 190 | 191 | if args.max_steps > 0 and global_step > args.max_steps: 192 | epoch_iterator.close() 193 | break 194 | if args.max_steps > 0 and global_step > args.max_steps: 195 | train_iterator.close() 196 | break 197 | 198 | tb_writer.close() 199 | return global_step, tr_loss / global_step 200 | 201 | 202 | def evaluate(args, model, tokenizer, prefix=''): 203 | eval_output_dir = args.output_dir 204 | 205 | results = {} 206 | eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev') 207 | 208 | if not os.path.exists(eval_output_dir): 209 | os.makedirs(eval_output_dir) 210 | 211 | args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) 212 | eval_sampler = SequentialSampler(eval_dataset) 213 | eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) 214 | 215 | # Eval! 216 | logger.info('***** Running evaluation {} *****'.format(prefix)) 217 | logger.info(' Num examples = %d', len(eval_dataset)) 218 | logger.info(' Batch size = %d', args.eval_batch_size) 219 | eval_loss = 0.0 220 | nb_eval_steps = 0 221 | preds = None 222 | out_label_ids = None 223 | for batch in tqdm(eval_dataloader, desc='Evaluating'): 224 | model.eval() 225 | batch = tuple(t.to(args.device) for t in batch) 226 | 227 | with torch.no_grad(): 228 | inputs = {'input_ids': batch[0], 229 | 'attention_mask': batch[1], 230 | 'token_type_ids': batch[2], 231 | 'labels': batch[3], 232 | 'ct_clf_input_ids': batch[4], 233 | 'ct_clf_attention_mask': batch[5], 234 | 'ct_clf_token_type_ids': batch[6], 235 | 'categories': batch[7]} 236 | outputs = model(**inputs) 237 | tmp_eval_loss, logits = outputs[0][:2] 238 | eval_loss += tmp_eval_loss.mean().item() 239 | nb_eval_steps += 1 240 | if preds is None: 241 | preds = logits.detach().cpu().numpy() 242 | out_label_ids = inputs['labels'].detach().cpu().numpy() 243 | else: 244 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 245 | out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0) 246 | 247 | eval_loss = eval_loss / nb_eval_steps 248 | preds = np.argmax(preds, axis=1) 249 | result = compute_metrics(preds, out_label_ids) 250 | results.update(result) 251 | 252 | output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt') 253 | with open(output_eval_file, 'a') as writer: 254 | for key in sorted(result.keys()): 255 | logger.info(' %s = %s', key, str(result[key])) 256 | writer.write('%s = %s\n' % (key, str(result[key]))) 257 | writer.write('='*20 + '\n') 258 | 259 | return results 260 | 261 | 262 | def predict(args, model, tokenizer, index): 263 | test_dataset = load_and_cache_examples(args, tokenizer, set_type='test') 264 | 265 | args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu) 266 | test_sampler = SequentialSampler(test_dataset) 267 | test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size) 268 | 269 | # Eval! 270 | logger.info('***** Running prediction *****') 271 | logger.info(' Num examples = %d', len(test_dataset)) 272 | logger.info(' Batch size = %d', args.test_batch_size) 273 | preds = None 274 | for batch in tqdm(test_dataloader, desc='Testing'): 275 | model.eval() 276 | batch = tuple(t.to(args.device) for t in batch) 277 | 278 | with torch.no_grad(): 279 | inputs = {'input_ids': batch[0], 280 | 'attention_mask': batch[1], 281 | 'token_type_ids': batch[2], 282 | 'labels': batch[3], 283 | 'ct_clf_input_ids': batch[4], 284 | 'ct_clf_attention_mask': batch[5], 285 | 'ct_clf_token_type_ids': batch[6], 286 | 'categories': batch[7]} 287 | outputs = model(**inputs) 288 | tmp_eval_loss, logits = outputs[0][:2] 289 | if preds is None: 290 | preds = logits.detach().cpu().numpy() 291 | else: 292 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 293 | 294 | preds = np.argmax(preds, axis=1) 295 | with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f: 296 | f.write('id,label\n') 297 | for i, pred in enumerate(preds): 298 | f.write('%d,%d\n' % (i, pred)) 299 | 300 | 301 | def load_and_cache_examples(args, tokenizer, set_type): 302 | processor = QPMProcessor() 303 | # Load data features from cache or dataset file 304 | cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}'.format( 305 | set_type, 306 | list(filter(None, args.model_name_or_path.split('/'))).pop(), 307 | str(args.max_seq_length) 308 | )) 309 | if os.path.exists(cached_features_file): 310 | logger.info('Loading features from cache file %s', cached_features_file) 311 | features = torch.load(cached_features_file) 312 | else: 313 | logger.info('Creating features from dataset file at %s', args.data_dir) 314 | label_list = processor.get_labels() 315 | category_list = processor.get_categories() 316 | examples = processor.get_examples(args.data_dir, set_type) 317 | features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer, 318 | cls_token_at_end=False, # xlnet has a cls token at the end 319 | cls_token=tokenizer.cls_token, 320 | cls_token_segment_id=0, 321 | sep_token=tokenizer.sep_token, 322 | sep_token_extra=False, 323 | pad_on_left=False, 324 | pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 325 | pad_token_segment_id=0 326 | # cls_token_at_end=bool(args.model_type in ['xlnet']), # xlnet has a cls token at the end 327 | # cls_token=tokenizer.cls_token, 328 | # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0, 329 | # sep_token=tokenizer.sep_token, 330 | # sep_token_extra=bool(args.model_type in ['roberta']), # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805 331 | # pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet 332 | # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 333 | # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0, 334 | ) 335 | logger.info("Saving features into cached file %s", cached_features_file) 336 | torch.save(features, cached_features_file) 337 | 338 | # Convert to Tensors and build dataset 339 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) 340 | all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) 341 | all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) 342 | all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long) 343 | all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long) 344 | all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long) 345 | all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long) 346 | all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long) 347 | 348 | dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, 349 | all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids) 350 | return dataset 351 | 352 | 353 | def main(): 354 | parser = argparse.ArgumentParser() 355 | 356 | ## Required parameters 357 | parser.add_argument('--data_dir', default=None, type=str, required=True, 358 | help='The input data dir. Should contain the .csv files for the task.') 359 | parser.add_argument('--model_name_or_path', default=None, type=str, required=True, 360 | help='Path to pretrained model or shortcut name selected in the list.') 361 | parser.add_argument('--output_dir', default=None, type=str, required=True, 362 | help='The output directory where the model predictions and checkpoints will be written.') 363 | 364 | ## Other parameters 365 | parser.add_argument('--config_name', default='', type=str, 366 | help='Pretrained config name or path if not the same as model_name.') 367 | parser.add_argument('--tokenizer_name', default='', type=str, 368 | help='Pretrained tokenizer name or path if not the same as model_name.') 369 | parser.add_argument('--max_seq_length', default='128', type=int, 370 | help='The maximum total input sequence length after tokenization. Sequences longer than this ' 371 | 'will be truncated, sequences shorter will be padded.') 372 | parser.add_argument('--do_train', action='store_true', 373 | help='Whether to run training.') 374 | parser.add_argument('--do_eval', action='store_true', 375 | help='Whether to run eval on the dev set.') 376 | parser.add_argument('--do_predict', action='store_true', 377 | help='Whether to run test on the test set.') 378 | parser.add_argument('--evaluate_during_training', action='store_true', 379 | help='Rul evaluation during training at each logging step.') 380 | parser.add_argument('--do_lower_case', action='store_true', 381 | help='Set this flag if you are using an uncased model.') 382 | 383 | parser.add_argument('--per_gpu_train_batch_size', default=8, type=int, 384 | help='Batch size per GPU/CPU for training.') 385 | parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int, 386 | help='Batch size per GPU/CPU for evaluation.') 387 | parser.add_argument('--per_gpu_test_batch_size', default=8, type=int, 388 | help='Batch size per GPU/CPU for prediction.') 389 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 390 | help='Number of updates steps to accumulate before performing a backward/update pass.') 391 | parser.add_argument('--learning_rate', default=5e-5, type=float, 392 | help='The initial learning rate for Adam.') 393 | parser.add_argument('--weight_decay', default=0.0, type=float, 394 | help='Weight decay if we apply some.') 395 | parser.add_argument('--adam_epsilon', default=1e-8, type=float, 396 | help='Epsilon for Adam optimizer.') 397 | parser.add_argument('--max_grad_norm', default=1.0, type=float, 398 | help='Max gradient norm.') 399 | parser.add_argument('--num_train_epochs', default=3.0, type=float, 400 | help='Total number of training epochs to perform.') 401 | parser.add_argument('--max_steps', default=-1, type=int, 402 | help='If > 0: set total number of training steps to perform. Override num_train_epochs.') 403 | parser.add_argument('--warmup_steps', default=0, type=int, 404 | help='Linear warmup over warmup_steps.') 405 | 406 | parser.add_argument('--logging_steps', type=int, default=50, 407 | help='Log every X updates steps.') 408 | parser.add_argument('--save_steps', type=int, default=100, 409 | help='Save checkpoint every X updates steps.') 410 | parser.add_argument('--eval_all_checkpoints', action='store_true', 411 | help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.') 412 | parser.add_argument('--no_cuda', action='store_true', 413 | help='Avoid using CUDA when available.') 414 | parser.add_argument('--overwrite_output_dir', action='store_true', 415 | help='Overwrite the content of the output directory.') 416 | parser.add_argument('--overwrite_cache', action='store_true', 417 | help='Overwrite the cached training and evaluation sets.') 418 | parser.add_argument('--seed', type=int, default=42, 419 | help='random seed for initialization') 420 | 421 | parser.add_argument('--fp16', action='store_true', 422 | help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit") 423 | parser.add_argument('--fp16_opt_level', type=str, default='O1', 424 | help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." 425 | "See details at https://nvidia.github.io/apex/amp.html") 426 | args = parser.parse_args() 427 | 428 | # Setup CUDA, GPU 429 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir: 430 | raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.') 431 | 432 | device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu') 433 | args.n_gpu = torch.cuda.device_count() 434 | args.device = device 435 | 436 | # Setup logging 437 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 438 | datefmt = '%m/%d/%Y %H:%M:%S', 439 | level = logging.INFO) 440 | logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s', 441 | device, args.n_gpu, args.fp16) 442 | 443 | # Set seed 444 | set_seed(args) 445 | # Prepare QPM task 446 | processor = QPMProcessor() 447 | label_list = processor.get_labels() 448 | num_labels = len(label_list) 449 | category_list = processor.get_categories() 450 | clf_num_labels = len(category_list) 451 | 452 | # Load pretrained model and tokenizer 453 | config_class, model_class, tokenizer_class = BertConfig, MultiTaskBert, BertTokenizer 454 | config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels) 455 | tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case) 456 | model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config) 457 | model.to(args.device) 458 | 459 | logger.info('Trainning/evaluation parameters %s', args) 460 | parent_data_dir = args.data_dir 461 | parent_output_dir = args.output_dir 462 | 463 | # Trainning 464 | results_tmp = {} 465 | if args.do_train: 466 | # 10-Fold dataset for training. 467 | for i in range(2, 10): 468 | # Reload the pretrained model. 469 | model = model_class.from_pretrained(args.model_name_or_path, 470 | from_tf=bool('.ckpt' in args.model_name_or_path), 471 | config=config) 472 | model.to(args.device) 473 | 474 | args.data_dir = parent_data_dir + str(i) 475 | args.output_dir = parent_output_dir + str(i) 476 | 477 | train_dataset = load_and_cache_examples(args, tokenizer, set_type='train') 478 | global_step, tr_loss = train(args, train_dataset, model, tokenizer) 479 | logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) 480 | # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() 481 | # Create output directory if needed 482 | if not os.path.exists(args.output_dir): 483 | os.makedirs(args.output_dir) 484 | 485 | logger.info("Saving model checkpoint to %s", args.output_dir) 486 | # Save a trained model, configuration and tokenizer using `save_pretrained()`. 487 | # They can then be reloaded using `from_pretrained()` 488 | model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training 489 | model_to_save.save_pretrained(args.output_dir) 490 | tokenizer.save_pretrained(args.output_dir) 491 | 492 | # Good practice: save your training arguments together with the trained model 493 | torch.save(args, os.path.join(args.output_dir, 'training_args.bin')) 494 | 495 | # Load a trained model and vocabulary that you have fine-tuned 496 | model = model_class.from_pretrained(args.output_dir) 497 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 498 | model.to(args.device) 499 | 500 | # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset. 501 | args.data_dir = parent_data_dir + str(i) 502 | args.output_dir = parent_output_dir + str(i) 503 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 504 | checkpoints = [args.output_dir] 505 | if args.eval_all_checkpoints: 506 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 507 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 508 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 509 | best_f1 = 0.0 510 | for checkpoint in checkpoints: 511 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 512 | model = model_class.from_pretrained(checkpoint) 513 | model.to(args.device) 514 | result = evaluate(args, model, tokenizer, prefix=global_step) 515 | if result['f1'] > best_f1: 516 | best_f1 = result['f1'] 517 | # Save the best model checkpoint 518 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 519 | if not os.path.exists(output_dir): 520 | os.makedirs(output_dir) 521 | model_to_save = model.module if hasattr(model, 'module') else model 522 | model_to_save.save_pretrained(output_dir) 523 | torch.save(args, 'training_args.bin') 524 | logger.info('Saving model checkpoint to %s', output_dir) 525 | 526 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 527 | results_tmp.update(result) 528 | checkpoints.remove(args.output_dir) 529 | for checkpoint in checkpoints: 530 | shutil.rmtree(checkpoint) 531 | 532 | # Evaluation 533 | results = {} 534 | if args.do_eval: 535 | for i in range(10): 536 | args.data_dir = parent_data_dir + str(i) 537 | args.output_dir = parent_output_dir + str(i) 538 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 539 | checkpoints = [args.output_dir] 540 | if args.eval_all_checkpoints: 541 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 542 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 543 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 544 | best_f1 = 0.0 545 | for checkpoint in checkpoints: 546 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 547 | model = model_class.from_pretrained(checkpoint) 548 | model.to(args.device) 549 | result = evaluate(args, model, tokenizer, prefix=global_step) 550 | if result['f1'] > best_f1: 551 | best_f1 = result['f1'] 552 | # Save the best model checkpoint 553 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 554 | if not os.path.exists(output_dir): 555 | os.makedirs(output_dir) 556 | model_to_save = model.module if hasattr(model, 'module') else model 557 | model_to_save.save_pretrained(output_dir) 558 | torch.save(args, 'training_args.bin') 559 | logger.info('Saving model checkpoint to %s', output_dir) 560 | 561 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 562 | results.update(result) 563 | 564 | # Prediction 565 | if args.do_predict: 566 | for i in range(10): 567 | args.output_dir = parent_output_dir + str(i) 568 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 569 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 570 | checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i) 571 | model = model_class.from_pretrained(checkpoint) 572 | model.to(args.device) 573 | predict(args, model, tokenizer, i) 574 | 575 | # For bagging. 576 | all = pd.read_csv('./data/sample_submission.csv') 577 | for i in range(10): 578 | df = pd.read_csv(args.data_dir + str(i) + '/result.csv') 579 | all['label'] += df['label'] 580 | all['label'] = all['label'] // 6 581 | all.to_csv('./data/result.csv', index=False) 582 | 583 | 584 | 585 | if __name__ == '__main__': 586 | main() 587 | -------------------------------------------------------------------------------- /model_qpm.py: -------------------------------------------------------------------------------- 1 | # -*— coding: utf-8 -*- 2 | """ Finetuning the library models for chip2019 question pairs matching. """ 3 | 4 | import argparse 5 | import glob 6 | import logging 7 | import os 8 | import random 9 | import shutil 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import torch 14 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset) 15 | from tensorboardX import SummaryWriter 16 | from tqdm import tqdm, trange 17 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertForSequenceClassification, BertTokenizer) 18 | from pytorch_transformers import AdamW, WarmupLinearSchedule 19 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor) 20 | 21 | logger = logging.getLogger(__name__) 22 | 23 | 24 | def set_seed(args): 25 | random.seed(args.seed) 26 | np.random.seed(args.seed) 27 | torch.manual_seed(args.seed) 28 | if args.n_gpu > 0: 29 | torch.cuda.manual_seed_all(args.seed) 30 | 31 | 32 | def train(args, train_dataset, model, tokenizer): 33 | """ Train the model. """ 34 | tb_writer = SummaryWriter() 35 | 36 | args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu) 37 | train_sampler = RandomSampler(train_dataset) 38 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) 39 | 40 | if args.max_steps > 0: 41 | t_total = args.max_steps 42 | args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1 43 | else: 44 | t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs 45 | 46 | # Prepare optimizer and schedule (linear warmup and decay) 47 | no_decay = ['bias', 'LayerNorm.weight'] 48 | optimizer_grouped_parameters = [ 49 | {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay}, 50 | {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 51 | ] 52 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 53 | scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total) 54 | if args.fp16: 55 | try: 56 | from apex import amp 57 | except ImportError: 58 | raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.') 59 | model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level) 60 | 61 | # multi-gpu training (should be after apex fp16 initialization) 62 | if args.n_gpu > 1: 63 | model = torch.nn.DataParallel(model) 64 | 65 | # Train! 66 | logger.info('***** Running training *****') 67 | logger.info(' Num examples = %d', len(train_dataset)) 68 | logger.info(' Num Epochs = %d', args.num_train_epochs) 69 | logger.info(' Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size) 70 | logger.info(' Total train batch size (w. parallel & accumulation) = %d', 71 | args.train_batch_size * args.gradient_accumulation_steps) 72 | logger.info(' Gradient Accumulation steps = %d', args.gradient_accumulation_steps) 73 | logger.info(' Total optimization steps = %d', t_total) 74 | 75 | global_step = 0 76 | tr_loss, logging_loss = 0.0, 0.0 77 | model.zero_grad() 78 | train_iterator = trange(int(args.num_train_epochs), desc='Epoch') 79 | set_seed(args) # Added here for reproductibility 80 | 81 | max_val_acc = 0 82 | max_val_f1 = 0 83 | 84 | for _ in train_iterator: 85 | # epoch_iterator = tqdm(train_dataloader, desc='Iteration') 86 | # for step, batch in enumerate(epoch_iterator): 87 | for step, batch in enumerate(train_dataloader): 88 | model.train() 89 | batch = tuple(t.to(args.device) for t in batch) 90 | inputs = {'input_ids': batch[0], 91 | 'attention_mask': batch[1], 92 | 'token_type_ids': batch[2], 93 | 'labels': batch[3]} 94 | outputs = model(**inputs) 95 | loss = outputs[0] # model outputs are always tuple in pytorch_transformers (see doc) 96 | 97 | if args.n_gpu > 1: 98 | loss = loss.mean() 99 | if args.gradient_accumulation_steps > 1: 100 | loss = loss / args.gradient_accumulation_steps 101 | 102 | if args.fp16: 103 | with amp.scale_los(loss.optimizer) as scaled_loss: 104 | scaled_loss.backward() 105 | torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) 106 | else: 107 | loss.backward() 108 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 109 | 110 | tr_loss += loss.item() 111 | if (step + 1) % args.gradient_accumulation_steps == 0: 112 | optimizer.step() 113 | scheduler.step() 114 | model.zero_grad() 115 | global_step += 1 116 | 117 | if args.logging_steps > 0 and global_step % args.logging_steps == 0: 118 | # Log metrics 119 | if args.evaluate_during_training: 120 | result = evaluate(args, model, tokenizer) 121 | for key, value in result.items(): 122 | tb_writer.add_scalar('eval_{}'.format(key), value, global_step) 123 | if result['acc'] > max_val_acc: 124 | max_val_acc = result['acc'] 125 | if result['f1'] > max_val_f1: 126 | max_val_f1 = result['f1'] 127 | output_dir = os.path.join(args.output_dir, 'best_checkpoint') 128 | if not os.path.exists(output_dir): 129 | os.makedirs(output_dir) 130 | model_to_save = model.module if hasattr(model, 'module') else model 131 | model_to_save.save_pretrained(output_dir) 132 | torch.save(args, 'training_args.bin') 133 | logger.info('Saving model checkpoint with f1 {:.4f}'.format(max_val_f1)) 134 | tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step) 135 | tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step) 136 | logging_loss = tr_loss 137 | 138 | # if args.save_steps > 0 and global_step % args.save_steps == 0: 139 | # # Save model checkpoint 140 | # output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step)) 141 | # if not os.path.exists(output_dir): 142 | # os.makedirs(output_dir) 143 | # model_to_save = model.module if hasattr(model, 'module') else model 144 | # model_to_save.save_pretrained(output_dir) 145 | # torch.save(args, 'training_args.bin') 146 | # logger.info('Saving model checkpoint to %s', output_dir) 147 | 148 | # if args.max_steps > 0 and global_step > args.max_steps: 149 | # epoch_iterator.close() 150 | # break 151 | if args.max_steps > 0 and global_step > args.max_steps: 152 | train_iterator.close() 153 | break 154 | 155 | tb_writer.close() 156 | return global_step, tr_loss / global_step 157 | 158 | 159 | def evaluate(args, model, tokenizer, prefix=''): 160 | eval_output_dir = args.output_dir 161 | 162 | results = {} 163 | eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev') 164 | 165 | if not os.path.exists(eval_output_dir): 166 | os.makedirs(eval_output_dir) 167 | 168 | args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu) 169 | eval_sampler = SequentialSampler(eval_dataset) 170 | eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) 171 | 172 | # Eval! 173 | logger.info('***** Running evaluation {} *****'.format(prefix)) 174 | logger.info(' Num examples = %d', len(eval_dataset)) 175 | logger.info(' Batch size = %d', args.eval_batch_size) 176 | eval_loss = 0.0 177 | nb_eval_steps = 0 178 | preds = None 179 | out_label_ids = None 180 | # for batch in tqdm(eval_dataloader, desc='Evaluating'): 181 | for batch in eval_dataloader: 182 | model.eval() 183 | batch = tuple(t.to(args.device) for t in batch) 184 | 185 | with torch.no_grad(): 186 | inputs = {'input_ids': batch[0], 187 | 'attention_mask': batch[1], 188 | 'token_type_ids': batch[2], 189 | 'labels': batch[3]} 190 | outputs = model(**inputs) 191 | tmp_eval_loss, logits = outputs[:2] 192 | eval_loss += tmp_eval_loss.mean().item() 193 | nb_eval_steps += 1 194 | if preds is None: 195 | preds = logits.detach().cpu().numpy() 196 | out_label_ids = inputs['labels'].detach().cpu().numpy() 197 | else: 198 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 199 | out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0) 200 | 201 | eval_loss = eval_loss / nb_eval_steps 202 | preds = np.argmax(preds, axis=1) 203 | result = compute_metrics(preds, out_label_ids) 204 | results.update(result) 205 | 206 | output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt') 207 | with open(output_eval_file, 'a') as writer: 208 | for key in sorted(result.keys()): 209 | logger.info(' %s = %s', key, str(result[key])) 210 | writer.write('%s = %s\n' % (key, str(result[key]))) 211 | writer.write('='*20 + '\n') 212 | 213 | return results 214 | 215 | 216 | def predict(args, model, tokenizer, index): 217 | test_dataset = load_and_cache_examples(args, tokenizer, set_type='test') 218 | 219 | args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu) 220 | test_sampler = SequentialSampler(test_dataset) 221 | test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size) 222 | 223 | # Eval! 224 | logger.info('***** Running prediction *****') 225 | logger.info(' Num examples = %d', len(test_dataset)) 226 | logger.info(' Batch size = %d', args.test_batch_size) 227 | preds = None 228 | for batch in tqdm(test_dataloader, desc='Testing'): 229 | model.eval() 230 | batch = tuple(t.to(args.device) for t in batch) 231 | 232 | with torch.no_grad(): 233 | inputs = {'input_ids': batch[0], 234 | 'attention_mask': batch[1], 235 | 'token_type_ids': batch[2], 236 | 'labels': batch[3]} 237 | outputs = model(**inputs) 238 | tmp_eval_loss, logits = outputs[:2] 239 | if preds is None: 240 | preds = logits.detach().cpu().numpy() 241 | else: 242 | preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 243 | 244 | preds = np.argmax(preds, axis=1) 245 | with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f: 246 | f.write('id,label\n') 247 | for i, pred in enumerate(preds): 248 | f.write('%d,%d\n' % (i, pred)) 249 | 250 | 251 | def load_and_cache_examples(args, tokenizer, set_type): 252 | processor = QPMProcessor() 253 | # Load data features from cache or dataset file 254 | cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}'.format( 255 | set_type, 256 | list(filter(None, args.model_name_or_path.split('/'))).pop(), 257 | str(args.max_seq_length) 258 | )) 259 | if os.path.exists(cached_features_file): 260 | logger.info('Loading features from cache file %s', cached_features_file) 261 | features = torch.load(cached_features_file) 262 | else: 263 | logger.info('Creating features from dataset file at %s', args.data_dir) 264 | label_list = processor.get_labels() 265 | category_list = processor.get_categories() 266 | examples = processor.get_examples(args.data_dir, set_type) 267 | features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer, 268 | cls_token_at_end=False, # xlnet has a cls token at the end 269 | cls_token=tokenizer.cls_token, 270 | cls_token_segment_id=0, 271 | sep_token=tokenizer.sep_token, 272 | sep_token_extra=False, 273 | pad_on_left=False, 274 | pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 275 | pad_token_segment_id=0 276 | # cls_token_at_end=bool(args.model_type in ['xlnet']), # xlnet has a cls token at the end 277 | # cls_token=tokenizer.cls_token, 278 | # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0, 279 | # sep_token=tokenizer.sep_token, 280 | # sep_token_extra=bool(args.model_type in ['roberta']), # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805 281 | # pad_on_left=bool(args.model_type in ['xlnet']), # pad on the left for xlnet 282 | # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], 283 | # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0, 284 | ) 285 | logger.info("Saving features into cached file %s", cached_features_file) 286 | torch.save(features, cached_features_file) 287 | 288 | # Convert to Tensors and build dataset 289 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) 290 | all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long) 291 | all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long) 292 | all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long) 293 | all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long) 294 | all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long) 295 | all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long) 296 | all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long) 297 | 298 | dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, 299 | all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids) 300 | return dataset 301 | 302 | 303 | def main(): 304 | parser = argparse.ArgumentParser() 305 | 306 | ## Required parameters 307 | parser.add_argument('--data_dir', default=None, type=str, required=True, 308 | help='The input data dir. Should contain the .csv files for the task.') 309 | parser.add_argument('--model_name_or_path', default=None, type=str, required=True, 310 | help='Path to pretrained model or shortcut name selected in the list.') 311 | parser.add_argument('--output_dir', default=None, type=str, required=True, 312 | help='The output directory where the model predictions and checkpoints will be written.') 313 | 314 | ## Other parameters 315 | parser.add_argument('--config_name', default='', type=str, 316 | help='Pretrained config name or path if not the same as model_name.') 317 | parser.add_argument('--tokenizer_name', default='', type=str, 318 | help='Pretrained tokenizer name or path if not the same as model_name.') 319 | parser.add_argument('--max_seq_length', default='128', type=int, 320 | help='The maximum total input sequence length after tokenization. Sequences longer than this ' 321 | 'will be truncated, sequences shorter will be padded.') 322 | parser.add_argument('--do_train', action='store_true', 323 | help='Whether to run training.') 324 | parser.add_argument('--do_eval', action='store_true', 325 | help='Whether to run eval on the dev set.') 326 | parser.add_argument('--do_predict', action='store_true', 327 | help='Whether to run test on the test set.') 328 | parser.add_argument('--evaluate_during_training', action='store_true', 329 | help='Rul evaluation during training at each logging step.') 330 | parser.add_argument('--do_lower_case', action='store_true', 331 | help='Set this flag if you are using an uncased model.') 332 | 333 | parser.add_argument('--per_gpu_train_batch_size', default=1, type=int, 334 | help='Batch size per GPU/CPU for training.') 335 | parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int, 336 | help='Batch size per GPU/CPU for evaluation.') 337 | parser.add_argument('--per_gpu_test_batch_size', default=8, type=int, 338 | help='Batch size per GPU/CPU for prediction.') 339 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 340 | help='Number of updates steps to accumulate before performing a backward/update pass.') 341 | parser.add_argument('--learning_rate', default=5e-5, type=float, 342 | help='The initial learning rate for Adam.') 343 | parser.add_argument('--weight_decay', default=0.0, type=float, 344 | help='Weight decay if we apply some.') 345 | parser.add_argument('--adam_epsilon', default=1e-8, type=float, 346 | help='Epsilon for Adam optimizer.') 347 | parser.add_argument('--max_grad_norm', default=1.0, type=float, 348 | help='Max gradient norm.') 349 | parser.add_argument('--num_train_epochs', default=4.0, type=float, 350 | help='Total number of training epochs to perform.') 351 | parser.add_argument('--max_steps', default=-1, type=int, 352 | help='If > 0: set total number of training steps to perform. Override num_train_epochs.') 353 | parser.add_argument('--warmup_steps', default=0, type=int, 354 | help='Linear warmup over warmup_steps.') 355 | 356 | parser.add_argument('--logging_steps', type=int, default=50, 357 | help='Log every X updates steps.') 358 | parser.add_argument('--save_steps', type=int, default=100, 359 | help='Save checkpoint every X updates steps.') 360 | parser.add_argument('--eval_all_checkpoints', action='store_true', 361 | help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.') 362 | parser.add_argument('--no_cuda', action='store_true', 363 | help='Avoid using CUDA when available.') 364 | parser.add_argument('--overwrite_output_dir', action='store_true', 365 | help='Overwrite the content of the output directory.') 366 | parser.add_argument('--overwrite_cache', action='store_true', 367 | help='Overwrite the cached training and evaluation sets.') 368 | parser.add_argument('--seed', type=int, default=42, 369 | help='random seed for initialization') 370 | 371 | parser.add_argument('--fp16', action='store_true', 372 | help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit") 373 | parser.add_argument('--fp16_opt_level', type=str, default='O1', 374 | help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." 375 | "See details at https://nvidia.github.io/apex/amp.html") 376 | args = parser.parse_args() 377 | 378 | # Setup CUDA, GPU 379 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir: 380 | raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.') 381 | 382 | device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu') 383 | args.n_gpu = torch.cuda.device_count() 384 | args.device = device 385 | 386 | # Setup logging 387 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 388 | datefmt = '%m/%d/%Y %H:%M:%S', 389 | level = logging.INFO) 390 | logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s', 391 | device, args.n_gpu, args.fp16) 392 | 393 | # Set seed 394 | set_seed(args) 395 | # Prepare QPM task 396 | processor = QPMProcessor() 397 | label_list = processor.get_labels() 398 | num_labels = len(label_list) 399 | 400 | # Load pretrained model and tokenizer 401 | config_class, model_class, tokenizer_class = BertConfig, BertForSequenceClassification, BertTokenizer 402 | config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels) 403 | tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case) 404 | model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config) 405 | model.to(args.device) 406 | 407 | logger.info('Trainning/evaluation parameters %s', args) 408 | parent_data_dir = args.data_dir 409 | parent_output_dir = args.output_dir 410 | 411 | # Trainning 412 | results_tmp = {} 413 | if args.do_train: 414 | # 10-Fold dataset for training. 415 | for i in range(0, 10): 416 | # Reload the pretrained model. 417 | model = model_class.from_pretrained(args.model_name_or_path, 418 | from_tf=bool('.ckpt' in args.model_name_or_path), 419 | config=config) 420 | model.to(args.device) 421 | 422 | args.data_dir = parent_data_dir + str(i) 423 | args.output_dir = parent_output_dir + str(i) 424 | 425 | train_dataset = load_and_cache_examples(args, tokenizer, set_type='train') 426 | global_step, tr_loss = train(args, train_dataset, model, tokenizer) 427 | logger.info(" global_step = %s, average loss = %s", global_step, tr_loss) 428 | # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained() 429 | # Create output directory if needed 430 | if not os.path.exists(args.output_dir): 431 | os.makedirs(args.output_dir) 432 | 433 | logger.info("Saving model checkpoint to %s", args.output_dir) 434 | # Save a trained model, configuration and tokenizer using `save_pretrained()`. 435 | # They can then be reloaded using `from_pretrained()` 436 | model_to_save = model.module if hasattr(model, 'module') else model # Take care of distributed/parallel training 437 | model_to_save.save_pretrained(args.output_dir) 438 | tokenizer.save_pretrained(args.output_dir) 439 | 440 | # Good practice: save your training arguments together with the trained model 441 | torch.save(args, os.path.join(args.output_dir, 'training_args.bin')) 442 | 443 | # Load a trained model and vocabulary that you have fine-tuned 444 | model = model_class.from_pretrained(args.output_dir) 445 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 446 | model.to(args.device) 447 | 448 | # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset. 449 | # args.data_dir = parent_data_dir + str(i) 450 | # args.output_dir = parent_output_dir + str(i) 451 | # tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 452 | # checkpoints = [args.output_dir] 453 | # if args.eval_all_checkpoints: 454 | # checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 455 | # logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 456 | # logger.info("Evaluate the following checkpoints: %s", checkpoints) 457 | # best_f1 = 0.0 458 | # for checkpoint in checkpoints: 459 | # global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 460 | # model = model_class.from_pretrained(checkpoint) 461 | # model.to(args.device) 462 | # result = evaluate(args, model, tokenizer, prefix=global_step) 463 | # if result['f1'] > best_f1: 464 | # best_f1 = result['f1'] 465 | # # Save the best model checkpoint 466 | # output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 467 | # if not os.path.exists(output_dir): 468 | # os.makedirs(output_dir) 469 | # model_to_save = model.module if hasattr(model, 'module') else model 470 | # model_to_save.save_pretrained(output_dir) 471 | # torch.save(args, 'training_args.bin') 472 | # logger.info('Saving model checkpoint to %s', output_dir) 473 | # 474 | # result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 475 | # results_tmp.update(result) 476 | # checkpoints.remove(args.output_dir) 477 | # for checkpoint in checkpoints: 478 | # shutil.rmtree(checkpoint) 479 | 480 | # Evaluation 481 | results = {} 482 | if args.do_eval: 483 | for i in range(10): 484 | args.data_dir = parent_data_dir + str(i) 485 | args.output_dir = parent_output_dir + str(i) 486 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 487 | checkpoints = [args.output_dir] 488 | if args.eval_all_checkpoints: 489 | checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True))) 490 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 491 | logger.info("Evaluate the following checkpoints: %s", checkpoints) 492 | best_f1 = 0.0 493 | for checkpoint in checkpoints: 494 | global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else "" 495 | model = model_class.from_pretrained(checkpoint) 496 | model.to(args.device) 497 | result = evaluate(args, model, tokenizer, prefix=global_step) 498 | if result['f1'] > best_f1: 499 | best_f1 = result['f1'] 500 | # Save the best model checkpoint 501 | output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i)) 502 | if not os.path.exists(output_dir): 503 | os.makedirs(output_dir) 504 | model_to_save = model.module if hasattr(model, 'module') else model 505 | model_to_save.save_pretrained(output_dir) 506 | torch.save(args, 'training_args.bin') 507 | logger.info('Saving model checkpoint to %s', output_dir) 508 | 509 | result = dict((k + '_{}'.format(global_step), v) for k, v in result.items()) 510 | results.update(result) 511 | 512 | # Prediction 513 | if args.do_predict: 514 | for i in range(1): 515 | args.output_dir = parent_output_dir + str(i) 516 | tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 517 | logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging 518 | checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i) 519 | model = model_class.from_pretrained(checkpoint) 520 | model.to(args.device) 521 | predict(args, model, tokenizer, i) 522 | 523 | # For bagging. 524 | all = pd.read_csv('./data/sample_submission.csv') 525 | for i in range(10): 526 | df = pd.read_csv(args.data_dir + str(i) + '/result.csv') 527 | all['label'] += df['label'] 528 | all['label'] = all['label'] // 6 529 | all.to_csv('./data/result.csv', index=False) 530 | 531 | 532 | if __name__ == '__main__': 533 | main() 534 | -------------------------------------------------------------------------------- /post_processing.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import sys 4 | 5 | import pandas as pd 6 | 7 | df_train = pd.read_csv('./data/origin_train.csv', encoding='utf-8', engine='python') 8 | q1 = df_train['question1'].values 9 | q2 = df_train['question2'].values 10 | label = df_train['label'].values 11 | category = df_train['category'].values 12 | dict_1 = {} 13 | dict_2 = {} 14 | dict_ct = {} 15 | for i in range(0, df_train.shape[0]): 16 | dict_ct[q1[i]] = category[i] 17 | dict_ct[q2[i]] = category[i] 18 | if label[i] == 1: 19 | if dict_1.get(q1[i], -1) == -1: 20 | dict_1[q1[i]] = [q2[i]] 21 | else: 22 | dict_1[q1[i]].append(q2[i]) 23 | if dict_1.get(q2[i], -1) == -1: 24 | dict_1[q2[i]] = [q1[i]] 25 | else: 26 | dict_1[q2[i]].append(q1[i]) 27 | else: 28 | if dict_2.get(q1[i], -1) == -1: 29 | dict_2[q1[i]] = [q2[i]] 30 | else: 31 | dict_2[q1[i]].append(q2[i]) 32 | if dict_2.get(q2[i], -1) == -1: 33 | dict_2[q2[i]] = [q1[i]] 34 | else: 35 | dict_2[q2[i]].append(q1[i]) 36 | 37 | if i % 5000 == 0: 38 | sys.stdout.flush() 39 | sys.stdout.write('#') 40 | print(len(dict_1)) 41 | 42 | df_result = pd.read_csv('./data/result.csv', encoding='utf-8', engine='python', index_col='id') 43 | df_test = pd.read_csv('./data/noextension/test.csv', encoding='utf-8', engine='python') 44 | q1_test = df_test['question1'].values 45 | q2_test = df_test['question2'].values 46 | category_test = df_test['category'].values 47 | id_test = df_test['id'] 48 | 49 | cnt = 0 50 | for i in range(0, df_test.shape[0]): 51 | # print(q1_test[i], q2_test[i], id_test[i], category_test[i]) 52 | list1 = dict_1.get(q1_test[i], -1) 53 | list2 = dict_1.get(q2_test[i], -1) 54 | ct = category_test[i] 55 | if list1 != -1: 56 | if list2 != -1: 57 | if len(set(list1).intersection(set(list2))) != 0 and dict_ct.get(q1_test[i], -1) == dict_ct.get(q2_test[i], -1) and dict_ct.get(q1_test[i], -1) != -1: 58 | df_result.iloc[id_test[i]]['label'] = 1 59 | print(3) 60 | 61 | # 找到每个q1相似问题q在训练集中是否存在q!=q2,从而推出q1!=q2 62 | if list1 != -1: 63 | for q in list1: 64 | neq_list = dict_2.get(q, -1) 65 | if neq_list != -1: 66 | if q2_test[i] in neq_list: 67 | df_result.iloc[id_test[i]]['label'] = 0 68 | print(1) 69 | # 同理q2 70 | if list2 != -1: 71 | for q in list2: 72 | neq_list = dict_2.get(q, -1) 73 | if neq_list != -1: 74 | if q1_test[i] in neq_list: 75 | df_result.iloc[id_test[i]]['label'] = 0 76 | print(2) 77 | 78 | df_result.to_csv('./data/post_result.csv') 79 | --------------------------------------------------------------------------------