├── README.md
├── data_augmentation.py
├── data_utils.py
├── extract_bert_char_embedding.py
├── feature_engineering.py
├── model_feature.py
├── model_final.py
├── model_final_2.py
├── model_multitask.py
├── model_qpm.py
└── post_processing.py


/README.md:
--------------------------------------------------------------------------------
 1 | # chip2019_task2_question_pairs_matching
 2 | [CHIP 2019 平安医疗科技疾病问答迁移学习比赛](https://www.biendata.com/competition/chip2019/)，本质上就是一个类似于Quora Question Pairs的问句匹配问题。基于[huggingface/pytorch-transformers](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py)实现的BERT baseline，代码比较冗余，中文预训练模型采用[ymcui/Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)。因为条件有限（没有GPU。。。），所以只跑了几个baseline（提交次数惨淡），没有trick、模型融合以及超参数选择，只做10折交叉验证，A榜应该就能达到0.878+，B榜0.864+，A榜rank 9，B榜rank11，剔除了小号和未报名队伍后rank7，作为baseline效果还是可以的。
 3 | 
 4 | 项目文件目录结构及文件说明如下：
 5 | ```
 6 | |-- README.md
 7 | |-- data_augmentation.py 利用问句相似性传递进行数据增强和生成10折交叉验证数据文件
 8 | |-- data_utils.py 读取数据文件，转化为模型输入等操作
 9 | |-- extract_bert_char_embedding.py 抽取出BERT的字向量，然后用在例如ESIM之类的常规模型上，效果不好
10 | |-- feature_engineering.py 对每一折数据抽取出数据字词级别的tfidf等特征
11 | |-- model_qpm.py BERT常规句对分类
12 | |-- model_multitask.py 在qpm的基础上，增加一个子任务，将句对句子合在一起，判断所属类目
13 | |-- model_feature.py 在qpm的基础上，加入手工特征的dense层
14 | |-- model_final.py 将qpm、判断类目和手工特征三种做法融合
15 | |-- model_final_2.py 和model_final.py基本一致，只是改变了模型保存方式
16 | |-- post_processing.py 利用问句相似性传递进行后处理文件
17 | |-- data 按照10折交叉验证分别存放10折数据文件
18 |     |-- noextension 未数据增强的文件
19 |         |-- 0
20 |         |-- ...
21 |     |-- THUOCL_medical.txt 清华开源的医疗词库，用于jieba加载后分词做词特征抽取
22 | |-- tmp 按照10折交叉验证分别保存模型的目录
23 |     |-- 0
24 |     |-- ...
25 | ```
26 | `model_final.py`和`model_final_2.py`只是在模型保存方式上有区别，前者占空间，后者费时间。
27 | 
28 | 训练模型请使用
29 | ```
30 | python3 model_final_2.py --model_name_or_path ./chinese_roberta_wwm_ext_pytorch/ --do_train --do_lower_case --data_dir ./data/noextension/ --max_seq_length 128 --per_gpu_train_batch_size 16 --learning_rate 2e-5 --num_train_epochs 5.0 --output_dir ./tmp/ --overwrite_output_dir --evaluate_during_training
31 | ```
32 | 
33 | 预测结果请使用
34 | ```
35 | python3 model_final_2.py --model_name_or_path ./chinese_roberta_wwm_ext_pytorch/ --do_predict --do_lower_case --data_dir ./data/noextension/ --per_gpu_test_batch_size 16 --output_dir ./tmp/
36 | ```
37 | 
38 | 说说模型效果：
39 | - 使用`RoBERTa-wwm-ext`和`RoBERTa-wwm-ext-large`能比`BERT-wwm-ext`提升0.005，而roberta base和large效果差别不大，就0.002左右。
40 | - 利用问句相似性传递做数据增强容易过拟合，而且本训练集标注有很多问题，不对训练集做任何修改的话，用了数据增强反而下降0.005左右。同样做相似性传递后处理也会下降0.001~0.005，所以就没用数据增强和后处理。
41 | - 加入句子分类的loss，能提升0.001。
42 | - 加入特征工程的dense层，效果不稳定，可能是我的特征选得不好。
43 | 
44 | 另外两个句子分别通过BERT得到representation后，互相做一下attention拼接到句子对的[CLS]输出后也能提升模型效果，不过后期没GPU了，没基于roberta-wwm训练新模型，所以最终提交的还是roberta_final_2.py跑出来的结果。
45 | 


--------------------------------------------------------------------------------
/data_augmentation.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | import copy
  4 | import os
  5 | import numpy as np
  6 | import random
  7 | import sys
  8 | 
  9 | import pandas as pd
 10 | from sklearn.model_selection import KFold, train_test_split
 11 | from sklearn.utils import shuffle
 12 | from tqdm import tqdm
 13 | 
 14 | """ Naive train test split. """
 15 | # df = pd.read_csv('./data/aug_train.csv', encoding='utf-8', engine='python')
 16 | # df = shuffle(df)
 17 | # train_df = df[2000:]
 18 | # dev_df = df[:2000]
 19 | # train_df.to_csv('./data/origin_train.csv', index=False)
 20 | # dev_df.to_csv('./data/dev.csv', index=False)
 21 | 
 22 | """ KFold """
 23 | # parent_directory = './data/extension/'
 24 | # df = pd.read_csv(parent_directory + 'final_train.csv', encoding='utf-8', engine='python')
 25 | # kFold = KFold(n_splits=10, shuffle=True, random_state=12345)
 26 | # folds = kFold.split(df)
 27 | # for i in range(10):
 28 | #     os.makedirs(parent_directory + str(i))
 29 | # for i, (train, dev) in enumerate(folds):
 30 | #     df.iloc[train].to_csv(parent_directory + str(i) + '/origin_train.csv', index=False)
 31 | #     df.iloc[dev].to_csv(parent_directory + str(i) + '/dev.csv', index=False)
 32 | 
 33 | """ Data extension by questions similarity. """
 34 | df_train = pd.read_csv('./data/origin_train.csv', encoding='utf-8', engine='python')
 35 | q1 = df_train['question1'].values
 36 | q2 = df_train['question2'].values
 37 | label = df_train['label'].values
 38 | category = df_train['category'].values
 39 | dict_1 = {}
 40 | dict_ct = {}
 41 | for i in range(0, df_train.shape[0]):
 42 |     dict_ct[q1[i]] = category[i]
 43 |     dict_ct[q2[i]] = category[i]
 44 |     if label[i] == 1:
 45 |         if dict_1.get(q1[i], -1) == -1:
 46 |             dict_1[q1[i]] = [q2[i]]
 47 |         else:
 48 |             dict_1[q1[i]].append(q2[i])
 49 |         if dict_1.get(q2[i], -1) == -1:
 50 |             dict_1[q2[i]] = [q1[i]]
 51 |         else:
 52 |             dict_1[q2[i]].append(q1[i])
 53 |     if i % 5000 == 0:
 54 |         sys.stdout.flush()
 55 |         sys.stdout.write('#')
 56 | print(len(dict_1))
 57 | 
 58 | listxy = []
 59 | for x in dict_1:
 60 |     listx = dict_1[x]
 61 |     if len(listx) > 1:
 62 |         listy = listx[:]
 63 |         random.shuffle(listy)
 64 |         for x, y in zip(listx, listy):
 65 |             if x != y and y not in dict_1[x] and x not in dict_1[y]:
 66 |                 if dict_ct[x] != dict_ct[y]:
 67 |                     ct = 'wrong'
 68 |                     listxy.append([x, y, 0, ct])
 69 |                 else:
 70 |                     ct = dict_ct[x]
 71 |                     listxy.append([x, y, 1, ct])
 72 | print(len(listxy))
 73 | random.shuffle(listxy)
 74 | df_ext = pd.DataFrame(listxy)
 75 | df_ext.columns = ['question1', 'question2', 'label', 'category']
 76 | df_ext.to_csv('./data/extension/ext_train.csv', index=False)
 77 | 
 78 | """ Produce negative samples and ombine extension dataset. """
 79 | df_ext_train = pd.read_csv('./data/extension/ext_train.csv')
 80 | temp_q1 = df_train['question1'].values.copy()
 81 | temp_q2 = df_train['question2'].values.copy()
 82 | np.random.shuffle(temp_q1)
 83 | np.random.shuffle(temp_q2)
 84 | temp_df = pd.DataFrame()
 85 | temp_df['label'] = np.zeros(temp_q1.shape[0], dtype=int)
 86 | temp_df['question1'] = temp_q1
 87 | temp_df['question2'] = temp_q2
 88 | category_col = []
 89 | for i in range(len(temp_df)):
 90 |     if dict_ct[temp_df.iloc[i]['question1']] == dict_ct[temp_df.iloc[i]['question2']]:
 91 |         category_col.append(dict_ct[temp_df.iloc[i]['question1']])
 92 |     else:
 93 |         category_col.append('wrong')
 94 | temp_df['category'] = category_col
 95 | temp_df = temp_df.sample(n=int(df_ext_train.shape[0]*0.8))
 96 | df_train = pd.concat([df_train, df_ext_train, temp_df], sort=False)
 97 | df_train = df_train.drop_duplicates(['question1', 'question2']).reset_index(drop=True)
 98 | df_train.columns = ['question1', 'question2', 'label', 'category']
 99 | df_train.to_csv('./data/extension/final_train.csv', index=False)
100 | print('Complete.')
101 | 
102 | parent_directory = './data/extension/'
103 | df = pd.read_csv(parent_directory + 'final_train.csv', encoding='utf-8', engine='python')
104 | kFold = KFold(n_splits=10, shuffle=True, random_state=12345)
105 | folds = kFold.split(df)
106 | for i in range(10):
107 |     os.makedirs(parent_directory + str(i))
108 | for i, (train, dev) in enumerate(folds):
109 |     df.iloc[train].to_csv(parent_directory + str(i) + '/origin_train.csv', index=False)
110 |     df.iloc[dev].to_csv(parent_directory + str(i) + '/dev.csv', index=False)
111 | 
112 | 
113 | category_map = {
114 |     'aids':             '艾滋病',
115 |     'breast_cancer':    '乳腺癌',
116 |     'diabetes':         '糖尿病',
117 |     'hepatitis':        '乙肝',
118 |     'hypertension':     '高血压'
119 | }
120 | 


--------------------------------------------------------------------------------
/data_utils.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """ BERT classification fine-tuning: utilities to work with QPM tasks. """
  3 | 
  4 | import logging
  5 | import os
  6 | import sys
  7 | 
  8 | import pandas as pd
  9 | from sklearn.metrics import f1_score
 10 | 
 11 | logger = logging.getLogger(__name__)
 12 | 
 13 | 
 14 | class InputExample(object):
 15 |     """ A single training/test example for question pairs matching task. """
 16 |     def __init__(self, guid, question_a, question_b, label=None, category=None, hand_features=None):
 17 |         """ Constructs a InputExample.
 18 |         Args:
 19 |             guid: Unique id for the example.
 20 |             question_a: string. The untokenized question sentence of the first sequence.
 21 |             question_b: string. The untokenized question sentence of the second sequence.
 22 |             label: string. The label of the example. This should be specified for train and dev examples,
 23 |             but not for test examples
 24 |         """
 25 |         self.guid = guid
 26 |         self.question_a = question_a
 27 |         self.question_b = question_b
 28 |         self.label = label
 29 |         self.category = category
 30 |         self.hand_features = hand_features
 31 | 
 32 | 
 33 | class InputFeatures(object):
 34 |     """ A single set of features of data. """
 35 |     def __init__(self, input_ids, input_mask, segment_ids, label_id,
 36 |                  category_clf_input_ids, category_clf_input_mask, category_clf_segment_ids, category_id, hand_features):
 37 |         self.input_ids = input_ids
 38 |         self.input_mask = input_mask
 39 |         self.segment_ids = segment_ids
 40 |         self.label_id = label_id
 41 |         self.category_clf_input_ids = category_clf_input_ids
 42 |         self.category_clf_input_mask = category_clf_input_mask
 43 |         self.category_clf_segment_ids = category_clf_segment_ids
 44 |         self.category_id = category_id
 45 |         self.hand_features = hand_features
 46 | 
 47 | 
 48 | class DataProcessor(object):
 49 |     """ Base class for data converters for sequence classfication data sets. """
 50 |     def get_examples(self, data_dir, set_type):
 51 |         """ Gets a collection of `InputExample`s for the train set. """
 52 |         raise NotImplementedError()
 53 | 
 54 |     def get_labels(self):
 55 |         """ Gets the list of labels for this data set."""
 56 |         raise NotImplementedError()
 57 | 
 58 |     @classmethod
 59 |     def _read_csv(cls, input_file, quotechar=None):
 60 |         """ Reads a `,` seperated value file. """
 61 |         with open(input_file, 'r', encoding='utf-8') as f:
 62 |             print(input_file)
 63 |             df = pd.read_csv(f, delimiter=',')
 64 |             df_feat = pd.read_csv(input_file.replace('.csv', '_feats.csv'))
 65 |             lines = []
 66 |             for index in df.index:
 67 |                 line = df.iloc[index].values
 68 |                 line = line.tolist()
 69 |                 line.append(df_feat.iloc[index])
 70 |                 lines.append(line)
 71 |             return lines
 72 | 
 73 | 
 74 | class QPMProcessor(DataProcessor):
 75 |     """ Processor for the QPM data set. """
 76 |     def get_examples(self, data_dir, set_type):
 77 |         """ See base class. """
 78 |         return self._create_examples(
 79 |             self._read_csv(os.path.join(data_dir, set_type + '.csv')), set_type)
 80 | 
 81 |     def get_labels(self):
 82 |         """ See base class. """
 83 |         return [0, 1]
 84 | 
 85 |     def get_categories(self):
 86 |         return ['aids', 'hypertension', 'hepatitis', 'diabetes', 'breast_cancer', 'wrong']
 87 | 
 88 |     def _create_examples(self, lines, set_type):
 89 |         """ Creates examples for the training and dev sets. """
 90 |         examples = []
 91 | 
 92 |         for (i, line) in enumerate(lines):
 93 |             guid = '%s-%s' % (set_type, i)
 94 |             try:
 95 |                 if set_type == 'train' or set_type == 'dev':
 96 |                     question_a = line[0]
 97 |                     question_b = line[1]
 98 |                     label = line[2]
 99 |                     category = line[3]
100 |                     hand_features = line[4]
101 |                 elif set_type == 'test':
102 |                     question_a = line[2]
103 |                     question_b = line[3]
104 |                     category = line[0]
105 |                     hand_features = line[4]
106 |                     label = 0
107 |                 else:
108 |                     raise ValueError()
109 |             except IndexError:
110 |                 continue
111 |             examples.append(InputExample(guid=guid, question_a=question_a, question_b=question_b, label=label, category=category, hand_features=hand_features))
112 |         return examples
113 | 
114 | 
115 | def convert_examples_to_features(examples, label_list, category_list, max_seq_length, tokenizer,
116 |                                  cls_token_at_end=False, cls_token='[CLS]', cls_token_segment_id=1,
117 |                                  sep_token='[SEP]', sep_token_extra=False,
118 |                                  pad_on_left=False, pad_token=0, pad_token_segment_id=0,
119 |                                  sequence_a_segment_id=0, sequence_b_segment_id=1,
120 |                                  mask_padding_with_zero=True):
121 |     """ Loads a data file into a list of `InputBatch`s
122 |         `cls_toekn_at_end` define the location of the CLS token:
123 |             - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
124 |             - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
125 |         `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
126 |     """
127 |     label_map = {label: i for i, label in enumerate(label_list)}
128 |     category_map = {category: i for i, category in enumerate(category_list)}
129 |     features = []
130 |     for (ex_index, example) in enumerate(examples):
131 |         hand_features = example.hand_features
132 | 
133 |         if ex_index % 10000 == 0:
134 |             logger.info('Writing example %d of %d' % (ex_index, len(examples)))
135 | 
136 |         tokens_a = tokenizer.tokenize(example.question_a)
137 |         tokens_b = tokenizer.tokenize(example.question_b)
138 |         # Modifies `tokens_a` and `tokens_b` in place so that the total length is less than the specified length.
139 |         # Account for [CLS], [SEP], [SEP] with '- 3'. '- 4' for ro RoBERTa
140 |         special_tokens_count = 4  if sep_token_extra else 3
141 |         _truncate_seq_pair(tokens_a, tokens_b, max_seq_length-special_tokens_count)
142 | 
143 |         # The convention in BERT is:
144 |         # (a) For sequence pairs:
145 |         #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
146 |         #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
147 |         # (b) For single sequences:
148 |         #  tokens:   [CLS] the dog is hairy . [SEP]
149 |         #  type_ids:   0   0   0   0  0     0   0
150 |         #
151 |         # Where "type_ids" are used to indicate whether this is the first
152 |         # sequence or the second sequence. The embedding vectors for `type=0` and
153 |         # `type=1` were learned during pre-training and are added to the wordpiece
154 |         # embedding vector (and position vector). This is not *strictly* necessary
155 |         # since the [SEP] token unambiguously separates the sequences, but it makes
156 |         # it easier for the model to learn the concept of sequences.
157 |         #
158 |         # For classification tasks, the first vector (corresponding to [CLS]) is
159 |         # used as as the "sentence vector". Note that this only makes sense because
160 |         # the entire model is fine-tuned.
161 |         tokens = tokens_a + [sep_token]
162 |         segment_ids = [sequence_a_segment_id] * len(tokens)
163 |         tokens += tokens_b + [sep_token]
164 |         segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)
165 | 
166 |         category_clf_tokens = tokens_a + tokens_b
167 |         special_tokens_count = 3 if sep_token_extra else 2
168 |         if len(category_clf_tokens) > max_seq_length - special_tokens_count:
169 |             category_clf_tokens = category_clf_tokens[:(max_seq_length - special_tokens_count)]
170 |         category_clf_tokens += [sep_token]
171 |         category_clf_segment_ids = [sequence_a_segment_id] * len(category_clf_tokens)
172 | 
173 |         if cls_token_at_end:
174 |             tokens = tokens + [cls_token]
175 |             segment_ids = segment_ids + [cls_token_segment_id]
176 |         else:
177 |             tokens = [cls_token] + tokens
178 |             segment_ids = [cls_token_segment_id] + segment_ids
179 | 
180 |             category_clf_tokens = [cls_token] + category_clf_tokens
181 |             category_clf_segment_ids = [cls_token_segment_id] + category_clf_segment_ids
182 | 
183 |         input_ids = tokenizer.convert_tokens_to_ids(tokens)
184 |         category_clf_input_ids = tokenizer.convert_tokens_to_ids(category_clf_tokens)
185 | 
186 |         # The mask has 1 for real tokens and 0 for padding tokens. Only real
187 |         # tokens are attended to.
188 |         input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
189 |         category_clf_input_mask = [1 if mask_padding_with_zero else 0] * len(category_clf_input_ids)
190 | 
191 |         # Zero-pad up to the sequence length.
192 |         padding_length = max_seq_length - len(input_ids)
193 |         category_clf_padding_length = max_seq_length - len(category_clf_input_ids)
194 |         if pad_on_left:
195 |             input_ids = ([pad_token] * padding_length) + input_ids
196 |             input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
197 |             segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
198 | 
199 |             category_clf_input_ids = ([pad_token] * category_clf_padding_length) + category_clf_input_ids
200 |             category_clf_input_mask = ([0 if mask_padding_with_zero else 1] * category_clf_padding_length) + category_clf_input_mask
201 |             category_clf_segment_ids = ([pad_token_segment_id] * category_clf_padding_length) + category_clf_segment_ids
202 |         else:
203 |             input_ids = input_ids + ([pad_token] * padding_length)
204 |             input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
205 |             segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
206 | 
207 |             category_clf_input_ids = category_clf_input_ids + ([pad_token] * category_clf_padding_length)
208 |             category_clf_input_mask =  category_clf_input_mask + ([0 if mask_padding_with_zero else 1] * category_clf_padding_length)
209 |             category_clf_segment_ids = category_clf_segment_ids + ([pad_token_segment_id] * category_clf_padding_length)
210 | 
211 |         assert len(input_ids) == max_seq_length
212 |         assert len(input_mask) == max_seq_length
213 |         assert len(segment_ids) == max_seq_length
214 |         assert len(category_clf_input_ids) == max_seq_length
215 |         assert len(category_clf_input_mask) == max_seq_length
216 |         assert len(category_clf_segment_ids) == max_seq_length
217 | 
218 |         label_id = label_map[example.label]
219 |         category_id = category_map[example.category]
220 | 
221 |         if ex_index < 5:
222 |             logger.info("*** Example ***")
223 |             logger.info("guid: %s" % (example.guid))
224 |             logger.info("tokens: %s" % " ".join(
225 |                 [str(x) for x in tokens]))
226 |             logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
227 |             logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
228 |             logger.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
229 |             logger.info("hand_features: %s" % " ".join([str(x) for x in hand_features]))
230 |             if example.label is not None:
231 |                 logger.info("label: %s (id = %d)" % (example.label, label_id))
232 | 
233 |         features.append(
234 |             InputFeatures(input_ids=input_ids,
235 |                           input_mask=input_mask,
236 |                           segment_ids=segment_ids,
237 |                           label_id=label_id,
238 |                           category_clf_input_ids=category_clf_input_ids,
239 |                           category_clf_input_mask=category_clf_input_mask,
240 |                           category_clf_segment_ids=category_clf_segment_ids,
241 |                           category_id=category_id,
242 |                           hand_features=hand_features
243 |                           ))
244 |     return features
245 | 
246 | 
247 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
248 |     """ Truncates a sequence pair in place to the maximum length. """
249 |     # This is a simple heuristic which will always truncate the longer sequence
250 |     # one token at a time. This makes more sense than truncating an equal percent
251 |     # of tokens from each, since if one sequence is very short then each token
252 |     # that's truncated likely contains more information that a longer sequence.
253 |     while True:
254 |         total_length = len(tokens_a) + len(tokens_b)
255 |         if total_length <= max_length:
256 |             break
257 |         if len(tokens_a) > len(tokens_b):
258 |             tokens_a.pop()
259 |         else:
260 |             tokens_b.pop()
261 | 
262 | 
263 | def compute_metrics(preds, labels):
264 |     assert len(preds) == len(labels)
265 |     return acc_and_f1(preds, labels)
266 | 
267 | 
268 | def acc_and_f1(preds, labels):
269 |     acc = simple_accuracy(preds, labels)
270 |     f1 = f1_score(y_true=labels, y_pred=preds)
271 |     return {
272 |         "acc": acc,
273 |         "f1": f1,
274 |         "acc_and_f1": (acc + f1) / 2,
275 |     }
276 | 
277 | 
278 | def simple_accuracy(preds, labels):
279 |     return (preds == labels).mean()
280 | 


--------------------------------------------------------------------------------
/extract_bert_char_embedding.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """ Extract the pretrained character level embedding from BERT hidden outputs. """
 3 | import re
 4 | 
 5 | import numpy as np
 6 | from pytorch_transformers import BertTokenizer, BertModel
 7 | 
 8 | 
 9 | if __name__ == '__main__':
10 |     print('# Load pretrained model tokenizer.')
11 |     tokenizer = BertTokenizer.from_pretrained('./bert_wwm/')
12 | 
13 |     print('# Construct vocab.')
14 |     vocab = [token for token in tokenizer.vocab]
15 | 
16 |     print('# Load pretrained model.')
17 |     model = BertModel.from_pretrained('./bert_wwm')
18 | 
19 |     print('# Load word embeddings')
20 |     emb = model.embeddings.word_embeddings.weight.data
21 |     emb = emb.numpy()
22 | 
23 |     print('# Write')
24 |     with open('{}.{}.{}d.vec'.format('bert_wwm', len(vocab), emb.shape[-1]), 'w', encoding='utf-8') as fout:
25 |         fout.write('{} {}\n'.format(len(vocab), emb.shape[-1]))
26 |         assert len(vocab) == len(emb), 'The number of vocab and embeddings MUST be identical.'
27 |         for token, e in zip(vocab, emb):
28 |             e = np.array2string(e, max_line_width=np.inf)[1:-1]
29 |             e = re.sub('[ ]+', ' ', e)
30 |             fout.write('{} {}\n'.format(token, e))
31 | 


--------------------------------------------------------------------------------
/feature_engineering.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | import jieba
  4 | import numpy as np
  5 | import pandas as pd
  6 | 
  7 | from sklearn.feature_extraction.text import TfidfVectorizer
  8 | from tqdm import tqdm
  9 | 
 10 | 
 11 | def get_len_diff(data):
 12 |     """
 13 |     Get the difference of length and normalize by the longest one of question pairs.
 14 |     """
 15 |     q1_len = data.question1.apply(lambda x: len(x.split(' '))).values
 16 |     q2_len = data.question2.apply(lambda x: len(x.split(' '))).values
 17 |     len_diff = np.abs(q1_len - q2_len) / np.max([q1_len, q2_len], axis=0)
 18 |     return len_diff
 19 | 
 20 | 
 21 | def get_num_common_units(data):
 22 |     """
 23 |     Get the common units(words or chars) in q1 and q2.
 24 |     """
 25 |     q1_unit_set = data.question1.apply(lambda x: x.split(' ')).apply(set).values
 26 |     q2_unit_set = data.question2.apply(lambda x: x.split(' ')).apply(set).values
 27 |     result = [len(q1_unit_set[i] & q2_unit_set[i]) for i in range(len(q1_unit_set))]
 28 |     result = pd.DataFrame(result, index=data.index)
 29 |     result.columns = ['num_common_units']
 30 |     return result
 31 | 
 32 | 
 33 | def get_common_units_ratio(data):
 34 |     q1_unit_set = data.question1.apply(lambda x: x.split(' ')).apply(set).values()
 35 |     q2_unit_set = data.question2.apply(lambda x: x.split(' ')).apply(set).values()
 36 |     q1_len = data.question1.apply(lambda x: len(x.split(' '))).values
 37 |     q2_len = data.question2.apply(lambda x: len(x.split(' '))).values
 38 |     result = [len(q1_unit_set[i] & q2_unit_set[i])/max(q1_len[i], q2_len[i]) for i in range(len(q1_unit_set))]
 39 |     result = pd.DataFrame(result, index=data.index)
 40 |     result.columns = ['common_units_ratio']
 41 |     return result
 42 | 
 43 | 
 44 | def get_tfidf_vector(data, vectorizer):
 45 |     q1_tfidf = vectorizer.transform(data.question1.values)
 46 |     q2_tfidf = vectorizer.transform(data.question2.values)
 47 |     return vectorizer.vocabulary_, q1_tfidf, q2_tfidf
 48 | 
 49 | 
 50 | def adjust_common_units_ratio_by_tfidf(data, unit2index, q1_tfidf, q2_tfidf):
 51 |     adjusted_common_units_ratio = []
 52 |     for i in range(q1_tfidf.shape[0]):
 53 |         q1_units = {}
 54 |         q2_units = {}
 55 |         for unit in data.loc[i, 'question1'].lower().split():
 56 |             q1_units[unit] = q1_units.get(unit, 0) + 1
 57 |         for unit in data.loc[i, 'question2'].lower().split():
 58 |             q2_units[unit] = q2_units.get(unit, 0) + 1
 59 | 
 60 |         sum_shared_unit_in_q1 = sum([q1_units[u] * q1_tfidf[i, unit2index[u]] for u in q1_units if u in q2_units])
 61 |         sum_shared_unit_in_q2 = sum([q2_units[u] * q2_tfidf[i, unit2index[u]] for u in q2_units if u in q1_units])
 62 |         sum_total = sum([q1_units[u] * q1_tfidf[i, unit2index[u]] for u in q1_units]) +\
 63 |                     sum([q2_units[u] * q2_tfidf[i, unit2index[u]] for u in q2_units])
 64 |         if 1e-6 > sum_total:
 65 |             adjusted_common_units_ratio.append(0.)
 66 |         else:
 67 |             adjusted_common_units_ratio.append(1.0 * (sum_shared_unit_in_q1 + sum_shared_unit_in_q2) / sum_total)
 68 |     return adjusted_common_units_ratio
 69 | 
 70 | 
 71 | def generate_powerful_unit(data):
 72 |     """
 73 |     Calculate the influence of unit.
 74 |     0. the num of unit appeared in question pairs
 75 |     1. the ratio of unit appeared in question pairs
 76 |     2. the ratio of unit appeared in question pairs labeled 1
 77 |     3. the ratio of unit appeared in only one question
 78 |     4. the ratio of unit appeared in only one question and pair labeled 1
 79 |     5. the ratio of unit appeared in both two questions
 80 |     6. the ratio of unit appeared in both two questions and pair labeled 1
 81 |     """
 82 |     units_power = {}
 83 |     for i in data.index:
 84 |         label = int(data.loc[i, 'label'])
 85 |         q1_units = list(data.loc[i, 'question1'].lower().split())
 86 |         q2_units = list(data.loc[i, 'question2'].lower().split())
 87 |         all_units = set(q1_units + q2_units)
 88 |         q1_units = set(q1_units)
 89 |         q2_units = set(q2_units)
 90 | 
 91 |         for unit in all_units:
 92 |             if unit not in units_power:
 93 |                 units_power[unit] = [0. for i in range(7)]
 94 |             units_power[unit][0] += 1.
 95 |             units_power[unit][1] += 1.
 96 | 
 97 |             if (unit in q1_units and unit not in q2_units) or (unit not in q1_units and unit in q2_units):
 98 |                 units_power[unit][3] += 1.
 99 |                 if 1 == label:
100 |                     units_power[unit][2] += 1.
101 |                     units_power[unit][4] += 1.
102 | 
103 |             if unit in q1_units and unit in q2_units:
104 |                 units_power[unit][5] += 1.
105 |                 if 1 == label:
106 |                     units_power[unit][2] += 1.
107 |                     units_power[unit][6] += 1.
108 | 
109 |     for unit in units_power:
110 |         # calculate ratios
111 |         units_power[unit][1] /= data.shape[0]
112 |         units_power[unit][2] /= data.shape[0]
113 |         if units_power[unit][3] > 1e-6:
114 |             units_power[unit][4] /= units_power[unit][3]
115 |         units_power[unit][4] /= units_power[unit][0]
116 |         if units_power[unit][5] > 1e-6:
117 |             units_power[unit][6] /= units_power[unit][5]
118 |         units_power[unit][5] /= units_power[unit][0]
119 | 
120 |     sorted_units_power = sorted(units_power.items(), key=lambda d: d[1][0], reverse=True)
121 |     return sorted_units_power
122 | 
123 | 
124 | def powerful_units_dside_tag(punit, data, threshold_num, threshold_rate):
125 |     """
126 |     If a powerful units appeared in both questions, the tag was set as 1, otherwise 0.
127 |     """
128 |     punit_dside = []
129 |     punit = filter(lambda x: x[1][0] * x[1][5] >= threshold_num, punit)
130 |     punit_sort = sorted(punit, key=lambda d: d[1][6], reverse=True)
131 |     punit_dside.extend(map(lambda x: x[0], filter(lambda x: x[1][6] >= threshold_rate, punit_sort)))
132 | 
133 |     punit_dside_tags = []
134 |     for i in data.index:
135 |         tags = []
136 |         q1_units = set(data.loc[i, 'question1'].lower().split())
137 |         q2_units = set(data.loc[i, 'question2'].lower().split())
138 |         for unit in punit_dside:
139 |             if unit in q1_units and unit in q2_units:
140 |                 tags.append(1.0)
141 |             else:
142 |                 tags.append(0.0)
143 |         punit_dside_tags.append(tags)
144 |     return punit_dside, punit_dside_tags
145 | 
146 | 
147 | def powerful_units_oside_tag(punit, data, threshold_num, threshold_rate):
148 |     punit_oside = []
149 |     punit = filter(lambda x: x[1][0] * x[1][3] >= threshold_num, punit)
150 |     punit_oside.extend(map(lambda x: x[0], filter(lambda x: x[1][4] >= threshold_rate, punit)))
151 | 
152 |     punit_oside_tags = []
153 |     for i in data.index:
154 |         tags = []
155 |         q1_units = set(data.loc[i, 'question1'].lower().split())
156 |         q2_units = set(data.loc[i, 'question2'].lower().split())
157 |         for unit in punit_oside:
158 |             if unit in q1_units and unit not in q2_units:
159 |                 tags.append(1.0)
160 |             elif unit not in q1_units and unit in q2_units:
161 |                 tags.append(1.0)
162 |             else:
163 |                 tags.append(0.0)
164 |         punit_oside_tags.append(tags)
165 |     return punit_oside, punit_oside_tags
166 | 
167 | 
168 | def powerful_units_dside_rate(sorted_units_power, punit_dside, data):
169 |     num_least = 300
170 |     units_power = dict(sorted_units_power)
171 |     punit_dside_rate = []
172 |     for i in data.index:
173 |         rate = 1.0
174 |         q1_units = set(data.loc[i, 'question1'].lower().split())
175 |         q2_units = set(data.loc[i, 'question2'].lower().split())
176 |         share_units = list(q1_units.intersection(q2_units))
177 |         for unit in share_units:
178 |             if unit in punit_dside:
179 |                 rate *= (1.0 - units_power[unit][6])
180 |         punit_dside_rate.append(1-rate)
181 |     return punit_dside_rate
182 | 
183 | 
184 | def powerful_units_oside_rate(sorted_units_power, punit_oside, data):
185 |     num_least = 300
186 |     units_power = dict(sorted_units_power)
187 |     punits_oside_rate = []
188 |     for i in data.index:
189 |         rate = 1.0
190 |         q1_units = set(data.loc[i, 'question1'].lower().split())
191 |         q2_units = set(data.loc[i, 'question2'].lower().split())
192 |         q1_diff = list(set(q1_units).difference(set(q2_units)))
193 |         q2_diff = list(set(q2_units).difference(set(q1_units)))
194 |         all_diff = set(q1_diff + q2_diff)
195 |         for unit in all_diff:
196 |             if unit in punit_oside:
197 |                 rate *= (1.0 - units_power[unit][4])
198 |         punits_oside_rate.append(1-rate)
199 |     return punits_oside_rate
200 | 
201 | 
202 | def edit_distance(q1, q2):
203 |     str1 = q1.split(' ')
204 |     str2 = q2.split(' ')
205 |     matrix = [[i+j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
206 |     for i in range(1, len(str1) + 1):
207 |         for j in range(1, len(str2) + 1):
208 |             if str1[i - 1] == str2[j - 1]:
209 |                 d = 0
210 |             else:
211 |                 d = 1
212 |             matrix[i][j] = min(matrix[i-1][j]+1, matrix[i][j-1] + 1, matrix[i-1][j-1]+d)
213 |             if j > i > 1 and str1[i-1] == str2[j-2] and str1[i-2] == str2[j-1]:
214 |                 d = 0
215 |                 matrix[i][j] = min(matrix[i][j], matrix[i-2][j-2] + d)
216 |     return matrix[len(str1)][len(str2)]
217 | 
218 | 
219 | def get_edit_distance(data):
220 |     q1_len = data['question1'].apply(lambda x: len(list(x.split(' ')))).values
221 |     q2_len = data['question2'].apply(lambda x: len(list(x.split(' ')))).values
222 | 
223 |     dist = [edit_distance(data.loc[i, 'question1'], data.loc[i, 'question2']) / np.max([q1_len, q2_len], axis=0)[i] for i in data.index]
224 |     return dist
225 | 
226 | 
227 | def generate_split_chars():
228 |     for mode in ['train', 'test']:
229 |         df_temp = pd.read_csv('./data/noextension/' + mode + '.csv', encoding='utf-8', engine='python')
230 |         question1 = df_temp.question1.apply(lambda x: ' '.join(list(x.replace(' ', ''))))
231 |         question2 = df_temp.question2.apply(lambda x: ' '.join(list(x.replace(' ', ''))))
232 |         df_corpus = pd.DataFrame({
233 |             'question1': question1,
234 |             'question2': question2,
235 |         })
236 |         df_corpus.to_csv('./data/noextension/' + mode + '_corpus_char.csv', index=False)
237 | 
238 |     for i in range(10):
239 |         for mode in ['train', 'dev', 'test']:
240 |             df_temp = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '.csv', encoding='utf-8', engine='python')
241 |             question1 = df_temp.question1.apply(lambda x: ' '.join(list(x.replace(' ', ''))))
242 |             question2 = df_temp.question2.apply(lambda x: ' '.join(list(x.replace(' ', ''))))
243 | 
244 |             if mode == 'train':
245 |                 df_corpus = pd.DataFrame({
246 |                     'question1': question1,
247 |                     'question2': question2,
248 |                     'label': df_temp.label
249 |                 })
250 |             else:
251 |                 df_corpus = pd.DataFrame({
252 |                     'question1': question1,
253 |                     'question2': question2,
254 |                 })
255 |             df_corpus.to_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_char.csv', index=False)
256 | 
257 | 
258 | def generate_split_words():
259 |     for mode in ['train', 'test']:
260 |         df_temp = pd.read_csv('./data/noextension/' + mode + '.csv', encoding='utf-8', engine='python')
261 |         question1 = df_temp.question1.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', ''))))
262 |         question2 = df_temp.question2.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', ''))))
263 |         df_corpus = pd.DataFrame({
264 |             'question1': question1,
265 |             'question2': question2,
266 |         })
267 |         df_corpus.to_csv('./data/noextension/' + mode + '_corpus_word.csv', index=False)
268 | 
269 |     for i in range(10):
270 |         for mode in ['train', 'dev', 'test']:
271 |             df_temp = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '.csv', encoding='utf-8', engine='python')
272 |             question1 = df_temp.question1.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', ''))))
273 |             question2 = df_temp.question2.apply(lambda x: ' '.join(jieba.cut(x.replace(' ', ''))))
274 | 
275 |             if mode == 'train':
276 |                 df_corpus = pd.DataFrame({
277 |                     'question1': question1,
278 |                     'question2': question2,
279 |                     'label': df_temp.label
280 |                 })
281 |             else:
282 |                 df_corpus = pd.DataFrame({
283 |                     'question1': question1,
284 |                     'question2': question2,
285 |                 })
286 |             df_corpus.to_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_word.csv', index=False)
287 | 
288 | 
289 | def generate_features_csv():
290 |     # prepare and load THUOCL medical file
291 |     with open('./data/THUOCL_medical.txt', 'r', encoding='utf-8') as f:
292 |         content = f.read()
293 |     with open('./data/THUOCL_medical.txt', 'w', encoding='utf-8') as f:
294 |         content = content.replace('\t', ' ')
295 |         f.write(content)
296 |     jieba.load_userdict('./data/THUOCL_medical.txt')
297 | 
298 |     print('*' * 10 + ' Generating chars corpus file ' + '*' * 10)
299 |     generate_split_chars()
300 |     print('*' * 10 + ' Generating words corpus file ' + '*' * 10)
301 |     generate_split_words()
302 | 
303 |     all_train_data = pd.read_csv('./data/noextension/train_corpus_char.csv', encoding='utf-8', engine='python')
304 |     corpus = list(all_train_data.question1) + list(all_train_data.question2)
305 |     all_test_data = pd.read_csv('./data/noextension/test_corpus_char.csv', encoding='utf-8', engine='python')
306 |     corpus += list(all_test_data.question1) + list(all_test_data.question2)
307 |     vectorizer_char = TfidfVectorizer(token_pattern=r'[^\s]+').fit(corpus)
308 | 
309 |     all_train_data = pd.read_csv('./data/noextension/train_corpus_word.csv', encoding='utf-8', engine='python')
310 |     corpus = list(all_train_data.question1) + list(all_train_data.question2)
311 |     all_test_data = pd.read_csv('./data/noextension/test_corpus_word.csv', encoding='utf-8', engine='python')
312 |     corpus += list(all_test_data.question1) + list(all_test_data.question2)
313 |     vectorizer_word = TfidfVectorizer(token_pattern=r'[^\s]+').fit(corpus)
314 | 
315 |     print('*' * 10 + ' Generating feature file ' + '*' * 10)
316 |     for i in tqdm(range(10)):
317 |         sorted_chars_power = None
318 |         sorted_words_power = None
319 |         for mode in ['train', 'dev', 'test']:
320 |             data = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_char.csv', encoding='utf-8', engine='python')
321 |             if mode == 'train':
322 |                 sorted_chars_power = generate_powerful_unit(data)
323 | 
324 |             len_diff_char = get_len_diff(data)
325 |             edit_char = get_edit_distance(data)
326 |             vocab, q1_tfidf, q2_tfidf = get_tfidf_vector(data, vectorizer_char)
327 |             adjusted_common_char_ratio = adjust_common_units_ratio_by_tfidf(data, vocab, q1_tfidf, q2_tfidf)
328 |             pchar_dside, pchar_dside_tags = powerful_units_dside_tag(sorted_chars_power, data, 1, 0.7)
329 |             pchar_dside_rate = powerful_units_dside_rate(sorted_chars_power, pchar_dside, data)
330 |             pchar_oside, pchar_oside_tags = powerful_units_oside_tag(sorted_chars_power, data, 1, 0.7)
331 |             pchar_oside_rate = powerful_units_oside_rate(sorted_chars_power, pchar_oside, data)
332 | 
333 |             data = pd.read_csv('./data/noextension/' + str(i) + '/' + mode + '_corpus_word.csv', encoding='utf-8', engine='python')
334 |             if mode == 'train':
335 |                 sorted_words_power = generate_powerful_unit(data)
336 | 
337 |             len_diff_word = get_len_diff(data)
338 |             edit_word = get_edit_distance(data)
339 |             vocab, q1_tfidf, q2_tfidf = get_tfidf_vector(data, vectorizer_word)
340 |             adjusted_common_word_ratio = adjust_common_units_ratio_by_tfidf(data, vocab, q1_tfidf, q2_tfidf)
341 |             pword_dside, pword_dside_tags = powerful_units_dside_tag(sorted_words_power, data, 1, 0.7)
342 |             pword_dside_rate = powerful_units_dside_rate(sorted_words_power, pword_dside, data)
343 |             pword_oside, pword_oside_tags = powerful_units_oside_tag(sorted_words_power, data, 1, 0.7)
344 |             pword_oside_rate = powerful_units_oside_rate(sorted_words_power, pword_oside, data)
345 | 
346 |             df = pd.DataFrame({'len_diff_char': len_diff_char, 'edit_char': edit_char, 'len_diff_word': len_diff_word, 'edit_word': edit_word,
347 |                                'adjusted_common_char_ratio': adjusted_common_char_ratio, 'adjusted_common_word_ratio': adjusted_common_word_ratio,
348 |                                'pchar_dside_rate': pchar_dside_rate, 'pchar_oside_rate': pchar_oside_rate, 'pword_dside_rate': pword_dside_rate, 'pword_oside_rate': pword_oside_rate})
349 |             df.to_csv('./data/noextension/' + str(i) + '/' + mode + '_feats.csv', index=False)
350 | 
351 | 
352 | generate_features_csv()
353 | 


--------------------------------------------------------------------------------
/model_feature.py:
--------------------------------------------------------------------------------
  1 | # -*— coding: utf-8 -*-
  2 | """ Finetuning the library models for chip2019 question pairs matching. """
  3 | 
  4 | import argparse
  5 | import glob
  6 | import logging
  7 | import os
  8 | import random
  9 | import shutil
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | import torch
 14 | import torch.nn as nn
 15 | from torch.nn import CrossEntropyLoss, MSELoss
 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
 17 | from tensorboardX import SummaryWriter
 18 | from tqdm import tqdm, trange
 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer)
 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule
 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel
 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor)
 23 | 
 24 | logger = logging.getLogger(__name__)
 25 | 
 26 | 
 27 | class FeatureBert(BertPreTrainedModel):
 28 |     def __init__(self, config):
 29 |         super(FeatureBert, self).__init__(config)
 30 |         self.num_labels = config.num_labels
 31 | 
 32 |         self.bert = BertModel(config)
 33 |         self.dropout = nn.Dropout(config.hidden_dropout_prob)
 34 |         # self.classifier = nn.Linear(config.hidden_size + 5, self.config.num_labels)
 35 |         self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
 36 | 
 37 |         self.features_bn = nn.BatchNorm1d(5)
 38 |         self.features_dense = nn.Linear(5, 5)
 39 |         self.init_weights()
 40 | 
 41 |     def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
 42 |                 position_ids=None, head_mask=None, hand_features=None):
 43 |         outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
 44 |                             attention_mask=attention_mask, head_mask=head_mask)
 45 | 
 46 |         hand_features = hand_features.float()
 47 |         # features = self.features_dense(self.features_bn(hand_features))
 48 |         pooled_output = outputs[1]
 49 | 
 50 |         pooled_output = self.dropout(pooled_output)
 51 |         # pooled_output = torch.cat((pooled_output, features), dim=1)
 52 |         logits = self.classifier(pooled_output)
 53 | 
 54 |         outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
 55 | 
 56 |         if labels is not None:
 57 |             if self.num_labels == 1:
 58 |                 #  We are doing regression
 59 |                 loss_fct = MSELoss()
 60 |                 loss = loss_fct(logits.view(-1), labels.view(-1))
 61 |             else:
 62 |                 loss_fct = CrossEntropyLoss()
 63 |                 loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
 64 |             outputs = (loss,) + outputs
 65 | 
 66 |         return outputs  # (loss), logits, (hidden_states), (attentions)
 67 | 
 68 | 
 69 | def set_seed(args):
 70 |     random.seed(args.seed)
 71 |     np.random.seed(args.seed)
 72 |     torch.manual_seed(args.seed)
 73 |     if args.n_gpu > 0:
 74 |         torch.cuda.manual_seed_all(args.seed)
 75 | 
 76 | 
 77 | def train(args, train_dataset, model, tokenizer):
 78 |     """ Train the model. """
 79 |     tb_writer = SummaryWriter()
 80 | 
 81 |     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
 82 |     train_sampler = RandomSampler(train_dataset)
 83 |     train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
 84 | 
 85 |     if args.max_steps > 0:
 86 |         t_total = args.max_steps
 87 |         args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
 88 |     else:
 89 |         t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
 90 | 
 91 |     # Prepare optimizer and schedule (linear warmup and decay)
 92 |     no_decay = ['bias', 'LayerNorm.weight']
 93 |     optimizer_grouped_parameters = [
 94 |         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
 95 |         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
 96 |     ]
 97 |     optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
 98 |     scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
 99 |     if args.fp16:
100 |         try:
101 |             from apex import amp
102 |         except ImportError:
103 |             raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.')
104 |         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
105 | 
106 |     # multi-gpu training (should be after apex fp16 initialization)
107 |     if args.n_gpu > 1:
108 |         model = torch.nn.DataParallel(model)
109 | 
110 |     # Train!
111 |     logger.info('***** Running training *****')
112 |     logger.info('   Num examples = %d', len(train_dataset))
113 |     logger.info('   Num Epochs = %d', args.num_train_epochs)
114 |     logger.info('   Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size)
115 |     logger.info('   Total train batch size (w. parallel & accumulation) = %d',
116 |                 args.train_batch_size * args.gradient_accumulation_steps)
117 |     logger.info('   Gradient Accumulation steps = %d', args.gradient_accumulation_steps)
118 |     logger.info('   Total optimization steps = %d', t_total)
119 | 
120 |     global_step = 0
121 |     tr_loss, logging_loss = 0.0, 0.0
122 |     model.zero_grad()
123 |     train_iterator = trange(int(args.num_train_epochs), desc='Epoch')
124 |     set_seed(args)  # Added here for reproductibility
125 |     for _ in train_iterator:
126 |         epoch_iterator = tqdm(train_dataloader, desc='Iteration')
127 |         for step, batch in enumerate(epoch_iterator):
128 |             model.train()
129 |             batch = tuple(t.to(args.device) for t in batch)
130 |             inputs = {'input_ids':      batch[0],
131 |                       'attention_mask': batch[1],
132 |                       'token_type_ids': batch[2],
133 |                       'labels':         batch[3],
134 |                       'hand_features':  batch[4]}
135 |             outputs = model(**inputs)
136 |             loss = outputs[0]  # model outputs are always tuple in pytorch_transformers (see doc)
137 | 
138 |             if args.n_gpu > 1:
139 |                 loss = loss.mean()
140 |             if args.gradient_accumulation_steps > 1:
141 |                 loss = loss / args.gradient_accumulation_steps
142 | 
143 |             if args.fp16:
144 |                 with amp.scale_los(loss.optimizer) as scaled_loss:
145 |                     scaled_loss.backward()
146 |                 torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
147 |             else:
148 |                 loss.backward()
149 |                 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
150 | 
151 |             tr_loss += loss.item()
152 |             if (step + 1) % args.gradient_accumulation_steps == 0:
153 |                 optimizer.step()
154 |                 scheduler.step()
155 |                 model.zero_grad()
156 |                 global_step += 1
157 | 
158 |                 if args.logging_steps > 0 and global_step % args.logging_steps == 0:
159 |                     # Log metrics
160 |                     if args.evaluate_during_training:
161 |                         result = evaluate(args, model, tokenizer)
162 |                         for key, value in result.items():
163 |                             tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
164 |                     tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
165 |                     tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step)
166 |                     logging_loss = tr_loss
167 | 
168 |                 if args.save_steps > 0 and global_step % args.save_steps == 0:
169 |                     # Save model checkpoint
170 |                     output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
171 |                     if not os.path.exists(output_dir):
172 |                         os.makedirs(output_dir)
173 |                     model_to_save = model.module if hasattr(model, 'module') else model
174 |                     model_to_save.save_pretrained(output_dir)
175 |                     torch.save(args, 'training_args.bin')
176 |                     logger.info('Saving model checkpoint to %s', output_dir)
177 | 
178 |             if args.max_steps > 0 and global_step > args.max_steps:
179 |                 epoch_iterator.close()
180 |                 break
181 |         if args.max_steps > 0 and global_step > args.max_steps:
182 |             train_iterator.close()
183 |             break
184 | 
185 |     tb_writer.close()
186 |     return global_step, tr_loss / global_step
187 | 
188 | 
189 | def evaluate(args, model, tokenizer, prefix=''):
190 |     eval_output_dir = args.output_dir
191 | 
192 |     results = {}
193 |     eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev')
194 | 
195 |     if not os.path.exists(eval_output_dir):
196 |         os.makedirs(eval_output_dir)
197 | 
198 |     args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
199 |     eval_sampler = SequentialSampler(eval_dataset)
200 |     eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
201 | 
202 |     # Eval!
203 |     logger.info('***** Running evaluation {} *****'.format(prefix))
204 |     logger.info('   Num examples = %d', len(eval_dataset))
205 |     logger.info('   Batch size = %d', args.eval_batch_size)
206 |     eval_loss = 0.0
207 |     nb_eval_steps = 0
208 |     preds = None
209 |     out_label_ids = None
210 |     for batch in tqdm(eval_dataloader, desc='Evaluating'):
211 |         model.eval()
212 |         batch = tuple(t.to(args.device) for t in batch)
213 | 
214 |         with torch.no_grad():
215 |             inputs = {'input_ids':      batch[0],
216 |                       'attention_mask': batch[1],
217 |                       'token_type_ids': batch[2],
218 |                       'labels':         batch[3],
219 |                       'hand_features':  batch[4]}
220 |             outputs = model(**inputs)
221 |             tmp_eval_loss, logits = outputs[:2]
222 |             eval_loss += tmp_eval_loss.mean().item()
223 |         nb_eval_steps += 1
224 |         if preds is None:
225 |             preds = logits.detach().cpu().numpy()
226 |             out_label_ids = inputs['labels'].detach().cpu().numpy()
227 |         else:
228 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
229 |             out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
230 | 
231 |     eval_loss = eval_loss / nb_eval_steps
232 |     preds = np.argmax(preds, axis=1)
233 |     result = compute_metrics(preds, out_label_ids)
234 |     results.update(result)
235 | 
236 |     output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt')
237 |     with open(output_eval_file, 'a') as writer:
238 |         for key in sorted(result.keys()):
239 |             logger.info('   %s = %s', key, str(result[key]))
240 |             writer.write('%s = %s\n' % (key, str(result[key])))
241 |         writer.write('='*20 + '\n')
242 | 
243 |     return results
244 | 
245 | 
246 | # def predict(args, model, tokenizer, index):
247 | def predict(args, model, tokenizer):
248 |     test_dataset = load_and_cache_examples(args, tokenizer, set_type='test')
249 | 
250 |     args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu)
251 |     test_sampler = SequentialSampler(test_dataset)
252 |     test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size)
253 | 
254 |     # Eval!
255 |     logger.info('***** Running prediction *****')
256 |     logger.info('   Num examples = %d', len(test_dataset))
257 |     logger.info('   Batch size = %d', args.test_batch_size)
258 |     preds = None
259 |     for batch in tqdm(test_dataloader, desc='Testing'):
260 |         model.eval()
261 |         batch = tuple(t.to(args.device) for t in batch)
262 | 
263 |         with torch.no_grad():
264 |             inputs = {'input_ids':      batch[0],
265 |                       'attention_mask': batch[1],
266 |                       'token_type_ids': batch[2],
267 |                       'labels':         batch[3],
268 |                       'hand_features':  batch[4]}
269 |             outputs = model(**inputs)
270 |             tmp_eval_loss, logits = outputs[:2]
271 |         if preds is None:
272 |             preds = logits.detach().cpu().numpy()
273 |         else:
274 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
275 | 
276 |     preds = np.argmax(preds, axis=1)
277 |     # with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f:
278 |     with open(os.path.join(args.data_dir, 'result.csv'), 'w') as f:
279 |         f.write('id,label\n')
280 |         for i, pred in enumerate(preds):
281 |             f.write('%d,%d\n' % (i, pred))
282 | 
283 | 
284 | def load_and_cache_examples(args, tokenizer, set_type):
285 |     processor = QPMProcessor()
286 |     # Load data features from cache or dataset file
287 |     cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_hand_feature'.format(
288 |         set_type,
289 |         list(filter(None, args.model_name_or_path.split('/'))).pop(),
290 |         str(args.max_seq_length)
291 |     ))
292 |     if os.path.exists(cached_features_file):
293 |         logger.info('Loading features from cache file %s', cached_features_file)
294 |         features = torch.load(cached_features_file)
295 |     else:
296 |         logger.info('Creating features from dataset file at %s', args.data_dir)
297 |         label_list = processor.get_labels()
298 |         category_list = processor.get_categories()
299 |         examples = processor.get_examples(args.data_dir, set_type)
300 |         features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer,
301 |             cls_token_at_end=False,    # xlnet has a cls token at the end
302 |             cls_token=tokenizer.cls_token,
303 |             cls_token_segment_id=0,
304 |             sep_token=tokenizer.sep_token,
305 |             sep_token_extra=False,
306 |             pad_on_left=False,
307 |             pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
308 |             pad_token_segment_id=0
309 |             # cls_token_at_end=bool(args.model_type in ['xlnet']),    # xlnet has a cls token at the end
310 |             # cls_token=tokenizer.cls_token,
311 |             # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
312 |             # sep_token=tokenizer.sep_token,
313 |             # sep_token_extra=bool(args.model_type in ['roberta']),   # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
314 |             # pad_on_left=bool(args.model_type in ['xlnet']),         # pad on the left for xlnet
315 |             # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
316 |             # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
317 |         )
318 |         logger.info("Saving features into cached file %s", cached_features_file)
319 |         torch.save(features, cached_features_file)
320 | 
321 |     # Convert to Tensors and build dataset
322 |     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
323 |     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
324 |     all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
325 |     all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
326 |     all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long)
327 |     all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long)
328 |     all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long)
329 |     all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long)
330 |     all_hand_features = torch.tensor([f.hand_features for f in features], dtype=torch.long)
331 | 
332 |     dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_hand_features,
333 |                             all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids)
334 |     return dataset
335 | 
336 | 
337 | def main():
338 |     parser = argparse.ArgumentParser()
339 | 
340 |     ## Required parameters
341 |     parser.add_argument('--data_dir', default=None, type=str, required=True,
342 |                         help='The input data dir. Should contain the .csv files for the task.')
343 |     parser.add_argument('--model_name_or_path', default=None, type=str, required=True,
344 |                         help='Path to pretrained model or shortcut name selected in the list.')
345 |     parser.add_argument('--output_dir', default=None, type=str, required=True,
346 |                         help='The output directory where the model predictions and checkpoints will be written.')
347 | 
348 |     ## Other parameters
349 |     parser.add_argument('--config_name', default='', type=str,
350 |                         help='Pretrained config name or path if not the same as model_name.')
351 |     parser.add_argument('--tokenizer_name', default='', type=str,
352 |                         help='Pretrained tokenizer name or path if not the same as model_name.')
353 |     parser.add_argument('--max_seq_length', default='128', type=int,
354 |                         help='The maximum total input sequence length after tokenization. Sequences longer than this '
355 |                              'will be truncated, sequences shorter will be padded.')
356 |     parser.add_argument('--do_train', action='store_true',
357 |                         help='Whether to run training.')
358 |     parser.add_argument('--do_eval', action='store_true',
359 |                         help='Whether to run eval on the dev set.')
360 |     parser.add_argument('--do_predict', action='store_true',
361 |                         help='Whether to run test on the test set.')
362 |     parser.add_argument('--evaluate_during_training', action='store_true',
363 |                         help='Rul evaluation during training at each logging step.')
364 |     parser.add_argument('--do_lower_case', action='store_true',
365 |                         help='Set this flag if you are using an uncased model.')
366 | 
367 |     parser.add_argument('--per_gpu_train_batch_size', default=1, type=int,
368 |                         help='Batch size per GPU/CPU for training.')
369 |     parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int,
370 |                         help='Batch size per GPU/CPU for evaluation.')
371 |     parser.add_argument('--per_gpu_test_batch_size', default=8, type=int,
372 |                         help='Batch size per GPU/CPU for prediction.')
373 |     parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
374 |                         help='Number of updates steps to accumulate before performing a backward/update pass.')
375 |     parser.add_argument('--learning_rate', default=5e-5, type=float,
376 |                         help='The initial learning rate for Adam.')
377 |     parser.add_argument('--weight_decay', default=0.0, type=float,
378 |                         help='Weight decay if we apply some.')
379 |     parser.add_argument('--adam_epsilon', default=1e-8, type=float,
380 |                         help='Epsilon for Adam optimizer.')
381 |     parser.add_argument('--max_grad_norm', default=1.0, type=float,
382 |                         help='Max gradient norm.')
383 |     parser.add_argument('--num_train_epochs', default=4.0, type=float,
384 |                         help='Total number of training epochs to perform.')
385 |     parser.add_argument('--max_steps', default=-1, type=int,
386 |                         help='If > 0: set total number of training steps to perform. Override num_train_epochs.')
387 |     parser.add_argument('--warmup_steps', default=0, type=int,
388 |                         help='Linear warmup over warmup_steps.')
389 | 
390 |     parser.add_argument('--logging_steps', type=int, default=50,
391 |                         help='Log every X updates steps.')
392 |     parser.add_argument('--save_steps', type=int, default=100,
393 |                         help='Save checkpoint every X updates steps.')
394 |     parser.add_argument('--eval_all_checkpoints', action='store_true',
395 |                         help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.')
396 |     parser.add_argument('--no_cuda', action='store_true',
397 |                         help='Avoid using CUDA when available.')
398 |     parser.add_argument('--overwrite_output_dir', action='store_true',
399 |                         help='Overwrite the content of the output directory.')
400 |     parser.add_argument('--overwrite_cache', action='store_true',
401 |                         help='Overwrite the cached training and evaluation sets.')
402 |     parser.add_argument('--seed', type=int, default=42,
403 |                         help='random seed for initialization')
404 | 
405 |     parser.add_argument('--fp16', action='store_true',
406 |                         help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
407 |     parser.add_argument('--fp16_opt_level', type=str, default='O1',
408 |                         help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
409 |                              "See details at https://nvidia.github.io/apex/amp.html")
410 |     args = parser.parse_args()
411 | 
412 |     # Setup CUDA, GPU
413 |     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
414 |         raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.')
415 | 
416 |     device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu')
417 |     args.n_gpu = torch.cuda.device_count()
418 |     args.device = device
419 | 
420 |     # Setup logging
421 |     logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
422 |                         datefmt = '%m/%d/%Y %H:%M:%S',
423 |                         level = logging.INFO)
424 |     logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s',
425 |                    device, args.n_gpu, args.fp16)
426 | 
427 |     # Set seed
428 |     set_seed(args)
429 |     # Prepare QPM task
430 |     processor = QPMProcessor()
431 |     label_list = processor.get_labels()
432 |     num_labels = len(label_list)
433 | 
434 |     # Load pretrained model and tokenizer
435 |     config_class, model_class, tokenizer_class = BertConfig, FeatureBert, BertTokenizer
436 |     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels)
437 |     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
438 |     model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
439 |     model.to(args.device)
440 | 
441 |     logger.info('Trainning/evaluation parameters %s', args)
442 |     parent_data_dir = args.data_dir
443 |     parent_output_dir = args.output_dir
444 | 
445 |     # Trainning
446 |     results_tmp = {}
447 |     if args.do_train:
448 |         # 10-Fold dataset for training.
449 |         # for i in range(0, 10):
450 |         # Reload the pretrained model.
451 |         model = model_class.from_pretrained(args.model_name_or_path,
452 |                                             from_tf=bool('.ckpt' in args.model_name_or_path),
453 |                                             config=config)
454 |         model.to(args.device)
455 | 
456 |         # args.data_dir = parent_data_dir + str(i)
457 |         # args.output_dir = parent_output_dir + str(i)
458 | 
459 |         train_dataset = load_and_cache_examples(args, tokenizer, set_type='train')
460 |         global_step, tr_loss = train(args, train_dataset, model, tokenizer)
461 |         logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
462 |         # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
463 |         # Create output directory if needed
464 |         if not os.path.exists(args.output_dir):
465 |             os.makedirs(args.output_dir)
466 | 
467 |         logger.info("Saving model checkpoint to %s", args.output_dir)
468 |         # Save a trained model, configuration and tokenizer using `save_pretrained()`.
469 |         # They can then be reloaded using `from_pretrained()`
470 |         model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
471 |         model_to_save.save_pretrained(args.output_dir)
472 |         tokenizer.save_pretrained(args.output_dir)
473 | 
474 |         # Good practice: save your training arguments together with the trained model
475 |         torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
476 | 
477 |         # Load a trained model and vocabulary that you have fine-tuned
478 |         model = model_class.from_pretrained(args.output_dir)
479 |         tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
480 |         model.to(args.device)
481 | 
482 |         # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset.
483 |         # args.data_dir = parent_data_dir + str(i)
484 |         # args.output_dir = parent_output_dir + str(i)
485 |         tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
486 |         checkpoints = [args.output_dir]
487 |         if args.eval_all_checkpoints:
488 |             checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
489 |             logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
490 |         logger.info("Evaluate the following checkpoints: %s", checkpoints)
491 |         best_f1 = 0.0
492 |         for checkpoint in checkpoints:
493 |             global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
494 |             model = model_class.from_pretrained(checkpoint)
495 |             model.to(args.device)
496 |             result = evaluate(args, model, tokenizer, prefix=global_step)
497 |             if result['f1'] > best_f1:
498 |                 best_f1 = result['f1']
499 |                 # Save the best model checkpoint
500 |                 output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold')
501 |                 if not os.path.exists(output_dir):
502 |                     os.makedirs(output_dir)
503 |                 model_to_save = model.module if hasattr(model, 'module') else model
504 |                 model_to_save.save_pretrained(output_dir)
505 |                 torch.save(args, 'training_args.bin')
506 |                 logger.info('Saving model checkpoint to %s', output_dir)
507 | 
508 |             result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
509 |             results_tmp.update(result)
510 |         checkpoints.remove(args.output_dir)
511 |         for checkpoint in checkpoints:
512 |             shutil.rmtree(checkpoint)
513 | 
514 |     # Evaluation
515 |     results = {}
516 |     if args.do_eval:
517 |         for i in range(10):
518 |             args.data_dir = parent_data_dir + str(i)
519 |             args.output_dir = parent_output_dir + str(i)
520 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
521 |             checkpoints = [args.output_dir]
522 |             if args.eval_all_checkpoints:
523 |                 checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
524 |                 logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
525 |             logger.info("Evaluate the following checkpoints: %s", checkpoints)
526 |             best_f1 = 0.0
527 |             for checkpoint in checkpoints:
528 |                 global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
529 |                 model = model_class.from_pretrained(checkpoint)
530 |                 model.to(args.device)
531 |                 result = evaluate(args, model, tokenizer, prefix=global_step)
532 |                 if result['f1'] > best_f1:
533 |                     best_f1 = result['f1']
534 |                     # Save the best model checkpoint
535 |                     output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
536 |                     if not os.path.exists(output_dir):
537 |                         os.makedirs(output_dir)
538 |                     model_to_save = model.module if hasattr(model, 'module') else model
539 |                     model_to_save.save_pretrained(output_dir)
540 |                     torch.save(args, 'training_args.bin')
541 |                     logger.info('Saving model checkpoint to %s', output_dir)
542 | 
543 |                 result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
544 |                 results.update(result)
545 | 
546 |     # Prediction
547 |     if args.do_predict:
548 |         # for i in range(10):
549 |         # args.output_dir = parent_output_dir + str(i)
550 |         args.output_dir = parent_output_dir
551 |         tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
552 |         logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
553 |         # checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i)
554 |         checkpoint = args.output_dir + '/best_checkpoint_fold'
555 |         model = model_class.from_pretrained(checkpoint)
556 |         model.to(args.device)
557 |         # predict(args, model, tokenizer, i)
558 |         predict(args, model, tokenizer)
559 | 
560 |         # For bagging.
561 |         all = pd.read_csv('./data/sample_submission.csv')
562 |         for i in range(10):
563 |             df = pd.read_csv(args.data_dir + str(i) + '/result.csv')
564 |             all['label'] += df['label']
565 |         all['label'] = all['label'] // 6
566 |         all.to_csv('./data/result.csv', index=False)
567 | 
568 | 
569 | if __name__ == '__main__':
570 |     main()
571 | 


--------------------------------------------------------------------------------
/model_final.py:
--------------------------------------------------------------------------------
  1 | # -*— coding: utf-8 -*-
  2 | """ Finetuning the library models for chip2019 question pairs matching. """
  3 | 
  4 | import argparse
  5 | import glob
  6 | import logging
  7 | import os
  8 | import random
  9 | import shutil
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | import torch
 14 | import torch.nn as nn
 15 | from torch.nn import CrossEntropyLoss, MSELoss
 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
 17 | from tensorboardX import SummaryWriter
 18 | from tqdm import tqdm, trange
 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer)
 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule
 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel
 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor)
 23 | 
 24 | logger = logging.getLogger(__name__)
 25 | 
 26 | 
 27 | class CustomizedBert(BertPreTrainedModel):
 28 |     def __init__(self, config):
 29 |         super(CustomizedBert, self).__init__(config)
 30 |         self.num_labels = 2
 31 |         self.num_categories = 5
 32 |         self.num_features = 10
 33 |         self.bert = BertModel(config)
 34 |         self.dropout = nn.Dropout(config.hidden_dropout_prob)
 35 |         self.classifier1 = nn.Linear(config.hidden_size + self.num_features, self.num_labels)
 36 |         self.classifier2 = nn.Linear(config.hidden_size, self.num_categories)
 37 |         self.features_bn = nn.BatchNorm1d(self.num_features)
 38 |         self.features_dense = nn.Linear(self.num_features, self.num_features)
 39 | 
 40 |         self.init_weights()
 41 | 
 42 |     def forward(self, input_ids, ct_clf_input_ids, token_type_ids=None, attention_mask=None, position_ids=None, labels=None,
 43 |                 ct_clf_token_type_ids=None, ct_clf_attention_mask=None, ct_clf_position_ids=None, categories=None,
 44 |                 head_mask=None, hand_features=None):
 45 |         outputs1 = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
 46 |                             attention_mask=attention_mask, head_mask=head_mask)
 47 |         outputs2 = self.bert(ct_clf_input_ids, position_ids=ct_clf_position_ids, token_type_ids=ct_clf_token_type_ids,
 48 |                              attention_mask=ct_clf_attention_mask, head_mask=head_mask)
 49 |         pooled_output1 = outputs1[1]
 50 |         pooled_output2 = outputs2[1]
 51 | 
 52 |         hand_features = hand_features.float()
 53 |         hand_features = self.features_dense(self.features_bn(hand_features))
 54 |         pooled_output1 = torch.cat((pooled_output1, hand_features), dim=1)
 55 |         pooled_output1 = self.dropout(pooled_output1)
 56 |         pooled_output2 = self.dropout(pooled_output2)
 57 |         logits1 = self.classifier1(pooled_output1)
 58 |         logits2 = self.classifier2(pooled_output2)
 59 | 
 60 |         outputs1 = (logits1,) + outputs1[2:]  # add hidden states and attention if they are here
 61 |         outputs2 = (logits2,) + outputs2[2:]
 62 | 
 63 |         if labels is not None:
 64 |             if self.num_labels == 1:
 65 |                 #  We are doing regression
 66 |                 loss_fct = MSELoss()
 67 |                 loss = loss_fct(logits1.view(-1), labels.view(-1))
 68 |             else:
 69 |                 loss_fct = CrossEntropyLoss()
 70 |                 loss = loss_fct(logits1.view(-1, self.num_labels), labels.view(-1))
 71 |             outputs1 = (loss,) + outputs1
 72 |         if categories is not None:
 73 |             if self.num_categories == 1:
 74 |                 #  We are doing regression
 75 |                 loss_fct = MSELoss()
 76 |                 loss = loss_fct(logits2.view(-1), categories.view(-1))
 77 |             else:
 78 |                 loss_fct = CrossEntropyLoss()
 79 |                 loss = loss_fct(logits2.view(-1, self.num_categories), categories.view(-1))
 80 |             outputs2 = (loss,) + outputs2
 81 | 
 82 |         return outputs1, outputs2  # (loss), logits, (hidden_states), (attentions)
 83 | 
 84 | 
 85 | def set_seed(args):
 86 |     random.seed(args.seed)
 87 |     np.random.seed(args.seed)
 88 |     torch.manual_seed(args.seed)
 89 |     if args.n_gpu > 0:
 90 |         torch.cuda.manual_seed_all(args.seed)
 91 | 
 92 | 
 93 | def train(args, train_dataset, model, tokenizer):
 94 |     """ Train the model. """
 95 |     tb_writer = SummaryWriter()
 96 | 
 97 |     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
 98 |     train_sampler = RandomSampler(train_dataset)
 99 |     train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
100 | 
101 |     if args.max_steps > 0:
102 |         t_total = args.max_steps
103 |         args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
104 |     else:
105 |         t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
106 | 
107 |     # Prepare optimizer and schedule (linear warmup and decay)
108 |     no_decay = ['bias', 'LayerNorm.weight']
109 |     optimizer_grouped_parameters = [
110 |         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
111 |         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
112 |     ]
113 |     optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
114 |     scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
115 |     if args.fp16:
116 |         try:
117 |             from apex import amp
118 |         except ImportError:
119 |             raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.')
120 |         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
121 | 
122 |     # multi-gpu training (should be after apex fp16 initialization)
123 |     if args.n_gpu > 1:
124 |         model = torch.nn.DataParallel(model)
125 | 
126 |     # Train!
127 |     logger.info('***** Running training *****')
128 |     logger.info('   Num examples = %d', len(train_dataset))
129 |     logger.info('   Num Epochs = %d', args.num_train_epochs)
130 |     logger.info('   Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size)
131 |     logger.info('   Total train batch size (w. parallel & accumulation) = %d',
132 |                 args.train_batch_size * args.gradient_accumulation_steps)
133 |     logger.info('   Gradient Accumulation steps = %d', args.gradient_accumulation_steps)
134 |     logger.info('   Total optimization steps = %d', t_total)
135 | 
136 |     global_step = 0
137 |     tr_loss, logging_loss = 0.0, 0.0
138 |     model.zero_grad()
139 |     train_iterator = trange(int(args.num_train_epochs), desc='Epoch')
140 |     set_seed(args)  # Added here for reproductibility
141 |     for _ in train_iterator:
142 |         epoch_iterator = tqdm(train_dataloader, desc='Iteration')
143 |         for step, batch in enumerate(epoch_iterator):
144 |             model.train()
145 |             batch = tuple(t.to(args.device) for t in batch)
146 |             inputs = {'input_ids':              batch[0],
147 |                       'attention_mask':         batch[1],
148 |                       'token_type_ids':         batch[2],
149 |                       'labels':                 batch[3],
150 |                       'ct_clf_input_ids':       batch[4],
151 |                       'ct_clf_attention_mask':  batch[5],
152 |                       'ct_clf_token_type_ids':  batch[6],
153 |                       'categories':             batch[7],
154 |                       'hand_features':          batch[8]}
155 |             outputs = model(**inputs)
156 |             loss, clf_loss = outputs[0][0], outputs[1][0]  # model outputs are always tuple in pytorch_transformers (see doc)
157 | 
158 |             total_loss = loss + clf_loss
159 |             if args.n_gpu > 1:
160 |                 total_loss = total_loss.mean()
161 |             if args.gradient_accumulation_steps > 1:
162 |                 total_loss = total_loss / args.gradient_accumulation_steps
163 | 
164 |             if args.fp16:
165 |                 with amp.scale_los(total_loss.optimizer) as scaled_loss:
166 |                     scaled_loss.backward()
167 |                 torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
168 |             else:
169 |                 total_loss.backward()
170 |                 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
171 | 
172 |             tr_loss += total_loss.item()
173 |             if (step + 1) % args.gradient_accumulation_steps == 0:
174 |                 optimizer.step()
175 |                 scheduler.step()
176 |                 model.zero_grad()
177 |                 global_step += 1
178 | 
179 |                 if args.logging_steps > 0 and global_step % args.logging_steps == 0:
180 |                     # Log metrics
181 |                     if args.evaluate_during_training:
182 |                         result = evaluate(args, model, tokenizer)
183 |                         for key, value in result.items():
184 |                             tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
185 |                     tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
186 |                     tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step)
187 |                     logging_loss = tr_loss
188 | 
189 |                 if args.save_steps > 0 and global_step % args.save_steps == 0:
190 |                     # Save model checkpoint
191 |                     output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
192 |                     if not os.path.exists(output_dir):
193 |                         os.makedirs(output_dir)
194 |                     model_to_save = model.module if hasattr(model, 'module') else model
195 |                     model_to_save.save_pretrained(output_dir)
196 |                     torch.save(args, 'training_args.bin')
197 |                     logger.info('Saving model checkpoint to %s', output_dir)
198 | 
199 |             if args.max_steps > 0 and global_step > args.max_steps:
200 |                 epoch_iterator.close()
201 |                 break
202 |         if args.max_steps > 0 and global_step > args.max_steps:
203 |             train_iterator.close()
204 |             break
205 | 
206 |     tb_writer.close()
207 |     return global_step, tr_loss / global_step
208 | 
209 | 
210 | def evaluate(args, model, tokenizer, prefix=''):
211 |     eval_output_dir = args.output_dir
212 | 
213 |     results = {}
214 |     eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev')
215 | 
216 |     if not os.path.exists(eval_output_dir):
217 |         os.makedirs(eval_output_dir)
218 | 
219 |     args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
220 |     eval_sampler = SequentialSampler(eval_dataset)
221 |     eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
222 | 
223 |     # Eval!
224 |     logger.info('***** Running evaluation {} *****'.format(prefix))
225 |     logger.info('   Num examples = %d', len(eval_dataset))
226 |     logger.info('   Batch size = %d', args.eval_batch_size)
227 |     eval_loss = 0.0
228 |     nb_eval_steps = 0
229 |     preds = None
230 |     out_label_ids = None
231 |     for batch in tqdm(eval_dataloader, desc='Evaluating'):
232 |         model.eval()
233 |         batch = tuple(t.to(args.device) for t in batch)
234 | 
235 |         with torch.no_grad():
236 |             inputs = {'input_ids':              batch[0],
237 |                       'attention_mask':         batch[1],
238 |                       'token_type_ids':         batch[2],
239 |                       'labels':                 batch[3],
240 |                       'ct_clf_input_ids':       batch[4],
241 |                       'ct_clf_attention_mask':  batch[5],
242 |                       'ct_clf_token_type_ids':  batch[6],
243 |                       'categories':             batch[7],
244 |                       'hand_features':          batch[8]}
245 |             outputs = model(**inputs)
246 |             tmp_eval_loss, logits = outputs[0][:2]
247 |             eval_loss += tmp_eval_loss.mean().item()
248 |         nb_eval_steps += 1
249 |         if preds is None:
250 |             preds = logits.detach().cpu().numpy()
251 |             out_label_ids = inputs['labels'].detach().cpu().numpy()
252 |         else:
253 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
254 |             out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
255 | 
256 |     eval_loss = eval_loss / nb_eval_steps
257 |     preds = np.argmax(preds, axis=1)
258 |     result = compute_metrics(preds, out_label_ids)
259 |     results.update(result)
260 | 
261 |     output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt')
262 |     with open(output_eval_file, 'a') as writer:
263 |         for key in sorted(result.keys()):
264 |             logger.info('   %s = %s', key, str(result[key]))
265 |             writer.write('%s = %s\n' % (key, str(result[key])))
266 |         writer.write('='*20 + '\n')
267 | 
268 |     return results
269 | 
270 | 
271 | def predict(args, model, tokenizer, index):
272 |     test_dataset = load_and_cache_examples(args, tokenizer, set_type='test')
273 | 
274 |     args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu)
275 |     test_sampler = SequentialSampler(test_dataset)
276 |     test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size)
277 | 
278 |     # Eval!
279 |     logger.info('***** Running prediction *****')
280 |     logger.info('   Num examples = %d', len(test_dataset))
281 |     logger.info('   Batch size = %d', args.test_batch_size)
282 |     preds = None
283 |     for batch in tqdm(test_dataloader, desc='Testing'):
284 |         model.eval()
285 |         batch = tuple(t.to(args.device) for t in batch)
286 | 
287 |         with torch.no_grad():
288 |             inputs = {'input_ids':              batch[0],
289 |                       'attention_mask':         batch[1],
290 |                       'token_type_ids':         batch[2],
291 |                       'labels':                 batch[3],
292 |                       'ct_clf_input_ids':       batch[4],
293 |                       'ct_clf_attention_mask':  batch[5],
294 |                       'ct_clf_token_type_ids':  batch[6],
295 |                       'categories':             batch[7],
296 |                       'hand_features':          batch[8]}
297 |             outputs = model(**inputs)
298 |             tmp_eval_loss, logits = outputs[0][:2]
299 |         if preds is None:
300 |             preds = logits.detach().cpu().numpy()
301 |         else:
302 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
303 | 
304 |     preds = np.argmax(preds, axis=1)
305 |     with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f:
306 |         f.write('id,label\n')
307 |         for i, pred in enumerate(preds):
308 |             f.write('%d,%d\n' % (i, pred))
309 | 
310 | 
311 | def load_and_cache_examples(args, tokenizer, set_type):
312 |     processor = QPMProcessor()
313 |     # Load data features from cache or dataset file
314 |     cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_customized'.format(
315 |         set_type,
316 |         list(filter(None, args.model_name_or_path.split('/'))).pop(),
317 |         str(args.max_seq_length)
318 |     ))
319 |     if os.path.exists(cached_features_file):
320 |         logger.info('Loading features from cache file %s', cached_features_file)
321 |         features = torch.load(cached_features_file)
322 |     else:
323 |         logger.info('Creating features from dataset file at %s', args.data_dir)
324 |         label_list = processor.get_labels()
325 |         category_list = processor.get_categories()
326 |         examples = processor.get_examples(args.data_dir, set_type)
327 |         features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer,
328 |             cls_token_at_end=False,    # xlnet has a cls token at the end
329 |             cls_token=tokenizer.cls_token,
330 |             cls_token_segment_id=0,
331 |             sep_token=tokenizer.sep_token,
332 |             sep_token_extra=False,
333 |             pad_on_left=False,
334 |             pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
335 |             pad_token_segment_id=0
336 |             # cls_token_at_end=bool(args.model_type in ['xlnet']),    # xlnet has a cls token at the end
337 |             # cls_token=tokenizer.cls_token,
338 |             # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
339 |             # sep_token=tokenizer.sep_token,
340 |             # sep_token_extra=bool(args.model_type in ['roberta']),   # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
341 |             # pad_on_left=bool(args.model_type in ['xlnet']),         # pad on the left for xlnet
342 |             # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
343 |             # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
344 |         )
345 |         logger.info("Saving features into cached file %s", cached_features_file)
346 |         torch.save(features, cached_features_file)
347 | 
348 |     # Convert to Tensors and build dataset
349 |     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
350 |     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
351 |     all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
352 |     all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
353 |     all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long)
354 |     all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long)
355 |     all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long)
356 |     all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long)
357 |     all_hand_features = torch.tensor([f.hand_features for f in features], dtype=torch.long)
358 | 
359 |     dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,
360 |                             all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids,
361 |                             all_hand_features)
362 |     return dataset
363 | 
364 | 
365 | def main():
366 |     parser = argparse.ArgumentParser()
367 | 
368 |     ## Required parameters
369 |     parser.add_argument('--data_dir', default=None, type=str, required=True,
370 |                         help='The input data dir. Should contain the .csv files for the task.')
371 |     parser.add_argument('--model_name_or_path', default=None, type=str, required=True,
372 |                         help='Path to pretrained model or shortcut name selected in the list.')
373 |     parser.add_argument('--output_dir', default=None, type=str, required=True,
374 |                         help='The output directory where the model predictions and checkpoints will be written.')
375 | 
376 |     ## Other parameters
377 |     parser.add_argument('--config_name', default='', type=str,
378 |                         help='Pretrained config name or path if not the same as model_name.')
379 |     parser.add_argument('--tokenizer_name', default='', type=str,
380 |                         help='Pretrained tokenizer name or path if not the same as model_name.')
381 |     parser.add_argument('--max_seq_length', default='128', type=int,
382 |                         help='The maximum total input sequence length after tokenization. Sequences longer than this '
383 |                              'will be truncated, sequences shorter will be padded.')
384 |     parser.add_argument('--do_train', action='store_true',
385 |                         help='Whether to run training.')
386 |     parser.add_argument('--do_eval', action='store_true',
387 |                         help='Whether to run eval on the dev set.')
388 |     parser.add_argument('--do_predict', action='store_true',
389 |                         help='Whether to run test on the test set.')
390 |     parser.add_argument('--evaluate_during_training', action='store_true',
391 |                         help='Rul evaluation during training at each logging step.')
392 |     parser.add_argument('--do_lower_case', action='store_true',
393 |                         help='Set this flag if you are using an uncased model.')
394 | 
395 |     parser.add_argument('--num_features', default=10, type=int,
396 |                         help='Number of hand-crafted features.')
397 |     parser.add_argument('--per_gpu_train_batch_size', default=8, type=int,
398 |                         help='Batch size per GPU/CPU for training.')
399 |     parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int,
400 |                         help='Batch size per GPU/CPU for evaluation.')
401 |     parser.add_argument('--per_gpu_test_batch_size', default=8, type=int,
402 |                         help='Batch size per GPU/CPU for prediction.')
403 |     parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
404 |                         help='Number of updates steps to accumulate before performing a backward/update pass.')
405 |     parser.add_argument('--learning_rate', default=5e-5, type=float,
406 |                         help='The initial learning rate for Adam.')
407 |     parser.add_argument('--weight_decay', default=0.0, type=float,
408 |                         help='Weight decay if we apply some.')
409 |     parser.add_argument('--adam_epsilon', default=1e-8, type=float,
410 |                         help='Epsilon for Adam optimizer.')
411 |     parser.add_argument('--max_grad_norm', default=1.0, type=float,
412 |                         help='Max gradient norm.')
413 |     parser.add_argument('--num_train_epochs', default=3.0, type=float,
414 |                         help='Total number of training epochs to perform.')
415 |     parser.add_argument('--max_steps', default=-1, type=int,
416 |                         help='If > 0: set total number of training steps to perform. Override num_train_epochs.')
417 |     parser.add_argument('--warmup_steps', default=0, type=int,
418 |                         help='Linear warmup over warmup_steps.')
419 | 
420 |     parser.add_argument('--logging_steps', type=int, default=50,
421 |                         help='Log every X updates steps.')
422 |     parser.add_argument('--save_steps', type=int, default=200,
423 |                         help='Save checkpoint every X updates steps.')
424 |     parser.add_argument('--eval_all_checkpoints', action='store_true',
425 |                         help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.')
426 |     parser.add_argument('--no_cuda', action='store_true',
427 |                         help='Avoid using CUDA when available.')
428 |     parser.add_argument('--overwrite_output_dir', action='store_true',
429 |                         help='Overwrite the content of the output directory.')
430 |     parser.add_argument('--overwrite_cache', action='store_true',
431 |                         help='Overwrite the cached training and evaluation sets.')
432 |     parser.add_argument('--seed', type=int, default=42,
433 |                         help='random seed for initialization')
434 | 
435 |     parser.add_argument('--fp16', action='store_true',
436 |                         help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
437 |     parser.add_argument('--fp16_opt_level', type=str, default='O1',
438 |                         help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
439 |                              "See details at https://nvidia.github.io/apex/amp.html")
440 |     args = parser.parse_args()
441 | 
442 |     # Setup CUDA, GPU
443 |     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
444 |         raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.')
445 | 
446 |     device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu')
447 |     args.n_gpu = torch.cuda.device_count()
448 |     args.device = device
449 | 
450 |     # Setup logging
451 |     logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
452 |                         datefmt = '%m/%d/%Y %H:%M:%S',
453 |                         level = logging.INFO)
454 |     logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s',
455 |                    device, args.n_gpu, args.fp16)
456 | 
457 |     # Set seed
458 |     set_seed(args)
459 |     # Prepare QPM task
460 |     processor = QPMProcessor()
461 |     label_list = processor.get_labels()
462 |     num_labels = len(label_list)
463 |     category_list = processor.get_categories()
464 |     clf_num_labels = len(category_list)
465 | 
466 |     # Load pretrained model and tokenizer
467 |     config_class, model_class, tokenizer_class = BertConfig, CustomizedBert, BertTokenizer
468 |     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels)
469 |     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
470 |     model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
471 |     model.to(args.device)
472 | 
473 |     logger.info('Trainning/evaluation parameters %s', args)
474 |     parent_data_dir = args.data_dir
475 |     parent_output_dir = args.output_dir
476 | 
477 |     # Trainning
478 |     results_tmp = {}
479 |     if args.do_train:
480 |         # 10-Fold dataset for training.
481 |         for i in range(0, 10):
482 |             # Reload the pretrained model.
483 |             model = model_class.from_pretrained(args.model_name_or_path,
484 |                                                 from_tf=bool('.ckpt' in args.model_name_or_path),
485 |                                                 config=config)
486 |             model.to(args.device)
487 | 
488 |             args.data_dir = parent_data_dir + str(i)
489 |             args.output_dir = parent_output_dir + str(i)
490 | 
491 |             train_dataset = load_and_cache_examples(args, tokenizer, set_type='train')
492 |             global_step, tr_loss = train(args, train_dataset, model, tokenizer)
493 |             logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
494 |             # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
495 |             # Create output directory if needed
496 |             if not os.path.exists(args.output_dir):
497 |                 os.makedirs(args.output_dir)
498 | 
499 |             logger.info("Saving model checkpoint to %s", args.output_dir)
500 |             # Save a trained model, configuration and tokenizer using `save_pretrained()`.
501 |             # They can then be reloaded using `from_pretrained()`
502 |             model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
503 |             model_to_save.save_pretrained(args.output_dir)
504 |             tokenizer.save_pretrained(args.output_dir)
505 | 
506 |             # Good practice: save your training arguments together with the trained model
507 |             torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
508 | 
509 |             # Load a trained model and vocabulary that you have fine-tuned
510 |             model = model_class.from_pretrained(args.output_dir)
511 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
512 |             model.to(args.device)
513 | 
514 |             # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset.
515 |             args.data_dir = parent_data_dir + str(i)
516 |             args.output_dir = parent_output_dir + str(i)
517 |             checkpoints = [args.output_dir]
518 |             if args.eval_all_checkpoints:
519 |                 checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
520 |                 logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
521 |             logger.info("Evaluate the following checkpoints: %s", checkpoints)
522 |             best_f1 = 0.0
523 |             for checkpoint in checkpoints:
524 |                 global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
525 |                 model = model_class.from_pretrained(checkpoint)
526 |                 model.to(args.device)
527 |                 result = evaluate(args, model, tokenizer, prefix=global_step)
528 |                 if result['f1'] > best_f1:
529 |                     best_f1 = result['f1']
530 |                     # Save the best model checkpoint
531 |                     output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
532 |                     if not os.path.exists(output_dir):
533 |                         os.makedirs(output_dir)
534 |                     model_to_save = model.module if hasattr(model, 'module') else model
535 |                     model_to_save.save_pretrained(output_dir)
536 |                     torch.save(args, 'training_args.bin')
537 |                     logger.info('Saving model checkpoint to %s', output_dir)
538 | 
539 |                 result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
540 |                 results_tmp.update(result)
541 |             checkpoints.remove(args.output_dir)
542 |             for checkpoint in checkpoints:
543 |                 shutil.rmtree(checkpoint)
544 | 
545 |     # Evaluation
546 |     results = {}
547 |     if args.do_eval:
548 |         for i in range(10):
549 |             args.data_dir = parent_data_dir + str(i)
550 |             args.output_dir = parent_output_dir + str(i)
551 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
552 |             checkpoints = [args.output_dir]
553 |             if args.eval_all_checkpoints:
554 |                 checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
555 |                 logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
556 |             logger.info("Evaluate the following checkpoints: %s", checkpoints)
557 |             best_f1 = 0.0
558 |             for checkpoint in checkpoints:
559 |                 global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
560 |                 model = model_class.from_pretrained(checkpoint)
561 |                 model.to(args.device)
562 |                 result = evaluate(args, model, tokenizer, prefix=global_step)
563 |                 if result['f1'] > best_f1:
564 |                     best_f1 = result['f1']
565 |                     # Save the best model checkpoint
566 |                     output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
567 |                     if not os.path.exists(output_dir):
568 |                         os.makedirs(output_dir)
569 |                     model_to_save = model.module if hasattr(model, 'module') else model
570 |                     model_to_save.save_pretrained(output_dir)
571 |                     torch.save(args, 'training_args.bin')
572 |                     logger.info('Saving model checkpoint to %s', output_dir)
573 | 
574 |                 result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
575 |                 results.update(result)
576 | 
577 |     # Prediction
578 |     if args.do_predict:
579 |         for i in range(10):
580 |             args.data_dir = parent_data_dir + str(i)
581 |             args.output_dir = parent_output_dir + str(i)
582 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
583 |             logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
584 |             checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i)
585 |             model = model_class.from_pretrained(checkpoint)
586 |             model.to(args.device)
587 |             predict(args, model, tokenizer, i)
588 | 
589 |         # For bagging.
590 |         all = pd.read_csv('./data/sample_submission.csv')
591 |         for i in range(10):
592 |             df = pd.read_csv(args.data_dir + str(i) + '/result.csv')
593 |             all['label'] += df['label']
594 |         all['label'] = all['label'] // 6
595 |         all.to_csv('./data/result.csv', index=False)
596 | 
597 | 
598 | if __name__ == '__main__':
599 |     main()
600 | 


--------------------------------------------------------------------------------
/model_final_2.py:
--------------------------------------------------------------------------------
  1 | # -*— coding: utf-8 -*-
  2 | """ Finetuning the library models for chip2019 question pairs matching. """
  3 | 
  4 | import argparse
  5 | import glob
  6 | import logging
  7 | import os
  8 | import random
  9 | import shutil
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | import torch
 14 | import torch.nn as nn
 15 | from torch.nn import CrossEntropyLoss, MSELoss
 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
 17 | from tensorboardX import SummaryWriter
 18 | from tqdm import tqdm, trange
 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer)
 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule
 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel
 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor)
 23 | 
 24 | logger = logging.getLogger(__name__)
 25 | 
 26 | 
 27 | class CustomizedBert(BertPreTrainedModel):
 28 |     def __init__(self, config):
 29 |         super(CustomizedBert, self).__init__(config)
 30 |         self.num_labels = 2
 31 |         self.num_categories = 5
 32 |         self.num_features = 10
 33 |         self.bert = BertModel(config)
 34 |         self.dropout = nn.Dropout(config.hidden_dropout_prob)
 35 |         self.classifier1 = nn.Linear(config.hidden_size + self.num_features, self.num_labels)
 36 |         self.classifier2 = nn.Linear(config.hidden_size, self.num_categories)
 37 |         self.features_bn = nn.BatchNorm1d(self.num_features)
 38 |         self.features_dense = nn.Linear(self.num_features, self.num_features)
 39 | 
 40 |         self.init_weights()
 41 | 
 42 |     def forward(self, input_ids, ct_clf_input_ids, token_type_ids=None, attention_mask=None, position_ids=None, labels=None,
 43 |                 ct_clf_token_type_ids=None, ct_clf_attention_mask=None, ct_clf_position_ids=None, categories=None,
 44 |                 head_mask=None, hand_features=None):
 45 |         outputs1 = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
 46 |                             attention_mask=attention_mask, head_mask=head_mask)
 47 |         outputs2 = self.bert(ct_clf_input_ids, position_ids=ct_clf_position_ids, token_type_ids=ct_clf_token_type_ids,
 48 |                              attention_mask=ct_clf_attention_mask, head_mask=head_mask)
 49 |         pooled_output1 = outputs1[1]
 50 |         pooled_output2 = outputs2[1]
 51 | 
 52 |         hand_features = hand_features.float()
 53 |         hand_features = self.features_dense(self.features_bn(hand_features))
 54 |         pooled_output1 = torch.cat((pooled_output1, hand_features), dim=1)
 55 |         pooled_output1 = self.dropout(pooled_output1)
 56 |         pooled_output2 = self.dropout(pooled_output2)
 57 |         logits1 = self.classifier1(pooled_output1)
 58 |         logits2 = self.classifier2(pooled_output2)
 59 | 
 60 |         outputs1 = (logits1,) + outputs1[2:]  # add hidden states and attention if they are here
 61 |         outputs2 = (logits2,) + outputs2[2:]
 62 | 
 63 |         if labels is not None:
 64 |             if self.num_labels == 1:
 65 |                 #  We are doing regression
 66 |                 loss_fct = MSELoss()
 67 |                 loss = loss_fct(logits1.view(-1), labels.view(-1))
 68 |             else:
 69 |                 loss_fct = CrossEntropyLoss()
 70 |                 loss = loss_fct(logits1.view(-1, self.num_labels), labels.view(-1))
 71 |             outputs1 = (loss,) + outputs1
 72 |         if categories is not None:
 73 |             if self.num_categories == 1:
 74 |                 #  We are doing regression
 75 |                 loss_fct = MSELoss()
 76 |                 loss = loss_fct(logits2.view(-1), categories.view(-1))
 77 |             else:
 78 |                 loss_fct = CrossEntropyLoss()
 79 |                 loss = loss_fct(logits2.view(-1, self.num_categories), categories.view(-1))
 80 |             outputs2 = (loss,) + outputs2
 81 | 
 82 |         return outputs1, outputs2  # (loss), logits, (hidden_states), (attentions)
 83 | 
 84 | 
 85 | def set_seed(args):
 86 |     random.seed(args.seed)
 87 |     np.random.seed(args.seed)
 88 |     torch.manual_seed(args.seed)
 89 |     if args.n_gpu > 0:
 90 |         torch.cuda.manual_seed_all(args.seed)
 91 | 
 92 | 
 93 | def train(args, train_dataset, model, tokenizer):
 94 |     """ Train the model. """
 95 |     tb_writer = SummaryWriter()
 96 | 
 97 |     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
 98 |     train_sampler = RandomSampler(train_dataset)
 99 |     train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
100 | 
101 |     if args.max_steps > 0:
102 |         t_total = args.max_steps
103 |         args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
104 |     else:
105 |         t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
106 | 
107 |     # Prepare optimizer and schedule (linear warmup and decay)
108 |     no_decay = ['bias', 'LayerNorm.weight']
109 |     optimizer_grouped_parameters = [
110 |         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
111 |         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
112 |     ]
113 |     optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
114 |     scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
115 |     if args.fp16:
116 |         try:
117 |             from apex import amp
118 |         except ImportError:
119 |             raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.')
120 |         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
121 | 
122 |     # multi-gpu training (should be after apex fp16 initialization)
123 |     if args.n_gpu > 1:
124 |         model = torch.nn.DataParallel(model)
125 | 
126 |     # Train!
127 |     logger.info('***** Running training *****')
128 |     logger.info('   Num examples = %d', len(train_dataset))
129 |     logger.info('   Num Epochs = %d', args.num_train_epochs)
130 |     logger.info('   Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size)
131 |     logger.info('   Total train batch size (w. parallel & accumulation) = %d',
132 |                 args.train_batch_size * args.gradient_accumulation_steps)
133 |     logger.info('   Gradient Accumulation steps = %d', args.gradient_accumulation_steps)
134 |     logger.info('   Total optimization steps = %d', t_total)
135 | 
136 |     global_step = 0
137 |     tr_loss, logging_loss = 0.0, 0.0
138 |     model.zero_grad()
139 |     train_iterator = trange(int(args.num_train_epochs), desc='Epoch')
140 |     set_seed(args)  # Added here for reproductibility
141 | 
142 |     max_val_acc = 0
143 |     max_val_f1 = 0
144 | 
145 |     for _ in train_iterator:
146 |         for step, batch in enumerate(train_dataloader):
147 |             model.train()
148 |             batch = tuple(t.to(args.device) for t in batch)
149 |             inputs = {'input_ids':              batch[0],
150 |                       'attention_mask':         batch[1],
151 |                       'token_type_ids':         batch[2],
152 |                       'labels':                 batch[3],
153 |                       'ct_clf_input_ids':       batch[4],
154 |                       'ct_clf_attention_mask':  batch[5],
155 |                       'ct_clf_token_type_ids':  batch[6],
156 |                       'categories':             batch[7],
157 |                       'hand_features':          batch[8]}
158 |             outputs = model(**inputs)
159 |             loss, clf_loss = outputs[0][0], outputs[1][0]  # model outputs are always tuple in pytorch_transformers (see doc)
160 | 
161 |             total_loss = loss + clf_loss
162 |             if args.n_gpu > 1:
163 |                 total_loss = total_loss.mean()
164 |             if args.gradient_accumulation_steps > 1:
165 |                 total_loss = total_loss / args.gradient_accumulation_steps
166 | 
167 |             if args.fp16:
168 |                 with amp.scale_los(total_loss.optimizer) as scaled_loss:
169 |                     scaled_loss.backward()
170 |                 torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
171 |             else:
172 |                 total_loss.backward()
173 |                 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
174 | 
175 |             tr_loss += total_loss.item()
176 |             if (step + 1) % args.gradient_accumulation_steps == 0:
177 |                 optimizer.step()
178 |                 scheduler.step()
179 |                 model.zero_grad()
180 |                 global_step += 1
181 | 
182 |                 if args.logging_steps > 0 and global_step % args.logging_steps == 0:
183 |                     # Log metrics
184 |                     if args.evaluate_during_training:
185 |                         result = evaluate(args, model, tokenizer)
186 |                         for key, value in result.items():
187 |                             tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
188 |                         if result['acc'] > max_val_acc:
189 |                             max_val_acc = result['acc']
190 |                         if result['f1'] > max_val_f1:
191 |                             max_val_f1 = result['f1']
192 |                             output_dir = os.path.join(args.output_dir, 'best_checkpoint')
193 |                             if not os.path.exists(output_dir):
194 |                                 os.makedirs(output_dir)
195 |                             model_to_save = model.module if hasattr(model, 'module') else model
196 |                             model_to_save.save_pretrained(output_dir)
197 |                             torch.save(args, 'training_args.bin')
198 |                             logger.info('Saving model checkpoint with f1 {:.4f}'.format(max_val_f1))
199 |                     tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
200 |                     tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step)
201 |                     logging_loss = tr_loss
202 | 
203 |         if args.max_steps > 0 and global_step > args.max_steps:
204 |             train_iterator.close()
205 |             break
206 | 
207 |     tb_writer.close()
208 |     return global_step, tr_loss / global_step
209 | 
210 | 
211 | def evaluate(args, model, tokenizer, prefix=''):
212 |     eval_output_dir = args.output_dir
213 | 
214 |     results = {}
215 |     eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev')
216 | 
217 |     if not os.path.exists(eval_output_dir):
218 |         os.makedirs(eval_output_dir)
219 | 
220 |     args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
221 |     eval_sampler = SequentialSampler(eval_dataset)
222 |     eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
223 | 
224 |     # Eval!
225 |     logger.info('***** Running evaluation {} *****'.format(prefix))
226 |     logger.info('   Num examples = %d', len(eval_dataset))
227 |     logger.info('   Batch size = %d', args.eval_batch_size)
228 |     eval_loss = 0.0
229 |     nb_eval_steps = 0
230 |     preds = None
231 |     out_label_ids = None
232 |     for batch in eval_dataloader:
233 |         model.eval()
234 |         batch = tuple(t.to(args.device) for t in batch)
235 | 
236 |         with torch.no_grad():
237 |             inputs = {'input_ids':              batch[0],
238 |                       'attention_mask':         batch[1],
239 |                       'token_type_ids':         batch[2],
240 |                       'labels':                 batch[3],
241 |                       'ct_clf_input_ids':       batch[4],
242 |                       'ct_clf_attention_mask':  batch[5],
243 |                       'ct_clf_token_type_ids':  batch[6],
244 |                       'categories':             batch[7],
245 |                       'hand_features':          batch[8]}
246 |             outputs = model(**inputs)
247 |             tmp_eval_loss, logits = outputs[0][:2]
248 |             eval_loss += tmp_eval_loss.mean().item()
249 |         nb_eval_steps += 1
250 |         if preds is None:
251 |             preds = logits.detach().cpu().numpy()
252 |             out_label_ids = inputs['labels'].detach().cpu().numpy()
253 |         else:
254 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
255 |             out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
256 | 
257 |     eval_loss = eval_loss / nb_eval_steps
258 |     preds = np.argmax(preds, axis=1)
259 |     result = compute_metrics(preds, out_label_ids)
260 |     results.update(result)
261 | 
262 |     output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt')
263 |     with open(output_eval_file, 'a') as writer:
264 |         for key in sorted(result.keys()):
265 |             logger.info('   %s = %s', key, str(result[key]))
266 |             writer.write('%s = %s\n' % (key, str(result[key])))
267 |         writer.write('='*20 + '\n')
268 | 
269 |     return results
270 | 
271 | 
272 | def predict(args, model, tokenizer, index):
273 |     test_dataset = load_and_cache_examples(args, tokenizer, set_type='test')
274 | 
275 |     args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu)
276 |     test_sampler = SequentialSampler(test_dataset)
277 |     test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size)
278 | 
279 |     # Eval!
280 |     logger.info('***** Running prediction *****')
281 |     logger.info('   Num examples = %d', len(test_dataset))
282 |     logger.info('   Batch size = %d', args.test_batch_size)
283 |     preds = None
284 |     for batch in tqdm(test_dataloader, desc='Testing'):
285 |         model.eval()
286 |         batch = tuple(t.to(args.device) for t in batch)
287 | 
288 |         with torch.no_grad():
289 |             inputs = {'input_ids':              batch[0],
290 |                       'attention_mask':         batch[1],
291 |                       'token_type_ids':         batch[2],
292 |                       'labels':                 batch[3],
293 |                       'ct_clf_input_ids':       batch[4],
294 |                       'ct_clf_attention_mask':  batch[5],
295 |                       'ct_clf_token_type_ids':  batch[6],
296 |                       'categories':             batch[7],
297 |                       'hand_features':          batch[8]}
298 |             outputs = model(**inputs)
299 |             tmp_eval_loss, logits = outputs[0][:2]
300 |         if preds is None:
301 |             preds = logits.detach().cpu().numpy()
302 |         else:
303 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
304 | 
305 |     preds = np.argmax(preds, axis=1)
306 |     with open(os.path.join(args.data_dir, 'result.csv'), 'w') as f:
307 |         f.write('id,label\n')
308 |         for i, pred in enumerate(preds):
309 |             f.write('%d,%d\n' % (i, pred))
310 | 
311 | 
312 | def load_and_cache_examples(args, tokenizer, set_type):
313 |     processor = QPMProcessor()
314 |     # Load data features from cache or dataset file
315 |     cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}_customized'.format(
316 |         set_type,
317 |         list(filter(None, args.model_name_or_path.split('/'))).pop(),
318 |         str(args.max_seq_length)
319 |     ))
320 |     if os.path.exists(cached_features_file):
321 |         logger.info('Loading features from cache file %s', cached_features_file)
322 |         features = torch.load(cached_features_file)
323 |     else:
324 |         logger.info('Creating features from dataset file at %s', args.data_dir)
325 |         label_list = processor.get_labels()
326 |         category_list = processor.get_categories()
327 |         examples = processor.get_examples(args.data_dir, set_type)
328 |         features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer,
329 |             cls_token_at_end=False,    # xlnet has a cls token at the end
330 |             cls_token=tokenizer.cls_token,
331 |             cls_token_segment_id=0,
332 |             sep_token=tokenizer.sep_token,
333 |             sep_token_extra=False,
334 |             pad_on_left=False,
335 |             pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
336 |             pad_token_segment_id=0
337 |             # cls_token_at_end=bool(args.model_type in ['xlnet']),    # xlnet has a cls token at the end
338 |             # cls_token=tokenizer.cls_token,
339 |             # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
340 |             # sep_token=tokenizer.sep_token,
341 |             # sep_token_extra=bool(args.model_type in ['roberta']),   # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
342 |             # pad_on_left=bool(args.model_type in ['xlnet']),         # pad on the left for xlnet
343 |             # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
344 |             # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
345 |         )
346 |         logger.info("Saving features into cached file %s", cached_features_file)
347 |         torch.save(features, cached_features_file)
348 | 
349 |     # Convert to Tensors and build dataset
350 |     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
351 |     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
352 |     all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
353 |     all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
354 |     all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long)
355 |     all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long)
356 |     all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long)
357 |     all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long)
358 |     all_hand_features = torch.tensor([f.hand_features for f in features], dtype=torch.long)
359 | 
360 |     dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,
361 |                             all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids,
362 |                             all_hand_features)
363 |     return dataset
364 | 
365 | 
366 | def main():
367 |     parser = argparse.ArgumentParser()
368 | 
369 |     ## Required parameters
370 |     parser.add_argument('--data_dir', default=None, type=str, required=True,
371 |                         help='The input data dir. Should contain the .csv files for the task.')
372 |     parser.add_argument('--model_name_or_path', default=None, type=str, required=True,
373 |                         help='Path to pretrained model or shortcut name selected in the list.')
374 |     parser.add_argument('--output_dir', default=None, type=str, required=True,
375 |                         help='The output directory where the model predictions and checkpoints will be written.')
376 | 
377 |     ## Other parameters
378 |     parser.add_argument('--config_name', default='', type=str,
379 |                         help='Pretrained config name or path if not the same as model_name.')
380 |     parser.add_argument('--tokenizer_name', default='', type=str,
381 |                         help='Pretrained tokenizer name or path if not the same as model_name.')
382 |     parser.add_argument('--max_seq_length', default='128', type=int,
383 |                         help='The maximum total input sequence length after tokenization. Sequences longer than this '
384 |                              'will be truncated, sequences shorter will be padded.')
385 |     parser.add_argument('--do_train', action='store_true',
386 |                         help='Whether to run training.')
387 |     parser.add_argument('--do_eval', action='store_true',
388 |                         help='Whether to run eval on the dev set.')
389 |     parser.add_argument('--do_predict', action='store_true',
390 |                         help='Whether to run test on the test set.')
391 |     parser.add_argument('--evaluate_during_training', action='store_true',
392 |                         help='Rul evaluation during training at each logging step.')
393 |     parser.add_argument('--do_lower_case', action='store_true',
394 |                         help='Set this flag if you are using an uncased model.')
395 | 
396 |     parser.add_argument('--num_features', default=10, type=int,
397 |                         help='Number of hand-crafted features.')
398 |     parser.add_argument('--per_gpu_train_batch_size', default=8, type=int,
399 |                         help='Batch size per GPU/CPU for training.')
400 |     parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int,
401 |                         help='Batch size per GPU/CPU for evaluation.')
402 |     parser.add_argument('--per_gpu_test_batch_size', default=8, type=int,
403 |                         help='Batch size per GPU/CPU for prediction.')
404 |     parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
405 |                         help='Number of updates steps to accumulate before performing a backward/update pass.')
406 |     parser.add_argument('--learning_rate', default=5e-5, type=float,
407 |                         help='The initial learning rate for Adam.')
408 |     parser.add_argument('--weight_decay', default=0.0, type=float,
409 |                         help='Weight decay if we apply some.')
410 |     parser.add_argument('--adam_epsilon', default=1e-8, type=float,
411 |                         help='Epsilon for Adam optimizer.')
412 |     parser.add_argument('--max_grad_norm', default=1.0, type=float,
413 |                         help='Max gradient norm.')
414 |     parser.add_argument('--num_train_epochs', default=3.0, type=float,
415 |                         help='Total number of training epochs to perform.')
416 |     parser.add_argument('--max_steps', default=-1, type=int,
417 |                         help='If > 0: set total number of training steps to perform. Override num_train_epochs.')
418 |     parser.add_argument('--warmup_steps', default=0, type=int,
419 |                         help='Linear warmup over warmup_steps.')
420 | 
421 |     parser.add_argument('--logging_steps', type=int, default=100,
422 |                         help='Log every X updates steps.')
423 |     parser.add_argument('--save_steps', type=int, default=100,
424 |                         help='Save checkpoint every X updates steps.')
425 |     parser.add_argument('--eval_all_checkpoints', action='store_true',
426 |                         help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.')
427 |     parser.add_argument('--no_cuda', action='store_true',
428 |                         help='Avoid using CUDA when available.')
429 |     parser.add_argument('--overwrite_output_dir', action='store_true',
430 |                         help='Overwrite the content of the output directory.')
431 |     parser.add_argument('--overwrite_cache', action='store_true',
432 |                         help='Overwrite the cached training and evaluation sets.')
433 |     parser.add_argument('--seed', type=int, default=42,
434 |                         help='random seed for initialization')
435 | 
436 |     parser.add_argument('--fp16', action='store_true',
437 |                         help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
438 |     parser.add_argument('--fp16_opt_level', type=str, default='O1',
439 |                         help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
440 |                              "See details at https://nvidia.github.io/apex/amp.html")
441 |     args = parser.parse_args()
442 | 
443 |     # Setup CUDA, GPU
444 |     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
445 |         raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.')
446 | 
447 |     device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu')
448 |     args.n_gpu = torch.cuda.device_count()
449 |     args.device = device
450 | 
451 |     # Setup logging
452 |     logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
453 |                         datefmt = '%m/%d/%Y %H:%M:%S',
454 |                         level = logging.INFO)
455 |     logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s',
456 |                    device, args.n_gpu, args.fp16)
457 | 
458 |     # Set seed
459 |     set_seed(args)
460 |     # Prepare QPM task
461 |     processor = QPMProcessor()
462 |     label_list = processor.get_labels()
463 |     num_labels = len(label_list)
464 |     category_list = processor.get_categories()
465 |     clf_num_labels = len(category_list)
466 | 
467 |     # Load pretrained model and tokenizer
468 |     config_class, model_class, tokenizer_class = BertConfig, CustomizedBert, BertTokenizer
469 |     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels)
470 |     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
471 |     model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
472 |     model.to(args.device)
473 | 
474 |     logger.info('Trainning/evaluation parameters %s', args)
475 |     parent_data_dir = args.data_dir
476 |     parent_output_dir = args.output_dir
477 | 
478 |     # Trainning
479 |     results_tmp = {}
480 |     if args.do_train:
481 |         # 10-Fold dataset for training.
482 |         for i in range(0, 10):
483 |             # Reload the pretrained model.
484 |             model = model_class.from_pretrained(args.model_name_or_path,
485 |                                                 from_tf=bool('.ckpt' in args.model_name_or_path),
486 |                                                 config=config)
487 |             model.to(args.device)
488 | 
489 |             args.data_dir = parent_data_dir + str(i)
490 |             args.output_dir = parent_output_dir + str(i)
491 | 
492 |             train_dataset = load_and_cache_examples(args, tokenizer, set_type='train')
493 |             global_step, tr_loss = train(args, train_dataset, model, tokenizer)
494 |             logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
495 |             # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
496 |             # Create output directory if needed
497 |             if not os.path.exists(args.output_dir):
498 |                 os.makedirs(args.output_dir)
499 | 
500 |             logger.info("Saving model checkpoint to %s", args.output_dir)
501 |             # Save a trained model, configuration and tokenizer using `save_pretrained()`.
502 |             # They can then be reloaded using `from_pretrained()`
503 |             model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
504 |             model_to_save.save_pretrained(args.output_dir)
505 |             tokenizer.save_pretrained(args.output_dir)
506 | 
507 |             # Good practice: save your training arguments together with the trained model
508 |             torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
509 | 
510 |             # Load a trained model and vocabulary that you have fine-tuned
511 |             model = model_class.from_pretrained(args.output_dir)
512 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
513 |             model.to(args.device)
514 | 
515 |     # Evaluation
516 |     results = {}
517 |     if args.do_eval:
518 |         for i in range(10):
519 |             args.data_dir = parent_data_dir + str(i)
520 |             args.output_dir = parent_output_dir + str(i)
521 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
522 |             checkpoints = [args.output_dir]
523 |             if args.eval_all_checkpoints:
524 |                 checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
525 |                 logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
526 |             logger.info("Evaluate the following checkpoints: %s", checkpoints)
527 |             best_f1 = 0.0
528 |             for checkpoint in checkpoints:
529 |                 global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
530 |                 model = model_class.from_pretrained(checkpoint)
531 |                 model.to(args.device)
532 |                 result = evaluate(args, model, tokenizer, prefix=global_step)
533 |                 if result['f1'] > best_f1:
534 |                     best_f1 = result['f1']
535 |                     # Save the best model checkpoint
536 |                     output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
537 |                     if not os.path.exists(output_dir):
538 |                         os.makedirs(output_dir)
539 |                     model_to_save = model.module if hasattr(model, 'module') else model
540 |                     model_to_save.save_pretrained(output_dir)
541 |                     torch.save(args, 'training_args.bin')
542 |                     logger.info('Saving model checkpoint to %s', output_dir)
543 | 
544 |                 result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
545 |                 results.update(result)
546 | 
547 |     # Prediction
548 |     if args.do_predict:
549 |         for i in range(10):
550 |             args.data_dir = parent_data_dir + str(i)
551 |             args.output_dir = parent_output_dir + str(i)
552 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
553 |             logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
554 |             checkpoint = args.output_dir + '/best_checkpoint'
555 |             model = model_class.from_pretrained(checkpoint)
556 |             model.to(args.device)
557 |             predict(args, model, tokenizer, i)
558 | 
559 |         # For bagging.
560 |         num_test = 50000
561 |         all = pd.DataFrame({'id': [i for i in range(num_test)], 'label': num_test*[0]})
562 |         for i in range(10):
563 |             args.data_dir = parent_data_dir + str(i)
564 |             df = pd.read_csv(args.data_dir + '/result.csv')
565 |             all['label'] += df['label']
566 |         all['label'] = all['label'] // 6
567 |         all.to_csv('./data/result.csv', index=False)
568 | 
569 | 
570 | if __name__ == '__main__':
571 |     main()
572 | 


--------------------------------------------------------------------------------
/model_multitask.py:
--------------------------------------------------------------------------------
  1 | # -*— coding: utf-8 -*-
  2 | """ Finetuning the library models for chip2019 question pairs matching. """
  3 | 
  4 | import argparse
  5 | import glob
  6 | import logging
  7 | import os
  8 | import random
  9 | import shutil
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | import torch
 14 | import torch.nn as nn
 15 | from torch.nn import CrossEntropyLoss, MSELoss
 16 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
 17 | from tensorboardX import SummaryWriter
 18 | from tqdm import tqdm, trange
 19 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertModel, BertTokenizer)
 20 | from pytorch_transformers import AdamW, WarmupLinearSchedule
 21 | from pytorch_transformers.modeling_bert import BertPreTrainedModel
 22 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor)
 23 | 
 24 | logger = logging.getLogger(__name__)
 25 | 
 26 | 
 27 | class MultiTaskBert(BertPreTrainedModel):
 28 |     def __init__(self, config):
 29 |         super(MultiTaskBert, self).__init__(config)
 30 |         self.num_labels = 3
 31 |         self.num_categories = 5
 32 |         self.bert = BertModel(config)
 33 |         self.dropout = nn.Dropout(config.hidden_dropout_prob)
 34 |         self.classifier1 = nn.Linear(config.hidden_size, self.num_labels)
 35 |         self.classifier2 = nn.Linear(config.hidden_size, self.num_categories)
 36 | 
 37 |         self.init_weights()
 38 | 
 39 |     def forward(self, input_ids, ct_clf_input_ids, token_type_ids=None, attention_mask=None, position_ids=None, labels=None,
 40 |                 ct_clf_token_type_ids=None, ct_clf_attention_mask=None, ct_clf_position_ids=None, categories=None,
 41 |                 head_mask=None):
 42 |         outputs1 = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
 43 |                             attention_mask=attention_mask, head_mask=head_mask)
 44 |         outputs2 = self.bert(ct_clf_input_ids, position_ids=ct_clf_position_ids, token_type_ids=ct_clf_token_type_ids,
 45 |                              attention_mask=ct_clf_attention_mask, head_mask=head_mask)
 46 |         pooled_output1 = outputs1[1]
 47 |         pooled_output2 = outputs2[1]
 48 |         pooled_output1 = self.dropout(pooled_output1)
 49 |         pooled_output2 = self.dropout(pooled_output2)
 50 |         logits1 = self.classifier1(pooled_output1)
 51 |         logits2 = self.classifier2(pooled_output2)
 52 | 
 53 |         outputs1 = (logits1,) + outputs1[2:]  # add hidden states and attention if they are here
 54 |         outputs2 = (logits2,) + outputs2[2:]
 55 | 
 56 |         if labels is not None:
 57 |             if self.num_labels == 1:
 58 |                 #  We are doing regression
 59 |                 loss_fct = MSELoss()
 60 |                 loss = loss_fct(logits1.view(-1), labels.view(-1))
 61 |             else:
 62 |                 loss_fct = CrossEntropyLoss()
 63 |                 loss = loss_fct(logits1.view(-1, self.num_labels), labels.view(-1))
 64 |             outputs1 = (loss,) + outputs1
 65 |         if categories is not None:
 66 |             if self.num_categories == 1:
 67 |                 #  We are doing regression
 68 |                 loss_fct = MSELoss()
 69 |                 loss = loss_fct(logits2.view(-1), categories.view(-1))
 70 |             else:
 71 |                 loss_fct = CrossEntropyLoss()
 72 |                 loss = loss_fct(logits2.view(-1, self.num_categories), categories.view(-1))
 73 |             outputs2 = (loss,) + outputs2
 74 | 
 75 |         return outputs1, outputs2  # (loss), logits, (hidden_states), (attentions)
 76 | 
 77 | 
 78 | def set_seed(args):
 79 |     random.seed(args.seed)
 80 |     np.random.seed(args.seed)
 81 |     torch.manual_seed(args.seed)
 82 |     if args.n_gpu > 0:
 83 |         torch.cuda.manual_seed_all(args.seed)
 84 | 
 85 | 
 86 | def train(args, train_dataset, model, tokenizer):
 87 |     """ Train the model. """
 88 |     tb_writer = SummaryWriter()
 89 | 
 90 |     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
 91 |     train_sampler = RandomSampler(train_dataset)
 92 |     train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
 93 | 
 94 |     if args.max_steps > 0:
 95 |         t_total = args.max_steps
 96 |         args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
 97 |     else:
 98 |         t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
 99 | 
100 |     # Prepare optimizer and schedule (linear warmup and decay)
101 |     no_decay = ['bias', 'LayerNorm.weight']
102 |     optimizer_grouped_parameters = [
103 |         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
104 |         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
105 |     ]
106 |     optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
107 |     scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
108 |     if args.fp16:
109 |         try:
110 |             from apex import amp
111 |         except ImportError:
112 |             raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.')
113 |         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
114 | 
115 |     # multi-gpu training (should be after apex fp16 initialization)
116 |     if args.n_gpu > 1:
117 |         model = torch.nn.DataParallel(model)
118 | 
119 |     # Train!
120 |     logger.info('***** Running training *****')
121 |     logger.info('   Num examples = %d', len(train_dataset))
122 |     logger.info('   Num Epochs = %d', args.num_train_epochs)
123 |     logger.info('   Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size)
124 |     logger.info('   Total train batch size (w. parallel & accumulation) = %d',
125 |                 args.train_batch_size * args.gradient_accumulation_steps)
126 |     logger.info('   Gradient Accumulation steps = %d', args.gradient_accumulation_steps)
127 |     logger.info('   Total optimization steps = %d', t_total)
128 | 
129 |     global_step = 0
130 |     tr_loss, logging_loss = 0.0, 0.0
131 |     model.zero_grad()
132 |     train_iterator = trange(int(args.num_train_epochs), desc='Epoch')
133 |     set_seed(args)  # Added here for reproductibility
134 |     for _ in train_iterator:
135 |         epoch_iterator = tqdm(train_dataloader, desc='Iteration')
136 |         for step, batch in enumerate(epoch_iterator):
137 |             model.train()
138 |             batch = tuple(t.to(args.device) for t in batch)
139 |             inputs = {'input_ids':      batch[0],
140 |                       'attention_mask': batch[1],
141 |                       'token_type_ids': batch[2],
142 |                       'labels':         batch[3],
143 |                       'ct_clf_input_ids': batch[4],
144 |                       'ct_clf_attention_mask': batch[5],
145 |                       'ct_clf_token_type_ids': batch[6],
146 |                       'categories': batch[7]}
147 |             outputs = model(**inputs)
148 |             loss, clf_loss = outputs[0][0], outputs[1][0]  # model outputs are always tuple in pytorch_transformers (see doc)
149 | 
150 |             total_loss = loss + clf_loss
151 |             if args.n_gpu > 1:
152 |                 total_loss = total_loss.mean()
153 |             if args.gradient_accumulation_steps > 1:
154 |                 total_loss = total_loss / args.gradient_accumulation_steps
155 | 
156 |             if args.fp16:
157 |                 with amp.scale_los(total_loss.optimizer) as scaled_loss:
158 |                     scaled_loss.backward()
159 |                 torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
160 |             else:
161 |                 total_loss.backward()
162 |                 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
163 | 
164 |             tr_loss += total_loss.item()
165 |             if (step + 1) % args.gradient_accumulation_steps == 0:
166 |                 optimizer.step()
167 |                 scheduler.step()
168 |                 model.zero_grad()
169 |                 global_step += 1
170 | 
171 |                 if args.logging_steps > 0 and global_step % args.logging_steps == 0:
172 |                     # Log metrics
173 |                     if args.evaluate_during_training:
174 |                         result = evaluate(args, model, tokenizer)
175 |                         for key, value in result.items():
176 |                             tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
177 |                     tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
178 |                     tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step)
179 |                     logging_loss = tr_loss
180 | 
181 |                 if args.save_steps > 0 and global_step % args.save_steps == 0:
182 |                     # Save model checkpoint
183 |                     output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
184 |                     if not os.path.exists(output_dir):
185 |                         os.makedirs(output_dir)
186 |                     model_to_save = model.module if hasattr(model, 'module') else model
187 |                     model_to_save.save_pretrained(output_dir)
188 |                     torch.save(args, 'training_args.bin')
189 |                     logger.info('Saving model checkpoint to %s', output_dir)
190 | 
191 |             if args.max_steps > 0 and global_step > args.max_steps:
192 |                 epoch_iterator.close()
193 |                 break
194 |         if args.max_steps > 0 and global_step > args.max_steps:
195 |             train_iterator.close()
196 |             break
197 | 
198 |     tb_writer.close()
199 |     return global_step, tr_loss / global_step
200 | 
201 | 
202 | def evaluate(args, model, tokenizer, prefix=''):
203 |     eval_output_dir = args.output_dir
204 | 
205 |     results = {}
206 |     eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev')
207 | 
208 |     if not os.path.exists(eval_output_dir):
209 |         os.makedirs(eval_output_dir)
210 | 
211 |     args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
212 |     eval_sampler = SequentialSampler(eval_dataset)
213 |     eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
214 | 
215 |     # Eval!
216 |     logger.info('***** Running evaluation {} *****'.format(prefix))
217 |     logger.info('   Num examples = %d', len(eval_dataset))
218 |     logger.info('   Batch size = %d', args.eval_batch_size)
219 |     eval_loss = 0.0
220 |     nb_eval_steps = 0
221 |     preds = None
222 |     out_label_ids = None
223 |     for batch in tqdm(eval_dataloader, desc='Evaluating'):
224 |         model.eval()
225 |         batch = tuple(t.to(args.device) for t in batch)
226 | 
227 |         with torch.no_grad():
228 |             inputs = {'input_ids':      batch[0],
229 |                       'attention_mask': batch[1],
230 |                       'token_type_ids': batch[2],
231 |                       'labels':         batch[3],
232 |                       'ct_clf_input_ids': batch[4],
233 |                       'ct_clf_attention_mask': batch[5],
234 |                       'ct_clf_token_type_ids': batch[6],
235 |                       'categories': batch[7]}
236 |             outputs = model(**inputs)
237 |             tmp_eval_loss, logits = outputs[0][:2]
238 |             eval_loss += tmp_eval_loss.mean().item()
239 |         nb_eval_steps += 1
240 |         if preds is None:
241 |             preds = logits.detach().cpu().numpy()
242 |             out_label_ids = inputs['labels'].detach().cpu().numpy()
243 |         else:
244 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
245 |             out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
246 | 
247 |     eval_loss = eval_loss / nb_eval_steps
248 |     preds = np.argmax(preds, axis=1)
249 |     result = compute_metrics(preds, out_label_ids)
250 |     results.update(result)
251 | 
252 |     output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt')
253 |     with open(output_eval_file, 'a') as writer:
254 |         for key in sorted(result.keys()):
255 |             logger.info('   %s = %s', key, str(result[key]))
256 |             writer.write('%s = %s\n' % (key, str(result[key])))
257 |         writer.write('='*20 + '\n')
258 | 
259 |     return results
260 | 
261 | 
262 | def predict(args, model, tokenizer, index):
263 |     test_dataset = load_and_cache_examples(args, tokenizer, set_type='test')
264 | 
265 |     args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu)
266 |     test_sampler = SequentialSampler(test_dataset)
267 |     test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size)
268 | 
269 |     # Eval!
270 |     logger.info('***** Running prediction *****')
271 |     logger.info('   Num examples = %d', len(test_dataset))
272 |     logger.info('   Batch size = %d', args.test_batch_size)
273 |     preds = None
274 |     for batch in tqdm(test_dataloader, desc='Testing'):
275 |         model.eval()
276 |         batch = tuple(t.to(args.device) for t in batch)
277 | 
278 |         with torch.no_grad():
279 |             inputs = {'input_ids':      batch[0],
280 |                       'attention_mask': batch[1],
281 |                       'token_type_ids': batch[2],
282 |                       'labels':         batch[3],
283 |                       'ct_clf_input_ids': batch[4],
284 |                       'ct_clf_attention_mask': batch[5],
285 |                       'ct_clf_token_type_ids': batch[6],
286 |                       'categories': batch[7]}
287 |             outputs = model(**inputs)
288 |             tmp_eval_loss, logits = outputs[0][:2]
289 |         if preds is None:
290 |             preds = logits.detach().cpu().numpy()
291 |         else:
292 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
293 | 
294 |     preds = np.argmax(preds, axis=1)
295 |     with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f:
296 |         f.write('id,label\n')
297 |         for i, pred in enumerate(preds):
298 |             f.write('%d,%d\n' % (i, pred))
299 | 
300 | 
301 | def load_and_cache_examples(args, tokenizer, set_type):
302 |     processor = QPMProcessor()
303 |     # Load data features from cache or dataset file
304 |     cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}'.format(
305 |         set_type,
306 |         list(filter(None, args.model_name_or_path.split('/'))).pop(),
307 |         str(args.max_seq_length)
308 |     ))
309 |     if os.path.exists(cached_features_file):
310 |         logger.info('Loading features from cache file %s', cached_features_file)
311 |         features = torch.load(cached_features_file)
312 |     else:
313 |         logger.info('Creating features from dataset file at %s', args.data_dir)
314 |         label_list = processor.get_labels()
315 |         category_list = processor.get_categories()
316 |         examples = processor.get_examples(args.data_dir, set_type)
317 |         features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer,
318 |             cls_token_at_end=False,    # xlnet has a cls token at the end
319 |             cls_token=tokenizer.cls_token,
320 |             cls_token_segment_id=0,
321 |             sep_token=tokenizer.sep_token,
322 |             sep_token_extra=False,
323 |             pad_on_left=False,
324 |             pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
325 |             pad_token_segment_id=0
326 |             # cls_token_at_end=bool(args.model_type in ['xlnet']),    # xlnet has a cls token at the end
327 |             # cls_token=tokenizer.cls_token,
328 |             # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
329 |             # sep_token=tokenizer.sep_token,
330 |             # sep_token_extra=bool(args.model_type in ['roberta']),   # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
331 |             # pad_on_left=bool(args.model_type in ['xlnet']),         # pad on the left for xlnet
332 |             # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
333 |             # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
334 |         )
335 |         logger.info("Saving features into cached file %s", cached_features_file)
336 |         torch.save(features, cached_features_file)
337 | 
338 |     # Convert to Tensors and build dataset
339 |     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
340 |     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
341 |     all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
342 |     all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
343 |     all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long)
344 |     all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long)
345 |     all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long)
346 |     all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long)
347 | 
348 |     dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,
349 |                             all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids)
350 |     return dataset
351 | 
352 | 
353 | def main():
354 |     parser = argparse.ArgumentParser()
355 | 
356 |     ## Required parameters
357 |     parser.add_argument('--data_dir', default=None, type=str, required=True,
358 |                         help='The input data dir. Should contain the .csv files for the task.')
359 |     parser.add_argument('--model_name_or_path', default=None, type=str, required=True,
360 |                         help='Path to pretrained model or shortcut name selected in the list.')
361 |     parser.add_argument('--output_dir', default=None, type=str, required=True,
362 |                         help='The output directory where the model predictions and checkpoints will be written.')
363 | 
364 |     ## Other parameters
365 |     parser.add_argument('--config_name', default='', type=str,
366 |                         help='Pretrained config name or path if not the same as model_name.')
367 |     parser.add_argument('--tokenizer_name', default='', type=str,
368 |                         help='Pretrained tokenizer name or path if not the same as model_name.')
369 |     parser.add_argument('--max_seq_length', default='128', type=int,
370 |                         help='The maximum total input sequence length after tokenization. Sequences longer than this '
371 |                              'will be truncated, sequences shorter will be padded.')
372 |     parser.add_argument('--do_train', action='store_true',
373 |                         help='Whether to run training.')
374 |     parser.add_argument('--do_eval', action='store_true',
375 |                         help='Whether to run eval on the dev set.')
376 |     parser.add_argument('--do_predict', action='store_true',
377 |                         help='Whether to run test on the test set.')
378 |     parser.add_argument('--evaluate_during_training', action='store_true',
379 |                         help='Rul evaluation during training at each logging step.')
380 |     parser.add_argument('--do_lower_case', action='store_true',
381 |                         help='Set this flag if you are using an uncased model.')
382 | 
383 |     parser.add_argument('--per_gpu_train_batch_size', default=8, type=int,
384 |                         help='Batch size per GPU/CPU for training.')
385 |     parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int,
386 |                         help='Batch size per GPU/CPU for evaluation.')
387 |     parser.add_argument('--per_gpu_test_batch_size', default=8, type=int,
388 |                         help='Batch size per GPU/CPU for prediction.')
389 |     parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
390 |                         help='Number of updates steps to accumulate before performing a backward/update pass.')
391 |     parser.add_argument('--learning_rate', default=5e-5, type=float,
392 |                         help='The initial learning rate for Adam.')
393 |     parser.add_argument('--weight_decay', default=0.0, type=float,
394 |                         help='Weight decay if we apply some.')
395 |     parser.add_argument('--adam_epsilon', default=1e-8, type=float,
396 |                         help='Epsilon for Adam optimizer.')
397 |     parser.add_argument('--max_grad_norm', default=1.0, type=float,
398 |                         help='Max gradient norm.')
399 |     parser.add_argument('--num_train_epochs', default=3.0, type=float,
400 |                         help='Total number of training epochs to perform.')
401 |     parser.add_argument('--max_steps', default=-1, type=int,
402 |                         help='If > 0: set total number of training steps to perform. Override num_train_epochs.')
403 |     parser.add_argument('--warmup_steps', default=0, type=int,
404 |                         help='Linear warmup over warmup_steps.')
405 | 
406 |     parser.add_argument('--logging_steps', type=int, default=50,
407 |                         help='Log every X updates steps.')
408 |     parser.add_argument('--save_steps', type=int, default=100,
409 |                         help='Save checkpoint every X updates steps.')
410 |     parser.add_argument('--eval_all_checkpoints', action='store_true',
411 |                         help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.')
412 |     parser.add_argument('--no_cuda', action='store_true',
413 |                         help='Avoid using CUDA when available.')
414 |     parser.add_argument('--overwrite_output_dir', action='store_true',
415 |                         help='Overwrite the content of the output directory.')
416 |     parser.add_argument('--overwrite_cache', action='store_true',
417 |                         help='Overwrite the cached training and evaluation sets.')
418 |     parser.add_argument('--seed', type=int, default=42,
419 |                         help='random seed for initialization')
420 | 
421 |     parser.add_argument('--fp16', action='store_true',
422 |                         help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
423 |     parser.add_argument('--fp16_opt_level', type=str, default='O1',
424 |                         help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
425 |                              "See details at https://nvidia.github.io/apex/amp.html")
426 |     args = parser.parse_args()
427 | 
428 |     # Setup CUDA, GPU
429 |     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
430 |         raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.')
431 | 
432 |     device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu')
433 |     args.n_gpu = torch.cuda.device_count()
434 |     args.device = device
435 | 
436 |     # Setup logging
437 |     logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
438 |                         datefmt = '%m/%d/%Y %H:%M:%S',
439 |                         level = logging.INFO)
440 |     logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s',
441 |                    device, args.n_gpu, args.fp16)
442 | 
443 |     # Set seed
444 |     set_seed(args)
445 |     # Prepare QPM task
446 |     processor = QPMProcessor()
447 |     label_list = processor.get_labels()
448 |     num_labels = len(label_list)
449 |     category_list = processor.get_categories()
450 |     clf_num_labels = len(category_list)
451 | 
452 |     # Load pretrained model and tokenizer
453 |     config_class, model_class, tokenizer_class = BertConfig, MultiTaskBert, BertTokenizer
454 |     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels)
455 |     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
456 |     model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
457 |     model.to(args.device)
458 | 
459 |     logger.info('Trainning/evaluation parameters %s', args)
460 |     parent_data_dir = args.data_dir
461 |     parent_output_dir = args.output_dir
462 | 
463 |     # Trainning
464 |     results_tmp = {}
465 |     if args.do_train:
466 |         # 10-Fold dataset for training.
467 |         for i in range(2, 10):
468 |             # Reload the pretrained model.
469 |             model = model_class.from_pretrained(args.model_name_or_path,
470 |                                                 from_tf=bool('.ckpt' in args.model_name_or_path),
471 |                                                 config=config)
472 |             model.to(args.device)
473 | 
474 |             args.data_dir = parent_data_dir + str(i)
475 |             args.output_dir = parent_output_dir + str(i)
476 | 
477 |             train_dataset = load_and_cache_examples(args, tokenizer, set_type='train')
478 |             global_step, tr_loss = train(args, train_dataset, model, tokenizer)
479 |             logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
480 |             # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
481 |             # Create output directory if needed
482 |             if not os.path.exists(args.output_dir):
483 |                 os.makedirs(args.output_dir)
484 | 
485 |             logger.info("Saving model checkpoint to %s", args.output_dir)
486 |             # Save a trained model, configuration and tokenizer using `save_pretrained()`.
487 |             # They can then be reloaded using `from_pretrained()`
488 |             model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
489 |             model_to_save.save_pretrained(args.output_dir)
490 |             tokenizer.save_pretrained(args.output_dir)
491 | 
492 |             # Good practice: save your training arguments together with the trained model
493 |             torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
494 | 
495 |             # Load a trained model and vocabulary that you have fine-tuned
496 |             model = model_class.from_pretrained(args.output_dir)
497 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
498 |             model.to(args.device)
499 | 
500 |             # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset.
501 |             args.data_dir = parent_data_dir + str(i)
502 |             args.output_dir = parent_output_dir + str(i)
503 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
504 |             checkpoints = [args.output_dir]
505 |             if args.eval_all_checkpoints:
506 |                 checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
507 |                 logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
508 |             logger.info("Evaluate the following checkpoints: %s", checkpoints)
509 |             best_f1 = 0.0
510 |             for checkpoint in checkpoints:
511 |                 global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
512 |                 model = model_class.from_pretrained(checkpoint)
513 |                 model.to(args.device)
514 |                 result = evaluate(args, model, tokenizer, prefix=global_step)
515 |                 if result['f1'] > best_f1:
516 |                     best_f1 = result['f1']
517 |                     # Save the best model checkpoint
518 |                     output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
519 |                     if not os.path.exists(output_dir):
520 |                         os.makedirs(output_dir)
521 |                     model_to_save = model.module if hasattr(model, 'module') else model
522 |                     model_to_save.save_pretrained(output_dir)
523 |                     torch.save(args, 'training_args.bin')
524 |                     logger.info('Saving model checkpoint to %s', output_dir)
525 | 
526 |                 result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
527 |                 results_tmp.update(result)
528 |             checkpoints.remove(args.output_dir)
529 |             for checkpoint in checkpoints:
530 |                 shutil.rmtree(checkpoint)
531 | 
532 |     # Evaluation
533 |     results = {}
534 |     if args.do_eval:
535 |         for i in range(10):
536 |             args.data_dir = parent_data_dir + str(i)
537 |             args.output_dir = parent_output_dir + str(i)
538 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
539 |             checkpoints = [args.output_dir]
540 |             if args.eval_all_checkpoints:
541 |                 checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
542 |                 logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
543 |             logger.info("Evaluate the following checkpoints: %s", checkpoints)
544 |             best_f1 = 0.0
545 |             for checkpoint in checkpoints:
546 |                 global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
547 |                 model = model_class.from_pretrained(checkpoint)
548 |                 model.to(args.device)
549 |                 result = evaluate(args, model, tokenizer, prefix=global_step)
550 |                 if result['f1'] > best_f1:
551 |                     best_f1 = result['f1']
552 |                     # Save the best model checkpoint
553 |                     output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
554 |                     if not os.path.exists(output_dir):
555 |                         os.makedirs(output_dir)
556 |                     model_to_save = model.module if hasattr(model, 'module') else model
557 |                     model_to_save.save_pretrained(output_dir)
558 |                     torch.save(args, 'training_args.bin')
559 |                     logger.info('Saving model checkpoint to %s', output_dir)
560 | 
561 |                 result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
562 |                 results.update(result)
563 | 
564 |     # Prediction
565 |     if args.do_predict:
566 |         for i in range(10):
567 |             args.output_dir = parent_output_dir + str(i)
568 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
569 |             logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
570 |             checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i)
571 |             model = model_class.from_pretrained(checkpoint)
572 |             model.to(args.device)
573 |             predict(args, model, tokenizer, i)
574 | 
575 |         # For bagging.
576 |         all = pd.read_csv('./data/sample_submission.csv')
577 |         for i in range(10):
578 |             df = pd.read_csv(args.data_dir + str(i) + '/result.csv')
579 |             all['label'] += df['label']
580 |         all['label'] = all['label'] // 6
581 |         all.to_csv('./data/result.csv', index=False)
582 | 
583 | 
584 | 
585 | if __name__ == '__main__':
586 |     main()
587 | 


--------------------------------------------------------------------------------
/model_qpm.py:
--------------------------------------------------------------------------------
  1 | # -*— coding: utf-8 -*-
  2 | """ Finetuning the library models for chip2019 question pairs matching. """
  3 | 
  4 | import argparse
  5 | import glob
  6 | import logging
  7 | import os
  8 | import random
  9 | import shutil
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | import torch
 14 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
 15 | from tensorboardX import SummaryWriter
 16 | from tqdm import tqdm, trange
 17 | from pytorch_transformers import (WEIGHTS_NAME, BertConfig, BertForSequenceClassification, BertTokenizer)
 18 | from pytorch_transformers import AdamW, WarmupLinearSchedule
 19 | from data_utils import (compute_metrics, convert_examples_to_features, QPMProcessor)
 20 | 
 21 | logger = logging.getLogger(__name__)
 22 | 
 23 | 
 24 | def set_seed(args):
 25 |     random.seed(args.seed)
 26 |     np.random.seed(args.seed)
 27 |     torch.manual_seed(args.seed)
 28 |     if args.n_gpu > 0:
 29 |         torch.cuda.manual_seed_all(args.seed)
 30 | 
 31 | 
 32 | def train(args, train_dataset, model, tokenizer):
 33 |     """ Train the model. """
 34 |     tb_writer = SummaryWriter()
 35 | 
 36 |     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
 37 |     train_sampler = RandomSampler(train_dataset)
 38 |     train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
 39 | 
 40 |     if args.max_steps > 0:
 41 |         t_total = args.max_steps
 42 |         args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
 43 |     else:
 44 |         t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
 45 | 
 46 |     # Prepare optimizer and schedule (linear warmup and decay)
 47 |     no_decay = ['bias', 'LayerNorm.weight']
 48 |     optimizer_grouped_parameters = [
 49 |         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
 50 |         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
 51 |     ]
 52 |     optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
 53 |     scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
 54 |     if args.fp16:
 55 |         try:
 56 |             from apex import amp
 57 |         except ImportError:
 58 |             raise ImportError('Please install apex from https://www.github.com/nvidia/apex to use fp16 training.')
 59 |         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
 60 | 
 61 |     # multi-gpu training (should be after apex fp16 initialization)
 62 |     if args.n_gpu > 1:
 63 |         model = torch.nn.DataParallel(model)
 64 | 
 65 |     # Train!
 66 |     logger.info('***** Running training *****')
 67 |     logger.info('   Num examples = %d', len(train_dataset))
 68 |     logger.info('   Num Epochs = %d', args.num_train_epochs)
 69 |     logger.info('   Instantaneous batch size per GPU = %d', args.per_gpu_train_batch_size)
 70 |     logger.info('   Total train batch size (w. parallel & accumulation) = %d',
 71 |                 args.train_batch_size * args.gradient_accumulation_steps)
 72 |     logger.info('   Gradient Accumulation steps = %d', args.gradient_accumulation_steps)
 73 |     logger.info('   Total optimization steps = %d', t_total)
 74 | 
 75 |     global_step = 0
 76 |     tr_loss, logging_loss = 0.0, 0.0
 77 |     model.zero_grad()
 78 |     train_iterator = trange(int(args.num_train_epochs), desc='Epoch')
 79 |     set_seed(args)  # Added here for reproductibility
 80 | 
 81 |     max_val_acc = 0
 82 |     max_val_f1 = 0
 83 | 
 84 |     for _ in train_iterator:
 85 |         # epoch_iterator = tqdm(train_dataloader, desc='Iteration')
 86 |         # for step, batch in enumerate(epoch_iterator):
 87 |         for step, batch in enumerate(train_dataloader):
 88 |             model.train()
 89 |             batch = tuple(t.to(args.device) for t in batch)
 90 |             inputs = {'input_ids':      batch[0],
 91 |                       'attention_mask': batch[1],
 92 |                       'token_type_ids': batch[2],
 93 |                       'labels':         batch[3]}
 94 |             outputs = model(**inputs)
 95 |             loss = outputs[0]  # model outputs are always tuple in pytorch_transformers (see doc)
 96 | 
 97 |             if args.n_gpu > 1:
 98 |                 loss = loss.mean()
 99 |             if args.gradient_accumulation_steps > 1:
100 |                 loss = loss / args.gradient_accumulation_steps
101 | 
102 |             if args.fp16:
103 |                 with amp.scale_los(loss.optimizer) as scaled_loss:
104 |                     scaled_loss.backward()
105 |                 torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
106 |             else:
107 |                 loss.backward()
108 |                 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
109 | 
110 |             tr_loss += loss.item()
111 |             if (step + 1) % args.gradient_accumulation_steps == 0:
112 |                 optimizer.step()
113 |                 scheduler.step()
114 |                 model.zero_grad()
115 |                 global_step += 1
116 | 
117 |                 if args.logging_steps > 0 and global_step % args.logging_steps == 0:
118 |                     # Log metrics
119 |                     if args.evaluate_during_training:
120 |                         result = evaluate(args, model, tokenizer)
121 |                         for key, value in result.items():
122 |                             tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
123 |                         if result['acc'] > max_val_acc:
124 |                             max_val_acc = result['acc']
125 |                         if result['f1'] > max_val_f1:
126 |                             max_val_f1 = result['f1']
127 |                             output_dir = os.path.join(args.output_dir, 'best_checkpoint')
128 |                             if not os.path.exists(output_dir):
129 |                                 os.makedirs(output_dir)
130 |                             model_to_save = model.module if hasattr(model, 'module') else model
131 |                             model_to_save.save_pretrained(output_dir)
132 |                             torch.save(args, 'training_args.bin')
133 |                             logger.info('Saving model checkpoint with f1 {:.4f}'.format(max_val_f1))
134 |                     tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
135 |                     tb_writer.add_scalar('loss', (tr_loss-logging_loss)/args.logging_steps, global_step)
136 |                     logging_loss = tr_loss
137 | 
138 |                 # if args.save_steps > 0 and global_step % args.save_steps == 0:
139 |                 #     # Save model checkpoint
140 |                 #     output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
141 |                 #     if not os.path.exists(output_dir):
142 |                 #         os.makedirs(output_dir)
143 |                 #     model_to_save = model.module if hasattr(model, 'module') else model
144 |                 #     model_to_save.save_pretrained(output_dir)
145 |                 #     torch.save(args, 'training_args.bin')
146 |                 #     logger.info('Saving model checkpoint to %s', output_dir)
147 | 
148 |             # if args.max_steps > 0 and global_step > args.max_steps:
149 |             #     epoch_iterator.close()
150 |             #     break
151 |         if args.max_steps > 0 and global_step > args.max_steps:
152 |             train_iterator.close()
153 |             break
154 | 
155 |     tb_writer.close()
156 |     return global_step, tr_loss / global_step
157 | 
158 | 
159 | def evaluate(args, model, tokenizer, prefix=''):
160 |     eval_output_dir = args.output_dir
161 | 
162 |     results = {}
163 |     eval_dataset = load_and_cache_examples(args, tokenizer, set_type='dev')
164 | 
165 |     if not os.path.exists(eval_output_dir):
166 |         os.makedirs(eval_output_dir)
167 | 
168 |     args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
169 |     eval_sampler = SequentialSampler(eval_dataset)
170 |     eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
171 | 
172 |     # Eval!
173 |     logger.info('***** Running evaluation {} *****'.format(prefix))
174 |     logger.info('   Num examples = %d', len(eval_dataset))
175 |     logger.info('   Batch size = %d', args.eval_batch_size)
176 |     eval_loss = 0.0
177 |     nb_eval_steps = 0
178 |     preds = None
179 |     out_label_ids = None
180 |     # for batch in tqdm(eval_dataloader, desc='Evaluating'):
181 |     for batch in eval_dataloader:
182 |         model.eval()
183 |         batch = tuple(t.to(args.device) for t in batch)
184 | 
185 |         with torch.no_grad():
186 |             inputs = {'input_ids':      batch[0],
187 |                       'attention_mask': batch[1],
188 |                       'token_type_ids': batch[2],
189 |                       'labels':         batch[3]}
190 |             outputs = model(**inputs)
191 |             tmp_eval_loss, logits = outputs[:2]
192 |             eval_loss += tmp_eval_loss.mean().item()
193 |         nb_eval_steps += 1
194 |         if preds is None:
195 |             preds = logits.detach().cpu().numpy()
196 |             out_label_ids = inputs['labels'].detach().cpu().numpy()
197 |         else:
198 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
199 |             out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
200 | 
201 |     eval_loss = eval_loss / nb_eval_steps
202 |     preds = np.argmax(preds, axis=1)
203 |     result = compute_metrics(preds, out_label_ids)
204 |     results.update(result)
205 | 
206 |     output_eval_file = os.path.join(eval_output_dir, 'eval_results.txt')
207 |     with open(output_eval_file, 'a') as writer:
208 |         for key in sorted(result.keys()):
209 |             logger.info('   %s = %s', key, str(result[key]))
210 |             writer.write('%s = %s\n' % (key, str(result[key])))
211 |         writer.write('='*20 + '\n')
212 | 
213 |     return results
214 | 
215 | 
216 | def predict(args, model, tokenizer, index):
217 |     test_dataset = load_and_cache_examples(args, tokenizer, set_type='test')
218 | 
219 |     args.test_batch_size = args.per_gpu_test_batch_size * max(1, args.n_gpu)
220 |     test_sampler = SequentialSampler(test_dataset)
221 |     test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.test_batch_size)
222 | 
223 |     # Eval!
224 |     logger.info('***** Running prediction *****')
225 |     logger.info('   Num examples = %d', len(test_dataset))
226 |     logger.info('   Batch size = %d', args.test_batch_size)
227 |     preds = None
228 |     for batch in tqdm(test_dataloader, desc='Testing'):
229 |         model.eval()
230 |         batch = tuple(t.to(args.device) for t in batch)
231 | 
232 |         with torch.no_grad():
233 |             inputs = {'input_ids':      batch[0],
234 |                       'attention_mask': batch[1],
235 |                       'token_type_ids': batch[2],
236 |                       'labels':         batch[3]}
237 |             outputs = model(**inputs)
238 |             tmp_eval_loss, logits = outputs[:2]
239 |         if preds is None:
240 |             preds = logits.detach().cpu().numpy()
241 |         else:
242 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
243 | 
244 |     preds = np.argmax(preds, axis=1)
245 |     with open(os.path.join(args.data_dir + str(index), 'result.csv'), 'w') as f:
246 |         f.write('id,label\n')
247 |         for i, pred in enumerate(preds):
248 |             f.write('%d,%d\n' % (i, pred))
249 | 
250 | 
251 | def load_and_cache_examples(args, tokenizer, set_type):
252 |     processor = QPMProcessor()
253 |     # Load data features from cache or dataset file
254 |     cached_features_file = os.path.join(args.data_dir, 'cached_{}_{}_{}'.format(
255 |         set_type,
256 |         list(filter(None, args.model_name_or_path.split('/'))).pop(),
257 |         str(args.max_seq_length)
258 |     ))
259 |     if os.path.exists(cached_features_file):
260 |         logger.info('Loading features from cache file %s', cached_features_file)
261 |         features = torch.load(cached_features_file)
262 |     else:
263 |         logger.info('Creating features from dataset file at %s', args.data_dir)
264 |         label_list = processor.get_labels()
265 |         category_list = processor.get_categories()
266 |         examples = processor.get_examples(args.data_dir, set_type)
267 |         features = convert_examples_to_features(examples, label_list, category_list, args.max_seq_length, tokenizer,
268 |             cls_token_at_end=False,    # xlnet has a cls token at the end
269 |             cls_token=tokenizer.cls_token,
270 |             cls_token_segment_id=0,
271 |             sep_token=tokenizer.sep_token,
272 |             sep_token_extra=False,
273 |             pad_on_left=False,
274 |             pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
275 |             pad_token_segment_id=0
276 |             # cls_token_at_end=bool(args.model_type in ['xlnet']),    # xlnet has a cls token at the end
277 |             # cls_token=tokenizer.cls_token,
278 |             # cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
279 |             # sep_token=tokenizer.sep_token,
280 |             # sep_token_extra=bool(args.model_type in ['roberta']),   # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
281 |             # pad_on_left=bool(args.model_type in ['xlnet']),         # pad on the left for xlnet
282 |             # pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
283 |             # pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0,
284 |         )
285 |         logger.info("Saving features into cached file %s", cached_features_file)
286 |         torch.save(features, cached_features_file)
287 | 
288 |     # Convert to Tensors and build dataset
289 |     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
290 |     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
291 |     all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
292 |     all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)
293 |     all_ct_clf_input_ids = torch.tensor([f.category_clf_input_ids for f in features], dtype=torch.long)
294 |     all_ct_clf_input_mask = torch.tensor([f.category_clf_input_mask for f in features], dtype=torch.long)
295 |     all_ct_clf_segment_ids = torch.tensor([f.category_clf_segment_ids for f in features], dtype=torch.long)
296 |     all_category_ids = torch.tensor([f.category_id for f in features], dtype=torch.long)
297 | 
298 |     dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,
299 |                             all_ct_clf_input_ids, all_ct_clf_input_mask, all_ct_clf_segment_ids, all_category_ids)
300 |     return dataset
301 | 
302 | 
303 | def main():
304 |     parser = argparse.ArgumentParser()
305 | 
306 |     ## Required parameters
307 |     parser.add_argument('--data_dir', default=None, type=str, required=True,
308 |                         help='The input data dir. Should contain the .csv files for the task.')
309 |     parser.add_argument('--model_name_or_path', default=None, type=str, required=True,
310 |                         help='Path to pretrained model or shortcut name selected in the list.')
311 |     parser.add_argument('--output_dir', default=None, type=str, required=True,
312 |                         help='The output directory where the model predictions and checkpoints will be written.')
313 | 
314 |     ## Other parameters
315 |     parser.add_argument('--config_name', default='', type=str,
316 |                         help='Pretrained config name or path if not the same as model_name.')
317 |     parser.add_argument('--tokenizer_name', default='', type=str,
318 |                         help='Pretrained tokenizer name or path if not the same as model_name.')
319 |     parser.add_argument('--max_seq_length', default='128', type=int,
320 |                         help='The maximum total input sequence length after tokenization. Sequences longer than this '
321 |                              'will be truncated, sequences shorter will be padded.')
322 |     parser.add_argument('--do_train', action='store_true',
323 |                         help='Whether to run training.')
324 |     parser.add_argument('--do_eval', action='store_true',
325 |                         help='Whether to run eval on the dev set.')
326 |     parser.add_argument('--do_predict', action='store_true',
327 |                         help='Whether to run test on the test set.')
328 |     parser.add_argument('--evaluate_during_training', action='store_true',
329 |                         help='Rul evaluation during training at each logging step.')
330 |     parser.add_argument('--do_lower_case', action='store_true',
331 |                         help='Set this flag if you are using an uncased model.')
332 | 
333 |     parser.add_argument('--per_gpu_train_batch_size', default=1, type=int,
334 |                         help='Batch size per GPU/CPU for training.')
335 |     parser.add_argument('--per_gpu_eval_batch_size', default=8, type=int,
336 |                         help='Batch size per GPU/CPU for evaluation.')
337 |     parser.add_argument('--per_gpu_test_batch_size', default=8, type=int,
338 |                         help='Batch size per GPU/CPU for prediction.')
339 |     parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
340 |                         help='Number of updates steps to accumulate before performing a backward/update pass.')
341 |     parser.add_argument('--learning_rate', default=5e-5, type=float,
342 |                         help='The initial learning rate for Adam.')
343 |     parser.add_argument('--weight_decay', default=0.0, type=float,
344 |                         help='Weight decay if we apply some.')
345 |     parser.add_argument('--adam_epsilon', default=1e-8, type=float,
346 |                         help='Epsilon for Adam optimizer.')
347 |     parser.add_argument('--max_grad_norm', default=1.0, type=float,
348 |                         help='Max gradient norm.')
349 |     parser.add_argument('--num_train_epochs', default=4.0, type=float,
350 |                         help='Total number of training epochs to perform.')
351 |     parser.add_argument('--max_steps', default=-1, type=int,
352 |                         help='If > 0: set total number of training steps to perform. Override num_train_epochs.')
353 |     parser.add_argument('--warmup_steps', default=0, type=int,
354 |                         help='Linear warmup over warmup_steps.')
355 | 
356 |     parser.add_argument('--logging_steps', type=int, default=50,
357 |                         help='Log every X updates steps.')
358 |     parser.add_argument('--save_steps', type=int, default=100,
359 |                         help='Save checkpoint every X updates steps.')
360 |     parser.add_argument('--eval_all_checkpoints', action='store_true',
361 |                         help='Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number.')
362 |     parser.add_argument('--no_cuda', action='store_true',
363 |                         help='Avoid using CUDA when available.')
364 |     parser.add_argument('--overwrite_output_dir', action='store_true',
365 |                         help='Overwrite the content of the output directory.')
366 |     parser.add_argument('--overwrite_cache', action='store_true',
367 |                         help='Overwrite the cached training and evaluation sets.')
368 |     parser.add_argument('--seed', type=int, default=42,
369 |                         help='random seed for initialization')
370 | 
371 |     parser.add_argument('--fp16', action='store_true',
372 |                         help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
373 |     parser.add_argument('--fp16_opt_level', type=str, default='O1',
374 |                         help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
375 |                              "See details at https://nvidia.github.io/apex/amp.html")
376 |     args = parser.parse_args()
377 | 
378 |     # Setup CUDA, GPU
379 |     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
380 |         raise ValueError('Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.')
381 | 
382 |     device = torch.device('cuda' if torch.cuda.is_available() and not args.no_cuda else 'cpu')
383 |     args.n_gpu = torch.cuda.device_count()
384 |     args.device = device
385 | 
386 |     # Setup logging
387 |     logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
388 |                         datefmt = '%m/%d/%Y %H:%M:%S',
389 |                         level = logging.INFO)
390 |     logger.warning('Process device: %s, n_gpu: %s, 16-bits training: %s',
391 |                    device, args.n_gpu, args.fp16)
392 | 
393 |     # Set seed
394 |     set_seed(args)
395 |     # Prepare QPM task
396 |     processor = QPMProcessor()
397 |     label_list = processor.get_labels()
398 |     num_labels = len(label_list)
399 | 
400 |     # Load pretrained model and tokenizer
401 |     config_class, model_class, tokenizer_class = BertConfig, BertForSequenceClassification, BertTokenizer
402 |     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path, num_labels=num_labels)
403 |     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
404 |     model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
405 |     model.to(args.device)
406 | 
407 |     logger.info('Trainning/evaluation parameters %s', args)
408 |     parent_data_dir = args.data_dir
409 |     parent_output_dir = args.output_dir
410 | 
411 |     # Trainning
412 |     results_tmp = {}
413 |     if args.do_train:
414 |         # 10-Fold dataset for training.
415 |         for i in range(0, 10):
416 |             # Reload the pretrained model.
417 |             model = model_class.from_pretrained(args.model_name_or_path,
418 |                                                 from_tf=bool('.ckpt' in args.model_name_or_path),
419 |                                                 config=config)
420 |             model.to(args.device)
421 | 
422 |             args.data_dir = parent_data_dir + str(i)
423 |             args.output_dir = parent_output_dir + str(i)
424 | 
425 |             train_dataset = load_and_cache_examples(args, tokenizer, set_type='train')
426 |             global_step, tr_loss = train(args, train_dataset, model, tokenizer)
427 |             logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
428 |             # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
429 |             # Create output directory if needed
430 |             if not os.path.exists(args.output_dir):
431 |                 os.makedirs(args.output_dir)
432 | 
433 |             logger.info("Saving model checkpoint to %s", args.output_dir)
434 |             # Save a trained model, configuration and tokenizer using `save_pretrained()`.
435 |             # They can then be reloaded using `from_pretrained()`
436 |             model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
437 |             model_to_save.save_pretrained(args.output_dir)
438 |             tokenizer.save_pretrained(args.output_dir)
439 | 
440 |             # Good practice: save your training arguments together with the trained model
441 |             torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
442 | 
443 |             # Load a trained model and vocabulary that you have fine-tuned
444 |             model = model_class.from_pretrained(args.output_dir)
445 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
446 |             model.to(args.device)
447 | 
448 |             # for reduce the usage of disk, evluate and find the best checkpoint every sub dataset.
449 |             # args.data_dir = parent_data_dir + str(i)
450 |             # args.output_dir = parent_output_dir + str(i)
451 |             # tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
452 |             # checkpoints = [args.output_dir]
453 |             # if args.eval_all_checkpoints:
454 |             #     checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
455 |             #     logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
456 |             # logger.info("Evaluate the following checkpoints: %s", checkpoints)
457 |             # best_f1 = 0.0
458 |             # for checkpoint in checkpoints:
459 |             #     global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
460 |             #     model = model_class.from_pretrained(checkpoint)
461 |             #     model.to(args.device)
462 |             #     result = evaluate(args, model, tokenizer, prefix=global_step)
463 |             #     if result['f1'] > best_f1:
464 |             #         best_f1 = result['f1']
465 |             #         # Save the best model checkpoint
466 |             #         output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
467 |             #         if not os.path.exists(output_dir):
468 |             #             os.makedirs(output_dir)
469 |             #         model_to_save = model.module if hasattr(model, 'module') else model
470 |             #         model_to_save.save_pretrained(output_dir)
471 |             #         torch.save(args, 'training_args.bin')
472 |             #         logger.info('Saving model checkpoint to %s', output_dir)
473 |             #
474 |             #     result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
475 |             #     results_tmp.update(result)
476 |             # checkpoints.remove(args.output_dir)
477 |             # for checkpoint in checkpoints:
478 |             #     shutil.rmtree(checkpoint)
479 | 
480 |     # Evaluation
481 |     results = {}
482 |     if args.do_eval:
483 |         for i in range(10):
484 |             args.data_dir = parent_data_dir + str(i)
485 |             args.output_dir = parent_output_dir + str(i)
486 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
487 |             checkpoints = [args.output_dir]
488 |             if args.eval_all_checkpoints:
489 |                 checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
490 |                 logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
491 |             logger.info("Evaluate the following checkpoints: %s", checkpoints)
492 |             best_f1 = 0.0
493 |             for checkpoint in checkpoints:
494 |                 global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
495 |                 model = model_class.from_pretrained(checkpoint)
496 |                 model.to(args.device)
497 |                 result = evaluate(args, model, tokenizer, prefix=global_step)
498 |                 if result['f1'] > best_f1:
499 |                     best_f1 = result['f1']
500 |                     # Save the best model checkpoint
501 |                     output_dir = os.path.join(args.output_dir, 'best_checkpoint_fold' + str(i))
502 |                     if not os.path.exists(output_dir):
503 |                         os.makedirs(output_dir)
504 |                     model_to_save = model.module if hasattr(model, 'module') else model
505 |                     model_to_save.save_pretrained(output_dir)
506 |                     torch.save(args, 'training_args.bin')
507 |                     logger.info('Saving model checkpoint to %s', output_dir)
508 | 
509 |                 result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
510 |                 results.update(result)
511 | 
512 |     # Prediction
513 |     if args.do_predict:
514 |         for i in range(1):
515 |             args.output_dir = parent_output_dir + str(i)
516 |             tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
517 |             logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
518 |             checkpoint = args.output_dir + '/best_checkpoint_fold' + str(i)
519 |             model = model_class.from_pretrained(checkpoint)
520 |             model.to(args.device)
521 |             predict(args, model, tokenizer, i)
522 | 
523 |         # For bagging.
524 |         all = pd.read_csv('./data/sample_submission.csv')
525 |         for i in range(10):
526 |             df = pd.read_csv(args.data_dir + str(i) + '/result.csv')
527 |             all['label'] += df['label']
528 |         all['label'] = all['label'] // 6
529 |         all.to_csv('./data/result.csv', index=False)
530 | 
531 | 
532 | if __name__ == '__main__':
533 |     main()
534 | 


--------------------------------------------------------------------------------
/post_processing.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | import sys
 4 | 
 5 | import pandas as pd
 6 | 
 7 | df_train = pd.read_csv('./data/origin_train.csv', encoding='utf-8', engine='python')
 8 | q1 = df_train['question1'].values
 9 | q2 = df_train['question2'].values
10 | label = df_train['label'].values
11 | category = df_train['category'].values
12 | dict_1 = {}
13 | dict_2 = {}
14 | dict_ct = {}
15 | for i in range(0, df_train.shape[0]):
16 |     dict_ct[q1[i]] = category[i]
17 |     dict_ct[q2[i]] = category[i]
18 |     if label[i] == 1:
19 |         if dict_1.get(q1[i], -1) == -1:
20 |             dict_1[q1[i]] = [q2[i]]
21 |         else:
22 |             dict_1[q1[i]].append(q2[i])
23 |         if dict_1.get(q2[i], -1) == -1:
24 |             dict_1[q2[i]] = [q1[i]]
25 |         else:
26 |             dict_1[q2[i]].append(q1[i])
27 |     else:
28 |         if dict_2.get(q1[i], -1) == -1:
29 |             dict_2[q1[i]] = [q2[i]]
30 |         else:
31 |             dict_2[q1[i]].append(q2[i])
32 |         if dict_2.get(q2[i], -1) == -1:
33 |             dict_2[q2[i]] = [q1[i]]
34 |         else:
35 |             dict_2[q2[i]].append(q1[i])
36 | 
37 |     if i % 5000 == 0:
38 |         sys.stdout.flush()
39 |         sys.stdout.write('#')
40 | print(len(dict_1))
41 | 
42 | df_result = pd.read_csv('./data/result.csv', encoding='utf-8', engine='python', index_col='id')
43 | df_test = pd.read_csv('./data/noextension/test.csv', encoding='utf-8', engine='python')
44 | q1_test = df_test['question1'].values
45 | q2_test = df_test['question2'].values
46 | category_test = df_test['category'].values
47 | id_test = df_test['id']
48 | 
49 | cnt = 0
50 | for i in range(0, df_test.shape[0]):
51 |     # print(q1_test[i], q2_test[i], id_test[i], category_test[i])
52 |     list1 = dict_1.get(q1_test[i], -1)
53 |     list2 = dict_1.get(q2_test[i], -1)
54 |     ct = category_test[i]
55 |     if list1 != -1:
56 |         if list2 != -1:
57 |             if len(set(list1).intersection(set(list2))) != 0 and dict_ct.get(q1_test[i], -1) == dict_ct.get(q2_test[i], -1) and dict_ct.get(q1_test[i], -1) != -1:
58 |                 df_result.iloc[id_test[i]]['label'] = 1
59 |                 print(3)
60 | 
61 |     # 找到每个q1相似问题q在训练集中是否存在q!=q2，从而推出q1!=q2
62 |     if list1 != -1:
63 |         for q in list1:
64 |             neq_list = dict_2.get(q, -1)
65 |             if neq_list != -1:
66 |                 if q2_test[i] in neq_list:
67 |                     df_result.iloc[id_test[i]]['label'] = 0
68 |                     print(1)
69 |     # 同理q2
70 |     if list2 != -1:
71 |         for q in list2:
72 |             neq_list = dict_2.get(q, -1)
73 |             if neq_list != -1:
74 |                 if q1_test[i] in neq_list:
75 |                     df_result.iloc[id_test[i]]['label'] = 0
76 |                     print(2)
77 | 
78 | df_result.to_csv('./data/post_result.csv')
79 | 


--------------------------------------------------------------------------------