├── README.md ├── __init__.py ├── atec_nlp_sim_test_0.4.csv ├── atec_nlp_sim_train_0.6.csv ├── create_pretraining_data.py ├── extract_features.py ├── modeling.py ├── modeling_test.py ├── optimization.py ├── optimization_test.py ├── requirements.txt ├── run_classifier.py ├── run_classifier_0214.py ├── run_classifier_0228.py ├── run_classifier_with_tfhub.py ├── run_pretraining.py ├── run_squad.py └── tokenization.py /README.md: -------------------------------------------------------------------------------- 1 | ## Bert 中文文本分类 2 | - 任务1 判断文本相似性 3 | - 任务2 文本多分类 4 | 5 | 6 | ## 运行环境 7 | - python3.6 8 | - tensorflow1.12 9 | 10 | 11 | ## 任务1 判断文本相似性 12 | #### 数据说明 13 | - 运行时数据所在目录为/Users/luyao/Desktop/bert_learn/MAYI,在这里为了方便起见我把数据一起上传了 14 | - atec_nlp_sim_train_0.6.csv和atec_nlp_sim_test_0.4.csv格式相同,各列分别为"index query1 query2 label","\t"分隔。来自网络上蚂蚁金服的公开数据集,是判断文本相似性的数据 15 | - 程序运行中只选择了部分数据(train3000,val500,test100) 16 | 17 | #### 程序说明 18 | >如下的目录都是我自己学习时的目录,大家根据自己的实际情况进行修改 19 | - 首先要下载**[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**并解压到/Users/luyao/Desktop/bert_learn/chinese_L-12_H-768_A-12 20 | - run_classifier.py是Bert自带的程序,在这里我把自己写的放在了run_classifier_0214.py中 21 | - 添加SelfProcessor类,修改了get_train_examples等三个方法,功能参考Bert自带的那些processor即可。或者也可以修改_create_examples 22 | - 在main(_)中增加自己定义的这个processor到字典中 23 | 24 | #### 训练 25 | - 定义环境变量 26 | >export BERT_BASE_DIR=/Users/luyao/Desktop/bert_learn/chinese_L-12_H-768_A-12 27 | >export MAYI_DIR=/Users/luyao/Desktop/bert_learn 28 | >查看环境变量是否生效 echo $BERT_BASE_DIR 等 29 | - 进行训练 30 | > python run_classifier_0214.py \\ 31 | --task_name=MAYI \\ 32 | --do_train=true \\ 33 | --do_eval=true \\ 34 | --data_dir=\$MAYI_DIR/MAYI \\ 35 | --vocab_file=\$BERT_BASE_DIR/vocab.txt \\ 36 | --bert_config_file=\$BERT_BASE_DIR/bert_config.json \\ 37 | --init_checkpoint=\$BERT_BASE_DIR/bert_model.ckpt \\ 38 | --max_seq_length=128 \\ 39 | --train_batch_size=32 \\ 40 | --learning_rate=2e-5 \\ 41 | --num_train_epochs=3.0 \\ 42 | --output_dir=./tmp/mayi_output/ 43 | output_dir这个目录可以自动生成,不必手动添加。训练完生成的这个目录太大,这里我没有上传 44 | 45 | - 结果在output_dir的eval_results.txt中 46 | >eval_accuracy = 0.8016032 47 | eval_loss = 0.5034697 48 | global_step = 281 49 | loss = 0.5032294 50 | 51 | #### 预测 52 | - 定义环境变量 53 | >export TRAINED_CLASSIFIER=/Users/luyao/Desktop/bert_learn/fine/tuned/classifier 54 | 使用官方上的demo时,这里有错。应该选择train之后生成的目录。 55 | tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /Users/luyao/Desktop/bert_learn/fine/tuned/classifier 56 | 57 | >改成 58 | export TRAINED_CLASSIFIER=./tmp/mayi_output 这是上一步训练后的输出目录 59 | 60 | - 进行预测 61 | >python run_classifier_0214.py \\ 62 | --task_name=MAYI \\ 63 | --do_train=false \\ 64 | --do_eval=false \\ 65 | --do_predict=true \\ 66 | --data_dir=\$MAYI_DIR/MAYI \\ 67 | --vocab_file=\$BERT_BASE_DIR/vocab.txt \\ 68 | --bert_config_file=\$BERT_BASE_DIR/bert_config.json \\ 69 | --init_checkpoint=\$TRAINED_CLASSIFIER \\ 70 | --max_seq_length=128 \\ 71 | --output_dir=./tmp/mayi_output/ 72 | 73 | - 结果在output_dir的test_results.tsv中,第一列为label=0的概率,第二列为label=1的概率 74 | >0.54155594 0.45844415 75 | 0.99733293 0.0026670366 76 | 0.98726386 0.012736204 77 | ... 78 | 79 | 80 | 81 | ## 任务2 文本多分类 82 | #### 数据说明 83 | - 运行时数据所在目录为/Users/luyao/Desktop/bert_learn/CNEWS 84 | - 数据较大无法上传 链接:https://pan.baidu.com/s/1ZDez64S9cnzNnucIOrPapQ 密码:vutp 85 | - cenws.train.txt、cenws.val.txt、cenws.test.txt格式相同,各列分别为"label doc","\t"分隔 86 | - 包含"体育、科技、娱乐、家居、时政、财经、房产、游戏、时尚、教育"十个类别,其中train文件50000条,val5000条,test10000条 87 | - 程序运行时仍然只选择了部分数据,大家可以根据机器性能随意选择 88 | 89 | #### 程序说明 90 | >如下的目录都是我自己学习时的目录,大家根据自己的实际情况进行修改 91 | - 首先要下载**[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**并解压到/Users/luyao/Desktop/bert_learn/chinese_L-12_H-768_A-12 92 | - 把自己写的放在了run_classifier_0228.py中 93 | - 添加SelfProcessor类,修改了get_train_examples等三个方法,或者也可以修改_create_examples。修改了get_labels方法 94 | - 在main(_)中增加自己定义的这个processor到字典中,修改了FLAGS.do_train等三部分 95 | 96 | #### 训练 97 | - 参考任务1,运行run_classifier_0228.py,修改task_name,data_dir和output_dir即可 98 | >训练结果 99 | eval_accuracy = 0.955 100 | eval_loss = 0.21101545 101 | global_step = 93 102 | loss = 0.21101545 103 | 104 | #### 预测 105 | - 参考任务1,运行run_classifier_0228.py,修改task_name,data_dir和output_dir即可 106 | >预测结果 107 | 0.006859495 0.0022842707 0.005758511 0.0027749564 0.0021667227 0.0033973546 0.00307322 0.0014950271 0.9639163 0.008274185 108 | 0.002119527 0.9845195 0.0023118905 0.001905937 0.0012551323 0.002937188 0.00080464134 0.0015169261 0.0013669604 0.0012623245 109 | 0.008577874 0.0022037376 0.009644147 0.0033149354 0.0121542765 0.003821571 0.92171514 0.0034074073 0.009290288 0.025870731 110 | ... 111 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | -------------------------------------------------------------------------------- /create_pretraining_data.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Create masked LM/next sentence masked_lm TF examples for BERT.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import random 23 | import tokenization 24 | import tensorflow as tf 25 | 26 | flags = tf.flags 27 | 28 | FLAGS = flags.FLAGS 29 | 30 | flags.DEFINE_string("input_file", None, 31 | "Input raw text file (or comma-separated list of files).") 32 | 33 | flags.DEFINE_string( 34 | "output_file", None, 35 | "Output TF example file (or comma-separated list of files).") 36 | 37 | flags.DEFINE_string("vocab_file", None, 38 | "The vocabulary file that the BERT model was trained on.") 39 | 40 | flags.DEFINE_bool( 41 | "do_lower_case", True, 42 | "Whether to lower case the input text. Should be True for uncased " 43 | "models and False for cased models.") 44 | 45 | flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.") 46 | 47 | flags.DEFINE_integer("max_predictions_per_seq", 20, 48 | "Maximum number of masked LM predictions per sequence.") 49 | 50 | flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.") 51 | 52 | flags.DEFINE_integer( 53 | "dupe_factor", 10, 54 | "Number of times to duplicate the input data (with different masks).") 55 | 56 | flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.") 57 | 58 | flags.DEFINE_float( 59 | "short_seq_prob", 0.1, 60 | "Probability of creating sequences which are shorter than the " 61 | "maximum length.") 62 | 63 | 64 | class TrainingInstance(object): 65 | """A single training instance (sentence pair).""" 66 | 67 | def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, 68 | is_random_next): 69 | self.tokens = tokens 70 | self.segment_ids = segment_ids 71 | self.is_random_next = is_random_next 72 | self.masked_lm_positions = masked_lm_positions 73 | self.masked_lm_labels = masked_lm_labels 74 | 75 | def __str__(self): 76 | s = "" 77 | s += "tokens: %s\n" % (" ".join( 78 | [tokenization.printable_text(x) for x in self.tokens])) 79 | s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids])) 80 | s += "is_random_next: %s\n" % self.is_random_next 81 | s += "masked_lm_positions: %s\n" % (" ".join( 82 | [str(x) for x in self.masked_lm_positions])) 83 | s += "masked_lm_labels: %s\n" % (" ".join( 84 | [tokenization.printable_text(x) for x in self.masked_lm_labels])) 85 | s += "\n" 86 | return s 87 | 88 | def __repr__(self): 89 | return self.__str__() 90 | 91 | 92 | def write_instance_to_example_files(instances, tokenizer, max_seq_length, 93 | max_predictions_per_seq, output_files): 94 | """Create TF example files from `TrainingInstance`s.""" 95 | writers = [] 96 | for output_file in output_files: 97 | writers.append(tf.python_io.TFRecordWriter(output_file)) 98 | 99 | writer_index = 0 100 | 101 | total_written = 0 102 | for (inst_index, instance) in enumerate(instances): 103 | input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) 104 | input_mask = [1] * len(input_ids) 105 | segment_ids = list(instance.segment_ids) 106 | assert len(input_ids) <= max_seq_length 107 | 108 | while len(input_ids) < max_seq_length: 109 | input_ids.append(0) 110 | input_mask.append(0) 111 | segment_ids.append(0) 112 | 113 | assert len(input_ids) == max_seq_length 114 | assert len(input_mask) == max_seq_length 115 | assert len(segment_ids) == max_seq_length 116 | 117 | masked_lm_positions = list(instance.masked_lm_positions) 118 | masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) 119 | masked_lm_weights = [1.0] * len(masked_lm_ids) 120 | 121 | while len(masked_lm_positions) < max_predictions_per_seq: 122 | masked_lm_positions.append(0) 123 | masked_lm_ids.append(0) 124 | masked_lm_weights.append(0.0) 125 | 126 | next_sentence_label = 1 if instance.is_random_next else 0 127 | 128 | features = collections.OrderedDict() 129 | features["input_ids"] = create_int_feature(input_ids) 130 | features["input_mask"] = create_int_feature(input_mask) 131 | features["segment_ids"] = create_int_feature(segment_ids) 132 | features["masked_lm_positions"] = create_int_feature(masked_lm_positions) 133 | features["masked_lm_ids"] = create_int_feature(masked_lm_ids) 134 | features["masked_lm_weights"] = create_float_feature(masked_lm_weights) 135 | features["next_sentence_labels"] = create_int_feature([next_sentence_label]) 136 | 137 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 138 | 139 | writers[writer_index].write(tf_example.SerializeToString()) 140 | writer_index = (writer_index + 1) % len(writers) 141 | 142 | total_written += 1 143 | 144 | if inst_index < 20: 145 | tf.logging.info("*** Example ***") 146 | tf.logging.info("tokens: %s" % " ".join( 147 | [tokenization.printable_text(x) for x in instance.tokens])) 148 | 149 | for feature_name in features.keys(): 150 | feature = features[feature_name] 151 | values = [] 152 | if feature.int64_list.value: 153 | values = feature.int64_list.value 154 | elif feature.float_list.value: 155 | values = feature.float_list.value 156 | tf.logging.info( 157 | "%s: %s" % (feature_name, " ".join([str(x) for x in values]))) 158 | 159 | for writer in writers: 160 | writer.close() 161 | 162 | tf.logging.info("Wrote %d total instances", total_written) 163 | 164 | 165 | def create_int_feature(values): 166 | feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 167 | return feature 168 | 169 | 170 | def create_float_feature(values): 171 | feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) 172 | return feature 173 | 174 | 175 | def create_training_instances(input_files, tokenizer, max_seq_length, 176 | dupe_factor, short_seq_prob, masked_lm_prob, 177 | max_predictions_per_seq, rng): 178 | """Create `TrainingInstance`s from raw text.""" 179 | all_documents = [[]] 180 | 181 | # Input file format: 182 | # (1) One sentence per line. These should ideally be actual sentences, not 183 | # entire paragraphs or arbitrary spans of text. (Because we use the 184 | # sentence boundaries for the "next sentence prediction" task). 185 | # (2) Blank lines between documents. Document boundaries are needed so 186 | # that the "next sentence prediction" task doesn't span between documents. 187 | for input_file in input_files: 188 | with tf.gfile.GFile(input_file, "r") as reader: 189 | while True: 190 | line = tokenization.convert_to_unicode(reader.readline()) 191 | if not line: 192 | break 193 | line = line.strip() 194 | 195 | # Empty lines are used as document delimiters 196 | if not line: 197 | all_documents.append([]) 198 | tokens = tokenizer.tokenize(line) 199 | if tokens: 200 | all_documents[-1].append(tokens) 201 | 202 | # Remove empty documents 203 | all_documents = [x for x in all_documents if x] 204 | rng.shuffle(all_documents) 205 | 206 | vocab_words = list(tokenizer.vocab.keys()) 207 | instances = [] 208 | for _ in range(dupe_factor): 209 | for document_index in range(len(all_documents)): 210 | instances.extend( 211 | create_instances_from_document( 212 | all_documents, document_index, max_seq_length, short_seq_prob, 213 | masked_lm_prob, max_predictions_per_seq, vocab_words, rng)) 214 | 215 | rng.shuffle(instances) 216 | return instances 217 | 218 | 219 | def create_instances_from_document( 220 | all_documents, document_index, max_seq_length, short_seq_prob, 221 | masked_lm_prob, max_predictions_per_seq, vocab_words, rng): 222 | """Creates `TrainingInstance`s for a single document.""" 223 | document = all_documents[document_index] 224 | 225 | # Account for [CLS], [SEP], [SEP] 226 | max_num_tokens = max_seq_length - 3 227 | 228 | # We *usually* want to fill up the entire sequence since we are padding 229 | # to `max_seq_length` anyways, so short sequences are generally wasted 230 | # computation. However, we *sometimes* 231 | # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter 232 | # sequences to minimize the mismatch between pre-training and fine-tuning. 233 | # The `target_seq_length` is just a rough target however, whereas 234 | # `max_seq_length` is a hard limit. 235 | target_seq_length = max_num_tokens 236 | if rng.random() < short_seq_prob: 237 | target_seq_length = rng.randint(2, max_num_tokens) 238 | 239 | # We DON'T just concatenate all of the tokens from a document into a long 240 | # sequence and choose an arbitrary split point because this would make the 241 | # next sentence prediction task too easy. Instead, we split the input into 242 | # segments "A" and "B" based on the actual "sentences" provided by the user 243 | # input. 244 | instances = [] 245 | current_chunk = [] 246 | current_length = 0 247 | i = 0 248 | while i < len(document): 249 | segment = document[i] 250 | current_chunk.append(segment) 251 | current_length += len(segment) 252 | if i == len(document) - 1 or current_length >= target_seq_length: 253 | if current_chunk: 254 | # `a_end` is how many segments from `current_chunk` go into the `A` 255 | # (first) sentence. 256 | a_end = 1 257 | if len(current_chunk) >= 2: 258 | a_end = rng.randint(1, len(current_chunk) - 1) 259 | 260 | tokens_a = [] 261 | for j in range(a_end): 262 | tokens_a.extend(current_chunk[j]) 263 | 264 | tokens_b = [] 265 | # Random next 266 | is_random_next = False 267 | if len(current_chunk) == 1 or rng.random() < 0.5: 268 | is_random_next = True 269 | target_b_length = target_seq_length - len(tokens_a) 270 | 271 | # This should rarely go for more than one iteration for large 272 | # corpora. However, just to be careful, we try to make sure that 273 | # the random document is not the same as the document 274 | # we're processing. 275 | for _ in range(10): 276 | random_document_index = rng.randint(0, len(all_documents) - 1) 277 | if random_document_index != document_index: 278 | break 279 | 280 | random_document = all_documents[random_document_index] 281 | random_start = rng.randint(0, len(random_document) - 1) 282 | for j in range(random_start, len(random_document)): 283 | tokens_b.extend(random_document[j]) 284 | if len(tokens_b) >= target_b_length: 285 | break 286 | # We didn't actually use these segments so we "put them back" so 287 | # they don't go to waste. 288 | num_unused_segments = len(current_chunk) - a_end 289 | i -= num_unused_segments 290 | # Actual next 291 | else: 292 | is_random_next = False 293 | for j in range(a_end, len(current_chunk)): 294 | tokens_b.extend(current_chunk[j]) 295 | truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) 296 | 297 | assert len(tokens_a) >= 1 298 | assert len(tokens_b) >= 1 299 | 300 | tokens = [] 301 | segment_ids = [] 302 | tokens.append("[CLS]") 303 | segment_ids.append(0) 304 | for token in tokens_a: 305 | tokens.append(token) 306 | segment_ids.append(0) 307 | 308 | tokens.append("[SEP]") 309 | segment_ids.append(0) 310 | 311 | for token in tokens_b: 312 | tokens.append(token) 313 | segment_ids.append(1) 314 | tokens.append("[SEP]") 315 | segment_ids.append(1) 316 | 317 | (tokens, masked_lm_positions, 318 | masked_lm_labels) = create_masked_lm_predictions( 319 | tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) 320 | instance = TrainingInstance( 321 | tokens=tokens, 322 | segment_ids=segment_ids, 323 | is_random_next=is_random_next, 324 | masked_lm_positions=masked_lm_positions, 325 | masked_lm_labels=masked_lm_labels) 326 | instances.append(instance) 327 | current_chunk = [] 328 | current_length = 0 329 | i += 1 330 | 331 | return instances 332 | 333 | 334 | MaskedLmInstance = collections.namedtuple("MaskedLmInstance", 335 | ["index", "label"]) 336 | 337 | 338 | def create_masked_lm_predictions(tokens, masked_lm_prob, 339 | max_predictions_per_seq, vocab_words, rng): 340 | """Creates the predictions for the masked LM objective.""" 341 | 342 | cand_indexes = [] 343 | for (i, token) in enumerate(tokens): 344 | if token == "[CLS]" or token == "[SEP]": 345 | continue 346 | cand_indexes.append(i) 347 | 348 | rng.shuffle(cand_indexes) 349 | 350 | output_tokens = list(tokens) 351 | 352 | num_to_predict = min(max_predictions_per_seq, 353 | max(1, int(round(len(tokens) * masked_lm_prob)))) 354 | 355 | masked_lms = [] 356 | covered_indexes = set() 357 | for index in cand_indexes: 358 | if len(masked_lms) >= num_to_predict: 359 | break 360 | if index in covered_indexes: 361 | continue 362 | covered_indexes.add(index) 363 | 364 | masked_token = None 365 | # 80% of the time, replace with [MASK] 366 | if rng.random() < 0.8: 367 | masked_token = "[MASK]" 368 | else: 369 | # 10% of the time, keep original 370 | if rng.random() < 0.5: 371 | masked_token = tokens[index] 372 | # 10% of the time, replace with random word 373 | else: 374 | masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] 375 | 376 | output_tokens[index] = masked_token 377 | 378 | masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) 379 | 380 | masked_lms = sorted(masked_lms, key=lambda x: x.index) 381 | 382 | masked_lm_positions = [] 383 | masked_lm_labels = [] 384 | for p in masked_lms: 385 | masked_lm_positions.append(p.index) 386 | masked_lm_labels.append(p.label) 387 | 388 | return (output_tokens, masked_lm_positions, masked_lm_labels) 389 | 390 | 391 | def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): 392 | """Truncates a pair of sequences to a maximum sequence length.""" 393 | while True: 394 | total_length = len(tokens_a) + len(tokens_b) 395 | if total_length <= max_num_tokens: 396 | break 397 | 398 | trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b 399 | assert len(trunc_tokens) >= 1 400 | 401 | # We want to sometimes truncate from the front and sometimes from the 402 | # back to add more randomness and avoid biases. 403 | if rng.random() < 0.5: 404 | del trunc_tokens[0] 405 | else: 406 | trunc_tokens.pop() 407 | 408 | 409 | def main(_): 410 | tf.logging.set_verbosity(tf.logging.INFO) 411 | 412 | tokenizer = tokenization.FullTokenizer( 413 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 414 | 415 | input_files = [] 416 | for input_pattern in FLAGS.input_file.split(","): 417 | input_files.extend(tf.gfile.Glob(input_pattern)) 418 | 419 | tf.logging.info("*** Reading from input files ***") 420 | for input_file in input_files: 421 | tf.logging.info(" %s", input_file) 422 | 423 | rng = random.Random(FLAGS.random_seed) 424 | instances = create_training_instances( 425 | input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor, 426 | FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq, 427 | rng) 428 | 429 | output_files = FLAGS.output_file.split(",") 430 | tf.logging.info("*** Writing to output files ***") 431 | for output_file in output_files: 432 | tf.logging.info(" %s", output_file) 433 | 434 | write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length, 435 | FLAGS.max_predictions_per_seq, output_files) 436 | 437 | 438 | if __name__ == "__main__": 439 | flags.mark_flag_as_required("input_file") 440 | flags.mark_flag_as_required("output_file") 441 | flags.mark_flag_as_required("vocab_file") 442 | tf.app.run() 443 | -------------------------------------------------------------------------------- /extract_features.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Extract pre-computed feature vectors from BERT.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import codecs 22 | import collections 23 | import json 24 | import re 25 | 26 | import modeling 27 | import tokenization 28 | import tensorflow as tf 29 | 30 | flags = tf.flags 31 | 32 | FLAGS = flags.FLAGS 33 | 34 | flags.DEFINE_string("input_file", None, "") 35 | 36 | flags.DEFINE_string("output_file", None, "") 37 | 38 | flags.DEFINE_string("layers", "-1,-2,-3,-4", "") 39 | 40 | flags.DEFINE_string( 41 | "bert_config_file", None, 42 | "The config json file corresponding to the pre-trained BERT model. " 43 | "This specifies the model architecture.") 44 | 45 | flags.DEFINE_integer( 46 | "max_seq_length", 128, 47 | "The maximum total input sequence length after WordPiece tokenization. " 48 | "Sequences longer than this will be truncated, and sequences shorter " 49 | "than this will be padded.") 50 | 51 | flags.DEFINE_string( 52 | "init_checkpoint", None, 53 | "Initial checkpoint (usually from a pre-trained BERT model).") 54 | 55 | flags.DEFINE_string("vocab_file", None, 56 | "The vocabulary file that the BERT model was trained on.") 57 | 58 | flags.DEFINE_bool( 59 | "do_lower_case", True, 60 | "Whether to lower case the input text. Should be True for uncased " 61 | "models and False for cased models.") 62 | 63 | flags.DEFINE_integer("batch_size", 32, "Batch size for predictions.") 64 | 65 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 66 | 67 | flags.DEFINE_string("master", None, 68 | "If using a TPU, the address of the master.") 69 | 70 | flags.DEFINE_integer( 71 | "num_tpu_cores", 8, 72 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 73 | 74 | flags.DEFINE_bool( 75 | "use_one_hot_embeddings", False, 76 | "If True, tf.one_hot will be used for embedding lookups, otherwise " 77 | "tf.nn.embedding_lookup will be used. On TPUs, this should be True " 78 | "since it is much faster.") 79 | 80 | 81 | class InputExample(object): 82 | 83 | def __init__(self, unique_id, text_a, text_b): 84 | self.unique_id = unique_id 85 | self.text_a = text_a 86 | self.text_b = text_b 87 | 88 | 89 | class InputFeatures(object): 90 | """A single set of features of data.""" 91 | 92 | def __init__(self, unique_id, tokens, input_ids, input_mask, input_type_ids): 93 | self.unique_id = unique_id 94 | self.tokens = tokens 95 | self.input_ids = input_ids 96 | self.input_mask = input_mask 97 | self.input_type_ids = input_type_ids 98 | 99 | 100 | def input_fn_builder(features, seq_length): 101 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 102 | 103 | all_unique_ids = [] 104 | all_input_ids = [] 105 | all_input_mask = [] 106 | all_input_type_ids = [] 107 | 108 | for feature in features: 109 | all_unique_ids.append(feature.unique_id) 110 | all_input_ids.append(feature.input_ids) 111 | all_input_mask.append(feature.input_mask) 112 | all_input_type_ids.append(feature.input_type_ids) 113 | 114 | def input_fn(params): 115 | """The actual input function.""" 116 | batch_size = params["batch_size"] 117 | 118 | num_examples = len(features) 119 | 120 | # This is for demo purposes and does NOT scale to large data sets. We do 121 | # not use Dataset.from_generator() because that uses tf.py_func which is 122 | # not TPU compatible. The right way to load data is with TFRecordReader. 123 | d = tf.data.Dataset.from_tensor_slices({ 124 | "unique_ids": 125 | tf.constant(all_unique_ids, shape=[num_examples], dtype=tf.int32), 126 | "input_ids": 127 | tf.constant( 128 | all_input_ids, shape=[num_examples, seq_length], 129 | dtype=tf.int32), 130 | "input_mask": 131 | tf.constant( 132 | all_input_mask, 133 | shape=[num_examples, seq_length], 134 | dtype=tf.int32), 135 | "input_type_ids": 136 | tf.constant( 137 | all_input_type_ids, 138 | shape=[num_examples, seq_length], 139 | dtype=tf.int32), 140 | }) 141 | 142 | d = d.batch(batch_size=batch_size, drop_remainder=False) 143 | return d 144 | 145 | return input_fn 146 | 147 | 148 | def model_fn_builder(bert_config, init_checkpoint, layer_indexes, use_tpu, 149 | use_one_hot_embeddings): 150 | """Returns `model_fn` closure for TPUEstimator.""" 151 | 152 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 153 | """The `model_fn` for TPUEstimator.""" 154 | 155 | unique_ids = features["unique_ids"] 156 | input_ids = features["input_ids"] 157 | input_mask = features["input_mask"] 158 | input_type_ids = features["input_type_ids"] 159 | 160 | model = modeling.BertModel( 161 | config=bert_config, 162 | is_training=False, 163 | input_ids=input_ids, 164 | input_mask=input_mask, 165 | token_type_ids=input_type_ids, 166 | use_one_hot_embeddings=use_one_hot_embeddings) 167 | 168 | if mode != tf.estimator.ModeKeys.PREDICT: 169 | raise ValueError("Only PREDICT modes are supported: %s" % (mode)) 170 | 171 | tvars = tf.trainable_variables() 172 | scaffold_fn = None 173 | (assignment_map, 174 | initialized_variable_names) = modeling.get_assignment_map_from_checkpoint( 175 | tvars, init_checkpoint) 176 | if use_tpu: 177 | 178 | def tpu_scaffold(): 179 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 180 | return tf.train.Scaffold() 181 | 182 | scaffold_fn = tpu_scaffold 183 | else: 184 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 185 | 186 | tf.logging.info("**** Trainable Variables ****") 187 | for var in tvars: 188 | init_string = "" 189 | if var.name in initialized_variable_names: 190 | init_string = ", *INIT_FROM_CKPT*" 191 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 192 | init_string) 193 | 194 | all_layers = model.get_all_encoder_layers() 195 | 196 | predictions = { 197 | "unique_id": unique_ids, 198 | } 199 | 200 | for (i, layer_index) in enumerate(layer_indexes): 201 | predictions["layer_output_%d" % i] = all_layers[layer_index] 202 | 203 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 204 | mode=mode, predictions=predictions, scaffold_fn=scaffold_fn) 205 | return output_spec 206 | 207 | return model_fn 208 | 209 | 210 | def convert_examples_to_features(examples, seq_length, tokenizer): 211 | """Loads a data file into a list of `InputBatch`s.""" 212 | 213 | features = [] 214 | for (ex_index, example) in enumerate(examples): 215 | tokens_a = tokenizer.tokenize(example.text_a) 216 | 217 | tokens_b = None 218 | if example.text_b: 219 | tokens_b = tokenizer.tokenize(example.text_b) 220 | 221 | if tokens_b: 222 | # Modifies `tokens_a` and `tokens_b` in place so that the total 223 | # length is less than the specified length. 224 | # Account for [CLS], [SEP], [SEP] with "- 3" 225 | _truncate_seq_pair(tokens_a, tokens_b, seq_length - 3) 226 | else: 227 | # Account for [CLS] and [SEP] with "- 2" 228 | if len(tokens_a) > seq_length - 2: 229 | tokens_a = tokens_a[0:(seq_length - 2)] 230 | 231 | # The convention in BERT is: 232 | # (a) For sequence pairs: 233 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 234 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 235 | # (b) For single sequences: 236 | # tokens: [CLS] the dog is hairy . [SEP] 237 | # type_ids: 0 0 0 0 0 0 0 238 | # 239 | # Where "type_ids" are used to indicate whether this is the first 240 | # sequence or the second sequence. The embedding vectors for `type=0` and 241 | # `type=1` were learned during pre-training and are added to the wordpiece 242 | # embedding vector (and position vector). This is not *strictly* necessary 243 | # since the [SEP] token unambiguously separates the sequences, but it makes 244 | # it easier for the model to learn the concept of sequences. 245 | # 246 | # For classification tasks, the first vector (corresponding to [CLS]) is 247 | # used as as the "sentence vector". Note that this only makes sense because 248 | # the entire model is fine-tuned. 249 | tokens = [] 250 | input_type_ids = [] 251 | tokens.append("[CLS]") 252 | input_type_ids.append(0) 253 | for token in tokens_a: 254 | tokens.append(token) 255 | input_type_ids.append(0) 256 | tokens.append("[SEP]") 257 | input_type_ids.append(0) 258 | 259 | if tokens_b: 260 | for token in tokens_b: 261 | tokens.append(token) 262 | input_type_ids.append(1) 263 | tokens.append("[SEP]") 264 | input_type_ids.append(1) 265 | 266 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 267 | 268 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 269 | # tokens are attended to. 270 | input_mask = [1] * len(input_ids) 271 | 272 | # Zero-pad up to the sequence length. 273 | while len(input_ids) < seq_length: 274 | input_ids.append(0) 275 | input_mask.append(0) 276 | input_type_ids.append(0) 277 | 278 | assert len(input_ids) == seq_length 279 | assert len(input_mask) == seq_length 280 | assert len(input_type_ids) == seq_length 281 | 282 | if ex_index < 5: 283 | tf.logging.info("*** Example ***") 284 | tf.logging.info("unique_id: %s" % (example.unique_id)) 285 | tf.logging.info("tokens: %s" % " ".join( 286 | [tokenization.printable_text(x) for x in tokens])) 287 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 288 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 289 | tf.logging.info( 290 | "input_type_ids: %s" % " ".join([str(x) for x in input_type_ids])) 291 | 292 | features.append( 293 | InputFeatures( 294 | unique_id=example.unique_id, 295 | tokens=tokens, 296 | input_ids=input_ids, 297 | input_mask=input_mask, 298 | input_type_ids=input_type_ids)) 299 | return features 300 | 301 | 302 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 303 | """Truncates a sequence pair in place to the maximum length.""" 304 | 305 | # This is a simple heuristic which will always truncate the longer sequence 306 | # one token at a time. This makes more sense than truncating an equal percent 307 | # of tokens from each, since if one sequence is very short then each token 308 | # that's truncated likely contains more information than a longer sequence. 309 | while True: 310 | total_length = len(tokens_a) + len(tokens_b) 311 | if total_length <= max_length: 312 | break 313 | if len(tokens_a) > len(tokens_b): 314 | tokens_a.pop() 315 | else: 316 | tokens_b.pop() 317 | 318 | 319 | def read_examples(input_file): 320 | """Read a list of `InputExample`s from an input file.""" 321 | examples = [] 322 | unique_id = 0 323 | with tf.gfile.GFile(input_file, "r") as reader: 324 | while True: 325 | line = tokenization.convert_to_unicode(reader.readline()) 326 | if not line: 327 | break 328 | line = line.strip() 329 | text_a = None 330 | text_b = None 331 | m = re.match(r"^(.*) \|\|\| (.*)$", line) 332 | if m is None: 333 | text_a = line 334 | else: 335 | text_a = m.group(1) 336 | text_b = m.group(2) 337 | examples.append( 338 | InputExample(unique_id=unique_id, text_a=text_a, text_b=text_b)) 339 | unique_id += 1 340 | return examples 341 | 342 | 343 | def main(_): 344 | tf.logging.set_verbosity(tf.logging.INFO) 345 | 346 | layer_indexes = [int(x) for x in FLAGS.layers.split(",")] 347 | 348 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 349 | 350 | tokenizer = tokenization.FullTokenizer( 351 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 352 | 353 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 354 | run_config = tf.contrib.tpu.RunConfig( 355 | master=FLAGS.master, 356 | tpu_config=tf.contrib.tpu.TPUConfig( 357 | num_shards=FLAGS.num_tpu_cores, 358 | per_host_input_for_training=is_per_host)) 359 | 360 | examples = read_examples(FLAGS.input_file) 361 | 362 | features = convert_examples_to_features( 363 | examples=examples, seq_length=FLAGS.max_seq_length, tokenizer=tokenizer) 364 | 365 | unique_id_to_feature = {} 366 | for feature in features: 367 | unique_id_to_feature[feature.unique_id] = feature 368 | 369 | model_fn = model_fn_builder( 370 | bert_config=bert_config, 371 | init_checkpoint=FLAGS.init_checkpoint, 372 | layer_indexes=layer_indexes, 373 | use_tpu=FLAGS.use_tpu, 374 | use_one_hot_embeddings=FLAGS.use_one_hot_embeddings) 375 | 376 | # If TPU is not available, this will fall back to normal Estimator on CPU 377 | # or GPU. 378 | estimator = tf.contrib.tpu.TPUEstimator( 379 | use_tpu=FLAGS.use_tpu, 380 | model_fn=model_fn, 381 | config=run_config, 382 | predict_batch_size=FLAGS.batch_size) 383 | 384 | input_fn = input_fn_builder( 385 | features=features, seq_length=FLAGS.max_seq_length) 386 | 387 | with codecs.getwriter("utf-8")(tf.gfile.Open(FLAGS.output_file, 388 | "w")) as writer: 389 | for result in estimator.predict(input_fn, yield_single_examples=True): 390 | unique_id = int(result["unique_id"]) 391 | feature = unique_id_to_feature[unique_id] 392 | output_json = collections.OrderedDict() 393 | output_json["linex_index"] = unique_id 394 | all_features = [] 395 | for (i, token) in enumerate(feature.tokens): 396 | all_layers = [] 397 | for (j, layer_index) in enumerate(layer_indexes): 398 | layer_output = result["layer_output_%d" % j] 399 | layers = collections.OrderedDict() 400 | layers["index"] = layer_index 401 | layers["values"] = [ 402 | round(float(x), 6) for x in layer_output[i:(i + 1)].flat 403 | ] 404 | all_layers.append(layers) 405 | features = collections.OrderedDict() 406 | features["token"] = token 407 | features["layers"] = all_layers 408 | all_features.append(features) 409 | output_json["features"] = all_features 410 | writer.write(json.dumps(output_json) + "\n") 411 | 412 | 413 | if __name__ == "__main__": 414 | flags.mark_flag_as_required("input_file") 415 | flags.mark_flag_as_required("vocab_file") 416 | flags.mark_flag_as_required("bert_config_file") 417 | flags.mark_flag_as_required("init_checkpoint") 418 | flags.mark_flag_as_required("output_file") 419 | tf.app.run() 420 | -------------------------------------------------------------------------------- /modeling.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """The main BERT model and related functions.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import copy 23 | import json 24 | import math 25 | import re 26 | import numpy as np 27 | import six 28 | import tensorflow as tf 29 | 30 | 31 | class BertConfig(object): 32 | """Configuration for `BertModel`.""" 33 | 34 | def __init__(self, 35 | vocab_size, 36 | hidden_size=768, 37 | num_hidden_layers=12, 38 | num_attention_heads=12, 39 | intermediate_size=3072, 40 | hidden_act="gelu", 41 | hidden_dropout_prob=0.1, 42 | attention_probs_dropout_prob=0.1, 43 | max_position_embeddings=512, 44 | type_vocab_size=16, 45 | initializer_range=0.02): 46 | """Constructs BertConfig. 47 | 48 | Args: 49 | vocab_size: Vocabulary size of `inputs_ids` in `BertModel`. 50 | hidden_size: Size of the encoder layers and the pooler layer. 51 | num_hidden_layers: Number of hidden layers in the Transformer encoder. 52 | num_attention_heads: Number of attention heads for each attention layer in 53 | the Transformer encoder. 54 | intermediate_size: The size of the "intermediate" (i.e., feed-forward) 55 | layer in the Transformer encoder. 56 | hidden_act: The non-linear activation function (function or string) in the 57 | encoder and pooler. 58 | hidden_dropout_prob: The dropout probability for all fully connected 59 | layers in the embeddings, encoder, and pooler. 60 | attention_probs_dropout_prob: The dropout ratio for the attention 61 | probabilities. 62 | max_position_embeddings: The maximum sequence length that this model might 63 | ever be used with. Typically set this to something large just in case 64 | (e.g., 512 or 1024 or 2048). 65 | type_vocab_size: The vocabulary size of the `token_type_ids` passed into 66 | `BertModel`. 67 | initializer_range: The stdev of the truncated_normal_initializer for 68 | initializing all weight matrices. 69 | """ 70 | self.vocab_size = vocab_size 71 | self.hidden_size = hidden_size 72 | self.num_hidden_layers = num_hidden_layers 73 | self.num_attention_heads = num_attention_heads 74 | self.hidden_act = hidden_act 75 | self.intermediate_size = intermediate_size 76 | self.hidden_dropout_prob = hidden_dropout_prob 77 | self.attention_probs_dropout_prob = attention_probs_dropout_prob 78 | self.max_position_embeddings = max_position_embeddings 79 | self.type_vocab_size = type_vocab_size 80 | self.initializer_range = initializer_range 81 | 82 | @classmethod 83 | def from_dict(cls, json_object): 84 | """Constructs a `BertConfig` from a Python dictionary of parameters.""" 85 | config = BertConfig(vocab_size=None) 86 | for (key, value) in six.iteritems(json_object): 87 | config.__dict__[key] = value 88 | return config 89 | 90 | @classmethod 91 | def from_json_file(cls, json_file): 92 | """Constructs a `BertConfig` from a json file of parameters.""" 93 | with tf.gfile.GFile(json_file, "r") as reader: 94 | text = reader.read() 95 | return cls.from_dict(json.loads(text)) 96 | 97 | def to_dict(self): 98 | """Serializes this instance to a Python dictionary.""" 99 | output = copy.deepcopy(self.__dict__) 100 | return output 101 | 102 | def to_json_string(self): 103 | """Serializes this instance to a JSON string.""" 104 | return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n" 105 | 106 | 107 | class BertModel(object): 108 | """BERT model ("Bidirectional Encoder Representations from Transformers"). 109 | 110 | Example usage: 111 | 112 | ```python 113 | # Already been converted into WordPiece token ids 114 | input_ids = tf.constant([[31, 51, 99], [15, 5, 0]]) 115 | input_mask = tf.constant([[1, 1, 1], [1, 1, 0]]) 116 | token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]]) 117 | 118 | config = modeling.BertConfig(vocab_size=32000, hidden_size=512, 119 | num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024) 120 | 121 | model = modeling.BertModel(config=config, is_training=True, 122 | input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids) 123 | 124 | label_embeddings = tf.get_variable(...) 125 | pooled_output = model.get_pooled_output() 126 | logits = tf.matmul(pooled_output, label_embeddings) 127 | ... 128 | ``` 129 | """ 130 | 131 | def __init__(self, 132 | config, 133 | is_training, 134 | input_ids, 135 | input_mask=None, 136 | token_type_ids=None, 137 | use_one_hot_embeddings=False, 138 | scope=None): 139 | """Constructor for BertModel. 140 | 141 | Args: 142 | config: `BertConfig` instance. 143 | is_training: bool. true for training model, false for eval model. Controls 144 | whether dropout will be applied. 145 | input_ids: int32 Tensor of shape [batch_size, seq_length]. 146 | input_mask: (optional) int32 Tensor of shape [batch_size, seq_length]. 147 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. 148 | use_one_hot_embeddings: (optional) bool. Whether to use one-hot word 149 | embeddings or tf.embedding_lookup() for the word embeddings. 150 | scope: (optional) variable scope. Defaults to "bert". 151 | 152 | Raises: 153 | ValueError: The config is invalid or one of the input tensor shapes 154 | is invalid. 155 | """ 156 | config = copy.deepcopy(config) 157 | if not is_training: 158 | config.hidden_dropout_prob = 0.0 159 | config.attention_probs_dropout_prob = 0.0 160 | 161 | input_shape = get_shape_list(input_ids, expected_rank=2) 162 | batch_size = input_shape[0] 163 | seq_length = input_shape[1] 164 | 165 | if input_mask is None: 166 | input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32) 167 | 168 | if token_type_ids is None: 169 | token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32) 170 | 171 | with tf.variable_scope(scope, default_name="bert"): 172 | with tf.variable_scope("embeddings"): 173 | # Perform embedding lookup on the word ids. 174 | (self.embedding_output, self.embedding_table) = embedding_lookup( 175 | input_ids=input_ids, 176 | vocab_size=config.vocab_size, 177 | embedding_size=config.hidden_size, 178 | initializer_range=config.initializer_range, 179 | word_embedding_name="word_embeddings", 180 | use_one_hot_embeddings=use_one_hot_embeddings) 181 | 182 | # Add positional embeddings and token type embeddings, then layer 183 | # normalize and perform dropout. 184 | self.embedding_output = embedding_postprocessor( 185 | input_tensor=self.embedding_output, 186 | use_token_type=True, 187 | token_type_ids=token_type_ids, 188 | token_type_vocab_size=config.type_vocab_size, 189 | token_type_embedding_name="token_type_embeddings", 190 | use_position_embeddings=True, 191 | position_embedding_name="position_embeddings", 192 | initializer_range=config.initializer_range, 193 | max_position_embeddings=config.max_position_embeddings, 194 | dropout_prob=config.hidden_dropout_prob) 195 | 196 | with tf.variable_scope("encoder"): 197 | # This converts a 2D mask of shape [batch_size, seq_length] to a 3D 198 | # mask of shape [batch_size, seq_length, seq_length] which is used 199 | # for the attention scores. 200 | attention_mask = create_attention_mask_from_input_mask( 201 | input_ids, input_mask) 202 | 203 | # Run the stacked transformer. 204 | # `sequence_output` shape = [batch_size, seq_length, hidden_size]. 205 | self.all_encoder_layers = transformer_model( 206 | input_tensor=self.embedding_output, 207 | attention_mask=attention_mask, 208 | hidden_size=config.hidden_size, 209 | num_hidden_layers=config.num_hidden_layers, 210 | num_attention_heads=config.num_attention_heads, 211 | intermediate_size=config.intermediate_size, 212 | intermediate_act_fn=get_activation(config.hidden_act), 213 | hidden_dropout_prob=config.hidden_dropout_prob, 214 | attention_probs_dropout_prob=config.attention_probs_dropout_prob, 215 | initializer_range=config.initializer_range, 216 | do_return_all_layers=True) 217 | 218 | self.sequence_output = self.all_encoder_layers[-1] 219 | # The "pooler" converts the encoded sequence tensor of shape 220 | # [batch_size, seq_length, hidden_size] to a tensor of shape 221 | # [batch_size, hidden_size]. This is necessary for segment-level 222 | # (or segment-pair-level) classification tasks where we need a fixed 223 | # dimensional representation of the segment. 224 | with tf.variable_scope("pooler"): 225 | # We "pool" the model by simply taking the hidden state corresponding 226 | # to the first token. We assume that this has been pre-trained 227 | first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) 228 | self.pooled_output = tf.layers.dense( 229 | first_token_tensor, 230 | config.hidden_size, 231 | activation=tf.tanh, 232 | kernel_initializer=create_initializer(config.initializer_range)) 233 | 234 | def get_pooled_output(self): 235 | return self.pooled_output 236 | 237 | def get_sequence_output(self): 238 | """Gets final hidden layer of encoder. 239 | 240 | Returns: 241 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding 242 | to the final hidden of the transformer encoder. 243 | """ 244 | return self.sequence_output 245 | 246 | def get_all_encoder_layers(self): 247 | return self.all_encoder_layers 248 | 249 | def get_embedding_output(self): 250 | """Gets output of the embedding lookup (i.e., input to the transformer). 251 | 252 | Returns: 253 | float Tensor of shape [batch_size, seq_length, hidden_size] corresponding 254 | to the output of the embedding layer, after summing the word 255 | embeddings with the positional embeddings and the token type embeddings, 256 | then performing layer normalization. This is the input to the transformer. 257 | """ 258 | return self.embedding_output 259 | 260 | def get_embedding_table(self): 261 | return self.embedding_table 262 | 263 | 264 | def gelu(x): 265 | """Gaussian Error Linear Unit. 266 | 267 | This is a smoother version of the RELU. 268 | Original paper: https://arxiv.org/abs/1606.08415 269 | Args: 270 | x: float Tensor to perform activation. 271 | 272 | Returns: 273 | `x` with the GELU activation applied. 274 | """ 275 | cdf = 0.5 * (1.0 + tf.tanh( 276 | (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))) 277 | return x * cdf 278 | 279 | 280 | def get_activation(activation_string): 281 | """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`. 282 | 283 | Args: 284 | activation_string: String name of the activation function. 285 | 286 | Returns: 287 | A Python function corresponding to the activation function. If 288 | `activation_string` is None, empty, or "linear", this will return None. 289 | If `activation_string` is not a string, it will return `activation_string`. 290 | 291 | Raises: 292 | ValueError: The `activation_string` does not correspond to a known 293 | activation. 294 | """ 295 | 296 | # We assume that anything that"s not a string is already an activation 297 | # function, so we just return it. 298 | if not isinstance(activation_string, six.string_types): 299 | return activation_string 300 | 301 | if not activation_string: 302 | return None 303 | 304 | act = activation_string.lower() 305 | if act == "linear": 306 | return None 307 | elif act == "relu": 308 | return tf.nn.relu 309 | elif act == "gelu": 310 | return gelu 311 | elif act == "t" \ 312 | "anh": 313 | return tf.tanh 314 | else: 315 | raise ValueError("Unsupported activation: %s" % act) 316 | 317 | 318 | def get_assignment_map_from_checkpoint(tvars, init_checkpoint): 319 | """Compute the union of the current variables and checkpoint variables.""" 320 | assignment_map = {} 321 | initialized_variable_names = {} 322 | 323 | name_to_variable = collections.OrderedDict() 324 | for var in tvars: 325 | name = var.name 326 | m = re.match("^(.*):\\d+$", name) 327 | if m is not None: 328 | name = m.group(1) 329 | name_to_variable[name] = var 330 | 331 | init_vars = tf.train.list_variables(init_checkpoint) 332 | 333 | assignment_map = collections.OrderedDict() 334 | for x in init_vars: 335 | (name, var) = (x[0], x[1]) 336 | if name not in name_to_variable: 337 | continue 338 | assignment_map[name] = name 339 | initialized_variable_names[name] = 1 340 | initialized_variable_names[name + ":0"] = 1 341 | 342 | return (assignment_map, initialized_variable_names) 343 | 344 | 345 | def dropout(input_tensor, dropout_prob): 346 | """Perform dropout. 347 | 348 | Args: 349 | input_tensor: float Tensor. 350 | dropout_prob: Python float. The probability of dropping out a value (NOT of 351 | *keeping* a dimension as in `tf.nn.dropout`). 352 | 353 | Returns: 354 | A version of `input_tensor` with dropout applied. 355 | """ 356 | if dropout_prob is None or dropout_prob == 0.0: 357 | return input_tensor 358 | 359 | output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob) 360 | return output 361 | 362 | 363 | def layer_norm(input_tensor, name=None): 364 | """Run layer normalization on the last dimension of the tensor.""" 365 | return tf.contrib.layers.layer_norm( 366 | inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name) 367 | 368 | 369 | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None): 370 | """Runs layer normalization followed by dropout.""" 371 | output_tensor = layer_norm(input_tensor, name) 372 | output_tensor = dropout(output_tensor, dropout_prob) 373 | return output_tensor 374 | 375 | 376 | def create_initializer(initializer_range=0.02): 377 | """Creates a `truncated_normal_initializer` with the given range.""" 378 | return tf.truncated_normal_initializer(stddev=initializer_range) 379 | 380 | 381 | def embedding_lookup(input_ids, 382 | vocab_size, 383 | embedding_size=128, 384 | initializer_range=0.02, 385 | word_embedding_name="word_embeddings", 386 | use_one_hot_embeddings=False): 387 | """Looks up words embeddings for id tensor. 388 | 389 | Args: 390 | input_ids: int32 Tensor of shape [batch_size, seq_length] containing word 391 | ids. 392 | vocab_size: int. Size of the embedding vocabulary. 393 | embedding_size: int. Width of the word embeddings. 394 | initializer_range: float. Embedding initialization range. 395 | word_embedding_name: string. Name of the embedding table. 396 | use_one_hot_embeddings: bool. If True, use one-hot method for word 397 | embeddings. If False, use `tf.gather()`. 398 | 399 | Returns: 400 | float Tensor of shape [batch_size, seq_length, embedding_size]. 401 | """ 402 | # This function assumes that the input is of shape [batch_size, seq_length, 403 | # num_inputs]. 404 | # 405 | # If the input is a 2D tensor of shape [batch_size, seq_length], we 406 | # reshape to [batch_size, seq_length, 1]. 407 | if input_ids.shape.ndims == 2: 408 | input_ids = tf.expand_dims(input_ids, axis=[-1]) 409 | 410 | embedding_table = tf.get_variable( 411 | name=word_embedding_name, 412 | shape=[vocab_size, embedding_size], 413 | initializer=create_initializer(initializer_range)) 414 | 415 | flat_input_ids = tf.reshape(input_ids, [-1]) 416 | if use_one_hot_embeddings: 417 | one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) 418 | output = tf.matmul(one_hot_input_ids, embedding_table) 419 | else: 420 | output = tf.gather(embedding_table, flat_input_ids) 421 | 422 | input_shape = get_shape_list(input_ids) 423 | 424 | output = tf.reshape(output, 425 | input_shape[0:-1] + [input_shape[-1] * embedding_size]) 426 | return (output, embedding_table) 427 | 428 | 429 | def embedding_postprocessor(input_tensor, 430 | use_token_type=False, 431 | token_type_ids=None, 432 | token_type_vocab_size=16, 433 | token_type_embedding_name="token_type_embeddings", 434 | use_position_embeddings=True, 435 | position_embedding_name="position_embeddings", 436 | initializer_range=0.02, 437 | max_position_embeddings=512, 438 | dropout_prob=0.1): 439 | """Performs various post-processing on a word embedding tensor. 440 | 441 | Args: 442 | input_tensor: float Tensor of shape [batch_size, seq_length, 443 | embedding_size]. 444 | use_token_type: bool. Whether to add embeddings for `token_type_ids`. 445 | token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length]. 446 | Must be specified if `use_token_type` is True. 447 | token_type_vocab_size: int. The vocabulary size of `token_type_ids`. 448 | token_type_embedding_name: string. The name of the embedding table variable 449 | for token type ids. 450 | use_position_embeddings: bool. Whether to add position embeddings for the 451 | position of each token in the sequence. 452 | position_embedding_name: string. The name of the embedding table variable 453 | for positional embeddings. 454 | initializer_range: float. Range of the weight initialization. 455 | max_position_embeddings: int. Maximum sequence length that might ever be 456 | used with this model. This can be longer than the sequence length of 457 | input_tensor, but cannot be shorter. 458 | dropout_prob: float. Dropout probability applied to the final output tensor. 459 | 460 | Returns: 461 | float tensor with same shape as `input_tensor`. 462 | 463 | Raises: 464 | ValueError: One of the tensor shapes or input values is invalid. 465 | """ 466 | input_shape = get_shape_list(input_tensor, expected_rank=3) 467 | batch_size = input_shape[0] 468 | seq_length = input_shape[1] 469 | width = input_shape[2] 470 | 471 | output = input_tensor 472 | 473 | if use_token_type: 474 | if token_type_ids is None: 475 | raise ValueError("`token_type_ids` must be specified if" 476 | "`use_token_type` is True.") 477 | token_type_table = tf.get_variable( 478 | name=token_type_embedding_name, 479 | shape=[token_type_vocab_size, width], 480 | initializer=create_initializer(initializer_range)) 481 | # This vocab will be small so we always do one-hot here, since it is always 482 | # faster for a small vocabulary. 483 | flat_token_type_ids = tf.reshape(token_type_ids, [-1]) 484 | one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) 485 | token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) 486 | token_type_embeddings = tf.reshape(token_type_embeddings, 487 | [batch_size, seq_length, width]) 488 | output += token_type_embeddings 489 | 490 | if use_position_embeddings: 491 | assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) 492 | with tf.control_dependencies([assert_op]): 493 | full_position_embeddings = tf.get_variable( 494 | name=position_embedding_name, 495 | shape=[max_position_embeddings, width], 496 | initializer=create_initializer(initializer_range)) 497 | # Since the position embedding table is a learned variable, we create it 498 | # using a (long) sequence length `max_position_embeddings`. The actual 499 | # sequence length might be shorter than this, for faster training of 500 | # tasks that do not have long sequences. 501 | # 502 | # So `full_position_embeddings` is effectively an embedding table 503 | # for position [0, 1, 2, ..., max_position_embeddings-1], and the current 504 | # sequence has positions [0, 1, 2, ... seq_length-1], so we can just 505 | # perform a slice. 506 | position_embeddings = tf.slice(full_position_embeddings, [0, 0], 507 | [seq_length, -1]) 508 | num_dims = len(output.shape.as_list()) 509 | 510 | # Only the last two dimensions are relevant (`seq_length` and `width`), so 511 | # we broadcast among the first dimensions, which is typically just 512 | # the batch size. 513 | position_broadcast_shape = [] 514 | for _ in range(num_dims - 2): 515 | position_broadcast_shape.append(1) 516 | position_broadcast_shape.extend([seq_length, width]) 517 | position_embeddings = tf.reshape(position_embeddings, 518 | position_broadcast_shape) 519 | output += position_embeddings 520 | 521 | output = layer_norm_and_dropout(output, dropout_prob) 522 | return output 523 | 524 | 525 | def create_attention_mask_from_input_mask(from_tensor, to_mask): 526 | """Create 3D attention mask from a 2D tensor mask. 527 | 528 | Args: 529 | from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...]. 530 | to_mask: int32 Tensor of shape [batch_size, to_seq_length]. 531 | 532 | Returns: 533 | float Tensor of shape [batch_size, from_seq_length, to_seq_length]. 534 | """ 535 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) 536 | batch_size = from_shape[0] 537 | from_seq_length = from_shape[1] 538 | 539 | to_shape = get_shape_list(to_mask, expected_rank=2) 540 | to_seq_length = to_shape[1] 541 | 542 | to_mask = tf.cast( 543 | tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32) 544 | 545 | # We don't assume that `from_tensor` is a mask (although it could be). We 546 | # don't actually care if we attend *from* padding tokens (only *to* padding) 547 | # tokens so we create a tensor of all ones. 548 | # 549 | # `broadcast_ones` = [batch_size, from_seq_length, 1] 550 | broadcast_ones = tf.ones( 551 | shape=[batch_size, from_seq_length, 1], dtype=tf.float32) 552 | 553 | # Here we broadcast along two dimensions to create the mask. 554 | mask = broadcast_ones * to_mask 555 | 556 | return mask 557 | 558 | 559 | def attention_layer(from_tensor, 560 | to_tensor, 561 | attention_mask=None, 562 | num_attention_heads=1, 563 | size_per_head=512, 564 | query_act=None, 565 | key_act=None, 566 | value_act=None, 567 | attention_probs_dropout_prob=0.0, 568 | initializer_range=0.02, 569 | do_return_2d_tensor=False, 570 | batch_size=None, 571 | from_seq_length=None, 572 | to_seq_length=None): 573 | """Performs multi-headed attention from `from_tensor` to `to_tensor`. 574 | 575 | This is an implementation of multi-headed attention based on "Attention 576 | is all you Need". If `from_tensor` and `to_tensor` are the same, then 577 | this is self-attention. Each timestep in `from_tensor` attends to the 578 | corresponding sequence in `to_tensor`, and returns a fixed-with vector. 579 | 580 | This function first projects `from_tensor` into a "query" tensor and 581 | `to_tensor` into "key" and "value" tensors. These are (effectively) a list 582 | of tensors of length `num_attention_heads`, where each tensor is of shape 583 | [batch_size, seq_length, size_per_head]. 584 | 585 | Then, the query and key tensors are dot-producted and scaled. These are 586 | softmaxed to obtain attention probabilities. The value tensors are then 587 | interpolated by these probabilities, then concatenated back to a single 588 | tensor and returned. 589 | 590 | In practice, the multi-headed attention are done with transposes and 591 | reshapes rather than actual separate tensors. 592 | 593 | Args: 594 | from_tensor: float Tensor of shape [batch_size, from_seq_length, 595 | from_width]. 596 | to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width]. 597 | attention_mask: (optional) int32 Tensor of shape [batch_size, 598 | from_seq_length, to_seq_length]. The values should be 1 or 0. The 599 | attention scores will effectively be set to -infinity for any positions in 600 | the mask that are 0, and will be unchanged for positions that are 1. 601 | num_attention_heads: int. Number of attention heads. 602 | size_per_head: int. Size of each attention head. 603 | query_act: (optional) Activation function for the query transform. 604 | key_act: (optional) Activation function for the key transform. 605 | value_act: (optional) Activation function for the value transform. 606 | attention_probs_dropout_prob: (optional) float. Dropout probability of the 607 | attention probabilities. 608 | initializer_range: float. Range of the weight initializer. 609 | do_return_2d_tensor: bool. If True, the output will be of shape [batch_size 610 | * from_seq_length, num_attention_heads * size_per_head]. If False, the 611 | output will be of shape [batch_size, from_seq_length, num_attention_heads 612 | * size_per_head]. 613 | batch_size: (Optional) int. If the input is 2D, this might be the batch size 614 | of the 3D version of the `from_tensor` and `to_tensor`. 615 | from_seq_length: (Optional) If the input is 2D, this might be the seq length 616 | of the 3D version of the `from_tensor`. 617 | to_seq_length: (Optional) If the input is 2D, this might be the seq length 618 | of the 3D version of the `to_tensor`. 619 | 620 | Returns: 621 | float Tensor of shape [batch_size, from_seq_length, 622 | num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is 623 | true, this will be of shape [batch_size * from_seq_length, 624 | num_attention_heads * size_per_head]). 625 | 626 | Raises: 627 | ValueError: Any of the arguments or tensor shapes are invalid. 628 | """ 629 | 630 | def transpose_for_scores(input_tensor, batch_size, num_attention_heads, 631 | seq_length, width): 632 | output_tensor = tf.reshape( 633 | input_tensor, [batch_size, seq_length, num_attention_heads, width]) 634 | 635 | output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3]) 636 | return output_tensor 637 | 638 | from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) 639 | to_shape = get_shape_list(to_tensor, expected_rank=[2, 3]) 640 | 641 | if len(from_shape) != len(to_shape): 642 | raise ValueError( 643 | "The rank of `from_tensor` must match the rank of `to_tensor`.") 644 | 645 | if len(from_shape) == 3: 646 | batch_size = from_shape[0] 647 | from_seq_length = from_shape[1] 648 | to_seq_length = to_shape[1] 649 | elif len(from_shape) == 2: 650 | if (batch_size is None or from_seq_length is None or to_seq_length is None): 651 | raise ValueError( 652 | "When passing in rank 2 tensors to attention_layer, the values " 653 | "for `batch_size`, `from_seq_length`, and `to_seq_length` " 654 | "must all be specified.") 655 | 656 | # Scalar dimensions referenced here: 657 | # B = batch size (number of sequences) 658 | # F = `from_tensor` sequence length 659 | # T = `to_tensor` sequence length 660 | # N = `num_attention_heads` 661 | # H = `size_per_head` 662 | 663 | from_tensor_2d = reshape_to_matrix(from_tensor) 664 | to_tensor_2d = reshape_to_matrix(to_tensor) 665 | 666 | # `query_layer` = [B*F, N*H] 667 | query_layer = tf.layers.dense( 668 | from_tensor_2d, 669 | num_attention_heads * size_per_head, 670 | activation=query_act, 671 | name="query", 672 | kernel_initializer=create_initializer(initializer_range)) 673 | 674 | # `key_layer` = [B*T, N*H] 675 | key_layer = tf.layers.dense( 676 | to_tensor_2d, 677 | num_attention_heads * size_per_head, 678 | activation=key_act, 679 | name="key", 680 | kernel_initializer=create_initializer(initializer_range)) 681 | 682 | # `value_layer` = [B*T, N*H] 683 | value_layer = tf.layers.dense( 684 | to_tensor_2d, 685 | num_attention_heads * size_per_head, 686 | activation=value_act, 687 | name="value", 688 | kernel_initializer=create_initializer(initializer_range)) 689 | 690 | # `query_layer` = [B, N, F, H] 691 | query_layer = transpose_for_scores(query_layer, batch_size, 692 | num_attention_heads, from_seq_length, 693 | size_per_head) 694 | 695 | # `key_layer` = [B, N, T, H] 696 | key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads, 697 | to_seq_length, size_per_head) 698 | 699 | # Take the dot product between "query" and "key" to get the raw 700 | # attention scores. 701 | # `attention_scores` = [B, N, F, T] 702 | attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) 703 | attention_scores = tf.multiply(attention_scores, 704 | 1.0 / math.sqrt(float(size_per_head))) 705 | 706 | if attention_mask is not None: 707 | # `attention_mask` = [B, 1, F, T] 708 | attention_mask = tf.expand_dims(attention_mask, axis=[1]) 709 | 710 | # Since attention_mask is 1.0 for positions we want to attend and 0.0 for 711 | # masked positions, this operation will create a tensor which is 0.0 for 712 | # positions we want to attend and -10000.0 for masked positions. 713 | adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 714 | 715 | # Since we are adding it to the raw scores before the softmax, this is 716 | # effectively the same as removing these entirely. 717 | attention_scores += adder 718 | 719 | # Normalize the attention scores to probabilities. 720 | # `attention_probs` = [B, N, F, T] 721 | attention_probs = tf.nn.softmax(attention_scores) 722 | 723 | # This is actually dropping out entire tokens to attend to, which might 724 | # seem a bit unusual, but is taken from the original Transformer paper. 725 | attention_probs = dropout(attention_probs, attention_probs_dropout_prob) 726 | 727 | # `value_layer` = [B, T, N, H] 728 | value_layer = tf.reshape( 729 | value_layer, 730 | [batch_size, to_seq_length, num_attention_heads, size_per_head]) 731 | 732 | # `value_layer` = [B, N, T, H] 733 | value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) 734 | 735 | # `context_layer` = [B, N, F, H] 736 | context_layer = tf.matmul(attention_probs, value_layer) 737 | 738 | # `context_layer` = [B, F, N, H] 739 | context_layer = tf.transpose(context_layer, [0, 2, 1, 3]) 740 | 741 | if do_return_2d_tensor: 742 | # `context_layer` = [B*F, N*H] 743 | context_layer = tf.reshape( 744 | context_layer, 745 | [batch_size * from_seq_length, num_attention_heads * size_per_head]) 746 | else: 747 | # `context_layer` = [B, F, N*H] 748 | context_layer = tf.reshape( 749 | context_layer, 750 | [batch_size, from_seq_length, num_attention_heads * size_per_head]) 751 | 752 | return context_layer 753 | 754 | 755 | def transformer_model(input_tensor, 756 | attention_mask=None, 757 | hidden_size=768, 758 | num_hidden_layers=12, 759 | num_attention_heads=12, 760 | intermediate_size=3072, 761 | intermediate_act_fn=gelu, 762 | hidden_dropout_prob=0.1, 763 | attention_probs_dropout_prob=0.1, 764 | initializer_range=0.02, 765 | do_return_all_layers=False): 766 | """Multi-headed, multi-layer Transformer from "Attention is All You Need". 767 | 768 | This is almost an exact implementation of the original Transformer encoder. 769 | 770 | See the original paper: 771 | https://arxiv.org/abs/1706.03762 772 | 773 | Also see: 774 | https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py 775 | 776 | Args: 777 | input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size]. 778 | attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length, 779 | seq_length], with 1 for positions that can be attended to and 0 in 780 | positions that should not be. 781 | hidden_size: int. Hidden size of the Transformer. 782 | num_hidden_layers: int. Number of layers (blocks) in the Transformer. 783 | num_attention_heads: int. Number of attention heads in the Transformer. 784 | intermediate_size: int. The size of the "intermediate" (a.k.a., feed 785 | forward) layer. 786 | intermediate_act_fn: function. The non-linear activation function to apply 787 | to the output of the intermediate/feed-forward layer. 788 | hidden_dropout_prob: float. Dropout probability for the hidden layers. 789 | attention_probs_dropout_prob: float. Dropout probability of the attention 790 | probabilities. 791 | initializer_range: float. Range of the initializer (stddev of truncated 792 | normal). 793 | do_return_all_layers: Whether to also return all layers or just the final 794 | layer. 795 | 796 | Returns: 797 | float Tensor of shape [batch_size, seq_length, hidden_size], the final 798 | hidden layer of the Transformer. 799 | 800 | Raises: 801 | ValueError: A Tensor shape or parameter is invalid. 802 | """ 803 | if hidden_size % num_attention_heads != 0: 804 | raise ValueError( 805 | "The hidden size (%d) is not a multiple of the number of attention " 806 | "heads (%d)" % (hidden_size, num_attention_heads)) 807 | 808 | attention_head_size = int(hidden_size / num_attention_heads) 809 | input_shape = get_shape_list(input_tensor, expected_rank=3) 810 | batch_size = input_shape[0] 811 | seq_length = input_shape[1] 812 | input_width = input_shape[2] 813 | 814 | # The Transformer performs sum residuals on all layers so the input needs 815 | # to be the same as the hidden size. 816 | if input_width != hidden_size: 817 | raise ValueError("The width of the input tensor (%d) != hidden size (%d)" % 818 | (input_width, hidden_size)) 819 | 820 | # We keep the representation as a 2D tensor to avoid re-shaping it back and 821 | # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on 822 | # the GPU/CPU but may not be free on the TPU, so we want to minimize them to 823 | # help the optimizer. 824 | prev_output = reshape_to_matrix(input_tensor) 825 | 826 | all_layer_outputs = [] 827 | for layer_idx in range(num_hidden_layers): 828 | with tf.variable_scope("layer_%d" % layer_idx): 829 | layer_input = prev_output 830 | 831 | with tf.variable_scope("attention"): 832 | attention_heads = [] 833 | with tf.variable_scope("self"): 834 | attention_head = attention_layer( 835 | from_tensor=layer_input, 836 | to_tensor=layer_input, 837 | attention_mask=attention_mask, 838 | num_attention_heads=num_attention_heads, 839 | size_per_head=attention_head_size, 840 | attention_probs_dropout_prob=attention_probs_dropout_prob, 841 | initializer_range=initializer_range, 842 | do_return_2d_tensor=True, 843 | batch_size=batch_size, 844 | from_seq_length=seq_length, 845 | to_seq_length=seq_length) 846 | attention_heads.append(attention_head) 847 | 848 | attention_output = None 849 | if len(attention_heads) == 1: 850 | attention_output = attention_heads[0] 851 | else: 852 | # In the case where we have other sequences, we just concatenate 853 | # them to the self-attention head before the projection. 854 | attention_output = tf.concat(attention_heads, axis=-1) 855 | 856 | # Run a linear projection of `hidden_size` then add a residual 857 | # with `layer_input`. 858 | with tf.variable_scope("output"): 859 | attention_output = tf.layers.dense( 860 | attention_output, 861 | hidden_size, 862 | kernel_initializer=create_initializer(initializer_range)) 863 | attention_output = dropout(attention_output, hidden_dropout_prob) 864 | attention_output = layer_norm(attention_output + layer_input) 865 | 866 | # The activation is only applied to the "intermediate" hidden layer. 867 | with tf.variable_scope("intermediate"): 868 | intermediate_output = tf.layers.dense( 869 | attention_output, 870 | intermediate_size, 871 | activation=intermediate_act_fn, 872 | kernel_initializer=create_initializer(initializer_range)) 873 | 874 | # Down-project back to `hidden_size` then add the residual. 875 | with tf.variable_scope("output"): 876 | layer_output = tf.layers.dense( 877 | intermediate_output, 878 | hidden_size, 879 | kernel_initializer=create_initializer(initializer_range)) 880 | layer_output = dropout(layer_output, hidden_dropout_prob) 881 | layer_output = layer_norm(layer_output + attention_output) 882 | prev_output = layer_output 883 | all_layer_outputs.append(layer_output) 884 | 885 | if do_return_all_layers: 886 | final_outputs = [] 887 | for layer_output in all_layer_outputs: 888 | final_output = reshape_from_matrix(layer_output, input_shape) 889 | final_outputs.append(final_output) 890 | return final_outputs 891 | else: 892 | final_output = reshape_from_matrix(prev_output, input_shape) 893 | return final_output 894 | 895 | 896 | def get_shape_list(tensor, expected_rank=None, name=None): 897 | """Returns a list of the shape of tensor, preferring static dimensions. 898 | 899 | Args: 900 | tensor: A tf.Tensor object to find the shape of. 901 | expected_rank: (optional) int. The expected rank of `tensor`. If this is 902 | specified and the `tensor` has a different rank, and exception will be 903 | thrown. 904 | name: Optional name of the tensor for the error message. 905 | 906 | Returns: 907 | A list of dimensions of the shape of tensor. All static dimensions will 908 | be returned as python integers, and dynamic dimensions will be returned 909 | as tf.Tensor scalars. 910 | """ 911 | if name is None: 912 | name = tensor.name 913 | 914 | if expected_rank is not None: 915 | assert_rank(tensor, expected_rank, name) 916 | 917 | shape = tensor.shape.as_list() 918 | 919 | non_static_indexes = [] 920 | for (index, dim) in enumerate(shape): 921 | if dim is None: 922 | non_static_indexes.append(index) 923 | 924 | if not non_static_indexes: 925 | return shape 926 | 927 | dyn_shape = tf.shape(tensor) 928 | for index in non_static_indexes: 929 | shape[index] = dyn_shape[index] 930 | return shape 931 | 932 | 933 | def reshape_to_matrix(input_tensor): 934 | """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" 935 | ndims = input_tensor.shape.ndims 936 | if ndims < 2: 937 | raise ValueError("Input tensor must have at least rank 2. Shape = %s" % 938 | (input_tensor.shape)) 939 | if ndims == 2: 940 | return input_tensor 941 | 942 | width = input_tensor.shape[-1] 943 | output_tensor = tf.reshape(input_tensor, [-1, width]) 944 | return output_tensor 945 | 946 | 947 | def reshape_from_matrix(output_tensor, orig_shape_list): 948 | """Reshapes a rank 2 tensor back to its original rank >= 2 tensor.""" 949 | if len(orig_shape_list) == 2: 950 | return output_tensor 951 | 952 | output_shape = get_shape_list(output_tensor) 953 | 954 | orig_dims = orig_shape_list[0:-1] 955 | width = output_shape[-1] 956 | 957 | return tf.reshape(output_tensor, orig_dims + [width]) 958 | 959 | 960 | def assert_rank(tensor, expected_rank, name=None): 961 | """Raises an exception if the tensor rank is not of the expected rank. 962 | 963 | Args: 964 | tensor: A tf.Tensor to check the rank of. 965 | expected_rank: Python integer or list of integers, expected rank. 966 | name: Optional name of the tensor for the error message. 967 | 968 | Raises: 969 | ValueError: If the expected shape doesn't match the actual shape. 970 | """ 971 | if name is None: 972 | name = tensor.name 973 | 974 | expected_rank_dict = {} 975 | if isinstance(expected_rank, six.integer_types): 976 | expected_rank_dict[expected_rank] = True 977 | else: 978 | for x in expected_rank: 979 | expected_rank_dict[x] = True 980 | 981 | actual_rank = tensor.shape.ndims 982 | if actual_rank not in expected_rank_dict: 983 | scope_name = tf.get_variable_scope().name 984 | raise ValueError( 985 | "For the tensor `%s` in scope `%s`, the actual rank " 986 | "`%d` (shape = %s) is not equal to the expected rank `%s`" % 987 | (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank))) 988 | -------------------------------------------------------------------------------- /modeling_test.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | from __future__ import absolute_import 16 | from __future__ import division 17 | from __future__ import print_function 18 | 19 | import collections 20 | import json 21 | import random 22 | import re 23 | 24 | import modeling 25 | import six 26 | import tensorflow as tf 27 | 28 | 29 | class BertModelTest(tf.test.TestCase): 30 | 31 | class BertModelTester(object): 32 | 33 | def __init__(self, 34 | parent, 35 | batch_size=13, 36 | seq_length=7, 37 | is_training=True, 38 | use_input_mask=True, 39 | use_token_type_ids=True, 40 | vocab_size=99, 41 | hidden_size=32, 42 | num_hidden_layers=5, 43 | num_attention_heads=4, 44 | intermediate_size=37, 45 | hidden_act="gelu", 46 | hidden_dropout_prob=0.1, 47 | attention_probs_dropout_prob=0.1, 48 | max_position_embeddings=512, 49 | type_vocab_size=16, 50 | initializer_range=0.02, 51 | scope=None): 52 | self.parent = parent 53 | self.batch_size = batch_size 54 | self.seq_length = seq_length 55 | self.is_training = is_training 56 | self.use_input_mask = use_input_mask 57 | self.use_token_type_ids = use_token_type_ids 58 | self.vocab_size = vocab_size 59 | self.hidden_size = hidden_size 60 | self.num_hidden_layers = num_hidden_layers 61 | self.num_attention_heads = num_attention_heads 62 | self.intermediate_size = intermediate_size 63 | self.hidden_act = hidden_act 64 | self.hidden_dropout_prob = hidden_dropout_prob 65 | self.attention_probs_dropout_prob = attention_probs_dropout_prob 66 | self.max_position_embeddings = max_position_embeddings 67 | self.type_vocab_size = type_vocab_size 68 | self.initializer_range = initializer_range 69 | self.scope = scope 70 | 71 | def create_model(self): 72 | input_ids = BertModelTest.ids_tensor([self.batch_size, self.seq_length], 73 | self.vocab_size) 74 | 75 | input_mask = None 76 | if self.use_input_mask: 77 | input_mask = BertModelTest.ids_tensor( 78 | [self.batch_size, self.seq_length], vocab_size=2) 79 | 80 | token_type_ids = None 81 | if self.use_token_type_ids: 82 | token_type_ids = BertModelTest.ids_tensor( 83 | [self.batch_size, self.seq_length], self.type_vocab_size) 84 | 85 | config = modeling.BertConfig( 86 | vocab_size=self.vocab_size, 87 | hidden_size=self.hidden_size, 88 | num_hidden_layers=self.num_hidden_layers, 89 | num_attention_heads=self.num_attention_heads, 90 | intermediate_size=self.intermediate_size, 91 | hidden_act=self.hidden_act, 92 | hidden_dropout_prob=self.hidden_dropout_prob, 93 | attention_probs_dropout_prob=self.attention_probs_dropout_prob, 94 | max_position_embeddings=self.max_position_embeddings, 95 | type_vocab_size=self.type_vocab_size, 96 | initializer_range=self.initializer_range) 97 | 98 | model = modeling.BertModel( 99 | config=config, 100 | is_training=self.is_training, 101 | input_ids=input_ids, 102 | input_mask=input_mask, 103 | token_type_ids=token_type_ids, 104 | scope=self.scope) 105 | 106 | outputs = { 107 | "embedding_output": model.get_embedding_output(), 108 | "sequence_output": model.get_sequence_output(), 109 | "pooled_output": model.get_pooled_output(), 110 | "all_encoder_layers": model.get_all_encoder_layers(), 111 | } 112 | return outputs 113 | 114 | def check_output(self, result): 115 | self.parent.assertAllEqual( 116 | result["embedding_output"].shape, 117 | [self.batch_size, self.seq_length, self.hidden_size]) 118 | 119 | self.parent.assertAllEqual( 120 | result["sequence_output"].shape, 121 | [self.batch_size, self.seq_length, self.hidden_size]) 122 | 123 | self.parent.assertAllEqual(result["pooled_output"].shape, 124 | [self.batch_size, self.hidden_size]) 125 | 126 | def test_default(self): 127 | self.run_tester(BertModelTest.BertModelTester(self)) 128 | 129 | def test_config_to_json_string(self): 130 | config = modeling.BertConfig(vocab_size=99, hidden_size=37) 131 | obj = json.loads(config.to_json_string()) 132 | self.assertEqual(obj["vocab_size"], 99) 133 | self.assertEqual(obj["hidden_size"], 37) 134 | 135 | def run_tester(self, tester): 136 | with self.test_session() as sess: 137 | ops = tester.create_model() 138 | init_op = tf.group(tf.global_variables_initializer(), 139 | tf.local_variables_initializer()) 140 | sess.run(init_op) 141 | output_result = sess.run(ops) 142 | tester.check_output(output_result) 143 | 144 | self.assert_all_tensors_reachable(sess, [init_op, ops]) 145 | 146 | @classmethod 147 | def ids_tensor(cls, shape, vocab_size, rng=None, name=None): 148 | """Creates a random int32 tensor of the shape within the vocab size.""" 149 | if rng is None: 150 | rng = random.Random() 151 | 152 | total_dims = 1 153 | for dim in shape: 154 | total_dims *= dim 155 | 156 | values = [] 157 | for _ in range(total_dims): 158 | values.append(rng.randint(0, vocab_size - 1)) 159 | 160 | return tf.constant(value=values, dtype=tf.int32, shape=shape, name=name) 161 | 162 | def assert_all_tensors_reachable(self, sess, outputs): 163 | """Checks that all the tensors in the graph are reachable from outputs.""" 164 | graph = sess.graph 165 | 166 | ignore_strings = [ 167 | "^.*/assert_less_equal/.*$", 168 | "^.*/dilation_rate$", 169 | "^.*/Tensordot/concat$", 170 | "^.*/Tensordot/concat/axis$", 171 | "^testing/.*$", 172 | ] 173 | 174 | ignore_regexes = [re.compile(x) for x in ignore_strings] 175 | 176 | unreachable = self.get_unreachable_ops(graph, outputs) 177 | filtered_unreachable = [] 178 | for x in unreachable: 179 | do_ignore = False 180 | for r in ignore_regexes: 181 | m = r.match(x.name) 182 | if m is not None: 183 | do_ignore = True 184 | if do_ignore: 185 | continue 186 | filtered_unreachable.append(x) 187 | unreachable = filtered_unreachable 188 | 189 | self.assertEqual( 190 | len(unreachable), 0, "The following ops are unreachable: %s" % 191 | (" ".join([x.name for x in unreachable]))) 192 | 193 | @classmethod 194 | def get_unreachable_ops(cls, graph, outputs): 195 | """Finds all of the tensors in graph that are unreachable from outputs.""" 196 | outputs = cls.flatten_recursive(outputs) 197 | output_to_op = collections.defaultdict(list) 198 | op_to_all = collections.defaultdict(list) 199 | assign_out_to_in = collections.defaultdict(list) 200 | 201 | for op in graph.get_operations(): 202 | for x in op.inputs: 203 | op_to_all[op.name].append(x.name) 204 | for y in op.outputs: 205 | output_to_op[y.name].append(op.name) 206 | op_to_all[op.name].append(y.name) 207 | if str(op.type) == "Assign": 208 | for y in op.outputs: 209 | for x in op.inputs: 210 | assign_out_to_in[y.name].append(x.name) 211 | 212 | assign_groups = collections.defaultdict(list) 213 | for out_name in assign_out_to_in.keys(): 214 | name_group = assign_out_to_in[out_name] 215 | for n1 in name_group: 216 | assign_groups[n1].append(out_name) 217 | for n2 in name_group: 218 | if n1 != n2: 219 | assign_groups[n1].append(n2) 220 | 221 | seen_tensors = {} 222 | stack = [x.name for x in outputs] 223 | while stack: 224 | name = stack.pop() 225 | if name in seen_tensors: 226 | continue 227 | seen_tensors[name] = True 228 | 229 | if name in output_to_op: 230 | for op_name in output_to_op[name]: 231 | if op_name in op_to_all: 232 | for input_name in op_to_all[op_name]: 233 | if input_name not in stack: 234 | stack.append(input_name) 235 | 236 | expanded_names = [] 237 | if name in assign_groups: 238 | for assign_name in assign_groups[name]: 239 | expanded_names.append(assign_name) 240 | 241 | for expanded_name in expanded_names: 242 | if expanded_name not in stack: 243 | stack.append(expanded_name) 244 | 245 | unreachable_ops = [] 246 | for op in graph.get_operations(): 247 | is_unreachable = False 248 | all_names = [x.name for x in op.inputs] + [x.name for x in op.outputs] 249 | for name in all_names: 250 | if name not in seen_tensors: 251 | is_unreachable = True 252 | if is_unreachable: 253 | unreachable_ops.append(op) 254 | return unreachable_ops 255 | 256 | @classmethod 257 | def flatten_recursive(cls, item): 258 | """Flattens (potentially nested) a tuple/dictionary/list to a list.""" 259 | output = [] 260 | if isinstance(item, list): 261 | output.extend(item) 262 | elif isinstance(item, tuple): 263 | output.extend(list(item)) 264 | elif isinstance(item, dict): 265 | for (_, v) in six.iteritems(item): 266 | output.append(v) 267 | else: 268 | return [item] 269 | 270 | flat_output = [] 271 | for x in output: 272 | flat_output.extend(cls.flatten_recursive(x)) 273 | return flat_output 274 | 275 | 276 | if __name__ == "__main__": 277 | tf.test.main() 278 | -------------------------------------------------------------------------------- /optimization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Functions and classes related to optimization (weight updates).""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import re 22 | import tensorflow as tf 23 | 24 | 25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu): 26 | """Creates an optimizer training op.""" 27 | global_step = tf.train.get_or_create_global_step() 28 | 29 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) 30 | 31 | # Implements linear decay of the learning rate. 32 | learning_rate = tf.train.polynomial_decay( 33 | learning_rate, 34 | global_step, 35 | num_train_steps, 36 | end_learning_rate=0.0, 37 | power=1.0, 38 | cycle=False) 39 | 40 | # Implements linear warmup. I.e., if global_step < num_warmup_steps, the 41 | # learning rate will be `global_step/num_warmup_steps * init_lr`. 42 | if num_warmup_steps: 43 | global_steps_int = tf.cast(global_step, tf.int32) 44 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) 45 | 46 | global_steps_float = tf.cast(global_steps_int, tf.float32) 47 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) 48 | 49 | warmup_percent_done = global_steps_float / warmup_steps_float 50 | warmup_learning_rate = init_lr * warmup_percent_done 51 | 52 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) 53 | learning_rate = ( 54 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) 55 | 56 | # It is recommended that you use this optimizer for fine tuning, since this 57 | # is how the model was trained (note that the Adam m/v variables are NOT 58 | # loaded from init_checkpoint.) 59 | optimizer = AdamWeightDecayOptimizer( 60 | learning_rate=learning_rate, 61 | weight_decay_rate=0.01, 62 | beta_1=0.9, 63 | beta_2=0.999, 64 | epsilon=1e-6, 65 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 66 | 67 | if use_tpu: 68 | optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) 69 | 70 | tvars = tf.trainable_variables() 71 | grads = tf.gradients(loss, tvars) 72 | 73 | # This is how the model was pre-trained. 74 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) 75 | 76 | train_op = optimizer.apply_gradients( 77 | zip(grads, tvars), global_step=global_step) 78 | 79 | # Normally the global step update is done inside of `apply_gradients`. 80 | # However, `AdamWeightDecayOptimizer` doesn't do this. But if you use 81 | # a different optimizer, you should probably take this line out. 82 | new_global_step = global_step + 1 83 | train_op = tf.group(train_op, [global_step.assign(new_global_step)]) 84 | return train_op 85 | 86 | 87 | class AdamWeightDecayOptimizer(tf.train.Optimizer): 88 | """A basic Adam optimizer that includes "correct" L2 weight decay.""" 89 | 90 | def __init__(self, 91 | learning_rate, 92 | weight_decay_rate=0.0, 93 | beta_1=0.9, 94 | beta_2=0.999, 95 | epsilon=1e-6, 96 | exclude_from_weight_decay=None, 97 | name="AdamWeightDecayOptimizer"): 98 | """Constructs a AdamWeightDecayOptimizer.""" 99 | super(AdamWeightDecayOptimizer, self).__init__(False, name) 100 | 101 | self.learning_rate = learning_rate 102 | self.weight_decay_rate = weight_decay_rate 103 | self.beta_1 = beta_1 104 | self.beta_2 = beta_2 105 | self.epsilon = epsilon 106 | self.exclude_from_weight_decay = exclude_from_weight_decay 107 | 108 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 109 | """See base class.""" 110 | assignments = [] 111 | for (grad, param) in grads_and_vars: 112 | if grad is None or param is None: 113 | continue 114 | 115 | param_name = self._get_variable_name(param.name) 116 | 117 | m = tf.get_variable( 118 | name=param_name + "/adam_m", 119 | shape=param.shape.as_list(), 120 | dtype=tf.float32, 121 | trainable=False, 122 | initializer=tf.zeros_initializer()) 123 | v = tf.get_variable( 124 | name=param_name + "/adam_v", 125 | shape=param.shape.as_list(), 126 | dtype=tf.float32, 127 | trainable=False, 128 | initializer=tf.zeros_initializer()) 129 | 130 | # Standard Adam update. 131 | next_m = ( 132 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 133 | next_v = ( 134 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 135 | tf.square(grad))) 136 | 137 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 138 | 139 | # Just adding the square of the weights to the loss function is *not* 140 | # the correct way of using L2 regularization/weight decay with Adam, 141 | # since that will interact with the m and v parameters in strange ways. 142 | # 143 | # Instead we want ot decay the weights in a manner that doesn't interact 144 | # with the m/v parameters. This is equivalent to adding the square 145 | # of the weights to the loss with plain (non-momentum) SGD. 146 | if self._do_use_weight_decay(param_name): 147 | update += self.weight_decay_rate * param 148 | 149 | update_with_lr = self.learning_rate * update 150 | 151 | next_param = param - update_with_lr 152 | 153 | assignments.extend( 154 | [param.assign(next_param), 155 | m.assign(next_m), 156 | v.assign(next_v)]) 157 | return tf.group(*assignments, name=name) 158 | 159 | def _do_use_weight_decay(self, param_name): 160 | """Whether to use L2 weight decay for `param_name`.""" 161 | if not self.weight_decay_rate: 162 | return False 163 | if self.exclude_from_weight_decay: 164 | for r in self.exclude_from_weight_decay: 165 | if re.search(r, param_name) is not None: 166 | return False 167 | return True 168 | 169 | def _get_variable_name(self, param_name): 170 | """Get the variable name from the tensor name.""" 171 | m = re.match("^(.*):\\d+$", param_name) 172 | if m is not None: 173 | param_name = m.group(1) 174 | return param_name 175 | -------------------------------------------------------------------------------- /optimization_test.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | from __future__ import absolute_import 16 | from __future__ import division 17 | from __future__ import print_function 18 | 19 | import optimization 20 | import tensorflow as tf 21 | 22 | 23 | class OptimizationTest(tf.test.TestCase): 24 | 25 | def test_adam(self): 26 | with self.test_session() as sess: 27 | w = tf.get_variable( 28 | "w", 29 | shape=[3], 30 | initializer=tf.constant_initializer([0.1, -0.2, -0.1])) 31 | x = tf.constant([0.4, 0.2, -0.5]) 32 | loss = tf.reduce_mean(tf.square(x - w)) 33 | tvars = tf.trainable_variables() 34 | grads = tf.gradients(loss, tvars) 35 | global_step = tf.train.get_or_create_global_step() 36 | optimizer = optimization.AdamWeightDecayOptimizer(learning_rate=0.2) 37 | train_op = optimizer.apply_gradients(zip(grads, tvars), global_step) 38 | init_op = tf.group(tf.global_variables_initializer(), 39 | tf.local_variables_initializer()) 40 | sess.run(init_op) 41 | for _ in range(100): 42 | sess.run(train_op) 43 | w_np = sess.run(w) 44 | self.assertAllClose(w_np.flat, [0.4, 0.2, -0.5], rtol=1e-2, atol=1e-2) 45 | 46 | 47 | if __name__ == "__main__": 48 | tf.test.main() 49 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow >= 1.11.0 # CPU Version of TensorFlow. 2 | # tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow. 3 | -------------------------------------------------------------------------------- /run_classifier.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """BERT finetuning runner.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import csv 23 | import os 24 | import modeling 25 | import optimization 26 | import tokenization 27 | import tensorflow as tf 28 | 29 | flags = tf.flags 30 | 31 | FLAGS = flags.FLAGS 32 | 33 | ## Required parameters 34 | flags.DEFINE_string( 35 | "data_dir", None, 36 | "The input data dir. Should contain the .tsv files (or other data files) " 37 | "for the task.") 38 | 39 | flags.DEFINE_string( 40 | "bert_config_file", None, 41 | "The config json file corresponding to the pre-trained BERT model. " 42 | "This specifies the model architecture.") 43 | 44 | flags.DEFINE_string("task_name", None, "The name of the task to train.") 45 | 46 | flags.DEFINE_string("vocab_file", None, 47 | "The vocabulary file that the BERT model was trained on.") 48 | 49 | flags.DEFINE_string( 50 | "output_dir", None, 51 | "The output directory where the model checkpoints will be written.") 52 | 53 | ## Other parameters 54 | 55 | flags.DEFINE_string( 56 | "init_checkpoint", None, 57 | "Initial checkpoint (usually from a pre-trained BERT model).") 58 | 59 | flags.DEFINE_bool( 60 | "do_lower_case", True, 61 | "Whether to lower case the input text. Should be True for uncased " 62 | "models and False for cased models.") 63 | 64 | flags.DEFINE_integer( 65 | "max_seq_length", 128, 66 | "The maximum total input sequence length after WordPiece tokenization. " 67 | "Sequences longer than this will be truncated, and sequences shorter " 68 | "than this will be padded.") 69 | 70 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 71 | 72 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 73 | 74 | flags.DEFINE_bool( 75 | "do_predict", False, 76 | "Whether to run the model in inference mode on the test set.") 77 | 78 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 79 | 80 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 81 | 82 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 83 | 84 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 85 | 86 | flags.DEFINE_float("num_train_epochs", 3.0, 87 | "Total number of training epochs to perform.") 88 | 89 | flags.DEFINE_float( 90 | "warmup_proportion", 0.1, 91 | "Proportion of training to perform linear learning rate warmup for. " 92 | "E.g., 0.1 = 10% of training.") 93 | 94 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 95 | "How often to save the model checkpoint.") 96 | 97 | flags.DEFINE_integer("iterations_per_loop", 1000, 98 | "How many steps to make in each estimator call.") 99 | 100 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 101 | 102 | tf.flags.DEFINE_string( 103 | "tpu_name", None, 104 | "The Cloud TPU to use for training. This should be either the name " 105 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 106 | "url.") 107 | 108 | tf.flags.DEFINE_string( 109 | "tpu_zone", None, 110 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 111 | "specified, we will attempt to automatically detect the GCE project from " 112 | "metadata.") 113 | 114 | tf.flags.DEFINE_string( 115 | "gcp_project", None, 116 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 117 | "specified, we will attempt to automatically detect the GCE project from " 118 | "metadata.") 119 | 120 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 121 | 122 | flags.DEFINE_integer( 123 | "num_tpu_cores", 8, 124 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 125 | 126 | 127 | class InputExample(object): 128 | """A single training/test example for simple sequence classification.""" 129 | 130 | def __init__(self, guid, text_a, text_b=None, label=None): 131 | """Constructs a InputExample. 132 | 133 | Args: 134 | guid: Unique id for the example. 135 | text_a: string. The untokenized text of the first sequence. For single 136 | sequence tasks, only this sequence must be specified. 137 | text_b: (Optional) string. The untokenized text of the second sequence. 138 | Only must be specified for sequence pair tasks. 139 | label: (Optional) string. The label of the example. This should be 140 | specified for train and dev examples, but not for test examples. 141 | """ 142 | self.guid = guid 143 | self.text_a = text_a 144 | self.text_b = text_b 145 | self.label = label 146 | 147 | 148 | class PaddingInputExample(object): 149 | """Fake example so the num input examples is a multiple of the batch size. 150 | 151 | When running eval/predict on the TPU, we need to pad the number of examples 152 | to be a multiple of the batch size, because the TPU requires a fixed batch 153 | size. The alternative is to drop the last batch, which is bad because it means 154 | the entire output data won't be generated. 155 | 156 | We use this class instead of `None` because treating `None` as padding 157 | battches could cause silent errors. 158 | """ 159 | 160 | 161 | class InputFeatures(object): 162 | """A single set of features of data.""" 163 | 164 | def __init__(self, 165 | input_ids, 166 | input_mask, 167 | segment_ids, 168 | label_id, 169 | is_real_example=True): 170 | self.input_ids = input_ids 171 | self.input_mask = input_mask 172 | self.segment_ids = segment_ids 173 | self.label_id = label_id 174 | self.is_real_example = is_real_example 175 | 176 | 177 | class DataProcessor(object): 178 | """Base class for data converters for sequence classification data sets.""" 179 | 180 | def get_train_examples(self, data_dir): 181 | """Gets a collection of `InputExample`s for the train set.""" 182 | raise NotImplementedError() 183 | 184 | def get_dev_examples(self, data_dir): 185 | """Gets a collection of `InputExample`s for the dev set.""" 186 | raise NotImplementedError() 187 | 188 | def get_test_examples(self, data_dir): 189 | """Gets a collection of `InputExample`s for prediction.""" 190 | raise NotImplementedError() 191 | 192 | def get_labels(self): 193 | """Gets the list of labels for this data set.""" 194 | raise NotImplementedError() 195 | 196 | @classmethod 197 | def _read_tsv(cls, input_file, quotechar=None): 198 | """Reads a tab separated value file.""" 199 | with tf.gfile.Open(input_file, "r") as f: 200 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 201 | lines = [] 202 | for line in reader: 203 | lines.append(line) 204 | return lines 205 | 206 | 207 | class XnliProcessor(DataProcessor): 208 | """Processor for the XNLI data set.""" 209 | 210 | def __init__(self): 211 | self.language = "zh" 212 | 213 | def get_train_examples(self, data_dir): 214 | """See base class.""" 215 | lines = self._read_tsv( 216 | os.path.join(data_dir, "multinli", 217 | "multinli.train.%s.tsv" % self.language)) 218 | examples = [] 219 | for (i, line) in enumerate(lines): 220 | if i == 0: 221 | continue 222 | guid = "train-%d" % (i) 223 | text_a = tokenization.convert_to_unicode(line[0]) 224 | text_b = tokenization.convert_to_unicode(line[1]) 225 | label = tokenization.convert_to_unicode(line[2]) 226 | if label == tokenization.convert_to_unicode("contradictory"): 227 | label = tokenization.convert_to_unicode("contradiction") 228 | examples.append( 229 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 230 | return examples 231 | 232 | def get_dev_examples(self, data_dir): 233 | """See base class.""" 234 | lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")) 235 | examples = [] 236 | for (i, line) in enumerate(lines): 237 | if i == 0: 238 | continue 239 | guid = "dev-%d" % (i) 240 | language = tokenization.convert_to_unicode(line[0]) 241 | if language != tokenization.convert_to_unicode(self.language): 242 | continue 243 | text_a = tokenization.convert_to_unicode(line[6]) 244 | text_b = tokenization.convert_to_unicode(line[7]) 245 | label = tokenization.convert_to_unicode(line[1]) 246 | examples.append( 247 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 248 | return examples 249 | 250 | def get_labels(self): 251 | """See base class.""" 252 | return ["contradiction", "entailment", "neutral"] 253 | 254 | 255 | class MnliProcessor(DataProcessor): 256 | """Processor for the MultiNLI data set (GLUE version).""" 257 | 258 | def get_train_examples(self, data_dir): 259 | """See base class.""" 260 | return self._create_examples( 261 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 262 | 263 | def get_dev_examples(self, data_dir): 264 | """See base class.""" 265 | return self._create_examples( 266 | self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), 267 | "dev_matched") 268 | 269 | def get_test_examples(self, data_dir): 270 | """See base class.""" 271 | return self._create_examples( 272 | self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") 273 | 274 | def get_labels(self): 275 | """See base class.""" 276 | return ["contradiction", "entailment", "neutral"] 277 | 278 | def _create_examples(self, lines, set_type): 279 | """Creates examples for the training and dev sets.""" 280 | examples = [] 281 | for (i, line) in enumerate(lines): 282 | if i == 0: 283 | continue 284 | guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0])) 285 | text_a = tokenization.convert_to_unicode(line[8]) 286 | text_b = tokenization.convert_to_unicode(line[9]) 287 | if set_type == "test": 288 | label = "contradiction" 289 | else: 290 | label = tokenization.convert_to_unicode(line[-1]) 291 | examples.append( 292 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 293 | return examples 294 | 295 | 296 | class MrpcProcessor(DataProcessor): 297 | """Processor for the MRPC data set (GLUE version).""" 298 | 299 | def get_train_examples(self, data_dir): 300 | """See base class.""" 301 | return self._create_examples( 302 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 303 | 304 | def get_dev_examples(self, data_dir): 305 | """See base class.""" 306 | return self._create_examples( 307 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 308 | 309 | def get_test_examples(self, data_dir): 310 | """See base class.""" 311 | return self._create_examples( 312 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") 313 | 314 | def get_labels(self): 315 | """See base class.""" 316 | return ["0", "1"] 317 | 318 | def _create_examples(self, lines, set_type): 319 | """Creates examples for the training and dev sets.""" 320 | examples = [] 321 | for (i, line) in enumerate(lines): 322 | if i == 0: 323 | continue 324 | guid = "%s-%s" % (set_type, i) 325 | text_a = tokenization.convert_to_unicode(line[3]) 326 | text_b = tokenization.convert_to_unicode(line[4]) 327 | if set_type == "test": 328 | label = "0" 329 | else: 330 | label = tokenization.convert_to_unicode(line[0]) 331 | examples.append( 332 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 333 | return examples 334 | 335 | 336 | 337 | class ColaProcessor(DataProcessor): 338 | """Processor for the CoLA data set (GLUE version).""" 339 | 340 | def get_train_examples(self, data_dir): 341 | """See base class.""" 342 | return self._create_examples( 343 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 344 | 345 | def get_dev_examples(self, data_dir): 346 | """See base class.""" 347 | return self._create_examples( 348 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 349 | 350 | def get_test_examples(self, data_dir): 351 | """See base class.""" 352 | return self._create_examples( 353 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") 354 | 355 | def get_labels(self): 356 | """See base class.""" 357 | return ["0", "1"] 358 | 359 | def _create_examples(self, lines, set_type): 360 | """Creates examples for the training and dev sets.""" 361 | examples = [] 362 | for (i, line) in enumerate(lines): 363 | # Only the test set has a header 364 | if set_type == "test" and i == 0: 365 | continue 366 | guid = "%s-%s" % (set_type, i) 367 | if set_type == "test": 368 | text_a = tokenization.convert_to_unicode(line[1]) 369 | label = "0" 370 | else: 371 | text_a = tokenization.convert_to_unicode(line[3]) 372 | label = tokenization.convert_to_unicode(line[1]) 373 | examples.append( 374 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 375 | return examples 376 | 377 | 378 | def convert_single_example(ex_index, example, label_list, max_seq_length, 379 | tokenizer): 380 | """Converts a single `InputExample` into a single `InputFeatures`.""" 381 | 382 | if isinstance(example, PaddingInputExample): 383 | return InputFeatures( 384 | input_ids=[0] * max_seq_length, 385 | input_mask=[0] * max_seq_length, 386 | segment_ids=[0] * max_seq_length, 387 | label_id=0, 388 | is_real_example=False) 389 | 390 | label_map = {} 391 | for (i, label) in enumerate(label_list): 392 | label_map[label] = i 393 | 394 | tokens_a = tokenizer.tokenize(example.text_a) 395 | tokens_b = None 396 | if example.text_b: 397 | tokens_b = tokenizer.tokenize(example.text_b) 398 | 399 | if tokens_b: 400 | # Modifies `tokens_a` and `tokens_b` in place so that the total 401 | # length is less than the specified length. 402 | # Account for [CLS], [SEP], [SEP] with "- 3" 403 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 404 | else: 405 | # Account for [CLS] and [SEP] with "- 2" 406 | if len(tokens_a) > max_seq_length - 2: 407 | tokens_a = tokens_a[0:(max_seq_length - 2)] 408 | 409 | # The convention in BERT is: 410 | # (a) For sequence pairs: 411 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 412 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 413 | # (b) For single sequences: 414 | # tokens: [CLS] the dog is hairy . [SEP] 415 | # type_ids: 0 0 0 0 0 0 0 416 | # 417 | # Where "type_ids" are used to indicate whether this is the first 418 | # sequence or the second sequence. The embedding vectors for `type=0` and 419 | # `type=1` were learned during pre-training and are added to the wordpiece 420 | # embedding vector (and position vector). This is not *strictly* necessary 421 | # since the [SEP] token unambiguously separates the sequences, but it makes 422 | # it easier for the model to learn the concept of sequences. 423 | # 424 | # For classification tasks, the first vector (corresponding to [CLS]) is 425 | # used as the "sentence vector". Note that this only makes sense because 426 | # the entire model is fine-tuned. 427 | tokens = [] 428 | segment_ids = [] 429 | tokens.append("[CLS]") 430 | segment_ids.append(0) 431 | for token in tokens_a: 432 | tokens.append(token) 433 | segment_ids.append(0) 434 | tokens.append("[SEP]") 435 | segment_ids.append(0) 436 | 437 | if tokens_b: 438 | for token in tokens_b: 439 | tokens.append(token) 440 | segment_ids.append(1) 441 | tokens.append("[SEP]") 442 | segment_ids.append(1) 443 | 444 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 445 | 446 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 447 | # tokens are attended to. 448 | input_mask = [1] * len(input_ids) 449 | 450 | # Zero-pad up to the sequence length. 451 | while len(input_ids) < max_seq_length: 452 | input_ids.append(0) 453 | input_mask.append(0) 454 | segment_ids.append(0) 455 | 456 | assert len(input_ids) == max_seq_length 457 | assert len(input_mask) == max_seq_length 458 | assert len(segment_ids) == max_seq_length 459 | 460 | label_id = label_map[example.label] 461 | if ex_index < 5: 462 | tf.logging.info("*** Example ***") 463 | tf.logging.info("guid: %s" % (example.guid)) 464 | tf.logging.info("tokens: %s" % " ".join( 465 | [tokenization.printable_text(x) for x in tokens])) 466 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 467 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 468 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 469 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 470 | 471 | feature = InputFeatures( 472 | input_ids=input_ids, 473 | input_mask=input_mask, 474 | segment_ids=segment_ids, 475 | label_id=label_id, 476 | is_real_example=True) 477 | return feature 478 | 479 | 480 | def file_based_convert_examples_to_features( 481 | examples, label_list, max_seq_length, tokenizer, output_file): 482 | """Convert a set of `InputExample`s to a TFRecord file.""" 483 | 484 | writer = tf.python_io.TFRecordWriter(output_file) 485 | 486 | for (ex_index, example) in enumerate(examples): 487 | if ex_index % 10000 == 0: 488 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 489 | 490 | feature = convert_single_example(ex_index, example, label_list, 491 | max_seq_length, tokenizer) 492 | 493 | def create_int_feature(values): 494 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 495 | return f 496 | 497 | features = collections.OrderedDict() 498 | features["input_ids"] = create_int_feature(feature.input_ids) 499 | features["input_mask"] = create_int_feature(feature.input_mask) 500 | features["segment_ids"] = create_int_feature(feature.segment_ids) 501 | features["label_ids"] = create_int_feature([feature.label_id]) 502 | features["is_real_example"] = create_int_feature( 503 | [int(feature.is_real_example)]) 504 | 505 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 506 | writer.write(tf_example.SerializeToString()) 507 | writer.close() 508 | 509 | 510 | def file_based_input_fn_builder(input_file, seq_length, is_training, 511 | drop_remainder): 512 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 513 | 514 | name_to_features = { 515 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 516 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 517 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 518 | "label_ids": tf.FixedLenFeature([], tf.int64), 519 | "is_real_example": tf.FixedLenFeature([], tf.int64), 520 | } 521 | 522 | def _decode_record(record, name_to_features): 523 | """Decodes a record to a TensorFlow example.""" 524 | example = tf.parse_single_example(record, name_to_features) 525 | 526 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 527 | # So cast all int64 to int32. 528 | for name in list(example.keys()): 529 | t = example[name] 530 | if t.dtype == tf.int64: 531 | t = tf.to_int32(t) 532 | example[name] = t 533 | 534 | return example 535 | 536 | def input_fn(params): 537 | """The actual input function.""" 538 | batch_size = params["batch_size"] 539 | 540 | # For training, we want a lot of parallel reading and shuffling. 541 | # For eval, we want no shuffling and parallel reading doesn't matter. 542 | d = tf.data.TFRecordDataset(input_file) 543 | if is_training: 544 | d = d.repeat() 545 | d = d.shuffle(buffer_size=100) 546 | 547 | d = d.apply( 548 | tf.contrib.data.map_and_batch( 549 | lambda record: _decode_record(record, name_to_features), 550 | batch_size=batch_size, 551 | drop_remainder=drop_remainder)) 552 | 553 | return d 554 | 555 | return input_fn 556 | 557 | 558 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 559 | """Truncates a sequence pair in place to the maximum length.""" 560 | 561 | # This is a simple heuristic which will always truncate the longer sequence 562 | # one token at a time. This makes more sense than truncating an equal percent 563 | # of tokens from each, since if one sequence is very short then each token 564 | # that's truncated likely contains more information than a longer sequence. 565 | while True: 566 | total_length = len(tokens_a) + len(tokens_b) 567 | if total_length <= max_length: 568 | break 569 | if len(tokens_a) > len(tokens_b): 570 | tokens_a.pop() 571 | else: 572 | tokens_b.pop() 573 | 574 | 575 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 576 | labels, num_labels, use_one_hot_embeddings): 577 | """Creates a classification model.""" 578 | model = modeling.BertModel( 579 | config=bert_config, 580 | is_training=is_training, 581 | input_ids=input_ids, 582 | input_mask=input_mask, 583 | token_type_ids=segment_ids, 584 | use_one_hot_embeddings=use_one_hot_embeddings) 585 | 586 | # In the demo, we are doing a simple classification task on the entire 587 | # segment. 588 | # 589 | # If you want to use the token-level output, use model.get_sequence_output() 590 | # instead. 591 | output_layer = model.get_pooled_output() 592 | 593 | hidden_size = output_layer.shape[-1].value 594 | 595 | output_weights = tf.get_variable( 596 | "output_weights", [num_labels, hidden_size], 597 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 598 | 599 | output_bias = tf.get_variable( 600 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 601 | 602 | with tf.variable_scope("loss"): 603 | if is_training: 604 | # I.e., 0.1 dropout 605 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 606 | 607 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 608 | logits = tf.nn.bias_add(logits, output_bias) 609 | probabilities = tf.nn.softmax(logits, axis=-1) 610 | log_probs = tf.nn.log_softmax(logits, axis=-1) 611 | 612 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 613 | 614 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 615 | loss = tf.reduce_mean(per_example_loss) 616 | 617 | return (loss, per_example_loss, logits, probabilities) 618 | 619 | 620 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 621 | num_train_steps, num_warmup_steps, use_tpu, 622 | use_one_hot_embeddings): 623 | """Returns `model_fn` closure for TPUEstimator.""" 624 | 625 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 626 | """The `model_fn` for TPUEstimator.""" 627 | 628 | tf.logging.info("*** Features ***") 629 | for name in sorted(features.keys()): 630 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 631 | 632 | input_ids = features["input_ids"] 633 | input_mask = features["input_mask"] 634 | segment_ids = features["segment_ids"] 635 | label_ids = features["label_ids"] 636 | is_real_example = None 637 | if "is_real_example" in features: 638 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 639 | else: 640 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 641 | 642 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 643 | 644 | (total_loss, per_example_loss, logits, probabilities) = create_model( 645 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, 646 | num_labels, use_one_hot_embeddings) 647 | 648 | tvars = tf.trainable_variables() 649 | initialized_variable_names = {} 650 | scaffold_fn = None 651 | if init_checkpoint: 652 | (assignment_map, initialized_variable_names 653 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 654 | if use_tpu: 655 | 656 | def tpu_scaffold(): 657 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 658 | return tf.train.Scaffold() 659 | 660 | scaffold_fn = tpu_scaffold 661 | else: 662 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 663 | 664 | tf.logging.info("**** Trainable Variables ****") 665 | for var in tvars: 666 | init_string = "" 667 | if var.name in initialized_variable_names: 668 | init_string = ", *INIT_FROM_CKPT*" 669 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 670 | init_string) 671 | 672 | output_spec = None 673 | if mode == tf.estimator.ModeKeys.TRAIN: 674 | 675 | train_op = optimization.create_optimizer( 676 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 677 | 678 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 679 | mode=mode, 680 | loss=total_loss, 681 | train_op=train_op, 682 | scaffold_fn=scaffold_fn) 683 | elif mode == tf.estimator.ModeKeys.EVAL: 684 | 685 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 686 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 687 | accuracy = tf.metrics.accuracy( 688 | labels=label_ids, predictions=predictions, weights=is_real_example) 689 | loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) 690 | return { 691 | "eval_accuracy": accuracy, 692 | "eval_loss": loss, 693 | } 694 | 695 | eval_metrics = (metric_fn, 696 | [per_example_loss, label_ids, logits, is_real_example]) 697 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 698 | mode=mode, 699 | loss=total_loss, 700 | eval_metrics=eval_metrics, 701 | scaffold_fn=scaffold_fn) 702 | else: 703 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 704 | mode=mode, 705 | predictions={"probabilities": probabilities}, 706 | scaffold_fn=scaffold_fn) 707 | return output_spec 708 | 709 | return model_fn 710 | 711 | 712 | # This function is not used by this file but is still used by the Colab and 713 | # people who depend on it. 714 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 715 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 716 | 717 | all_input_ids = [] 718 | all_input_mask = [] 719 | all_segment_ids = [] 720 | all_label_ids = [] 721 | 722 | for feature in features: 723 | all_input_ids.append(feature.input_ids) 724 | all_input_mask.append(feature.input_mask) 725 | all_segment_ids.append(feature.segment_ids) 726 | all_label_ids.append(feature.label_id) 727 | 728 | def input_fn(params): 729 | """The actual input function.""" 730 | batch_size = params["batch_size"] 731 | 732 | num_examples = len(features) 733 | 734 | # This is for demo purposes and does NOT scale to large data sets. We do 735 | # not use Dataset.from_generator() because that uses tf.py_func which is 736 | # not TPU compatible. The right way to load data is with TFRecordReader. 737 | d = tf.data.Dataset.from_tensor_slices({ 738 | "input_ids": 739 | tf.constant( 740 | all_input_ids, shape=[num_examples, seq_length], 741 | dtype=tf.int32), 742 | "input_mask": 743 | tf.constant( 744 | all_input_mask, 745 | shape=[num_examples, seq_length], 746 | dtype=tf.int32), 747 | "segment_ids": 748 | tf.constant( 749 | all_segment_ids, 750 | shape=[num_examples, seq_length], 751 | dtype=tf.int32), 752 | "label_ids": 753 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 754 | }) 755 | 756 | if is_training: 757 | d = d.repeat() 758 | d = d.shuffle(buffer_size=100) 759 | 760 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 761 | return d 762 | 763 | return input_fn 764 | 765 | 766 | # This function is not used by this file but is still used by the Colab and 767 | # people who depend on it. 768 | def convert_examples_to_features(examples, label_list, max_seq_length, 769 | tokenizer): 770 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 771 | 772 | features = [] 773 | for (ex_index, example) in enumerate(examples): 774 | if ex_index % 10000 == 0: 775 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 776 | 777 | feature = convert_single_example(ex_index, example, label_list, 778 | max_seq_length, tokenizer) 779 | 780 | features.append(feature) 781 | return features 782 | 783 | 784 | def main(_): 785 | tf.logging.set_verbosity(tf.logging.INFO) 786 | 787 | processors = { 788 | "cola": ColaProcessor, 789 | "mnli": MnliProcessor, 790 | "mrpc": MrpcProcessor, 791 | "xnli": XnliProcessor, 792 | } 793 | 794 | tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, 795 | FLAGS.init_checkpoint) 796 | 797 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 798 | raise ValueError( 799 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 800 | 801 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 802 | 803 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 804 | raise ValueError( 805 | "Cannot use sequence length %d because the BERT model " 806 | "was only trained up to sequence length %d" % 807 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 808 | 809 | tf.gfile.MakeDirs(FLAGS.output_dir) 810 | 811 | task_name = FLAGS.task_name.lower() 812 | 813 | if task_name not in processors: 814 | raise ValueError("Task not found: %s" % (task_name)) 815 | 816 | processor = processors[task_name]() 817 | 818 | label_list = processor.get_labels() 819 | 820 | tokenizer = tokenization.FullTokenizer( 821 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 822 | 823 | tpu_cluster_resolver = None 824 | if FLAGS.use_tpu and FLAGS.tpu_name: 825 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 826 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 827 | 828 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 829 | run_config = tf.contrib.tpu.RunConfig( 830 | cluster=tpu_cluster_resolver, 831 | master=FLAGS.master, 832 | model_dir=FLAGS.output_dir, 833 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 834 | tpu_config=tf.contrib.tpu.TPUConfig( 835 | iterations_per_loop=FLAGS.iterations_per_loop, 836 | num_shards=FLAGS.num_tpu_cores, 837 | per_host_input_for_training=is_per_host)) 838 | 839 | train_examples = None 840 | num_train_steps = None 841 | num_warmup_steps = None 842 | if FLAGS.do_train: 843 | train_examples = processor.get_train_examples(FLAGS.data_dir) 844 | num_train_steps = int( 845 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 846 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 847 | 848 | model_fn = model_fn_builder( 849 | bert_config=bert_config, 850 | num_labels=len(label_list), 851 | init_checkpoint=FLAGS.init_checkpoint, 852 | learning_rate=FLAGS.learning_rate, 853 | num_train_steps=num_train_steps, 854 | num_warmup_steps=num_warmup_steps, 855 | use_tpu=FLAGS.use_tpu, 856 | use_one_hot_embeddings=FLAGS.use_tpu) 857 | 858 | # If TPU is not available, this will fall back to normal Estimator on CPU 859 | # or GPU. 860 | estimator = tf.contrib.tpu.TPUEstimator( 861 | use_tpu=FLAGS.use_tpu, 862 | model_fn=model_fn, 863 | config=run_config, 864 | train_batch_size=FLAGS.train_batch_size, 865 | eval_batch_size=FLAGS.eval_batch_size, 866 | predict_batch_size=FLAGS.predict_batch_size) 867 | 868 | if FLAGS.do_train: 869 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 870 | file_based_convert_examples_to_features( 871 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 872 | tf.logging.info("***** Running training *****") 873 | tf.logging.info(" Num examples = %d", len(train_examples)) 874 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 875 | tf.logging.info(" Num steps = %d", num_train_steps) 876 | train_input_fn = file_based_input_fn_builder( 877 | input_file=train_file, 878 | seq_length=FLAGS.max_seq_length, 879 | is_training=True, 880 | drop_remainder=True) 881 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 882 | 883 | if FLAGS.do_eval: 884 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 885 | num_actual_eval_examples = len(eval_examples) 886 | if FLAGS.use_tpu: 887 | # TPU requires a fixed batch size for all batches, therefore the number 888 | # of examples must be a multiple of the batch size, or else examples 889 | # will get dropped. So we pad with fake examples which are ignored 890 | # later on. These do NOT count towards the metric (all tf.metrics 891 | # support a per-instance weight, and these get a weight of 0.0). 892 | while len(eval_examples) % FLAGS.eval_batch_size != 0: 893 | eval_examples.append(PaddingInputExample()) 894 | 895 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 896 | file_based_convert_examples_to_features( 897 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 898 | 899 | tf.logging.info("***** Running evaluation *****") 900 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 901 | len(eval_examples), num_actual_eval_examples, 902 | len(eval_examples) - num_actual_eval_examples) 903 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 904 | 905 | # This tells the estimator to run through the entire set. 906 | eval_steps = None 907 | # However, if running eval on the TPU, you will need to specify the 908 | # number of steps. 909 | if FLAGS.use_tpu: 910 | assert len(eval_examples) % FLAGS.eval_batch_size == 0 911 | eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) 912 | 913 | eval_drop_remainder = True if FLAGS.use_tpu else False 914 | eval_input_fn = file_based_input_fn_builder( 915 | input_file=eval_file, 916 | seq_length=FLAGS.max_seq_length, 917 | is_training=False, 918 | drop_remainder=eval_drop_remainder) 919 | 920 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 921 | 922 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 923 | with tf.gfile.GFile(output_eval_file, "w") as writer: 924 | tf.logging.info("***** Eval results *****") 925 | for key in sorted(result.keys()): 926 | tf.logging.info(" %s = %s", key, str(result[key])) 927 | writer.write("%s = %s\n" % (key, str(result[key]))) 928 | 929 | if FLAGS.do_predict: 930 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 931 | num_actual_predict_examples = len(predict_examples) 932 | if FLAGS.use_tpu: 933 | # TPU requires a fixed batch size for all batches, therefore the number 934 | # of examples must be a multiple of the batch size, or else examples 935 | # will get dropped. So we pad with fake examples which are ignored 936 | # later on. 937 | while len(predict_examples) % FLAGS.predict_batch_size != 0: 938 | predict_examples.append(PaddingInputExample()) 939 | 940 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 941 | file_based_convert_examples_to_features(predict_examples, label_list, 942 | FLAGS.max_seq_length, tokenizer, 943 | predict_file) 944 | 945 | tf.logging.info("***** Running prediction*****") 946 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 947 | len(predict_examples), num_actual_predict_examples, 948 | len(predict_examples) - num_actual_predict_examples) 949 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 950 | 951 | predict_drop_remainder = True if FLAGS.use_tpu else False 952 | predict_input_fn = file_based_input_fn_builder( 953 | input_file=predict_file, 954 | seq_length=FLAGS.max_seq_length, 955 | is_training=False, 956 | drop_remainder=predict_drop_remainder) 957 | 958 | result = estimator.predict(input_fn=predict_input_fn) 959 | 960 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 961 | with tf.gfile.GFile(output_predict_file, "w") as writer: 962 | num_written_lines = 0 963 | tf.logging.info("***** Predict results *****") 964 | for (i, prediction) in enumerate(result): 965 | probabilities = prediction["probabilities"] 966 | if i >= num_actual_predict_examples: 967 | break 968 | output_line = "\t".join( 969 | str(class_probability) 970 | for class_probability in probabilities) + "\n" 971 | writer.write(output_line) 972 | num_written_lines += 1 973 | assert num_written_lines == num_actual_predict_examples 974 | 975 | 976 | if __name__ == "__main__": 977 | flags.mark_flag_as_required("data_dir") 978 | flags.mark_flag_as_required("task_name") 979 | flags.mark_flag_as_required("vocab_file") 980 | flags.mark_flag_as_required("bert_config_file") 981 | flags.mark_flag_as_required("output_dir") 982 | tf.app.run() 983 | -------------------------------------------------------------------------------- /run_classifier_0214.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """BERT finetuning runner.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import csv 23 | import os 24 | import modeling 25 | import optimization 26 | import tokenization 27 | import tensorflow as tf 28 | 29 | flags = tf.flags 30 | 31 | FLAGS = flags.FLAGS 32 | 33 | ## Required parameters 34 | flags.DEFINE_string( 35 | "data_dir", None, 36 | "The input data dir. Should contain the .tsv files (or other data files) " 37 | "for the task.") 38 | 39 | flags.DEFINE_string( 40 | "bert_config_file", None, 41 | "The config json file corresponding to the pre-trained BERT model. " 42 | "This specifies the model architecture.") 43 | 44 | flags.DEFINE_string("task_name", "MRPC", "The name of the task to train.") 45 | 46 | flags.DEFINE_string("vocab_file", None, 47 | "The vocabulary file that the BERT model was trained on.") 48 | 49 | flags.DEFINE_string( 50 | "output_dir", None, 51 | "The output directory where the model checkpoints will be written.") 52 | 53 | ## Other parameters 54 | 55 | flags.DEFINE_string( 56 | "init_checkpoint", None, 57 | "Initial checkpoint (usually from a pre-trained BERT model).") 58 | 59 | flags.DEFINE_bool( 60 | "do_lower_case", True, 61 | "Whether to lower case the input text. Should be True for uncased " 62 | "models and False for cased models.") 63 | 64 | flags.DEFINE_integer( 65 | "max_seq_length", 128, 66 | "The maximum total input sequence length after WordPiece tokenization. " 67 | "Sequences longer than this will be truncated, and sequences shorter " 68 | "than this will be padded.") 69 | 70 | flags.DEFINE_bool("do_train", True, "Whether to run training.") 71 | 72 | flags.DEFINE_bool("do_eval", True, "Whether to run eval on the dev set.") 73 | 74 | flags.DEFINE_bool( 75 | "do_predict", False, 76 | "Whether to run the model in inference mode on the test set.") 77 | 78 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 79 | 80 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 81 | 82 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 83 | 84 | flags.DEFINE_float("learning_rate", 2e-5, "The initial learning rate for Adam.") 85 | 86 | flags.DEFINE_float("num_train_epochs", 10.0, 87 | "Total number of training epochs to perform.") 88 | 89 | flags.DEFINE_float( 90 | "warmup_proportion", 0.1, 91 | "Proportion of training to perform linear learning rate warmup for. " 92 | "E.g., 0.1 = 10% of training.") 93 | 94 | flags.DEFINE_integer("save_checkpoints_steps", 50, 95 | "How often to save the model checkpoint.") 96 | 97 | flags.DEFINE_integer("iterations_per_loop", 50, 98 | "How many steps to make in each estimator call.") 99 | 100 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 101 | 102 | tf.flags.DEFINE_string( 103 | "tpu_name", None, 104 | "The Cloud TPU to use for training. This should be either the name " 105 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 106 | "url.") 107 | 108 | tf.flags.DEFINE_string( 109 | "tpu_zone", None, 110 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 111 | "specified, we will attempt to automatically detect the GCE project from " 112 | "metadata.") 113 | 114 | tf.flags.DEFINE_string( 115 | "gcp_project", None, 116 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 117 | "specified, we will attempt to automatically detect the GCE project from " 118 | "metadata.") 119 | 120 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 121 | 122 | flags.DEFINE_integer( 123 | "num_tpu_cores", 8, 124 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 125 | 126 | 127 | class InputExample(object): 128 | """A single training/test example for simple sequence classification.""" 129 | 130 | def __init__(self, guid, text_a, text_b=None, label=None): 131 | """Constructs a InputExample. 132 | 133 | Args: 134 | guid: Unique id for the example. 135 | text_a: string. The untokenized text of the first sequence. For single 136 | sequence tasks, only this sequence must be specified. 137 | text_b: (Optional) string. The untokenized text of the second sequence. 138 | Only must be specified for sequence pair tasks. 139 | label: (Optional) string. The label of the example. This should be 140 | specified for train and dev examples, but not for test examples. 141 | """ 142 | self.guid = guid 143 | self.text_a = text_a 144 | self.text_b = text_b 145 | self.label = label 146 | 147 | 148 | class PaddingInputExample(object): 149 | """Fake example so the num input examples is a multiple of the batch size. 150 | 151 | When running eval/predict on the TPU, we need to pad the number of examples 152 | to be a multiple of the batch size, because the TPU requires a fixed batch 153 | size. The alternative is to drop the last batch, which is bad because it means 154 | the entire output data won't be generated. 155 | 156 | We use this class instead of `None` because treating `None` as padding 157 | battches could cause silent errors. 158 | """ 159 | 160 | 161 | class InputFeatures(object): 162 | """A single set of features of data.""" 163 | 164 | def __init__(self, 165 | input_ids, 166 | input_mask, 167 | segment_ids, 168 | label_id, 169 | is_real_example=True): 170 | self.input_ids = input_ids 171 | self.input_mask = input_mask 172 | self.segment_ids = segment_ids 173 | self.label_id = label_id 174 | self.is_real_example = is_real_example 175 | 176 | 177 | class DataProcessor(object): 178 | """Base class for data converters for sequence classification data sets.""" 179 | 180 | def get_train_examples(self, data_dir): 181 | """Gets a collection of `InputExample`s for the train set.""" 182 | raise NotImplementedError() 183 | 184 | def get_dev_examples(self, data_dir): 185 | """Gets a collection of `InputExample`s for the dev set.""" 186 | raise NotImplementedError() 187 | 188 | def get_test_examples(self, data_dir): 189 | """Gets a collection of `InputExample`s for prediction.""" 190 | raise NotImplementedError() 191 | 192 | def get_labels(self): 193 | """Gets the list of labels for this data set.""" 194 | raise NotImplementedError() 195 | 196 | @classmethod 197 | def _read_tsv(cls, input_file, quotechar=None): 198 | """Reads a tab separated value file.""" 199 | with tf.gfile.Open(input_file, "r") as f: 200 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 201 | lines = [] 202 | for line in reader: 203 | lines.append(line) 204 | return lines 205 | 206 | 207 | class XnliProcessor(DataProcessor): 208 | """Processor for the XNLI data set.""" 209 | 210 | def __init__(self): 211 | self.language = "zh" 212 | 213 | def get_train_examples(self, data_dir): 214 | """See base class.""" 215 | lines = self._read_tsv( 216 | os.path.join(data_dir, "multinli", 217 | "multinli.train.%s.tsv" % self.language)) 218 | examples = [] 219 | for (i, line) in enumerate(lines): 220 | if i == 0: 221 | continue 222 | guid = "train-%d" % (i) 223 | text_a = tokenization.convert_to_unicode(line[0]) 224 | text_b = tokenization.convert_to_unicode(line[1]) 225 | label = tokenization.convert_to_unicode(line[2]) 226 | if label == tokenization.convert_to_unicode("contradictory"): 227 | label = tokenization.convert_to_unicode("contradiction") 228 | examples.append( 229 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 230 | return examples 231 | 232 | def get_dev_examples(self, data_dir): 233 | """See base class.""" 234 | lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")) 235 | examples = [] 236 | for (i, line) in enumerate(lines): 237 | if i == 0: 238 | continue 239 | guid = "dev-%d" % (i) 240 | language = tokenization.convert_to_unicode(line[0]) 241 | if language != tokenization.convert_to_unicode(self.language): 242 | continue 243 | text_a = tokenization.convert_to_unicode(line[6]) 244 | text_b = tokenization.convert_to_unicode(line[7]) 245 | label = tokenization.convert_to_unicode(line[1]) 246 | examples.append( 247 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 248 | return examples 249 | 250 | def get_labels(self): 251 | """See base class.""" 252 | return ["contradiction", "entailment", "neutral"] 253 | 254 | 255 | class MnliProcessor(DataProcessor): 256 | """Processor for the MultiNLI data set (GLUE version).""" 257 | 258 | def get_train_examples(self, data_dir): 259 | """See base class.""" 260 | return self._create_examples( 261 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 262 | 263 | def get_dev_examples(self, data_dir): 264 | """See base class.""" 265 | return self._create_examples( 266 | self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), 267 | "dev_matched") 268 | 269 | def get_test_examples(self, data_dir): 270 | """See base class.""" 271 | return self._create_examples( 272 | self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test") 273 | 274 | def get_labels(self): 275 | """See base class.""" 276 | return ["contradiction", "entailment", "neutral"] 277 | 278 | def _create_examples(self, lines, set_type): 279 | """Creates examples for the training and dev sets.""" 280 | examples = [] 281 | for (i, line) in enumerate(lines): 282 | if i == 0: 283 | continue 284 | guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0])) 285 | text_a = tokenization.convert_to_unicode(line[8]) 286 | text_b = tokenization.convert_to_unicode(line[9]) 287 | if set_type == "test": 288 | label = "contradiction" 289 | else: 290 | label = tokenization.convert_to_unicode(line[-1]) 291 | examples.append( 292 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 293 | return examples 294 | 295 | 296 | class MrpcProcessor(DataProcessor): 297 | """Processor for the MRPC data set (GLUE version).""" 298 | 299 | def get_train_examples(self, data_dir): 300 | """See base class.""" 301 | return self._create_examples( 302 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 303 | 304 | def get_dev_examples(self, data_dir): 305 | """See base class.""" 306 | return self._create_examples( 307 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 308 | 309 | def get_test_examples(self, data_dir): 310 | """See base class.""" 311 | return self._create_examples( 312 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") 313 | 314 | def get_labels(self): 315 | """See base class.""" 316 | return ["0", "1"] 317 | 318 | def _create_examples(self, lines, set_type): 319 | """Creates examples for the training and dev sets.""" 320 | examples = [] 321 | for (i, line) in enumerate(lines): 322 | if i == 0: 323 | continue 324 | guid = "%s-%s" % (set_type, i) 325 | text_a = tokenization.convert_to_unicode(line[3]) 326 | text_b = tokenization.convert_to_unicode(line[4]) 327 | if set_type == "test": 328 | label = "0" 329 | else: 330 | label = tokenization.convert_to_unicode(line[0]) 331 | examples.append( 332 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 333 | return examples 334 | 335 | 336 | 337 | class ColaProcessor(DataProcessor): 338 | """Processor for the CoLA data set (GLUE version).""" 339 | 340 | def get_train_examples(self, data_dir): 341 | """See base class.""" 342 | return self._create_examples( 343 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 344 | 345 | def get_dev_examples(self, data_dir): 346 | """See base class.""" 347 | return self._create_examples( 348 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 349 | 350 | def get_test_examples(self, data_dir): 351 | """See base class.""" 352 | return self._create_examples( 353 | self._read_tsv(os.path.join(data_dir, "test.tsv")), "test") 354 | 355 | def get_labels(self): 356 | """See base class.""" 357 | return ["0", "1"] 358 | 359 | def _create_examples(self, lines, set_type): 360 | """Creates examples for the training and dev sets.""" 361 | examples = [] 362 | for (i, line) in enumerate(lines): 363 | # Only the test set has a header 364 | if set_type == "test" and i == 0: 365 | continue 366 | guid = "%s-%s" % (set_type, i) 367 | if set_type == "test": 368 | text_a = tokenization.convert_to_unicode(line[1]) 369 | label = "0" 370 | else: 371 | text_a = tokenization.convert_to_unicode(line[3]) 372 | label = tokenization.convert_to_unicode(line[1]) 373 | examples.append( 374 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 375 | return examples 376 | 377 | 378 | def convert_single_example(ex_index, example, label_list, max_seq_length, 379 | tokenizer): 380 | """Converts a single `InputExample` into a single `InputFeatures`.""" 381 | 382 | if isinstance(example, PaddingInputExample): 383 | return InputFeatures( 384 | input_ids=[0] * max_seq_length, 385 | input_mask=[0] * max_seq_length, 386 | segment_ids=[0] * max_seq_length, 387 | label_id=0, 388 | is_real_example=False) 389 | 390 | label_map = {} 391 | for (i, label) in enumerate(label_list): 392 | label_map[label] = i 393 | 394 | tokens_a = tokenizer.tokenize(example.text_a) 395 | tokens_b = None 396 | if example.text_b: 397 | tokens_b = tokenizer.tokenize(example.text_b) 398 | 399 | if tokens_b: 400 | # Modifies `tokens_a` and `tokens_b` in place so that the total 401 | # length is less than the specified length. 402 | # Account for [CLS], [SEP], [SEP] with "- 3" 403 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 404 | else: 405 | # Account for [CLS] and [SEP] with "- 2" 406 | if len(tokens_a) > max_seq_length - 2: 407 | tokens_a = tokens_a[0:(max_seq_length - 2)] 408 | 409 | # The convention in BERT is: 410 | # (a) For sequence pairs: 411 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 412 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 413 | # (b) For single sequences: 414 | # tokens: [CLS] the dog is hairy . [SEP] 415 | # type_ids: 0 0 0 0 0 0 0 416 | # 417 | # Where "type_ids" are used to indicate whether this is the first 418 | # sequence or the second sequence. The embedding vectors for `type=0` and 419 | # `type=1` were learned during pre-training and are added to the wordpiece 420 | # embedding vector (and position vector). This is not *strictly* necessary 421 | # since the [SEP] token unambiguously separates the sequences, but it makes 422 | # it easier for the model to learn the concept of sequences. 423 | # 424 | # For classification tasks, the first vector (corresponding to [CLS]) is 425 | # used as the "sentence vector". Note that this only makes sense because 426 | # the entire model is fine-tuned. 427 | tokens = [] 428 | segment_ids = [] 429 | tokens.append("[CLS]") 430 | segment_ids.append(0) 431 | for token in tokens_a: 432 | tokens.append(token) 433 | segment_ids.append(0) 434 | tokens.append("[SEP]") 435 | segment_ids.append(0) 436 | 437 | if tokens_b: 438 | for token in tokens_b: 439 | tokens.append(token) 440 | segment_ids.append(1) 441 | tokens.append("[SEP]") 442 | segment_ids.append(1) 443 | 444 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 445 | 446 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 447 | # tokens are attended to. 448 | input_mask = [1] * len(input_ids) 449 | 450 | # Zero-pad up to the sequence length. 451 | while len(input_ids) < max_seq_length: 452 | input_ids.append(0) 453 | input_mask.append(0) 454 | segment_ids.append(0) 455 | 456 | assert len(input_ids) == max_seq_length 457 | assert len(input_mask) == max_seq_length 458 | assert len(segment_ids) == max_seq_length 459 | 460 | label_id = label_map[example.label] 461 | if ex_index < 5: 462 | tf.logging.info("*** Example ***") 463 | tf.logging.info("guid: %s" % (example.guid)) 464 | tf.logging.info("tokens: %s" % " ".join( 465 | [tokenization.printable_text(x) for x in tokens])) 466 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 467 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 468 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 469 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 470 | 471 | feature = InputFeatures( 472 | input_ids=input_ids, 473 | input_mask=input_mask, 474 | segment_ids=segment_ids, 475 | label_id=label_id, 476 | is_real_example=True) 477 | return feature 478 | 479 | 480 | def file_based_convert_examples_to_features( 481 | examples, label_list, max_seq_length, tokenizer, output_file): 482 | """Convert a set of `InputExample`s to a TFRecord file.""" 483 | 484 | writer = tf.python_io.TFRecordWriter(output_file) 485 | 486 | for (ex_index, example) in enumerate(examples): 487 | if ex_index % 10000 == 0: 488 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 489 | 490 | feature = convert_single_example(ex_index, example, label_list, 491 | max_seq_length, tokenizer) 492 | 493 | def create_int_feature(values): 494 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 495 | return f 496 | 497 | features = collections.OrderedDict() 498 | features["input_ids"] = create_int_feature(feature.input_ids) 499 | features["input_mask"] = create_int_feature(feature.input_mask) 500 | features["segment_ids"] = create_int_feature(feature.segment_ids) 501 | features["label_ids"] = create_int_feature([feature.label_id]) 502 | features["is_real_example"] = create_int_feature( 503 | [int(feature.is_real_example)]) 504 | 505 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 506 | writer.write(tf_example.SerializeToString()) 507 | writer.close() 508 | 509 | 510 | def file_based_input_fn_builder(input_file, seq_length, is_training, 511 | drop_remainder): 512 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 513 | 514 | name_to_features = { 515 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 516 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 517 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 518 | "label_ids": tf.FixedLenFeature([], tf.int64), 519 | "is_real_example": tf.FixedLenFeature([], tf.int64), 520 | } 521 | 522 | def _decode_record(record, name_to_features): 523 | """Decodes a record to a TensorFlow example.""" 524 | example = tf.parse_single_example(record, name_to_features) 525 | 526 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 527 | # So cast all int64 to int32. 528 | for name in list(example.keys()): 529 | t = example[name] 530 | if t.dtype == tf.int64: 531 | t = tf.to_int32(t) 532 | example[name] = t 533 | 534 | return example 535 | 536 | def input_fn(params): 537 | """The actual input function.""" 538 | batch_size = params["batch_size"] 539 | 540 | # For training, we want a lot of parallel reading and shuffling. 541 | # For eval, we want no shuffling and parallel reading doesn't matter. 542 | d = tf.data.TFRecordDataset(input_file) 543 | if is_training: 544 | d = d.repeat() 545 | d = d.shuffle(buffer_size=100) 546 | 547 | d = d.apply( 548 | tf.contrib.data.map_and_batch( 549 | lambda record: _decode_record(record, name_to_features), 550 | batch_size=batch_size, 551 | drop_remainder=drop_remainder)) 552 | 553 | return d 554 | 555 | return input_fn 556 | 557 | 558 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 559 | """Truncates a sequence pair in place to the maximum length.""" 560 | 561 | # This is a simple heuristic which will always truncate the longer sequence 562 | # one token at a time. This makes more sense than truncating an equal percent 563 | # of tokens from each, since if one sequence is very short then each token 564 | # that's truncated likely contains more information than a longer sequence. 565 | while True: 566 | total_length = len(tokens_a) + len(tokens_b) 567 | if total_length <= max_length: 568 | break 569 | if len(tokens_a) > len(tokens_b): 570 | tokens_a.pop() 571 | else: 572 | tokens_b.pop() 573 | 574 | 575 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 576 | labels, num_labels, use_one_hot_embeddings): 577 | """Creates a classification model.""" 578 | model = modeling.BertModel( 579 | config=bert_config, 580 | is_training=is_training, 581 | input_ids=input_ids, 582 | input_mask=input_mask, 583 | token_type_ids=segment_ids, 584 | use_one_hot_embeddings=use_one_hot_embeddings) 585 | 586 | # In the demo, we are doing a simple classification task on the entire 587 | # segment. 588 | # 589 | # If you want to use the token-level output, use model.get_sequence_output() 590 | # instead. 591 | output_layer = model.get_pooled_output() 592 | 593 | hidden_size = output_layer.shape[-1].value 594 | 595 | output_weights = tf.get_variable( 596 | "output_weights", [num_labels, hidden_size], 597 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 598 | 599 | output_bias = tf.get_variable( 600 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 601 | 602 | with tf.variable_scope("loss"): 603 | if is_training: 604 | # I.e., 0.1 dropout 605 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 606 | 607 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 608 | logits = tf.nn.bias_add(logits, output_bias) 609 | probabilities = tf.nn.softmax(logits, axis=-1) 610 | log_probs = tf.nn.log_softmax(logits, axis=-1) 611 | 612 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 613 | 614 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 615 | loss = tf.reduce_mean(per_example_loss) 616 | 617 | return (loss, per_example_loss, logits, probabilities) 618 | 619 | 620 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 621 | num_train_steps, num_warmup_steps, use_tpu, 622 | use_one_hot_embeddings): 623 | """Returns `model_fn` closure for TPUEstimator.""" 624 | 625 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 626 | """The `model_fn` for TPUEstimator.""" 627 | 628 | tf.logging.info("*** Features ***") 629 | for name in sorted(features.keys()): 630 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 631 | 632 | input_ids = features["input_ids"] 633 | input_mask = features["input_mask"] 634 | segment_ids = features["segment_ids"] 635 | label_ids = features["label_ids"] 636 | is_real_example = None 637 | if "is_real_example" in features: 638 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 639 | else: 640 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 641 | 642 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 643 | 644 | (total_loss, per_example_loss, logits, probabilities) = create_model( 645 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, 646 | num_labels, use_one_hot_embeddings) 647 | 648 | tvars = tf.trainable_variables() 649 | initialized_variable_names = {} 650 | scaffold_fn = None 651 | if init_checkpoint: 652 | (assignment_map, initialized_variable_names 653 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 654 | if use_tpu: 655 | 656 | def tpu_scaffold(): 657 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 658 | return tf.train.Scaffold() 659 | 660 | scaffold_fn = tpu_scaffold 661 | else: 662 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 663 | 664 | tf.logging.info("**** Trainable Variables ****") 665 | for var in tvars: 666 | init_string = "" 667 | if var.name in initialized_variable_names: 668 | init_string = ", *INIT_FROM_CKPT*" 669 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 670 | init_string) 671 | 672 | output_spec = None 673 | if mode == tf.estimator.ModeKeys.TRAIN: 674 | 675 | train_op = optimization.create_optimizer( 676 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 677 | 678 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 679 | mode=mode, 680 | loss=total_loss, 681 | train_op=train_op, 682 | scaffold_fn=scaffold_fn) 683 | elif mode == tf.estimator.ModeKeys.EVAL: 684 | 685 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 686 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 687 | accuracy = tf.metrics.accuracy( 688 | labels=label_ids, predictions=predictions, weights=is_real_example) 689 | loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) 690 | return { 691 | "eval_accuracy": accuracy, 692 | "eval_loss": loss, 693 | } 694 | 695 | eval_metrics = (metric_fn, 696 | [per_example_loss, label_ids, logits, is_real_example]) 697 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 698 | mode=mode, 699 | loss=total_loss, 700 | eval_metrics=eval_metrics, 701 | scaffold_fn=scaffold_fn) 702 | else: 703 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 704 | mode=mode, 705 | predictions={"probabilities": probabilities}, 706 | scaffold_fn=scaffold_fn) 707 | return output_spec 708 | 709 | return model_fn 710 | 711 | 712 | # This function is not used by this file but is still used by the Colab and 713 | # people who depend on it. 714 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 715 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 716 | 717 | all_input_ids = [] 718 | all_input_mask = [] 719 | all_segment_ids = [] 720 | all_label_ids = [] 721 | 722 | for feature in features: 723 | all_input_ids.append(feature.input_ids) 724 | all_input_mask.append(feature.input_mask) 725 | all_segment_ids.append(feature.segment_ids) 726 | all_label_ids.append(feature.label_id) 727 | 728 | def input_fn(params): 729 | """The actual input function.""" 730 | batch_size = params["batch_size"] 731 | 732 | num_examples = len(features) 733 | 734 | # This is for demo purposes and does NOT scale to large data sets. We do 735 | # not use Dataset.from_generator() because that uses tf.py_func which is 736 | # not TPU compatible. The right way to load data is with TFRecordReader. 737 | d = tf.data.Dataset.from_tensor_slices({ 738 | "input_ids": 739 | tf.constant( 740 | all_input_ids, shape=[num_examples, seq_length], 741 | dtype=tf.int32), 742 | "input_mask": 743 | tf.constant( 744 | all_input_mask, 745 | shape=[num_examples, seq_length], 746 | dtype=tf.int32), 747 | "segment_ids": 748 | tf.constant( 749 | all_segment_ids, 750 | shape=[num_examples, seq_length], 751 | dtype=tf.int32), 752 | "label_ids": 753 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 754 | }) 755 | 756 | if is_training: 757 | d = d.repeat() 758 | d = d.shuffle(buffer_size=100) 759 | 760 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 761 | return d 762 | 763 | return input_fn 764 | 765 | 766 | # This function is not used by this file but is still used by the Colab and 767 | # people who depend on it. 768 | def convert_examples_to_features(examples, label_list, max_seq_length, 769 | tokenizer): 770 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 771 | 772 | features = [] 773 | for (ex_index, example) in enumerate(examples): 774 | if ex_index % 10000 == 0: 775 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 776 | 777 | feature = convert_single_example(ex_index, example, label_list, 778 | max_seq_length, tokenizer) 779 | 780 | features.append(feature) 781 | return features 782 | 783 | 784 | 785 | import pandas as pd 786 | class SelfProcessor(DataProcessor): 787 | 788 | def get_train_examples(self, data_dir): 789 | file_path = os.path.join(data_dir, 'atec_nlp_sim_train_0.6.csv') 790 | reader=pd.read_csv(file_path, encoding='utf-8',error_bad_lines=False) 791 | # 如果数据不是乱序的,注意要shuffle 792 | # 这里的数据量比较大,取部分跑一下 793 | reader = reader.head(3000) 794 | print("train length:",len(reader)) 795 | 796 | examples = [] 797 | for _,row in reader.iterrows(): 798 | line=row[0] 799 | # print(line) 800 | split_line = line.strip().split("\t") 801 | if len(split_line)!=4: 802 | continue 803 | 804 | guid = split_line[0] 805 | text_a = tokenization.convert_to_unicode(split_line[1]) 806 | text_b = tokenization.convert_to_unicode(split_line[2]) 807 | label = split_line[3] 808 | examples.append(InputExample(guid=guid, text_a=text_a, 809 | text_b=text_b, label=label)) 810 | return examples 811 | 812 | def get_dev_examples(self, data_dir): 813 | file_path = os.path.join(data_dir, 'atec_nlp_sim_test_0.4.csv') 814 | reader=pd.read_csv(file_path, encoding='utf-8',error_bad_lines=False) 815 | #如果数据不是乱序的,注意要shuffle 816 | #这里的数据量比较大,取部分跑一下 817 | reader=reader.tail(500) 818 | 819 | examples = [] 820 | for _,row in reader.iterrows(): 821 | line=row[0] 822 | # print(line) 823 | split_line = line.strip().split("\t") 824 | if len(split_line)!=4: 825 | continue 826 | 827 | guid = split_line[0] 828 | text_a = tokenization.convert_to_unicode(split_line[1]) 829 | text_b = tokenization.convert_to_unicode(split_line[2]) 830 | label = split_line[3] 831 | examples.append(InputExample(guid=guid, text_a=text_a, 832 | text_b=text_b, label=label)) 833 | return examples 834 | 835 | def get_test_examples(self, data_dir): 836 | file_path = os.path.join(data_dir, 'atec_nlp_sim_test_0.4.csv') 837 | reader=pd.read_csv(file_path, encoding='utf-8',error_bad_lines=False) 838 | # 这里的数据量比较大,取部分跑一下,跟验证集的数据区分开 839 | reader = reader.head(100) 840 | 841 | examples = [] 842 | for _,row in reader.iterrows(): 843 | line=row[0] 844 | # print(line) 845 | split_line = line.strip().split("\t") 846 | if len(split_line)!=4: 847 | continue 848 | 849 | guid = split_line[0] 850 | text_a = tokenization.convert_to_unicode(split_line[1]) 851 | text_b = tokenization.convert_to_unicode(split_line[2]) 852 | label = split_line[3] 853 | examples.append(InputExample(guid=guid, text_a=text_a, 854 | text_b=text_b, label=label)) 855 | return examples 856 | 857 | def get_labels(self): 858 | """See base class.""" 859 | return ["0", "1"] 860 | 861 | def _create_examples(self, lines, set_type): 862 | """Creates examples for the training and dev sets.""" 863 | examples = [] 864 | for (i, line) in enumerate(lines): 865 | # Only the test set has a header 866 | if set_type == "test" and i == 0: 867 | continue 868 | guid = "%s-%s" % (set_type, i) 869 | if set_type == "test": 870 | text_a = tokenization.convert_to_unicode(line[1]) 871 | label = "0" 872 | else: 873 | text_a = tokenization.convert_to_unicode(line[3]) 874 | label = tokenization.convert_to_unicode(line[1]) 875 | examples.append( 876 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 877 | return examples 878 | 879 | 880 | 881 | def main(_): 882 | tf.logging.set_verbosity(tf.logging.INFO) 883 | 884 | processors = { 885 | "cola": ColaProcessor, 886 | "mnli": MnliProcessor, 887 | "mrpc": MrpcProcessor, 888 | "xnli": XnliProcessor, 889 | "mayi": SelfProcessor 890 | } 891 | 892 | tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, 893 | FLAGS.init_checkpoint) 894 | 895 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 896 | raise ValueError( 897 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 898 | 899 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 900 | 901 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 902 | raise ValueError( 903 | "Cannot use sequence length %d because the BERT model " 904 | "was only trained up to sequence length %d" % 905 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 906 | 907 | tf.gfile.MakeDirs(FLAGS.output_dir) 908 | 909 | task_name = FLAGS.task_name.lower() 910 | 911 | if task_name not in processors: 912 | raise ValueError("Task not found: %s" % (task_name)) 913 | 914 | processor = processors[task_name]() 915 | 916 | tokenizer = tokenization.FullTokenizer( 917 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 918 | 919 | tpu_cluster_resolver = None 920 | if FLAGS.use_tpu and FLAGS.tpu_name: 921 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 922 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 923 | 924 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 925 | run_config = tf.contrib.tpu.RunConfig( 926 | cluster=tpu_cluster_resolver, 927 | master=FLAGS.master, 928 | model_dir=FLAGS.output_dir, 929 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 930 | tpu_config=tf.contrib.tpu.TPUConfig( 931 | iterations_per_loop=FLAGS.iterations_per_loop, 932 | num_shards=FLAGS.num_tpu_cores, 933 | per_host_input_for_training=is_per_host)) 934 | 935 | train_examples = None 936 | num_train_steps = None 937 | num_warmup_steps = None 938 | 939 | train_examples = processor.get_train_examples(FLAGS.data_dir) 940 | # print("len of train_examples:",len(train_examples)) 941 | 942 | if FLAGS.do_train: 943 | num_train_steps = int( 944 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 945 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 946 | 947 | label_list = processor.get_labels() 948 | 949 | model_fn = model_fn_builder( 950 | bert_config=bert_config, 951 | num_labels=len(label_list), 952 | init_checkpoint=FLAGS.init_checkpoint, 953 | learning_rate=FLAGS.learning_rate, 954 | num_train_steps=num_train_steps, 955 | num_warmup_steps=num_warmup_steps, 956 | use_tpu=FLAGS.use_tpu, 957 | use_one_hot_embeddings=FLAGS.use_tpu) 958 | 959 | # If TPU is not available, this will fall back to normal Estimator on CPU 960 | # or GPU. 961 | estimator = tf.contrib.tpu.TPUEstimator( 962 | use_tpu=FLAGS.use_tpu, 963 | model_fn=model_fn, 964 | config=run_config, 965 | train_batch_size=FLAGS.train_batch_size, 966 | eval_batch_size=FLAGS.eval_batch_size, 967 | predict_batch_size=FLAGS.predict_batch_size) 968 | 969 | 970 | if FLAGS.do_train: 971 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 972 | file_based_convert_examples_to_features( 973 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 974 | tf.logging.info("***** Running training *****") 975 | tf.logging.info(" Num examples = %d", len(train_examples)) 976 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 977 | tf.logging.info(" Num steps = %d", num_train_steps) 978 | train_input_fn = file_based_input_fn_builder( 979 | input_file=train_file, 980 | seq_length=FLAGS.max_seq_length, 981 | is_training=True, 982 | drop_remainder=True) 983 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 984 | 985 | if FLAGS.do_eval: 986 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 987 | num_actual_eval_examples = len(eval_examples) 988 | if FLAGS.use_tpu: 989 | # TPU requires a fixed batch size for all batches, therefore the number 990 | # of examples must be a multiple of the batch size, or else examples 991 | # will get dropped. So we pad with fake examples which are ignored 992 | # later on. These do NOT count towards the metric (all tf.metrics 993 | # support a per-instance weight, and these get a weight of 0.0). 994 | while len(eval_examples) % FLAGS.eval_batch_size != 0: 995 | eval_examples.append(PaddingInputExample()) 996 | 997 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 998 | file_based_convert_examples_to_features( 999 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 1000 | 1001 | tf.logging.info("***** Running evaluation *****") 1002 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 1003 | len(eval_examples), num_actual_eval_examples, 1004 | len(eval_examples) - num_actual_eval_examples) 1005 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 1006 | 1007 | # This tells the estimator to run through the entire set. 1008 | eval_steps = None 1009 | # However, if running eval on the TPU, you will need to specify the 1010 | # number of steps. 1011 | if FLAGS.use_tpu: 1012 | assert len(eval_examples) % FLAGS.eval_batch_size == 0 1013 | eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) 1014 | 1015 | eval_drop_remainder = True if FLAGS.use_tpu else False 1016 | eval_input_fn = file_based_input_fn_builder( 1017 | input_file=eval_file, 1018 | seq_length=FLAGS.max_seq_length, 1019 | is_training=False, 1020 | drop_remainder=eval_drop_remainder) 1021 | 1022 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 1023 | 1024 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 1025 | with tf.gfile.GFile(output_eval_file, "w") as writer: 1026 | tf.logging.info("***** Eval results *****") 1027 | for key in sorted(result.keys()): 1028 | tf.logging.info(" %s = %s", key, str(result[key])) 1029 | writer.write("%s = %s\n" % (key, str(result[key]))) 1030 | 1031 | if FLAGS.do_predict: 1032 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 1033 | num_actual_predict_examples = len(predict_examples) 1034 | if FLAGS.use_tpu: 1035 | # TPU requires a fixed batch size for all batches, therefore the number 1036 | # of examples must be a multiple of the batch size, or else examples 1037 | # will get dropped. So we pad with fake examples which are ignored 1038 | # later on. 1039 | while len(predict_examples) % FLAGS.predict_batch_size != 0: 1040 | predict_examples.append(PaddingInputExample()) 1041 | 1042 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 1043 | file_based_convert_examples_to_features(predict_examples, label_list, 1044 | FLAGS.max_seq_length, tokenizer, 1045 | predict_file) 1046 | 1047 | tf.logging.info("***** Running prediction*****") 1048 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 1049 | len(predict_examples), num_actual_predict_examples, 1050 | len(predict_examples) - num_actual_predict_examples) 1051 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 1052 | 1053 | predict_drop_remainder = True if FLAGS.use_tpu else False 1054 | predict_input_fn = file_based_input_fn_builder( 1055 | input_file=predict_file, 1056 | seq_length=FLAGS.max_seq_length, 1057 | is_training=False, 1058 | drop_remainder=predict_drop_remainder) 1059 | 1060 | result = estimator.predict(input_fn=predict_input_fn) 1061 | 1062 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 1063 | with tf.gfile.GFile(output_predict_file, "w") as writer: 1064 | num_written_lines = 0 1065 | tf.logging.info("***** Predict results *****") 1066 | for (i, prediction) in enumerate(result): 1067 | probabilities = prediction["probabilities"] 1068 | if i >= num_actual_predict_examples: 1069 | break 1070 | output_line = "\t".join( 1071 | str(class_probability) 1072 | for class_probability in probabilities) + "\n" 1073 | writer.write(output_line) 1074 | num_written_lines += 1 1075 | assert num_written_lines == num_actual_predict_examples 1076 | 1077 | 1078 | 1079 | 1080 | 1081 | if __name__ == "__main__": 1082 | flags.mark_flag_as_required("data_dir") 1083 | flags.mark_flag_as_required("task_name") 1084 | flags.mark_flag_as_required("vocab_file") 1085 | flags.mark_flag_as_required("bert_config_file") 1086 | flags.mark_flag_as_required("output_dir") 1087 | tf.app.run() 1088 | -------------------------------------------------------------------------------- /run_classifier_with_tfhub.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """BERT finetuning runner with TF-Hub.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import os 22 | import optimization 23 | import run_classifier 24 | import tokenization 25 | import tensorflow as tf 26 | import tensorflow_hub as hub 27 | 28 | flags = tf.flags 29 | 30 | FLAGS = flags.FLAGS 31 | 32 | flags.DEFINE_string( 33 | "bert_hub_module_handle", None, 34 | "Handle for the BERT TF-Hub module.") 35 | 36 | 37 | def create_model(is_training, input_ids, input_mask, segment_ids, labels, 38 | num_labels): 39 | """Creates a classification model.""" 40 | tags = set() 41 | if is_training: 42 | tags.add("train") 43 | bert_module = hub.Module( 44 | FLAGS.bert_hub_module_handle, 45 | tags=tags, 46 | trainable=True) 47 | bert_inputs = dict( 48 | input_ids=input_ids, 49 | input_mask=input_mask, 50 | segment_ids=segment_ids) 51 | bert_outputs = bert_module( 52 | inputs=bert_inputs, 53 | signature="tokens", 54 | as_dict=True) 55 | 56 | # In the demo, we are doing a simple classification task on the entire 57 | # segment. 58 | # 59 | # If you want to use the token-level output, use 60 | # bert_outputs["sequence_output"] instead. 61 | output_layer = bert_outputs["pooled_output"] 62 | 63 | hidden_size = output_layer.shape[-1].value 64 | 65 | output_weights = tf.get_variable( 66 | "output_weights", [num_labels, hidden_size], 67 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 68 | 69 | output_bias = tf.get_variable( 70 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 71 | 72 | with tf.variable_scope("loss"): 73 | if is_training: 74 | # I.e., 0.1 dropout 75 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 76 | 77 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 78 | logits = tf.nn.bias_add(logits, output_bias) 79 | log_probs = tf.nn.log_softmax(logits, axis=-1) 80 | 81 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 82 | 83 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 84 | loss = tf.reduce_mean(per_example_loss) 85 | 86 | return (loss, per_example_loss, logits) 87 | 88 | 89 | def model_fn_builder(num_labels, learning_rate, num_train_steps, 90 | num_warmup_steps, use_tpu): 91 | """Returns `model_fn` closure for TPUEstimator.""" 92 | 93 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 94 | """The `model_fn` for TPUEstimator.""" 95 | 96 | tf.logging.info("*** Features ***") 97 | for name in sorted(features.keys()): 98 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 99 | 100 | input_ids = features["input_ids"] 101 | input_mask = features["input_mask"] 102 | segment_ids = features["segment_ids"] 103 | label_ids = features["label_ids"] 104 | 105 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 106 | 107 | (total_loss, per_example_loss, logits) = create_model( 108 | is_training, input_ids, input_mask, segment_ids, label_ids, num_labels) 109 | 110 | output_spec = None 111 | if mode == tf.estimator.ModeKeys.TRAIN: 112 | train_op = optimization.create_optimizer( 113 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 114 | 115 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 116 | mode=mode, 117 | loss=total_loss, 118 | train_op=train_op) 119 | elif mode == tf.estimator.ModeKeys.EVAL: 120 | 121 | def metric_fn(per_example_loss, label_ids, logits): 122 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 123 | accuracy = tf.metrics.accuracy(label_ids, predictions) 124 | loss = tf.metrics.mean(per_example_loss) 125 | return { 126 | "eval_accuracy": accuracy, 127 | "eval_loss": loss, 128 | } 129 | 130 | eval_metrics = (metric_fn, [per_example_loss, label_ids, logits]) 131 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 132 | mode=mode, 133 | loss=total_loss, 134 | eval_metrics=eval_metrics) 135 | else: 136 | raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode)) 137 | 138 | return output_spec 139 | 140 | return model_fn 141 | 142 | 143 | def create_tokenizer_from_hub_module(): 144 | """Get the vocab file and casing info from the Hub module.""" 145 | with tf.Graph().as_default(): 146 | bert_module = hub.Module(FLAGS.bert_hub_module_handle) 147 | tokenization_info = bert_module(signature="tokenization_info", as_dict=True) 148 | with tf.Session() as sess: 149 | vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"], 150 | tokenization_info["do_lower_case"]]) 151 | return tokenization.FullTokenizer( 152 | vocab_file=vocab_file, do_lower_case=do_lower_case) 153 | 154 | 155 | def main(_): 156 | tf.logging.set_verbosity(tf.logging.INFO) 157 | 158 | processors = { 159 | "cola": run_classifier.ColaProcessor, 160 | "mnli": run_classifier.MnliProcessor, 161 | "mrpc": run_classifier.MrpcProcessor, 162 | } 163 | 164 | if not FLAGS.do_train and not FLAGS.do_eval: 165 | raise ValueError("At least one of `do_train` or `do_eval` must be True.") 166 | 167 | tf.gfile.MakeDirs(FLAGS.output_dir) 168 | 169 | task_name = FLAGS.task_name.lower() 170 | 171 | if task_name not in processors: 172 | raise ValueError("Task not found: %s" % (task_name)) 173 | 174 | processor = processors[task_name]() 175 | 176 | label_list = processor.get_labels() 177 | 178 | tokenizer = create_tokenizer_from_hub_module() 179 | 180 | tpu_cluster_resolver = None 181 | if FLAGS.use_tpu and FLAGS.tpu_name: 182 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 183 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 184 | 185 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 186 | run_config = tf.contrib.tpu.RunConfig( 187 | cluster=tpu_cluster_resolver, 188 | master=FLAGS.master, 189 | model_dir=FLAGS.output_dir, 190 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 191 | tpu_config=tf.contrib.tpu.TPUConfig( 192 | iterations_per_loop=FLAGS.iterations_per_loop, 193 | num_shards=FLAGS.num_tpu_cores, 194 | per_host_input_for_training=is_per_host)) 195 | 196 | train_examples = None 197 | num_train_steps = None 198 | num_warmup_steps = None 199 | if FLAGS.do_train: 200 | train_examples = processor.get_train_examples(FLAGS.data_dir) 201 | num_train_steps = int( 202 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 203 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 204 | 205 | model_fn = model_fn_builder( 206 | num_labels=len(label_list), 207 | learning_rate=FLAGS.learning_rate, 208 | num_train_steps=num_train_steps, 209 | num_warmup_steps=num_warmup_steps, 210 | use_tpu=FLAGS.use_tpu) 211 | 212 | # If TPU is not available, this will fall back to normal Estimator on CPU 213 | # or GPU. 214 | estimator = tf.contrib.tpu.TPUEstimator( 215 | use_tpu=FLAGS.use_tpu, 216 | model_fn=model_fn, 217 | config=run_config, 218 | train_batch_size=FLAGS.train_batch_size, 219 | eval_batch_size=FLAGS.eval_batch_size) 220 | 221 | if FLAGS.do_train: 222 | train_features = run_classifier.convert_examples_to_features( 223 | train_examples, label_list, FLAGS.max_seq_length, tokenizer) 224 | tf.logging.info("***** Running training *****") 225 | tf.logging.info(" Num examples = %d", len(train_examples)) 226 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 227 | tf.logging.info(" Num steps = %d", num_train_steps) 228 | train_input_fn = run_classifier.input_fn_builder( 229 | features=train_features, 230 | seq_length=FLAGS.max_seq_length, 231 | is_training=True, 232 | drop_remainder=True) 233 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 234 | 235 | if FLAGS.do_eval: 236 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 237 | eval_features = run_classifier.convert_examples_to_features( 238 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer) 239 | 240 | tf.logging.info("***** Running evaluation *****") 241 | tf.logging.info(" Num examples = %d", len(eval_examples)) 242 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 243 | 244 | # This tells the estimator to run through the entire set. 245 | eval_steps = None 246 | # However, if running eval on the TPU, you will need to specify the 247 | # number of steps. 248 | if FLAGS.use_tpu: 249 | # Eval will be slightly WRONG on the TPU because it will truncate 250 | # the last batch. 251 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size) 252 | 253 | eval_drop_remainder = True if FLAGS.use_tpu else False 254 | eval_input_fn = run_classifier.input_fn_builder( 255 | features=eval_features, 256 | seq_length=FLAGS.max_seq_length, 257 | is_training=False, 258 | drop_remainder=eval_drop_remainder) 259 | 260 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 261 | 262 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 263 | with tf.gfile.GFile(output_eval_file, "w") as writer: 264 | tf.logging.info("***** Eval results *****") 265 | for key in sorted(result.keys()): 266 | tf.logging.info(" %s = %s", key, str(result[key])) 267 | writer.write("%s = %s\n" % (key, str(result[key]))) 268 | 269 | 270 | if __name__ == "__main__": 271 | flags.mark_flag_as_required("data_dir") 272 | flags.mark_flag_as_required("task_name") 273 | flags.mark_flag_as_required("bert_hub_module_handle") 274 | flags.mark_flag_as_required("output_dir") 275 | tf.app.run() 276 | -------------------------------------------------------------------------------- /run_pretraining.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Run masked LM/next sentence masked_lm pre-training for BERT.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import os 22 | import modeling 23 | import optimization 24 | import tensorflow as tf 25 | 26 | flags = tf.flags 27 | 28 | FLAGS = flags.FLAGS 29 | 30 | ## Required parameters 31 | flags.DEFINE_string( 32 | "bert_config_file", None, 33 | "The config json file corresponding to the pre-trained BERT model. " 34 | "This specifies the model architecture.") 35 | 36 | flags.DEFINE_string( 37 | "input_file", None, 38 | "Input TF example files (can be a glob or comma separated).") 39 | 40 | flags.DEFINE_string( 41 | "output_dir", None, 42 | "The output directory where the model checkpoints will be written.") 43 | 44 | ## Other parameters 45 | flags.DEFINE_string( 46 | "init_checkpoint", None, 47 | "Initial checkpoint (usually from a pre-trained BERT model).") 48 | 49 | flags.DEFINE_integer( 50 | "max_seq_length", 128, 51 | "The maximum total input sequence length after WordPiece tokenization. " 52 | "Sequences longer than this will be truncated, and sequences shorter " 53 | "than this will be padded. Must match data generation.") 54 | 55 | flags.DEFINE_integer( 56 | "max_predictions_per_seq", 20, 57 | "Maximum number of masked LM predictions per sequence. " 58 | "Must match data generation.") 59 | 60 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 61 | 62 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 63 | 64 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 65 | 66 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 67 | 68 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 69 | 70 | flags.DEFINE_integer("num_train_steps", 100000, "Number of training steps.") 71 | 72 | flags.DEFINE_integer("num_warmup_steps", 10000, "Number of warmup steps.") 73 | 74 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 75 | "How often to save the model checkpoint.") 76 | 77 | flags.DEFINE_integer("iterations_per_loop", 1000, 78 | "How many steps to make in each estimator call.") 79 | 80 | flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.") 81 | 82 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 83 | 84 | tf.flags.DEFINE_string( 85 | "tpu_name", None, 86 | "The Cloud TPU to use for training. This should be either the name " 87 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 88 | "url.") 89 | 90 | tf.flags.DEFINE_string( 91 | "tpu_zone", None, 92 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 93 | "specified, we will attempt to automatically detect the GCE project from " 94 | "metadata.") 95 | 96 | tf.flags.DEFINE_string( 97 | "gcp_project", None, 98 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 99 | "specified, we will attempt to automatically detect the GCE project from " 100 | "metadata.") 101 | 102 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 103 | 104 | flags.DEFINE_integer( 105 | "num_tpu_cores", 8, 106 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 107 | 108 | 109 | def model_fn_builder(bert_config, init_checkpoint, learning_rate, 110 | num_train_steps, num_warmup_steps, use_tpu, 111 | use_one_hot_embeddings): 112 | """Returns `model_fn` closure for TPUEstimator.""" 113 | 114 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 115 | """The `model_fn` for TPUEstimator.""" 116 | 117 | tf.logging.info("*** Features ***") 118 | for name in sorted(features.keys()): 119 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 120 | 121 | input_ids = features["input_ids"] 122 | input_mask = features["input_mask"] 123 | segment_ids = features["segment_ids"] 124 | masked_lm_positions = features["masked_lm_positions"] 125 | masked_lm_ids = features["masked_lm_ids"] 126 | masked_lm_weights = features["masked_lm_weights"] 127 | next_sentence_labels = features["next_sentence_labels"] 128 | 129 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 130 | 131 | model = modeling.BertModel( 132 | config=bert_config, 133 | is_training=is_training, 134 | input_ids=input_ids, 135 | input_mask=input_mask, 136 | token_type_ids=segment_ids, 137 | use_one_hot_embeddings=use_one_hot_embeddings) 138 | 139 | (masked_lm_loss, 140 | masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output( 141 | bert_config, model.get_sequence_output(), model.get_embedding_table(), 142 | masked_lm_positions, masked_lm_ids, masked_lm_weights) 143 | 144 | (next_sentence_loss, next_sentence_example_loss, 145 | next_sentence_log_probs) = get_next_sentence_output( 146 | bert_config, model.get_pooled_output(), next_sentence_labels) 147 | 148 | total_loss = masked_lm_loss + next_sentence_loss 149 | 150 | tvars = tf.trainable_variables() 151 | 152 | initialized_variable_names = {} 153 | scaffold_fn = None 154 | if init_checkpoint: 155 | (assignment_map, initialized_variable_names 156 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 157 | if use_tpu: 158 | 159 | def tpu_scaffold(): 160 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 161 | return tf.train.Scaffold() 162 | 163 | scaffold_fn = tpu_scaffold 164 | else: 165 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 166 | 167 | tf.logging.info("**** Trainable Variables ****") 168 | for var in tvars: 169 | init_string = "" 170 | if var.name in initialized_variable_names: 171 | init_string = ", *INIT_FROM_CKPT*" 172 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 173 | init_string) 174 | 175 | output_spec = None 176 | if mode == tf.estimator.ModeKeys.TRAIN: 177 | train_op = optimization.create_optimizer( 178 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 179 | 180 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 181 | mode=mode, 182 | loss=total_loss, 183 | train_op=train_op, 184 | scaffold_fn=scaffold_fn) 185 | elif mode == tf.estimator.ModeKeys.EVAL: 186 | 187 | def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, 188 | masked_lm_weights, next_sentence_example_loss, 189 | next_sentence_log_probs, next_sentence_labels): 190 | """Computes the loss and accuracy of the model.""" 191 | masked_lm_log_probs = tf.reshape(masked_lm_log_probs, 192 | [-1, masked_lm_log_probs.shape[-1]]) 193 | masked_lm_predictions = tf.argmax( 194 | masked_lm_log_probs, axis=-1, output_type=tf.int32) 195 | masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1]) 196 | masked_lm_ids = tf.reshape(masked_lm_ids, [-1]) 197 | masked_lm_weights = tf.reshape(masked_lm_weights, [-1]) 198 | masked_lm_accuracy = tf.metrics.accuracy( 199 | labels=masked_lm_ids, 200 | predictions=masked_lm_predictions, 201 | weights=masked_lm_weights) 202 | masked_lm_mean_loss = tf.metrics.mean( 203 | values=masked_lm_example_loss, weights=masked_lm_weights) 204 | 205 | next_sentence_log_probs = tf.reshape( 206 | next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]]) 207 | next_sentence_predictions = tf.argmax( 208 | next_sentence_log_probs, axis=-1, output_type=tf.int32) 209 | next_sentence_labels = tf.reshape(next_sentence_labels, [-1]) 210 | next_sentence_accuracy = tf.metrics.accuracy( 211 | labels=next_sentence_labels, predictions=next_sentence_predictions) 212 | next_sentence_mean_loss = tf.metrics.mean( 213 | values=next_sentence_example_loss) 214 | 215 | return { 216 | "masked_lm_accuracy": masked_lm_accuracy, 217 | "masked_lm_loss": masked_lm_mean_loss, 218 | "next_sentence_accuracy": next_sentence_accuracy, 219 | "next_sentence_loss": next_sentence_mean_loss, 220 | } 221 | 222 | eval_metrics = (metric_fn, [ 223 | masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids, 224 | masked_lm_weights, next_sentence_example_loss, 225 | next_sentence_log_probs, next_sentence_labels 226 | ]) 227 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 228 | mode=mode, 229 | loss=total_loss, 230 | eval_metrics=eval_metrics, 231 | scaffold_fn=scaffold_fn) 232 | else: 233 | raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode)) 234 | 235 | return output_spec 236 | 237 | return model_fn 238 | 239 | 240 | def get_masked_lm_output(bert_config, input_tensor, output_weights, positions, 241 | label_ids, label_weights): 242 | """Get loss and log probs for the masked LM.""" 243 | input_tensor = gather_indexes(input_tensor, positions) 244 | 245 | with tf.variable_scope("cls/predictions"): 246 | # We apply one more non-linear transformation before the output layer. 247 | # This matrix is not used after pre-training. 248 | with tf.variable_scope("transform"): 249 | input_tensor = tf.layers.dense( 250 | input_tensor, 251 | units=bert_config.hidden_size, 252 | activation=modeling.get_activation(bert_config.hidden_act), 253 | kernel_initializer=modeling.create_initializer( 254 | bert_config.initializer_range)) 255 | input_tensor = modeling.layer_norm(input_tensor) 256 | 257 | # The output weights are the same as the input embeddings, but there is 258 | # an output-only bias for each token. 259 | output_bias = tf.get_variable( 260 | "output_bias", 261 | shape=[bert_config.vocab_size], 262 | initializer=tf.zeros_initializer()) 263 | logits = tf.matmul(input_tensor, output_weights, transpose_b=True) 264 | logits = tf.nn.bias_add(logits, output_bias) 265 | log_probs = tf.nn.log_softmax(logits, axis=-1) 266 | 267 | label_ids = tf.reshape(label_ids, [-1]) 268 | label_weights = tf.reshape(label_weights, [-1]) 269 | 270 | one_hot_labels = tf.one_hot( 271 | label_ids, depth=bert_config.vocab_size, dtype=tf.float32) 272 | 273 | # The `positions` tensor might be zero-padded (if the sequence is too 274 | # short to have the maximum number of predictions). The `label_weights` 275 | # tensor has a value of 1.0 for every real prediction and 0.0 for the 276 | # padding predictions. 277 | per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1]) 278 | numerator = tf.reduce_sum(label_weights * per_example_loss) 279 | denominator = tf.reduce_sum(label_weights) + 1e-5 280 | loss = numerator / denominator 281 | 282 | return (loss, per_example_loss, log_probs) 283 | 284 | 285 | def get_next_sentence_output(bert_config, input_tensor, labels): 286 | """Get loss and log probs for the next sentence prediction.""" 287 | 288 | # Simple binary classification. Note that 0 is "next sentence" and 1 is 289 | # "random sentence". This weight matrix is not used after pre-training. 290 | with tf.variable_scope("cls/seq_relationship"): 291 | output_weights = tf.get_variable( 292 | "output_weights", 293 | shape=[2, bert_config.hidden_size], 294 | initializer=modeling.create_initializer(bert_config.initializer_range)) 295 | output_bias = tf.get_variable( 296 | "output_bias", shape=[2], initializer=tf.zeros_initializer()) 297 | 298 | logits = tf.matmul(input_tensor, output_weights, transpose_b=True) 299 | logits = tf.nn.bias_add(logits, output_bias) 300 | log_probs = tf.nn.log_softmax(logits, axis=-1) 301 | labels = tf.reshape(labels, [-1]) 302 | one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32) 303 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 304 | loss = tf.reduce_mean(per_example_loss) 305 | return (loss, per_example_loss, log_probs) 306 | 307 | 308 | def gather_indexes(sequence_tensor, positions): 309 | """Gathers the vectors at the specific positions over a minibatch.""" 310 | sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3) 311 | batch_size = sequence_shape[0] 312 | seq_length = sequence_shape[1] 313 | width = sequence_shape[2] 314 | 315 | flat_offsets = tf.reshape( 316 | tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1]) 317 | flat_positions = tf.reshape(positions + flat_offsets, [-1]) 318 | flat_sequence_tensor = tf.reshape(sequence_tensor, 319 | [batch_size * seq_length, width]) 320 | output_tensor = tf.gather(flat_sequence_tensor, flat_positions) 321 | return output_tensor 322 | 323 | 324 | def input_fn_builder(input_files, 325 | max_seq_length, 326 | max_predictions_per_seq, 327 | is_training, 328 | num_cpu_threads=4): 329 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 330 | 331 | def input_fn(params): 332 | """The actual input function.""" 333 | batch_size = params["batch_size"] 334 | 335 | name_to_features = { 336 | "input_ids": 337 | tf.FixedLenFeature([max_seq_length], tf.int64), 338 | "input_mask": 339 | tf.FixedLenFeature([max_seq_length], tf.int64), 340 | "segment_ids": 341 | tf.FixedLenFeature([max_seq_length], tf.int64), 342 | "masked_lm_positions": 343 | tf.FixedLenFeature([max_predictions_per_seq], tf.int64), 344 | "masked_lm_ids": 345 | tf.FixedLenFeature([max_predictions_per_seq], tf.int64), 346 | "masked_lm_weights": 347 | tf.FixedLenFeature([max_predictions_per_seq], tf.float32), 348 | "next_sentence_labels": 349 | tf.FixedLenFeature([1], tf.int64), 350 | } 351 | 352 | # For training, we want a lot of parallel reading and shuffling. 353 | # For eval, we want no shuffling and parallel reading doesn't matter. 354 | if is_training: 355 | d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files)) 356 | d = d.repeat() 357 | d = d.shuffle(buffer_size=len(input_files)) 358 | 359 | # `cycle_length` is the number of parallel files that get read. 360 | cycle_length = min(num_cpu_threads, len(input_files)) 361 | 362 | # `sloppy` mode means that the interleaving is not exact. This adds 363 | # even more randomness to the training pipeline. 364 | d = d.apply( 365 | tf.contrib.data.parallel_interleave( 366 | tf.data.TFRecordDataset, 367 | sloppy=is_training, 368 | cycle_length=cycle_length)) 369 | d = d.shuffle(buffer_size=100) 370 | else: 371 | d = tf.data.TFRecordDataset(input_files) 372 | # Since we evaluate for a fixed number of steps we don't want to encounter 373 | # out-of-range exceptions. 374 | d = d.repeat() 375 | 376 | # We must `drop_remainder` on training because the TPU requires fixed 377 | # size dimensions. For eval, we assume we are evaluating on the CPU or GPU 378 | # and we *don't* want to drop the remainder, otherwise we wont cover 379 | # every sample. 380 | d = d.apply( 381 | tf.contrib.data.map_and_batch( 382 | lambda record: _decode_record(record, name_to_features), 383 | batch_size=batch_size, 384 | num_parallel_batches=num_cpu_threads, 385 | drop_remainder=True)) 386 | return d 387 | 388 | return input_fn 389 | 390 | 391 | def _decode_record(record, name_to_features): 392 | """Decodes a record to a TensorFlow example.""" 393 | example = tf.parse_single_example(record, name_to_features) 394 | 395 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 396 | # So cast all int64 to int32. 397 | for name in list(example.keys()): 398 | t = example[name] 399 | if t.dtype == tf.int64: 400 | t = tf.to_int32(t) 401 | example[name] = t 402 | 403 | return example 404 | 405 | 406 | def main(_): 407 | tf.logging.set_verbosity(tf.logging.INFO) 408 | 409 | if not FLAGS.do_train and not FLAGS.do_eval: 410 | raise ValueError("At least one of `do_train` or `do_eval` must be True.") 411 | 412 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 413 | 414 | tf.gfile.MakeDirs(FLAGS.output_dir) 415 | 416 | input_files = [] 417 | for input_pattern in FLAGS.input_file.split(","): 418 | input_files.extend(tf.gfile.Glob(input_pattern)) 419 | 420 | tf.logging.info("*** Input Files ***") 421 | for input_file in input_files: 422 | tf.logging.info(" %s" % input_file) 423 | 424 | tpu_cluster_resolver = None 425 | if FLAGS.use_tpu and FLAGS.tpu_name: 426 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 427 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 428 | 429 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 430 | run_config = tf.contrib.tpu.RunConfig( 431 | cluster=tpu_cluster_resolver, 432 | master=FLAGS.master, 433 | model_dir=FLAGS.output_dir, 434 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 435 | tpu_config=tf.contrib.tpu.TPUConfig( 436 | iterations_per_loop=FLAGS.iterations_per_loop, 437 | num_shards=FLAGS.num_tpu_cores, 438 | per_host_input_for_training=is_per_host)) 439 | 440 | model_fn = model_fn_builder( 441 | bert_config=bert_config, 442 | init_checkpoint=FLAGS.init_checkpoint, 443 | learning_rate=FLAGS.learning_rate, 444 | num_train_steps=FLAGS.num_train_steps, 445 | num_warmup_steps=FLAGS.num_warmup_steps, 446 | use_tpu=FLAGS.use_tpu, 447 | use_one_hot_embeddings=FLAGS.use_tpu) 448 | 449 | # If TPU is not available, this will fall back to normal Estimator on CPU 450 | # or GPU. 451 | estimator = tf.contrib.tpu.TPUEstimator( 452 | use_tpu=FLAGS.use_tpu, 453 | model_fn=model_fn, 454 | config=run_config, 455 | train_batch_size=FLAGS.train_batch_size, 456 | eval_batch_size=FLAGS.eval_batch_size) 457 | 458 | if FLAGS.do_train: 459 | tf.logging.info("***** Running training *****") 460 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 461 | train_input_fn = input_fn_builder( 462 | input_files=input_files, 463 | max_seq_length=FLAGS.max_seq_length, 464 | max_predictions_per_seq=FLAGS.max_predictions_per_seq, 465 | is_training=True) 466 | estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps) 467 | 468 | if FLAGS.do_eval: 469 | tf.logging.info("***** Running evaluation *****") 470 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 471 | 472 | eval_input_fn = input_fn_builder( 473 | input_files=input_files, 474 | max_seq_length=FLAGS.max_seq_length, 475 | max_predictions_per_seq=FLAGS.max_predictions_per_seq, 476 | is_training=False) 477 | 478 | result = estimator.evaluate( 479 | input_fn=eval_input_fn, steps=FLAGS.max_eval_steps) 480 | 481 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 482 | with tf.gfile.GFile(output_eval_file, "w") as writer: 483 | tf.logging.info("***** Eval results *****") 484 | for key in sorted(result.keys()): 485 | tf.logging.info(" %s = %s", key, str(result[key])) 486 | writer.write("%s = %s\n" % (key, str(result[key]))) 487 | 488 | 489 | if __name__ == "__main__": 490 | flags.mark_flag_as_required("input_file") 491 | flags.mark_flag_as_required("bert_config_file") 492 | flags.mark_flag_as_required("output_dir") 493 | tf.app.run() 494 | -------------------------------------------------------------------------------- /tokenization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Tokenization classes.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import re 23 | import unicodedata 24 | import six 25 | import tensorflow as tf 26 | 27 | 28 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint): 29 | """Checks whether the casing config is consistent with the checkpoint name.""" 30 | 31 | # The casing has to be passed in by the user and there is no explicit check 32 | # as to whether it matches the checkpoint. The casing information probably 33 | # should have been stored in the bert_config.json file, but it's not, so 34 | # we have to heuristically detect it to validate. 35 | 36 | if not init_checkpoint: 37 | return 38 | 39 | m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint) 40 | if m is None: 41 | return 42 | 43 | model_name = m.group(1) 44 | 45 | lower_models = [ 46 | "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12", 47 | "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12" 48 | ] 49 | 50 | cased_models = [ 51 | "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16", 52 | "multi_cased_L-12_H-768_A-12" 53 | ] 54 | 55 | is_bad_config = False 56 | if model_name in lower_models and not do_lower_case: 57 | is_bad_config = True 58 | actual_flag = "False" 59 | case_name = "lowercased" 60 | opposite_flag = "True" 61 | 62 | if model_name in cased_models and do_lower_case: 63 | is_bad_config = True 64 | actual_flag = "True" 65 | case_name = "cased" 66 | opposite_flag = "False" 67 | 68 | if is_bad_config: 69 | raise ValueError( 70 | "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. " 71 | "However, `%s` seems to be a %s model, so you " 72 | "should pass in `--do_lower_case=%s` so that the fine-tuning matches " 73 | "how the model was pre-training. If this error is wrong, please " 74 | "just comment out this check." % (actual_flag, init_checkpoint, 75 | model_name, case_name, opposite_flag)) 76 | 77 | 78 | def convert_to_unicode(text): 79 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 80 | if six.PY3: 81 | if isinstance(text, str): 82 | return text 83 | elif isinstance(text, bytes): 84 | return text.decode("utf-8", "ignore") 85 | else: 86 | raise ValueError("Unsupported string type: %s" % (type(text))) 87 | elif six.PY2: 88 | if isinstance(text, str): 89 | return text.decode("utf-8", "ignore") 90 | elif isinstance(text, unicode): 91 | return text 92 | else: 93 | raise ValueError("Unsupported string type: %s" % (type(text))) 94 | else: 95 | raise ValueError("Not running on Python2 or Python 3?") 96 | 97 | 98 | def printable_text(text): 99 | """Returns text encoded in a way suitable for print or `tf.logging`.""" 100 | 101 | # These functions want `str` for both Python2 and Python3, but in one case 102 | # it's a Unicode string and in the other it's a byte string. 103 | if six.PY3: 104 | if isinstance(text, str): 105 | return text 106 | elif isinstance(text, bytes): 107 | return text.decode("utf-8", "ignore") 108 | else: 109 | raise ValueError("Unsupported string type: %s" % (type(text))) 110 | elif six.PY2: 111 | if isinstance(text, str): 112 | return text 113 | elif isinstance(text, unicode): 114 | return text.encode("utf-8") 115 | else: 116 | raise ValueError("Unsupported string type: %s" % (type(text))) 117 | else: 118 | raise ValueError("Not running on Python2 or Python 3?") 119 | 120 | 121 | def load_vocab(vocab_file): 122 | """Loads a vocabulary file into a dictionary.""" 123 | vocab = collections.OrderedDict() 124 | index = 0 125 | with tf.gfile.GFile(vocab_file, "r") as reader: 126 | while True: 127 | token = convert_to_unicode(reader.readline()) 128 | if not token: 129 | break 130 | token = token.strip() 131 | vocab[token] = index 132 | index += 1 133 | return vocab 134 | 135 | 136 | def convert_by_vocab(vocab, items): 137 | """Converts a sequence of [tokens|ids] using the vocab.""" 138 | output = [] 139 | for item in items: 140 | output.append(vocab[item]) 141 | return output 142 | 143 | 144 | def convert_tokens_to_ids(vocab, tokens): 145 | return convert_by_vocab(vocab, tokens) 146 | 147 | 148 | def convert_ids_to_tokens(inv_vocab, ids): 149 | return convert_by_vocab(inv_vocab, ids) 150 | 151 | 152 | def whitespace_tokenize(text): 153 | """Runs basic whitespace cleaning and splitting on a piece of text.""" 154 | text = text.strip() 155 | if not text: 156 | return [] 157 | tokens = text.split() 158 | return tokens 159 | 160 | 161 | class FullTokenizer(object): 162 | """Runs end-to-end tokenziation.""" 163 | 164 | def __init__(self, vocab_file, do_lower_case=True): 165 | self.vocab = load_vocab(vocab_file) 166 | self.inv_vocab = {v: k for k, v in self.vocab.items()} 167 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 168 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 169 | 170 | def tokenize(self, text): 171 | split_tokens = [] 172 | for token in self.basic_tokenizer.tokenize(text): 173 | for sub_token in self.wordpiece_tokenizer.tokenize(token): 174 | split_tokens.append(sub_token) 175 | 176 | return split_tokens 177 | 178 | def convert_tokens_to_ids(self, tokens): 179 | return convert_by_vocab(self.vocab, tokens) 180 | 181 | def convert_ids_to_tokens(self, ids): 182 | return convert_by_vocab(self.inv_vocab, ids) 183 | 184 | 185 | class BasicTokenizer(object): 186 | """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" 187 | 188 | def __init__(self, do_lower_case=True): 189 | """Constructs a BasicTokenizer. 190 | 191 | Args: 192 | do_lower_case: Whether to lower case the input. 193 | """ 194 | self.do_lower_case = do_lower_case 195 | 196 | def tokenize(self, text): 197 | """Tokenizes a piece of text.""" 198 | text = convert_to_unicode(text) 199 | text = self._clean_text(text) 200 | 201 | # This was added on November 1st, 2018 for the multilingual and Chinese 202 | # models. This is also applied to the English models now, but it doesn't 203 | # matter since the English models were not trained on any Chinese data 204 | # and generally don't have any Chinese data in them (there are Chinese 205 | # characters in the vocabulary because Wikipedia does have some Chinese 206 | # words in the English Wikipedia.). 207 | text = self._tokenize_chinese_chars(text) 208 | 209 | orig_tokens = whitespace_tokenize(text) 210 | split_tokens = [] 211 | for token in orig_tokens: 212 | if self.do_lower_case: 213 | token = token.lower() 214 | token = self._run_strip_accents(token) 215 | split_tokens.extend(self._run_split_on_punc(token)) 216 | 217 | output_tokens = whitespace_tokenize(" ".join(split_tokens)) 218 | return output_tokens 219 | 220 | def _run_strip_accents(self, text): 221 | """Strips accents from a piece of text.""" 222 | text = unicodedata.normalize("NFD", text) 223 | output = [] 224 | for char in text: 225 | cat = unicodedata.category(char) 226 | if cat == "Mn": 227 | continue 228 | output.append(char) 229 | return "".join(output) 230 | 231 | def _run_split_on_punc(self, text): 232 | """Splits punctuation on a piece of text.""" 233 | chars = list(text) 234 | i = 0 235 | start_new_word = True 236 | output = [] 237 | while i < len(chars): 238 | char = chars[i] 239 | if _is_punctuation(char): 240 | output.append([char]) 241 | start_new_word = True 242 | else: 243 | if start_new_word: 244 | output.append([]) 245 | start_new_word = False 246 | output[-1].append(char) 247 | i += 1 248 | 249 | return ["".join(x) for x in output] 250 | 251 | def _tokenize_chinese_chars(self, text): 252 | """Adds whitespace around any CJK character.""" 253 | output = [] 254 | for char in text: 255 | cp = ord(char) 256 | if self._is_chinese_char(cp): 257 | output.append(" ") 258 | output.append(char) 259 | output.append(" ") 260 | else: 261 | output.append(char) 262 | return "".join(output) 263 | 264 | def _is_chinese_char(self, cp): 265 | """Checks whether CP is the codepoint of a CJK character.""" 266 | # This defines a "chinese character" as anything in the CJK Unicode block: 267 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) 268 | # 269 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters, 270 | # despite its name. The modern Korean Hangul alphabet is a different block, 271 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write 272 | # space-separated words, so they are not treated specially and handled 273 | # like the all of the other languages. 274 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or # 275 | (cp >= 0x3400 and cp <= 0x4DBF) or # 276 | (cp >= 0x20000 and cp <= 0x2A6DF) or # 277 | (cp >= 0x2A700 and cp <= 0x2B73F) or # 278 | (cp >= 0x2B740 and cp <= 0x2B81F) or # 279 | (cp >= 0x2B820 and cp <= 0x2CEAF) or 280 | (cp >= 0xF900 and cp <= 0xFAFF) or # 281 | (cp >= 0x2F800 and cp <= 0x2FA1F)): # 282 | return True 283 | 284 | return False 285 | 286 | def _clean_text(self, text): 287 | """Performs invalid character removal and whitespace cleanup on text.""" 288 | output = [] 289 | for char in text: 290 | cp = ord(char) 291 | if cp == 0 or cp == 0xfffd or _is_control(char): 292 | continue 293 | if _is_whitespace(char): 294 | output.append(" ") 295 | else: 296 | output.append(char) 297 | return "".join(output) 298 | 299 | 300 | class WordpieceTokenizer(object): 301 | """Runs WordPiece tokenziation.""" 302 | 303 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): 304 | self.vocab = vocab 305 | self.unk_token = unk_token 306 | self.max_input_chars_per_word = max_input_chars_per_word 307 | 308 | def tokenize(self, text): 309 | """Tokenizes a piece of text into its word pieces. 310 | 311 | This uses a greedy longest-match-first algorithm to perform tokenization 312 | using the given vocabulary. 313 | 314 | For example: 315 | input = "unaffable" 316 | output = ["un", "##aff", "##able"] 317 | 318 | Args: 319 | text: A single token or whitespace separated tokens. This should have 320 | already been passed through `BasicTokenizer. 321 | 322 | Returns: 323 | A list of wordpiece tokens. 324 | """ 325 | 326 | text = convert_to_unicode(text) 327 | 328 | output_tokens = [] 329 | for token in whitespace_tokenize(text): 330 | chars = list(token) 331 | if len(chars) > self.max_input_chars_per_word: 332 | output_tokens.append(self.unk_token) 333 | continue 334 | 335 | is_bad = False 336 | start = 0 337 | sub_tokens = [] 338 | while start < len(chars): 339 | end = len(chars) 340 | cur_substr = None 341 | while start < end: 342 | substr = "".join(chars[start:end]) 343 | if start > 0: 344 | substr = "##" + substr 345 | if substr in self.vocab: 346 | cur_substr = substr 347 | break 348 | end -= 1 349 | if cur_substr is None: 350 | is_bad = True 351 | break 352 | sub_tokens.append(cur_substr) 353 | start = end 354 | 355 | if is_bad: 356 | output_tokens.append(self.unk_token) 357 | else: 358 | output_tokens.extend(sub_tokens) 359 | return output_tokens 360 | 361 | 362 | def _is_whitespace(char): 363 | """Checks whether `chars` is a whitespace character.""" 364 | # \t, \n, and \r are technically contorl characters but we treat them 365 | # as whitespace since they are generally considered as such. 366 | if char == " " or char == "\t" or char == "\n" or char == "\r": 367 | return True 368 | cat = unicodedata.category(char) 369 | if cat == "Zs": 370 | return True 371 | return False 372 | 373 | 374 | def _is_control(char): 375 | """Checks whether `chars` is a control character.""" 376 | # These are technically control characters but we count them as whitespace 377 | # characters. 378 | if char == "\t" or char == "\n" or char == "\r": 379 | return False 380 | cat = unicodedata.category(char) 381 | if cat.startswith("C"): 382 | return True 383 | return False 384 | 385 | 386 | def _is_punctuation(char): 387 | """Checks whether `chars` is a punctuation character.""" 388 | cp = ord(char) 389 | # We treat all non-letter/number ASCII as punctuation. 390 | # Characters such as "^", "$", and "`" are not in the Unicode 391 | # Punctuation class but we treat them as punctuation anyways, for 392 | # consistency. 393 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or 394 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): 395 | return True 396 | cat = unicodedata.category(char) 397 | if cat.startswith("P"): 398 | return True 399 | return False 400 | --------------------------------------------------------------------------------