├── README.md
├── __init__.py
├── atec_nlp_sim_test_0.4.csv
├── atec_nlp_sim_train_0.6.csv
├── create_pretraining_data.py
├── extract_features.py
├── modeling.py
├── modeling_test.py
├── optimization.py
├── optimization_test.py
├── requirements.txt
├── run_classifier.py
├── run_classifier_0214.py
├── run_classifier_0228.py
├── run_classifier_with_tfhub.py
├── run_pretraining.py
├── run_squad.py
└── tokenization.py


/README.md:
--------------------------------------------------------------------------------
  1 | ## Bert  中文文本分类
  2 | - 任务1 判断文本相似性 
  3 | - 任务2 文本多分类
  4 | 
  5 | 
  6 | ## 运行环境
  7 | - python3.6
  8 | - tensorflow1.12
  9 | 
 10 | 
 11 | ## 任务1 判断文本相似性
 12 | #### 数据说明
 13 | - 运行时数据所在目录为/Users/luyao/Desktop/bert_learn/MAYI，在这里为了方便起见我把数据一起上传了  
 14 | - atec_nlp_sim_train_0.6.csv和atec_nlp_sim_test_0.4.csv格式相同，各列分别为"index query1 query2 label"，"\t"分隔。来自网络上蚂蚁金服的公开数据集，是判断文本相似性的数据  
 15 | - 程序运行中只选择了部分数据(train3000,val500,test100)  
 16 | 
 17 | #### 程序说明
 18 | >如下的目录都是我自己学习时的目录，大家根据自己的实际情况进行修改
 19 | - 首先要下载**[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**并解压到/Users/luyao/Desktop/bert_learn/chinese_L-12_H-768_A-12
 20 | - run_classifier.py是Bert自带的程序，在这里我把自己写的放在了run_classifier_0214.py中
 21 | - 添加SelfProcessor类，修改了get_train_examples等三个方法，功能参考Bert自带的那些processor即可。或者也可以修改_create_examples
 22 | - 在main(_)中增加自己定义的这个processor到字典中
 23 | 
 24 | #### 训练
 25 | - 定义环境变量
 26 | >export BERT_BASE_DIR=/Users/luyao/Desktop/bert_learn/chinese_L-12_H-768_A-12  
 27 | >export MAYI_DIR=/Users/luyao/Desktop/bert_learn  
 28 | >查看环境变量是否生效 echo $BERT_BASE_DIR 等
 29 | - 进行训练
 30 | > python run_classifier_0214.py \\  
 31 |   --task_name=MAYI \\  
 32 |   --do_train=true \\  
 33 |   --do_eval=true \\  
 34 |   --data_dir=\$MAYI_DIR/MAYI  \\  
 35 |   --vocab_file=\$BERT_BASE_DIR/vocab.txt \\  
 36 |   --bert_config_file=\$BERT_BASE_DIR/bert_config.json \\  
 37 |   --init_checkpoint=\$BERT_BASE_DIR/bert_model.ckpt \\  
 38 |   --max_seq_length=128 \\  
 39 |   --train_batch_size=32 \\  
 40 |   --learning_rate=2e-5 \\  
 41 |   --num_train_epochs=3.0 \\  
 42 |   --output_dir=./tmp/mayi_output/  
 43 |   output_dir这个目录可以自动生成，不必手动添加。训练完生成的这个目录太大，这里我没有上传
 44 | 
 45 | - 结果在output_dir的eval_results.txt中
 46 | >eval_accuracy = 0.8016032  
 47 | eval_loss = 0.5034697  
 48 | global_step = 281  
 49 | loss = 0.5032294  
 50 | 
 51 | #### 预测
 52 | - 定义环境变量
 53 | >export TRAINED_CLASSIFIER=/Users/luyao/Desktop/bert_learn/fine/tuned/classifier
 54 | 使用官方上的demo时，这里有错。应该选择train之后生成的目录。  
 55 | tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /Users/luyao/Desktop/bert_learn/fine/tuned/classifier
 56 | 
 57 | >改成    
 58 | export TRAINED_CLASSIFIER=./tmp/mayi_output  这是上一步训练后的输出目录
 59 | 
 60 | - 进行预测
 61 | >python run_classifier_0214.py \\  
 62 |   --task_name=MAYI \\  
 63 |   --do_train=false \\  
 64 |   --do_eval=false \\  
 65 |   --do_predict=true \\  
 66 |   --data_dir=\$MAYI_DIR/MAYI \\  
 67 |   --vocab_file=\$BERT_BASE_DIR/vocab.txt \\  
 68 |   --bert_config_file=\$BERT_BASE_DIR/bert_config.json \\  
 69 |   --init_checkpoint=\$TRAINED_CLASSIFIER \\  
 70 |   --max_seq_length=128 \\  
 71 |   --output_dir=./tmp/mayi_output/
 72 |   
 73 | - 结果在output_dir的test_results.tsv中，第一列为label=0的概率，第二列为label=1的概率
 74 | >0.54155594	0.45844415  
 75 | 0.99733293	0.0026670366  
 76 | 0.98726386	0.012736204   
 77 | ...  
 78 | 
 79 | 
 80 | 
 81 | ## 任务2 文本多分类
 82 | #### 数据说明
 83 | - 运行时数据所在目录为/Users/luyao/Desktop/bert_learn/CNEWS
 84 | - 数据较大无法上传 链接:https://pan.baidu.com/s/1ZDez64S9cnzNnucIOrPapQ  密码:vutp
 85 | - cenws.train.txt、cenws.val.txt、cenws.test.txt格式相同，各列分别为"label doc"，"\t"分隔
 86 | - 包含"体育、科技、娱乐、家居、时政、财经、房产、游戏、时尚、教育"十个类别，其中train文件50000条，val5000条，test10000条
 87 | - 程序运行时仍然只选择了部分数据，大家可以根据机器性能随意选择
 88 | 
 89 | #### 程序说明
 90 | >如下的目录都是我自己学习时的目录，大家根据自己的实际情况进行修改
 91 | - 首先要下载**[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**并解压到/Users/luyao/Desktop/bert_learn/chinese_L-12_H-768_A-12
 92 | - 把自己写的放在了run_classifier_0228.py中
 93 | - 添加SelfProcessor类，修改了get_train_examples等三个方法，或者也可以修改_create_examples。修改了get_labels方法
 94 | - 在main(_)中增加自己定义的这个processor到字典中，修改了FLAGS.do_train等三部分
 95 | 
 96 | #### 训练
 97 | - 参考任务1，运行run_classifier_0228.py，修改task_name，data_dir和output_dir即可
 98 | >训练结果  
 99 | eval_accuracy = 0.955  
100 | eval_loss = 0.21101545  
101 | global_step = 93  
102 | loss = 0.21101545  
103 | 
104 | #### 预测
105 | - 参考任务1，运行run_classifier_0228.py，修改task_name，data_dir和output_dir即可
106 | >预测结果  
107 | 0.006859495	0.0022842707	0.005758511	0.0027749564	0.0021667227	0.0033973546	0.00307322	0.0014950271	0.9639163	0.008274185  
108 | 0.002119527	0.9845195	0.0023118905	0.001905937	0.0012551323	0.002937188	0.00080464134	0.0015169261	0.0013669604	0.0012623245  
109 | 0.008577874	0.0022037376	0.009644147	0.0033149354	0.0121542765	0.003821571	0.92171514	0.0034074073	0.009290288	0.025870731  
110 | ...
111 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | # Copyright 2018 The Google AI Language Team Authors.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | 
16 | 


--------------------------------------------------------------------------------
/create_pretraining_data.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Create masked LM/next sentence masked_lm TF examples for BERT."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import collections
 22 | import random
 23 | import tokenization
 24 | import tensorflow as tf
 25 | 
 26 | flags = tf.flags
 27 | 
 28 | FLAGS = flags.FLAGS
 29 | 
 30 | flags.DEFINE_string("input_file", None,
 31 |                     "Input raw text file (or comma-separated list of files).")
 32 | 
 33 | flags.DEFINE_string(
 34 |     "output_file", None,
 35 |     "Output TF example file (or comma-separated list of files).")
 36 | 
 37 | flags.DEFINE_string("vocab_file", None,
 38 |                     "The vocabulary file that the BERT model was trained on.")
 39 | 
 40 | flags.DEFINE_bool(
 41 |     "do_lower_case", True,
 42 |     "Whether to lower case the input text. Should be True for uncased "
 43 |     "models and False for cased models.")
 44 | 
 45 | flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.")
 46 | 
 47 | flags.DEFINE_integer("max_predictions_per_seq", 20,
 48 |                      "Maximum number of masked LM predictions per sequence.")
 49 | 
 50 | flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.")
 51 | 
 52 | flags.DEFINE_integer(
 53 |     "dupe_factor", 10,
 54 |     "Number of times to duplicate the input data (with different masks).")
 55 | 
 56 | flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.")
 57 | 
 58 | flags.DEFINE_float(
 59 |     "short_seq_prob", 0.1,
 60 |     "Probability of creating sequences which are shorter than the "
 61 |     "maximum length.")
 62 | 
 63 | 
 64 | class TrainingInstance(object):
 65 |   """A single training instance (sentence pair)."""
 66 | 
 67 |   def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels,
 68 |                is_random_next):
 69 |     self.tokens = tokens
 70 |     self.segment_ids = segment_ids
 71 |     self.is_random_next = is_random_next
 72 |     self.masked_lm_positions = masked_lm_positions
 73 |     self.masked_lm_labels = masked_lm_labels
 74 | 
 75 |   def __str__(self):
 76 |     s = ""
 77 |     s += "tokens: %s\n" % (" ".join(
 78 |         [tokenization.printable_text(x) for x in self.tokens]))
 79 |     s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids]))
 80 |     s += "is_random_next: %s\n" % self.is_random_next
 81 |     s += "masked_lm_positions: %s\n" % (" ".join(
 82 |         [str(x) for x in self.masked_lm_positions]))
 83 |     s += "masked_lm_labels: %s\n" % (" ".join(
 84 |         [tokenization.printable_text(x) for x in self.masked_lm_labels]))
 85 |     s += "\n"
 86 |     return s
 87 | 
 88 |   def __repr__(self):
 89 |     return self.__str__()
 90 | 
 91 | 
 92 | def write_instance_to_example_files(instances, tokenizer, max_seq_length,
 93 |                                     max_predictions_per_seq, output_files):
 94 |   """Create TF example files from `TrainingInstance`s."""
 95 |   writers = []
 96 |   for output_file in output_files:
 97 |     writers.append(tf.python_io.TFRecordWriter(output_file))
 98 | 
 99 |   writer_index = 0
100 | 
101 |   total_written = 0
102 |   for (inst_index, instance) in enumerate(instances):
103 |     input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
104 |     input_mask = [1] * len(input_ids)
105 |     segment_ids = list(instance.segment_ids)
106 |     assert len(input_ids) <= max_seq_length
107 | 
108 |     while len(input_ids) < max_seq_length:
109 |       input_ids.append(0)
110 |       input_mask.append(0)
111 |       segment_ids.append(0)
112 | 
113 |     assert len(input_ids) == max_seq_length
114 |     assert len(input_mask) == max_seq_length
115 |     assert len(segment_ids) == max_seq_length
116 | 
117 |     masked_lm_positions = list(instance.masked_lm_positions)
118 |     masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
119 |     masked_lm_weights = [1.0] * len(masked_lm_ids)
120 | 
121 |     while len(masked_lm_positions) < max_predictions_per_seq:
122 |       masked_lm_positions.append(0)
123 |       masked_lm_ids.append(0)
124 |       masked_lm_weights.append(0.0)
125 | 
126 |     next_sentence_label = 1 if instance.is_random_next else 0
127 | 
128 |     features = collections.OrderedDict()
129 |     features["input_ids"] = create_int_feature(input_ids)
130 |     features["input_mask"] = create_int_feature(input_mask)
131 |     features["segment_ids"] = create_int_feature(segment_ids)
132 |     features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
133 |     features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
134 |     features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
135 |     features["next_sentence_labels"] = create_int_feature([next_sentence_label])
136 | 
137 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
138 | 
139 |     writers[writer_index].write(tf_example.SerializeToString())
140 |     writer_index = (writer_index + 1) % len(writers)
141 | 
142 |     total_written += 1
143 | 
144 |     if inst_index < 20:
145 |       tf.logging.info("*** Example ***")
146 |       tf.logging.info("tokens: %s" % " ".join(
147 |           [tokenization.printable_text(x) for x in instance.tokens]))
148 | 
149 |       for feature_name in features.keys():
150 |         feature = features[feature_name]
151 |         values = []
152 |         if feature.int64_list.value:
153 |           values = feature.int64_list.value
154 |         elif feature.float_list.value:
155 |           values = feature.float_list.value
156 |         tf.logging.info(
157 |             "%s: %s" % (feature_name, " ".join([str(x) for x in values])))
158 | 
159 |   for writer in writers:
160 |     writer.close()
161 | 
162 |   tf.logging.info("Wrote %d total instances", total_written)
163 | 
164 | 
165 | def create_int_feature(values):
166 |   feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
167 |   return feature
168 | 
169 | 
170 | def create_float_feature(values):
171 |   feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))
172 |   return feature
173 | 
174 | 
175 | def create_training_instances(input_files, tokenizer, max_seq_length,
176 |                               dupe_factor, short_seq_prob, masked_lm_prob,
177 |                               max_predictions_per_seq, rng):
178 |   """Create `TrainingInstance`s from raw text."""
179 |   all_documents = [[]]
180 | 
181 |   # Input file format:
182 |   # (1) One sentence per line. These should ideally be actual sentences, not
183 |   # entire paragraphs or arbitrary spans of text. (Because we use the
184 |   # sentence boundaries for the "next sentence prediction" task).
185 |   # (2) Blank lines between documents. Document boundaries are needed so
186 |   # that the "next sentence prediction" task doesn't span between documents.
187 |   for input_file in input_files:
188 |     with tf.gfile.GFile(input_file, "r") as reader:
189 |       while True:
190 |         line = tokenization.convert_to_unicode(reader.readline())
191 |         if not line:
192 |           break
193 |         line = line.strip()
194 | 
195 |         # Empty lines are used as document delimiters
196 |         if not line:
197 |           all_documents.append([])
198 |         tokens = tokenizer.tokenize(line)
199 |         if tokens:
200 |           all_documents[-1].append(tokens)
201 | 
202 |   # Remove empty documents
203 |   all_documents = [x for x in all_documents if x]
204 |   rng.shuffle(all_documents)
205 | 
206 |   vocab_words = list(tokenizer.vocab.keys())
207 |   instances = []
208 |   for _ in range(dupe_factor):
209 |     for document_index in range(len(all_documents)):
210 |       instances.extend(
211 |           create_instances_from_document(
212 |               all_documents, document_index, max_seq_length, short_seq_prob,
213 |               masked_lm_prob, max_predictions_per_seq, vocab_words, rng))
214 | 
215 |   rng.shuffle(instances)
216 |   return instances
217 | 
218 | 
219 | def create_instances_from_document(
220 |     all_documents, document_index, max_seq_length, short_seq_prob,
221 |     masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
222 |   """Creates `TrainingInstance`s for a single document."""
223 |   document = all_documents[document_index]
224 | 
225 |   # Account for [CLS], [SEP], [SEP]
226 |   max_num_tokens = max_seq_length - 3
227 | 
228 |   # We *usually* want to fill up the entire sequence since we are padding
229 |   # to `max_seq_length` anyways, so short sequences are generally wasted
230 |   # computation. However, we *sometimes*
231 |   # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
232 |   # sequences to minimize the mismatch between pre-training and fine-tuning.
233 |   # The `target_seq_length` is just a rough target however, whereas
234 |   # `max_seq_length` is a hard limit.
235 |   target_seq_length = max_num_tokens
236 |   if rng.random() < short_seq_prob:
237 |     target_seq_length = rng.randint(2, max_num_tokens)
238 | 
239 |   # We DON'T just concatenate all of the tokens from a document into a long
240 |   # sequence and choose an arbitrary split point because this would make the
241 |   # next sentence prediction task too easy. Instead, we split the input into
242 |   # segments "A" and "B" based on the actual "sentences" provided by the user
243 |   # input.
244 |   instances = []
245 |   current_chunk = []
246 |   current_length = 0
247 |   i = 0
248 |   while i < len(document):
249 |     segment = document[i]
250 |     current_chunk.append(segment)
251 |     current_length += len(segment)
252 |     if i == len(document) - 1 or current_length >= target_seq_length:
253 |       if current_chunk:
254 |         # `a_end` is how many segments from `current_chunk` go into the `A`
255 |         # (first) sentence.
256 |         a_end = 1
257 |         if len(current_chunk) >= 2:
258 |           a_end = rng.randint(1, len(current_chunk) - 1)
259 | 
260 |         tokens_a = []
261 |         for j in range(a_end):
262 |           tokens_a.extend(current_chunk[j])
263 | 
264 |         tokens_b = []
265 |         # Random next
266 |         is_random_next = False
267 |         if len(current_chunk) == 1 or rng.random() < 0.5:
268 |           is_random_next = True
269 |           target_b_length = target_seq_length - len(tokens_a)
270 | 
271 |           # This should rarely go for more than one iteration for large
272 |           # corpora. However, just to be careful, we try to make sure that
273 |           # the random document is not the same as the document
274 |           # we're processing.
275 |           for _ in range(10):
276 |             random_document_index = rng.randint(0, len(all_documents) - 1)
277 |             if random_document_index != document_index:
278 |               break
279 | 
280 |           random_document = all_documents[random_document_index]
281 |           random_start = rng.randint(0, len(random_document) - 1)
282 |           for j in range(random_start, len(random_document)):
283 |             tokens_b.extend(random_document[j])
284 |             if len(tokens_b) >= target_b_length:
285 |               break
286 |           # We didn't actually use these segments so we "put them back" so
287 |           # they don't go to waste.
288 |           num_unused_segments = len(current_chunk) - a_end
289 |           i -= num_unused_segments
290 |         # Actual next
291 |         else:
292 |           is_random_next = False
293 |           for j in range(a_end, len(current_chunk)):
294 |             tokens_b.extend(current_chunk[j])
295 |         truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
296 | 
297 |         assert len(tokens_a) >= 1
298 |         assert len(tokens_b) >= 1
299 | 
300 |         tokens = []
301 |         segment_ids = []
302 |         tokens.append("[CLS]")
303 |         segment_ids.append(0)
304 |         for token in tokens_a:
305 |           tokens.append(token)
306 |           segment_ids.append(0)
307 | 
308 |         tokens.append("[SEP]")
309 |         segment_ids.append(0)
310 | 
311 |         for token in tokens_b:
312 |           tokens.append(token)
313 |           segment_ids.append(1)
314 |         tokens.append("[SEP]")
315 |         segment_ids.append(1)
316 | 
317 |         (tokens, masked_lm_positions,
318 |          masked_lm_labels) = create_masked_lm_predictions(
319 |              tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
320 |         instance = TrainingInstance(
321 |             tokens=tokens,
322 |             segment_ids=segment_ids,
323 |             is_random_next=is_random_next,
324 |             masked_lm_positions=masked_lm_positions,
325 |             masked_lm_labels=masked_lm_labels)
326 |         instances.append(instance)
327 |       current_chunk = []
328 |       current_length = 0
329 |     i += 1
330 | 
331 |   return instances
332 | 
333 | 
334 | MaskedLmInstance = collections.namedtuple("MaskedLmInstance",
335 |                                           ["index", "label"])
336 | 
337 | 
338 | def create_masked_lm_predictions(tokens, masked_lm_prob,
339 |                                  max_predictions_per_seq, vocab_words, rng):
340 |   """Creates the predictions for the masked LM objective."""
341 | 
342 |   cand_indexes = []
343 |   for (i, token) in enumerate(tokens):
344 |     if token == "[CLS]" or token == "[SEP]":
345 |       continue
346 |     cand_indexes.append(i)
347 | 
348 |   rng.shuffle(cand_indexes)
349 | 
350 |   output_tokens = list(tokens)
351 | 
352 |   num_to_predict = min(max_predictions_per_seq,
353 |                        max(1, int(round(len(tokens) * masked_lm_prob))))
354 | 
355 |   masked_lms = []
356 |   covered_indexes = set()
357 |   for index in cand_indexes:
358 |     if len(masked_lms) >= num_to_predict:
359 |       break
360 |     if index in covered_indexes:
361 |       continue
362 |     covered_indexes.add(index)
363 | 
364 |     masked_token = None
365 |     # 80% of the time, replace with [MASK]
366 |     if rng.random() < 0.8:
367 |       masked_token = "[MASK]"
368 |     else:
369 |       # 10% of the time, keep original
370 |       if rng.random() < 0.5:
371 |         masked_token = tokens[index]
372 |       # 10% of the time, replace with random word
373 |       else:
374 |         masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
375 | 
376 |     output_tokens[index] = masked_token
377 | 
378 |     masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
379 | 
380 |   masked_lms = sorted(masked_lms, key=lambda x: x.index)
381 | 
382 |   masked_lm_positions = []
383 |   masked_lm_labels = []
384 |   for p in masked_lms:
385 |     masked_lm_positions.append(p.index)
386 |     masked_lm_labels.append(p.label)
387 | 
388 |   return (output_tokens, masked_lm_positions, masked_lm_labels)
389 | 
390 | 
391 | def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
392 |   """Truncates a pair of sequences to a maximum sequence length."""
393 |   while True:
394 |     total_length = len(tokens_a) + len(tokens_b)
395 |     if total_length <= max_num_tokens:
396 |       break
397 | 
398 |     trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
399 |     assert len(trunc_tokens) >= 1
400 | 
401 |     # We want to sometimes truncate from the front and sometimes from the
402 |     # back to add more randomness and avoid biases.
403 |     if rng.random() < 0.5:
404 |       del trunc_tokens[0]
405 |     else:
406 |       trunc_tokens.pop()
407 | 
408 | 
409 | def main(_):
410 |   tf.logging.set_verbosity(tf.logging.INFO)
411 | 
412 |   tokenizer = tokenization.FullTokenizer(
413 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
414 | 
415 |   input_files = []
416 |   for input_pattern in FLAGS.input_file.split(","):
417 |     input_files.extend(tf.gfile.Glob(input_pattern))
418 | 
419 |   tf.logging.info("*** Reading from input files ***")
420 |   for input_file in input_files:
421 |     tf.logging.info("  %s", input_file)
422 | 
423 |   rng = random.Random(FLAGS.random_seed)
424 |   instances = create_training_instances(
425 |       input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
426 |       FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
427 |       rng)
428 | 
429 |   output_files = FLAGS.output_file.split(",")
430 |   tf.logging.info("*** Writing to output files ***")
431 |   for output_file in output_files:
432 |     tf.logging.info("  %s", output_file)
433 | 
434 |   write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
435 |                                   FLAGS.max_predictions_per_seq, output_files)
436 | 
437 | 
438 | if __name__ == "__main__":
439 |   flags.mark_flag_as_required("input_file")
440 |   flags.mark_flag_as_required("output_file")
441 |   flags.mark_flag_as_required("vocab_file")
442 |   tf.app.run()
443 | 


--------------------------------------------------------------------------------
/extract_features.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Extract pre-computed feature vectors from BERT."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import codecs
 22 | import collections
 23 | import json
 24 | import re
 25 | 
 26 | import modeling
 27 | import tokenization
 28 | import tensorflow as tf
 29 | 
 30 | flags = tf.flags
 31 | 
 32 | FLAGS = flags.FLAGS
 33 | 
 34 | flags.DEFINE_string("input_file", None, "")
 35 | 
 36 | flags.DEFINE_string("output_file", None, "")
 37 | 
 38 | flags.DEFINE_string("layers", "-1,-2,-3,-4", "")
 39 | 
 40 | flags.DEFINE_string(
 41 |     "bert_config_file", None,
 42 |     "The config json file corresponding to the pre-trained BERT model. "
 43 |     "This specifies the model architecture.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "max_seq_length", 128,
 47 |     "The maximum total input sequence length after WordPiece tokenization. "
 48 |     "Sequences longer than this will be truncated, and sequences shorter "
 49 |     "than this will be padded.")
 50 | 
 51 | flags.DEFINE_string(
 52 |     "init_checkpoint", None,
 53 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 54 | 
 55 | flags.DEFINE_string("vocab_file", None,
 56 |                     "The vocabulary file that the BERT model was trained on.")
 57 | 
 58 | flags.DEFINE_bool(
 59 |     "do_lower_case", True,
 60 |     "Whether to lower case the input text. Should be True for uncased "
 61 |     "models and False for cased models.")
 62 | 
 63 | flags.DEFINE_integer("batch_size", 32, "Batch size for predictions.")
 64 | 
 65 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
 66 | 
 67 | flags.DEFINE_string("master", None,
 68 |                     "If using a TPU, the address of the master.")
 69 | 
 70 | flags.DEFINE_integer(
 71 |     "num_tpu_cores", 8,
 72 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
 73 | 
 74 | flags.DEFINE_bool(
 75 |     "use_one_hot_embeddings", False,
 76 |     "If True, tf.one_hot will be used for embedding lookups, otherwise "
 77 |     "tf.nn.embedding_lookup will be used. On TPUs, this should be True "
 78 |     "since it is much faster.")
 79 | 
 80 | 
 81 | class InputExample(object):
 82 | 
 83 |   def __init__(self, unique_id, text_a, text_b):
 84 |     self.unique_id = unique_id
 85 |     self.text_a = text_a
 86 |     self.text_b = text_b
 87 | 
 88 | 
 89 | class InputFeatures(object):
 90 |   """A single set of features of data."""
 91 | 
 92 |   def __init__(self, unique_id, tokens, input_ids, input_mask, input_type_ids):
 93 |     self.unique_id = unique_id
 94 |     self.tokens = tokens
 95 |     self.input_ids = input_ids
 96 |     self.input_mask = input_mask
 97 |     self.input_type_ids = input_type_ids
 98 | 
 99 | 
100 | def input_fn_builder(features, seq_length):
101 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
102 | 
103 |   all_unique_ids = []
104 |   all_input_ids = []
105 |   all_input_mask = []
106 |   all_input_type_ids = []
107 | 
108 |   for feature in features:
109 |     all_unique_ids.append(feature.unique_id)
110 |     all_input_ids.append(feature.input_ids)
111 |     all_input_mask.append(feature.input_mask)
112 |     all_input_type_ids.append(feature.input_type_ids)
113 | 
114 |   def input_fn(params):
115 |     """The actual input function."""
116 |     batch_size = params["batch_size"]
117 | 
118 |     num_examples = len(features)
119 | 
120 |     # This is for demo purposes and does NOT scale to large data sets. We do
121 |     # not use Dataset.from_generator() because that uses tf.py_func which is
122 |     # not TPU compatible. The right way to load data is with TFRecordReader.
123 |     d = tf.data.Dataset.from_tensor_slices({
124 |         "unique_ids":
125 |             tf.constant(all_unique_ids, shape=[num_examples], dtype=tf.int32),
126 |         "input_ids":
127 |             tf.constant(
128 |                 all_input_ids, shape=[num_examples, seq_length],
129 |                 dtype=tf.int32),
130 |         "input_mask":
131 |             tf.constant(
132 |                 all_input_mask,
133 |                 shape=[num_examples, seq_length],
134 |                 dtype=tf.int32),
135 |         "input_type_ids":
136 |             tf.constant(
137 |                 all_input_type_ids,
138 |                 shape=[num_examples, seq_length],
139 |                 dtype=tf.int32),
140 |     })
141 | 
142 |     d = d.batch(batch_size=batch_size, drop_remainder=False)
143 |     return d
144 | 
145 |   return input_fn
146 | 
147 | 
148 | def model_fn_builder(bert_config, init_checkpoint, layer_indexes, use_tpu,
149 |                      use_one_hot_embeddings):
150 |   """Returns `model_fn` closure for TPUEstimator."""
151 | 
152 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
153 |     """The `model_fn` for TPUEstimator."""
154 | 
155 |     unique_ids = features["unique_ids"]
156 |     input_ids = features["input_ids"]
157 |     input_mask = features["input_mask"]
158 |     input_type_ids = features["input_type_ids"]
159 | 
160 |     model = modeling.BertModel(
161 |         config=bert_config,
162 |         is_training=False,
163 |         input_ids=input_ids,
164 |         input_mask=input_mask,
165 |         token_type_ids=input_type_ids,
166 |         use_one_hot_embeddings=use_one_hot_embeddings)
167 | 
168 |     if mode != tf.estimator.ModeKeys.PREDICT:
169 |       raise ValueError("Only PREDICT modes are supported: %s" % (mode))
170 | 
171 |     tvars = tf.trainable_variables()
172 |     scaffold_fn = None
173 |     (assignment_map,
174 |      initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(
175 |          tvars, init_checkpoint)
176 |     if use_tpu:
177 | 
178 |       def tpu_scaffold():
179 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
180 |         return tf.train.Scaffold()
181 | 
182 |       scaffold_fn = tpu_scaffold
183 |     else:
184 |       tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
185 | 
186 |     tf.logging.info("**** Trainable Variables ****")
187 |     for var in tvars:
188 |       init_string = ""
189 |       if var.name in initialized_variable_names:
190 |         init_string = ", *INIT_FROM_CKPT*"
191 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
192 |                       init_string)
193 | 
194 |     all_layers = model.get_all_encoder_layers()
195 | 
196 |     predictions = {
197 |         "unique_id": unique_ids,
198 |     }
199 | 
200 |     for (i, layer_index) in enumerate(layer_indexes):
201 |       predictions["layer_output_%d" % i] = all_layers[layer_index]
202 | 
203 |     output_spec = tf.contrib.tpu.TPUEstimatorSpec(
204 |         mode=mode, predictions=predictions, scaffold_fn=scaffold_fn)
205 |     return output_spec
206 | 
207 |   return model_fn
208 | 
209 | 
210 | def convert_examples_to_features(examples, seq_length, tokenizer):
211 |   """Loads a data file into a list of `InputBatch`s."""
212 | 
213 |   features = []
214 |   for (ex_index, example) in enumerate(examples):
215 |     tokens_a = tokenizer.tokenize(example.text_a)
216 | 
217 |     tokens_b = None
218 |     if example.text_b:
219 |       tokens_b = tokenizer.tokenize(example.text_b)
220 | 
221 |     if tokens_b:
222 |       # Modifies `tokens_a` and `tokens_b` in place so that the total
223 |       # length is less than the specified length.
224 |       # Account for [CLS], [SEP], [SEP] with "- 3"
225 |       _truncate_seq_pair(tokens_a, tokens_b, seq_length - 3)
226 |     else:
227 |       # Account for [CLS] and [SEP] with "- 2"
228 |       if len(tokens_a) > seq_length - 2:
229 |         tokens_a = tokens_a[0:(seq_length - 2)]
230 | 
231 |     # The convention in BERT is:
232 |     # (a) For sequence pairs:
233 |     #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
234 |     #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
235 |     # (b) For single sequences:
236 |     #  tokens:   [CLS] the dog is hairy . [SEP]
237 |     #  type_ids: 0     0   0   0  0     0 0
238 |     #
239 |     # Where "type_ids" are used to indicate whether this is the first
240 |     # sequence or the second sequence. The embedding vectors for `type=0` and
241 |     # `type=1` were learned during pre-training and are added to the wordpiece
242 |     # embedding vector (and position vector). This is not *strictly* necessary
243 |     # since the [SEP] token unambiguously separates the sequences, but it makes
244 |     # it easier for the model to learn the concept of sequences.
245 |     #
246 |     # For classification tasks, the first vector (corresponding to [CLS]) is
247 |     # used as as the "sentence vector". Note that this only makes sense because
248 |     # the entire model is fine-tuned.
249 |     tokens = []
250 |     input_type_ids = []
251 |     tokens.append("[CLS]")
252 |     input_type_ids.append(0)
253 |     for token in tokens_a:
254 |       tokens.append(token)
255 |       input_type_ids.append(0)
256 |     tokens.append("[SEP]")
257 |     input_type_ids.append(0)
258 | 
259 |     if tokens_b:
260 |       for token in tokens_b:
261 |         tokens.append(token)
262 |         input_type_ids.append(1)
263 |       tokens.append("[SEP]")
264 |       input_type_ids.append(1)
265 | 
266 |     input_ids = tokenizer.convert_tokens_to_ids(tokens)
267 | 
268 |     # The mask has 1 for real tokens and 0 for padding tokens. Only real
269 |     # tokens are attended to.
270 |     input_mask = [1] * len(input_ids)
271 | 
272 |     # Zero-pad up to the sequence length.
273 |     while len(input_ids) < seq_length:
274 |       input_ids.append(0)
275 |       input_mask.append(0)
276 |       input_type_ids.append(0)
277 | 
278 |     assert len(input_ids) == seq_length
279 |     assert len(input_mask) == seq_length
280 |     assert len(input_type_ids) == seq_length
281 | 
282 |     if ex_index < 5:
283 |       tf.logging.info("*** Example ***")
284 |       tf.logging.info("unique_id: %s" % (example.unique_id))
285 |       tf.logging.info("tokens: %s" % " ".join(
286 |           [tokenization.printable_text(x) for x in tokens]))
287 |       tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
288 |       tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
289 |       tf.logging.info(
290 |           "input_type_ids: %s" % " ".join([str(x) for x in input_type_ids]))
291 | 
292 |     features.append(
293 |         InputFeatures(
294 |             unique_id=example.unique_id,
295 |             tokens=tokens,
296 |             input_ids=input_ids,
297 |             input_mask=input_mask,
298 |             input_type_ids=input_type_ids))
299 |   return features
300 | 
301 | 
302 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
303 |   """Truncates a sequence pair in place to the maximum length."""
304 | 
305 |   # This is a simple heuristic which will always truncate the longer sequence
306 |   # one token at a time. This makes more sense than truncating an equal percent
307 |   # of tokens from each, since if one sequence is very short then each token
308 |   # that's truncated likely contains more information than a longer sequence.
309 |   while True:
310 |     total_length = len(tokens_a) + len(tokens_b)
311 |     if total_length <= max_length:
312 |       break
313 |     if len(tokens_a) > len(tokens_b):
314 |       tokens_a.pop()
315 |     else:
316 |       tokens_b.pop()
317 | 
318 | 
319 | def read_examples(input_file):
320 |   """Read a list of `InputExample`s from an input file."""
321 |   examples = []
322 |   unique_id = 0
323 |   with tf.gfile.GFile(input_file, "r") as reader:
324 |     while True:
325 |       line = tokenization.convert_to_unicode(reader.readline())
326 |       if not line:
327 |         break
328 |       line = line.strip()
329 |       text_a = None
330 |       text_b = None
331 |       m = re.match(r"^(.*) \|\|\| (.*)$", line)
332 |       if m is None:
333 |         text_a = line
334 |       else:
335 |         text_a = m.group(1)
336 |         text_b = m.group(2)
337 |       examples.append(
338 |           InputExample(unique_id=unique_id, text_a=text_a, text_b=text_b))
339 |       unique_id += 1
340 |   return examples
341 | 
342 | 
343 | def main(_):
344 |   tf.logging.set_verbosity(tf.logging.INFO)
345 | 
346 |   layer_indexes = [int(x) for x in FLAGS.layers.split(",")]
347 | 
348 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
349 | 
350 |   tokenizer = tokenization.FullTokenizer(
351 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
352 | 
353 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
354 |   run_config = tf.contrib.tpu.RunConfig(
355 |       master=FLAGS.master,
356 |       tpu_config=tf.contrib.tpu.TPUConfig(
357 |           num_shards=FLAGS.num_tpu_cores,
358 |           per_host_input_for_training=is_per_host))
359 | 
360 |   examples = read_examples(FLAGS.input_file)
361 | 
362 |   features = convert_examples_to_features(
363 |       examples=examples, seq_length=FLAGS.max_seq_length, tokenizer=tokenizer)
364 | 
365 |   unique_id_to_feature = {}
366 |   for feature in features:
367 |     unique_id_to_feature[feature.unique_id] = feature
368 | 
369 |   model_fn = model_fn_builder(
370 |       bert_config=bert_config,
371 |       init_checkpoint=FLAGS.init_checkpoint,
372 |       layer_indexes=layer_indexes,
373 |       use_tpu=FLAGS.use_tpu,
374 |       use_one_hot_embeddings=FLAGS.use_one_hot_embeddings)
375 | 
376 |   # If TPU is not available, this will fall back to normal Estimator on CPU
377 |   # or GPU.
378 |   estimator = tf.contrib.tpu.TPUEstimator(
379 |       use_tpu=FLAGS.use_tpu,
380 |       model_fn=model_fn,
381 |       config=run_config,
382 |       predict_batch_size=FLAGS.batch_size)
383 | 
384 |   input_fn = input_fn_builder(
385 |       features=features, seq_length=FLAGS.max_seq_length)
386 | 
387 |   with codecs.getwriter("utf-8")(tf.gfile.Open(FLAGS.output_file,
388 |                                                "w")) as writer:
389 |     for result in estimator.predict(input_fn, yield_single_examples=True):
390 |       unique_id = int(result["unique_id"])
391 |       feature = unique_id_to_feature[unique_id]
392 |       output_json = collections.OrderedDict()
393 |       output_json["linex_index"] = unique_id
394 |       all_features = []
395 |       for (i, token) in enumerate(feature.tokens):
396 |         all_layers = []
397 |         for (j, layer_index) in enumerate(layer_indexes):
398 |           layer_output = result["layer_output_%d" % j]
399 |           layers = collections.OrderedDict()
400 |           layers["index"] = layer_index
401 |           layers["values"] = [
402 |               round(float(x), 6) for x in layer_output[i:(i + 1)].flat
403 |           ]
404 |           all_layers.append(layers)
405 |         features = collections.OrderedDict()
406 |         features["token"] = token
407 |         features["layers"] = all_layers
408 |         all_features.append(features)
409 |       output_json["features"] = all_features
410 |       writer.write(json.dumps(output_json) + "\n")
411 | 
412 | 
413 | if __name__ == "__main__":
414 |   flags.mark_flag_as_required("input_file")
415 |   flags.mark_flag_as_required("vocab_file")
416 |   flags.mark_flag_as_required("bert_config_file")
417 |   flags.mark_flag_as_required("init_checkpoint")
418 |   flags.mark_flag_as_required("output_file")
419 |   tf.app.run()
420 | 


--------------------------------------------------------------------------------
/modeling.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """The main BERT model and related functions."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import collections
 22 | import copy
 23 | import json
 24 | import math
 25 | import re
 26 | import numpy as np
 27 | import six
 28 | import tensorflow as tf
 29 | 
 30 | 
 31 | class BertConfig(object):
 32 |   """Configuration for `BertModel`."""
 33 | 
 34 |   def __init__(self,
 35 |                vocab_size,
 36 |                hidden_size=768,
 37 |                num_hidden_layers=12,
 38 |                num_attention_heads=12,
 39 |                intermediate_size=3072,
 40 |                hidden_act="gelu",
 41 |                hidden_dropout_prob=0.1,
 42 |                attention_probs_dropout_prob=0.1,
 43 |                max_position_embeddings=512,
 44 |                type_vocab_size=16,
 45 |                initializer_range=0.02):
 46 |     """Constructs BertConfig.
 47 | 
 48 |     Args:
 49 |       vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
 50 |       hidden_size: Size of the encoder layers and the pooler layer.
 51 |       num_hidden_layers: Number of hidden layers in the Transformer encoder.
 52 |       num_attention_heads: Number of attention heads for each attention layer in
 53 |         the Transformer encoder.
 54 |       intermediate_size: The size of the "intermediate" (i.e., feed-forward)
 55 |         layer in the Transformer encoder.
 56 |       hidden_act: The non-linear activation function (function or string) in the
 57 |         encoder and pooler.
 58 |       hidden_dropout_prob: The dropout probability for all fully connected
 59 |         layers in the embeddings, encoder, and pooler.
 60 |       attention_probs_dropout_prob: The dropout ratio for the attention
 61 |         probabilities.
 62 |       max_position_embeddings: The maximum sequence length that this model might
 63 |         ever be used with. Typically set this to something large just in case
 64 |         (e.g., 512 or 1024 or 2048).
 65 |       type_vocab_size: The vocabulary size of the `token_type_ids` passed into
 66 |         `BertModel`.
 67 |       initializer_range: The stdev of the truncated_normal_initializer for
 68 |         initializing all weight matrices.
 69 |     """
 70 |     self.vocab_size = vocab_size
 71 |     self.hidden_size = hidden_size
 72 |     self.num_hidden_layers = num_hidden_layers
 73 |     self.num_attention_heads = num_attention_heads
 74 |     self.hidden_act = hidden_act
 75 |     self.intermediate_size = intermediate_size
 76 |     self.hidden_dropout_prob = hidden_dropout_prob
 77 |     self.attention_probs_dropout_prob = attention_probs_dropout_prob
 78 |     self.max_position_embeddings = max_position_embeddings
 79 |     self.type_vocab_size = type_vocab_size
 80 |     self.initializer_range = initializer_range
 81 | 
 82 |   @classmethod
 83 |   def from_dict(cls, json_object):
 84 |     """Constructs a `BertConfig` from a Python dictionary of parameters."""
 85 |     config = BertConfig(vocab_size=None)
 86 |     for (key, value) in six.iteritems(json_object):
 87 |       config.__dict__[key] = value
 88 |     return config
 89 | 
 90 |   @classmethod
 91 |   def from_json_file(cls, json_file):
 92 |     """Constructs a `BertConfig` from a json file of parameters."""
 93 |     with tf.gfile.GFile(json_file, "r") as reader:
 94 |       text = reader.read()
 95 |     return cls.from_dict(json.loads(text))
 96 | 
 97 |   def to_dict(self):
 98 |     """Serializes this instance to a Python dictionary."""
 99 |     output = copy.deepcopy(self.__dict__)
100 |     return output
101 | 
102 |   def to_json_string(self):
103 |     """Serializes this instance to a JSON string."""
104 |     return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
105 | 
106 | 
107 | class BertModel(object):
108 |   """BERT model ("Bidirectional Encoder Representations from Transformers").
109 | 
110 |   Example usage:
111 | 
112 |   ```python
113 |   # Already been converted into WordPiece token ids
114 |   input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
115 |   input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
116 |   token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
117 | 
118 |   config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
119 |     num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
120 | 
121 |   model = modeling.BertModel(config=config, is_training=True,
122 |     input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
123 | 
124 |   label_embeddings = tf.get_variable(...)
125 |   pooled_output = model.get_pooled_output()
126 |   logits = tf.matmul(pooled_output, label_embeddings)
127 |   ...
128 |   ```
129 |   """
130 | 
131 |   def __init__(self,
132 |                config,
133 |                is_training,
134 |                input_ids,
135 |                input_mask=None,
136 |                token_type_ids=None,
137 |                use_one_hot_embeddings=False,
138 |                scope=None):
139 |     """Constructor for BertModel.
140 | 
141 |     Args:
142 |       config: `BertConfig` instance.
143 |       is_training: bool. true for training model, false for eval model. Controls
144 |         whether dropout will be applied.
145 |       input_ids: int32 Tensor of shape [batch_size, seq_length].
146 |       input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
147 |       token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
148 |       use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
149 |         embeddings or tf.embedding_lookup() for the word embeddings.
150 |       scope: (optional) variable scope. Defaults to "bert".
151 | 
152 |     Raises:
153 |       ValueError: The config is invalid or one of the input tensor shapes
154 |         is invalid.
155 |     """
156 |     config = copy.deepcopy(config)
157 |     if not is_training:
158 |       config.hidden_dropout_prob = 0.0
159 |       config.attention_probs_dropout_prob = 0.0
160 | 
161 |     input_shape = get_shape_list(input_ids, expected_rank=2)
162 |     batch_size = input_shape[0]
163 |     seq_length = input_shape[1]
164 | 
165 |     if input_mask is None:
166 |       input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)
167 | 
168 |     if token_type_ids is None:
169 |       token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)
170 | 
171 |     with tf.variable_scope(scope, default_name="bert"):
172 |       with tf.variable_scope("embeddings"):
173 |         # Perform embedding lookup on the word ids.
174 |         (self.embedding_output, self.embedding_table) = embedding_lookup(
175 |             input_ids=input_ids,
176 |             vocab_size=config.vocab_size,
177 |             embedding_size=config.hidden_size,
178 |             initializer_range=config.initializer_range,
179 |             word_embedding_name="word_embeddings",
180 |             use_one_hot_embeddings=use_one_hot_embeddings)
181 | 
182 |         # Add positional embeddings and token type embeddings, then layer
183 |         # normalize and perform dropout.
184 |         self.embedding_output = embedding_postprocessor(
185 |             input_tensor=self.embedding_output,
186 |             use_token_type=True,
187 |             token_type_ids=token_type_ids,
188 |             token_type_vocab_size=config.type_vocab_size,
189 |             token_type_embedding_name="token_type_embeddings",
190 |             use_position_embeddings=True,
191 |             position_embedding_name="position_embeddings",
192 |             initializer_range=config.initializer_range,
193 |             max_position_embeddings=config.max_position_embeddings,
194 |             dropout_prob=config.hidden_dropout_prob)
195 | 
196 |       with tf.variable_scope("encoder"):
197 |         # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
198 |         # mask of shape [batch_size, seq_length, seq_length] which is used
199 |         # for the attention scores.
200 |         attention_mask = create_attention_mask_from_input_mask(
201 |             input_ids, input_mask)
202 | 
203 |         # Run the stacked transformer.
204 |         # `sequence_output` shape = [batch_size, seq_length, hidden_size].
205 |         self.all_encoder_layers = transformer_model(
206 |             input_tensor=self.embedding_output,
207 |             attention_mask=attention_mask,
208 |             hidden_size=config.hidden_size,
209 |             num_hidden_layers=config.num_hidden_layers,
210 |             num_attention_heads=config.num_attention_heads,
211 |             intermediate_size=config.intermediate_size,
212 |             intermediate_act_fn=get_activation(config.hidden_act),
213 |             hidden_dropout_prob=config.hidden_dropout_prob,
214 |             attention_probs_dropout_prob=config.attention_probs_dropout_prob,
215 |             initializer_range=config.initializer_range,
216 |             do_return_all_layers=True)
217 | 
218 |       self.sequence_output = self.all_encoder_layers[-1]
219 |       # The "pooler" converts the encoded sequence tensor of shape
220 |       # [batch_size, seq_length, hidden_size] to a tensor of shape
221 |       # [batch_size, hidden_size]. This is necessary for segment-level
222 |       # (or segment-pair-level) classification tasks where we need a fixed
223 |       # dimensional representation of the segment.
224 |       with tf.variable_scope("pooler"):
225 |         # We "pool" the model by simply taking the hidden state corresponding
226 |         # to the first token. We assume that this has been pre-trained
227 |         first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
228 |         self.pooled_output = tf.layers.dense(
229 |             first_token_tensor,
230 |             config.hidden_size,
231 |             activation=tf.tanh,
232 |             kernel_initializer=create_initializer(config.initializer_range))
233 | 
234 |   def get_pooled_output(self):
235 |     return self.pooled_output
236 | 
237 |   def get_sequence_output(self):
238 |     """Gets final hidden layer of encoder.
239 | 
240 |     Returns:
241 |       float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
242 |       to the final hidden of the transformer encoder.
243 |     """
244 |     return self.sequence_output
245 | 
246 |   def get_all_encoder_layers(self):
247 |     return self.all_encoder_layers
248 | 
249 |   def get_embedding_output(self):
250 |     """Gets output of the embedding lookup (i.e., input to the transformer).
251 | 
252 |     Returns:
253 |       float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
254 |       to the output of the embedding layer, after summing the word
255 |       embeddings with the positional embeddings and the token type embeddings,
256 |       then performing layer normalization. This is the input to the transformer.
257 |     """
258 |     return self.embedding_output
259 | 
260 |   def get_embedding_table(self):
261 |     return self.embedding_table
262 | 
263 | 
264 | def gelu(x):
265 |   """Gaussian Error Linear Unit.
266 | 
267 |   This is a smoother version of the RELU.
268 |   Original paper: https://arxiv.org/abs/1606.08415
269 |   Args:
270 |     x: float Tensor to perform activation.
271 | 
272 |   Returns:
273 |     `x` with the GELU activation applied.
274 |   """
275 |   cdf = 0.5 * (1.0 + tf.tanh(
276 |       (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
277 |   return x * cdf
278 | 
279 | 
280 | def get_activation(activation_string):
281 |   """Maps a string to a Python function, e.g., "relu" => `tf.nn.relu`.
282 | 
283 |   Args:
284 |     activation_string: String name of the activation function.
285 | 
286 |   Returns:
287 |     A Python function corresponding to the activation function. If
288 |     `activation_string` is None, empty, or "linear", this will return None.
289 |     If `activation_string` is not a string, it will return `activation_string`.
290 | 
291 |   Raises:
292 |     ValueError: The `activation_string` does not correspond to a known
293 |       activation.
294 |   """
295 | 
296 |   # We assume that anything that"s not a string is already an activation
297 |   # function, so we just return it.
298 |   if not isinstance(activation_string, six.string_types):
299 |     return activation_string
300 | 
301 |   if not activation_string:
302 |     return None
303 | 
304 |   act = activation_string.lower()
305 |   if act == "linear":
306 |     return None
307 |   elif act == "relu":
308 |     return tf.nn.relu
309 |   elif act == "gelu":
310 |     return gelu
311 |   elif act == "t" \
312 |               "anh":
313 |     return tf.tanh
314 |   else:
315 |     raise ValueError("Unsupported activation: %s" % act)
316 | 
317 | 
318 | def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
319 |   """Compute the union of the current variables and checkpoint variables."""
320 |   assignment_map = {}
321 |   initialized_variable_names = {}
322 | 
323 |   name_to_variable = collections.OrderedDict()
324 |   for var in tvars:
325 |     name = var.name
326 |     m = re.match("^(.*):\\d+$", name)
327 |     if m is not None:
328 |       name = m.group(1)
329 |     name_to_variable[name] = var
330 | 
331 |   init_vars = tf.train.list_variables(init_checkpoint)
332 | 
333 |   assignment_map = collections.OrderedDict()
334 |   for x in init_vars:
335 |     (name, var) = (x[0], x[1])
336 |     if name not in name_to_variable:
337 |       continue
338 |     assignment_map[name] = name
339 |     initialized_variable_names[name] = 1
340 |     initialized_variable_names[name + ":0"] = 1
341 | 
342 |   return (assignment_map, initialized_variable_names)
343 | 
344 | 
345 | def dropout(input_tensor, dropout_prob):
346 |   """Perform dropout.
347 | 
348 |   Args:
349 |     input_tensor: float Tensor.
350 |     dropout_prob: Python float. The probability of dropping out a value (NOT of
351 |       *keeping* a dimension as in `tf.nn.dropout`).
352 | 
353 |   Returns:
354 |     A version of `input_tensor` with dropout applied.
355 |   """
356 |   if dropout_prob is None or dropout_prob == 0.0:
357 |     return input_tensor
358 | 
359 |   output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
360 |   return output
361 | 
362 | 
363 | def layer_norm(input_tensor, name=None):
364 |   """Run layer normalization on the last dimension of the tensor."""
365 |   return tf.contrib.layers.layer_norm(
366 |       inputs=input_tensor, begin_norm_axis=-1, begin_params_axis=-1, scope=name)
367 | 
368 | 
369 | def layer_norm_and_dropout(input_tensor, dropout_prob, name=None):
370 |   """Runs layer normalization followed by dropout."""
371 |   output_tensor = layer_norm(input_tensor, name)
372 |   output_tensor = dropout(output_tensor, dropout_prob)
373 |   return output_tensor
374 | 
375 | 
376 | def create_initializer(initializer_range=0.02):
377 |   """Creates a `truncated_normal_initializer` with the given range."""
378 |   return tf.truncated_normal_initializer(stddev=initializer_range)
379 | 
380 | 
381 | def embedding_lookup(input_ids,
382 |                      vocab_size,
383 |                      embedding_size=128,
384 |                      initializer_range=0.02,
385 |                      word_embedding_name="word_embeddings",
386 |                      use_one_hot_embeddings=False):
387 |   """Looks up words embeddings for id tensor.
388 | 
389 |   Args:
390 |     input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
391 |       ids.
392 |     vocab_size: int. Size of the embedding vocabulary.
393 |     embedding_size: int. Width of the word embeddings.
394 |     initializer_range: float. Embedding initialization range.
395 |     word_embedding_name: string. Name of the embedding table.
396 |     use_one_hot_embeddings: bool. If True, use one-hot method for word
397 |       embeddings. If False, use `tf.gather()`.
398 | 
399 |   Returns:
400 |     float Tensor of shape [batch_size, seq_length, embedding_size].
401 |   """
402 |   # This function assumes that the input is of shape [batch_size, seq_length,
403 |   # num_inputs].
404 |   #
405 |   # If the input is a 2D tensor of shape [batch_size, seq_length], we
406 |   # reshape to [batch_size, seq_length, 1].
407 |   if input_ids.shape.ndims == 2:
408 |     input_ids = tf.expand_dims(input_ids, axis=[-1])
409 | 
410 |   embedding_table = tf.get_variable(
411 |       name=word_embedding_name,
412 |       shape=[vocab_size, embedding_size],
413 |       initializer=create_initializer(initializer_range))
414 | 
415 |   flat_input_ids = tf.reshape(input_ids, [-1])
416 |   if use_one_hot_embeddings:
417 |     one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
418 |     output = tf.matmul(one_hot_input_ids, embedding_table)
419 |   else:
420 |     output = tf.gather(embedding_table, flat_input_ids)
421 | 
422 |   input_shape = get_shape_list(input_ids)
423 | 
424 |   output = tf.reshape(output,
425 |                       input_shape[0:-1] + [input_shape[-1] * embedding_size])
426 |   return (output, embedding_table)
427 | 
428 | 
429 | def embedding_postprocessor(input_tensor,
430 |                             use_token_type=False,
431 |                             token_type_ids=None,
432 |                             token_type_vocab_size=16,
433 |                             token_type_embedding_name="token_type_embeddings",
434 |                             use_position_embeddings=True,
435 |                             position_embedding_name="position_embeddings",
436 |                             initializer_range=0.02,
437 |                             max_position_embeddings=512,
438 |                             dropout_prob=0.1):
439 |   """Performs various post-processing on a word embedding tensor.
440 | 
441 |   Args:
442 |     input_tensor: float Tensor of shape [batch_size, seq_length,
443 |       embedding_size].
444 |     use_token_type: bool. Whether to add embeddings for `token_type_ids`.
445 |     token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
446 |       Must be specified if `use_token_type` is True.
447 |     token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
448 |     token_type_embedding_name: string. The name of the embedding table variable
449 |       for token type ids.
450 |     use_position_embeddings: bool. Whether to add position embeddings for the
451 |       position of each token in the sequence.
452 |     position_embedding_name: string. The name of the embedding table variable
453 |       for positional embeddings.
454 |     initializer_range: float. Range of the weight initialization.
455 |     max_position_embeddings: int. Maximum sequence length that might ever be
456 |       used with this model. This can be longer than the sequence length of
457 |       input_tensor, but cannot be shorter.
458 |     dropout_prob: float. Dropout probability applied to the final output tensor.
459 | 
460 |   Returns:
461 |     float tensor with same shape as `input_tensor`.
462 | 
463 |   Raises:
464 |     ValueError: One of the tensor shapes or input values is invalid.
465 |   """
466 |   input_shape = get_shape_list(input_tensor, expected_rank=3)
467 |   batch_size = input_shape[0]
468 |   seq_length = input_shape[1]
469 |   width = input_shape[2]
470 | 
471 |   output = input_tensor
472 | 
473 |   if use_token_type:
474 |     if token_type_ids is None:
475 |       raise ValueError("`token_type_ids` must be specified if"
476 |                        "`use_token_type` is True.")
477 |     token_type_table = tf.get_variable(
478 |         name=token_type_embedding_name,
479 |         shape=[token_type_vocab_size, width],
480 |         initializer=create_initializer(initializer_range))
481 |     # This vocab will be small so we always do one-hot here, since it is always
482 |     # faster for a small vocabulary.
483 |     flat_token_type_ids = tf.reshape(token_type_ids, [-1])
484 |     one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
485 |     token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
486 |     token_type_embeddings = tf.reshape(token_type_embeddings,
487 |                                        [batch_size, seq_length, width])
488 |     output += token_type_embeddings
489 | 
490 |   if use_position_embeddings:
491 |     assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
492 |     with tf.control_dependencies([assert_op]):
493 |       full_position_embeddings = tf.get_variable(
494 |           name=position_embedding_name,
495 |           shape=[max_position_embeddings, width],
496 |           initializer=create_initializer(initializer_range))
497 |       # Since the position embedding table is a learned variable, we create it
498 |       # using a (long) sequence length `max_position_embeddings`. The actual
499 |       # sequence length might be shorter than this, for faster training of
500 |       # tasks that do not have long sequences.
501 |       #
502 |       # So `full_position_embeddings` is effectively an embedding table
503 |       # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
504 |       # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
505 |       # perform a slice.
506 |       position_embeddings = tf.slice(full_position_embeddings, [0, 0],
507 |                                      [seq_length, -1])
508 |       num_dims = len(output.shape.as_list())
509 | 
510 |       # Only the last two dimensions are relevant (`seq_length` and `width`), so
511 |       # we broadcast among the first dimensions, which is typically just
512 |       # the batch size.
513 |       position_broadcast_shape = []
514 |       for _ in range(num_dims - 2):
515 |         position_broadcast_shape.append(1)
516 |       position_broadcast_shape.extend([seq_length, width])
517 |       position_embeddings = tf.reshape(position_embeddings,
518 |                                        position_broadcast_shape)
519 |       output += position_embeddings
520 | 
521 |   output = layer_norm_and_dropout(output, dropout_prob)
522 |   return output
523 | 
524 | 
525 | def create_attention_mask_from_input_mask(from_tensor, to_mask):
526 |   """Create 3D attention mask from a 2D tensor mask.
527 | 
528 |   Args:
529 |     from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
530 |     to_mask: int32 Tensor of shape [batch_size, to_seq_length].
531 | 
532 |   Returns:
533 |     float Tensor of shape [batch_size, from_seq_length, to_seq_length].
534 |   """
535 |   from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
536 |   batch_size = from_shape[0]
537 |   from_seq_length = from_shape[1]
538 | 
539 |   to_shape = get_shape_list(to_mask, expected_rank=2)
540 |   to_seq_length = to_shape[1]
541 | 
542 |   to_mask = tf.cast(
543 |       tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)
544 | 
545 |   # We don't assume that `from_tensor` is a mask (although it could be). We
546 |   # don't actually care if we attend *from* padding tokens (only *to* padding)
547 |   # tokens so we create a tensor of all ones.
548 |   #
549 |   # `broadcast_ones` = [batch_size, from_seq_length, 1]
550 |   broadcast_ones = tf.ones(
551 |       shape=[batch_size, from_seq_length, 1], dtype=tf.float32)
552 | 
553 |   # Here we broadcast along two dimensions to create the mask.
554 |   mask = broadcast_ones * to_mask
555 | 
556 |   return mask
557 | 
558 | 
559 | def attention_layer(from_tensor,
560 |                     to_tensor,
561 |                     attention_mask=None,
562 |                     num_attention_heads=1,
563 |                     size_per_head=512,
564 |                     query_act=None,
565 |                     key_act=None,
566 |                     value_act=None,
567 |                     attention_probs_dropout_prob=0.0,
568 |                     initializer_range=0.02,
569 |                     do_return_2d_tensor=False,
570 |                     batch_size=None,
571 |                     from_seq_length=None,
572 |                     to_seq_length=None):
573 |   """Performs multi-headed attention from `from_tensor` to `to_tensor`.
574 | 
575 |   This is an implementation of multi-headed attention based on "Attention
576 |   is all you Need". If `from_tensor` and `to_tensor` are the same, then
577 |   this is self-attention. Each timestep in `from_tensor` attends to the
578 |   corresponding sequence in `to_tensor`, and returns a fixed-with vector.
579 | 
580 |   This function first projects `from_tensor` into a "query" tensor and
581 |   `to_tensor` into "key" and "value" tensors. These are (effectively) a list
582 |   of tensors of length `num_attention_heads`, where each tensor is of shape
583 |   [batch_size, seq_length, size_per_head].
584 | 
585 |   Then, the query and key tensors are dot-producted and scaled. These are
586 |   softmaxed to obtain attention probabilities. The value tensors are then
587 |   interpolated by these probabilities, then concatenated back to a single
588 |   tensor and returned.
589 | 
590 |   In practice, the multi-headed attention are done with transposes and
591 |   reshapes rather than actual separate tensors.
592 | 
593 |   Args:
594 |     from_tensor: float Tensor of shape [batch_size, from_seq_length,
595 |       from_width].
596 |     to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
597 |     attention_mask: (optional) int32 Tensor of shape [batch_size,
598 |       from_seq_length, to_seq_length]. The values should be 1 or 0. The
599 |       attention scores will effectively be set to -infinity for any positions in
600 |       the mask that are 0, and will be unchanged for positions that are 1.
601 |     num_attention_heads: int. Number of attention heads.
602 |     size_per_head: int. Size of each attention head.
603 |     query_act: (optional) Activation function for the query transform.
604 |     key_act: (optional) Activation function for the key transform.
605 |     value_act: (optional) Activation function for the value transform.
606 |     attention_probs_dropout_prob: (optional) float. Dropout probability of the
607 |       attention probabilities.
608 |     initializer_range: float. Range of the weight initializer.
609 |     do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
610 |       * from_seq_length, num_attention_heads * size_per_head]. If False, the
611 |       output will be of shape [batch_size, from_seq_length, num_attention_heads
612 |       * size_per_head].
613 |     batch_size: (Optional) int. If the input is 2D, this might be the batch size
614 |       of the 3D version of the `from_tensor` and `to_tensor`.
615 |     from_seq_length: (Optional) If the input is 2D, this might be the seq length
616 |       of the 3D version of the `from_tensor`.
617 |     to_seq_length: (Optional) If the input is 2D, this might be the seq length
618 |       of the 3D version of the `to_tensor`.
619 | 
620 |   Returns:
621 |     float Tensor of shape [batch_size, from_seq_length,
622 |       num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
623 |       true, this will be of shape [batch_size * from_seq_length,
624 |       num_attention_heads * size_per_head]).
625 | 
626 |   Raises:
627 |     ValueError: Any of the arguments or tensor shapes are invalid.
628 |   """
629 | 
630 |   def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
631 |                            seq_length, width):
632 |     output_tensor = tf.reshape(
633 |         input_tensor, [batch_size, seq_length, num_attention_heads, width])
634 | 
635 |     output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
636 |     return output_tensor
637 | 
638 |   from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
639 |   to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])
640 | 
641 |   if len(from_shape) != len(to_shape):
642 |     raise ValueError(
643 |         "The rank of `from_tensor` must match the rank of `to_tensor`.")
644 | 
645 |   if len(from_shape) == 3:
646 |     batch_size = from_shape[0]
647 |     from_seq_length = from_shape[1]
648 |     to_seq_length = to_shape[1]
649 |   elif len(from_shape) == 2:
650 |     if (batch_size is None or from_seq_length is None or to_seq_length is None):
651 |       raise ValueError(
652 |           "When passing in rank 2 tensors to attention_layer, the values "
653 |           "for `batch_size`, `from_seq_length`, and `to_seq_length` "
654 |           "must all be specified.")
655 | 
656 |   # Scalar dimensions referenced here:
657 |   #   B = batch size (number of sequences)
658 |   #   F = `from_tensor` sequence length
659 |   #   T = `to_tensor` sequence length
660 |   #   N = `num_attention_heads`
661 |   #   H = `size_per_head`
662 | 
663 |   from_tensor_2d = reshape_to_matrix(from_tensor)
664 |   to_tensor_2d = reshape_to_matrix(to_tensor)
665 | 
666 |   # `query_layer` = [B*F, N*H]
667 |   query_layer = tf.layers.dense(
668 |       from_tensor_2d,
669 |       num_attention_heads * size_per_head,
670 |       activation=query_act,
671 |       name="query",
672 |       kernel_initializer=create_initializer(initializer_range))
673 | 
674 |   # `key_layer` = [B*T, N*H]
675 |   key_layer = tf.layers.dense(
676 |       to_tensor_2d,
677 |       num_attention_heads * size_per_head,
678 |       activation=key_act,
679 |       name="key",
680 |       kernel_initializer=create_initializer(initializer_range))
681 | 
682 |   # `value_layer` = [B*T, N*H]
683 |   value_layer = tf.layers.dense(
684 |       to_tensor_2d,
685 |       num_attention_heads * size_per_head,
686 |       activation=value_act,
687 |       name="value",
688 |       kernel_initializer=create_initializer(initializer_range))
689 | 
690 |   # `query_layer` = [B, N, F, H]
691 |   query_layer = transpose_for_scores(query_layer, batch_size,
692 |                                      num_attention_heads, from_seq_length,
693 |                                      size_per_head)
694 | 
695 |   # `key_layer` = [B, N, T, H]
696 |   key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
697 |                                    to_seq_length, size_per_head)
698 | 
699 |   # Take the dot product between "query" and "key" to get the raw
700 |   # attention scores.
701 |   # `attention_scores` = [B, N, F, T]
702 |   attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
703 |   attention_scores = tf.multiply(attention_scores,
704 |                                  1.0 / math.sqrt(float(size_per_head)))
705 | 
706 |   if attention_mask is not None:
707 |     # `attention_mask` = [B, 1, F, T]
708 |     attention_mask = tf.expand_dims(attention_mask, axis=[1])
709 | 
710 |     # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
711 |     # masked positions, this operation will create a tensor which is 0.0 for
712 |     # positions we want to attend and -10000.0 for masked positions.
713 |     adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0
714 | 
715 |     # Since we are adding it to the raw scores before the softmax, this is
716 |     # effectively the same as removing these entirely.
717 |     attention_scores += adder
718 | 
719 |   # Normalize the attention scores to probabilities.
720 |   # `attention_probs` = [B, N, F, T]
721 |   attention_probs = tf.nn.softmax(attention_scores)
722 | 
723 |   # This is actually dropping out entire tokens to attend to, which might
724 |   # seem a bit unusual, but is taken from the original Transformer paper.
725 |   attention_probs = dropout(attention_probs, attention_probs_dropout_prob)
726 | 
727 |   # `value_layer` = [B, T, N, H]
728 |   value_layer = tf.reshape(
729 |       value_layer,
730 |       [batch_size, to_seq_length, num_attention_heads, size_per_head])
731 | 
732 |   # `value_layer` = [B, N, T, H]
733 |   value_layer = tf.transpose(value_layer, [0, 2, 1, 3])
734 | 
735 |   # `context_layer` = [B, N, F, H]
736 |   context_layer = tf.matmul(attention_probs, value_layer)
737 | 
738 |   # `context_layer` = [B, F, N, H]
739 |   context_layer = tf.transpose(context_layer, [0, 2, 1, 3])
740 | 
741 |   if do_return_2d_tensor:
742 |     # `context_layer` = [B*F, N*H]
743 |     context_layer = tf.reshape(
744 |         context_layer,
745 |         [batch_size * from_seq_length, num_attention_heads * size_per_head])
746 |   else:
747 |     # `context_layer` = [B, F, N*H]
748 |     context_layer = tf.reshape(
749 |         context_layer,
750 |         [batch_size, from_seq_length, num_attention_heads * size_per_head])
751 | 
752 |   return context_layer
753 | 
754 | 
755 | def transformer_model(input_tensor,
756 |                       attention_mask=None,
757 |                       hidden_size=768,
758 |                       num_hidden_layers=12,
759 |                       num_attention_heads=12,
760 |                       intermediate_size=3072,
761 |                       intermediate_act_fn=gelu,
762 |                       hidden_dropout_prob=0.1,
763 |                       attention_probs_dropout_prob=0.1,
764 |                       initializer_range=0.02,
765 |                       do_return_all_layers=False):
766 |   """Multi-headed, multi-layer Transformer from "Attention is All You Need".
767 | 
768 |   This is almost an exact implementation of the original Transformer encoder.
769 | 
770 |   See the original paper:
771 |   https://arxiv.org/abs/1706.03762
772 | 
773 |   Also see:
774 |   https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
775 | 
776 |   Args:
777 |     input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
778 |     attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
779 |       seq_length], with 1 for positions that can be attended to and 0 in
780 |       positions that should not be.
781 |     hidden_size: int. Hidden size of the Transformer.
782 |     num_hidden_layers: int. Number of layers (blocks) in the Transformer.
783 |     num_attention_heads: int. Number of attention heads in the Transformer.
784 |     intermediate_size: int. The size of the "intermediate" (a.k.a., feed
785 |       forward) layer.
786 |     intermediate_act_fn: function. The non-linear activation function to apply
787 |       to the output of the intermediate/feed-forward layer.
788 |     hidden_dropout_prob: float. Dropout probability for the hidden layers.
789 |     attention_probs_dropout_prob: float. Dropout probability of the attention
790 |       probabilities.
791 |     initializer_range: float. Range of the initializer (stddev of truncated
792 |       normal).
793 |     do_return_all_layers: Whether to also return all layers or just the final
794 |       layer.
795 | 
796 |   Returns:
797 |     float Tensor of shape [batch_size, seq_length, hidden_size], the final
798 |     hidden layer of the Transformer.
799 | 
800 |   Raises:
801 |     ValueError: A Tensor shape or parameter is invalid.
802 |   """
803 |   if hidden_size % num_attention_heads != 0:
804 |     raise ValueError(
805 |         "The hidden size (%d) is not a multiple of the number of attention "
806 |         "heads (%d)" % (hidden_size, num_attention_heads))
807 | 
808 |   attention_head_size = int(hidden_size / num_attention_heads)
809 |   input_shape = get_shape_list(input_tensor, expected_rank=3)
810 |   batch_size = input_shape[0]
811 |   seq_length = input_shape[1]
812 |   input_width = input_shape[2]
813 | 
814 |   # The Transformer performs sum residuals on all layers so the input needs
815 |   # to be the same as the hidden size.
816 |   if input_width != hidden_size:
817 |     raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
818 |                      (input_width, hidden_size))
819 | 
820 |   # We keep the representation as a 2D tensor to avoid re-shaping it back and
821 |   # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
822 |   # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
823 |   # help the optimizer.
824 |   prev_output = reshape_to_matrix(input_tensor)
825 | 
826 |   all_layer_outputs = []
827 |   for layer_idx in range(num_hidden_layers):
828 |     with tf.variable_scope("layer_%d" % layer_idx):
829 |       layer_input = prev_output
830 | 
831 |       with tf.variable_scope("attention"):
832 |         attention_heads = []
833 |         with tf.variable_scope("self"):
834 |           attention_head = attention_layer(
835 |               from_tensor=layer_input,
836 |               to_tensor=layer_input,
837 |               attention_mask=attention_mask,
838 |               num_attention_heads=num_attention_heads,
839 |               size_per_head=attention_head_size,
840 |               attention_probs_dropout_prob=attention_probs_dropout_prob,
841 |               initializer_range=initializer_range,
842 |               do_return_2d_tensor=True,
843 |               batch_size=batch_size,
844 |               from_seq_length=seq_length,
845 |               to_seq_length=seq_length)
846 |           attention_heads.append(attention_head)
847 | 
848 |         attention_output = None
849 |         if len(attention_heads) == 1:
850 |           attention_output = attention_heads[0]
851 |         else:
852 |           # In the case where we have other sequences, we just concatenate
853 |           # them to the self-attention head before the projection.
854 |           attention_output = tf.concat(attention_heads, axis=-1)
855 | 
856 |         # Run a linear projection of `hidden_size` then add a residual
857 |         # with `layer_input`.
858 |         with tf.variable_scope("output"):
859 |           attention_output = tf.layers.dense(
860 |               attention_output,
861 |               hidden_size,
862 |               kernel_initializer=create_initializer(initializer_range))
863 |           attention_output = dropout(attention_output, hidden_dropout_prob)
864 |           attention_output = layer_norm(attention_output + layer_input)
865 | 
866 |       # The activation is only applied to the "intermediate" hidden layer.
867 |       with tf.variable_scope("intermediate"):
868 |         intermediate_output = tf.layers.dense(
869 |             attention_output,
870 |             intermediate_size,
871 |             activation=intermediate_act_fn,
872 |             kernel_initializer=create_initializer(initializer_range))
873 | 
874 |       # Down-project back to `hidden_size` then add the residual.
875 |       with tf.variable_scope("output"):
876 |         layer_output = tf.layers.dense(
877 |             intermediate_output,
878 |             hidden_size,
879 |             kernel_initializer=create_initializer(initializer_range))
880 |         layer_output = dropout(layer_output, hidden_dropout_prob)
881 |         layer_output = layer_norm(layer_output + attention_output)
882 |         prev_output = layer_output
883 |         all_layer_outputs.append(layer_output)
884 | 
885 |   if do_return_all_layers:
886 |     final_outputs = []
887 |     for layer_output in all_layer_outputs:
888 |       final_output = reshape_from_matrix(layer_output, input_shape)
889 |       final_outputs.append(final_output)
890 |     return final_outputs
891 |   else:
892 |     final_output = reshape_from_matrix(prev_output, input_shape)
893 |     return final_output
894 | 
895 | 
896 | def get_shape_list(tensor, expected_rank=None, name=None):
897 |   """Returns a list of the shape of tensor, preferring static dimensions.
898 | 
899 |   Args:
900 |     tensor: A tf.Tensor object to find the shape of.
901 |     expected_rank: (optional) int. The expected rank of `tensor`. If this is
902 |       specified and the `tensor` has a different rank, and exception will be
903 |       thrown.
904 |     name: Optional name of the tensor for the error message.
905 | 
906 |   Returns:
907 |     A list of dimensions of the shape of tensor. All static dimensions will
908 |     be returned as python integers, and dynamic dimensions will be returned
909 |     as tf.Tensor scalars.
910 |   """
911 |   if name is None:
912 |     name = tensor.name
913 | 
914 |   if expected_rank is not None:
915 |     assert_rank(tensor, expected_rank, name)
916 | 
917 |   shape = tensor.shape.as_list()
918 | 
919 |   non_static_indexes = []
920 |   for (index, dim) in enumerate(shape):
921 |     if dim is None:
922 |       non_static_indexes.append(index)
923 | 
924 |   if not non_static_indexes:
925 |     return shape
926 | 
927 |   dyn_shape = tf.shape(tensor)
928 |   for index in non_static_indexes:
929 |     shape[index] = dyn_shape[index]
930 |   return shape
931 | 
932 | 
933 | def reshape_to_matrix(input_tensor):
934 |   """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
935 |   ndims = input_tensor.shape.ndims
936 |   if ndims < 2:
937 |     raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
938 |                      (input_tensor.shape))
939 |   if ndims == 2:
940 |     return input_tensor
941 | 
942 |   width = input_tensor.shape[-1]
943 |   output_tensor = tf.reshape(input_tensor, [-1, width])
944 |   return output_tensor
945 | 
946 | 
947 | def reshape_from_matrix(output_tensor, orig_shape_list):
948 |   """Reshapes a rank 2 tensor back to its original rank >= 2 tensor."""
949 |   if len(orig_shape_list) == 2:
950 |     return output_tensor
951 | 
952 |   output_shape = get_shape_list(output_tensor)
953 | 
954 |   orig_dims = orig_shape_list[0:-1]
955 |   width = output_shape[-1]
956 | 
957 |   return tf.reshape(output_tensor, orig_dims + [width])
958 | 
959 | 
960 | def assert_rank(tensor, expected_rank, name=None):
961 |   """Raises an exception if the tensor rank is not of the expected rank.
962 | 
963 |   Args:
964 |     tensor: A tf.Tensor to check the rank of.
965 |     expected_rank: Python integer or list of integers, expected rank.
966 |     name: Optional name of the tensor for the error message.
967 | 
968 |   Raises:
969 |     ValueError: If the expected shape doesn't match the actual shape.
970 |   """
971 |   if name is None:
972 |     name = tensor.name
973 | 
974 |   expected_rank_dict = {}
975 |   if isinstance(expected_rank, six.integer_types):
976 |     expected_rank_dict[expected_rank] = True
977 |   else:
978 |     for x in expected_rank:
979 |       expected_rank_dict[x] = True
980 | 
981 |   actual_rank = tensor.shape.ndims
982 |   if actual_rank not in expected_rank_dict:
983 |     scope_name = tf.get_variable_scope().name
984 |     raise ValueError(
985 |         "For the tensor `%s` in scope `%s`, the actual rank "
986 |         "`%d` (shape = %s) is not equal to the expected rank `%s`" %
987 |         (name, scope_name, actual_rank, str(tensor.shape), str(expected_rank)))
988 | 


--------------------------------------------------------------------------------
/modeling_test.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | from __future__ import absolute_import
 16 | from __future__ import division
 17 | from __future__ import print_function
 18 | 
 19 | import collections
 20 | import json
 21 | import random
 22 | import re
 23 | 
 24 | import modeling
 25 | import six
 26 | import tensorflow as tf
 27 | 
 28 | 
 29 | class BertModelTest(tf.test.TestCase):
 30 | 
 31 |   class BertModelTester(object):
 32 | 
 33 |     def __init__(self,
 34 |                  parent,
 35 |                  batch_size=13,
 36 |                  seq_length=7,
 37 |                  is_training=True,
 38 |                  use_input_mask=True,
 39 |                  use_token_type_ids=True,
 40 |                  vocab_size=99,
 41 |                  hidden_size=32,
 42 |                  num_hidden_layers=5,
 43 |                  num_attention_heads=4,
 44 |                  intermediate_size=37,
 45 |                  hidden_act="gelu",
 46 |                  hidden_dropout_prob=0.1,
 47 |                  attention_probs_dropout_prob=0.1,
 48 |                  max_position_embeddings=512,
 49 |                  type_vocab_size=16,
 50 |                  initializer_range=0.02,
 51 |                  scope=None):
 52 |       self.parent = parent
 53 |       self.batch_size = batch_size
 54 |       self.seq_length = seq_length
 55 |       self.is_training = is_training
 56 |       self.use_input_mask = use_input_mask
 57 |       self.use_token_type_ids = use_token_type_ids
 58 |       self.vocab_size = vocab_size
 59 |       self.hidden_size = hidden_size
 60 |       self.num_hidden_layers = num_hidden_layers
 61 |       self.num_attention_heads = num_attention_heads
 62 |       self.intermediate_size = intermediate_size
 63 |       self.hidden_act = hidden_act
 64 |       self.hidden_dropout_prob = hidden_dropout_prob
 65 |       self.attention_probs_dropout_prob = attention_probs_dropout_prob
 66 |       self.max_position_embeddings = max_position_embeddings
 67 |       self.type_vocab_size = type_vocab_size
 68 |       self.initializer_range = initializer_range
 69 |       self.scope = scope
 70 | 
 71 |     def create_model(self):
 72 |       input_ids = BertModelTest.ids_tensor([self.batch_size, self.seq_length],
 73 |                                            self.vocab_size)
 74 | 
 75 |       input_mask = None
 76 |       if self.use_input_mask:
 77 |         input_mask = BertModelTest.ids_tensor(
 78 |             [self.batch_size, self.seq_length], vocab_size=2)
 79 | 
 80 |       token_type_ids = None
 81 |       if self.use_token_type_ids:
 82 |         token_type_ids = BertModelTest.ids_tensor(
 83 |             [self.batch_size, self.seq_length], self.type_vocab_size)
 84 | 
 85 |       config = modeling.BertConfig(
 86 |           vocab_size=self.vocab_size,
 87 |           hidden_size=self.hidden_size,
 88 |           num_hidden_layers=self.num_hidden_layers,
 89 |           num_attention_heads=self.num_attention_heads,
 90 |           intermediate_size=self.intermediate_size,
 91 |           hidden_act=self.hidden_act,
 92 |           hidden_dropout_prob=self.hidden_dropout_prob,
 93 |           attention_probs_dropout_prob=self.attention_probs_dropout_prob,
 94 |           max_position_embeddings=self.max_position_embeddings,
 95 |           type_vocab_size=self.type_vocab_size,
 96 |           initializer_range=self.initializer_range)
 97 | 
 98 |       model = modeling.BertModel(
 99 |           config=config,
100 |           is_training=self.is_training,
101 |           input_ids=input_ids,
102 |           input_mask=input_mask,
103 |           token_type_ids=token_type_ids,
104 |           scope=self.scope)
105 | 
106 |       outputs = {
107 |           "embedding_output": model.get_embedding_output(),
108 |           "sequence_output": model.get_sequence_output(),
109 |           "pooled_output": model.get_pooled_output(),
110 |           "all_encoder_layers": model.get_all_encoder_layers(),
111 |       }
112 |       return outputs
113 | 
114 |     def check_output(self, result):
115 |       self.parent.assertAllEqual(
116 |           result["embedding_output"].shape,
117 |           [self.batch_size, self.seq_length, self.hidden_size])
118 | 
119 |       self.parent.assertAllEqual(
120 |           result["sequence_output"].shape,
121 |           [self.batch_size, self.seq_length, self.hidden_size])
122 | 
123 |       self.parent.assertAllEqual(result["pooled_output"].shape,
124 |                                  [self.batch_size, self.hidden_size])
125 | 
126 |   def test_default(self):
127 |     self.run_tester(BertModelTest.BertModelTester(self))
128 | 
129 |   def test_config_to_json_string(self):
130 |     config = modeling.BertConfig(vocab_size=99, hidden_size=37)
131 |     obj = json.loads(config.to_json_string())
132 |     self.assertEqual(obj["vocab_size"], 99)
133 |     self.assertEqual(obj["hidden_size"], 37)
134 | 
135 |   def run_tester(self, tester):
136 |     with self.test_session() as sess:
137 |       ops = tester.create_model()
138 |       init_op = tf.group(tf.global_variables_initializer(),
139 |                          tf.local_variables_initializer())
140 |       sess.run(init_op)
141 |       output_result = sess.run(ops)
142 |       tester.check_output(output_result)
143 | 
144 |       self.assert_all_tensors_reachable(sess, [init_op, ops])
145 | 
146 |   @classmethod
147 |   def ids_tensor(cls, shape, vocab_size, rng=None, name=None):
148 |     """Creates a random int32 tensor of the shape within the vocab size."""
149 |     if rng is None:
150 |       rng = random.Random()
151 | 
152 |     total_dims = 1
153 |     for dim in shape:
154 |       total_dims *= dim
155 | 
156 |     values = []
157 |     for _ in range(total_dims):
158 |       values.append(rng.randint(0, vocab_size - 1))
159 | 
160 |     return tf.constant(value=values, dtype=tf.int32, shape=shape, name=name)
161 | 
162 |   def assert_all_tensors_reachable(self, sess, outputs):
163 |     """Checks that all the tensors in the graph are reachable from outputs."""
164 |     graph = sess.graph
165 | 
166 |     ignore_strings = [
167 |         "^.*/assert_less_equal/.*$",
168 |         "^.*/dilation_rate$",
169 |         "^.*/Tensordot/concat$",
170 |         "^.*/Tensordot/concat/axis$",
171 |         "^testing/.*$",
172 |     ]
173 | 
174 |     ignore_regexes = [re.compile(x) for x in ignore_strings]
175 | 
176 |     unreachable = self.get_unreachable_ops(graph, outputs)
177 |     filtered_unreachable = []
178 |     for x in unreachable:
179 |       do_ignore = False
180 |       for r in ignore_regexes:
181 |         m = r.match(x.name)
182 |         if m is not None:
183 |           do_ignore = True
184 |       if do_ignore:
185 |         continue
186 |       filtered_unreachable.append(x)
187 |     unreachable = filtered_unreachable
188 | 
189 |     self.assertEqual(
190 |         len(unreachable), 0, "The following ops are unreachable: %s" %
191 |         (" ".join([x.name for x in unreachable])))
192 | 
193 |   @classmethod
194 |   def get_unreachable_ops(cls, graph, outputs):
195 |     """Finds all of the tensors in graph that are unreachable from outputs."""
196 |     outputs = cls.flatten_recursive(outputs)
197 |     output_to_op = collections.defaultdict(list)
198 |     op_to_all = collections.defaultdict(list)
199 |     assign_out_to_in = collections.defaultdict(list)
200 | 
201 |     for op in graph.get_operations():
202 |       for x in op.inputs:
203 |         op_to_all[op.name].append(x.name)
204 |       for y in op.outputs:
205 |         output_to_op[y.name].append(op.name)
206 |         op_to_all[op.name].append(y.name)
207 |       if str(op.type) == "Assign":
208 |         for y in op.outputs:
209 |           for x in op.inputs:
210 |             assign_out_to_in[y.name].append(x.name)
211 | 
212 |     assign_groups = collections.defaultdict(list)
213 |     for out_name in assign_out_to_in.keys():
214 |       name_group = assign_out_to_in[out_name]
215 |       for n1 in name_group:
216 |         assign_groups[n1].append(out_name)
217 |         for n2 in name_group:
218 |           if n1 != n2:
219 |             assign_groups[n1].append(n2)
220 | 
221 |     seen_tensors = {}
222 |     stack = [x.name for x in outputs]
223 |     while stack:
224 |       name = stack.pop()
225 |       if name in seen_tensors:
226 |         continue
227 |       seen_tensors[name] = True
228 | 
229 |       if name in output_to_op:
230 |         for op_name in output_to_op[name]:
231 |           if op_name in op_to_all:
232 |             for input_name in op_to_all[op_name]:
233 |               if input_name not in stack:
234 |                 stack.append(input_name)
235 | 
236 |       expanded_names = []
237 |       if name in assign_groups:
238 |         for assign_name in assign_groups[name]:
239 |           expanded_names.append(assign_name)
240 | 
241 |       for expanded_name in expanded_names:
242 |         if expanded_name not in stack:
243 |           stack.append(expanded_name)
244 | 
245 |     unreachable_ops = []
246 |     for op in graph.get_operations():
247 |       is_unreachable = False
248 |       all_names = [x.name for x in op.inputs] + [x.name for x in op.outputs]
249 |       for name in all_names:
250 |         if name not in seen_tensors:
251 |           is_unreachable = True
252 |       if is_unreachable:
253 |         unreachable_ops.append(op)
254 |     return unreachable_ops
255 | 
256 |   @classmethod
257 |   def flatten_recursive(cls, item):
258 |     """Flattens (potentially nested) a tuple/dictionary/list to a list."""
259 |     output = []
260 |     if isinstance(item, list):
261 |       output.extend(item)
262 |     elif isinstance(item, tuple):
263 |       output.extend(list(item))
264 |     elif isinstance(item, dict):
265 |       for (_, v) in six.iteritems(item):
266 |         output.append(v)
267 |     else:
268 |       return [item]
269 | 
270 |     flat_output = []
271 |     for x in output:
272 |       flat_output.extend(cls.flatten_recursive(x))
273 |     return flat_output
274 | 
275 | 
276 | if __name__ == "__main__":
277 |   tf.test.main()
278 | 


--------------------------------------------------------------------------------
/optimization.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Functions and classes related to optimization (weight updates)."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import re
 22 | import tensorflow as tf
 23 | 
 24 | 
 25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
 26 |   """Creates an optimizer training op."""
 27 |   global_step = tf.train.get_or_create_global_step()
 28 | 
 29 |   learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
 30 | 
 31 |   # Implements linear decay of the learning rate.
 32 |   learning_rate = tf.train.polynomial_decay(
 33 |       learning_rate,
 34 |       global_step,
 35 |       num_train_steps,
 36 |       end_learning_rate=0.0,
 37 |       power=1.0,
 38 |       cycle=False)
 39 | 
 40 |   # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
 41 |   # learning rate will be `global_step/num_warmup_steps * init_lr`.
 42 |   if num_warmup_steps:
 43 |     global_steps_int = tf.cast(global_step, tf.int32)
 44 |     warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
 45 | 
 46 |     global_steps_float = tf.cast(global_steps_int, tf.float32)
 47 |     warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
 48 | 
 49 |     warmup_percent_done = global_steps_float / warmup_steps_float
 50 |     warmup_learning_rate = init_lr * warmup_percent_done
 51 | 
 52 |     is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
 53 |     learning_rate = (
 54 |         (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
 55 | 
 56 |   # It is recommended that you use this optimizer for fine tuning, since this
 57 |   # is how the model was trained (note that the Adam m/v variables are NOT
 58 |   # loaded from init_checkpoint.)
 59 |   optimizer = AdamWeightDecayOptimizer(
 60 |       learning_rate=learning_rate,
 61 |       weight_decay_rate=0.01,
 62 |       beta_1=0.9,
 63 |       beta_2=0.999,
 64 |       epsilon=1e-6,
 65 |       exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
 66 | 
 67 |   if use_tpu:
 68 |     optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
 69 | 
 70 |   tvars = tf.trainable_variables()
 71 |   grads = tf.gradients(loss, tvars)
 72 | 
 73 |   # This is how the model was pre-trained.
 74 |   (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
 75 | 
 76 |   train_op = optimizer.apply_gradients(
 77 |       zip(grads, tvars), global_step=global_step)
 78 | 
 79 |   # Normally the global step update is done inside of `apply_gradients`.
 80 |   # However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
 81 |   # a different optimizer, you should probably take this line out.
 82 |   new_global_step = global_step + 1
 83 |   train_op = tf.group(train_op, [global_step.assign(new_global_step)])
 84 |   return train_op
 85 | 
 86 | 
 87 | class AdamWeightDecayOptimizer(tf.train.Optimizer):
 88 |   """A basic Adam optimizer that includes "correct" L2 weight decay."""
 89 | 
 90 |   def __init__(self,
 91 |                learning_rate,
 92 |                weight_decay_rate=0.0,
 93 |                beta_1=0.9,
 94 |                beta_2=0.999,
 95 |                epsilon=1e-6,
 96 |                exclude_from_weight_decay=None,
 97 |                name="AdamWeightDecayOptimizer"):
 98 |     """Constructs a AdamWeightDecayOptimizer."""
 99 |     super(AdamWeightDecayOptimizer, self).__init__(False, name)
100 | 
101 |     self.learning_rate = learning_rate
102 |     self.weight_decay_rate = weight_decay_rate
103 |     self.beta_1 = beta_1
104 |     self.beta_2 = beta_2
105 |     self.epsilon = epsilon
106 |     self.exclude_from_weight_decay = exclude_from_weight_decay
107 | 
108 |   def apply_gradients(self, grads_and_vars, global_step=None, name=None):
109 |     """See base class."""
110 |     assignments = []
111 |     for (grad, param) in grads_and_vars:
112 |       if grad is None or param is None:
113 |         continue
114 | 
115 |       param_name = self._get_variable_name(param.name)
116 | 
117 |       m = tf.get_variable(
118 |           name=param_name + "/adam_m",
119 |           shape=param.shape.as_list(),
120 |           dtype=tf.float32,
121 |           trainable=False,
122 |           initializer=tf.zeros_initializer())
123 |       v = tf.get_variable(
124 |           name=param_name + "/adam_v",
125 |           shape=param.shape.as_list(),
126 |           dtype=tf.float32,
127 |           trainable=False,
128 |           initializer=tf.zeros_initializer())
129 | 
130 |       # Standard Adam update.
131 |       next_m = (
132 |           tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
133 |       next_v = (
134 |           tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
135 |                                                     tf.square(grad)))
136 | 
137 |       update = next_m / (tf.sqrt(next_v) + self.epsilon)
138 | 
139 |       # Just adding the square of the weights to the loss function is *not*
140 |       # the correct way of using L2 regularization/weight decay with Adam,
141 |       # since that will interact with the m and v parameters in strange ways.
142 |       #
143 |       # Instead we want ot decay the weights in a manner that doesn't interact
144 |       # with the m/v parameters. This is equivalent to adding the square
145 |       # of the weights to the loss with plain (non-momentum) SGD.
146 |       if self._do_use_weight_decay(param_name):
147 |         update += self.weight_decay_rate * param
148 | 
149 |       update_with_lr = self.learning_rate * update
150 | 
151 |       next_param = param - update_with_lr
152 | 
153 |       assignments.extend(
154 |           [param.assign(next_param),
155 |            m.assign(next_m),
156 |            v.assign(next_v)])
157 |     return tf.group(*assignments, name=name)
158 | 
159 |   def _do_use_weight_decay(self, param_name):
160 |     """Whether to use L2 weight decay for `param_name`."""
161 |     if not self.weight_decay_rate:
162 |       return False
163 |     if self.exclude_from_weight_decay:
164 |       for r in self.exclude_from_weight_decay:
165 |         if re.search(r, param_name) is not None:
166 |           return False
167 |     return True
168 | 
169 |   def _get_variable_name(self, param_name):
170 |     """Get the variable name from the tensor name."""
171 |     m = re.match("^(.*):\\d+$", param_name)
172 |     if m is not None:
173 |       param_name = m.group(1)
174 |     return param_name
175 | 


--------------------------------------------------------------------------------
/optimization_test.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | # Copyright 2018 The Google AI Language Team Authors.
 3 | #
 4 | # Licensed under the Apache License, Version 2.0 (the "License");
 5 | # you may not use this file except in compliance with the License.
 6 | # You may obtain a copy of the License at
 7 | #
 8 | #     http://www.apache.org/licenses/LICENSE-2.0
 9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | from __future__ import absolute_import
16 | from __future__ import division
17 | from __future__ import print_function
18 | 
19 | import optimization
20 | import tensorflow as tf
21 | 
22 | 
23 | class OptimizationTest(tf.test.TestCase):
24 | 
25 |   def test_adam(self):
26 |     with self.test_session() as sess:
27 |       w = tf.get_variable(
28 |           "w",
29 |           shape=[3],
30 |           initializer=tf.constant_initializer([0.1, -0.2, -0.1]))
31 |       x = tf.constant([0.4, 0.2, -0.5])
32 |       loss = tf.reduce_mean(tf.square(x - w))
33 |       tvars = tf.trainable_variables()
34 |       grads = tf.gradients(loss, tvars)
35 |       global_step = tf.train.get_or_create_global_step()
36 |       optimizer = optimization.AdamWeightDecayOptimizer(learning_rate=0.2)
37 |       train_op = optimizer.apply_gradients(zip(grads, tvars), global_step)
38 |       init_op = tf.group(tf.global_variables_initializer(),
39 |                          tf.local_variables_initializer())
40 |       sess.run(init_op)
41 |       for _ in range(100):
42 |         sess.run(train_op)
43 |       w_np = sess.run(w)
44 |       self.assertAllClose(w_np.flat, [0.4, 0.2, -0.5], rtol=1e-2, atol=1e-2)
45 | 
46 | 
47 | if __name__ == "__main__":
48 |   tf.test.main()
49 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow >= 1.11.0   # CPU Version of TensorFlow.
2 | # tensorflow-gpu  >= 1.11.0  # GPU version of TensorFlow.
3 | 


--------------------------------------------------------------------------------
/run_classifier.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """BERT finetuning runner."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import collections
 22 | import csv
 23 | import os
 24 | import modeling
 25 | import optimization
 26 | import tokenization
 27 | import tensorflow as tf
 28 | 
 29 | flags = tf.flags
 30 | 
 31 | FLAGS = flags.FLAGS
 32 | 
 33 | ## Required parameters
 34 | flags.DEFINE_string(
 35 |     "data_dir", None,
 36 |     "The input data dir. Should contain the .tsv files (or other data files) "
 37 |     "for the task.")
 38 | 
 39 | flags.DEFINE_string(
 40 |     "bert_config_file", None,
 41 |     "The config json file corresponding to the pre-trained BERT model. "
 42 |     "This specifies the model architecture.")
 43 | 
 44 | flags.DEFINE_string("task_name", None, "The name of the task to train.")
 45 | 
 46 | flags.DEFINE_string("vocab_file", None,
 47 |                     "The vocabulary file that the BERT model was trained on.")
 48 | 
 49 | flags.DEFINE_string(
 50 |     "output_dir", None,
 51 |     "The output directory where the model checkpoints will be written.")
 52 | 
 53 | ## Other parameters
 54 | 
 55 | flags.DEFINE_string(
 56 |     "init_checkpoint", None,
 57 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 58 | 
 59 | flags.DEFINE_bool(
 60 |     "do_lower_case", True,
 61 |     "Whether to lower case the input text. Should be True for uncased "
 62 |     "models and False for cased models.")
 63 | 
 64 | flags.DEFINE_integer(
 65 |     "max_seq_length", 128,
 66 |     "The maximum total input sequence length after WordPiece tokenization. "
 67 |     "Sequences longer than this will be truncated, and sequences shorter "
 68 |     "than this will be padded.")
 69 | 
 70 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 71 | 
 72 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 73 | 
 74 | flags.DEFINE_bool(
 75 |     "do_predict", False,
 76 |     "Whether to run the model in inference mode on the test set.")
 77 | 
 78 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
 79 | 
 80 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 81 | 
 82 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 83 | 
 84 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 85 | 
 86 | flags.DEFINE_float("num_train_epochs", 3.0,
 87 |                    "Total number of training epochs to perform.")
 88 | 
 89 | flags.DEFINE_float(
 90 |     "warmup_proportion", 0.1,
 91 |     "Proportion of training to perform linear learning rate warmup for. "
 92 |     "E.g., 0.1 = 10% of training.")
 93 | 
 94 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
 95 |                      "How often to save the model checkpoint.")
 96 | 
 97 | flags.DEFINE_integer("iterations_per_loop", 1000,
 98 |                      "How many steps to make in each estimator call.")
 99 | 
100 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
101 | 
102 | tf.flags.DEFINE_string(
103 |     "tpu_name", None,
104 |     "The Cloud TPU to use for training. This should be either the name "
105 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
106 |     "url.")
107 | 
108 | tf.flags.DEFINE_string(
109 |     "tpu_zone", None,
110 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
111 |     "specified, we will attempt to automatically detect the GCE project from "
112 |     "metadata.")
113 | 
114 | tf.flags.DEFINE_string(
115 |     "gcp_project", None,
116 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
117 |     "specified, we will attempt to automatically detect the GCE project from "
118 |     "metadata.")
119 | 
120 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
121 | 
122 | flags.DEFINE_integer(
123 |     "num_tpu_cores", 8,
124 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
125 | 
126 | 
127 | class InputExample(object):
128 |   """A single training/test example for simple sequence classification."""
129 | 
130 |   def __init__(self, guid, text_a, text_b=None, label=None):
131 |     """Constructs a InputExample.
132 | 
133 |     Args:
134 |       guid: Unique id for the example.
135 |       text_a: string. The untokenized text of the first sequence. For single
136 |         sequence tasks, only this sequence must be specified.
137 |       text_b: (Optional) string. The untokenized text of the second sequence.
138 |         Only must be specified for sequence pair tasks.
139 |       label: (Optional) string. The label of the example. This should be
140 |         specified for train and dev examples, but not for test examples.
141 |     """
142 |     self.guid = guid
143 |     self.text_a = text_a
144 |     self.text_b = text_b
145 |     self.label = label
146 | 
147 | 
148 | class PaddingInputExample(object):
149 |   """Fake example so the num input examples is a multiple of the batch size.
150 | 
151 |   When running eval/predict on the TPU, we need to pad the number of examples
152 |   to be a multiple of the batch size, because the TPU requires a fixed batch
153 |   size. The alternative is to drop the last batch, which is bad because it means
154 |   the entire output data won't be generated.
155 | 
156 |   We use this class instead of `None` because treating `None` as padding
157 |   battches could cause silent errors.
158 |   """
159 | 
160 | 
161 | class InputFeatures(object):
162 |   """A single set of features of data."""
163 | 
164 |   def __init__(self,
165 |                input_ids,
166 |                input_mask,
167 |                segment_ids,
168 |                label_id,
169 |                is_real_example=True):
170 |     self.input_ids = input_ids
171 |     self.input_mask = input_mask
172 |     self.segment_ids = segment_ids
173 |     self.label_id = label_id
174 |     self.is_real_example = is_real_example
175 | 
176 | 
177 | class DataProcessor(object):
178 |   """Base class for data converters for sequence classification data sets."""
179 | 
180 |   def get_train_examples(self, data_dir):
181 |     """Gets a collection of `InputExample`s for the train set."""
182 |     raise NotImplementedError()
183 | 
184 |   def get_dev_examples(self, data_dir):
185 |     """Gets a collection of `InputExample`s for the dev set."""
186 |     raise NotImplementedError()
187 | 
188 |   def get_test_examples(self, data_dir):
189 |     """Gets a collection of `InputExample`s for prediction."""
190 |     raise NotImplementedError()
191 | 
192 |   def get_labels(self):
193 |     """Gets the list of labels for this data set."""
194 |     raise NotImplementedError()
195 | 
196 |   @classmethod
197 |   def _read_tsv(cls, input_file, quotechar=None):
198 |     """Reads a tab separated value file."""
199 |     with tf.gfile.Open(input_file, "r") as f:
200 |       reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
201 |       lines = []
202 |       for line in reader:
203 |         lines.append(line)
204 |       return lines
205 | 
206 | 
207 | class XnliProcessor(DataProcessor):
208 |   """Processor for the XNLI data set."""
209 | 
210 |   def __init__(self):
211 |     self.language = "zh"
212 | 
213 |   def get_train_examples(self, data_dir):
214 |     """See base class."""
215 |     lines = self._read_tsv(
216 |         os.path.join(data_dir, "multinli",
217 |                      "multinli.train.%s.tsv" % self.language))
218 |     examples = []
219 |     for (i, line) in enumerate(lines):
220 |       if i == 0:
221 |         continue
222 |       guid = "train-%d" % (i)
223 |       text_a = tokenization.convert_to_unicode(line[0])
224 |       text_b = tokenization.convert_to_unicode(line[1])
225 |       label = tokenization.convert_to_unicode(line[2])
226 |       if label == tokenization.convert_to_unicode("contradictory"):
227 |         label = tokenization.convert_to_unicode("contradiction")
228 |       examples.append(
229 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
230 |     return examples
231 | 
232 |   def get_dev_examples(self, data_dir):
233 |     """See base class."""
234 |     lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv"))
235 |     examples = []
236 |     for (i, line) in enumerate(lines):
237 |       if i == 0:
238 |         continue
239 |       guid = "dev-%d" % (i)
240 |       language = tokenization.convert_to_unicode(line[0])
241 |       if language != tokenization.convert_to_unicode(self.language):
242 |         continue
243 |       text_a = tokenization.convert_to_unicode(line[6])
244 |       text_b = tokenization.convert_to_unicode(line[7])
245 |       label = tokenization.convert_to_unicode(line[1])
246 |       examples.append(
247 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
248 |     return examples
249 | 
250 |   def get_labels(self):
251 |     """See base class."""
252 |     return ["contradiction", "entailment", "neutral"]
253 | 
254 | 
255 | class MnliProcessor(DataProcessor):
256 |   """Processor for the MultiNLI data set (GLUE version)."""
257 | 
258 |   def get_train_examples(self, data_dir):
259 |     """See base class."""
260 |     return self._create_examples(
261 |         self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
262 | 
263 |   def get_dev_examples(self, data_dir):
264 |     """See base class."""
265 |     return self._create_examples(
266 |         self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
267 |         "dev_matched")
268 | 
269 |   def get_test_examples(self, data_dir):
270 |     """See base class."""
271 |     return self._create_examples(
272 |         self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
273 | 
274 |   def get_labels(self):
275 |     """See base class."""
276 |     return ["contradiction", "entailment", "neutral"]
277 | 
278 |   def _create_examples(self, lines, set_type):
279 |     """Creates examples for the training and dev sets."""
280 |     examples = []
281 |     for (i, line) in enumerate(lines):
282 |       if i == 0:
283 |         continue
284 |       guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0]))
285 |       text_a = tokenization.convert_to_unicode(line[8])
286 |       text_b = tokenization.convert_to_unicode(line[9])
287 |       if set_type == "test":
288 |         label = "contradiction"
289 |       else:
290 |         label = tokenization.convert_to_unicode(line[-1])
291 |       examples.append(
292 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
293 |     return examples
294 | 
295 | 
296 | class MrpcProcessor(DataProcessor):
297 |   """Processor for the MRPC data set (GLUE version)."""
298 | 
299 |   def get_train_examples(self, data_dir):
300 |     """See base class."""
301 |     return self._create_examples(
302 |         self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
303 | 
304 |   def get_dev_examples(self, data_dir):
305 |     """See base class."""
306 |     return self._create_examples(
307 |         self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
308 | 
309 |   def get_test_examples(self, data_dir):
310 |     """See base class."""
311 |     return self._create_examples(
312 |         self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
313 | 
314 |   def get_labels(self):
315 |     """See base class."""
316 |     return ["0", "1"]
317 | 
318 |   def _create_examples(self, lines, set_type):
319 |     """Creates examples for the training and dev sets."""
320 |     examples = []
321 |     for (i, line) in enumerate(lines):
322 |       if i == 0:
323 |         continue
324 |       guid = "%s-%s" % (set_type, i)
325 |       text_a = tokenization.convert_to_unicode(line[3])
326 |       text_b = tokenization.convert_to_unicode(line[4])
327 |       if set_type == "test":
328 |         label = "0"
329 |       else:
330 |         label = tokenization.convert_to_unicode(line[0])
331 |       examples.append(
332 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
333 |     return examples
334 | 
335 | 
336 | 
337 | class ColaProcessor(DataProcessor):
338 |   """Processor for the CoLA data set (GLUE version)."""
339 | 
340 |   def get_train_examples(self, data_dir):
341 |     """See base class."""
342 |     return self._create_examples(
343 |         self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
344 | 
345 |   def get_dev_examples(self, data_dir):
346 |     """See base class."""
347 |     return self._create_examples(
348 |         self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
349 | 
350 |   def get_test_examples(self, data_dir):
351 |     """See base class."""
352 |     return self._create_examples(
353 |         self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
354 | 
355 |   def get_labels(self):
356 |     """See base class."""
357 |     return ["0", "1"]
358 | 
359 |   def _create_examples(self, lines, set_type):
360 |     """Creates examples for the training and dev sets."""
361 |     examples = []
362 |     for (i, line) in enumerate(lines):
363 |       # Only the test set has a header
364 |       if set_type == "test" and i == 0:
365 |         continue
366 |       guid = "%s-%s" % (set_type, i)
367 |       if set_type == "test":
368 |         text_a = tokenization.convert_to_unicode(line[1])
369 |         label = "0"
370 |       else:
371 |         text_a = tokenization.convert_to_unicode(line[3])
372 |         label = tokenization.convert_to_unicode(line[1])
373 |       examples.append(
374 |           InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
375 |     return examples
376 | 
377 | 
378 | def convert_single_example(ex_index, example, label_list, max_seq_length,
379 |                            tokenizer):
380 |   """Converts a single `InputExample` into a single `InputFeatures`."""
381 | 
382 |   if isinstance(example, PaddingInputExample):
383 |     return InputFeatures(
384 |         input_ids=[0] * max_seq_length,
385 |         input_mask=[0] * max_seq_length,
386 |         segment_ids=[0] * max_seq_length,
387 |         label_id=0,
388 |         is_real_example=False)
389 | 
390 |   label_map = {}
391 |   for (i, label) in enumerate(label_list):
392 |     label_map[label] = i
393 | 
394 |   tokens_a = tokenizer.tokenize(example.text_a)
395 |   tokens_b = None
396 |   if example.text_b:
397 |     tokens_b = tokenizer.tokenize(example.text_b)
398 | 
399 |   if tokens_b:
400 |     # Modifies `tokens_a` and `tokens_b` in place so that the total
401 |     # length is less than the specified length.
402 |     # Account for [CLS], [SEP], [SEP] with "- 3"
403 |     _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
404 |   else:
405 |     # Account for [CLS] and [SEP] with "- 2"
406 |     if len(tokens_a) > max_seq_length - 2:
407 |       tokens_a = tokens_a[0:(max_seq_length - 2)]
408 | 
409 |   # The convention in BERT is:
410 |   # (a) For sequence pairs:
411 |   #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
412 |   #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
413 |   # (b) For single sequences:
414 |   #  tokens:   [CLS] the dog is hairy . [SEP]
415 |   #  type_ids: 0     0   0   0  0     0 0
416 |   #
417 |   # Where "type_ids" are used to indicate whether this is the first
418 |   # sequence or the second sequence. The embedding vectors for `type=0` and
419 |   # `type=1` were learned during pre-training and are added to the wordpiece
420 |   # embedding vector (and position vector). This is not *strictly* necessary
421 |   # since the [SEP] token unambiguously separates the sequences, but it makes
422 |   # it easier for the model to learn the concept of sequences.
423 |   #
424 |   # For classification tasks, the first vector (corresponding to [CLS]) is
425 |   # used as the "sentence vector". Note that this only makes sense because
426 |   # the entire model is fine-tuned.
427 |   tokens = []
428 |   segment_ids = []
429 |   tokens.append("[CLS]")
430 |   segment_ids.append(0)
431 |   for token in tokens_a:
432 |     tokens.append(token)
433 |     segment_ids.append(0)
434 |   tokens.append("[SEP]")
435 |   segment_ids.append(0)
436 | 
437 |   if tokens_b:
438 |     for token in tokens_b:
439 |       tokens.append(token)
440 |       segment_ids.append(1)
441 |     tokens.append("[SEP]")
442 |     segment_ids.append(1)
443 | 
444 |   input_ids = tokenizer.convert_tokens_to_ids(tokens)
445 | 
446 |   # The mask has 1 for real tokens and 0 for padding tokens. Only real
447 |   # tokens are attended to.
448 |   input_mask = [1] * len(input_ids)
449 | 
450 |   # Zero-pad up to the sequence length.
451 |   while len(input_ids) < max_seq_length:
452 |     input_ids.append(0)
453 |     input_mask.append(0)
454 |     segment_ids.append(0)
455 | 
456 |   assert len(input_ids) == max_seq_length
457 |   assert len(input_mask) == max_seq_length
458 |   assert len(segment_ids) == max_seq_length
459 | 
460 |   label_id = label_map[example.label]
461 |   if ex_index < 5:
462 |     tf.logging.info("*** Example ***")
463 |     tf.logging.info("guid: %s" % (example.guid))
464 |     tf.logging.info("tokens: %s" % " ".join(
465 |         [tokenization.printable_text(x) for x in tokens]))
466 |     tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
467 |     tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
468 |     tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
469 |     tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
470 | 
471 |   feature = InputFeatures(
472 |       input_ids=input_ids,
473 |       input_mask=input_mask,
474 |       segment_ids=segment_ids,
475 |       label_id=label_id,
476 |       is_real_example=True)
477 |   return feature
478 | 
479 | 
480 | def file_based_convert_examples_to_features(
481 |     examples, label_list, max_seq_length, tokenizer, output_file):
482 |   """Convert a set of `InputExample`s to a TFRecord file."""
483 | 
484 |   writer = tf.python_io.TFRecordWriter(output_file)
485 | 
486 |   for (ex_index, example) in enumerate(examples):
487 |     if ex_index % 10000 == 0:
488 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
489 | 
490 |     feature = convert_single_example(ex_index, example, label_list,
491 |                                      max_seq_length, tokenizer)
492 | 
493 |     def create_int_feature(values):
494 |       f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
495 |       return f
496 | 
497 |     features = collections.OrderedDict()
498 |     features["input_ids"] = create_int_feature(feature.input_ids)
499 |     features["input_mask"] = create_int_feature(feature.input_mask)
500 |     features["segment_ids"] = create_int_feature(feature.segment_ids)
501 |     features["label_ids"] = create_int_feature([feature.label_id])
502 |     features["is_real_example"] = create_int_feature(
503 |         [int(feature.is_real_example)])
504 | 
505 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
506 |     writer.write(tf_example.SerializeToString())
507 |   writer.close()
508 | 
509 | 
510 | def file_based_input_fn_builder(input_file, seq_length, is_training,
511 |                                 drop_remainder):
512 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
513 | 
514 |   name_to_features = {
515 |       "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
516 |       "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
517 |       "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
518 |       "label_ids": tf.FixedLenFeature([], tf.int64),
519 |       "is_real_example": tf.FixedLenFeature([], tf.int64),
520 |   }
521 | 
522 |   def _decode_record(record, name_to_features):
523 |     """Decodes a record to a TensorFlow example."""
524 |     example = tf.parse_single_example(record, name_to_features)
525 | 
526 |     # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
527 |     # So cast all int64 to int32.
528 |     for name in list(example.keys()):
529 |       t = example[name]
530 |       if t.dtype == tf.int64:
531 |         t = tf.to_int32(t)
532 |       example[name] = t
533 | 
534 |     return example
535 | 
536 |   def input_fn(params):
537 |     """The actual input function."""
538 |     batch_size = params["batch_size"]
539 | 
540 |     # For training, we want a lot of parallel reading and shuffling.
541 |     # For eval, we want no shuffling and parallel reading doesn't matter.
542 |     d = tf.data.TFRecordDataset(input_file)
543 |     if is_training:
544 |       d = d.repeat()
545 |       d = d.shuffle(buffer_size=100)
546 | 
547 |     d = d.apply(
548 |         tf.contrib.data.map_and_batch(
549 |             lambda record: _decode_record(record, name_to_features),
550 |             batch_size=batch_size,
551 |             drop_remainder=drop_remainder))
552 | 
553 |     return d
554 | 
555 |   return input_fn
556 | 
557 | 
558 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
559 |   """Truncates a sequence pair in place to the maximum length."""
560 | 
561 |   # This is a simple heuristic which will always truncate the longer sequence
562 |   # one token at a time. This makes more sense than truncating an equal percent
563 |   # of tokens from each, since if one sequence is very short then each token
564 |   # that's truncated likely contains more information than a longer sequence.
565 |   while True:
566 |     total_length = len(tokens_a) + len(tokens_b)
567 |     if total_length <= max_length:
568 |       break
569 |     if len(tokens_a) > len(tokens_b):
570 |       tokens_a.pop()
571 |     else:
572 |       tokens_b.pop()
573 | 
574 | 
575 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
576 |                  labels, num_labels, use_one_hot_embeddings):
577 |   """Creates a classification model."""
578 |   model = modeling.BertModel(
579 |       config=bert_config,
580 |       is_training=is_training,
581 |       input_ids=input_ids,
582 |       input_mask=input_mask,
583 |       token_type_ids=segment_ids,
584 |       use_one_hot_embeddings=use_one_hot_embeddings)
585 | 
586 |   # In the demo, we are doing a simple classification task on the entire
587 |   # segment.
588 |   #
589 |   # If you want to use the token-level output, use model.get_sequence_output()
590 |   # instead.
591 |   output_layer = model.get_pooled_output()
592 | 
593 |   hidden_size = output_layer.shape[-1].value
594 | 
595 |   output_weights = tf.get_variable(
596 |       "output_weights", [num_labels, hidden_size],
597 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
598 | 
599 |   output_bias = tf.get_variable(
600 |       "output_bias", [num_labels], initializer=tf.zeros_initializer())
601 | 
602 |   with tf.variable_scope("loss"):
603 |     if is_training:
604 |       # I.e., 0.1 dropout
605 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
606 | 
607 |     logits = tf.matmul(output_layer, output_weights, transpose_b=True)
608 |     logits = tf.nn.bias_add(logits, output_bias)
609 |     probabilities = tf.nn.softmax(logits, axis=-1)
610 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
611 | 
612 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
613 | 
614 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
615 |     loss = tf.reduce_mean(per_example_loss)
616 | 
617 |     return (loss, per_example_loss, logits, probabilities)
618 | 
619 | 
620 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
621 |                      num_train_steps, num_warmup_steps, use_tpu,
622 |                      use_one_hot_embeddings):
623 |   """Returns `model_fn` closure for TPUEstimator."""
624 | 
625 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
626 |     """The `model_fn` for TPUEstimator."""
627 | 
628 |     tf.logging.info("*** Features ***")
629 |     for name in sorted(features.keys()):
630 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
631 | 
632 |     input_ids = features["input_ids"]
633 |     input_mask = features["input_mask"]
634 |     segment_ids = features["segment_ids"]
635 |     label_ids = features["label_ids"]
636 |     is_real_example = None
637 |     if "is_real_example" in features:
638 |       is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
639 |     else:
640 |       is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
641 | 
642 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
643 | 
644 |     (total_loss, per_example_loss, logits, probabilities) = create_model(
645 |         bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
646 |         num_labels, use_one_hot_embeddings)
647 | 
648 |     tvars = tf.trainable_variables()
649 |     initialized_variable_names = {}
650 |     scaffold_fn = None
651 |     if init_checkpoint:
652 |       (assignment_map, initialized_variable_names
653 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
654 |       if use_tpu:
655 | 
656 |         def tpu_scaffold():
657 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
658 |           return tf.train.Scaffold()
659 | 
660 |         scaffold_fn = tpu_scaffold
661 |       else:
662 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
663 | 
664 |     tf.logging.info("**** Trainable Variables ****")
665 |     for var in tvars:
666 |       init_string = ""
667 |       if var.name in initialized_variable_names:
668 |         init_string = ", *INIT_FROM_CKPT*"
669 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
670 |                       init_string)
671 | 
672 |     output_spec = None
673 |     if mode == tf.estimator.ModeKeys.TRAIN:
674 | 
675 |       train_op = optimization.create_optimizer(
676 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
677 | 
678 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
679 |           mode=mode,
680 |           loss=total_loss,
681 |           train_op=train_op,
682 |           scaffold_fn=scaffold_fn)
683 |     elif mode == tf.estimator.ModeKeys.EVAL:
684 | 
685 |       def metric_fn(per_example_loss, label_ids, logits, is_real_example):
686 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
687 |         accuracy = tf.metrics.accuracy(
688 |             labels=label_ids, predictions=predictions, weights=is_real_example)
689 |         loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
690 |         return {
691 |             "eval_accuracy": accuracy,
692 |             "eval_loss": loss,
693 |         }
694 | 
695 |       eval_metrics = (metric_fn,
696 |                       [per_example_loss, label_ids, logits, is_real_example])
697 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
698 |           mode=mode,
699 |           loss=total_loss,
700 |           eval_metrics=eval_metrics,
701 |           scaffold_fn=scaffold_fn)
702 |     else:
703 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
704 |           mode=mode,
705 |           predictions={"probabilities": probabilities},
706 |           scaffold_fn=scaffold_fn)
707 |     return output_spec
708 | 
709 |   return model_fn
710 | 
711 | 
712 | # This function is not used by this file but is still used by the Colab and
713 | # people who depend on it.
714 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
715 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
716 | 
717 |   all_input_ids = []
718 |   all_input_mask = []
719 |   all_segment_ids = []
720 |   all_label_ids = []
721 | 
722 |   for feature in features:
723 |     all_input_ids.append(feature.input_ids)
724 |     all_input_mask.append(feature.input_mask)
725 |     all_segment_ids.append(feature.segment_ids)
726 |     all_label_ids.append(feature.label_id)
727 | 
728 |   def input_fn(params):
729 |     """The actual input function."""
730 |     batch_size = params["batch_size"]
731 | 
732 |     num_examples = len(features)
733 | 
734 |     # This is for demo purposes and does NOT scale to large data sets. We do
735 |     # not use Dataset.from_generator() because that uses tf.py_func which is
736 |     # not TPU compatible. The right way to load data is with TFRecordReader.
737 |     d = tf.data.Dataset.from_tensor_slices({
738 |         "input_ids":
739 |             tf.constant(
740 |                 all_input_ids, shape=[num_examples, seq_length],
741 |                 dtype=tf.int32),
742 |         "input_mask":
743 |             tf.constant(
744 |                 all_input_mask,
745 |                 shape=[num_examples, seq_length],
746 |                 dtype=tf.int32),
747 |         "segment_ids":
748 |             tf.constant(
749 |                 all_segment_ids,
750 |                 shape=[num_examples, seq_length],
751 |                 dtype=tf.int32),
752 |         "label_ids":
753 |             tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
754 |     })
755 | 
756 |     if is_training:
757 |       d = d.repeat()
758 |       d = d.shuffle(buffer_size=100)
759 | 
760 |     d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
761 |     return d
762 | 
763 |   return input_fn
764 | 
765 | 
766 | # This function is not used by this file but is still used by the Colab and
767 | # people who depend on it.
768 | def convert_examples_to_features(examples, label_list, max_seq_length,
769 |                                  tokenizer):
770 |   """Convert a set of `InputExample`s to a list of `InputFeatures`."""
771 | 
772 |   features = []
773 |   for (ex_index, example) in enumerate(examples):
774 |     if ex_index % 10000 == 0:
775 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
776 | 
777 |     feature = convert_single_example(ex_index, example, label_list,
778 |                                      max_seq_length, tokenizer)
779 | 
780 |     features.append(feature)
781 |   return features
782 | 
783 | 
784 | def main(_):
785 |   tf.logging.set_verbosity(tf.logging.INFO)
786 | 
787 |   processors = {
788 |       "cola": ColaProcessor,
789 |       "mnli": MnliProcessor,
790 |       "mrpc": MrpcProcessor,
791 |       "xnli": XnliProcessor,
792 |   }
793 | 
794 |   tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
795 |                                                 FLAGS.init_checkpoint)
796 | 
797 |   if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
798 |     raise ValueError(
799 |         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
800 | 
801 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
802 | 
803 |   if FLAGS.max_seq_length > bert_config.max_position_embeddings:
804 |     raise ValueError(
805 |         "Cannot use sequence length %d because the BERT model "
806 |         "was only trained up to sequence length %d" %
807 |         (FLAGS.max_seq_length, bert_config.max_position_embeddings))
808 | 
809 |   tf.gfile.MakeDirs(FLAGS.output_dir)
810 | 
811 |   task_name = FLAGS.task_name.lower()
812 | 
813 |   if task_name not in processors:
814 |     raise ValueError("Task not found: %s" % (task_name))
815 | 
816 |   processor = processors[task_name]()
817 | 
818 |   label_list = processor.get_labels()
819 | 
820 |   tokenizer = tokenization.FullTokenizer(
821 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
822 | 
823 |   tpu_cluster_resolver = None
824 |   if FLAGS.use_tpu and FLAGS.tpu_name:
825 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
826 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
827 | 
828 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
829 |   run_config = tf.contrib.tpu.RunConfig(
830 |       cluster=tpu_cluster_resolver,
831 |       master=FLAGS.master,
832 |       model_dir=FLAGS.output_dir,
833 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
834 |       tpu_config=tf.contrib.tpu.TPUConfig(
835 |           iterations_per_loop=FLAGS.iterations_per_loop,
836 |           num_shards=FLAGS.num_tpu_cores,
837 |           per_host_input_for_training=is_per_host))
838 | 
839 |   train_examples = None
840 |   num_train_steps = None
841 |   num_warmup_steps = None
842 |   if FLAGS.do_train:
843 |     train_examples = processor.get_train_examples(FLAGS.data_dir)
844 |     num_train_steps = int(
845 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
846 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
847 | 
848 |   model_fn = model_fn_builder(
849 |       bert_config=bert_config,
850 |       num_labels=len(label_list),
851 |       init_checkpoint=FLAGS.init_checkpoint,
852 |       learning_rate=FLAGS.learning_rate,
853 |       num_train_steps=num_train_steps,
854 |       num_warmup_steps=num_warmup_steps,
855 |       use_tpu=FLAGS.use_tpu,
856 |       use_one_hot_embeddings=FLAGS.use_tpu)
857 | 
858 |   # If TPU is not available, this will fall back to normal Estimator on CPU
859 |   # or GPU.
860 |   estimator = tf.contrib.tpu.TPUEstimator(
861 |       use_tpu=FLAGS.use_tpu,
862 |       model_fn=model_fn,
863 |       config=run_config,
864 |       train_batch_size=FLAGS.train_batch_size,
865 |       eval_batch_size=FLAGS.eval_batch_size,
866 |       predict_batch_size=FLAGS.predict_batch_size)
867 | 
868 |   if FLAGS.do_train:
869 |     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
870 |     file_based_convert_examples_to_features(
871 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
872 |     tf.logging.info("***** Running training *****")
873 |     tf.logging.info("  Num examples = %d", len(train_examples))
874 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
875 |     tf.logging.info("  Num steps = %d", num_train_steps)
876 |     train_input_fn = file_based_input_fn_builder(
877 |         input_file=train_file,
878 |         seq_length=FLAGS.max_seq_length,
879 |         is_training=True,
880 |         drop_remainder=True)
881 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
882 | 
883 |   if FLAGS.do_eval:
884 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
885 |     num_actual_eval_examples = len(eval_examples)
886 |     if FLAGS.use_tpu:
887 |       # TPU requires a fixed batch size for all batches, therefore the number
888 |       # of examples must be a multiple of the batch size, or else examples
889 |       # will get dropped. So we pad with fake examples which are ignored
890 |       # later on. These do NOT count towards the metric (all tf.metrics
891 |       # support a per-instance weight, and these get a weight of 0.0).
892 |       while len(eval_examples) % FLAGS.eval_batch_size != 0:
893 |         eval_examples.append(PaddingInputExample())
894 | 
895 |     eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
896 |     file_based_convert_examples_to_features(
897 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
898 | 
899 |     tf.logging.info("***** Running evaluation *****")
900 |     tf.logging.info("  Num examples = %d (%d actual, %d padding)",
901 |                     len(eval_examples), num_actual_eval_examples,
902 |                     len(eval_examples) - num_actual_eval_examples)
903 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
904 | 
905 |     # This tells the estimator to run through the entire set.
906 |     eval_steps = None
907 |     # However, if running eval on the TPU, you will need to specify the
908 |     # number of steps.
909 |     if FLAGS.use_tpu:
910 |       assert len(eval_examples) % FLAGS.eval_batch_size == 0
911 |       eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
912 | 
913 |     eval_drop_remainder = True if FLAGS.use_tpu else False
914 |     eval_input_fn = file_based_input_fn_builder(
915 |         input_file=eval_file,
916 |         seq_length=FLAGS.max_seq_length,
917 |         is_training=False,
918 |         drop_remainder=eval_drop_remainder)
919 | 
920 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
921 | 
922 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
923 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
924 |       tf.logging.info("***** Eval results *****")
925 |       for key in sorted(result.keys()):
926 |         tf.logging.info("  %s = %s", key, str(result[key]))
927 |         writer.write("%s = %s\n" % (key, str(result[key])))
928 | 
929 |   if FLAGS.do_predict:
930 |     predict_examples = processor.get_test_examples(FLAGS.data_dir)
931 |     num_actual_predict_examples = len(predict_examples)
932 |     if FLAGS.use_tpu:
933 |       # TPU requires a fixed batch size for all batches, therefore the number
934 |       # of examples must be a multiple of the batch size, or else examples
935 |       # will get dropped. So we pad with fake examples which are ignored
936 |       # later on.
937 |       while len(predict_examples) % FLAGS.predict_batch_size != 0:
938 |         predict_examples.append(PaddingInputExample())
939 | 
940 |     predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
941 |     file_based_convert_examples_to_features(predict_examples, label_list,
942 |                                             FLAGS.max_seq_length, tokenizer,
943 |                                             predict_file)
944 | 
945 |     tf.logging.info("***** Running prediction*****")
946 |     tf.logging.info("  Num examples = %d (%d actual, %d padding)",
947 |                     len(predict_examples), num_actual_predict_examples,
948 |                     len(predict_examples) - num_actual_predict_examples)
949 |     tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
950 | 
951 |     predict_drop_remainder = True if FLAGS.use_tpu else False
952 |     predict_input_fn = file_based_input_fn_builder(
953 |         input_file=predict_file,
954 |         seq_length=FLAGS.max_seq_length,
955 |         is_training=False,
956 |         drop_remainder=predict_drop_remainder)
957 | 
958 |     result = estimator.predict(input_fn=predict_input_fn)
959 | 
960 |     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
961 |     with tf.gfile.GFile(output_predict_file, "w") as writer:
962 |       num_written_lines = 0
963 |       tf.logging.info("***** Predict results *****")
964 |       for (i, prediction) in enumerate(result):
965 |         probabilities = prediction["probabilities"]
966 |         if i >= num_actual_predict_examples:
967 |           break
968 |         output_line = "\t".join(
969 |             str(class_probability)
970 |             for class_probability in probabilities) + "\n"
971 |         writer.write(output_line)
972 |         num_written_lines += 1
973 |     assert num_written_lines == num_actual_predict_examples
974 | 
975 | 
976 | if __name__ == "__main__":
977 |   flags.mark_flag_as_required("data_dir")
978 |   flags.mark_flag_as_required("task_name")
979 |   flags.mark_flag_as_required("vocab_file")
980 |   flags.mark_flag_as_required("bert_config_file")
981 |   flags.mark_flag_as_required("output_dir")
982 |   tf.app.run()
983 | 


--------------------------------------------------------------------------------
/run_classifier_0214.py:
--------------------------------------------------------------------------------
   1 | # coding=utf-8
   2 | # Copyright 2018 The Google AI Language Team Authors.
   3 | #
   4 | # Licensed under the Apache License, Version 2.0 (the "License");
   5 | # you may not use this file except in compliance with the License.
   6 | # You may obtain a copy of the License at
   7 | #
   8 | #     http://www.apache.org/licenses/LICENSE-2.0
   9 | #
  10 | # Unless required by applicable law or agreed to in writing, software
  11 | # distributed under the License is distributed on an "AS IS" BASIS,
  12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13 | # See the License for the specific language governing permissions and
  14 | # limitations under the License.
  15 | """BERT finetuning runner."""
  16 | 
  17 | from __future__ import absolute_import
  18 | from __future__ import division
  19 | from __future__ import print_function
  20 | 
  21 | import collections
  22 | import csv
  23 | import os
  24 | import modeling
  25 | import optimization
  26 | import tokenization
  27 | import tensorflow as tf
  28 | 
  29 | flags = tf.flags
  30 | 
  31 | FLAGS = flags.FLAGS
  32 | 
  33 | ## Required parameters
  34 | flags.DEFINE_string(
  35 |     "data_dir", None,
  36 |     "The input data dir. Should contain the .tsv files (or other data files) "
  37 |     "for the task.")
  38 | 
  39 | flags.DEFINE_string(
  40 |     "bert_config_file", None,
  41 |     "The config json file corresponding to the pre-trained BERT model. "
  42 |     "This specifies the model architecture.")
  43 | 
  44 | flags.DEFINE_string("task_name", "MRPC", "The name of the task to train.")
  45 | 
  46 | flags.DEFINE_string("vocab_file", None,
  47 |                     "The vocabulary file that the BERT model was trained on.")
  48 | 
  49 | flags.DEFINE_string(
  50 |     "output_dir", None,
  51 |     "The output directory where the model checkpoints will be written.")
  52 | 
  53 | ## Other parameters
  54 | 
  55 | flags.DEFINE_string(
  56 |     "init_checkpoint", None,
  57 |     "Initial checkpoint (usually from a pre-trained BERT model).")
  58 | 
  59 | flags.DEFINE_bool(
  60 |     "do_lower_case", True,
  61 |     "Whether to lower case the input text. Should be True for uncased "
  62 |     "models and False for cased models.")
  63 | 
  64 | flags.DEFINE_integer(
  65 |     "max_seq_length", 128,
  66 |     "The maximum total input sequence length after WordPiece tokenization. "
  67 |     "Sequences longer than this will be truncated, and sequences shorter "
  68 |     "than this will be padded.")
  69 | 
  70 | flags.DEFINE_bool("do_train", True, "Whether to run training.")
  71 | 
  72 | flags.DEFINE_bool("do_eval", True, "Whether to run eval on the dev set.")
  73 | 
  74 | flags.DEFINE_bool(
  75 |     "do_predict", False,
  76 |     "Whether to run the model in inference mode on the test set.")
  77 | 
  78 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
  79 | 
  80 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
  81 | 
  82 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
  83 | 
  84 | flags.DEFINE_float("learning_rate", 2e-5, "The initial learning rate for Adam.")
  85 | 
  86 | flags.DEFINE_float("num_train_epochs", 10.0,
  87 |                    "Total number of training epochs to perform.")
  88 | 
  89 | flags.DEFINE_float(
  90 |     "warmup_proportion", 0.1,
  91 |     "Proportion of training to perform linear learning rate warmup for. "
  92 |     "E.g., 0.1 = 10% of training.")
  93 | 
  94 | flags.DEFINE_integer("save_checkpoints_steps", 50,
  95 |                      "How often to save the model checkpoint.")
  96 | 
  97 | flags.DEFINE_integer("iterations_per_loop", 50,
  98 |                      "How many steps to make in each estimator call.")
  99 | 
 100 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
 101 | 
 102 | tf.flags.DEFINE_string(
 103 |     "tpu_name", None,
 104 |     "The Cloud TPU to use for training. This should be either the name "
 105 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
 106 |     "url.")
 107 | 
 108 | tf.flags.DEFINE_string(
 109 |     "tpu_zone", None,
 110 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
 111 |     "specified, we will attempt to automatically detect the GCE project from "
 112 |     "metadata.")
 113 | 
 114 | tf.flags.DEFINE_string(
 115 |     "gcp_project", None,
 116 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
 117 |     "specified, we will attempt to automatically detect the GCE project from "
 118 |     "metadata.")
 119 | 
 120 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
 121 | 
 122 | flags.DEFINE_integer(
 123 |     "num_tpu_cores", 8,
 124 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
 125 | 
 126 | 
 127 | class InputExample(object):
 128 |   """A single training/test example for simple sequence classification."""
 129 | 
 130 |   def __init__(self, guid, text_a, text_b=None, label=None):
 131 |     """Constructs a InputExample.
 132 | 
 133 |     Args:
 134 |       guid: Unique id for the example.
 135 |       text_a: string. The untokenized text of the first sequence. For single
 136 |         sequence tasks, only this sequence must be specified.
 137 |       text_b: (Optional) string. The untokenized text of the second sequence.
 138 |         Only must be specified for sequence pair tasks.
 139 |       label: (Optional) string. The label of the example. This should be
 140 |         specified for train and dev examples, but not for test examples.
 141 |     """
 142 |     self.guid = guid
 143 |     self.text_a = text_a
 144 |     self.text_b = text_b
 145 |     self.label = label
 146 | 
 147 | 
 148 | class PaddingInputExample(object):
 149 |   """Fake example so the num input examples is a multiple of the batch size.
 150 | 
 151 |   When running eval/predict on the TPU, we need to pad the number of examples
 152 |   to be a multiple of the batch size, because the TPU requires a fixed batch
 153 |   size. The alternative is to drop the last batch, which is bad because it means
 154 |   the entire output data won't be generated.
 155 | 
 156 |   We use this class instead of `None` because treating `None` as padding
 157 |   battches could cause silent errors.
 158 |   """
 159 | 
 160 | 
 161 | class InputFeatures(object):
 162 |   """A single set of features of data."""
 163 | 
 164 |   def __init__(self,
 165 |                input_ids,
 166 |                input_mask,
 167 |                segment_ids,
 168 |                label_id,
 169 |                is_real_example=True):
 170 |     self.input_ids = input_ids
 171 |     self.input_mask = input_mask
 172 |     self.segment_ids = segment_ids
 173 |     self.label_id = label_id
 174 |     self.is_real_example = is_real_example
 175 | 
 176 | 
 177 | class DataProcessor(object):
 178 |   """Base class for data converters for sequence classification data sets."""
 179 | 
 180 |   def get_train_examples(self, data_dir):
 181 |     """Gets a collection of `InputExample`s for the train set."""
 182 |     raise NotImplementedError()
 183 | 
 184 |   def get_dev_examples(self, data_dir):
 185 |     """Gets a collection of `InputExample`s for the dev set."""
 186 |     raise NotImplementedError()
 187 | 
 188 |   def get_test_examples(self, data_dir):
 189 |     """Gets a collection of `InputExample`s for prediction."""
 190 |     raise NotImplementedError()
 191 | 
 192 |   def get_labels(self):
 193 |     """Gets the list of labels for this data set."""
 194 |     raise NotImplementedError()
 195 | 
 196 |   @classmethod
 197 |   def _read_tsv(cls, input_file, quotechar=None):
 198 |     """Reads a tab separated value file."""
 199 |     with tf.gfile.Open(input_file, "r") as f:
 200 |       reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
 201 |       lines = []
 202 |       for line in reader:
 203 |         lines.append(line)
 204 |       return lines
 205 | 
 206 | 
 207 | class XnliProcessor(DataProcessor):
 208 |   """Processor for the XNLI data set."""
 209 | 
 210 |   def __init__(self):
 211 |     self.language = "zh"
 212 | 
 213 |   def get_train_examples(self, data_dir):
 214 |     """See base class."""
 215 |     lines = self._read_tsv(
 216 |         os.path.join(data_dir, "multinli",
 217 |                      "multinli.train.%s.tsv" % self.language))
 218 |     examples = []
 219 |     for (i, line) in enumerate(lines):
 220 |       if i == 0:
 221 |         continue
 222 |       guid = "train-%d" % (i)
 223 |       text_a = tokenization.convert_to_unicode(line[0])
 224 |       text_b = tokenization.convert_to_unicode(line[1])
 225 |       label = tokenization.convert_to_unicode(line[2])
 226 |       if label == tokenization.convert_to_unicode("contradictory"):
 227 |         label = tokenization.convert_to_unicode("contradiction")
 228 |       examples.append(
 229 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 230 |     return examples
 231 | 
 232 |   def get_dev_examples(self, data_dir):
 233 |     """See base class."""
 234 |     lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv"))
 235 |     examples = []
 236 |     for (i, line) in enumerate(lines):
 237 |       if i == 0:
 238 |         continue
 239 |       guid = "dev-%d" % (i)
 240 |       language = tokenization.convert_to_unicode(line[0])
 241 |       if language != tokenization.convert_to_unicode(self.language):
 242 |         continue
 243 |       text_a = tokenization.convert_to_unicode(line[6])
 244 |       text_b = tokenization.convert_to_unicode(line[7])
 245 |       label = tokenization.convert_to_unicode(line[1])
 246 |       examples.append(
 247 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 248 |     return examples
 249 | 
 250 |   def get_labels(self):
 251 |     """See base class."""
 252 |     return ["contradiction", "entailment", "neutral"]
 253 | 
 254 | 
 255 | class MnliProcessor(DataProcessor):
 256 |   """Processor for the MultiNLI data set (GLUE version)."""
 257 | 
 258 |   def get_train_examples(self, data_dir):
 259 |     """See base class."""
 260 |     return self._create_examples(
 261 |         self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 262 | 
 263 |   def get_dev_examples(self, data_dir):
 264 |     """See base class."""
 265 |     return self._create_examples(
 266 |         self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
 267 |         "dev_matched")
 268 | 
 269 |   def get_test_examples(self, data_dir):
 270 |     """See base class."""
 271 |     return self._create_examples(
 272 |         self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
 273 | 
 274 |   def get_labels(self):
 275 |     """See base class."""
 276 |     return ["contradiction", "entailment", "neutral"]
 277 | 
 278 |   def _create_examples(self, lines, set_type):
 279 |     """Creates examples for the training and dev sets."""
 280 |     examples = []
 281 |     for (i, line) in enumerate(lines):
 282 |       if i == 0:
 283 |         continue
 284 |       guid = "%s-%s" % (set_type, tokenization.convert_to_unicode(line[0]))
 285 |       text_a = tokenization.convert_to_unicode(line[8])
 286 |       text_b = tokenization.convert_to_unicode(line[9])
 287 |       if set_type == "test":
 288 |         label = "contradiction"
 289 |       else:
 290 |         label = tokenization.convert_to_unicode(line[-1])
 291 |       examples.append(
 292 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 293 |     return examples
 294 | 
 295 | 
 296 | class MrpcProcessor(DataProcessor):
 297 |   """Processor for the MRPC data set (GLUE version)."""
 298 | 
 299 |   def get_train_examples(self, data_dir):
 300 |     """See base class."""
 301 |     return self._create_examples(
 302 |         self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 303 | 
 304 |   def get_dev_examples(self, data_dir):
 305 |     """See base class."""
 306 |     return self._create_examples(
 307 |         self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 308 | 
 309 |   def get_test_examples(self, data_dir):
 310 |     """See base class."""
 311 |     return self._create_examples(
 312 |         self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
 313 | 
 314 |   def get_labels(self):
 315 |     """See base class."""
 316 |     return ["0", "1"]
 317 | 
 318 |   def _create_examples(self, lines, set_type):
 319 |     """Creates examples for the training and dev sets."""
 320 |     examples = []
 321 |     for (i, line) in enumerate(lines):
 322 |       if i == 0:
 323 |         continue
 324 |       guid = "%s-%s" % (set_type, i)
 325 |       text_a = tokenization.convert_to_unicode(line[3])
 326 |       text_b = tokenization.convert_to_unicode(line[4])
 327 |       if set_type == "test":
 328 |         label = "0"
 329 |       else:
 330 |         label = tokenization.convert_to_unicode(line[0])
 331 |       examples.append(
 332 |           InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 333 |     return examples
 334 | 
 335 | 
 336 | 
 337 | class ColaProcessor(DataProcessor):
 338 |   """Processor for the CoLA data set (GLUE version)."""
 339 | 
 340 |   def get_train_examples(self, data_dir):
 341 |     """See base class."""
 342 |     return self._create_examples(
 343 |         self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 344 | 
 345 |   def get_dev_examples(self, data_dir):
 346 |     """See base class."""
 347 |     return self._create_examples(
 348 |         self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 349 | 
 350 |   def get_test_examples(self, data_dir):
 351 |     """See base class."""
 352 |     return self._create_examples(
 353 |         self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
 354 | 
 355 |   def get_labels(self):
 356 |     """See base class."""
 357 |     return ["0", "1"]
 358 | 
 359 |   def _create_examples(self, lines, set_type):
 360 |     """Creates examples for the training and dev sets."""
 361 |     examples = []
 362 |     for (i, line) in enumerate(lines):
 363 |       # Only the test set has a header
 364 |       if set_type == "test" and i == 0:
 365 |         continue
 366 |       guid = "%s-%s" % (set_type, i)
 367 |       if set_type == "test":
 368 |         text_a = tokenization.convert_to_unicode(line[1])
 369 |         label = "0"
 370 |       else:
 371 |         text_a = tokenization.convert_to_unicode(line[3])
 372 |         label = tokenization.convert_to_unicode(line[1])
 373 |       examples.append(
 374 |           InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
 375 |     return examples
 376 | 
 377 | 
 378 | def convert_single_example(ex_index, example, label_list, max_seq_length,
 379 |                            tokenizer):
 380 |   """Converts a single `InputExample` into a single `InputFeatures`."""
 381 | 
 382 |   if isinstance(example, PaddingInputExample):
 383 |     return InputFeatures(
 384 |         input_ids=[0] * max_seq_length,
 385 |         input_mask=[0] * max_seq_length,
 386 |         segment_ids=[0] * max_seq_length,
 387 |         label_id=0,
 388 |         is_real_example=False)
 389 | 
 390 |   label_map = {}
 391 |   for (i, label) in enumerate(label_list):
 392 |     label_map[label] = i
 393 | 
 394 |   tokens_a = tokenizer.tokenize(example.text_a)
 395 |   tokens_b = None
 396 |   if example.text_b:
 397 |     tokens_b = tokenizer.tokenize(example.text_b)
 398 | 
 399 |   if tokens_b:
 400 |     # Modifies `tokens_a` and `tokens_b` in place so that the total
 401 |     # length is less than the specified length.
 402 |     # Account for [CLS], [SEP], [SEP] with "- 3"
 403 |     _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
 404 |   else:
 405 |     # Account for [CLS] and [SEP] with "- 2"
 406 |     if len(tokens_a) > max_seq_length - 2:
 407 |       tokens_a = tokens_a[0:(max_seq_length - 2)]
 408 | 
 409 |   # The convention in BERT is:
 410 |   # (a) For sequence pairs:
 411 |   #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
 412 |   #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
 413 |   # (b) For single sequences:
 414 |   #  tokens:   [CLS] the dog is hairy . [SEP]
 415 |   #  type_ids: 0     0   0   0  0     0 0
 416 |   #
 417 |   # Where "type_ids" are used to indicate whether this is the first
 418 |   # sequence or the second sequence. The embedding vectors for `type=0` and
 419 |   # `type=1` were learned during pre-training and are added to the wordpiece
 420 |   # embedding vector (and position vector). This is not *strictly* necessary
 421 |   # since the [SEP] token unambiguously separates the sequences, but it makes
 422 |   # it easier for the model to learn the concept of sequences.
 423 |   #
 424 |   # For classification tasks, the first vector (corresponding to [CLS]) is
 425 |   # used as the "sentence vector". Note that this only makes sense because
 426 |   # the entire model is fine-tuned.
 427 |   tokens = []
 428 |   segment_ids = []
 429 |   tokens.append("[CLS]")
 430 |   segment_ids.append(0)
 431 |   for token in tokens_a:
 432 |     tokens.append(token)
 433 |     segment_ids.append(0)
 434 |   tokens.append("[SEP]")
 435 |   segment_ids.append(0)
 436 | 
 437 |   if tokens_b:
 438 |     for token in tokens_b:
 439 |       tokens.append(token)
 440 |       segment_ids.append(1)
 441 |     tokens.append("[SEP]")
 442 |     segment_ids.append(1)
 443 | 
 444 |   input_ids = tokenizer.convert_tokens_to_ids(tokens)
 445 | 
 446 |   # The mask has 1 for real tokens and 0 for padding tokens. Only real
 447 |   # tokens are attended to.
 448 |   input_mask = [1] * len(input_ids)
 449 | 
 450 |   # Zero-pad up to the sequence length.
 451 |   while len(input_ids) < max_seq_length:
 452 |     input_ids.append(0)
 453 |     input_mask.append(0)
 454 |     segment_ids.append(0)
 455 | 
 456 |   assert len(input_ids) == max_seq_length
 457 |   assert len(input_mask) == max_seq_length
 458 |   assert len(segment_ids) == max_seq_length
 459 | 
 460 |   label_id = label_map[example.label]
 461 |   if ex_index < 5:
 462 |     tf.logging.info("*** Example ***")
 463 |     tf.logging.info("guid: %s" % (example.guid))
 464 |     tf.logging.info("tokens: %s" % " ".join(
 465 |         [tokenization.printable_text(x) for x in tokens]))
 466 |     tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
 467 |     tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
 468 |     tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
 469 |     tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
 470 | 
 471 |   feature = InputFeatures(
 472 |       input_ids=input_ids,
 473 |       input_mask=input_mask,
 474 |       segment_ids=segment_ids,
 475 |       label_id=label_id,
 476 |       is_real_example=True)
 477 |   return feature
 478 | 
 479 | 
 480 | def file_based_convert_examples_to_features(
 481 |     examples, label_list, max_seq_length, tokenizer, output_file):
 482 |   """Convert a set of `InputExample`s to a TFRecord file."""
 483 | 
 484 |   writer = tf.python_io.TFRecordWriter(output_file)
 485 | 
 486 |   for (ex_index, example) in enumerate(examples):
 487 |     if ex_index % 10000 == 0:
 488 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
 489 | 
 490 |     feature = convert_single_example(ex_index, example, label_list,
 491 |                                      max_seq_length, tokenizer)
 492 | 
 493 |     def create_int_feature(values):
 494 |       f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
 495 |       return f
 496 | 
 497 |     features = collections.OrderedDict()
 498 |     features["input_ids"] = create_int_feature(feature.input_ids)
 499 |     features["input_mask"] = create_int_feature(feature.input_mask)
 500 |     features["segment_ids"] = create_int_feature(feature.segment_ids)
 501 |     features["label_ids"] = create_int_feature([feature.label_id])
 502 |     features["is_real_example"] = create_int_feature(
 503 |         [int(feature.is_real_example)])
 504 | 
 505 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
 506 |     writer.write(tf_example.SerializeToString())
 507 |   writer.close()
 508 | 
 509 | 
 510 | def file_based_input_fn_builder(input_file, seq_length, is_training,
 511 |                                 drop_remainder):
 512 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
 513 | 
 514 |   name_to_features = {
 515 |       "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
 516 |       "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
 517 |       "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
 518 |       "label_ids": tf.FixedLenFeature([], tf.int64),
 519 |       "is_real_example": tf.FixedLenFeature([], tf.int64),
 520 |   }
 521 | 
 522 |   def _decode_record(record, name_to_features):
 523 |     """Decodes a record to a TensorFlow example."""
 524 |     example = tf.parse_single_example(record, name_to_features)
 525 | 
 526 |     # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
 527 |     # So cast all int64 to int32.
 528 |     for name in list(example.keys()):
 529 |       t = example[name]
 530 |       if t.dtype == tf.int64:
 531 |         t = tf.to_int32(t)
 532 |       example[name] = t
 533 | 
 534 |     return example
 535 | 
 536 |   def input_fn(params):
 537 |     """The actual input function."""
 538 |     batch_size = params["batch_size"]
 539 | 
 540 |     # For training, we want a lot of parallel reading and shuffling.
 541 |     # For eval, we want no shuffling and parallel reading doesn't matter.
 542 |     d = tf.data.TFRecordDataset(input_file)
 543 |     if is_training:
 544 |       d = d.repeat()
 545 |       d = d.shuffle(buffer_size=100)
 546 | 
 547 |     d = d.apply(
 548 |         tf.contrib.data.map_and_batch(
 549 |             lambda record: _decode_record(record, name_to_features),
 550 |             batch_size=batch_size,
 551 |             drop_remainder=drop_remainder))
 552 | 
 553 |     return d
 554 | 
 555 |   return input_fn
 556 | 
 557 | 
 558 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
 559 |   """Truncates a sequence pair in place to the maximum length."""
 560 | 
 561 |   # This is a simple heuristic which will always truncate the longer sequence
 562 |   # one token at a time. This makes more sense than truncating an equal percent
 563 |   # of tokens from each, since if one sequence is very short then each token
 564 |   # that's truncated likely contains more information than a longer sequence.
 565 |   while True:
 566 |     total_length = len(tokens_a) + len(tokens_b)
 567 |     if total_length <= max_length:
 568 |       break
 569 |     if len(tokens_a) > len(tokens_b):
 570 |       tokens_a.pop()
 571 |     else:
 572 |       tokens_b.pop()
 573 | 
 574 | 
 575 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
 576 |                  labels, num_labels, use_one_hot_embeddings):
 577 |   """Creates a classification model."""
 578 |   model = modeling.BertModel(
 579 |       config=bert_config,
 580 |       is_training=is_training,
 581 |       input_ids=input_ids,
 582 |       input_mask=input_mask,
 583 |       token_type_ids=segment_ids,
 584 |       use_one_hot_embeddings=use_one_hot_embeddings)
 585 | 
 586 |   # In the demo, we are doing a simple classification task on the entire
 587 |   # segment.
 588 |   #
 589 |   # If you want to use the token-level output, use model.get_sequence_output()
 590 |   # instead.
 591 |   output_layer = model.get_pooled_output()
 592 | 
 593 |   hidden_size = output_layer.shape[-1].value
 594 | 
 595 |   output_weights = tf.get_variable(
 596 |       "output_weights", [num_labels, hidden_size],
 597 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
 598 | 
 599 |   output_bias = tf.get_variable(
 600 |       "output_bias", [num_labels], initializer=tf.zeros_initializer())
 601 | 
 602 |   with tf.variable_scope("loss"):
 603 |     if is_training:
 604 |       # I.e., 0.1 dropout
 605 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
 606 | 
 607 |     logits = tf.matmul(output_layer, output_weights, transpose_b=True)
 608 |     logits = tf.nn.bias_add(logits, output_bias)
 609 |     probabilities = tf.nn.softmax(logits, axis=-1)
 610 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
 611 | 
 612 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
 613 | 
 614 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
 615 |     loss = tf.reduce_mean(per_example_loss)
 616 | 
 617 |     return (loss, per_example_loss, logits, probabilities)
 618 | 
 619 | 
 620 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
 621 |                      num_train_steps, num_warmup_steps, use_tpu,
 622 |                      use_one_hot_embeddings):
 623 |   """Returns `model_fn` closure for TPUEstimator."""
 624 | 
 625 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
 626 |     """The `model_fn` for TPUEstimator."""
 627 | 
 628 |     tf.logging.info("*** Features ***")
 629 |     for name in sorted(features.keys()):
 630 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
 631 | 
 632 |     input_ids = features["input_ids"]
 633 |     input_mask = features["input_mask"]
 634 |     segment_ids = features["segment_ids"]
 635 |     label_ids = features["label_ids"]
 636 |     is_real_example = None
 637 |     if "is_real_example" in features:
 638 |       is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
 639 |     else:
 640 |       is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
 641 | 
 642 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
 643 | 
 644 |     (total_loss, per_example_loss, logits, probabilities) = create_model(
 645 |         bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
 646 |         num_labels, use_one_hot_embeddings)
 647 | 
 648 |     tvars = tf.trainable_variables()
 649 |     initialized_variable_names = {}
 650 |     scaffold_fn = None
 651 |     if init_checkpoint:
 652 |       (assignment_map, initialized_variable_names
 653 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
 654 |       if use_tpu:
 655 | 
 656 |         def tpu_scaffold():
 657 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
 658 |           return tf.train.Scaffold()
 659 | 
 660 |         scaffold_fn = tpu_scaffold
 661 |       else:
 662 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
 663 | 
 664 |     tf.logging.info("**** Trainable Variables ****")
 665 |     for var in tvars:
 666 |       init_string = ""
 667 |       if var.name in initialized_variable_names:
 668 |         init_string = ", *INIT_FROM_CKPT*"
 669 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
 670 |                       init_string)
 671 | 
 672 |     output_spec = None
 673 |     if mode == tf.estimator.ModeKeys.TRAIN:
 674 | 
 675 |       train_op = optimization.create_optimizer(
 676 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
 677 | 
 678 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
 679 |           mode=mode,
 680 |           loss=total_loss,
 681 |           train_op=train_op,
 682 |           scaffold_fn=scaffold_fn)
 683 |     elif mode == tf.estimator.ModeKeys.EVAL:
 684 | 
 685 |       def metric_fn(per_example_loss, label_ids, logits, is_real_example):
 686 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
 687 |         accuracy = tf.metrics.accuracy(
 688 |             labels=label_ids, predictions=predictions, weights=is_real_example)
 689 |         loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
 690 |         return {
 691 |             "eval_accuracy": accuracy,
 692 |             "eval_loss": loss,
 693 |         }
 694 | 
 695 |       eval_metrics = (metric_fn,
 696 |                       [per_example_loss, label_ids, logits, is_real_example])
 697 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
 698 |           mode=mode,
 699 |           loss=total_loss,
 700 |           eval_metrics=eval_metrics,
 701 |           scaffold_fn=scaffold_fn)
 702 |     else:
 703 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
 704 |           mode=mode,
 705 |           predictions={"probabilities": probabilities},
 706 |           scaffold_fn=scaffold_fn)
 707 |     return output_spec
 708 | 
 709 |   return model_fn
 710 | 
 711 | 
 712 | # This function is not used by this file but is still used by the Colab and
 713 | # people who depend on it.
 714 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
 715 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
 716 | 
 717 |   all_input_ids = []
 718 |   all_input_mask = []
 719 |   all_segment_ids = []
 720 |   all_label_ids = []
 721 | 
 722 |   for feature in features:
 723 |     all_input_ids.append(feature.input_ids)
 724 |     all_input_mask.append(feature.input_mask)
 725 |     all_segment_ids.append(feature.segment_ids)
 726 |     all_label_ids.append(feature.label_id)
 727 | 
 728 |   def input_fn(params):
 729 |     """The actual input function."""
 730 |     batch_size = params["batch_size"]
 731 | 
 732 |     num_examples = len(features)
 733 | 
 734 |     # This is for demo purposes and does NOT scale to large data sets. We do
 735 |     # not use Dataset.from_generator() because that uses tf.py_func which is
 736 |     # not TPU compatible. The right way to load data is with TFRecordReader.
 737 |     d = tf.data.Dataset.from_tensor_slices({
 738 |         "input_ids":
 739 |             tf.constant(
 740 |                 all_input_ids, shape=[num_examples, seq_length],
 741 |                 dtype=tf.int32),
 742 |         "input_mask":
 743 |             tf.constant(
 744 |                 all_input_mask,
 745 |                 shape=[num_examples, seq_length],
 746 |                 dtype=tf.int32),
 747 |         "segment_ids":
 748 |             tf.constant(
 749 |                 all_segment_ids,
 750 |                 shape=[num_examples, seq_length],
 751 |                 dtype=tf.int32),
 752 |         "label_ids":
 753 |             tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
 754 |     })
 755 | 
 756 |     if is_training:
 757 |       d = d.repeat()
 758 |       d = d.shuffle(buffer_size=100)
 759 | 
 760 |     d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
 761 |     return d
 762 | 
 763 |   return input_fn
 764 | 
 765 | 
 766 | # This function is not used by this file but is still used by the Colab and
 767 | # people who depend on it.
 768 | def convert_examples_to_features(examples, label_list, max_seq_length,
 769 |                                  tokenizer):
 770 |   """Convert a set of `InputExample`s to a list of `InputFeatures`."""
 771 | 
 772 |   features = []
 773 |   for (ex_index, example) in enumerate(examples):
 774 |     if ex_index % 10000 == 0:
 775 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
 776 | 
 777 |     feature = convert_single_example(ex_index, example, label_list,
 778 |                                      max_seq_length, tokenizer)
 779 | 
 780 |     features.append(feature)
 781 |   return features
 782 | 
 783 | 
 784 | 
 785 | import pandas as pd
 786 | class SelfProcessor(DataProcessor):
 787 | 
 788 |     def get_train_examples(self, data_dir):
 789 |         file_path = os.path.join(data_dir, 'atec_nlp_sim_train_0.6.csv')
 790 |         reader=pd.read_csv(file_path, encoding='utf-8',error_bad_lines=False)
 791 |         # 如果数据不是乱序的，注意要shuffle
 792 |         # 这里的数据量比较大，取部分跑一下
 793 |         reader = reader.head(3000)
 794 |         print("train length:",len(reader))
 795 | 
 796 |         examples = []
 797 |         for _,row in reader.iterrows():
 798 |             line=row[0]
 799 |             # print(line)
 800 |             split_line = line.strip().split("\t")
 801 |             if len(split_line)!=4:
 802 |                 continue
 803 | 
 804 |             guid = split_line[0]
 805 |             text_a = tokenization.convert_to_unicode(split_line[1])
 806 |             text_b = tokenization.convert_to_unicode(split_line[2])
 807 |             label = split_line[3]
 808 |             examples.append(InputExample(guid=guid, text_a=text_a,
 809 |                                          text_b=text_b, label=label))
 810 |         return examples
 811 | 
 812 |     def get_dev_examples(self, data_dir):
 813 |         file_path = os.path.join(data_dir, 'atec_nlp_sim_test_0.4.csv')
 814 |         reader=pd.read_csv(file_path, encoding='utf-8',error_bad_lines=False)
 815 |         #如果数据不是乱序的，注意要shuffle
 816 |         #这里的数据量比较大，取部分跑一下
 817 |         reader=reader.tail(500)
 818 | 
 819 |         examples = []
 820 |         for _,row in reader.iterrows():
 821 |             line=row[0]
 822 |             # print(line)
 823 |             split_line = line.strip().split("\t")
 824 |             if len(split_line)!=4:
 825 |                 continue
 826 | 
 827 |             guid = split_line[0]
 828 |             text_a = tokenization.convert_to_unicode(split_line[1])
 829 |             text_b = tokenization.convert_to_unicode(split_line[2])
 830 |             label = split_line[3]
 831 |             examples.append(InputExample(guid=guid, text_a=text_a,
 832 |                                          text_b=text_b, label=label))
 833 |         return examples
 834 | 
 835 |     def get_test_examples(self, data_dir):
 836 |         file_path = os.path.join(data_dir, 'atec_nlp_sim_test_0.4.csv')
 837 |         reader=pd.read_csv(file_path, encoding='utf-8',error_bad_lines=False)
 838 |         # 这里的数据量比较大，取部分跑一下,跟验证集的数据区分开
 839 |         reader = reader.head(100)
 840 | 
 841 |         examples = []
 842 |         for _,row in reader.iterrows():
 843 |             line=row[0]
 844 |             # print(line)
 845 |             split_line = line.strip().split("\t")
 846 |             if len(split_line)!=4:
 847 |                 continue
 848 | 
 849 |             guid = split_line[0]
 850 |             text_a = tokenization.convert_to_unicode(split_line[1])
 851 |             text_b = tokenization.convert_to_unicode(split_line[2])
 852 |             label = split_line[3]
 853 |             examples.append(InputExample(guid=guid, text_a=text_a,
 854 |                                          text_b=text_b, label=label))
 855 |         return examples
 856 | 
 857 |     def get_labels(self):
 858 |         """See base class."""
 859 |         return ["0", "1"]
 860 | 
 861 |     def _create_examples(self, lines, set_type):
 862 |         """Creates examples for the training and dev sets."""
 863 |         examples = []
 864 |         for (i, line) in enumerate(lines):
 865 |             # Only the test set has a header
 866 |             if set_type == "test" and i == 0:
 867 |                 continue
 868 |             guid = "%s-%s" % (set_type, i)
 869 |             if set_type == "test":
 870 |                 text_a = tokenization.convert_to_unicode(line[1])
 871 |                 label = "0"
 872 |             else:
 873 |                 text_a = tokenization.convert_to_unicode(line[3])
 874 |                 label = tokenization.convert_to_unicode(line[1])
 875 |             examples.append(
 876 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
 877 |         return examples
 878 | 
 879 | 
 880 | 
 881 | def main(_):
 882 |   tf.logging.set_verbosity(tf.logging.INFO)
 883 | 
 884 |   processors = {
 885 |       "cola": ColaProcessor,
 886 |       "mnli": MnliProcessor,
 887 |       "mrpc": MrpcProcessor,
 888 |       "xnli": XnliProcessor,
 889 |       "mayi": SelfProcessor
 890 |   }
 891 | 
 892 |   tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
 893 |                                                 FLAGS.init_checkpoint)
 894 | 
 895 |   if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
 896 |     raise ValueError(
 897 |         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
 898 | 
 899 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
 900 | 
 901 |   if FLAGS.max_seq_length > bert_config.max_position_embeddings:
 902 |     raise ValueError(
 903 |         "Cannot use sequence length %d because the BERT model "
 904 |         "was only trained up to sequence length %d" %
 905 |         (FLAGS.max_seq_length, bert_config.max_position_embeddings))
 906 | 
 907 |   tf.gfile.MakeDirs(FLAGS.output_dir)
 908 | 
 909 |   task_name = FLAGS.task_name.lower()
 910 | 
 911 |   if task_name not in processors:
 912 |     raise ValueError("Task not found: %s" % (task_name))
 913 | 
 914 |   processor = processors[task_name]()
 915 | 
 916 |   tokenizer = tokenization.FullTokenizer(
 917 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
 918 | 
 919 |   tpu_cluster_resolver = None
 920 |   if FLAGS.use_tpu and FLAGS.tpu_name:
 921 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
 922 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
 923 | 
 924 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
 925 |   run_config = tf.contrib.tpu.RunConfig(
 926 |       cluster=tpu_cluster_resolver,
 927 |       master=FLAGS.master,
 928 |       model_dir=FLAGS.output_dir,
 929 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
 930 |       tpu_config=tf.contrib.tpu.TPUConfig(
 931 |           iterations_per_loop=FLAGS.iterations_per_loop,
 932 |           num_shards=FLAGS.num_tpu_cores,
 933 |           per_host_input_for_training=is_per_host))
 934 | 
 935 |   train_examples = None
 936 |   num_train_steps = None
 937 |   num_warmup_steps = None
 938 | 
 939 |   train_examples = processor.get_train_examples(FLAGS.data_dir)
 940 |   # print("len of train_examples:",len(train_examples))
 941 | 
 942 |   if FLAGS.do_train:
 943 |     num_train_steps = int(
 944 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
 945 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
 946 | 
 947 |   label_list = processor.get_labels()
 948 | 
 949 |   model_fn = model_fn_builder(
 950 |       bert_config=bert_config,
 951 |       num_labels=len(label_list),
 952 |       init_checkpoint=FLAGS.init_checkpoint,
 953 |       learning_rate=FLAGS.learning_rate,
 954 |       num_train_steps=num_train_steps,
 955 |       num_warmup_steps=num_warmup_steps,
 956 |       use_tpu=FLAGS.use_tpu,
 957 |       use_one_hot_embeddings=FLAGS.use_tpu)
 958 | 
 959 |   # If TPU is not available, this will fall back to normal Estimator on CPU
 960 |   # or GPU.
 961 |   estimator = tf.contrib.tpu.TPUEstimator(
 962 |       use_tpu=FLAGS.use_tpu,
 963 |       model_fn=model_fn,
 964 |       config=run_config,
 965 |       train_batch_size=FLAGS.train_batch_size,
 966 |       eval_batch_size=FLAGS.eval_batch_size,
 967 |       predict_batch_size=FLAGS.predict_batch_size)
 968 | 
 969 | 
 970 |   if FLAGS.do_train:
 971 |     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
 972 |     file_based_convert_examples_to_features(
 973 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
 974 |     tf.logging.info("***** Running training *****")
 975 |     tf.logging.info("  Num examples = %d", len(train_examples))
 976 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
 977 |     tf.logging.info("  Num steps = %d", num_train_steps)
 978 |     train_input_fn = file_based_input_fn_builder(
 979 |         input_file=train_file,
 980 |         seq_length=FLAGS.max_seq_length,
 981 |         is_training=True,
 982 |         drop_remainder=True)
 983 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
 984 | 
 985 |   if FLAGS.do_eval:
 986 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
 987 |     num_actual_eval_examples = len(eval_examples)
 988 |     if FLAGS.use_tpu:
 989 |       # TPU requires a fixed batch size for all batches, therefore the number
 990 |       # of examples must be a multiple of the batch size, or else examples
 991 |       # will get dropped. So we pad with fake examples which are ignored
 992 |       # later on. These do NOT count towards the metric (all tf.metrics
 993 |       # support a per-instance weight, and these get a weight of 0.0).
 994 |       while len(eval_examples) % FLAGS.eval_batch_size != 0:
 995 |         eval_examples.append(PaddingInputExample())
 996 | 
 997 |     eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
 998 |     file_based_convert_examples_to_features(
 999 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
1000 | 
1001 |     tf.logging.info("***** Running evaluation *****")
1002 |     tf.logging.info("  Num examples = %d (%d actual, %d padding)",
1003 |                     len(eval_examples), num_actual_eval_examples,
1004 |                     len(eval_examples) - num_actual_eval_examples)
1005 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
1006 | 
1007 |     # This tells the estimator to run through the entire set.
1008 |     eval_steps = None
1009 |     # However, if running eval on the TPU, you will need to specify the
1010 |     # number of steps.
1011 |     if FLAGS.use_tpu:
1012 |       assert len(eval_examples) % FLAGS.eval_batch_size == 0
1013 |       eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
1014 | 
1015 |     eval_drop_remainder = True if FLAGS.use_tpu else False
1016 |     eval_input_fn = file_based_input_fn_builder(
1017 |         input_file=eval_file,
1018 |         seq_length=FLAGS.max_seq_length,
1019 |         is_training=False,
1020 |         drop_remainder=eval_drop_remainder)
1021 | 
1022 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
1023 | 
1024 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
1025 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
1026 |       tf.logging.info("***** Eval results *****")
1027 |       for key in sorted(result.keys()):
1028 |         tf.logging.info("  %s = %s", key, str(result[key]))
1029 |         writer.write("%s = %s\n" % (key, str(result[key])))
1030 | 
1031 |   if FLAGS.do_predict:
1032 |     predict_examples = processor.get_test_examples(FLAGS.data_dir)
1033 |     num_actual_predict_examples = len(predict_examples)
1034 |     if FLAGS.use_tpu:
1035 |       # TPU requires a fixed batch size for all batches, therefore the number
1036 |       # of examples must be a multiple of the batch size, or else examples
1037 |       # will get dropped. So we pad with fake examples which are ignored
1038 |       # later on.
1039 |       while len(predict_examples) % FLAGS.predict_batch_size != 0:
1040 |         predict_examples.append(PaddingInputExample())
1041 | 
1042 |     predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
1043 |     file_based_convert_examples_to_features(predict_examples, label_list,
1044 |                                             FLAGS.max_seq_length, tokenizer,
1045 |                                             predict_file)
1046 | 
1047 |     tf.logging.info("***** Running prediction*****")
1048 |     tf.logging.info("  Num examples = %d (%d actual, %d padding)",
1049 |                     len(predict_examples), num_actual_predict_examples,
1050 |                     len(predict_examples) - num_actual_predict_examples)
1051 |     tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
1052 | 
1053 |     predict_drop_remainder = True if FLAGS.use_tpu else False
1054 |     predict_input_fn = file_based_input_fn_builder(
1055 |         input_file=predict_file,
1056 |         seq_length=FLAGS.max_seq_length,
1057 |         is_training=False,
1058 |         drop_remainder=predict_drop_remainder)
1059 | 
1060 |     result = estimator.predict(input_fn=predict_input_fn)
1061 | 
1062 |     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
1063 |     with tf.gfile.GFile(output_predict_file, "w") as writer:
1064 |       num_written_lines = 0
1065 |       tf.logging.info("***** Predict results *****")
1066 |       for (i, prediction) in enumerate(result):
1067 |         probabilities = prediction["probabilities"]
1068 |         if i >= num_actual_predict_examples:
1069 |           break
1070 |         output_line = "\t".join(
1071 |             str(class_probability)
1072 |             for class_probability in probabilities) + "\n"
1073 |         writer.write(output_line)
1074 |         num_written_lines += 1
1075 |     assert num_written_lines == num_actual_predict_examples
1076 | 
1077 | 
1078 | 
1079 | 
1080 | 
1081 | if __name__ == "__main__":
1082 |   flags.mark_flag_as_required("data_dir")
1083 |   flags.mark_flag_as_required("task_name")
1084 |   flags.mark_flag_as_required("vocab_file")
1085 |   flags.mark_flag_as_required("bert_config_file")
1086 |   flags.mark_flag_as_required("output_dir")
1087 |   tf.app.run()
1088 | 


--------------------------------------------------------------------------------
/run_classifier_with_tfhub.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """BERT finetuning runner with TF-Hub."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import os
 22 | import optimization
 23 | import run_classifier
 24 | import tokenization
 25 | import tensorflow as tf
 26 | import tensorflow_hub as hub
 27 | 
 28 | flags = tf.flags
 29 | 
 30 | FLAGS = flags.FLAGS
 31 | 
 32 | flags.DEFINE_string(
 33 |     "bert_hub_module_handle", None,
 34 |     "Handle for the BERT TF-Hub module.")
 35 | 
 36 | 
 37 | def create_model(is_training, input_ids, input_mask, segment_ids, labels,
 38 |                  num_labels):
 39 |   """Creates a classification model."""
 40 |   tags = set()
 41 |   if is_training:
 42 |     tags.add("train")
 43 |   bert_module = hub.Module(
 44 |       FLAGS.bert_hub_module_handle,
 45 |       tags=tags,
 46 |       trainable=True)
 47 |   bert_inputs = dict(
 48 |       input_ids=input_ids,
 49 |       input_mask=input_mask,
 50 |       segment_ids=segment_ids)
 51 |   bert_outputs = bert_module(
 52 |       inputs=bert_inputs,
 53 |       signature="tokens",
 54 |       as_dict=True)
 55 | 
 56 |   # In the demo, we are doing a simple classification task on the entire
 57 |   # segment.
 58 |   #
 59 |   # If you want to use the token-level output, use
 60 |   # bert_outputs["sequence_output"] instead.
 61 |   output_layer = bert_outputs["pooled_output"]
 62 | 
 63 |   hidden_size = output_layer.shape[-1].value
 64 | 
 65 |   output_weights = tf.get_variable(
 66 |       "output_weights", [num_labels, hidden_size],
 67 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
 68 | 
 69 |   output_bias = tf.get_variable(
 70 |       "output_bias", [num_labels], initializer=tf.zeros_initializer())
 71 | 
 72 |   with tf.variable_scope("loss"):
 73 |     if is_training:
 74 |       # I.e., 0.1 dropout
 75 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
 76 | 
 77 |     logits = tf.matmul(output_layer, output_weights, transpose_b=True)
 78 |     logits = tf.nn.bias_add(logits, output_bias)
 79 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
 80 | 
 81 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
 82 | 
 83 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
 84 |     loss = tf.reduce_mean(per_example_loss)
 85 | 
 86 |     return (loss, per_example_loss, logits)
 87 | 
 88 | 
 89 | def model_fn_builder(num_labels, learning_rate, num_train_steps,
 90 |                      num_warmup_steps, use_tpu):
 91 |   """Returns `model_fn` closure for TPUEstimator."""
 92 | 
 93 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
 94 |     """The `model_fn` for TPUEstimator."""
 95 | 
 96 |     tf.logging.info("*** Features ***")
 97 |     for name in sorted(features.keys()):
 98 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
 99 | 
100 |     input_ids = features["input_ids"]
101 |     input_mask = features["input_mask"]
102 |     segment_ids = features["segment_ids"]
103 |     label_ids = features["label_ids"]
104 | 
105 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
106 | 
107 |     (total_loss, per_example_loss, logits) = create_model(
108 |         is_training, input_ids, input_mask, segment_ids, label_ids, num_labels)
109 | 
110 |     output_spec = None
111 |     if mode == tf.estimator.ModeKeys.TRAIN:
112 |       train_op = optimization.create_optimizer(
113 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
114 | 
115 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
116 |           mode=mode,
117 |           loss=total_loss,
118 |           train_op=train_op)
119 |     elif mode == tf.estimator.ModeKeys.EVAL:
120 | 
121 |       def metric_fn(per_example_loss, label_ids, logits):
122 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
123 |         accuracy = tf.metrics.accuracy(label_ids, predictions)
124 |         loss = tf.metrics.mean(per_example_loss)
125 |         return {
126 |             "eval_accuracy": accuracy,
127 |             "eval_loss": loss,
128 |         }
129 | 
130 |       eval_metrics = (metric_fn, [per_example_loss, label_ids, logits])
131 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
132 |           mode=mode,
133 |           loss=total_loss,
134 |           eval_metrics=eval_metrics)
135 |     else:
136 |       raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))
137 | 
138 |     return output_spec
139 | 
140 |   return model_fn
141 | 
142 | 
143 | def create_tokenizer_from_hub_module():
144 |   """Get the vocab file and casing info from the Hub module."""
145 |   with tf.Graph().as_default():
146 |     bert_module = hub.Module(FLAGS.bert_hub_module_handle)
147 |     tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
148 |     with tf.Session() as sess:
149 |       vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
150 |                                             tokenization_info["do_lower_case"]])
151 |   return tokenization.FullTokenizer(
152 |       vocab_file=vocab_file, do_lower_case=do_lower_case)
153 | 
154 | 
155 | def main(_):
156 |   tf.logging.set_verbosity(tf.logging.INFO)
157 | 
158 |   processors = {
159 |       "cola": run_classifier.ColaProcessor,
160 |       "mnli": run_classifier.MnliProcessor,
161 |       "mrpc": run_classifier.MrpcProcessor,
162 |   }
163 | 
164 |   if not FLAGS.do_train and not FLAGS.do_eval:
165 |     raise ValueError("At least one of `do_train` or `do_eval` must be True.")
166 | 
167 |   tf.gfile.MakeDirs(FLAGS.output_dir)
168 | 
169 |   task_name = FLAGS.task_name.lower()
170 | 
171 |   if task_name not in processors:
172 |     raise ValueError("Task not found: %s" % (task_name))
173 | 
174 |   processor = processors[task_name]()
175 | 
176 |   label_list = processor.get_labels()
177 | 
178 |   tokenizer = create_tokenizer_from_hub_module()
179 | 
180 |   tpu_cluster_resolver = None
181 |   if FLAGS.use_tpu and FLAGS.tpu_name:
182 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
183 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
184 | 
185 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
186 |   run_config = tf.contrib.tpu.RunConfig(
187 |       cluster=tpu_cluster_resolver,
188 |       master=FLAGS.master,
189 |       model_dir=FLAGS.output_dir,
190 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
191 |       tpu_config=tf.contrib.tpu.TPUConfig(
192 |           iterations_per_loop=FLAGS.iterations_per_loop,
193 |           num_shards=FLAGS.num_tpu_cores,
194 |           per_host_input_for_training=is_per_host))
195 | 
196 |   train_examples = None
197 |   num_train_steps = None
198 |   num_warmup_steps = None
199 |   if FLAGS.do_train:
200 |     train_examples = processor.get_train_examples(FLAGS.data_dir)
201 |     num_train_steps = int(
202 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
203 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
204 | 
205 |   model_fn = model_fn_builder(
206 |       num_labels=len(label_list),
207 |       learning_rate=FLAGS.learning_rate,
208 |       num_train_steps=num_train_steps,
209 |       num_warmup_steps=num_warmup_steps,
210 |       use_tpu=FLAGS.use_tpu)
211 | 
212 |   # If TPU is not available, this will fall back to normal Estimator on CPU
213 |   # or GPU.
214 |   estimator = tf.contrib.tpu.TPUEstimator(
215 |       use_tpu=FLAGS.use_tpu,
216 |       model_fn=model_fn,
217 |       config=run_config,
218 |       train_batch_size=FLAGS.train_batch_size,
219 |       eval_batch_size=FLAGS.eval_batch_size)
220 | 
221 |   if FLAGS.do_train:
222 |     train_features = run_classifier.convert_examples_to_features(
223 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer)
224 |     tf.logging.info("***** Running training *****")
225 |     tf.logging.info("  Num examples = %d", len(train_examples))
226 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
227 |     tf.logging.info("  Num steps = %d", num_train_steps)
228 |     train_input_fn = run_classifier.input_fn_builder(
229 |         features=train_features,
230 |         seq_length=FLAGS.max_seq_length,
231 |         is_training=True,
232 |         drop_remainder=True)
233 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
234 | 
235 |   if FLAGS.do_eval:
236 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
237 |     eval_features = run_classifier.convert_examples_to_features(
238 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer)
239 | 
240 |     tf.logging.info("***** Running evaluation *****")
241 |     tf.logging.info("  Num examples = %d", len(eval_examples))
242 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
243 | 
244 |     # This tells the estimator to run through the entire set.
245 |     eval_steps = None
246 |     # However, if running eval on the TPU, you will need to specify the
247 |     # number of steps.
248 |     if FLAGS.use_tpu:
249 |       # Eval will be slightly WRONG on the TPU because it will truncate
250 |       # the last batch.
251 |       eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size)
252 | 
253 |     eval_drop_remainder = True if FLAGS.use_tpu else False
254 |     eval_input_fn = run_classifier.input_fn_builder(
255 |         features=eval_features,
256 |         seq_length=FLAGS.max_seq_length,
257 |         is_training=False,
258 |         drop_remainder=eval_drop_remainder)
259 | 
260 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
261 | 
262 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
263 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
264 |       tf.logging.info("***** Eval results *****")
265 |       for key in sorted(result.keys()):
266 |         tf.logging.info("  %s = %s", key, str(result[key]))
267 |         writer.write("%s = %s\n" % (key, str(result[key])))
268 | 
269 | 
270 | if __name__ == "__main__":
271 |   flags.mark_flag_as_required("data_dir")
272 |   flags.mark_flag_as_required("task_name")
273 |   flags.mark_flag_as_required("bert_hub_module_handle")
274 |   flags.mark_flag_as_required("output_dir")
275 |   tf.app.run()
276 | 


--------------------------------------------------------------------------------
/run_pretraining.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Run masked LM/next sentence masked_lm pre-training for BERT."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import os
 22 | import modeling
 23 | import optimization
 24 | import tensorflow as tf
 25 | 
 26 | flags = tf.flags
 27 | 
 28 | FLAGS = flags.FLAGS
 29 | 
 30 | ## Required parameters
 31 | flags.DEFINE_string(
 32 |     "bert_config_file", None,
 33 |     "The config json file corresponding to the pre-trained BERT model. "
 34 |     "This specifies the model architecture.")
 35 | 
 36 | flags.DEFINE_string(
 37 |     "input_file", None,
 38 |     "Input TF example files (can be a glob or comma separated).")
 39 | 
 40 | flags.DEFINE_string(
 41 |     "output_dir", None,
 42 |     "The output directory where the model checkpoints will be written.")
 43 | 
 44 | ## Other parameters
 45 | flags.DEFINE_string(
 46 |     "init_checkpoint", None,
 47 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 48 | 
 49 | flags.DEFINE_integer(
 50 |     "max_seq_length", 128,
 51 |     "The maximum total input sequence length after WordPiece tokenization. "
 52 |     "Sequences longer than this will be truncated, and sequences shorter "
 53 |     "than this will be padded. Must match data generation.")
 54 | 
 55 | flags.DEFINE_integer(
 56 |     "max_predictions_per_seq", 20,
 57 |     "Maximum number of masked LM predictions per sequence. "
 58 |     "Must match data generation.")
 59 | 
 60 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 61 | 
 62 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 63 | 
 64 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
 65 | 
 66 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 67 | 
 68 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 69 | 
 70 | flags.DEFINE_integer("num_train_steps", 100000, "Number of training steps.")
 71 | 
 72 | flags.DEFINE_integer("num_warmup_steps", 10000, "Number of warmup steps.")
 73 | 
 74 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
 75 |                      "How often to save the model checkpoint.")
 76 | 
 77 | flags.DEFINE_integer("iterations_per_loop", 1000,
 78 |                      "How many steps to make in each estimator call.")
 79 | 
 80 | flags.DEFINE_integer("max_eval_steps", 100, "Maximum number of eval steps.")
 81 | 
 82 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
 83 | 
 84 | tf.flags.DEFINE_string(
 85 |     "tpu_name", None,
 86 |     "The Cloud TPU to use for training. This should be either the name "
 87 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
 88 |     "url.")
 89 | 
 90 | tf.flags.DEFINE_string(
 91 |     "tpu_zone", None,
 92 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
 93 |     "specified, we will attempt to automatically detect the GCE project from "
 94 |     "metadata.")
 95 | 
 96 | tf.flags.DEFINE_string(
 97 |     "gcp_project", None,
 98 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
 99 |     "specified, we will attempt to automatically detect the GCE project from "
100 |     "metadata.")
101 | 
102 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
103 | 
104 | flags.DEFINE_integer(
105 |     "num_tpu_cores", 8,
106 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
107 | 
108 | 
109 | def model_fn_builder(bert_config, init_checkpoint, learning_rate,
110 |                      num_train_steps, num_warmup_steps, use_tpu,
111 |                      use_one_hot_embeddings):
112 |   """Returns `model_fn` closure for TPUEstimator."""
113 | 
114 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
115 |     """The `model_fn` for TPUEstimator."""
116 | 
117 |     tf.logging.info("*** Features ***")
118 |     for name in sorted(features.keys()):
119 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
120 | 
121 |     input_ids = features["input_ids"]
122 |     input_mask = features["input_mask"]
123 |     segment_ids = features["segment_ids"]
124 |     masked_lm_positions = features["masked_lm_positions"]
125 |     masked_lm_ids = features["masked_lm_ids"]
126 |     masked_lm_weights = features["masked_lm_weights"]
127 |     next_sentence_labels = features["next_sentence_labels"]
128 | 
129 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
130 | 
131 |     model = modeling.BertModel(
132 |         config=bert_config,
133 |         is_training=is_training,
134 |         input_ids=input_ids,
135 |         input_mask=input_mask,
136 |         token_type_ids=segment_ids,
137 |         use_one_hot_embeddings=use_one_hot_embeddings)
138 | 
139 |     (masked_lm_loss,
140 |      masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
141 |          bert_config, model.get_sequence_output(), model.get_embedding_table(),
142 |          masked_lm_positions, masked_lm_ids, masked_lm_weights)
143 | 
144 |     (next_sentence_loss, next_sentence_example_loss,
145 |      next_sentence_log_probs) = get_next_sentence_output(
146 |          bert_config, model.get_pooled_output(), next_sentence_labels)
147 | 
148 |     total_loss = masked_lm_loss + next_sentence_loss
149 | 
150 |     tvars = tf.trainable_variables()
151 | 
152 |     initialized_variable_names = {}
153 |     scaffold_fn = None
154 |     if init_checkpoint:
155 |       (assignment_map, initialized_variable_names
156 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
157 |       if use_tpu:
158 | 
159 |         def tpu_scaffold():
160 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
161 |           return tf.train.Scaffold()
162 | 
163 |         scaffold_fn = tpu_scaffold
164 |       else:
165 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
166 | 
167 |     tf.logging.info("**** Trainable Variables ****")
168 |     for var in tvars:
169 |       init_string = ""
170 |       if var.name in initialized_variable_names:
171 |         init_string = ", *INIT_FROM_CKPT*"
172 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
173 |                       init_string)
174 | 
175 |     output_spec = None
176 |     if mode == tf.estimator.ModeKeys.TRAIN:
177 |       train_op = optimization.create_optimizer(
178 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
179 | 
180 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
181 |           mode=mode,
182 |           loss=total_loss,
183 |           train_op=train_op,
184 |           scaffold_fn=scaffold_fn)
185 |     elif mode == tf.estimator.ModeKeys.EVAL:
186 | 
187 |       def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
188 |                     masked_lm_weights, next_sentence_example_loss,
189 |                     next_sentence_log_probs, next_sentence_labels):
190 |         """Computes the loss and accuracy of the model."""
191 |         masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
192 |                                          [-1, masked_lm_log_probs.shape[-1]])
193 |         masked_lm_predictions = tf.argmax(
194 |             masked_lm_log_probs, axis=-1, output_type=tf.int32)
195 |         masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
196 |         masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
197 |         masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
198 |         masked_lm_accuracy = tf.metrics.accuracy(
199 |             labels=masked_lm_ids,
200 |             predictions=masked_lm_predictions,
201 |             weights=masked_lm_weights)
202 |         masked_lm_mean_loss = tf.metrics.mean(
203 |             values=masked_lm_example_loss, weights=masked_lm_weights)
204 | 
205 |         next_sentence_log_probs = tf.reshape(
206 |             next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]])
207 |         next_sentence_predictions = tf.argmax(
208 |             next_sentence_log_probs, axis=-1, output_type=tf.int32)
209 |         next_sentence_labels = tf.reshape(next_sentence_labels, [-1])
210 |         next_sentence_accuracy = tf.metrics.accuracy(
211 |             labels=next_sentence_labels, predictions=next_sentence_predictions)
212 |         next_sentence_mean_loss = tf.metrics.mean(
213 |             values=next_sentence_example_loss)
214 | 
215 |         return {
216 |             "masked_lm_accuracy": masked_lm_accuracy,
217 |             "masked_lm_loss": masked_lm_mean_loss,
218 |             "next_sentence_accuracy": next_sentence_accuracy,
219 |             "next_sentence_loss": next_sentence_mean_loss,
220 |         }
221 | 
222 |       eval_metrics = (metric_fn, [
223 |           masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
224 |           masked_lm_weights, next_sentence_example_loss,
225 |           next_sentence_log_probs, next_sentence_labels
226 |       ])
227 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
228 |           mode=mode,
229 |           loss=total_loss,
230 |           eval_metrics=eval_metrics,
231 |           scaffold_fn=scaffold_fn)
232 |     else:
233 |       raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))
234 | 
235 |     return output_spec
236 | 
237 |   return model_fn
238 | 
239 | 
240 | def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
241 |                          label_ids, label_weights):
242 |   """Get loss and log probs for the masked LM."""
243 |   input_tensor = gather_indexes(input_tensor, positions)
244 | 
245 |   with tf.variable_scope("cls/predictions"):
246 |     # We apply one more non-linear transformation before the output layer.
247 |     # This matrix is not used after pre-training.
248 |     with tf.variable_scope("transform"):
249 |       input_tensor = tf.layers.dense(
250 |           input_tensor,
251 |           units=bert_config.hidden_size,
252 |           activation=modeling.get_activation(bert_config.hidden_act),
253 |           kernel_initializer=modeling.create_initializer(
254 |               bert_config.initializer_range))
255 |       input_tensor = modeling.layer_norm(input_tensor)
256 | 
257 |     # The output weights are the same as the input embeddings, but there is
258 |     # an output-only bias for each token.
259 |     output_bias = tf.get_variable(
260 |         "output_bias",
261 |         shape=[bert_config.vocab_size],
262 |         initializer=tf.zeros_initializer())
263 |     logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
264 |     logits = tf.nn.bias_add(logits, output_bias)
265 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
266 | 
267 |     label_ids = tf.reshape(label_ids, [-1])
268 |     label_weights = tf.reshape(label_weights, [-1])
269 | 
270 |     one_hot_labels = tf.one_hot(
271 |         label_ids, depth=bert_config.vocab_size, dtype=tf.float32)
272 | 
273 |     # The `positions` tensor might be zero-padded (if the sequence is too
274 |     # short to have the maximum number of predictions). The `label_weights`
275 |     # tensor has a value of 1.0 for every real prediction and 0.0 for the
276 |     # padding predictions.
277 |     per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
278 |     numerator = tf.reduce_sum(label_weights * per_example_loss)
279 |     denominator = tf.reduce_sum(label_weights) + 1e-5
280 |     loss = numerator / denominator
281 | 
282 |   return (loss, per_example_loss, log_probs)
283 | 
284 | 
285 | def get_next_sentence_output(bert_config, input_tensor, labels):
286 |   """Get loss and log probs for the next sentence prediction."""
287 | 
288 |   # Simple binary classification. Note that 0 is "next sentence" and 1 is
289 |   # "random sentence". This weight matrix is not used after pre-training.
290 |   with tf.variable_scope("cls/seq_relationship"):
291 |     output_weights = tf.get_variable(
292 |         "output_weights",
293 |         shape=[2, bert_config.hidden_size],
294 |         initializer=modeling.create_initializer(bert_config.initializer_range))
295 |     output_bias = tf.get_variable(
296 |         "output_bias", shape=[2], initializer=tf.zeros_initializer())
297 | 
298 |     logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
299 |     logits = tf.nn.bias_add(logits, output_bias)
300 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
301 |     labels = tf.reshape(labels, [-1])
302 |     one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
303 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
304 |     loss = tf.reduce_mean(per_example_loss)
305 |     return (loss, per_example_loss, log_probs)
306 | 
307 | 
308 | def gather_indexes(sequence_tensor, positions):
309 |   """Gathers the vectors at the specific positions over a minibatch."""
310 |   sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
311 |   batch_size = sequence_shape[0]
312 |   seq_length = sequence_shape[1]
313 |   width = sequence_shape[2]
314 | 
315 |   flat_offsets = tf.reshape(
316 |       tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
317 |   flat_positions = tf.reshape(positions + flat_offsets, [-1])
318 |   flat_sequence_tensor = tf.reshape(sequence_tensor,
319 |                                     [batch_size * seq_length, width])
320 |   output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
321 |   return output_tensor
322 | 
323 | 
324 | def input_fn_builder(input_files,
325 |                      max_seq_length,
326 |                      max_predictions_per_seq,
327 |                      is_training,
328 |                      num_cpu_threads=4):
329 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
330 | 
331 |   def input_fn(params):
332 |     """The actual input function."""
333 |     batch_size = params["batch_size"]
334 | 
335 |     name_to_features = {
336 |         "input_ids":
337 |             tf.FixedLenFeature([max_seq_length], tf.int64),
338 |         "input_mask":
339 |             tf.FixedLenFeature([max_seq_length], tf.int64),
340 |         "segment_ids":
341 |             tf.FixedLenFeature([max_seq_length], tf.int64),
342 |         "masked_lm_positions":
343 |             tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
344 |         "masked_lm_ids":
345 |             tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
346 |         "masked_lm_weights":
347 |             tf.FixedLenFeature([max_predictions_per_seq], tf.float32),
348 |         "next_sentence_labels":
349 |             tf.FixedLenFeature([1], tf.int64),
350 |     }
351 | 
352 |     # For training, we want a lot of parallel reading and shuffling.
353 |     # For eval, we want no shuffling and parallel reading doesn't matter.
354 |     if is_training:
355 |       d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
356 |       d = d.repeat()
357 |       d = d.shuffle(buffer_size=len(input_files))
358 | 
359 |       # `cycle_length` is the number of parallel files that get read.
360 |       cycle_length = min(num_cpu_threads, len(input_files))
361 | 
362 |       # `sloppy` mode means that the interleaving is not exact. This adds
363 |       # even more randomness to the training pipeline.
364 |       d = d.apply(
365 |           tf.contrib.data.parallel_interleave(
366 |               tf.data.TFRecordDataset,
367 |               sloppy=is_training,
368 |               cycle_length=cycle_length))
369 |       d = d.shuffle(buffer_size=100)
370 |     else:
371 |       d = tf.data.TFRecordDataset(input_files)
372 |       # Since we evaluate for a fixed number of steps we don't want to encounter
373 |       # out-of-range exceptions.
374 |       d = d.repeat()
375 | 
376 |     # We must `drop_remainder` on training because the TPU requires fixed
377 |     # size dimensions. For eval, we assume we are evaluating on the CPU or GPU
378 |     # and we *don't* want to drop the remainder, otherwise we wont cover
379 |     # every sample.
380 |     d = d.apply(
381 |         tf.contrib.data.map_and_batch(
382 |             lambda record: _decode_record(record, name_to_features),
383 |             batch_size=batch_size,
384 |             num_parallel_batches=num_cpu_threads,
385 |             drop_remainder=True))
386 |     return d
387 | 
388 |   return input_fn
389 | 
390 | 
391 | def _decode_record(record, name_to_features):
392 |   """Decodes a record to a TensorFlow example."""
393 |   example = tf.parse_single_example(record, name_to_features)
394 | 
395 |   # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
396 |   # So cast all int64 to int32.
397 |   for name in list(example.keys()):
398 |     t = example[name]
399 |     if t.dtype == tf.int64:
400 |       t = tf.to_int32(t)
401 |     example[name] = t
402 | 
403 |   return example
404 | 
405 | 
406 | def main(_):
407 |   tf.logging.set_verbosity(tf.logging.INFO)
408 | 
409 |   if not FLAGS.do_train and not FLAGS.do_eval:
410 |     raise ValueError("At least one of `do_train` or `do_eval` must be True.")
411 | 
412 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
413 | 
414 |   tf.gfile.MakeDirs(FLAGS.output_dir)
415 | 
416 |   input_files = []
417 |   for input_pattern in FLAGS.input_file.split(","):
418 |     input_files.extend(tf.gfile.Glob(input_pattern))
419 | 
420 |   tf.logging.info("*** Input Files ***")
421 |   for input_file in input_files:
422 |     tf.logging.info("  %s" % input_file)
423 | 
424 |   tpu_cluster_resolver = None
425 |   if FLAGS.use_tpu and FLAGS.tpu_name:
426 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
427 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
428 | 
429 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
430 |   run_config = tf.contrib.tpu.RunConfig(
431 |       cluster=tpu_cluster_resolver,
432 |       master=FLAGS.master,
433 |       model_dir=FLAGS.output_dir,
434 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
435 |       tpu_config=tf.contrib.tpu.TPUConfig(
436 |           iterations_per_loop=FLAGS.iterations_per_loop,
437 |           num_shards=FLAGS.num_tpu_cores,
438 |           per_host_input_for_training=is_per_host))
439 | 
440 |   model_fn = model_fn_builder(
441 |       bert_config=bert_config,
442 |       init_checkpoint=FLAGS.init_checkpoint,
443 |       learning_rate=FLAGS.learning_rate,
444 |       num_train_steps=FLAGS.num_train_steps,
445 |       num_warmup_steps=FLAGS.num_warmup_steps,
446 |       use_tpu=FLAGS.use_tpu,
447 |       use_one_hot_embeddings=FLAGS.use_tpu)
448 | 
449 |   # If TPU is not available, this will fall back to normal Estimator on CPU
450 |   # or GPU.
451 |   estimator = tf.contrib.tpu.TPUEstimator(
452 |       use_tpu=FLAGS.use_tpu,
453 |       model_fn=model_fn,
454 |       config=run_config,
455 |       train_batch_size=FLAGS.train_batch_size,
456 |       eval_batch_size=FLAGS.eval_batch_size)
457 | 
458 |   if FLAGS.do_train:
459 |     tf.logging.info("***** Running training *****")
460 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
461 |     train_input_fn = input_fn_builder(
462 |         input_files=input_files,
463 |         max_seq_length=FLAGS.max_seq_length,
464 |         max_predictions_per_seq=FLAGS.max_predictions_per_seq,
465 |         is_training=True)
466 |     estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
467 | 
468 |   if FLAGS.do_eval:
469 |     tf.logging.info("***** Running evaluation *****")
470 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
471 | 
472 |     eval_input_fn = input_fn_builder(
473 |         input_files=input_files,
474 |         max_seq_length=FLAGS.max_seq_length,
475 |         max_predictions_per_seq=FLAGS.max_predictions_per_seq,
476 |         is_training=False)
477 | 
478 |     result = estimator.evaluate(
479 |         input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
480 | 
481 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
482 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
483 |       tf.logging.info("***** Eval results *****")
484 |       for key in sorted(result.keys()):
485 |         tf.logging.info("  %s = %s", key, str(result[key]))
486 |         writer.write("%s = %s\n" % (key, str(result[key])))
487 | 
488 | 
489 | if __name__ == "__main__":
490 |   flags.mark_flag_as_required("input_file")
491 |   flags.mark_flag_as_required("bert_config_file")
492 |   flags.mark_flag_as_required("output_dir")
493 |   tf.app.run()
494 | 


--------------------------------------------------------------------------------
/tokenization.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors.
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | #     http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | """Tokenization classes."""
 16 | 
 17 | from __future__ import absolute_import
 18 | from __future__ import division
 19 | from __future__ import print_function
 20 | 
 21 | import collections
 22 | import re
 23 | import unicodedata
 24 | import six
 25 | import tensorflow as tf
 26 | 
 27 | 
 28 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
 29 |   """Checks whether the casing config is consistent with the checkpoint name."""
 30 | 
 31 |   # The casing has to be passed in by the user and there is no explicit check
 32 |   # as to whether it matches the checkpoint. The casing information probably
 33 |   # should have been stored in the bert_config.json file, but it's not, so
 34 |   # we have to heuristically detect it to validate.
 35 | 
 36 |   if not init_checkpoint:
 37 |     return
 38 | 
 39 |   m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint)
 40 |   if m is None:
 41 |     return
 42 | 
 43 |   model_name = m.group(1)
 44 | 
 45 |   lower_models = [
 46 |       "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
 47 |       "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
 48 |   ]
 49 | 
 50 |   cased_models = [
 51 |       "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
 52 |       "multi_cased_L-12_H-768_A-12"
 53 |   ]
 54 | 
 55 |   is_bad_config = False
 56 |   if model_name in lower_models and not do_lower_case:
 57 |     is_bad_config = True
 58 |     actual_flag = "False"
 59 |     case_name = "lowercased"
 60 |     opposite_flag = "True"
 61 | 
 62 |   if model_name in cased_models and do_lower_case:
 63 |     is_bad_config = True
 64 |     actual_flag = "True"
 65 |     case_name = "cased"
 66 |     opposite_flag = "False"
 67 | 
 68 |   if is_bad_config:
 69 |     raise ValueError(
 70 |         "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
 71 |         "However, `%s` seems to be a %s model, so you "
 72 |         "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
 73 |         "how the model was pre-training. If this error is wrong, please "
 74 |         "just comment out this check." % (actual_flag, init_checkpoint,
 75 |                                           model_name, case_name, opposite_flag))
 76 | 
 77 | 
 78 | def convert_to_unicode(text):
 79 |   """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
 80 |   if six.PY3:
 81 |     if isinstance(text, str):
 82 |       return text
 83 |     elif isinstance(text, bytes):
 84 |       return text.decode("utf-8", "ignore")
 85 |     else:
 86 |       raise ValueError("Unsupported string type: %s" % (type(text)))
 87 |   elif six.PY2:
 88 |     if isinstance(text, str):
 89 |       return text.decode("utf-8", "ignore")
 90 |     elif isinstance(text, unicode):
 91 |       return text
 92 |     else:
 93 |       raise ValueError("Unsupported string type: %s" % (type(text)))
 94 |   else:
 95 |     raise ValueError("Not running on Python2 or Python 3?")
 96 | 
 97 | 
 98 | def printable_text(text):
 99 |   """Returns text encoded in a way suitable for print or `tf.logging`."""
100 | 
101 |   # These functions want `str` for both Python2 and Python3, but in one case
102 |   # it's a Unicode string and in the other it's a byte string.
103 |   if six.PY3:
104 |     if isinstance(text, str):
105 |       return text
106 |     elif isinstance(text, bytes):
107 |       return text.decode("utf-8", "ignore")
108 |     else:
109 |       raise ValueError("Unsupported string type: %s" % (type(text)))
110 |   elif six.PY2:
111 |     if isinstance(text, str):
112 |       return text
113 |     elif isinstance(text, unicode):
114 |       return text.encode("utf-8")
115 |     else:
116 |       raise ValueError("Unsupported string type: %s" % (type(text)))
117 |   else:
118 |     raise ValueError("Not running on Python2 or Python 3?")
119 | 
120 | 
121 | def load_vocab(vocab_file):
122 |   """Loads a vocabulary file into a dictionary."""
123 |   vocab = collections.OrderedDict()
124 |   index = 0
125 |   with tf.gfile.GFile(vocab_file, "r") as reader:
126 |     while True:
127 |       token = convert_to_unicode(reader.readline())
128 |       if not token:
129 |         break
130 |       token = token.strip()
131 |       vocab[token] = index
132 |       index += 1
133 |   return vocab
134 | 
135 | 
136 | def convert_by_vocab(vocab, items):
137 |   """Converts a sequence of [tokens|ids] using the vocab."""
138 |   output = []
139 |   for item in items:
140 |     output.append(vocab[item])
141 |   return output
142 | 
143 | 
144 | def convert_tokens_to_ids(vocab, tokens):
145 |   return convert_by_vocab(vocab, tokens)
146 | 
147 | 
148 | def convert_ids_to_tokens(inv_vocab, ids):
149 |   return convert_by_vocab(inv_vocab, ids)
150 | 
151 | 
152 | def whitespace_tokenize(text):
153 |   """Runs basic whitespace cleaning and splitting on a piece of text."""
154 |   text = text.strip()
155 |   if not text:
156 |     return []
157 |   tokens = text.split()
158 |   return tokens
159 | 
160 | 
161 | class FullTokenizer(object):
162 |   """Runs end-to-end tokenziation."""
163 | 
164 |   def __init__(self, vocab_file, do_lower_case=True):
165 |     self.vocab = load_vocab(vocab_file)
166 |     self.inv_vocab = {v: k for k, v in self.vocab.items()}
167 |     self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
168 |     self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
169 | 
170 |   def tokenize(self, text):
171 |     split_tokens = []
172 |     for token in self.basic_tokenizer.tokenize(text):
173 |       for sub_token in self.wordpiece_tokenizer.tokenize(token):
174 |         split_tokens.append(sub_token)
175 | 
176 |     return split_tokens
177 | 
178 |   def convert_tokens_to_ids(self, tokens):
179 |     return convert_by_vocab(self.vocab, tokens)
180 | 
181 |   def convert_ids_to_tokens(self, ids):
182 |     return convert_by_vocab(self.inv_vocab, ids)
183 | 
184 | 
185 | class BasicTokenizer(object):
186 |   """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
187 | 
188 |   def __init__(self, do_lower_case=True):
189 |     """Constructs a BasicTokenizer.
190 | 
191 |     Args:
192 |       do_lower_case: Whether to lower case the input.
193 |     """
194 |     self.do_lower_case = do_lower_case
195 | 
196 |   def tokenize(self, text):
197 |     """Tokenizes a piece of text."""
198 |     text = convert_to_unicode(text)
199 |     text = self._clean_text(text)
200 | 
201 |     # This was added on November 1st, 2018 for the multilingual and Chinese
202 |     # models. This is also applied to the English models now, but it doesn't
203 |     # matter since the English models were not trained on any Chinese data
204 |     # and generally don't have any Chinese data in them (there are Chinese
205 |     # characters in the vocabulary because Wikipedia does have some Chinese
206 |     # words in the English Wikipedia.).
207 |     text = self._tokenize_chinese_chars(text)
208 | 
209 |     orig_tokens = whitespace_tokenize(text)
210 |     split_tokens = []
211 |     for token in orig_tokens:
212 |       if self.do_lower_case:
213 |         token = token.lower()
214 |         token = self._run_strip_accents(token)
215 |       split_tokens.extend(self._run_split_on_punc(token))
216 | 
217 |     output_tokens = whitespace_tokenize(" ".join(split_tokens))
218 |     return output_tokens
219 | 
220 |   def _run_strip_accents(self, text):
221 |     """Strips accents from a piece of text."""
222 |     text = unicodedata.normalize("NFD", text)
223 |     output = []
224 |     for char in text:
225 |       cat = unicodedata.category(char)
226 |       if cat == "Mn":
227 |         continue
228 |       output.append(char)
229 |     return "".join(output)
230 | 
231 |   def _run_split_on_punc(self, text):
232 |     """Splits punctuation on a piece of text."""
233 |     chars = list(text)
234 |     i = 0
235 |     start_new_word = True
236 |     output = []
237 |     while i < len(chars):
238 |       char = chars[i]
239 |       if _is_punctuation(char):
240 |         output.append([char])
241 |         start_new_word = True
242 |       else:
243 |         if start_new_word:
244 |           output.append([])
245 |         start_new_word = False
246 |         output[-1].append(char)
247 |       i += 1
248 | 
249 |     return ["".join(x) for x in output]
250 | 
251 |   def _tokenize_chinese_chars(self, text):
252 |     """Adds whitespace around any CJK character."""
253 |     output = []
254 |     for char in text:
255 |       cp = ord(char)
256 |       if self._is_chinese_char(cp):
257 |         output.append(" ")
258 |         output.append(char)
259 |         output.append(" ")
260 |       else:
261 |         output.append(char)
262 |     return "".join(output)
263 | 
264 |   def _is_chinese_char(self, cp):
265 |     """Checks whether CP is the codepoint of a CJK character."""
266 |     # This defines a "chinese character" as anything in the CJK Unicode block:
267 |     #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
268 |     #
269 |     # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
270 |     # despite its name. The modern Korean Hangul alphabet is a different block,
271 |     # as is Japanese Hiragana and Katakana. Those alphabets are used to write
272 |     # space-separated words, so they are not treated specially and handled
273 |     # like the all of the other languages.
274 |     if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
275 |         (cp >= 0x3400 and cp <= 0x4DBF) or  #
276 |         (cp >= 0x20000 and cp <= 0x2A6DF) or  #
277 |         (cp >= 0x2A700 and cp <= 0x2B73F) or  #
278 |         (cp >= 0x2B740 and cp <= 0x2B81F) or  #
279 |         (cp >= 0x2B820 and cp <= 0x2CEAF) or
280 |         (cp >= 0xF900 and cp <= 0xFAFF) or  #
281 |         (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
282 |       return True
283 | 
284 |     return False
285 | 
286 |   def _clean_text(self, text):
287 |     """Performs invalid character removal and whitespace cleanup on text."""
288 |     output = []
289 |     for char in text:
290 |       cp = ord(char)
291 |       if cp == 0 or cp == 0xfffd or _is_control(char):
292 |         continue
293 |       if _is_whitespace(char):
294 |         output.append(" ")
295 |       else:
296 |         output.append(char)
297 |     return "".join(output)
298 | 
299 | 
300 | class WordpieceTokenizer(object):
301 |   """Runs WordPiece tokenziation."""
302 | 
303 |   def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
304 |     self.vocab = vocab
305 |     self.unk_token = unk_token
306 |     self.max_input_chars_per_word = max_input_chars_per_word
307 | 
308 |   def tokenize(self, text):
309 |     """Tokenizes a piece of text into its word pieces.
310 | 
311 |     This uses a greedy longest-match-first algorithm to perform tokenization
312 |     using the given vocabulary.
313 | 
314 |     For example:
315 |       input = "unaffable"
316 |       output = ["un", "##aff", "##able"]
317 | 
318 |     Args:
319 |       text: A single token or whitespace separated tokens. This should have
320 |         already been passed through `BasicTokenizer.
321 | 
322 |     Returns:
323 |       A list of wordpiece tokens.
324 |     """
325 | 
326 |     text = convert_to_unicode(text)
327 | 
328 |     output_tokens = []
329 |     for token in whitespace_tokenize(text):
330 |       chars = list(token)
331 |       if len(chars) > self.max_input_chars_per_word:
332 |         output_tokens.append(self.unk_token)
333 |         continue
334 | 
335 |       is_bad = False
336 |       start = 0
337 |       sub_tokens = []
338 |       while start < len(chars):
339 |         end = len(chars)
340 |         cur_substr = None
341 |         while start < end:
342 |           substr = "".join(chars[start:end])
343 |           if start > 0:
344 |             substr = "##" + substr
345 |           if substr in self.vocab:
346 |             cur_substr = substr
347 |             break
348 |           end -= 1
349 |         if cur_substr is None:
350 |           is_bad = True
351 |           break
352 |         sub_tokens.append(cur_substr)
353 |         start = end
354 | 
355 |       if is_bad:
356 |         output_tokens.append(self.unk_token)
357 |       else:
358 |         output_tokens.extend(sub_tokens)
359 |     return output_tokens
360 | 
361 | 
362 | def _is_whitespace(char):
363 |   """Checks whether `chars` is a whitespace character."""
364 |   # \t, \n, and \r are technically contorl characters but we treat them
365 |   # as whitespace since they are generally considered as such.
366 |   if char == " " or char == "\t" or char == "\n" or char == "\r":
367 |     return True
368 |   cat = unicodedata.category(char)
369 |   if cat == "Zs":
370 |     return True
371 |   return False
372 | 
373 | 
374 | def _is_control(char):
375 |   """Checks whether `chars` is a control character."""
376 |   # These are technically control characters but we count them as whitespace
377 |   # characters.
378 |   if char == "\t" or char == "\n" or char == "\r":
379 |     return False
380 |   cat = unicodedata.category(char)
381 |   if cat.startswith("C"):
382 |     return True
383 |   return False
384 | 
385 | 
386 | def _is_punctuation(char):
387 |   """Checks whether `chars` is a punctuation character."""
388 |   cp = ord(char)
389 |   # We treat all non-letter/number ASCII as punctuation.
390 |   # Characters such as "^", "$", and "`" are not in the Unicode
391 |   # Punctuation class but we treat them as punctuation anyways, for
392 |   # consistency.
393 |   if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
394 |       (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
395 |     return True
396 |   cat = unicodedata.category(char)
397 |   if cat.startswith("P"):
398 |     return True
399 |   return False
400 | 


--------------------------------------------------------------------------------