├── .gitignore
├── README.md
├── model_fine_tuning_scripts
    ├── run_commonsense_qa.py
    ├── run_commonsense_qa_recognition.py
    ├── run_mnli.py
    ├── run_openbookqa.py
    └── run_openbookqa_recognition.py
└── obqa_create_data_splits.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | 
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Annotator bias in Natural Language Understanding datasets
  2 | 
  3 | This repository contains the accompanying code for the paper:
  4 | 
  5 | **"Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets."** Mor Geva, Yoav Goldberg and Jonathan Berant. *In EMNLP-IJCNLP, 2019*.
  6 | [https://arxiv.org/abs/1908.07898](https://arxiv.org/abs/1908.07898)
  7 | 
  8 | In this work, we investigate whether prevalent crowdsourcing practices for building NLU datasets introduce an "annotator bias" in the data that leads to an over-estimation of model performance.
  9 | 
 10 | In this repository, we release our code for:
 11 |  * Fine-tuning BERT on the three datasets considered in the paper (experiments 1-3)
 12 |  * Converting the three datasets considered in the paper into the annotator recognition task format (experiment 2, this is done as part of the fine-tuning scripts)
 13 |  * Generating annotator-based splits (experiment 3)
 14 | 
 15 | 
 16 | Please note that the data splits are generated randomly, therefore, reproducing the exact results in the paper is not possible as there might be some variation (see the standard deviation values reported in the paper).
 17 | 
 18 | Our experiments were conducted in a **python 3.6.8** environment with **tensorflow 1.11.0** and **pandas 0.24.2**.
 19 | 
 20 | ## Generation of annotator-based data splits
 21 | We considered three NLU datasets in our experiments:
 22 | * [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) (MNLI)
 23 | * [OpenBookQA](http://data.allenai.org/OpenBookQA) (OBQA)
 24 | * [CommonsenseQA](https://www.tau-nlp.org/commonsenseqa) (CSQA)
 25 | 
 26 | The script `obqa_create_data_splits.py` contains the code used for generating annotator-based splits of OBQA and their corresponding random splits. The spliting method is exactly the same for MNLI and CSQA. However, we do not provide the scripts for MNLI and CSQA since annotator information is not publicly available for these datasets (see "note on data availability" below).
 27 | 
 28 | **Note on data availability**: At the time of publishing this work, annotator IDs were publicly available only for OBQA. For MNLI and CSQA, this information was not available as part of the official releases. If you are interested in the annotation information, please contact the creators of these datasets.
 29 | 
 30 | #### Example commands
 31 | 1. Generating three annotator-based data splits for each annotator of the top 5 annotators of OBQA:
 32 |    ```bash
 33 |    python obqa_create_data_splits.py \
 34 |    --only_annotator \
 35 |    --repeat=3
 36 |    ```
 37 |    In order to generate both annotator-based data splits and corresponding random splits of the same size, run this command with the value of `only_annotator` set to `False`.
 38 | 
 39 | 2. Generating three 20%-augmented annotator-based data splits for each annotator of the top 5 annotators of OBQA:
 40 |    ```bash
 41 |    python obqa_create_data_splits.py \
 42 |    --augment_ratio 0.2 \
 43 |    --only_annotator \
 44 |    --repeat=3
 45 |    ```
 46 | 
 47 | 3. Generating three series of random splits corresponding to the 0%-10%-20%-30%- augmented annotator-based splits of the top 5 annotators of OBQA:
 48 |    ```bash
 49 |    python obqa_create_data_splits.py \
 50 |    --augment_random_series \
 51 |    --only_random \
 52 |    --repeat=3
 53 |    ```
 54 | 
 55 | 
 56 | ## Model fine-tuning
 57 | In all our experiments, we used the pretrained BERT-base cased model from [Google's official repository](https://github.com/google-research/bert).
 58 | The directory `model_fine_tuning_scripts` contains the scripts used for fine-tuning and execution of the model on data splits of three datasets considered in the paper.
 59 | The table below indicates the experiments covered by each fine-tuning script:
 60 | 
 61 | |Fine-tuning script | dataset | utility of annotator information | annotator recognition | generalization across annotators |
 62 | |--------|:--------:|:--------:|:-----:|:-----:|
 63 | | `run_mnli.py` | MNLI | V | V | V |
 64 | | `run_openbookqa.py` | OBQA | V |  | V |
 65 | | `run_openbookqa_recognition.py` | OBQA |  | V |  |
 66 | | `run_commonsense_qa.py` | CSQA | V |  | V |
 67 | | `run_commonsense_qa_recognition.py` | CSQA |  | V |  |
 68 | 
 69 | 
 70 | The scripts follow the exact same format as the fine-tuning scripts provided in [Google's official repository](https://github.com/google-research/bert#fine-tuning-with-bert), and should be executed from its root path.
 71 | 
 72 | Before running the scripts, make sure you generated the relevant data split files. The directory containing these files should be passed to the argument `data_dir` when running any of these scripts (see examples below).
 73 | 
 74 | 
 75 | #### Example commands
 76 | 1. Fine-tuning the model on the original split of OBQA with annotator IDs concatenated to each example, to test the utility of annotator information:
 77 |    ```bash
 78 |    export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12
 79 |    export DATA_DIR=/path/to/obqa/data/splits/dir
 80 | 
 81 |    python run_openbookqa.py \
 82 |    --do_train=true  --do_eval=true  \
 83 |    --data_dir=$DATA_DIR  --vocab_file=$BERT_BASE_DIR/vocab.txt \
 84 |    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
 85 |    --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
 86 |    --max_seq_length=128  --train_batch_size=10  \
 87 |    --learning_rate=2e-5  --num_train_epochs=3.0 \
 88 |    --output_dir=$BERT_BASE_DIR/openbookqa_with_annotator/ \
 89 |    --split=with_annotator
 90 |    ```
 91 |    To fine-tune on the same data split without annotator IDs, replace the value of the `split` argument with `without_annotator`.
 92 | 
 93 | 2. Fine-tuning the model to predict annotator IDs of the top 5 annotators of OBQA (annotator recognition):
 94 |    ```bash
 95 |    export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12
 96 |    export DATA_DIR=/path/to/obqa/data/splits/dir
 97 | 
 98 |    python run_openbookqa_recognition.py \
 99 |    --do_train=true  --do_eval=true  \
100 |    --data_dir=$DATA_DIR  --vocab_file=$BERT_BASE_DIR/vocab.txt \
101 |    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
102 |    --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
103 |    --max_seq_length=128  --train_batch_size=10  \
104 |    --learning_rate=2e-5  --num_train_epochs=3.0 \
105 |    --output_dir=$BERT_BASE_DIR/openbookqa_annotator_recognition/ \
106 |    --split=without_annotator
107 |    ```
108 | 
109 | 3. Fine-tuning the model on the top annotator split of OBQA, to test model generalization from all other annotators:
110 |    ```bash
111 |    export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12
112 |    export DATA_DIR=/path/to/obqa/data/splits/dir
113 | 
114 |    python run_openbookqa.py \
115 |    --do_train=true  --do_eval=true  \
116 |    --data_dir=$DATA_DIR  --vocab_file=$BERT_BASE_DIR/vocab.txt \
117 |    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
118 |    --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
119 |    --max_seq_length=128  --train_batch_size=10  \
120 |    --learning_rate=2e-5  --num_train_epochs=3.0 \
121 |    --output_dir=$BERT_BASE_DIR/openbookqa_annotator_0/ \
122 |    --split=annotator  --annotator_idx=0
123 |    ```
124 |    To fine-tune on the corresponding random split of the top annotator, simply replace the value of the `split` argument with `rand`.
125 |    Similarly, to fine-tune on the second multi-annotator split, replace the values of the `split` and `annotator_idx` arguments with `annotator_multi` and `1`, respectively.
126 | 
127 | 4. Fine-tuning the model on the 20%-augmented data split of the third top annotator of OBQA:
128 |    ```bash
129 |    export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12
130 |    export DATA_DIR=/path/to/obqa/data/splits/dir
131 | 
132 |    python run_openbookqa.py \
133 |    --do_train=true  --do_eval=true  \
134 |    --data_dir=$DATA_DIR  --vocab_file=$BERT_BASE_DIR/vocab.txt \
135 |    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
136 |    --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
137 |    --max_seq_length=128  --train_batch_size=10  \
138 |    --learning_rate=2e-5  --num_train_epochs=3.0 \
139 |    --output_dir=$BERT_BASE_DIR/openbookqa_annotator_0/ \
140 |    --split=annotator  --annotator_idx=2  \
141 |    --augment_ratio=0.2
142 |    ```
143 | 
144 | ## Re-producing our experiments on a new dataset
145 | 
146 | To reproduce our experiments on a crowdsourced NLU dataset for which annotation information is available (i.e. for every example there is an identifier of the annotator who created it), one needs the following for each of our three experiments.
147 | 1. **The utility of annotator information**
148 |    * The original data split of the dataset.
149 |    * The original data split of the dataset with the annotator ID concatenated as an additional feature to every example.
150 |    * Fine-tuning script for BERT, suitable for original the dataset task.
151 | 2. **Annotator recognition**
152 |    * The original data split of the dataset, with annotator IDs of the top 5 annotators as lables. Namely, every example `(x,y)` written by annotator `z` should be replaced with `(x,z*)`, where `z*=z` if `z` is in the top 5 annotators and `z*=OTHER` otherwise.
153 |    * Fine-tuning script for BERT, for classification task.
154 | 3. **Model generalization across annotators**
155 |    * Annotator-based data splits and corresponding random splits of the same size.
156 |    * Augmented annotator-based data splits and corresponding random splits of the same size.
157 |    * Fine-tuning script for BERT, suitable for the original dataset task.
158 | 
159 | 
160 | ## Citation
161 | If you make use of our work in your research, we would appreciate citing the following:
162 | 
163 | 
164 | > @InProceedings{GevaEtAl2019,
165 |   title = {{Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets}},
166 |   author = {Geva, Mor and Goldberg, Yoav and Berant, Jonathan},
167 |   booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing},
168 |   note = {arXiv preprint arXiv:1908.07898},
169 |   year = {2019}
170 | }
171 | 


--------------------------------------------------------------------------------
/model_fine_tuning_scripts/run_commonsense_qa.py:
--------------------------------------------------------------------------------
  1 | """Run BERT on CommonsenseQA."""
  2 | 
  3 | from __future__ import absolute_import
  4 | from __future__ import division
  5 | from __future__ import print_function
  6 | 
  7 | import collections
  8 | import json
  9 | import os
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | import modeling
 15 | import optimization
 16 | import tokenization
 17 | 
 18 | 
 19 | flags = tf.flags
 20 | 
 21 | FLAGS = flags.FLAGS
 22 | 
 23 | ## Required parameters
 24 | flags.DEFINE_string(
 25 |     "data_dir", None,
 26 |     "The input data dir. Should contain the .tsv files (or other data files) "
 27 |     "for the task.")
 28 | 
 29 | flags.DEFINE_string(
 30 |     "bert_config_file", None,
 31 |     "The config json file corresponding to the pre-trained BERT model. "
 32 |     "This specifies the model architecture.")
 33 | 
 34 | flags.DEFINE_string("vocab_file", None,
 35 |                     "The vocabulary file that the BERT model was trained on.")
 36 | 
 37 | flags.DEFINE_string(
 38 |     "output_dir", None,
 39 |     "The output directory where the model checkpoints will be written.")
 40 | 
 41 | flags.DEFINE_string(
 42 |   "split", None,
 43 |   "The split you'd like to run on, either 'qtoken' or 'rand'.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "annotator_idx", 0,
 47 |     "Index of the annotator to split by.")
 48 | 
 49 | flags.DEFINE_float(
 50 |     "augment_ratio", 0.0,
 51 |     "Proportion of dev examples moved to training set.")
 52 | 
 53 | flags.DEFINE_integer(
 54 |     "take_number", 1,
 55 |     "In case this is a re-execution of previous experiment.")
 56 | 
 57 | flags.DEFINE_bool(
 58 |     "swap_trn_dev", False,
 59 |     "Whether to use the train set as dev set and vice versa.")
 60 | 
 61 | 
 62 | ## Other parameters
 63 | 
 64 | flags.DEFINE_string(
 65 |     "init_checkpoint", None,
 66 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 67 | 
 68 | flags.DEFINE_bool(
 69 |     "do_lower_case", True,
 70 |     "Whether to lower case the input text. Should be True for uncased "
 71 |     "models and False for cased models.")
 72 | 
 73 | flags.DEFINE_integer(
 74 |     "max_seq_length", 128,
 75 |     "The maximum total input sequence length after WordPiece tokenization. "
 76 |     "Sequences longer than this will be truncated, and sequences shorter "
 77 |     "than this will be padded.")
 78 | 
 79 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 80 | 
 81 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 82 | 
 83 | flags.DEFINE_bool(
 84 |     "do_predict", False,
 85 |     "Whether to run the model in inference mode on the test set.")
 86 | 
 87 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
 88 | 
 89 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 90 | 
 91 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 92 | 
 93 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 94 | 
 95 | flags.DEFINE_float("num_train_epochs", 3.0,
 96 |                    "Total number of training epochs to perform.")
 97 | 
 98 | flags.DEFINE_float(
 99 |     "warmup_proportion", 0.1,
100 |     "Proportion of training to perform linear learning rate warmup for. "
101 |     "E.g., 0.1 = 10% of training.")
102 | 
103 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
104 |                      "How often to save the model checkpoint.")
105 | 
106 | flags.DEFINE_integer("iterations_per_loop", 1000,
107 |                      "How many steps to make in each estimator call.")
108 | 
109 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
110 | 
111 | tf.flags.DEFINE_string(
112 |     "tpu_name", None,
113 |     "The Cloud TPU to use for training. This should be either the name "
114 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
115 |     "url.")
116 | 
117 | tf.flags.DEFINE_string(
118 |     "tpu_zone", None,
119 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
120 |     "specified, we will attempt to automatically detect the GCE project from "
121 |     "metadata.")
122 | 
123 | tf.flags.DEFINE_string(
124 |     "gcp_project", None,
125 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
126 |     "specified, we will attempt to automatically detect the GCE project from "
127 |     "metadata.")
128 | 
129 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
130 | 
131 | flags.DEFINE_integer(
132 |     "num_tpu_cores", 8,
133 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
134 | 
135 | 
136 | class InputExample(object):
137 |   """A single multiple choice question."""
138 | 
139 |   def __init__(
140 |       self,
141 |       qid,
142 |       question,
143 |       answers,
144 |       label):
145 |     """Construct an instance."""
146 |     self.qid = qid
147 |     self.question = question
148 |     self.answers = answers
149 |     self.label = label
150 | 
151 | 
152 | class DataProcessor(object):
153 |   """Base class for data converters for sequence classification data sets."""
154 | 
155 |   def get_train_examples(self, data_dir):
156 |     """Gets a collection of `InputExample`s for the train set."""
157 |     raise NotImplementedError()
158 | 
159 |   def get_dev_examples(self, data_dir):
160 |     """Gets a collection of `InputExample`s for the dev set."""
161 |     raise NotImplementedError()
162 | 
163 |   def get_test_examples(self, data_dir):
164 |     """Gets a collection of `InputExample`s for prediction."""
165 |     raise NotImplementedError()
166 | 
167 |   def get_labels(self):
168 |     """Gets the list of labels for this data set."""
169 |     raise NotImplementedError()
170 | 
171 |   @classmethod
172 |   def _read_json(cls, input_file):
173 |     """Reads a JSON file."""
174 |     with tf.gfile.Open(input_file, "r") as f:
175 |       return json.load(f)
176 | 
177 | 
178 | class CommonsenseQAProcessor(DataProcessor):
179 |   """Processor for the CommonsenseQA data set."""
180 | 
181 |   SPLIT_TO_NAME = {
182 |     # 'annotator': 'annotator_{annotator_idx}_cand_dists',
183 |     # 'annotator_multi': 'annotator_multi_{annotator_idx}_cand_dists',
184 |     # 'rand': 'rand_{annotator_idx}_cand_dists',
185 |     # 'rand_multi': 'rand_multi_{annotator_idx}_cand_dists',
186 |     'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}',
187 |     'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}',
188 |     'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}',
189 |     'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}',
190 |     'with_annotator': 'with_annotator_id',
191 |     'without_annotator': 'without_annotator_id',
192 |   }
193 | 
194 |   TRAIN_FILE_NAME = 'train_{split_name}.json'
195 |   DEV_FILE_NAME = 'dev_{split_name}.json'
196 |   TEST_FILE_NAME = 'dev_{split_name}.json'
197 | 
198 |   def __init__(self, split, annotator_idx, augment_ratio, take_number, swap_trn_dev):
199 |     if split not in self.SPLIT_TO_NAME.keys():
200 |       raise ValueError(
201 |         'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.')
202 | 
203 |     self.split = split
204 |     self.annotator_idx = annotator_idx
205 |     self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else ""
206 |     self.take = "_take{}".format(take_number) if take_number > 1 else ""
207 |     self.swap_trn_dev = swap_trn_dev
208 | 
209 |     if self.swap_trn_dev:
210 |       tmp = self.TRAIN_FILE_NAME
211 |       self.TRAIN_FILE_NAME = self.DEV_FILE_NAME
212 |       self.DEV_FILE_NAME = tmp
213 | 
214 |   def get_train_examples(self, data_dir):
215 |     train_file_name = self.TRAIN_FILE_NAME.format(
216 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
217 | 
218 |     return self._create_examples(
219 |       self._read_json(os.path.join(data_dir, train_file_name)),
220 |       'train')
221 | 
222 |   def get_dev_examples(self, data_dir):
223 |     dev_file_name = self.DEV_FILE_NAME.format(
224 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
225 | 
226 |     return self._create_examples(
227 |       self._read_json(os.path.join(data_dir, dev_file_name)),
228 |       'dev')
229 | 
230 |   def get_test_examples(self, data_dir):
231 |     test_file_name = self.TEST_FILE_NAME.format(
232 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
233 | 
234 |     return self._create_examples(
235 |       self._read_json(os.path.join(data_dir, test_file_name)),
236 |       'test')
237 | 
238 |   def get_labels(self):
239 |     return [0, 1, 2, 3, 4]
240 | 
241 |   def _create_examples(self, lines, set_type):
242 |     examples = []
243 |     for  i, line in enumerate(lines):
244 |       qid = "%s-%s" % (set_type, i)
245 | 
246 |       question = tokenization.convert_to_unicode(line['question'])
247 | 
248 |       answers = np.array([
249 |         tokenization.convert_to_unicode(line['correct_answer']),
250 |         tokenization.convert_to_unicode(line['distractor_0']),
251 |         tokenization.convert_to_unicode(line['distractor_1']),
252 |         tokenization.convert_to_unicode(line['distractor_2']),
253 |         tokenization.convert_to_unicode(line['distractor_3'])
254 |       ])
255 | 
256 |       label = 0
257 | 
258 |       examples.append(
259 |         InputExample(
260 |           qid=qid,
261 |           question=question,
262 |           answers=answers,
263 |           label=label))
264 | 
265 |     return examples
266 | 
267 | 
268 | def example_to_token_ids_segment_ids_label_ids(
269 |     ex_index,
270 |     example,
271 |     max_seq_length,
272 |     tokenizer):
273 |   """Converts an ``InputExample`` to token ids and segment ids."""
274 |   if ex_index < 5:
275 |     tf.logging.info(f"*** Example {ex_index} ***")
276 |     tf.logging.info("qid: %s" % (example.qid))
277 | 
278 |   question_tokens = tokenizer.tokenize(example.question)
279 |   answers_tokens = map(tokenizer.tokenize, example.answers)
280 | 
281 |   token_ids = []
282 |   segment_ids = []
283 |   for choice_idx, answer_tokens in enumerate(answers_tokens):
284 |     truncated_question_tokens = question_tokens[
285 |       :max((max_seq_length - 3)//2, max_seq_length - (len(answer_tokens) + 3))]
286 |     truncated_answer_tokens = answer_tokens[
287 |       :max((max_seq_length - 3)//2, max_seq_length - (len(question_tokens) + 3))]
288 | 
289 |     choice_tokens = []
290 |     choice_segment_ids = []
291 |     choice_tokens.append("[CLS]")
292 |     choice_segment_ids.append(0)
293 |     for question_token in truncated_question_tokens:
294 |       choice_tokens.append(question_token)
295 |       choice_segment_ids.append(0)
296 |     choice_tokens.append("[SEP]")
297 |     choice_segment_ids.append(0)
298 |     for answer_token in truncated_answer_tokens:
299 |       choice_tokens.append(answer_token)
300 |       choice_segment_ids.append(1)
301 |     choice_tokens.append("[SEP]")
302 |     choice_segment_ids.append(1)
303 | 
304 |     choice_token_ids = tokenizer.convert_tokens_to_ids(choice_tokens)
305 | 
306 |     token_ids.append(choice_token_ids)
307 |     segment_ids.append(choice_segment_ids)
308 | 
309 |     if ex_index < 5:
310 |       tf.logging.info("choice %s" % choice_idx)
311 |       tf.logging.info("tokens: %s" % " ".join(
312 |         [tokenization.printable_text(t) for t in choice_tokens]))
313 |       tf.logging.info("token ids: %s" % " ".join(
314 |         [str(x) for x in choice_token_ids]))
315 |       tf.logging.info("segment ids: %s" % " ".join(
316 |         [str(x) for x in choice_segment_ids]))
317 | 
318 |   label_ids = [example.label]
319 | 
320 |   if ex_index < 5:
321 |     tf.logging.info("label: %s (id = %d)" % (example.label, label_ids[0]))
322 | 
323 |   return token_ids, segment_ids, label_ids
324 | 
325 | 
326 | def file_based_convert_examples_to_features(
327 |     examples,
328 |     label_list,
329 |     max_seq_length,
330 |     tokenizer,
331 |     output_file
332 | ):
333 |   """Convert a set of ``InputExamples`` to a TFRecord file."""
334 | 
335 |   # encode examples into token_ids and segment_ids
336 |   token_ids_segment_ids_label_ids = [
337 |     example_to_token_ids_segment_ids_label_ids(
338 |       ex_index,
339 |       example,
340 |       max_seq_length,
341 |       tokenizer)
342 |     for ex_index, example in enumerate(examples)
343 |   ]
344 | 
345 |   # compute the maximum sequence length for any of the inputs
346 |   seq_length = max([
347 |     max([len(choice_token_ids) for choice_token_ids in token_ids])
348 |     for token_ids, _, _ in token_ids_segment_ids_label_ids
349 |   ])
350 | 
351 |   # encode the inputs into fixed-length vectors
352 |   writer = tf.python_io.TFRecordWriter(output_file)
353 | 
354 |   for idx, (token_ids, segment_ids, label_ids) in enumerate(
355 |       token_ids_segment_ids_label_ids
356 |   ):
357 |     if idx % 10000 == 0:
358 |       tf.logging.info("Writing %d of %d" % (
359 |         idx,
360 |         len(token_ids_segment_ids_label_ids)))
361 | 
362 |     features = collections.OrderedDict()
363 |     for i, (choice_token_ids, choice_segment_ids) in enumerate(
364 |         zip(token_ids, segment_ids)):
365 |       input_ids = np.zeros(max_seq_length)
366 |       input_ids[:len(choice_token_ids)] = np.array(choice_token_ids)
367 | 
368 |       input_mask = np.zeros(max_seq_length)
369 |       input_mask[:len(choice_token_ids)] = 1
370 | 
371 |       segment_ids = np.zeros(max_seq_length)
372 |       segment_ids[:len(choice_segment_ids)] = np.array(choice_segment_ids)
373 | 
374 |       features[f'input_ids{i}'] = tf.train.Feature(
375 |         int64_list=tf.train.Int64List(value=list(input_ids.astype(np.int64))))
376 |       features[f'input_mask{i}'] = tf.train.Feature(
377 |         int64_list=tf.train.Int64List(value=list(input_mask.astype(np.int64))))
378 |       features[f'segment_ids{i}'] = tf.train.Feature(
379 |         int64_list=tf.train.Int64List(value=list(segment_ids.astype(np.int64))))
380 | 
381 |     features['label_ids'] = tf.train.Feature(
382 |       int64_list=tf.train.Int64List(value=label_ids))
383 | 
384 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
385 |     writer.write(tf_example.SerializeToString())
386 | 
387 |   return seq_length
388 | 
389 | 
390 | def file_based_input_fn_builder(
391 |     input_file,
392 |     seq_length,
393 |     is_training,
394 |     drop_remainder
395 | ):
396 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
397 | 
398 |   name_to_features = {
399 |       "input_ids0": tf.FixedLenFeature([seq_length], tf.int64),
400 |       "input_mask0": tf.FixedLenFeature([seq_length], tf.int64),
401 |       "segment_ids0": tf.FixedLenFeature([seq_length], tf.int64),
402 |       "input_ids1": tf.FixedLenFeature([seq_length], tf.int64),
403 |       "input_mask1": tf.FixedLenFeature([seq_length], tf.int64),
404 |       "segment_ids1": tf.FixedLenFeature([seq_length], tf.int64),
405 |       "input_ids2": tf.FixedLenFeature([seq_length], tf.int64),
406 |       "input_mask2": tf.FixedLenFeature([seq_length], tf.int64),
407 |       "segment_ids2": tf.FixedLenFeature([seq_length], tf.int64),
408 |       "input_ids3": tf.FixedLenFeature([seq_length], tf.int64),
409 |       "input_mask3": tf.FixedLenFeature([seq_length], tf.int64),
410 |       "segment_ids3": tf.FixedLenFeature([seq_length], tf.int64),
411 |       "input_ids4": tf.FixedLenFeature([seq_length], tf.int64),
412 |       "input_mask4": tf.FixedLenFeature([seq_length], tf.int64),
413 |       "segment_ids4": tf.FixedLenFeature([seq_length], tf.int64),
414 |       "label_ids": tf.FixedLenFeature([], tf.int64),
415 |   }
416 | 
417 |   def _decode_record(record, name_to_features):
418 |     """Decodes a record to a TensorFlow example."""
419 |     example = tf.parse_single_example(record, name_to_features)
420 | 
421 |     # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
422 |     # So cast all int64 to int32.
423 |     for name in list(example.keys()):
424 |       t = example[name]
425 |       if t.dtype == tf.int64:
426 |         t = tf.to_int32(t)
427 |       example[name] = t
428 | 
429 |     return example
430 | 
431 |   def input_fn(params):
432 |     """The actual input function."""
433 |     batch_size = params["batch_size"]
434 | 
435 |     # For training, we want a lot of parallel reading and shuffling.
436 |     # For eval, we want no shuffling and parallel reading doesn't matter.
437 |     d = tf.data.TFRecordDataset(input_file)
438 |     if is_training:
439 |       d = d.repeat()
440 |       d = d.shuffle(buffer_size=100)
441 | 
442 |     d = d.apply(
443 |         tf.contrib.data.map_and_batch(
444 |             lambda record: _decode_record(record, name_to_features),
445 |             batch_size=batch_size,
446 |             drop_remainder=drop_remainder))
447 | 
448 |     return d
449 | 
450 |   return input_fn
451 | 
452 | 
453 | def create_model(
454 |     bert_config,
455 |     is_training,
456 |     input_ids0,
457 |     input_mask0,
458 |     segment_ids0,
459 |     input_ids1,
460 |     input_mask1,
461 |     segment_ids1,
462 |     input_ids2,
463 |     input_mask2,
464 |     segment_ids2,
465 |     input_ids3,
466 |     input_mask3,
467 |     segment_ids3,
468 |     input_ids4,
469 |     input_mask4,
470 |     segment_ids4,
471 |     labels,
472 |     num_labels,
473 |     use_one_hot_embeddings
474 | ):
475 |   """Creates a classification model."""
476 |   input_ids = tf.stack(
477 |     [
478 |       input_ids0,
479 |       input_ids1,
480 |       input_ids2,
481 |       input_ids3,
482 |       input_ids4
483 |     ],
484 |     axis=1)
485 |   input_mask = tf.stack(
486 |     [
487 |       input_mask0,
488 |       input_mask1,
489 |       input_mask2,
490 |       input_mask3,
491 |       input_mask4
492 |     ],
493 |     axis=1)
494 |   segment_ids = tf.stack(
495 |     [
496 |       segment_ids0,
497 |       segment_ids1,
498 |       segment_ids2,
499 |       segment_ids3,
500 |       segment_ids4
501 |     ],
502 |     axis=1)
503 | 
504 |   _, num_choices, seq_length = input_ids.shape
505 | 
506 |   input_ids = tf.reshape(input_ids, (-1, seq_length))
507 |   input_mask = tf.reshape(input_mask, (-1, seq_length))
508 |   segment_ids = tf.reshape(segment_ids, (-1, seq_length))
509 | 
510 |   output_layer = modeling.BertModel(
511 |       config=bert_config,
512 |       is_training=is_training,
513 |       input_ids=input_ids,
514 |       input_mask=input_mask,
515 |       token_type_ids=segment_ids,
516 |       use_one_hot_embeddings=use_one_hot_embeddings
517 |   ).get_pooled_output()
518 | 
519 |   hidden_size = output_layer.shape[-1].value
520 | 
521 |   softmax_weights = tf.get_variable(
522 |       "softmax_weights", [hidden_size, 1],
523 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
524 | 
525 |   with tf.variable_scope("loss"):
526 |     if is_training:
527 |       # I.e., 0.1 dropout
528 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
529 | 
530 |     logits = tf.reshape(
531 |       tf.matmul(output_layer, softmax_weights),
532 |       (-1, num_choices))
533 | 
534 |     probabilities = tf.nn.softmax(logits, axis=-1)
535 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
536 | 
537 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
538 | 
539 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
540 |     loss = tf.reduce_mean(per_example_loss)
541 | 
542 |     return (loss, per_example_loss, logits, probabilities, output_layer)
543 | 
544 | 
545 | def model_fn_builder(
546 |     bert_config,
547 |     num_labels,
548 |     init_checkpoint,
549 |     learning_rate,
550 |     num_train_steps,
551 |     num_warmup_steps,
552 |     use_tpu,
553 |     use_one_hot_embeddings
554 | ):
555 |   """Returns `model_fn` closure for TPUEstimator."""
556 | 
557 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
558 |     """The `model_fn` for TPUEstimator."""
559 | 
560 |     tf.logging.info("*** Features ***")
561 |     for name in sorted(features.keys()):
562 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
563 | 
564 |     input_ids0 = features["input_ids0"]
565 |     input_mask0 = features["input_mask0"]
566 |     segment_ids0 = features["segment_ids0"]
567 |     input_ids1 = features["input_ids1"]
568 |     input_mask1 = features["input_mask1"]
569 |     segment_ids1 = features["segment_ids1"]
570 |     input_ids2 = features["input_ids2"]
571 |     input_mask2 = features["input_mask2"]
572 |     segment_ids2 = features["segment_ids2"]
573 |     input_ids3 = features["input_ids3"]
574 |     input_mask3 = features["input_mask3"]
575 |     segment_ids3 = features["segment_ids3"]
576 |     input_ids4 = features["input_ids4"]
577 |     input_mask4 = features["input_mask4"]
578 |     segment_ids4 = features["segment_ids4"]
579 |     label_ids = features["label_ids"]
580 | 
581 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
582 | 
583 |     (total_loss, per_example_loss, logits, probabilities, output_layer) = create_model(
584 |       bert_config,
585 |       is_training,
586 |       input_ids0,
587 |       input_mask0,
588 |       segment_ids0,
589 |       input_ids1,
590 |       input_mask1,
591 |       segment_ids1,
592 |       input_ids2,
593 |       input_mask2,
594 |       segment_ids2,
595 |       input_ids3,
596 |       input_mask3,
597 |       segment_ids3,
598 |       input_ids4,
599 |       input_mask4,
600 |       segment_ids4,
601 |       label_ids,
602 |       num_labels,
603 |       use_one_hot_embeddings)
604 | 
605 |     tvars = tf.trainable_variables()
606 |     initialized_variable_names = {}
607 |     scaffold_fn = None
608 |     if init_checkpoint:
609 |       (assignment_map, initialized_variable_names
610 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
611 |       if use_tpu:
612 | 
613 |         def tpu_scaffold():
614 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
615 |           return tf.train.Scaffold()
616 | 
617 |         scaffold_fn = tpu_scaffold
618 |       else:
619 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
620 | 
621 |     tf.logging.info("**** Trainable Variables ****")
622 |     for var in tvars:
623 |       init_string = ""
624 |       if var.name in initialized_variable_names:
625 |         init_string = ", *INIT_FROM_CKPT*"
626 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
627 |                       init_string)
628 | 
629 |     output_spec = None
630 |     if mode == tf.estimator.ModeKeys.TRAIN:
631 | 
632 |       train_op = optimization.create_optimizer(
633 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
634 | 
635 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
636 |           mode=mode,
637 |           loss=total_loss,
638 |           train_op=train_op,
639 |           scaffold_fn=scaffold_fn)
640 |     elif mode == tf.estimator.ModeKeys.EVAL:
641 | 
642 |       def metric_fn(per_example_loss, label_ids, logits):
643 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
644 |         accuracy = tf.metrics.accuracy(label_ids, predictions)
645 |         loss = tf.metrics.mean(per_example_loss)
646 |         return {
647 |             "eval_accuracy": accuracy,
648 |             "eval_loss": loss,
649 |         }
650 | 
651 |       eval_metrics = (metric_fn, [per_example_loss, label_ids, logits])
652 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
653 |           mode=mode,
654 |           loss=total_loss,
655 |           eval_metrics=eval_metrics,
656 |           scaffold_fn=scaffold_fn)
657 |     else:
658 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
659 |           mode=mode, predictions=probabilities, scaffold_fn=scaffold_fn)
660 |     return output_spec
661 | 
662 |   return model_fn
663 | 
664 | 
665 | def main(_):
666 |   tf.logging.set_verbosity(tf.logging.INFO)
667 | 
668 |   if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
669 |     raise ValueError(
670 |         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
671 | 
672 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
673 | 
674 |   if FLAGS.max_seq_length > bert_config.max_position_embeddings:
675 |     raise ValueError(
676 |         "Cannot use sequence length %d because the BERT model "
677 |         "was only trained up to sequence length %d" %
678 |         (FLAGS.max_seq_length, bert_config.max_position_embeddings))
679 | 
680 |   tf.gfile.MakeDirs(FLAGS.output_dir)
681 | 
682 |   processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx,
683 |                                      augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number,
684 |                                      swap_trn_dev=FLAGS.swap_trn_dev)
685 | 
686 |   label_list = processor.get_labels()
687 | 
688 |   tokenizer = tokenization.FullTokenizer(
689 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
690 | 
691 |   tpu_cluster_resolver = None
692 |   if FLAGS.use_tpu and FLAGS.tpu_name:
693 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
694 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
695 | 
696 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
697 |   run_config = tf.contrib.tpu.RunConfig(
698 |       cluster=tpu_cluster_resolver,
699 |       master=FLAGS.master,
700 |       model_dir=FLAGS.output_dir,
701 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
702 |       tpu_config=tf.contrib.tpu.TPUConfig(
703 |           iterations_per_loop=FLAGS.iterations_per_loop,
704 |           num_shards=FLAGS.num_tpu_cores,
705 |           per_host_input_for_training=is_per_host))
706 | 
707 |   train_examples = None
708 |   num_train_steps = None
709 |   num_warmup_steps = None
710 |   if FLAGS.do_train:
711 |     train_examples = processor.get_train_examples(FLAGS.data_dir)
712 |     num_train_steps = int(
713 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
714 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
715 | 
716 |   model_fn = model_fn_builder(
717 |       bert_config=bert_config,
718 |       num_labels=len(label_list),
719 |       init_checkpoint=FLAGS.init_checkpoint,
720 |       learning_rate=FLAGS.learning_rate,
721 |       num_train_steps=num_train_steps,
722 |       num_warmup_steps=num_warmup_steps,
723 |       use_tpu=FLAGS.use_tpu,
724 |       use_one_hot_embeddings=FLAGS.use_tpu)
725 | 
726 |   # If TPU is not available, this will fall back to normal Estimator on CPU
727 |   # or GPU.
728 |   estimator = tf.contrib.tpu.TPUEstimator(
729 |       use_tpu=FLAGS.use_tpu,
730 |       model_fn=model_fn,
731 |       config=run_config,
732 |       train_batch_size=FLAGS.train_batch_size,
733 |       eval_batch_size=FLAGS.eval_batch_size,
734 |       predict_batch_size=FLAGS.predict_batch_size)
735 | 
736 |   if FLAGS.do_train:
737 |     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
738 |     train_seq_length = file_based_convert_examples_to_features(
739 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
740 |     tf.logging.info("***** Running training *****")
741 |     tf.logging.info("  Num examples = %d", len(train_examples))
742 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
743 |     tf.logging.info("  Num steps = %d", num_train_steps)
744 |     tf.logging.info("  Longest training sequence = %d", train_seq_length)
745 |     train_input_fn = file_based_input_fn_builder(
746 |         input_file=train_file,
747 |         seq_length=FLAGS.max_seq_length,
748 |         is_training=True,
749 |         drop_remainder=True)
750 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
751 | 
752 |   if FLAGS.do_eval:
753 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
754 |     eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
755 |     eval_seq_length = file_based_convert_examples_to_features(
756 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
757 | 
758 |     tf.logging.info("***** Running evaluation *****")
759 |     tf.logging.info("  Num examples = %d", len(eval_examples))
760 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
761 |     tf.logging.info("  Longest eval sequence = %d", eval_seq_length)
762 | 
763 |     # This tells the estimator to run through the entire set.
764 |     eval_steps = None
765 |     # However, if running eval on the TPU, you will need to specify the
766 |     # number of steps.
767 |     if FLAGS.use_tpu:
768 |       # Eval will be slightly WRONG on the TPU because it will truncate
769 |       # the last batch.
770 |       eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size)
771 | 
772 |     eval_drop_remainder = True if FLAGS.use_tpu else False
773 |     eval_input_fn = file_based_input_fn_builder(
774 |         input_file=eval_file,
775 |         seq_length=FLAGS.max_seq_length,
776 |         is_training=False,
777 |         drop_remainder=eval_drop_remainder)
778 | 
779 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
780 | 
781 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
782 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
783 |       tf.logging.info("***** Eval results *****")
784 |       for key in sorted(result.keys()):
785 |         tf.logging.info("  %s = %s", key, str(result[key]))
786 |         writer.write("%s = %s\n" % (key, str(result[key])))
787 | 
788 |   if FLAGS.do_predict:
789 |     predict_examples = processor.get_test_examples(FLAGS.data_dir)
790 |     predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
791 |     predict_seq_length = file_based_convert_examples_to_features(
792 |       predict_examples, label_list,
793 |       FLAGS.max_seq_length, tokenizer,
794 |       predict_file)
795 | 
796 |     tf.logging.info("***** Running prediction*****")
797 |     tf.logging.info("  Num examples = %d", len(predict_examples))
798 |     tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
799 |     tf.logging.info("  Longest predict sequence = %d", predict_seq_length)
800 | 
801 |     if FLAGS.use_tpu:
802 |       # Warning: According to tpu_estimator.py Prediction on TPU is an
803 |       # experimental feature and hence not supported here
804 |       raise ValueError("Prediction in TPU not supported")
805 | 
806 |     predict_drop_remainder = True if FLAGS.use_tpu else False
807 |     predict_input_fn = file_based_input_fn_builder(
808 |         input_file=predict_file,
809 |         seq_length=FLAGS.max_seq_length,
810 |         is_training=False,
811 |         drop_remainder=predict_drop_remainder)
812 | 
813 |     result = estimator.predict(input_fn=predict_input_fn)
814 | 
815 |     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
816 |     with tf.gfile.GFile(output_predict_file, "w") as writer:
817 |       tf.logging.info("***** Predict results *****")
818 |       for prediction in result:
819 |         output_line = "\t".join(
820 |             str(class_probability) for class_probability in prediction) + "\n"
821 |         writer.write(output_line)
822 | 
823 | 
824 | if __name__ == "__main__":
825 |   flags.mark_flag_as_required("data_dir")
826 |   flags.mark_flag_as_required("vocab_file")
827 |   flags.mark_flag_as_required("bert_config_file")
828 |   flags.mark_flag_as_required("output_dir")
829 |   tf.app.run()
830 | 


--------------------------------------------------------------------------------
/model_fine_tuning_scripts/run_commonsense_qa_recognition.py:
--------------------------------------------------------------------------------
  1 | """Run BERT on CommonsenseQA for annotator ID prediction.."""
  2 | 
  3 | from __future__ import absolute_import
  4 | from __future__ import division
  5 | from __future__ import print_function
  6 | 
  7 | import collections
  8 | import json
  9 | import os
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | import modeling
 15 | import optimization
 16 | import tokenization
 17 | 
 18 | 
 19 | flags = tf.flags
 20 | 
 21 | FLAGS = flags.FLAGS
 22 | 
 23 | ## Required parameters
 24 | flags.DEFINE_string(
 25 |     "data_dir", None,
 26 |     "The input data dir. Should contain the .tsv files (or other data files) "
 27 |     "for the task.")
 28 | 
 29 | flags.DEFINE_string(
 30 |     "bert_config_file", None,
 31 |     "The config json file corresponding to the pre-trained BERT model. "
 32 |     "This specifies the model architecture.")
 33 | 
 34 | flags.DEFINE_string("vocab_file", None,
 35 |                     "The vocabulary file that the BERT model was trained on.")
 36 | 
 37 | flags.DEFINE_string(
 38 |     "output_dir", None,
 39 |     "The output directory where the model checkpoints will be written.")
 40 | 
 41 | flags.DEFINE_string(
 42 |   "split", None,
 43 |   "The split you'd like to run on, either 'annotator' or 'rand'.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "annotator_idx", 0,
 47 |     "Index of the annotator to split by.")
 48 | 
 49 | flags.DEFINE_float(
 50 |     "augment_ratio", 0.0,
 51 |     "Proportion of dev examples moved to training set.")
 52 | 
 53 | flags.DEFINE_integer(
 54 |     "take_number", 1,
 55 |     "In case this is a re-execution of previous experiment.")
 56 | 
 57 | 
 58 | ## Other parameters
 59 | 
 60 | flags.DEFINE_string(
 61 |     "init_checkpoint", None,
 62 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 63 | 
 64 | flags.DEFINE_bool(
 65 |     "do_lower_case", True,
 66 |     "Whether to lower case the input text. Should be True for uncased "
 67 |     "models and False for cased models.")
 68 | 
 69 | flags.DEFINE_integer(
 70 |     "max_seq_length", 128,
 71 |     "The maximum total input sequence length after WordPiece tokenization. "
 72 |     "Sequences longer than this will be truncated, and sequences shorter "
 73 |     "than this will be padded.")
 74 | 
 75 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 76 | 
 77 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 78 | 
 79 | flags.DEFINE_bool(
 80 |     "do_predict", False,
 81 |     "Whether to run the model in inference mode on the test set.")
 82 | 
 83 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
 84 | 
 85 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 86 | 
 87 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 88 | 
 89 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 90 | 
 91 | flags.DEFINE_float("num_train_epochs", 3.0,
 92 |                    "Total number of training epochs to perform.")
 93 | 
 94 | flags.DEFINE_float(
 95 |     "warmup_proportion", 0.1,
 96 |     "Proportion of training to perform linear learning rate warmup for. "
 97 |     "E.g., 0.1 = 10% of training.")
 98 | 
 99 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
100 |                      "How often to save the model checkpoint.")
101 | 
102 | flags.DEFINE_integer("iterations_per_loop", 1000,
103 |                      "How many steps to make in each estimator call.")
104 | 
105 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
106 | 
107 | tf.flags.DEFINE_string(
108 |     "tpu_name", None,
109 |     "The Cloud TPU to use for training. This should be either the name "
110 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
111 |     "url.")
112 | 
113 | tf.flags.DEFINE_string(
114 |     "tpu_zone", None,
115 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
116 |     "specified, we will attempt to automatically detect the GCE project from "
117 |     "metadata.")
118 | 
119 | tf.flags.DEFINE_string(
120 |     "gcp_project", None,
121 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
122 |     "specified, we will attempt to automatically detect the GCE project from "
123 |     "metadata.")
124 | 
125 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
126 | 
127 | flags.DEFINE_integer(
128 |     "num_tpu_cores", 8,
129 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
130 | 
131 | 
132 | class InputExample(object):
133 |   """A single training/test example for simple sequence classification."""
134 | 
135 |   def __init__(self, guid, text_a, text_b=None, label=None):
136 |     """Constructs a InputExample.
137 | 
138 |     Args:
139 |       guid: Unique id for the example.
140 |       text_a: string. The untokenized text of the first sequence. For single
141 |         sequence tasks, only this sequence must be specified.
142 |       text_b: (Optional) string. The untokenized text of the second sequence.
143 |         Only must be specified for sequence pair tasks.
144 |       label: (Optional) string. The label of the example. This should be
145 |         specified for train and dev examples, but not for test examples.
146 |     """
147 |     self.guid = guid
148 |     self.text_a = text_a
149 |     self.text_b = text_b
150 |     self.label = label
151 | 
152 | 
153 | class PaddingInputExample(object):
154 |   """Fake example so the num input examples is a multiple of the batch size.
155 | 
156 |   When running eval/predict on the TPU, we need to pad the number of examples
157 |   to be a multiple of the batch size, because the TPU requires a fixed batch
158 |   size. The alternative is to drop the last batch, which is bad because it means
159 |   the entire output data won't be generated.
160 | 
161 |   We use this class instead of `None` because treating `None` as padding
162 |   battches could cause silent errors.
163 |   """
164 | 
165 | 
166 | class InputFeatures(object):
167 |   """A single set of features of data."""
168 | 
169 |   def __init__(self,
170 |                input_ids,
171 |                input_mask,
172 |                segment_ids,
173 |                label_id,
174 |                is_real_example=True):
175 |     self.input_ids = input_ids
176 |     self.input_mask = input_mask
177 |     self.segment_ids = segment_ids
178 |     self.label_id = label_id
179 |     self.is_real_example = is_real_example
180 | 
181 | 
182 | class DataProcessor(object):
183 |   """Base class for data converters for sequence classification data sets."""
184 | 
185 |   def get_train_examples(self, data_dir):
186 |     """Gets a collection of `InputExample`s for the train set."""
187 |     raise NotImplementedError()
188 | 
189 |   def get_dev_examples(self, data_dir):
190 |     """Gets a collection of `InputExample`s for the dev set."""
191 |     raise NotImplementedError()
192 | 
193 |   def get_test_examples(self, data_dir):
194 |     """Gets a collection of `InputExample`s for prediction."""
195 |     raise NotImplementedError()
196 | 
197 |   def get_labels(self):
198 |     """Gets the list of labels for this data set."""
199 |     raise NotImplementedError()
200 | 
201 |   @classmethod
202 |   def _read_json(cls, input_file):
203 |     """Reads a JSON file."""
204 |     with tf.gfile.Open(input_file, "r") as f:
205 |       return json.load(f)
206 | 
207 | 
208 | class CommonsenseQAProcessor(DataProcessor):
209 |   """Processor for the MultiNLI data set (GLUE version)."""
210 | 
211 |   SPLIT_TO_NAME = {
212 |     'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}',
213 |     'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}',
214 |     'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}',
215 |     'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}',
216 |     'with_annotator': 'with_annotator_id',
217 |     'without_annotator': 'without_annotator_id'
218 |   }
219 | 
220 |   TRAIN_FILE_NAME = 'train_{split_name}.json'
221 |   DEV_FILE_NAME = 'dev_{split_name}.json'
222 |   TEST_FILE_NAME = 'dev_{split_name}.json'
223 | 
224 |   def __init__(self, split, annotator_idx, augment_ratio, take_number):
225 |     if split not in self.SPLIT_TO_NAME.keys():
226 |       raise ValueError(
227 |         'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.')
228 | 
229 |     self.split = split
230 |     self.annotator_idx = annotator_idx
231 |     self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else ""
232 |     self.take = "_take{}".format(take_number) if take_number > 1 else ""
233 | 
234 |   def get_train_examples(self, data_dir):
235 |     train_file_name = self.TRAIN_FILE_NAME.format(
236 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
237 | 
238 |     return self._create_examples(
239 |       self._read_json(os.path.join(data_dir, train_file_name)),
240 |       'train')
241 | 
242 |   def get_dev_examples(self, data_dir):
243 |     dev_file_name = self.DEV_FILE_NAME.format(
244 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
245 | 
246 |     return self._create_examples(
247 |       self._read_json(os.path.join(data_dir, dev_file_name)),
248 |       'dev')
249 | 
250 |   def get_test_examples(self, data_dir):
251 |     test_file_name = self.TEST_FILE_NAME.format(
252 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
253 | 
254 |     return self._create_examples(
255 |       self._read_json(os.path.join(data_dir, test_file_name)),
256 |       'test')
257 | 
258 |   def get_labels(self):
259 |     """See base class."""
260 |     # These are anonymized annotator IDs.
261 |     return ["ANNOT1", "ANNOT2", "ANNOT3", "ANNOT4", "ANNOT5", "OTHER"]
262 | 
263 |   def _create_examples(self, lines, set_type):
264 |     """Creates examples for the training and dev sets."""
265 |     labels = self.get_labels()
266 |     examples = []
267 |     for i, line in enumerate(lines):
268 |         qid = "%s-%s" % (set_type, i)
269 | 
270 |         answers = ' ; '.join(
271 |             [line['distractor_{}'.format(i)] for i in range(4)] + [line['correct_answer']]
272 |         )
273 | 
274 |         text_a = tokenization.convert_to_unicode(line['question'])
275 |         text_b = tokenization.convert_to_unicode(answers)
276 | 
277 |         if set_type == "test":
278 |             label = "OTHER"
279 |         else:
280 |             if line["turkIdAnonymized"] in labels:
281 |                 label = tokenization.convert_to_unicode(line["turkIdAnonymized"])
282 |             else:
283 |                 label = "OTHER"
284 |         examples.append(
285 |             InputExample(guid=qid, text_a=text_a, text_b=text_b, label=label))
286 | 
287 |     return examples
288 | 
289 | 
290 | def convert_single_example(ex_index, example, label_list, max_seq_length,
291 |                            tokenizer):
292 |   """Converts a single `InputExample` into a single `InputFeatures`."""
293 | 
294 |   if isinstance(example, PaddingInputExample):
295 |     return InputFeatures(
296 |         input_ids=[0] * max_seq_length,
297 |         input_mask=[0] * max_seq_length,
298 |         segment_ids=[0] * max_seq_length,
299 |         label_id=0,
300 |         is_real_example=False)
301 | 
302 |   label_map = {}
303 |   for (i, label) in enumerate(label_list):
304 |     label_map[label] = i
305 | 
306 |   tokens_a = tokenizer.tokenize(example.text_a)
307 |   tokens_b = None
308 |   if example.text_b:
309 |     tokens_b = tokenizer.tokenize(example.text_b)
310 | 
311 |   if tokens_b:
312 |     # Modifies `tokens_a` and `tokens_b` in place so that the total
313 |     # length is less than the specified length.
314 |     # Account for [CLS], [SEP], [SEP] with "- 3"
315 |     _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
316 |   else:
317 |     # Account for [CLS] and [SEP] with "- 2"
318 |     if len(tokens_a) > max_seq_length - 2:
319 |       tokens_a = tokens_a[0:(max_seq_length - 2)]
320 | 
321 |   # The convention in BERT is:
322 |   # (a) For sequence pairs:
323 |   #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
324 |   #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
325 |   # (b) For single sequences:
326 |   #  tokens:   [CLS] the dog is hairy . [SEP]
327 |   #  type_ids: 0     0   0   0  0     0 0
328 |   #
329 |   # Where "type_ids" are used to indicate whether this is the first
330 |   # sequence or the second sequence. The embedding vectors for `type=0` and
331 |   # `type=1` were learned during pre-training and are added to the wordpiece
332 |   # embedding vector (and position vector). This is not *strictly* necessary
333 |   # since the [SEP] token unambiguously separates the sequences, but it makes
334 |   # it easier for the model to learn the concept of sequences.
335 |   #
336 |   # For classification tasks, the first vector (corresponding to [CLS]) is
337 |   # used as the "sentence vector". Note that this only makes sense because
338 |   # the entire model is fine-tuned.
339 |   tokens = []
340 |   segment_ids = []
341 |   tokens.append("[CLS]")
342 |   segment_ids.append(0)
343 |   for token in tokens_a:
344 |     tokens.append(token)
345 |     segment_ids.append(0)
346 |   tokens.append("[SEP]")
347 |   segment_ids.append(0)
348 | 
349 |   if tokens_b:
350 |     for token in tokens_b:
351 |       tokens.append(token)
352 |       segment_ids.append(1)
353 |     tokens.append("[SEP]")
354 |     segment_ids.append(1)
355 | 
356 |   input_ids = tokenizer.convert_tokens_to_ids(tokens)
357 | 
358 |   # The mask has 1 for real tokens and 0 for padding tokens. Only real
359 |   # tokens are attended to.
360 |   input_mask = [1] * len(input_ids)
361 | 
362 |   # Zero-pad up to the sequence length.
363 |   while len(input_ids) < max_seq_length:
364 |     input_ids.append(0)
365 |     input_mask.append(0)
366 |     segment_ids.append(0)
367 | 
368 |   assert len(input_ids) == max_seq_length
369 |   assert len(input_mask) == max_seq_length
370 |   assert len(segment_ids) == max_seq_length
371 | 
372 |   label_id = label_map[example.label]
373 |   if ex_index < 5:
374 |     tf.logging.info("*** Example ***")
375 |     tf.logging.info("guid: %s" % (example.guid))
376 |     tf.logging.info("tokens: %s" % " ".join(
377 |         [tokenization.printable_text(x) for x in tokens]))
378 |     tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
379 |     tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
380 |     tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
381 |     tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
382 | 
383 |   feature = InputFeatures(
384 |       input_ids=input_ids,
385 |       input_mask=input_mask,
386 |       segment_ids=segment_ids,
387 |       label_id=label_id,
388 |       is_real_example=True)
389 |   return feature
390 | 
391 | 
392 | def file_based_convert_examples_to_features(
393 |     examples, label_list, max_seq_length, tokenizer, output_file):
394 |   """Convert a set of `InputExample`s to a TFRecord file."""
395 | 
396 |   writer = tf.python_io.TFRecordWriter(output_file)
397 | 
398 |   for (ex_index, example) in enumerate(examples):
399 |     if ex_index % 10000 == 0:
400 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
401 | 
402 |     feature = convert_single_example(ex_index, example, label_list,
403 |                                      max_seq_length, tokenizer)
404 | 
405 |     def create_int_feature(values):
406 |       f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
407 |       return f
408 | 
409 |     features = collections.OrderedDict()
410 |     features["input_ids"] = create_int_feature(feature.input_ids)
411 |     features["input_mask"] = create_int_feature(feature.input_mask)
412 |     features["segment_ids"] = create_int_feature(feature.segment_ids)
413 |     features["label_ids"] = create_int_feature([feature.label_id])
414 |     features["is_real_example"] = create_int_feature(
415 |         [int(feature.is_real_example)])
416 | 
417 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
418 |     writer.write(tf_example.SerializeToString())
419 |   writer.close()
420 | 
421 | 
422 | def file_based_input_fn_builder(input_file, seq_length, is_training,
423 |                                 drop_remainder):
424 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
425 | 
426 |   name_to_features = {
427 |       "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
428 |       "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
429 |       "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
430 |       "label_ids": tf.FixedLenFeature([], tf.int64),
431 |       "is_real_example": tf.FixedLenFeature([], tf.int64),
432 |   }
433 | 
434 |   def _decode_record(record, name_to_features):
435 |     """Decodes a record to a TensorFlow example."""
436 |     example = tf.parse_single_example(record, name_to_features)
437 | 
438 |     # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
439 |     # So cast all int64 to int32.
440 |     for name in list(example.keys()):
441 |       t = example[name]
442 |       if t.dtype == tf.int64:
443 |         t = tf.to_int32(t)
444 |       example[name] = t
445 | 
446 |     return example
447 | 
448 |   def input_fn(params):
449 |     """The actual input function."""
450 |     batch_size = params["batch_size"]
451 | 
452 |     # For training, we want a lot of parallel reading and shuffling.
453 |     # For eval, we want no shuffling and parallel reading doesn't matter.
454 |     d = tf.data.TFRecordDataset(input_file)
455 |     if is_training:
456 |       d = d.repeat()
457 |       d = d.shuffle(buffer_size=100)
458 | 
459 |     d = d.apply(
460 |         tf.contrib.data.map_and_batch(
461 |             lambda record: _decode_record(record, name_to_features),
462 |             batch_size=batch_size,
463 |             drop_remainder=drop_remainder))
464 | 
465 |     return d
466 | 
467 |   return input_fn
468 | 
469 | 
470 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
471 |   """Truncates a sequence pair in place to the maximum length."""
472 | 
473 |   # This is a simple heuristic which will always truncate the longer sequence
474 |   # one token at a time. This makes more sense than truncating an equal percent
475 |   # of tokens from each, since if one sequence is very short then each token
476 |   # that's truncated likely contains more information than a longer sequence.
477 |   while True:
478 |     total_length = len(tokens_a) + len(tokens_b)
479 |     if total_length <= max_length:
480 |       break
481 |     if len(tokens_a) > len(tokens_b):
482 |       tokens_a.pop()
483 |     else:
484 |       tokens_b.pop()
485 | 
486 | 
487 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
488 |                  labels, num_labels, use_one_hot_embeddings):
489 |   """Creates a classification model."""
490 |   model = modeling.BertModel(
491 |       config=bert_config,
492 |       is_training=is_training,
493 |       input_ids=input_ids,
494 |       input_mask=input_mask,
495 |       token_type_ids=segment_ids,
496 |       use_one_hot_embeddings=use_one_hot_embeddings)
497 | 
498 |   # In the demo, we are doing a simple classification task on the entire
499 |   # segment.
500 |   #
501 |   # If you want to use the token-level output, use model.get_sequence_output()
502 |   # instead.
503 |   output_layer = model.get_pooled_output()
504 | 
505 |   hidden_size = output_layer.shape[-1].value
506 | 
507 |   output_weights = tf.get_variable(
508 |       "output_weights", [num_labels, hidden_size],
509 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
510 | 
511 |   output_bias = tf.get_variable(
512 |       "output_bias", [num_labels], initializer=tf.zeros_initializer())
513 | 
514 |   with tf.variable_scope("loss"):
515 |     if is_training:
516 |       # I.e., 0.1 dropout
517 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
518 | 
519 |     logits = tf.matmul(output_layer, output_weights, transpose_b=True)
520 |     logits = tf.nn.bias_add(logits, output_bias)
521 |     probabilities = tf.nn.softmax(logits, axis=-1)
522 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
523 | 
524 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
525 | 
526 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
527 |     loss = tf.reduce_mean(per_example_loss)
528 | 
529 |     return (loss, per_example_loss, logits, probabilities)
530 | 
531 | 
532 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
533 |                      num_train_steps, num_warmup_steps, use_tpu,
534 |                      use_one_hot_embeddings):
535 |   """Returns `model_fn` closure for TPUEstimator."""
536 | 
537 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
538 |     """The `model_fn` for TPUEstimator."""
539 | 
540 |     tf.logging.info("*** Features ***")
541 |     for name in sorted(features.keys()):
542 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
543 | 
544 |     input_ids = features["input_ids"]
545 |     input_mask = features["input_mask"]
546 |     segment_ids = features["segment_ids"]
547 |     label_ids = features["label_ids"]
548 |     is_real_example = None
549 |     if "is_real_example" in features:
550 |       is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
551 |     else:
552 |       is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
553 | 
554 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
555 | 
556 |     (total_loss, per_example_loss, logits, probabilities) = create_model(
557 |         bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
558 |         num_labels, use_one_hot_embeddings)
559 | 
560 |     tvars = tf.trainable_variables()
561 |     initialized_variable_names = {}
562 |     scaffold_fn = None
563 |     if init_checkpoint:
564 |       (assignment_map, initialized_variable_names
565 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
566 |       if use_tpu:
567 | 
568 |         def tpu_scaffold():
569 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
570 |           return tf.train.Scaffold()
571 | 
572 |         scaffold_fn = tpu_scaffold
573 |       else:
574 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
575 | 
576 |     tf.logging.info("**** Trainable Variables ****")
577 |     for var in tvars:
578 |       init_string = ""
579 |       if var.name in initialized_variable_names:
580 |         init_string = ", *INIT_FROM_CKPT*"
581 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
582 |                       init_string)
583 | 
584 |     output_spec = None
585 |     if mode == tf.estimator.ModeKeys.TRAIN:
586 | 
587 |       train_op = optimization.create_optimizer(
588 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
589 | 
590 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
591 |           mode=mode,
592 |           loss=total_loss,
593 |           train_op=train_op,
594 |           scaffold_fn=scaffold_fn)
595 |     elif mode == tf.estimator.ModeKeys.EVAL:
596 | 
597 |       def metric_fn(per_example_loss, label_ids, logits, is_real_example):
598 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
599 |         accuracy = tf.metrics.accuracy(
600 |             labels=label_ids, predictions=predictions, weights=is_real_example)
601 |         loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
602 |         return {
603 |             "eval_accuracy": accuracy,
604 |             "eval_loss": loss,
605 |         }
606 | 
607 |       eval_metrics = (metric_fn,
608 |                       [per_example_loss, label_ids, logits, is_real_example])
609 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
610 |           mode=mode,
611 |           loss=total_loss,
612 |           eval_metrics=eval_metrics,
613 |           scaffold_fn=scaffold_fn)
614 |     else:
615 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
616 |           mode=mode,
617 |           predictions={"probabilities": probabilities},
618 |           scaffold_fn=scaffold_fn)
619 |     return output_spec
620 | 
621 |   return model_fn
622 | 
623 | 
624 | # This function is not used by this file but is still used by the Colab and
625 | # people who depend on it.
626 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
627 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
628 | 
629 |   all_input_ids = []
630 |   all_input_mask = []
631 |   all_segment_ids = []
632 |   all_label_ids = []
633 | 
634 |   for feature in features:
635 |     all_input_ids.append(feature.input_ids)
636 |     all_input_mask.append(feature.input_mask)
637 |     all_segment_ids.append(feature.segment_ids)
638 |     all_label_ids.append(feature.label_id)
639 | 
640 |   def input_fn(params):
641 |     """The actual input function."""
642 |     batch_size = params["batch_size"]
643 | 
644 |     num_examples = len(features)
645 | 
646 |     # This is for demo purposes and does NOT scale to large data sets. We do
647 |     # not use Dataset.from_generator() because that uses tf.py_func which is
648 |     # not TPU compatible. The right way to load data is with TFRecordReader.
649 |     d = tf.data.Dataset.from_tensor_slices({
650 |         "input_ids":
651 |             tf.constant(
652 |                 all_input_ids, shape=[num_examples, seq_length],
653 |                 dtype=tf.int32),
654 |         "input_mask":
655 |             tf.constant(
656 |                 all_input_mask,
657 |                 shape=[num_examples, seq_length],
658 |                 dtype=tf.int32),
659 |         "segment_ids":
660 |             tf.constant(
661 |                 all_segment_ids,
662 |                 shape=[num_examples, seq_length],
663 |                 dtype=tf.int32),
664 |         "label_ids":
665 |             tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
666 |     })
667 | 
668 |     if is_training:
669 |       d = d.repeat()
670 |       d = d.shuffle(buffer_size=100)
671 | 
672 |     d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
673 |     return d
674 | 
675 |   return input_fn
676 | 
677 | 
678 | # This function is not used by this file but is still used by the Colab and
679 | # people who depend on it.
680 | def convert_examples_to_features(examples, label_list, max_seq_length,
681 |                                  tokenizer):
682 |   """Convert a set of `InputExample`s to a list of `InputFeatures`."""
683 | 
684 |   features = []
685 |   for (ex_index, example) in enumerate(examples):
686 |     if ex_index % 10000 == 0:
687 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
688 | 
689 |     feature = convert_single_example(ex_index, example, label_list,
690 |                                      max_seq_length, tokenizer)
691 | 
692 |     features.append(feature)
693 |   return features
694 | 
695 | 
696 | def main(_):
697 |   tf.logging.set_verbosity(tf.logging.INFO)
698 | 
699 |   if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
700 |     raise ValueError(
701 |         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
702 | 
703 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
704 | 
705 |   if FLAGS.max_seq_length > bert_config.max_position_embeddings:
706 |     raise ValueError(
707 |         "Cannot use sequence length %d because the BERT model "
708 |         "was only trained up to sequence length %d" %
709 |         (FLAGS.max_seq_length, bert_config.max_position_embeddings))
710 | 
711 |   tf.gfile.MakeDirs(FLAGS.output_dir)
712 | 
713 |   processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx,
714 |                                      augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number)
715 | 
716 |   label_list = processor.get_labels()
717 | 
718 |   tokenizer = tokenization.FullTokenizer(
719 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
720 | 
721 |   tpu_cluster_resolver = None
722 |   if FLAGS.use_tpu and FLAGS.tpu_name:
723 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
724 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
725 | 
726 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
727 |   run_config = tf.contrib.tpu.RunConfig(
728 |       cluster=tpu_cluster_resolver,
729 |       master=FLAGS.master,
730 |       model_dir=FLAGS.output_dir,
731 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
732 |       tpu_config=tf.contrib.tpu.TPUConfig(
733 |           iterations_per_loop=FLAGS.iterations_per_loop,
734 |           num_shards=FLAGS.num_tpu_cores,
735 |           per_host_input_for_training=is_per_host))
736 | 
737 |   train_examples = None
738 |   num_train_steps = None
739 |   num_warmup_steps = None
740 |   if FLAGS.do_train:
741 |     train_examples = processor.get_train_examples(FLAGS.data_dir)
742 |     num_train_steps = int(
743 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
744 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
745 | 
746 |   model_fn = model_fn_builder(
747 |       bert_config=bert_config,
748 |       num_labels=len(label_list),
749 |       init_checkpoint=FLAGS.init_checkpoint,
750 |       learning_rate=FLAGS.learning_rate,
751 |       num_train_steps=num_train_steps,
752 |       num_warmup_steps=num_warmup_steps,
753 |       use_tpu=FLAGS.use_tpu,
754 |       use_one_hot_embeddings=FLAGS.use_tpu)
755 | 
756 |   # If TPU is not available, this will fall back to normal Estimator on CPU
757 |   # or GPU.
758 |   estimator = tf.contrib.tpu.TPUEstimator(
759 |       use_tpu=FLAGS.use_tpu,
760 |       model_fn=model_fn,
761 |       config=run_config,
762 |       train_batch_size=FLAGS.train_batch_size,
763 |       eval_batch_size=FLAGS.eval_batch_size,
764 |       predict_batch_size=FLAGS.predict_batch_size)
765 | 
766 |   if FLAGS.do_train:
767 |     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
768 |     train_seq_length = file_based_convert_examples_to_features(
769 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
770 |     tf.logging.info("***** Running training *****")
771 |     tf.logging.info("  Num examples = %d", len(train_examples))
772 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
773 |     tf.logging.info("  Num steps = %d", num_train_steps)
774 |     tf.logging.info("  Longest training sequence = %d", train_seq_length)
775 |     train_input_fn = file_based_input_fn_builder(
776 |         input_file=train_file,
777 |         seq_length=FLAGS.max_seq_length,
778 |         is_training=True,
779 |         drop_remainder=True)
780 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
781 | 
782 |   if FLAGS.do_eval:
783 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
784 |     eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
785 |     eval_seq_length = file_based_convert_examples_to_features(
786 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
787 | 
788 |     tf.logging.info("***** Running evaluation *****")
789 |     tf.logging.info("  Num examples = %d", len(eval_examples))
790 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
791 |     tf.logging.info("  Longest eval sequence = %d", eval_seq_length)
792 | 
793 |     # This tells the estimator to run through the entire set.
794 |     eval_steps = None
795 |     # However, if running eval on the TPU, you will need to specify the
796 |     # number of steps.
797 |     if FLAGS.use_tpu:
798 |       # Eval will be slightly WRONG on the TPU because it will truncate
799 |       # the last batch.
800 |       eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size)
801 | 
802 |     eval_drop_remainder = True if FLAGS.use_tpu else False
803 |     eval_input_fn = file_based_input_fn_builder(
804 |         input_file=eval_file,
805 |         seq_length=FLAGS.max_seq_length,
806 |         is_training=False,
807 |         drop_remainder=eval_drop_remainder)
808 | 
809 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
810 | 
811 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
812 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
813 |       tf.logging.info("***** Eval results *****")
814 |       for key in sorted(result.keys()):
815 |         tf.logging.info("  %s = %s", key, str(result[key]))
816 |         writer.write("%s = %s\n" % (key, str(result[key])))
817 | 
818 |   if FLAGS.do_predict:
819 |     predict_examples = processor.get_test_examples(FLAGS.data_dir)
820 |     num_actual_predict_examples = len(predict_examples)
821 |     predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
822 |     predict_seq_length = file_based_convert_examples_to_features(
823 |       predict_examples, label_list,
824 |       FLAGS.max_seq_length, tokenizer,
825 |       predict_file)
826 | 
827 |     tf.logging.info("***** Running prediction*****")
828 |     tf.logging.info("  Num examples = %d", len(predict_examples))
829 |     tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
830 |     tf.logging.info("  Longest predict sequence = %d", predict_seq_length)
831 | 
832 |     if FLAGS.use_tpu:
833 |       # Warning: According to tpu_estimator.py Prediction on TPU is an
834 |       # experimental feature and hence not supported here
835 |       raise ValueError("Prediction in TPU not supported")
836 | 
837 |     predict_drop_remainder = True if FLAGS.use_tpu else False
838 |     predict_input_fn = file_based_input_fn_builder(
839 |         input_file=predict_file,
840 |         seq_length=FLAGS.max_seq_length,
841 |         is_training=False,
842 |         drop_remainder=predict_drop_remainder)
843 | 
844 |     result = estimator.predict(input_fn=predict_input_fn)
845 | 
846 |     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
847 |     with tf.gfile.GFile(output_predict_file, "w") as writer:
848 |       num_written_lines = 0
849 |       tf.logging.info("***** Predict results *****")
850 |       for (i, prediction) in enumerate(result):
851 |         probabilities = prediction["probabilities"]
852 |         if i >= num_actual_predict_examples:
853 |           break
854 |         output_line = "\t".join(
855 |             str(class_probability)
856 |             for class_probability in probabilities) + "\n"
857 |         writer.write(output_line)
858 |         num_written_lines += 1
859 |     assert num_written_lines == num_actual_predict_examples
860 | 
861 | 
862 | if __name__ == "__main__":
863 |   flags.mark_flag_as_required("data_dir")
864 |   flags.mark_flag_as_required("vocab_file")
865 |   flags.mark_flag_as_required("bert_config_file")
866 |   flags.mark_flag_as_required("output_dir")
867 |   tf.app.run()
868 | 


--------------------------------------------------------------------------------
/model_fine_tuning_scripts/run_mnli.py:
--------------------------------------------------------------------------------
  1 | """Run BERT on MNLI."""
  2 | 
  3 | from __future__ import absolute_import
  4 | from __future__ import division
  5 | from __future__ import print_function
  6 | 
  7 | import collections
  8 | import json
  9 | import os
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | import modeling
 15 | import optimization
 16 | import tokenization
 17 | 
 18 | 
 19 | flags = tf.flags
 20 | 
 21 | FLAGS = flags.FLAGS
 22 | 
 23 | ## Required parameters
 24 | flags.DEFINE_string(
 25 |     "data_dir", None,
 26 |     "The input data dir. Should contain the .tsv files (or other data files) "
 27 |     "for the task.")
 28 | 
 29 | flags.DEFINE_string(
 30 |     "bert_config_file", None,
 31 |     "The config json file corresponding to the pre-trained BERT model. "
 32 |     "This specifies the model architecture.")
 33 | 
 34 | flags.DEFINE_string("vocab_file", None,
 35 |                     "The vocabulary file that the BERT model was trained on.")
 36 | 
 37 | flags.DEFINE_string(
 38 |     "output_dir", None,
 39 |     "The output directory where the model checkpoints will be written.")
 40 | 
 41 | flags.DEFINE_string(
 42 |   "split", None,
 43 |   "The split you'd like to run on, either 'annotator' or 'rand'.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "annotator_idx", 0,
 47 |     "Index of the annotator to split by.")
 48 | 
 49 | flags.DEFINE_float(
 50 |     "augment_ratio", 0.0,
 51 |     "Proportion of dev examples moved to training set.")
 52 | 
 53 | flags.DEFINE_integer(
 54 |     "take_number", 1,
 55 |     "In case this is a re-execution of previous experiment.")
 56 | 
 57 | flags.DEFINE_bool(
 58 |     "annotator_labels", False,
 59 |     "Whether to use top 5 annotator ids as labels "
 60 |     "(+ other label for all other annotators).")
 61 | 
 62 | 
 63 | ## Other parameters
 64 | 
 65 | flags.DEFINE_string(
 66 |     "init_checkpoint", None,
 67 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 68 | 
 69 | flags.DEFINE_bool(
 70 |     "do_lower_case", True,
 71 |     "Whether to lower case the input text. Should be True for uncased "
 72 |     "models and False for cased models.")
 73 | 
 74 | flags.DEFINE_integer(
 75 |     "max_seq_length", 128,
 76 |     "The maximum total input sequence length after WordPiece tokenization. "
 77 |     "Sequences longer than this will be truncated, and sequences shorter "
 78 |     "than this will be padded.")
 79 | 
 80 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 81 | 
 82 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 83 | 
 84 | flags.DEFINE_bool(
 85 |     "do_predict", False,
 86 |     "Whether to run the model in inference mode on the test set.")
 87 | 
 88 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
 89 | 
 90 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 91 | 
 92 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 93 | 
 94 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 95 | 
 96 | flags.DEFINE_float("num_train_epochs", 3.0,
 97 |                    "Total number of training epochs to perform.")
 98 | 
 99 | flags.DEFINE_float(
100 |     "warmup_proportion", 0.1,
101 |     "Proportion of training to perform linear learning rate warmup for. "
102 |     "E.g., 0.1 = 10% of training.")
103 | 
104 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
105 |                      "How often to save the model checkpoint.")
106 | 
107 | flags.DEFINE_integer("iterations_per_loop", 1000,
108 |                      "How many steps to make in each estimator call.")
109 | 
110 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
111 | 
112 | tf.flags.DEFINE_string(
113 |     "tpu_name", None,
114 |     "The Cloud TPU to use for training. This should be either the name "
115 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
116 |     "url.")
117 | 
118 | tf.flags.DEFINE_string(
119 |     "tpu_zone", None,
120 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
121 |     "specified, we will attempt to automatically detect the GCE project from "
122 |     "metadata.")
123 | 
124 | tf.flags.DEFINE_string(
125 |     "gcp_project", None,
126 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
127 |     "specified, we will attempt to automatically detect the GCE project from "
128 |     "metadata.")
129 | 
130 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
131 | 
132 | flags.DEFINE_integer(
133 |     "num_tpu_cores", 8,
134 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
135 | 
136 | 
137 | class InputExample(object):
138 |   """A single training/test example for simple sequence classification."""
139 | 
140 |   def __init__(self, guid, text_a, text_b=None, label=None):
141 |     """Constructs a InputExample.
142 | 
143 |     Args:
144 |       guid: Unique id for the example.
145 |       text_a: string. The untokenized text of the first sequence. For single
146 |         sequence tasks, only this sequence must be specified.
147 |       text_b: (Optional) string. The untokenized text of the second sequence.
148 |         Only must be specified for sequence pair tasks.
149 |       label: (Optional) string. The label of the example. This should be
150 |         specified for train and dev examples, but not for test examples.
151 |     """
152 |     self.guid = guid
153 |     self.text_a = text_a
154 |     self.text_b = text_b
155 |     self.label = label
156 | 
157 | 
158 | class PaddingInputExample(object):
159 |   """Fake example so the num input examples is a multiple of the batch size.
160 | 
161 |   When running eval/predict on the TPU, we need to pad the number of examples
162 |   to be a multiple of the batch size, because the TPU requires a fixed batch
163 |   size. The alternative is to drop the last batch, which is bad because it means
164 |   the entire output data won't be generated.
165 | 
166 |   We use this class instead of `None` because treating `None` as padding
167 |   battches could cause silent errors.
168 |   """
169 | 
170 | 
171 | class InputFeatures(object):
172 |   """A single set of features of data."""
173 | 
174 |   def __init__(self,
175 |                input_ids,
176 |                input_mask,
177 |                segment_ids,
178 |                label_id,
179 |                is_real_example=True):
180 |     self.input_ids = input_ids
181 |     self.input_mask = input_mask
182 |     self.segment_ids = segment_ids
183 |     self.label_id = label_id
184 |     self.is_real_example = is_real_example
185 | 
186 | 
187 | class DataProcessor(object):
188 |   """Base class for data converters for sequence classification data sets."""
189 | 
190 |   def get_train_examples(self, data_dir):
191 |     """Gets a collection of `InputExample`s for the train set."""
192 |     raise NotImplementedError()
193 | 
194 |   def get_dev_examples(self, data_dir):
195 |     """Gets a collection of `InputExample`s for the dev set."""
196 |     raise NotImplementedError()
197 | 
198 |   def get_test_examples(self, data_dir):
199 |     """Gets a collection of `InputExample`s for prediction."""
200 |     raise NotImplementedError()
201 | 
202 |   def get_labels(self):
203 |     """Gets the list of labels for this data set."""
204 |     raise NotImplementedError()
205 | 
206 |   @classmethod
207 |   def _read_json(cls, input_file):
208 |     """Reads a JSON file."""
209 |     with tf.gfile.Open(input_file, "r") as f:
210 |       return json.load(f)
211 | 
212 | 
213 | class MnliProcessor(DataProcessor):
214 |   """Processor for the MultiNLI data set (GLUE version)."""
215 | 
216 |   SPLIT_TO_NAME = {
217 |     'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}',
218 |     'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}',
219 |     'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}',
220 |     'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}',
221 |     'with_annotator': 'with_annotator_id',
222 |     'without_annotator': 'without_annotator_id'
223 |   }
224 | 
225 |   TRAIN_FILE_NAME = 'train_{split_name}.json'
226 |   DEV_FILE_NAME = 'dev_{split_name}.json'
227 |   TEST_FILE_NAME = 'dev_{split_name}.json'
228 | 
229 |   def __init__(self, split, annotator_idx, augment_ratio, take_number, annotator_labels):
230 |     if split not in self.SPLIT_TO_NAME.keys():
231 |       raise ValueError(
232 |         'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.')
233 | 
234 |     self.split = split
235 |     self.annotator_idx = annotator_idx
236 |     self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else ""
237 |     self.take = "_take{}".format(take_number) if take_number > 1 else ""
238 |     self.annotator_labels = annotator_labels
239 | 
240 |   def get_train_examples(self, data_dir):
241 |     train_file_name = self.TRAIN_FILE_NAME.format(
242 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
243 | 
244 |     return self._create_examples(
245 |       self._read_json(os.path.join(data_dir, train_file_name)),
246 |       'train')
247 | 
248 |   def get_dev_examples(self, data_dir):
249 |     dev_file_name = self.DEV_FILE_NAME.format(
250 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
251 | 
252 |     return self._create_examples(
253 |       self._read_json(os.path.join(data_dir, dev_file_name)),
254 |       'dev')
255 | 
256 |   def get_test_examples(self, data_dir):
257 |     test_file_name = self.TEST_FILE_NAME.format(
258 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
259 | 
260 |     return self._create_examples(
261 |       self._read_json(os.path.join(data_dir, test_file_name)),
262 |       'test')
263 | 
264 |   def get_labels(self):
265 |     """See base class."""
266 |     if self.annotator_labels:
267 |         # These are anonymized annotator IDs.
268 |         return ["ANNOT1", "ANNOT2", "ANNOT3", "ANNOT4", "ANNOT5", "OTHER"]
269 |     else:
270 |         return ["contradiction", "entailment", "neutral"]
271 | 
272 |   def _create_examples(self, lines, set_type):
273 |     """Creates examples for the training and dev sets."""
274 |     labels = self.get_labels()
275 |     examples = []
276 |     for i, line in enumerate(lines):
277 |         guid = tokenization.convert_to_unicode(line["pairID"])
278 | 
279 |         text_a = tokenization.convert_to_unicode(line["sentence1"])
280 |         text_b = tokenization.convert_to_unicode(line["sentence2"])
281 |         if self.annotator_labels:
282 |             if set_type == "test":
283 |                 label = "OTHER"
284 |             else:
285 |                 if line["turkIdAnonymized"] in labels:
286 |                     label = tokenization.convert_to_unicode(line["turkIdAnonymized"])
287 |                 else:
288 |                     label = "OTHER"
289 |         else:
290 |             if set_type == "test":
291 |                 label = "contradiction"
292 |             else:
293 |                 label = tokenization.convert_to_unicode(line["gold_label"])
294 |         examples.append(
295 |             InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
296 | 
297 |     return examples
298 | 
299 | 
300 | def convert_single_example(ex_index, example, label_list, max_seq_length,
301 |                            tokenizer):
302 |   """Converts a single `InputExample` into a single `InputFeatures`."""
303 | 
304 |   if isinstance(example, PaddingInputExample):
305 |     return InputFeatures(
306 |         input_ids=[0] * max_seq_length,
307 |         input_mask=[0] * max_seq_length,
308 |         segment_ids=[0] * max_seq_length,
309 |         label_id=0,
310 |         is_real_example=False)
311 | 
312 |   label_map = {}
313 |   for (i, label) in enumerate(label_list):
314 |     label_map[label] = i
315 | 
316 |   tokens_a = tokenizer.tokenize(example.text_a)
317 |   tokens_b = None
318 |   if example.text_b:
319 |     tokens_b = tokenizer.tokenize(example.text_b)
320 | 
321 |   if tokens_b:
322 |     # Modifies `tokens_a` and `tokens_b` in place so that the total
323 |     # length is less than the specified length.
324 |     # Account for [CLS], [SEP], [SEP] with "- 3"
325 |     _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
326 |   else:
327 |     # Account for [CLS] and [SEP] with "- 2"
328 |     if len(tokens_a) > max_seq_length - 2:
329 |       tokens_a = tokens_a[0:(max_seq_length - 2)]
330 | 
331 |   # The convention in BERT is:
332 |   # (a) For sequence pairs:
333 |   #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
334 |   #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
335 |   # (b) For single sequences:
336 |   #  tokens:   [CLS] the dog is hairy . [SEP]
337 |   #  type_ids: 0     0   0   0  0     0 0
338 |   #
339 |   # Where "type_ids" are used to indicate whether this is the first
340 |   # sequence or the second sequence. The embedding vectors for `type=0` and
341 |   # `type=1` were learned during pre-training and are added to the wordpiece
342 |   # embedding vector (and position vector). This is not *strictly* necessary
343 |   # since the [SEP] token unambiguously separates the sequences, but it makes
344 |   # it easier for the model to learn the concept of sequences.
345 |   #
346 |   # For classification tasks, the first vector (corresponding to [CLS]) is
347 |   # used as the "sentence vector". Note that this only makes sense because
348 |   # the entire model is fine-tuned.
349 |   tokens = []
350 |   segment_ids = []
351 |   tokens.append("[CLS]")
352 |   segment_ids.append(0)
353 |   for token in tokens_a:
354 |     tokens.append(token)
355 |     segment_ids.append(0)
356 |   tokens.append("[SEP]")
357 |   segment_ids.append(0)
358 | 
359 |   if tokens_b:
360 |     for token in tokens_b:
361 |       tokens.append(token)
362 |       segment_ids.append(1)
363 |     tokens.append("[SEP]")
364 |     segment_ids.append(1)
365 | 
366 |   input_ids = tokenizer.convert_tokens_to_ids(tokens)
367 | 
368 |   # The mask has 1 for real tokens and 0 for padding tokens. Only real
369 |   # tokens are attended to.
370 |   input_mask = [1] * len(input_ids)
371 | 
372 |   # Zero-pad up to the sequence length.
373 |   while len(input_ids) < max_seq_length:
374 |     input_ids.append(0)
375 |     input_mask.append(0)
376 |     segment_ids.append(0)
377 | 
378 |   assert len(input_ids) == max_seq_length
379 |   assert len(input_mask) == max_seq_length
380 |   assert len(segment_ids) == max_seq_length
381 | 
382 |   label_id = label_map[example.label]
383 |   if ex_index < 5:
384 |     tf.logging.info("*** Example ***")
385 |     tf.logging.info("guid: %s" % (example.guid))
386 |     tf.logging.info("tokens: %s" % " ".join(
387 |         [tokenization.printable_text(x) for x in tokens]))
388 |     tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
389 |     tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
390 |     tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
391 |     tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
392 | 
393 |   feature = InputFeatures(
394 |       input_ids=input_ids,
395 |       input_mask=input_mask,
396 |       segment_ids=segment_ids,
397 |       label_id=label_id,
398 |       is_real_example=True)
399 |   return feature
400 | 
401 | 
402 | def file_based_convert_examples_to_features(
403 |     examples, label_list, max_seq_length, tokenizer, output_file):
404 |   """Convert a set of `InputExample`s to a TFRecord file."""
405 | 
406 |   writer = tf.python_io.TFRecordWriter(output_file)
407 | 
408 |   for (ex_index, example) in enumerate(examples):
409 |     if ex_index % 10000 == 0:
410 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
411 | 
412 |     feature = convert_single_example(ex_index, example, label_list,
413 |                                      max_seq_length, tokenizer)
414 | 
415 |     def create_int_feature(values):
416 |       f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
417 |       return f
418 | 
419 |     features = collections.OrderedDict()
420 |     features["input_ids"] = create_int_feature(feature.input_ids)
421 |     features["input_mask"] = create_int_feature(feature.input_mask)
422 |     features["segment_ids"] = create_int_feature(feature.segment_ids)
423 |     features["label_ids"] = create_int_feature([feature.label_id])
424 |     features["is_real_example"] = create_int_feature(
425 |         [int(feature.is_real_example)])
426 | 
427 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
428 |     writer.write(tf_example.SerializeToString())
429 |   writer.close()
430 | 
431 | 
432 | def file_based_input_fn_builder(input_file, seq_length, is_training,
433 |                                 drop_remainder):
434 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
435 | 
436 |   name_to_features = {
437 |       "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
438 |       "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
439 |       "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
440 |       "label_ids": tf.FixedLenFeature([], tf.int64),
441 |       "is_real_example": tf.FixedLenFeature([], tf.int64),
442 |   }
443 | 
444 |   def _decode_record(record, name_to_features):
445 |     """Decodes a record to a TensorFlow example."""
446 |     example = tf.parse_single_example(record, name_to_features)
447 | 
448 |     # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
449 |     # So cast all int64 to int32.
450 |     for name in list(example.keys()):
451 |       t = example[name]
452 |       if t.dtype == tf.int64:
453 |         t = tf.to_int32(t)
454 |       example[name] = t
455 | 
456 |     return example
457 | 
458 |   def input_fn(params):
459 |     """The actual input function."""
460 |     batch_size = params["batch_size"]
461 | 
462 |     # For training, we want a lot of parallel reading and shuffling.
463 |     # For eval, we want no shuffling and parallel reading doesn't matter.
464 |     d = tf.data.TFRecordDataset(input_file)
465 |     if is_training:
466 |       d = d.repeat()
467 |       d = d.shuffle(buffer_size=100)
468 | 
469 |     d = d.apply(
470 |         tf.contrib.data.map_and_batch(
471 |             lambda record: _decode_record(record, name_to_features),
472 |             batch_size=batch_size,
473 |             drop_remainder=drop_remainder))
474 | 
475 |     return d
476 | 
477 |   return input_fn
478 | 
479 | 
480 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
481 |   """Truncates a sequence pair in place to the maximum length."""
482 | 
483 |   # This is a simple heuristic which will always truncate the longer sequence
484 |   # one token at a time. This makes more sense than truncating an equal percent
485 |   # of tokens from each, since if one sequence is very short then each token
486 |   # that's truncated likely contains more information than a longer sequence.
487 |   while True:
488 |     total_length = len(tokens_a) + len(tokens_b)
489 |     if total_length <= max_length:
490 |       break
491 |     if len(tokens_a) > len(tokens_b):
492 |       tokens_a.pop()
493 |     else:
494 |       tokens_b.pop()
495 | 
496 | 
497 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
498 |                  labels, num_labels, use_one_hot_embeddings):
499 |   """Creates a classification model."""
500 |   model = modeling.BertModel(
501 |       config=bert_config,
502 |       is_training=is_training,
503 |       input_ids=input_ids,
504 |       input_mask=input_mask,
505 |       token_type_ids=segment_ids,
506 |       use_one_hot_embeddings=use_one_hot_embeddings)
507 | 
508 |   # In the demo, we are doing a simple classification task on the entire
509 |   # segment.
510 |   #
511 |   # If you want to use the token-level output, use model.get_sequence_output()
512 |   # instead.
513 |   output_layer = model.get_pooled_output()
514 | 
515 |   hidden_size = output_layer.shape[-1].value
516 | 
517 |   output_weights = tf.get_variable(
518 |       "output_weights", [num_labels, hidden_size],
519 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
520 | 
521 |   output_bias = tf.get_variable(
522 |       "output_bias", [num_labels], initializer=tf.zeros_initializer())
523 | 
524 |   with tf.variable_scope("loss"):
525 |     if is_training:
526 |       # I.e., 0.1 dropout
527 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
528 | 
529 |     logits = tf.matmul(output_layer, output_weights, transpose_b=True)
530 |     logits = tf.nn.bias_add(logits, output_bias)
531 |     probabilities = tf.nn.softmax(logits, axis=-1)
532 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
533 | 
534 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
535 | 
536 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
537 |     loss = tf.reduce_mean(per_example_loss)
538 | 
539 |     return (loss, per_example_loss, logits, probabilities)
540 | 
541 | 
542 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
543 |                      num_train_steps, num_warmup_steps, use_tpu,
544 |                      use_one_hot_embeddings):
545 |   """Returns `model_fn` closure for TPUEstimator."""
546 | 
547 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
548 |     """The `model_fn` for TPUEstimator."""
549 | 
550 |     tf.logging.info("*** Features ***")
551 |     for name in sorted(features.keys()):
552 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
553 | 
554 |     input_ids = features["input_ids"]
555 |     input_mask = features["input_mask"]
556 |     segment_ids = features["segment_ids"]
557 |     label_ids = features["label_ids"]
558 |     is_real_example = None
559 |     if "is_real_example" in features:
560 |       is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
561 |     else:
562 |       is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
563 | 
564 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
565 | 
566 |     (total_loss, per_example_loss, logits, probabilities) = create_model(
567 |         bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
568 |         num_labels, use_one_hot_embeddings)
569 | 
570 |     tvars = tf.trainable_variables()
571 |     initialized_variable_names = {}
572 |     scaffold_fn = None
573 |     if init_checkpoint:
574 |       (assignment_map, initialized_variable_names
575 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
576 |       if use_tpu:
577 | 
578 |         def tpu_scaffold():
579 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
580 |           return tf.train.Scaffold()
581 | 
582 |         scaffold_fn = tpu_scaffold
583 |       else:
584 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
585 | 
586 |     tf.logging.info("**** Trainable Variables ****")
587 |     for var in tvars:
588 |       init_string = ""
589 |       if var.name in initialized_variable_names:
590 |         init_string = ", *INIT_FROM_CKPT*"
591 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
592 |                       init_string)
593 | 
594 |     output_spec = None
595 |     if mode == tf.estimator.ModeKeys.TRAIN:
596 | 
597 |       train_op = optimization.create_optimizer(
598 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
599 | 
600 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
601 |           mode=mode,
602 |           loss=total_loss,
603 |           train_op=train_op,
604 |           scaffold_fn=scaffold_fn)
605 |     elif mode == tf.estimator.ModeKeys.EVAL:
606 | 
607 |       def metric_fn(per_example_loss, label_ids, logits, is_real_example):
608 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
609 |         accuracy = tf.metrics.accuracy(
610 |             labels=label_ids, predictions=predictions, weights=is_real_example)
611 |         loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
612 |         return {
613 |             "eval_accuracy": accuracy,
614 |             "eval_loss": loss,
615 |         }
616 | 
617 |       eval_metrics = (metric_fn,
618 |                       [per_example_loss, label_ids, logits, is_real_example])
619 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
620 |           mode=mode,
621 |           loss=total_loss,
622 |           eval_metrics=eval_metrics,
623 |           scaffold_fn=scaffold_fn)
624 |     else:
625 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
626 |           mode=mode,
627 |           predictions={"probabilities": probabilities},
628 |           scaffold_fn=scaffold_fn)
629 |     return output_spec
630 | 
631 |   return model_fn
632 | 
633 | 
634 | # This function is not used by this file but is still used by the Colab and
635 | # people who depend on it.
636 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
637 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
638 | 
639 |   all_input_ids = []
640 |   all_input_mask = []
641 |   all_segment_ids = []
642 |   all_label_ids = []
643 | 
644 |   for feature in features:
645 |     all_input_ids.append(feature.input_ids)
646 |     all_input_mask.append(feature.input_mask)
647 |     all_segment_ids.append(feature.segment_ids)
648 |     all_label_ids.append(feature.label_id)
649 | 
650 |   def input_fn(params):
651 |     """The actual input function."""
652 |     batch_size = params["batch_size"]
653 | 
654 |     num_examples = len(features)
655 | 
656 |     # This is for demo purposes and does NOT scale to large data sets. We do
657 |     # not use Dataset.from_generator() because that uses tf.py_func which is
658 |     # not TPU compatible. The right way to load data is with TFRecordReader.
659 |     d = tf.data.Dataset.from_tensor_slices({
660 |         "input_ids":
661 |             tf.constant(
662 |                 all_input_ids, shape=[num_examples, seq_length],
663 |                 dtype=tf.int32),
664 |         "input_mask":
665 |             tf.constant(
666 |                 all_input_mask,
667 |                 shape=[num_examples, seq_length],
668 |                 dtype=tf.int32),
669 |         "segment_ids":
670 |             tf.constant(
671 |                 all_segment_ids,
672 |                 shape=[num_examples, seq_length],
673 |                 dtype=tf.int32),
674 |         "label_ids":
675 |             tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
676 |     })
677 | 
678 |     if is_training:
679 |       d = d.repeat()
680 |       d = d.shuffle(buffer_size=100)
681 | 
682 |     d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
683 |     return d
684 | 
685 |   return input_fn
686 | 
687 | 
688 | # This function is not used by this file but is still used by the Colab and
689 | # people who depend on it.
690 | def convert_examples_to_features(examples, label_list, max_seq_length,
691 |                                  tokenizer):
692 |   """Convert a set of `InputExample`s to a list of `InputFeatures`."""
693 | 
694 |   features = []
695 |   for (ex_index, example) in enumerate(examples):
696 |     if ex_index % 10000 == 0:
697 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
698 | 
699 |     feature = convert_single_example(ex_index, example, label_list,
700 |                                      max_seq_length, tokenizer)
701 | 
702 |     features.append(feature)
703 |   return features
704 | 
705 | 
706 | def main(_):
707 |   tf.logging.set_verbosity(tf.logging.INFO)
708 | 
709 |   if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
710 |     raise ValueError(
711 |         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
712 | 
713 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
714 | 
715 |   if FLAGS.max_seq_length > bert_config.max_position_embeddings:
716 |     raise ValueError(
717 |         "Cannot use sequence length %d because the BERT model "
718 |         "was only trained up to sequence length %d" %
719 |         (FLAGS.max_seq_length, bert_config.max_position_embeddings))
720 | 
721 |   tf.gfile.MakeDirs(FLAGS.output_dir)
722 | 
723 |   processor = MnliProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx,
724 |                             augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number,
725 |                             annotator_labels=FLAGS.annotator_labels)
726 | 
727 |   label_list = processor.get_labels()
728 | 
729 |   tokenizer = tokenization.FullTokenizer(
730 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
731 | 
732 |   tpu_cluster_resolver = None
733 |   if FLAGS.use_tpu and FLAGS.tpu_name:
734 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
735 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
736 | 
737 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
738 |   run_config = tf.contrib.tpu.RunConfig(
739 |       cluster=tpu_cluster_resolver,
740 |       master=FLAGS.master,
741 |       model_dir=FLAGS.output_dir,
742 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
743 |       tpu_config=tf.contrib.tpu.TPUConfig(
744 |           iterations_per_loop=FLAGS.iterations_per_loop,
745 |           num_shards=FLAGS.num_tpu_cores,
746 |           per_host_input_for_training=is_per_host))
747 | 
748 |   train_examples = None
749 |   num_train_steps = None
750 |   num_warmup_steps = None
751 |   if FLAGS.do_train:
752 |     train_examples = processor.get_train_examples(FLAGS.data_dir)
753 |     num_train_steps = int(
754 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
755 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
756 | 
757 |   model_fn = model_fn_builder(
758 |       bert_config=bert_config,
759 |       num_labels=len(label_list),
760 |       init_checkpoint=FLAGS.init_checkpoint,
761 |       learning_rate=FLAGS.learning_rate,
762 |       num_train_steps=num_train_steps,
763 |       num_warmup_steps=num_warmup_steps,
764 |       use_tpu=FLAGS.use_tpu,
765 |       use_one_hot_embeddings=FLAGS.use_tpu)
766 | 
767 |   # If TPU is not available, this will fall back to normal Estimator on CPU
768 |   # or GPU.
769 |   estimator = tf.contrib.tpu.TPUEstimator(
770 |       use_tpu=FLAGS.use_tpu,
771 |       model_fn=model_fn,
772 |       config=run_config,
773 |       train_batch_size=FLAGS.train_batch_size,
774 |       eval_batch_size=FLAGS.eval_batch_size,
775 |       predict_batch_size=FLAGS.predict_batch_size)
776 | 
777 |   if FLAGS.do_train:
778 |     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
779 |     train_seq_length = file_based_convert_examples_to_features(
780 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
781 |     tf.logging.info("***** Running training *****")
782 |     tf.logging.info("  Num examples = %d", len(train_examples))
783 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
784 |     tf.logging.info("  Num steps = %d", num_train_steps)
785 |     tf.logging.info("  Longest training sequence = %d", train_seq_length)
786 |     train_input_fn = file_based_input_fn_builder(
787 |         input_file=train_file,
788 |         seq_length=FLAGS.max_seq_length,
789 |         is_training=True,
790 |         drop_remainder=True)
791 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
792 | 
793 |   if FLAGS.do_eval:
794 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
795 |     eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
796 |     eval_seq_length = file_based_convert_examples_to_features(
797 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
798 | 
799 |     tf.logging.info("***** Running evaluation *****")
800 |     tf.logging.info("  Num examples = %d", len(eval_examples))
801 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
802 |     tf.logging.info("  Longest eval sequence = %d", eval_seq_length)
803 | 
804 |     # This tells the estimator to run through the entire set.
805 |     eval_steps = None
806 |     # However, if running eval on the TPU, you will need to specify the
807 |     # number of steps.
808 |     if FLAGS.use_tpu:
809 |       # Eval will be slightly WRONG on the TPU because it will truncate
810 |       # the last batch.
811 |       eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size)
812 | 
813 |     eval_drop_remainder = True if FLAGS.use_tpu else False
814 |     eval_input_fn = file_based_input_fn_builder(
815 |         input_file=eval_file,
816 |         seq_length=FLAGS.max_seq_length,
817 |         is_training=False,
818 |         drop_remainder=eval_drop_remainder)
819 | 
820 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
821 | 
822 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
823 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
824 |       tf.logging.info("***** Eval results *****")
825 |       for key in sorted(result.keys()):
826 |         tf.logging.info("  %s = %s", key, str(result[key]))
827 |         writer.write("%s = %s\n" % (key, str(result[key])))
828 | 
829 |   if FLAGS.do_predict:
830 |     predict_examples = processor.get_test_examples(FLAGS.data_dir)
831 |     num_actual_predict_examples = len(predict_examples)
832 |     predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
833 |     predict_seq_length = file_based_convert_examples_to_features(
834 |       predict_examples, label_list,
835 |       FLAGS.max_seq_length, tokenizer,
836 |       predict_file)
837 | 
838 |     tf.logging.info("***** Running prediction*****")
839 |     tf.logging.info("  Num examples = %d", len(predict_examples))
840 |     tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
841 |     tf.logging.info("  Longest predict sequence = %d", predict_seq_length)
842 | 
843 |     if FLAGS.use_tpu:
844 |       # Warning: According to tpu_estimator.py Prediction on TPU is an
845 |       # experimental feature and hence not supported here
846 |       raise ValueError("Prediction in TPU not supported")
847 | 
848 |     predict_drop_remainder = True if FLAGS.use_tpu else False
849 |     predict_input_fn = file_based_input_fn_builder(
850 |         input_file=predict_file,
851 |         seq_length=FLAGS.max_seq_length,
852 |         is_training=False,
853 |         drop_remainder=predict_drop_remainder)
854 | 
855 |     result = estimator.predict(input_fn=predict_input_fn)
856 | 
857 |     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
858 |     with tf.gfile.GFile(output_predict_file, "w") as writer:
859 |       num_written_lines = 0
860 |       tf.logging.info("***** Predict results *****")
861 |       for (i, prediction) in enumerate(result):
862 |         probabilities = prediction["probabilities"]
863 |         if i >= num_actual_predict_examples:
864 |           break
865 |         output_line = "\t".join(
866 |             str(class_probability)
867 |             for class_probability in probabilities) + "\n"
868 |         writer.write(output_line)
869 |         num_written_lines += 1
870 |     assert num_written_lines == num_actual_predict_examples
871 | 
872 | 
873 | if __name__ == "__main__":
874 |   flags.mark_flag_as_required("data_dir")
875 |   flags.mark_flag_as_required("vocab_file")
876 |   flags.mark_flag_as_required("bert_config_file")
877 |   flags.mark_flag_as_required("output_dir")
878 |   tf.app.run()
879 | 


--------------------------------------------------------------------------------
/model_fine_tuning_scripts/run_openbookqa.py:
--------------------------------------------------------------------------------
  1 | """Run BERT on OpenBookQA."""
  2 | 
  3 | from __future__ import absolute_import
  4 | from __future__ import division
  5 | from __future__ import print_function
  6 | 
  7 | import collections
  8 | import json
  9 | import os
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | import modeling
 15 | import optimization
 16 | import tokenization
 17 | 
 18 | 
 19 | flags = tf.flags
 20 | 
 21 | FLAGS = flags.FLAGS
 22 | 
 23 | ## Required parameters
 24 | flags.DEFINE_string(
 25 |     "data_dir", None,
 26 |     "The input data dir. Should contain the .tsv files (or other data files) "
 27 |     "for the task.")
 28 | 
 29 | flags.DEFINE_string(
 30 |     "bert_config_file", None,
 31 |     "The config json file corresponding to the pre-trained BERT model. "
 32 |     "This specifies the model architecture.")
 33 | 
 34 | flags.DEFINE_string("vocab_file", None,
 35 |                     "The vocabulary file that the BERT model was trained on.")
 36 | 
 37 | flags.DEFINE_string(
 38 |     "output_dir", None,
 39 |     "The output directory where the model checkpoints will be written.")
 40 | 
 41 | flags.DEFINE_string(
 42 |   "split", None,
 43 |   "The split you'd like to run on, either 'annotator' or 'rand'.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "annotator_idx", 0,
 47 |     "Index of the annotator to split by.")
 48 | 
 49 | flags.DEFINE_float(
 50 |     "augment_ratio", 0.0,
 51 |     "Proportion of dev examples moved to training set.")
 52 | 
 53 | flags.DEFINE_integer(
 54 |     "take_number", 1,
 55 |     "In case this is a re-execution of previous experiment.")
 56 | 
 57 | 
 58 | ## Other parameters
 59 | 
 60 | flags.DEFINE_string(
 61 |     "init_checkpoint", None,
 62 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 63 | 
 64 | flags.DEFINE_bool(
 65 |     "do_lower_case", True,
 66 |     "Whether to lower case the input text. Should be True for uncased "
 67 |     "models and False for cased models.")
 68 | 
 69 | flags.DEFINE_integer(
 70 |     "max_seq_length", 128,
 71 |     "The maximum total input sequence length after WordPiece tokenization. "
 72 |     "Sequences longer than this will be truncated, and sequences shorter "
 73 |     "than this will be padded.")
 74 | 
 75 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 76 | 
 77 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 78 | 
 79 | flags.DEFINE_bool(
 80 |     "do_predict", False,
 81 |     "Whether to run the model in inference mode on the test set.")
 82 | 
 83 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
 84 | 
 85 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 86 | 
 87 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 88 | 
 89 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 90 | 
 91 | flags.DEFINE_float("num_train_epochs", 3.0,
 92 |                    "Total number of training epochs to perform.")
 93 | 
 94 | flags.DEFINE_float(
 95 |     "warmup_proportion", 0.1,
 96 |     "Proportion of training to perform linear learning rate warmup for. "
 97 |     "E.g., 0.1 = 10% of training.")
 98 | 
 99 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
100 |                      "How often to save the model checkpoint.")
101 | 
102 | flags.DEFINE_integer("iterations_per_loop", 1000,
103 |                      "How many steps to make in each estimator call.")
104 | 
105 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
106 | 
107 | tf.flags.DEFINE_string(
108 |     "tpu_name", None,
109 |     "The Cloud TPU to use for training. This should be either the name "
110 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
111 |     "url.")
112 | 
113 | tf.flags.DEFINE_string(
114 |     "tpu_zone", None,
115 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
116 |     "specified, we will attempt to automatically detect the GCE project from "
117 |     "metadata.")
118 | 
119 | tf.flags.DEFINE_string(
120 |     "gcp_project", None,
121 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
122 |     "specified, we will attempt to automatically detect the GCE project from "
123 |     "metadata.")
124 | 
125 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
126 | 
127 | flags.DEFINE_integer(
128 |     "num_tpu_cores", 8,
129 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
130 | 
131 | 
132 | class InputExample(object):
133 |   """A single multiple choice question."""
134 | 
135 |   def __init__(
136 |       self,
137 |       qid,
138 |       question,
139 |       answers,
140 |       label):
141 |     """Construct an instance."""
142 |     self.qid = qid
143 |     self.question = question
144 |     self.answers = answers
145 |     self.label = label
146 | 
147 | 
148 | class DataProcessor(object):
149 |   """Base class for data converters for sequence classification data sets."""
150 | 
151 |   def get_train_examples(self, data_dir):
152 |     """Gets a collection of `InputExample`s for the train set."""
153 |     raise NotImplementedError()
154 | 
155 |   def get_dev_examples(self, data_dir):
156 |     """Gets a collection of `InputExample`s for the dev set."""
157 |     raise NotImplementedError()
158 | 
159 |   def get_test_examples(self, data_dir):
160 |     """Gets a collection of `InputExample`s for prediction."""
161 |     raise NotImplementedError()
162 | 
163 |   def get_labels(self):
164 |     """Gets the list of labels for this data set."""
165 |     raise NotImplementedError()
166 | 
167 |   @classmethod
168 |   def _read_json(cls, input_file):
169 |     """Reads a JSON file."""
170 |     with tf.gfile.Open(input_file, "r") as f:
171 |       return json.load(f)
172 | 
173 | 
174 | class CommonsenseQAProcessor(DataProcessor):
175 |   """Processor for the CommonsenseQA data set."""
176 | 
177 |   SPLIT_TO_NAME = {
178 |     'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}',
179 |     'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}',
180 |     'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}',
181 |     'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}',
182 |     'with_annotator': 'with_annotator_id',
183 |     'without_annotator': 'without_annotator_id',
184 |   }
185 | 
186 |   TRAIN_FILE_NAME = 'train_{split_name}.json'
187 |   DEV_FILE_NAME = 'dev_{split_name}.json'
188 |   TEST_FILE_NAME = 'dev_{split_name}.json'
189 | 
190 |   def __init__(self, split, annotator_idx, augment_ratio, take_number):
191 |     if split not in self.SPLIT_TO_NAME.keys():
192 |       raise ValueError(
193 |         'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.')
194 | 
195 |     self.split = split
196 |     self.annotator_idx = annotator_idx
197 |     self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else ""
198 |     self.take = "_take{}".format(take_number) if take_number > 1 else ""
199 | 
200 |   def get_train_examples(self, data_dir):
201 |     train_file_name = self.TRAIN_FILE_NAME.format(
202 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
203 | 
204 |     return self._create_examples(
205 |       self._read_json(os.path.join(data_dir, train_file_name)),
206 |       'train')
207 | 
208 |   def get_dev_examples(self, data_dir):
209 |     dev_file_name = self.DEV_FILE_NAME.format(
210 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
211 | 
212 |     return self._create_examples(
213 |       self._read_json(os.path.join(data_dir, dev_file_name)),
214 |       'dev')
215 | 
216 |   def get_test_examples(self, data_dir):
217 |     test_file_name = self.TEST_FILE_NAME.format(
218 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
219 | 
220 |     return self._create_examples(
221 |       self._read_json(os.path.join(data_dir, test_file_name)),
222 |       'test')
223 | 
224 |   def get_labels(self):
225 |     return [0, 1, 2, 3]
226 | 
227 |   def _create_examples(self, lines, set_type):
228 |     examples = []
229 |     for  i, line in enumerate(lines):
230 |       qid = "%s-%s" % (set_type, i)
231 | 
232 |       question = tokenization.convert_to_unicode(line['question'])
233 | 
234 |       answers = np.array([
235 |         tokenization.convert_to_unicode(line['choice0']),
236 |         tokenization.convert_to_unicode(line['choice1']),
237 |         tokenization.convert_to_unicode(line['choice2']),
238 |         tokenization.convert_to_unicode(line['choice3'])
239 |       ])
240 | 
241 |       label = np.argwhere(answers == line['correct_answer'])
242 |       assert len(label) == 1
243 |       label = label[0][0]
244 | 
245 |       examples.append(
246 |         InputExample(
247 |           qid=qid,
248 |           question=question,
249 |           answers=answers,
250 |           label=label))
251 | 
252 |     return examples
253 | 
254 | 
255 | def example_to_token_ids_segment_ids_label_ids(
256 |     ex_index,
257 |     example,
258 |     max_seq_length,
259 |     tokenizer):
260 |   """Converts an ``InputExample`` to token ids and segment ids."""
261 |   if ex_index < 5:
262 |     tf.logging.info(f"*** Example {ex_index} ***")
263 |     tf.logging.info("qid: %s" % (example.qid))
264 | 
265 |   question_tokens = tokenizer.tokenize(example.question)
266 |   answers_tokens = map(tokenizer.tokenize, example.answers)
267 | 
268 |   token_ids = []
269 |   segment_ids = []
270 |   for choice_idx, answer_tokens in enumerate(answers_tokens):
271 |     truncated_question_tokens = question_tokens[
272 |       :max((max_seq_length - 3)//2, max_seq_length - (len(answer_tokens) + 3))]
273 |     truncated_answer_tokens = answer_tokens[
274 |       :max((max_seq_length - 3)//2, max_seq_length - (len(question_tokens) + 3))]
275 | 
276 |     choice_tokens = []
277 |     choice_segment_ids = []
278 |     choice_tokens.append("[CLS]")
279 |     choice_segment_ids.append(0)
280 |     for question_token in truncated_question_tokens:
281 |       choice_tokens.append(question_token)
282 |       choice_segment_ids.append(0)
283 |     choice_tokens.append("[SEP]")
284 |     choice_segment_ids.append(0)
285 |     for answer_token in truncated_answer_tokens:
286 |       choice_tokens.append(answer_token)
287 |       choice_segment_ids.append(1)
288 |     choice_tokens.append("[SEP]")
289 |     choice_segment_ids.append(1)
290 | 
291 |     choice_token_ids = tokenizer.convert_tokens_to_ids(choice_tokens)
292 | 
293 |     token_ids.append(choice_token_ids)
294 |     segment_ids.append(choice_segment_ids)
295 | 
296 |     if ex_index < 5:
297 |       tf.logging.info("choice %s" % choice_idx)
298 |       tf.logging.info("tokens: %s" % " ".join(
299 |         [tokenization.printable_text(t) for t in choice_tokens]))
300 |       tf.logging.info("token ids: %s" % " ".join(
301 |         [str(x) for x in choice_token_ids]))
302 |       tf.logging.info("segment ids: %s" % " ".join(
303 |         [str(x) for x in choice_segment_ids]))
304 | 
305 |   label_ids = [example.label]
306 | 
307 |   if ex_index < 5:
308 |     tf.logging.info("label: %s (id = %d)" % (example.label, label_ids[0]))
309 | 
310 |   return token_ids, segment_ids, label_ids
311 | 
312 | 
313 | def file_based_convert_examples_to_features(
314 |     examples,
315 |     label_list,
316 |     max_seq_length,
317 |     tokenizer,
318 |     output_file
319 | ):
320 |   """Convert a set of ``InputExamples`` to a TFRecord file."""
321 | 
322 |   # encode examples into token_ids and segment_ids
323 |   token_ids_segment_ids_label_ids = [
324 |     example_to_token_ids_segment_ids_label_ids(
325 |       ex_index,
326 |       example,
327 |       max_seq_length,
328 |       tokenizer)
329 |     for ex_index, example in enumerate(examples)
330 |   ]
331 | 
332 |   # compute the maximum sequence length for any of the inputs
333 |   seq_length = max([
334 |     max([len(choice_token_ids) for choice_token_ids in token_ids])
335 |     for token_ids, _, _ in token_ids_segment_ids_label_ids
336 |   ])
337 | 
338 |   # encode the inputs into fixed-length vectors
339 |   writer = tf.python_io.TFRecordWriter(output_file)
340 | 
341 |   for idx, (token_ids, segment_ids, label_ids) in enumerate(
342 |       token_ids_segment_ids_label_ids
343 |   ):
344 |     if idx % 10000 == 0:
345 |       tf.logging.info("Writing %d of %d" % (
346 |         idx,
347 |         len(token_ids_segment_ids_label_ids)))
348 | 
349 |     features = collections.OrderedDict()
350 |     for i, (choice_token_ids, choice_segment_ids) in enumerate(
351 |         zip(token_ids, segment_ids)):
352 |       input_ids = np.zeros(max_seq_length)
353 |       input_ids[:len(choice_token_ids)] = np.array(choice_token_ids)
354 | 
355 |       input_mask = np.zeros(max_seq_length)
356 |       input_mask[:len(choice_token_ids)] = 1
357 | 
358 |       segment_ids = np.zeros(max_seq_length)
359 |       segment_ids[:len(choice_segment_ids)] = np.array(choice_segment_ids)
360 | 
361 |       features[f'input_ids{i}'] = tf.train.Feature(
362 |         int64_list=tf.train.Int64List(value=list(input_ids.astype(np.int64))))
363 |       features[f'input_mask{i}'] = tf.train.Feature(
364 |         int64_list=tf.train.Int64List(value=list(input_mask.astype(np.int64))))
365 |       features[f'segment_ids{i}'] = tf.train.Feature(
366 |         int64_list=tf.train.Int64List(value=list(segment_ids.astype(np.int64))))
367 | 
368 |     features['label_ids'] = tf.train.Feature(
369 |       int64_list=tf.train.Int64List(value=label_ids))
370 | 
371 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
372 |     writer.write(tf_example.SerializeToString())
373 | 
374 |   return seq_length
375 | 
376 | 
377 | def file_based_input_fn_builder(
378 |     input_file,
379 |     seq_length,
380 |     is_training,
381 |     drop_remainder
382 | ):
383 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
384 | 
385 |   name_to_features = {
386 |       "input_ids0": tf.FixedLenFeature([seq_length], tf.int64),
387 |       "input_mask0": tf.FixedLenFeature([seq_length], tf.int64),
388 |       "segment_ids0": tf.FixedLenFeature([seq_length], tf.int64),
389 |       "input_ids1": tf.FixedLenFeature([seq_length], tf.int64),
390 |       "input_mask1": tf.FixedLenFeature([seq_length], tf.int64),
391 |       "segment_ids1": tf.FixedLenFeature([seq_length], tf.int64),
392 |       "input_ids2": tf.FixedLenFeature([seq_length], tf.int64),
393 |       "input_mask2": tf.FixedLenFeature([seq_length], tf.int64),
394 |       "segment_ids2": tf.FixedLenFeature([seq_length], tf.int64),
395 |       "input_ids3": tf.FixedLenFeature([seq_length], tf.int64),
396 |       "input_mask3": tf.FixedLenFeature([seq_length], tf.int64),
397 |       "segment_ids3": tf.FixedLenFeature([seq_length], tf.int64),
398 |       "label_ids": tf.FixedLenFeature([], tf.int64),
399 |   }
400 | 
401 |   def _decode_record(record, name_to_features):
402 |     """Decodes a record to a TensorFlow example."""
403 |     example = tf.parse_single_example(record, name_to_features)
404 | 
405 |     # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
406 |     # So cast all int64 to int32.
407 |     for name in list(example.keys()):
408 |       t = example[name]
409 |       if t.dtype == tf.int64:
410 |         t = tf.to_int32(t)
411 |       example[name] = t
412 | 
413 |     return example
414 | 
415 |   def input_fn(params):
416 |     """The actual input function."""
417 |     batch_size = params["batch_size"]
418 | 
419 |     # For training, we want a lot of parallel reading and shuffling.
420 |     # For eval, we want no shuffling and parallel reading doesn't matter.
421 |     d = tf.data.TFRecordDataset(input_file)
422 |     if is_training:
423 |       d = d.repeat()
424 |       d = d.shuffle(buffer_size=100)
425 | 
426 |     d = d.apply(
427 |         tf.contrib.data.map_and_batch(
428 |             lambda record: _decode_record(record, name_to_features),
429 |             batch_size=batch_size,
430 |             drop_remainder=drop_remainder))
431 | 
432 |     return d
433 | 
434 |   return input_fn
435 | 
436 | 
437 | def create_model(
438 |     bert_config,
439 |     is_training,
440 |     input_ids0,
441 |     input_mask0,
442 |     segment_ids0,
443 |     input_ids1,
444 |     input_mask1,
445 |     segment_ids1,
446 |     input_ids2,
447 |     input_mask2,
448 |     segment_ids2,
449 |     input_ids3,
450 |     input_mask3,
451 |     segment_ids3,
452 |     labels,
453 |     num_labels,
454 |     use_one_hot_embeddings
455 | ):
456 |   """Creates a classification model."""
457 |   input_ids = tf.stack(
458 |     [
459 |       input_ids0,
460 |       input_ids1,
461 |       input_ids2,
462 |       input_ids3
463 |     ],
464 |     axis=1)
465 |   input_mask = tf.stack(
466 |     [
467 |       input_mask0,
468 |       input_mask1,
469 |       input_mask2,
470 |       input_mask3
471 |     ],
472 |     axis=1)
473 |   segment_ids = tf.stack(
474 |     [
475 |       segment_ids0,
476 |       segment_ids1,
477 |       segment_ids2,
478 |       segment_ids3
479 |     ],
480 |     axis=1)
481 | 
482 |   _, num_choices, seq_length = input_ids.shape
483 | 
484 |   input_ids = tf.reshape(input_ids, (-1, seq_length))
485 |   input_mask = tf.reshape(input_mask, (-1, seq_length))
486 |   segment_ids = tf.reshape(segment_ids, (-1, seq_length))
487 | 
488 |   output_layer = modeling.BertModel(
489 |       config=bert_config,
490 |       is_training=is_training,
491 |       input_ids=input_ids,
492 |       input_mask=input_mask,
493 |       token_type_ids=segment_ids,
494 |       use_one_hot_embeddings=use_one_hot_embeddings
495 |   ).get_pooled_output()
496 | 
497 |   hidden_size = output_layer.shape[-1].value
498 | 
499 |   softmax_weights = tf.get_variable(
500 |       "softmax_weights", [hidden_size, 1],
501 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
502 | 
503 |   with tf.variable_scope("loss"):
504 |     if is_training:
505 |       # I.e., 0.1 dropout
506 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
507 | 
508 |     logits = tf.reshape(
509 |       tf.matmul(output_layer, softmax_weights),
510 |       (-1, num_choices))
511 | 
512 |     probabilities = tf.nn.softmax(logits, axis=-1)
513 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
514 | 
515 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
516 | 
517 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
518 |     loss = tf.reduce_mean(per_example_loss)
519 | 
520 |     return (loss, per_example_loss, logits, probabilities, output_layer)
521 | 
522 | 
523 | def model_fn_builder(
524 |     bert_config,
525 |     num_labels,
526 |     init_checkpoint,
527 |     learning_rate,
528 |     num_train_steps,
529 |     num_warmup_steps,
530 |     use_tpu,
531 |     use_one_hot_embeddings
532 | ):
533 |   """Returns `model_fn` closure for TPUEstimator."""
534 | 
535 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
536 |     """The `model_fn` for TPUEstimator."""
537 | 
538 |     tf.logging.info("*** Features ***")
539 |     for name in sorted(features.keys()):
540 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
541 | 
542 |     input_ids0 = features["input_ids0"]
543 |     input_mask0 = features["input_mask0"]
544 |     segment_ids0 = features["segment_ids0"]
545 |     input_ids1 = features["input_ids1"]
546 |     input_mask1 = features["input_mask1"]
547 |     segment_ids1 = features["segment_ids1"]
548 |     input_ids2 = features["input_ids2"]
549 |     input_mask2 = features["input_mask2"]
550 |     segment_ids2 = features["segment_ids2"]
551 |     input_ids3 = features["input_ids3"]
552 |     input_mask3 = features["input_mask3"]
553 |     segment_ids3 = features["segment_ids3"]
554 |     label_ids = features["label_ids"]
555 | 
556 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
557 | 
558 |     (total_loss, per_example_loss, logits, probabilities, output_layer) = create_model(
559 |       bert_config,
560 |       is_training,
561 |       input_ids0,
562 |       input_mask0,
563 |       segment_ids0,
564 |       input_ids1,
565 |       input_mask1,
566 |       segment_ids1,
567 |       input_ids2,
568 |       input_mask2,
569 |       segment_ids2,
570 |       input_ids3,
571 |       input_mask3,
572 |       segment_ids3,
573 |       label_ids,
574 |       num_labels,
575 |       use_one_hot_embeddings)
576 | 
577 |     tvars = tf.trainable_variables()
578 |     initialized_variable_names = {}
579 |     scaffold_fn = None
580 |     if init_checkpoint:
581 |       (assignment_map, initialized_variable_names
582 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
583 |       if use_tpu:
584 | 
585 |         def tpu_scaffold():
586 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
587 |           return tf.train.Scaffold()
588 | 
589 |         scaffold_fn = tpu_scaffold
590 |       else:
591 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
592 | 
593 |     tf.logging.info("**** Trainable Variables ****")
594 |     for var in tvars:
595 |       init_string = ""
596 |       if var.name in initialized_variable_names:
597 |         init_string = ", *INIT_FROM_CKPT*"
598 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
599 |                       init_string)
600 | 
601 |     output_spec = None
602 |     if mode == tf.estimator.ModeKeys.TRAIN:
603 | 
604 |       train_op = optimization.create_optimizer(
605 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
606 | 
607 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
608 |           mode=mode,
609 |           loss=total_loss,
610 |           train_op=train_op,
611 |           scaffold_fn=scaffold_fn)
612 |     elif mode == tf.estimator.ModeKeys.EVAL:
613 | 
614 |       def metric_fn(per_example_loss, label_ids, logits):
615 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
616 |         accuracy = tf.metrics.accuracy(label_ids, predictions)
617 |         loss = tf.metrics.mean(per_example_loss)
618 |         return {
619 |             "eval_accuracy": accuracy,
620 |             "eval_loss": loss,
621 |         }
622 | 
623 |       eval_metrics = (metric_fn, [per_example_loss, label_ids, logits])
624 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
625 |           mode=mode,
626 |           loss=total_loss,
627 |           eval_metrics=eval_metrics,
628 |           scaffold_fn=scaffold_fn,)
629 |     else:
630 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
631 |           mode=mode, predictions=probabilities, scaffold_fn=scaffold_fn)
632 |     return output_spec
633 | 
634 |   return model_fn
635 | 
636 | 
637 | def main(_):
638 |   tf.logging.set_verbosity(tf.logging.INFO)
639 | 
640 |   if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
641 |     raise ValueError(
642 |         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
643 | 
644 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
645 | 
646 |   if FLAGS.max_seq_length > bert_config.max_position_embeddings:
647 |     raise ValueError(
648 |         "Cannot use sequence length %d because the BERT model "
649 |         "was only trained up to sequence length %d" %
650 |         (FLAGS.max_seq_length, bert_config.max_position_embeddings))
651 | 
652 |   tf.gfile.MakeDirs(FLAGS.output_dir)
653 | 
654 |   processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx, augment_ratio = FLAGS.augment_ratio, take_number=FLAGS.take_number)
655 | 
656 |   label_list = processor.get_labels()
657 | 
658 |   tokenizer = tokenization.FullTokenizer(
659 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
660 | 
661 |   tpu_cluster_resolver = None
662 |   if FLAGS.use_tpu and FLAGS.tpu_name:
663 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
664 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
665 | 
666 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
667 |   run_config = tf.contrib.tpu.RunConfig(
668 |       cluster=tpu_cluster_resolver,
669 |       master=FLAGS.master,
670 |       model_dir=FLAGS.output_dir,
671 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
672 |       tpu_config=tf.contrib.tpu.TPUConfig(
673 |           iterations_per_loop=FLAGS.iterations_per_loop,
674 |           num_shards=FLAGS.num_tpu_cores,
675 |           per_host_input_for_training=is_per_host))
676 | 
677 |   train_examples = None
678 |   num_train_steps = None
679 |   num_warmup_steps = None
680 |   if FLAGS.do_train:
681 |     train_examples = processor.get_train_examples(FLAGS.data_dir)
682 |     num_train_steps = int(
683 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
684 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
685 | 
686 |   model_fn = model_fn_builder(
687 |       bert_config=bert_config,
688 |       num_labels=len(label_list),
689 |       init_checkpoint=FLAGS.init_checkpoint,
690 |       learning_rate=FLAGS.learning_rate,
691 |       num_train_steps=num_train_steps,
692 |       num_warmup_steps=num_warmup_steps,
693 |       use_tpu=FLAGS.use_tpu,
694 |       use_one_hot_embeddings=FLAGS.use_tpu)
695 | 
696 |   # If TPU is not available, this will fall back to normal Estimator on CPU
697 |   # or GPU.
698 |   estimator = tf.contrib.tpu.TPUEstimator(
699 |       use_tpu=FLAGS.use_tpu,
700 |       model_fn=model_fn,
701 |       config=run_config,
702 |       train_batch_size=FLAGS.train_batch_size,
703 |       eval_batch_size=FLAGS.eval_batch_size,
704 |       predict_batch_size=FLAGS.predict_batch_size)
705 | 
706 |   if FLAGS.do_train:
707 |     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
708 |     train_seq_length = file_based_convert_examples_to_features(
709 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
710 |     tf.logging.info("***** Running training *****")
711 |     tf.logging.info("  Num examples = %d", len(train_examples))
712 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
713 |     tf.logging.info("  Num steps = %d", num_train_steps)
714 |     tf.logging.info("  Longest training sequence = %d", train_seq_length)
715 |     train_input_fn = file_based_input_fn_builder(
716 |         input_file=train_file,
717 |         seq_length=FLAGS.max_seq_length,
718 |         is_training=True,
719 |         drop_remainder=True)
720 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
721 | 
722 |   if FLAGS.do_eval:
723 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
724 |     eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
725 |     eval_seq_length = file_based_convert_examples_to_features(
726 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
727 | 
728 |     tf.logging.info("***** Running evaluation *****")
729 |     tf.logging.info("  Num examples = %d", len(eval_examples))
730 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
731 |     tf.logging.info("  Longest eval sequence = %d", eval_seq_length)
732 | 
733 |     # This tells the estimator to run through the entire set.
734 |     eval_steps = None
735 |     # However, if running eval on the TPU, you will need to specify the
736 |     # number of steps.
737 |     if FLAGS.use_tpu:
738 |       # Eval will be slightly WRONG on the TPU because it will truncate
739 |       # the last batch.
740 |       eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size)
741 | 
742 |     eval_drop_remainder = True if FLAGS.use_tpu else False
743 |     eval_input_fn = file_based_input_fn_builder(
744 |         input_file=eval_file,
745 |         seq_length=FLAGS.max_seq_length,
746 |         is_training=False,
747 |         drop_remainder=eval_drop_remainder)
748 | 
749 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
750 | 
751 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
752 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
753 |       tf.logging.info("***** Eval results *****")
754 |       for key in sorted(result.keys()):
755 |         tf.logging.info("  %s = %s", key, str(result[key]))
756 |         writer.write("%s = %s\n" % (key, str(result[key])))
757 | 
758 |   if FLAGS.do_predict:
759 |     predict_examples = processor.get_test_examples(FLAGS.data_dir)
760 |     predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
761 |     predict_seq_length = file_based_convert_examples_to_features(
762 |       predict_examples, label_list,
763 |       FLAGS.max_seq_length, tokenizer,
764 |       predict_file)
765 | 
766 |     tf.logging.info("***** Running prediction*****")
767 |     tf.logging.info("  Num examples = %d", len(predict_examples))
768 |     tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
769 |     tf.logging.info("  Longest predict sequence = %d", predict_seq_length)
770 | 
771 |     if FLAGS.use_tpu:
772 |       # Warning: According to tpu_estimator.py Prediction on TPU is an
773 |       # experimental feature and hence not supported here
774 |       raise ValueError("Prediction in TPU not supported")
775 | 
776 |     predict_drop_remainder = True if FLAGS.use_tpu else False
777 |     predict_input_fn = file_based_input_fn_builder(
778 |         input_file=predict_file,
779 |         seq_length=FLAGS.max_seq_length,
780 |         is_training=False,
781 |         drop_remainder=predict_drop_remainder)
782 | 
783 |     result = estimator.predict(input_fn=predict_input_fn)
784 | 
785 |     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
786 |     with tf.gfile.GFile(output_predict_file, "w") as writer:
787 |       tf.logging.info("***** Predict results *****")
788 |       for prediction in result:
789 |         output_line = "\t".join(
790 |             str(class_probability) for class_probability in prediction) + "\n"
791 |         writer.write(output_line)
792 | 
793 | 
794 | if __name__ == "__main__":
795 |   flags.mark_flag_as_required("data_dir")
796 |   flags.mark_flag_as_required("vocab_file")
797 |   flags.mark_flag_as_required("bert_config_file")
798 |   flags.mark_flag_as_required("output_dir")
799 |   tf.app.run()
800 | 


--------------------------------------------------------------------------------
/model_fine_tuning_scripts/run_openbookqa_recognition.py:
--------------------------------------------------------------------------------
  1 | """Run BERT on OpenBookQA for annotator ID prediction."""
  2 | 
  3 | from __future__ import absolute_import
  4 | from __future__ import division
  5 | from __future__ import print_function
  6 | 
  7 | import collections
  8 | import json
  9 | import os
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | import modeling
 15 | import optimization
 16 | import tokenization
 17 | 
 18 | 
 19 | flags = tf.flags
 20 | 
 21 | FLAGS = flags.FLAGS
 22 | 
 23 | ## Required parameters
 24 | flags.DEFINE_string(
 25 |     "data_dir", None,
 26 |     "The input data dir. Should contain the .tsv files (or other data files) "
 27 |     "for the task.")
 28 | 
 29 | flags.DEFINE_string(
 30 |     "bert_config_file", None,
 31 |     "The config json file corresponding to the pre-trained BERT model. "
 32 |     "This specifies the model architecture.")
 33 | 
 34 | flags.DEFINE_string("vocab_file", None,
 35 |                     "The vocabulary file that the BERT model was trained on.")
 36 | 
 37 | flags.DEFINE_string(
 38 |     "output_dir", None,
 39 |     "The output directory where the model checkpoints will be written.")
 40 | 
 41 | flags.DEFINE_string(
 42 |   "split", None,
 43 |   "The split you'd like to run on, either 'annotator' or 'rand'.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "annotator_idx", 0,
 47 |     "Index of the annotator to split by.")
 48 | 
 49 | flags.DEFINE_float(
 50 |     "augment_ratio", 0.0,
 51 |     "Proportion of dev examples moved to training set.")
 52 | 
 53 | flags.DEFINE_integer(
 54 |     "take_number", 1,
 55 |     "In case this is a re-execution of previous experiment.")
 56 | 
 57 | 
 58 | ## Other parameters
 59 | 
 60 | flags.DEFINE_string(
 61 |     "init_checkpoint", None,
 62 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 63 | 
 64 | flags.DEFINE_bool(
 65 |     "do_lower_case", True,
 66 |     "Whether to lower case the input text. Should be True for uncased "
 67 |     "models and False for cased models.")
 68 | 
 69 | flags.DEFINE_integer(
 70 |     "max_seq_length", 128,
 71 |     "The maximum total input sequence length after WordPiece tokenization. "
 72 |     "Sequences longer than this will be truncated, and sequences shorter "
 73 |     "than this will be padded.")
 74 | 
 75 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 76 | 
 77 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 78 | 
 79 | flags.DEFINE_bool(
 80 |     "do_predict", False,
 81 |     "Whether to run the model in inference mode on the test set.")
 82 | 
 83 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.")
 84 | 
 85 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 86 | 
 87 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 88 | 
 89 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 90 | 
 91 | flags.DEFINE_float("num_train_epochs", 3.0,
 92 |                    "Total number of training epochs to perform.")
 93 | 
 94 | flags.DEFINE_float(
 95 |     "warmup_proportion", 0.1,
 96 |     "Proportion of training to perform linear learning rate warmup for. "
 97 |     "E.g., 0.1 = 10% of training.")
 98 | 
 99 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
100 |                      "How often to save the model checkpoint.")
101 | 
102 | flags.DEFINE_integer("iterations_per_loop", 1000,
103 |                      "How many steps to make in each estimator call.")
104 | 
105 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
106 | 
107 | tf.flags.DEFINE_string(
108 |     "tpu_name", None,
109 |     "The Cloud TPU to use for training. This should be either the name "
110 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
111 |     "url.")
112 | 
113 | tf.flags.DEFINE_string(
114 |     "tpu_zone", None,
115 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
116 |     "specified, we will attempt to automatically detect the GCE project from "
117 |     "metadata.")
118 | 
119 | tf.flags.DEFINE_string(
120 |     "gcp_project", None,
121 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
122 |     "specified, we will attempt to automatically detect the GCE project from "
123 |     "metadata.")
124 | 
125 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
126 | 
127 | flags.DEFINE_integer(
128 |     "num_tpu_cores", 8,
129 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
130 | 
131 | 
132 | class InputExample(object):
133 |   """A single training/test example for simple sequence classification."""
134 | 
135 |   def __init__(self, guid, text_a, text_b=None, label=None):
136 |     """Constructs a InputExample.
137 | 
138 |     Args:
139 |       guid: Unique id for the example.
140 |       text_a: string. The untokenized text of the first sequence. For single
141 |         sequence tasks, only this sequence must be specified.
142 |       text_b: (Optional) string. The untokenized text of the second sequence.
143 |         Only must be specified for sequence pair tasks.
144 |       label: (Optional) string. The label of the example. This should be
145 |         specified for train and dev examples, but not for test examples.
146 |     """
147 |     self.guid = guid
148 |     self.text_a = text_a
149 |     self.text_b = text_b
150 |     self.label = label
151 | 
152 | 
153 | class PaddingInputExample(object):
154 |   """Fake example so the num input examples is a multiple of the batch size.
155 | 
156 |   When running eval/predict on the TPU, we need to pad the number of examples
157 |   to be a multiple of the batch size, because the TPU requires a fixed batch
158 |   size. The alternative is to drop the last batch, which is bad because it means
159 |   the entire output data won't be generated.
160 | 
161 |   We use this class instead of `None` because treating `None` as padding
162 |   battches could cause silent errors.
163 |   """
164 | 
165 | 
166 | class InputFeatures(object):
167 |   """A single set of features of data."""
168 | 
169 |   def __init__(self,
170 |                input_ids,
171 |                input_mask,
172 |                segment_ids,
173 |                label_id,
174 |                is_real_example=True):
175 |     self.input_ids = input_ids
176 |     self.input_mask = input_mask
177 |     self.segment_ids = segment_ids
178 |     self.label_id = label_id
179 |     self.is_real_example = is_real_example
180 | 
181 | 
182 | class DataProcessor(object):
183 |   """Base class for data converters for sequence classification data sets."""
184 | 
185 |   def get_train_examples(self, data_dir):
186 |     """Gets a collection of `InputExample`s for the train set."""
187 |     raise NotImplementedError()
188 | 
189 |   def get_dev_examples(self, data_dir):
190 |     """Gets a collection of `InputExample`s for the dev set."""
191 |     raise NotImplementedError()
192 | 
193 |   def get_test_examples(self, data_dir):
194 |     """Gets a collection of `InputExample`s for prediction."""
195 |     raise NotImplementedError()
196 | 
197 |   def get_labels(self):
198 |     """Gets the list of labels for this data set."""
199 |     raise NotImplementedError()
200 | 
201 |   @classmethod
202 |   def _read_json(cls, input_file):
203 |     """Reads a JSON file."""
204 |     with tf.gfile.Open(input_file, "r") as f:
205 |       return json.load(f)
206 | 
207 | 
208 | class CommonsenseQAProcessor(DataProcessor):
209 |   """Processor for the MultiNLI data set (GLUE version)."""
210 | 
211 |   SPLIT_TO_NAME = {
212 |     'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}',
213 |     'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}',
214 |     'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}',
215 |     'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}',
216 |     'with_annotator': 'with_annotator_id',
217 |     'without_annotator': 'without_annotator_id'
218 |   }
219 | 
220 |   TRAIN_FILE_NAME = 'train_{split_name}.json'
221 |   DEV_FILE_NAME = 'dev_{split_name}.json'
222 |   TEST_FILE_NAME = 'dev_{split_name}.json'
223 | 
224 |   def __init__(self, split, annotator_idx, augment_ratio, take_number):
225 |     if split not in self.SPLIT_TO_NAME.keys():
226 |       raise ValueError(
227 |         'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.')
228 | 
229 |     self.split = split
230 |     self.annotator_idx = annotator_idx
231 |     self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else ""
232 |     self.take = "_take{}".format(take_number) if take_number > 1 else ""
233 | 
234 |   def get_train_examples(self, data_dir):
235 |     train_file_name = self.TRAIN_FILE_NAME.format(
236 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
237 | 
238 |     return self._create_examples(
239 |       self._read_json(os.path.join(data_dir, train_file_name)),
240 |       'train')
241 | 
242 |   def get_dev_examples(self, data_dir):
243 |     dev_file_name = self.DEV_FILE_NAME.format(
244 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
245 | 
246 |     return self._create_examples(
247 |       self._read_json(os.path.join(data_dir, dev_file_name)),
248 |       'dev')
249 | 
250 |   def get_test_examples(self, data_dir):
251 |     test_file_name = self.TEST_FILE_NAME.format(
252 |       split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take))
253 | 
254 |     return self._create_examples(
255 |       self._read_json(os.path.join(data_dir, test_file_name)),
256 |       'test')
257 | 
258 |   def get_labels(self):
259 |     """See base class."""
260 |     # These are the anonymized IDs of the top 5 annotator of OBQA,
261 |     # as appear in the released dataset.
262 |     return ["b356d338b7", "7b06152ffc", "dc78319bb0",
263 |             "1e6dc77bb6", "cee82219a0", "OTHER"]
264 | 
265 |   def _create_examples(self, lines, set_type):
266 |     """Creates examples for the training and dev sets."""
267 |     labels = self.get_labels()
268 |     examples = []
269 |     for i, line in enumerate(lines):
270 |         qid = "%s-%s" % (set_type, i)
271 | 
272 |         answers = ' ; '.join(
273 |             [line['choice{}'.format(i)] for i in range(4)] + [line['correct_answer']]
274 |         )
275 | 
276 |         text_a = tokenization.convert_to_unicode(line['question'])
277 |         text_b = tokenization.convert_to_unicode(answers)
278 | 
279 |         if set_type == "test":
280 |             label = "OTHER"
281 |         else:
282 |             if line["turkIdAnonymized"] in labels:
283 |                 label = tokenization.convert_to_unicode(line["turkIdAnonymized"])
284 |             else:
285 |                 label = "OTHER"
286 |         examples.append(
287 |             InputExample(guid=qid, text_a=text_a, text_b=text_b, label=label))
288 | 
289 |     return examples
290 | 
291 | 
292 | def convert_single_example(ex_index, example, label_list, max_seq_length,
293 |                            tokenizer):
294 |   """Converts a single `InputExample` into a single `InputFeatures`."""
295 | 
296 |   if isinstance(example, PaddingInputExample):
297 |     return InputFeatures(
298 |         input_ids=[0] * max_seq_length,
299 |         input_mask=[0] * max_seq_length,
300 |         segment_ids=[0] * max_seq_length,
301 |         label_id=0,
302 |         is_real_example=False)
303 | 
304 |   label_map = {}
305 |   for (i, label) in enumerate(label_list):
306 |     label_map[label] = i
307 | 
308 |   tokens_a = tokenizer.tokenize(example.text_a)
309 |   tokens_b = None
310 |   if example.text_b:
311 |     tokens_b = tokenizer.tokenize(example.text_b)
312 | 
313 |   if tokens_b:
314 |     # Modifies `tokens_a` and `tokens_b` in place so that the total
315 |     # length is less than the specified length.
316 |     # Account for [CLS], [SEP], [SEP] with "- 3"
317 |     _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
318 |   else:
319 |     # Account for [CLS] and [SEP] with "- 2"
320 |     if len(tokens_a) > max_seq_length - 2:
321 |       tokens_a = tokens_a[0:(max_seq_length - 2)]
322 | 
323 |   # The convention in BERT is:
324 |   # (a) For sequence pairs:
325 |   #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
326 |   #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
327 |   # (b) For single sequences:
328 |   #  tokens:   [CLS] the dog is hairy . [SEP]
329 |   #  type_ids: 0     0   0   0  0     0 0
330 |   #
331 |   # Where "type_ids" are used to indicate whether this is the first
332 |   # sequence or the second sequence. The embedding vectors for `type=0` and
333 |   # `type=1` were learned during pre-training and are added to the wordpiece
334 |   # embedding vector (and position vector). This is not *strictly* necessary
335 |   # since the [SEP] token unambiguously separates the sequences, but it makes
336 |   # it easier for the model to learn the concept of sequences.
337 |   #
338 |   # For classification tasks, the first vector (corresponding to [CLS]) is
339 |   # used as the "sentence vector". Note that this only makes sense because
340 |   # the entire model is fine-tuned.
341 |   tokens = []
342 |   segment_ids = []
343 |   tokens.append("[CLS]")
344 |   segment_ids.append(0)
345 |   for token in tokens_a:
346 |     tokens.append(token)
347 |     segment_ids.append(0)
348 |   tokens.append("[SEP]")
349 |   segment_ids.append(0)
350 | 
351 |   if tokens_b:
352 |     for token in tokens_b:
353 |       tokens.append(token)
354 |       segment_ids.append(1)
355 |     tokens.append("[SEP]")
356 |     segment_ids.append(1)
357 | 
358 |   input_ids = tokenizer.convert_tokens_to_ids(tokens)
359 | 
360 |   # The mask has 1 for real tokens and 0 for padding tokens. Only real
361 |   # tokens are attended to.
362 |   input_mask = [1] * len(input_ids)
363 | 
364 |   # Zero-pad up to the sequence length.
365 |   while len(input_ids) < max_seq_length:
366 |     input_ids.append(0)
367 |     input_mask.append(0)
368 |     segment_ids.append(0)
369 | 
370 |   assert len(input_ids) == max_seq_length
371 |   assert len(input_mask) == max_seq_length
372 |   assert len(segment_ids) == max_seq_length
373 | 
374 |   label_id = label_map[example.label]
375 |   if ex_index < 5:
376 |     tf.logging.info("*** Example ***")
377 |     tf.logging.info("guid: %s" % (example.guid))
378 |     tf.logging.info("tokens: %s" % " ".join(
379 |         [tokenization.printable_text(x) for x in tokens]))
380 |     tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
381 |     tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
382 |     tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
383 |     tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
384 | 
385 |   feature = InputFeatures(
386 |       input_ids=input_ids,
387 |       input_mask=input_mask,
388 |       segment_ids=segment_ids,
389 |       label_id=label_id,
390 |       is_real_example=True)
391 |   return feature
392 | 
393 | 
394 | def file_based_convert_examples_to_features(
395 |     examples, label_list, max_seq_length, tokenizer, output_file):
396 |   """Convert a set of `InputExample`s to a TFRecord file."""
397 | 
398 |   writer = tf.python_io.TFRecordWriter(output_file)
399 | 
400 |   for (ex_index, example) in enumerate(examples):
401 |     if ex_index % 10000 == 0:
402 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
403 | 
404 |     feature = convert_single_example(ex_index, example, label_list,
405 |                                      max_seq_length, tokenizer)
406 | 
407 |     def create_int_feature(values):
408 |       f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
409 |       return f
410 | 
411 |     features = collections.OrderedDict()
412 |     features["input_ids"] = create_int_feature(feature.input_ids)
413 |     features["input_mask"] = create_int_feature(feature.input_mask)
414 |     features["segment_ids"] = create_int_feature(feature.segment_ids)
415 |     features["label_ids"] = create_int_feature([feature.label_id])
416 |     features["is_real_example"] = create_int_feature(
417 |         [int(feature.is_real_example)])
418 | 
419 |     tf_example = tf.train.Example(features=tf.train.Features(feature=features))
420 |     writer.write(tf_example.SerializeToString())
421 |   writer.close()
422 | 
423 | 
424 | def file_based_input_fn_builder(input_file, seq_length, is_training,
425 |                                 drop_remainder):
426 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
427 | 
428 |   name_to_features = {
429 |       "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
430 |       "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
431 |       "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
432 |       "label_ids": tf.FixedLenFeature([], tf.int64),
433 |       "is_real_example": tf.FixedLenFeature([], tf.int64),
434 |   }
435 | 
436 |   def _decode_record(record, name_to_features):
437 |     """Decodes a record to a TensorFlow example."""
438 |     example = tf.parse_single_example(record, name_to_features)
439 | 
440 |     # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
441 |     # So cast all int64 to int32.
442 |     for name in list(example.keys()):
443 |       t = example[name]
444 |       if t.dtype == tf.int64:
445 |         t = tf.to_int32(t)
446 |       example[name] = t
447 | 
448 |     return example
449 | 
450 |   def input_fn(params):
451 |     """The actual input function."""
452 |     batch_size = params["batch_size"]
453 | 
454 |     # For training, we want a lot of parallel reading and shuffling.
455 |     # For eval, we want no shuffling and parallel reading doesn't matter.
456 |     d = tf.data.TFRecordDataset(input_file)
457 |     if is_training:
458 |       d = d.repeat()
459 |       d = d.shuffle(buffer_size=100)
460 | 
461 |     d = d.apply(
462 |         tf.contrib.data.map_and_batch(
463 |             lambda record: _decode_record(record, name_to_features),
464 |             batch_size=batch_size,
465 |             drop_remainder=drop_remainder))
466 | 
467 |     return d
468 | 
469 |   return input_fn
470 | 
471 | 
472 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
473 |   """Truncates a sequence pair in place to the maximum length."""
474 | 
475 |   # This is a simple heuristic which will always truncate the longer sequence
476 |   # one token at a time. This makes more sense than truncating an equal percent
477 |   # of tokens from each, since if one sequence is very short then each token
478 |   # that's truncated likely contains more information than a longer sequence.
479 |   while True:
480 |     total_length = len(tokens_a) + len(tokens_b)
481 |     if total_length <= max_length:
482 |       break
483 |     if len(tokens_a) > len(tokens_b):
484 |       tokens_a.pop()
485 |     else:
486 |       tokens_b.pop()
487 | 
488 | 
489 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
490 |                  labels, num_labels, use_one_hot_embeddings):
491 |   """Creates a classification model."""
492 |   model = modeling.BertModel(
493 |       config=bert_config,
494 |       is_training=is_training,
495 |       input_ids=input_ids,
496 |       input_mask=input_mask,
497 |       token_type_ids=segment_ids,
498 |       use_one_hot_embeddings=use_one_hot_embeddings)
499 | 
500 |   # In the demo, we are doing a simple classification task on the entire
501 |   # segment.
502 |   #
503 |   # If you want to use the token-level output, use model.get_sequence_output()
504 |   # instead.
505 |   output_layer = model.get_pooled_output()
506 | 
507 |   hidden_size = output_layer.shape[-1].value
508 | 
509 |   output_weights = tf.get_variable(
510 |       "output_weights", [num_labels, hidden_size],
511 |       initializer=tf.truncated_normal_initializer(stddev=0.02))
512 | 
513 |   output_bias = tf.get_variable(
514 |       "output_bias", [num_labels], initializer=tf.zeros_initializer())
515 | 
516 |   with tf.variable_scope("loss"):
517 |     if is_training:
518 |       # I.e., 0.1 dropout
519 |       output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
520 | 
521 |     logits = tf.matmul(output_layer, output_weights, transpose_b=True)
522 |     logits = tf.nn.bias_add(logits, output_bias)
523 |     probabilities = tf.nn.softmax(logits, axis=-1)
524 |     log_probs = tf.nn.log_softmax(logits, axis=-1)
525 | 
526 |     one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
527 | 
528 |     per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
529 |     loss = tf.reduce_mean(per_example_loss)
530 | 
531 |     return (loss, per_example_loss, logits, probabilities)
532 | 
533 | 
534 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
535 |                      num_train_steps, num_warmup_steps, use_tpu,
536 |                      use_one_hot_embeddings):
537 |   """Returns `model_fn` closure for TPUEstimator."""
538 | 
539 |   def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
540 |     """The `model_fn` for TPUEstimator."""
541 | 
542 |     tf.logging.info("*** Features ***")
543 |     for name in sorted(features.keys()):
544 |       tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
545 | 
546 |     input_ids = features["input_ids"]
547 |     input_mask = features["input_mask"]
548 |     segment_ids = features["segment_ids"]
549 |     label_ids = features["label_ids"]
550 |     is_real_example = None
551 |     if "is_real_example" in features:
552 |       is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
553 |     else:
554 |       is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
555 | 
556 |     is_training = (mode == tf.estimator.ModeKeys.TRAIN)
557 | 
558 |     (total_loss, per_example_loss, logits, probabilities) = create_model(
559 |         bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
560 |         num_labels, use_one_hot_embeddings)
561 | 
562 |     tvars = tf.trainable_variables()
563 |     initialized_variable_names = {}
564 |     scaffold_fn = None
565 |     if init_checkpoint:
566 |       (assignment_map, initialized_variable_names
567 |       ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
568 |       if use_tpu:
569 | 
570 |         def tpu_scaffold():
571 |           tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
572 |           return tf.train.Scaffold()
573 | 
574 |         scaffold_fn = tpu_scaffold
575 |       else:
576 |         tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
577 | 
578 |     tf.logging.info("**** Trainable Variables ****")
579 |     for var in tvars:
580 |       init_string = ""
581 |       if var.name in initialized_variable_names:
582 |         init_string = ", *INIT_FROM_CKPT*"
583 |       tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
584 |                       init_string)
585 | 
586 |     output_spec = None
587 |     if mode == tf.estimator.ModeKeys.TRAIN:
588 | 
589 |       train_op = optimization.create_optimizer(
590 |           total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
591 | 
592 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
593 |           mode=mode,
594 |           loss=total_loss,
595 |           train_op=train_op,
596 |           scaffold_fn=scaffold_fn)
597 |     elif mode == tf.estimator.ModeKeys.EVAL:
598 | 
599 |       def metric_fn(per_example_loss, label_ids, logits, is_real_example):
600 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
601 |         accuracy = tf.metrics.accuracy(
602 |             labels=label_ids, predictions=predictions, weights=is_real_example)
603 |         loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
604 |         return {
605 |             "eval_accuracy": accuracy,
606 |             "eval_loss": loss,
607 |         }
608 | 
609 |       eval_metrics = (metric_fn,
610 |                       [per_example_loss, label_ids, logits, is_real_example])
611 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
612 |           mode=mode,
613 |           loss=total_loss,
614 |           eval_metrics=eval_metrics,
615 |           scaffold_fn=scaffold_fn)
616 |     else:
617 |       output_spec = tf.contrib.tpu.TPUEstimatorSpec(
618 |           mode=mode,
619 |           predictions={"probabilities": probabilities},
620 |           scaffold_fn=scaffold_fn)
621 |     return output_spec
622 | 
623 |   return model_fn
624 | 
625 | 
626 | # This function is not used by this file but is still used by the Colab and
627 | # people who depend on it.
628 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
629 |   """Creates an `input_fn` closure to be passed to TPUEstimator."""
630 | 
631 |   all_input_ids = []
632 |   all_input_mask = []
633 |   all_segment_ids = []
634 |   all_label_ids = []
635 | 
636 |   for feature in features:
637 |     all_input_ids.append(feature.input_ids)
638 |     all_input_mask.append(feature.input_mask)
639 |     all_segment_ids.append(feature.segment_ids)
640 |     all_label_ids.append(feature.label_id)
641 | 
642 |   def input_fn(params):
643 |     """The actual input function."""
644 |     batch_size = params["batch_size"]
645 | 
646 |     num_examples = len(features)
647 | 
648 |     # This is for demo purposes and does NOT scale to large data sets. We do
649 |     # not use Dataset.from_generator() because that uses tf.py_func which is
650 |     # not TPU compatible. The right way to load data is with TFRecordReader.
651 |     d = tf.data.Dataset.from_tensor_slices({
652 |         "input_ids":
653 |             tf.constant(
654 |                 all_input_ids, shape=[num_examples, seq_length],
655 |                 dtype=tf.int32),
656 |         "input_mask":
657 |             tf.constant(
658 |                 all_input_mask,
659 |                 shape=[num_examples, seq_length],
660 |                 dtype=tf.int32),
661 |         "segment_ids":
662 |             tf.constant(
663 |                 all_segment_ids,
664 |                 shape=[num_examples, seq_length],
665 |                 dtype=tf.int32),
666 |         "label_ids":
667 |             tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
668 |     })
669 | 
670 |     if is_training:
671 |       d = d.repeat()
672 |       d = d.shuffle(buffer_size=100)
673 | 
674 |     d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
675 |     return d
676 | 
677 |   return input_fn
678 | 
679 | 
680 | # This function is not used by this file but is still used by the Colab and
681 | # people who depend on it.
682 | def convert_examples_to_features(examples, label_list, max_seq_length,
683 |                                  tokenizer):
684 |   """Convert a set of `InputExample`s to a list of `InputFeatures`."""
685 | 
686 |   features = []
687 |   for (ex_index, example) in enumerate(examples):
688 |     if ex_index % 10000 == 0:
689 |       tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
690 | 
691 |     feature = convert_single_example(ex_index, example, label_list,
692 |                                      max_seq_length, tokenizer)
693 | 
694 |     features.append(feature)
695 |   return features
696 | 
697 | 
698 | def main(_):
699 |   tf.logging.set_verbosity(tf.logging.INFO)
700 | 
701 |   if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
702 |     raise ValueError(
703 |         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
704 | 
705 |   bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
706 | 
707 |   if FLAGS.max_seq_length > bert_config.max_position_embeddings:
708 |     raise ValueError(
709 |         "Cannot use sequence length %d because the BERT model "
710 |         "was only trained up to sequence length %d" %
711 |         (FLAGS.max_seq_length, bert_config.max_position_embeddings))
712 | 
713 |   tf.gfile.MakeDirs(FLAGS.output_dir)
714 | 
715 |   processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx,
716 |                                      augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number)
717 | 
718 |   label_list = processor.get_labels()
719 | 
720 |   tokenizer = tokenization.FullTokenizer(
721 |       vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
722 | 
723 |   tpu_cluster_resolver = None
724 |   if FLAGS.use_tpu and FLAGS.tpu_name:
725 |     tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
726 |         FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
727 | 
728 |   is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
729 |   run_config = tf.contrib.tpu.RunConfig(
730 |       cluster=tpu_cluster_resolver,
731 |       master=FLAGS.master,
732 |       model_dir=FLAGS.output_dir,
733 |       save_checkpoints_steps=FLAGS.save_checkpoints_steps,
734 |       tpu_config=tf.contrib.tpu.TPUConfig(
735 |           iterations_per_loop=FLAGS.iterations_per_loop,
736 |           num_shards=FLAGS.num_tpu_cores,
737 |           per_host_input_for_training=is_per_host))
738 | 
739 |   train_examples = None
740 |   num_train_steps = None
741 |   num_warmup_steps = None
742 |   if FLAGS.do_train:
743 |     train_examples = processor.get_train_examples(FLAGS.data_dir)
744 |     num_train_steps = int(
745 |         len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
746 |     num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
747 | 
748 |   model_fn = model_fn_builder(
749 |       bert_config=bert_config,
750 |       num_labels=len(label_list),
751 |       init_checkpoint=FLAGS.init_checkpoint,
752 |       learning_rate=FLAGS.learning_rate,
753 |       num_train_steps=num_train_steps,
754 |       num_warmup_steps=num_warmup_steps,
755 |       use_tpu=FLAGS.use_tpu,
756 |       use_one_hot_embeddings=FLAGS.use_tpu)
757 | 
758 |   # If TPU is not available, this will fall back to normal Estimator on CPU
759 |   # or GPU.
760 |   estimator = tf.contrib.tpu.TPUEstimator(
761 |       use_tpu=FLAGS.use_tpu,
762 |       model_fn=model_fn,
763 |       config=run_config,
764 |       train_batch_size=FLAGS.train_batch_size,
765 |       eval_batch_size=FLAGS.eval_batch_size,
766 |       predict_batch_size=FLAGS.predict_batch_size)
767 | 
768 |   if FLAGS.do_train:
769 |     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
770 |     train_seq_length = file_based_convert_examples_to_features(
771 |         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
772 |     tf.logging.info("***** Running training *****")
773 |     tf.logging.info("  Num examples = %d", len(train_examples))
774 |     tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
775 |     tf.logging.info("  Num steps = %d", num_train_steps)
776 |     tf.logging.info("  Longest training sequence = %d", train_seq_length)
777 |     train_input_fn = file_based_input_fn_builder(
778 |         input_file=train_file,
779 |         seq_length=FLAGS.max_seq_length,
780 |         is_training=True,
781 |         drop_remainder=True)
782 |     estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
783 | 
784 |   if FLAGS.do_eval:
785 |     eval_examples = processor.get_dev_examples(FLAGS.data_dir)
786 |     eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
787 |     eval_seq_length = file_based_convert_examples_to_features(
788 |         eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
789 | 
790 |     tf.logging.info("***** Running evaluation *****")
791 |     tf.logging.info("  Num examples = %d", len(eval_examples))
792 |     tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
793 |     tf.logging.info("  Longest eval sequence = %d", eval_seq_length)
794 | 
795 |     # This tells the estimator to run through the entire set.
796 |     eval_steps = None
797 |     # However, if running eval on the TPU, you will need to specify the
798 |     # number of steps.
799 |     if FLAGS.use_tpu:
800 |       # Eval will be slightly WRONG on the TPU because it will truncate
801 |       # the last batch.
802 |       eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size)
803 | 
804 |     eval_drop_remainder = True if FLAGS.use_tpu else False
805 |     eval_input_fn = file_based_input_fn_builder(
806 |         input_file=eval_file,
807 |         seq_length=FLAGS.max_seq_length,
808 |         is_training=False,
809 |         drop_remainder=eval_drop_remainder)
810 | 
811 |     result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
812 | 
813 |     output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
814 |     with tf.gfile.GFile(output_eval_file, "w") as writer:
815 |       tf.logging.info("***** Eval results *****")
816 |       for key in sorted(result.keys()):
817 |         tf.logging.info("  %s = %s", key, str(result[key]))
818 |         writer.write("%s = %s\n" % (key, str(result[key])))
819 | 
820 |   if FLAGS.do_predict:
821 |     predict_examples = processor.get_test_examples(FLAGS.data_dir)
822 |     num_actual_predict_examples = len(predict_examples)
823 |     predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
824 |     predict_seq_length = file_based_convert_examples_to_features(
825 |       predict_examples, label_list,
826 |       FLAGS.max_seq_length, tokenizer,
827 |       predict_file)
828 | 
829 |     tf.logging.info("***** Running prediction*****")
830 |     tf.logging.info("  Num examples = %d", len(predict_examples))
831 |     tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
832 |     tf.logging.info("  Longest predict sequence = %d", predict_seq_length)
833 | 
834 |     if FLAGS.use_tpu:
835 |       # Warning: According to tpu_estimator.py Prediction on TPU is an
836 |       # experimental feature and hence not supported here
837 |       raise ValueError("Prediction in TPU not supported")
838 | 
839 |     predict_drop_remainder = True if FLAGS.use_tpu else False
840 |     predict_input_fn = file_based_input_fn_builder(
841 |         input_file=predict_file,
842 |         seq_length=FLAGS.max_seq_length,
843 |         is_training=False,
844 |         drop_remainder=predict_drop_remainder)
845 | 
846 |     result = estimator.predict(input_fn=predict_input_fn)
847 | 
848 |     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
849 |     with tf.gfile.GFile(output_predict_file, "w") as writer:
850 |       num_written_lines = 0
851 |       tf.logging.info("***** Predict results *****")
852 |       for (i, prediction) in enumerate(result):
853 |         probabilities = prediction["probabilities"]
854 |         if i >= num_actual_predict_examples:
855 |           break
856 |         output_line = "\t".join(
857 |             str(class_probability)
858 |             for class_probability in probabilities) + "\n"
859 |         writer.write(output_line)
860 |         num_written_lines += 1
861 |     assert num_written_lines == num_actual_predict_examples
862 | 
863 | 
864 | if __name__ == "__main__":
865 |   flags.mark_flag_as_required("data_dir")
866 |   flags.mark_flag_as_required("vocab_file")
867 |   flags.mark_flag_as_required("bert_config_file")
868 |   flags.mark_flag_as_required("output_dir")
869 |   tf.app.run()
870 | 


--------------------------------------------------------------------------------
/obqa_create_data_splits.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import argparse
  3 | import json
  4 | import pandas as pd
  5 | 
  6 | 
  7 | pd.set_option('display.width', 1000)
  8 | pd.set_option('display.max_columns', 1000)
  9 | 
 10 | 
 11 | def load_data():
 12 |     trn_file_path = "./OpenBookQA-V1-Sep2018/Data/Additional/train_complete.jsonl"
 13 |     dev_file_path = "./OpenBookQA-V1-Sep2018/Data/Additional/dev_complete.jsonl"
 14 | 
 15 |     examples = []
 16 |     with open(trn_file_path, "r") as fd:
 17 |         examples.extend(fd.readlines())
 18 |         num_trn_examples = len(examples)
 19 |         print("read {} training examples.".format(num_trn_examples))
 20 | 
 21 |     with open(dev_file_path, "r") as fd:
 22 |         examples.extend(fd.readlines())
 23 |         print("read {} development examples.".format(len(examples) - num_trn_examples))
 24 | 
 25 |     def parse_json_line(jline):
 26 |         line = json.loads(jline)
 27 |         parsed_line = {
 28 |             "id": line["id"],
 29 |             "turkIdAnonymized": line["turkIdAnonymized"],
 30 |             "answerKey": line["answerKey"],
 31 |             "clarity": line["clarity"],
 32 |             "fact1": line["fact1"],
 33 |             "humanScore": line["humanScore"],
 34 |             "question": line["question"]["stem"],
 35 |             "choice0": line["question"]["choices"][0]["text"],
 36 |             "choice1": line["question"]["choices"][1]["text"],
 37 |             "choice2": line["question"]["choices"][2]["text"],
 38 |             "choice3": line["question"]["choices"][3]["text"]
 39 |         }
 40 |         anskey_to_choice = {"A": "choice0", "B": "choice1", "C": "choice2", "D": "choice3"}
 41 |         parsed_line["correct_answer"] = parsed_line[anskey_to_choice[parsed_line["answerKey"]]]
 42 | 
 43 |         return parsed_line
 44 | 
 45 |     print("total number of examples: {}.".format(len(examples)))
 46 |     df = pd.DataFrame([parse_json_line(jline) for jline in examples])
 47 |     df = df.sample(frac=1.0)
 48 | 
 49 |     return df
 50 | 
 51 | 
 52 | def dump_to_json(example_set, split_type, split_partition, split_index, augment):
 53 |     example_set_filtered_cols = example_set.loc[:, ['id', 'turkIdAnonymized', 'question', 'correct_answer',
 54 |                                                     'answerKey', 'choice0', 'choice1', 'choice2', 'choice3']]
 55 | 
 56 |     data_dict = example_set_filtered_cols.reset_index().to_dict(orient='rows')
 57 |     file_name = split_partition + '_' + split_type + '_' + str(split_index) + augment + ".json"
 58 |     print('Saving %s with %d exmamples' % (file_name, len(data_dict)))
 59 |     with open(file_name, "w") as f:
 60 |         json.dump(data_dict, f, sort_keys=True, indent=4)
 61 | 
 62 | 
 63 | def print_annotator_distribution(data):
 64 |     dff = data.groupby(['turkIdAnonymized'])[
 65 |         [c for c in data.columns if c != 'turkIdAnonymized']].nunique().sort_values(
 66 |         'id', ascending=False).reset_index()
 67 |     total_number_of_examples = dff.id.sum()
 68 |     print("total number of unique examples: ", total_number_of_examples)
 69 |     dff["percentage"] = round(dff.id * 100.0 / total_number_of_examples, 1)
 70 |     print(dff)
 71 | 
 72 | 
 73 | def split_dev_sets(dev_sets, num_splits, augment_ratio):
 74 |     dev_sets_new = []
 75 |     dev_sets_move = []
 76 |     for i in range(num_splits):
 77 |         split_annotators = dev_sets[i]['turkIdAnonymized'].unique()
 78 |         split_annotator_splits = [
 79 |             dev_sets[i][dev_sets[i]['turkIdAnonymized'] == annotator]
 80 |             for annotator in split_annotators
 81 |         ]
 82 |         annotator_split_idx = [
 83 |             int(augment_ratio * len(split_annotator_splits[j]))
 84 |             for j in range(len(split_annotators))
 85 |         ]
 86 | 
 87 |         dev_set_new = []
 88 |         dev_set_move = []
 89 |         for j in range(len(split_annotators)):
 90 |             dev_set_new.append(split_annotator_splits[j][annotator_split_idx[j]:])
 91 |             dev_set_move.append(split_annotator_splits[j][0:annotator_split_idx[j]])
 92 | 
 93 |         dev_sets_new.append(pd.concat(dev_set_new))
 94 |         dev_sets_move.append(pd.concat(dev_set_move))
 95 | 
 96 |     dev_split_idx = [
 97 |         len(dev_sets_move[i])
 98 |         for i in range(num_splits)
 99 |     ]
100 | 
101 |     return dev_sets_new, dev_sets_move, dev_split_idx
102 | 
103 | 
104 | def augment_data(train_sets, dev_sets, num_splits, augment_ratio,
105 |                  by_annotator, dev_split_idx=None, normalize=True):
106 |     if by_annotator:
107 |         dev_sets_new, dev_sets_move, dev_split_idx = split_dev_sets(
108 |             dev_sets, num_splits, augment_ratio)
109 |     else:
110 |         if not dev_split_idx:
111 |             dev_split_idx = [
112 |                 int(augment_ratio * len(dev_sets[i]))
113 |                 for i in range(num_splits)
114 |             ]
115 |         dev_sets_new = [
116 |             dev_sets[i][dev_split_idx[i]:]
117 |             for i in range(num_splits)
118 |         ]
119 |         dev_sets_move = [
120 |             dev_sets[i][0:dev_split_idx[i]]
121 |             for i in range(num_splits)
122 |         ]
123 | 
124 |     if normalize:
125 |         augmented_train_sets = [
126 |             train_sets[i].sample(n=len(train_sets[i]) - len(dev_sets_move[i]))
127 |             for i in range(num_splits)
128 |         ]
129 |     else:
130 |         augmented_train_sets = [
131 |             train_sets[i].sample(frac=1.0)
132 |             for i in range(num_splits)
133 |         ]
134 | 
135 |     augmented_train_sets = [
136 |         pd.concat([augmented_train_sets[i], dev_sets_move[i]],
137 |                   ignore_index=True)
138 |         for i in range(num_splits)
139 |     ]
140 | 
141 |     return augmented_train_sets, dev_sets_new, dev_split_idx
142 | 
143 | 
144 | def print_split_sizes(train_sets_by_annotator, dev_sets_by_annotator, num_splits):
145 |     for i in range(num_splits):
146 |         print("annotator {}: train set size {}, dev set size {}, total size {}".format(
147 |             i, len(train_sets_by_annotator[i]), len(dev_sets_by_annotator[i]),
148 |             len(train_sets_by_annotator[i]) + len(dev_sets_by_annotator[i])))
149 | 
150 | 
151 | def get_annotator_splits(all_data, dev_annotator_list, num_splits, multi_mode):
152 |     if multi_mode:
153 |         dev_sets_by_annotator = [
154 |             all_data.loc[all_data["turkIdAnonymized"].isin(dev_annotator_list[i])]
155 |             for i in range(num_splits)
156 |         ]
157 |         train_sets_by_annotator = [
158 |             all_data.loc[~all_data["turkIdAnonymized"].isin(dev_annotator_list[i])]
159 |             for i in range(num_splits)
160 |         ]
161 |     else:
162 |         dev_sets_by_annotator = [
163 |             all_data.loc[all_data["turkIdAnonymized"] == dev_annotator_list[i]]
164 |             for i in range(num_splits)
165 |         ]
166 |         train_sets_by_annotator = [
167 |             all_data.loc[all_data["turkIdAnonymized"] != dev_annotator_list[i]]
168 |             for i in range(num_splits)
169 |         ]
170 | 
171 |     return train_sets_by_annotator, dev_sets_by_annotator
172 | 
173 | 
174 | def get_random_splits(all_data, train_sets_by_annotator, dev_sets_by_annotator, num_splits):
175 |     shuffled_data_frames = [
176 |         all_data.sample(frac=1.0)
177 |         for _ in range(num_splits)
178 |     ]
179 | 
180 |     dev_sets_random = [
181 |         shuffled_data_frames[i][0:dev_sets_by_annotator[i].shape[0]]
182 |         for i in range(num_splits)
183 |     ]
184 |     train_sets_random = [
185 |         shuffled_data_frames[i][dev_sets_by_annotator[i].shape[0]:
186 |                                 dev_sets_by_annotator[i].shape[0] + train_sets_by_annotator[i].shape[0]]
187 |         for i in range(num_splits)
188 |     ]
189 | 
190 |     return train_sets_random, dev_sets_random
191 | 
192 | 
193 | def write_split_files(train_sets_by_annotator, dev_sets_by_annotator,
194 |                       train_sets_random, dev_sets_random, num_splits,
195 |                       multi_mode, augment_ratio, take_number, only_random, only_annotator):
196 |     type_add = "_multi" if multi_mode else ""
197 |     for i in range(0, num_splits):
198 |         augment = "_augment{}".format(augment_ratio) if augment_ratio > 0 else ""
199 |         take = "_take{}".format(take_number) if take_number > 1 else ""
200 |         properties = augment + take
201 | 
202 |         if not only_random:
203 |             dump_to_json(train_sets_by_annotator[i], "annotator" + type_add, "train", i, properties)
204 |             dump_to_json(dev_sets_by_annotator[i], "annotator" + type_add, "dev", i, properties)
205 | 
206 |         if not only_annotator:
207 |             dump_to_json(train_sets_random[i], "rand" + type_add, "train", i, properties)
208 |             dump_to_json(dev_sets_random[i], "rand" + type_add, "dev", i, properties)
209 | 
210 | 
211 | def create_random_augmented_series(multi_mode, take_number):
212 |     all_data = load_data()
213 |     print_annotator_distribution(all_data)
214 | 
215 |     sorted_annotators = all_data.groupby(['turkIdAnonymized']).nunique().sort_values(
216 |         'id', ascending=False)['id'].keys()
217 | 
218 |     if multi_mode:
219 |         dev_annotator_list = [sorted_annotators[i:i + 5].values for i in range(20, 45, 5)]
220 |         num_splits = 5
221 |     else:
222 |         dev_annotator_list = [sorted_annotators[i] for i in range(5)]
223 |         num_splits = 5
224 | 
225 |     train_sets_by_annotator, dev_sets_by_annotator = get_annotator_splits(
226 |         all_data, dev_annotator_list, num_splits, multi_mode)
227 | 
228 |     train_sets_random, dev_sets_random = get_random_splits(
229 |         all_data, train_sets_by_annotator, dev_sets_by_annotator, num_splits)
230 | 
231 |     write_split_files([], [], train_sets_random, dev_sets_random, num_splits,
232 |                       multi_mode, 0.0, take_number,
233 |                       only_random=True, only_annotator=False)
234 | 
235 |     for augment_ratio in [0.1, 0.2, 0.3]:
236 |         print("before augmentation.")
237 |         print_split_sizes(train_sets_random, dev_sets_random, num_splits)
238 | 
239 |         augmented_train_sets_by_annotator, augmented_dev_sets_by_annotator, dev_split_idx = augment_data(
240 |             train_sets_by_annotator, dev_sets_by_annotator, num_splits, augment_ratio,
241 |             by_annotator=True)
242 | 
243 |         augmented_train_sets_random, augmented_dev_sets_random, _ = augment_data(
244 |             train_sets_random, dev_sets_random, num_splits, augment_ratio,
245 |             by_annotator=False, dev_split_idx=dev_split_idx)
246 | 
247 |         print("after augmentation.")
248 |         print_split_sizes(augmented_train_sets_random, augmented_dev_sets_random,
249 |                           num_splits)
250 | 
251 |         for i in range(num_splits):
252 |             assert augmented_train_sets_by_annotator[i].shape == augmented_train_sets_random[i].shape
253 |             assert augmented_dev_sets_by_annotator[i].shape == augmented_dev_sets_random[i].shape
254 |             assert augmented_train_sets_random[i].shape == train_sets_random[i].shape
255 |             print("split {}: train {}, dev {}".format(
256 |                 i, augmented_train_sets_random[i].shape, augmented_dev_sets_random[i].shape))
257 | 
258 |         write_split_files([], [], augmented_train_sets_random, augmented_dev_sets_random,
259 |                           num_splits, multi_mode, augment_ratio, take_number,
260 |                           only_random=True, only_annotator=False)
261 | 
262 | 
263 | def create_data_splits(multi_mode=True, augment_ratio=0.0, take_number=1, only_random=False, only_annotator=False):
264 |     all_data = load_data()
265 |     print_annotator_distribution(all_data)
266 | 
267 |     sorted_annotators = all_data.groupby(['turkIdAnonymized']).nunique().sort_values(
268 |         'id', ascending=False)['id'].keys()
269 | 
270 |     if multi_mode:
271 |         dev_annotator_list = [sorted_annotators[i:i+5].values for i in range(20, 45, 5)]
272 |         num_splits = 5
273 |     else:
274 |         dev_annotator_list = [sorted_annotators[i] for i in range(5)]
275 |         num_splits = 5
276 | 
277 |     train_sets_by_annotator, dev_sets_by_annotator = get_annotator_splits(
278 |         all_data, dev_annotator_list, num_splits, multi_mode)
279 | 
280 |     train_sets_random, dev_sets_random = get_random_splits(
281 |         all_data, train_sets_by_annotator, dev_sets_by_annotator, num_splits)
282 | 
283 |     if augment_ratio > 0:
284 |         print("before augmentation.")
285 |         print_split_sizes(train_sets_by_annotator, dev_sets_by_annotator, num_splits)
286 | 
287 |         train_sets_by_annotator, dev_sets_by_annotator, dev_split_idx = augment_data(
288 |             train_sets_by_annotator, dev_sets_by_annotator, num_splits, augment_ratio,
289 |             by_annotator=True)
290 | 
291 |         train_sets_random, dev_sets_random, _ = augment_data(
292 |             train_sets_random, dev_sets_random, num_splits, augment_ratio,
293 |             by_annotator=False, dev_split_idx=dev_split_idx)
294 | 
295 |         print("after augmentation.")
296 |         print_split_sizes(train_sets_by_annotator, dev_sets_by_annotator, num_splits)
297 | 
298 |     for i in range(num_splits):
299 |         assert train_sets_random[i].shape == train_sets_by_annotator[i].shape
300 |         assert dev_sets_random[i].shape == dev_sets_by_annotator[i].shape
301 |         print("split {}: train {}, dev {}".format(i, train_sets_random[i].shape, dev_sets_random[i].shape))
302 | 
303 |     write_split_files(train_sets_by_annotator, dev_sets_by_annotator,
304 |                       train_sets_random, dev_sets_random, num_splits,
305 |                       multi_mode, augment_ratio, take_number, only_random, only_annotator)
306 | 
307 | 
308 | def main(args):
309 |     for i in range(args.repeat):
310 |         if args.augment_random_series:
311 |             create_random_augmented_series(args.multi_mode,
312 |                                            args.take_number + i)
313 |         else:
314 |             create_data_splits(args.multi_mode,
315 |                                args.augment_ratio,
316 |                                args.take_number + i,
317 |                                args.only_random,
318 |                                args.only_annotator)
319 | 
320 | 
321 | if __name__ == '__main__':
322 |     parser = argparse.ArgumentParser(description="OpenbookQA: create data splits")
323 |     parser.add_argument('--augment_ratio', type=float, default=0.0,
324 |                         help='fraction of annotator examples to augment the train set with.')
325 |     parser.add_argument('--take_number', type=int, default=1,
326 |                         help='the number of times the specified split is being generated.'
327 |                              'if repeat>1 then this argument will be the starting index of the generated splits.')
328 |     parser.add_argument('--repeat', type=int, default=1,
329 |                         help='how many times to generate the specified split.')
330 | 
331 |     parser.add_argument('--multi_mode', action='store_true', default=False,
332 |                         help='multi-annotator mode.')
333 |     parser.add_argument('--augment_random_series', action='store_true', default=False,
334 |                         help='a series of augmentation splits, starting from random splits '
335 |                              'corresponding to annotator splits in size.')
336 |     parser.add_argument('--only_random', action='store_true', default=False,
337 |                         help='generate only random splits.')
338 |     parser.add_argument('--only_annotator', action='store_true', default=False,
339 |                         help='generate only annotator splits.')
340 | 
341 |     args = parser.parse_args()
342 | 
343 |     main(args)
344 | 
345 | 


--------------------------------------------------------------------------------