├── .gitignore ├── README.md ├── model_fine_tuning_scripts ├── run_commonsense_qa.py ├── run_commonsense_qa_recognition.py ├── run_mnli.py ├── run_openbookqa.py └── run_openbookqa_recognition.py └── obqa_create_data_splits.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Annotator bias in Natural Language Understanding datasets 2 | 3 | This repository contains the accompanying code for the paper: 4 | 5 | **"Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets."** Mor Geva, Yoav Goldberg and Jonathan Berant. *In EMNLP-IJCNLP, 2019*. 6 | [https://arxiv.org/abs/1908.07898](https://arxiv.org/abs/1908.07898) 7 | 8 | In this work, we investigate whether prevalent crowdsourcing practices for building NLU datasets introduce an "annotator bias" in the data that leads to an over-estimation of model performance. 9 | 10 | In this repository, we release our code for: 11 | * Fine-tuning BERT on the three datasets considered in the paper (experiments 1-3) 12 | * Converting the three datasets considered in the paper into the annotator recognition task format (experiment 2, this is done as part of the fine-tuning scripts) 13 | * Generating annotator-based splits (experiment 3) 14 | 15 | 16 | Please note that the data splits are generated randomly, therefore, reproducing the exact results in the paper is not possible as there might be some variation (see the standard deviation values reported in the paper). 17 | 18 | Our experiments were conducted in a **python 3.6.8** environment with **tensorflow 1.11.0** and **pandas 0.24.2**. 19 | 20 | ## Generation of annotator-based data splits 21 | We considered three NLU datasets in our experiments: 22 | * [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) (MNLI) 23 | * [OpenBookQA](http://data.allenai.org/OpenBookQA) (OBQA) 24 | * [CommonsenseQA](https://www.tau-nlp.org/commonsenseqa) (CSQA) 25 | 26 | The script `obqa_create_data_splits.py` contains the code used for generating annotator-based splits of OBQA and their corresponding random splits. The spliting method is exactly the same for MNLI and CSQA. However, we do not provide the scripts for MNLI and CSQA since annotator information is not publicly available for these datasets (see "note on data availability" below). 27 | 28 | **Note on data availability**: At the time of publishing this work, annotator IDs were publicly available only for OBQA. For MNLI and CSQA, this information was not available as part of the official releases. If you are interested in the annotation information, please contact the creators of these datasets. 29 | 30 | #### Example commands 31 | 1. Generating three annotator-based data splits for each annotator of the top 5 annotators of OBQA: 32 | ```bash 33 | python obqa_create_data_splits.py \ 34 | --only_annotator \ 35 | --repeat=3 36 | ``` 37 | In order to generate both annotator-based data splits and corresponding random splits of the same size, run this command with the value of `only_annotator` set to `False`. 38 | 39 | 2. Generating three 20%-augmented annotator-based data splits for each annotator of the top 5 annotators of OBQA: 40 | ```bash 41 | python obqa_create_data_splits.py \ 42 | --augment_ratio 0.2 \ 43 | --only_annotator \ 44 | --repeat=3 45 | ``` 46 | 47 | 3. Generating three series of random splits corresponding to the 0%-10%-20%-30%- augmented annotator-based splits of the top 5 annotators of OBQA: 48 | ```bash 49 | python obqa_create_data_splits.py \ 50 | --augment_random_series \ 51 | --only_random \ 52 | --repeat=3 53 | ``` 54 | 55 | 56 | ## Model fine-tuning 57 | In all our experiments, we used the pretrained BERT-base cased model from [Google's official repository](https://github.com/google-research/bert). 58 | The directory `model_fine_tuning_scripts` contains the scripts used for fine-tuning and execution of the model on data splits of three datasets considered in the paper. 59 | The table below indicates the experiments covered by each fine-tuning script: 60 | 61 | |Fine-tuning script | dataset | utility of annotator information | annotator recognition | generalization across annotators | 62 | |--------|:--------:|:--------:|:-----:|:-----:| 63 | | `run_mnli.py` | MNLI | V | V | V | 64 | | `run_openbookqa.py` | OBQA | V | | V | 65 | | `run_openbookqa_recognition.py` | OBQA | | V | | 66 | | `run_commonsense_qa.py` | CSQA | V | | V | 67 | | `run_commonsense_qa_recognition.py` | CSQA | | V | | 68 | 69 | 70 | The scripts follow the exact same format as the fine-tuning scripts provided in [Google's official repository](https://github.com/google-research/bert#fine-tuning-with-bert), and should be executed from its root path. 71 | 72 | Before running the scripts, make sure you generated the relevant data split files. The directory containing these files should be passed to the argument `data_dir` when running any of these scripts (see examples below). 73 | 74 | 75 | #### Example commands 76 | 1. Fine-tuning the model on the original split of OBQA with annotator IDs concatenated to each example, to test the utility of annotator information: 77 | ```bash 78 | export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12 79 | export DATA_DIR=/path/to/obqa/data/splits/dir 80 | 81 | python run_openbookqa.py \ 82 | --do_train=true --do_eval=true \ 83 | --data_dir=$DATA_DIR --vocab_file=$BERT_BASE_DIR/vocab.txt \ 84 | --bert_config_file=$BERT_BASE_DIR/bert_config.json \ 85 | --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ 86 | --max_seq_length=128 --train_batch_size=10 \ 87 | --learning_rate=2e-5 --num_train_epochs=3.0 \ 88 | --output_dir=$BERT_BASE_DIR/openbookqa_with_annotator/ \ 89 | --split=with_annotator 90 | ``` 91 | To fine-tune on the same data split without annotator IDs, replace the value of the `split` argument with `without_annotator`. 92 | 93 | 2. Fine-tuning the model to predict annotator IDs of the top 5 annotators of OBQA (annotator recognition): 94 | ```bash 95 | export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12 96 | export DATA_DIR=/path/to/obqa/data/splits/dir 97 | 98 | python run_openbookqa_recognition.py \ 99 | --do_train=true --do_eval=true \ 100 | --data_dir=$DATA_DIR --vocab_file=$BERT_BASE_DIR/vocab.txt \ 101 | --bert_config_file=$BERT_BASE_DIR/bert_config.json \ 102 | --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ 103 | --max_seq_length=128 --train_batch_size=10 \ 104 | --learning_rate=2e-5 --num_train_epochs=3.0 \ 105 | --output_dir=$BERT_BASE_DIR/openbookqa_annotator_recognition/ \ 106 | --split=without_annotator 107 | ``` 108 | 109 | 3. Fine-tuning the model on the top annotator split of OBQA, to test model generalization from all other annotators: 110 | ```bash 111 | export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12 112 | export DATA_DIR=/path/to/obqa/data/splits/dir 113 | 114 | python run_openbookqa.py \ 115 | --do_train=true --do_eval=true \ 116 | --data_dir=$DATA_DIR --vocab_file=$BERT_BASE_DIR/vocab.txt \ 117 | --bert_config_file=$BERT_BASE_DIR/bert_config.json \ 118 | --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ 119 | --max_seq_length=128 --train_batch_size=10 \ 120 | --learning_rate=2e-5 --num_train_epochs=3.0 \ 121 | --output_dir=$BERT_BASE_DIR/openbookqa_annotator_0/ \ 122 | --split=annotator --annotator_idx=0 123 | ``` 124 | To fine-tune on the corresponding random split of the top annotator, simply replace the value of the `split` argument with `rand`. 125 | Similarly, to fine-tune on the second multi-annotator split, replace the values of the `split` and `annotator_idx` arguments with `annotator_multi` and `1`, respectively. 126 | 127 | 4. Fine-tuning the model on the 20%-augmented data split of the third top annotator of OBQA: 128 | ```bash 129 | export BERT_BASE_DIR=/path/to/bert/cased_L-12_H-768_A-12 130 | export DATA_DIR=/path/to/obqa/data/splits/dir 131 | 132 | python run_openbookqa.py \ 133 | --do_train=true --do_eval=true \ 134 | --data_dir=$DATA_DIR --vocab_file=$BERT_BASE_DIR/vocab.txt \ 135 | --bert_config_file=$BERT_BASE_DIR/bert_config.json \ 136 | --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ 137 | --max_seq_length=128 --train_batch_size=10 \ 138 | --learning_rate=2e-5 --num_train_epochs=3.0 \ 139 | --output_dir=$BERT_BASE_DIR/openbookqa_annotator_0/ \ 140 | --split=annotator --annotator_idx=2 \ 141 | --augment_ratio=0.2 142 | ``` 143 | 144 | ## Re-producing our experiments on a new dataset 145 | 146 | To reproduce our experiments on a crowdsourced NLU dataset for which annotation information is available (i.e. for every example there is an identifier of the annotator who created it), one needs the following for each of our three experiments. 147 | 1. **The utility of annotator information** 148 | * The original data split of the dataset. 149 | * The original data split of the dataset with the annotator ID concatenated as an additional feature to every example. 150 | * Fine-tuning script for BERT, suitable for original the dataset task. 151 | 2. **Annotator recognition** 152 | * The original data split of the dataset, with annotator IDs of the top 5 annotators as lables. Namely, every example `(x,y)` written by annotator `z` should be replaced with `(x,z*)`, where `z*=z` if `z` is in the top 5 annotators and `z*=OTHER` otherwise. 153 | * Fine-tuning script for BERT, for classification task. 154 | 3. **Model generalization across annotators** 155 | * Annotator-based data splits and corresponding random splits of the same size. 156 | * Augmented annotator-based data splits and corresponding random splits of the same size. 157 | * Fine-tuning script for BERT, suitable for the original dataset task. 158 | 159 | 160 | ## Citation 161 | If you make use of our work in your research, we would appreciate citing the following: 162 | 163 | 164 | > @InProceedings{GevaEtAl2019, 165 | title = {{Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets}}, 166 | author = {Geva, Mor and Goldberg, Yoav and Berant, Jonathan}, 167 | booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing}, 168 | note = {arXiv preprint arXiv:1908.07898}, 169 | year = {2019} 170 | } 171 | -------------------------------------------------------------------------------- /model_fine_tuning_scripts/run_commonsense_qa.py: -------------------------------------------------------------------------------- 1 | """Run BERT on CommonsenseQA.""" 2 | 3 | from __future__ import absolute_import 4 | from __future__ import division 5 | from __future__ import print_function 6 | 7 | import collections 8 | import json 9 | import os 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | import modeling 15 | import optimization 16 | import tokenization 17 | 18 | 19 | flags = tf.flags 20 | 21 | FLAGS = flags.FLAGS 22 | 23 | ## Required parameters 24 | flags.DEFINE_string( 25 | "data_dir", None, 26 | "The input data dir. Should contain the .tsv files (or other data files) " 27 | "for the task.") 28 | 29 | flags.DEFINE_string( 30 | "bert_config_file", None, 31 | "The config json file corresponding to the pre-trained BERT model. " 32 | "This specifies the model architecture.") 33 | 34 | flags.DEFINE_string("vocab_file", None, 35 | "The vocabulary file that the BERT model was trained on.") 36 | 37 | flags.DEFINE_string( 38 | "output_dir", None, 39 | "The output directory where the model checkpoints will be written.") 40 | 41 | flags.DEFINE_string( 42 | "split", None, 43 | "The split you'd like to run on, either 'qtoken' or 'rand'.") 44 | 45 | flags.DEFINE_integer( 46 | "annotator_idx", 0, 47 | "Index of the annotator to split by.") 48 | 49 | flags.DEFINE_float( 50 | "augment_ratio", 0.0, 51 | "Proportion of dev examples moved to training set.") 52 | 53 | flags.DEFINE_integer( 54 | "take_number", 1, 55 | "In case this is a re-execution of previous experiment.") 56 | 57 | flags.DEFINE_bool( 58 | "swap_trn_dev", False, 59 | "Whether to use the train set as dev set and vice versa.") 60 | 61 | 62 | ## Other parameters 63 | 64 | flags.DEFINE_string( 65 | "init_checkpoint", None, 66 | "Initial checkpoint (usually from a pre-trained BERT model).") 67 | 68 | flags.DEFINE_bool( 69 | "do_lower_case", True, 70 | "Whether to lower case the input text. Should be True for uncased " 71 | "models and False for cased models.") 72 | 73 | flags.DEFINE_integer( 74 | "max_seq_length", 128, 75 | "The maximum total input sequence length after WordPiece tokenization. " 76 | "Sequences longer than this will be truncated, and sequences shorter " 77 | "than this will be padded.") 78 | 79 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 80 | 81 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 82 | 83 | flags.DEFINE_bool( 84 | "do_predict", False, 85 | "Whether to run the model in inference mode on the test set.") 86 | 87 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 88 | 89 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 90 | 91 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 92 | 93 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 94 | 95 | flags.DEFINE_float("num_train_epochs", 3.0, 96 | "Total number of training epochs to perform.") 97 | 98 | flags.DEFINE_float( 99 | "warmup_proportion", 0.1, 100 | "Proportion of training to perform linear learning rate warmup for. " 101 | "E.g., 0.1 = 10% of training.") 102 | 103 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 104 | "How often to save the model checkpoint.") 105 | 106 | flags.DEFINE_integer("iterations_per_loop", 1000, 107 | "How many steps to make in each estimator call.") 108 | 109 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 110 | 111 | tf.flags.DEFINE_string( 112 | "tpu_name", None, 113 | "The Cloud TPU to use for training. This should be either the name " 114 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 115 | "url.") 116 | 117 | tf.flags.DEFINE_string( 118 | "tpu_zone", None, 119 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 120 | "specified, we will attempt to automatically detect the GCE project from " 121 | "metadata.") 122 | 123 | tf.flags.DEFINE_string( 124 | "gcp_project", None, 125 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 126 | "specified, we will attempt to automatically detect the GCE project from " 127 | "metadata.") 128 | 129 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 130 | 131 | flags.DEFINE_integer( 132 | "num_tpu_cores", 8, 133 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 134 | 135 | 136 | class InputExample(object): 137 | """A single multiple choice question.""" 138 | 139 | def __init__( 140 | self, 141 | qid, 142 | question, 143 | answers, 144 | label): 145 | """Construct an instance.""" 146 | self.qid = qid 147 | self.question = question 148 | self.answers = answers 149 | self.label = label 150 | 151 | 152 | class DataProcessor(object): 153 | """Base class for data converters for sequence classification data sets.""" 154 | 155 | def get_train_examples(self, data_dir): 156 | """Gets a collection of `InputExample`s for the train set.""" 157 | raise NotImplementedError() 158 | 159 | def get_dev_examples(self, data_dir): 160 | """Gets a collection of `InputExample`s for the dev set.""" 161 | raise NotImplementedError() 162 | 163 | def get_test_examples(self, data_dir): 164 | """Gets a collection of `InputExample`s for prediction.""" 165 | raise NotImplementedError() 166 | 167 | def get_labels(self): 168 | """Gets the list of labels for this data set.""" 169 | raise NotImplementedError() 170 | 171 | @classmethod 172 | def _read_json(cls, input_file): 173 | """Reads a JSON file.""" 174 | with tf.gfile.Open(input_file, "r") as f: 175 | return json.load(f) 176 | 177 | 178 | class CommonsenseQAProcessor(DataProcessor): 179 | """Processor for the CommonsenseQA data set.""" 180 | 181 | SPLIT_TO_NAME = { 182 | # 'annotator': 'annotator_{annotator_idx}_cand_dists', 183 | # 'annotator_multi': 'annotator_multi_{annotator_idx}_cand_dists', 184 | # 'rand': 'rand_{annotator_idx}_cand_dists', 185 | # 'rand_multi': 'rand_multi_{annotator_idx}_cand_dists', 186 | 'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}', 187 | 'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}', 188 | 'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}', 189 | 'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}', 190 | 'with_annotator': 'with_annotator_id', 191 | 'without_annotator': 'without_annotator_id', 192 | } 193 | 194 | TRAIN_FILE_NAME = 'train_{split_name}.json' 195 | DEV_FILE_NAME = 'dev_{split_name}.json' 196 | TEST_FILE_NAME = 'dev_{split_name}.json' 197 | 198 | def __init__(self, split, annotator_idx, augment_ratio, take_number, swap_trn_dev): 199 | if split not in self.SPLIT_TO_NAME.keys(): 200 | raise ValueError( 201 | 'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.') 202 | 203 | self.split = split 204 | self.annotator_idx = annotator_idx 205 | self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else "" 206 | self.take = "_take{}".format(take_number) if take_number > 1 else "" 207 | self.swap_trn_dev = swap_trn_dev 208 | 209 | if self.swap_trn_dev: 210 | tmp = self.TRAIN_FILE_NAME 211 | self.TRAIN_FILE_NAME = self.DEV_FILE_NAME 212 | self.DEV_FILE_NAME = tmp 213 | 214 | def get_train_examples(self, data_dir): 215 | train_file_name = self.TRAIN_FILE_NAME.format( 216 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 217 | 218 | return self._create_examples( 219 | self._read_json(os.path.join(data_dir, train_file_name)), 220 | 'train') 221 | 222 | def get_dev_examples(self, data_dir): 223 | dev_file_name = self.DEV_FILE_NAME.format( 224 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 225 | 226 | return self._create_examples( 227 | self._read_json(os.path.join(data_dir, dev_file_name)), 228 | 'dev') 229 | 230 | def get_test_examples(self, data_dir): 231 | test_file_name = self.TEST_FILE_NAME.format( 232 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 233 | 234 | return self._create_examples( 235 | self._read_json(os.path.join(data_dir, test_file_name)), 236 | 'test') 237 | 238 | def get_labels(self): 239 | return [0, 1, 2, 3, 4] 240 | 241 | def _create_examples(self, lines, set_type): 242 | examples = [] 243 | for i, line in enumerate(lines): 244 | qid = "%s-%s" % (set_type, i) 245 | 246 | question = tokenization.convert_to_unicode(line['question']) 247 | 248 | answers = np.array([ 249 | tokenization.convert_to_unicode(line['correct_answer']), 250 | tokenization.convert_to_unicode(line['distractor_0']), 251 | tokenization.convert_to_unicode(line['distractor_1']), 252 | tokenization.convert_to_unicode(line['distractor_2']), 253 | tokenization.convert_to_unicode(line['distractor_3']) 254 | ]) 255 | 256 | label = 0 257 | 258 | examples.append( 259 | InputExample( 260 | qid=qid, 261 | question=question, 262 | answers=answers, 263 | label=label)) 264 | 265 | return examples 266 | 267 | 268 | def example_to_token_ids_segment_ids_label_ids( 269 | ex_index, 270 | example, 271 | max_seq_length, 272 | tokenizer): 273 | """Converts an ``InputExample`` to token ids and segment ids.""" 274 | if ex_index < 5: 275 | tf.logging.info(f"*** Example {ex_index} ***") 276 | tf.logging.info("qid: %s" % (example.qid)) 277 | 278 | question_tokens = tokenizer.tokenize(example.question) 279 | answers_tokens = map(tokenizer.tokenize, example.answers) 280 | 281 | token_ids = [] 282 | segment_ids = [] 283 | for choice_idx, answer_tokens in enumerate(answers_tokens): 284 | truncated_question_tokens = question_tokens[ 285 | :max((max_seq_length - 3)//2, max_seq_length - (len(answer_tokens) + 3))] 286 | truncated_answer_tokens = answer_tokens[ 287 | :max((max_seq_length - 3)//2, max_seq_length - (len(question_tokens) + 3))] 288 | 289 | choice_tokens = [] 290 | choice_segment_ids = [] 291 | choice_tokens.append("[CLS]") 292 | choice_segment_ids.append(0) 293 | for question_token in truncated_question_tokens: 294 | choice_tokens.append(question_token) 295 | choice_segment_ids.append(0) 296 | choice_tokens.append("[SEP]") 297 | choice_segment_ids.append(0) 298 | for answer_token in truncated_answer_tokens: 299 | choice_tokens.append(answer_token) 300 | choice_segment_ids.append(1) 301 | choice_tokens.append("[SEP]") 302 | choice_segment_ids.append(1) 303 | 304 | choice_token_ids = tokenizer.convert_tokens_to_ids(choice_tokens) 305 | 306 | token_ids.append(choice_token_ids) 307 | segment_ids.append(choice_segment_ids) 308 | 309 | if ex_index < 5: 310 | tf.logging.info("choice %s" % choice_idx) 311 | tf.logging.info("tokens: %s" % " ".join( 312 | [tokenization.printable_text(t) for t in choice_tokens])) 313 | tf.logging.info("token ids: %s" % " ".join( 314 | [str(x) for x in choice_token_ids])) 315 | tf.logging.info("segment ids: %s" % " ".join( 316 | [str(x) for x in choice_segment_ids])) 317 | 318 | label_ids = [example.label] 319 | 320 | if ex_index < 5: 321 | tf.logging.info("label: %s (id = %d)" % (example.label, label_ids[0])) 322 | 323 | return token_ids, segment_ids, label_ids 324 | 325 | 326 | def file_based_convert_examples_to_features( 327 | examples, 328 | label_list, 329 | max_seq_length, 330 | tokenizer, 331 | output_file 332 | ): 333 | """Convert a set of ``InputExamples`` to a TFRecord file.""" 334 | 335 | # encode examples into token_ids and segment_ids 336 | token_ids_segment_ids_label_ids = [ 337 | example_to_token_ids_segment_ids_label_ids( 338 | ex_index, 339 | example, 340 | max_seq_length, 341 | tokenizer) 342 | for ex_index, example in enumerate(examples) 343 | ] 344 | 345 | # compute the maximum sequence length for any of the inputs 346 | seq_length = max([ 347 | max([len(choice_token_ids) for choice_token_ids in token_ids]) 348 | for token_ids, _, _ in token_ids_segment_ids_label_ids 349 | ]) 350 | 351 | # encode the inputs into fixed-length vectors 352 | writer = tf.python_io.TFRecordWriter(output_file) 353 | 354 | for idx, (token_ids, segment_ids, label_ids) in enumerate( 355 | token_ids_segment_ids_label_ids 356 | ): 357 | if idx % 10000 == 0: 358 | tf.logging.info("Writing %d of %d" % ( 359 | idx, 360 | len(token_ids_segment_ids_label_ids))) 361 | 362 | features = collections.OrderedDict() 363 | for i, (choice_token_ids, choice_segment_ids) in enumerate( 364 | zip(token_ids, segment_ids)): 365 | input_ids = np.zeros(max_seq_length) 366 | input_ids[:len(choice_token_ids)] = np.array(choice_token_ids) 367 | 368 | input_mask = np.zeros(max_seq_length) 369 | input_mask[:len(choice_token_ids)] = 1 370 | 371 | segment_ids = np.zeros(max_seq_length) 372 | segment_ids[:len(choice_segment_ids)] = np.array(choice_segment_ids) 373 | 374 | features[f'input_ids{i}'] = tf.train.Feature( 375 | int64_list=tf.train.Int64List(value=list(input_ids.astype(np.int64)))) 376 | features[f'input_mask{i}'] = tf.train.Feature( 377 | int64_list=tf.train.Int64List(value=list(input_mask.astype(np.int64)))) 378 | features[f'segment_ids{i}'] = tf.train.Feature( 379 | int64_list=tf.train.Int64List(value=list(segment_ids.astype(np.int64)))) 380 | 381 | features['label_ids'] = tf.train.Feature( 382 | int64_list=tf.train.Int64List(value=label_ids)) 383 | 384 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 385 | writer.write(tf_example.SerializeToString()) 386 | 387 | return seq_length 388 | 389 | 390 | def file_based_input_fn_builder( 391 | input_file, 392 | seq_length, 393 | is_training, 394 | drop_remainder 395 | ): 396 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 397 | 398 | name_to_features = { 399 | "input_ids0": tf.FixedLenFeature([seq_length], tf.int64), 400 | "input_mask0": tf.FixedLenFeature([seq_length], tf.int64), 401 | "segment_ids0": tf.FixedLenFeature([seq_length], tf.int64), 402 | "input_ids1": tf.FixedLenFeature([seq_length], tf.int64), 403 | "input_mask1": tf.FixedLenFeature([seq_length], tf.int64), 404 | "segment_ids1": tf.FixedLenFeature([seq_length], tf.int64), 405 | "input_ids2": tf.FixedLenFeature([seq_length], tf.int64), 406 | "input_mask2": tf.FixedLenFeature([seq_length], tf.int64), 407 | "segment_ids2": tf.FixedLenFeature([seq_length], tf.int64), 408 | "input_ids3": tf.FixedLenFeature([seq_length], tf.int64), 409 | "input_mask3": tf.FixedLenFeature([seq_length], tf.int64), 410 | "segment_ids3": tf.FixedLenFeature([seq_length], tf.int64), 411 | "input_ids4": tf.FixedLenFeature([seq_length], tf.int64), 412 | "input_mask4": tf.FixedLenFeature([seq_length], tf.int64), 413 | "segment_ids4": tf.FixedLenFeature([seq_length], tf.int64), 414 | "label_ids": tf.FixedLenFeature([], tf.int64), 415 | } 416 | 417 | def _decode_record(record, name_to_features): 418 | """Decodes a record to a TensorFlow example.""" 419 | example = tf.parse_single_example(record, name_to_features) 420 | 421 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 422 | # So cast all int64 to int32. 423 | for name in list(example.keys()): 424 | t = example[name] 425 | if t.dtype == tf.int64: 426 | t = tf.to_int32(t) 427 | example[name] = t 428 | 429 | return example 430 | 431 | def input_fn(params): 432 | """The actual input function.""" 433 | batch_size = params["batch_size"] 434 | 435 | # For training, we want a lot of parallel reading and shuffling. 436 | # For eval, we want no shuffling and parallel reading doesn't matter. 437 | d = tf.data.TFRecordDataset(input_file) 438 | if is_training: 439 | d = d.repeat() 440 | d = d.shuffle(buffer_size=100) 441 | 442 | d = d.apply( 443 | tf.contrib.data.map_and_batch( 444 | lambda record: _decode_record(record, name_to_features), 445 | batch_size=batch_size, 446 | drop_remainder=drop_remainder)) 447 | 448 | return d 449 | 450 | return input_fn 451 | 452 | 453 | def create_model( 454 | bert_config, 455 | is_training, 456 | input_ids0, 457 | input_mask0, 458 | segment_ids0, 459 | input_ids1, 460 | input_mask1, 461 | segment_ids1, 462 | input_ids2, 463 | input_mask2, 464 | segment_ids2, 465 | input_ids3, 466 | input_mask3, 467 | segment_ids3, 468 | input_ids4, 469 | input_mask4, 470 | segment_ids4, 471 | labels, 472 | num_labels, 473 | use_one_hot_embeddings 474 | ): 475 | """Creates a classification model.""" 476 | input_ids = tf.stack( 477 | [ 478 | input_ids0, 479 | input_ids1, 480 | input_ids2, 481 | input_ids3, 482 | input_ids4 483 | ], 484 | axis=1) 485 | input_mask = tf.stack( 486 | [ 487 | input_mask0, 488 | input_mask1, 489 | input_mask2, 490 | input_mask3, 491 | input_mask4 492 | ], 493 | axis=1) 494 | segment_ids = tf.stack( 495 | [ 496 | segment_ids0, 497 | segment_ids1, 498 | segment_ids2, 499 | segment_ids3, 500 | segment_ids4 501 | ], 502 | axis=1) 503 | 504 | _, num_choices, seq_length = input_ids.shape 505 | 506 | input_ids = tf.reshape(input_ids, (-1, seq_length)) 507 | input_mask = tf.reshape(input_mask, (-1, seq_length)) 508 | segment_ids = tf.reshape(segment_ids, (-1, seq_length)) 509 | 510 | output_layer = modeling.BertModel( 511 | config=bert_config, 512 | is_training=is_training, 513 | input_ids=input_ids, 514 | input_mask=input_mask, 515 | token_type_ids=segment_ids, 516 | use_one_hot_embeddings=use_one_hot_embeddings 517 | ).get_pooled_output() 518 | 519 | hidden_size = output_layer.shape[-1].value 520 | 521 | softmax_weights = tf.get_variable( 522 | "softmax_weights", [hidden_size, 1], 523 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 524 | 525 | with tf.variable_scope("loss"): 526 | if is_training: 527 | # I.e., 0.1 dropout 528 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 529 | 530 | logits = tf.reshape( 531 | tf.matmul(output_layer, softmax_weights), 532 | (-1, num_choices)) 533 | 534 | probabilities = tf.nn.softmax(logits, axis=-1) 535 | log_probs = tf.nn.log_softmax(logits, axis=-1) 536 | 537 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 538 | 539 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 540 | loss = tf.reduce_mean(per_example_loss) 541 | 542 | return (loss, per_example_loss, logits, probabilities, output_layer) 543 | 544 | 545 | def model_fn_builder( 546 | bert_config, 547 | num_labels, 548 | init_checkpoint, 549 | learning_rate, 550 | num_train_steps, 551 | num_warmup_steps, 552 | use_tpu, 553 | use_one_hot_embeddings 554 | ): 555 | """Returns `model_fn` closure for TPUEstimator.""" 556 | 557 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 558 | """The `model_fn` for TPUEstimator.""" 559 | 560 | tf.logging.info("*** Features ***") 561 | for name in sorted(features.keys()): 562 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 563 | 564 | input_ids0 = features["input_ids0"] 565 | input_mask0 = features["input_mask0"] 566 | segment_ids0 = features["segment_ids0"] 567 | input_ids1 = features["input_ids1"] 568 | input_mask1 = features["input_mask1"] 569 | segment_ids1 = features["segment_ids1"] 570 | input_ids2 = features["input_ids2"] 571 | input_mask2 = features["input_mask2"] 572 | segment_ids2 = features["segment_ids2"] 573 | input_ids3 = features["input_ids3"] 574 | input_mask3 = features["input_mask3"] 575 | segment_ids3 = features["segment_ids3"] 576 | input_ids4 = features["input_ids4"] 577 | input_mask4 = features["input_mask4"] 578 | segment_ids4 = features["segment_ids4"] 579 | label_ids = features["label_ids"] 580 | 581 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 582 | 583 | (total_loss, per_example_loss, logits, probabilities, output_layer) = create_model( 584 | bert_config, 585 | is_training, 586 | input_ids0, 587 | input_mask0, 588 | segment_ids0, 589 | input_ids1, 590 | input_mask1, 591 | segment_ids1, 592 | input_ids2, 593 | input_mask2, 594 | segment_ids2, 595 | input_ids3, 596 | input_mask3, 597 | segment_ids3, 598 | input_ids4, 599 | input_mask4, 600 | segment_ids4, 601 | label_ids, 602 | num_labels, 603 | use_one_hot_embeddings) 604 | 605 | tvars = tf.trainable_variables() 606 | initialized_variable_names = {} 607 | scaffold_fn = None 608 | if init_checkpoint: 609 | (assignment_map, initialized_variable_names 610 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 611 | if use_tpu: 612 | 613 | def tpu_scaffold(): 614 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 615 | return tf.train.Scaffold() 616 | 617 | scaffold_fn = tpu_scaffold 618 | else: 619 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 620 | 621 | tf.logging.info("**** Trainable Variables ****") 622 | for var in tvars: 623 | init_string = "" 624 | if var.name in initialized_variable_names: 625 | init_string = ", *INIT_FROM_CKPT*" 626 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 627 | init_string) 628 | 629 | output_spec = None 630 | if mode == tf.estimator.ModeKeys.TRAIN: 631 | 632 | train_op = optimization.create_optimizer( 633 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 634 | 635 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 636 | mode=mode, 637 | loss=total_loss, 638 | train_op=train_op, 639 | scaffold_fn=scaffold_fn) 640 | elif mode == tf.estimator.ModeKeys.EVAL: 641 | 642 | def metric_fn(per_example_loss, label_ids, logits): 643 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 644 | accuracy = tf.metrics.accuracy(label_ids, predictions) 645 | loss = tf.metrics.mean(per_example_loss) 646 | return { 647 | "eval_accuracy": accuracy, 648 | "eval_loss": loss, 649 | } 650 | 651 | eval_metrics = (metric_fn, [per_example_loss, label_ids, logits]) 652 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 653 | mode=mode, 654 | loss=total_loss, 655 | eval_metrics=eval_metrics, 656 | scaffold_fn=scaffold_fn) 657 | else: 658 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 659 | mode=mode, predictions=probabilities, scaffold_fn=scaffold_fn) 660 | return output_spec 661 | 662 | return model_fn 663 | 664 | 665 | def main(_): 666 | tf.logging.set_verbosity(tf.logging.INFO) 667 | 668 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 669 | raise ValueError( 670 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 671 | 672 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 673 | 674 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 675 | raise ValueError( 676 | "Cannot use sequence length %d because the BERT model " 677 | "was only trained up to sequence length %d" % 678 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 679 | 680 | tf.gfile.MakeDirs(FLAGS.output_dir) 681 | 682 | processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx, 683 | augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number, 684 | swap_trn_dev=FLAGS.swap_trn_dev) 685 | 686 | label_list = processor.get_labels() 687 | 688 | tokenizer = tokenization.FullTokenizer( 689 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 690 | 691 | tpu_cluster_resolver = None 692 | if FLAGS.use_tpu and FLAGS.tpu_name: 693 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 694 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 695 | 696 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 697 | run_config = tf.contrib.tpu.RunConfig( 698 | cluster=tpu_cluster_resolver, 699 | master=FLAGS.master, 700 | model_dir=FLAGS.output_dir, 701 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 702 | tpu_config=tf.contrib.tpu.TPUConfig( 703 | iterations_per_loop=FLAGS.iterations_per_loop, 704 | num_shards=FLAGS.num_tpu_cores, 705 | per_host_input_for_training=is_per_host)) 706 | 707 | train_examples = None 708 | num_train_steps = None 709 | num_warmup_steps = None 710 | if FLAGS.do_train: 711 | train_examples = processor.get_train_examples(FLAGS.data_dir) 712 | num_train_steps = int( 713 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 714 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 715 | 716 | model_fn = model_fn_builder( 717 | bert_config=bert_config, 718 | num_labels=len(label_list), 719 | init_checkpoint=FLAGS.init_checkpoint, 720 | learning_rate=FLAGS.learning_rate, 721 | num_train_steps=num_train_steps, 722 | num_warmup_steps=num_warmup_steps, 723 | use_tpu=FLAGS.use_tpu, 724 | use_one_hot_embeddings=FLAGS.use_tpu) 725 | 726 | # If TPU is not available, this will fall back to normal Estimator on CPU 727 | # or GPU. 728 | estimator = tf.contrib.tpu.TPUEstimator( 729 | use_tpu=FLAGS.use_tpu, 730 | model_fn=model_fn, 731 | config=run_config, 732 | train_batch_size=FLAGS.train_batch_size, 733 | eval_batch_size=FLAGS.eval_batch_size, 734 | predict_batch_size=FLAGS.predict_batch_size) 735 | 736 | if FLAGS.do_train: 737 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 738 | train_seq_length = file_based_convert_examples_to_features( 739 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 740 | tf.logging.info("***** Running training *****") 741 | tf.logging.info(" Num examples = %d", len(train_examples)) 742 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 743 | tf.logging.info(" Num steps = %d", num_train_steps) 744 | tf.logging.info(" Longest training sequence = %d", train_seq_length) 745 | train_input_fn = file_based_input_fn_builder( 746 | input_file=train_file, 747 | seq_length=FLAGS.max_seq_length, 748 | is_training=True, 749 | drop_remainder=True) 750 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 751 | 752 | if FLAGS.do_eval: 753 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 754 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 755 | eval_seq_length = file_based_convert_examples_to_features( 756 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 757 | 758 | tf.logging.info("***** Running evaluation *****") 759 | tf.logging.info(" Num examples = %d", len(eval_examples)) 760 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 761 | tf.logging.info(" Longest eval sequence = %d", eval_seq_length) 762 | 763 | # This tells the estimator to run through the entire set. 764 | eval_steps = None 765 | # However, if running eval on the TPU, you will need to specify the 766 | # number of steps. 767 | if FLAGS.use_tpu: 768 | # Eval will be slightly WRONG on the TPU because it will truncate 769 | # the last batch. 770 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size) 771 | 772 | eval_drop_remainder = True if FLAGS.use_tpu else False 773 | eval_input_fn = file_based_input_fn_builder( 774 | input_file=eval_file, 775 | seq_length=FLAGS.max_seq_length, 776 | is_training=False, 777 | drop_remainder=eval_drop_remainder) 778 | 779 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 780 | 781 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 782 | with tf.gfile.GFile(output_eval_file, "w") as writer: 783 | tf.logging.info("***** Eval results *****") 784 | for key in sorted(result.keys()): 785 | tf.logging.info(" %s = %s", key, str(result[key])) 786 | writer.write("%s = %s\n" % (key, str(result[key]))) 787 | 788 | if FLAGS.do_predict: 789 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 790 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 791 | predict_seq_length = file_based_convert_examples_to_features( 792 | predict_examples, label_list, 793 | FLAGS.max_seq_length, tokenizer, 794 | predict_file) 795 | 796 | tf.logging.info("***** Running prediction*****") 797 | tf.logging.info(" Num examples = %d", len(predict_examples)) 798 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 799 | tf.logging.info(" Longest predict sequence = %d", predict_seq_length) 800 | 801 | if FLAGS.use_tpu: 802 | # Warning: According to tpu_estimator.py Prediction on TPU is an 803 | # experimental feature and hence not supported here 804 | raise ValueError("Prediction in TPU not supported") 805 | 806 | predict_drop_remainder = True if FLAGS.use_tpu else False 807 | predict_input_fn = file_based_input_fn_builder( 808 | input_file=predict_file, 809 | seq_length=FLAGS.max_seq_length, 810 | is_training=False, 811 | drop_remainder=predict_drop_remainder) 812 | 813 | result = estimator.predict(input_fn=predict_input_fn) 814 | 815 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 816 | with tf.gfile.GFile(output_predict_file, "w") as writer: 817 | tf.logging.info("***** Predict results *****") 818 | for prediction in result: 819 | output_line = "\t".join( 820 | str(class_probability) for class_probability in prediction) + "\n" 821 | writer.write(output_line) 822 | 823 | 824 | if __name__ == "__main__": 825 | flags.mark_flag_as_required("data_dir") 826 | flags.mark_flag_as_required("vocab_file") 827 | flags.mark_flag_as_required("bert_config_file") 828 | flags.mark_flag_as_required("output_dir") 829 | tf.app.run() 830 | -------------------------------------------------------------------------------- /model_fine_tuning_scripts/run_commonsense_qa_recognition.py: -------------------------------------------------------------------------------- 1 | """Run BERT on CommonsenseQA for annotator ID prediction..""" 2 | 3 | from __future__ import absolute_import 4 | from __future__ import division 5 | from __future__ import print_function 6 | 7 | import collections 8 | import json 9 | import os 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | import modeling 15 | import optimization 16 | import tokenization 17 | 18 | 19 | flags = tf.flags 20 | 21 | FLAGS = flags.FLAGS 22 | 23 | ## Required parameters 24 | flags.DEFINE_string( 25 | "data_dir", None, 26 | "The input data dir. Should contain the .tsv files (or other data files) " 27 | "for the task.") 28 | 29 | flags.DEFINE_string( 30 | "bert_config_file", None, 31 | "The config json file corresponding to the pre-trained BERT model. " 32 | "This specifies the model architecture.") 33 | 34 | flags.DEFINE_string("vocab_file", None, 35 | "The vocabulary file that the BERT model was trained on.") 36 | 37 | flags.DEFINE_string( 38 | "output_dir", None, 39 | "The output directory where the model checkpoints will be written.") 40 | 41 | flags.DEFINE_string( 42 | "split", None, 43 | "The split you'd like to run on, either 'annotator' or 'rand'.") 44 | 45 | flags.DEFINE_integer( 46 | "annotator_idx", 0, 47 | "Index of the annotator to split by.") 48 | 49 | flags.DEFINE_float( 50 | "augment_ratio", 0.0, 51 | "Proportion of dev examples moved to training set.") 52 | 53 | flags.DEFINE_integer( 54 | "take_number", 1, 55 | "In case this is a re-execution of previous experiment.") 56 | 57 | 58 | ## Other parameters 59 | 60 | flags.DEFINE_string( 61 | "init_checkpoint", None, 62 | "Initial checkpoint (usually from a pre-trained BERT model).") 63 | 64 | flags.DEFINE_bool( 65 | "do_lower_case", True, 66 | "Whether to lower case the input text. Should be True for uncased " 67 | "models and False for cased models.") 68 | 69 | flags.DEFINE_integer( 70 | "max_seq_length", 128, 71 | "The maximum total input sequence length after WordPiece tokenization. " 72 | "Sequences longer than this will be truncated, and sequences shorter " 73 | "than this will be padded.") 74 | 75 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 76 | 77 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 78 | 79 | flags.DEFINE_bool( 80 | "do_predict", False, 81 | "Whether to run the model in inference mode on the test set.") 82 | 83 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 84 | 85 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 86 | 87 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 88 | 89 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 90 | 91 | flags.DEFINE_float("num_train_epochs", 3.0, 92 | "Total number of training epochs to perform.") 93 | 94 | flags.DEFINE_float( 95 | "warmup_proportion", 0.1, 96 | "Proportion of training to perform linear learning rate warmup for. " 97 | "E.g., 0.1 = 10% of training.") 98 | 99 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 100 | "How often to save the model checkpoint.") 101 | 102 | flags.DEFINE_integer("iterations_per_loop", 1000, 103 | "How many steps to make in each estimator call.") 104 | 105 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 106 | 107 | tf.flags.DEFINE_string( 108 | "tpu_name", None, 109 | "The Cloud TPU to use for training. This should be either the name " 110 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 111 | "url.") 112 | 113 | tf.flags.DEFINE_string( 114 | "tpu_zone", None, 115 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 116 | "specified, we will attempt to automatically detect the GCE project from " 117 | "metadata.") 118 | 119 | tf.flags.DEFINE_string( 120 | "gcp_project", None, 121 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 122 | "specified, we will attempt to automatically detect the GCE project from " 123 | "metadata.") 124 | 125 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 126 | 127 | flags.DEFINE_integer( 128 | "num_tpu_cores", 8, 129 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 130 | 131 | 132 | class InputExample(object): 133 | """A single training/test example for simple sequence classification.""" 134 | 135 | def __init__(self, guid, text_a, text_b=None, label=None): 136 | """Constructs a InputExample. 137 | 138 | Args: 139 | guid: Unique id for the example. 140 | text_a: string. The untokenized text of the first sequence. For single 141 | sequence tasks, only this sequence must be specified. 142 | text_b: (Optional) string. The untokenized text of the second sequence. 143 | Only must be specified for sequence pair tasks. 144 | label: (Optional) string. The label of the example. This should be 145 | specified for train and dev examples, but not for test examples. 146 | """ 147 | self.guid = guid 148 | self.text_a = text_a 149 | self.text_b = text_b 150 | self.label = label 151 | 152 | 153 | class PaddingInputExample(object): 154 | """Fake example so the num input examples is a multiple of the batch size. 155 | 156 | When running eval/predict on the TPU, we need to pad the number of examples 157 | to be a multiple of the batch size, because the TPU requires a fixed batch 158 | size. The alternative is to drop the last batch, which is bad because it means 159 | the entire output data won't be generated. 160 | 161 | We use this class instead of `None` because treating `None` as padding 162 | battches could cause silent errors. 163 | """ 164 | 165 | 166 | class InputFeatures(object): 167 | """A single set of features of data.""" 168 | 169 | def __init__(self, 170 | input_ids, 171 | input_mask, 172 | segment_ids, 173 | label_id, 174 | is_real_example=True): 175 | self.input_ids = input_ids 176 | self.input_mask = input_mask 177 | self.segment_ids = segment_ids 178 | self.label_id = label_id 179 | self.is_real_example = is_real_example 180 | 181 | 182 | class DataProcessor(object): 183 | """Base class for data converters for sequence classification data sets.""" 184 | 185 | def get_train_examples(self, data_dir): 186 | """Gets a collection of `InputExample`s for the train set.""" 187 | raise NotImplementedError() 188 | 189 | def get_dev_examples(self, data_dir): 190 | """Gets a collection of `InputExample`s for the dev set.""" 191 | raise NotImplementedError() 192 | 193 | def get_test_examples(self, data_dir): 194 | """Gets a collection of `InputExample`s for prediction.""" 195 | raise NotImplementedError() 196 | 197 | def get_labels(self): 198 | """Gets the list of labels for this data set.""" 199 | raise NotImplementedError() 200 | 201 | @classmethod 202 | def _read_json(cls, input_file): 203 | """Reads a JSON file.""" 204 | with tf.gfile.Open(input_file, "r") as f: 205 | return json.load(f) 206 | 207 | 208 | class CommonsenseQAProcessor(DataProcessor): 209 | """Processor for the MultiNLI data set (GLUE version).""" 210 | 211 | SPLIT_TO_NAME = { 212 | 'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}', 213 | 'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}', 214 | 'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}', 215 | 'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}', 216 | 'with_annotator': 'with_annotator_id', 217 | 'without_annotator': 'without_annotator_id' 218 | } 219 | 220 | TRAIN_FILE_NAME = 'train_{split_name}.json' 221 | DEV_FILE_NAME = 'dev_{split_name}.json' 222 | TEST_FILE_NAME = 'dev_{split_name}.json' 223 | 224 | def __init__(self, split, annotator_idx, augment_ratio, take_number): 225 | if split not in self.SPLIT_TO_NAME.keys(): 226 | raise ValueError( 227 | 'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.') 228 | 229 | self.split = split 230 | self.annotator_idx = annotator_idx 231 | self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else "" 232 | self.take = "_take{}".format(take_number) if take_number > 1 else "" 233 | 234 | def get_train_examples(self, data_dir): 235 | train_file_name = self.TRAIN_FILE_NAME.format( 236 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 237 | 238 | return self._create_examples( 239 | self._read_json(os.path.join(data_dir, train_file_name)), 240 | 'train') 241 | 242 | def get_dev_examples(self, data_dir): 243 | dev_file_name = self.DEV_FILE_NAME.format( 244 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 245 | 246 | return self._create_examples( 247 | self._read_json(os.path.join(data_dir, dev_file_name)), 248 | 'dev') 249 | 250 | def get_test_examples(self, data_dir): 251 | test_file_name = self.TEST_FILE_NAME.format( 252 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 253 | 254 | return self._create_examples( 255 | self._read_json(os.path.join(data_dir, test_file_name)), 256 | 'test') 257 | 258 | def get_labels(self): 259 | """See base class.""" 260 | # These are anonymized annotator IDs. 261 | return ["ANNOT1", "ANNOT2", "ANNOT3", "ANNOT4", "ANNOT5", "OTHER"] 262 | 263 | def _create_examples(self, lines, set_type): 264 | """Creates examples for the training and dev sets.""" 265 | labels = self.get_labels() 266 | examples = [] 267 | for i, line in enumerate(lines): 268 | qid = "%s-%s" % (set_type, i) 269 | 270 | answers = ' ; '.join( 271 | [line['distractor_{}'.format(i)] for i in range(4)] + [line['correct_answer']] 272 | ) 273 | 274 | text_a = tokenization.convert_to_unicode(line['question']) 275 | text_b = tokenization.convert_to_unicode(answers) 276 | 277 | if set_type == "test": 278 | label = "OTHER" 279 | else: 280 | if line["turkIdAnonymized"] in labels: 281 | label = tokenization.convert_to_unicode(line["turkIdAnonymized"]) 282 | else: 283 | label = "OTHER" 284 | examples.append( 285 | InputExample(guid=qid, text_a=text_a, text_b=text_b, label=label)) 286 | 287 | return examples 288 | 289 | 290 | def convert_single_example(ex_index, example, label_list, max_seq_length, 291 | tokenizer): 292 | """Converts a single `InputExample` into a single `InputFeatures`.""" 293 | 294 | if isinstance(example, PaddingInputExample): 295 | return InputFeatures( 296 | input_ids=[0] * max_seq_length, 297 | input_mask=[0] * max_seq_length, 298 | segment_ids=[0] * max_seq_length, 299 | label_id=0, 300 | is_real_example=False) 301 | 302 | label_map = {} 303 | for (i, label) in enumerate(label_list): 304 | label_map[label] = i 305 | 306 | tokens_a = tokenizer.tokenize(example.text_a) 307 | tokens_b = None 308 | if example.text_b: 309 | tokens_b = tokenizer.tokenize(example.text_b) 310 | 311 | if tokens_b: 312 | # Modifies `tokens_a` and `tokens_b` in place so that the total 313 | # length is less than the specified length. 314 | # Account for [CLS], [SEP], [SEP] with "- 3" 315 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 316 | else: 317 | # Account for [CLS] and [SEP] with "- 2" 318 | if len(tokens_a) > max_seq_length - 2: 319 | tokens_a = tokens_a[0:(max_seq_length - 2)] 320 | 321 | # The convention in BERT is: 322 | # (a) For sequence pairs: 323 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 324 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 325 | # (b) For single sequences: 326 | # tokens: [CLS] the dog is hairy . [SEP] 327 | # type_ids: 0 0 0 0 0 0 0 328 | # 329 | # Where "type_ids" are used to indicate whether this is the first 330 | # sequence or the second sequence. The embedding vectors for `type=0` and 331 | # `type=1` were learned during pre-training and are added to the wordpiece 332 | # embedding vector (and position vector). This is not *strictly* necessary 333 | # since the [SEP] token unambiguously separates the sequences, but it makes 334 | # it easier for the model to learn the concept of sequences. 335 | # 336 | # For classification tasks, the first vector (corresponding to [CLS]) is 337 | # used as the "sentence vector". Note that this only makes sense because 338 | # the entire model is fine-tuned. 339 | tokens = [] 340 | segment_ids = [] 341 | tokens.append("[CLS]") 342 | segment_ids.append(0) 343 | for token in tokens_a: 344 | tokens.append(token) 345 | segment_ids.append(0) 346 | tokens.append("[SEP]") 347 | segment_ids.append(0) 348 | 349 | if tokens_b: 350 | for token in tokens_b: 351 | tokens.append(token) 352 | segment_ids.append(1) 353 | tokens.append("[SEP]") 354 | segment_ids.append(1) 355 | 356 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 357 | 358 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 359 | # tokens are attended to. 360 | input_mask = [1] * len(input_ids) 361 | 362 | # Zero-pad up to the sequence length. 363 | while len(input_ids) < max_seq_length: 364 | input_ids.append(0) 365 | input_mask.append(0) 366 | segment_ids.append(0) 367 | 368 | assert len(input_ids) == max_seq_length 369 | assert len(input_mask) == max_seq_length 370 | assert len(segment_ids) == max_seq_length 371 | 372 | label_id = label_map[example.label] 373 | if ex_index < 5: 374 | tf.logging.info("*** Example ***") 375 | tf.logging.info("guid: %s" % (example.guid)) 376 | tf.logging.info("tokens: %s" % " ".join( 377 | [tokenization.printable_text(x) for x in tokens])) 378 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 379 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 380 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 381 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 382 | 383 | feature = InputFeatures( 384 | input_ids=input_ids, 385 | input_mask=input_mask, 386 | segment_ids=segment_ids, 387 | label_id=label_id, 388 | is_real_example=True) 389 | return feature 390 | 391 | 392 | def file_based_convert_examples_to_features( 393 | examples, label_list, max_seq_length, tokenizer, output_file): 394 | """Convert a set of `InputExample`s to a TFRecord file.""" 395 | 396 | writer = tf.python_io.TFRecordWriter(output_file) 397 | 398 | for (ex_index, example) in enumerate(examples): 399 | if ex_index % 10000 == 0: 400 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 401 | 402 | feature = convert_single_example(ex_index, example, label_list, 403 | max_seq_length, tokenizer) 404 | 405 | def create_int_feature(values): 406 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 407 | return f 408 | 409 | features = collections.OrderedDict() 410 | features["input_ids"] = create_int_feature(feature.input_ids) 411 | features["input_mask"] = create_int_feature(feature.input_mask) 412 | features["segment_ids"] = create_int_feature(feature.segment_ids) 413 | features["label_ids"] = create_int_feature([feature.label_id]) 414 | features["is_real_example"] = create_int_feature( 415 | [int(feature.is_real_example)]) 416 | 417 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 418 | writer.write(tf_example.SerializeToString()) 419 | writer.close() 420 | 421 | 422 | def file_based_input_fn_builder(input_file, seq_length, is_training, 423 | drop_remainder): 424 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 425 | 426 | name_to_features = { 427 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 428 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 429 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 430 | "label_ids": tf.FixedLenFeature([], tf.int64), 431 | "is_real_example": tf.FixedLenFeature([], tf.int64), 432 | } 433 | 434 | def _decode_record(record, name_to_features): 435 | """Decodes a record to a TensorFlow example.""" 436 | example = tf.parse_single_example(record, name_to_features) 437 | 438 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 439 | # So cast all int64 to int32. 440 | for name in list(example.keys()): 441 | t = example[name] 442 | if t.dtype == tf.int64: 443 | t = tf.to_int32(t) 444 | example[name] = t 445 | 446 | return example 447 | 448 | def input_fn(params): 449 | """The actual input function.""" 450 | batch_size = params["batch_size"] 451 | 452 | # For training, we want a lot of parallel reading and shuffling. 453 | # For eval, we want no shuffling and parallel reading doesn't matter. 454 | d = tf.data.TFRecordDataset(input_file) 455 | if is_training: 456 | d = d.repeat() 457 | d = d.shuffle(buffer_size=100) 458 | 459 | d = d.apply( 460 | tf.contrib.data.map_and_batch( 461 | lambda record: _decode_record(record, name_to_features), 462 | batch_size=batch_size, 463 | drop_remainder=drop_remainder)) 464 | 465 | return d 466 | 467 | return input_fn 468 | 469 | 470 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 471 | """Truncates a sequence pair in place to the maximum length.""" 472 | 473 | # This is a simple heuristic which will always truncate the longer sequence 474 | # one token at a time. This makes more sense than truncating an equal percent 475 | # of tokens from each, since if one sequence is very short then each token 476 | # that's truncated likely contains more information than a longer sequence. 477 | while True: 478 | total_length = len(tokens_a) + len(tokens_b) 479 | if total_length <= max_length: 480 | break 481 | if len(tokens_a) > len(tokens_b): 482 | tokens_a.pop() 483 | else: 484 | tokens_b.pop() 485 | 486 | 487 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 488 | labels, num_labels, use_one_hot_embeddings): 489 | """Creates a classification model.""" 490 | model = modeling.BertModel( 491 | config=bert_config, 492 | is_training=is_training, 493 | input_ids=input_ids, 494 | input_mask=input_mask, 495 | token_type_ids=segment_ids, 496 | use_one_hot_embeddings=use_one_hot_embeddings) 497 | 498 | # In the demo, we are doing a simple classification task on the entire 499 | # segment. 500 | # 501 | # If you want to use the token-level output, use model.get_sequence_output() 502 | # instead. 503 | output_layer = model.get_pooled_output() 504 | 505 | hidden_size = output_layer.shape[-1].value 506 | 507 | output_weights = tf.get_variable( 508 | "output_weights", [num_labels, hidden_size], 509 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 510 | 511 | output_bias = tf.get_variable( 512 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 513 | 514 | with tf.variable_scope("loss"): 515 | if is_training: 516 | # I.e., 0.1 dropout 517 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 518 | 519 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 520 | logits = tf.nn.bias_add(logits, output_bias) 521 | probabilities = tf.nn.softmax(logits, axis=-1) 522 | log_probs = tf.nn.log_softmax(logits, axis=-1) 523 | 524 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 525 | 526 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 527 | loss = tf.reduce_mean(per_example_loss) 528 | 529 | return (loss, per_example_loss, logits, probabilities) 530 | 531 | 532 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 533 | num_train_steps, num_warmup_steps, use_tpu, 534 | use_one_hot_embeddings): 535 | """Returns `model_fn` closure for TPUEstimator.""" 536 | 537 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 538 | """The `model_fn` for TPUEstimator.""" 539 | 540 | tf.logging.info("*** Features ***") 541 | for name in sorted(features.keys()): 542 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 543 | 544 | input_ids = features["input_ids"] 545 | input_mask = features["input_mask"] 546 | segment_ids = features["segment_ids"] 547 | label_ids = features["label_ids"] 548 | is_real_example = None 549 | if "is_real_example" in features: 550 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 551 | else: 552 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 553 | 554 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 555 | 556 | (total_loss, per_example_loss, logits, probabilities) = create_model( 557 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, 558 | num_labels, use_one_hot_embeddings) 559 | 560 | tvars = tf.trainable_variables() 561 | initialized_variable_names = {} 562 | scaffold_fn = None 563 | if init_checkpoint: 564 | (assignment_map, initialized_variable_names 565 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 566 | if use_tpu: 567 | 568 | def tpu_scaffold(): 569 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 570 | return tf.train.Scaffold() 571 | 572 | scaffold_fn = tpu_scaffold 573 | else: 574 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 575 | 576 | tf.logging.info("**** Trainable Variables ****") 577 | for var in tvars: 578 | init_string = "" 579 | if var.name in initialized_variable_names: 580 | init_string = ", *INIT_FROM_CKPT*" 581 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 582 | init_string) 583 | 584 | output_spec = None 585 | if mode == tf.estimator.ModeKeys.TRAIN: 586 | 587 | train_op = optimization.create_optimizer( 588 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 589 | 590 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 591 | mode=mode, 592 | loss=total_loss, 593 | train_op=train_op, 594 | scaffold_fn=scaffold_fn) 595 | elif mode == tf.estimator.ModeKeys.EVAL: 596 | 597 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 598 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 599 | accuracy = tf.metrics.accuracy( 600 | labels=label_ids, predictions=predictions, weights=is_real_example) 601 | loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) 602 | return { 603 | "eval_accuracy": accuracy, 604 | "eval_loss": loss, 605 | } 606 | 607 | eval_metrics = (metric_fn, 608 | [per_example_loss, label_ids, logits, is_real_example]) 609 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 610 | mode=mode, 611 | loss=total_loss, 612 | eval_metrics=eval_metrics, 613 | scaffold_fn=scaffold_fn) 614 | else: 615 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 616 | mode=mode, 617 | predictions={"probabilities": probabilities}, 618 | scaffold_fn=scaffold_fn) 619 | return output_spec 620 | 621 | return model_fn 622 | 623 | 624 | # This function is not used by this file but is still used by the Colab and 625 | # people who depend on it. 626 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 627 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 628 | 629 | all_input_ids = [] 630 | all_input_mask = [] 631 | all_segment_ids = [] 632 | all_label_ids = [] 633 | 634 | for feature in features: 635 | all_input_ids.append(feature.input_ids) 636 | all_input_mask.append(feature.input_mask) 637 | all_segment_ids.append(feature.segment_ids) 638 | all_label_ids.append(feature.label_id) 639 | 640 | def input_fn(params): 641 | """The actual input function.""" 642 | batch_size = params["batch_size"] 643 | 644 | num_examples = len(features) 645 | 646 | # This is for demo purposes and does NOT scale to large data sets. We do 647 | # not use Dataset.from_generator() because that uses tf.py_func which is 648 | # not TPU compatible. The right way to load data is with TFRecordReader. 649 | d = tf.data.Dataset.from_tensor_slices({ 650 | "input_ids": 651 | tf.constant( 652 | all_input_ids, shape=[num_examples, seq_length], 653 | dtype=tf.int32), 654 | "input_mask": 655 | tf.constant( 656 | all_input_mask, 657 | shape=[num_examples, seq_length], 658 | dtype=tf.int32), 659 | "segment_ids": 660 | tf.constant( 661 | all_segment_ids, 662 | shape=[num_examples, seq_length], 663 | dtype=tf.int32), 664 | "label_ids": 665 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 666 | }) 667 | 668 | if is_training: 669 | d = d.repeat() 670 | d = d.shuffle(buffer_size=100) 671 | 672 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 673 | return d 674 | 675 | return input_fn 676 | 677 | 678 | # This function is not used by this file but is still used by the Colab and 679 | # people who depend on it. 680 | def convert_examples_to_features(examples, label_list, max_seq_length, 681 | tokenizer): 682 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 683 | 684 | features = [] 685 | for (ex_index, example) in enumerate(examples): 686 | if ex_index % 10000 == 0: 687 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 688 | 689 | feature = convert_single_example(ex_index, example, label_list, 690 | max_seq_length, tokenizer) 691 | 692 | features.append(feature) 693 | return features 694 | 695 | 696 | def main(_): 697 | tf.logging.set_verbosity(tf.logging.INFO) 698 | 699 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 700 | raise ValueError( 701 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 702 | 703 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 704 | 705 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 706 | raise ValueError( 707 | "Cannot use sequence length %d because the BERT model " 708 | "was only trained up to sequence length %d" % 709 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 710 | 711 | tf.gfile.MakeDirs(FLAGS.output_dir) 712 | 713 | processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx, 714 | augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number) 715 | 716 | label_list = processor.get_labels() 717 | 718 | tokenizer = tokenization.FullTokenizer( 719 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 720 | 721 | tpu_cluster_resolver = None 722 | if FLAGS.use_tpu and FLAGS.tpu_name: 723 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 724 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 725 | 726 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 727 | run_config = tf.contrib.tpu.RunConfig( 728 | cluster=tpu_cluster_resolver, 729 | master=FLAGS.master, 730 | model_dir=FLAGS.output_dir, 731 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 732 | tpu_config=tf.contrib.tpu.TPUConfig( 733 | iterations_per_loop=FLAGS.iterations_per_loop, 734 | num_shards=FLAGS.num_tpu_cores, 735 | per_host_input_for_training=is_per_host)) 736 | 737 | train_examples = None 738 | num_train_steps = None 739 | num_warmup_steps = None 740 | if FLAGS.do_train: 741 | train_examples = processor.get_train_examples(FLAGS.data_dir) 742 | num_train_steps = int( 743 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 744 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 745 | 746 | model_fn = model_fn_builder( 747 | bert_config=bert_config, 748 | num_labels=len(label_list), 749 | init_checkpoint=FLAGS.init_checkpoint, 750 | learning_rate=FLAGS.learning_rate, 751 | num_train_steps=num_train_steps, 752 | num_warmup_steps=num_warmup_steps, 753 | use_tpu=FLAGS.use_tpu, 754 | use_one_hot_embeddings=FLAGS.use_tpu) 755 | 756 | # If TPU is not available, this will fall back to normal Estimator on CPU 757 | # or GPU. 758 | estimator = tf.contrib.tpu.TPUEstimator( 759 | use_tpu=FLAGS.use_tpu, 760 | model_fn=model_fn, 761 | config=run_config, 762 | train_batch_size=FLAGS.train_batch_size, 763 | eval_batch_size=FLAGS.eval_batch_size, 764 | predict_batch_size=FLAGS.predict_batch_size) 765 | 766 | if FLAGS.do_train: 767 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 768 | train_seq_length = file_based_convert_examples_to_features( 769 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 770 | tf.logging.info("***** Running training *****") 771 | tf.logging.info(" Num examples = %d", len(train_examples)) 772 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 773 | tf.logging.info(" Num steps = %d", num_train_steps) 774 | tf.logging.info(" Longest training sequence = %d", train_seq_length) 775 | train_input_fn = file_based_input_fn_builder( 776 | input_file=train_file, 777 | seq_length=FLAGS.max_seq_length, 778 | is_training=True, 779 | drop_remainder=True) 780 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 781 | 782 | if FLAGS.do_eval: 783 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 784 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 785 | eval_seq_length = file_based_convert_examples_to_features( 786 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 787 | 788 | tf.logging.info("***** Running evaluation *****") 789 | tf.logging.info(" Num examples = %d", len(eval_examples)) 790 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 791 | tf.logging.info(" Longest eval sequence = %d", eval_seq_length) 792 | 793 | # This tells the estimator to run through the entire set. 794 | eval_steps = None 795 | # However, if running eval on the TPU, you will need to specify the 796 | # number of steps. 797 | if FLAGS.use_tpu: 798 | # Eval will be slightly WRONG on the TPU because it will truncate 799 | # the last batch. 800 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size) 801 | 802 | eval_drop_remainder = True if FLAGS.use_tpu else False 803 | eval_input_fn = file_based_input_fn_builder( 804 | input_file=eval_file, 805 | seq_length=FLAGS.max_seq_length, 806 | is_training=False, 807 | drop_remainder=eval_drop_remainder) 808 | 809 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 810 | 811 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 812 | with tf.gfile.GFile(output_eval_file, "w") as writer: 813 | tf.logging.info("***** Eval results *****") 814 | for key in sorted(result.keys()): 815 | tf.logging.info(" %s = %s", key, str(result[key])) 816 | writer.write("%s = %s\n" % (key, str(result[key]))) 817 | 818 | if FLAGS.do_predict: 819 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 820 | num_actual_predict_examples = len(predict_examples) 821 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 822 | predict_seq_length = file_based_convert_examples_to_features( 823 | predict_examples, label_list, 824 | FLAGS.max_seq_length, tokenizer, 825 | predict_file) 826 | 827 | tf.logging.info("***** Running prediction*****") 828 | tf.logging.info(" Num examples = %d", len(predict_examples)) 829 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 830 | tf.logging.info(" Longest predict sequence = %d", predict_seq_length) 831 | 832 | if FLAGS.use_tpu: 833 | # Warning: According to tpu_estimator.py Prediction on TPU is an 834 | # experimental feature and hence not supported here 835 | raise ValueError("Prediction in TPU not supported") 836 | 837 | predict_drop_remainder = True if FLAGS.use_tpu else False 838 | predict_input_fn = file_based_input_fn_builder( 839 | input_file=predict_file, 840 | seq_length=FLAGS.max_seq_length, 841 | is_training=False, 842 | drop_remainder=predict_drop_remainder) 843 | 844 | result = estimator.predict(input_fn=predict_input_fn) 845 | 846 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 847 | with tf.gfile.GFile(output_predict_file, "w") as writer: 848 | num_written_lines = 0 849 | tf.logging.info("***** Predict results *****") 850 | for (i, prediction) in enumerate(result): 851 | probabilities = prediction["probabilities"] 852 | if i >= num_actual_predict_examples: 853 | break 854 | output_line = "\t".join( 855 | str(class_probability) 856 | for class_probability in probabilities) + "\n" 857 | writer.write(output_line) 858 | num_written_lines += 1 859 | assert num_written_lines == num_actual_predict_examples 860 | 861 | 862 | if __name__ == "__main__": 863 | flags.mark_flag_as_required("data_dir") 864 | flags.mark_flag_as_required("vocab_file") 865 | flags.mark_flag_as_required("bert_config_file") 866 | flags.mark_flag_as_required("output_dir") 867 | tf.app.run() 868 | -------------------------------------------------------------------------------- /model_fine_tuning_scripts/run_mnli.py: -------------------------------------------------------------------------------- 1 | """Run BERT on MNLI.""" 2 | 3 | from __future__ import absolute_import 4 | from __future__ import division 5 | from __future__ import print_function 6 | 7 | import collections 8 | import json 9 | import os 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | import modeling 15 | import optimization 16 | import tokenization 17 | 18 | 19 | flags = tf.flags 20 | 21 | FLAGS = flags.FLAGS 22 | 23 | ## Required parameters 24 | flags.DEFINE_string( 25 | "data_dir", None, 26 | "The input data dir. Should contain the .tsv files (or other data files) " 27 | "for the task.") 28 | 29 | flags.DEFINE_string( 30 | "bert_config_file", None, 31 | "The config json file corresponding to the pre-trained BERT model. " 32 | "This specifies the model architecture.") 33 | 34 | flags.DEFINE_string("vocab_file", None, 35 | "The vocabulary file that the BERT model was trained on.") 36 | 37 | flags.DEFINE_string( 38 | "output_dir", None, 39 | "The output directory where the model checkpoints will be written.") 40 | 41 | flags.DEFINE_string( 42 | "split", None, 43 | "The split you'd like to run on, either 'annotator' or 'rand'.") 44 | 45 | flags.DEFINE_integer( 46 | "annotator_idx", 0, 47 | "Index of the annotator to split by.") 48 | 49 | flags.DEFINE_float( 50 | "augment_ratio", 0.0, 51 | "Proportion of dev examples moved to training set.") 52 | 53 | flags.DEFINE_integer( 54 | "take_number", 1, 55 | "In case this is a re-execution of previous experiment.") 56 | 57 | flags.DEFINE_bool( 58 | "annotator_labels", False, 59 | "Whether to use top 5 annotator ids as labels " 60 | "(+ other label for all other annotators).") 61 | 62 | 63 | ## Other parameters 64 | 65 | flags.DEFINE_string( 66 | "init_checkpoint", None, 67 | "Initial checkpoint (usually from a pre-trained BERT model).") 68 | 69 | flags.DEFINE_bool( 70 | "do_lower_case", True, 71 | "Whether to lower case the input text. Should be True for uncased " 72 | "models and False for cased models.") 73 | 74 | flags.DEFINE_integer( 75 | "max_seq_length", 128, 76 | "The maximum total input sequence length after WordPiece tokenization. " 77 | "Sequences longer than this will be truncated, and sequences shorter " 78 | "than this will be padded.") 79 | 80 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 81 | 82 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 83 | 84 | flags.DEFINE_bool( 85 | "do_predict", False, 86 | "Whether to run the model in inference mode on the test set.") 87 | 88 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 89 | 90 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 91 | 92 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 93 | 94 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 95 | 96 | flags.DEFINE_float("num_train_epochs", 3.0, 97 | "Total number of training epochs to perform.") 98 | 99 | flags.DEFINE_float( 100 | "warmup_proportion", 0.1, 101 | "Proportion of training to perform linear learning rate warmup for. " 102 | "E.g., 0.1 = 10% of training.") 103 | 104 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 105 | "How often to save the model checkpoint.") 106 | 107 | flags.DEFINE_integer("iterations_per_loop", 1000, 108 | "How many steps to make in each estimator call.") 109 | 110 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 111 | 112 | tf.flags.DEFINE_string( 113 | "tpu_name", None, 114 | "The Cloud TPU to use for training. This should be either the name " 115 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 116 | "url.") 117 | 118 | tf.flags.DEFINE_string( 119 | "tpu_zone", None, 120 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 121 | "specified, we will attempt to automatically detect the GCE project from " 122 | "metadata.") 123 | 124 | tf.flags.DEFINE_string( 125 | "gcp_project", None, 126 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 127 | "specified, we will attempt to automatically detect the GCE project from " 128 | "metadata.") 129 | 130 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 131 | 132 | flags.DEFINE_integer( 133 | "num_tpu_cores", 8, 134 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 135 | 136 | 137 | class InputExample(object): 138 | """A single training/test example for simple sequence classification.""" 139 | 140 | def __init__(self, guid, text_a, text_b=None, label=None): 141 | """Constructs a InputExample. 142 | 143 | Args: 144 | guid: Unique id for the example. 145 | text_a: string. The untokenized text of the first sequence. For single 146 | sequence tasks, only this sequence must be specified. 147 | text_b: (Optional) string. The untokenized text of the second sequence. 148 | Only must be specified for sequence pair tasks. 149 | label: (Optional) string. The label of the example. This should be 150 | specified for train and dev examples, but not for test examples. 151 | """ 152 | self.guid = guid 153 | self.text_a = text_a 154 | self.text_b = text_b 155 | self.label = label 156 | 157 | 158 | class PaddingInputExample(object): 159 | """Fake example so the num input examples is a multiple of the batch size. 160 | 161 | When running eval/predict on the TPU, we need to pad the number of examples 162 | to be a multiple of the batch size, because the TPU requires a fixed batch 163 | size. The alternative is to drop the last batch, which is bad because it means 164 | the entire output data won't be generated. 165 | 166 | We use this class instead of `None` because treating `None` as padding 167 | battches could cause silent errors. 168 | """ 169 | 170 | 171 | class InputFeatures(object): 172 | """A single set of features of data.""" 173 | 174 | def __init__(self, 175 | input_ids, 176 | input_mask, 177 | segment_ids, 178 | label_id, 179 | is_real_example=True): 180 | self.input_ids = input_ids 181 | self.input_mask = input_mask 182 | self.segment_ids = segment_ids 183 | self.label_id = label_id 184 | self.is_real_example = is_real_example 185 | 186 | 187 | class DataProcessor(object): 188 | """Base class for data converters for sequence classification data sets.""" 189 | 190 | def get_train_examples(self, data_dir): 191 | """Gets a collection of `InputExample`s for the train set.""" 192 | raise NotImplementedError() 193 | 194 | def get_dev_examples(self, data_dir): 195 | """Gets a collection of `InputExample`s for the dev set.""" 196 | raise NotImplementedError() 197 | 198 | def get_test_examples(self, data_dir): 199 | """Gets a collection of `InputExample`s for prediction.""" 200 | raise NotImplementedError() 201 | 202 | def get_labels(self): 203 | """Gets the list of labels for this data set.""" 204 | raise NotImplementedError() 205 | 206 | @classmethod 207 | def _read_json(cls, input_file): 208 | """Reads a JSON file.""" 209 | with tf.gfile.Open(input_file, "r") as f: 210 | return json.load(f) 211 | 212 | 213 | class MnliProcessor(DataProcessor): 214 | """Processor for the MultiNLI data set (GLUE version).""" 215 | 216 | SPLIT_TO_NAME = { 217 | 'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}', 218 | 'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}', 219 | 'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}', 220 | 'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}', 221 | 'with_annotator': 'with_annotator_id', 222 | 'without_annotator': 'without_annotator_id' 223 | } 224 | 225 | TRAIN_FILE_NAME = 'train_{split_name}.json' 226 | DEV_FILE_NAME = 'dev_{split_name}.json' 227 | TEST_FILE_NAME = 'dev_{split_name}.json' 228 | 229 | def __init__(self, split, annotator_idx, augment_ratio, take_number, annotator_labels): 230 | if split not in self.SPLIT_TO_NAME.keys(): 231 | raise ValueError( 232 | 'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.') 233 | 234 | self.split = split 235 | self.annotator_idx = annotator_idx 236 | self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else "" 237 | self.take = "_take{}".format(take_number) if take_number > 1 else "" 238 | self.annotator_labels = annotator_labels 239 | 240 | def get_train_examples(self, data_dir): 241 | train_file_name = self.TRAIN_FILE_NAME.format( 242 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 243 | 244 | return self._create_examples( 245 | self._read_json(os.path.join(data_dir, train_file_name)), 246 | 'train') 247 | 248 | def get_dev_examples(self, data_dir): 249 | dev_file_name = self.DEV_FILE_NAME.format( 250 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 251 | 252 | return self._create_examples( 253 | self._read_json(os.path.join(data_dir, dev_file_name)), 254 | 'dev') 255 | 256 | def get_test_examples(self, data_dir): 257 | test_file_name = self.TEST_FILE_NAME.format( 258 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 259 | 260 | return self._create_examples( 261 | self._read_json(os.path.join(data_dir, test_file_name)), 262 | 'test') 263 | 264 | def get_labels(self): 265 | """See base class.""" 266 | if self.annotator_labels: 267 | # These are anonymized annotator IDs. 268 | return ["ANNOT1", "ANNOT2", "ANNOT3", "ANNOT4", "ANNOT5", "OTHER"] 269 | else: 270 | return ["contradiction", "entailment", "neutral"] 271 | 272 | def _create_examples(self, lines, set_type): 273 | """Creates examples for the training and dev sets.""" 274 | labels = self.get_labels() 275 | examples = [] 276 | for i, line in enumerate(lines): 277 | guid = tokenization.convert_to_unicode(line["pairID"]) 278 | 279 | text_a = tokenization.convert_to_unicode(line["sentence1"]) 280 | text_b = tokenization.convert_to_unicode(line["sentence2"]) 281 | if self.annotator_labels: 282 | if set_type == "test": 283 | label = "OTHER" 284 | else: 285 | if line["turkIdAnonymized"] in labels: 286 | label = tokenization.convert_to_unicode(line["turkIdAnonymized"]) 287 | else: 288 | label = "OTHER" 289 | else: 290 | if set_type == "test": 291 | label = "contradiction" 292 | else: 293 | label = tokenization.convert_to_unicode(line["gold_label"]) 294 | examples.append( 295 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 296 | 297 | return examples 298 | 299 | 300 | def convert_single_example(ex_index, example, label_list, max_seq_length, 301 | tokenizer): 302 | """Converts a single `InputExample` into a single `InputFeatures`.""" 303 | 304 | if isinstance(example, PaddingInputExample): 305 | return InputFeatures( 306 | input_ids=[0] * max_seq_length, 307 | input_mask=[0] * max_seq_length, 308 | segment_ids=[0] * max_seq_length, 309 | label_id=0, 310 | is_real_example=False) 311 | 312 | label_map = {} 313 | for (i, label) in enumerate(label_list): 314 | label_map[label] = i 315 | 316 | tokens_a = tokenizer.tokenize(example.text_a) 317 | tokens_b = None 318 | if example.text_b: 319 | tokens_b = tokenizer.tokenize(example.text_b) 320 | 321 | if tokens_b: 322 | # Modifies `tokens_a` and `tokens_b` in place so that the total 323 | # length is less than the specified length. 324 | # Account for [CLS], [SEP], [SEP] with "- 3" 325 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 326 | else: 327 | # Account for [CLS] and [SEP] with "- 2" 328 | if len(tokens_a) > max_seq_length - 2: 329 | tokens_a = tokens_a[0:(max_seq_length - 2)] 330 | 331 | # The convention in BERT is: 332 | # (a) For sequence pairs: 333 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 334 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 335 | # (b) For single sequences: 336 | # tokens: [CLS] the dog is hairy . [SEP] 337 | # type_ids: 0 0 0 0 0 0 0 338 | # 339 | # Where "type_ids" are used to indicate whether this is the first 340 | # sequence or the second sequence. The embedding vectors for `type=0` and 341 | # `type=1` were learned during pre-training and are added to the wordpiece 342 | # embedding vector (and position vector). This is not *strictly* necessary 343 | # since the [SEP] token unambiguously separates the sequences, but it makes 344 | # it easier for the model to learn the concept of sequences. 345 | # 346 | # For classification tasks, the first vector (corresponding to [CLS]) is 347 | # used as the "sentence vector". Note that this only makes sense because 348 | # the entire model is fine-tuned. 349 | tokens = [] 350 | segment_ids = [] 351 | tokens.append("[CLS]") 352 | segment_ids.append(0) 353 | for token in tokens_a: 354 | tokens.append(token) 355 | segment_ids.append(0) 356 | tokens.append("[SEP]") 357 | segment_ids.append(0) 358 | 359 | if tokens_b: 360 | for token in tokens_b: 361 | tokens.append(token) 362 | segment_ids.append(1) 363 | tokens.append("[SEP]") 364 | segment_ids.append(1) 365 | 366 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 367 | 368 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 369 | # tokens are attended to. 370 | input_mask = [1] * len(input_ids) 371 | 372 | # Zero-pad up to the sequence length. 373 | while len(input_ids) < max_seq_length: 374 | input_ids.append(0) 375 | input_mask.append(0) 376 | segment_ids.append(0) 377 | 378 | assert len(input_ids) == max_seq_length 379 | assert len(input_mask) == max_seq_length 380 | assert len(segment_ids) == max_seq_length 381 | 382 | label_id = label_map[example.label] 383 | if ex_index < 5: 384 | tf.logging.info("*** Example ***") 385 | tf.logging.info("guid: %s" % (example.guid)) 386 | tf.logging.info("tokens: %s" % " ".join( 387 | [tokenization.printable_text(x) for x in tokens])) 388 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 389 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 390 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 391 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 392 | 393 | feature = InputFeatures( 394 | input_ids=input_ids, 395 | input_mask=input_mask, 396 | segment_ids=segment_ids, 397 | label_id=label_id, 398 | is_real_example=True) 399 | return feature 400 | 401 | 402 | def file_based_convert_examples_to_features( 403 | examples, label_list, max_seq_length, tokenizer, output_file): 404 | """Convert a set of `InputExample`s to a TFRecord file.""" 405 | 406 | writer = tf.python_io.TFRecordWriter(output_file) 407 | 408 | for (ex_index, example) in enumerate(examples): 409 | if ex_index % 10000 == 0: 410 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 411 | 412 | feature = convert_single_example(ex_index, example, label_list, 413 | max_seq_length, tokenizer) 414 | 415 | def create_int_feature(values): 416 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 417 | return f 418 | 419 | features = collections.OrderedDict() 420 | features["input_ids"] = create_int_feature(feature.input_ids) 421 | features["input_mask"] = create_int_feature(feature.input_mask) 422 | features["segment_ids"] = create_int_feature(feature.segment_ids) 423 | features["label_ids"] = create_int_feature([feature.label_id]) 424 | features["is_real_example"] = create_int_feature( 425 | [int(feature.is_real_example)]) 426 | 427 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 428 | writer.write(tf_example.SerializeToString()) 429 | writer.close() 430 | 431 | 432 | def file_based_input_fn_builder(input_file, seq_length, is_training, 433 | drop_remainder): 434 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 435 | 436 | name_to_features = { 437 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 438 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 439 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 440 | "label_ids": tf.FixedLenFeature([], tf.int64), 441 | "is_real_example": tf.FixedLenFeature([], tf.int64), 442 | } 443 | 444 | def _decode_record(record, name_to_features): 445 | """Decodes a record to a TensorFlow example.""" 446 | example = tf.parse_single_example(record, name_to_features) 447 | 448 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 449 | # So cast all int64 to int32. 450 | for name in list(example.keys()): 451 | t = example[name] 452 | if t.dtype == tf.int64: 453 | t = tf.to_int32(t) 454 | example[name] = t 455 | 456 | return example 457 | 458 | def input_fn(params): 459 | """The actual input function.""" 460 | batch_size = params["batch_size"] 461 | 462 | # For training, we want a lot of parallel reading and shuffling. 463 | # For eval, we want no shuffling and parallel reading doesn't matter. 464 | d = tf.data.TFRecordDataset(input_file) 465 | if is_training: 466 | d = d.repeat() 467 | d = d.shuffle(buffer_size=100) 468 | 469 | d = d.apply( 470 | tf.contrib.data.map_and_batch( 471 | lambda record: _decode_record(record, name_to_features), 472 | batch_size=batch_size, 473 | drop_remainder=drop_remainder)) 474 | 475 | return d 476 | 477 | return input_fn 478 | 479 | 480 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 481 | """Truncates a sequence pair in place to the maximum length.""" 482 | 483 | # This is a simple heuristic which will always truncate the longer sequence 484 | # one token at a time. This makes more sense than truncating an equal percent 485 | # of tokens from each, since if one sequence is very short then each token 486 | # that's truncated likely contains more information than a longer sequence. 487 | while True: 488 | total_length = len(tokens_a) + len(tokens_b) 489 | if total_length <= max_length: 490 | break 491 | if len(tokens_a) > len(tokens_b): 492 | tokens_a.pop() 493 | else: 494 | tokens_b.pop() 495 | 496 | 497 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 498 | labels, num_labels, use_one_hot_embeddings): 499 | """Creates a classification model.""" 500 | model = modeling.BertModel( 501 | config=bert_config, 502 | is_training=is_training, 503 | input_ids=input_ids, 504 | input_mask=input_mask, 505 | token_type_ids=segment_ids, 506 | use_one_hot_embeddings=use_one_hot_embeddings) 507 | 508 | # In the demo, we are doing a simple classification task on the entire 509 | # segment. 510 | # 511 | # If you want to use the token-level output, use model.get_sequence_output() 512 | # instead. 513 | output_layer = model.get_pooled_output() 514 | 515 | hidden_size = output_layer.shape[-1].value 516 | 517 | output_weights = tf.get_variable( 518 | "output_weights", [num_labels, hidden_size], 519 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 520 | 521 | output_bias = tf.get_variable( 522 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 523 | 524 | with tf.variable_scope("loss"): 525 | if is_training: 526 | # I.e., 0.1 dropout 527 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 528 | 529 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 530 | logits = tf.nn.bias_add(logits, output_bias) 531 | probabilities = tf.nn.softmax(logits, axis=-1) 532 | log_probs = tf.nn.log_softmax(logits, axis=-1) 533 | 534 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 535 | 536 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 537 | loss = tf.reduce_mean(per_example_loss) 538 | 539 | return (loss, per_example_loss, logits, probabilities) 540 | 541 | 542 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 543 | num_train_steps, num_warmup_steps, use_tpu, 544 | use_one_hot_embeddings): 545 | """Returns `model_fn` closure for TPUEstimator.""" 546 | 547 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 548 | """The `model_fn` for TPUEstimator.""" 549 | 550 | tf.logging.info("*** Features ***") 551 | for name in sorted(features.keys()): 552 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 553 | 554 | input_ids = features["input_ids"] 555 | input_mask = features["input_mask"] 556 | segment_ids = features["segment_ids"] 557 | label_ids = features["label_ids"] 558 | is_real_example = None 559 | if "is_real_example" in features: 560 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 561 | else: 562 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 563 | 564 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 565 | 566 | (total_loss, per_example_loss, logits, probabilities) = create_model( 567 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, 568 | num_labels, use_one_hot_embeddings) 569 | 570 | tvars = tf.trainable_variables() 571 | initialized_variable_names = {} 572 | scaffold_fn = None 573 | if init_checkpoint: 574 | (assignment_map, initialized_variable_names 575 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 576 | if use_tpu: 577 | 578 | def tpu_scaffold(): 579 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 580 | return tf.train.Scaffold() 581 | 582 | scaffold_fn = tpu_scaffold 583 | else: 584 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 585 | 586 | tf.logging.info("**** Trainable Variables ****") 587 | for var in tvars: 588 | init_string = "" 589 | if var.name in initialized_variable_names: 590 | init_string = ", *INIT_FROM_CKPT*" 591 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 592 | init_string) 593 | 594 | output_spec = None 595 | if mode == tf.estimator.ModeKeys.TRAIN: 596 | 597 | train_op = optimization.create_optimizer( 598 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 599 | 600 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 601 | mode=mode, 602 | loss=total_loss, 603 | train_op=train_op, 604 | scaffold_fn=scaffold_fn) 605 | elif mode == tf.estimator.ModeKeys.EVAL: 606 | 607 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 608 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 609 | accuracy = tf.metrics.accuracy( 610 | labels=label_ids, predictions=predictions, weights=is_real_example) 611 | loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) 612 | return { 613 | "eval_accuracy": accuracy, 614 | "eval_loss": loss, 615 | } 616 | 617 | eval_metrics = (metric_fn, 618 | [per_example_loss, label_ids, logits, is_real_example]) 619 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 620 | mode=mode, 621 | loss=total_loss, 622 | eval_metrics=eval_metrics, 623 | scaffold_fn=scaffold_fn) 624 | else: 625 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 626 | mode=mode, 627 | predictions={"probabilities": probabilities}, 628 | scaffold_fn=scaffold_fn) 629 | return output_spec 630 | 631 | return model_fn 632 | 633 | 634 | # This function is not used by this file but is still used by the Colab and 635 | # people who depend on it. 636 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 637 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 638 | 639 | all_input_ids = [] 640 | all_input_mask = [] 641 | all_segment_ids = [] 642 | all_label_ids = [] 643 | 644 | for feature in features: 645 | all_input_ids.append(feature.input_ids) 646 | all_input_mask.append(feature.input_mask) 647 | all_segment_ids.append(feature.segment_ids) 648 | all_label_ids.append(feature.label_id) 649 | 650 | def input_fn(params): 651 | """The actual input function.""" 652 | batch_size = params["batch_size"] 653 | 654 | num_examples = len(features) 655 | 656 | # This is for demo purposes and does NOT scale to large data sets. We do 657 | # not use Dataset.from_generator() because that uses tf.py_func which is 658 | # not TPU compatible. The right way to load data is with TFRecordReader. 659 | d = tf.data.Dataset.from_tensor_slices({ 660 | "input_ids": 661 | tf.constant( 662 | all_input_ids, shape=[num_examples, seq_length], 663 | dtype=tf.int32), 664 | "input_mask": 665 | tf.constant( 666 | all_input_mask, 667 | shape=[num_examples, seq_length], 668 | dtype=tf.int32), 669 | "segment_ids": 670 | tf.constant( 671 | all_segment_ids, 672 | shape=[num_examples, seq_length], 673 | dtype=tf.int32), 674 | "label_ids": 675 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 676 | }) 677 | 678 | if is_training: 679 | d = d.repeat() 680 | d = d.shuffle(buffer_size=100) 681 | 682 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 683 | return d 684 | 685 | return input_fn 686 | 687 | 688 | # This function is not used by this file but is still used by the Colab and 689 | # people who depend on it. 690 | def convert_examples_to_features(examples, label_list, max_seq_length, 691 | tokenizer): 692 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 693 | 694 | features = [] 695 | for (ex_index, example) in enumerate(examples): 696 | if ex_index % 10000 == 0: 697 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 698 | 699 | feature = convert_single_example(ex_index, example, label_list, 700 | max_seq_length, tokenizer) 701 | 702 | features.append(feature) 703 | return features 704 | 705 | 706 | def main(_): 707 | tf.logging.set_verbosity(tf.logging.INFO) 708 | 709 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 710 | raise ValueError( 711 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 712 | 713 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 714 | 715 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 716 | raise ValueError( 717 | "Cannot use sequence length %d because the BERT model " 718 | "was only trained up to sequence length %d" % 719 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 720 | 721 | tf.gfile.MakeDirs(FLAGS.output_dir) 722 | 723 | processor = MnliProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx, 724 | augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number, 725 | annotator_labels=FLAGS.annotator_labels) 726 | 727 | label_list = processor.get_labels() 728 | 729 | tokenizer = tokenization.FullTokenizer( 730 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 731 | 732 | tpu_cluster_resolver = None 733 | if FLAGS.use_tpu and FLAGS.tpu_name: 734 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 735 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 736 | 737 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 738 | run_config = tf.contrib.tpu.RunConfig( 739 | cluster=tpu_cluster_resolver, 740 | master=FLAGS.master, 741 | model_dir=FLAGS.output_dir, 742 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 743 | tpu_config=tf.contrib.tpu.TPUConfig( 744 | iterations_per_loop=FLAGS.iterations_per_loop, 745 | num_shards=FLAGS.num_tpu_cores, 746 | per_host_input_for_training=is_per_host)) 747 | 748 | train_examples = None 749 | num_train_steps = None 750 | num_warmup_steps = None 751 | if FLAGS.do_train: 752 | train_examples = processor.get_train_examples(FLAGS.data_dir) 753 | num_train_steps = int( 754 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 755 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 756 | 757 | model_fn = model_fn_builder( 758 | bert_config=bert_config, 759 | num_labels=len(label_list), 760 | init_checkpoint=FLAGS.init_checkpoint, 761 | learning_rate=FLAGS.learning_rate, 762 | num_train_steps=num_train_steps, 763 | num_warmup_steps=num_warmup_steps, 764 | use_tpu=FLAGS.use_tpu, 765 | use_one_hot_embeddings=FLAGS.use_tpu) 766 | 767 | # If TPU is not available, this will fall back to normal Estimator on CPU 768 | # or GPU. 769 | estimator = tf.contrib.tpu.TPUEstimator( 770 | use_tpu=FLAGS.use_tpu, 771 | model_fn=model_fn, 772 | config=run_config, 773 | train_batch_size=FLAGS.train_batch_size, 774 | eval_batch_size=FLAGS.eval_batch_size, 775 | predict_batch_size=FLAGS.predict_batch_size) 776 | 777 | if FLAGS.do_train: 778 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 779 | train_seq_length = file_based_convert_examples_to_features( 780 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 781 | tf.logging.info("***** Running training *****") 782 | tf.logging.info(" Num examples = %d", len(train_examples)) 783 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 784 | tf.logging.info(" Num steps = %d", num_train_steps) 785 | tf.logging.info(" Longest training sequence = %d", train_seq_length) 786 | train_input_fn = file_based_input_fn_builder( 787 | input_file=train_file, 788 | seq_length=FLAGS.max_seq_length, 789 | is_training=True, 790 | drop_remainder=True) 791 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 792 | 793 | if FLAGS.do_eval: 794 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 795 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 796 | eval_seq_length = file_based_convert_examples_to_features( 797 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 798 | 799 | tf.logging.info("***** Running evaluation *****") 800 | tf.logging.info(" Num examples = %d", len(eval_examples)) 801 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 802 | tf.logging.info(" Longest eval sequence = %d", eval_seq_length) 803 | 804 | # This tells the estimator to run through the entire set. 805 | eval_steps = None 806 | # However, if running eval on the TPU, you will need to specify the 807 | # number of steps. 808 | if FLAGS.use_tpu: 809 | # Eval will be slightly WRONG on the TPU because it will truncate 810 | # the last batch. 811 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size) 812 | 813 | eval_drop_remainder = True if FLAGS.use_tpu else False 814 | eval_input_fn = file_based_input_fn_builder( 815 | input_file=eval_file, 816 | seq_length=FLAGS.max_seq_length, 817 | is_training=False, 818 | drop_remainder=eval_drop_remainder) 819 | 820 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 821 | 822 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 823 | with tf.gfile.GFile(output_eval_file, "w") as writer: 824 | tf.logging.info("***** Eval results *****") 825 | for key in sorted(result.keys()): 826 | tf.logging.info(" %s = %s", key, str(result[key])) 827 | writer.write("%s = %s\n" % (key, str(result[key]))) 828 | 829 | if FLAGS.do_predict: 830 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 831 | num_actual_predict_examples = len(predict_examples) 832 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 833 | predict_seq_length = file_based_convert_examples_to_features( 834 | predict_examples, label_list, 835 | FLAGS.max_seq_length, tokenizer, 836 | predict_file) 837 | 838 | tf.logging.info("***** Running prediction*****") 839 | tf.logging.info(" Num examples = %d", len(predict_examples)) 840 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 841 | tf.logging.info(" Longest predict sequence = %d", predict_seq_length) 842 | 843 | if FLAGS.use_tpu: 844 | # Warning: According to tpu_estimator.py Prediction on TPU is an 845 | # experimental feature and hence not supported here 846 | raise ValueError("Prediction in TPU not supported") 847 | 848 | predict_drop_remainder = True if FLAGS.use_tpu else False 849 | predict_input_fn = file_based_input_fn_builder( 850 | input_file=predict_file, 851 | seq_length=FLAGS.max_seq_length, 852 | is_training=False, 853 | drop_remainder=predict_drop_remainder) 854 | 855 | result = estimator.predict(input_fn=predict_input_fn) 856 | 857 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 858 | with tf.gfile.GFile(output_predict_file, "w") as writer: 859 | num_written_lines = 0 860 | tf.logging.info("***** Predict results *****") 861 | for (i, prediction) in enumerate(result): 862 | probabilities = prediction["probabilities"] 863 | if i >= num_actual_predict_examples: 864 | break 865 | output_line = "\t".join( 866 | str(class_probability) 867 | for class_probability in probabilities) + "\n" 868 | writer.write(output_line) 869 | num_written_lines += 1 870 | assert num_written_lines == num_actual_predict_examples 871 | 872 | 873 | if __name__ == "__main__": 874 | flags.mark_flag_as_required("data_dir") 875 | flags.mark_flag_as_required("vocab_file") 876 | flags.mark_flag_as_required("bert_config_file") 877 | flags.mark_flag_as_required("output_dir") 878 | tf.app.run() 879 | -------------------------------------------------------------------------------- /model_fine_tuning_scripts/run_openbookqa.py: -------------------------------------------------------------------------------- 1 | """Run BERT on OpenBookQA.""" 2 | 3 | from __future__ import absolute_import 4 | from __future__ import division 5 | from __future__ import print_function 6 | 7 | import collections 8 | import json 9 | import os 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | import modeling 15 | import optimization 16 | import tokenization 17 | 18 | 19 | flags = tf.flags 20 | 21 | FLAGS = flags.FLAGS 22 | 23 | ## Required parameters 24 | flags.DEFINE_string( 25 | "data_dir", None, 26 | "The input data dir. Should contain the .tsv files (or other data files) " 27 | "for the task.") 28 | 29 | flags.DEFINE_string( 30 | "bert_config_file", None, 31 | "The config json file corresponding to the pre-trained BERT model. " 32 | "This specifies the model architecture.") 33 | 34 | flags.DEFINE_string("vocab_file", None, 35 | "The vocabulary file that the BERT model was trained on.") 36 | 37 | flags.DEFINE_string( 38 | "output_dir", None, 39 | "The output directory where the model checkpoints will be written.") 40 | 41 | flags.DEFINE_string( 42 | "split", None, 43 | "The split you'd like to run on, either 'annotator' or 'rand'.") 44 | 45 | flags.DEFINE_integer( 46 | "annotator_idx", 0, 47 | "Index of the annotator to split by.") 48 | 49 | flags.DEFINE_float( 50 | "augment_ratio", 0.0, 51 | "Proportion of dev examples moved to training set.") 52 | 53 | flags.DEFINE_integer( 54 | "take_number", 1, 55 | "In case this is a re-execution of previous experiment.") 56 | 57 | 58 | ## Other parameters 59 | 60 | flags.DEFINE_string( 61 | "init_checkpoint", None, 62 | "Initial checkpoint (usually from a pre-trained BERT model).") 63 | 64 | flags.DEFINE_bool( 65 | "do_lower_case", True, 66 | "Whether to lower case the input text. Should be True for uncased " 67 | "models and False for cased models.") 68 | 69 | flags.DEFINE_integer( 70 | "max_seq_length", 128, 71 | "The maximum total input sequence length after WordPiece tokenization. " 72 | "Sequences longer than this will be truncated, and sequences shorter " 73 | "than this will be padded.") 74 | 75 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 76 | 77 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 78 | 79 | flags.DEFINE_bool( 80 | "do_predict", False, 81 | "Whether to run the model in inference mode on the test set.") 82 | 83 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 84 | 85 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 86 | 87 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 88 | 89 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 90 | 91 | flags.DEFINE_float("num_train_epochs", 3.0, 92 | "Total number of training epochs to perform.") 93 | 94 | flags.DEFINE_float( 95 | "warmup_proportion", 0.1, 96 | "Proportion of training to perform linear learning rate warmup for. " 97 | "E.g., 0.1 = 10% of training.") 98 | 99 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 100 | "How often to save the model checkpoint.") 101 | 102 | flags.DEFINE_integer("iterations_per_loop", 1000, 103 | "How many steps to make in each estimator call.") 104 | 105 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 106 | 107 | tf.flags.DEFINE_string( 108 | "tpu_name", None, 109 | "The Cloud TPU to use for training. This should be either the name " 110 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 111 | "url.") 112 | 113 | tf.flags.DEFINE_string( 114 | "tpu_zone", None, 115 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 116 | "specified, we will attempt to automatically detect the GCE project from " 117 | "metadata.") 118 | 119 | tf.flags.DEFINE_string( 120 | "gcp_project", None, 121 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 122 | "specified, we will attempt to automatically detect the GCE project from " 123 | "metadata.") 124 | 125 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 126 | 127 | flags.DEFINE_integer( 128 | "num_tpu_cores", 8, 129 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 130 | 131 | 132 | class InputExample(object): 133 | """A single multiple choice question.""" 134 | 135 | def __init__( 136 | self, 137 | qid, 138 | question, 139 | answers, 140 | label): 141 | """Construct an instance.""" 142 | self.qid = qid 143 | self.question = question 144 | self.answers = answers 145 | self.label = label 146 | 147 | 148 | class DataProcessor(object): 149 | """Base class for data converters for sequence classification data sets.""" 150 | 151 | def get_train_examples(self, data_dir): 152 | """Gets a collection of `InputExample`s for the train set.""" 153 | raise NotImplementedError() 154 | 155 | def get_dev_examples(self, data_dir): 156 | """Gets a collection of `InputExample`s for the dev set.""" 157 | raise NotImplementedError() 158 | 159 | def get_test_examples(self, data_dir): 160 | """Gets a collection of `InputExample`s for prediction.""" 161 | raise NotImplementedError() 162 | 163 | def get_labels(self): 164 | """Gets the list of labels for this data set.""" 165 | raise NotImplementedError() 166 | 167 | @classmethod 168 | def _read_json(cls, input_file): 169 | """Reads a JSON file.""" 170 | with tf.gfile.Open(input_file, "r") as f: 171 | return json.load(f) 172 | 173 | 174 | class CommonsenseQAProcessor(DataProcessor): 175 | """Processor for the CommonsenseQA data set.""" 176 | 177 | SPLIT_TO_NAME = { 178 | 'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}', 179 | 'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}', 180 | 'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}', 181 | 'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}', 182 | 'with_annotator': 'with_annotator_id', 183 | 'without_annotator': 'without_annotator_id', 184 | } 185 | 186 | TRAIN_FILE_NAME = 'train_{split_name}.json' 187 | DEV_FILE_NAME = 'dev_{split_name}.json' 188 | TEST_FILE_NAME = 'dev_{split_name}.json' 189 | 190 | def __init__(self, split, annotator_idx, augment_ratio, take_number): 191 | if split not in self.SPLIT_TO_NAME.keys(): 192 | raise ValueError( 193 | 'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.') 194 | 195 | self.split = split 196 | self.annotator_idx = annotator_idx 197 | self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else "" 198 | self.take = "_take{}".format(take_number) if take_number > 1 else "" 199 | 200 | def get_train_examples(self, data_dir): 201 | train_file_name = self.TRAIN_FILE_NAME.format( 202 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 203 | 204 | return self._create_examples( 205 | self._read_json(os.path.join(data_dir, train_file_name)), 206 | 'train') 207 | 208 | def get_dev_examples(self, data_dir): 209 | dev_file_name = self.DEV_FILE_NAME.format( 210 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 211 | 212 | return self._create_examples( 213 | self._read_json(os.path.join(data_dir, dev_file_name)), 214 | 'dev') 215 | 216 | def get_test_examples(self, data_dir): 217 | test_file_name = self.TEST_FILE_NAME.format( 218 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 219 | 220 | return self._create_examples( 221 | self._read_json(os.path.join(data_dir, test_file_name)), 222 | 'test') 223 | 224 | def get_labels(self): 225 | return [0, 1, 2, 3] 226 | 227 | def _create_examples(self, lines, set_type): 228 | examples = [] 229 | for i, line in enumerate(lines): 230 | qid = "%s-%s" % (set_type, i) 231 | 232 | question = tokenization.convert_to_unicode(line['question']) 233 | 234 | answers = np.array([ 235 | tokenization.convert_to_unicode(line['choice0']), 236 | tokenization.convert_to_unicode(line['choice1']), 237 | tokenization.convert_to_unicode(line['choice2']), 238 | tokenization.convert_to_unicode(line['choice3']) 239 | ]) 240 | 241 | label = np.argwhere(answers == line['correct_answer']) 242 | assert len(label) == 1 243 | label = label[0][0] 244 | 245 | examples.append( 246 | InputExample( 247 | qid=qid, 248 | question=question, 249 | answers=answers, 250 | label=label)) 251 | 252 | return examples 253 | 254 | 255 | def example_to_token_ids_segment_ids_label_ids( 256 | ex_index, 257 | example, 258 | max_seq_length, 259 | tokenizer): 260 | """Converts an ``InputExample`` to token ids and segment ids.""" 261 | if ex_index < 5: 262 | tf.logging.info(f"*** Example {ex_index} ***") 263 | tf.logging.info("qid: %s" % (example.qid)) 264 | 265 | question_tokens = tokenizer.tokenize(example.question) 266 | answers_tokens = map(tokenizer.tokenize, example.answers) 267 | 268 | token_ids = [] 269 | segment_ids = [] 270 | for choice_idx, answer_tokens in enumerate(answers_tokens): 271 | truncated_question_tokens = question_tokens[ 272 | :max((max_seq_length - 3)//2, max_seq_length - (len(answer_tokens) + 3))] 273 | truncated_answer_tokens = answer_tokens[ 274 | :max((max_seq_length - 3)//2, max_seq_length - (len(question_tokens) + 3))] 275 | 276 | choice_tokens = [] 277 | choice_segment_ids = [] 278 | choice_tokens.append("[CLS]") 279 | choice_segment_ids.append(0) 280 | for question_token in truncated_question_tokens: 281 | choice_tokens.append(question_token) 282 | choice_segment_ids.append(0) 283 | choice_tokens.append("[SEP]") 284 | choice_segment_ids.append(0) 285 | for answer_token in truncated_answer_tokens: 286 | choice_tokens.append(answer_token) 287 | choice_segment_ids.append(1) 288 | choice_tokens.append("[SEP]") 289 | choice_segment_ids.append(1) 290 | 291 | choice_token_ids = tokenizer.convert_tokens_to_ids(choice_tokens) 292 | 293 | token_ids.append(choice_token_ids) 294 | segment_ids.append(choice_segment_ids) 295 | 296 | if ex_index < 5: 297 | tf.logging.info("choice %s" % choice_idx) 298 | tf.logging.info("tokens: %s" % " ".join( 299 | [tokenization.printable_text(t) for t in choice_tokens])) 300 | tf.logging.info("token ids: %s" % " ".join( 301 | [str(x) for x in choice_token_ids])) 302 | tf.logging.info("segment ids: %s" % " ".join( 303 | [str(x) for x in choice_segment_ids])) 304 | 305 | label_ids = [example.label] 306 | 307 | if ex_index < 5: 308 | tf.logging.info("label: %s (id = %d)" % (example.label, label_ids[0])) 309 | 310 | return token_ids, segment_ids, label_ids 311 | 312 | 313 | def file_based_convert_examples_to_features( 314 | examples, 315 | label_list, 316 | max_seq_length, 317 | tokenizer, 318 | output_file 319 | ): 320 | """Convert a set of ``InputExamples`` to a TFRecord file.""" 321 | 322 | # encode examples into token_ids and segment_ids 323 | token_ids_segment_ids_label_ids = [ 324 | example_to_token_ids_segment_ids_label_ids( 325 | ex_index, 326 | example, 327 | max_seq_length, 328 | tokenizer) 329 | for ex_index, example in enumerate(examples) 330 | ] 331 | 332 | # compute the maximum sequence length for any of the inputs 333 | seq_length = max([ 334 | max([len(choice_token_ids) for choice_token_ids in token_ids]) 335 | for token_ids, _, _ in token_ids_segment_ids_label_ids 336 | ]) 337 | 338 | # encode the inputs into fixed-length vectors 339 | writer = tf.python_io.TFRecordWriter(output_file) 340 | 341 | for idx, (token_ids, segment_ids, label_ids) in enumerate( 342 | token_ids_segment_ids_label_ids 343 | ): 344 | if idx % 10000 == 0: 345 | tf.logging.info("Writing %d of %d" % ( 346 | idx, 347 | len(token_ids_segment_ids_label_ids))) 348 | 349 | features = collections.OrderedDict() 350 | for i, (choice_token_ids, choice_segment_ids) in enumerate( 351 | zip(token_ids, segment_ids)): 352 | input_ids = np.zeros(max_seq_length) 353 | input_ids[:len(choice_token_ids)] = np.array(choice_token_ids) 354 | 355 | input_mask = np.zeros(max_seq_length) 356 | input_mask[:len(choice_token_ids)] = 1 357 | 358 | segment_ids = np.zeros(max_seq_length) 359 | segment_ids[:len(choice_segment_ids)] = np.array(choice_segment_ids) 360 | 361 | features[f'input_ids{i}'] = tf.train.Feature( 362 | int64_list=tf.train.Int64List(value=list(input_ids.astype(np.int64)))) 363 | features[f'input_mask{i}'] = tf.train.Feature( 364 | int64_list=tf.train.Int64List(value=list(input_mask.astype(np.int64)))) 365 | features[f'segment_ids{i}'] = tf.train.Feature( 366 | int64_list=tf.train.Int64List(value=list(segment_ids.astype(np.int64)))) 367 | 368 | features['label_ids'] = tf.train.Feature( 369 | int64_list=tf.train.Int64List(value=label_ids)) 370 | 371 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 372 | writer.write(tf_example.SerializeToString()) 373 | 374 | return seq_length 375 | 376 | 377 | def file_based_input_fn_builder( 378 | input_file, 379 | seq_length, 380 | is_training, 381 | drop_remainder 382 | ): 383 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 384 | 385 | name_to_features = { 386 | "input_ids0": tf.FixedLenFeature([seq_length], tf.int64), 387 | "input_mask0": tf.FixedLenFeature([seq_length], tf.int64), 388 | "segment_ids0": tf.FixedLenFeature([seq_length], tf.int64), 389 | "input_ids1": tf.FixedLenFeature([seq_length], tf.int64), 390 | "input_mask1": tf.FixedLenFeature([seq_length], tf.int64), 391 | "segment_ids1": tf.FixedLenFeature([seq_length], tf.int64), 392 | "input_ids2": tf.FixedLenFeature([seq_length], tf.int64), 393 | "input_mask2": tf.FixedLenFeature([seq_length], tf.int64), 394 | "segment_ids2": tf.FixedLenFeature([seq_length], tf.int64), 395 | "input_ids3": tf.FixedLenFeature([seq_length], tf.int64), 396 | "input_mask3": tf.FixedLenFeature([seq_length], tf.int64), 397 | "segment_ids3": tf.FixedLenFeature([seq_length], tf.int64), 398 | "label_ids": tf.FixedLenFeature([], tf.int64), 399 | } 400 | 401 | def _decode_record(record, name_to_features): 402 | """Decodes a record to a TensorFlow example.""" 403 | example = tf.parse_single_example(record, name_to_features) 404 | 405 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 406 | # So cast all int64 to int32. 407 | for name in list(example.keys()): 408 | t = example[name] 409 | if t.dtype == tf.int64: 410 | t = tf.to_int32(t) 411 | example[name] = t 412 | 413 | return example 414 | 415 | def input_fn(params): 416 | """The actual input function.""" 417 | batch_size = params["batch_size"] 418 | 419 | # For training, we want a lot of parallel reading and shuffling. 420 | # For eval, we want no shuffling and parallel reading doesn't matter. 421 | d = tf.data.TFRecordDataset(input_file) 422 | if is_training: 423 | d = d.repeat() 424 | d = d.shuffle(buffer_size=100) 425 | 426 | d = d.apply( 427 | tf.contrib.data.map_and_batch( 428 | lambda record: _decode_record(record, name_to_features), 429 | batch_size=batch_size, 430 | drop_remainder=drop_remainder)) 431 | 432 | return d 433 | 434 | return input_fn 435 | 436 | 437 | def create_model( 438 | bert_config, 439 | is_training, 440 | input_ids0, 441 | input_mask0, 442 | segment_ids0, 443 | input_ids1, 444 | input_mask1, 445 | segment_ids1, 446 | input_ids2, 447 | input_mask2, 448 | segment_ids2, 449 | input_ids3, 450 | input_mask3, 451 | segment_ids3, 452 | labels, 453 | num_labels, 454 | use_one_hot_embeddings 455 | ): 456 | """Creates a classification model.""" 457 | input_ids = tf.stack( 458 | [ 459 | input_ids0, 460 | input_ids1, 461 | input_ids2, 462 | input_ids3 463 | ], 464 | axis=1) 465 | input_mask = tf.stack( 466 | [ 467 | input_mask0, 468 | input_mask1, 469 | input_mask2, 470 | input_mask3 471 | ], 472 | axis=1) 473 | segment_ids = tf.stack( 474 | [ 475 | segment_ids0, 476 | segment_ids1, 477 | segment_ids2, 478 | segment_ids3 479 | ], 480 | axis=1) 481 | 482 | _, num_choices, seq_length = input_ids.shape 483 | 484 | input_ids = tf.reshape(input_ids, (-1, seq_length)) 485 | input_mask = tf.reshape(input_mask, (-1, seq_length)) 486 | segment_ids = tf.reshape(segment_ids, (-1, seq_length)) 487 | 488 | output_layer = modeling.BertModel( 489 | config=bert_config, 490 | is_training=is_training, 491 | input_ids=input_ids, 492 | input_mask=input_mask, 493 | token_type_ids=segment_ids, 494 | use_one_hot_embeddings=use_one_hot_embeddings 495 | ).get_pooled_output() 496 | 497 | hidden_size = output_layer.shape[-1].value 498 | 499 | softmax_weights = tf.get_variable( 500 | "softmax_weights", [hidden_size, 1], 501 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 502 | 503 | with tf.variable_scope("loss"): 504 | if is_training: 505 | # I.e., 0.1 dropout 506 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 507 | 508 | logits = tf.reshape( 509 | tf.matmul(output_layer, softmax_weights), 510 | (-1, num_choices)) 511 | 512 | probabilities = tf.nn.softmax(logits, axis=-1) 513 | log_probs = tf.nn.log_softmax(logits, axis=-1) 514 | 515 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 516 | 517 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 518 | loss = tf.reduce_mean(per_example_loss) 519 | 520 | return (loss, per_example_loss, logits, probabilities, output_layer) 521 | 522 | 523 | def model_fn_builder( 524 | bert_config, 525 | num_labels, 526 | init_checkpoint, 527 | learning_rate, 528 | num_train_steps, 529 | num_warmup_steps, 530 | use_tpu, 531 | use_one_hot_embeddings 532 | ): 533 | """Returns `model_fn` closure for TPUEstimator.""" 534 | 535 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 536 | """The `model_fn` for TPUEstimator.""" 537 | 538 | tf.logging.info("*** Features ***") 539 | for name in sorted(features.keys()): 540 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 541 | 542 | input_ids0 = features["input_ids0"] 543 | input_mask0 = features["input_mask0"] 544 | segment_ids0 = features["segment_ids0"] 545 | input_ids1 = features["input_ids1"] 546 | input_mask1 = features["input_mask1"] 547 | segment_ids1 = features["segment_ids1"] 548 | input_ids2 = features["input_ids2"] 549 | input_mask2 = features["input_mask2"] 550 | segment_ids2 = features["segment_ids2"] 551 | input_ids3 = features["input_ids3"] 552 | input_mask3 = features["input_mask3"] 553 | segment_ids3 = features["segment_ids3"] 554 | label_ids = features["label_ids"] 555 | 556 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 557 | 558 | (total_loss, per_example_loss, logits, probabilities, output_layer) = create_model( 559 | bert_config, 560 | is_training, 561 | input_ids0, 562 | input_mask0, 563 | segment_ids0, 564 | input_ids1, 565 | input_mask1, 566 | segment_ids1, 567 | input_ids2, 568 | input_mask2, 569 | segment_ids2, 570 | input_ids3, 571 | input_mask3, 572 | segment_ids3, 573 | label_ids, 574 | num_labels, 575 | use_one_hot_embeddings) 576 | 577 | tvars = tf.trainable_variables() 578 | initialized_variable_names = {} 579 | scaffold_fn = None 580 | if init_checkpoint: 581 | (assignment_map, initialized_variable_names 582 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 583 | if use_tpu: 584 | 585 | def tpu_scaffold(): 586 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 587 | return tf.train.Scaffold() 588 | 589 | scaffold_fn = tpu_scaffold 590 | else: 591 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 592 | 593 | tf.logging.info("**** Trainable Variables ****") 594 | for var in tvars: 595 | init_string = "" 596 | if var.name in initialized_variable_names: 597 | init_string = ", *INIT_FROM_CKPT*" 598 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 599 | init_string) 600 | 601 | output_spec = None 602 | if mode == tf.estimator.ModeKeys.TRAIN: 603 | 604 | train_op = optimization.create_optimizer( 605 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 606 | 607 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 608 | mode=mode, 609 | loss=total_loss, 610 | train_op=train_op, 611 | scaffold_fn=scaffold_fn) 612 | elif mode == tf.estimator.ModeKeys.EVAL: 613 | 614 | def metric_fn(per_example_loss, label_ids, logits): 615 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 616 | accuracy = tf.metrics.accuracy(label_ids, predictions) 617 | loss = tf.metrics.mean(per_example_loss) 618 | return { 619 | "eval_accuracy": accuracy, 620 | "eval_loss": loss, 621 | } 622 | 623 | eval_metrics = (metric_fn, [per_example_loss, label_ids, logits]) 624 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 625 | mode=mode, 626 | loss=total_loss, 627 | eval_metrics=eval_metrics, 628 | scaffold_fn=scaffold_fn,) 629 | else: 630 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 631 | mode=mode, predictions=probabilities, scaffold_fn=scaffold_fn) 632 | return output_spec 633 | 634 | return model_fn 635 | 636 | 637 | def main(_): 638 | tf.logging.set_verbosity(tf.logging.INFO) 639 | 640 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 641 | raise ValueError( 642 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 643 | 644 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 645 | 646 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 647 | raise ValueError( 648 | "Cannot use sequence length %d because the BERT model " 649 | "was only trained up to sequence length %d" % 650 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 651 | 652 | tf.gfile.MakeDirs(FLAGS.output_dir) 653 | 654 | processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx, augment_ratio = FLAGS.augment_ratio, take_number=FLAGS.take_number) 655 | 656 | label_list = processor.get_labels() 657 | 658 | tokenizer = tokenization.FullTokenizer( 659 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 660 | 661 | tpu_cluster_resolver = None 662 | if FLAGS.use_tpu and FLAGS.tpu_name: 663 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 664 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 665 | 666 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 667 | run_config = tf.contrib.tpu.RunConfig( 668 | cluster=tpu_cluster_resolver, 669 | master=FLAGS.master, 670 | model_dir=FLAGS.output_dir, 671 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 672 | tpu_config=tf.contrib.tpu.TPUConfig( 673 | iterations_per_loop=FLAGS.iterations_per_loop, 674 | num_shards=FLAGS.num_tpu_cores, 675 | per_host_input_for_training=is_per_host)) 676 | 677 | train_examples = None 678 | num_train_steps = None 679 | num_warmup_steps = None 680 | if FLAGS.do_train: 681 | train_examples = processor.get_train_examples(FLAGS.data_dir) 682 | num_train_steps = int( 683 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 684 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 685 | 686 | model_fn = model_fn_builder( 687 | bert_config=bert_config, 688 | num_labels=len(label_list), 689 | init_checkpoint=FLAGS.init_checkpoint, 690 | learning_rate=FLAGS.learning_rate, 691 | num_train_steps=num_train_steps, 692 | num_warmup_steps=num_warmup_steps, 693 | use_tpu=FLAGS.use_tpu, 694 | use_one_hot_embeddings=FLAGS.use_tpu) 695 | 696 | # If TPU is not available, this will fall back to normal Estimator on CPU 697 | # or GPU. 698 | estimator = tf.contrib.tpu.TPUEstimator( 699 | use_tpu=FLAGS.use_tpu, 700 | model_fn=model_fn, 701 | config=run_config, 702 | train_batch_size=FLAGS.train_batch_size, 703 | eval_batch_size=FLAGS.eval_batch_size, 704 | predict_batch_size=FLAGS.predict_batch_size) 705 | 706 | if FLAGS.do_train: 707 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 708 | train_seq_length = file_based_convert_examples_to_features( 709 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 710 | tf.logging.info("***** Running training *****") 711 | tf.logging.info(" Num examples = %d", len(train_examples)) 712 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 713 | tf.logging.info(" Num steps = %d", num_train_steps) 714 | tf.logging.info(" Longest training sequence = %d", train_seq_length) 715 | train_input_fn = file_based_input_fn_builder( 716 | input_file=train_file, 717 | seq_length=FLAGS.max_seq_length, 718 | is_training=True, 719 | drop_remainder=True) 720 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 721 | 722 | if FLAGS.do_eval: 723 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 724 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 725 | eval_seq_length = file_based_convert_examples_to_features( 726 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 727 | 728 | tf.logging.info("***** Running evaluation *****") 729 | tf.logging.info(" Num examples = %d", len(eval_examples)) 730 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 731 | tf.logging.info(" Longest eval sequence = %d", eval_seq_length) 732 | 733 | # This tells the estimator to run through the entire set. 734 | eval_steps = None 735 | # However, if running eval on the TPU, you will need to specify the 736 | # number of steps. 737 | if FLAGS.use_tpu: 738 | # Eval will be slightly WRONG on the TPU because it will truncate 739 | # the last batch. 740 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size) 741 | 742 | eval_drop_remainder = True if FLAGS.use_tpu else False 743 | eval_input_fn = file_based_input_fn_builder( 744 | input_file=eval_file, 745 | seq_length=FLAGS.max_seq_length, 746 | is_training=False, 747 | drop_remainder=eval_drop_remainder) 748 | 749 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 750 | 751 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 752 | with tf.gfile.GFile(output_eval_file, "w") as writer: 753 | tf.logging.info("***** Eval results *****") 754 | for key in sorted(result.keys()): 755 | tf.logging.info(" %s = %s", key, str(result[key])) 756 | writer.write("%s = %s\n" % (key, str(result[key]))) 757 | 758 | if FLAGS.do_predict: 759 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 760 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 761 | predict_seq_length = file_based_convert_examples_to_features( 762 | predict_examples, label_list, 763 | FLAGS.max_seq_length, tokenizer, 764 | predict_file) 765 | 766 | tf.logging.info("***** Running prediction*****") 767 | tf.logging.info(" Num examples = %d", len(predict_examples)) 768 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 769 | tf.logging.info(" Longest predict sequence = %d", predict_seq_length) 770 | 771 | if FLAGS.use_tpu: 772 | # Warning: According to tpu_estimator.py Prediction on TPU is an 773 | # experimental feature and hence not supported here 774 | raise ValueError("Prediction in TPU not supported") 775 | 776 | predict_drop_remainder = True if FLAGS.use_tpu else False 777 | predict_input_fn = file_based_input_fn_builder( 778 | input_file=predict_file, 779 | seq_length=FLAGS.max_seq_length, 780 | is_training=False, 781 | drop_remainder=predict_drop_remainder) 782 | 783 | result = estimator.predict(input_fn=predict_input_fn) 784 | 785 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 786 | with tf.gfile.GFile(output_predict_file, "w") as writer: 787 | tf.logging.info("***** Predict results *****") 788 | for prediction in result: 789 | output_line = "\t".join( 790 | str(class_probability) for class_probability in prediction) + "\n" 791 | writer.write(output_line) 792 | 793 | 794 | if __name__ == "__main__": 795 | flags.mark_flag_as_required("data_dir") 796 | flags.mark_flag_as_required("vocab_file") 797 | flags.mark_flag_as_required("bert_config_file") 798 | flags.mark_flag_as_required("output_dir") 799 | tf.app.run() 800 | -------------------------------------------------------------------------------- /model_fine_tuning_scripts/run_openbookqa_recognition.py: -------------------------------------------------------------------------------- 1 | """Run BERT on OpenBookQA for annotator ID prediction.""" 2 | 3 | from __future__ import absolute_import 4 | from __future__ import division 5 | from __future__ import print_function 6 | 7 | import collections 8 | import json 9 | import os 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | import modeling 15 | import optimization 16 | import tokenization 17 | 18 | 19 | flags = tf.flags 20 | 21 | FLAGS = flags.FLAGS 22 | 23 | ## Required parameters 24 | flags.DEFINE_string( 25 | "data_dir", None, 26 | "The input data dir. Should contain the .tsv files (or other data files) " 27 | "for the task.") 28 | 29 | flags.DEFINE_string( 30 | "bert_config_file", None, 31 | "The config json file corresponding to the pre-trained BERT model. " 32 | "This specifies the model architecture.") 33 | 34 | flags.DEFINE_string("vocab_file", None, 35 | "The vocabulary file that the BERT model was trained on.") 36 | 37 | flags.DEFINE_string( 38 | "output_dir", None, 39 | "The output directory where the model checkpoints will be written.") 40 | 41 | flags.DEFINE_string( 42 | "split", None, 43 | "The split you'd like to run on, either 'annotator' or 'rand'.") 44 | 45 | flags.DEFINE_integer( 46 | "annotator_idx", 0, 47 | "Index of the annotator to split by.") 48 | 49 | flags.DEFINE_float( 50 | "augment_ratio", 0.0, 51 | "Proportion of dev examples moved to training set.") 52 | 53 | flags.DEFINE_integer( 54 | "take_number", 1, 55 | "In case this is a re-execution of previous experiment.") 56 | 57 | 58 | ## Other parameters 59 | 60 | flags.DEFINE_string( 61 | "init_checkpoint", None, 62 | "Initial checkpoint (usually from a pre-trained BERT model).") 63 | 64 | flags.DEFINE_bool( 65 | "do_lower_case", True, 66 | "Whether to lower case the input text. Should be True for uncased " 67 | "models and False for cased models.") 68 | 69 | flags.DEFINE_integer( 70 | "max_seq_length", 128, 71 | "The maximum total input sequence length after WordPiece tokenization. " 72 | "Sequences longer than this will be truncated, and sequences shorter " 73 | "than this will be padded.") 74 | 75 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 76 | 77 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 78 | 79 | flags.DEFINE_bool( 80 | "do_predict", False, 81 | "Whether to run the model in inference mode on the test set.") 82 | 83 | flags.DEFINE_integer("train_batch_size", 32, "Total batch size for training.") 84 | 85 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 86 | 87 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 88 | 89 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 90 | 91 | flags.DEFINE_float("num_train_epochs", 3.0, 92 | "Total number of training epochs to perform.") 93 | 94 | flags.DEFINE_float( 95 | "warmup_proportion", 0.1, 96 | "Proportion of training to perform linear learning rate warmup for. " 97 | "E.g., 0.1 = 10% of training.") 98 | 99 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 100 | "How often to save the model checkpoint.") 101 | 102 | flags.DEFINE_integer("iterations_per_loop", 1000, 103 | "How many steps to make in each estimator call.") 104 | 105 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 106 | 107 | tf.flags.DEFINE_string( 108 | "tpu_name", None, 109 | "The Cloud TPU to use for training. This should be either the name " 110 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 111 | "url.") 112 | 113 | tf.flags.DEFINE_string( 114 | "tpu_zone", None, 115 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 116 | "specified, we will attempt to automatically detect the GCE project from " 117 | "metadata.") 118 | 119 | tf.flags.DEFINE_string( 120 | "gcp_project", None, 121 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 122 | "specified, we will attempt to automatically detect the GCE project from " 123 | "metadata.") 124 | 125 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 126 | 127 | flags.DEFINE_integer( 128 | "num_tpu_cores", 8, 129 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 130 | 131 | 132 | class InputExample(object): 133 | """A single training/test example for simple sequence classification.""" 134 | 135 | def __init__(self, guid, text_a, text_b=None, label=None): 136 | """Constructs a InputExample. 137 | 138 | Args: 139 | guid: Unique id for the example. 140 | text_a: string. The untokenized text of the first sequence. For single 141 | sequence tasks, only this sequence must be specified. 142 | text_b: (Optional) string. The untokenized text of the second sequence. 143 | Only must be specified for sequence pair tasks. 144 | label: (Optional) string. The label of the example. This should be 145 | specified for train and dev examples, but not for test examples. 146 | """ 147 | self.guid = guid 148 | self.text_a = text_a 149 | self.text_b = text_b 150 | self.label = label 151 | 152 | 153 | class PaddingInputExample(object): 154 | """Fake example so the num input examples is a multiple of the batch size. 155 | 156 | When running eval/predict on the TPU, we need to pad the number of examples 157 | to be a multiple of the batch size, because the TPU requires a fixed batch 158 | size. The alternative is to drop the last batch, which is bad because it means 159 | the entire output data won't be generated. 160 | 161 | We use this class instead of `None` because treating `None` as padding 162 | battches could cause silent errors. 163 | """ 164 | 165 | 166 | class InputFeatures(object): 167 | """A single set of features of data.""" 168 | 169 | def __init__(self, 170 | input_ids, 171 | input_mask, 172 | segment_ids, 173 | label_id, 174 | is_real_example=True): 175 | self.input_ids = input_ids 176 | self.input_mask = input_mask 177 | self.segment_ids = segment_ids 178 | self.label_id = label_id 179 | self.is_real_example = is_real_example 180 | 181 | 182 | class DataProcessor(object): 183 | """Base class for data converters for sequence classification data sets.""" 184 | 185 | def get_train_examples(self, data_dir): 186 | """Gets a collection of `InputExample`s for the train set.""" 187 | raise NotImplementedError() 188 | 189 | def get_dev_examples(self, data_dir): 190 | """Gets a collection of `InputExample`s for the dev set.""" 191 | raise NotImplementedError() 192 | 193 | def get_test_examples(self, data_dir): 194 | """Gets a collection of `InputExample`s for prediction.""" 195 | raise NotImplementedError() 196 | 197 | def get_labels(self): 198 | """Gets the list of labels for this data set.""" 199 | raise NotImplementedError() 200 | 201 | @classmethod 202 | def _read_json(cls, input_file): 203 | """Reads a JSON file.""" 204 | with tf.gfile.Open(input_file, "r") as f: 205 | return json.load(f) 206 | 207 | 208 | class CommonsenseQAProcessor(DataProcessor): 209 | """Processor for the MultiNLI data set (GLUE version).""" 210 | 211 | SPLIT_TO_NAME = { 212 | 'annotator': 'annotator_{annotator_idx}{augment_ratio}{take_number}', 213 | 'annotator_multi': 'annotator_multi_{annotator_idx}{augment_ratio}{take_number}', 214 | 'rand': 'rand_{annotator_idx}{augment_ratio}{take_number}', 215 | 'rand_multi': 'rand_multi_{annotator_idx}{augment_ratio}{take_number}', 216 | 'with_annotator': 'with_annotator_id', 217 | 'without_annotator': 'without_annotator_id' 218 | } 219 | 220 | TRAIN_FILE_NAME = 'train_{split_name}.json' 221 | DEV_FILE_NAME = 'dev_{split_name}.json' 222 | TEST_FILE_NAME = 'dev_{split_name}.json' 223 | 224 | def __init__(self, split, annotator_idx, augment_ratio, take_number): 225 | if split not in self.SPLIT_TO_NAME.keys(): 226 | raise ValueError( 227 | 'split must be one of {", ".join(self.SPLIT_TO_NAME.keys())}.') 228 | 229 | self.split = split 230 | self.annotator_idx = annotator_idx 231 | self.augment_ratio = "_augment{}".format(augment_ratio) if augment_ratio > 0 else "" 232 | self.take = "_take{}".format(take_number) if take_number > 1 else "" 233 | 234 | def get_train_examples(self, data_dir): 235 | train_file_name = self.TRAIN_FILE_NAME.format( 236 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 237 | 238 | return self._create_examples( 239 | self._read_json(os.path.join(data_dir, train_file_name)), 240 | 'train') 241 | 242 | def get_dev_examples(self, data_dir): 243 | dev_file_name = self.DEV_FILE_NAME.format( 244 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 245 | 246 | return self._create_examples( 247 | self._read_json(os.path.join(data_dir, dev_file_name)), 248 | 'dev') 249 | 250 | def get_test_examples(self, data_dir): 251 | test_file_name = self.TEST_FILE_NAME.format( 252 | split_name=self.SPLIT_TO_NAME[self.split].format(annotator_idx=self.annotator_idx, augment_ratio=self.augment_ratio, take_number=self.take)) 253 | 254 | return self._create_examples( 255 | self._read_json(os.path.join(data_dir, test_file_name)), 256 | 'test') 257 | 258 | def get_labels(self): 259 | """See base class.""" 260 | # These are the anonymized IDs of the top 5 annotator of OBQA, 261 | # as appear in the released dataset. 262 | return ["b356d338b7", "7b06152ffc", "dc78319bb0", 263 | "1e6dc77bb6", "cee82219a0", "OTHER"] 264 | 265 | def _create_examples(self, lines, set_type): 266 | """Creates examples for the training and dev sets.""" 267 | labels = self.get_labels() 268 | examples = [] 269 | for i, line in enumerate(lines): 270 | qid = "%s-%s" % (set_type, i) 271 | 272 | answers = ' ; '.join( 273 | [line['choice{}'.format(i)] for i in range(4)] + [line['correct_answer']] 274 | ) 275 | 276 | text_a = tokenization.convert_to_unicode(line['question']) 277 | text_b = tokenization.convert_to_unicode(answers) 278 | 279 | if set_type == "test": 280 | label = "OTHER" 281 | else: 282 | if line["turkIdAnonymized"] in labels: 283 | label = tokenization.convert_to_unicode(line["turkIdAnonymized"]) 284 | else: 285 | label = "OTHER" 286 | examples.append( 287 | InputExample(guid=qid, text_a=text_a, text_b=text_b, label=label)) 288 | 289 | return examples 290 | 291 | 292 | def convert_single_example(ex_index, example, label_list, max_seq_length, 293 | tokenizer): 294 | """Converts a single `InputExample` into a single `InputFeatures`.""" 295 | 296 | if isinstance(example, PaddingInputExample): 297 | return InputFeatures( 298 | input_ids=[0] * max_seq_length, 299 | input_mask=[0] * max_seq_length, 300 | segment_ids=[0] * max_seq_length, 301 | label_id=0, 302 | is_real_example=False) 303 | 304 | label_map = {} 305 | for (i, label) in enumerate(label_list): 306 | label_map[label] = i 307 | 308 | tokens_a = tokenizer.tokenize(example.text_a) 309 | tokens_b = None 310 | if example.text_b: 311 | tokens_b = tokenizer.tokenize(example.text_b) 312 | 313 | if tokens_b: 314 | # Modifies `tokens_a` and `tokens_b` in place so that the total 315 | # length is less than the specified length. 316 | # Account for [CLS], [SEP], [SEP] with "- 3" 317 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 318 | else: 319 | # Account for [CLS] and [SEP] with "- 2" 320 | if len(tokens_a) > max_seq_length - 2: 321 | tokens_a = tokens_a[0:(max_seq_length - 2)] 322 | 323 | # The convention in BERT is: 324 | # (a) For sequence pairs: 325 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 326 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 327 | # (b) For single sequences: 328 | # tokens: [CLS] the dog is hairy . [SEP] 329 | # type_ids: 0 0 0 0 0 0 0 330 | # 331 | # Where "type_ids" are used to indicate whether this is the first 332 | # sequence or the second sequence. The embedding vectors for `type=0` and 333 | # `type=1` were learned during pre-training and are added to the wordpiece 334 | # embedding vector (and position vector). This is not *strictly* necessary 335 | # since the [SEP] token unambiguously separates the sequences, but it makes 336 | # it easier for the model to learn the concept of sequences. 337 | # 338 | # For classification tasks, the first vector (corresponding to [CLS]) is 339 | # used as the "sentence vector". Note that this only makes sense because 340 | # the entire model is fine-tuned. 341 | tokens = [] 342 | segment_ids = [] 343 | tokens.append("[CLS]") 344 | segment_ids.append(0) 345 | for token in tokens_a: 346 | tokens.append(token) 347 | segment_ids.append(0) 348 | tokens.append("[SEP]") 349 | segment_ids.append(0) 350 | 351 | if tokens_b: 352 | for token in tokens_b: 353 | tokens.append(token) 354 | segment_ids.append(1) 355 | tokens.append("[SEP]") 356 | segment_ids.append(1) 357 | 358 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 359 | 360 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 361 | # tokens are attended to. 362 | input_mask = [1] * len(input_ids) 363 | 364 | # Zero-pad up to the sequence length. 365 | while len(input_ids) < max_seq_length: 366 | input_ids.append(0) 367 | input_mask.append(0) 368 | segment_ids.append(0) 369 | 370 | assert len(input_ids) == max_seq_length 371 | assert len(input_mask) == max_seq_length 372 | assert len(segment_ids) == max_seq_length 373 | 374 | label_id = label_map[example.label] 375 | if ex_index < 5: 376 | tf.logging.info("*** Example ***") 377 | tf.logging.info("guid: %s" % (example.guid)) 378 | tf.logging.info("tokens: %s" % " ".join( 379 | [tokenization.printable_text(x) for x in tokens])) 380 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 381 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 382 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 383 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 384 | 385 | feature = InputFeatures( 386 | input_ids=input_ids, 387 | input_mask=input_mask, 388 | segment_ids=segment_ids, 389 | label_id=label_id, 390 | is_real_example=True) 391 | return feature 392 | 393 | 394 | def file_based_convert_examples_to_features( 395 | examples, label_list, max_seq_length, tokenizer, output_file): 396 | """Convert a set of `InputExample`s to a TFRecord file.""" 397 | 398 | writer = tf.python_io.TFRecordWriter(output_file) 399 | 400 | for (ex_index, example) in enumerate(examples): 401 | if ex_index % 10000 == 0: 402 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 403 | 404 | feature = convert_single_example(ex_index, example, label_list, 405 | max_seq_length, tokenizer) 406 | 407 | def create_int_feature(values): 408 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 409 | return f 410 | 411 | features = collections.OrderedDict() 412 | features["input_ids"] = create_int_feature(feature.input_ids) 413 | features["input_mask"] = create_int_feature(feature.input_mask) 414 | features["segment_ids"] = create_int_feature(feature.segment_ids) 415 | features["label_ids"] = create_int_feature([feature.label_id]) 416 | features["is_real_example"] = create_int_feature( 417 | [int(feature.is_real_example)]) 418 | 419 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 420 | writer.write(tf_example.SerializeToString()) 421 | writer.close() 422 | 423 | 424 | def file_based_input_fn_builder(input_file, seq_length, is_training, 425 | drop_remainder): 426 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 427 | 428 | name_to_features = { 429 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 430 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 431 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 432 | "label_ids": tf.FixedLenFeature([], tf.int64), 433 | "is_real_example": tf.FixedLenFeature([], tf.int64), 434 | } 435 | 436 | def _decode_record(record, name_to_features): 437 | """Decodes a record to a TensorFlow example.""" 438 | example = tf.parse_single_example(record, name_to_features) 439 | 440 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 441 | # So cast all int64 to int32. 442 | for name in list(example.keys()): 443 | t = example[name] 444 | if t.dtype == tf.int64: 445 | t = tf.to_int32(t) 446 | example[name] = t 447 | 448 | return example 449 | 450 | def input_fn(params): 451 | """The actual input function.""" 452 | batch_size = params["batch_size"] 453 | 454 | # For training, we want a lot of parallel reading and shuffling. 455 | # For eval, we want no shuffling and parallel reading doesn't matter. 456 | d = tf.data.TFRecordDataset(input_file) 457 | if is_training: 458 | d = d.repeat() 459 | d = d.shuffle(buffer_size=100) 460 | 461 | d = d.apply( 462 | tf.contrib.data.map_and_batch( 463 | lambda record: _decode_record(record, name_to_features), 464 | batch_size=batch_size, 465 | drop_remainder=drop_remainder)) 466 | 467 | return d 468 | 469 | return input_fn 470 | 471 | 472 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 473 | """Truncates a sequence pair in place to the maximum length.""" 474 | 475 | # This is a simple heuristic which will always truncate the longer sequence 476 | # one token at a time. This makes more sense than truncating an equal percent 477 | # of tokens from each, since if one sequence is very short then each token 478 | # that's truncated likely contains more information than a longer sequence. 479 | while True: 480 | total_length = len(tokens_a) + len(tokens_b) 481 | if total_length <= max_length: 482 | break 483 | if len(tokens_a) > len(tokens_b): 484 | tokens_a.pop() 485 | else: 486 | tokens_b.pop() 487 | 488 | 489 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 490 | labels, num_labels, use_one_hot_embeddings): 491 | """Creates a classification model.""" 492 | model = modeling.BertModel( 493 | config=bert_config, 494 | is_training=is_training, 495 | input_ids=input_ids, 496 | input_mask=input_mask, 497 | token_type_ids=segment_ids, 498 | use_one_hot_embeddings=use_one_hot_embeddings) 499 | 500 | # In the demo, we are doing a simple classification task on the entire 501 | # segment. 502 | # 503 | # If you want to use the token-level output, use model.get_sequence_output() 504 | # instead. 505 | output_layer = model.get_pooled_output() 506 | 507 | hidden_size = output_layer.shape[-1].value 508 | 509 | output_weights = tf.get_variable( 510 | "output_weights", [num_labels, hidden_size], 511 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 512 | 513 | output_bias = tf.get_variable( 514 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 515 | 516 | with tf.variable_scope("loss"): 517 | if is_training: 518 | # I.e., 0.1 dropout 519 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 520 | 521 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 522 | logits = tf.nn.bias_add(logits, output_bias) 523 | probabilities = tf.nn.softmax(logits, axis=-1) 524 | log_probs = tf.nn.log_softmax(logits, axis=-1) 525 | 526 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 527 | 528 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 529 | loss = tf.reduce_mean(per_example_loss) 530 | 531 | return (loss, per_example_loss, logits, probabilities) 532 | 533 | 534 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 535 | num_train_steps, num_warmup_steps, use_tpu, 536 | use_one_hot_embeddings): 537 | """Returns `model_fn` closure for TPUEstimator.""" 538 | 539 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 540 | """The `model_fn` for TPUEstimator.""" 541 | 542 | tf.logging.info("*** Features ***") 543 | for name in sorted(features.keys()): 544 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 545 | 546 | input_ids = features["input_ids"] 547 | input_mask = features["input_mask"] 548 | segment_ids = features["segment_ids"] 549 | label_ids = features["label_ids"] 550 | is_real_example = None 551 | if "is_real_example" in features: 552 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 553 | else: 554 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 555 | 556 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 557 | 558 | (total_loss, per_example_loss, logits, probabilities) = create_model( 559 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, 560 | num_labels, use_one_hot_embeddings) 561 | 562 | tvars = tf.trainable_variables() 563 | initialized_variable_names = {} 564 | scaffold_fn = None 565 | if init_checkpoint: 566 | (assignment_map, initialized_variable_names 567 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 568 | if use_tpu: 569 | 570 | def tpu_scaffold(): 571 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 572 | return tf.train.Scaffold() 573 | 574 | scaffold_fn = tpu_scaffold 575 | else: 576 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 577 | 578 | tf.logging.info("**** Trainable Variables ****") 579 | for var in tvars: 580 | init_string = "" 581 | if var.name in initialized_variable_names: 582 | init_string = ", *INIT_FROM_CKPT*" 583 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 584 | init_string) 585 | 586 | output_spec = None 587 | if mode == tf.estimator.ModeKeys.TRAIN: 588 | 589 | train_op = optimization.create_optimizer( 590 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 591 | 592 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 593 | mode=mode, 594 | loss=total_loss, 595 | train_op=train_op, 596 | scaffold_fn=scaffold_fn) 597 | elif mode == tf.estimator.ModeKeys.EVAL: 598 | 599 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 600 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 601 | accuracy = tf.metrics.accuracy( 602 | labels=label_ids, predictions=predictions, weights=is_real_example) 603 | loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) 604 | return { 605 | "eval_accuracy": accuracy, 606 | "eval_loss": loss, 607 | } 608 | 609 | eval_metrics = (metric_fn, 610 | [per_example_loss, label_ids, logits, is_real_example]) 611 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 612 | mode=mode, 613 | loss=total_loss, 614 | eval_metrics=eval_metrics, 615 | scaffold_fn=scaffold_fn) 616 | else: 617 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 618 | mode=mode, 619 | predictions={"probabilities": probabilities}, 620 | scaffold_fn=scaffold_fn) 621 | return output_spec 622 | 623 | return model_fn 624 | 625 | 626 | # This function is not used by this file but is still used by the Colab and 627 | # people who depend on it. 628 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 629 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 630 | 631 | all_input_ids = [] 632 | all_input_mask = [] 633 | all_segment_ids = [] 634 | all_label_ids = [] 635 | 636 | for feature in features: 637 | all_input_ids.append(feature.input_ids) 638 | all_input_mask.append(feature.input_mask) 639 | all_segment_ids.append(feature.segment_ids) 640 | all_label_ids.append(feature.label_id) 641 | 642 | def input_fn(params): 643 | """The actual input function.""" 644 | batch_size = params["batch_size"] 645 | 646 | num_examples = len(features) 647 | 648 | # This is for demo purposes and does NOT scale to large data sets. We do 649 | # not use Dataset.from_generator() because that uses tf.py_func which is 650 | # not TPU compatible. The right way to load data is with TFRecordReader. 651 | d = tf.data.Dataset.from_tensor_slices({ 652 | "input_ids": 653 | tf.constant( 654 | all_input_ids, shape=[num_examples, seq_length], 655 | dtype=tf.int32), 656 | "input_mask": 657 | tf.constant( 658 | all_input_mask, 659 | shape=[num_examples, seq_length], 660 | dtype=tf.int32), 661 | "segment_ids": 662 | tf.constant( 663 | all_segment_ids, 664 | shape=[num_examples, seq_length], 665 | dtype=tf.int32), 666 | "label_ids": 667 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 668 | }) 669 | 670 | if is_training: 671 | d = d.repeat() 672 | d = d.shuffle(buffer_size=100) 673 | 674 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 675 | return d 676 | 677 | return input_fn 678 | 679 | 680 | # This function is not used by this file but is still used by the Colab and 681 | # people who depend on it. 682 | def convert_examples_to_features(examples, label_list, max_seq_length, 683 | tokenizer): 684 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 685 | 686 | features = [] 687 | for (ex_index, example) in enumerate(examples): 688 | if ex_index % 10000 == 0: 689 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 690 | 691 | feature = convert_single_example(ex_index, example, label_list, 692 | max_seq_length, tokenizer) 693 | 694 | features.append(feature) 695 | return features 696 | 697 | 698 | def main(_): 699 | tf.logging.set_verbosity(tf.logging.INFO) 700 | 701 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 702 | raise ValueError( 703 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 704 | 705 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 706 | 707 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 708 | raise ValueError( 709 | "Cannot use sequence length %d because the BERT model " 710 | "was only trained up to sequence length %d" % 711 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 712 | 713 | tf.gfile.MakeDirs(FLAGS.output_dir) 714 | 715 | processor = CommonsenseQAProcessor(split=FLAGS.split, annotator_idx=FLAGS.annotator_idx, 716 | augment_ratio=FLAGS.augment_ratio, take_number=FLAGS.take_number) 717 | 718 | label_list = processor.get_labels() 719 | 720 | tokenizer = tokenization.FullTokenizer( 721 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 722 | 723 | tpu_cluster_resolver = None 724 | if FLAGS.use_tpu and FLAGS.tpu_name: 725 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 726 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 727 | 728 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 729 | run_config = tf.contrib.tpu.RunConfig( 730 | cluster=tpu_cluster_resolver, 731 | master=FLAGS.master, 732 | model_dir=FLAGS.output_dir, 733 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 734 | tpu_config=tf.contrib.tpu.TPUConfig( 735 | iterations_per_loop=FLAGS.iterations_per_loop, 736 | num_shards=FLAGS.num_tpu_cores, 737 | per_host_input_for_training=is_per_host)) 738 | 739 | train_examples = None 740 | num_train_steps = None 741 | num_warmup_steps = None 742 | if FLAGS.do_train: 743 | train_examples = processor.get_train_examples(FLAGS.data_dir) 744 | num_train_steps = int( 745 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 746 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 747 | 748 | model_fn = model_fn_builder( 749 | bert_config=bert_config, 750 | num_labels=len(label_list), 751 | init_checkpoint=FLAGS.init_checkpoint, 752 | learning_rate=FLAGS.learning_rate, 753 | num_train_steps=num_train_steps, 754 | num_warmup_steps=num_warmup_steps, 755 | use_tpu=FLAGS.use_tpu, 756 | use_one_hot_embeddings=FLAGS.use_tpu) 757 | 758 | # If TPU is not available, this will fall back to normal Estimator on CPU 759 | # or GPU. 760 | estimator = tf.contrib.tpu.TPUEstimator( 761 | use_tpu=FLAGS.use_tpu, 762 | model_fn=model_fn, 763 | config=run_config, 764 | train_batch_size=FLAGS.train_batch_size, 765 | eval_batch_size=FLAGS.eval_batch_size, 766 | predict_batch_size=FLAGS.predict_batch_size) 767 | 768 | if FLAGS.do_train: 769 | train_file = os.path.join(FLAGS.output_dir, "train.tf_record") 770 | train_seq_length = file_based_convert_examples_to_features( 771 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 772 | tf.logging.info("***** Running training *****") 773 | tf.logging.info(" Num examples = %d", len(train_examples)) 774 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 775 | tf.logging.info(" Num steps = %d", num_train_steps) 776 | tf.logging.info(" Longest training sequence = %d", train_seq_length) 777 | train_input_fn = file_based_input_fn_builder( 778 | input_file=train_file, 779 | seq_length=FLAGS.max_seq_length, 780 | is_training=True, 781 | drop_remainder=True) 782 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 783 | 784 | if FLAGS.do_eval: 785 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 786 | eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record") 787 | eval_seq_length = file_based_convert_examples_to_features( 788 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 789 | 790 | tf.logging.info("***** Running evaluation *****") 791 | tf.logging.info(" Num examples = %d", len(eval_examples)) 792 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 793 | tf.logging.info(" Longest eval sequence = %d", eval_seq_length) 794 | 795 | # This tells the estimator to run through the entire set. 796 | eval_steps = None 797 | # However, if running eval on the TPU, you will need to specify the 798 | # number of steps. 799 | if FLAGS.use_tpu: 800 | # Eval will be slightly WRONG on the TPU because it will truncate 801 | # the last batch. 802 | eval_steps = int(len(eval_examples) / FLAGS.eval_batch_size) 803 | 804 | eval_drop_remainder = True if FLAGS.use_tpu else False 805 | eval_input_fn = file_based_input_fn_builder( 806 | input_file=eval_file, 807 | seq_length=FLAGS.max_seq_length, 808 | is_training=False, 809 | drop_remainder=eval_drop_remainder) 810 | 811 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 812 | 813 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 814 | with tf.gfile.GFile(output_eval_file, "w") as writer: 815 | tf.logging.info("***** Eval results *****") 816 | for key in sorted(result.keys()): 817 | tf.logging.info(" %s = %s", key, str(result[key])) 818 | writer.write("%s = %s\n" % (key, str(result[key]))) 819 | 820 | if FLAGS.do_predict: 821 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 822 | num_actual_predict_examples = len(predict_examples) 823 | predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record") 824 | predict_seq_length = file_based_convert_examples_to_features( 825 | predict_examples, label_list, 826 | FLAGS.max_seq_length, tokenizer, 827 | predict_file) 828 | 829 | tf.logging.info("***** Running prediction*****") 830 | tf.logging.info(" Num examples = %d", len(predict_examples)) 831 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 832 | tf.logging.info(" Longest predict sequence = %d", predict_seq_length) 833 | 834 | if FLAGS.use_tpu: 835 | # Warning: According to tpu_estimator.py Prediction on TPU is an 836 | # experimental feature and hence not supported here 837 | raise ValueError("Prediction in TPU not supported") 838 | 839 | predict_drop_remainder = True if FLAGS.use_tpu else False 840 | predict_input_fn = file_based_input_fn_builder( 841 | input_file=predict_file, 842 | seq_length=FLAGS.max_seq_length, 843 | is_training=False, 844 | drop_remainder=predict_drop_remainder) 845 | 846 | result = estimator.predict(input_fn=predict_input_fn) 847 | 848 | output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv") 849 | with tf.gfile.GFile(output_predict_file, "w") as writer: 850 | num_written_lines = 0 851 | tf.logging.info("***** Predict results *****") 852 | for (i, prediction) in enumerate(result): 853 | probabilities = prediction["probabilities"] 854 | if i >= num_actual_predict_examples: 855 | break 856 | output_line = "\t".join( 857 | str(class_probability) 858 | for class_probability in probabilities) + "\n" 859 | writer.write(output_line) 860 | num_written_lines += 1 861 | assert num_written_lines == num_actual_predict_examples 862 | 863 | 864 | if __name__ == "__main__": 865 | flags.mark_flag_as_required("data_dir") 866 | flags.mark_flag_as_required("vocab_file") 867 | flags.mark_flag_as_required("bert_config_file") 868 | flags.mark_flag_as_required("output_dir") 869 | tf.app.run() 870 | -------------------------------------------------------------------------------- /obqa_create_data_splits.py: -------------------------------------------------------------------------------- 1 | 2 | import argparse 3 | import json 4 | import pandas as pd 5 | 6 | 7 | pd.set_option('display.width', 1000) 8 | pd.set_option('display.max_columns', 1000) 9 | 10 | 11 | def load_data(): 12 | trn_file_path = "./OpenBookQA-V1-Sep2018/Data/Additional/train_complete.jsonl" 13 | dev_file_path = "./OpenBookQA-V1-Sep2018/Data/Additional/dev_complete.jsonl" 14 | 15 | examples = [] 16 | with open(trn_file_path, "r") as fd: 17 | examples.extend(fd.readlines()) 18 | num_trn_examples = len(examples) 19 | print("read {} training examples.".format(num_trn_examples)) 20 | 21 | with open(dev_file_path, "r") as fd: 22 | examples.extend(fd.readlines()) 23 | print("read {} development examples.".format(len(examples) - num_trn_examples)) 24 | 25 | def parse_json_line(jline): 26 | line = json.loads(jline) 27 | parsed_line = { 28 | "id": line["id"], 29 | "turkIdAnonymized": line["turkIdAnonymized"], 30 | "answerKey": line["answerKey"], 31 | "clarity": line["clarity"], 32 | "fact1": line["fact1"], 33 | "humanScore": line["humanScore"], 34 | "question": line["question"]["stem"], 35 | "choice0": line["question"]["choices"][0]["text"], 36 | "choice1": line["question"]["choices"][1]["text"], 37 | "choice2": line["question"]["choices"][2]["text"], 38 | "choice3": line["question"]["choices"][3]["text"] 39 | } 40 | anskey_to_choice = {"A": "choice0", "B": "choice1", "C": "choice2", "D": "choice3"} 41 | parsed_line["correct_answer"] = parsed_line[anskey_to_choice[parsed_line["answerKey"]]] 42 | 43 | return parsed_line 44 | 45 | print("total number of examples: {}.".format(len(examples))) 46 | df = pd.DataFrame([parse_json_line(jline) for jline in examples]) 47 | df = df.sample(frac=1.0) 48 | 49 | return df 50 | 51 | 52 | def dump_to_json(example_set, split_type, split_partition, split_index, augment): 53 | example_set_filtered_cols = example_set.loc[:, ['id', 'turkIdAnonymized', 'question', 'correct_answer', 54 | 'answerKey', 'choice0', 'choice1', 'choice2', 'choice3']] 55 | 56 | data_dict = example_set_filtered_cols.reset_index().to_dict(orient='rows') 57 | file_name = split_partition + '_' + split_type + '_' + str(split_index) + augment + ".json" 58 | print('Saving %s with %d exmamples' % (file_name, len(data_dict))) 59 | with open(file_name, "w") as f: 60 | json.dump(data_dict, f, sort_keys=True, indent=4) 61 | 62 | 63 | def print_annotator_distribution(data): 64 | dff = data.groupby(['turkIdAnonymized'])[ 65 | [c for c in data.columns if c != 'turkIdAnonymized']].nunique().sort_values( 66 | 'id', ascending=False).reset_index() 67 | total_number_of_examples = dff.id.sum() 68 | print("total number of unique examples: ", total_number_of_examples) 69 | dff["percentage"] = round(dff.id * 100.0 / total_number_of_examples, 1) 70 | print(dff) 71 | 72 | 73 | def split_dev_sets(dev_sets, num_splits, augment_ratio): 74 | dev_sets_new = [] 75 | dev_sets_move = [] 76 | for i in range(num_splits): 77 | split_annotators = dev_sets[i]['turkIdAnonymized'].unique() 78 | split_annotator_splits = [ 79 | dev_sets[i][dev_sets[i]['turkIdAnonymized'] == annotator] 80 | for annotator in split_annotators 81 | ] 82 | annotator_split_idx = [ 83 | int(augment_ratio * len(split_annotator_splits[j])) 84 | for j in range(len(split_annotators)) 85 | ] 86 | 87 | dev_set_new = [] 88 | dev_set_move = [] 89 | for j in range(len(split_annotators)): 90 | dev_set_new.append(split_annotator_splits[j][annotator_split_idx[j]:]) 91 | dev_set_move.append(split_annotator_splits[j][0:annotator_split_idx[j]]) 92 | 93 | dev_sets_new.append(pd.concat(dev_set_new)) 94 | dev_sets_move.append(pd.concat(dev_set_move)) 95 | 96 | dev_split_idx = [ 97 | len(dev_sets_move[i]) 98 | for i in range(num_splits) 99 | ] 100 | 101 | return dev_sets_new, dev_sets_move, dev_split_idx 102 | 103 | 104 | def augment_data(train_sets, dev_sets, num_splits, augment_ratio, 105 | by_annotator, dev_split_idx=None, normalize=True): 106 | if by_annotator: 107 | dev_sets_new, dev_sets_move, dev_split_idx = split_dev_sets( 108 | dev_sets, num_splits, augment_ratio) 109 | else: 110 | if not dev_split_idx: 111 | dev_split_idx = [ 112 | int(augment_ratio * len(dev_sets[i])) 113 | for i in range(num_splits) 114 | ] 115 | dev_sets_new = [ 116 | dev_sets[i][dev_split_idx[i]:] 117 | for i in range(num_splits) 118 | ] 119 | dev_sets_move = [ 120 | dev_sets[i][0:dev_split_idx[i]] 121 | for i in range(num_splits) 122 | ] 123 | 124 | if normalize: 125 | augmented_train_sets = [ 126 | train_sets[i].sample(n=len(train_sets[i]) - len(dev_sets_move[i])) 127 | for i in range(num_splits) 128 | ] 129 | else: 130 | augmented_train_sets = [ 131 | train_sets[i].sample(frac=1.0) 132 | for i in range(num_splits) 133 | ] 134 | 135 | augmented_train_sets = [ 136 | pd.concat([augmented_train_sets[i], dev_sets_move[i]], 137 | ignore_index=True) 138 | for i in range(num_splits) 139 | ] 140 | 141 | return augmented_train_sets, dev_sets_new, dev_split_idx 142 | 143 | 144 | def print_split_sizes(train_sets_by_annotator, dev_sets_by_annotator, num_splits): 145 | for i in range(num_splits): 146 | print("annotator {}: train set size {}, dev set size {}, total size {}".format( 147 | i, len(train_sets_by_annotator[i]), len(dev_sets_by_annotator[i]), 148 | len(train_sets_by_annotator[i]) + len(dev_sets_by_annotator[i]))) 149 | 150 | 151 | def get_annotator_splits(all_data, dev_annotator_list, num_splits, multi_mode): 152 | if multi_mode: 153 | dev_sets_by_annotator = [ 154 | all_data.loc[all_data["turkIdAnonymized"].isin(dev_annotator_list[i])] 155 | for i in range(num_splits) 156 | ] 157 | train_sets_by_annotator = [ 158 | all_data.loc[~all_data["turkIdAnonymized"].isin(dev_annotator_list[i])] 159 | for i in range(num_splits) 160 | ] 161 | else: 162 | dev_sets_by_annotator = [ 163 | all_data.loc[all_data["turkIdAnonymized"] == dev_annotator_list[i]] 164 | for i in range(num_splits) 165 | ] 166 | train_sets_by_annotator = [ 167 | all_data.loc[all_data["turkIdAnonymized"] != dev_annotator_list[i]] 168 | for i in range(num_splits) 169 | ] 170 | 171 | return train_sets_by_annotator, dev_sets_by_annotator 172 | 173 | 174 | def get_random_splits(all_data, train_sets_by_annotator, dev_sets_by_annotator, num_splits): 175 | shuffled_data_frames = [ 176 | all_data.sample(frac=1.0) 177 | for _ in range(num_splits) 178 | ] 179 | 180 | dev_sets_random = [ 181 | shuffled_data_frames[i][0:dev_sets_by_annotator[i].shape[0]] 182 | for i in range(num_splits) 183 | ] 184 | train_sets_random = [ 185 | shuffled_data_frames[i][dev_sets_by_annotator[i].shape[0]: 186 | dev_sets_by_annotator[i].shape[0] + train_sets_by_annotator[i].shape[0]] 187 | for i in range(num_splits) 188 | ] 189 | 190 | return train_sets_random, dev_sets_random 191 | 192 | 193 | def write_split_files(train_sets_by_annotator, dev_sets_by_annotator, 194 | train_sets_random, dev_sets_random, num_splits, 195 | multi_mode, augment_ratio, take_number, only_random, only_annotator): 196 | type_add = "_multi" if multi_mode else "" 197 | for i in range(0, num_splits): 198 | augment = "_augment{}".format(augment_ratio) if augment_ratio > 0 else "" 199 | take = "_take{}".format(take_number) if take_number > 1 else "" 200 | properties = augment + take 201 | 202 | if not only_random: 203 | dump_to_json(train_sets_by_annotator[i], "annotator" + type_add, "train", i, properties) 204 | dump_to_json(dev_sets_by_annotator[i], "annotator" + type_add, "dev", i, properties) 205 | 206 | if not only_annotator: 207 | dump_to_json(train_sets_random[i], "rand" + type_add, "train", i, properties) 208 | dump_to_json(dev_sets_random[i], "rand" + type_add, "dev", i, properties) 209 | 210 | 211 | def create_random_augmented_series(multi_mode, take_number): 212 | all_data = load_data() 213 | print_annotator_distribution(all_data) 214 | 215 | sorted_annotators = all_data.groupby(['turkIdAnonymized']).nunique().sort_values( 216 | 'id', ascending=False)['id'].keys() 217 | 218 | if multi_mode: 219 | dev_annotator_list = [sorted_annotators[i:i + 5].values for i in range(20, 45, 5)] 220 | num_splits = 5 221 | else: 222 | dev_annotator_list = [sorted_annotators[i] for i in range(5)] 223 | num_splits = 5 224 | 225 | train_sets_by_annotator, dev_sets_by_annotator = get_annotator_splits( 226 | all_data, dev_annotator_list, num_splits, multi_mode) 227 | 228 | train_sets_random, dev_sets_random = get_random_splits( 229 | all_data, train_sets_by_annotator, dev_sets_by_annotator, num_splits) 230 | 231 | write_split_files([], [], train_sets_random, dev_sets_random, num_splits, 232 | multi_mode, 0.0, take_number, 233 | only_random=True, only_annotator=False) 234 | 235 | for augment_ratio in [0.1, 0.2, 0.3]: 236 | print("before augmentation.") 237 | print_split_sizes(train_sets_random, dev_sets_random, num_splits) 238 | 239 | augmented_train_sets_by_annotator, augmented_dev_sets_by_annotator, dev_split_idx = augment_data( 240 | train_sets_by_annotator, dev_sets_by_annotator, num_splits, augment_ratio, 241 | by_annotator=True) 242 | 243 | augmented_train_sets_random, augmented_dev_sets_random, _ = augment_data( 244 | train_sets_random, dev_sets_random, num_splits, augment_ratio, 245 | by_annotator=False, dev_split_idx=dev_split_idx) 246 | 247 | print("after augmentation.") 248 | print_split_sizes(augmented_train_sets_random, augmented_dev_sets_random, 249 | num_splits) 250 | 251 | for i in range(num_splits): 252 | assert augmented_train_sets_by_annotator[i].shape == augmented_train_sets_random[i].shape 253 | assert augmented_dev_sets_by_annotator[i].shape == augmented_dev_sets_random[i].shape 254 | assert augmented_train_sets_random[i].shape == train_sets_random[i].shape 255 | print("split {}: train {}, dev {}".format( 256 | i, augmented_train_sets_random[i].shape, augmented_dev_sets_random[i].shape)) 257 | 258 | write_split_files([], [], augmented_train_sets_random, augmented_dev_sets_random, 259 | num_splits, multi_mode, augment_ratio, take_number, 260 | only_random=True, only_annotator=False) 261 | 262 | 263 | def create_data_splits(multi_mode=True, augment_ratio=0.0, take_number=1, only_random=False, only_annotator=False): 264 | all_data = load_data() 265 | print_annotator_distribution(all_data) 266 | 267 | sorted_annotators = all_data.groupby(['turkIdAnonymized']).nunique().sort_values( 268 | 'id', ascending=False)['id'].keys() 269 | 270 | if multi_mode: 271 | dev_annotator_list = [sorted_annotators[i:i+5].values for i in range(20, 45, 5)] 272 | num_splits = 5 273 | else: 274 | dev_annotator_list = [sorted_annotators[i] for i in range(5)] 275 | num_splits = 5 276 | 277 | train_sets_by_annotator, dev_sets_by_annotator = get_annotator_splits( 278 | all_data, dev_annotator_list, num_splits, multi_mode) 279 | 280 | train_sets_random, dev_sets_random = get_random_splits( 281 | all_data, train_sets_by_annotator, dev_sets_by_annotator, num_splits) 282 | 283 | if augment_ratio > 0: 284 | print("before augmentation.") 285 | print_split_sizes(train_sets_by_annotator, dev_sets_by_annotator, num_splits) 286 | 287 | train_sets_by_annotator, dev_sets_by_annotator, dev_split_idx = augment_data( 288 | train_sets_by_annotator, dev_sets_by_annotator, num_splits, augment_ratio, 289 | by_annotator=True) 290 | 291 | train_sets_random, dev_sets_random, _ = augment_data( 292 | train_sets_random, dev_sets_random, num_splits, augment_ratio, 293 | by_annotator=False, dev_split_idx=dev_split_idx) 294 | 295 | print("after augmentation.") 296 | print_split_sizes(train_sets_by_annotator, dev_sets_by_annotator, num_splits) 297 | 298 | for i in range(num_splits): 299 | assert train_sets_random[i].shape == train_sets_by_annotator[i].shape 300 | assert dev_sets_random[i].shape == dev_sets_by_annotator[i].shape 301 | print("split {}: train {}, dev {}".format(i, train_sets_random[i].shape, dev_sets_random[i].shape)) 302 | 303 | write_split_files(train_sets_by_annotator, dev_sets_by_annotator, 304 | train_sets_random, dev_sets_random, num_splits, 305 | multi_mode, augment_ratio, take_number, only_random, only_annotator) 306 | 307 | 308 | def main(args): 309 | for i in range(args.repeat): 310 | if args.augment_random_series: 311 | create_random_augmented_series(args.multi_mode, 312 | args.take_number + i) 313 | else: 314 | create_data_splits(args.multi_mode, 315 | args.augment_ratio, 316 | args.take_number + i, 317 | args.only_random, 318 | args.only_annotator) 319 | 320 | 321 | if __name__ == '__main__': 322 | parser = argparse.ArgumentParser(description="OpenbookQA: create data splits") 323 | parser.add_argument('--augment_ratio', type=float, default=0.0, 324 | help='fraction of annotator examples to augment the train set with.') 325 | parser.add_argument('--take_number', type=int, default=1, 326 | help='the number of times the specified split is being generated.' 327 | 'if repeat>1 then this argument will be the starting index of the generated splits.') 328 | parser.add_argument('--repeat', type=int, default=1, 329 | help='how many times to generate the specified split.') 330 | 331 | parser.add_argument('--multi_mode', action='store_true', default=False, 332 | help='multi-annotator mode.') 333 | parser.add_argument('--augment_random_series', action='store_true', default=False, 334 | help='a series of augmentation splits, starting from random splits ' 335 | 'corresponding to annotator splits in size.') 336 | parser.add_argument('--only_random', action='store_true', default=False, 337 | help='generate only random splits.') 338 | parser.add_argument('--only_annotator', action='store_true', default=False, 339 | help='generate only annotator splits.') 340 | 341 | args = parser.parse_args() 342 | 343 | main(args) 344 | 345 | --------------------------------------------------------------------------------