├── LICENSE ├── NaNoGenMo_50K_words_sample.txt ├── README.md ├── cleaner_on_bert_weights.py ├── do_critic.py ├── gpt1finetune.py ├── gpt1sample.py ├── gpt1tokenize_trainset.py ├── handwriting.png ├── paranoid_transformer.pdf ├── paranoid_transformer.png ├── paranoid_transformer_back.png ├── paranoid_transformer_w_pics.pdf ├── pics_samples.png ├── simple_cleaner.py ├── train_classifier.py ├── vocab.txt └── weight_samples.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Aleksey Tikhonov 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Paranoid Transformer 2 | 3 | ## TLDR 4 | 5 | After all, this project turns into a published neural network generated book. [Check the story behind it in my Medium post](https://medium.com/altsoph/paranoid-transformer-80a960ddc90a). 6 | 7 | ## Overview 8 | 9 | 10 | This is an attempt to make an unsupervised text generator with some specific style and form characteristics of text. 11 | Originaly it was published as an entry for [NaNoGenMo 2019](https://github.com/NaNoGenMo/2019/issues/142) (_National Novel Generation Month_ contest). 12 | 13 | The general idea behind the _Paranoid Transformer_ project is to build a paranoiac-critical system based on two neural networks. 14 | The first network (_Paranoiac-intrusive Generator_) is a GPT-based tuned conditional language model and the second one (_Critic subsystem_) uses a BERT-based classifier that works as a filtering subsystem, so it selects the best ones from the flow of text passages. Finally, I used an existing handwriting synthesis neural network implementation to generate a nervous handwritten diary where a degree of shakiness depends on the sentiment strength of a given sentence. 15 | 16 | ## Generator subsystem 17 | 18 | The first network, Paranoiac-intrusive subsystem AKA Generator, uses an [OpenAI GPT](https://github.com/openai/finetune-transformer-lm) architecture and the [implementation from huggingface](https://github.com/huggingface/transformers). I took a publicly available network model already pre-trained on a huge fiction [BooksCorpus dataset](https://arxiv.org/pdf/1506.06724.pdf) with approx ~10K books and ~1B words. 19 | 20 | Next, I finetuned it on several additional handcrafted text corpora (altogether ~50Mb of text): 21 | - a collection of Crypto Texts (Crypto Anarchist Manifesto, Cyphernomicon, etc), 22 | - another collection of fiction books (from such cyberpunk authors as Dick, Gibson, and others + non-cyberpunk authors, for example, Kafka and Rumi), 23 | - transcripts and subtitles from some cyberpunk movies and series, 24 | - several thousands of quotes and fortune cookie messages collected from different sources. 25 | 26 | During the fine-tuning phase, I used special labels for conditional training of the model: 27 | - _QUOTE_ for any short quote or fortune, _LONG_ for others 28 | - _CYBER_ for cyber-themed text and _OTHER_ for others. 29 | Each text got 2 labels, for example, it was _LONG_+_CYBER_ for Cyphernomicon, _LONG_+_OTHER_ for Kafka and _QUOTE_+_OTHER_ for fortune cookie messages. Note, there were almost no texts labeled as _QUOTE_+_CYBER_, just a few nerd jokes. 30 | 31 | At last, in generation mode, I kindly asked the model to generate only _QUOTE_+_CYBER_ texts. 32 | The raw results were already promising enough: 33 | 34 | > terosexuality is pleasures a turn off ; and to me not to be a true blossoming beautiful being is on the other side. the wind is our song, the emotions are our wind and a piano, new things change, new smells kick off in time, a spiritually shifting dust. let your eyes sing music for a while. let your ears measure the bass beat of your soul, the gentle winding of the song. then your ears achieve harmony. you can listen to french playstation on live music together forever, in the philly coffeehouse, in them congressional district of the franklin gap building. let painting melt away every other shred of reason and pain, just lew the paint to move thoughts away from blizzes in death. let it dry out, and turn to cosmic delights, to laugh on the big charms and saxophones and fudatron steames of the sales titanium. we are god's friends, the golden hands on the shoulders of our fears. do you knock my cleaning table over? i snap awake at some dawn. the patrons researching the blues instructor's theories around me, then give me a glass of jim beam. boom! the business group soon concludes. caught one miracle? survive the tedious rituals you refuse to provide? whatever happens, i throw shit in your face. joy ries away? you could give acapindulgent half your life away, though i am nothing especially sexy. this sift, this being sveng? do impotent and desperate oozing drug as i shake and shine? you adored me. brains run out when people charitable that into you. 35 | 36 | Now it was time to make some cleaning. 37 | 38 | ## Heuristic filters 39 | 40 | The next big thing to do was filter some really good ones from this endless flow of the text. 41 | 42 | At first, I made a script with some simple heuristic filters such as: 43 | - reject a creation of new, non-existing words, 44 | - reject phrases with two unconnected verbs in a row, 45 | - reject phrases with several duplicating words, 46 | - reject phrases with no punctuation or with too many punctuation marks. 47 | 48 | The application of this script cut the initial text flow into a subsequence of valid chunks. 49 | 50 | > a slave has no more say in his language but he hasn't to speak out! 51 | > 52 | > the doll has a variety of languages, so its feelings have to fill up some time of the day - to - day journals. 53 | > the doll is used only when he remains private. 54 | > and it is always effective. 55 | > 56 | > leave him with his monk - like body. 57 | > 58 | > a little of technique on can be helpful. 59 | > 60 | > out of his passions remain in embarrassment and never wake. 61 | > 62 | > adolescence is the university of manchester. 63 | > the senior class of manchester... the senior class of manchester. 64 | 65 | ## Critic subsystem 66 | 67 | At last, I trained the Critic subsystem. 68 | This neural network uses a [BERT](https://github.com/google-research/bert) architecture implemented again by [huggingface](https://github.com/huggingface/transformers). Again I took a public available pre-trained network model and finetuned it on my labeled 1K chunks dataset to predict the label of any given chunk. 69 | 70 | Here I used manual labeling of these chunks with two classes, GOOD/BAD. Most of the labeling was done by a friend of mine, Ivan [@kr0niker](https://www.yamshchikov.info/) Yamshchikov, and some I did myself. We marked a chunk as BAD in case it was grammatically incorrect or just too boring or too stupid. Overall, I used approx 1K of labeled chunks, balanced (one half of them were GOOD, the other half -- BAD). 71 | 72 | Finally, I made a pipeline that includes the Generator subsystem, some heuristic filters, and the Critic subsystem. 73 | Here it is a short sample of the final results: 74 | 75 | > a sudden feeling of austin lemons, a gentle stab of disgust. 76 | > i'm what i'm. 77 | > 78 | > humans whirl in night and distance. 79 | > 80 | > by the wonders of them. 81 | > 82 | > we shall never suffer this. 83 | > if the human race came along tomorrow, none of us would be as wise as they already would have been. 84 | > there is a beginning and an end. 85 | > 86 | > both of our grandparents and brothers are overdue. 87 | > he either can not agree or he can look for someone to blame for his death. 88 | > 89 | > he has reappeared from the world of revenge, revenge, separation, hatred. 90 | > he has ceased all who have offended him. 91 | > 92 | > he is the one who can remember that nothing remotely resembles the trip begun in retrospect. 93 | > what's up? 94 | > 95 | > and i don't want the truth. 96 | > not for an hour. 97 | 98 | [The huge blob of generated text could be found here](https://github.com/altsoph/paranoid_transforner/blob/master/NaNoGenMo_50K_words_sample.txt). 99 | 100 | ## Code overview 101 | 102 | Here is a short description of scripts from this project: 103 | - gpt1tokenize_trainset.py -- used to tokenize the fine-tuning dataset and add the conditioning labels 104 | - gpt1finetune.py -- used to fine-tune the Generator network on the prepared dataset 105 | - gpt1sample.py -- used to sample texts from the Generator network 106 | 107 | - simple_cleaner.py -- holds the heuristic filters 108 | 109 | - train_classifier.py -- used to train the BERT-based classifier (Critic) 110 | - do_critic.py -- applies Critic to the samples 111 | - weight_samples.py + cleaner_on_bert_weights.py -- used to filter samples based on Critic scores 112 | 113 | 114 | ## Nervous handwriting 115 | 116 | As much as the resulting text basically reminded me of neurotic/paranoid notes I decided to use this effect and make it deeper: 117 | 118 | I took an [implementation by Sean Vasquez](https://github.com/sjvasquez/handwriting-synthesis) of the handwriting synthesis experiments from the paper [Generating Sequences with Recurrent Neural Networks by Alex Graves](https://arxiv.org/abs/1308.0850) and patched it a little. Specifically, I used a bias parameter to make the handwriting shakiness depended on the sentiment strength of a given sentence. 119 | 120 | Take a look at the example: 121 | 122 | drawing 123 | 124 | ## Freehand drawings 125 | 126 | At some point, I realized that this diary lacks freehand drawings, so I decided to add some. I used my modification of a [pytorch implementation](https://github.com/alexis-jacq/Pytorch-Sketch-RNN) of [arXiv:1704.03477](https://arxiv.org/abs/1704.03477) trained on 127 | [Quick, Draw! Dataset](https://github.com/googlecreativelab/quickdraw-dataset). Each time any of categories from the dataset appears on the page I generate and add random picture somewhere arround. 128 | 129 | drawing 130 | 131 | ## Covers and PDF compilation 132 | 133 | I drew some covers and used the [rsvg-convert library](https://en.wikipedia.org/wiki/Librsvg) to build a PDF file from separate pages in SVG. 134 | 135 | Covers: 136 | 137 | drawing drawing 138 | 139 | The resulting diary (40 Mb): 140 | https://github.com/altsoph/paranoid_transformer/raw/master/paranoid_transformer_w_pics.pdf 141 | 142 | ## Papers, publications, releases, links 143 | 144 | * [ICCC 2020 Proceedings, P.146-152](http://computationalcreativity.net/iccc20/wp-content/uploads/2020/09/ICCC20_Proceedings.pdf): Paranoid Transformer. Yana Agafonova, Alexey Tikhonov and Ivan Yamshchikov 145 | * Future Internet Journal: [Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity](https://www.mdpi.com/1999-5903/12/11/182/htm) 146 | * [Pre-oder the book](https://deadalivemagazine.com/press/paranoid-transformer.html) 147 | -------------------------------------------------------------------------------- /cleaner_on_bert_weights.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | ofh = open(sys.argv[2], 'w', encoding='utf-8') 4 | 5 | prev_blank = True 6 | for ln, line in enumerate(open(sys.argv[1], encoding='utf-8')): 7 | text,_,_,score = line.strip().split('\t') 8 | 9 | if text == '----------' or float(score)<0.9: 10 | if not prev_blank: 11 | print(file=ofh) 12 | prev_blank = True 13 | else: 14 | print(text,file=ofh) 15 | prev_blank = False 16 | 17 | ofh.close() -------------------------------------------------------------------------------- /do_critic.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. 3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """BERT finetuning runner.""" 17 | 18 | from __future__ import absolute_import, division, print_function 19 | 20 | import argparse 21 | import csv 22 | import logging 23 | import os 24 | import random 25 | import sys 26 | 27 | import numpy as np 28 | import torch 29 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, 30 | TensorDataset) 31 | from torch.utils.data.distributed import DistributedSampler 32 | from tqdm import tqdm, trange 33 | 34 | from torch.nn import CrossEntropyLoss, MSELoss 35 | from scipy.stats import pearsonr, spearmanr 36 | from sklearn.metrics import matthews_corrcoef, f1_score 37 | 38 | from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME 39 | from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig 40 | from pytorch_pretrained_bert.tokenization import BertTokenizer 41 | from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule 42 | 43 | logger = logging.getLogger(__name__) 44 | 45 | 46 | class InputExample(object): 47 | """A single training/test example for simple sequence classification.""" 48 | 49 | def __init__(self, guid, text_a, text_b=None, label=None): 50 | """Constructs a InputExample. 51 | 52 | Args: 53 | guid: Unique id for the example. 54 | text_a: string. The untokenized text of the first sequence. For single 55 | sequence tasks, only this sequence must be specified. 56 | text_b: (Optional) string. The untokenized text of the second sequence. 57 | Only must be specified for sequence pair tasks. 58 | label: (Optional) string. The label of the example. This should be 59 | specified for train and dev examples, but not for test examples. 60 | """ 61 | self.guid = guid 62 | self.text_a = text_a 63 | self.text_b = text_b 64 | self.label = label 65 | 66 | 67 | class InputFeatures(object): 68 | """A single set of features of data.""" 69 | 70 | def __init__(self, input_ids, input_mask, segment_ids, label_id): 71 | self.input_ids = input_ids 72 | self.input_mask = input_mask 73 | self.segment_ids = segment_ids 74 | self.label_id = label_id 75 | 76 | 77 | class DataProcessor(object): 78 | """Base class for data converters for sequence classification data sets.""" 79 | 80 | def get_train_examples(self, data_dir): 81 | """Gets a collection of `InputExample`s for the train set.""" 82 | raise NotImplementedError() 83 | 84 | def get_dev_examples(self, data_dir): 85 | """Gets a collection of `InputExample`s for the dev set.""" 86 | raise NotImplementedError() 87 | 88 | def get_labels(self): 89 | """Gets the list of labels for this data set.""" 90 | raise NotImplementedError() 91 | 92 | @classmethod 93 | def _read_tsv(cls, input_file, quotechar=None): 94 | """Reads a tab separated value file.""" 95 | with open(input_file, "r", encoding="utf-8") as f: 96 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 97 | lines = [] 98 | for line in reader: 99 | if sys.version_info[0] == 2: 100 | line = list(unicode(cell, 'utf-8') for cell in line) 101 | lines.append(line) 102 | return lines 103 | 104 | 105 | 106 | class ColaProcessor(DataProcessor): 107 | """Processor for the CoLA data set (GLUE version).""" 108 | 109 | def get_train_examples(self, data_dir): 110 | """See base class.""" 111 | return self._create_examples( 112 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 113 | 114 | def get_dev_examples(self, data_dir): 115 | """See base class.""" 116 | return self._create_examples( 117 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 118 | 119 | def get_pred_examples(self, data_dir): 120 | """See base class.""" 121 | lines = self._read_tsv(os.path.join(data_dir, "pred.tsv")) 122 | examples = [] 123 | for (i, line) in enumerate(lines): 124 | if i == 0: 125 | continue 126 | guid = "%s-%s" % ('pred', i) 127 | text_a = line[0] 128 | text_b = None 129 | label = str(i%2) 130 | examples.append( 131 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 132 | return examples 133 | 134 | def get_labels(self): 135 | """See base class.""" 136 | return ["0", "1"] 137 | 138 | def _create_examples(self, lines, set_type): 139 | """Creates examples for the training and dev sets.""" 140 | examples = [] 141 | for (i, line) in enumerate(lines): 142 | guid = "%s-%s" % (set_type, i) 143 | text_a = line[3] 144 | label = line[1] 145 | examples.append( 146 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 147 | return examples 148 | 149 | def convert_examples_to_features(examples, label_list, max_seq_length, 150 | tokenizer, output_mode='classification'): 151 | """Loads a data file into a list of `InputBatch`s.""" 152 | 153 | label_map = {label : i for i, label in enumerate(label_list)} 154 | 155 | features = [] 156 | for (ex_index, example) in enumerate(examples): 157 | if ex_index % 10000 == 0: 158 | logger.info("Writing example %d of %d" % (ex_index, len(examples))) 159 | 160 | tokens_a = tokenizer.tokenize(example.text_a) 161 | 162 | tokens_b = None 163 | if example.text_b: 164 | tokens_b = tokenizer.tokenize(example.text_b) 165 | # Modifies `tokens_a` and `tokens_b` in place so that the total 166 | # length is less than the specified length. 167 | # Account for [CLS], [SEP], [SEP] with "- 3" 168 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 169 | else: 170 | # Account for [CLS] and [SEP] with "- 2" 171 | if len(tokens_a) > max_seq_length - 2: 172 | tokens_a = tokens_a[:(max_seq_length - 2)] 173 | 174 | # The convention in BERT is: 175 | # (a) For sequence pairs: 176 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 177 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 178 | # (b) For single sequences: 179 | # tokens: [CLS] the dog is hairy . [SEP] 180 | # type_ids: 0 0 0 0 0 0 0 181 | # 182 | # Where "type_ids" are used to indicate whether this is the first 183 | # sequence or the second sequence. The embedding vectors for `type=0` and 184 | # `type=1` were learned during pre-training and are added to the wordpiece 185 | # embedding vector (and position vector). This is not *strictly* necessary 186 | # since the [SEP] token unambiguously separates the sequences, but it makes 187 | # it easier for the model to learn the concept of sequences. 188 | # 189 | # For classification tasks, the first vector (corresponding to [CLS]) is 190 | # used as as the "sentence vector". Note that this only makes sense because 191 | # the entire model is fine-tuned. 192 | tokens = ["[CLS]"] + tokens_a + ["[SEP]"] 193 | segment_ids = [0] * len(tokens) 194 | 195 | if tokens_b: 196 | tokens += tokens_b + ["[SEP]"] 197 | segment_ids += [1] * (len(tokens_b) + 1) 198 | 199 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 200 | 201 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 202 | # tokens are attended to. 203 | input_mask = [1] * len(input_ids) 204 | 205 | # Zero-pad up to the sequence length. 206 | padding = [0] * (max_seq_length - len(input_ids)) 207 | input_ids += padding 208 | input_mask += padding 209 | segment_ids += padding 210 | 211 | assert len(input_ids) == max_seq_length 212 | assert len(input_mask) == max_seq_length 213 | assert len(segment_ids) == max_seq_length 214 | 215 | if output_mode == "classification": 216 | label_id = label_map[example.label] 217 | elif output_mode == "regression": 218 | label_id = float(example.label) 219 | else: 220 | raise KeyError(output_mode) 221 | 222 | if ex_index < 5: 223 | logger.info("*** Example ***") 224 | logger.info("guid: %s" % (example.guid)) 225 | logger.info("tokens: %s" % " ".join( 226 | [str(x) for x in tokens])) 227 | logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 228 | logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 229 | logger.info( 230 | "segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 231 | logger.info("label: %s (id = %d)" % (example.label, label_id)) 232 | 233 | features.append( 234 | InputFeatures(input_ids=input_ids, 235 | input_mask=input_mask, 236 | segment_ids=segment_ids, 237 | label_id=label_id)) 238 | return features 239 | 240 | 241 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 242 | """Truncates a sequence pair in place to the maximum length.""" 243 | 244 | # This is a simple heuristic which will always truncate the longer sequence 245 | # one token at a time. This makes more sense than truncating an equal percent 246 | # of tokens from each, since if one sequence is very short then each token 247 | # that's truncated likely contains more information than a longer sequence. 248 | while True: 249 | total_length = len(tokens_a) + len(tokens_b) 250 | if total_length <= max_length: 251 | break 252 | if len(tokens_a) > len(tokens_b): 253 | tokens_a.pop() 254 | else: 255 | tokens_b.pop() 256 | 257 | 258 | def simple_accuracy(preds, labels): 259 | return (preds == labels).mean() 260 | 261 | 262 | def compute_metrics(task_name, preds, labels): 263 | assert len(preds) == len(labels) 264 | return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)} 265 | 266 | 267 | def main(): 268 | parser = argparse.ArgumentParser() 269 | 270 | ## Required parameters 271 | parser.add_argument("--data_dir", 272 | default=None, 273 | type=str, 274 | required=True, 275 | help="The input data dir. Should contain the .tsv files (or other data files) for the task.") 276 | parser.add_argument("--bert_model", default=None, type=str, required=True, 277 | help="Bert pre-trained model selected in the list: bert-base-uncased, " 278 | "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, " 279 | "bert-base-multilingual-cased, bert-base-chinese.") 280 | parser.add_argument("--output_dir", 281 | default=None, 282 | type=str, 283 | required=True, 284 | help="The output directory where the model predictions and checkpoints will be written.") 285 | 286 | ## Other parameters 287 | parser.add_argument("--cache_dir", 288 | default="", 289 | type=str, 290 | help="Where do you want to store the pre-trained models downloaded from s3") 291 | parser.add_argument("--max_seq_length", 292 | default=128, 293 | type=int, 294 | help="The maximum total input sequence length after WordPiece tokenization. \n" 295 | "Sequences longer than this will be truncated, and sequences shorter \n" 296 | "than this will be padded.") 297 | parser.add_argument("--do_eval", 298 | action='store_true', 299 | help="Whether to run eval on the dev set.") 300 | parser.add_argument("--do_lower_case", 301 | action='store_true', 302 | help="Set this flag if you are using an uncased model.") 303 | parser.add_argument("--eval_batch_size", 304 | default=8, 305 | type=int, 306 | help="Total batch size for eval.") 307 | parser.add_argument("--no_cuda", 308 | action='store_true', 309 | help="Whether not to use CUDA when available") 310 | parser.add_argument("--local_rank", 311 | type=int, 312 | default=-1, 313 | help="local_rank for distributed training on gpus") 314 | parser.add_argument('--seed', 315 | type=int, 316 | default=42, 317 | help="random seed for initialization") 318 | 319 | args = parser.parse_args() 320 | 321 | device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") 322 | n_gpu = torch.cuda.device_count() 323 | 324 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 325 | datefmt = '%m/%d/%Y %H:%M:%S', 326 | level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN) 327 | 328 | logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: -".format( 329 | device, n_gpu, bool(args.local_rank != -1) )) 330 | 331 | random.seed(args.seed) 332 | np.random.seed(args.seed) 333 | torch.manual_seed(args.seed) 334 | if n_gpu > 0: 335 | torch.cuda.manual_seed_all(args.seed) 336 | 337 | if not os.path.exists(args.output_dir): 338 | # os.makedirs(args.output_dir) 339 | raise ValueError("No model output dir found.") 340 | 341 | processor = ColaProcessor() # processors[task_name]() 342 | 343 | label_list = processor.get_labels() 344 | num_labels = len(label_list) 345 | 346 | global_step = 0 347 | 348 | model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels) 349 | tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 350 | model.to(device) 351 | 352 | # if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0): 353 | eval_examples = processor.get_pred_examples(args.data_dir) 354 | eval_features = convert_examples_to_features( 355 | eval_examples, label_list, args.max_seq_length, tokenizer) # , output_mode) 356 | logger.info("***** Running evaluation *****") 357 | logger.info(" Num examples = %d", len(eval_examples)) 358 | logger.info(" Batch size = %d", args.eval_batch_size) 359 | # print(eval_examples[:10]) 360 | all_input_idxs = torch.tensor([idx for idx,f in enumerate(eval_features)], dtype=torch.long) 361 | all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long) 362 | all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long) 363 | all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long) 364 | 365 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long) 366 | 367 | eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_input_idxs) 368 | eval_sampler = SequentialSampler(eval_data) 369 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size) 370 | 371 | model.eval() 372 | # model.predict() 373 | eval_loss = 0 374 | nb_eval_steps = 0 375 | preds = [] 376 | 377 | for input_ids, input_mask, segment_ids, label_ids, text_idxs in tqdm(eval_dataloader, desc="Evaluating", disable=True): 378 | input_ids = input_ids.to(device) 379 | input_mask = input_mask.to(device) 380 | segment_ids = segment_ids.to(device) 381 | label_ids = label_ids.to(device) 382 | 383 | with torch.no_grad(): 384 | logits = model(input_ids, segment_ids, input_mask, labels=None) 385 | for idx,logit in zip(list(text_idxs.data),list(logits.data)): 386 | # print(idx,logit) 387 | print("%s\t%f\t%f" % ( eval_examples[idx.item()].text_a,logit[0].item(),logit[1].item()) ) 388 | 389 | 390 | if __name__ == "__main__": 391 | main() 392 | -------------------------------------------------------------------------------- /gpt1finetune.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import json 4 | 5 | import nltk 6 | import torch 7 | # from apex import amp 8 | from tqdm import tqdm, trange 9 | from pytorch_pretrained_bert import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OpenAIAdam 10 | 11 | SPECIAL_TOKENS = ["", "", "", "", "", "", "",""] 12 | LR = 6.25e-5 13 | MAX_LEN = 500 14 | BATCH_SIZE = 13 15 | 16 | OUTPUT_DIR = "/home/altsoph/current" 17 | random.seed(0xDEADFEED) 18 | 19 | 20 | 21 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 22 | n_gpu = torch.cuda.device_count() 23 | model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt') 24 | tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt') 25 | 26 | tokenizer.set_special_tokens(SPECIAL_TOKENS) 27 | model.set_num_special_tokens(len(SPECIAL_TOKENS)) 28 | model.to(device) 29 | optimizer = OpenAIAdam(model.parameters(), 30 | lr=LR, 31 | warmup=0.002, 32 | max_grad_norm=1, 33 | weight_decay=0.01) 34 | 35 | TAG_TEXT, TAG_META1, TAG_META2, TAG_PAD = tokenizer.convert_tokens_to_ids(("", "", "", "")) 36 | 37 | def pad(x, padding, padding_length): 38 | return x + [padding] * (padding_length - len(x)) 39 | 40 | dataset = [] 41 | for line in open('gpt1_trainset_tokens.tsv'): 42 | chunks = line.strip().split('\t') 43 | tokens = list(map(int,chunks[2].split(','))) 44 | if len(tokens)<8: continue 45 | segments = [TAG_META1, TAG_META2] + [TAG_TEXT for _ in tokens[2:]] 46 | positions = list(range(len(tokens))) 47 | lm_targets = [-1, -1, -1] + tokens[3:] 48 | dataset.append( (len(tokens), tokens, segments, positions, lm_targets) ) 49 | 50 | model.train() 51 | 52 | for epoch in range(10): 53 | exp_average_loss = None 54 | nb_tr_steps = 0 55 | tr_loss = 0 56 | 57 | dataset = list(sorted(dataset,key=lambda x:random.random())) 58 | 59 | tqdm_bar = tqdm(range(0,len(dataset),BATCH_SIZE), desc="Training", mininterval=6.0) 60 | for batch_num,batch_start in enumerate(tqdm_bar): 61 | 62 | batch_raw = dataset[batch_start:batch_start+BATCH_SIZE] 63 | pad_size = max(map(lambda x:x[0],batch_raw)) 64 | 65 | input_words = [] 66 | input_segments = [] 67 | input_targets = [] 68 | 69 | for _,words,segments,_,targets in batch_raw: 70 | input_words.append( pad(words,TAG_PAD,pad_size) ) 71 | input_segments.append( pad(segments,TAG_PAD,pad_size) ) 72 | input_targets.append( pad(targets,-1,pad_size) ) 73 | 74 | input_ids = torch.tensor(input_words, dtype=torch.long) 75 | token_type_ids = torch.tensor(input_segments, dtype=torch.long) 76 | lm_labels = torch.tensor(input_targets, dtype=torch.long) 77 | 78 | loss = model(input_ids.to(device), lm_labels=lm_labels.to(device), token_type_ids=token_type_ids.to(device)) 79 | 80 | loss.backward() 81 | optimizer.step() 82 | optimizer.zero_grad() 83 | tr_loss += loss.item() 84 | exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item() 85 | nb_tr_steps += 1 86 | tqdm_bar.desc = "Epoch {:02}, batch {:05}/{:05}. Training loss: {:.2e} lr: {:.2e}".format(epoch, batch_num, len(dataset)//BATCH_SIZE, exp_average_loss, optimizer.get_lr()[0]) 87 | 88 | model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self 89 | torch.save(model_to_save.state_dict(), os.path.join(OUTPUT_DIR, "pytorch_model.bin")) 90 | model_to_save.config.to_json_file(os.path.join(OUTPUT_DIR, "config.json")) 91 | tokenizer.save_vocabulary(OUTPUT_DIR) 92 | -------------------------------------------------------------------------------- /gpt1sample.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import os 4 | import random 5 | import json 6 | 7 | import numpy as np 8 | import nltk 9 | import torch 10 | import torch.nn.functional as F 11 | 12 | # from apex import amp 13 | from tqdm import tqdm, trange 14 | from pytorch_pretrained_bert import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OpenAIAdam 15 | 16 | 17 | SAMPLES = 16384 18 | BATCH_SIZE = 32 19 | 20 | MAX_LEN = 500 21 | MODEL_DIR = "/home/altsoph/current" 22 | SEED = 0xDEADFEED 23 | 24 | 25 | def top_k_logits(logits, k): 26 | """ 27 | Masks everything but the k top entries as -infinity (1e10). 28 | Used to mask logits such that e^-infinity -> 0 won't contribute to the 29 | sum of the denominator. 30 | """ 31 | if k == 0: 32 | return logits 33 | else: 34 | values = torch.topk(logits, k)[0] 35 | batch_mins = values[:, -1].view(-1, 1).expand_as(logits) 36 | return torch.where(logits < batch_mins, torch.ones_like(logits) * -1e10, logits) 37 | 38 | def sample_sequence(model, length, segments=None, batch_size=None, context=None, temperature=1, top_k=0, device='cuda', sample=True, text_tag=0): 39 | context = torch.tensor(context, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1) 40 | segments = torch.tensor(segments, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1) 41 | text_tag_tpl = torch.tensor([text_tag,], device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1) 42 | 43 | prev = context 44 | output = context 45 | prev_segments = segments 46 | past = None 47 | with torch.no_grad(): 48 | for i in trange(length): 49 | # model(input_ids.to(device), lm_labels=lm_labels.to(device), token_type_ids=token_type_ids.to(device)) 50 | logits = model(output, token_type_ids=prev_segments) 51 | logits = logits[:, -1, :] / temperature 52 | logits = top_k_logits(logits, k=top_k) 53 | log_probs = F.softmax(logits, dim=-1) 54 | if sample: 55 | prev = torch.multinomial(log_probs, num_samples=1) 56 | else: 57 | _, prev = torch.topk(log_probs, k=1, dim=-1) 58 | output = torch.cat((output, prev), dim=1) 59 | prev_segments = torch.cat((prev_segments, text_tag_tpl), dim=1) 60 | return output 61 | 62 | random.seed(SEED) 63 | torch.random.manual_seed(SEED) 64 | torch.cuda.manual_seed(SEED) 65 | 66 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 67 | n_gpu = torch.cuda.device_count() 68 | model = OpenAIGPTLMHeadModel.from_pretrained(MODEL_DIR) 69 | tokenizer = OpenAIGPTTokenizer.from_pretrained(MODEL_DIR) 70 | 71 | model.to(device) 72 | 73 | TAG_QUOTES, TAG_CYBER, TAG_TEXT, TAG_META1, TAG_META2, TAG_PAD = tokenizer.convert_tokens_to_ids( 74 | ("", "", "", "", "", "")) 75 | 76 | context_tokens = [TAG_QUOTES, TAG_CYBER] 77 | context_segments = [TAG_META1, TAG_META2] 78 | 79 | generated = 0 80 | 81 | for _ in range(SAMPLES // BATCH_SIZE): 82 | out = sample_sequence( 83 | model=model, length=MAX_LEN, 84 | context=context_tokens, 85 | segments=context_segments, 86 | batch_size=BATCH_SIZE, 87 | temperature=1, top_k=0, device=device, 88 | text_tag = TAG_TEXT 89 | ) 90 | out = out[:, len(context_tokens):].tolist() 91 | for i in range(BATCH_SIZE): 92 | generated += 1 93 | text = tokenizer.decode(out[i]) 94 | print("=" * 35 + " SAMPLE " + str(generated) + " " + "=" * (36-len(str(generated))) ) 95 | print(text) 96 | 97 | -------------------------------------------------------------------------------- /gpt1tokenize_trainset.py: -------------------------------------------------------------------------------- 1 | import random 2 | import nltk 3 | from pytorch_pretrained_bert import OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer 4 | 5 | model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt') 6 | tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt') 7 | 8 | SPECIAL_TOKENS = ["", "", "", "", "", "", ""] 9 | 10 | # We can add these special tokens to the vocabulary and the embeddings of the model: 11 | tokenizer.set_special_tokens(SPECIAL_TOKENS) 12 | model.set_num_special_tokens(len(SPECIAL_TOKENS)) 13 | 14 | MAX_LEN = 500 15 | 16 | dataset = [] 17 | for fn,meta1,meta2 in (('long_cyberpunk.txt','',''),('quotes_cyberpunk.txt','',''), 18 | ('long_others.txt','',''),('quotes_others.txt','','')): 19 | meta_tokens = tokenizer.convert_tokens_to_ids((meta1,meta2)) 20 | for line in open(fn, encoding='utf-8', errors='ignore'): 21 | if not line.strip(): continue 22 | # meta_tokens = tokenizer.encode("%s %s" %(meta1,meta2)) 23 | # segments = tokenizer.convert_tokens_to_ids(segments) 24 | tokens = tokenizer.encode(line.strip()) 25 | if len(tokens)>MAX_LEN: 26 | # print('too long',len(tokens)) 27 | sentences = nltk.sent_tokenize(line.strip()) 28 | # print(sentences) 29 | sentences_tokens = [tokenizer.encode(sentence) for sentence in sentences] 30 | # print(sentences_tokens) 31 | collected = [] 32 | for sentence_tokens in sentences_tokens: 33 | if 0 in sentences_tokens or len(collected)+len(sentence_tokens)>MAX_LEN: 34 | # print(len(collected),collected) 35 | dataset.append( (meta1,meta2,meta_tokens+collected) ) 36 | collected = [] 37 | if len(sentence_tokens)<=MAX_LEN: 38 | collected.extend(sentence_tokens) 39 | if collected: 40 | # print(len(collected),collected) 41 | dataset.append( (meta1,meta2,meta_tokens+collected) ) 42 | # exit() 43 | else: 44 | dataset.append( (meta1,meta2,meta_tokens+tokens) ) 45 | for m1,m2,token_ids in dataset: 46 | print("%s\t%s\t%s" % (m1,m2,",".join(map(str,token_ids)))) 47 | -------------------------------------------------------------------------------- /handwriting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/handwriting.png -------------------------------------------------------------------------------- /paranoid_transformer.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer.pdf -------------------------------------------------------------------------------- /paranoid_transformer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer.png -------------------------------------------------------------------------------- /paranoid_transformer_back.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer_back.png -------------------------------------------------------------------------------- /paranoid_transformer_w_pics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer_w_pics.pdf -------------------------------------------------------------------------------- /pics_samples.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/pics_samples.png -------------------------------------------------------------------------------- /simple_cleaner.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from nltk.tokenize import sent_tokenize, word_tokenize 3 | from collections import defaultdict 4 | from nltk import pos_tag 5 | 6 | vocab = set() 7 | for line in open("vocab.txt", encoding='utf-8'): 8 | vocab.add( line.strip() ) 9 | 10 | outfh = open(sys.argv[2], "w", encoding='utf-8') 11 | 12 | lines_cnt = sentences_cnt = 0 13 | cases = defaultdict(int) 14 | for ln, line in enumerate(open(sys.argv[1], encoding='utf-8')): 15 | if line[0] == '=': continue 16 | lines_cnt += 1 17 | 18 | sents = sent_tokenize(line) 19 | sentences_cnt += len(sents) 20 | print(file=outfh) 21 | for sent in sents: 22 | words = word_tokenize(sent) 23 | tmp = sent.replace('#',' ').replace('...','#').replace('!','#').replace('?','#').replace(',','#').replace('--','#').replace('-','#').replace(';','#')\ 24 | .replace(':','#').replace('`','#').replace('"','#').replace('.','#').replace(' ','') 25 | 26 | no_punct = [] 27 | size = 0 28 | for npidx,w in enumerate(words): 29 | if w in ('...','!','?',',','--','-',';',':','`','"','.'): 30 | no_punct.append(size) 31 | size = 0 32 | else: 33 | size += 1 34 | no_punct.append(size) 35 | 36 | pos = pos_tag(words) 37 | skip = False 38 | for idx,(w,p) in enumerate(pos[:-1]): 39 | # VB verb, base form take 40 | # VBD verb, past tense took 41 | # VBG verb, gerund/present participle taking 42 | # VBN verb, past participle taken 43 | # VBP verb, sing. present, non-3d take 44 | # VBZ verb, 3rd person sing. present takes 45 | if p in ('VB','VBD','VBG','VBN','VBP','VBZ') and pos[idx+1][1] in ('VB','VBD','VBG','VBN','VBP','VBZ'): 46 | if p == 'VBD' and pos[idx+1][1] == 'VBN': continue 47 | if p == 'VB' and pos[idx+1][1] == 'VBN': continue 48 | if p == 'VBP' and pos[idx+1][1] == 'VBN': continue 49 | if p == 'VBZ' and pos[idx+1][1] == 'VBN': continue 50 | if w == 'been' and pos[idx+1][1] == 'VBN': continue 51 | if w in ('be','was','are','is',"'re","'s","been","have") and pos[idx+1][1] == 'VBG': continue 52 | if w == 'i': continue 53 | # print('VERB', (w,p), pos[idx+1], sent) 54 | cases['verbverb'] += 1 55 | skip = True 56 | break 57 | # it's bad if several verbs in a row 58 | if set(words)-vocab: 59 | cases['new_word'] += 1 60 | skip = True 61 | elif max(no_punct)>25: 62 | cases['no_punct'] += 1 63 | skip = True 64 | elif len(words)>=60: 65 | cases['to_long'] += 1 66 | skip = True 67 | elif "###" in tmp: 68 | cases['manypuncts'] += 1 69 | skip = True 70 | for idx,w in enumerate(words[:-1]): 71 | if w == words[idx+1]: 72 | cases['duplicate_words'] += 1 73 | skip = True 74 | break 75 | if sent[-1] not in '.!?': 76 | cases['badend'] += 1 77 | skip = True 78 | 79 | if skip: 80 | print(file=outfh) 81 | else: 82 | print(sent,file=outfh) 83 | 84 | print(lines_cnt, sentences_cnt, cases.items()) 85 | 86 | outfh.close() 87 | -------------------------------------------------------------------------------- /train_classifier.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. 3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """BERT finetuning runner.""" 17 | 18 | from __future__ import absolute_import, division, print_function 19 | 20 | import argparse 21 | import csv 22 | import logging 23 | import os 24 | import random 25 | import sys 26 | 27 | import numpy as np 28 | import torch 29 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, 30 | TensorDataset) 31 | from torch.utils.data.distributed import DistributedSampler 32 | from tqdm import tqdm, trange 33 | 34 | from torch.nn import CrossEntropyLoss, MSELoss 35 | from scipy.stats import pearsonr, spearmanr 36 | from sklearn.metrics import matthews_corrcoef, f1_score 37 | 38 | from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME 39 | from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig 40 | from pytorch_pretrained_bert.tokenization import BertTokenizer 41 | from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule 42 | 43 | logger = logging.getLogger(__name__) 44 | 45 | 46 | class InputExample(object): 47 | """A single training/test example for simple sequence classification.""" 48 | 49 | def __init__(self, guid, text_a, text_b=None, label=None): 50 | """Constructs a InputExample. 51 | 52 | Args: 53 | guid: Unique id for the example. 54 | text_a: string. The untokenized text of the first sequence. For single 55 | sequence tasks, only this sequence must be specified. 56 | text_b: (Optional) string. The untokenized text of the second sequence. 57 | Only must be specified for sequence pair tasks. 58 | label: (Optional) string. The label of the example. This should be 59 | specified for train and dev examples, but not for test examples. 60 | """ 61 | self.guid = guid 62 | self.text_a = text_a 63 | self.text_b = text_b 64 | self.label = label 65 | 66 | 67 | class InputFeatures(object): 68 | """A single set of features of data.""" 69 | 70 | def __init__(self, input_ids, input_mask, segment_ids, label_id): 71 | self.input_ids = input_ids 72 | self.input_mask = input_mask 73 | self.segment_ids = segment_ids 74 | self.label_id = label_id 75 | 76 | 77 | class DataProcessor(object): 78 | """Base class for data converters for sequence classification data sets.""" 79 | 80 | def get_train_examples(self, data_dir): 81 | """Gets a collection of `InputExample`s for the train set.""" 82 | raise NotImplementedError() 83 | 84 | def get_dev_examples(self, data_dir): 85 | """Gets a collection of `InputExample`s for the dev set.""" 86 | raise NotImplementedError() 87 | 88 | def get_labels(self): 89 | """Gets the list of labels for this data set.""" 90 | raise NotImplementedError() 91 | 92 | @classmethod 93 | def _read_tsv(cls, input_file, quotechar=None): 94 | """Reads a tab separated value file.""" 95 | with open(input_file, "r", encoding="utf-8") as f: 96 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 97 | lines = [] 98 | for line in reader: 99 | if sys.version_info[0] == 2: 100 | line = list(unicode(cell, 'utf-8') for cell in line) 101 | lines.append(line) 102 | return lines 103 | 104 | 105 | class MrpcProcessor(DataProcessor): 106 | """Processor for the MRPC data set (GLUE version).""" 107 | 108 | def get_train_examples(self, data_dir): 109 | """See base class.""" 110 | logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv"))) 111 | return self._create_examples( 112 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 113 | 114 | def get_dev_examples(self, data_dir): 115 | """See base class.""" 116 | return self._create_examples( 117 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 118 | 119 | def get_labels(self): 120 | """See base class.""" 121 | return ["0", "1"] 122 | 123 | def _create_examples(self, lines, set_type): 124 | """Creates examples for the training and dev sets.""" 125 | examples = [] 126 | for (i, line) in enumerate(lines): 127 | if i == 0: 128 | continue 129 | guid = "%s-%s" % (set_type, i) 130 | text_a = line[3] 131 | text_b = line[4] 132 | label = line[0] 133 | examples.append( 134 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 135 | return examples 136 | 137 | 138 | class MnliProcessor(DataProcessor): 139 | """Processor for the MultiNLI data set (GLUE version).""" 140 | 141 | def get_train_examples(self, data_dir): 142 | """See base class.""" 143 | return self._create_examples( 144 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 145 | 146 | def get_dev_examples(self, data_dir): 147 | """See base class.""" 148 | return self._create_examples( 149 | self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), 150 | "dev_matched") 151 | 152 | def get_labels(self): 153 | """See base class.""" 154 | return ["contradiction", "entailment", "neutral"] 155 | 156 | def _create_examples(self, lines, set_type): 157 | """Creates examples for the training and dev sets.""" 158 | examples = [] 159 | for (i, line) in enumerate(lines): 160 | if i == 0: 161 | continue 162 | guid = "%s-%s" % (set_type, line[0]) 163 | text_a = line[8] 164 | text_b = line[9] 165 | label = line[-1] 166 | examples.append( 167 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 168 | return examples 169 | 170 | 171 | class MnliMismatchedProcessor(MnliProcessor): 172 | """Processor for the MultiNLI Mismatched data set (GLUE version).""" 173 | 174 | def get_dev_examples(self, data_dir): 175 | """See base class.""" 176 | return self._create_examples( 177 | self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")), 178 | "dev_matched") 179 | 180 | 181 | class ColaProcessor(DataProcessor): 182 | """Processor for the CoLA data set (GLUE version).""" 183 | 184 | def get_train_examples(self, data_dir): 185 | """See base class.""" 186 | return self._create_examples( 187 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 188 | 189 | def get_dev_examples(self, data_dir): 190 | """See base class.""" 191 | return self._create_examples( 192 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 193 | 194 | def get_labels(self): 195 | """See base class.""" 196 | return ["0", "1"] 197 | 198 | def _create_examples(self, lines, set_type): 199 | """Creates examples for the training and dev sets.""" 200 | examples = [] 201 | for (i, line) in enumerate(lines): 202 | guid = "%s-%s" % (set_type, i) 203 | text_a = line[3] 204 | label = line[1] 205 | examples.append( 206 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 207 | return examples 208 | 209 | class OnionProcessor(ColaProcessor): 210 | def _create_examples(self, lines, set_type): 211 | """Creates examples for the training and dev sets.""" 212 | examples = [] 213 | for (i, line) in enumerate(lines): 214 | left, right = line[0], line[1] 215 | target = 0 216 | guid = str(i) 217 | 218 | # tmp_right = right.split() 219 | # np.random.shuffle(tmp_right) 220 | # right = " ".join(tmp_right) 221 | if np.random.rand() > 0.5: 222 | left, right = right, left 223 | target = 1 224 | 225 | # text_a = tokenization.convert_to_unicode(left) 226 | # text_b = tokenization.convert_to_unicode(right) 227 | text_a = left 228 | text_b = right 229 | label = str(target) 230 | examples.append( 231 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 232 | 233 | return examples 234 | 235 | class Sst2Processor(DataProcessor): 236 | """Processor for the SST-2 data set (GLUE version).""" 237 | 238 | def get_train_examples(self, data_dir): 239 | """See base class.""" 240 | return self._create_examples( 241 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 242 | 243 | def get_dev_examples(self, data_dir): 244 | """See base class.""" 245 | return self._create_examples( 246 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 247 | 248 | def get_labels(self): 249 | """See base class.""" 250 | return ["0", "1"] 251 | 252 | def _create_examples(self, lines, set_type): 253 | """Creates examples for the training and dev sets.""" 254 | examples = [] 255 | for (i, line) in enumerate(lines): 256 | if i == 0: 257 | continue 258 | guid = "%s-%s" % (set_type, i) 259 | text_a = line[0] 260 | label = line[1] 261 | examples.append( 262 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 263 | return examples 264 | 265 | 266 | class StsbProcessor(DataProcessor): 267 | """Processor for the STS-B data set (GLUE version).""" 268 | 269 | def get_train_examples(self, data_dir): 270 | """See base class.""" 271 | return self._create_examples( 272 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 273 | 274 | def get_dev_examples(self, data_dir): 275 | """See base class.""" 276 | return self._create_examples( 277 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 278 | 279 | def get_labels(self): 280 | """See base class.""" 281 | return [None] 282 | 283 | def _create_examples(self, lines, set_type): 284 | """Creates examples for the training and dev sets.""" 285 | examples = [] 286 | for (i, line) in enumerate(lines): 287 | if i == 0: 288 | continue 289 | guid = "%s-%s" % (set_type, line[0]) 290 | text_a = line[7] 291 | text_b = line[8] 292 | label = line[-1] 293 | examples.append( 294 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 295 | return examples 296 | 297 | 298 | class QqpProcessor(DataProcessor): 299 | """Processor for the STS-B data set (GLUE version).""" 300 | 301 | def get_train_examples(self, data_dir): 302 | """See base class.""" 303 | return self._create_examples( 304 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 305 | 306 | def get_dev_examples(self, data_dir): 307 | """See base class.""" 308 | return self._create_examples( 309 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 310 | 311 | def get_labels(self): 312 | """See base class.""" 313 | return ["0", "1"] 314 | 315 | def _create_examples(self, lines, set_type): 316 | """Creates examples for the training and dev sets.""" 317 | examples = [] 318 | for (i, line) in enumerate(lines): 319 | if i == 0: 320 | continue 321 | guid = "%s-%s" % (set_type, line[0]) 322 | try: 323 | text_a = line[3] 324 | text_b = line[4] 325 | label = line[5] 326 | except IndexError: 327 | continue 328 | examples.append( 329 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 330 | return examples 331 | 332 | 333 | class QnliProcessor(DataProcessor): 334 | """Processor for the STS-B data set (GLUE version).""" 335 | 336 | def get_train_examples(self, data_dir): 337 | """See base class.""" 338 | return self._create_examples( 339 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 340 | 341 | def get_dev_examples(self, data_dir): 342 | """See base class.""" 343 | return self._create_examples( 344 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), 345 | "dev_matched") 346 | 347 | def get_labels(self): 348 | """See base class.""" 349 | return ["entailment", "not_entailment"] 350 | 351 | def _create_examples(self, lines, set_type): 352 | """Creates examples for the training and dev sets.""" 353 | examples = [] 354 | for (i, line) in enumerate(lines): 355 | if i == 0: 356 | continue 357 | guid = "%s-%s" % (set_type, line[0]) 358 | text_a = line[1] 359 | text_b = line[2] 360 | label = line[-1] 361 | examples.append( 362 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 363 | return examples 364 | 365 | 366 | class RteProcessor(DataProcessor): 367 | """Processor for the RTE data set (GLUE version).""" 368 | 369 | def get_train_examples(self, data_dir): 370 | """See base class.""" 371 | return self._create_examples( 372 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 373 | 374 | def get_dev_examples(self, data_dir): 375 | """See base class.""" 376 | return self._create_examples( 377 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 378 | 379 | def get_labels(self): 380 | """See base class.""" 381 | return ["entailment", "not_entailment"] 382 | 383 | def _create_examples(self, lines, set_type): 384 | """Creates examples for the training and dev sets.""" 385 | examples = [] 386 | for (i, line) in enumerate(lines): 387 | if i == 0: 388 | continue 389 | guid = "%s-%s" % (set_type, line[0]) 390 | text_a = line[1] 391 | text_b = line[2] 392 | label = line[-1] 393 | examples.append( 394 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 395 | return examples 396 | 397 | 398 | class WnliProcessor(DataProcessor): 399 | """Processor for the WNLI data set (GLUE version).""" 400 | 401 | def get_train_examples(self, data_dir): 402 | """See base class.""" 403 | return self._create_examples( 404 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train") 405 | 406 | def get_dev_examples(self, data_dir): 407 | """See base class.""" 408 | return self._create_examples( 409 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev") 410 | 411 | def get_labels(self): 412 | """See base class.""" 413 | return ["0", "1"] 414 | 415 | def _create_examples(self, lines, set_type): 416 | """Creates examples for the training and dev sets.""" 417 | examples = [] 418 | for (i, line) in enumerate(lines): 419 | if i == 0: 420 | continue 421 | guid = "%s-%s" % (set_type, line[0]) 422 | text_a = line[1] 423 | text_b = line[2] 424 | label = line[-1] 425 | examples.append( 426 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 427 | return examples 428 | 429 | 430 | def convert_examples_to_features(examples, label_list, max_seq_length, 431 | tokenizer, output_mode): 432 | """Loads a data file into a list of `InputBatch`s.""" 433 | 434 | label_map = {label : i for i, label in enumerate(label_list)} 435 | 436 | features = [] 437 | for (ex_index, example) in enumerate(examples): 438 | if ex_index % 10000 == 0: 439 | logger.info("Writing example %d of %d" % (ex_index, len(examples))) 440 | 441 | tokens_a = tokenizer.tokenize(example.text_a) 442 | 443 | tokens_b = None 444 | if example.text_b: 445 | tokens_b = tokenizer.tokenize(example.text_b) 446 | # Modifies `tokens_a` and `tokens_b` in place so that the total 447 | # length is less than the specified length. 448 | # Account for [CLS], [SEP], [SEP] with "- 3" 449 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 450 | else: 451 | # Account for [CLS] and [SEP] with "- 2" 452 | if len(tokens_a) > max_seq_length - 2: 453 | tokens_a = tokens_a[:(max_seq_length - 2)] 454 | 455 | # The convention in BERT is: 456 | # (a) For sequence pairs: 457 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 458 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 459 | # (b) For single sequences: 460 | # tokens: [CLS] the dog is hairy . [SEP] 461 | # type_ids: 0 0 0 0 0 0 0 462 | # 463 | # Where "type_ids" are used to indicate whether this is the first 464 | # sequence or the second sequence. The embedding vectors for `type=0` and 465 | # `type=1` were learned during pre-training and are added to the wordpiece 466 | # embedding vector (and position vector). This is not *strictly* necessary 467 | # since the [SEP] token unambiguously separates the sequences, but it makes 468 | # it easier for the model to learn the concept of sequences. 469 | # 470 | # For classification tasks, the first vector (corresponding to [CLS]) is 471 | # used as as the "sentence vector". Note that this only makes sense because 472 | # the entire model is fine-tuned. 473 | tokens = ["[CLS]"] + tokens_a + ["[SEP]"] 474 | segment_ids = [0] * len(tokens) 475 | 476 | if tokens_b: 477 | tokens += tokens_b + ["[SEP]"] 478 | segment_ids += [1] * (len(tokens_b) + 1) 479 | 480 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 481 | 482 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 483 | # tokens are attended to. 484 | input_mask = [1] * len(input_ids) 485 | 486 | # Zero-pad up to the sequence length. 487 | padding = [0] * (max_seq_length - len(input_ids)) 488 | input_ids += padding 489 | input_mask += padding 490 | segment_ids += padding 491 | 492 | assert len(input_ids) == max_seq_length 493 | assert len(input_mask) == max_seq_length 494 | assert len(segment_ids) == max_seq_length 495 | 496 | if output_mode == "classification": 497 | label_id = label_map[example.label] 498 | elif output_mode == "regression": 499 | label_id = float(example.label) 500 | else: 501 | raise KeyError(output_mode) 502 | 503 | if ex_index < 5: 504 | logger.info("*** Example ***") 505 | logger.info("guid: %s" % (example.guid)) 506 | logger.info("tokens: %s" % " ".join( 507 | [str(x) for x in tokens])) 508 | logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 509 | logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 510 | logger.info( 511 | "segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 512 | logger.info("label: %s (id = %d)" % (example.label, label_id)) 513 | 514 | features.append( 515 | InputFeatures(input_ids=input_ids, 516 | input_mask=input_mask, 517 | segment_ids=segment_ids, 518 | label_id=label_id)) 519 | return features 520 | 521 | 522 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 523 | """Truncates a sequence pair in place to the maximum length.""" 524 | 525 | # This is a simple heuristic which will always truncate the longer sequence 526 | # one token at a time. This makes more sense than truncating an equal percent 527 | # of tokens from each, since if one sequence is very short then each token 528 | # that's truncated likely contains more information than a longer sequence. 529 | while True: 530 | total_length = len(tokens_a) + len(tokens_b) 531 | if total_length <= max_length: 532 | break 533 | if len(tokens_a) > len(tokens_b): 534 | tokens_a.pop() 535 | else: 536 | tokens_b.pop() 537 | 538 | 539 | def simple_accuracy(preds, labels): 540 | return (preds == labels).mean() 541 | 542 | 543 | def acc_and_f1(preds, labels): 544 | acc = simple_accuracy(preds, labels) 545 | f1 = f1_score(y_true=labels, y_pred=preds) 546 | return { 547 | "acc": acc, 548 | "f1": f1, 549 | "acc_and_f1": (acc + f1) / 2, 550 | } 551 | 552 | 553 | def pearson_and_spearman(preds, labels): 554 | pearson_corr = pearsonr(preds, labels)[0] 555 | spearman_corr = spearmanr(preds, labels)[0] 556 | return { 557 | "pearson": pearson_corr, 558 | "spearmanr": spearman_corr, 559 | "corr": (pearson_corr + spearman_corr) / 2, 560 | } 561 | 562 | 563 | def compute_metrics(task_name, preds, labels): 564 | assert len(preds) == len(labels) 565 | if task_name == "cola": 566 | return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)} 567 | if task_name == "onion": 568 | return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)} 569 | elif task_name == "sst-2": 570 | return {"acc": simple_accuracy(preds, labels)} 571 | elif task_name == "mrpc": 572 | return acc_and_f1(preds, labels) 573 | elif task_name == "sts-b": 574 | return pearson_and_spearman(preds, labels) 575 | elif task_name == "qqp": 576 | return acc_and_f1(preds, labels) 577 | elif task_name == "mnli": 578 | return {"acc": simple_accuracy(preds, labels)} 579 | elif task_name == "mnli-mm": 580 | return {"acc": simple_accuracy(preds, labels)} 581 | elif task_name == "qnli": 582 | return {"acc": simple_accuracy(preds, labels)} 583 | elif task_name == "rte": 584 | return {"acc": simple_accuracy(preds, labels)} 585 | elif task_name == "wnli": 586 | return {"acc": simple_accuracy(preds, labels)} 587 | else: 588 | raise KeyError(task_name) 589 | 590 | 591 | def main(): 592 | parser = argparse.ArgumentParser() 593 | 594 | ## Required parameters 595 | parser.add_argument("--data_dir", 596 | default=None, 597 | type=str, 598 | required=True, 599 | help="The input data dir. Should contain the .tsv files (or other data files) for the task.") 600 | parser.add_argument("--bert_model", default=None, type=str, required=True, 601 | help="Bert pre-trained model selected in the list: bert-base-uncased, " 602 | "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, " 603 | "bert-base-multilingual-cased, bert-base-chinese.") 604 | parser.add_argument("--task_name", 605 | default=None, 606 | type=str, 607 | required=True, 608 | help="The name of the task to train.") 609 | parser.add_argument("--output_dir", 610 | default=None, 611 | type=str, 612 | required=True, 613 | help="The output directory where the model predictions and checkpoints will be written.") 614 | 615 | ## Other parameters 616 | parser.add_argument("--cache_dir", 617 | default="", 618 | type=str, 619 | help="Where do you want to store the pre-trained models downloaded from s3") 620 | parser.add_argument("--max_seq_length", 621 | default=128, 622 | type=int, 623 | help="The maximum total input sequence length after WordPiece tokenization. \n" 624 | "Sequences longer than this will be truncated, and sequences shorter \n" 625 | "than this will be padded.") 626 | parser.add_argument("--do_train", 627 | action='store_true', 628 | help="Whether to run training.") 629 | parser.add_argument("--do_eval", 630 | action='store_true', 631 | help="Whether to run eval on the dev set.") 632 | parser.add_argument("--do_lower_case", 633 | action='store_true', 634 | help="Set this flag if you are using an uncased model.") 635 | parser.add_argument("--train_batch_size", 636 | default=32, 637 | type=int, 638 | help="Total batch size for training.") 639 | parser.add_argument("--eval_batch_size", 640 | default=8, 641 | type=int, 642 | help="Total batch size for eval.") 643 | parser.add_argument("--learning_rate", 644 | default=5e-5, 645 | type=float, 646 | help="The initial learning rate for Adam.") 647 | parser.add_argument("--num_train_epochs", 648 | default=3.0, 649 | type=float, 650 | help="Total number of training epochs to perform.") 651 | parser.add_argument("--warmup_proportion", 652 | default=0.1, 653 | type=float, 654 | help="Proportion of training to perform linear learning rate warmup for. " 655 | "E.g., 0.1 = 10%% of training.") 656 | parser.add_argument("--no_cuda", 657 | action='store_true', 658 | help="Whether not to use CUDA when available") 659 | parser.add_argument("--local_rank", 660 | type=int, 661 | default=-1, 662 | help="local_rank for distributed training on gpus") 663 | parser.add_argument('--seed', 664 | type=int, 665 | default=42, 666 | help="random seed for initialization") 667 | parser.add_argument('--gradient_accumulation_steps', 668 | type=int, 669 | default=1, 670 | help="Number of updates steps to accumulate before performing a backward/update pass.") 671 | parser.add_argument('--fp16', 672 | action='store_true', 673 | help="Whether to use 16-bit float precision instead of 32-bit") 674 | parser.add_argument('--loss_scale', 675 | type=float, default=0, 676 | help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n" 677 | "0 (default value): dynamic loss scaling.\n" 678 | "Positive power of 2: static loss scaling value.\n") 679 | parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.") 680 | parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.") 681 | args = parser.parse_args() 682 | 683 | if args.server_ip and args.server_port: 684 | # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script 685 | import ptvsd 686 | print("Waiting for debugger attach") 687 | ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True) 688 | ptvsd.wait_for_attach() 689 | 690 | processors = { 691 | "cola": ColaProcessor, 692 | "onion": OnionProcessor, 693 | "mnli": MnliProcessor, 694 | "mnli-mm": MnliMismatchedProcessor, 695 | "mrpc": MrpcProcessor, 696 | "sst-2": Sst2Processor, 697 | "sts-b": StsbProcessor, 698 | "qqp": QqpProcessor, 699 | "qnli": QnliProcessor, 700 | "rte": RteProcessor, 701 | "wnli": WnliProcessor, 702 | } 703 | 704 | output_modes = { 705 | "cola": "classification", 706 | "onion": "classification", 707 | "mnli": "classification", 708 | "mrpc": "classification", 709 | "sst-2": "classification", 710 | "sts-b": "regression", 711 | "qqp": "classification", 712 | "qnli": "classification", 713 | "rte": "classification", 714 | "wnli": "classification", 715 | } 716 | 717 | if args.local_rank == -1 or args.no_cuda: 718 | device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") 719 | n_gpu = torch.cuda.device_count() 720 | else: 721 | torch.cuda.set_device(args.local_rank) 722 | device = torch.device("cuda", args.local_rank) 723 | n_gpu = 1 724 | # Initializes the distributed backend which will take care of sychronizing nodes/GPUs 725 | torch.distributed.init_process_group(backend='nccl') 726 | 727 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 728 | datefmt = '%m/%d/%Y %H:%M:%S', 729 | level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN) 730 | 731 | logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format( 732 | device, n_gpu, bool(args.local_rank != -1), args.fp16)) 733 | 734 | if args.gradient_accumulation_steps < 1: 735 | raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format( 736 | args.gradient_accumulation_steps)) 737 | 738 | args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps 739 | 740 | random.seed(args.seed) 741 | np.random.seed(args.seed) 742 | torch.manual_seed(args.seed) 743 | if n_gpu > 0: 744 | torch.cuda.manual_seed_all(args.seed) 745 | 746 | if not args.do_train and not args.do_eval: 747 | raise ValueError("At least one of `do_train` or `do_eval` must be True.") 748 | 749 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train: 750 | raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir)) 751 | if not os.path.exists(args.output_dir): 752 | os.makedirs(args.output_dir) 753 | 754 | task_name = args.task_name.lower() 755 | 756 | if task_name not in processors: 757 | raise ValueError("Task not found: %s" % (task_name)) 758 | 759 | processor = processors[task_name]() 760 | output_mode = output_modes[task_name] 761 | 762 | label_list = processor.get_labels() 763 | num_labels = len(label_list) 764 | 765 | tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case) 766 | 767 | train_examples = None 768 | num_train_optimization_steps = None 769 | if args.do_train: 770 | train_examples = processor.get_train_examples(args.data_dir) 771 | num_train_optimization_steps = int( 772 | len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs 773 | if args.local_rank != -1: 774 | num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size() 775 | 776 | # Prepare model 777 | cache_dir = args.cache_dir if args.cache_dir else os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank)) 778 | model = BertForSequenceClassification.from_pretrained(args.bert_model, 779 | cache_dir=cache_dir, 780 | num_labels=num_labels) 781 | if args.fp16: 782 | model.half() 783 | model.to(device) 784 | if args.local_rank != -1: 785 | try: 786 | from apex.parallel import DistributedDataParallel as DDP 787 | except ImportError: 788 | raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.") 789 | 790 | model = DDP(model) 791 | elif n_gpu > 1: 792 | model = torch.nn.DataParallel(model) 793 | 794 | # Prepare optimizer 795 | param_optimizer = list(model.named_parameters()) 796 | no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] 797 | optimizer_grouped_parameters = [ 798 | {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, 799 | {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 800 | ] 801 | if args.fp16: 802 | try: 803 | from apex.optimizers import FP16_Optimizer 804 | from apex.optimizers import FusedAdam 805 | except ImportError: 806 | raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.") 807 | 808 | optimizer = FusedAdam(optimizer_grouped_parameters, 809 | lr=args.learning_rate, 810 | bias_correction=False, 811 | max_grad_norm=1.0) 812 | if args.loss_scale == 0: 813 | optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True) 814 | else: 815 | optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale) 816 | warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion, 817 | t_total=num_train_optimization_steps) 818 | 819 | else: 820 | optimizer = BertAdam(optimizer_grouped_parameters, 821 | lr=args.learning_rate, 822 | warmup=args.warmup_proportion, 823 | t_total=num_train_optimization_steps) 824 | 825 | global_step = 0 826 | nb_tr_steps = 0 827 | tr_loss = 0 828 | if args.do_train: 829 | train_features = convert_examples_to_features( 830 | train_examples, label_list, args.max_seq_length, tokenizer, output_mode) 831 | logger.info("***** Running training *****") 832 | logger.info(" Num examples = %d", len(train_examples)) 833 | logger.info(" Batch size = %d", args.train_batch_size) 834 | logger.info(" Num steps = %d", num_train_optimization_steps) 835 | all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long) 836 | all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long) 837 | all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long) 838 | 839 | if output_mode == "classification": 840 | all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long) 841 | elif output_mode == "regression": 842 | all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.float) 843 | 844 | train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids) 845 | if args.local_rank == -1: 846 | train_sampler = RandomSampler(train_data) 847 | else: 848 | train_sampler = DistributedSampler(train_data) 849 | train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size) 850 | 851 | model.train() 852 | for _ in trange(int(args.num_train_epochs), desc="Epoch"): 853 | tr_loss = 0 854 | nb_tr_examples, nb_tr_steps = 0, 0 855 | for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")): 856 | batch = tuple(t.to(device) for t in batch) 857 | input_ids, input_mask, segment_ids, label_ids = batch 858 | 859 | # define a new function to compute loss values for both output_modes 860 | logits = model(input_ids, segment_ids, input_mask, labels=None) 861 | 862 | if output_mode == "classification": 863 | loss_fct = CrossEntropyLoss() 864 | loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1)) 865 | elif output_mode == "regression": 866 | loss_fct = MSELoss() 867 | loss = loss_fct(logits.view(-1), label_ids.view(-1)) 868 | 869 | if n_gpu > 1: 870 | loss = loss.mean() # mean() to average on multi-gpu. 871 | if args.gradient_accumulation_steps > 1: 872 | loss = loss / args.gradient_accumulation_steps 873 | 874 | if args.fp16: 875 | optimizer.backward(loss) 876 | else: 877 | loss.backward() 878 | 879 | tr_loss += loss.item() 880 | nb_tr_examples += input_ids.size(0) 881 | nb_tr_steps += 1 882 | if (step + 1) % args.gradient_accumulation_steps == 0: 883 | if args.fp16: 884 | # modify learning rate with special warm up BERT uses 885 | # if args.fp16 is False, BertAdam is used that handles this automatically 886 | lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step/num_train_optimization_steps, 887 | args.warmup_proportion) 888 | for param_group in optimizer.param_groups: 889 | param_group['lr'] = lr_this_step 890 | optimizer.step() 891 | optimizer.zero_grad() 892 | global_step += 1 893 | 894 | if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0): 895 | # Save a trained model, configuration and tokenizer 896 | model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self 897 | 898 | # If we save using the predefined names, we can load using `from_pretrained` 899 | output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME) 900 | output_config_file = os.path.join(args.output_dir, CONFIG_NAME) 901 | 902 | torch.save(model_to_save.state_dict(), output_model_file) 903 | model_to_save.config.to_json_file(output_config_file) 904 | tokenizer.save_vocabulary(args.output_dir) 905 | 906 | # Load a trained model and vocabulary that you have fine-tuned 907 | model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels) 908 | tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) 909 | else: 910 | model = BertForSequenceClassification.from_pretrained(args.bert_model, num_labels=num_labels) 911 | model.to(device) 912 | 913 | if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0): 914 | eval_examples = processor.get_dev_examples(args.data_dir) 915 | eval_features = convert_examples_to_features( 916 | eval_examples, label_list, args.max_seq_length, tokenizer, output_mode) 917 | logger.info("***** Running evaluation *****") 918 | logger.info(" Num examples = %d", len(eval_examples)) 919 | logger.info(" Batch size = %d", args.eval_batch_size) 920 | all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long) 921 | all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long) 922 | all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long) 923 | 924 | if output_mode == "classification": 925 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long) 926 | elif output_mode == "regression": 927 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.float) 928 | 929 | eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids) 930 | # Run prediction for full data 931 | eval_sampler = SequentialSampler(eval_data) 932 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size) 933 | 934 | model.eval() 935 | eval_loss = 0 936 | nb_eval_steps = 0 937 | preds = [] 938 | 939 | for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"): 940 | input_ids = input_ids.to(device) 941 | input_mask = input_mask.to(device) 942 | segment_ids = segment_ids.to(device) 943 | label_ids = label_ids.to(device) 944 | 945 | with torch.no_grad(): 946 | logits = model(input_ids, segment_ids, input_mask, labels=None) 947 | 948 | # create eval loss and other metric required by the task 949 | if output_mode == "classification": 950 | loss_fct = CrossEntropyLoss() 951 | tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1)) 952 | elif output_mode == "regression": 953 | loss_fct = MSELoss() 954 | tmp_eval_loss = loss_fct(logits.view(-1), label_ids.view(-1)) 955 | 956 | eval_loss += tmp_eval_loss.mean().item() 957 | nb_eval_steps += 1 958 | if len(preds) == 0: 959 | preds.append(logits.detach().cpu().numpy()) 960 | else: 961 | preds[0] = np.append( 962 | preds[0], logits.detach().cpu().numpy(), axis=0) 963 | 964 | eval_loss = eval_loss / nb_eval_steps 965 | preds = preds[0] 966 | if output_mode == "classification": 967 | preds = np.argmax(preds, axis=1) 968 | elif output_mode == "regression": 969 | preds = np.squeeze(preds) 970 | result = compute_metrics(task_name, preds, all_label_ids.numpy()) 971 | loss = tr_loss/nb_tr_steps if args.do_train else None 972 | 973 | result['eval_loss'] = eval_loss 974 | result['global_step'] = global_step 975 | result['loss'] = loss 976 | 977 | output_eval_file = os.path.join(args.output_dir, "eval_results.txt") 978 | with open(output_eval_file, "w") as writer: 979 | logger.info("***** Eval results *****") 980 | for key in sorted(result.keys()): 981 | logger.info(" %s = %s", key, str(result[key])) 982 | writer.write("%s = %s\n" % (key, str(result[key]))) 983 | 984 | # hack for MNLI-MM 985 | if task_name == "mnli": 986 | task_name = "mnli-mm" 987 | processor = processors[task_name]() 988 | 989 | if os.path.exists(args.output_dir + '-MM') and os.listdir(args.output_dir + '-MM') and args.do_train: 990 | raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir)) 991 | if not os.path.exists(args.output_dir + '-MM'): 992 | os.makedirs(args.output_dir + '-MM') 993 | 994 | eval_examples = processor.get_dev_examples(args.data_dir) 995 | eval_features = convert_examples_to_features( 996 | eval_examples, label_list, args.max_seq_length, tokenizer, output_mode) 997 | logger.info("***** Running evaluation *****") 998 | logger.info(" Num examples = %d", len(eval_examples)) 999 | logger.info(" Batch size = %d", args.eval_batch_size) 1000 | all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long) 1001 | all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long) 1002 | all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long) 1003 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long) 1004 | 1005 | eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids) 1006 | # Run prediction for full data 1007 | eval_sampler = SequentialSampler(eval_data) 1008 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size) 1009 | 1010 | model.eval() 1011 | eval_loss = 0 1012 | nb_eval_steps = 0 1013 | preds = [] 1014 | 1015 | for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"): 1016 | input_ids = input_ids.to(device) 1017 | input_mask = input_mask.to(device) 1018 | segment_ids = segment_ids.to(device) 1019 | label_ids = label_ids.to(device) 1020 | 1021 | with torch.no_grad(): 1022 | logits = model(input_ids, segment_ids, input_mask, labels=None) 1023 | 1024 | loss_fct = CrossEntropyLoss() 1025 | tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1)) 1026 | 1027 | eval_loss += tmp_eval_loss.mean().item() 1028 | nb_eval_steps += 1 1029 | if len(preds) == 0: 1030 | preds.append(logits.detach().cpu().numpy()) 1031 | else: 1032 | preds[0] = np.append( 1033 | preds[0], logits.detach().cpu().numpy(), axis=0) 1034 | 1035 | eval_loss = eval_loss / nb_eval_steps 1036 | preds = preds[0] 1037 | preds = np.argmax(preds, axis=1) 1038 | result = compute_metrics(task_name, preds, all_label_ids.numpy()) 1039 | loss = tr_loss/nb_tr_steps if args.do_train else None 1040 | 1041 | result['eval_loss'] = eval_loss 1042 | result['global_step'] = global_step 1043 | result['loss'] = loss 1044 | 1045 | output_eval_file = os.path.join(args.output_dir + '-MM', "eval_results.txt") 1046 | with open(output_eval_file, "w") as writer: 1047 | logger.info("***** Eval results *****") 1048 | for key in sorted(result.keys()): 1049 | logger.info(" %s = %s", key, str(result[key])) 1050 | writer.write("%s = %s\n" % (key, str(result[key]))) 1051 | 1052 | if __name__ == "__main__": 1053 | main() 1054 | -------------------------------------------------------------------------------- /weight_samples.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import math 3 | 4 | collected = [] 5 | 6 | def sm(x): 7 | a = math.exp(x[1]) 8 | b = math.exp(x[2]) 9 | return b/(a+b+1e-8) 10 | 11 | ofh = open(sys.argv[2], 'w', encoding='utf-8') 12 | 13 | for line in open(sys.argv[1], encoding='utf-8'): 14 | chunks = line.strip().split('\t') 15 | chunks[1] = float(chunks[1]) 16 | chunks[2] = float(chunks[2]) 17 | chunks.append( sm(chunks) ) 18 | print("\t".join(map(str,chunks)),file=ofh) 19 | 20 | ofh.close() 21 | --------------------------------------------------------------------------------