├── LICENSE
├── NaNoGenMo_50K_words_sample.txt
├── README.md
├── cleaner_on_bert_weights.py
├── do_critic.py
├── gpt1finetune.py
├── gpt1sample.py
├── gpt1tokenize_trainset.py
├── handwriting.png
├── paranoid_transformer.pdf
├── paranoid_transformer.png
├── paranoid_transformer_back.png
├── paranoid_transformer_w_pics.pdf
├── pics_samples.png
├── simple_cleaner.py
├── train_classifier.py
├── vocab.txt
└── weight_samples.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Aleksey Tikhonov
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Paranoid Transformer
  2 | 
  3 | ## TLDR
  4 | 
  5 | After all, this project turns into a published neural network generated book. [Check the story behind it in my Medium post](https://medium.com/altsoph/paranoid-transformer-80a960ddc90a).
  6 | 
  7 | ## Overview
  8 | 
  9 | 
 10 | This is an attempt to make an unsupervised text generator with some specific style and form characteristics of text.
 11 | Originaly it was published as an entry for [NaNoGenMo 2019](https://github.com/NaNoGenMo/2019/issues/142) (_National Novel Generation Month_ contest).
 12 | 
 13 | The general idea behind the _Paranoid Transformer_ project is to build a paranoiac-critical system based on two neural networks.
 14 | The first network (_Paranoiac-intrusive Generator_) is a GPT-based tuned conditional language model and the second one (_Critic subsystem_) uses a BERT-based classifier that works as a filtering subsystem, so it selects the best ones from the flow of text passages. Finally, I used an existing handwriting synthesis neural network implementation to generate a nervous handwritten diary where a degree of shakiness depends on the sentiment strength of a given sentence.
 15 | 
 16 | ## Generator subsystem
 17 | 
 18 | The first network, Paranoiac-intrusive subsystem AKA Generator, uses an [OpenAI GPT](https://github.com/openai/finetune-transformer-lm) architecture and the [implementation from huggingface](https://github.com/huggingface/transformers). I took a publicly available network model already pre-trained on a huge fiction [BooksCorpus dataset](https://arxiv.org/pdf/1506.06724.pdf) with approx ~10K books and ~1B words.
 19 | 
 20 | Next, I finetuned it on several additional handcrafted text corpora (altogether ~50Mb of text):
 21 | -  a collection of Crypto Texts (Crypto Anarchist Manifesto, Cyphernomicon, etc),
 22 | -  another collection of fiction books (from such cyberpunk authors as Dick, Gibson, and others + non-cyberpunk authors, for example, Kafka and Rumi),
 23 | -  transcripts and subtitles from some cyberpunk movies and series,
 24 | -  several thousands of quotes and fortune cookie messages collected from different sources.
 25 | 
 26 | During the fine-tuning phase, I used special labels for conditional training of the model:
 27 |   - _QUOTE_ for any short quote or fortune, _LONG_ for others
 28 |   - _CYBER_ for cyber-themed text and _OTHER_ for others.
 29 | Each text got 2 labels, for example, it was _LONG_+_CYBER_ for Cyphernomicon, _LONG_+_OTHER_ for Kafka and _QUOTE_+_OTHER_ for fortune cookie messages. Note, there were almost no texts labeled as _QUOTE_+_CYBER_, just a few nerd jokes.
 30 | 
 31 | At last, in generation mode, I kindly asked the model to generate only _QUOTE_+_CYBER_ texts.
 32 | The raw results were already promising enough:
 33 | 
 34 | > terosexuality is pleasures a turn off ; and to me not to be a true blossoming beautiful being is on the other side. the wind is our song, the emotions are our wind and a piano, new things change, new smells kick off in time, a spiritually shifting dust. let your eyes sing music for a while. let your ears measure the bass beat of your soul, the gentle winding of the song. then your ears achieve harmony. you can listen to french playstation on live music together forever, in the philly coffeehouse, in them congressional district of the franklin gap building. let painting melt away every other shred of reason and pain, just lew the paint to move thoughts away from blizzes in death. let it dry out, and turn to cosmic delights, to laugh on the big charms and saxophones and fudatron steames of the sales titanium. we are god's friends, the golden hands on the shoulders of our fears. do you knock my cleaning table over? i snap awake at some dawn. the patrons researching the blues instructor's theories around me, then give me a glass of jim beam. boom! the business group soon concludes. caught one miracle? survive the tedious rituals you refuse to provide? whatever happens, i throw shit in your face. joy ries away? you could give acapindulgent half your life away, though i am nothing especially sexy. this sift, this being sveng? do impotent and desperate oozing drug as i shake and shine? you adored me. brains run out when people charitable that into you. 
 35 | 
 36 | Now it was time to make some cleaning.
 37 | 
 38 | ## Heuristic filters
 39 | 
 40 | The next big thing to do was filter some really good ones from this endless flow of the text.
 41 | 
 42 | At first, I made a script with some simple heuristic filters such as:
 43 |   - reject a creation of new, non-existing words,
 44 |   - reject phrases with two unconnected verbs in a row,
 45 |   - reject phrases with several duplicating words,
 46 |   - reject phrases with no punctuation or with too many punctuation marks.
 47 | 
 48 | The application of this script cut the initial text flow into a subsequence of valid chunks.
 49 | 
 50 | > a slave has no more say in his language but he hasn't to speak out!
 51 | >
 52 | > the doll has a variety of languages, so its feelings have to fill up some time of the day - to - day journals.
 53 | > the doll is used only when he remains private.
 54 | > and it is always effective.
 55 | >
 56 | > leave him with his monk - like body.
 57 | >
 58 | > a little of technique on can be helpful.
 59 | >
 60 | > out of his passions remain in embarrassment and never wake.
 61 | >
 62 | > adolescence is the university of manchester.
 63 | > the senior class of manchester... the senior class of manchester.
 64 | 
 65 | ## Critic subsystem
 66 | 
 67 | At last, I trained the Critic subsystem.
 68 | This neural network uses a [BERT](https://github.com/google-research/bert) architecture implemented again by [huggingface](https://github.com/huggingface/transformers). Again I took a public available pre-trained network model and finetuned it on my labeled 1K chunks dataset to predict the label of any given chunk.
 69 | 
 70 | Here I used manual labeling of these chunks with two classes, GOOD/BAD. Most of the labeling was done by a friend of mine, Ivan [@kr0niker](https://www.yamshchikov.info/) Yamshchikov, and some I did myself. We marked a chunk as BAD in case it was grammatically incorrect or just too boring or too stupid. Overall, I used approx 1K of labeled chunks, balanced (one half of them were GOOD, the other half -- BAD).
 71 | 
 72 | Finally, I made a pipeline that includes the Generator subsystem, some heuristic filters, and the Critic subsystem.
 73 | Here it is a short sample of the final results:
 74 | 
 75 | > a sudden feeling of austin lemons, a gentle stab of disgust.
 76 | > i'm what i'm.
 77 | > 
 78 | > humans whirl in night and distance.
 79 | > 
 80 | > by the wonders of them.
 81 | > 
 82 | > we shall never suffer this.
 83 | > if the human race came along tomorrow, none of us would be as wise as they already would have been.
 84 | > there is a beginning and an end.
 85 | > 
 86 | > both of our grandparents and brothers are overdue.
 87 | > he either can not agree or he can look for someone to blame for his death.
 88 | > 
 89 | > he has reappeared from the world of revenge, revenge, separation, hatred.
 90 | > he has ceased all who have offended him.
 91 | > 
 92 | > he is the one who can remember that nothing remotely resembles the trip begun in retrospect.
 93 | > what's up?
 94 | > 
 95 | > and i don't want the truth.
 96 | > not for an hour.
 97 | 
 98 | [The huge blob of generated text could be found here](https://github.com/altsoph/paranoid_transforner/blob/master/NaNoGenMo_50K_words_sample.txt).
 99 | 
100 | ## Code overview
101 | 
102 | Here is a short description of scripts from this project:
103 | - gpt1tokenize_trainset.py -- used to tokenize the fine-tuning dataset and add the conditioning labels
104 | - gpt1finetune.py -- used to fine-tune the Generator network on the prepared dataset
105 | - gpt1sample.py -- used to sample texts from the Generator network
106 | 
107 | - simple_cleaner.py -- holds the heuristic filters
108 | 
109 | - train_classifier.py -- used to train the BERT-based classifier (Critic)
110 | - do_critic.py -- applies Critic to the samples
111 | - weight_samples.py + cleaner_on_bert_weights.py -- used to filter samples based on Critic scores
112 | 
113 | 
114 | ## Nervous handwriting
115 | 
116 | As much as the resulting text basically reminded me of neurotic/paranoid notes I decided to use this effect and make it deeper:
117 | 
118 | I took an [implementation by Sean Vasquez](https://github.com/sjvasquez/handwriting-synthesis) of the handwriting synthesis experiments from the paper [Generating Sequences with Recurrent Neural Networks by Alex Graves](https://arxiv.org/abs/1308.0850) and patched it a little. Specifically, I used a bias parameter to make the handwriting shakiness depended on the sentiment strength of a given sentence.
119 | 
120 | Take a look at the example:
121 | 
122 | <img src="https://raw.githubusercontent.com/altsoph/paranoid_transformer/master/handwriting.png" alt="drawing" />
123 | 
124 | ## Freehand drawings
125 | 
126 | At some point, I realized that this diary lacks freehand drawings, so I decided to add some. I used my modification of a [pytorch implementation](https://github.com/alexis-jacq/Pytorch-Sketch-RNN) of [arXiv:1704.03477](https://arxiv.org/abs/1704.03477) trained on 
127 | [Quick, Draw! Dataset](https://github.com/googlecreativelab/quickdraw-dataset). Each time any of categories from the dataset appears on the page I generate and add random picture somewhere arround.
128 | 
129 | <img src="https://raw.githubusercontent.com/altsoph/paranoid_transformer/master/pics_samples.png" alt="drawing" /> 
130 | 
131 | ## Covers and PDF compilation
132 | 
133 | I drew some covers and used the [rsvg-convert library](https://en.wikipedia.org/wiki/Librsvg) to build a PDF file from separate pages in SVG.
134 | 
135 | Covers:
136 | 
137 | <img src="https://raw.githubusercontent.com/altsoph/paranoid_transformer/master/paranoid_transformer.png" alt="drawing" width="300"/> <img src="https://raw.githubusercontent.com/altsoph/paranoid_transformer/master/paranoid_transformer_back.png" alt="drawing" width="300"/>
138 | 
139 | The resulting diary (40 Mb):
140 | https://github.com/altsoph/paranoid_transformer/raw/master/paranoid_transformer_w_pics.pdf
141 | 
142 | ## Papers, publications, releases, links
143 | 
144 | * [ICCC 2020 Proceedings, P.146-152](http://computationalcreativity.net/iccc20/wp-content/uploads/2020/09/ICCC20_Proceedings.pdf): Paranoid Transformer. Yana Agafonova, Alexey Tikhonov and Ivan Yamshchikov
145 | * Future Internet Journal: [Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity](https://www.mdpi.com/1999-5903/12/11/182/htm)
146 | * [Pre-oder the book](https://deadalivemagazine.com/press/paranoid-transformer.html)
147 | 


--------------------------------------------------------------------------------
/cleaner_on_bert_weights.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | ofh = open(sys.argv[2], 'w', encoding='utf-8')
 4 | 
 5 | prev_blank = True
 6 | for ln, line in enumerate(open(sys.argv[1], encoding='utf-8')):
 7 | 	text,_,_,score = line.strip().split('\t')
 8 | 
 9 | 	if text == '----------' or float(score)<0.9:
10 | 		if not prev_blank:
11 | 			print(file=ofh)
12 | 		prev_blank = True
13 | 	else:
14 | 		print(text,file=ofh)
15 | 		prev_blank = False
16 | 
17 | ofh.close()


--------------------------------------------------------------------------------
/do_critic.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
  3 | # Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | """BERT finetuning runner."""
 17 | 
 18 | from __future__ import absolute_import, division, print_function
 19 | 
 20 | import argparse
 21 | import csv
 22 | import logging
 23 | import os
 24 | import random
 25 | import sys
 26 | 
 27 | import numpy as np
 28 | import torch
 29 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
 30 |                               TensorDataset)
 31 | from torch.utils.data.distributed import DistributedSampler
 32 | from tqdm import tqdm, trange
 33 | 
 34 | from torch.nn import CrossEntropyLoss, MSELoss
 35 | from scipy.stats import pearsonr, spearmanr
 36 | from sklearn.metrics import matthews_corrcoef, f1_score
 37 | 
 38 | from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME
 39 | from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig
 40 | from pytorch_pretrained_bert.tokenization import BertTokenizer
 41 | from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
 42 | 
 43 | logger = logging.getLogger(__name__)
 44 | 
 45 | 
 46 | class InputExample(object):
 47 |     """A single training/test example for simple sequence classification."""
 48 | 
 49 |     def __init__(self, guid, text_a, text_b=None, label=None):
 50 |         """Constructs a InputExample.
 51 | 
 52 |         Args:
 53 |             guid: Unique id for the example.
 54 |             text_a: string. The untokenized text of the first sequence. For single
 55 |             sequence tasks, only this sequence must be specified.
 56 |             text_b: (Optional) string. The untokenized text of the second sequence.
 57 |             Only must be specified for sequence pair tasks.
 58 |             label: (Optional) string. The label of the example. This should be
 59 |             specified for train and dev examples, but not for test examples.
 60 |         """
 61 |         self.guid = guid
 62 |         self.text_a = text_a
 63 |         self.text_b = text_b
 64 |         self.label = label
 65 | 
 66 | 
 67 | class InputFeatures(object):
 68 |     """A single set of features of data."""
 69 | 
 70 |     def __init__(self, input_ids, input_mask, segment_ids, label_id):
 71 |         self.input_ids = input_ids
 72 |         self.input_mask = input_mask
 73 |         self.segment_ids = segment_ids
 74 |         self.label_id = label_id
 75 | 
 76 | 
 77 | class DataProcessor(object):
 78 |     """Base class for data converters for sequence classification data sets."""
 79 | 
 80 |     def get_train_examples(self, data_dir):
 81 |         """Gets a collection of `InputExample`s for the train set."""
 82 |         raise NotImplementedError()
 83 | 
 84 |     def get_dev_examples(self, data_dir):
 85 |         """Gets a collection of `InputExample`s for the dev set."""
 86 |         raise NotImplementedError()
 87 | 
 88 |     def get_labels(self):
 89 |         """Gets the list of labels for this data set."""
 90 |         raise NotImplementedError()
 91 | 
 92 |     @classmethod
 93 |     def _read_tsv(cls, input_file, quotechar=None):
 94 |         """Reads a tab separated value file."""
 95 |         with open(input_file, "r", encoding="utf-8") as f:
 96 |             reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
 97 |             lines = []
 98 |             for line in reader:
 99 |                 if sys.version_info[0] == 2:
100 |                     line = list(unicode(cell, 'utf-8') for cell in line)
101 |                 lines.append(line)
102 |             return lines
103 | 
104 | 
105 | 
106 | class ColaProcessor(DataProcessor):
107 |     """Processor for the CoLA data set (GLUE version)."""
108 | 
109 |     def get_train_examples(self, data_dir):
110 |         """See base class."""
111 |         return self._create_examples(
112 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
113 | 
114 |     def get_dev_examples(self, data_dir):
115 |         """See base class."""
116 |         return self._create_examples(
117 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
118 | 
119 |     def get_pred_examples(self, data_dir):
120 |         """See base class."""
121 |         lines = self._read_tsv(os.path.join(data_dir, "pred.tsv"))
122 |         examples = []
123 |         for (i, line) in enumerate(lines):
124 |             if i == 0:
125 |                 continue
126 |             guid = "%s-%s" % ('pred', i)
127 |             text_a = line[0]
128 |             text_b = None
129 |             label = str(i%2)
130 |             examples.append(
131 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
132 |         return examples
133 | 
134 |     def get_labels(self):
135 |         """See base class."""
136 |         return ["0", "1"]
137 | 
138 |     def _create_examples(self, lines, set_type):
139 |         """Creates examples for the training and dev sets."""
140 |         examples = []
141 |         for (i, line) in enumerate(lines):
142 |             guid = "%s-%s" % (set_type, i)
143 |             text_a = line[3]
144 |             label = line[1]
145 |             examples.append(
146 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
147 |         return examples
148 | 
149 | def convert_examples_to_features(examples, label_list, max_seq_length,
150 |                                  tokenizer, output_mode='classification'):
151 |     """Loads a data file into a list of `InputBatch`s."""
152 | 
153 |     label_map = {label : i for i, label in enumerate(label_list)}
154 | 
155 |     features = []
156 |     for (ex_index, example) in enumerate(examples):
157 |         if ex_index % 10000 == 0:
158 |             logger.info("Writing example %d of %d" % (ex_index, len(examples)))
159 | 
160 |         tokens_a = tokenizer.tokenize(example.text_a)
161 | 
162 |         tokens_b = None
163 |         if example.text_b:
164 |             tokens_b = tokenizer.tokenize(example.text_b)
165 |             # Modifies `tokens_a` and `tokens_b` in place so that the total
166 |             # length is less than the specified length.
167 |             # Account for [CLS], [SEP], [SEP] with "- 3"
168 |             _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
169 |         else:
170 |             # Account for [CLS] and [SEP] with "- 2"
171 |             if len(tokens_a) > max_seq_length - 2:
172 |                 tokens_a = tokens_a[:(max_seq_length - 2)]
173 | 
174 |         # The convention in BERT is:
175 |         # (a) For sequence pairs:
176 |         #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
177 |         #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
178 |         # (b) For single sequences:
179 |         #  tokens:   [CLS] the dog is hairy . [SEP]
180 |         #  type_ids: 0   0   0   0  0     0 0
181 |         #
182 |         # Where "type_ids" are used to indicate whether this is the first
183 |         # sequence or the second sequence. The embedding vectors for `type=0` and
184 |         # `type=1` were learned during pre-training and are added to the wordpiece
185 |         # embedding vector (and position vector). This is not *strictly* necessary
186 |         # since the [SEP] token unambiguously separates the sequences, but it makes
187 |         # it easier for the model to learn the concept of sequences.
188 |         #
189 |         # For classification tasks, the first vector (corresponding to [CLS]) is
190 |         # used as as the "sentence vector". Note that this only makes sense because
191 |         # the entire model is fine-tuned.
192 |         tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
193 |         segment_ids = [0] * len(tokens)
194 | 
195 |         if tokens_b:
196 |             tokens += tokens_b + ["[SEP]"]
197 |             segment_ids += [1] * (len(tokens_b) + 1)
198 | 
199 |         input_ids = tokenizer.convert_tokens_to_ids(tokens)
200 | 
201 |         # The mask has 1 for real tokens and 0 for padding tokens. Only real
202 |         # tokens are attended to.
203 |         input_mask = [1] * len(input_ids)
204 | 
205 |         # Zero-pad up to the sequence length.
206 |         padding = [0] * (max_seq_length - len(input_ids))
207 |         input_ids += padding
208 |         input_mask += padding
209 |         segment_ids += padding
210 | 
211 |         assert len(input_ids) == max_seq_length
212 |         assert len(input_mask) == max_seq_length
213 |         assert len(segment_ids) == max_seq_length
214 | 
215 |         if output_mode == "classification":
216 |             label_id = label_map[example.label]
217 |         elif output_mode == "regression":
218 |             label_id = float(example.label)
219 |         else:
220 |             raise KeyError(output_mode)
221 | 
222 |         if ex_index < 5:
223 |             logger.info("*** Example ***")
224 |             logger.info("guid: %s" % (example.guid))
225 |             logger.info("tokens: %s" % " ".join(
226 |                     [str(x) for x in tokens]))
227 |             logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
228 |             logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
229 |             logger.info(
230 |                     "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
231 |             logger.info("label: %s (id = %d)" % (example.label, label_id))
232 | 
233 |         features.append(
234 |                 InputFeatures(input_ids=input_ids,
235 |                               input_mask=input_mask,
236 |                               segment_ids=segment_ids,
237 |                               label_id=label_id))
238 |     return features
239 | 
240 | 
241 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
242 |     """Truncates a sequence pair in place to the maximum length."""
243 | 
244 |     # This is a simple heuristic which will always truncate the longer sequence
245 |     # one token at a time. This makes more sense than truncating an equal percent
246 |     # of tokens from each, since if one sequence is very short then each token
247 |     # that's truncated likely contains more information than a longer sequence.
248 |     while True:
249 |         total_length = len(tokens_a) + len(tokens_b)
250 |         if total_length <= max_length:
251 |             break
252 |         if len(tokens_a) > len(tokens_b):
253 |             tokens_a.pop()
254 |         else:
255 |             tokens_b.pop()
256 | 
257 | 
258 | def simple_accuracy(preds, labels):
259 |     return (preds == labels).mean()
260 | 
261 | 
262 | def compute_metrics(task_name, preds, labels):
263 |     assert len(preds) == len(labels)
264 |     return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)}
265 | 
266 | 
267 | def main():
268 |     parser = argparse.ArgumentParser()
269 | 
270 |     ## Required parameters
271 |     parser.add_argument("--data_dir",
272 |                         default=None,
273 |                         type=str,
274 |                         required=True,
275 |                         help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
276 |     parser.add_argument("--bert_model", default=None, type=str, required=True,
277 |                         help="Bert pre-trained model selected in the list: bert-base-uncased, "
278 |                         "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
279 |                         "bert-base-multilingual-cased, bert-base-chinese.")
280 |     parser.add_argument("--output_dir",
281 |                         default=None,
282 |                         type=str,
283 |                         required=True,
284 |                         help="The output directory where the model predictions and checkpoints will be written.")
285 | 
286 |     ## Other parameters
287 |     parser.add_argument("--cache_dir",
288 |                         default="",
289 |                         type=str,
290 |                         help="Where do you want to store the pre-trained models downloaded from s3")
291 |     parser.add_argument("--max_seq_length",
292 |                         default=128,
293 |                         type=int,
294 |                         help="The maximum total input sequence length after WordPiece tokenization. \n"
295 |                              "Sequences longer than this will be truncated, and sequences shorter \n"
296 |                              "than this will be padded.")
297 |     parser.add_argument("--do_eval",
298 |                         action='store_true',
299 |                         help="Whether to run eval on the dev set.")
300 |     parser.add_argument("--do_lower_case",
301 |                         action='store_true',
302 |                         help="Set this flag if you are using an uncased model.")
303 |     parser.add_argument("--eval_batch_size",
304 |                         default=8,
305 |                         type=int,
306 |                         help="Total batch size for eval.")
307 |     parser.add_argument("--no_cuda",
308 |                         action='store_true',
309 |                         help="Whether not to use CUDA when available")
310 |     parser.add_argument("--local_rank",
311 |                         type=int,
312 |                         default=-1,
313 |                         help="local_rank for distributed training on gpus")
314 |     parser.add_argument('--seed',
315 |                         type=int,
316 |                         default=42,
317 |                         help="random seed for initialization")
318 | 
319 |     args = parser.parse_args()
320 | 
321 |     device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
322 |     n_gpu = torch.cuda.device_count()
323 | 
324 |     logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
325 |                         datefmt = '%m/%d/%Y %H:%M:%S',
326 |                         level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
327 | 
328 |     logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: -".format(
329 |         device, n_gpu, bool(args.local_rank != -1) ))
330 | 
331 |     random.seed(args.seed)
332 |     np.random.seed(args.seed)
333 |     torch.manual_seed(args.seed)
334 |     if n_gpu > 0:
335 |         torch.cuda.manual_seed_all(args.seed)
336 | 
337 |     if not os.path.exists(args.output_dir):
338 |         # os.makedirs(args.output_dir)
339 |         raise ValueError("No model output dir found.")
340 | 
341 |     processor = ColaProcessor() # processors[task_name]()
342 | 
343 |     label_list = processor.get_labels()
344 |     num_labels = len(label_list)
345 | 
346 |     global_step = 0
347 | 
348 |     model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels)
349 |     tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
350 |     model.to(device)
351 | 
352 |     # if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
353 |     eval_examples = processor.get_pred_examples(args.data_dir)
354 |     eval_features = convert_examples_to_features(
355 |             eval_examples, label_list, args.max_seq_length, tokenizer) # , output_mode)
356 |     logger.info("***** Running evaluation *****")
357 |     logger.info("  Num examples = %d", len(eval_examples))
358 |     logger.info("  Batch size = %d", args.eval_batch_size)
359 |     # print(eval_examples[:10])
360 |     all_input_idxs = torch.tensor([idx for idx,f in enumerate(eval_features)], dtype=torch.long)
361 |     all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
362 |     all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
363 |     all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
364 | 
365 |     all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
366 | 
367 |     eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_input_idxs)
368 |     eval_sampler = SequentialSampler(eval_data)
369 |     eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
370 | 
371 |     model.eval()
372 |     # model.predict()
373 |     eval_loss = 0
374 |     nb_eval_steps = 0
375 |     preds = []
376 | 
377 |     for input_ids, input_mask, segment_ids, label_ids, text_idxs in tqdm(eval_dataloader, desc="Evaluating", disable=True):
378 |         input_ids = input_ids.to(device)
379 |         input_mask = input_mask.to(device)
380 |         segment_ids = segment_ids.to(device)
381 |         label_ids = label_ids.to(device)
382 | 
383 |         with torch.no_grad():
384 |             logits = model(input_ids, segment_ids, input_mask, labels=None)
385 |             for idx,logit in zip(list(text_idxs.data),list(logits.data)):
386 |                 # print(idx,logit)
387 |                 print("%s\t%f\t%f" % ( eval_examples[idx.item()].text_a,logit[0].item(),logit[1].item()) )
388 | 
389 | 
390 | if __name__ == "__main__":
391 |     main()
392 | 


--------------------------------------------------------------------------------
/gpt1finetune.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import random
 3 | import json
 4 | 
 5 | import nltk
 6 | import torch
 7 | # from apex import amp
 8 | from tqdm import tqdm, trange
 9 | from pytorch_pretrained_bert import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OpenAIAdam
10 | 
11 | SPECIAL_TOKENS = ["<long>", "<quotes>", "<others>", "<cyberpunk>", "<text>", "<meta1>", "<meta2>","<pad>"]
12 | LR = 6.25e-5
13 | MAX_LEN = 500
14 | BATCH_SIZE = 13
15 | 
16 | OUTPUT_DIR = "/home/altsoph/current"
17 | random.seed(0xDEADFEED)
18 | 
19 | 
20 | 
21 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
22 | n_gpu = torch.cuda.device_count()
23 | model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
24 | tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
25 | 
26 | tokenizer.set_special_tokens(SPECIAL_TOKENS)
27 | model.set_num_special_tokens(len(SPECIAL_TOKENS))
28 | model.to(device)
29 | optimizer = OpenAIAdam(model.parameters(), 
30 |                        lr=LR,
31 |                        warmup=0.002,
32 |                        max_grad_norm=1,
33 |                        weight_decay=0.01)
34 | 
35 | TAG_TEXT, TAG_META1, TAG_META2, TAG_PAD = tokenizer.convert_tokens_to_ids(("<text>", "<meta1>", "<meta2>", "<pad>"))
36 | 
37 | def pad(x, padding, padding_length):
38 |     return x + [padding] * (padding_length - len(x))
39 | 
40 | dataset = []
41 | for line in open('gpt1_trainset_tokens.tsv'):
42 |     chunks = line.strip().split('\t')
43 |     tokens = list(map(int,chunks[2].split(',')))
44 |     if len(tokens)<8: continue
45 |     segments = [TAG_META1, TAG_META2] + [TAG_TEXT for _ in tokens[2:]]
46 |     positions = list(range(len(tokens)))
47 |     lm_targets = [-1, -1, -1] + tokens[3:]
48 |     dataset.append( (len(tokens), tokens, segments, positions, lm_targets) )
49 | 
50 | model.train()
51 | 
52 | for epoch in range(10):
53 |     exp_average_loss = None
54 |     nb_tr_steps = 0
55 |     tr_loss = 0
56 | 
57 |     dataset = list(sorted(dataset,key=lambda x:random.random()))
58 | 
59 |     tqdm_bar = tqdm(range(0,len(dataset),BATCH_SIZE), desc="Training", mininterval=6.0)
60 |     for batch_num,batch_start in enumerate(tqdm_bar):
61 | 
62 |         batch_raw = dataset[batch_start:batch_start+BATCH_SIZE]
63 |         pad_size = max(map(lambda x:x[0],batch_raw))
64 | 
65 |         input_words = []
66 |         input_segments = []
67 |         input_targets = []
68 | 
69 |         for _,words,segments,_,targets in batch_raw:
70 |             input_words.append( pad(words,TAG_PAD,pad_size) )
71 |             input_segments.append( pad(segments,TAG_PAD,pad_size) )
72 |             input_targets.append( pad(targets,-1,pad_size) )
73 | 
74 |         input_ids = torch.tensor(input_words, dtype=torch.long)
75 |         token_type_ids = torch.tensor(input_segments, dtype=torch.long)
76 |         lm_labels = torch.tensor(input_targets, dtype=torch.long)
77 | 
78 |         loss = model(input_ids.to(device), lm_labels=lm_labels.to(device), token_type_ids=token_type_ids.to(device))
79 | 
80 |         loss.backward()
81 |         optimizer.step()
82 |         optimizer.zero_grad()
83 |         tr_loss += loss.item()
84 |         exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
85 |         nb_tr_steps += 1
86 |         tqdm_bar.desc = "Epoch {:02}, batch {:05}/{:05}. Training loss: {:.2e} lr: {:.2e}".format(epoch, batch_num, len(dataset)//BATCH_SIZE, exp_average_loss, optimizer.get_lr()[0])
87 | 
88 |     model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
89 |     torch.save(model_to_save.state_dict(), os.path.join(OUTPUT_DIR, "pytorch_model.bin"))
90 |     model_to_save.config.to_json_file(os.path.join(OUTPUT_DIR, "config.json"))
91 |     tokenizer.save_vocabulary(OUTPUT_DIR)
92 | 


--------------------------------------------------------------------------------
/gpt1sample.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import os
 4 | import random
 5 | import json
 6 | 
 7 | import numpy as np
 8 | import nltk
 9 | import torch
10 | import torch.nn.functional as F
11 | 
12 | # from apex import amp
13 | from tqdm import tqdm, trange
14 | from pytorch_pretrained_bert import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OpenAIAdam
15 | 
16 | 
17 | SAMPLES = 16384
18 | BATCH_SIZE = 32
19 | 
20 | MAX_LEN = 500
21 | MODEL_DIR = "/home/altsoph/current"
22 | SEED = 0xDEADFEED
23 | 
24 | 
25 | def top_k_logits(logits, k):
26 |     """
27 |     Masks everything but the k top entries as -infinity (1e10).
28 |     Used to mask logits such that e^-infinity -> 0 won't contribute to the
29 |     sum of the denominator.
30 |     """
31 |     if k == 0:
32 |         return logits
33 |     else:
34 |         values = torch.topk(logits, k)[0]
35 |         batch_mins = values[:, -1].view(-1, 1).expand_as(logits)
36 |         return torch.where(logits < batch_mins, torch.ones_like(logits) * -1e10, logits)
37 | 
38 | def sample_sequence(model, length, segments=None, batch_size=None, context=None, temperature=1, top_k=0, device='cuda', sample=True, text_tag=0):
39 |     context = torch.tensor(context, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
40 |     segments = torch.tensor(segments, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
41 |     text_tag_tpl = torch.tensor([text_tag,], device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
42 | 
43 |     prev = context
44 |     output = context
45 |     prev_segments = segments
46 |     past = None
47 |     with torch.no_grad():
48 |         for i in trange(length):
49 |             # model(input_ids.to(device), lm_labels=lm_labels.to(device), token_type_ids=token_type_ids.to(device))
50 |             logits = model(output, token_type_ids=prev_segments)
51 |             logits = logits[:, -1, :] / temperature
52 |             logits = top_k_logits(logits, k=top_k)
53 |             log_probs = F.softmax(logits, dim=-1)
54 |             if sample:
55 |                 prev = torch.multinomial(log_probs, num_samples=1)
56 |             else:
57 |                 _, prev = torch.topk(log_probs, k=1, dim=-1)
58 |             output = torch.cat((output, prev), dim=1)
59 |             prev_segments = torch.cat((prev_segments, text_tag_tpl), dim=1)
60 |     return output
61 | 
62 | random.seed(SEED)
63 | torch.random.manual_seed(SEED)
64 | torch.cuda.manual_seed(SEED)
65 | 
66 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
67 | n_gpu = torch.cuda.device_count()
68 | model = OpenAIGPTLMHeadModel.from_pretrained(MODEL_DIR)
69 | tokenizer = OpenAIGPTTokenizer.from_pretrained(MODEL_DIR)
70 | 
71 | model.to(device)
72 | 
73 | TAG_QUOTES, TAG_CYBER, TAG_TEXT, TAG_META1, TAG_META2, TAG_PAD = tokenizer.convert_tokens_to_ids(
74 |                             ("<quotes>", "<cyberpunk>", "<text>", "<meta1>", "<meta2>", "<pad>"))
75 | 
76 | context_tokens   = [TAG_QUOTES, TAG_CYBER]
77 | context_segments = [TAG_META1, TAG_META2]
78 | 
79 | generated = 0
80 | 
81 | for _ in range(SAMPLES // BATCH_SIZE):
82 |     out = sample_sequence(
83 |         model=model, length=MAX_LEN,
84 |         context=context_tokens,
85 |         segments=context_segments,
86 |         batch_size=BATCH_SIZE,
87 |         temperature=1, top_k=0, device=device,
88 |         text_tag = TAG_TEXT
89 |     )
90 |     out = out[:, len(context_tokens):].tolist()
91 |     for i in range(BATCH_SIZE):
92 |         generated += 1
93 |         text = tokenizer.decode(out[i])
94 |         print("=" * 35 + " SAMPLE " + str(generated) + " " + "=" * (36-len(str(generated))) )
95 |         print(text)
96 | 
97 | 


--------------------------------------------------------------------------------
/gpt1tokenize_trainset.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | import nltk
 3 | from pytorch_pretrained_bert import OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer
 4 | 
 5 | model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
 6 | tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
 7 | 
 8 | SPECIAL_TOKENS = ["<long>", "<quotes>", "<others>", "<cyberpunk>", "<text>", "<meta1>", "<meta2>"]
 9 | 
10 | # We can add these special tokens to the vocabulary and the embeddings of the model:
11 | tokenizer.set_special_tokens(SPECIAL_TOKENS)
12 | model.set_num_special_tokens(len(SPECIAL_TOKENS))
13 | 
14 | MAX_LEN = 500
15 | 
16 | dataset = []
17 | for fn,meta1,meta2 in (('long_cyberpunk.txt','<long>','<cyberpunk>'),('quotes_cyberpunk.txt','<quotes>','<cyberpunk>'),
18 | 					   ('long_others.txt','<long>','<others>'),('quotes_others.txt','<quotes>','<others>')):
19 | 	meta_tokens = tokenizer.convert_tokens_to_ids((meta1,meta2))
20 | 	for line in open(fn, encoding='utf-8', errors='ignore'):
21 | 		if not line.strip(): continue
22 | 		# meta_tokens = tokenizer.encode("%s %s" %(meta1,meta2))
23 | 		# segments = tokenizer.convert_tokens_to_ids(segments)
24 | 		tokens = tokenizer.encode(line.strip())
25 | 		if len(tokens)>MAX_LEN:
26 | 			# print('too long',len(tokens))
27 | 			sentences = nltk.sent_tokenize(line.strip())
28 | 			# print(sentences)
29 | 			sentences_tokens = [tokenizer.encode(sentence) for sentence in sentences]
30 | 			# print(sentences_tokens)
31 | 			collected = []
32 | 			for sentence_tokens in sentences_tokens:
33 | 				if 0 in sentences_tokens or len(collected)+len(sentence_tokens)>MAX_LEN:
34 | 					# print(len(collected),collected)
35 | 					dataset.append( (meta1,meta2,meta_tokens+collected) )
36 | 					collected = []
37 | 				if len(sentence_tokens)<=MAX_LEN:
38 | 					collected.extend(sentence_tokens)
39 | 			if collected:
40 | 				# print(len(collected),collected)
41 | 				dataset.append( (meta1,meta2,meta_tokens+collected) )
42 | 			# exit()
43 | 		else:
44 | 			dataset.append( (meta1,meta2,meta_tokens+tokens) )
45 | for m1,m2,token_ids in dataset:
46 | 	print("%s\t%s\t%s" % (m1,m2,",".join(map(str,token_ids))))
47 | 


--------------------------------------------------------------------------------
/handwriting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/handwriting.png


--------------------------------------------------------------------------------
/paranoid_transformer.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer.pdf


--------------------------------------------------------------------------------
/paranoid_transformer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer.png


--------------------------------------------------------------------------------
/paranoid_transformer_back.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer_back.png


--------------------------------------------------------------------------------
/paranoid_transformer_w_pics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer_w_pics.pdf


--------------------------------------------------------------------------------
/pics_samples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/pics_samples.png


--------------------------------------------------------------------------------
/simple_cleaner.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | from nltk.tokenize import sent_tokenize, word_tokenize
 3 | from collections import defaultdict
 4 | from nltk import pos_tag
 5 | 
 6 | vocab = set()
 7 | for line in open("vocab.txt", encoding='utf-8'):
 8 | 	vocab.add( line.strip() )
 9 | 
10 | outfh = open(sys.argv[2], "w", encoding='utf-8')
11 | 
12 | lines_cnt = sentences_cnt = 0
13 | cases = defaultdict(int)
14 | for ln, line in enumerate(open(sys.argv[1], encoding='utf-8')):
15 | 	if line[0] == '=': continue
16 | 	lines_cnt += 1
17 | 
18 | 	sents = sent_tokenize(line)
19 | 	sentences_cnt += len(sents)
20 | 	print(file=outfh)
21 | 	for sent in sents:
22 | 		words = word_tokenize(sent)
23 | 		tmp = sent.replace('#',' ').replace('...','#').replace('!','#').replace('?','#').replace(',','#').replace('--','#').replace('-','#').replace(';','#')\
24 | 			      .replace(':','#').replace('`','#').replace('"','#').replace('.','#').replace(' ','')
25 | 
26 | 		no_punct = []
27 | 		size = 0
28 | 		for npidx,w in enumerate(words):
29 | 			if w in ('...','!','?',',','--','-',';',':','`','"','.'):
30 | 				no_punct.append(size)
31 | 				size = 0
32 | 			else:
33 | 				size += 1
34 | 		no_punct.append(size)
35 | 
36 | 		pos = pos_tag(words)
37 | 		skip = False
38 | 		for idx,(w,p) in enumerate(pos[:-1]):
39 | 			# VB verb, base form take
40 | 			# VBD verb, past tense took
41 | 			# VBG verb, gerund/present participle taking
42 | 			# VBN verb, past participle taken
43 | 			# VBP verb, sing. present, non-3d take
44 | 			# VBZ verb, 3rd person sing. present takes			
45 | 			if p in ('VB','VBD','VBG','VBN','VBP','VBZ') and pos[idx+1][1] in ('VB','VBD','VBG','VBN','VBP','VBZ'):
46 | 				if p == 'VBD' and pos[idx+1][1] == 'VBN': continue
47 | 				if p == 'VB' and pos[idx+1][1] == 'VBN': continue
48 | 				if p == 'VBP' and pos[idx+1][1] == 'VBN': continue
49 | 				if p == 'VBZ' and pos[idx+1][1] == 'VBN': continue
50 | 				if w == 'been' and pos[idx+1][1] == 'VBN': continue
51 | 				if w in ('be','was','are','is',"'re","'s","been","have") and pos[idx+1][1] == 'VBG': continue
52 | 				if w == 'i': continue
53 | 				# print('VERB', (w,p), pos[idx+1], sent)
54 | 				cases['verbverb'] += 1
55 | 				skip = True
56 | 				break
57 | 		# it's bad if several verbs in a row
58 | 		if set(words)-vocab:
59 | 			cases['new_word'] += 1
60 | 			skip = True
61 | 		elif max(no_punct)>25:
62 | 			cases['no_punct'] += 1
63 | 			skip = True
64 | 		elif len(words)>=60:
65 | 			cases['to_long'] += 1
66 | 			skip = True
67 | 		elif "###" in tmp:
68 | 			cases['manypuncts'] += 1
69 | 			skip = True
70 | 		for idx,w in enumerate(words[:-1]):
71 | 			if w == words[idx+1]:
72 | 				cases['duplicate_words'] += 1
73 | 				skip = True
74 | 				break
75 | 		if sent[-1] not in '.!?':
76 | 			cases['badend'] += 1
77 | 			skip = True
78 | 
79 | 		if skip:
80 | 			print(file=outfh)
81 | 		else:
82 | 			print(sent,file=outfh)
83 | 
84 | 	print(lines_cnt, sentences_cnt, cases.items())
85 | 
86 | outfh.close()
87 | 


--------------------------------------------------------------------------------
/train_classifier.py:
--------------------------------------------------------------------------------
   1 | # coding=utf-8
   2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
   3 | # Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
   4 | #
   5 | # Licensed under the Apache License, Version 2.0 (the "License");
   6 | # you may not use this file except in compliance with the License.
   7 | # You may obtain a copy of the License at
   8 | #
   9 | #     http://www.apache.org/licenses/LICENSE-2.0
  10 | #
  11 | # Unless required by applicable law or agreed to in writing, software
  12 | # distributed under the License is distributed on an "AS IS" BASIS,
  13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14 | # See the License for the specific language governing permissions and
  15 | # limitations under the License.
  16 | """BERT finetuning runner."""
  17 | 
  18 | from __future__ import absolute_import, division, print_function
  19 | 
  20 | import argparse
  21 | import csv
  22 | import logging
  23 | import os
  24 | import random
  25 | import sys
  26 | 
  27 | import numpy as np
  28 | import torch
  29 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
  30 |                               TensorDataset)
  31 | from torch.utils.data.distributed import DistributedSampler
  32 | from tqdm import tqdm, trange
  33 | 
  34 | from torch.nn import CrossEntropyLoss, MSELoss
  35 | from scipy.stats import pearsonr, spearmanr
  36 | from sklearn.metrics import matthews_corrcoef, f1_score
  37 | 
  38 | from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME
  39 | from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig
  40 | from pytorch_pretrained_bert.tokenization import BertTokenizer
  41 | from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
  42 | 
  43 | logger = logging.getLogger(__name__)
  44 | 
  45 | 
  46 | class InputExample(object):
  47 |     """A single training/test example for simple sequence classification."""
  48 | 
  49 |     def __init__(self, guid, text_a, text_b=None, label=None):
  50 |         """Constructs a InputExample.
  51 | 
  52 |         Args:
  53 |             guid: Unique id for the example.
  54 |             text_a: string. The untokenized text of the first sequence. For single
  55 |             sequence tasks, only this sequence must be specified.
  56 |             text_b: (Optional) string. The untokenized text of the second sequence.
  57 |             Only must be specified for sequence pair tasks.
  58 |             label: (Optional) string. The label of the example. This should be
  59 |             specified for train and dev examples, but not for test examples.
  60 |         """
  61 |         self.guid = guid
  62 |         self.text_a = text_a
  63 |         self.text_b = text_b
  64 |         self.label = label
  65 | 
  66 | 
  67 | class InputFeatures(object):
  68 |     """A single set of features of data."""
  69 | 
  70 |     def __init__(self, input_ids, input_mask, segment_ids, label_id):
  71 |         self.input_ids = input_ids
  72 |         self.input_mask = input_mask
  73 |         self.segment_ids = segment_ids
  74 |         self.label_id = label_id
  75 | 
  76 | 
  77 | class DataProcessor(object):
  78 |     """Base class for data converters for sequence classification data sets."""
  79 | 
  80 |     def get_train_examples(self, data_dir):
  81 |         """Gets a collection of `InputExample`s for the train set."""
  82 |         raise NotImplementedError()
  83 | 
  84 |     def get_dev_examples(self, data_dir):
  85 |         """Gets a collection of `InputExample`s for the dev set."""
  86 |         raise NotImplementedError()
  87 | 
  88 |     def get_labels(self):
  89 |         """Gets the list of labels for this data set."""
  90 |         raise NotImplementedError()
  91 | 
  92 |     @classmethod
  93 |     def _read_tsv(cls, input_file, quotechar=None):
  94 |         """Reads a tab separated value file."""
  95 |         with open(input_file, "r", encoding="utf-8") as f:
  96 |             reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
  97 |             lines = []
  98 |             for line in reader:
  99 |                 if sys.version_info[0] == 2:
 100 |                     line = list(unicode(cell, 'utf-8') for cell in line)
 101 |                 lines.append(line)
 102 |             return lines
 103 | 
 104 | 
 105 | class MrpcProcessor(DataProcessor):
 106 |     """Processor for the MRPC data set (GLUE version)."""
 107 | 
 108 |     def get_train_examples(self, data_dir):
 109 |         """See base class."""
 110 |         logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
 111 |         return self._create_examples(
 112 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 113 | 
 114 |     def get_dev_examples(self, data_dir):
 115 |         """See base class."""
 116 |         return self._create_examples(
 117 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 118 | 
 119 |     def get_labels(self):
 120 |         """See base class."""
 121 |         return ["0", "1"]
 122 | 
 123 |     def _create_examples(self, lines, set_type):
 124 |         """Creates examples for the training and dev sets."""
 125 |         examples = []
 126 |         for (i, line) in enumerate(lines):
 127 |             if i == 0:
 128 |                 continue
 129 |             guid = "%s-%s" % (set_type, i)
 130 |             text_a = line[3]
 131 |             text_b = line[4]
 132 |             label = line[0]
 133 |             examples.append(
 134 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 135 |         return examples
 136 | 
 137 | 
 138 | class MnliProcessor(DataProcessor):
 139 |     """Processor for the MultiNLI data set (GLUE version)."""
 140 | 
 141 |     def get_train_examples(self, data_dir):
 142 |         """See base class."""
 143 |         return self._create_examples(
 144 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 145 | 
 146 |     def get_dev_examples(self, data_dir):
 147 |         """See base class."""
 148 |         return self._create_examples(
 149 |             self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
 150 |             "dev_matched")
 151 | 
 152 |     def get_labels(self):
 153 |         """See base class."""
 154 |         return ["contradiction", "entailment", "neutral"]
 155 | 
 156 |     def _create_examples(self, lines, set_type):
 157 |         """Creates examples for the training and dev sets."""
 158 |         examples = []
 159 |         for (i, line) in enumerate(lines):
 160 |             if i == 0:
 161 |                 continue
 162 |             guid = "%s-%s" % (set_type, line[0])
 163 |             text_a = line[8]
 164 |             text_b = line[9]
 165 |             label = line[-1]
 166 |             examples.append(
 167 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 168 |         return examples
 169 | 
 170 | 
 171 | class MnliMismatchedProcessor(MnliProcessor):
 172 |     """Processor for the MultiNLI Mismatched data set (GLUE version)."""
 173 | 
 174 |     def get_dev_examples(self, data_dir):
 175 |         """See base class."""
 176 |         return self._create_examples(
 177 |             self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")),
 178 |             "dev_matched")
 179 | 
 180 | 
 181 | class ColaProcessor(DataProcessor):
 182 |     """Processor for the CoLA data set (GLUE version)."""
 183 | 
 184 |     def get_train_examples(self, data_dir):
 185 |         """See base class."""
 186 |         return self._create_examples(
 187 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 188 | 
 189 |     def get_dev_examples(self, data_dir):
 190 |         """See base class."""
 191 |         return self._create_examples(
 192 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 193 | 
 194 |     def get_labels(self):
 195 |         """See base class."""
 196 |         return ["0", "1"]
 197 | 
 198 |     def _create_examples(self, lines, set_type):
 199 |         """Creates examples for the training and dev sets."""
 200 |         examples = []
 201 |         for (i, line) in enumerate(lines):
 202 |             guid = "%s-%s" % (set_type, i)
 203 |             text_a = line[3]
 204 |             label = line[1]
 205 |             examples.append(
 206 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
 207 |         return examples
 208 | 
 209 | class OnionProcessor(ColaProcessor):
 210 |     def _create_examples(self, lines, set_type):
 211 |         """Creates examples for the training and dev sets."""
 212 |         examples = []
 213 |         for (i, line) in enumerate(lines):
 214 |             left, right = line[0], line[1]
 215 |             target = 0
 216 |             guid = str(i) 
 217 | 
 218 |             # tmp_right = right.split()
 219 |             # np.random.shuffle(tmp_right)
 220 |             # right = " ".join(tmp_right)            
 221 |             if np.random.rand() > 0.5:
 222 |                 left, right = right, left
 223 |                 target = 1
 224 |             
 225 |             # text_a = tokenization.convert_to_unicode(left)
 226 |             # text_b = tokenization.convert_to_unicode(right)
 227 |             text_a = left
 228 |             text_b = right
 229 |             label = str(target)
 230 |             examples.append(
 231 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 232 |             
 233 |         return examples
 234 | 
 235 | class Sst2Processor(DataProcessor):
 236 |     """Processor for the SST-2 data set (GLUE version)."""
 237 | 
 238 |     def get_train_examples(self, data_dir):
 239 |         """See base class."""
 240 |         return self._create_examples(
 241 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 242 | 
 243 |     def get_dev_examples(self, data_dir):
 244 |         """See base class."""
 245 |         return self._create_examples(
 246 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 247 | 
 248 |     def get_labels(self):
 249 |         """See base class."""
 250 |         return ["0", "1"]
 251 | 
 252 |     def _create_examples(self, lines, set_type):
 253 |         """Creates examples for the training and dev sets."""
 254 |         examples = []
 255 |         for (i, line) in enumerate(lines):
 256 |             if i == 0:
 257 |                 continue
 258 |             guid = "%s-%s" % (set_type, i)
 259 |             text_a = line[0]
 260 |             label = line[1]
 261 |             examples.append(
 262 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
 263 |         return examples
 264 | 
 265 | 
 266 | class StsbProcessor(DataProcessor):
 267 |     """Processor for the STS-B data set (GLUE version)."""
 268 | 
 269 |     def get_train_examples(self, data_dir):
 270 |         """See base class."""
 271 |         return self._create_examples(
 272 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 273 | 
 274 |     def get_dev_examples(self, data_dir):
 275 |         """See base class."""
 276 |         return self._create_examples(
 277 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 278 | 
 279 |     def get_labels(self):
 280 |         """See base class."""
 281 |         return [None]
 282 | 
 283 |     def _create_examples(self, lines, set_type):
 284 |         """Creates examples for the training and dev sets."""
 285 |         examples = []
 286 |         for (i, line) in enumerate(lines):
 287 |             if i == 0:
 288 |                 continue
 289 |             guid = "%s-%s" % (set_type, line[0])
 290 |             text_a = line[7]
 291 |             text_b = line[8]
 292 |             label = line[-1]
 293 |             examples.append(
 294 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 295 |         return examples
 296 | 
 297 | 
 298 | class QqpProcessor(DataProcessor):
 299 |     """Processor for the STS-B data set (GLUE version)."""
 300 | 
 301 |     def get_train_examples(self, data_dir):
 302 |         """See base class."""
 303 |         return self._create_examples(
 304 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 305 | 
 306 |     def get_dev_examples(self, data_dir):
 307 |         """See base class."""
 308 |         return self._create_examples(
 309 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 310 | 
 311 |     def get_labels(self):
 312 |         """See base class."""
 313 |         return ["0", "1"]
 314 | 
 315 |     def _create_examples(self, lines, set_type):
 316 |         """Creates examples for the training and dev sets."""
 317 |         examples = []
 318 |         for (i, line) in enumerate(lines):
 319 |             if i == 0:
 320 |                 continue
 321 |             guid = "%s-%s" % (set_type, line[0])
 322 |             try:
 323 |                 text_a = line[3]
 324 |                 text_b = line[4]
 325 |                 label = line[5]
 326 |             except IndexError:
 327 |                 continue
 328 |             examples.append(
 329 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 330 |         return examples
 331 | 
 332 | 
 333 | class QnliProcessor(DataProcessor):
 334 |     """Processor for the STS-B data set (GLUE version)."""
 335 | 
 336 |     def get_train_examples(self, data_dir):
 337 |         """See base class."""
 338 |         return self._create_examples(
 339 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 340 | 
 341 |     def get_dev_examples(self, data_dir):
 342 |         """See base class."""
 343 |         return self._create_examples(
 344 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), 
 345 |             "dev_matched")
 346 | 
 347 |     def get_labels(self):
 348 |         """See base class."""
 349 |         return ["entailment", "not_entailment"]
 350 | 
 351 |     def _create_examples(self, lines, set_type):
 352 |         """Creates examples for the training and dev sets."""
 353 |         examples = []
 354 |         for (i, line) in enumerate(lines):
 355 |             if i == 0:
 356 |                 continue
 357 |             guid = "%s-%s" % (set_type, line[0])
 358 |             text_a = line[1]
 359 |             text_b = line[2]
 360 |             label = line[-1]
 361 |             examples.append(
 362 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 363 |         return examples
 364 | 
 365 | 
 366 | class RteProcessor(DataProcessor):
 367 |     """Processor for the RTE data set (GLUE version)."""
 368 | 
 369 |     def get_train_examples(self, data_dir):
 370 |         """See base class."""
 371 |         return self._create_examples(
 372 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 373 | 
 374 |     def get_dev_examples(self, data_dir):
 375 |         """See base class."""
 376 |         return self._create_examples(
 377 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 378 | 
 379 |     def get_labels(self):
 380 |         """See base class."""
 381 |         return ["entailment", "not_entailment"]
 382 | 
 383 |     def _create_examples(self, lines, set_type):
 384 |         """Creates examples for the training and dev sets."""
 385 |         examples = []
 386 |         for (i, line) in enumerate(lines):
 387 |             if i == 0:
 388 |                 continue
 389 |             guid = "%s-%s" % (set_type, line[0])
 390 |             text_a = line[1]
 391 |             text_b = line[2]
 392 |             label = line[-1]
 393 |             examples.append(
 394 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 395 |         return examples
 396 | 
 397 | 
 398 | class WnliProcessor(DataProcessor):
 399 |     """Processor for the WNLI data set (GLUE version)."""
 400 | 
 401 |     def get_train_examples(self, data_dir):
 402 |         """See base class."""
 403 |         return self._create_examples(
 404 |             self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
 405 | 
 406 |     def get_dev_examples(self, data_dir):
 407 |         """See base class."""
 408 |         return self._create_examples(
 409 |             self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
 410 | 
 411 |     def get_labels(self):
 412 |         """See base class."""
 413 |         return ["0", "1"]
 414 | 
 415 |     def _create_examples(self, lines, set_type):
 416 |         """Creates examples for the training and dev sets."""
 417 |         examples = []
 418 |         for (i, line) in enumerate(lines):
 419 |             if i == 0:
 420 |                 continue
 421 |             guid = "%s-%s" % (set_type, line[0])
 422 |             text_a = line[1]
 423 |             text_b = line[2]
 424 |             label = line[-1]
 425 |             examples.append(
 426 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
 427 |         return examples
 428 | 
 429 | 
 430 | def convert_examples_to_features(examples, label_list, max_seq_length,
 431 |                                  tokenizer, output_mode):
 432 |     """Loads a data file into a list of `InputBatch`s."""
 433 | 
 434 |     label_map = {label : i for i, label in enumerate(label_list)}
 435 | 
 436 |     features = []
 437 |     for (ex_index, example) in enumerate(examples):
 438 |         if ex_index % 10000 == 0:
 439 |             logger.info("Writing example %d of %d" % (ex_index, len(examples)))
 440 | 
 441 |         tokens_a = tokenizer.tokenize(example.text_a)
 442 | 
 443 |         tokens_b = None
 444 |         if example.text_b:
 445 |             tokens_b = tokenizer.tokenize(example.text_b)
 446 |             # Modifies `tokens_a` and `tokens_b` in place so that the total
 447 |             # length is less than the specified length.
 448 |             # Account for [CLS], [SEP], [SEP] with "- 3"
 449 |             _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
 450 |         else:
 451 |             # Account for [CLS] and [SEP] with "- 2"
 452 |             if len(tokens_a) > max_seq_length - 2:
 453 |                 tokens_a = tokens_a[:(max_seq_length - 2)]
 454 | 
 455 |         # The convention in BERT is:
 456 |         # (a) For sequence pairs:
 457 |         #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
 458 |         #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
 459 |         # (b) For single sequences:
 460 |         #  tokens:   [CLS] the dog is hairy . [SEP]
 461 |         #  type_ids: 0   0   0   0  0     0 0
 462 |         #
 463 |         # Where "type_ids" are used to indicate whether this is the first
 464 |         # sequence or the second sequence. The embedding vectors for `type=0` and
 465 |         # `type=1` were learned during pre-training and are added to the wordpiece
 466 |         # embedding vector (and position vector). This is not *strictly* necessary
 467 |         # since the [SEP] token unambiguously separates the sequences, but it makes
 468 |         # it easier for the model to learn the concept of sequences.
 469 |         #
 470 |         # For classification tasks, the first vector (corresponding to [CLS]) is
 471 |         # used as as the "sentence vector". Note that this only makes sense because
 472 |         # the entire model is fine-tuned.
 473 |         tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
 474 |         segment_ids = [0] * len(tokens)
 475 | 
 476 |         if tokens_b:
 477 |             tokens += tokens_b + ["[SEP]"]
 478 |             segment_ids += [1] * (len(tokens_b) + 1)
 479 | 
 480 |         input_ids = tokenizer.convert_tokens_to_ids(tokens)
 481 | 
 482 |         # The mask has 1 for real tokens and 0 for padding tokens. Only real
 483 |         # tokens are attended to.
 484 |         input_mask = [1] * len(input_ids)
 485 | 
 486 |         # Zero-pad up to the sequence length.
 487 |         padding = [0] * (max_seq_length - len(input_ids))
 488 |         input_ids += padding
 489 |         input_mask += padding
 490 |         segment_ids += padding
 491 | 
 492 |         assert len(input_ids) == max_seq_length
 493 |         assert len(input_mask) == max_seq_length
 494 |         assert len(segment_ids) == max_seq_length
 495 | 
 496 |         if output_mode == "classification":
 497 |             label_id = label_map[example.label]
 498 |         elif output_mode == "regression":
 499 |             label_id = float(example.label)
 500 |         else:
 501 |             raise KeyError(output_mode)
 502 | 
 503 |         if ex_index < 5:
 504 |             logger.info("*** Example ***")
 505 |             logger.info("guid: %s" % (example.guid))
 506 |             logger.info("tokens: %s" % " ".join(
 507 |                     [str(x) for x in tokens]))
 508 |             logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
 509 |             logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
 510 |             logger.info(
 511 |                     "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
 512 |             logger.info("label: %s (id = %d)" % (example.label, label_id))
 513 | 
 514 |         features.append(
 515 |                 InputFeatures(input_ids=input_ids,
 516 |                               input_mask=input_mask,
 517 |                               segment_ids=segment_ids,
 518 |                               label_id=label_id))
 519 |     return features
 520 | 
 521 | 
 522 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
 523 |     """Truncates a sequence pair in place to the maximum length."""
 524 | 
 525 |     # This is a simple heuristic which will always truncate the longer sequence
 526 |     # one token at a time. This makes more sense than truncating an equal percent
 527 |     # of tokens from each, since if one sequence is very short then each token
 528 |     # that's truncated likely contains more information than a longer sequence.
 529 |     while True:
 530 |         total_length = len(tokens_a) + len(tokens_b)
 531 |         if total_length <= max_length:
 532 |             break
 533 |         if len(tokens_a) > len(tokens_b):
 534 |             tokens_a.pop()
 535 |         else:
 536 |             tokens_b.pop()
 537 | 
 538 | 
 539 | def simple_accuracy(preds, labels):
 540 |     return (preds == labels).mean()
 541 | 
 542 | 
 543 | def acc_and_f1(preds, labels):
 544 |     acc = simple_accuracy(preds, labels)
 545 |     f1 = f1_score(y_true=labels, y_pred=preds)
 546 |     return {
 547 |         "acc": acc,
 548 |         "f1": f1,
 549 |         "acc_and_f1": (acc + f1) / 2,
 550 |     }
 551 | 
 552 | 
 553 | def pearson_and_spearman(preds, labels):
 554 |     pearson_corr = pearsonr(preds, labels)[0]
 555 |     spearman_corr = spearmanr(preds, labels)[0]
 556 |     return {
 557 |         "pearson": pearson_corr,
 558 |         "spearmanr": spearman_corr,
 559 |         "corr": (pearson_corr + spearman_corr) / 2,
 560 |     }
 561 | 
 562 | 
 563 | def compute_metrics(task_name, preds, labels):
 564 |     assert len(preds) == len(labels)
 565 |     if task_name == "cola":
 566 |         return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)}
 567 |     if task_name == "onion":
 568 |         return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)}
 569 |     elif task_name == "sst-2":
 570 |         return {"acc": simple_accuracy(preds, labels)}
 571 |     elif task_name == "mrpc":
 572 |         return acc_and_f1(preds, labels)
 573 |     elif task_name == "sts-b":
 574 |         return pearson_and_spearman(preds, labels)
 575 |     elif task_name == "qqp":
 576 |         return acc_and_f1(preds, labels)
 577 |     elif task_name == "mnli":
 578 |         return {"acc": simple_accuracy(preds, labels)}
 579 |     elif task_name == "mnli-mm":
 580 |         return {"acc": simple_accuracy(preds, labels)}
 581 |     elif task_name == "qnli":
 582 |         return {"acc": simple_accuracy(preds, labels)}
 583 |     elif task_name == "rte":
 584 |         return {"acc": simple_accuracy(preds, labels)}
 585 |     elif task_name == "wnli":
 586 |         return {"acc": simple_accuracy(preds, labels)}
 587 |     else:
 588 |         raise KeyError(task_name)
 589 | 
 590 | 
 591 | def main():
 592 |     parser = argparse.ArgumentParser()
 593 | 
 594 |     ## Required parameters
 595 |     parser.add_argument("--data_dir",
 596 |                         default=None,
 597 |                         type=str,
 598 |                         required=True,
 599 |                         help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
 600 |     parser.add_argument("--bert_model", default=None, type=str, required=True,
 601 |                         help="Bert pre-trained model selected in the list: bert-base-uncased, "
 602 |                         "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
 603 |                         "bert-base-multilingual-cased, bert-base-chinese.")
 604 |     parser.add_argument("--task_name",
 605 |                         default=None,
 606 |                         type=str,
 607 |                         required=True,
 608 |                         help="The name of the task to train.")
 609 |     parser.add_argument("--output_dir",
 610 |                         default=None,
 611 |                         type=str,
 612 |                         required=True,
 613 |                         help="The output directory where the model predictions and checkpoints will be written.")
 614 | 
 615 |     ## Other parameters
 616 |     parser.add_argument("--cache_dir",
 617 |                         default="",
 618 |                         type=str,
 619 |                         help="Where do you want to store the pre-trained models downloaded from s3")
 620 |     parser.add_argument("--max_seq_length",
 621 |                         default=128,
 622 |                         type=int,
 623 |                         help="The maximum total input sequence length after WordPiece tokenization. \n"
 624 |                              "Sequences longer than this will be truncated, and sequences shorter \n"
 625 |                              "than this will be padded.")
 626 |     parser.add_argument("--do_train",
 627 |                         action='store_true',
 628 |                         help="Whether to run training.")
 629 |     parser.add_argument("--do_eval",
 630 |                         action='store_true',
 631 |                         help="Whether to run eval on the dev set.")
 632 |     parser.add_argument("--do_lower_case",
 633 |                         action='store_true',
 634 |                         help="Set this flag if you are using an uncased model.")
 635 |     parser.add_argument("--train_batch_size",
 636 |                         default=32,
 637 |                         type=int,
 638 |                         help="Total batch size for training.")
 639 |     parser.add_argument("--eval_batch_size",
 640 |                         default=8,
 641 |                         type=int,
 642 |                         help="Total batch size for eval.")
 643 |     parser.add_argument("--learning_rate",
 644 |                         default=5e-5,
 645 |                         type=float,
 646 |                         help="The initial learning rate for Adam.")
 647 |     parser.add_argument("--num_train_epochs",
 648 |                         default=3.0,
 649 |                         type=float,
 650 |                         help="Total number of training epochs to perform.")
 651 |     parser.add_argument("--warmup_proportion",
 652 |                         default=0.1,
 653 |                         type=float,
 654 |                         help="Proportion of training to perform linear learning rate warmup for. "
 655 |                              "E.g., 0.1 = 10%% of training.")
 656 |     parser.add_argument("--no_cuda",
 657 |                         action='store_true',
 658 |                         help="Whether not to use CUDA when available")
 659 |     parser.add_argument("--local_rank",
 660 |                         type=int,
 661 |                         default=-1,
 662 |                         help="local_rank for distributed training on gpus")
 663 |     parser.add_argument('--seed',
 664 |                         type=int,
 665 |                         default=42,
 666 |                         help="random seed for initialization")
 667 |     parser.add_argument('--gradient_accumulation_steps',
 668 |                         type=int,
 669 |                         default=1,
 670 |                         help="Number of updates steps to accumulate before performing a backward/update pass.")
 671 |     parser.add_argument('--fp16',
 672 |                         action='store_true',
 673 |                         help="Whether to use 16-bit float precision instead of 32-bit")
 674 |     parser.add_argument('--loss_scale',
 675 |                         type=float, default=0,
 676 |                         help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
 677 |                              "0 (default value): dynamic loss scaling.\n"
 678 |                              "Positive power of 2: static loss scaling value.\n")
 679 |     parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
 680 |     parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
 681 |     args = parser.parse_args()
 682 | 
 683 |     if args.server_ip and args.server_port:
 684 |         # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
 685 |         import ptvsd
 686 |         print("Waiting for debugger attach")
 687 |         ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
 688 |         ptvsd.wait_for_attach()
 689 | 
 690 |     processors = {
 691 |         "cola": ColaProcessor,
 692 |         "onion": OnionProcessor,
 693 |         "mnli": MnliProcessor,
 694 |         "mnli-mm": MnliMismatchedProcessor,
 695 |         "mrpc": MrpcProcessor,
 696 |         "sst-2": Sst2Processor,
 697 |         "sts-b": StsbProcessor,
 698 |         "qqp": QqpProcessor,
 699 |         "qnli": QnliProcessor,
 700 |         "rte": RteProcessor,
 701 |         "wnli": WnliProcessor,
 702 |     }
 703 | 
 704 |     output_modes = {
 705 |         "cola": "classification",
 706 |         "onion": "classification",
 707 |         "mnli": "classification",
 708 |         "mrpc": "classification",
 709 |         "sst-2": "classification",
 710 |         "sts-b": "regression",
 711 |         "qqp": "classification",
 712 |         "qnli": "classification",
 713 |         "rte": "classification",
 714 |         "wnli": "classification",
 715 |     }
 716 | 
 717 |     if args.local_rank == -1 or args.no_cuda:
 718 |         device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
 719 |         n_gpu = torch.cuda.device_count()
 720 |     else:
 721 |         torch.cuda.set_device(args.local_rank)
 722 |         device = torch.device("cuda", args.local_rank)
 723 |         n_gpu = 1
 724 |         # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
 725 |         torch.distributed.init_process_group(backend='nccl')
 726 | 
 727 |     logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
 728 |                         datefmt = '%m/%d/%Y %H:%M:%S',
 729 |                         level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
 730 | 
 731 |     logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
 732 |         device, n_gpu, bool(args.local_rank != -1), args.fp16))
 733 | 
 734 |     if args.gradient_accumulation_steps < 1:
 735 |         raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
 736 |                             args.gradient_accumulation_steps))
 737 | 
 738 |     args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps
 739 | 
 740 |     random.seed(args.seed)
 741 |     np.random.seed(args.seed)
 742 |     torch.manual_seed(args.seed)
 743 |     if n_gpu > 0:
 744 |         torch.cuda.manual_seed_all(args.seed)
 745 | 
 746 |     if not args.do_train and not args.do_eval:
 747 |         raise ValueError("At least one of `do_train` or `do_eval` must be True.")
 748 | 
 749 |     if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
 750 |         raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
 751 |     if not os.path.exists(args.output_dir):
 752 |         os.makedirs(args.output_dir)
 753 | 
 754 |     task_name = args.task_name.lower()
 755 | 
 756 |     if task_name not in processors:
 757 |         raise ValueError("Task not found: %s" % (task_name))
 758 | 
 759 |     processor = processors[task_name]()
 760 |     output_mode = output_modes[task_name]
 761 | 
 762 |     label_list = processor.get_labels()
 763 |     num_labels = len(label_list)
 764 | 
 765 |     tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
 766 | 
 767 |     train_examples = None
 768 |     num_train_optimization_steps = None
 769 |     if args.do_train:
 770 |         train_examples = processor.get_train_examples(args.data_dir)
 771 |         num_train_optimization_steps = int(
 772 |             len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
 773 |         if args.local_rank != -1:
 774 |             num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
 775 | 
 776 |     # Prepare model
 777 |     cache_dir = args.cache_dir if args.cache_dir else os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank))
 778 |     model = BertForSequenceClassification.from_pretrained(args.bert_model,
 779 |               cache_dir=cache_dir,
 780 |               num_labels=num_labels)
 781 |     if args.fp16:
 782 |         model.half()
 783 |     model.to(device)
 784 |     if args.local_rank != -1:
 785 |         try:
 786 |             from apex.parallel import DistributedDataParallel as DDP
 787 |         except ImportError:
 788 |             raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
 789 | 
 790 |         model = DDP(model)
 791 |     elif n_gpu > 1:
 792 |         model = torch.nn.DataParallel(model)
 793 | 
 794 |     # Prepare optimizer
 795 |     param_optimizer = list(model.named_parameters())
 796 |     no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
 797 |     optimizer_grouped_parameters = [
 798 |         {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
 799 |         {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
 800 |         ]
 801 |     if args.fp16:
 802 |         try:
 803 |             from apex.optimizers import FP16_Optimizer
 804 |             from apex.optimizers import FusedAdam
 805 |         except ImportError:
 806 |             raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
 807 | 
 808 |         optimizer = FusedAdam(optimizer_grouped_parameters,
 809 |                               lr=args.learning_rate,
 810 |                               bias_correction=False,
 811 |                               max_grad_norm=1.0)
 812 |         if args.loss_scale == 0:
 813 |             optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
 814 |         else:
 815 |             optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
 816 |         warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
 817 |                                              t_total=num_train_optimization_steps)
 818 | 
 819 |     else:
 820 |         optimizer = BertAdam(optimizer_grouped_parameters,
 821 |                              lr=args.learning_rate,
 822 |                              warmup=args.warmup_proportion,
 823 |                              t_total=num_train_optimization_steps)
 824 | 
 825 |     global_step = 0
 826 |     nb_tr_steps = 0
 827 |     tr_loss = 0
 828 |     if args.do_train:
 829 |         train_features = convert_examples_to_features(
 830 |             train_examples, label_list, args.max_seq_length, tokenizer, output_mode)
 831 |         logger.info("***** Running training *****")
 832 |         logger.info("  Num examples = %d", len(train_examples))
 833 |         logger.info("  Batch size = %d", args.train_batch_size)
 834 |         logger.info("  Num steps = %d", num_train_optimization_steps)
 835 |         all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
 836 |         all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
 837 |         all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
 838 | 
 839 |         if output_mode == "classification":
 840 |             all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
 841 |         elif output_mode == "regression":
 842 |             all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.float)
 843 | 
 844 |         train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
 845 |         if args.local_rank == -1:
 846 |             train_sampler = RandomSampler(train_data)
 847 |         else:
 848 |             train_sampler = DistributedSampler(train_data)
 849 |         train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)
 850 | 
 851 |         model.train()
 852 |         for _ in trange(int(args.num_train_epochs), desc="Epoch"):
 853 |             tr_loss = 0
 854 |             nb_tr_examples, nb_tr_steps = 0, 0
 855 |             for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
 856 |                 batch = tuple(t.to(device) for t in batch)
 857 |                 input_ids, input_mask, segment_ids, label_ids = batch
 858 | 
 859 |                 # define a new function to compute loss values for both output_modes
 860 |                 logits = model(input_ids, segment_ids, input_mask, labels=None)
 861 | 
 862 |                 if output_mode == "classification":
 863 |                     loss_fct = CrossEntropyLoss()
 864 |                     loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
 865 |                 elif output_mode == "regression":
 866 |                     loss_fct = MSELoss()
 867 |                     loss = loss_fct(logits.view(-1), label_ids.view(-1))
 868 | 
 869 |                 if n_gpu > 1:
 870 |                     loss = loss.mean() # mean() to average on multi-gpu.
 871 |                 if args.gradient_accumulation_steps > 1:
 872 |                     loss = loss / args.gradient_accumulation_steps
 873 | 
 874 |                 if args.fp16:
 875 |                     optimizer.backward(loss)
 876 |                 else:
 877 |                     loss.backward()
 878 | 
 879 |                 tr_loss += loss.item()
 880 |                 nb_tr_examples += input_ids.size(0)
 881 |                 nb_tr_steps += 1
 882 |                 if (step + 1) % args.gradient_accumulation_steps == 0:
 883 |                     if args.fp16:
 884 |                         # modify learning rate with special warm up BERT uses
 885 |                         # if args.fp16 is False, BertAdam is used that handles this automatically
 886 |                         lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step/num_train_optimization_steps,
 887 |                                                                                  args.warmup_proportion)
 888 |                         for param_group in optimizer.param_groups:
 889 |                             param_group['lr'] = lr_this_step
 890 |                     optimizer.step()
 891 |                     optimizer.zero_grad()
 892 |                     global_step += 1
 893 | 
 894 |     if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
 895 |         # Save a trained model, configuration and tokenizer
 896 |         model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
 897 | 
 898 |         # If we save using the predefined names, we can load using `from_pretrained`
 899 |         output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
 900 |         output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
 901 | 
 902 |         torch.save(model_to_save.state_dict(), output_model_file)
 903 |         model_to_save.config.to_json_file(output_config_file)
 904 |         tokenizer.save_vocabulary(args.output_dir)
 905 | 
 906 |         # Load a trained model and vocabulary that you have fine-tuned
 907 |         model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels)
 908 |         tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
 909 |     else:
 910 |         model = BertForSequenceClassification.from_pretrained(args.bert_model, num_labels=num_labels)
 911 |     model.to(device)
 912 | 
 913 |     if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
 914 |         eval_examples = processor.get_dev_examples(args.data_dir)
 915 |         eval_features = convert_examples_to_features(
 916 |             eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
 917 |         logger.info("***** Running evaluation *****")
 918 |         logger.info("  Num examples = %d", len(eval_examples))
 919 |         logger.info("  Batch size = %d", args.eval_batch_size)
 920 |         all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
 921 |         all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
 922 |         all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
 923 | 
 924 |         if output_mode == "classification":
 925 |             all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
 926 |         elif output_mode == "regression":
 927 |             all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.float)
 928 | 
 929 |         eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
 930 |         # Run prediction for full data
 931 |         eval_sampler = SequentialSampler(eval_data)
 932 |         eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
 933 | 
 934 |         model.eval()
 935 |         eval_loss = 0
 936 |         nb_eval_steps = 0
 937 |         preds = []
 938 | 
 939 |         for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
 940 |             input_ids = input_ids.to(device)
 941 |             input_mask = input_mask.to(device)
 942 |             segment_ids = segment_ids.to(device)
 943 |             label_ids = label_ids.to(device)
 944 | 
 945 |             with torch.no_grad():
 946 |                 logits = model(input_ids, segment_ids, input_mask, labels=None)
 947 | 
 948 |             # create eval loss and other metric required by the task
 949 |             if output_mode == "classification":
 950 |                 loss_fct = CrossEntropyLoss()
 951 |                 tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
 952 |             elif output_mode == "regression":
 953 |                 loss_fct = MSELoss()
 954 |                 tmp_eval_loss = loss_fct(logits.view(-1), label_ids.view(-1))
 955 |             
 956 |             eval_loss += tmp_eval_loss.mean().item()
 957 |             nb_eval_steps += 1
 958 |             if len(preds) == 0:
 959 |                 preds.append(logits.detach().cpu().numpy())
 960 |             else:
 961 |                 preds[0] = np.append(
 962 |                     preds[0], logits.detach().cpu().numpy(), axis=0)
 963 | 
 964 |         eval_loss = eval_loss / nb_eval_steps
 965 |         preds = preds[0]
 966 |         if output_mode == "classification":
 967 |             preds = np.argmax(preds, axis=1)
 968 |         elif output_mode == "regression":
 969 |             preds = np.squeeze(preds)
 970 |         result = compute_metrics(task_name, preds, all_label_ids.numpy())
 971 |         loss = tr_loss/nb_tr_steps if args.do_train else None
 972 | 
 973 |         result['eval_loss'] = eval_loss
 974 |         result['global_step'] = global_step
 975 |         result['loss'] = loss
 976 | 
 977 |         output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
 978 |         with open(output_eval_file, "w") as writer:
 979 |             logger.info("***** Eval results *****")
 980 |             for key in sorted(result.keys()):
 981 |                 logger.info("  %s = %s", key, str(result[key]))
 982 |                 writer.write("%s = %s\n" % (key, str(result[key])))
 983 | 
 984 |         # hack for MNLI-MM
 985 |         if task_name == "mnli":
 986 |             task_name = "mnli-mm"
 987 |             processor = processors[task_name]()
 988 | 
 989 |             if os.path.exists(args.output_dir + '-MM') and os.listdir(args.output_dir + '-MM') and args.do_train:
 990 |                 raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
 991 |             if not os.path.exists(args.output_dir + '-MM'):
 992 |                 os.makedirs(args.output_dir + '-MM')
 993 | 
 994 |             eval_examples = processor.get_dev_examples(args.data_dir)
 995 |             eval_features = convert_examples_to_features(
 996 |                 eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
 997 |             logger.info("***** Running evaluation *****")
 998 |             logger.info("  Num examples = %d", len(eval_examples))
 999 |             logger.info("  Batch size = %d", args.eval_batch_size)
1000 |             all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
1001 |             all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
1002 |             all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
1003 |             all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
1004 | 
1005 |             eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
1006 |             # Run prediction for full data
1007 |             eval_sampler = SequentialSampler(eval_data)
1008 |             eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
1009 | 
1010 |             model.eval()
1011 |             eval_loss = 0
1012 |             nb_eval_steps = 0
1013 |             preds = []
1014 | 
1015 |             for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
1016 |                 input_ids = input_ids.to(device)
1017 |                 input_mask = input_mask.to(device)
1018 |                 segment_ids = segment_ids.to(device)
1019 |                 label_ids = label_ids.to(device)
1020 | 
1021 |                 with torch.no_grad():
1022 |                     logits = model(input_ids, segment_ids, input_mask, labels=None)
1023 |             
1024 |                 loss_fct = CrossEntropyLoss()
1025 |                 tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
1026 |             
1027 |                 eval_loss += tmp_eval_loss.mean().item()
1028 |                 nb_eval_steps += 1
1029 |                 if len(preds) == 0:
1030 |                     preds.append(logits.detach().cpu().numpy())
1031 |                 else:
1032 |                     preds[0] = np.append(
1033 |                         preds[0], logits.detach().cpu().numpy(), axis=0)
1034 | 
1035 |             eval_loss = eval_loss / nb_eval_steps
1036 |             preds = preds[0]
1037 |             preds = np.argmax(preds, axis=1)
1038 |             result = compute_metrics(task_name, preds, all_label_ids.numpy())
1039 |             loss = tr_loss/nb_tr_steps if args.do_train else None
1040 | 
1041 |             result['eval_loss'] = eval_loss
1042 |             result['global_step'] = global_step
1043 |             result['loss'] = loss
1044 | 
1045 |             output_eval_file = os.path.join(args.output_dir + '-MM', "eval_results.txt")
1046 |             with open(output_eval_file, "w") as writer:
1047 |                 logger.info("***** Eval results *****")
1048 |                 for key in sorted(result.keys()):
1049 |                     logger.info("  %s = %s", key, str(result[key]))
1050 |                     writer.write("%s = %s\n" % (key, str(result[key])))
1051 | 
1052 | if __name__ == "__main__":
1053 |     main()
1054 | 


--------------------------------------------------------------------------------
/weight_samples.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import math 
 3 | 
 4 | collected = []
 5 | 
 6 | def sm(x):
 7 | 	a = math.exp(x[1])
 8 | 	b = math.exp(x[2])
 9 | 	return b/(a+b+1e-8)
10 | 
11 | ofh = open(sys.argv[2], 'w', encoding='utf-8')
12 | 
13 | for line in open(sys.argv[1], encoding='utf-8'):
14 | 	chunks = line.strip().split('\t')
15 | 	chunks[1] = float(chunks[1])
16 | 	chunks[2] = float(chunks[2])
17 | 	chunks.append( sm(chunks) )
18 | 	print("\t".join(map(str,chunks)),file=ofh)
19 | 
20 | ofh.close()
21 | 


--------------------------------------------------------------------------------