├── LICENSE
├── NaNoGenMo_50K_words_sample.txt
├── README.md
├── cleaner_on_bert_weights.py
├── do_critic.py
├── gpt1finetune.py
├── gpt1sample.py
├── gpt1tokenize_trainset.py
├── handwriting.png
├── paranoid_transformer.pdf
├── paranoid_transformer.png
├── paranoid_transformer_back.png
├── paranoid_transformer_w_pics.pdf
├── pics_samples.png
├── simple_cleaner.py
├── train_classifier.py
├── vocab.txt
└── weight_samples.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Aleksey Tikhonov
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Paranoid Transformer
2 |
3 | ## TLDR
4 |
5 | After all, this project turns into a published neural network generated book. [Check the story behind it in my Medium post](https://medium.com/altsoph/paranoid-transformer-80a960ddc90a).
6 |
7 | ## Overview
8 |
9 |
10 | This is an attempt to make an unsupervised text generator with some specific style and form characteristics of text.
11 | Originaly it was published as an entry for [NaNoGenMo 2019](https://github.com/NaNoGenMo/2019/issues/142) (_National Novel Generation Month_ contest).
12 |
13 | The general idea behind the _Paranoid Transformer_ project is to build a paranoiac-critical system based on two neural networks.
14 | The first network (_Paranoiac-intrusive Generator_) is a GPT-based tuned conditional language model and the second one (_Critic subsystem_) uses a BERT-based classifier that works as a filtering subsystem, so it selects the best ones from the flow of text passages. Finally, I used an existing handwriting synthesis neural network implementation to generate a nervous handwritten diary where a degree of shakiness depends on the sentiment strength of a given sentence.
15 |
16 | ## Generator subsystem
17 |
18 | The first network, Paranoiac-intrusive subsystem AKA Generator, uses an [OpenAI GPT](https://github.com/openai/finetune-transformer-lm) architecture and the [implementation from huggingface](https://github.com/huggingface/transformers). I took a publicly available network model already pre-trained on a huge fiction [BooksCorpus dataset](https://arxiv.org/pdf/1506.06724.pdf) with approx ~10K books and ~1B words.
19 |
20 | Next, I finetuned it on several additional handcrafted text corpora (altogether ~50Mb of text):
21 | - a collection of Crypto Texts (Crypto Anarchist Manifesto, Cyphernomicon, etc),
22 | - another collection of fiction books (from such cyberpunk authors as Dick, Gibson, and others + non-cyberpunk authors, for example, Kafka and Rumi),
23 | - transcripts and subtitles from some cyberpunk movies and series,
24 | - several thousands of quotes and fortune cookie messages collected from different sources.
25 |
26 | During the fine-tuning phase, I used special labels for conditional training of the model:
27 | - _QUOTE_ for any short quote or fortune, _LONG_ for others
28 | - _CYBER_ for cyber-themed text and _OTHER_ for others.
29 | Each text got 2 labels, for example, it was _LONG_+_CYBER_ for Cyphernomicon, _LONG_+_OTHER_ for Kafka and _QUOTE_+_OTHER_ for fortune cookie messages. Note, there were almost no texts labeled as _QUOTE_+_CYBER_, just a few nerd jokes.
30 |
31 | At last, in generation mode, I kindly asked the model to generate only _QUOTE_+_CYBER_ texts.
32 | The raw results were already promising enough:
33 |
34 | > terosexuality is pleasures a turn off ; and to me not to be a true blossoming beautiful being is on the other side. the wind is our song, the emotions are our wind and a piano, new things change, new smells kick off in time, a spiritually shifting dust. let your eyes sing music for a while. let your ears measure the bass beat of your soul, the gentle winding of the song. then your ears achieve harmony. you can listen to french playstation on live music together forever, in the philly coffeehouse, in them congressional district of the franklin gap building. let painting melt away every other shred of reason and pain, just lew the paint to move thoughts away from blizzes in death. let it dry out, and turn to cosmic delights, to laugh on the big charms and saxophones and fudatron steames of the sales titanium. we are god's friends, the golden hands on the shoulders of our fears. do you knock my cleaning table over? i snap awake at some dawn. the patrons researching the blues instructor's theories around me, then give me a glass of jim beam. boom! the business group soon concludes. caught one miracle? survive the tedious rituals you refuse to provide? whatever happens, i throw shit in your face. joy ries away? you could give acapindulgent half your life away, though i am nothing especially sexy. this sift, this being sveng? do impotent and desperate oozing drug as i shake and shine? you adored me. brains run out when people charitable that into you.
35 |
36 | Now it was time to make some cleaning.
37 |
38 | ## Heuristic filters
39 |
40 | The next big thing to do was filter some really good ones from this endless flow of the text.
41 |
42 | At first, I made a script with some simple heuristic filters such as:
43 | - reject a creation of new, non-existing words,
44 | - reject phrases with two unconnected verbs in a row,
45 | - reject phrases with several duplicating words,
46 | - reject phrases with no punctuation or with too many punctuation marks.
47 |
48 | The application of this script cut the initial text flow into a subsequence of valid chunks.
49 |
50 | > a slave has no more say in his language but he hasn't to speak out!
51 | >
52 | > the doll has a variety of languages, so its feelings have to fill up some time of the day - to - day journals.
53 | > the doll is used only when he remains private.
54 | > and it is always effective.
55 | >
56 | > leave him with his monk - like body.
57 | >
58 | > a little of technique on can be helpful.
59 | >
60 | > out of his passions remain in embarrassment and never wake.
61 | >
62 | > adolescence is the university of manchester.
63 | > the senior class of manchester... the senior class of manchester.
64 |
65 | ## Critic subsystem
66 |
67 | At last, I trained the Critic subsystem.
68 | This neural network uses a [BERT](https://github.com/google-research/bert) architecture implemented again by [huggingface](https://github.com/huggingface/transformers). Again I took a public available pre-trained network model and finetuned it on my labeled 1K chunks dataset to predict the label of any given chunk.
69 |
70 | Here I used manual labeling of these chunks with two classes, GOOD/BAD. Most of the labeling was done by a friend of mine, Ivan [@kr0niker](https://www.yamshchikov.info/) Yamshchikov, and some I did myself. We marked a chunk as BAD in case it was grammatically incorrect or just too boring or too stupid. Overall, I used approx 1K of labeled chunks, balanced (one half of them were GOOD, the other half -- BAD).
71 |
72 | Finally, I made a pipeline that includes the Generator subsystem, some heuristic filters, and the Critic subsystem.
73 | Here it is a short sample of the final results:
74 |
75 | > a sudden feeling of austin lemons, a gentle stab of disgust.
76 | > i'm what i'm.
77 | >
78 | > humans whirl in night and distance.
79 | >
80 | > by the wonders of them.
81 | >
82 | > we shall never suffer this.
83 | > if the human race came along tomorrow, none of us would be as wise as they already would have been.
84 | > there is a beginning and an end.
85 | >
86 | > both of our grandparents and brothers are overdue.
87 | > he either can not agree or he can look for someone to blame for his death.
88 | >
89 | > he has reappeared from the world of revenge, revenge, separation, hatred.
90 | > he has ceased all who have offended him.
91 | >
92 | > he is the one who can remember that nothing remotely resembles the trip begun in retrospect.
93 | > what's up?
94 | >
95 | > and i don't want the truth.
96 | > not for an hour.
97 |
98 | [The huge blob of generated text could be found here](https://github.com/altsoph/paranoid_transforner/blob/master/NaNoGenMo_50K_words_sample.txt).
99 |
100 | ## Code overview
101 |
102 | Here is a short description of scripts from this project:
103 | - gpt1tokenize_trainset.py -- used to tokenize the fine-tuning dataset and add the conditioning labels
104 | - gpt1finetune.py -- used to fine-tune the Generator network on the prepared dataset
105 | - gpt1sample.py -- used to sample texts from the Generator network
106 |
107 | - simple_cleaner.py -- holds the heuristic filters
108 |
109 | - train_classifier.py -- used to train the BERT-based classifier (Critic)
110 | - do_critic.py -- applies Critic to the samples
111 | - weight_samples.py + cleaner_on_bert_weights.py -- used to filter samples based on Critic scores
112 |
113 |
114 | ## Nervous handwriting
115 |
116 | As much as the resulting text basically reminded me of neurotic/paranoid notes I decided to use this effect and make it deeper:
117 |
118 | I took an [implementation by Sean Vasquez](https://github.com/sjvasquez/handwriting-synthesis) of the handwriting synthesis experiments from the paper [Generating Sequences with Recurrent Neural Networks by Alex Graves](https://arxiv.org/abs/1308.0850) and patched it a little. Specifically, I used a bias parameter to make the handwriting shakiness depended on the sentiment strength of a given sentence.
119 |
120 | Take a look at the example:
121 |
122 |
123 |
124 | ## Freehand drawings
125 |
126 | At some point, I realized that this diary lacks freehand drawings, so I decided to add some. I used my modification of a [pytorch implementation](https://github.com/alexis-jacq/Pytorch-Sketch-RNN) of [arXiv:1704.03477](https://arxiv.org/abs/1704.03477) trained on
127 | [Quick, Draw! Dataset](https://github.com/googlecreativelab/quickdraw-dataset). Each time any of categories from the dataset appears on the page I generate and add random picture somewhere arround.
128 |
129 |
130 |
131 | ## Covers and PDF compilation
132 |
133 | I drew some covers and used the [rsvg-convert library](https://en.wikipedia.org/wiki/Librsvg) to build a PDF file from separate pages in SVG.
134 |
135 | Covers:
136 |
137 |
138 |
139 | The resulting diary (40 Mb):
140 | https://github.com/altsoph/paranoid_transformer/raw/master/paranoid_transformer_w_pics.pdf
141 |
142 | ## Papers, publications, releases, links
143 |
144 | * [ICCC 2020 Proceedings, P.146-152](http://computationalcreativity.net/iccc20/wp-content/uploads/2020/09/ICCC20_Proceedings.pdf): Paranoid Transformer. Yana Agafonova, Alexey Tikhonov and Ivan Yamshchikov
145 | * Future Internet Journal: [Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity](https://www.mdpi.com/1999-5903/12/11/182/htm)
146 | * [Pre-oder the book](https://deadalivemagazine.com/press/paranoid-transformer.html)
147 |
--------------------------------------------------------------------------------
/cleaner_on_bert_weights.py:
--------------------------------------------------------------------------------
1 | import sys
2 |
3 | ofh = open(sys.argv[2], 'w', encoding='utf-8')
4 |
5 | prev_blank = True
6 | for ln, line in enumerate(open(sys.argv[1], encoding='utf-8')):
7 | text,_,_,score = line.strip().split('\t')
8 |
9 | if text == '----------' or float(score)<0.9:
10 | if not prev_blank:
11 | print(file=ofh)
12 | prev_blank = True
13 | else:
14 | print(text,file=ofh)
15 | prev_blank = False
16 |
17 | ofh.close()
--------------------------------------------------------------------------------
/do_critic.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | """BERT finetuning runner."""
17 |
18 | from __future__ import absolute_import, division, print_function
19 |
20 | import argparse
21 | import csv
22 | import logging
23 | import os
24 | import random
25 | import sys
26 |
27 | import numpy as np
28 | import torch
29 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
30 | TensorDataset)
31 | from torch.utils.data.distributed import DistributedSampler
32 | from tqdm import tqdm, trange
33 |
34 | from torch.nn import CrossEntropyLoss, MSELoss
35 | from scipy.stats import pearsonr, spearmanr
36 | from sklearn.metrics import matthews_corrcoef, f1_score
37 |
38 | from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME
39 | from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig
40 | from pytorch_pretrained_bert.tokenization import BertTokenizer
41 | from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
42 |
43 | logger = logging.getLogger(__name__)
44 |
45 |
46 | class InputExample(object):
47 | """A single training/test example for simple sequence classification."""
48 |
49 | def __init__(self, guid, text_a, text_b=None, label=None):
50 | """Constructs a InputExample.
51 |
52 | Args:
53 | guid: Unique id for the example.
54 | text_a: string. The untokenized text of the first sequence. For single
55 | sequence tasks, only this sequence must be specified.
56 | text_b: (Optional) string. The untokenized text of the second sequence.
57 | Only must be specified for sequence pair tasks.
58 | label: (Optional) string. The label of the example. This should be
59 | specified for train and dev examples, but not for test examples.
60 | """
61 | self.guid = guid
62 | self.text_a = text_a
63 | self.text_b = text_b
64 | self.label = label
65 |
66 |
67 | class InputFeatures(object):
68 | """A single set of features of data."""
69 |
70 | def __init__(self, input_ids, input_mask, segment_ids, label_id):
71 | self.input_ids = input_ids
72 | self.input_mask = input_mask
73 | self.segment_ids = segment_ids
74 | self.label_id = label_id
75 |
76 |
77 | class DataProcessor(object):
78 | """Base class for data converters for sequence classification data sets."""
79 |
80 | def get_train_examples(self, data_dir):
81 | """Gets a collection of `InputExample`s for the train set."""
82 | raise NotImplementedError()
83 |
84 | def get_dev_examples(self, data_dir):
85 | """Gets a collection of `InputExample`s for the dev set."""
86 | raise NotImplementedError()
87 |
88 | def get_labels(self):
89 | """Gets the list of labels for this data set."""
90 | raise NotImplementedError()
91 |
92 | @classmethod
93 | def _read_tsv(cls, input_file, quotechar=None):
94 | """Reads a tab separated value file."""
95 | with open(input_file, "r", encoding="utf-8") as f:
96 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
97 | lines = []
98 | for line in reader:
99 | if sys.version_info[0] == 2:
100 | line = list(unicode(cell, 'utf-8') for cell in line)
101 | lines.append(line)
102 | return lines
103 |
104 |
105 |
106 | class ColaProcessor(DataProcessor):
107 | """Processor for the CoLA data set (GLUE version)."""
108 |
109 | def get_train_examples(self, data_dir):
110 | """See base class."""
111 | return self._create_examples(
112 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
113 |
114 | def get_dev_examples(self, data_dir):
115 | """See base class."""
116 | return self._create_examples(
117 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
118 |
119 | def get_pred_examples(self, data_dir):
120 | """See base class."""
121 | lines = self._read_tsv(os.path.join(data_dir, "pred.tsv"))
122 | examples = []
123 | for (i, line) in enumerate(lines):
124 | if i == 0:
125 | continue
126 | guid = "%s-%s" % ('pred', i)
127 | text_a = line[0]
128 | text_b = None
129 | label = str(i%2)
130 | examples.append(
131 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
132 | return examples
133 |
134 | def get_labels(self):
135 | """See base class."""
136 | return ["0", "1"]
137 |
138 | def _create_examples(self, lines, set_type):
139 | """Creates examples for the training and dev sets."""
140 | examples = []
141 | for (i, line) in enumerate(lines):
142 | guid = "%s-%s" % (set_type, i)
143 | text_a = line[3]
144 | label = line[1]
145 | examples.append(
146 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
147 | return examples
148 |
149 | def convert_examples_to_features(examples, label_list, max_seq_length,
150 | tokenizer, output_mode='classification'):
151 | """Loads a data file into a list of `InputBatch`s."""
152 |
153 | label_map = {label : i for i, label in enumerate(label_list)}
154 |
155 | features = []
156 | for (ex_index, example) in enumerate(examples):
157 | if ex_index % 10000 == 0:
158 | logger.info("Writing example %d of %d" % (ex_index, len(examples)))
159 |
160 | tokens_a = tokenizer.tokenize(example.text_a)
161 |
162 | tokens_b = None
163 | if example.text_b:
164 | tokens_b = tokenizer.tokenize(example.text_b)
165 | # Modifies `tokens_a` and `tokens_b` in place so that the total
166 | # length is less than the specified length.
167 | # Account for [CLS], [SEP], [SEP] with "- 3"
168 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
169 | else:
170 | # Account for [CLS] and [SEP] with "- 2"
171 | if len(tokens_a) > max_seq_length - 2:
172 | tokens_a = tokens_a[:(max_seq_length - 2)]
173 |
174 | # The convention in BERT is:
175 | # (a) For sequence pairs:
176 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
177 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
178 | # (b) For single sequences:
179 | # tokens: [CLS] the dog is hairy . [SEP]
180 | # type_ids: 0 0 0 0 0 0 0
181 | #
182 | # Where "type_ids" are used to indicate whether this is the first
183 | # sequence or the second sequence. The embedding vectors for `type=0` and
184 | # `type=1` were learned during pre-training and are added to the wordpiece
185 | # embedding vector (and position vector). This is not *strictly* necessary
186 | # since the [SEP] token unambiguously separates the sequences, but it makes
187 | # it easier for the model to learn the concept of sequences.
188 | #
189 | # For classification tasks, the first vector (corresponding to [CLS]) is
190 | # used as as the "sentence vector". Note that this only makes sense because
191 | # the entire model is fine-tuned.
192 | tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
193 | segment_ids = [0] * len(tokens)
194 |
195 | if tokens_b:
196 | tokens += tokens_b + ["[SEP]"]
197 | segment_ids += [1] * (len(tokens_b) + 1)
198 |
199 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
200 |
201 | # The mask has 1 for real tokens and 0 for padding tokens. Only real
202 | # tokens are attended to.
203 | input_mask = [1] * len(input_ids)
204 |
205 | # Zero-pad up to the sequence length.
206 | padding = [0] * (max_seq_length - len(input_ids))
207 | input_ids += padding
208 | input_mask += padding
209 | segment_ids += padding
210 |
211 | assert len(input_ids) == max_seq_length
212 | assert len(input_mask) == max_seq_length
213 | assert len(segment_ids) == max_seq_length
214 |
215 | if output_mode == "classification":
216 | label_id = label_map[example.label]
217 | elif output_mode == "regression":
218 | label_id = float(example.label)
219 | else:
220 | raise KeyError(output_mode)
221 |
222 | if ex_index < 5:
223 | logger.info("*** Example ***")
224 | logger.info("guid: %s" % (example.guid))
225 | logger.info("tokens: %s" % " ".join(
226 | [str(x) for x in tokens]))
227 | logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
228 | logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
229 | logger.info(
230 | "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
231 | logger.info("label: %s (id = %d)" % (example.label, label_id))
232 |
233 | features.append(
234 | InputFeatures(input_ids=input_ids,
235 | input_mask=input_mask,
236 | segment_ids=segment_ids,
237 | label_id=label_id))
238 | return features
239 |
240 |
241 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
242 | """Truncates a sequence pair in place to the maximum length."""
243 |
244 | # This is a simple heuristic which will always truncate the longer sequence
245 | # one token at a time. This makes more sense than truncating an equal percent
246 | # of tokens from each, since if one sequence is very short then each token
247 | # that's truncated likely contains more information than a longer sequence.
248 | while True:
249 | total_length = len(tokens_a) + len(tokens_b)
250 | if total_length <= max_length:
251 | break
252 | if len(tokens_a) > len(tokens_b):
253 | tokens_a.pop()
254 | else:
255 | tokens_b.pop()
256 |
257 |
258 | def simple_accuracy(preds, labels):
259 | return (preds == labels).mean()
260 |
261 |
262 | def compute_metrics(task_name, preds, labels):
263 | assert len(preds) == len(labels)
264 | return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)}
265 |
266 |
267 | def main():
268 | parser = argparse.ArgumentParser()
269 |
270 | ## Required parameters
271 | parser.add_argument("--data_dir",
272 | default=None,
273 | type=str,
274 | required=True,
275 | help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
276 | parser.add_argument("--bert_model", default=None, type=str, required=True,
277 | help="Bert pre-trained model selected in the list: bert-base-uncased, "
278 | "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
279 | "bert-base-multilingual-cased, bert-base-chinese.")
280 | parser.add_argument("--output_dir",
281 | default=None,
282 | type=str,
283 | required=True,
284 | help="The output directory where the model predictions and checkpoints will be written.")
285 |
286 | ## Other parameters
287 | parser.add_argument("--cache_dir",
288 | default="",
289 | type=str,
290 | help="Where do you want to store the pre-trained models downloaded from s3")
291 | parser.add_argument("--max_seq_length",
292 | default=128,
293 | type=int,
294 | help="The maximum total input sequence length after WordPiece tokenization. \n"
295 | "Sequences longer than this will be truncated, and sequences shorter \n"
296 | "than this will be padded.")
297 | parser.add_argument("--do_eval",
298 | action='store_true',
299 | help="Whether to run eval on the dev set.")
300 | parser.add_argument("--do_lower_case",
301 | action='store_true',
302 | help="Set this flag if you are using an uncased model.")
303 | parser.add_argument("--eval_batch_size",
304 | default=8,
305 | type=int,
306 | help="Total batch size for eval.")
307 | parser.add_argument("--no_cuda",
308 | action='store_true',
309 | help="Whether not to use CUDA when available")
310 | parser.add_argument("--local_rank",
311 | type=int,
312 | default=-1,
313 | help="local_rank for distributed training on gpus")
314 | parser.add_argument('--seed',
315 | type=int,
316 | default=42,
317 | help="random seed for initialization")
318 |
319 | args = parser.parse_args()
320 |
321 | device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
322 | n_gpu = torch.cuda.device_count()
323 |
324 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
325 | datefmt = '%m/%d/%Y %H:%M:%S',
326 | level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
327 |
328 | logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: -".format(
329 | device, n_gpu, bool(args.local_rank != -1) ))
330 |
331 | random.seed(args.seed)
332 | np.random.seed(args.seed)
333 | torch.manual_seed(args.seed)
334 | if n_gpu > 0:
335 | torch.cuda.manual_seed_all(args.seed)
336 |
337 | if not os.path.exists(args.output_dir):
338 | # os.makedirs(args.output_dir)
339 | raise ValueError("No model output dir found.")
340 |
341 | processor = ColaProcessor() # processors[task_name]()
342 |
343 | label_list = processor.get_labels()
344 | num_labels = len(label_list)
345 |
346 | global_step = 0
347 |
348 | model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels)
349 | tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
350 | model.to(device)
351 |
352 | # if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
353 | eval_examples = processor.get_pred_examples(args.data_dir)
354 | eval_features = convert_examples_to_features(
355 | eval_examples, label_list, args.max_seq_length, tokenizer) # , output_mode)
356 | logger.info("***** Running evaluation *****")
357 | logger.info(" Num examples = %d", len(eval_examples))
358 | logger.info(" Batch size = %d", args.eval_batch_size)
359 | # print(eval_examples[:10])
360 | all_input_idxs = torch.tensor([idx for idx,f in enumerate(eval_features)], dtype=torch.long)
361 | all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
362 | all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
363 | all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
364 |
365 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
366 |
367 | eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids, all_input_idxs)
368 | eval_sampler = SequentialSampler(eval_data)
369 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
370 |
371 | model.eval()
372 | # model.predict()
373 | eval_loss = 0
374 | nb_eval_steps = 0
375 | preds = []
376 |
377 | for input_ids, input_mask, segment_ids, label_ids, text_idxs in tqdm(eval_dataloader, desc="Evaluating", disable=True):
378 | input_ids = input_ids.to(device)
379 | input_mask = input_mask.to(device)
380 | segment_ids = segment_ids.to(device)
381 | label_ids = label_ids.to(device)
382 |
383 | with torch.no_grad():
384 | logits = model(input_ids, segment_ids, input_mask, labels=None)
385 | for idx,logit in zip(list(text_idxs.data),list(logits.data)):
386 | # print(idx,logit)
387 | print("%s\t%f\t%f" % ( eval_examples[idx.item()].text_a,logit[0].item(),logit[1].item()) )
388 |
389 |
390 | if __name__ == "__main__":
391 | main()
392 |
--------------------------------------------------------------------------------
/gpt1finetune.py:
--------------------------------------------------------------------------------
1 | import os
2 | import random
3 | import json
4 |
5 | import nltk
6 | import torch
7 | # from apex import amp
8 | from tqdm import tqdm, trange
9 | from pytorch_pretrained_bert import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OpenAIAdam
10 |
11 | SPECIAL_TOKENS = ["", "", "", "", "", "", "",""]
12 | LR = 6.25e-5
13 | MAX_LEN = 500
14 | BATCH_SIZE = 13
15 |
16 | OUTPUT_DIR = "/home/altsoph/current"
17 | random.seed(0xDEADFEED)
18 |
19 |
20 |
21 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
22 | n_gpu = torch.cuda.device_count()
23 | model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
24 | tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
25 |
26 | tokenizer.set_special_tokens(SPECIAL_TOKENS)
27 | model.set_num_special_tokens(len(SPECIAL_TOKENS))
28 | model.to(device)
29 | optimizer = OpenAIAdam(model.parameters(),
30 | lr=LR,
31 | warmup=0.002,
32 | max_grad_norm=1,
33 | weight_decay=0.01)
34 |
35 | TAG_TEXT, TAG_META1, TAG_META2, TAG_PAD = tokenizer.convert_tokens_to_ids(("", "", "", ""))
36 |
37 | def pad(x, padding, padding_length):
38 | return x + [padding] * (padding_length - len(x))
39 |
40 | dataset = []
41 | for line in open('gpt1_trainset_tokens.tsv'):
42 | chunks = line.strip().split('\t')
43 | tokens = list(map(int,chunks[2].split(',')))
44 | if len(tokens)<8: continue
45 | segments = [TAG_META1, TAG_META2] + [TAG_TEXT for _ in tokens[2:]]
46 | positions = list(range(len(tokens)))
47 | lm_targets = [-1, -1, -1] + tokens[3:]
48 | dataset.append( (len(tokens), tokens, segments, positions, lm_targets) )
49 |
50 | model.train()
51 |
52 | for epoch in range(10):
53 | exp_average_loss = None
54 | nb_tr_steps = 0
55 | tr_loss = 0
56 |
57 | dataset = list(sorted(dataset,key=lambda x:random.random()))
58 |
59 | tqdm_bar = tqdm(range(0,len(dataset),BATCH_SIZE), desc="Training", mininterval=6.0)
60 | for batch_num,batch_start in enumerate(tqdm_bar):
61 |
62 | batch_raw = dataset[batch_start:batch_start+BATCH_SIZE]
63 | pad_size = max(map(lambda x:x[0],batch_raw))
64 |
65 | input_words = []
66 | input_segments = []
67 | input_targets = []
68 |
69 | for _,words,segments,_,targets in batch_raw:
70 | input_words.append( pad(words,TAG_PAD,pad_size) )
71 | input_segments.append( pad(segments,TAG_PAD,pad_size) )
72 | input_targets.append( pad(targets,-1,pad_size) )
73 |
74 | input_ids = torch.tensor(input_words, dtype=torch.long)
75 | token_type_ids = torch.tensor(input_segments, dtype=torch.long)
76 | lm_labels = torch.tensor(input_targets, dtype=torch.long)
77 |
78 | loss = model(input_ids.to(device), lm_labels=lm_labels.to(device), token_type_ids=token_type_ids.to(device))
79 |
80 | loss.backward()
81 | optimizer.step()
82 | optimizer.zero_grad()
83 | tr_loss += loss.item()
84 | exp_average_loss = loss.item() if exp_average_loss is None else 0.7*exp_average_loss+0.3*loss.item()
85 | nb_tr_steps += 1
86 | tqdm_bar.desc = "Epoch {:02}, batch {:05}/{:05}. Training loss: {:.2e} lr: {:.2e}".format(epoch, batch_num, len(dataset)//BATCH_SIZE, exp_average_loss, optimizer.get_lr()[0])
87 |
88 | model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
89 | torch.save(model_to_save.state_dict(), os.path.join(OUTPUT_DIR, "pytorch_model.bin"))
90 | model_to_save.config.to_json_file(os.path.join(OUTPUT_DIR, "config.json"))
91 | tokenizer.save_vocabulary(OUTPUT_DIR)
92 |
--------------------------------------------------------------------------------
/gpt1sample.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import os
4 | import random
5 | import json
6 |
7 | import numpy as np
8 | import nltk
9 | import torch
10 | import torch.nn.functional as F
11 |
12 | # from apex import amp
13 | from tqdm import tqdm, trange
14 | from pytorch_pretrained_bert import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, OpenAIAdam
15 |
16 |
17 | SAMPLES = 16384
18 | BATCH_SIZE = 32
19 |
20 | MAX_LEN = 500
21 | MODEL_DIR = "/home/altsoph/current"
22 | SEED = 0xDEADFEED
23 |
24 |
25 | def top_k_logits(logits, k):
26 | """
27 | Masks everything but the k top entries as -infinity (1e10).
28 | Used to mask logits such that e^-infinity -> 0 won't contribute to the
29 | sum of the denominator.
30 | """
31 | if k == 0:
32 | return logits
33 | else:
34 | values = torch.topk(logits, k)[0]
35 | batch_mins = values[:, -1].view(-1, 1).expand_as(logits)
36 | return torch.where(logits < batch_mins, torch.ones_like(logits) * -1e10, logits)
37 |
38 | def sample_sequence(model, length, segments=None, batch_size=None, context=None, temperature=1, top_k=0, device='cuda', sample=True, text_tag=0):
39 | context = torch.tensor(context, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
40 | segments = torch.tensor(segments, device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
41 | text_tag_tpl = torch.tensor([text_tag,], device=device, dtype=torch.long).unsqueeze(0).repeat(batch_size, 1)
42 |
43 | prev = context
44 | output = context
45 | prev_segments = segments
46 | past = None
47 | with torch.no_grad():
48 | for i in trange(length):
49 | # model(input_ids.to(device), lm_labels=lm_labels.to(device), token_type_ids=token_type_ids.to(device))
50 | logits = model(output, token_type_ids=prev_segments)
51 | logits = logits[:, -1, :] / temperature
52 | logits = top_k_logits(logits, k=top_k)
53 | log_probs = F.softmax(logits, dim=-1)
54 | if sample:
55 | prev = torch.multinomial(log_probs, num_samples=1)
56 | else:
57 | _, prev = torch.topk(log_probs, k=1, dim=-1)
58 | output = torch.cat((output, prev), dim=1)
59 | prev_segments = torch.cat((prev_segments, text_tag_tpl), dim=1)
60 | return output
61 |
62 | random.seed(SEED)
63 | torch.random.manual_seed(SEED)
64 | torch.cuda.manual_seed(SEED)
65 |
66 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
67 | n_gpu = torch.cuda.device_count()
68 | model = OpenAIGPTLMHeadModel.from_pretrained(MODEL_DIR)
69 | tokenizer = OpenAIGPTTokenizer.from_pretrained(MODEL_DIR)
70 |
71 | model.to(device)
72 |
73 | TAG_QUOTES, TAG_CYBER, TAG_TEXT, TAG_META1, TAG_META2, TAG_PAD = tokenizer.convert_tokens_to_ids(
74 | ("", "", "", "", "", ""))
75 |
76 | context_tokens = [TAG_QUOTES, TAG_CYBER]
77 | context_segments = [TAG_META1, TAG_META2]
78 |
79 | generated = 0
80 |
81 | for _ in range(SAMPLES // BATCH_SIZE):
82 | out = sample_sequence(
83 | model=model, length=MAX_LEN,
84 | context=context_tokens,
85 | segments=context_segments,
86 | batch_size=BATCH_SIZE,
87 | temperature=1, top_k=0, device=device,
88 | text_tag = TAG_TEXT
89 | )
90 | out = out[:, len(context_tokens):].tolist()
91 | for i in range(BATCH_SIZE):
92 | generated += 1
93 | text = tokenizer.decode(out[i])
94 | print("=" * 35 + " SAMPLE " + str(generated) + " " + "=" * (36-len(str(generated))) )
95 | print(text)
96 |
97 |
--------------------------------------------------------------------------------
/gpt1tokenize_trainset.py:
--------------------------------------------------------------------------------
1 | import random
2 | import nltk
3 | from pytorch_pretrained_bert import OpenAIGPTDoubleHeadsModel, OpenAIGPTTokenizer
4 |
5 | model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
6 | tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
7 |
8 | SPECIAL_TOKENS = ["", "", "", "", "", "", ""]
9 |
10 | # We can add these special tokens to the vocabulary and the embeddings of the model:
11 | tokenizer.set_special_tokens(SPECIAL_TOKENS)
12 | model.set_num_special_tokens(len(SPECIAL_TOKENS))
13 |
14 | MAX_LEN = 500
15 |
16 | dataset = []
17 | for fn,meta1,meta2 in (('long_cyberpunk.txt','',''),('quotes_cyberpunk.txt','',''),
18 | ('long_others.txt','',''),('quotes_others.txt','','')):
19 | meta_tokens = tokenizer.convert_tokens_to_ids((meta1,meta2))
20 | for line in open(fn, encoding='utf-8', errors='ignore'):
21 | if not line.strip(): continue
22 | # meta_tokens = tokenizer.encode("%s %s" %(meta1,meta2))
23 | # segments = tokenizer.convert_tokens_to_ids(segments)
24 | tokens = tokenizer.encode(line.strip())
25 | if len(tokens)>MAX_LEN:
26 | # print('too long',len(tokens))
27 | sentences = nltk.sent_tokenize(line.strip())
28 | # print(sentences)
29 | sentences_tokens = [tokenizer.encode(sentence) for sentence in sentences]
30 | # print(sentences_tokens)
31 | collected = []
32 | for sentence_tokens in sentences_tokens:
33 | if 0 in sentences_tokens or len(collected)+len(sentence_tokens)>MAX_LEN:
34 | # print(len(collected),collected)
35 | dataset.append( (meta1,meta2,meta_tokens+collected) )
36 | collected = []
37 | if len(sentence_tokens)<=MAX_LEN:
38 | collected.extend(sentence_tokens)
39 | if collected:
40 | # print(len(collected),collected)
41 | dataset.append( (meta1,meta2,meta_tokens+collected) )
42 | # exit()
43 | else:
44 | dataset.append( (meta1,meta2,meta_tokens+tokens) )
45 | for m1,m2,token_ids in dataset:
46 | print("%s\t%s\t%s" % (m1,m2,",".join(map(str,token_ids))))
47 |
--------------------------------------------------------------------------------
/handwriting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/handwriting.png
--------------------------------------------------------------------------------
/paranoid_transformer.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer.pdf
--------------------------------------------------------------------------------
/paranoid_transformer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer.png
--------------------------------------------------------------------------------
/paranoid_transformer_back.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer_back.png
--------------------------------------------------------------------------------
/paranoid_transformer_w_pics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/paranoid_transformer_w_pics.pdf
--------------------------------------------------------------------------------
/pics_samples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/altsoph/paranoid_transformer/86d77697a4dd8c61e9552a8e03fab2b01ad1103e/pics_samples.png
--------------------------------------------------------------------------------
/simple_cleaner.py:
--------------------------------------------------------------------------------
1 | import sys
2 | from nltk.tokenize import sent_tokenize, word_tokenize
3 | from collections import defaultdict
4 | from nltk import pos_tag
5 |
6 | vocab = set()
7 | for line in open("vocab.txt", encoding='utf-8'):
8 | vocab.add( line.strip() )
9 |
10 | outfh = open(sys.argv[2], "w", encoding='utf-8')
11 |
12 | lines_cnt = sentences_cnt = 0
13 | cases = defaultdict(int)
14 | for ln, line in enumerate(open(sys.argv[1], encoding='utf-8')):
15 | if line[0] == '=': continue
16 | lines_cnt += 1
17 |
18 | sents = sent_tokenize(line)
19 | sentences_cnt += len(sents)
20 | print(file=outfh)
21 | for sent in sents:
22 | words = word_tokenize(sent)
23 | tmp = sent.replace('#',' ').replace('...','#').replace('!','#').replace('?','#').replace(',','#').replace('--','#').replace('-','#').replace(';','#')\
24 | .replace(':','#').replace('`','#').replace('"','#').replace('.','#').replace(' ','')
25 |
26 | no_punct = []
27 | size = 0
28 | for npidx,w in enumerate(words):
29 | if w in ('...','!','?',',','--','-',';',':','`','"','.'):
30 | no_punct.append(size)
31 | size = 0
32 | else:
33 | size += 1
34 | no_punct.append(size)
35 |
36 | pos = pos_tag(words)
37 | skip = False
38 | for idx,(w,p) in enumerate(pos[:-1]):
39 | # VB verb, base form take
40 | # VBD verb, past tense took
41 | # VBG verb, gerund/present participle taking
42 | # VBN verb, past participle taken
43 | # VBP verb, sing. present, non-3d take
44 | # VBZ verb, 3rd person sing. present takes
45 | if p in ('VB','VBD','VBG','VBN','VBP','VBZ') and pos[idx+1][1] in ('VB','VBD','VBG','VBN','VBP','VBZ'):
46 | if p == 'VBD' and pos[idx+1][1] == 'VBN': continue
47 | if p == 'VB' and pos[idx+1][1] == 'VBN': continue
48 | if p == 'VBP' and pos[idx+1][1] == 'VBN': continue
49 | if p == 'VBZ' and pos[idx+1][1] == 'VBN': continue
50 | if w == 'been' and pos[idx+1][1] == 'VBN': continue
51 | if w in ('be','was','are','is',"'re","'s","been","have") and pos[idx+1][1] == 'VBG': continue
52 | if w == 'i': continue
53 | # print('VERB', (w,p), pos[idx+1], sent)
54 | cases['verbverb'] += 1
55 | skip = True
56 | break
57 | # it's bad if several verbs in a row
58 | if set(words)-vocab:
59 | cases['new_word'] += 1
60 | skip = True
61 | elif max(no_punct)>25:
62 | cases['no_punct'] += 1
63 | skip = True
64 | elif len(words)>=60:
65 | cases['to_long'] += 1
66 | skip = True
67 | elif "###" in tmp:
68 | cases['manypuncts'] += 1
69 | skip = True
70 | for idx,w in enumerate(words[:-1]):
71 | if w == words[idx+1]:
72 | cases['duplicate_words'] += 1
73 | skip = True
74 | break
75 | if sent[-1] not in '.!?':
76 | cases['badend'] += 1
77 | skip = True
78 |
79 | if skip:
80 | print(file=outfh)
81 | else:
82 | print(sent,file=outfh)
83 |
84 | print(lines_cnt, sentences_cnt, cases.items())
85 |
86 | outfh.close()
87 |
--------------------------------------------------------------------------------
/train_classifier.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | """BERT finetuning runner."""
17 |
18 | from __future__ import absolute_import, division, print_function
19 |
20 | import argparse
21 | import csv
22 | import logging
23 | import os
24 | import random
25 | import sys
26 |
27 | import numpy as np
28 | import torch
29 | from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
30 | TensorDataset)
31 | from torch.utils.data.distributed import DistributedSampler
32 | from tqdm import tqdm, trange
33 |
34 | from torch.nn import CrossEntropyLoss, MSELoss
35 | from scipy.stats import pearsonr, spearmanr
36 | from sklearn.metrics import matthews_corrcoef, f1_score
37 |
38 | from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, WEIGHTS_NAME, CONFIG_NAME
39 | from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig
40 | from pytorch_pretrained_bert.tokenization import BertTokenizer
41 | from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
42 |
43 | logger = logging.getLogger(__name__)
44 |
45 |
46 | class InputExample(object):
47 | """A single training/test example for simple sequence classification."""
48 |
49 | def __init__(self, guid, text_a, text_b=None, label=None):
50 | """Constructs a InputExample.
51 |
52 | Args:
53 | guid: Unique id for the example.
54 | text_a: string. The untokenized text of the first sequence. For single
55 | sequence tasks, only this sequence must be specified.
56 | text_b: (Optional) string. The untokenized text of the second sequence.
57 | Only must be specified for sequence pair tasks.
58 | label: (Optional) string. The label of the example. This should be
59 | specified for train and dev examples, but not for test examples.
60 | """
61 | self.guid = guid
62 | self.text_a = text_a
63 | self.text_b = text_b
64 | self.label = label
65 |
66 |
67 | class InputFeatures(object):
68 | """A single set of features of data."""
69 |
70 | def __init__(self, input_ids, input_mask, segment_ids, label_id):
71 | self.input_ids = input_ids
72 | self.input_mask = input_mask
73 | self.segment_ids = segment_ids
74 | self.label_id = label_id
75 |
76 |
77 | class DataProcessor(object):
78 | """Base class for data converters for sequence classification data sets."""
79 |
80 | def get_train_examples(self, data_dir):
81 | """Gets a collection of `InputExample`s for the train set."""
82 | raise NotImplementedError()
83 |
84 | def get_dev_examples(self, data_dir):
85 | """Gets a collection of `InputExample`s for the dev set."""
86 | raise NotImplementedError()
87 |
88 | def get_labels(self):
89 | """Gets the list of labels for this data set."""
90 | raise NotImplementedError()
91 |
92 | @classmethod
93 | def _read_tsv(cls, input_file, quotechar=None):
94 | """Reads a tab separated value file."""
95 | with open(input_file, "r", encoding="utf-8") as f:
96 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
97 | lines = []
98 | for line in reader:
99 | if sys.version_info[0] == 2:
100 | line = list(unicode(cell, 'utf-8') for cell in line)
101 | lines.append(line)
102 | return lines
103 |
104 |
105 | class MrpcProcessor(DataProcessor):
106 | """Processor for the MRPC data set (GLUE version)."""
107 |
108 | def get_train_examples(self, data_dir):
109 | """See base class."""
110 | logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
111 | return self._create_examples(
112 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
113 |
114 | def get_dev_examples(self, data_dir):
115 | """See base class."""
116 | return self._create_examples(
117 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
118 |
119 | def get_labels(self):
120 | """See base class."""
121 | return ["0", "1"]
122 |
123 | def _create_examples(self, lines, set_type):
124 | """Creates examples for the training and dev sets."""
125 | examples = []
126 | for (i, line) in enumerate(lines):
127 | if i == 0:
128 | continue
129 | guid = "%s-%s" % (set_type, i)
130 | text_a = line[3]
131 | text_b = line[4]
132 | label = line[0]
133 | examples.append(
134 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
135 | return examples
136 |
137 |
138 | class MnliProcessor(DataProcessor):
139 | """Processor for the MultiNLI data set (GLUE version)."""
140 |
141 | def get_train_examples(self, data_dir):
142 | """See base class."""
143 | return self._create_examples(
144 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
145 |
146 | def get_dev_examples(self, data_dir):
147 | """See base class."""
148 | return self._create_examples(
149 | self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
150 | "dev_matched")
151 |
152 | def get_labels(self):
153 | """See base class."""
154 | return ["contradiction", "entailment", "neutral"]
155 |
156 | def _create_examples(self, lines, set_type):
157 | """Creates examples for the training and dev sets."""
158 | examples = []
159 | for (i, line) in enumerate(lines):
160 | if i == 0:
161 | continue
162 | guid = "%s-%s" % (set_type, line[0])
163 | text_a = line[8]
164 | text_b = line[9]
165 | label = line[-1]
166 | examples.append(
167 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
168 | return examples
169 |
170 |
171 | class MnliMismatchedProcessor(MnliProcessor):
172 | """Processor for the MultiNLI Mismatched data set (GLUE version)."""
173 |
174 | def get_dev_examples(self, data_dir):
175 | """See base class."""
176 | return self._create_examples(
177 | self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")),
178 | "dev_matched")
179 |
180 |
181 | class ColaProcessor(DataProcessor):
182 | """Processor for the CoLA data set (GLUE version)."""
183 |
184 | def get_train_examples(self, data_dir):
185 | """See base class."""
186 | return self._create_examples(
187 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
188 |
189 | def get_dev_examples(self, data_dir):
190 | """See base class."""
191 | return self._create_examples(
192 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
193 |
194 | def get_labels(self):
195 | """See base class."""
196 | return ["0", "1"]
197 |
198 | def _create_examples(self, lines, set_type):
199 | """Creates examples for the training and dev sets."""
200 | examples = []
201 | for (i, line) in enumerate(lines):
202 | guid = "%s-%s" % (set_type, i)
203 | text_a = line[3]
204 | label = line[1]
205 | examples.append(
206 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
207 | return examples
208 |
209 | class OnionProcessor(ColaProcessor):
210 | def _create_examples(self, lines, set_type):
211 | """Creates examples for the training and dev sets."""
212 | examples = []
213 | for (i, line) in enumerate(lines):
214 | left, right = line[0], line[1]
215 | target = 0
216 | guid = str(i)
217 |
218 | # tmp_right = right.split()
219 | # np.random.shuffle(tmp_right)
220 | # right = " ".join(tmp_right)
221 | if np.random.rand() > 0.5:
222 | left, right = right, left
223 | target = 1
224 |
225 | # text_a = tokenization.convert_to_unicode(left)
226 | # text_b = tokenization.convert_to_unicode(right)
227 | text_a = left
228 | text_b = right
229 | label = str(target)
230 | examples.append(
231 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
232 |
233 | return examples
234 |
235 | class Sst2Processor(DataProcessor):
236 | """Processor for the SST-2 data set (GLUE version)."""
237 |
238 | def get_train_examples(self, data_dir):
239 | """See base class."""
240 | return self._create_examples(
241 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
242 |
243 | def get_dev_examples(self, data_dir):
244 | """See base class."""
245 | return self._create_examples(
246 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
247 |
248 | def get_labels(self):
249 | """See base class."""
250 | return ["0", "1"]
251 |
252 | def _create_examples(self, lines, set_type):
253 | """Creates examples for the training and dev sets."""
254 | examples = []
255 | for (i, line) in enumerate(lines):
256 | if i == 0:
257 | continue
258 | guid = "%s-%s" % (set_type, i)
259 | text_a = line[0]
260 | label = line[1]
261 | examples.append(
262 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
263 | return examples
264 |
265 |
266 | class StsbProcessor(DataProcessor):
267 | """Processor for the STS-B data set (GLUE version)."""
268 |
269 | def get_train_examples(self, data_dir):
270 | """See base class."""
271 | return self._create_examples(
272 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
273 |
274 | def get_dev_examples(self, data_dir):
275 | """See base class."""
276 | return self._create_examples(
277 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
278 |
279 | def get_labels(self):
280 | """See base class."""
281 | return [None]
282 |
283 | def _create_examples(self, lines, set_type):
284 | """Creates examples for the training and dev sets."""
285 | examples = []
286 | for (i, line) in enumerate(lines):
287 | if i == 0:
288 | continue
289 | guid = "%s-%s" % (set_type, line[0])
290 | text_a = line[7]
291 | text_b = line[8]
292 | label = line[-1]
293 | examples.append(
294 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
295 | return examples
296 |
297 |
298 | class QqpProcessor(DataProcessor):
299 | """Processor for the STS-B data set (GLUE version)."""
300 |
301 | def get_train_examples(self, data_dir):
302 | """See base class."""
303 | return self._create_examples(
304 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
305 |
306 | def get_dev_examples(self, data_dir):
307 | """See base class."""
308 | return self._create_examples(
309 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
310 |
311 | def get_labels(self):
312 | """See base class."""
313 | return ["0", "1"]
314 |
315 | def _create_examples(self, lines, set_type):
316 | """Creates examples for the training and dev sets."""
317 | examples = []
318 | for (i, line) in enumerate(lines):
319 | if i == 0:
320 | continue
321 | guid = "%s-%s" % (set_type, line[0])
322 | try:
323 | text_a = line[3]
324 | text_b = line[4]
325 | label = line[5]
326 | except IndexError:
327 | continue
328 | examples.append(
329 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
330 | return examples
331 |
332 |
333 | class QnliProcessor(DataProcessor):
334 | """Processor for the STS-B data set (GLUE version)."""
335 |
336 | def get_train_examples(self, data_dir):
337 | """See base class."""
338 | return self._create_examples(
339 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
340 |
341 | def get_dev_examples(self, data_dir):
342 | """See base class."""
343 | return self._create_examples(
344 | self._read_tsv(os.path.join(data_dir, "dev.tsv")),
345 | "dev_matched")
346 |
347 | def get_labels(self):
348 | """See base class."""
349 | return ["entailment", "not_entailment"]
350 |
351 | def _create_examples(self, lines, set_type):
352 | """Creates examples for the training and dev sets."""
353 | examples = []
354 | for (i, line) in enumerate(lines):
355 | if i == 0:
356 | continue
357 | guid = "%s-%s" % (set_type, line[0])
358 | text_a = line[1]
359 | text_b = line[2]
360 | label = line[-1]
361 | examples.append(
362 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
363 | return examples
364 |
365 |
366 | class RteProcessor(DataProcessor):
367 | """Processor for the RTE data set (GLUE version)."""
368 |
369 | def get_train_examples(self, data_dir):
370 | """See base class."""
371 | return self._create_examples(
372 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
373 |
374 | def get_dev_examples(self, data_dir):
375 | """See base class."""
376 | return self._create_examples(
377 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
378 |
379 | def get_labels(self):
380 | """See base class."""
381 | return ["entailment", "not_entailment"]
382 |
383 | def _create_examples(self, lines, set_type):
384 | """Creates examples for the training and dev sets."""
385 | examples = []
386 | for (i, line) in enumerate(lines):
387 | if i == 0:
388 | continue
389 | guid = "%s-%s" % (set_type, line[0])
390 | text_a = line[1]
391 | text_b = line[2]
392 | label = line[-1]
393 | examples.append(
394 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
395 | return examples
396 |
397 |
398 | class WnliProcessor(DataProcessor):
399 | """Processor for the WNLI data set (GLUE version)."""
400 |
401 | def get_train_examples(self, data_dir):
402 | """See base class."""
403 | return self._create_examples(
404 | self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
405 |
406 | def get_dev_examples(self, data_dir):
407 | """See base class."""
408 | return self._create_examples(
409 | self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
410 |
411 | def get_labels(self):
412 | """See base class."""
413 | return ["0", "1"]
414 |
415 | def _create_examples(self, lines, set_type):
416 | """Creates examples for the training and dev sets."""
417 | examples = []
418 | for (i, line) in enumerate(lines):
419 | if i == 0:
420 | continue
421 | guid = "%s-%s" % (set_type, line[0])
422 | text_a = line[1]
423 | text_b = line[2]
424 | label = line[-1]
425 | examples.append(
426 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
427 | return examples
428 |
429 |
430 | def convert_examples_to_features(examples, label_list, max_seq_length,
431 | tokenizer, output_mode):
432 | """Loads a data file into a list of `InputBatch`s."""
433 |
434 | label_map = {label : i for i, label in enumerate(label_list)}
435 |
436 | features = []
437 | for (ex_index, example) in enumerate(examples):
438 | if ex_index % 10000 == 0:
439 | logger.info("Writing example %d of %d" % (ex_index, len(examples)))
440 |
441 | tokens_a = tokenizer.tokenize(example.text_a)
442 |
443 | tokens_b = None
444 | if example.text_b:
445 | tokens_b = tokenizer.tokenize(example.text_b)
446 | # Modifies `tokens_a` and `tokens_b` in place so that the total
447 | # length is less than the specified length.
448 | # Account for [CLS], [SEP], [SEP] with "- 3"
449 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
450 | else:
451 | # Account for [CLS] and [SEP] with "- 2"
452 | if len(tokens_a) > max_seq_length - 2:
453 | tokens_a = tokens_a[:(max_seq_length - 2)]
454 |
455 | # The convention in BERT is:
456 | # (a) For sequence pairs:
457 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
458 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
459 | # (b) For single sequences:
460 | # tokens: [CLS] the dog is hairy . [SEP]
461 | # type_ids: 0 0 0 0 0 0 0
462 | #
463 | # Where "type_ids" are used to indicate whether this is the first
464 | # sequence or the second sequence. The embedding vectors for `type=0` and
465 | # `type=1` were learned during pre-training and are added to the wordpiece
466 | # embedding vector (and position vector). This is not *strictly* necessary
467 | # since the [SEP] token unambiguously separates the sequences, but it makes
468 | # it easier for the model to learn the concept of sequences.
469 | #
470 | # For classification tasks, the first vector (corresponding to [CLS]) is
471 | # used as as the "sentence vector". Note that this only makes sense because
472 | # the entire model is fine-tuned.
473 | tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
474 | segment_ids = [0] * len(tokens)
475 |
476 | if tokens_b:
477 | tokens += tokens_b + ["[SEP]"]
478 | segment_ids += [1] * (len(tokens_b) + 1)
479 |
480 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
481 |
482 | # The mask has 1 for real tokens and 0 for padding tokens. Only real
483 | # tokens are attended to.
484 | input_mask = [1] * len(input_ids)
485 |
486 | # Zero-pad up to the sequence length.
487 | padding = [0] * (max_seq_length - len(input_ids))
488 | input_ids += padding
489 | input_mask += padding
490 | segment_ids += padding
491 |
492 | assert len(input_ids) == max_seq_length
493 | assert len(input_mask) == max_seq_length
494 | assert len(segment_ids) == max_seq_length
495 |
496 | if output_mode == "classification":
497 | label_id = label_map[example.label]
498 | elif output_mode == "regression":
499 | label_id = float(example.label)
500 | else:
501 | raise KeyError(output_mode)
502 |
503 | if ex_index < 5:
504 | logger.info("*** Example ***")
505 | logger.info("guid: %s" % (example.guid))
506 | logger.info("tokens: %s" % " ".join(
507 | [str(x) for x in tokens]))
508 | logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
509 | logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
510 | logger.info(
511 | "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
512 | logger.info("label: %s (id = %d)" % (example.label, label_id))
513 |
514 | features.append(
515 | InputFeatures(input_ids=input_ids,
516 | input_mask=input_mask,
517 | segment_ids=segment_ids,
518 | label_id=label_id))
519 | return features
520 |
521 |
522 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
523 | """Truncates a sequence pair in place to the maximum length."""
524 |
525 | # This is a simple heuristic which will always truncate the longer sequence
526 | # one token at a time. This makes more sense than truncating an equal percent
527 | # of tokens from each, since if one sequence is very short then each token
528 | # that's truncated likely contains more information than a longer sequence.
529 | while True:
530 | total_length = len(tokens_a) + len(tokens_b)
531 | if total_length <= max_length:
532 | break
533 | if len(tokens_a) > len(tokens_b):
534 | tokens_a.pop()
535 | else:
536 | tokens_b.pop()
537 |
538 |
539 | def simple_accuracy(preds, labels):
540 | return (preds == labels).mean()
541 |
542 |
543 | def acc_and_f1(preds, labels):
544 | acc = simple_accuracy(preds, labels)
545 | f1 = f1_score(y_true=labels, y_pred=preds)
546 | return {
547 | "acc": acc,
548 | "f1": f1,
549 | "acc_and_f1": (acc + f1) / 2,
550 | }
551 |
552 |
553 | def pearson_and_spearman(preds, labels):
554 | pearson_corr = pearsonr(preds, labels)[0]
555 | spearman_corr = spearmanr(preds, labels)[0]
556 | return {
557 | "pearson": pearson_corr,
558 | "spearmanr": spearman_corr,
559 | "corr": (pearson_corr + spearman_corr) / 2,
560 | }
561 |
562 |
563 | def compute_metrics(task_name, preds, labels):
564 | assert len(preds) == len(labels)
565 | if task_name == "cola":
566 | return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)}
567 | if task_name == "onion":
568 | return {"mcc": matthews_corrcoef(labels, preds),"acc": simple_accuracy(preds, labels)}
569 | elif task_name == "sst-2":
570 | return {"acc": simple_accuracy(preds, labels)}
571 | elif task_name == "mrpc":
572 | return acc_and_f1(preds, labels)
573 | elif task_name == "sts-b":
574 | return pearson_and_spearman(preds, labels)
575 | elif task_name == "qqp":
576 | return acc_and_f1(preds, labels)
577 | elif task_name == "mnli":
578 | return {"acc": simple_accuracy(preds, labels)}
579 | elif task_name == "mnli-mm":
580 | return {"acc": simple_accuracy(preds, labels)}
581 | elif task_name == "qnli":
582 | return {"acc": simple_accuracy(preds, labels)}
583 | elif task_name == "rte":
584 | return {"acc": simple_accuracy(preds, labels)}
585 | elif task_name == "wnli":
586 | return {"acc": simple_accuracy(preds, labels)}
587 | else:
588 | raise KeyError(task_name)
589 |
590 |
591 | def main():
592 | parser = argparse.ArgumentParser()
593 |
594 | ## Required parameters
595 | parser.add_argument("--data_dir",
596 | default=None,
597 | type=str,
598 | required=True,
599 | help="The input data dir. Should contain the .tsv files (or other data files) for the task.")
600 | parser.add_argument("--bert_model", default=None, type=str, required=True,
601 | help="Bert pre-trained model selected in the list: bert-base-uncased, "
602 | "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
603 | "bert-base-multilingual-cased, bert-base-chinese.")
604 | parser.add_argument("--task_name",
605 | default=None,
606 | type=str,
607 | required=True,
608 | help="The name of the task to train.")
609 | parser.add_argument("--output_dir",
610 | default=None,
611 | type=str,
612 | required=True,
613 | help="The output directory where the model predictions and checkpoints will be written.")
614 |
615 | ## Other parameters
616 | parser.add_argument("--cache_dir",
617 | default="",
618 | type=str,
619 | help="Where do you want to store the pre-trained models downloaded from s3")
620 | parser.add_argument("--max_seq_length",
621 | default=128,
622 | type=int,
623 | help="The maximum total input sequence length after WordPiece tokenization. \n"
624 | "Sequences longer than this will be truncated, and sequences shorter \n"
625 | "than this will be padded.")
626 | parser.add_argument("--do_train",
627 | action='store_true',
628 | help="Whether to run training.")
629 | parser.add_argument("--do_eval",
630 | action='store_true',
631 | help="Whether to run eval on the dev set.")
632 | parser.add_argument("--do_lower_case",
633 | action='store_true',
634 | help="Set this flag if you are using an uncased model.")
635 | parser.add_argument("--train_batch_size",
636 | default=32,
637 | type=int,
638 | help="Total batch size for training.")
639 | parser.add_argument("--eval_batch_size",
640 | default=8,
641 | type=int,
642 | help="Total batch size for eval.")
643 | parser.add_argument("--learning_rate",
644 | default=5e-5,
645 | type=float,
646 | help="The initial learning rate for Adam.")
647 | parser.add_argument("--num_train_epochs",
648 | default=3.0,
649 | type=float,
650 | help="Total number of training epochs to perform.")
651 | parser.add_argument("--warmup_proportion",
652 | default=0.1,
653 | type=float,
654 | help="Proportion of training to perform linear learning rate warmup for. "
655 | "E.g., 0.1 = 10%% of training.")
656 | parser.add_argument("--no_cuda",
657 | action='store_true',
658 | help="Whether not to use CUDA when available")
659 | parser.add_argument("--local_rank",
660 | type=int,
661 | default=-1,
662 | help="local_rank for distributed training on gpus")
663 | parser.add_argument('--seed',
664 | type=int,
665 | default=42,
666 | help="random seed for initialization")
667 | parser.add_argument('--gradient_accumulation_steps',
668 | type=int,
669 | default=1,
670 | help="Number of updates steps to accumulate before performing a backward/update pass.")
671 | parser.add_argument('--fp16',
672 | action='store_true',
673 | help="Whether to use 16-bit float precision instead of 32-bit")
674 | parser.add_argument('--loss_scale',
675 | type=float, default=0,
676 | help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
677 | "0 (default value): dynamic loss scaling.\n"
678 | "Positive power of 2: static loss scaling value.\n")
679 | parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
680 | parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
681 | args = parser.parse_args()
682 |
683 | if args.server_ip and args.server_port:
684 | # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
685 | import ptvsd
686 | print("Waiting for debugger attach")
687 | ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
688 | ptvsd.wait_for_attach()
689 |
690 | processors = {
691 | "cola": ColaProcessor,
692 | "onion": OnionProcessor,
693 | "mnli": MnliProcessor,
694 | "mnli-mm": MnliMismatchedProcessor,
695 | "mrpc": MrpcProcessor,
696 | "sst-2": Sst2Processor,
697 | "sts-b": StsbProcessor,
698 | "qqp": QqpProcessor,
699 | "qnli": QnliProcessor,
700 | "rte": RteProcessor,
701 | "wnli": WnliProcessor,
702 | }
703 |
704 | output_modes = {
705 | "cola": "classification",
706 | "onion": "classification",
707 | "mnli": "classification",
708 | "mrpc": "classification",
709 | "sst-2": "classification",
710 | "sts-b": "regression",
711 | "qqp": "classification",
712 | "qnli": "classification",
713 | "rte": "classification",
714 | "wnli": "classification",
715 | }
716 |
717 | if args.local_rank == -1 or args.no_cuda:
718 | device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
719 | n_gpu = torch.cuda.device_count()
720 | else:
721 | torch.cuda.set_device(args.local_rank)
722 | device = torch.device("cuda", args.local_rank)
723 | n_gpu = 1
724 | # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
725 | torch.distributed.init_process_group(backend='nccl')
726 |
727 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
728 | datefmt = '%m/%d/%Y %H:%M:%S',
729 | level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
730 |
731 | logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
732 | device, n_gpu, bool(args.local_rank != -1), args.fp16))
733 |
734 | if args.gradient_accumulation_steps < 1:
735 | raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
736 | args.gradient_accumulation_steps))
737 |
738 | args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps
739 |
740 | random.seed(args.seed)
741 | np.random.seed(args.seed)
742 | torch.manual_seed(args.seed)
743 | if n_gpu > 0:
744 | torch.cuda.manual_seed_all(args.seed)
745 |
746 | if not args.do_train and not args.do_eval:
747 | raise ValueError("At least one of `do_train` or `do_eval` must be True.")
748 |
749 | if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
750 | raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
751 | if not os.path.exists(args.output_dir):
752 | os.makedirs(args.output_dir)
753 |
754 | task_name = args.task_name.lower()
755 |
756 | if task_name not in processors:
757 | raise ValueError("Task not found: %s" % (task_name))
758 |
759 | processor = processors[task_name]()
760 | output_mode = output_modes[task_name]
761 |
762 | label_list = processor.get_labels()
763 | num_labels = len(label_list)
764 |
765 | tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
766 |
767 | train_examples = None
768 | num_train_optimization_steps = None
769 | if args.do_train:
770 | train_examples = processor.get_train_examples(args.data_dir)
771 | num_train_optimization_steps = int(
772 | len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
773 | if args.local_rank != -1:
774 | num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
775 |
776 | # Prepare model
777 | cache_dir = args.cache_dir if args.cache_dir else os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank))
778 | model = BertForSequenceClassification.from_pretrained(args.bert_model,
779 | cache_dir=cache_dir,
780 | num_labels=num_labels)
781 | if args.fp16:
782 | model.half()
783 | model.to(device)
784 | if args.local_rank != -1:
785 | try:
786 | from apex.parallel import DistributedDataParallel as DDP
787 | except ImportError:
788 | raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
789 |
790 | model = DDP(model)
791 | elif n_gpu > 1:
792 | model = torch.nn.DataParallel(model)
793 |
794 | # Prepare optimizer
795 | param_optimizer = list(model.named_parameters())
796 | no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
797 | optimizer_grouped_parameters = [
798 | {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
799 | {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
800 | ]
801 | if args.fp16:
802 | try:
803 | from apex.optimizers import FP16_Optimizer
804 | from apex.optimizers import FusedAdam
805 | except ImportError:
806 | raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
807 |
808 | optimizer = FusedAdam(optimizer_grouped_parameters,
809 | lr=args.learning_rate,
810 | bias_correction=False,
811 | max_grad_norm=1.0)
812 | if args.loss_scale == 0:
813 | optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
814 | else:
815 | optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
816 | warmup_linear = WarmupLinearSchedule(warmup=args.warmup_proportion,
817 | t_total=num_train_optimization_steps)
818 |
819 | else:
820 | optimizer = BertAdam(optimizer_grouped_parameters,
821 | lr=args.learning_rate,
822 | warmup=args.warmup_proportion,
823 | t_total=num_train_optimization_steps)
824 |
825 | global_step = 0
826 | nb_tr_steps = 0
827 | tr_loss = 0
828 | if args.do_train:
829 | train_features = convert_examples_to_features(
830 | train_examples, label_list, args.max_seq_length, tokenizer, output_mode)
831 | logger.info("***** Running training *****")
832 | logger.info(" Num examples = %d", len(train_examples))
833 | logger.info(" Batch size = %d", args.train_batch_size)
834 | logger.info(" Num steps = %d", num_train_optimization_steps)
835 | all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
836 | all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
837 | all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
838 |
839 | if output_mode == "classification":
840 | all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
841 | elif output_mode == "regression":
842 | all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.float)
843 |
844 | train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
845 | if args.local_rank == -1:
846 | train_sampler = RandomSampler(train_data)
847 | else:
848 | train_sampler = DistributedSampler(train_data)
849 | train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)
850 |
851 | model.train()
852 | for _ in trange(int(args.num_train_epochs), desc="Epoch"):
853 | tr_loss = 0
854 | nb_tr_examples, nb_tr_steps = 0, 0
855 | for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
856 | batch = tuple(t.to(device) for t in batch)
857 | input_ids, input_mask, segment_ids, label_ids = batch
858 |
859 | # define a new function to compute loss values for both output_modes
860 | logits = model(input_ids, segment_ids, input_mask, labels=None)
861 |
862 | if output_mode == "classification":
863 | loss_fct = CrossEntropyLoss()
864 | loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
865 | elif output_mode == "regression":
866 | loss_fct = MSELoss()
867 | loss = loss_fct(logits.view(-1), label_ids.view(-1))
868 |
869 | if n_gpu > 1:
870 | loss = loss.mean() # mean() to average on multi-gpu.
871 | if args.gradient_accumulation_steps > 1:
872 | loss = loss / args.gradient_accumulation_steps
873 |
874 | if args.fp16:
875 | optimizer.backward(loss)
876 | else:
877 | loss.backward()
878 |
879 | tr_loss += loss.item()
880 | nb_tr_examples += input_ids.size(0)
881 | nb_tr_steps += 1
882 | if (step + 1) % args.gradient_accumulation_steps == 0:
883 | if args.fp16:
884 | # modify learning rate with special warm up BERT uses
885 | # if args.fp16 is False, BertAdam is used that handles this automatically
886 | lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step/num_train_optimization_steps,
887 | args.warmup_proportion)
888 | for param_group in optimizer.param_groups:
889 | param_group['lr'] = lr_this_step
890 | optimizer.step()
891 | optimizer.zero_grad()
892 | global_step += 1
893 |
894 | if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
895 | # Save a trained model, configuration and tokenizer
896 | model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
897 |
898 | # If we save using the predefined names, we can load using `from_pretrained`
899 | output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
900 | output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
901 |
902 | torch.save(model_to_save.state_dict(), output_model_file)
903 | model_to_save.config.to_json_file(output_config_file)
904 | tokenizer.save_vocabulary(args.output_dir)
905 |
906 | # Load a trained model and vocabulary that you have fine-tuned
907 | model = BertForSequenceClassification.from_pretrained(args.output_dir, num_labels=num_labels)
908 | tokenizer = BertTokenizer.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
909 | else:
910 | model = BertForSequenceClassification.from_pretrained(args.bert_model, num_labels=num_labels)
911 | model.to(device)
912 |
913 | if args.do_eval and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
914 | eval_examples = processor.get_dev_examples(args.data_dir)
915 | eval_features = convert_examples_to_features(
916 | eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
917 | logger.info("***** Running evaluation *****")
918 | logger.info(" Num examples = %d", len(eval_examples))
919 | logger.info(" Batch size = %d", args.eval_batch_size)
920 | all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
921 | all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
922 | all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
923 |
924 | if output_mode == "classification":
925 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
926 | elif output_mode == "regression":
927 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.float)
928 |
929 | eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
930 | # Run prediction for full data
931 | eval_sampler = SequentialSampler(eval_data)
932 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
933 |
934 | model.eval()
935 | eval_loss = 0
936 | nb_eval_steps = 0
937 | preds = []
938 |
939 | for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
940 | input_ids = input_ids.to(device)
941 | input_mask = input_mask.to(device)
942 | segment_ids = segment_ids.to(device)
943 | label_ids = label_ids.to(device)
944 |
945 | with torch.no_grad():
946 | logits = model(input_ids, segment_ids, input_mask, labels=None)
947 |
948 | # create eval loss and other metric required by the task
949 | if output_mode == "classification":
950 | loss_fct = CrossEntropyLoss()
951 | tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
952 | elif output_mode == "regression":
953 | loss_fct = MSELoss()
954 | tmp_eval_loss = loss_fct(logits.view(-1), label_ids.view(-1))
955 |
956 | eval_loss += tmp_eval_loss.mean().item()
957 | nb_eval_steps += 1
958 | if len(preds) == 0:
959 | preds.append(logits.detach().cpu().numpy())
960 | else:
961 | preds[0] = np.append(
962 | preds[0], logits.detach().cpu().numpy(), axis=0)
963 |
964 | eval_loss = eval_loss / nb_eval_steps
965 | preds = preds[0]
966 | if output_mode == "classification":
967 | preds = np.argmax(preds, axis=1)
968 | elif output_mode == "regression":
969 | preds = np.squeeze(preds)
970 | result = compute_metrics(task_name, preds, all_label_ids.numpy())
971 | loss = tr_loss/nb_tr_steps if args.do_train else None
972 |
973 | result['eval_loss'] = eval_loss
974 | result['global_step'] = global_step
975 | result['loss'] = loss
976 |
977 | output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
978 | with open(output_eval_file, "w") as writer:
979 | logger.info("***** Eval results *****")
980 | for key in sorted(result.keys()):
981 | logger.info(" %s = %s", key, str(result[key]))
982 | writer.write("%s = %s\n" % (key, str(result[key])))
983 |
984 | # hack for MNLI-MM
985 | if task_name == "mnli":
986 | task_name = "mnli-mm"
987 | processor = processors[task_name]()
988 |
989 | if os.path.exists(args.output_dir + '-MM') and os.listdir(args.output_dir + '-MM') and args.do_train:
990 | raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
991 | if not os.path.exists(args.output_dir + '-MM'):
992 | os.makedirs(args.output_dir + '-MM')
993 |
994 | eval_examples = processor.get_dev_examples(args.data_dir)
995 | eval_features = convert_examples_to_features(
996 | eval_examples, label_list, args.max_seq_length, tokenizer, output_mode)
997 | logger.info("***** Running evaluation *****")
998 | logger.info(" Num examples = %d", len(eval_examples))
999 | logger.info(" Batch size = %d", args.eval_batch_size)
1000 | all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
1001 | all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
1002 | all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
1003 | all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
1004 |
1005 | eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
1006 | # Run prediction for full data
1007 | eval_sampler = SequentialSampler(eval_data)
1008 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)
1009 |
1010 | model.eval()
1011 | eval_loss = 0
1012 | nb_eval_steps = 0
1013 | preds = []
1014 |
1015 | for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader, desc="Evaluating"):
1016 | input_ids = input_ids.to(device)
1017 | input_mask = input_mask.to(device)
1018 | segment_ids = segment_ids.to(device)
1019 | label_ids = label_ids.to(device)
1020 |
1021 | with torch.no_grad():
1022 | logits = model(input_ids, segment_ids, input_mask, labels=None)
1023 |
1024 | loss_fct = CrossEntropyLoss()
1025 | tmp_eval_loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
1026 |
1027 | eval_loss += tmp_eval_loss.mean().item()
1028 | nb_eval_steps += 1
1029 | if len(preds) == 0:
1030 | preds.append(logits.detach().cpu().numpy())
1031 | else:
1032 | preds[0] = np.append(
1033 | preds[0], logits.detach().cpu().numpy(), axis=0)
1034 |
1035 | eval_loss = eval_loss / nb_eval_steps
1036 | preds = preds[0]
1037 | preds = np.argmax(preds, axis=1)
1038 | result = compute_metrics(task_name, preds, all_label_ids.numpy())
1039 | loss = tr_loss/nb_tr_steps if args.do_train else None
1040 |
1041 | result['eval_loss'] = eval_loss
1042 | result['global_step'] = global_step
1043 | result['loss'] = loss
1044 |
1045 | output_eval_file = os.path.join(args.output_dir + '-MM', "eval_results.txt")
1046 | with open(output_eval_file, "w") as writer:
1047 | logger.info("***** Eval results *****")
1048 | for key in sorted(result.keys()):
1049 | logger.info(" %s = %s", key, str(result[key]))
1050 | writer.write("%s = %s\n" % (key, str(result[key])))
1051 |
1052 | if __name__ == "__main__":
1053 | main()
1054 |
--------------------------------------------------------------------------------
/weight_samples.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import math
3 |
4 | collected = []
5 |
6 | def sm(x):
7 | a = math.exp(x[1])
8 | b = math.exp(x[2])
9 | return b/(a+b+1e-8)
10 |
11 | ofh = open(sys.argv[2], 'w', encoding='utf-8')
12 |
13 | for line in open(sys.argv[1], encoding='utf-8'):
14 | chunks = line.strip().split('\t')
15 | chunks[1] = float(chunks[1])
16 | chunks[2] = float(chunks[2])
17 | chunks.append( sm(chunks) )
18 | print("\t".join(map(str,chunks)),file=ofh)
19 |
20 | ofh.close()
21 |
--------------------------------------------------------------------------------