├── .gitignore ├── README.md ├── checkpoints └── .gitignore ├── seq2seq.ipynb └── tensorboard-logs └── .gitignore /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | __pycache__ 3 | pred.txt 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Batched Seq2Seq Example 2 | Based on the [`seq2seq-translation-batched.ipynb`](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb) from *practical-pytorch*, but more extra features. 3 | 4 | This example runs grammatical error correction task where the source sequence is a grammatically erroneuous English sentence and the target sequence is an grammatically correct English sentence. The corpus and evaluation script can be download at: https://github.com/keisks/jfleg. 5 | 6 | ### Extra features 7 | - Cleaner codebase 8 | - Very detailed comments for learners 9 | - Implement Pytorch native dataset and dataloader for batching 10 | - Correctly handle the hidden state from bidirectional encoder and past to the decoder as initial hidden state. 11 | - Fully batched attention mechanism computation (only implement `general attention` but it's sufficient). Note: The original code still uses for-loop to compute, which is very slow. 12 | - Support LSTM instead of only GRU 13 | - Shared embeddings (encoder's input embedding and decoder's input embedding) 14 | - Pretrained Glove embedding 15 | - Fixed embedding 16 | - Tie embeddings (decoder's input embedding and decoder's output embedding) 17 | - Tensorboard visualization 18 | - Load and save checkpoint 19 | - Replace unknown words by selecting the source token with the highest attention score. (Translation) 20 | 21 | ### Cons 22 | Comparing to the state-of-the-art seq2seq library, OpenNMT-py, there are some stuffs that aren't optimized in this codebase: 23 | - Use CuDNN when possible (always on encoder, on decoder when `input_feed`=0) 24 | - Always avoid indexing / loops and use torch primitives. 25 | - When possible, batch softmax operations across time. (this is the second complicated part of the code) 26 | - Batch inference and beam search for translation (this is the most complicated part of the code) 27 | 28 | ### How to speed up RNN training? 29 | Several ways to speed up RNN training: 30 | - Batching 31 | - Static padding 32 | - Dynamic padding 33 | - Bucketing 34 | - Truncated BPTT 35 | 36 | See ["Sequence Models and the RNN API (TensorFlow Dev Summit 2017)"](https://www.youtube.com/watch?v=RIR_-Xlbp7s&t=490s) for understanding those techniques. 37 | 38 | You can use [torchtext](http://torchtext.readthedocs.io/en/latest/index.html) or OpenNMT's data iterator for speeding up the training. It can be 7x faster! (ex: 7 hours for an epoch -> 1 hour!) 39 | 40 | ### Acknowledgement 41 | Thanks to the author of OpenNMT-py @srush for answering the questions for me! See https://github.com/OpenNMT/OpenNMT-py/issues/552 42 | -------------------------------------------------------------------------------- /checkpoints/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /seq2seq.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Batched Seq2Seq Example\n", 8 | "Based on the [`seq2seq-translation-batched.ipynb`](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb) from *practical-pytorch*, but more extra features.\n", 9 | "\n", 10 | "This example runs grammatical error correction task where the source sequence is a grammatically erroneuous English sentence and the target sequence is an grammatically correct English sentence. The corpus and evaluation script can be download at: https://github.com/keisks/jfleg.\n", 11 | "\n", 12 | "### Extra features\n", 13 | "- Cleaner codebase\n", 14 | "- Very detailed comments for learners\n", 15 | "- Implement Pytorch native dataset and dataloader for batching\n", 16 | "- Correctly handle the hidden state from bidirectional encoder and past to the decoder as initial hidden state.\n", 17 | "- Fully batched attention mechanism computation (only implement `general attention` but it's sufficient). Note: The original code still uses for-loop to compute, which is very slow.\n", 18 | "- Support LSTM instead of only GRU\n", 19 | "- Shared embeddings (encoder's input embedding and decoder's input embedding)\n", 20 | "- Pretrained Glove embedding\n", 21 | "- Fixed embedding\n", 22 | "- Tie embeddings (decoder's input embedding and decoder's output embedding)\n", 23 | "- Tensorboard visualization\n", 24 | "- Load and save checkpoint\n", 25 | "- Replace unknown words by selecting the source token with the highest attention score. (Translation)\n", 26 | "\n", 27 | "### Cons\n", 28 | "Comparing to the state-of-the-art seq2seq library, OpenNMT-py, there are some stuffs that aren't optimized in this codebase:\n", 29 | "- Use CuDNN when possible (always on encoder, on decoder when input_feed 0)\n", 30 | "- Always avoid indexing / loops and use torch primitives.\n", 31 | "- When possible, batch softmax operations across time. ( this is the second complicated part of the code)\n", 32 | "- Batch inference and beam search for translation (this is the most complicated part of the code)\n", 33 | "\n", 34 | "Thanks to the author of OpenNMT-py @srush for answering the questions for me! See https://github.com/OpenNMT/OpenNMT-py/issues/552" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 1, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "import os\n", 46 | "import subprocess\n", 47 | "import codecs\n", 48 | "import numpy as np\n", 49 | "\n", 50 | "import torch\n", 51 | "import torch.nn as nn\n", 52 | "from torch.autograd import Variable\n", 53 | "from torch import optim\n", 54 | "import torch.nn.functional as F\n", 55 | "from torch.utils.data import Dataset, DataLoader" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "\"\"\" Please download from here: \n", 67 | "1. Install spacy: https://spacy.io/usage/\n", 68 | "2. Install model: https://spacy.io/usage/models\n", 69 | "Recommend to install spacy since it is a very powerful NLP tool\n", 70 | "\"\"\"\n", 71 | "\n", 72 | "import spacy\n", 73 | "nlp = spacy.load('en_core_web_lg') # For the glove embeddings" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "Use_CUDA=True\n", 86 | "current_device=0\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "\"\"\" Enable GPU training \"\"\"\n", 92 | "USE_CUDA = torch.cuda.is_available()\n", 93 | "print('Use_CUDA={}'.format(USE_CUDA))\n", 94 | "if USE_CUDA:\n", 95 | " # You can change device by `torch.cuda.set_device(device_id)`\n", 96 | " print('current_device={}'.format(torch.cuda.current_device()))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "## Build vocabulary, dataset and data loader" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 4, 109 | "metadata": { 110 | "collapsed": true 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "import codecs\n", 115 | "from tqdm import tqdm\n", 116 | "from collections import Counter, namedtuple\n", 117 | "from torch.utils.data import Dataset, DataLoader\n", 118 | "\n", 119 | "PAD = 0\n", 120 | "BOS = 1\n", 121 | "EOS = 2\n", 122 | "UNK = 3\n", 123 | "\n", 124 | "class AttrDict(dict):\n", 125 | " \"\"\" Access dictionary keys like attribute \n", 126 | " https://stackoverflow.com/questions/4984647/accessing-dict-keys-like-an-attribute\n", 127 | " \"\"\"\n", 128 | " def __init__(self, *av, **kav):\n", 129 | " dict.__init__(self, *av, **kav)\n", 130 | " self.__dict__ = self\n", 131 | "\n", 132 | "class NMTDataset(Dataset):\n", 133 | " def __init__(self, src_path, tgt_path, src_vocab=None, tgt_vocab=None, max_vocab_size=50000, share_vocab=True):\n", 134 | " \"\"\" Note: If src_vocab, tgt_vocab is not given, it will build both vocabs.\n", 135 | " Args: \n", 136 | " - src_path, tgt_path: text file with tokenized sentences.\n", 137 | " - src_vocab, tgt_vocab: data structure is same as self.build_vocab().\n", 138 | " \"\"\"\n", 139 | " print('='*100)\n", 140 | " print('Dataset preprocessing log:')\n", 141 | " \n", 142 | " print('- Loading and tokenizing source sentences...')\n", 143 | " self.src_sents = self.load_sents(src_path)\n", 144 | " print('- Loading and tokenizing target sentences...')\n", 145 | " self.tgt_sents = self.load_sents(tgt_path)\n", 146 | " \n", 147 | " if src_vocab is None or tgt_vocab is None:\n", 148 | " print('- Building source counter...')\n", 149 | " self.src_counter = self.build_counter(self.src_sents)\n", 150 | " print('- Building target counter...')\n", 151 | " self.tgt_counter = self.build_counter(self.tgt_sents)\n", 152 | "\n", 153 | " if share_vocab:\n", 154 | " print('- Building source vocabulary...')\n", 155 | " self.src_vocab = self.build_vocab(self.src_counter + self.tgt_counter, max_vocab_size)\n", 156 | " print('- Building target vocabulary...')\n", 157 | " self.tgt_vocab = self.src_vocab\n", 158 | " else:\n", 159 | " print('- Building source vocabulary...')\n", 160 | " self.src_vocab = self.build_vocab(self.src_counter, max_vocab_size)\n", 161 | " print('- Building target vocabulary...')\n", 162 | " self.tgt_vocab = self.build_vocab(self.tgt_counter, max_vocab_size)\n", 163 | " else:\n", 164 | " self.src_vocab = src_vocab\n", 165 | " self.tgt_vocab = tgt_vocab\n", 166 | " share_vocab = src_vocab == tgt_vocab\n", 167 | " \n", 168 | " print('='*100)\n", 169 | " print('Dataset Info:')\n", 170 | " print('- Number of source sentences: {}'.format(len(self.src_sents)))\n", 171 | " print('- Number of target sentences: {}'.format(len(self.tgt_sents)))\n", 172 | " print('- Source vocabulary size: {}'.format(len(self.src_vocab.token2id)))\n", 173 | " print('- Target vocabulary size: {}'.format(len(self.tgt_vocab.token2id)))\n", 174 | " print('- Shared vocabulary: {}'.format(share_vocab))\n", 175 | " print('='*100 + '\\n')\n", 176 | " \n", 177 | " def __len__(self):\n", 178 | " return len(self.src_sents)\n", 179 | " \n", 180 | " def __getitem__(self, index):\n", 181 | " src_sent = self.src_sents[index]\n", 182 | " tgt_sent = self.tgt_sents[index]\n", 183 | " src_seq = self.tokens2ids(src_sent, self.src_vocab.token2id, append_BOS=False, append_EOS=True)\n", 184 | " tgt_seq = self.tokens2ids(tgt_sent, self.tgt_vocab.token2id, append_BOS=False, append_EOS=True)\n", 185 | "\n", 186 | " return src_sent, tgt_sent, src_seq, tgt_seq\n", 187 | " \n", 188 | " def load_sents(self, file_path):\n", 189 | " sents = []\n", 190 | " with codecs.open(file_path) as file:\n", 191 | " for sent in tqdm(file.readlines()):\n", 192 | " tokens = [token for token in sent.split()]\n", 193 | " sents.append(tokens)\n", 194 | " return sents\n", 195 | " \n", 196 | " def build_counter(self, sents):\n", 197 | " counter = Counter()\n", 198 | " for sent in tqdm(sents):\n", 199 | " counter.update(sent)\n", 200 | " return counter\n", 201 | " \n", 202 | " def build_vocab(self, counter, max_vocab_size):\n", 203 | " vocab = AttrDict()\n", 204 | " vocab.token2id = {'': PAD, '': BOS, '': EOS, '': UNK}\n", 205 | " vocab.token2id.update({token: _id+4 for _id, (token, count) in tqdm(enumerate(counter.most_common(max_vocab_size)))})\n", 206 | " vocab.id2token = {v:k for k,v in tqdm(vocab.token2id.items())} \n", 207 | " return vocab\n", 208 | " \n", 209 | " def tokens2ids(self, tokens, token2id, append_BOS=True, append_EOS=True):\n", 210 | " seq = []\n", 211 | " if append_BOS: seq.append(BOS)\n", 212 | " seq.extend([token2id.get(token, UNK) for token in tokens])\n", 213 | " if append_EOS: seq.append(EOS)\n", 214 | " return seq\n", 215 | " \n", 216 | "def collate_fn(data):\n", 217 | " \"\"\"\n", 218 | " Creates mini-batch tensors from (src_sent, tgt_sent, src_seq, tgt_seq).\n", 219 | " We should build a custom collate_fn rather than using default collate_fn,\n", 220 | " because merging sequences (including padding) is not supported in default.\n", 221 | " Seqeuences are padded to the maximum length of mini-batch sequences (dynamic padding).\n", 222 | " \n", 223 | " Args:\n", 224 | " data: list of tuple (src_sents, tgt_sents, src_seqs, tgt_seqs)\n", 225 | " - src_sents, tgt_sents: batch of original tokenized sentences\n", 226 | " - src_seqs, tgt_seqs: batch of original tokenized sentence ids\n", 227 | " Returns:\n", 228 | " - src_sents, tgt_sents (tuple): batch of original tokenized sentences\n", 229 | " - src_seqs, tgt_seqs (variable): (max_src_len, batch_size)\n", 230 | " - src_lens, tgt_lens (tensor): (batch_size)\n", 231 | " \n", 232 | " \"\"\"\n", 233 | " def _pad_sequences(seqs):\n", 234 | " lens = [len(seq) for seq in seqs]\n", 235 | " padded_seqs = torch.zeros(len(seqs), max(lens)).long()\n", 236 | " for i, seq in enumerate(seqs):\n", 237 | " end = lens[i]\n", 238 | " padded_seqs[i, :end] = torch.LongTensor(seq[:end])\n", 239 | " return padded_seqs, lens\n", 240 | "\n", 241 | " # Sort a list by *source* sequence length (descending order) to use `pack_padded_sequence`.\n", 242 | " # The *target* sequence is not sorted <-- It's ok, cause `pack_padded_sequence` only takes\n", 243 | " # *source* sequence, which is in the EncoderRNN\n", 244 | " data.sort(key=lambda x: len(x[0]), reverse=True)\n", 245 | "\n", 246 | " # Seperate source and target sequences.\n", 247 | " src_sents, tgt_sents, src_seqs, tgt_seqs = zip(*data)\n", 248 | " \n", 249 | " # Merge sequences (from tuple of 1D tensor to 2D tensor)\n", 250 | " src_seqs, src_lens = _pad_sequences(src_seqs)\n", 251 | " tgt_seqs, tgt_lens = _pad_sequences(tgt_seqs)\n", 252 | " \n", 253 | " # (batch, seq_len) => (seq_len, batch)\n", 254 | " src_seqs = src_seqs.transpose(0,1)\n", 255 | " tgt_seqs = tgt_seqs.transpose(0,1)\n", 256 | "\n", 257 | " return src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "## Build models\n", 265 | "### Encoder" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 5, 271 | "metadata": { 272 | "collapsed": true 273 | }, 274 | "outputs": [], 275 | "source": [ 276 | "class EncoderRNN(nn.Module):\n", 277 | " def __init__(self, embedding=None, rnn_type='LSTM', hidden_size=128, num_layers=1, dropout=0.3, bidirectional=True):\n", 278 | " super(EncoderRNN, self).__init__()\n", 279 | " \n", 280 | " self.num_layers = num_layers\n", 281 | " self.dropout = dropout\n", 282 | " self.bidirectional = bidirectional\n", 283 | " self.num_directions = 2 if bidirectional else 1\n", 284 | " self.hidden_size = hidden_size // self.num_directions\n", 285 | " \n", 286 | " self.embedding = embedding\n", 287 | " self.word_vec_size = self.embedding.embedding_dim\n", 288 | " \n", 289 | " self.rnn_type = rnn_type\n", 290 | " self.rnn = getattr(nn, self.rnn_type)(\n", 291 | " input_size=self.word_vec_size,\n", 292 | " hidden_size=self.hidden_size,\n", 293 | " num_layers=self.num_layers,\n", 294 | " dropout=self.dropout, \n", 295 | " bidirectional=self.bidirectional)\n", 296 | " \n", 297 | " def forward(self, src_seqs, src_lens, hidden=None):\n", 298 | " \"\"\"\n", 299 | " Args:\n", 300 | " - src_seqs: (max_src_len, batch_size)\n", 301 | " - src_lens: (batch_size)\n", 302 | " Returns:\n", 303 | " - outputs: (max_src_len, batch_size, hidden_size * num_directions)\n", 304 | " - hidden : (num_layers, batch_size, hidden_size * num_directions)\n", 305 | " \"\"\"\n", 306 | " \n", 307 | " # (max_src_len, batch_size) => (max_src_len, batch_size, word_vec_size)\n", 308 | " emb = self.embedding(src_seqs)\n", 309 | "\n", 310 | " # packed_emb:\n", 311 | " # - data: (sum(batch_sizes), word_vec_size)\n", 312 | " # - batch_sizes: list of batch sizes\n", 313 | " packed_emb = nn.utils.rnn.pack_padded_sequence(emb, src_lens)\n", 314 | "\n", 315 | " # rnn(gru) returns:\n", 316 | " # - packed_outputs: shape same as packed_emb\n", 317 | " # - hidden: (num_layers * num_directions, batch_size, hidden_size) \n", 318 | " packed_outputs, hidden = self.rnn(packed_emb, hidden)\n", 319 | "\n", 320 | " # outputs: (max_src_len, batch_size, hidden_size * num_directions)\n", 321 | " # output_lens == src_lensˇ\n", 322 | " outputs, output_lens = nn.utils.rnn.pad_packed_sequence(packed_outputs)\n", 323 | " \n", 324 | " if self.bidirectional:\n", 325 | " # (num_layers * num_directions, batch_size, hidden_size) \n", 326 | " # => (num_layers, batch_size, hidden_size * num_directions)\n", 327 | " hidden = self._cat_directions(hidden)\n", 328 | " \n", 329 | " return outputs, hidden\n", 330 | " \n", 331 | " def _cat_directions(self, hidden):\n", 332 | " \"\"\" If the encoder is bidirectional, do the following transformation.\n", 333 | " Ref: https://github.com/IBM/pytorch-seq2seq/blob/master/seq2seq/models/DecoderRNN.py#L176\n", 334 | " -----------------------------------------------------------\n", 335 | " In: (num_layers * num_directions, batch_size, hidden_size)\n", 336 | " (ex: num_layers=2, num_directions=2)\n", 337 | "\n", 338 | " layer 1: forward__hidden(1)\n", 339 | " layer 1: backward_hidden(1)\n", 340 | " layer 2: forward__hidden(2)\n", 341 | " layer 2: backward_hidden(2)\n", 342 | "\n", 343 | " -----------------------------------------------------------\n", 344 | " Out: (num_layers, batch_size, hidden_size * num_directions)\n", 345 | "\n", 346 | " layer 1: forward__hidden(1) backward_hidden(1)\n", 347 | " layer 2: forward__hidden(2) backward_hidden(2)\n", 348 | " \"\"\"\n", 349 | " def _cat(h):\n", 350 | " return torch.cat([h[0:h.size(0):2], h[1:h.size(0):2]], 2)\n", 351 | " \n", 352 | " if isinstance(hidden, tuple):\n", 353 | " # LSTM hidden contains a tuple (hidden state, cell state)\n", 354 | " hidden = tuple([_cat(h) for h in hidden])\n", 355 | " else:\n", 356 | " # GRU hidden\n", 357 | " hidden = _cat(hidden)\n", 358 | " \n", 359 | " return hidden" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "### Decoder with \"general attention\" mechanism" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 6, 372 | "metadata": { 373 | "collapsed": true 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "class LuongAttnDecoderRNN(nn.Module):\n", 378 | " def __init__(self, encoder, embedding=None, attention=True, bias=True, tie_embeddings=False, dropout=0.3):\n", 379 | " \"\"\" General attention in `Effective Approaches to Attention-based Neural Machine Translation`\n", 380 | " Ref: https://arxiv.org/abs/1508.04025\n", 381 | " \n", 382 | " Share input and output embeddings:\n", 383 | " Ref:\n", 384 | " - \"Using the Output Embedding to Improve Language Models\" (Press & Wolf 2016)\n", 385 | " https://arxiv.org/abs/1608.05859\n", 386 | " - \"Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling\" (Inan et al. 2016)\n", 387 | " https://arxiv.org/abs/1611.01462\n", 388 | " \"\"\"\n", 389 | " super(LuongAttnDecoderRNN, self).__init__()\n", 390 | " \n", 391 | " self.hidden_size = encoder.hidden_size * encoder.num_directions\n", 392 | " self.num_layers = encoder.num_layers\n", 393 | " self.dropout = dropout\n", 394 | " self.embedding = embedding\n", 395 | " self.attention = attention\n", 396 | " self.tie_embeddings = tie_embeddings\n", 397 | " \n", 398 | " self.vocab_size = self.embedding.num_embeddings\n", 399 | " self.word_vec_size = self.embedding.embedding_dim\n", 400 | " \n", 401 | " self.rnn_type = encoder.rnn_type\n", 402 | " self.rnn = getattr(nn, self.rnn_type)(\n", 403 | " input_size=self.word_vec_size,\n", 404 | " hidden_size=self.hidden_size,\n", 405 | " num_layers=self.num_layers,\n", 406 | " dropout=self.dropout)\n", 407 | " \n", 408 | " if self.attention:\n", 409 | " self.W_a = nn.Linear(encoder.hidden_size * encoder.num_directions,\n", 410 | " self.hidden_size, bias=bias)\n", 411 | " self.W_c = nn.Linear(encoder.hidden_size * encoder.num_directions + self.hidden_size, \n", 412 | " self.hidden_size, bias=bias)\n", 413 | " \n", 414 | " if self.tie_embeddings:\n", 415 | " self.W_proj = nn.Linear(self.hidden_size, self.word_vec_size, bias=bias)\n", 416 | " self.W_s = nn.Linear(self.word_vec_size, self.vocab_size, bias=bias)\n", 417 | " self.W_s.weight = self.embedding.weight\n", 418 | " else:\n", 419 | " self.W_s = nn.Linear(self.hidden_size, self.vocab_size, bias=bias)\n", 420 | " \n", 421 | " def forward(self, input_seq, decoder_hidden, encoder_outputs, src_lens):\n", 422 | " \"\"\" Args:\n", 423 | " - input_seq : (batch_size)\n", 424 | " - decoder_hidden : (t=0) last encoder hidden state (num_layers * num_directions, batch_size, hidden_size) \n", 425 | " (t>0) previous decoder hidden state (num_layers, batch_size, hidden_size)\n", 426 | " - encoder_outputs: (max_src_len, batch_size, hidden_size * num_directions)\n", 427 | " \n", 428 | " Returns:\n", 429 | " - output : (batch_size, vocab_size)\n", 430 | " - decoder_hidden : (num_layers, batch_size, hidden_size)\n", 431 | " - attention_weights: (batch_size, max_src_len)\n", 432 | " \"\"\" \n", 433 | " # (batch_size) => (seq_len=1, batch_size)\n", 434 | " input_seq = input_seq.unsqueeze(0)\n", 435 | " \n", 436 | " # (seq_len=1, batch_size) => (seq_len=1, batch_size, word_vec_size) \n", 437 | " emb = self.embedding(input_seq)\n", 438 | " \n", 439 | " # rnn returns:\n", 440 | " # - decoder_output: (seq_len=1, batch_size, hidden_size)\n", 441 | " # - decoder_hidden: (num_layers, batch_size, hidden_size)\n", 442 | " decoder_output, decoder_hidden = self.rnn(emb, decoder_hidden)\n", 443 | "\n", 444 | " # (seq_len=1, batch_size, hidden_size) => (batch_size, seq_len=1, hidden_size)\n", 445 | " decoder_output = decoder_output.transpose(0,1)\n", 446 | " \n", 447 | " \"\"\" \n", 448 | " ------------------------------------------------------------------------------------------\n", 449 | " Notes of computing attention scores\n", 450 | " ------------------------------------------------------------------------------------------\n", 451 | " # For-loop version:\n", 452 | "\n", 453 | " max_src_len = encoder_outputs.size(0)\n", 454 | " batch_size = encoder_outputs.size(1)\n", 455 | " attention_scores = Variable(torch.zeros(batch_size, max_src_len))\n", 456 | "\n", 457 | " # For every batch, every time step of encoder's hidden state, calculate attention score.\n", 458 | " for b in range(batch_size):\n", 459 | " for t in range(max_src_len):\n", 460 | " # Loung. eq(8) -- general form content-based attention:\n", 461 | " attention_scores[b,t] = decoder_output[b].dot(attention.W_a(encoder_outputs[t,b]))\n", 462 | "\n", 463 | " ------------------------------------------------------------------------------------------\n", 464 | " # Vectorized version:\n", 465 | "\n", 466 | " 1. decoder_output: (batch_size, seq_len=1, hidden_size)\n", 467 | " 2. encoder_outputs: (max_src_len, batch_size, hidden_size * num_directions)\n", 468 | " 3. W_a(encoder_outputs): (max_src_len, batch_size, hidden_size)\n", 469 | " .transpose(0,1) : (batch_size, max_src_len, hidden_size) \n", 470 | " .transpose(1,2) : (batch_size, hidden_size, max_src_len)\n", 471 | " 4. attention_scores: \n", 472 | " (batch_size, seq_len=1, hidden_size) * (batch_size, hidden_size, max_src_len) \n", 473 | " => (batch_size, seq_len=1, max_src_len)\n", 474 | " \"\"\"\n", 475 | " \n", 476 | " if self.attention:\n", 477 | " # attention_scores: (batch_size, seq_len=1, max_src_len)\n", 478 | " attention_scores = torch.bmm(decoder_output, self.W_a(encoder_outputs).transpose(0,1).transpose(1,2))\n", 479 | "\n", 480 | " # attention_mask: (batch_size, seq_len=1, max_src_len)\n", 481 | " attention_mask = sequence_mask(src_lens).unsqueeze(1)\n", 482 | "\n", 483 | " # Fills elements of tensor with `-float('inf')` where `mask` is 1.\n", 484 | " attention_scores.data.masked_fill_(1 - attention_mask.data, -float('inf'))\n", 485 | "\n", 486 | " # attention_weights: (batch_size, seq_len=1, max_src_len) => (batch_size, max_src_len) for `F.softmax` \n", 487 | " # => (batch_size, seq_len=1, max_src_len)\n", 488 | " try: # torch 0.3.x\n", 489 | " attention_weights = F.softmax(attention_scores.squeeze(1), dim=1).unsqueeze(1)\n", 490 | " except:\n", 491 | " attention_weights = F.softmax(attention_scores.squeeze(1)).unsqueeze(1)\n", 492 | "\n", 493 | " # context_vector:\n", 494 | " # (batch_size, seq_len=1, max_src_len) * (batch_size, max_src_len, encoder_hidden_size * num_directions)\n", 495 | " # => (batch_size, seq_len=1, encoder_hidden_size * num_directions)\n", 496 | " context_vector = torch.bmm(attention_weights, encoder_outputs.transpose(0,1))\n", 497 | "\n", 498 | " # concat_input: (batch_size, seq_len=1, encoder_hidden_size * num_directions + decoder_hidden_size)\n", 499 | " concat_input = torch.cat([context_vector, decoder_output], -1)\n", 500 | "\n", 501 | " # (batch_size, seq_len=1, encoder_hidden_size * num_directions + decoder_hidden_size) => (batch_size, seq_len=1, decoder_hidden_size)\n", 502 | " concat_output = F.tanh(self.W_c(concat_input))\n", 503 | " \n", 504 | " # Prepare returns:\n", 505 | " # (batch_size, seq_len=1, max_src_len) => (batch_size, max_src_len)\n", 506 | " attention_weights = attention_weights.squeeze(1)\n", 507 | " else:\n", 508 | " attention_weights = None\n", 509 | " concat_output = decoder_output\n", 510 | " \n", 511 | " # If input and output embeddings are tied,\n", 512 | " # project `decoder_hidden_size` to `word_vec_size`.\n", 513 | " if self.tie_embeddings:\n", 514 | " output = self.W_s(self.W_proj(concat_output))\n", 515 | " else:\n", 516 | " # (batch_size, seq_len=1, decoder_hidden_size) => (batch_size, seq_len=1, vocab_size)\n", 517 | " output = self.W_s(concat_output) \n", 518 | " \n", 519 | " # Prepare returns:\n", 520 | " # (batch_size, seq_len=1, vocab_size) => (batch_size, vocab_size)\n", 521 | " output = output.squeeze(1)\n", 522 | " \n", 523 | " del src_lens\n", 524 | " \n", 525 | " return output, decoder_hidden, attention_weights" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "## Utils" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 7, 538 | "metadata": { 539 | "collapsed": true 540 | }, 541 | "outputs": [], 542 | "source": [ 543 | "def load_spacy_glove_embedding(spacy_nlp, vocab):\n", 544 | " \n", 545 | " vocab_size = len(vocab.token2id)\n", 546 | " word_vec_size = spacy_nlp.vocab.vectors_length\n", 547 | " embedding = np.zeros((vocab_size, word_vec_size))\n", 548 | " unk_count = 0\n", 549 | " \n", 550 | " print('='*100)\n", 551 | " print('Loading spacy glove embedding:')\n", 552 | " print('- Vocabulary size: {}'.format(vocab_size))\n", 553 | " print('- Word vector size: {}'.format(word_vec_size))\n", 554 | " \n", 555 | " for token, index in tqdm(vocab.token2id.items()):\n", 556 | " if token == vocab.id2token[PAD]: \n", 557 | " continue\n", 558 | " elif token in [vocab.id2token[BOS], vocab.id2token[EOS], vocab.id2token[UNK]]: \n", 559 | " vector = np.random.rand(word_vec_size,)\n", 560 | " elif spacy_nlp.vocab[token].has_vector: \n", 561 | " vector = spacy_nlp.vocab[token].vector\n", 562 | " else:\n", 563 | " vector = embedding[UNK] \n", 564 | " unk_count += 1\n", 565 | " \n", 566 | " embedding[index] = vector\n", 567 | " \n", 568 | " print('- Unknown word count: {}'.format(unk_count))\n", 569 | " print('='*100 + '\\n')\n", 570 | " \n", 571 | " return torch.from_numpy(embedding).float()\n", 572 | "\n", 573 | "def sequence_mask(sequence_length, max_len=None):\n", 574 | " \"\"\"\n", 575 | " Caution: Input and Return are VARIABLE.\n", 576 | " \"\"\"\n", 577 | " if max_len is None:\n", 578 | " max_len = sequence_length.data.max()\n", 579 | " batch_size = sequence_length.size(0)\n", 580 | " seq_range = torch.arange(0, max_len).long()\n", 581 | " seq_range_expand = seq_range.unsqueeze(0).expand(batch_size, max_len)\n", 582 | " seq_range_expand = Variable(seq_range_expand)\n", 583 | " if sequence_length.is_cuda:\n", 584 | " seq_range_expand = seq_range_expand.cuda()\n", 585 | " seq_length_expand = (sequence_length.unsqueeze(1)\n", 586 | " .expand_as(seq_range_expand))\n", 587 | " mask = seq_range_expand < seq_length_expand\n", 588 | " \n", 589 | " return mask\n", 590 | "\n", 591 | "def masked_cross_entropy(logits, target, length):\n", 592 | " \"\"\"\n", 593 | " Args:\n", 594 | " logits: A Variable containing a FloatTensor of size\n", 595 | " (batch, max_len, num_classes) which contains the\n", 596 | " unnormalized probability for each class.\n", 597 | " target: A Variable containing a LongTensor of size\n", 598 | " (batch, max_len) which contains the index of the true\n", 599 | " class for each corresponding step.\n", 600 | " length: A Variable containing a LongTensor of size (batch,)\n", 601 | " which contains the length of each data in a batch.\n", 602 | " Returns:\n", 603 | " loss: An average loss value masked by the length.\n", 604 | " \n", 605 | " The code is same as:\n", 606 | " \n", 607 | " weight = torch.ones(tgt_vocab_size)\n", 608 | " weight[padding_idx] = 0\n", 609 | " criterion = nn.CrossEntropyLoss(weight.cuda(), size_average)\n", 610 | " loss = criterion(logits_flat, losses_flat)\n", 611 | " \"\"\"\n", 612 | " # logits_flat: (batch * max_len, num_classes)\n", 613 | " logits_flat = logits.view(-1, logits.size(-1))\n", 614 | " # log_probs_flat: (batch * max_len, num_classes)\n", 615 | " log_probs_flat = F.log_softmax(logits_flat)\n", 616 | " # target_flat: (batch * max_len, 1)\n", 617 | " target_flat = target.view(-1, 1)\n", 618 | " # losses_flat: (batch * max_len, 1)\n", 619 | " losses_flat = -torch.gather(log_probs_flat, dim=1, index=target_flat)\n", 620 | " # losses: (batch, max_len)\n", 621 | " losses = losses_flat.view(*target.size())\n", 622 | " # mask: (batch, max_len)\n", 623 | " mask = sequence_mask(sequence_length=length, max_len=target.size(1))\n", 624 | " # Note: mask need to bed casted to float!\n", 625 | " losses = losses * mask.float()\n", 626 | " loss = losses.sum() / mask.float().sum()\n", 627 | " \n", 628 | " # (batch_size * max_tgt_len,)\n", 629 | " pred_flat = log_probs_flat.max(1)[1]\n", 630 | " # (batch_size * max_tgt_len,) => (batch_size, max_tgt_len) => (max_tgt_len, batch_size)\n", 631 | " pred_seqs = pred_flat.view(*target.size()).transpose(0,1).contiguous()\n", 632 | " # (batch_size, max_len) => (batch_size * max_tgt_len,)\n", 633 | " mask_flat = mask.view(-1)\n", 634 | " \n", 635 | " # `.float()` IS VERY IMPORTANT !!!\n", 636 | " # https://discuss.pytorch.org/t/batch-size-and-validation-accuracy/4066/3\n", 637 | " num_corrects = int(pred_flat.eq(target_flat.squeeze(1)).masked_select(mask_flat).float().data.sum())\n", 638 | " num_words = length.data.sum()\n", 639 | "\n", 640 | " return loss, pred_seqs, num_corrects, num_words\n", 641 | "\n", 642 | "def load_checkpoint(checkpoint_path):\n", 643 | " # It's weird that if `map_location` is not given, it will be extremely slow.\n", 644 | " return torch.load(checkpoint_path, map_location=lambda storage, loc: storage)\n", 645 | "\n", 646 | "def save_checkpoint(opts, experiment_name, encoder, decoder, encoder_optim, decoder_optim,\n", 647 | " total_accuracy, total_loss, global_step):\n", 648 | " checkpoint = {\n", 649 | " 'opts': opts,\n", 650 | " 'global_step': global_step,\n", 651 | " 'encoder_state_dict': encoder.state_dict(),\n", 652 | " 'decoder_state_dict': decoder.state_dict(),\n", 653 | " 'encoder_optim_state_dict': encoder_optim.state_dict(),\n", 654 | " 'decoder_optim_state_dict': decoder_optim.state_dict()\n", 655 | " }\n", 656 | " \n", 657 | " checkpoint_path = 'checkpoints/%s_acc_%.2f_loss_%.2f_step_%d.pt' % (experiment_name, total_accuracy, total_loss, global_step)\n", 658 | " \n", 659 | " directory, filename = os.path.split(os.path.abspath(checkpoint_path))\n", 660 | "\n", 661 | " if not os.path.exists(directory):\n", 662 | " os.makedirs(directory)\n", 663 | " \n", 664 | " torch.save(checkpoint, checkpoint_path)\n", 665 | " \n", 666 | " return checkpoint_path\n", 667 | "\n", 668 | "def variable2numpy(var):\n", 669 | " \"\"\" For tensorboard visualization \"\"\"\n", 670 | " return var.data.cpu().numpy()\n", 671 | "\n", 672 | "def write_to_tensorboard(writer, global_step, total_loss, total_corrects, total_words, total_accuracy,\n", 673 | " encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm,\n", 674 | " encoder, decoder, gpu_memory_usage=None):\n", 675 | " # scalars\n", 676 | " if gpu_memory_usage is not None:\n", 677 | " writer.add_scalar('curr_gpu_memory_usage', gpu_memory_usage['curr'], global_step)\n", 678 | " writer.add_scalar('diff_gpu_memory_usage', gpu_memory_usage['diff'], global_step)\n", 679 | " \n", 680 | " writer.add_scalar('total_loss', total_loss, global_step)\n", 681 | " writer.add_scalar('total_accuracy', total_accuracy, global_step)\n", 682 | " writer.add_scalar('total_corrects', total_corrects, global_step)\n", 683 | " writer.add_scalar('total_words', total_words, global_step)\n", 684 | " writer.add_scalar('encoder_grad_norm', encoder_grad_norm, global_step)\n", 685 | " writer.add_scalar('decoder_grad_norm', decoder_grad_norm, global_step)\n", 686 | " writer.add_scalar('clipped_encoder_grad_norm', clipped_encoder_grad_norm, global_step)\n", 687 | " writer.add_scalar('clipped_decoder_grad_norm', clipped_decoder_grad_norm, global_step)\n", 688 | " \n", 689 | " # histogram\n", 690 | " for name, param in encoder.named_parameters():\n", 691 | " name = name.replace('.', '/')\n", 692 | " writer.add_histogram('encoder/{}'.format(name), variable2numpy(param), global_step, bins='doane')\n", 693 | " if param.grad is not None:\n", 694 | " writer.add_histogram('encoder/{}/grad'.format(name), variable2numpy(param.grad), global_step, bins='doane')\n", 695 | "\n", 696 | " for name, param in decoder.named_parameters():\n", 697 | " name = name.replace('.', '/')\n", 698 | " writer.add_histogram('decoder/{}'.format(name), variable2numpy(param), global_step, bins='doane')\n", 699 | " if param.grad is not None:\n", 700 | " writer.add_histogram('decoder/{}/grad'.format(name), variable2numpy(param.grad), global_step, bins='doane')\n", 701 | " \n", 702 | "def detach_hidden(hidden):\n", 703 | " \"\"\" Wraps hidden states in new Variables, to detach them from their history. Prevent OOM.\n", 704 | " After detach, the hidden's requires_grad=Fasle and grad_fn=None.\n", 705 | " Issues:\n", 706 | " - Memory leak problem in LSTM and RNN: https://github.com/pytorch/pytorch/issues/2198\n", 707 | " - https://github.com/pytorch/examples/blob/master/word_language_model/main.py\n", 708 | " - https://discuss.pytorch.org/t/help-clarifying-repackage-hidden-in-word-language-model/226\n", 709 | " - https://discuss.pytorch.org/t/solved-why-we-need-to-detach-variable-which-contains-hidden-representation/1426\n", 710 | " - \n", 711 | " \"\"\"\n", 712 | " if type(hidden) == Variable:\n", 713 | " hidden.detach_() # same as creating a new variable.\n", 714 | " else:\n", 715 | " for h in hidden: h.detach_()\n", 716 | "\n", 717 | "def get_gpu_memory_usage(device_id):\n", 718 | " \"\"\"Get the current gpu usage. \"\"\"\n", 719 | " result = subprocess.check_output(\n", 720 | " [\n", 721 | " 'nvidia-smi', '--query-gpu=memory.used',\n", 722 | " '--format=csv,nounits,noheader'\n", 723 | " ], encoding='utf-8')\n", 724 | " # Convert lines into a dictionary\n", 725 | " gpu_memory = [int(x) for x in result.strip().split('\\n')]\n", 726 | " gpu_memory_map = dict(zip(range(len(gpu_memory)), gpu_memory))\n", 727 | " return gpu_memory_map[device_id]" 728 | ] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "metadata": {}, 733 | "source": [ 734 | "## Trainer" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 8, 740 | "metadata": { 741 | "collapsed": true 742 | }, 743 | "outputs": [], 744 | "source": [ 745 | "def compute_grad_norm(parameters, norm_type=2):\n", 746 | " \"\"\" Ref: http://pytorch.org/docs/0.3.0/_modules/torch/nn/utils/clip_grad.html#clip_grad_norm\n", 747 | " \"\"\"\n", 748 | " parameters = list(filter(lambda p: p.grad is not None, parameters))\n", 749 | " norm_type = float(norm_type)\n", 750 | " if norm_type == float('inf'):\n", 751 | " total_norm = max(p.grad.data.abs().max() for p in parameters)\n", 752 | " else:\n", 753 | " total_norm = 0\n", 754 | " for p in parameters:\n", 755 | " param_norm = p.grad.data.norm(norm_type)\n", 756 | " total_norm += param_norm ** norm_type\n", 757 | " total_norm = total_norm ** (1. / norm_type)\n", 758 | " return total_norm\n", 759 | "\n", 760 | "def train(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens,\n", 761 | " encoder, decoder, encoder_optim, decoder_optim, opts): \n", 762 | " # -------------------------------------\n", 763 | " # Prepare input and output placeholders\n", 764 | " # -------------------------------------\n", 765 | " # Last batch might not have the same size as we set to the `batch_size`\n", 766 | " batch_size = src_seqs.size(1)\n", 767 | " assert(batch_size == tgt_seqs.size(1))\n", 768 | " \n", 769 | " # Pack tensors to variables for neural network inputs (in order to autograd)\n", 770 | " src_seqs = Variable(src_seqs)\n", 771 | " tgt_seqs = Variable(tgt_seqs)\n", 772 | " src_lens = Variable(torch.LongTensor(src_lens))\n", 773 | " tgt_lens = Variable(torch.LongTensor(tgt_lens))\n", 774 | "\n", 775 | " # Decoder's input\n", 776 | " input_seq = Variable(torch.LongTensor([BOS] * batch_size))\n", 777 | " \n", 778 | " # Decoder's output sequence length = max target sequence length of current batch.\n", 779 | " max_tgt_len = tgt_lens.data.max()\n", 780 | " \n", 781 | " # Store all decoder's outputs.\n", 782 | " # **CRUTIAL** \n", 783 | " # Don't set:\n", 784 | " # >> decoder_outputs = Variable(torch.zeros(max_tgt_len, batch_size, decoder.vocab_size))\n", 785 | " # Varying tensor size could cause GPU allocate a new memory causing OOM, \n", 786 | " # so we intialize tensor with fixed size instead:\n", 787 | " # `opts.max_seq_len` is a fixed number, unlike `max_tgt_len` always varys.\n", 788 | " decoder_outputs = Variable(torch.zeros(opts.max_seq_len, batch_size, decoder.vocab_size))\n", 789 | "\n", 790 | " # Move variables from CPU to GPU.\n", 791 | " if USE_CUDA:\n", 792 | " src_seqs = src_seqs.cuda()\n", 793 | " tgt_seqs = tgt_seqs.cuda()\n", 794 | " src_lens = src_lens.cuda()\n", 795 | " tgt_lens = tgt_lens.cuda()\n", 796 | " input_seq = input_seq.cuda()\n", 797 | " decoder_outputs = decoder_outputs.cuda()\n", 798 | " \n", 799 | " # -------------------------------------\n", 800 | " # Training mode (enable dropout)\n", 801 | " # -------------------------------------\n", 802 | " encoder.train()\n", 803 | " decoder.train()\n", 804 | " \n", 805 | " # -------------------------------------\n", 806 | " # Zero gradients, since optimizers will accumulate gradients for every backward.\n", 807 | " # -------------------------------------\n", 808 | " encoder_optim.zero_grad()\n", 809 | " decoder_optim.zero_grad()\n", 810 | " \n", 811 | " # -------------------------------------\n", 812 | " # Forward encoder\n", 813 | " # -------------------------------------\n", 814 | " encoder_outputs, encoder_hidden = encoder(src_seqs, src_lens.data.tolist())\n", 815 | "\n", 816 | " # -------------------------------------\n", 817 | " # Forward decoder\n", 818 | " # -------------------------------------\n", 819 | " # Initialize decoder's hidden state as encoder's last hidden state.\n", 820 | " decoder_hidden = encoder_hidden\n", 821 | " \n", 822 | " # Run through decoder one time step at a time.\n", 823 | " for t in range(max_tgt_len):\n", 824 | " \n", 825 | " # decoder returns:\n", 826 | " # - decoder_output : (batch_size, vocab_size)\n", 827 | " # - decoder_hidden : (num_layers, batch_size, hidden_size)\n", 828 | " # - attention_weights: (batch_size, max_src_len)\n", 829 | " decoder_output, decoder_hidden, attention_weights = decoder(input_seq, decoder_hidden,\n", 830 | " encoder_outputs, src_lens)\n", 831 | "\n", 832 | " # Store decoder outputs.\n", 833 | " decoder_outputs[t] = decoder_output\n", 834 | " \n", 835 | " # Next input is current target\n", 836 | " input_seq = tgt_seqs[t]\n", 837 | " \n", 838 | " # Detach hidden state:\n", 839 | " detach_hidden(decoder_hidden)\n", 840 | " \n", 841 | " # -------------------------------------\n", 842 | " # Compute loss\n", 843 | " # -------------------------------------\n", 844 | " loss, pred_seqs, num_corrects, num_words = masked_cross_entropy(\n", 845 | " decoder_outputs[:max_tgt_len].transpose(0,1).contiguous(), \n", 846 | " tgt_seqs.transpose(0,1).contiguous(),\n", 847 | " tgt_lens\n", 848 | " )\n", 849 | " \n", 850 | " pred_seqs = pred_seqs[:max_tgt_len]\n", 851 | " \n", 852 | " # -------------------------------------\n", 853 | " # Backward and optimize\n", 854 | " # -------------------------------------\n", 855 | " # Backward to get gradients w.r.t parameters in model.\n", 856 | " loss.backward()\n", 857 | " \n", 858 | " # Clip gradients\n", 859 | " encoder_grad_norm = nn.utils.clip_grad_norm(encoder.parameters(), opts.max_grad_norm)\n", 860 | " decoder_grad_norm = nn.utils.clip_grad_norm(decoder.parameters(), opts.max_grad_norm)\n", 861 | " clipped_encoder_grad_norm = compute_grad_norm(encoder.parameters())\n", 862 | " clipped_decoder_grad_norm = compute_grad_norm(decoder.parameters())\n", 863 | " \n", 864 | " # Update parameters with optimizers\n", 865 | " encoder_optim.step()\n", 866 | " decoder_optim.step()\n", 867 | " \n", 868 | " return loss.data[0], pred_seqs, attention_weights, num_corrects, num_words,\\\n", 869 | " encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm" 870 | ] 871 | }, 872 | { 873 | "cell_type": "markdown", 874 | "metadata": {}, 875 | "source": [ 876 | "## Main\n", 877 | "\n", 878 | "### Load dataset\n", 879 | "You can download the small grammatical error correction dataset from [here](https://github.com/keisks/jfleg)." 880 | ] 881 | }, 882 | { 883 | "cell_type": "code", 884 | "execution_count": 9, 885 | "metadata": {}, 886 | "outputs": [ 887 | { 888 | "name": "stdout", 889 | "output_type": "stream", 890 | "text": [ 891 | "====================================================================================================\n", 892 | "Dataset preprocessing log:\n", 893 | "- Loading and tokenizing source sentences...\n" 894 | ] 895 | }, 896 | { 897 | "name": "stderr", 898 | "output_type": "stream", 899 | "text": [ 900 | "100%|██████████| 2443191/2443191 [00:07<00:00, 343302.86it/s]\n", 901 | " 0%| | 6142/2443191 [00:00<00:46, 52749.90it/s]" 902 | ] 903 | }, 904 | { 905 | "name": "stdout", 906 | "output_type": "stream", 907 | "text": [ 908 | "- Loading and tokenizing target sentences...\n" 909 | ] 910 | }, 911 | { 912 | "name": "stderr", 913 | "output_type": "stream", 914 | "text": [ 915 | "100%|██████████| 2443191/2443191 [00:06<00:00, 365671.01it/s]\n", 916 | " 1%| | 25731/2443191 [00:00<00:09, 257275.90it/s]" 917 | ] 918 | }, 919 | { 920 | "name": "stdout", 921 | "output_type": "stream", 922 | "text": [ 923 | "- Building source counter...\n" 924 | ] 925 | }, 926 | { 927 | "name": "stderr", 928 | "output_type": "stream", 929 | "text": [ 930 | "100%|██████████| 2443191/2443191 [00:08<00:00, 272308.93it/s]\n", 931 | " 1%| | 29309/2443191 [00:00<00:08, 293065.82it/s]" 932 | ] 933 | }, 934 | { 935 | "name": "stdout", 936 | "output_type": "stream", 937 | "text": [ 938 | "- Building target counter...\n" 939 | ] 940 | }, 941 | { 942 | "name": "stderr", 943 | "output_type": "stream", 944 | "text": [ 945 | "100%|██████████| 2443191/2443191 [00:10<00:00, 240597.14it/s]\n" 946 | ] 947 | }, 948 | { 949 | "name": "stdout", 950 | "output_type": "stream", 951 | "text": [ 952 | "- Building source vocabulary...\n" 953 | ] 954 | }, 955 | { 956 | "name": "stderr", 957 | "output_type": "stream", 958 | "text": [ 959 | "50000it [00:00, 1461165.22it/s]\n", 960 | "100%|██████████| 50004/50004 [00:00<00:00, 2374494.52it/s]" 961 | ] 962 | }, 963 | { 964 | "name": "stdout", 965 | "output_type": "stream", 966 | "text": [ 967 | "- Building target vocabulary...\n", 968 | "====================================================================================================\n", 969 | "Dataset Info:\n", 970 | "- Number of source sentences: 2443191\n", 971 | "- Number of target sentences: 2443191\n", 972 | "- Source vocabulary size: 50004\n", 973 | "- Target vocabulary size: 50004\n", 974 | "- Shared vocabulary: True\n", 975 | "====================================================================================================\n", 976 | "\n" 977 | ] 978 | }, 979 | { 980 | "name": "stderr", 981 | "output_type": "stream", 982 | "text": [ 983 | "\n" 984 | ] 985 | } 986 | ], 987 | "source": [ 988 | "# train_dataset = NMTDataset(src_path='../dataset/jfleg/dev/dev.src',\n", 989 | "# tgt_path='../dataset/jfleg/dev/dev.ref1')\n", 990 | "\n", 991 | "train_dataset = NMTDataset(src_path='../dataset/efcamdat/efcamdat2.changed.src.txt',\n", 992 | " tgt_path='../dataset/efcamdat/efcamdat2.changed.tgt.txt')" 993 | ] 994 | }, 995 | { 996 | "cell_type": "code", 997 | "execution_count": 10, 998 | "metadata": {}, 999 | "outputs": [ 1000 | { 1001 | "name": "stderr", 1002 | "output_type": "stream", 1003 | "text": [ 1004 | "100%|██████████| 754/754 [00:00<00:00, 359334.76it/s]\n", 1005 | "100%|██████████| 754/754 [00:00<00:00, 333597.60it/s]" 1006 | ] 1007 | }, 1008 | { 1009 | "name": "stdout", 1010 | "output_type": "stream", 1011 | "text": [ 1012 | "====================================================================================================\n", 1013 | "Dataset preprocessing log:\n", 1014 | "- Loading and tokenizing source sentences...\n", 1015 | "- Loading and tokenizing target sentences...\n", 1016 | "====================================================================================================\n", 1017 | "Dataset Info:\n", 1018 | "- Number of source sentences: 754\n", 1019 | "- Number of target sentences: 754\n", 1020 | "- Source vocabulary size: 50004\n", 1021 | "- Target vocabulary size: 50004\n", 1022 | "- Shared vocabulary: True\n", 1023 | "====================================================================================================\n", 1024 | "\n" 1025 | ] 1026 | }, 1027 | { 1028 | "name": "stderr", 1029 | "output_type": "stream", 1030 | "text": [ 1031 | "\n" 1032 | ] 1033 | } 1034 | ], 1035 | "source": [ 1036 | "valid_dataset = NMTDataset(src_path='../dataset/jfleg/dev/dev.src',\n", 1037 | " tgt_path='../dataset/jfleg/dev/dev.ref0',\n", 1038 | " src_vocab=train_dataset.src_vocab,\n", 1039 | " tgt_vocab=train_dataset.tgt_vocab)" 1040 | ] 1041 | }, 1042 | { 1043 | "cell_type": "markdown", 1044 | "metadata": {}, 1045 | "source": [ 1046 | "### Batchify dataset using dataloader" 1047 | ] 1048 | }, 1049 | { 1050 | "cell_type": "code", 1051 | "execution_count": 11, 1052 | "metadata": { 1053 | "collapsed": true 1054 | }, 1055 | "outputs": [], 1056 | "source": [ 1057 | "batch_size = 48\n", 1058 | "\n", 1059 | "train_iter = DataLoader(dataset=train_dataset,\n", 1060 | " batch_size=batch_size,\n", 1061 | " shuffle=True,\n", 1062 | " num_workers=4,\n", 1063 | " collate_fn=collate_fn)\n", 1064 | "\n", 1065 | "valid_iter = DataLoader(dataset=valid_dataset,\n", 1066 | " batch_size=batch_size, \n", 1067 | " shuffle=False,\n", 1068 | " num_workers=4,\n", 1069 | " collate_fn=collate_fn)" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "markdown", 1074 | "metadata": {}, 1075 | "source": [ 1076 | "### Hyperparameters" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": 12, 1082 | "metadata": { 1083 | "collapsed": true 1084 | }, 1085 | "outputs": [], 1086 | "source": [ 1087 | "# If enabled, load checkpoint.\n", 1088 | "LOAD_CHECKPOINT = True\n", 1089 | "\n", 1090 | "if LOAD_CHECKPOINT:\n", 1091 | " # Modify this path.\n", 1092 | " checkpoint_path = './checkpoints/seq2seq_2018-02-07 20:30:47_acc_88.15_loss_12.85_step_135000.pt'\n", 1093 | " checkpoint = load_checkpoint(checkpoint_path)\n", 1094 | " opts = checkpoint['opts'] \n", 1095 | "else:\n", 1096 | " opts = AttrDict()\n", 1097 | "\n", 1098 | " # Configure models\n", 1099 | " opts.word_vec_size = 300\n", 1100 | " opts.rnn_type = 'LSTM'\n", 1101 | " opts.hidden_size = 512\n", 1102 | " opts.num_layers = 2\n", 1103 | " opts.dropout = 0.3\n", 1104 | " opts.bidirectional = True\n", 1105 | " opts.attention = True\n", 1106 | " opts.share_embeddings = True\n", 1107 | " opts.pretrained_embeddings = True\n", 1108 | " opts.fixed_embeddings = True\n", 1109 | " opts.tie_embeddings = True # Tie decoder's input and output embeddings\n", 1110 | "\n", 1111 | " # Configure optimization\n", 1112 | " opts.max_grad_norm = 2\n", 1113 | " opts.learning_rate = 0.001\n", 1114 | " opts.weight_decay = 1e-5 # L2 weight regularization\n", 1115 | " \n", 1116 | " # Configure training\n", 1117 | " opts.max_seq_len = 100 # max sequence length to prevent OOM.\n", 1118 | " opts.num_epochs = 5\n", 1119 | " opts.print_every_step = 20\n", 1120 | " opts.save_every_step = 5000" 1121 | ] 1122 | }, 1123 | { 1124 | "cell_type": "code", 1125 | "execution_count": 13, 1126 | "metadata": {}, 1127 | "outputs": [ 1128 | { 1129 | "name": "stdout", 1130 | "output_type": "stream", 1131 | "text": [ 1132 | "====================================================================================================\n", 1133 | "Options log:\n", 1134 | "- Load from checkpoint: True\n", 1135 | "- Global step: 135000\n", 1136 | "- word_vec_size: 300\n", 1137 | "- rnn_type: LSTM\n", 1138 | "- hidden_size: 512\n", 1139 | "- num_layers: 2\n", 1140 | "- dropout: 0.3\n", 1141 | "- bidirectional: True\n", 1142 | "- attention: True\n", 1143 | "- share_embeddings: True\n", 1144 | "- pretrained_embeddings: True\n", 1145 | "- fixed_embeddings: True\n", 1146 | "- tie_embeddings: True\n", 1147 | "- max_grad_norm: 2\n", 1148 | "- learning_rate: 0.001\n", 1149 | "- weight_decay: 1e-05\n", 1150 | "- max_seq_len: 100\n", 1151 | "- num_epochs: 5\n", 1152 | "- print_every_step: 20\n", 1153 | "- save_every_step: 5000\n", 1154 | "====================================================================================================\n", 1155 | "\n" 1156 | ] 1157 | } 1158 | ], 1159 | "source": [ 1160 | "print('='*100)\n", 1161 | "print('Options log:')\n", 1162 | "print('- Load from checkpoint: {}'.format(LOAD_CHECKPOINT))\n", 1163 | "if LOAD_CHECKPOINT: print('- Global step: {}'.format(checkpoint['global_step']))\n", 1164 | "for k,v in opts.items(): print('- {}: {}'.format(k, v))\n", 1165 | "print('='*100 + '\\n')" 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "markdown", 1170 | "metadata": {}, 1171 | "source": [ 1172 | "### Initialize embeddings and models" 1173 | ] 1174 | }, 1175 | { 1176 | "cell_type": "code", 1177 | "execution_count": 14, 1178 | "metadata": {}, 1179 | "outputs": [ 1180 | { 1181 | "name": "stderr", 1182 | "output_type": "stream", 1183 | "text": [ 1184 | " 38%|███▊ | 19055/50004 [00:00<00:00, 190512.48it/s]" 1185 | ] 1186 | }, 1187 | { 1188 | "name": "stdout", 1189 | "output_type": "stream", 1190 | "text": [ 1191 | "====================================================================================================\n", 1192 | "Loading spacy glove embedding:\n", 1193 | "- Vocabulary size: 50004\n", 1194 | "- Word vector size: 300\n" 1195 | ] 1196 | }, 1197 | { 1198 | "name": "stderr", 1199 | "output_type": "stream", 1200 | "text": [ 1201 | "100%|██████████| 50004/50004 [00:00<00:00, 144659.40it/s]\n" 1202 | ] 1203 | }, 1204 | { 1205 | "name": "stdout", 1206 | "output_type": "stream", 1207 | "text": [ 1208 | "- Unknown word count: 9362\n", 1209 | "====================================================================================================\n", 1210 | "\n" 1211 | ] 1212 | } 1213 | ], 1214 | "source": [ 1215 | "# Initialize vocabulary size.\n", 1216 | "src_vocab_size = len(train_dataset.src_vocab.token2id)\n", 1217 | "tgt_vocab_size = len(train_dataset.tgt_vocab.token2id)\n", 1218 | "\n", 1219 | "# Initialize embeddings.\n", 1220 | "# We can actually put all modules in one module like `NMTModel`)\n", 1221 | "# See: https://github.com/spro/practical-pytorch/issues/34\n", 1222 | "word_vec_size = opts.word_vec_size if not opts.pretrained_embeddings else nlp.vocab.vectors_length\n", 1223 | "src_embedding = nn.Embedding(src_vocab_size, word_vec_size, padding_idx=PAD)\n", 1224 | "tgt_embedding = nn.Embedding(tgt_vocab_size, word_vec_size, padding_idx=PAD)\n", 1225 | "\n", 1226 | "if opts.share_embeddings:\n", 1227 | " assert(src_vocab_size == tgt_vocab_size)\n", 1228 | " tgt_embedding.weight = src_embedding.weight\n", 1229 | "\n", 1230 | "# Initialize models.\n", 1231 | "encoder = EncoderRNN(embedding=src_embedding,\n", 1232 | " rnn_type=opts.rnn_type,\n", 1233 | " hidden_size=opts.hidden_size,\n", 1234 | " num_layers=opts.num_layers,\n", 1235 | " dropout=opts.dropout,\n", 1236 | " bidirectional=opts.bidirectional)\n", 1237 | "\n", 1238 | "decoder = LuongAttnDecoderRNN(encoder, embedding=tgt_embedding,\n", 1239 | " attention=opts.attention,\n", 1240 | " tie_embeddings=opts.tie_embeddings,\n", 1241 | " dropout=opts.dropout)\n", 1242 | "\n", 1243 | "if opts.pretrained_embeddings:\n", 1244 | " glove_embeddings = load_spacy_glove_embedding(nlp, train_dataset.src_vocab)\n", 1245 | " encoder.embedding.weight.data.copy_(glove_embeddings)\n", 1246 | " decoder.embedding.weight.data.copy_(glove_embeddings)\n", 1247 | " if opts.fixed_embeddings:\n", 1248 | " encoder.embedding.weight.requires_grad = False\n", 1249 | " decoder.embedding.weight.requires_grad = False\n", 1250 | " \n", 1251 | "if LOAD_CHECKPOINT:\n", 1252 | " encoder.load_state_dict(checkpoint['encoder_state_dict'])\n", 1253 | " decoder.load_state_dict(checkpoint['decoder_state_dict'])\n", 1254 | " \n", 1255 | "# Move models to GPU (need time for initial run)\n", 1256 | "if USE_CUDA:\n", 1257 | " encoder.cuda()\n", 1258 | " decoder.cuda()" 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "markdown", 1263 | "metadata": {}, 1264 | "source": [ 1265 | "### Fine-tuning embeddings\n", 1266 | "Recommend to use fine-tune after training for a while until the training loss don't decrease.\n", 1267 | "\n", 1268 | "TODO: Should be controlled in training loop." 1269 | ] 1270 | }, 1271 | { 1272 | "cell_type": "code", 1273 | "execution_count": 15, 1274 | "metadata": { 1275 | "collapsed": true 1276 | }, 1277 | "outputs": [], 1278 | "source": [ 1279 | "FINE_TUNE = True\n", 1280 | "if FINE_TUNE:\n", 1281 | " encoder.embedding.weight.requires_grad = True" 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "code", 1286 | "execution_count": 16, 1287 | "metadata": {}, 1288 | "outputs": [ 1289 | { 1290 | "name": "stdout", 1291 | "output_type": "stream", 1292 | "text": [ 1293 | "====================================================================================================\n", 1294 | "Model log:\n", 1295 | "\n", 1296 | "EncoderRNN (\n", 1297 | " (embedding): Embedding(50004, 300, padding_idx=0)\n", 1298 | " (rnn): LSTM(300, 256, num_layers=2, dropout=0.3, bidirectional=True)\n", 1299 | ")\n", 1300 | "LuongAttnDecoderRNN (\n", 1301 | " (embedding): Embedding(50004, 300, padding_idx=0)\n", 1302 | " (rnn): LSTM(300, 512, num_layers=2, dropout=0.3)\n", 1303 | " (W_a): Linear (512 -> 512)\n", 1304 | " (W_c): Linear (1024 -> 512)\n", 1305 | " (W_proj): Linear (512 -> 300)\n", 1306 | " (W_s): Linear (300 -> 50004)\n", 1307 | ")\n", 1308 | "- Encoder input embedding requires_grad=True\n", 1309 | "- Decoder input embedding requires_grad=True\n", 1310 | "- Decoder output embedding requires_grad=True\n", 1311 | "====================================================================================================\n", 1312 | "\n" 1313 | ] 1314 | } 1315 | ], 1316 | "source": [ 1317 | "print('='*100)\n", 1318 | "print('Model log:\\n')\n", 1319 | "print(encoder)\n", 1320 | "print(decoder)\n", 1321 | "print('- Encoder input embedding requires_grad={}'.format(encoder.embedding.weight.requires_grad))\n", 1322 | "print('- Decoder input embedding requires_grad={}'.format(decoder.embedding.weight.requires_grad))\n", 1323 | "print('- Decoder output embedding requires_grad={}'.format(decoder.W_s.weight.requires_grad))\n", 1324 | "print('='*100 + '\\n')" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "markdown", 1329 | "metadata": {}, 1330 | "source": [ 1331 | "### Initialize optimizers\n", 1332 | "TODO: Different learning rate for fine tuning embeddings: https://discuss.pytorch.org/t/how-to-perform-finetuning-in-pytorch/419/7" 1333 | ] 1334 | }, 1335 | { 1336 | "cell_type": "code", 1337 | "execution_count": 17, 1338 | "metadata": { 1339 | "collapsed": true 1340 | }, 1341 | "outputs": [], 1342 | "source": [ 1343 | "# Initialize optimizers (we can experiment different learning rates)\n", 1344 | "encoder_optim = optim.Adam([p for p in encoder.parameters() if p.requires_grad], lr=opts.learning_rate, weight_decay=opts.weight_decay)\n", 1345 | "decoder_optim = optim.Adam([p for p in decoder.parameters() if p.requires_grad], lr=opts.learning_rate, weight_decay=opts.weight_decay)" 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "markdown", 1350 | "metadata": {}, 1351 | "source": [ 1352 | "## Training" 1353 | ] 1354 | }, 1355 | { 1356 | "cell_type": "code", 1357 | "execution_count": null, 1358 | "metadata": { 1359 | "collapsed": true 1360 | }, 1361 | "outputs": [], 1362 | "source": [] 1363 | }, 1364 | { 1365 | "cell_type": "code", 1366 | "execution_count": null, 1367 | "metadata": { 1368 | "collapsed": true 1369 | }, 1370 | "outputs": [], 1371 | "source": [ 1372 | "\"\"\" Open port 6006 and see tensorboard.\n", 1373 | " Ref: https://medium.com/@dexterhuang/%E7%B5%A6-pytorch-%E7%94%A8%E7%9A%84-tensorboard-bb341ce3f837\n", 1374 | "\"\"\"\n", 1375 | "from datetime import datetime\n", 1376 | "from tensorboardX import SummaryWriter\n", 1377 | "# --------------------------\n", 1378 | "# Configure tensorboard\n", 1379 | "# --------------------------\n", 1380 | "model_name = 'seq2seq'\n", 1381 | "datetime = ('%s' % datetime.now()).split('.')[0]\n", 1382 | "experiment_name = '{}_{}'.format(model_name, datetime)\n", 1383 | "tensorboard_log_dir = './tensorboard-logs/{}/'.format(experiment_name)\n", 1384 | "writer = SummaryWriter(tensorboard_log_dir)\n", 1385 | "\n", 1386 | "# --------------------------\n", 1387 | "# Configure training\n", 1388 | "# --------------------------\n", 1389 | "num_epochs = opts.num_epochs\n", 1390 | "print_every_step = opts.print_every_step\n", 1391 | "save_every_step = opts.save_every_step\n", 1392 | "# For saving checkpoint and tensorboard\n", 1393 | "global_step = 0 if not LOAD_CHECKPOINT else checkpoint['global_step']\n", 1394 | "\n", 1395 | "# --------------------------\n", 1396 | "# Start training\n", 1397 | "# --------------------------\n", 1398 | "total_loss = 0\n", 1399 | "total_corrects = 0\n", 1400 | "total_words = 0\n", 1401 | "prev_gpu_memory_usage = 0" 1402 | ] 1403 | }, 1404 | { 1405 | "cell_type": "code", 1406 | "execution_count": null, 1407 | "metadata": { 1408 | "collapsed": true, 1409 | "scrolled": true 1410 | }, 1411 | "outputs": [], 1412 | "source": [ 1413 | "for epoch in range(num_epochs):\n", 1414 | " for batch_id, batch_data in tqdm(enumerate(train_iter)):\n", 1415 | "\n", 1416 | " # Unpack batch data\n", 1417 | " src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens = batch_data\n", 1418 | " \n", 1419 | " # Ignore batch if there is a long sequence.\n", 1420 | " max_seq_len = max(src_lens + tgt_lens)\n", 1421 | " if max_seq_len > opts.max_seq_len:\n", 1422 | " print('[!] Ignore batch: sequence length={} > max sequence length={}'.format(max_seq_len, opts.max_seq_len))\n", 1423 | " continue\n", 1424 | " \n", 1425 | " # Train.\n", 1426 | " loss, pred_seqs, attention_weights, num_corrects, num_words, \\\n", 1427 | " encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm \\\n", 1428 | " = train(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, encoder, decoder, encoder_optim, decoder_optim, opts)\n", 1429 | "\n", 1430 | " # Statistics.\n", 1431 | " global_step += 1\n", 1432 | " total_loss += loss\n", 1433 | " total_corrects += num_corrects\n", 1434 | " total_words += num_words\n", 1435 | " total_accuracy = 100 * (total_corrects / total_words)\n", 1436 | " \n", 1437 | " # Save checkpoint.\n", 1438 | " if global_step % save_every_step == 0:\n", 1439 | " \n", 1440 | " checkpoint_path = save_checkpoint(opts, experiment_name, encoder, decoder, encoder_optim, decoder_optim, \n", 1441 | " total_accuracy, total_loss, global_step)\n", 1442 | " \n", 1443 | " print('='*100)\n", 1444 | " print('Save checkpoint to \"{}\".'.format(checkpoint_path))\n", 1445 | " print('='*100 + '\\n')\n", 1446 | "\n", 1447 | " # Print statistics and write to Tensorboard.\n", 1448 | " if global_step % print_every_step == 0:\n", 1449 | " \n", 1450 | " curr_gpu_memory_usage = get_gpu_memory_usage(device_id=torch.cuda.current_device())\n", 1451 | " diff_gpu_memory_usage = curr_gpu_memory_usage - prev_gpu_memory_usage\n", 1452 | " prev_gpu_memory_usage = curr_gpu_memory_usage\n", 1453 | " \n", 1454 | " print('='*100)\n", 1455 | " print('Training log:')\n", 1456 | " print('- Epoch: {}/{}'.format(epoch, num_epochs))\n", 1457 | " print('- Global step: {}'.format(global_step))\n", 1458 | " print('- Total loss: {}'.format(total_loss))\n", 1459 | " print('- Total corrects: {}'.format(total_corrects))\n", 1460 | " print('- Total words: {}'.format(total_words))\n", 1461 | " print('- Total accuracy: {}'.format(total_accuracy))\n", 1462 | " print('- Current GPU memory usage: {}'.format(curr_gpu_memory_usage))\n", 1463 | " print('- Diff GPU memory usage: {}'.format(diff_gpu_memory_usage))\n", 1464 | " print('='*100 + '\\n')\n", 1465 | " \n", 1466 | " write_to_tensorboard(writer, global_step, total_loss, total_corrects, total_words, total_accuracy,\n", 1467 | " encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm,\n", 1468 | " encoder, decoder,\n", 1469 | " gpu_memory_usage={\n", 1470 | " 'curr': curr_gpu_memory_usage,\n", 1471 | " 'diff': diff_gpu_memory_usage\n", 1472 | " })\n", 1473 | " \n", 1474 | " total_loss = 0\n", 1475 | " total_corrects = 0\n", 1476 | " total_words = 0\n", 1477 | "\n", 1478 | " # Free memory\n", 1479 | " del src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, \\\n", 1480 | " loss, pred_seqs, attention_weights, num_corrects, num_words, \\\n", 1481 | " encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm\n", 1482 | " " 1483 | ] 1484 | }, 1485 | { 1486 | "cell_type": "code", 1487 | "execution_count": null, 1488 | "metadata": { 1489 | "collapsed": true 1490 | }, 1491 | "outputs": [], 1492 | "source": [ 1493 | "checkpoint_path = save_checkpoint(opts, experiment_name, encoder, decoder, encoder_optim, decoder_optim, \n", 1494 | " total_accuracy, total_loss, global_step)\n", 1495 | " \n", 1496 | "print('='*100)\n", 1497 | "print('Save checkpoint to \"{}\".'.format(checkpoint_path))\n", 1498 | "print('='*100 + '\\n')" 1499 | ] 1500 | }, 1501 | { 1502 | "cell_type": "markdown", 1503 | "metadata": {}, 1504 | "source": [ 1505 | "## Evaluation" 1506 | ] 1507 | }, 1508 | { 1509 | "cell_type": "code", 1510 | "execution_count": 18, 1511 | "metadata": { 1512 | "collapsed": true 1513 | }, 1514 | "outputs": [], 1515 | "source": [ 1516 | "def evaluate(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, encoder, decoder):\n", 1517 | " # -------------------------------------\n", 1518 | " # Prepare input and output placeholders\n", 1519 | " # -------------------------------------\n", 1520 | " # Last batch might not have the same size as we set to the `batch_size`\n", 1521 | " batch_size = src_seqs.size(1)\n", 1522 | " assert(batch_size == tgt_seqs.size(1))\n", 1523 | " \n", 1524 | " # Pack tensors to variables for neural network inputs (in order to autograd)\n", 1525 | " src_seqs = Variable(src_seqs, volatile=True)\n", 1526 | " tgt_seqs = Variable(tgt_seqs, volatile=True)\n", 1527 | " src_lens = Variable(torch.LongTensor(src_lens), volatile=True)\n", 1528 | " tgt_lens = Variable(torch.LongTensor(tgt_lens), volatile=True)\n", 1529 | "\n", 1530 | " # Decoder's input\n", 1531 | " input_seq = Variable(torch.LongTensor([BOS] * batch_size), volatile=True)\n", 1532 | " \n", 1533 | " # Decoder's output sequence length = max target sequence length of current batch.\n", 1534 | " max_tgt_len = tgt_lens.data.max()\n", 1535 | " \n", 1536 | " # Store all decoder's outputs.\n", 1537 | " # **CRUTIAL** \n", 1538 | " # Don't set:\n", 1539 | " # >> decoder_outputs = Variable(torch.zeros(max_tgt_len, batch_size, decoder.vocab_size))\n", 1540 | " # Varying tensor size could cause GPU allocate a new memory causing OOM, \n", 1541 | " # so we intialize tensor with fixed size instead:\n", 1542 | " # `opts.max_seq_len` is a fixed number, unlike `max_tgt_len` always varys.\n", 1543 | " decoder_outputs = Variable(torch.zeros(opts.max_seq_len, batch_size, decoder.vocab_size), volatile=True)\n", 1544 | "\n", 1545 | " # Move variables from CPU to GPU.\n", 1546 | " if USE_CUDA:\n", 1547 | " src_seqs = src_seqs.cuda()\n", 1548 | " tgt_seqs = tgt_seqs.cuda()\n", 1549 | " src_lens = src_lens.cuda()\n", 1550 | " tgt_lens = tgt_lens.cuda()\n", 1551 | " input_seq = input_seq.cuda()\n", 1552 | " decoder_outputs = decoder_outputs.cuda()\n", 1553 | " \n", 1554 | " # -------------------------------------\n", 1555 | " # Evaluation mode (disable dropout)\n", 1556 | " # -------------------------------------\n", 1557 | " encoder.eval()\n", 1558 | " decoder.eval()\n", 1559 | " \n", 1560 | " # -------------------------------------\n", 1561 | " # Forward encoder\n", 1562 | " # -------------------------------------\n", 1563 | " encoder_outputs, encoder_hidden = encoder(src_seqs, src_lens.data.tolist())\n", 1564 | " \n", 1565 | " # -------------------------------------\n", 1566 | " # Forward decoder\n", 1567 | " # -------------------------------------\n", 1568 | " # Initialize decoder's hidden state as encoder's last hidden state.\n", 1569 | " decoder_hidden = encoder_hidden\n", 1570 | " \n", 1571 | " # Run through decoder one time step at a time.\n", 1572 | " for t in range(max_tgt_len):\n", 1573 | " \n", 1574 | " # decoder returns:\n", 1575 | " # - decoder_output : (batch_size, vocab_size)\n", 1576 | " # - decoder_hidden : (num_layers, batch_size, hidden_size)\n", 1577 | " # - attention_weights: (batch_size, max_src_len)\n", 1578 | " decoder_output, decoder_hidden, attention_weights = decoder(input_seq, decoder_hidden,\n", 1579 | " encoder_outputs, src_lens)\n", 1580 | "\n", 1581 | " # Store decoder outputs.\n", 1582 | " decoder_outputs[t] = decoder_output\n", 1583 | " \n", 1584 | " # Next input is current target\n", 1585 | " input_seq = tgt_seqs[t]\n", 1586 | " \n", 1587 | " # Detach hidden state (may not need this, since no BPTT)\n", 1588 | " detach_hidden(decoder_hidden)\n", 1589 | " \n", 1590 | " # -------------------------------------\n", 1591 | " # Compute loss\n", 1592 | " # -------------------------------------\n", 1593 | " loss, pred_seqs, num_corrects, num_words = masked_cross_entropy(\n", 1594 | " decoder_outputs[:max_tgt_len].transpose(0,1).contiguous(), \n", 1595 | " tgt_seqs.transpose(0,1).contiguous(),\n", 1596 | " tgt_lens\n", 1597 | " )\n", 1598 | " \n", 1599 | " pred_seqs = pred_seqs[:max_tgt_len]\n", 1600 | " \n", 1601 | " return loss.data[0], pred_seqs, attention_weights, num_corrects, num_words" 1602 | ] 1603 | }, 1604 | { 1605 | "cell_type": "code", 1606 | "execution_count": 19, 1607 | "metadata": {}, 1608 | "outputs": [ 1609 | { 1610 | "name": "stderr", 1611 | "output_type": "stream", 1612 | "text": [ 1613 | "16it [00:04, 3.73it/s]" 1614 | ] 1615 | }, 1616 | { 1617 | "name": "stdout", 1618 | "output_type": "stream", 1619 | "text": [ 1620 | "====================================================================================================\n", 1621 | "Validation log:\n", 1622 | "- Total loss: 23.030829787254333\n", 1623 | "- Total corrects: 11675\n", 1624 | "- Total words: 14994\n", 1625 | "- Total accuracy: 77.86447912498332\n", 1626 | "====================================================================================================\n", 1627 | "\n" 1628 | ] 1629 | }, 1630 | { 1631 | "name": "stderr", 1632 | "output_type": "stream", 1633 | "text": [ 1634 | "\n" 1635 | ] 1636 | } 1637 | ], 1638 | "source": [ 1639 | "total_loss = 0\n", 1640 | "total_corrects = 0\n", 1641 | "total_words = 0\n", 1642 | "\n", 1643 | "for batch_id, batch_data in tqdm(enumerate(valid_iter)):\n", 1644 | " src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens = batch_data\n", 1645 | " \n", 1646 | " loss, pred_seqs, attention_weights, num_corrects, num_words \\\n", 1647 | " = evaluate(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, encoder, decoder)\n", 1648 | " \n", 1649 | " total_loss += loss\n", 1650 | " total_corrects += num_corrects\n", 1651 | " total_words += num_words\n", 1652 | " total_accuracy = 100 * (total_corrects / total_words)\n", 1653 | "\n", 1654 | "print('='*100)\n", 1655 | "print('Validation log:')\n", 1656 | "print('- Total loss: {}'.format(total_loss))\n", 1657 | "print('- Total corrects: {}'.format(total_corrects))\n", 1658 | "print('- Total words: {}'.format(total_words))\n", 1659 | "print('- Total accuracy: {}'.format(total_accuracy))\n", 1660 | "print('='*100 + '\\n')" 1661 | ] 1662 | }, 1663 | { 1664 | "cell_type": "markdown", 1665 | "metadata": {}, 1666 | "source": [ 1667 | "## Translate (Inference)" 1668 | ] 1669 | }, 1670 | { 1671 | "cell_type": "code", 1672 | "execution_count": 50, 1673 | "metadata": { 1674 | "collapsed": true 1675 | }, 1676 | "outputs": [], 1677 | "source": [ 1678 | "def translate(src_text, train_dataset, encoder, decoder, max_seq_len, replace_unk=True):\n", 1679 | " # -------------------------------------\n", 1680 | " # Prepare input and output placeholders\n", 1681 | " # -------------------------------------\n", 1682 | " # Like dataset's `__getitem__()` and dataloader's `collate_fn()`.\n", 1683 | " src_sent = src_text.split()\n", 1684 | " src_seqs = torch.LongTensor([train_dataset.tokens2ids(tokens=src_text.split(),\n", 1685 | " token2id=train_dataset.src_vocab.token2id,\n", 1686 | " append_BOS=False, append_EOS=True)]).transpose(0,1)\n", 1687 | " src_lens = [len(src_seqs)]\n", 1688 | " \n", 1689 | " # Last batch might not have the same size as we set to the `batch_size`\n", 1690 | " batch_size = src_seqs.size(1)\n", 1691 | " \n", 1692 | " # Pack tensors to variables for neural network inputs (in order to autograd)\n", 1693 | " src_seqs = Variable(src_seqs, volatile=True)\n", 1694 | " src_lens = Variable(torch.LongTensor(src_lens), volatile=True)\n", 1695 | "\n", 1696 | " # Decoder's input\n", 1697 | " input_seq = Variable(torch.LongTensor([BOS] * batch_size), volatile=True)\n", 1698 | " # Store output words and attention states\n", 1699 | " out_sent = []\n", 1700 | " all_attention_weights = torch.zeros(max_seq_len, len(src_seqs))\n", 1701 | " \n", 1702 | " # Move variables from CPU to GPU.\n", 1703 | " if USE_CUDA:\n", 1704 | " src_seqs = src_seqs.cuda()\n", 1705 | " src_lens = src_lens.cuda()\n", 1706 | " input_seq = input_seq.cuda()\n", 1707 | " \n", 1708 | " # -------------------------------------\n", 1709 | " # Evaluation mode (disable dropout)\n", 1710 | " # -------------------------------------\n", 1711 | " encoder.eval()\n", 1712 | " decoder.eval()\n", 1713 | " \n", 1714 | " # -------------------------------------\n", 1715 | " # Forward encoder\n", 1716 | " # -------------------------------------\n", 1717 | " encoder_outputs, encoder_hidden = encoder(src_seqs, src_lens.data.tolist())\n", 1718 | "\n", 1719 | " # -------------------------------------\n", 1720 | " # Forward decoder\n", 1721 | " # -------------------------------------\n", 1722 | " # Initialize decoder's hidden state as encoder's last hidden state.\n", 1723 | " decoder_hidden = encoder_hidden\n", 1724 | " \n", 1725 | " # Run through decoder one time step at a time.\n", 1726 | " for t in range(max_seq_len):\n", 1727 | " \n", 1728 | " # decoder returns:\n", 1729 | " # - decoder_output : (batch_size, vocab_size)\n", 1730 | " # - decoder_hidden : (num_layers, batch_size, hidden_size)\n", 1731 | " # - attention_weights: (batch_size, max_src_len)\n", 1732 | " decoder_output, decoder_hidden, attention_weights = decoder(input_seq, decoder_hidden,\n", 1733 | " encoder_outputs, src_lens)\n", 1734 | "\n", 1735 | " # Store attention weights.\n", 1736 | " # .squeeze(0): remove `batch_size` dimension since batch_size=1\n", 1737 | " all_attention_weights[t] = attention_weights.squeeze(0).cpu().data \n", 1738 | " \n", 1739 | " # Choose top word from decoder's output\n", 1740 | " prob, token_id = decoder_output.data.topk(1)\n", 1741 | " token_id = token_id[0][0] # get value\n", 1742 | " if token_id == EOS:\n", 1743 | " break\n", 1744 | " else:\n", 1745 | " if token_id == UNK and replace_unk:\n", 1746 | " # Replace unk by selecting the source token with the highest attention score.\n", 1747 | " score, idx = all_attention_weights[t].max(0)\n", 1748 | " token = src_sent[idx[0]]\n", 1749 | " else:\n", 1750 | " # \n", 1751 | " token = train_dataset.tgt_vocab.id2token[token_id]\n", 1752 | " \n", 1753 | " out_sent.append(token)\n", 1754 | " \n", 1755 | " # Next input is chosen word\n", 1756 | " input_seq = Variable(torch.LongTensor([token_id]), volatile=True)\n", 1757 | " if USE_CUDA: input_seq = input_seq.cuda()\n", 1758 | " \n", 1759 | " # Repackage hidden state (may not need this, since no BPTT)\n", 1760 | " detach_hidden(decoder_hidden)\n", 1761 | " \n", 1762 | " src_text = ' '.join([train_dataset.src_vocab.id2token[token_id] for token_id in src_seqs.data.squeeze(1).tolist()])\n", 1763 | " out_text = ' '.join(out_sent)\n", 1764 | " \n", 1765 | " # all_attention_weights: (out_len, src_len)\n", 1766 | " return src_text, out_text, all_attention_weights[:len(out_sent)]" 1767 | ] 1768 | }, 1769 | { 1770 | "cell_type": "markdown", 1771 | "metadata": {}, 1772 | "source": [ 1773 | "### Small test for translation" 1774 | ] 1775 | }, 1776 | { 1777 | "cell_type": "code", 1778 | "execution_count": 23, 1779 | "metadata": {}, 1780 | "outputs": [ 1781 | { 1782 | "data": { 1783 | "text/plain": [ 1784 | "('He have a car ', 'He has a car', \n", 1785 | " 0.8339 0.0667 0.0158 0.0305 0.0530\n", 1786 | " 0.0471 0.8918 0.0325 0.0141 0.0144\n", 1787 | " 0.0654 0.2109 0.5132 0.1534 0.0572\n", 1788 | " 0.0083 0.0270 0.0291 0.8793 0.0564\n", 1789 | " [torch.FloatTensor of size 4x5])" 1790 | ] 1791 | }, 1792 | "execution_count": 23, 1793 | "metadata": {}, 1794 | "output_type": "execute_result" 1795 | } 1796 | ], 1797 | "source": [ 1798 | "src_text, out_text, all_attention_weights = translate('He have a car', train_dataset, encoder, decoder, max_seq_len=opts.max_seq_len)\n", 1799 | "src_text, out_text, all_attention_weights" 1800 | ] 1801 | }, 1802 | { 1803 | "cell_type": "code", 1804 | "execution_count": null, 1805 | "metadata": { 1806 | "collapsed": true 1807 | }, 1808 | "outputs": [], 1809 | "source": [ 1810 | "# check attention weight sum == 1\n", 1811 | "[all_attention_weights[t].sum() for t in range(all_attention_weights.size(0))]" 1812 | ] 1813 | }, 1814 | { 1815 | "cell_type": "markdown", 1816 | "metadata": {}, 1817 | "source": [ 1818 | "### Translate a given text file" 1819 | ] 1820 | }, 1821 | { 1822 | "cell_type": "code", 1823 | "execution_count": 24, 1824 | "metadata": { 1825 | "collapsed": true 1826 | }, 1827 | "outputs": [], 1828 | "source": [ 1829 | "test_src_texts = []\n", 1830 | "with codecs.open('../dataset/jfleg/test/test.src', 'r', 'utf-8') as f:\n", 1831 | " test_src_texts = f.readlines()" 1832 | ] 1833 | }, 1834 | { 1835 | "cell_type": "code", 1836 | "execution_count": 56, 1837 | "metadata": {}, 1838 | "outputs": [ 1839 | { 1840 | "data": { 1841 | "text/plain": [ 1842 | "['New and new technology has been introduced to the society .\\n',\n", 1843 | " 'One possible outcome is that an environmentally-induced reduction in motorization levels in the richer countries will outweigh any rise in motorization levels in the poorer countries .\\n',\n", 1844 | " 'Every person needs to know a bit about math , sciences , arts , literature and history in order to stand out in society .\\n',\n", 1845 | " 'While the travel company will most likely show them some interesting sites in order for their customers to advertise for their company to their family and friends , it is highly unlikely , that the company will tell about the sites that were not included in the tour -- for example due to entrance fees that would make the total package price overly expensive .\\n',\n", 1846 | " 'Disadvantage is parking their car is very difficult .\\n']" 1847 | ] 1848 | }, 1849 | "execution_count": 56, 1850 | "metadata": {}, 1851 | "output_type": "execute_result" 1852 | } 1853 | ], 1854 | "source": [ 1855 | "test_src_texts[:5]" 1856 | ] 1857 | }, 1858 | { 1859 | "cell_type": "code", 1860 | "execution_count": 51, 1861 | "metadata": { 1862 | "collapsed": true 1863 | }, 1864 | "outputs": [], 1865 | "source": [ 1866 | "out_texts = []\n", 1867 | "for src_text in test_src_texts:\n", 1868 | " _, out_text, _ = translate(src_text.strip(), train_dataset, encoder, decoder, max_seq_len=opts.max_seq_len)\n", 1869 | " out_texts.append(out_text)" 1870 | ] 1871 | }, 1872 | { 1873 | "cell_type": "code", 1874 | "execution_count": 57, 1875 | "metadata": {}, 1876 | "outputs": [ 1877 | { 1878 | "data": { 1879 | "text/plain": [ 1880 | "['The new and new technology has been introduced to the society .',\n", 1881 | " 'One possible outcome is that an environmentally-induced reduction in motorization levels in the higher countries will outweigh any rise in motorization levels in the high countries .',\n", 1882 | " 'Every person needs to know a bit about math , sciences , arts , literature and history in order to stand out in society .',\n", 1883 | " 'While the travel company will most likely show them some interesting sites in order for their customers to advertise for their company to their family and friends , it is highly unlikely , that the company will tell about the sites that were not included in the tour -- for example due to entrance fees that would make the total of price overly expensive .',\n", 1884 | " 'The price is parking their cars are very difficult .']" 1885 | ] 1886 | }, 1887 | "execution_count": 57, 1888 | "metadata": {}, 1889 | "output_type": "execute_result" 1890 | } 1891 | ], 1892 | "source": [ 1893 | "out_texts[:5]" 1894 | ] 1895 | }, 1896 | { 1897 | "cell_type": "markdown", 1898 | "metadata": {}, 1899 | "source": [ 1900 | "### Save the predictions to text file" 1901 | ] 1902 | }, 1903 | { 1904 | "cell_type": "code", 1905 | "execution_count": 55, 1906 | "metadata": { 1907 | "collapsed": true 1908 | }, 1909 | "outputs": [], 1910 | "source": [ 1911 | "with codecs.open('./pred.txt', 'w', 'utf-8') as f:\n", 1912 | " for text in out_texts:\n", 1913 | " f.write(text + '\\n')" 1914 | ] 1915 | }, 1916 | { 1917 | "cell_type": "markdown", 1918 | "metadata": {}, 1919 | "source": [ 1920 | "### Evaluate with GLEU metric\n", 1921 | "If you're playing with grammatical error correction (GEC) corpus (jfleg),\n", 1922 | "it has an evaluation script specifically for GEC task:\n", 1923 | "\n", 1924 | "Run:\n", 1925 | "```\n", 1926 | "python jfleg/eval/gleu.py \\\n", 1927 | "-s jfleg/test/test.src \\\n", 1928 | "-r jfleg/test/test.ref[0-3] \\\n", 1929 | "--hyp ./pred.txt\n", 1930 | "```\n", 1931 | "\n", 1932 | "Output (GLEU score, std, confidence interval):\n", 1933 | "Note: The OpenNMT-py can further achieves ~0.49 GLEU score with the same model settings.\n", 1934 | "TODO: Try to optimize the code.\n", 1935 | "```\n", 1936 | "Running GLEU...\n", 1937 | "./pred.txt\n", 1938 | "[['0.451747', '0.007620', '(0.437,0.467)']]\n", 1939 | "```" 1940 | ] 1941 | }, 1942 | { 1943 | "cell_type": "markdown", 1944 | "metadata": {}, 1945 | "source": [ 1946 | "### Notes:\n", 1947 | "- Set `MAX_LENGTH` to training sequence is important to prevent OOM.\n", 1948 | " - Will effect:`decoder_outputs = Variable(torch.zeros(max_tgt_len, batch_size, decoder.vocab_size))`\n", 1949 | "- Do not `next(iter(data_loader))` in training for-loop,could be very slow.\n", 1950 | "- When computing `num_corrects`, need to cast `ByteTensor` using `.float()` in order to do `.sum()`, otherwise the result will overflow. Ref: https://discuss.pytorch.org/t/batch-size-and-validation-accuracy/4066/3\n", 1951 | "- Very crutial to GPU memory usage: Don'T set `MAX_LENGTH` to `max(tgt_lens)`. Varying tensor size could cause GPU allocate a new memory, so we fixed tensor size instead: `decoder_outputs = Variable(torch.zeros(**MAX_LENGTH**, batch_size, decoder.vocab_size))`\n", 1952 | "- Be careful if you only want to get `Variable`'s data and do some operations, for example, `sum()`, you should use `Variable(...).data.sum()` instead of `Variable(...).sum().data[0]`. This will create a new computational graph and if you do this in for-loop, it might increase memory.\n", 1953 | "- Be careful to misuse `Variable`.\n", 1954 | "- Do `detach` for RNN's hidden states, or it might increase memory when doing backprop.\n", 1955 | "- If restart but GPU memory is not returned, kill all python processes: `>> ps x |grep python|awk '{print $1}'|xargs kill`\n", 1956 | "- Forward decoder is time-consuming (for-loop).\n", 1957 | "- Calling `backward()` free memory: https://discuss.pytorch.org/t/calling-loss-backward-reduce-memory-usage/2735\n", 1958 | "\n", 1959 | "### Try to:\n", 1960 | "- Implement schedule sampling for training.\n", 1961 | "- Implement beam search for evaluation and translation.\n", 1962 | "- Understand and interpret param visualization on tensorboard.\n", 1963 | "- Implement more RNN optimizing and regularization tricks:\n", 1964 | " - Set `max_seq_len` for preventing RNN OOM \n", 1965 | " - Xavier initializer\n", 1966 | " - Weight normalization and layer normalization: https://github.com/pytorch/pytorch/issues/1601\n", 1967 | " - Embedding dropout\n", 1968 | " - Weight dropping\n", 1969 | " - Variational dropout: [part1](https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307), [part2](https://towardsdatascience.com/learning-note-dropout-in-recurrent-networks-part-2-f209222481f8), [part3](https://towardsdatascience.com/learning-note-dropout-in-recurrent-networks-part-3-1b161d030cd4)\n", 1970 | " - Zoneout\n", 1971 | " - Fraternal dropout\n", 1972 | " - Activation regularization (AR), and temporal activation regularization (TAR)\n", 1973 | " - Read more: [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)\n" 1974 | ] 1975 | }, 1976 | { 1977 | "cell_type": "code", 1978 | "execution_count": null, 1979 | "metadata": { 1980 | "collapsed": true 1981 | }, 1982 | "outputs": [], 1983 | "source": [] 1984 | } 1985 | ], 1986 | "metadata": { 1987 | "kernelspec": { 1988 | "display_name": "Python 3", 1989 | "language": "python", 1990 | "name": "python3" 1991 | }, 1992 | "language_info": { 1993 | "codemirror_mode": { 1994 | "name": "ipython", 1995 | "version": 3 1996 | }, 1997 | "file_extension": ".py", 1998 | "mimetype": "text/x-python", 1999 | "name": "python", 2000 | "nbconvert_exporter": "python", 2001 | "pygments_lexer": "ipython3", 2002 | "version": "3.6.2" 2003 | } 2004 | }, 2005 | "nbformat": 4, 2006 | "nbformat_minor": 2 2007 | } 2008 | -------------------------------------------------------------------------------- /tensorboard-logs/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | --------------------------------------------------------------------------------