├── .gitignore
├── README.md
├── checkpoints
    └── .gitignore
├── seq2seq.ipynb
└── tensorboard-logs
    └── .gitignore


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | __pycache__
3 | pred.txt
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Batched Seq2Seq Example
 2 | Based on the [`seq2seq-translation-batched.ipynb`](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb) from *practical-pytorch*, but more extra features.
 3 | 
 4 | This example runs grammatical error correction task where the source sequence is a grammatically erroneuous English sentence and the target sequence is an grammatically correct English sentence. The corpus and evaluation script can be download at: https://github.com/keisks/jfleg.
 5 | 
 6 | ### Extra features
 7 | - Cleaner codebase
 8 | - Very detailed comments for learners
 9 | - Implement Pytorch native dataset and dataloader for batching
10 | - Correctly handle the hidden state from bidirectional encoder and past to the decoder as initial hidden state.
11 | - Fully batched attention mechanism computation (only implement `general attention` but it's sufficient). Note: The original code still uses for-loop to compute, which is very slow.
12 | - Support LSTM instead of only GRU
13 | - Shared embeddings (encoder's input embedding and decoder's input embedding)
14 | - Pretrained Glove embedding
15 | - Fixed embedding
16 | - Tie embeddings (decoder's input embedding and decoder's output embedding)
17 | - Tensorboard visualization
18 | - Load and save checkpoint
19 | - Replace unknown words by selecting the source token with the highest attention score. (Translation)
20 | 
21 | ### Cons
22 | Comparing to the state-of-the-art seq2seq library, OpenNMT-py, there are some stuffs that aren't optimized in this codebase:
23 | - Use CuDNN when possible (always on encoder, on decoder when `input_feed`=0)
24 | - Always avoid indexing / loops and use torch primitives.
25 | - When possible, batch softmax operations across time. (this is the second complicated part of the code)
26 | - Batch inference and beam search for translation (this is the most complicated part of the code)
27 | 
28 | ### How to speed up RNN training?
29 | Several ways to speed up RNN training:
30 | - Batching
31 | - Static padding
32 | - Dynamic padding
33 | - Bucketing
34 | - Truncated BPTT 
35 | 
36 | See ["Sequence Models and the RNN API (TensorFlow Dev Summit 2017)"](https://www.youtube.com/watch?v=RIR_-Xlbp7s&t=490s) for understanding those techniques.
37 | 
38 | You can use [torchtext](http://torchtext.readthedocs.io/en/latest/index.html) or OpenNMT's data iterator for speeding up the training. It can be 7x faster! (ex: 7 hours for an epoch -> 1 hour!)
39 | 
40 | ### Acknowledgement
41 | Thanks to the author of OpenNMT-py @srush for answering the questions for me! See https://github.com/OpenNMT/OpenNMT-py/issues/552
42 | 


--------------------------------------------------------------------------------
/checkpoints/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/seq2seq.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "## Batched Seq2Seq Example\n",
   8 |     "Based on the [`seq2seq-translation-batched.ipynb`](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb) from *practical-pytorch*, but more extra features.\n",
   9 |     "\n",
  10 |     "This example runs grammatical error correction task where the source sequence is a grammatically erroneuous English sentence and the target sequence is an grammatically correct English sentence. The corpus and evaluation script can be download at: https://github.com/keisks/jfleg.\n",
  11 |     "\n",
  12 |     "### Extra features\n",
  13 |     "- Cleaner codebase\n",
  14 |     "- Very detailed comments for learners\n",
  15 |     "- Implement Pytorch native dataset and dataloader for batching\n",
  16 |     "- Correctly handle the hidden state from bidirectional encoder and past to the decoder as initial hidden state.\n",
  17 |     "- Fully batched attention mechanism computation (only implement `general attention` but it's sufficient). Note: The original code still uses for-loop to compute, which is very slow.\n",
  18 |     "- Support LSTM instead of only GRU\n",
  19 |     "- Shared embeddings (encoder's input embedding and decoder's input embedding)\n",
  20 |     "- Pretrained Glove embedding\n",
  21 |     "- Fixed embedding\n",
  22 |     "- Tie embeddings (decoder's input embedding and decoder's output embedding)\n",
  23 |     "- Tensorboard visualization\n",
  24 |     "- Load and save checkpoint\n",
  25 |     "- Replace unknown words by selecting the source token with the highest attention score. (Translation)\n",
  26 |     "\n",
  27 |     "### Cons\n",
  28 |     "Comparing to the state-of-the-art seq2seq library, OpenNMT-py, there are some stuffs that aren't optimized in this codebase:\n",
  29 |     "- Use CuDNN when possible (always on encoder, on decoder when input_feed 0)\n",
  30 |     "- Always avoid indexing / loops and use torch primitives.\n",
  31 |     "- When possible, batch softmax operations across time. ( this is the second complicated part of the code)\n",
  32 |     "- Batch inference and beam search for translation (this is the most complicated part of the code)\n",
  33 |     "\n",
  34 |     "Thanks to the author of OpenNMT-py @srush for answering the questions for me! See https://github.com/OpenNMT/OpenNMT-py/issues/552"
  35 |    ]
  36 |   },
  37 |   {
  38 |    "cell_type": "code",
  39 |    "execution_count": 1,
  40 |    "metadata": {
  41 |     "collapsed": true
  42 |    },
  43 |    "outputs": [],
  44 |    "source": [
  45 |     "import os\n",
  46 |     "import subprocess\n",
  47 |     "import codecs\n",
  48 |     "import numpy as np\n",
  49 |     "\n",
  50 |     "import torch\n",
  51 |     "import torch.nn as nn\n",
  52 |     "from torch.autograd import Variable\n",
  53 |     "from torch import optim\n",
  54 |     "import torch.nn.functional as F\n",
  55 |     "from torch.utils.data import Dataset, DataLoader"
  56 |    ]
  57 |   },
  58 |   {
  59 |    "cell_type": "code",
  60 |    "execution_count": 2,
  61 |    "metadata": {
  62 |     "collapsed": true
  63 |    },
  64 |    "outputs": [],
  65 |    "source": [
  66 |     "\"\"\" Please download from here: \n",
  67 |     "1. Install spacy: https://spacy.io/usage/\n",
  68 |     "2. Install model: https://spacy.io/usage/models\n",
  69 |     "Recommend to install spacy since it is a very powerful NLP tool\n",
  70 |     "\"\"\"\n",
  71 |     "\n",
  72 |     "import spacy\n",
  73 |     "nlp = spacy.load('en_core_web_lg') # For the glove embeddings"
  74 |    ]
  75 |   },
  76 |   {
  77 |    "cell_type": "code",
  78 |    "execution_count": 3,
  79 |    "metadata": {},
  80 |    "outputs": [
  81 |     {
  82 |      "name": "stdout",
  83 |      "output_type": "stream",
  84 |      "text": [
  85 |       "Use_CUDA=True\n",
  86 |       "current_device=0\n"
  87 |      ]
  88 |     }
  89 |    ],
  90 |    "source": [
  91 |     "\"\"\" Enable GPU training \"\"\"\n",
  92 |     "USE_CUDA = torch.cuda.is_available()\n",
  93 |     "print('Use_CUDA={}'.format(USE_CUDA))\n",
  94 |     "if USE_CUDA:\n",
  95 |     "    # You can change device by `torch.cuda.set_device(device_id)`\n",
  96 |     "    print('current_device={}'.format(torch.cuda.current_device()))"
  97 |    ]
  98 |   },
  99 |   {
 100 |    "cell_type": "markdown",
 101 |    "metadata": {},
 102 |    "source": [
 103 |     "## Build vocabulary, dataset and data loader"
 104 |    ]
 105 |   },
 106 |   {
 107 |    "cell_type": "code",
 108 |    "execution_count": 4,
 109 |    "metadata": {
 110 |     "collapsed": true
 111 |    },
 112 |    "outputs": [],
 113 |    "source": [
 114 |     "import codecs\n",
 115 |     "from tqdm import tqdm\n",
 116 |     "from collections import Counter, namedtuple\n",
 117 |     "from torch.utils.data import Dataset, DataLoader\n",
 118 |     "\n",
 119 |     "PAD = 0\n",
 120 |     "BOS = 1\n",
 121 |     "EOS = 2\n",
 122 |     "UNK = 3\n",
 123 |     "\n",
 124 |     "class AttrDict(dict):\n",
 125 |     "    \"\"\" Access dictionary keys like attribute \n",
 126 |     "        https://stackoverflow.com/questions/4984647/accessing-dict-keys-like-an-attribute\n",
 127 |     "    \"\"\"\n",
 128 |     "    def __init__(self, *av, **kav):\n",
 129 |     "        dict.__init__(self, *av, **kav)\n",
 130 |     "        self.__dict__ = self\n",
 131 |     "\n",
 132 |     "class NMTDataset(Dataset):\n",
 133 |     "    def __init__(self, src_path, tgt_path, src_vocab=None, tgt_vocab=None, max_vocab_size=50000, share_vocab=True):\n",
 134 |     "        \"\"\" Note: If src_vocab, tgt_vocab is not given, it will build both vocabs.\n",
 135 |     "            Args: \n",
 136 |     "            - src_path, tgt_path: text file with tokenized sentences.\n",
 137 |     "            - src_vocab, tgt_vocab: data structure is same as self.build_vocab().\n",
 138 |     "        \"\"\"\n",
 139 |     "        print('='*100)\n",
 140 |     "        print('Dataset preprocessing log:')\n",
 141 |     "        \n",
 142 |     "        print('- Loading and tokenizing source sentences...')\n",
 143 |     "        self.src_sents = self.load_sents(src_path)\n",
 144 |     "        print('- Loading and tokenizing target sentences...')\n",
 145 |     "        self.tgt_sents = self.load_sents(tgt_path)\n",
 146 |     "        \n",
 147 |     "        if src_vocab is None or tgt_vocab is None:\n",
 148 |     "            print('- Building source counter...')\n",
 149 |     "            self.src_counter = self.build_counter(self.src_sents)\n",
 150 |     "            print('- Building target counter...')\n",
 151 |     "            self.tgt_counter = self.build_counter(self.tgt_sents)\n",
 152 |     "\n",
 153 |     "            if share_vocab:\n",
 154 |     "                print('- Building source vocabulary...')\n",
 155 |     "                self.src_vocab = self.build_vocab(self.src_counter + self.tgt_counter, max_vocab_size)\n",
 156 |     "                print('- Building target vocabulary...')\n",
 157 |     "                self.tgt_vocab = self.src_vocab\n",
 158 |     "            else:\n",
 159 |     "                print('- Building source vocabulary...')\n",
 160 |     "                self.src_vocab = self.build_vocab(self.src_counter, max_vocab_size)\n",
 161 |     "                print('- Building target vocabulary...')\n",
 162 |     "                self.tgt_vocab = self.build_vocab(self.tgt_counter, max_vocab_size)\n",
 163 |     "        else:\n",
 164 |     "            self.src_vocab = src_vocab\n",
 165 |     "            self.tgt_vocab = tgt_vocab\n",
 166 |     "            share_vocab = src_vocab == tgt_vocab\n",
 167 |     "                        \n",
 168 |     "        print('='*100)\n",
 169 |     "        print('Dataset Info:')\n",
 170 |     "        print('- Number of source sentences: {}'.format(len(self.src_sents)))\n",
 171 |     "        print('- Number of target sentences: {}'.format(len(self.tgt_sents)))\n",
 172 |     "        print('- Source vocabulary size: {}'.format(len(self.src_vocab.token2id)))\n",
 173 |     "        print('- Target vocabulary size: {}'.format(len(self.tgt_vocab.token2id)))\n",
 174 |     "        print('- Shared vocabulary: {}'.format(share_vocab))\n",
 175 |     "        print('='*100 + '\\n')\n",
 176 |     "    \n",
 177 |     "    def __len__(self):\n",
 178 |     "        return len(self.src_sents)\n",
 179 |     "    \n",
 180 |     "    def __getitem__(self, index):\n",
 181 |     "        src_sent = self.src_sents[index]\n",
 182 |     "        tgt_sent = self.tgt_sents[index]\n",
 183 |     "        src_seq = self.tokens2ids(src_sent, self.src_vocab.token2id, append_BOS=False, append_EOS=True)\n",
 184 |     "        tgt_seq = self.tokens2ids(tgt_sent, self.tgt_vocab.token2id, append_BOS=False, append_EOS=True)\n",
 185 |     "\n",
 186 |     "        return src_sent, tgt_sent, src_seq, tgt_seq\n",
 187 |     "    \n",
 188 |     "    def load_sents(self, file_path):\n",
 189 |     "        sents = []\n",
 190 |     "        with codecs.open(file_path) as file:\n",
 191 |     "            for sent in tqdm(file.readlines()):\n",
 192 |     "                tokens = [token for token in sent.split()]\n",
 193 |     "                sents.append(tokens)\n",
 194 |     "        return sents\n",
 195 |     "    \n",
 196 |     "    def build_counter(self, sents):\n",
 197 |     "        counter = Counter()\n",
 198 |     "        for sent in tqdm(sents):\n",
 199 |     "            counter.update(sent)\n",
 200 |     "        return counter\n",
 201 |     "    \n",
 202 |     "    def build_vocab(self, counter, max_vocab_size):\n",
 203 |     "        vocab = AttrDict()\n",
 204 |     "        vocab.token2id = {'<PAD>': PAD, '<BOS>': BOS, '<EOS>': EOS, '<UNK>': UNK}\n",
 205 |     "        vocab.token2id.update({token: _id+4 for _id, (token, count) in tqdm(enumerate(counter.most_common(max_vocab_size)))})\n",
 206 |     "        vocab.id2token = {v:k for k,v in tqdm(vocab.token2id.items())}    \n",
 207 |     "        return vocab\n",
 208 |     "    \n",
 209 |     "    def tokens2ids(self, tokens, token2id, append_BOS=True, append_EOS=True):\n",
 210 |     "        seq = []\n",
 211 |     "        if append_BOS: seq.append(BOS)\n",
 212 |     "        seq.extend([token2id.get(token, UNK) for token in tokens])\n",
 213 |     "        if append_EOS: seq.append(EOS)\n",
 214 |     "        return seq\n",
 215 |     "    \n",
 216 |     "def collate_fn(data):\n",
 217 |     "    \"\"\"\n",
 218 |     "    Creates mini-batch tensors from (src_sent, tgt_sent, src_seq, tgt_seq).\n",
 219 |     "    We should build a custom collate_fn rather than using default collate_fn,\n",
 220 |     "    because merging sequences (including padding) is not supported in default.\n",
 221 |     "    Seqeuences are padded to the maximum length of mini-batch sequences (dynamic padding).\n",
 222 |     "    \n",
 223 |     "    Args:\n",
 224 |     "        data: list of tuple (src_sents, tgt_sents, src_seqs, tgt_seqs)\n",
 225 |     "        - src_sents, tgt_sents: batch of original tokenized sentences\n",
 226 |     "        - src_seqs, tgt_seqs: batch of original tokenized sentence ids\n",
 227 |     "    Returns:\n",
 228 |     "        - src_sents, tgt_sents (tuple): batch of original tokenized sentences\n",
 229 |     "        - src_seqs, tgt_seqs (variable): (max_src_len, batch_size)\n",
 230 |     "        - src_lens, tgt_lens (tensor): (batch_size)\n",
 231 |     "       \n",
 232 |     "    \"\"\"\n",
 233 |     "    def _pad_sequences(seqs):\n",
 234 |     "        lens = [len(seq) for seq in seqs]\n",
 235 |     "        padded_seqs = torch.zeros(len(seqs), max(lens)).long()\n",
 236 |     "        for i, seq in enumerate(seqs):\n",
 237 |     "            end = lens[i]\n",
 238 |     "            padded_seqs[i, :end] = torch.LongTensor(seq[:end])\n",
 239 |     "        return padded_seqs, lens\n",
 240 |     "\n",
 241 |     "    # Sort a list by *source* sequence length (descending order) to use `pack_padded_sequence`.\n",
 242 |     "    # The *target* sequence is not sorted <-- It's ok, cause `pack_padded_sequence` only takes\n",
 243 |     "    # *source* sequence, which is in the EncoderRNN\n",
 244 |     "    data.sort(key=lambda x: len(x[0]), reverse=True)\n",
 245 |     "\n",
 246 |     "    # Seperate source and target sequences.\n",
 247 |     "    src_sents, tgt_sents, src_seqs, tgt_seqs = zip(*data)\n",
 248 |     "    \n",
 249 |     "    # Merge sequences (from tuple of 1D tensor to 2D tensor)\n",
 250 |     "    src_seqs, src_lens = _pad_sequences(src_seqs)\n",
 251 |     "    tgt_seqs, tgt_lens = _pad_sequences(tgt_seqs)\n",
 252 |     "    \n",
 253 |     "    # (batch, seq_len) => (seq_len, batch)\n",
 254 |     "    src_seqs = src_seqs.transpose(0,1)\n",
 255 |     "    tgt_seqs = tgt_seqs.transpose(0,1)\n",
 256 |     "\n",
 257 |     "    return src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens"
 258 |    ]
 259 |   },
 260 |   {
 261 |    "cell_type": "markdown",
 262 |    "metadata": {},
 263 |    "source": [
 264 |     "## Build models\n",
 265 |     "### Encoder"
 266 |    ]
 267 |   },
 268 |   {
 269 |    "cell_type": "code",
 270 |    "execution_count": 5,
 271 |    "metadata": {
 272 |     "collapsed": true
 273 |    },
 274 |    "outputs": [],
 275 |    "source": [
 276 |     "class EncoderRNN(nn.Module):\n",
 277 |     "    def __init__(self, embedding=None, rnn_type='LSTM', hidden_size=128, num_layers=1, dropout=0.3, bidirectional=True):\n",
 278 |     "        super(EncoderRNN, self).__init__()\n",
 279 |     "        \n",
 280 |     "        self.num_layers = num_layers\n",
 281 |     "        self.dropout = dropout\n",
 282 |     "        self.bidirectional = bidirectional\n",
 283 |     "        self.num_directions = 2 if bidirectional else 1\n",
 284 |     "        self.hidden_size = hidden_size // self.num_directions\n",
 285 |     "        \n",
 286 |     "        self.embedding = embedding\n",
 287 |     "        self.word_vec_size = self.embedding.embedding_dim\n",
 288 |     "        \n",
 289 |     "        self.rnn_type = rnn_type\n",
 290 |     "        self.rnn = getattr(nn, self.rnn_type)(\n",
 291 |     "                           input_size=self.word_vec_size,\n",
 292 |     "                           hidden_size=self.hidden_size,\n",
 293 |     "                           num_layers=self.num_layers,\n",
 294 |     "                           dropout=self.dropout, \n",
 295 |     "                           bidirectional=self.bidirectional)\n",
 296 |     "        \n",
 297 |     "    def forward(self, src_seqs, src_lens, hidden=None):\n",
 298 |     "        \"\"\"\n",
 299 |     "        Args:\n",
 300 |     "            - src_seqs: (max_src_len, batch_size)\n",
 301 |     "            - src_lens: (batch_size)\n",
 302 |     "        Returns:\n",
 303 |     "            - outputs: (max_src_len, batch_size, hidden_size * num_directions)\n",
 304 |     "            - hidden : (num_layers, batch_size, hidden_size * num_directions)\n",
 305 |     "        \"\"\"\n",
 306 |     "        \n",
 307 |     "        # (max_src_len, batch_size) => (max_src_len, batch_size, word_vec_size)\n",
 308 |     "        emb = self.embedding(src_seqs)\n",
 309 |     "\n",
 310 |     "        # packed_emb:\n",
 311 |     "        # - data: (sum(batch_sizes), word_vec_size)\n",
 312 |     "        # - batch_sizes: list of batch sizes\n",
 313 |     "        packed_emb = nn.utils.rnn.pack_padded_sequence(emb, src_lens)\n",
 314 |     "\n",
 315 |     "        # rnn(gru) returns:\n",
 316 |     "        # - packed_outputs: shape same as packed_emb\n",
 317 |     "        # - hidden: (num_layers * num_directions, batch_size, hidden_size) \n",
 318 |     "        packed_outputs, hidden = self.rnn(packed_emb, hidden)\n",
 319 |     "\n",
 320 |     "        # outputs: (max_src_len, batch_size, hidden_size * num_directions)\n",
 321 |     "        # output_lens == src_lensˇ\n",
 322 |     "        outputs, output_lens =  nn.utils.rnn.pad_packed_sequence(packed_outputs)\n",
 323 |     "        \n",
 324 |     "        if self.bidirectional:\n",
 325 |     "            # (num_layers * num_directions, batch_size, hidden_size) \n",
 326 |     "            # => (num_layers, batch_size, hidden_size * num_directions)\n",
 327 |     "            hidden = self._cat_directions(hidden)\n",
 328 |     "        \n",
 329 |     "        return outputs, hidden\n",
 330 |     "    \n",
 331 |     "    def _cat_directions(self, hidden):\n",
 332 |     "        \"\"\" If the encoder is bidirectional, do the following transformation.\n",
 333 |     "            Ref: https://github.com/IBM/pytorch-seq2seq/blob/master/seq2seq/models/DecoderRNN.py#L176\n",
 334 |     "            -----------------------------------------------------------\n",
 335 |     "            In: (num_layers * num_directions, batch_size, hidden_size)\n",
 336 |     "            (ex: num_layers=2, num_directions=2)\n",
 337 |     "\n",
 338 |     "            layer 1: forward__hidden(1)\n",
 339 |     "            layer 1: backward_hidden(1)\n",
 340 |     "            layer 2: forward__hidden(2)\n",
 341 |     "            layer 2: backward_hidden(2)\n",
 342 |     "\n",
 343 |     "            -----------------------------------------------------------\n",
 344 |     "            Out: (num_layers, batch_size, hidden_size * num_directions)\n",
 345 |     "\n",
 346 |     "            layer 1: forward__hidden(1) backward_hidden(1)\n",
 347 |     "            layer 2: forward__hidden(2) backward_hidden(2)\n",
 348 |     "        \"\"\"\n",
 349 |     "        def _cat(h):\n",
 350 |     "            return torch.cat([h[0:h.size(0):2], h[1:h.size(0):2]], 2)\n",
 351 |     "            \n",
 352 |     "        if isinstance(hidden, tuple):\n",
 353 |     "            # LSTM hidden contains a tuple (hidden state, cell state)\n",
 354 |     "            hidden = tuple([_cat(h) for h in hidden])\n",
 355 |     "        else:\n",
 356 |     "            # GRU hidden\n",
 357 |     "            hidden = _cat(hidden)\n",
 358 |     "            \n",
 359 |     "        return hidden"
 360 |    ]
 361 |   },
 362 |   {
 363 |    "cell_type": "markdown",
 364 |    "metadata": {},
 365 |    "source": [
 366 |     "### Decoder with \"general attention\" mechanism"
 367 |    ]
 368 |   },
 369 |   {
 370 |    "cell_type": "code",
 371 |    "execution_count": 6,
 372 |    "metadata": {
 373 |     "collapsed": true
 374 |    },
 375 |    "outputs": [],
 376 |    "source": [
 377 |     "class LuongAttnDecoderRNN(nn.Module):\n",
 378 |     "    def __init__(self, encoder, embedding=None, attention=True, bias=True, tie_embeddings=False, dropout=0.3):\n",
 379 |     "        \"\"\" General attention in `Effective Approaches to Attention-based Neural Machine Translation`\n",
 380 |     "            Ref: https://arxiv.org/abs/1508.04025\n",
 381 |     "            \n",
 382 |     "            Share input and output embeddings:\n",
 383 |     "            Ref:\n",
 384 |     "                - \"Using the Output Embedding to Improve Language Models\" (Press & Wolf 2016)\n",
 385 |     "                   https://arxiv.org/abs/1608.05859\n",
 386 |     "                - \"Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling\" (Inan et al. 2016)\n",
 387 |     "                   https://arxiv.org/abs/1611.01462\n",
 388 |     "        \"\"\"\n",
 389 |     "        super(LuongAttnDecoderRNN, self).__init__()\n",
 390 |     "        \n",
 391 |     "        self.hidden_size = encoder.hidden_size * encoder.num_directions\n",
 392 |     "        self.num_layers = encoder.num_layers\n",
 393 |     "        self.dropout = dropout\n",
 394 |     "        self.embedding = embedding\n",
 395 |     "        self.attention = attention\n",
 396 |     "        self.tie_embeddings = tie_embeddings\n",
 397 |     "        \n",
 398 |     "        self.vocab_size = self.embedding.num_embeddings\n",
 399 |     "        self.word_vec_size = self.embedding.embedding_dim\n",
 400 |     "        \n",
 401 |     "        self.rnn_type = encoder.rnn_type\n",
 402 |     "        self.rnn = getattr(nn, self.rnn_type)(\n",
 403 |     "                            input_size=self.word_vec_size,\n",
 404 |     "                            hidden_size=self.hidden_size,\n",
 405 |     "                            num_layers=self.num_layers,\n",
 406 |     "                            dropout=self.dropout)\n",
 407 |     "        \n",
 408 |     "        if self.attention:\n",
 409 |     "            self.W_a = nn.Linear(encoder.hidden_size * encoder.num_directions,\n",
 410 |     "                                 self.hidden_size, bias=bias)\n",
 411 |     "            self.W_c = nn.Linear(encoder.hidden_size * encoder.num_directions + self.hidden_size, \n",
 412 |     "                                 self.hidden_size, bias=bias)\n",
 413 |     "        \n",
 414 |     "        if self.tie_embeddings:\n",
 415 |     "            self.W_proj = nn.Linear(self.hidden_size, self.word_vec_size, bias=bias)\n",
 416 |     "            self.W_s = nn.Linear(self.word_vec_size, self.vocab_size, bias=bias)\n",
 417 |     "            self.W_s.weight = self.embedding.weight\n",
 418 |     "        else:\n",
 419 |     "            self.W_s = nn.Linear(self.hidden_size, self.vocab_size, bias=bias)\n",
 420 |     "        \n",
 421 |     "    def forward(self, input_seq, decoder_hidden, encoder_outputs, src_lens):\n",
 422 |     "        \"\"\" Args:\n",
 423 |     "            - input_seq      : (batch_size)\n",
 424 |     "            - decoder_hidden : (t=0) last encoder hidden state (num_layers * num_directions, batch_size, hidden_size) \n",
 425 |     "                               (t>0) previous decoder hidden state (num_layers, batch_size, hidden_size)\n",
 426 |     "            - encoder_outputs: (max_src_len, batch_size, hidden_size * num_directions)\n",
 427 |     "        \n",
 428 |     "            Returns:\n",
 429 |     "            - output           : (batch_size, vocab_size)\n",
 430 |     "            - decoder_hidden   : (num_layers, batch_size, hidden_size)\n",
 431 |     "            - attention_weights: (batch_size, max_src_len)\n",
 432 |     "        \"\"\"        \n",
 433 |     "        # (batch_size) => (seq_len=1, batch_size)\n",
 434 |     "        input_seq = input_seq.unsqueeze(0)\n",
 435 |     "        \n",
 436 |     "        # (seq_len=1, batch_size) => (seq_len=1, batch_size, word_vec_size) \n",
 437 |     "        emb = self.embedding(input_seq)\n",
 438 |     "        \n",
 439 |     "        # rnn returns:\n",
 440 |     "        # - decoder_output: (seq_len=1, batch_size, hidden_size)\n",
 441 |     "        # - decoder_hidden: (num_layers, batch_size, hidden_size)\n",
 442 |     "        decoder_output, decoder_hidden = self.rnn(emb, decoder_hidden)\n",
 443 |     "\n",
 444 |     "        # (seq_len=1, batch_size, hidden_size) => (batch_size, seq_len=1, hidden_size)\n",
 445 |     "        decoder_output = decoder_output.transpose(0,1)\n",
 446 |     "        \n",
 447 |     "        \"\"\" \n",
 448 |     "        ------------------------------------------------------------------------------------------\n",
 449 |     "        Notes of computing attention scores\n",
 450 |     "        ------------------------------------------------------------------------------------------\n",
 451 |     "        # For-loop version:\n",
 452 |     "\n",
 453 |     "        max_src_len = encoder_outputs.size(0)\n",
 454 |     "        batch_size = encoder_outputs.size(1)\n",
 455 |     "        attention_scores = Variable(torch.zeros(batch_size, max_src_len))\n",
 456 |     "\n",
 457 |     "        # For every batch, every time step of encoder's hidden state, calculate attention score.\n",
 458 |     "        for b in range(batch_size):\n",
 459 |     "            for t in range(max_src_len):\n",
 460 |     "                # Loung. eq(8) -- general form content-based attention:\n",
 461 |     "                attention_scores[b,t] = decoder_output[b].dot(attention.W_a(encoder_outputs[t,b]))\n",
 462 |     "\n",
 463 |     "        ------------------------------------------------------------------------------------------\n",
 464 |     "        # Vectorized version:\n",
 465 |     "\n",
 466 |     "        1. decoder_output: (batch_size, seq_len=1, hidden_size)\n",
 467 |     "        2. encoder_outputs: (max_src_len, batch_size, hidden_size * num_directions)\n",
 468 |     "        3. W_a(encoder_outputs): (max_src_len, batch_size, hidden_size)\n",
 469 |     "                        .transpose(0,1)  : (batch_size, max_src_len, hidden_size) \n",
 470 |     "                        .transpose(1,2)  : (batch_size, hidden_size, max_src_len)\n",
 471 |     "        4. attention_scores: \n",
 472 |     "                        (batch_size, seq_len=1, hidden_size) * (batch_size, hidden_size, max_src_len) \n",
 473 |     "                        => (batch_size, seq_len=1, max_src_len)\n",
 474 |     "        \"\"\"\n",
 475 |     "        \n",
 476 |     "        if self.attention:\n",
 477 |     "            # attention_scores: (batch_size, seq_len=1, max_src_len)\n",
 478 |     "            attention_scores = torch.bmm(decoder_output, self.W_a(encoder_outputs).transpose(0,1).transpose(1,2))\n",
 479 |     "\n",
 480 |     "            # attention_mask: (batch_size, seq_len=1, max_src_len)\n",
 481 |     "            attention_mask = sequence_mask(src_lens).unsqueeze(1)\n",
 482 |     "\n",
 483 |     "            # Fills elements of tensor with `-float('inf')` where `mask` is 1.\n",
 484 |     "            attention_scores.data.masked_fill_(1 - attention_mask.data, -float('inf'))\n",
 485 |     "\n",
 486 |     "            # attention_weights: (batch_size, seq_len=1, max_src_len) => (batch_size, max_src_len) for `F.softmax` \n",
 487 |     "            # => (batch_size, seq_len=1, max_src_len)\n",
 488 |     "            try: # torch 0.3.x\n",
 489 |     "                attention_weights = F.softmax(attention_scores.squeeze(1), dim=1).unsqueeze(1)\n",
 490 |     "            except:\n",
 491 |     "                attention_weights = F.softmax(attention_scores.squeeze(1)).unsqueeze(1)\n",
 492 |     "\n",
 493 |     "            # context_vector:\n",
 494 |     "            # (batch_size, seq_len=1, max_src_len) * (batch_size, max_src_len, encoder_hidden_size * num_directions)\n",
 495 |     "            # => (batch_size, seq_len=1, encoder_hidden_size * num_directions)\n",
 496 |     "            context_vector = torch.bmm(attention_weights, encoder_outputs.transpose(0,1))\n",
 497 |     "\n",
 498 |     "            # concat_input: (batch_size, seq_len=1, encoder_hidden_size * num_directions + decoder_hidden_size)\n",
 499 |     "            concat_input = torch.cat([context_vector, decoder_output], -1)\n",
 500 |     "\n",
 501 |     "            # (batch_size, seq_len=1, encoder_hidden_size * num_directions + decoder_hidden_size) => (batch_size, seq_len=1, decoder_hidden_size)\n",
 502 |     "            concat_output = F.tanh(self.W_c(concat_input))\n",
 503 |     "            \n",
 504 |     "            # Prepare returns:\n",
 505 |     "            # (batch_size, seq_len=1, max_src_len) => (batch_size, max_src_len)\n",
 506 |     "            attention_weights = attention_weights.squeeze(1)\n",
 507 |     "        else:\n",
 508 |     "            attention_weights = None\n",
 509 |     "            concat_output = decoder_output\n",
 510 |     "        \n",
 511 |     "        # If input and output embeddings are tied,\n",
 512 |     "        # project `decoder_hidden_size` to `word_vec_size`.\n",
 513 |     "        if self.tie_embeddings:\n",
 514 |     "            output = self.W_s(self.W_proj(concat_output))\n",
 515 |     "        else:\n",
 516 |     "            # (batch_size, seq_len=1, decoder_hidden_size) => (batch_size, seq_len=1, vocab_size)\n",
 517 |     "            output = self.W_s(concat_output)    \n",
 518 |     "        \n",
 519 |     "        # Prepare returns:\n",
 520 |     "        # (batch_size, seq_len=1, vocab_size) => (batch_size, vocab_size)\n",
 521 |     "        output = output.squeeze(1)\n",
 522 |     "        \n",
 523 |     "        del src_lens\n",
 524 |     "        \n",
 525 |     "        return output, decoder_hidden, attention_weights"
 526 |    ]
 527 |   },
 528 |   {
 529 |    "cell_type": "markdown",
 530 |    "metadata": {},
 531 |    "source": [
 532 |     "## Utils"
 533 |    ]
 534 |   },
 535 |   {
 536 |    "cell_type": "code",
 537 |    "execution_count": 7,
 538 |    "metadata": {
 539 |     "collapsed": true
 540 |    },
 541 |    "outputs": [],
 542 |    "source": [
 543 |     "def load_spacy_glove_embedding(spacy_nlp, vocab):\n",
 544 |     "    \n",
 545 |     "    vocab_size = len(vocab.token2id)\n",
 546 |     "    word_vec_size = spacy_nlp.vocab.vectors_length\n",
 547 |     "    embedding = np.zeros((vocab_size, word_vec_size))\n",
 548 |     "    unk_count = 0\n",
 549 |     "    \n",
 550 |     "    print('='*100)\n",
 551 |     "    print('Loading spacy glove embedding:')\n",
 552 |     "    print('- Vocabulary size: {}'.format(vocab_size))\n",
 553 |     "    print('- Word vector size: {}'.format(word_vec_size))\n",
 554 |     "    \n",
 555 |     "    for token, index in tqdm(vocab.token2id.items()):\n",
 556 |     "        if token == vocab.id2token[PAD]: \n",
 557 |     "            continue\n",
 558 |     "        elif token in [vocab.id2token[BOS], vocab.id2token[EOS], vocab.id2token[UNK]]: \n",
 559 |     "            vector = np.random.rand(word_vec_size,)\n",
 560 |     "        elif spacy_nlp.vocab[token].has_vector: \n",
 561 |     "            vector = spacy_nlp.vocab[token].vector\n",
 562 |     "        else:\n",
 563 |     "            vector = embedding[UNK] \n",
 564 |     "            unk_count += 1\n",
 565 |     "            \n",
 566 |     "        embedding[index] = vector\n",
 567 |     "        \n",
 568 |     "    print('- Unknown word count: {}'.format(unk_count))\n",
 569 |     "    print('='*100 + '\\n')\n",
 570 |     "        \n",
 571 |     "    return torch.from_numpy(embedding).float()\n",
 572 |     "\n",
 573 |     "def sequence_mask(sequence_length, max_len=None):\n",
 574 |     "    \"\"\"\n",
 575 |     "    Caution: Input and Return are VARIABLE.\n",
 576 |     "    \"\"\"\n",
 577 |     "    if max_len is None:\n",
 578 |     "        max_len = sequence_length.data.max()\n",
 579 |     "    batch_size = sequence_length.size(0)\n",
 580 |     "    seq_range = torch.arange(0, max_len).long()\n",
 581 |     "    seq_range_expand = seq_range.unsqueeze(0).expand(batch_size, max_len)\n",
 582 |     "    seq_range_expand = Variable(seq_range_expand)\n",
 583 |     "    if sequence_length.is_cuda:\n",
 584 |     "        seq_range_expand = seq_range_expand.cuda()\n",
 585 |     "    seq_length_expand = (sequence_length.unsqueeze(1)\n",
 586 |     "                         .expand_as(seq_range_expand))\n",
 587 |     "    mask = seq_range_expand < seq_length_expand\n",
 588 |     "    \n",
 589 |     "    return mask\n",
 590 |     "\n",
 591 |     "def masked_cross_entropy(logits, target, length):\n",
 592 |     "    \"\"\"\n",
 593 |     "    Args:\n",
 594 |     "        logits: A Variable containing a FloatTensor of size\n",
 595 |     "            (batch, max_len, num_classes) which contains the\n",
 596 |     "            unnormalized probability for each class.\n",
 597 |     "        target: A Variable containing a LongTensor of size\n",
 598 |     "            (batch, max_len) which contains the index of the true\n",
 599 |     "            class for each corresponding step.\n",
 600 |     "        length: A Variable containing a LongTensor of size (batch,)\n",
 601 |     "            which contains the length of each data in a batch.\n",
 602 |     "    Returns:\n",
 603 |     "        loss: An average loss value masked by the length.\n",
 604 |     "        \n",
 605 |     "    The code is same as:\n",
 606 |     "    \n",
 607 |     "    weight = torch.ones(tgt_vocab_size)\n",
 608 |     "    weight[padding_idx] = 0\n",
 609 |     "    criterion = nn.CrossEntropyLoss(weight.cuda(), size_average)\n",
 610 |     "    loss = criterion(logits_flat, losses_flat)\n",
 611 |     "    \"\"\"\n",
 612 |     "    # logits_flat: (batch * max_len, num_classes)\n",
 613 |     "    logits_flat = logits.view(-1, logits.size(-1))\n",
 614 |     "    # log_probs_flat: (batch * max_len, num_classes)\n",
 615 |     "    log_probs_flat = F.log_softmax(logits_flat)\n",
 616 |     "    # target_flat: (batch * max_len, 1)\n",
 617 |     "    target_flat = target.view(-1, 1)\n",
 618 |     "    # losses_flat: (batch * max_len, 1)\n",
 619 |     "    losses_flat = -torch.gather(log_probs_flat, dim=1, index=target_flat)\n",
 620 |     "    # losses: (batch, max_len)\n",
 621 |     "    losses = losses_flat.view(*target.size())\n",
 622 |     "    # mask: (batch, max_len)\n",
 623 |     "    mask = sequence_mask(sequence_length=length, max_len=target.size(1))\n",
 624 |     "    # Note: mask need to bed casted to float!\n",
 625 |     "    losses = losses * mask.float()\n",
 626 |     "    loss = losses.sum() / mask.float().sum()\n",
 627 |     "    \n",
 628 |     "    # (batch_size * max_tgt_len,)\n",
 629 |     "    pred_flat = log_probs_flat.max(1)[1]\n",
 630 |     "    # (batch_size * max_tgt_len,) => (batch_size, max_tgt_len) => (max_tgt_len, batch_size)\n",
 631 |     "    pred_seqs = pred_flat.view(*target.size()).transpose(0,1).contiguous()\n",
 632 |     "    # (batch_size, max_len) => (batch_size * max_tgt_len,)\n",
 633 |     "    mask_flat = mask.view(-1)\n",
 634 |     "    \n",
 635 |     "    # `.float()` IS VERY IMPORTANT !!!\n",
 636 |     "    # https://discuss.pytorch.org/t/batch-size-and-validation-accuracy/4066/3\n",
 637 |     "    num_corrects = int(pred_flat.eq(target_flat.squeeze(1)).masked_select(mask_flat).float().data.sum())\n",
 638 |     "    num_words = length.data.sum()\n",
 639 |     "\n",
 640 |     "    return loss, pred_seqs, num_corrects, num_words\n",
 641 |     "\n",
 642 |     "def load_checkpoint(checkpoint_path):\n",
 643 |     "    # It's weird that if `map_location` is not given, it will be extremely slow.\n",
 644 |     "    return torch.load(checkpoint_path, map_location=lambda storage, loc: storage)\n",
 645 |     "\n",
 646 |     "def save_checkpoint(opts, experiment_name, encoder, decoder, encoder_optim, decoder_optim,\n",
 647 |     "                    total_accuracy, total_loss, global_step):\n",
 648 |     "    checkpoint = {\n",
 649 |     "        'opts': opts,\n",
 650 |     "        'global_step': global_step,\n",
 651 |     "        'encoder_state_dict': encoder.state_dict(),\n",
 652 |     "        'decoder_state_dict': decoder.state_dict(),\n",
 653 |     "        'encoder_optim_state_dict': encoder_optim.state_dict(),\n",
 654 |     "        'decoder_optim_state_dict': decoder_optim.state_dict()\n",
 655 |     "    }\n",
 656 |     "    \n",
 657 |     "    checkpoint_path = 'checkpoints/%s_acc_%.2f_loss_%.2f_step_%d.pt' % (experiment_name, total_accuracy, total_loss, global_step)\n",
 658 |     "    \n",
 659 |     "    directory, filename = os.path.split(os.path.abspath(checkpoint_path))\n",
 660 |     "\n",
 661 |     "    if not os.path.exists(directory):\n",
 662 |     "        os.makedirs(directory)\n",
 663 |     "    \n",
 664 |     "    torch.save(checkpoint, checkpoint_path)\n",
 665 |     "    \n",
 666 |     "    return checkpoint_path\n",
 667 |     "\n",
 668 |     "def variable2numpy(var):\n",
 669 |     "    \"\"\" For tensorboard visualization \"\"\"\n",
 670 |     "    return var.data.cpu().numpy()\n",
 671 |     "\n",
 672 |     "def write_to_tensorboard(writer, global_step, total_loss, total_corrects, total_words, total_accuracy,\n",
 673 |     "                         encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm,\n",
 674 |     "                         encoder, decoder, gpu_memory_usage=None):\n",
 675 |     "    # scalars\n",
 676 |     "    if gpu_memory_usage is not None:\n",
 677 |     "        writer.add_scalar('curr_gpu_memory_usage', gpu_memory_usage['curr'], global_step)\n",
 678 |     "        writer.add_scalar('diff_gpu_memory_usage', gpu_memory_usage['diff'], global_step)\n",
 679 |     "        \n",
 680 |     "    writer.add_scalar('total_loss', total_loss, global_step)\n",
 681 |     "    writer.add_scalar('total_accuracy', total_accuracy, global_step)\n",
 682 |     "    writer.add_scalar('total_corrects', total_corrects, global_step)\n",
 683 |     "    writer.add_scalar('total_words', total_words, global_step)\n",
 684 |     "    writer.add_scalar('encoder_grad_norm', encoder_grad_norm, global_step)\n",
 685 |     "    writer.add_scalar('decoder_grad_norm', decoder_grad_norm, global_step)\n",
 686 |     "    writer.add_scalar('clipped_encoder_grad_norm', clipped_encoder_grad_norm, global_step)\n",
 687 |     "    writer.add_scalar('clipped_decoder_grad_norm', clipped_decoder_grad_norm, global_step)\n",
 688 |     "    \n",
 689 |     "    # histogram\n",
 690 |     "    for name, param in encoder.named_parameters():\n",
 691 |     "        name = name.replace('.', '/')\n",
 692 |     "        writer.add_histogram('encoder/{}'.format(name), variable2numpy(param), global_step, bins='doane')\n",
 693 |     "        if param.grad is not None:\n",
 694 |     "            writer.add_histogram('encoder/{}/grad'.format(name), variable2numpy(param.grad), global_step, bins='doane')\n",
 695 |     "\n",
 696 |     "    for name, param in decoder.named_parameters():\n",
 697 |     "        name = name.replace('.', '/')\n",
 698 |     "        writer.add_histogram('decoder/{}'.format(name), variable2numpy(param), global_step, bins='doane')\n",
 699 |     "        if param.grad is not None:\n",
 700 |     "            writer.add_histogram('decoder/{}/grad'.format(name), variable2numpy(param.grad), global_step, bins='doane')\n",
 701 |     "            \n",
 702 |     "def detach_hidden(hidden):\n",
 703 |     "    \"\"\" Wraps hidden states in new Variables, to detach them from their history. Prevent OOM.\n",
 704 |     "        After detach, the hidden's requires_grad=Fasle and grad_fn=None.\n",
 705 |     "    Issues:\n",
 706 |     "    - Memory leak problem in LSTM and RNN: https://github.com/pytorch/pytorch/issues/2198\n",
 707 |     "    - https://github.com/pytorch/examples/blob/master/word_language_model/main.py\n",
 708 |     "    - https://discuss.pytorch.org/t/help-clarifying-repackage-hidden-in-word-language-model/226\n",
 709 |     "    - https://discuss.pytorch.org/t/solved-why-we-need-to-detach-variable-which-contains-hidden-representation/1426\n",
 710 |     "    - \n",
 711 |     "    \"\"\"\n",
 712 |     "    if type(hidden) == Variable:\n",
 713 |     "        hidden.detach_() # same as creating a new variable.\n",
 714 |     "    else:\n",
 715 |     "        for h in hidden: h.detach_()\n",
 716 |     "\n",
 717 |     "def get_gpu_memory_usage(device_id):\n",
 718 |     "    \"\"\"Get the current gpu usage. \"\"\"\n",
 719 |     "    result = subprocess.check_output(\n",
 720 |     "        [\n",
 721 |     "            'nvidia-smi', '--query-gpu=memory.used',\n",
 722 |     "            '--format=csv,nounits,noheader'\n",
 723 |     "        ], encoding='utf-8')\n",
 724 |     "    # Convert lines into a dictionary\n",
 725 |     "    gpu_memory = [int(x) for x in result.strip().split('\\n')]\n",
 726 |     "    gpu_memory_map = dict(zip(range(len(gpu_memory)), gpu_memory))\n",
 727 |     "    return gpu_memory_map[device_id]"
 728 |    ]
 729 |   },
 730 |   {
 731 |    "cell_type": "markdown",
 732 |    "metadata": {},
 733 |    "source": [
 734 |     "## Trainer"
 735 |    ]
 736 |   },
 737 |   {
 738 |    "cell_type": "code",
 739 |    "execution_count": 8,
 740 |    "metadata": {
 741 |     "collapsed": true
 742 |    },
 743 |    "outputs": [],
 744 |    "source": [
 745 |     "def compute_grad_norm(parameters, norm_type=2):\n",
 746 |     "    \"\"\" Ref: http://pytorch.org/docs/0.3.0/_modules/torch/nn/utils/clip_grad.html#clip_grad_norm\n",
 747 |     "    \"\"\"\n",
 748 |     "    parameters = list(filter(lambda p: p.grad is not None, parameters))\n",
 749 |     "    norm_type = float(norm_type)\n",
 750 |     "    if norm_type == float('inf'):\n",
 751 |     "        total_norm = max(p.grad.data.abs().max() for p in parameters)\n",
 752 |     "    else:\n",
 753 |     "        total_norm = 0\n",
 754 |     "        for p in parameters:\n",
 755 |     "            param_norm = p.grad.data.norm(norm_type)\n",
 756 |     "            total_norm += param_norm ** norm_type\n",
 757 |     "        total_norm = total_norm ** (1. / norm_type)\n",
 758 |     "    return total_norm\n",
 759 |     "\n",
 760 |     "def train(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens,\n",
 761 |     "          encoder, decoder, encoder_optim, decoder_optim, opts):    \n",
 762 |     "    # -------------------------------------\n",
 763 |     "    # Prepare input and output placeholders\n",
 764 |     "    # -------------------------------------\n",
 765 |     "    # Last batch might not have the same size as we set to the `batch_size`\n",
 766 |     "    batch_size = src_seqs.size(1)\n",
 767 |     "    assert(batch_size == tgt_seqs.size(1))\n",
 768 |     "    \n",
 769 |     "    # Pack tensors to variables for neural network inputs (in order to autograd)\n",
 770 |     "    src_seqs = Variable(src_seqs)\n",
 771 |     "    tgt_seqs = Variable(tgt_seqs)\n",
 772 |     "    src_lens = Variable(torch.LongTensor(src_lens))\n",
 773 |     "    tgt_lens = Variable(torch.LongTensor(tgt_lens))\n",
 774 |     "\n",
 775 |     "    # Decoder's input\n",
 776 |     "    input_seq = Variable(torch.LongTensor([BOS] * batch_size))\n",
 777 |     "    \n",
 778 |     "    # Decoder's output sequence length = max target sequence length of current batch.\n",
 779 |     "    max_tgt_len = tgt_lens.data.max()\n",
 780 |     "    \n",
 781 |     "    # Store all decoder's outputs.\n",
 782 |     "    # **CRUTIAL** \n",
 783 |     "    # Don't set:\n",
 784 |     "    # >> decoder_outputs = Variable(torch.zeros(max_tgt_len, batch_size, decoder.vocab_size))\n",
 785 |     "    # Varying tensor size could cause GPU allocate a new memory causing OOM, \n",
 786 |     "    # so we intialize tensor with fixed size instead:\n",
 787 |     "    # `opts.max_seq_len` is a fixed number, unlike `max_tgt_len` always varys.\n",
 788 |     "    decoder_outputs = Variable(torch.zeros(opts.max_seq_len, batch_size, decoder.vocab_size))\n",
 789 |     "\n",
 790 |     "    # Move variables from CPU to GPU.\n",
 791 |     "    if USE_CUDA:\n",
 792 |     "        src_seqs = src_seqs.cuda()\n",
 793 |     "        tgt_seqs = tgt_seqs.cuda()\n",
 794 |     "        src_lens = src_lens.cuda()\n",
 795 |     "        tgt_lens = tgt_lens.cuda()\n",
 796 |     "        input_seq = input_seq.cuda()\n",
 797 |     "        decoder_outputs = decoder_outputs.cuda()\n",
 798 |     "        \n",
 799 |     "    # -------------------------------------\n",
 800 |     "    # Training mode (enable dropout)\n",
 801 |     "    # -------------------------------------\n",
 802 |     "    encoder.train()\n",
 803 |     "    decoder.train()\n",
 804 |     "    \n",
 805 |     "    # -------------------------------------\n",
 806 |     "    # Zero gradients, since optimizers will accumulate gradients for every backward.\n",
 807 |     "    # -------------------------------------\n",
 808 |     "    encoder_optim.zero_grad()\n",
 809 |     "    decoder_optim.zero_grad()\n",
 810 |     "        \n",
 811 |     "    # -------------------------------------\n",
 812 |     "    # Forward encoder\n",
 813 |     "    # -------------------------------------\n",
 814 |     "    encoder_outputs, encoder_hidden = encoder(src_seqs, src_lens.data.tolist())\n",
 815 |     "\n",
 816 |     "    # -------------------------------------\n",
 817 |     "    # Forward decoder\n",
 818 |     "    # -------------------------------------\n",
 819 |     "    # Initialize decoder's hidden state as encoder's last hidden state.\n",
 820 |     "    decoder_hidden = encoder_hidden\n",
 821 |     "    \n",
 822 |     "    # Run through decoder one time step at a time.\n",
 823 |     "    for t in range(max_tgt_len):\n",
 824 |     "        \n",
 825 |     "        # decoder returns:\n",
 826 |     "        # - decoder_output   : (batch_size, vocab_size)\n",
 827 |     "        # - decoder_hidden   : (num_layers, batch_size, hidden_size)\n",
 828 |     "        # - attention_weights: (batch_size, max_src_len)\n",
 829 |     "        decoder_output, decoder_hidden, attention_weights = decoder(input_seq, decoder_hidden,\n",
 830 |     "                                                                    encoder_outputs, src_lens)\n",
 831 |     "\n",
 832 |     "        # Store decoder outputs.\n",
 833 |     "        decoder_outputs[t] = decoder_output\n",
 834 |     "        \n",
 835 |     "        # Next input is current target\n",
 836 |     "        input_seq = tgt_seqs[t]\n",
 837 |     "        \n",
 838 |     "        # Detach hidden state:\n",
 839 |     "        detach_hidden(decoder_hidden)\n",
 840 |     "        \n",
 841 |     "    # -------------------------------------\n",
 842 |     "    # Compute loss\n",
 843 |     "    # -------------------------------------\n",
 844 |     "    loss, pred_seqs, num_corrects, num_words = masked_cross_entropy(\n",
 845 |     "        decoder_outputs[:max_tgt_len].transpose(0,1).contiguous(), \n",
 846 |     "        tgt_seqs.transpose(0,1).contiguous(),\n",
 847 |     "        tgt_lens\n",
 848 |     "    )\n",
 849 |     "    \n",
 850 |     "    pred_seqs = pred_seqs[:max_tgt_len]\n",
 851 |     "    \n",
 852 |     "    # -------------------------------------\n",
 853 |     "    # Backward and optimize\n",
 854 |     "    # -------------------------------------\n",
 855 |     "    # Backward to get gradients w.r.t parameters in model.\n",
 856 |     "    loss.backward()\n",
 857 |     "    \n",
 858 |     "    # Clip gradients\n",
 859 |     "    encoder_grad_norm = nn.utils.clip_grad_norm(encoder.parameters(), opts.max_grad_norm)\n",
 860 |     "    decoder_grad_norm = nn.utils.clip_grad_norm(decoder.parameters(), opts.max_grad_norm)\n",
 861 |     "    clipped_encoder_grad_norm = compute_grad_norm(encoder.parameters())\n",
 862 |     "    clipped_decoder_grad_norm = compute_grad_norm(decoder.parameters())\n",
 863 |     "    \n",
 864 |     "    # Update parameters with optimizers\n",
 865 |     "    encoder_optim.step()\n",
 866 |     "    decoder_optim.step()\n",
 867 |     "        \n",
 868 |     "    return loss.data[0], pred_seqs, attention_weights, num_corrects, num_words,\\\n",
 869 |     "           encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm"
 870 |    ]
 871 |   },
 872 |   {
 873 |    "cell_type": "markdown",
 874 |    "metadata": {},
 875 |    "source": [
 876 |     "## Main\n",
 877 |     "\n",
 878 |     "### Load dataset\n",
 879 |     "You can download the small grammatical error correction dataset from [here](https://github.com/keisks/jfleg)."
 880 |    ]
 881 |   },
 882 |   {
 883 |    "cell_type": "code",
 884 |    "execution_count": 9,
 885 |    "metadata": {},
 886 |    "outputs": [
 887 |     {
 888 |      "name": "stdout",
 889 |      "output_type": "stream",
 890 |      "text": [
 891 |       "====================================================================================================\n",
 892 |       "Dataset preprocessing log:\n",
 893 |       "- Loading and tokenizing source sentences...\n"
 894 |      ]
 895 |     },
 896 |     {
 897 |      "name": "stderr",
 898 |      "output_type": "stream",
 899 |      "text": [
 900 |       "100%|██████████| 2443191/2443191 [00:07<00:00, 343302.86it/s]\n",
 901 |       "  0%|          | 6142/2443191 [00:00<00:46, 52749.90it/s]"
 902 |      ]
 903 |     },
 904 |     {
 905 |      "name": "stdout",
 906 |      "output_type": "stream",
 907 |      "text": [
 908 |       "- Loading and tokenizing target sentences...\n"
 909 |      ]
 910 |     },
 911 |     {
 912 |      "name": "stderr",
 913 |      "output_type": "stream",
 914 |      "text": [
 915 |       "100%|██████████| 2443191/2443191 [00:06<00:00, 365671.01it/s]\n",
 916 |       "  1%|          | 25731/2443191 [00:00<00:09, 257275.90it/s]"
 917 |      ]
 918 |     },
 919 |     {
 920 |      "name": "stdout",
 921 |      "output_type": "stream",
 922 |      "text": [
 923 |       "- Building source counter...\n"
 924 |      ]
 925 |     },
 926 |     {
 927 |      "name": "stderr",
 928 |      "output_type": "stream",
 929 |      "text": [
 930 |       "100%|██████████| 2443191/2443191 [00:08<00:00, 272308.93it/s]\n",
 931 |       "  1%|          | 29309/2443191 [00:00<00:08, 293065.82it/s]"
 932 |      ]
 933 |     },
 934 |     {
 935 |      "name": "stdout",
 936 |      "output_type": "stream",
 937 |      "text": [
 938 |       "- Building target counter...\n"
 939 |      ]
 940 |     },
 941 |     {
 942 |      "name": "stderr",
 943 |      "output_type": "stream",
 944 |      "text": [
 945 |       "100%|██████████| 2443191/2443191 [00:10<00:00, 240597.14it/s]\n"
 946 |      ]
 947 |     },
 948 |     {
 949 |      "name": "stdout",
 950 |      "output_type": "stream",
 951 |      "text": [
 952 |       "- Building source vocabulary...\n"
 953 |      ]
 954 |     },
 955 |     {
 956 |      "name": "stderr",
 957 |      "output_type": "stream",
 958 |      "text": [
 959 |       "50000it [00:00, 1461165.22it/s]\n",
 960 |       "100%|██████████| 50004/50004 [00:00<00:00, 2374494.52it/s]"
 961 |      ]
 962 |     },
 963 |     {
 964 |      "name": "stdout",
 965 |      "output_type": "stream",
 966 |      "text": [
 967 |       "- Building target vocabulary...\n",
 968 |       "====================================================================================================\n",
 969 |       "Dataset Info:\n",
 970 |       "- Number of source sentences: 2443191\n",
 971 |       "- Number of target sentences: 2443191\n",
 972 |       "- Source vocabulary size: 50004\n",
 973 |       "- Target vocabulary size: 50004\n",
 974 |       "- Shared vocabulary: True\n",
 975 |       "====================================================================================================\n",
 976 |       "\n"
 977 |      ]
 978 |     },
 979 |     {
 980 |      "name": "stderr",
 981 |      "output_type": "stream",
 982 |      "text": [
 983 |       "\n"
 984 |      ]
 985 |     }
 986 |    ],
 987 |    "source": [
 988 |     "# train_dataset = NMTDataset(src_path='../dataset/jfleg/dev/dev.src',\n",
 989 |     "#                            tgt_path='../dataset/jfleg/dev/dev.ref1')\n",
 990 |     "\n",
 991 |     "train_dataset = NMTDataset(src_path='../dataset/efcamdat/efcamdat2.changed.src.txt',\n",
 992 |     "                           tgt_path='../dataset/efcamdat/efcamdat2.changed.tgt.txt')"
 993 |    ]
 994 |   },
 995 |   {
 996 |    "cell_type": "code",
 997 |    "execution_count": 10,
 998 |    "metadata": {},
 999 |    "outputs": [
1000 |     {
1001 |      "name": "stderr",
1002 |      "output_type": "stream",
1003 |      "text": [
1004 |       "100%|██████████| 754/754 [00:00<00:00, 359334.76it/s]\n",
1005 |       "100%|██████████| 754/754 [00:00<00:00, 333597.60it/s]"
1006 |      ]
1007 |     },
1008 |     {
1009 |      "name": "stdout",
1010 |      "output_type": "stream",
1011 |      "text": [
1012 |       "====================================================================================================\n",
1013 |       "Dataset preprocessing log:\n",
1014 |       "- Loading and tokenizing source sentences...\n",
1015 |       "- Loading and tokenizing target sentences...\n",
1016 |       "====================================================================================================\n",
1017 |       "Dataset Info:\n",
1018 |       "- Number of source sentences: 754\n",
1019 |       "- Number of target sentences: 754\n",
1020 |       "- Source vocabulary size: 50004\n",
1021 |       "- Target vocabulary size: 50004\n",
1022 |       "- Shared vocabulary: True\n",
1023 |       "====================================================================================================\n",
1024 |       "\n"
1025 |      ]
1026 |     },
1027 |     {
1028 |      "name": "stderr",
1029 |      "output_type": "stream",
1030 |      "text": [
1031 |       "\n"
1032 |      ]
1033 |     }
1034 |    ],
1035 |    "source": [
1036 |     "valid_dataset = NMTDataset(src_path='../dataset/jfleg/dev/dev.src',\n",
1037 |     "                           tgt_path='../dataset/jfleg/dev/dev.ref0',\n",
1038 |     "                           src_vocab=train_dataset.src_vocab,\n",
1039 |     "                           tgt_vocab=train_dataset.tgt_vocab)"
1040 |    ]
1041 |   },
1042 |   {
1043 |    "cell_type": "markdown",
1044 |    "metadata": {},
1045 |    "source": [
1046 |     "### Batchify dataset using dataloader"
1047 |    ]
1048 |   },
1049 |   {
1050 |    "cell_type": "code",
1051 |    "execution_count": 11,
1052 |    "metadata": {
1053 |     "collapsed": true
1054 |    },
1055 |    "outputs": [],
1056 |    "source": [
1057 |     "batch_size = 48\n",
1058 |     "\n",
1059 |     "train_iter = DataLoader(dataset=train_dataset,\n",
1060 |     "                        batch_size=batch_size,\n",
1061 |     "                        shuffle=True,\n",
1062 |     "                        num_workers=4,\n",
1063 |     "                        collate_fn=collate_fn)\n",
1064 |     "\n",
1065 |     "valid_iter = DataLoader(dataset=valid_dataset,\n",
1066 |     "                        batch_size=batch_size, \n",
1067 |     "                        shuffle=False,\n",
1068 |     "                        num_workers=4,\n",
1069 |     "                        collate_fn=collate_fn)"
1070 |    ]
1071 |   },
1072 |   {
1073 |    "cell_type": "markdown",
1074 |    "metadata": {},
1075 |    "source": [
1076 |     "### Hyperparameters"
1077 |    ]
1078 |   },
1079 |   {
1080 |    "cell_type": "code",
1081 |    "execution_count": 12,
1082 |    "metadata": {
1083 |     "collapsed": true
1084 |    },
1085 |    "outputs": [],
1086 |    "source": [
1087 |     "# If enabled, load checkpoint.\n",
1088 |     "LOAD_CHECKPOINT = True\n",
1089 |     "\n",
1090 |     "if LOAD_CHECKPOINT:\n",
1091 |     "    # Modify this path.\n",
1092 |     "    checkpoint_path = './checkpoints/seq2seq_2018-02-07 20:30:47_acc_88.15_loss_12.85_step_135000.pt'\n",
1093 |     "    checkpoint = load_checkpoint(checkpoint_path)\n",
1094 |     "    opts = checkpoint['opts']    \n",
1095 |     "else:\n",
1096 |     "    opts = AttrDict()\n",
1097 |     "\n",
1098 |     "    # Configure models\n",
1099 |     "    opts.word_vec_size = 300\n",
1100 |     "    opts.rnn_type = 'LSTM'\n",
1101 |     "    opts.hidden_size = 512\n",
1102 |     "    opts.num_layers = 2\n",
1103 |     "    opts.dropout = 0.3\n",
1104 |     "    opts.bidirectional = True\n",
1105 |     "    opts.attention = True\n",
1106 |     "    opts.share_embeddings = True\n",
1107 |     "    opts.pretrained_embeddings = True\n",
1108 |     "    opts.fixed_embeddings = True\n",
1109 |     "    opts.tie_embeddings = True # Tie decoder's input and output embeddings\n",
1110 |     "\n",
1111 |     "    # Configure optimization\n",
1112 |     "    opts.max_grad_norm = 2\n",
1113 |     "    opts.learning_rate = 0.001\n",
1114 |     "    opts.weight_decay = 1e-5 # L2 weight regularization\n",
1115 |     "    \n",
1116 |     "    # Configure training\n",
1117 |     "    opts.max_seq_len = 100 # max sequence length to prevent OOM.\n",
1118 |     "    opts.num_epochs = 5\n",
1119 |     "    opts.print_every_step = 20\n",
1120 |     "    opts.save_every_step = 5000"
1121 |    ]
1122 |   },
1123 |   {
1124 |    "cell_type": "code",
1125 |    "execution_count": 13,
1126 |    "metadata": {},
1127 |    "outputs": [
1128 |     {
1129 |      "name": "stdout",
1130 |      "output_type": "stream",
1131 |      "text": [
1132 |       "====================================================================================================\n",
1133 |       "Options log:\n",
1134 |       "- Load from checkpoint: True\n",
1135 |       "- Global step: 135000\n",
1136 |       "- word_vec_size: 300\n",
1137 |       "- rnn_type: LSTM\n",
1138 |       "- hidden_size: 512\n",
1139 |       "- num_layers: 2\n",
1140 |       "- dropout: 0.3\n",
1141 |       "- bidirectional: True\n",
1142 |       "- attention: True\n",
1143 |       "- share_embeddings: True\n",
1144 |       "- pretrained_embeddings: True\n",
1145 |       "- fixed_embeddings: True\n",
1146 |       "- tie_embeddings: True\n",
1147 |       "- max_grad_norm: 2\n",
1148 |       "- learning_rate: 0.001\n",
1149 |       "- weight_decay: 1e-05\n",
1150 |       "- max_seq_len: 100\n",
1151 |       "- num_epochs: 5\n",
1152 |       "- print_every_step: 20\n",
1153 |       "- save_every_step: 5000\n",
1154 |       "====================================================================================================\n",
1155 |       "\n"
1156 |      ]
1157 |     }
1158 |    ],
1159 |    "source": [
1160 |     "print('='*100)\n",
1161 |     "print('Options log:')\n",
1162 |     "print('- Load from checkpoint: {}'.format(LOAD_CHECKPOINT))\n",
1163 |     "if LOAD_CHECKPOINT: print('- Global step: {}'.format(checkpoint['global_step']))\n",
1164 |     "for k,v in opts.items(): print('- {}: {}'.format(k, v))\n",
1165 |     "print('='*100 + '\\n')"
1166 |    ]
1167 |   },
1168 |   {
1169 |    "cell_type": "markdown",
1170 |    "metadata": {},
1171 |    "source": [
1172 |     "### Initialize embeddings and models"
1173 |    ]
1174 |   },
1175 |   {
1176 |    "cell_type": "code",
1177 |    "execution_count": 14,
1178 |    "metadata": {},
1179 |    "outputs": [
1180 |     {
1181 |      "name": "stderr",
1182 |      "output_type": "stream",
1183 |      "text": [
1184 |       " 38%|███▊      | 19055/50004 [00:00<00:00, 190512.48it/s]"
1185 |      ]
1186 |     },
1187 |     {
1188 |      "name": "stdout",
1189 |      "output_type": "stream",
1190 |      "text": [
1191 |       "====================================================================================================\n",
1192 |       "Loading spacy glove embedding:\n",
1193 |       "- Vocabulary size: 50004\n",
1194 |       "- Word vector size: 300\n"
1195 |      ]
1196 |     },
1197 |     {
1198 |      "name": "stderr",
1199 |      "output_type": "stream",
1200 |      "text": [
1201 |       "100%|██████████| 50004/50004 [00:00<00:00, 144659.40it/s]\n"
1202 |      ]
1203 |     },
1204 |     {
1205 |      "name": "stdout",
1206 |      "output_type": "stream",
1207 |      "text": [
1208 |       "- Unknown word count: 9362\n",
1209 |       "====================================================================================================\n",
1210 |       "\n"
1211 |      ]
1212 |     }
1213 |    ],
1214 |    "source": [
1215 |     "# Initialize vocabulary size.\n",
1216 |     "src_vocab_size = len(train_dataset.src_vocab.token2id)\n",
1217 |     "tgt_vocab_size = len(train_dataset.tgt_vocab.token2id)\n",
1218 |     "\n",
1219 |     "# Initialize embeddings.\n",
1220 |     "# We can actually put all modules in one module like `NMTModel`)\n",
1221 |     "# See: https://github.com/spro/practical-pytorch/issues/34\n",
1222 |     "word_vec_size = opts.word_vec_size if not opts.pretrained_embeddings else nlp.vocab.vectors_length\n",
1223 |     "src_embedding = nn.Embedding(src_vocab_size, word_vec_size, padding_idx=PAD)\n",
1224 |     "tgt_embedding = nn.Embedding(tgt_vocab_size, word_vec_size, padding_idx=PAD)\n",
1225 |     "\n",
1226 |     "if opts.share_embeddings:\n",
1227 |     "    assert(src_vocab_size == tgt_vocab_size)\n",
1228 |     "    tgt_embedding.weight = src_embedding.weight\n",
1229 |     "\n",
1230 |     "# Initialize models.\n",
1231 |     "encoder = EncoderRNN(embedding=src_embedding,\n",
1232 |     "                     rnn_type=opts.rnn_type,\n",
1233 |     "                     hidden_size=opts.hidden_size,\n",
1234 |     "                     num_layers=opts.num_layers,\n",
1235 |     "                     dropout=opts.dropout,\n",
1236 |     "                     bidirectional=opts.bidirectional)\n",
1237 |     "\n",
1238 |     "decoder = LuongAttnDecoderRNN(encoder, embedding=tgt_embedding,\n",
1239 |     "                              attention=opts.attention,\n",
1240 |     "                              tie_embeddings=opts.tie_embeddings,\n",
1241 |     "                              dropout=opts.dropout)\n",
1242 |     "\n",
1243 |     "if opts.pretrained_embeddings:\n",
1244 |     "    glove_embeddings = load_spacy_glove_embedding(nlp, train_dataset.src_vocab)\n",
1245 |     "    encoder.embedding.weight.data.copy_(glove_embeddings)\n",
1246 |     "    decoder.embedding.weight.data.copy_(glove_embeddings)\n",
1247 |     "    if opts.fixed_embeddings:\n",
1248 |     "        encoder.embedding.weight.requires_grad = False\n",
1249 |     "        decoder.embedding.weight.requires_grad = False\n",
1250 |     "        \n",
1251 |     "if LOAD_CHECKPOINT:\n",
1252 |     "    encoder.load_state_dict(checkpoint['encoder_state_dict'])\n",
1253 |     "    decoder.load_state_dict(checkpoint['decoder_state_dict'])\n",
1254 |     "    \n",
1255 |     "# Move models to GPU (need time for initial run)\n",
1256 |     "if USE_CUDA:\n",
1257 |     "    encoder.cuda()\n",
1258 |     "    decoder.cuda()"
1259 |    ]
1260 |   },
1261 |   {
1262 |    "cell_type": "markdown",
1263 |    "metadata": {},
1264 |    "source": [
1265 |     "### Fine-tuning embeddings\n",
1266 |     "Recommend to use fine-tune after training for a while until the training loss don't decrease.\n",
1267 |     "\n",
1268 |     "TODO: Should be controlled in training loop."
1269 |    ]
1270 |   },
1271 |   {
1272 |    "cell_type": "code",
1273 |    "execution_count": 15,
1274 |    "metadata": {
1275 |     "collapsed": true
1276 |    },
1277 |    "outputs": [],
1278 |    "source": [
1279 |     "FINE_TUNE = True\n",
1280 |     "if FINE_TUNE:\n",
1281 |     "    encoder.embedding.weight.requires_grad = True"
1282 |    ]
1283 |   },
1284 |   {
1285 |    "cell_type": "code",
1286 |    "execution_count": 16,
1287 |    "metadata": {},
1288 |    "outputs": [
1289 |     {
1290 |      "name": "stdout",
1291 |      "output_type": "stream",
1292 |      "text": [
1293 |       "====================================================================================================\n",
1294 |       "Model log:\n",
1295 |       "\n",
1296 |       "EncoderRNN (\n",
1297 |       "  (embedding): Embedding(50004, 300, padding_idx=0)\n",
1298 |       "  (rnn): LSTM(300, 256, num_layers=2, dropout=0.3, bidirectional=True)\n",
1299 |       ")\n",
1300 |       "LuongAttnDecoderRNN (\n",
1301 |       "  (embedding): Embedding(50004, 300, padding_idx=0)\n",
1302 |       "  (rnn): LSTM(300, 512, num_layers=2, dropout=0.3)\n",
1303 |       "  (W_a): Linear (512 -> 512)\n",
1304 |       "  (W_c): Linear (1024 -> 512)\n",
1305 |       "  (W_proj): Linear (512 -> 300)\n",
1306 |       "  (W_s): Linear (300 -> 50004)\n",
1307 |       ")\n",
1308 |       "- Encoder input embedding requires_grad=True\n",
1309 |       "- Decoder input embedding requires_grad=True\n",
1310 |       "- Decoder output embedding requires_grad=True\n",
1311 |       "====================================================================================================\n",
1312 |       "\n"
1313 |      ]
1314 |     }
1315 |    ],
1316 |    "source": [
1317 |     "print('='*100)\n",
1318 |     "print('Model log:\\n')\n",
1319 |     "print(encoder)\n",
1320 |     "print(decoder)\n",
1321 |     "print('- Encoder input embedding requires_grad={}'.format(encoder.embedding.weight.requires_grad))\n",
1322 |     "print('- Decoder input embedding requires_grad={}'.format(decoder.embedding.weight.requires_grad))\n",
1323 |     "print('- Decoder output embedding requires_grad={}'.format(decoder.W_s.weight.requires_grad))\n",
1324 |     "print('='*100 + '\\n')"
1325 |    ]
1326 |   },
1327 |   {
1328 |    "cell_type": "markdown",
1329 |    "metadata": {},
1330 |    "source": [
1331 |     "### Initialize optimizers\n",
1332 |     "TODO: Different learning rate for fine tuning embeddings: https://discuss.pytorch.org/t/how-to-perform-finetuning-in-pytorch/419/7"
1333 |    ]
1334 |   },
1335 |   {
1336 |    "cell_type": "code",
1337 |    "execution_count": 17,
1338 |    "metadata": {
1339 |     "collapsed": true
1340 |    },
1341 |    "outputs": [],
1342 |    "source": [
1343 |     "# Initialize optimizers (we can experiment different learning rates)\n",
1344 |     "encoder_optim = optim.Adam([p for p in encoder.parameters() if p.requires_grad], lr=opts.learning_rate, weight_decay=opts.weight_decay)\n",
1345 |     "decoder_optim = optim.Adam([p for p in decoder.parameters() if p.requires_grad], lr=opts.learning_rate, weight_decay=opts.weight_decay)"
1346 |    ]
1347 |   },
1348 |   {
1349 |    "cell_type": "markdown",
1350 |    "metadata": {},
1351 |    "source": [
1352 |     "## Training"
1353 |    ]
1354 |   },
1355 |   {
1356 |    "cell_type": "code",
1357 |    "execution_count": null,
1358 |    "metadata": {
1359 |     "collapsed": true
1360 |    },
1361 |    "outputs": [],
1362 |    "source": []
1363 |   },
1364 |   {
1365 |    "cell_type": "code",
1366 |    "execution_count": null,
1367 |    "metadata": {
1368 |     "collapsed": true
1369 |    },
1370 |    "outputs": [],
1371 |    "source": [
1372 |     "\"\"\" Open port 6006 and see tensorboard.\n",
1373 |     "    Ref:  https://medium.com/@dexterhuang/%E7%B5%A6-pytorch-%E7%94%A8%E7%9A%84-tensorboard-bb341ce3f837\n",
1374 |     "\"\"\"\n",
1375 |     "from datetime import datetime\n",
1376 |     "from tensorboardX import SummaryWriter\n",
1377 |     "# --------------------------\n",
1378 |     "# Configure tensorboard\n",
1379 |     "# --------------------------\n",
1380 |     "model_name = 'seq2seq'\n",
1381 |     "datetime = ('%s' % datetime.now()).split('.')[0]\n",
1382 |     "experiment_name = '{}_{}'.format(model_name, datetime)\n",
1383 |     "tensorboard_log_dir = './tensorboard-logs/{}/'.format(experiment_name)\n",
1384 |     "writer = SummaryWriter(tensorboard_log_dir)\n",
1385 |     "\n",
1386 |     "# --------------------------\n",
1387 |     "# Configure training\n",
1388 |     "# --------------------------\n",
1389 |     "num_epochs = opts.num_epochs\n",
1390 |     "print_every_step = opts.print_every_step\n",
1391 |     "save_every_step = opts.save_every_step\n",
1392 |     "# For saving checkpoint and tensorboard\n",
1393 |     "global_step = 0 if not LOAD_CHECKPOINT else checkpoint['global_step']\n",
1394 |     "\n",
1395 |     "# --------------------------\n",
1396 |     "# Start training\n",
1397 |     "# --------------------------\n",
1398 |     "total_loss = 0\n",
1399 |     "total_corrects = 0\n",
1400 |     "total_words = 0\n",
1401 |     "prev_gpu_memory_usage = 0"
1402 |    ]
1403 |   },
1404 |   {
1405 |    "cell_type": "code",
1406 |    "execution_count": null,
1407 |    "metadata": {
1408 |     "collapsed": true,
1409 |     "scrolled": true
1410 |    },
1411 |    "outputs": [],
1412 |    "source": [
1413 |     "for epoch in range(num_epochs):\n",
1414 |     "    for batch_id, batch_data in tqdm(enumerate(train_iter)):\n",
1415 |     "\n",
1416 |     "        # Unpack batch data\n",
1417 |     "        src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens = batch_data\n",
1418 |     "        \n",
1419 |     "        # Ignore batch if there is a long sequence.\n",
1420 |     "        max_seq_len = max(src_lens + tgt_lens)\n",
1421 |     "        if max_seq_len > opts.max_seq_len:\n",
1422 |     "            print('[!] Ignore batch: sequence length={} > max sequence length={}'.format(max_seq_len, opts.max_seq_len))\n",
1423 |     "            continue\n",
1424 |     "        \n",
1425 |     "        # Train.\n",
1426 |     "        loss, pred_seqs, attention_weights, num_corrects, num_words, \\\n",
1427 |     "        encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm \\\n",
1428 |     "        = train(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, encoder, decoder, encoder_optim, decoder_optim, opts)\n",
1429 |     "\n",
1430 |     "        # Statistics.\n",
1431 |     "        global_step += 1\n",
1432 |     "        total_loss += loss\n",
1433 |     "        total_corrects += num_corrects\n",
1434 |     "        total_words += num_words\n",
1435 |     "        total_accuracy = 100 * (total_corrects / total_words)\n",
1436 |     "        \n",
1437 |     "        # Save checkpoint.\n",
1438 |     "        if global_step % save_every_step == 0:\n",
1439 |     "            \n",
1440 |     "            checkpoint_path = save_checkpoint(opts, experiment_name, encoder, decoder, encoder_optim, decoder_optim, \n",
1441 |     "                                              total_accuracy, total_loss, global_step)\n",
1442 |     "            \n",
1443 |     "            print('='*100)\n",
1444 |     "            print('Save checkpoint to \"{}\".'.format(checkpoint_path))\n",
1445 |     "            print('='*100 + '\\n')\n",
1446 |     "\n",
1447 |     "        # Print statistics and write to Tensorboard.\n",
1448 |     "        if global_step % print_every_step == 0:\n",
1449 |     "            \n",
1450 |     "            curr_gpu_memory_usage = get_gpu_memory_usage(device_id=torch.cuda.current_device())\n",
1451 |     "            diff_gpu_memory_usage = curr_gpu_memory_usage - prev_gpu_memory_usage\n",
1452 |     "            prev_gpu_memory_usage = curr_gpu_memory_usage\n",
1453 |     "            \n",
1454 |     "            print('='*100)\n",
1455 |     "            print('Training log:')\n",
1456 |     "            print('- Epoch: {}/{}'.format(epoch, num_epochs))\n",
1457 |     "            print('- Global step: {}'.format(global_step))\n",
1458 |     "            print('- Total loss: {}'.format(total_loss))\n",
1459 |     "            print('- Total corrects: {}'.format(total_corrects))\n",
1460 |     "            print('- Total words: {}'.format(total_words))\n",
1461 |     "            print('- Total accuracy: {}'.format(total_accuracy))\n",
1462 |     "            print('- Current GPU memory usage: {}'.format(curr_gpu_memory_usage))\n",
1463 |     "            print('- Diff GPU memory usage: {}'.format(diff_gpu_memory_usage))\n",
1464 |     "            print('='*100 + '\\n')\n",
1465 |     "            \n",
1466 |     "            write_to_tensorboard(writer, global_step, total_loss, total_corrects, total_words, total_accuracy,\n",
1467 |     "                                 encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm,\n",
1468 |     "                                 encoder, decoder,\n",
1469 |     "                                 gpu_memory_usage={\n",
1470 |     "                                     'curr': curr_gpu_memory_usage,\n",
1471 |     "                                     'diff': diff_gpu_memory_usage\n",
1472 |     "                                 })\n",
1473 |     "            \n",
1474 |     "            total_loss = 0\n",
1475 |     "            total_corrects = 0\n",
1476 |     "            total_words = 0\n",
1477 |     "\n",
1478 |     "        # Free memory\n",
1479 |     "        del src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, \\\n",
1480 |     "            loss, pred_seqs, attention_weights, num_corrects, num_words, \\\n",
1481 |     "            encoder_grad_norm, decoder_grad_norm, clipped_encoder_grad_norm, clipped_decoder_grad_norm\n",
1482 |     "            "
1483 |    ]
1484 |   },
1485 |   {
1486 |    "cell_type": "code",
1487 |    "execution_count": null,
1488 |    "metadata": {
1489 |     "collapsed": true
1490 |    },
1491 |    "outputs": [],
1492 |    "source": [
1493 |     "checkpoint_path = save_checkpoint(opts, experiment_name, encoder, decoder, encoder_optim, decoder_optim, \n",
1494 |     "                                              total_accuracy, total_loss, global_step)\n",
1495 |     "            \n",
1496 |     "print('='*100)\n",
1497 |     "print('Save checkpoint to \"{}\".'.format(checkpoint_path))\n",
1498 |     "print('='*100 + '\\n')"
1499 |    ]
1500 |   },
1501 |   {
1502 |    "cell_type": "markdown",
1503 |    "metadata": {},
1504 |    "source": [
1505 |     "## Evaluation"
1506 |    ]
1507 |   },
1508 |   {
1509 |    "cell_type": "code",
1510 |    "execution_count": 18,
1511 |    "metadata": {
1512 |     "collapsed": true
1513 |    },
1514 |    "outputs": [],
1515 |    "source": [
1516 |     "def evaluate(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, encoder, decoder):\n",
1517 |     "    # -------------------------------------\n",
1518 |     "    # Prepare input and output placeholders\n",
1519 |     "    # -------------------------------------\n",
1520 |     "    # Last batch might not have the same size as we set to the `batch_size`\n",
1521 |     "    batch_size = src_seqs.size(1)\n",
1522 |     "    assert(batch_size == tgt_seqs.size(1))\n",
1523 |     "    \n",
1524 |     "    # Pack tensors to variables for neural network inputs (in order to autograd)\n",
1525 |     "    src_seqs = Variable(src_seqs, volatile=True)\n",
1526 |     "    tgt_seqs = Variable(tgt_seqs, volatile=True)\n",
1527 |     "    src_lens = Variable(torch.LongTensor(src_lens), volatile=True)\n",
1528 |     "    tgt_lens = Variable(torch.LongTensor(tgt_lens), volatile=True)\n",
1529 |     "\n",
1530 |     "    # Decoder's input\n",
1531 |     "    input_seq = Variable(torch.LongTensor([BOS] * batch_size), volatile=True)\n",
1532 |     "    \n",
1533 |     "    # Decoder's output sequence length = max target sequence length of current batch.\n",
1534 |     "    max_tgt_len = tgt_lens.data.max()\n",
1535 |     "    \n",
1536 |     "    # Store all decoder's outputs.\n",
1537 |     "    # **CRUTIAL** \n",
1538 |     "    # Don't set:\n",
1539 |     "    # >> decoder_outputs = Variable(torch.zeros(max_tgt_len, batch_size, decoder.vocab_size))\n",
1540 |     "    # Varying tensor size could cause GPU allocate a new memory causing OOM, \n",
1541 |     "    # so we intialize tensor with fixed size instead:\n",
1542 |     "    # `opts.max_seq_len` is a fixed number, unlike `max_tgt_len` always varys.\n",
1543 |     "    decoder_outputs = Variable(torch.zeros(opts.max_seq_len, batch_size, decoder.vocab_size), volatile=True)\n",
1544 |     "\n",
1545 |     "    # Move variables from CPU to GPU.\n",
1546 |     "    if USE_CUDA:\n",
1547 |     "        src_seqs = src_seqs.cuda()\n",
1548 |     "        tgt_seqs = tgt_seqs.cuda()\n",
1549 |     "        src_lens = src_lens.cuda()\n",
1550 |     "        tgt_lens = tgt_lens.cuda()\n",
1551 |     "        input_seq = input_seq.cuda()\n",
1552 |     "        decoder_outputs = decoder_outputs.cuda()\n",
1553 |     "        \n",
1554 |     "    # -------------------------------------\n",
1555 |     "    # Evaluation mode (disable dropout)\n",
1556 |     "    # -------------------------------------\n",
1557 |     "    encoder.eval()\n",
1558 |     "    decoder.eval()\n",
1559 |     "            \n",
1560 |     "    # -------------------------------------\n",
1561 |     "    # Forward encoder\n",
1562 |     "    # -------------------------------------\n",
1563 |     "    encoder_outputs, encoder_hidden = encoder(src_seqs, src_lens.data.tolist())\n",
1564 |     "    \n",
1565 |     "    # -------------------------------------\n",
1566 |     "    # Forward decoder\n",
1567 |     "    # -------------------------------------\n",
1568 |     "    # Initialize decoder's hidden state as encoder's last hidden state.\n",
1569 |     "    decoder_hidden = encoder_hidden\n",
1570 |     "    \n",
1571 |     "    # Run through decoder one time step at a time.\n",
1572 |     "    for t in range(max_tgt_len):\n",
1573 |     "        \n",
1574 |     "        # decoder returns:\n",
1575 |     "        # - decoder_output   : (batch_size, vocab_size)\n",
1576 |     "        # - decoder_hidden   : (num_layers, batch_size, hidden_size)\n",
1577 |     "        # - attention_weights: (batch_size, max_src_len)\n",
1578 |     "        decoder_output, decoder_hidden, attention_weights = decoder(input_seq, decoder_hidden,\n",
1579 |     "                                                                    encoder_outputs, src_lens)\n",
1580 |     "\n",
1581 |     "        # Store decoder outputs.\n",
1582 |     "        decoder_outputs[t] = decoder_output\n",
1583 |     "        \n",
1584 |     "        # Next input is current target\n",
1585 |     "        input_seq = tgt_seqs[t]\n",
1586 |     "        \n",
1587 |     "        # Detach hidden state (may not need this, since no BPTT)\n",
1588 |     "        detach_hidden(decoder_hidden)\n",
1589 |     "        \n",
1590 |     "    # -------------------------------------\n",
1591 |     "    # Compute loss\n",
1592 |     "    # -------------------------------------\n",
1593 |     "    loss, pred_seqs, num_corrects, num_words = masked_cross_entropy(\n",
1594 |     "        decoder_outputs[:max_tgt_len].transpose(0,1).contiguous(), \n",
1595 |     "        tgt_seqs.transpose(0,1).contiguous(),\n",
1596 |     "        tgt_lens\n",
1597 |     "    )\n",
1598 |     "    \n",
1599 |     "    pred_seqs = pred_seqs[:max_tgt_len]\n",
1600 |     "    \n",
1601 |     "    return loss.data[0], pred_seqs, attention_weights, num_corrects, num_words"
1602 |    ]
1603 |   },
1604 |   {
1605 |    "cell_type": "code",
1606 |    "execution_count": 19,
1607 |    "metadata": {},
1608 |    "outputs": [
1609 |     {
1610 |      "name": "stderr",
1611 |      "output_type": "stream",
1612 |      "text": [
1613 |       "16it [00:04,  3.73it/s]"
1614 |      ]
1615 |     },
1616 |     {
1617 |      "name": "stdout",
1618 |      "output_type": "stream",
1619 |      "text": [
1620 |       "====================================================================================================\n",
1621 |       "Validation log:\n",
1622 |       "- Total loss: 23.030829787254333\n",
1623 |       "- Total corrects: 11675\n",
1624 |       "- Total words: 14994\n",
1625 |       "- Total accuracy: 77.86447912498332\n",
1626 |       "====================================================================================================\n",
1627 |       "\n"
1628 |      ]
1629 |     },
1630 |     {
1631 |      "name": "stderr",
1632 |      "output_type": "stream",
1633 |      "text": [
1634 |       "\n"
1635 |      ]
1636 |     }
1637 |    ],
1638 |    "source": [
1639 |     "total_loss = 0\n",
1640 |     "total_corrects = 0\n",
1641 |     "total_words = 0\n",
1642 |     "\n",
1643 |     "for batch_id, batch_data in tqdm(enumerate(valid_iter)):\n",
1644 |     "    src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens = batch_data\n",
1645 |     "    \n",
1646 |     "    loss, pred_seqs, attention_weights, num_corrects, num_words \\\n",
1647 |     "        = evaluate(src_sents, tgt_sents, src_seqs, tgt_seqs, src_lens, tgt_lens, encoder, decoder)\n",
1648 |     "        \n",
1649 |     "    total_loss += loss\n",
1650 |     "    total_corrects += num_corrects\n",
1651 |     "    total_words += num_words\n",
1652 |     "    total_accuracy = 100 * (total_corrects / total_words)\n",
1653 |     "\n",
1654 |     "print('='*100)\n",
1655 |     "print('Validation log:')\n",
1656 |     "print('- Total loss: {}'.format(total_loss))\n",
1657 |     "print('- Total corrects: {}'.format(total_corrects))\n",
1658 |     "print('- Total words: {}'.format(total_words))\n",
1659 |     "print('- Total accuracy: {}'.format(total_accuracy))\n",
1660 |     "print('='*100 + '\\n')"
1661 |    ]
1662 |   },
1663 |   {
1664 |    "cell_type": "markdown",
1665 |    "metadata": {},
1666 |    "source": [
1667 |     "## Translate (Inference)"
1668 |    ]
1669 |   },
1670 |   {
1671 |    "cell_type": "code",
1672 |    "execution_count": 50,
1673 |    "metadata": {
1674 |     "collapsed": true
1675 |    },
1676 |    "outputs": [],
1677 |    "source": [
1678 |     "def translate(src_text, train_dataset, encoder, decoder, max_seq_len, replace_unk=True):\n",
1679 |     "    # -------------------------------------\n",
1680 |     "    # Prepare input and output placeholders\n",
1681 |     "    # -------------------------------------\n",
1682 |     "    # Like dataset's `__getitem__()` and dataloader's `collate_fn()`.\n",
1683 |     "    src_sent = src_text.split()\n",
1684 |     "    src_seqs = torch.LongTensor([train_dataset.tokens2ids(tokens=src_text.split(),\n",
1685 |     "                                                          token2id=train_dataset.src_vocab.token2id,\n",
1686 |     "                                                          append_BOS=False, append_EOS=True)]).transpose(0,1)\n",
1687 |     "    src_lens = [len(src_seqs)]\n",
1688 |     "    \n",
1689 |     "    # Last batch might not have the same size as we set to the `batch_size`\n",
1690 |     "    batch_size = src_seqs.size(1)\n",
1691 |     "    \n",
1692 |     "    # Pack tensors to variables for neural network inputs (in order to autograd)\n",
1693 |     "    src_seqs = Variable(src_seqs, volatile=True)\n",
1694 |     "    src_lens = Variable(torch.LongTensor(src_lens), volatile=True)\n",
1695 |     "\n",
1696 |     "    # Decoder's input\n",
1697 |     "    input_seq = Variable(torch.LongTensor([BOS] * batch_size), volatile=True)\n",
1698 |     "    # Store output words and attention states\n",
1699 |     "    out_sent = []\n",
1700 |     "    all_attention_weights = torch.zeros(max_seq_len, len(src_seqs))\n",
1701 |     "    \n",
1702 |     "    # Move variables from CPU to GPU.\n",
1703 |     "    if USE_CUDA:\n",
1704 |     "        src_seqs = src_seqs.cuda()\n",
1705 |     "        src_lens = src_lens.cuda()\n",
1706 |     "        input_seq = input_seq.cuda()\n",
1707 |     "        \n",
1708 |     "    # -------------------------------------\n",
1709 |     "    # Evaluation mode (disable dropout)\n",
1710 |     "    # -------------------------------------\n",
1711 |     "    encoder.eval()\n",
1712 |     "    decoder.eval()\n",
1713 |     "        \n",
1714 |     "    # -------------------------------------\n",
1715 |     "    # Forward encoder\n",
1716 |     "    # -------------------------------------\n",
1717 |     "    encoder_outputs, encoder_hidden = encoder(src_seqs, src_lens.data.tolist())\n",
1718 |     "\n",
1719 |     "    # -------------------------------------\n",
1720 |     "    # Forward decoder\n",
1721 |     "    # -------------------------------------\n",
1722 |     "    # Initialize decoder's hidden state as encoder's last hidden state.\n",
1723 |     "    decoder_hidden = encoder_hidden\n",
1724 |     "    \n",
1725 |     "    # Run through decoder one time step at a time.\n",
1726 |     "    for t in range(max_seq_len):\n",
1727 |     "        \n",
1728 |     "        # decoder returns:\n",
1729 |     "        # - decoder_output   : (batch_size, vocab_size)\n",
1730 |     "        # - decoder_hidden   : (num_layers, batch_size, hidden_size)\n",
1731 |     "        # - attention_weights: (batch_size, max_src_len)\n",
1732 |     "        decoder_output, decoder_hidden, attention_weights = decoder(input_seq, decoder_hidden,\n",
1733 |     "                                                                    encoder_outputs, src_lens)\n",
1734 |     "\n",
1735 |     "        # Store attention weights.\n",
1736 |     "        # .squeeze(0): remove `batch_size` dimension since batch_size=1\n",
1737 |     "        all_attention_weights[t] = attention_weights.squeeze(0).cpu().data \n",
1738 |     "        \n",
1739 |     "        # Choose top word from decoder's output\n",
1740 |     "        prob, token_id = decoder_output.data.topk(1)\n",
1741 |     "        token_id = token_id[0][0] # get value\n",
1742 |     "        if token_id == EOS:\n",
1743 |     "            break\n",
1744 |     "        else:\n",
1745 |     "            if token_id == UNK and replace_unk:\n",
1746 |     "                # Replace unk by selecting the source token with the highest attention score.\n",
1747 |     "                score, idx = all_attention_weights[t].max(0)\n",
1748 |     "                token = src_sent[idx[0]]\n",
1749 |     "            else:\n",
1750 |     "                # <UNK>\n",
1751 |     "                token = train_dataset.tgt_vocab.id2token[token_id]\n",
1752 |     "            \n",
1753 |     "            out_sent.append(token)\n",
1754 |     "        \n",
1755 |     "        # Next input is chosen word\n",
1756 |     "        input_seq = Variable(torch.LongTensor([token_id]), volatile=True)\n",
1757 |     "        if USE_CUDA: input_seq = input_seq.cuda()\n",
1758 |     "            \n",
1759 |     "        # Repackage hidden state (may not need this, since no BPTT)\n",
1760 |     "        detach_hidden(decoder_hidden)\n",
1761 |     "    \n",
1762 |     "    src_text = ' '.join([train_dataset.src_vocab.id2token[token_id] for token_id in src_seqs.data.squeeze(1).tolist()])\n",
1763 |     "    out_text = ' '.join(out_sent)\n",
1764 |     "        \n",
1765 |     "    # all_attention_weights: (out_len, src_len)\n",
1766 |     "    return src_text, out_text, all_attention_weights[:len(out_sent)]"
1767 |    ]
1768 |   },
1769 |   {
1770 |    "cell_type": "markdown",
1771 |    "metadata": {},
1772 |    "source": [
1773 |     "### Small test for translation"
1774 |    ]
1775 |   },
1776 |   {
1777 |    "cell_type": "code",
1778 |    "execution_count": 23,
1779 |    "metadata": {},
1780 |    "outputs": [
1781 |     {
1782 |      "data": {
1783 |       "text/plain": [
1784 |        "('He have a car <EOS>', 'He has a car', \n",
1785 |        "  0.8339  0.0667  0.0158  0.0305  0.0530\n",
1786 |        "  0.0471  0.8918  0.0325  0.0141  0.0144\n",
1787 |        "  0.0654  0.2109  0.5132  0.1534  0.0572\n",
1788 |        "  0.0083  0.0270  0.0291  0.8793  0.0564\n",
1789 |        " [torch.FloatTensor of size 4x5])"
1790 |       ]
1791 |      },
1792 |      "execution_count": 23,
1793 |      "metadata": {},
1794 |      "output_type": "execute_result"
1795 |     }
1796 |    ],
1797 |    "source": [
1798 |     "src_text, out_text, all_attention_weights = translate('He have a car', train_dataset, encoder, decoder, max_seq_len=opts.max_seq_len)\n",
1799 |     "src_text, out_text, all_attention_weights"
1800 |    ]
1801 |   },
1802 |   {
1803 |    "cell_type": "code",
1804 |    "execution_count": null,
1805 |    "metadata": {
1806 |     "collapsed": true
1807 |    },
1808 |    "outputs": [],
1809 |    "source": [
1810 |     "# check attention weight sum == 1\n",
1811 |     "[all_attention_weights[t].sum() for t in range(all_attention_weights.size(0))]"
1812 |    ]
1813 |   },
1814 |   {
1815 |    "cell_type": "markdown",
1816 |    "metadata": {},
1817 |    "source": [
1818 |     "### Translate a given text file"
1819 |    ]
1820 |   },
1821 |   {
1822 |    "cell_type": "code",
1823 |    "execution_count": 24,
1824 |    "metadata": {
1825 |     "collapsed": true
1826 |    },
1827 |    "outputs": [],
1828 |    "source": [
1829 |     "test_src_texts = []\n",
1830 |     "with codecs.open('../dataset/jfleg/test/test.src', 'r', 'utf-8') as f:\n",
1831 |     "    test_src_texts = f.readlines()"
1832 |    ]
1833 |   },
1834 |   {
1835 |    "cell_type": "code",
1836 |    "execution_count": 56,
1837 |    "metadata": {},
1838 |    "outputs": [
1839 |     {
1840 |      "data": {
1841 |       "text/plain": [
1842 |        "['New and new technology has been introduced to the society .\\n',\n",
1843 |        " 'One possible outcome is that an environmentally-induced reduction in motorization levels in the richer countries will outweigh any rise in motorization levels in the poorer countries .\\n',\n",
1844 |        " 'Every person needs to know a bit about math , sciences , arts , literature and history in order to stand out in society .\\n',\n",
1845 |        " 'While the travel company will most likely show them some interesting sites in order for their customers to advertise for their company to their family and friends , it is highly unlikely , that the company will tell about the sites that were not included in the tour -- for example due to entrance fees that would make the total package price overly expensive .\\n',\n",
1846 |        " 'Disadvantage is parking their car is very difficult .\\n']"
1847 |       ]
1848 |      },
1849 |      "execution_count": 56,
1850 |      "metadata": {},
1851 |      "output_type": "execute_result"
1852 |     }
1853 |    ],
1854 |    "source": [
1855 |     "test_src_texts[:5]"
1856 |    ]
1857 |   },
1858 |   {
1859 |    "cell_type": "code",
1860 |    "execution_count": 51,
1861 |    "metadata": {
1862 |     "collapsed": true
1863 |    },
1864 |    "outputs": [],
1865 |    "source": [
1866 |     "out_texts = []\n",
1867 |     "for src_text in test_src_texts:\n",
1868 |     "    _, out_text, _ = translate(src_text.strip(), train_dataset, encoder, decoder, max_seq_len=opts.max_seq_len)\n",
1869 |     "    out_texts.append(out_text)"
1870 |    ]
1871 |   },
1872 |   {
1873 |    "cell_type": "code",
1874 |    "execution_count": 57,
1875 |    "metadata": {},
1876 |    "outputs": [
1877 |     {
1878 |      "data": {
1879 |       "text/plain": [
1880 |        "['The new and new technology has been introduced to the society .',\n",
1881 |        " 'One possible outcome is that an environmentally-induced reduction in motorization levels in the higher countries will outweigh any rise in motorization levels in the high countries .',\n",
1882 |        " 'Every person needs to know a bit about math , sciences , arts , literature and history in order to stand out in society .',\n",
1883 |        " 'While the travel company will most likely show them some interesting sites in order for their customers to advertise for their company to their family and friends , it is highly unlikely , that the company will tell about the sites that were not included in the tour -- for example due to entrance fees that would make the total of price overly expensive .',\n",
1884 |        " 'The price is parking their cars are very difficult .']"
1885 |       ]
1886 |      },
1887 |      "execution_count": 57,
1888 |      "metadata": {},
1889 |      "output_type": "execute_result"
1890 |     }
1891 |    ],
1892 |    "source": [
1893 |     "out_texts[:5]"
1894 |    ]
1895 |   },
1896 |   {
1897 |    "cell_type": "markdown",
1898 |    "metadata": {},
1899 |    "source": [
1900 |     "### Save the predictions to text file"
1901 |    ]
1902 |   },
1903 |   {
1904 |    "cell_type": "code",
1905 |    "execution_count": 55,
1906 |    "metadata": {
1907 |     "collapsed": true
1908 |    },
1909 |    "outputs": [],
1910 |    "source": [
1911 |     "with codecs.open('./pred.txt', 'w', 'utf-8') as f:\n",
1912 |     "    for text in out_texts:\n",
1913 |     "        f.write(text + '\\n')"
1914 |    ]
1915 |   },
1916 |   {
1917 |    "cell_type": "markdown",
1918 |    "metadata": {},
1919 |    "source": [
1920 |     "### Evaluate with GLEU metric\n",
1921 |     "If you're playing with grammatical error correction (GEC) corpus (jfleg),\n",
1922 |     "it has an evaluation script specifically for GEC task:\n",
1923 |     "\n",
1924 |     "Run:\n",
1925 |     "```\n",
1926 |     "python jfleg/eval/gleu.py \\\n",
1927 |     "-s jfleg/test/test.src \\\n",
1928 |     "-r jfleg/test/test.ref[0-3] \\\n",
1929 |     "--hyp ./pred.txt\n",
1930 |     "```\n",
1931 |     "\n",
1932 |     "Output (GLEU score, std, confidence interval):\n",
1933 |     "Note: The OpenNMT-py can further achieves ~0.49 GLEU score with the same model settings.\n",
1934 |     "TODO: Try to optimize the code.\n",
1935 |     "```\n",
1936 |     "Running GLEU...\n",
1937 |     "./pred.txt\n",
1938 |     "[['0.451747', '0.007620', '(0.437,0.467)']]\n",
1939 |     "```"
1940 |    ]
1941 |   },
1942 |   {
1943 |    "cell_type": "markdown",
1944 |    "metadata": {},
1945 |    "source": [
1946 |     "### Notes:\n",
1947 |     "- Set `MAX_LENGTH` to training sequence is important to prevent OOM.\n",
1948 |     "    - Will effect：`decoder_outputs = Variable(torch.zeros(max_tgt_len, batch_size, decoder.vocab_size))`\n",
1949 |     "- Do not `next(iter(data_loader))` in training for-loop，could be very slow.\n",
1950 |     "- When computing `num_corrects`, need to cast `ByteTensor` using `.float()` in order to do `.sum()`, otherwise the result will overflow. Ref: https://discuss.pytorch.org/t/batch-size-and-validation-accuracy/4066/3\n",
1951 |     "- Very crutial to GPU memory usage: Don'T set `MAX_LENGTH` to `max(tgt_lens)`. Varying tensor size could cause GPU allocate a new memory, so we fixed tensor size instead: `decoder_outputs = Variable(torch.zeros(**MAX_LENGTH**, batch_size, decoder.vocab_size))`\n",
1952 |     "- Be careful if you only want to get `Variable`'s data and do some operations, for example, `sum()`, you should use `Variable(...).data.sum()` instead of `Variable(...).sum().data[0]`. This will create a new computational graph and if you do this in for-loop, it might increase memory.\n",
1953 |     "- Be careful to misuse `Variable`.\n",
1954 |     "- Do `detach` for RNN's hidden states, or it might increase memory when doing backprop.\n",
1955 |     "- If restart but GPU memory is not returned, kill all python processes: `>> ps x |grep python|awk '{print $1}'|xargs kill`\n",
1956 |     "- Forward decoder is time-consuming (for-loop).\n",
1957 |     "- Calling `backward()` free memory: https://discuss.pytorch.org/t/calling-loss-backward-reduce-memory-usage/2735\n",
1958 |     "\n",
1959 |     "### Try to:\n",
1960 |     "- Implement schedule sampling for training.\n",
1961 |     "- Implement beam search for evaluation and translation.\n",
1962 |     "- Understand and interpret param visualization on tensorboard.\n",
1963 |     "- Implement more RNN optimizing and regularization tricks:\n",
1964 |     "    - Set `max_seq_len` for preventing RNN OOM \n",
1965 |     "    - Xavier initializer\n",
1966 |     "    - Weight normalization and layer normalization: https://github.com/pytorch/pytorch/issues/1601\n",
1967 |     "    - Embedding dropout\n",
1968 |     "    - Weight dropping\n",
1969 |     "    - Variational dropout: [part1](https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307), [part2](https://towardsdatascience.com/learning-note-dropout-in-recurrent-networks-part-2-f209222481f8), [part3](https://towardsdatascience.com/learning-note-dropout-in-recurrent-networks-part-3-1b161d030cd4)\n",
1970 |     "    - Zoneout\n",
1971 |     "    - Fraternal dropout\n",
1972 |     "    - Activation regularization (AR), and temporal activation regularization (TAR)\n",
1973 |     "    - Read more: [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)\n"
1974 |    ]
1975 |   },
1976 |   {
1977 |    "cell_type": "code",
1978 |    "execution_count": null,
1979 |    "metadata": {
1980 |     "collapsed": true
1981 |    },
1982 |    "outputs": [],
1983 |    "source": []
1984 |   }
1985 |  ],
1986 |  "metadata": {
1987 |   "kernelspec": {
1988 |    "display_name": "Python 3",
1989 |    "language": "python",
1990 |    "name": "python3"
1991 |   },
1992 |   "language_info": {
1993 |    "codemirror_mode": {
1994 |     "name": "ipython",
1995 |     "version": 3
1996 |    },
1997 |    "file_extension": ".py",
1998 |    "mimetype": "text/x-python",
1999 |    "name": "python",
2000 |    "nbconvert_exporter": "python",
2001 |    "pygments_lexer": "ipython3",
2002 |    "version": "3.6.2"
2003 |   }
2004 |  },
2005 |  "nbformat": 4,
2006 |  "nbformat_minor": 2
2007 | }
2008 | 


--------------------------------------------------------------------------------
/tensorboard-logs/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------