├── .gitignore ├── LICENCE ├── README.md ├── data └── ptb │ ├── test.txt │ ├── train.txt │ └── valid.txt ├── evaluate.lua ├── get_data.sh ├── main.lua ├── model ├── HighwayMLP.lua ├── LSTMTDNN.lua └── TDNN.lua ├── run_models.sh └── util ├── BatchLoaderUnk.lua ├── Diag.lua ├── HLogSoftMax.lua ├── HSMClass.lua ├── OneHot.lua ├── OuterProd.lua ├── TensorProd.lua ├── misc.lua └── model_utils.lua /.gitignore: -------------------------------------------------------------------------------- 1 | *.t7 2 | *.out 3 | *.err 4 | *.txt 5 | *.zip 6 | *.tsv 7 | -------------------------------------------------------------------------------- /LICENCE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) <2015> 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Character-Aware Neural Language Models 2 | Code for the paper [Character-Aware Neural Language Models](http://arxiv.org/abs/1508.06615) 3 | (AAAI 2016). 4 | 5 | A neural language model (NLM) built on character inputs only. Predictions 6 | are still made at the word-level. The model employs a convolutional neural network (CNN) 7 | over characters to use as inputs into an long short-term memory (LSTM) 8 | recurrent neural network language model (RNN-LM). Also optionally 9 | passes the output from the CNN through a [Highway Network](http://arxiv.org/abs/1507.06228), 10 | which improves performance. 11 | 12 | Much of the base code is from 13 | [Andrej Karpathy's excellent character RNN implementation](https://github.com/karpathy/char-rnn). 14 | 15 | ### Requirements 16 | Code is written in Lua and requires Torch. It also requires 17 | the `nngraph` and the `luautf8` packages, which can be installed via: 18 | ``` 19 | luarocks install nngraph 20 | luarocks install luautf8 21 | ``` 22 | GPU usage will additionally require `cutorch` and `cunn` packages: 23 | ``` 24 | luarocks install cutorch 25 | luarocks install cunn 26 | ``` 27 | 28 | `cudnn` will result in a good (8x-10x) speed-up for convolutions, so it is 29 | highly recommended. This will make the training time of a character-level model 30 | be somewhat competitive against a word-level model (1500 tokens/sec vs 3000 tokens/sec for 31 | the large character/word-level models described below). 32 | 33 | ``` 34 | git clone https://github.com/soumith/cudnn.torch.git 35 | cd cudnn.torch 36 | luarocks make cudnn-scm-1.rockspec 37 | ``` 38 | ### Data 39 | Data should be put into the `data/` directory, split into `train.txt`, 40 | `valid.txt`, and `test.txt` 41 | 42 | Each line of the .txt file should be a sentence. The English Penn 43 | Treebank (PTB) data (Tomas Mikolov's pre-processed version with vocab size equal to 10K, 44 | widely used by the language modeling community) is given as the default. 45 | 46 | The paper also runs the models on non-English data (Czech, French, German, Russian, and Spanish), from the ICML 2014 47 | paper [Compositional Morphology for Word Representations and Language Modelling](http://arxiv.org/abs/1405.4273) 48 | by Jan Botha and Phil Blunsom. This can be downloaded from [Jan's website](https://bothameister.github.io). 49 | 50 | For ease of use, we provide a script to download the non-English data (`get_data.sh`). 51 | The script also saves the downloaded data into the relevant folders. 52 | 53 | #### Note on PTB 54 | The PTB data above does not have end-of-sentence tokens for each sentence, and hence these must be 55 | manually appended. This can be done by adding `-EOS '+'` to the script (obviously you 56 | can use other characters than `+` to represent an end-of-sentence token---we recommend a single 57 | unused character). 58 | 59 | The non-English data already have end-of-sentence tokens for each line so, you want to add 60 | `-EOS ''` to the command line. 61 | 62 | #### Unicode in Lua 63 | Lua is unicode-agnostic (each string is just a sequence of bytes) so we use 64 | the `luautf8` package to deal with languages where a character can be more than one byte 65 | (e.g. Russian). Many thanks to [vseledkin](https://github.com/vseledkin) for alerting us 66 | to the fact that previous version of the code did not take this account! 67 | 68 | ### Model 69 | Here are some example scripts. Add `-gpuid 0` to each line to use a GPU (which is 70 | required to get any reasonable speed with the CNN), and `-cudnn 1` to use the 71 | cudnn package. Scripts to reproduce the results of the paper can be found under `run_models.sh` 72 | 73 | #### Character-level models 74 | Large character-level model (LSTM-CharCNN-Large in the paper). 75 | This is the default: should get ~82 on valid and ~79 on test. Takes ~5 hours with `cudnn`. 76 | ``` 77 | th main.lua -savefile char-large -EOS '+' 78 | ``` 79 | Small character-level model (LSTM-CharCNN-Small in the paper). 80 | This should get ~96 on valid and ~93 on test. Takes ~2 hours with `cudnn`. 81 | ``` 82 | th main.lua -savefile char-small -rnn_size 300 -highway_layers 1 83 | -kernels '{1,2,3,4,5,6}' -feature_maps '{25,50,75,100,125,150}' -EOS '+' 84 | ``` 85 | 86 | #### Word-level models 87 | Large word-level model (LSTM-Word-Large in the paper). 88 | This should get ~89 on valid and ~85 on test. 89 | ``` 90 | th main.lua -savefile word-large -word_vec_size 650 -highway_layers 0 91 | -use_chars 0 -use_words 1 -EOS '+' 92 | ``` 93 | Small word-level model (LSTM-Word-Small in the paper). 94 | This should get ~101 on valid and ~98 on test. 95 | ``` 96 | th main.lua -savefile word-small -word_vec_size 200 -highway_layers 0 97 | -use_chars 0 -use_words 1 -rnn_size 200 -EOS '+' 98 | ``` 99 | 100 | #### Combining both 101 | Note that if `-use_chars` and `-use_words` are both set to 1, the model 102 | will concatenate the output from the CNN with the word embedding. We've 103 | found this model to underperform a purely character-level model, though. 104 | 105 | ### Evaluation 106 | By default `main.lua` will evaluate the model on test data after training, 107 | but this will use the last epoch's model, and also will be slow due to 108 | the way the data is set up. 109 | 110 | Evaluation on test can be performed via the following script: 111 | ``` 112 | th evaluate.lua -model model_file.t7 -data_dir data/ptb -savefile model_results.t7 113 | ``` 114 | Where `model_file.t7` is the path to the best performing (on validation) model. 115 | This will also save some basic statistics (e.g. perplexity by token) in 116 | `model_results.t7`. 117 | 118 | ### Hierarchical Softmax 119 | Training on a larger vocabulary (e.g. 100K+) will require hierarchical softmax (HSM) 120 | to train at a reasonable speed. You can use the `-hsm` option to do this. 121 | For example `-hsm 500` will randomly split the vocabulary into 500 clusters of 122 | (approximately) equal size. `-hsm 0` is the default and will not use HSM. 123 | `-hsm -1` will automatically choose the number of clusters for you, by choosing the integer 124 | closest to sqrt(|V|). 125 | 126 | ### Batch Size 127 | If training on bigger datasets you should probably use a 128 | larger batch size (e.g. `-batch_size 100`). 129 | 130 | ### Licence 131 | MIT 132 | 133 | 134 | 135 | -------------------------------------------------------------------------------- /evaluate.lua: -------------------------------------------------------------------------------- 1 | --[[ 2 | Evaluates a trained model 3 | 4 | Much of the code is borrowed from the following implementations 5 | https://github.com/karpathy/char-rnn 6 | https://github.com/wojzaremba/lstm 7 | ]]-- 8 | 9 | require 'torch' 10 | require 'nn' 11 | require 'nngraph' 12 | require 'optim' 13 | require 'lfs' 14 | require 'util.misc' 15 | require 'util.HLogSoftMax' 16 | 17 | HSMClass = require 'util.HSMClass' 18 | BatchLoader = require 'util.BatchLoaderUnk' 19 | model_utils = require 'util.model_utils' 20 | 21 | local stringx = require('pl.stringx') 22 | 23 | cmd = torch.CmdLine() 24 | cmd:text('Options') 25 | -- data 26 | cmd:option('-data_dir','data/ptb','data directory. Should contain train.txt/valid.txt/test.txt with input data') 27 | cmd:option('-savefile', 'model_results.t7', 'save results to') 28 | cmd:option('-model', 'en-large-word-model.t7', 'model checkpoint file') 29 | -- GPU/CPU these params must be passed in because it affects the constructors 30 | cmd:option('-gpuid', -1,'which gpu to use. -1 = use CPU') 31 | cmd:option('-cudnn', 0,'use cudnn (1 = yes, 0 = no)') 32 | 33 | cmd:text() 34 | 35 | -- parse input params 36 | opt2 = cmd:parse(arg) 37 | if opt2.gpuid >= 0 then 38 | print('using CUDA on GPU ' .. opt2.gpuid .. '...') 39 | require 'cutorch' 40 | require 'cunn' 41 | cutorch.setDevice(opt2.gpuid + 1) 42 | end 43 | 44 | if opt2.cudnn == 1 then 45 | assert(opt2.gpuid >= 0, 'GPU must be used if using cudnn') 46 | print('using cudnn') 47 | require 'cudnn' 48 | end 49 | 50 | HighwayMLP = require 'model.HighwayMLP' 51 | TDNN = require 'model.TDNN' 52 | LSTMTDNN = require 'model.LSTMTDNN' 53 | 54 | checkpoint = torch.load(opt2.model) 55 | opt = checkpoint.opt 56 | protos = checkpoint.protos 57 | print('opt: ') 58 | print(opt) 59 | print('val_losses: ') 60 | print(checkpoint.val_losses) 61 | idx2word, word2idx, idx2char, char2idx = table.unpack(checkpoint.vocab) 62 | 63 | -- recreate the data loader class 64 | loader = BatchLoader.create(opt.data_dir, opt.batch_size, opt.seq_length, opt.padding, opt.max_word_l) 65 | print('Word vocab size: ' .. #loader.idx2word .. ', Char vocab size: ' .. #loader.idx2char 66 | .. ', Max word length (incl. padding): ', loader.max_word_l) 67 | 68 | -- the initial state of the cell/hidden states 69 | init_state = {} 70 | for L=1,opt.num_layers do 71 | local h_init = torch.zeros(2, opt.rnn_size) 72 | if opt.gpuid >=0 then h_init = h_init:cuda() end 73 | table.insert(init_state, h_init:clone()) 74 | table.insert(init_state, h_init:clone()) 75 | end 76 | 77 | -- ship the model to the GPU if desired 78 | if opt.gpuid >= 0 then 79 | for k,v in pairs(protos) do v:cuda() end 80 | end 81 | 82 | params, grad_params = model_utils.combine_all_parameters(protos.rnn) 83 | if opt.hsm > 0 then 84 | hsm_params, hsm_grad_params = model_utils.combine_all_parameters(protos.criterion) 85 | print('number of parameters in the model: ' .. params:nElement() + hsm_params:nElement()) 86 | else 87 | print('number of parameters in the model: ' .. params:nElement()) 88 | end 89 | 90 | -- for easy switch between using words/chars (or both) 91 | function get_input(x, x_char, t, prev_states) 92 | local u = {} 93 | if opt.use_chars == 1 then 94 | table.insert(u, x_char[{{1,2},t}]) 95 | end 96 | if opt.use_words == 1 then 97 | table.insert(u, x[{{1,2},t}]) 98 | end 99 | for i = 1, #prev_states do table.insert(u, prev_states[i]) end 100 | return u 101 | end 102 | 103 | -- evaluate the loss over an entire split 104 | function eval_split_full(split_idx) 105 | print('evaluating loss over split index ' .. split_idx) 106 | if opt.hsm > 0 then 107 | protos.criterion:change_bias() 108 | end 109 | local n = loader.split_sizes[split_idx] 110 | loader:reset_batch_pointer(split_idx) -- move batch iteration pointer for this split to front 111 | local loss = 0 112 | local token_count = torch.zeros(#idx2word) 113 | local token_loss = torch.zeros(#idx2word) 114 | local rnn_state = {[0] = init_state} 115 | local x, y, x_char = loader:next_batch(split_idx) 116 | if opt.gpuid >= 0 then 117 | x = x:float():cuda() 118 | y = y:float():cuda() 119 | x_char = x_char:float():cuda() 120 | end 121 | protos.rnn:evaluate() 122 | for t = 1, x:size(2) do 123 | local lst = protos.rnn:forward(get_input(x, x_char, t, rnn_state[0])) 124 | rnn_state[0] = {} 125 | for i=1,#init_state do table.insert(rnn_state[0], lst[i]) end 126 | prediction = lst[#lst] 127 | local singleton_loss = protos.criterion:forward(prediction, y[{{1,2},t}]) 128 | loss = loss + singleton_loss 129 | local token_idx = x[1][t] 130 | token_count[token_idx] = token_count[token_idx] + 1 131 | token_loss[token_idx] = token_loss[token_idx] + singleton_loss 132 | end 133 | loss = loss / x:size(2) 134 | local total_perp = torch.exp(loss) 135 | return total_perp, token_loss:float(), token_count:float() 136 | end 137 | 138 | total_perp, token_loss, token_count = eval_split_full(3) 139 | print(total_perp) 140 | test_results = {} 141 | test_results.perp = total_perp 142 | test_results.token_loss = token_loss 143 | test_results.token_count = token_count 144 | test_results.vocab = {idx2word, word2idx, idx2char, char2idx} 145 | test_results.opt = opt 146 | test_results.val_losses = checkpoint.val_losses 147 | torch.save(opt2.savefile, test_results) 148 | collectgarbage() 149 | -------------------------------------------------------------------------------- /get_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #download non-english data 4 | wget https://github.com/bothameister/bothameister.github.io/raw/master/icml14-data.tar.bz2 5 | tar xf icml14-data.tar.bz2 6 | 7 | mkdir data/de/ 8 | cp en-de/1m-mono/train.in data/de/train.txt 9 | cp en-de/1m-mono/test.in data/de/valid.txt 10 | cp en-de/1m-mono/finaltest.in data/de/test.txt 11 | 12 | mkdir data/es/ 13 | cp en-es/1m-mono/train.in data/es/train.txt 14 | cp en-es/1m-mono/test.in data/es/valid.txt 15 | cp en-es/1m-mono/finaltest.in data/es/test.txt 16 | 17 | mkdir data/cs/ 18 | cp en-cs/1m-mono/train.in data/cs/train.txt 19 | cp en-cs/1m-mono/test.in data/cs/valid.txt 20 | cp en-cs/1m-mono/finaltest.in data/cs/test.txt 21 | 22 | mkdir data/fr/ 23 | cp en-fr/1m-mono/train.in data/fr/train.txt 24 | cp en-fr/1m-mono/test.in data/fr/valid.txt 25 | cp en-fr/1m-mono/finaltest.in data/fr/test.txt 26 | 27 | mkdir data/ru/ 28 | cp en-ru/1m-mono/train.in data/ru/train.txt 29 | cp en-ru/1m-mono/test.in data/ru/valid.txt 30 | cp en-ru/1m-mono/finaltest.in data/ru/test.txt 31 | -------------------------------------------------------------------------------- /main.lua: -------------------------------------------------------------------------------- 1 | --[[ 2 | Trains a word-level or character-level (for inputs) lstm language model 3 | Predictions are still made at the word-level. 4 | 5 | Much of the code is borrowed from the following implementations 6 | https://github.com/karpathy/char-rnn 7 | https://github.com/wojzaremba/lstm 8 | ]]-- 9 | 10 | require 'torch' 11 | require 'nn' 12 | require 'nngraph' 13 | require 'lfs' 14 | require 'util.misc' 15 | 16 | BatchLoader = require 'util.BatchLoaderUnk' 17 | model_utils = require 'util.model_utils' 18 | 19 | local stringx = require('pl.stringx') 20 | 21 | cmd = torch.CmdLine() 22 | cmd:text() 23 | cmd:text('Train a word+character-level language model') 24 | cmd:text() 25 | cmd:text('Options') 26 | -- data 27 | cmd:option('-data_dir','data/ptb','data directory. Should contain train.txt/valid.txt/test.txt with input data') 28 | -- model params 29 | cmd:option('-rnn_size', 650, 'size of LSTM internal state') 30 | cmd:option('-use_words', 0, 'use words (1=yes)') 31 | cmd:option('-use_chars', 1, 'use characters (1=yes)') 32 | cmd:option('-highway_layers', 2, 'number of highway layers') 33 | cmd:option('-word_vec_size', 650, 'dimensionality of word embeddings') 34 | cmd:option('-char_vec_size', 15, 'dimensionality of character embeddings') 35 | cmd:option('-feature_maps', '{50,100,150,200,200,200,200}', 'number of feature maps in the CNN') 36 | cmd:option('-kernels', '{1,2,3,4,5,6,7}', 'conv net kernel widths') 37 | cmd:option('-num_layers', 2, 'number of layers in the LSTM') 38 | cmd:option('-dropout',0.5,'dropout. 0 = no dropout') 39 | -- optimization 40 | cmd:option('-hsm',0,'number of clusters to use for hsm. 0 = normal softmax, -1 = use sqrt(|V|)') 41 | cmd:option('-learning_rate',1,'starting learning rate') 42 | cmd:option('-learning_rate_decay',0.5,'learning rate decay') 43 | cmd:option('-decay_when',1,'decay if validation perplexity does not improve by more than this much') 44 | cmd:option('-param_init', 0.05, 'initialize parameters at') 45 | cmd:option('-batch_norm', 0, 'use batch normalization over input embeddings (1=yes)') 46 | cmd:option('-seq_length',35,'number of timesteps to unroll for') 47 | cmd:option('-batch_size',20,'number of sequences to train on in parallel') 48 | cmd:option('-max_epochs',25,'number of full passes through the training data') 49 | cmd:option('-max_grad_norm',5,'normalize gradients at') 50 | cmd:option('-max_word_l',65,'maximum word length') 51 | -- bookkeeping 52 | cmd:option('-seed',3435,'torch manual random number generator seed') 53 | cmd:option('-print_every',500,'how many steps/minibatches between printing out the loss') 54 | cmd:option('-save_every', 5, 'save every n epochs') 55 | cmd:option('-checkpoint_dir', 'cv', 'output directory where checkpoints get written') 56 | cmd:option('-savefile','char','filename to autosave the checkpont to. Will be inside checkpoint_dir/') 57 | cmd:option('-EOS', '+', ' symbol. should be a single unused character (like +) for PTB and blank for others') 58 | cmd:option('-time', 0, 'print batch times') 59 | -- GPU/CPU 60 | cmd:option('-gpuid', -1,'which gpu to use. -1 = use CPU') 61 | cmd:option('-cudnn', 0,'use cudnn (1=yes). this should greatly speed up convolutions') 62 | cmd:text() 63 | 64 | -- parse input params 65 | opt = cmd:parse(arg) 66 | torch.manualSeed(opt.seed) 67 | 68 | assert(opt.use_words == 1 or opt.use_words == 0, '-use_words has to be 0 or 1') 69 | assert(opt.use_chars == 1 or opt.use_chars == 0, '-use_chars has to be 0 or 1') 70 | assert((opt.use_chars + opt.use_words) > 0, 'has to use at least one of words or chars') 71 | 72 | -- some housekeeping 73 | loadstring('opt.kernels = ' .. opt.kernels)() -- get kernel sizes 74 | loadstring('opt.feature_maps = ' .. opt.feature_maps)() -- get feature map sizes 75 | 76 | -- global constants for certain tokens 77 | opt.tokens = {} 78 | opt.tokens.EOS = opt.EOS 79 | opt.tokens.UNK = '|' -- unk word token 80 | opt.tokens.START = '{' -- start-of-word token 81 | opt.tokens.END = '}' -- end-of-word token 82 | opt.tokens.ZEROPAD = ' ' -- zero-pad token 83 | 84 | -- load necessary packages depending on config options 85 | if opt.gpuid >= 0 then 86 | print('using CUDA on GPU ' .. opt.gpuid .. '...') 87 | require 'cutorch' 88 | require 'cunn' 89 | cutorch.setDevice(opt.gpuid + 1) 90 | end 91 | 92 | if opt.cudnn == 1 then 93 | assert(opt.gpuid >= 0, 'GPU must be used if using cudnn') 94 | print('using cudnn...') 95 | require 'cudnn' 96 | end 97 | 98 | -- create the data loader class 99 | loader = BatchLoader.create(opt.data_dir, opt.batch_size, opt.seq_length, opt.max_word_l) 100 | print('Word vocab size: ' .. #loader.idx2word .. ', Char vocab size: ' .. #loader.idx2char 101 | .. ', Max word length (incl. padding): ', loader.max_word_l) 102 | opt.max_word_l = loader.max_word_l 103 | 104 | -- if number of clusters is not explicitly provided 105 | if opt.hsm == -1 then 106 | opt.hsm = torch.round(torch.sqrt(#loader.idx2word)) 107 | end 108 | 109 | if opt.hsm > 0 then 110 | -- partition into opt.hsm clusters 111 | -- we want roughly equal number of words in each cluster 112 | HSMClass = require 'util.HSMClass' 113 | require 'util.HLogSoftMax' 114 | mapping = torch.LongTensor(#loader.idx2word, 2):zero() 115 | local n_in_each_cluster = #loader.idx2word / opt.hsm 116 | local _, idx = torch.sort(torch.randn(#loader.idx2word), 1, true) 117 | local n_in_cluster = {} --number of tokens in each cluster 118 | local c = 1 119 | for i = 1, idx:size(1) do 120 | local word_idx = idx[i] 121 | if n_in_cluster[c] == nil then 122 | n_in_cluster[c] = 1 123 | else 124 | n_in_cluster[c] = n_in_cluster[c] + 1 125 | end 126 | mapping[word_idx][1] = c 127 | mapping[word_idx][2] = n_in_cluster[c] 128 | if n_in_cluster[c] >= n_in_each_cluster then 129 | c = c+1 130 | end 131 | if c > opt.hsm then --take care of some corner cases 132 | c = opt.hsm 133 | end 134 | end 135 | print(string.format('using hierarchical softmax with %d classes', opt.hsm)) 136 | end 137 | 138 | 139 | -- load model objects. we do this here because of cudnn and hsm options 140 | TDNN = require 'model.TDNN' 141 | LSTMTDNN = require 'model.LSTMTDNN' 142 | HighwayMLP = require 'model.HighwayMLP' 143 | 144 | -- make sure output directory exists 145 | if not path.exists(opt.checkpoint_dir) then lfs.mkdir(opt.checkpoint_dir) end 146 | 147 | -- define the model: prototypes for one timestep, then clone them in time 148 | protos = {} 149 | print('creating an LSTM-CNN with ' .. opt.num_layers .. ' layers') 150 | protos.rnn = LSTMTDNN.lstmtdnn(opt.rnn_size, opt.num_layers, opt.dropout, #loader.idx2word, 151 | opt.word_vec_size, #loader.idx2char, opt.char_vec_size, opt.feature_maps, 152 | opt.kernels, loader.max_word_l, opt.use_words, opt.use_chars, 153 | opt.batch_norm,opt.highway_layers, opt.hsm) 154 | -- training criterion (negative log likelihood) 155 | if opt.hsm > 0 then 156 | protos.criterion = nn.HLogSoftMax(mapping, opt.rnn_size) 157 | else 158 | protos.criterion = nn.ClassNLLCriterion() 159 | end 160 | 161 | -- the initial state of the cell/hidden states 162 | init_state = {} 163 | for L=1,opt.num_layers do 164 | local h_init = torch.zeros(opt.batch_size, opt.rnn_size) 165 | if opt.gpuid >=0 then h_init = h_init:cuda() end 166 | table.insert(init_state, h_init:clone()) 167 | table.insert(init_state, h_init:clone()) 168 | end 169 | 170 | -- ship the model to the GPU if desired 171 | if opt.gpuid >= 0 then 172 | for k,v in pairs(protos) do v:cuda() end 173 | end 174 | 175 | -- put the above things into one flattened parameters tensor 176 | params, grad_params = model_utils.combine_all_parameters(protos.rnn) 177 | -- hsm has its own params 178 | if opt.hsm > 0 then 179 | hsm_params, hsm_grad_params = protos.criterion:getParameters() 180 | hsm_params:uniform(-opt.param_init, opt.param_init) 181 | print('number of parameters in the model: ' .. params:nElement() + hsm_params:nElement()) 182 | else 183 | print('number of parameters in the model: ' .. params:nElement()) 184 | end 185 | 186 | -- initialization 187 | params:uniform(-opt.param_init, opt.param_init) -- small numbers uniform 188 | 189 | -- get layers which will be referenced layer (during SGD or introspection) 190 | function get_layer(layer) 191 | local tn = torch.typename(layer) 192 | if layer.name ~= nil then 193 | if layer.name == 'word_vecs' then 194 | word_vecs = layer 195 | elseif layer.name == 'char_vecs' then 196 | char_vecs = layer 197 | elseif layer.name == 'cnn' then 198 | cnn = layer 199 | end 200 | end 201 | end 202 | protos.rnn:apply(get_layer) 203 | 204 | -- make a bunch of clones after flattening, as that reallocates memory 205 | -- not really sure how this part works 206 | clones = {} 207 | for name,proto in pairs(protos) do 208 | print('cloning ' .. name) 209 | clones[name] = model_utils.clone_many_times(proto, opt.seq_length, not proto.parameters) 210 | end 211 | 212 | -- for easy switch between using words/chars (or both) 213 | function get_input(x, x_char, t, prev_states) 214 | local u = {} 215 | if opt.use_chars == 1 then table.insert(u, x_char[{{},t}]) end 216 | if opt.use_words == 1 then table.insert(u, x[{{},t}]) end 217 | for i = 1, #prev_states do table.insert(u, prev_states[i]) end 218 | return u 219 | end 220 | 221 | 222 | -- evaluate the loss over an entire split 223 | function eval_split(split_idx, max_batches) 224 | print('evaluating loss over split index ' .. split_idx) 225 | local n = loader.split_sizes[split_idx] 226 | if opt.hsm > 0 then 227 | protos.criterion:change_bias() 228 | end 229 | 230 | if max_batches ~= nil then n = math.min(max_batches, n) end 231 | 232 | loader:reset_batch_pointer(split_idx) -- move batch iteration pointer for this split to front 233 | local loss = 0 234 | local rnn_state = {[0] = init_state} 235 | if split_idx<=2 then -- batch eval 236 | for i = 1,n do -- iterate over batches in the split 237 | -- fetch a batch 238 | local x, y, x_char = loader:next_batch(split_idx) 239 | if opt.gpuid >= 0 then -- ship the input arrays to GPU 240 | -- have to convert to float because integers can't be cuda()'d 241 | x = x:float():cuda() 242 | y = y:float():cuda() 243 | x_char = x_char:float():cuda() 244 | end 245 | -- forward pass 246 | for t=1,opt.seq_length do 247 | clones.rnn[t]:evaluate() -- for dropout proper functioning 248 | local lst = clones.rnn[t]:forward(get_input(x, x_char, t, rnn_state[t-1])) 249 | rnn_state[t] = {} 250 | for i=1,#init_state do 251 | table.insert(rnn_state[t], lst[i]) 252 | end 253 | prediction = lst[#lst] 254 | loss = loss + clones.criterion[t]:forward(prediction, y[{{}, t}]) 255 | end 256 | -- carry over lstm state 257 | rnn_state[0] = rnn_state[#rnn_state] 258 | end 259 | loss = loss / opt.seq_length / n 260 | else -- full eval on test set 261 | local token_perp = torch.zeros(#loader.idx2word, 2) 262 | local x, y, x_char = loader:next_batch(split_idx) 263 | if opt.gpuid >= 0 then -- ship the input arrays to GPU 264 | -- have to convert to float because integers can't be cuda()'d 265 | x = x:float():cuda() 266 | y = y:float():cuda() 267 | x_char = x_char:float():cuda() 268 | end 269 | protos.rnn:evaluate() -- just need one clone 270 | for t = 1, x:size(2) do 271 | local lst = protos.rnn:forward(get_input(x, x_char, t, rnn_state[0])) 272 | rnn_state[0] = {} 273 | for i=1,#init_state do table.insert(rnn_state[0], lst[i]) end 274 | prediction = lst[#lst] 275 | local tok_perp 276 | tok_perp = protos.criterion:forward(prediction, y[{{},t}]) 277 | loss = loss + tok_perp 278 | token_perp[y[1][t]][1] = token_perp[y[1][t]][1] + 1 --count 279 | token_perp[y[1][t]][2] = token_perp[y[1][t]][2] + tok_perp 280 | end 281 | loss = loss / x:size(2) 282 | end 283 | local perp = torch.exp(loss) 284 | return perp, token_perp 285 | end 286 | 287 | -- do fwd/bwd and return loss, grad_params 288 | local init_state_global = clone_list(init_state) 289 | function feval(x) 290 | if x ~= params then 291 | params:copy(x) 292 | end 293 | grad_params:zero() 294 | if opt.hsm > 0 then 295 | hsm_grad_params:zero() 296 | end 297 | ------------------ get minibatch ------------------- 298 | local x, y, x_char = loader:next_batch(1) --from train 299 | if opt.gpuid >= 0 then -- ship the input arrays to GPU 300 | -- have to convert to float because integers can't be cuda()'d 301 | x = x:float():cuda() 302 | y = y:float():cuda() 303 | x_char = x_char:float():cuda() 304 | end 305 | ------------------- forward pass ------------------- 306 | local rnn_state = {[0] = init_state_global} 307 | local predictions = {} -- softmax outputs 308 | local loss = 0 309 | for t=1,opt.seq_length do 310 | clones.rnn[t]:training() -- make sure we are in correct mode (this is cheap, sets flag) 311 | local lst = clones.rnn[t]:forward(get_input(x, x_char, t, rnn_state[t-1])) 312 | rnn_state[t] = {} 313 | for i=1,#init_state do 314 | table.insert(rnn_state[t], lst[i]) 315 | end -- extract the state, without output 316 | predictions[t] = lst[#lst] -- last element is the prediction 317 | loss = loss + clones.criterion[t]:forward(predictions[t], y[{{}, t}]) 318 | end 319 | loss = loss / opt.seq_length 320 | ------------------ backward pass ------------------- 321 | -- initialize gradient at time t to be zeros (there's no influence from future) 322 | local drnn_state = {[opt.seq_length] = clone_list(init_state, true)} -- true also zeros the clones 323 | for t=opt.seq_length,1,-1 do 324 | -- backprop through loss, and softmax/linear 325 | local doutput_t = clones.criterion[t]:backward(predictions[t], y[{{}, t}]) 326 | table.insert(drnn_state[t], doutput_t) 327 | table.insert(rnn_state[t-1], drnn_state[t]) 328 | local dlst = clones.rnn[t]:backward(get_input(x, x_char, t, rnn_state[t-1]), drnn_state[t]) 329 | drnn_state[t-1] = {} 330 | local tmp = opt.use_words + opt.use_chars -- not the safest way but quick 331 | for k,v in pairs(dlst) do 332 | if k > tmp then -- k == 1 is gradient on x, which we dont need 333 | -- note we do k-1 because first item is dembeddings, and then follow the 334 | -- derivatives of the state, starting at index 2. I know... 335 | drnn_state[t-1][k-tmp] = v 336 | end 337 | end 338 | end 339 | 340 | ------------------------ misc ---------------------- 341 | -- transfer final state to initial state (BPTT) 342 | init_state_global = rnn_state[#rnn_state] -- NOTE: I don't think this needs to be a clone, right? 343 | 344 | -- renormalize gradients 345 | local grad_norm, shrink_factor 346 | if opt.hsm==0 then 347 | grad_norm = grad_params:norm() 348 | else 349 | grad_norm = torch.sqrt(grad_params:norm()^2 + hsm_grad_params:norm()^2) 350 | end 351 | if grad_norm > opt.max_grad_norm then 352 | shrink_factor = opt.max_grad_norm / grad_norm 353 | grad_params:mul(shrink_factor) 354 | if opt.hsm > 0 then 355 | hsm_grad_params:mul(shrink_factor) 356 | end 357 | end 358 | params:add(grad_params:mul(-lr)) -- update params 359 | if opt.hsm > 0 then 360 | hsm_params:add(hsm_grad_params:mul(-lr)) 361 | end 362 | return torch.exp(loss) 363 | end 364 | 365 | 366 | -- start optimization here 367 | train_losses = {} 368 | val_losses = {} 369 | lr = opt.learning_rate -- starting learning rate which will be decayed 370 | local iterations = opt.max_epochs * loader.split_sizes[1] 371 | if char_vecs ~= nil then char_vecs.weight[1]:zero() end -- zero-padding vector is always zero 372 | for i = 1, iterations do 373 | local epoch = i / loader.split_sizes[1] 374 | 375 | local timer = torch.Timer() 376 | local time = timer:time().real 377 | 378 | train_loss = feval(params) -- fwd/backprop and update params 379 | if char_vecs ~= nil then -- zero-padding vector is always zero 380 | char_vecs.weight[1]:zero() 381 | char_vecs.gradWeight[1]:zero() 382 | end 383 | train_losses[i] = train_loss 384 | 385 | -- every now and then or on last iteration 386 | if i % loader.split_sizes[1] == 0 then 387 | -- evaluate loss on validation data 388 | local val_loss = eval_split(2) -- 2 = validation 389 | val_losses[#val_losses+1] = val_loss 390 | local savefile = string.format('%s/lm_%s_epoch%.2f_%.2f.t7', opt.checkpoint_dir, opt.savefile, epoch, val_loss) 391 | local checkpoint = {} 392 | checkpoint.protos = protos 393 | checkpoint.opt = opt 394 | checkpoint.train_losses = train_losses 395 | checkpoint.val_loss = val_loss 396 | checkpoint.val_losses = val_losses 397 | checkpoint.i = i 398 | checkpoint.epoch = epoch 399 | checkpoint.vocab = {loader.idx2word, loader.word2idx, loader.idx2char, loader.char2idx} 400 | checkpoint.lr = lr 401 | print('saving checkpoint to ' .. savefile) 402 | if epoch == opt.max_epochs or epoch % opt.save_every == 0 then 403 | torch.save(savefile, checkpoint) 404 | end 405 | end 406 | 407 | -- decay learning rate after epoch 408 | if i % loader.split_sizes[1] == 0 and #val_losses > 2 then 409 | if val_losses[#val_losses-1] - val_losses[#val_losses] < opt.decay_when then 410 | lr = lr * opt.learning_rate_decay 411 | end 412 | end 413 | 414 | if i % opt.print_every == 0 then 415 | print(string.format("%d/%d (epoch %.2f), train_loss = %6.4f", i, iterations, epoch, train_loss)) 416 | end 417 | if i % 10 == 0 then collectgarbage() end 418 | if opt.time ~= 0 then 419 | print("Batch Time:", timer:time().real - time) 420 | end 421 | end 422 | 423 | --evaluate on full test set. this just uses the model from the last epoch 424 | --rather than best-performing model. it is also incredibly inefficient 425 | --because of batch size issues. for faster evaluation, use evaluate.lua, i.e. 426 | --th evaluate.lua -model m 427 | --where m is the path to the best-performing model 428 | 429 | test_perp, token_perp = eval_split(3) 430 | print('Perplexity on test set: ' .. test_perp) 431 | torch.save('token_perp-ss.t7', {token_perp, loader.idx2word}) 432 | 433 | -------------------------------------------------------------------------------- /model/HighwayMLP.lua: -------------------------------------------------------------------------------- 1 | local HighwayMLP = {} 2 | 3 | function HighwayMLP.mlp(size, num_layers, bias, f) 4 | -- size = dimensionality of inputs 5 | -- num_layers = number of hidden layers (default = 1) 6 | -- bias = bias for transform gate (default = -2) 7 | -- f = non-linearity (default = ReLU) 8 | 9 | local output, transform_gate, carry_gate 10 | local num_layers = num_layers or 1 11 | local bias = bias or -2 12 | local f = f or nn.ReLU() 13 | local input = nn.Identity()() 14 | local inputs = {[1]=input} 15 | for i = 1, num_layers do 16 | output = f(nn.Linear(size, size)(inputs[i])) 17 | transform_gate = nn.Sigmoid()(nn.AddConstant(bias)(nn.Linear(size, size)(inputs[i]))) 18 | carry_gate = nn.AddConstant(1)(nn.MulConstant(-1)(transform_gate)) 19 | output = nn.CAddTable()({ 20 | nn.CMulTable()({transform_gate, output}), 21 | nn.CMulTable()({carry_gate, inputs[i]}) }) 22 | table.insert(inputs, output) 23 | end 24 | return nn.gModule({input},{output}) 25 | end 26 | 27 | return HighwayMLP -------------------------------------------------------------------------------- /model/LSTMTDNN.lua: -------------------------------------------------------------------------------- 1 | local LSTMTDNN = {} 2 | 3 | local ok, cunn = pcall(require, 'fbcunn') 4 | LookupTable = nn.LookupTable 5 | 6 | function LSTMTDNN.lstmtdnn(rnn_size, n, dropout, word_vocab_size, word_vec_size, char_vocab_size, char_vec_size, 7 | feature_maps, kernels, length, use_words, use_chars, batch_norm, highway_layers, hsm) 8 | -- rnn_size = dimensionality of hidden layers 9 | -- n = number of layers 10 | -- dropout = dropout probability 11 | -- word_vocab_size = num words in the vocab 12 | -- word_vec_size = dimensionality of word embeddings 13 | -- char_vocab_size = num chars in the character vocab 14 | -- char_vec_size = dimensionality of char embeddings 15 | -- feature_maps = table of feature map sizes for each kernel width 16 | -- kernels = table of kernel widths 17 | -- length = max length of a word 18 | -- use_words = 1 if use word embeddings, otherwise not 19 | -- use_chars = 1 if use char embeddings, otherwise not 20 | -- highway_layers = number of highway layers to use, if any 21 | 22 | dropout = dropout or 0 23 | 24 | -- there will be 2*n+1 inputs if using words or chars, 25 | -- otherwise there will be 2*n + 2 inputs 26 | local char_vec_layer, word_vec_layer, x, input_size_L, word_vec, char_vec 27 | local highway_layers = highway_layers or 0 28 | local length = length 29 | local inputs = {} 30 | if use_chars == 1 then 31 | table.insert(inputs, nn.Identity()()) -- batch_size x word length (char indices) 32 | char_vec_layer = LookupTable(char_vocab_size, char_vec_size) 33 | char_vec_layer.name = 'char_vecs' -- change name so we can refer to it easily later 34 | end 35 | if use_words == 1 then 36 | table.insert(inputs, nn.Identity()()) -- batch_size x 1 (word indices) 37 | word_vec_layer = LookupTable(word_vocab_size, word_vec_size) 38 | word_vec_layer.name = 'word_vecs' -- change name so we can refer to it easily later 39 | end 40 | for L = 1,n do 41 | table.insert(inputs, nn.Identity()()) -- prev_c[L] 42 | table.insert(inputs, nn.Identity()()) -- prev_h[L] 43 | end 44 | local outputs = {} 45 | for L = 1,n do 46 | -- c,h from previous timesteps. offsets depend on if we are using both word/chars 47 | local prev_h = inputs[L*2+use_words+use_chars] 48 | local prev_c = inputs[L*2+use_words+use_chars-1] 49 | -- the input to this layer 50 | if L == 1 then 51 | if use_chars == 1 then 52 | char_vec = char_vec_layer(inputs[1]) 53 | local char_cnn = TDNN.tdnn(length, char_vec_size, feature_maps, kernels) 54 | char_cnn.name = 'cnn' -- change name so we can refer to it later 55 | local cnn_output = char_cnn(char_vec) 56 | input_size_L = torch.Tensor(feature_maps):sum() 57 | if use_words == 1 then 58 | word_vec = word_vec_layer(inputs[2]) 59 | x = nn.JoinTable(2)({cnn_output, word_vec}) 60 | input_size_L = input_size_L + word_vec_size 61 | else 62 | x = nn.Identity()(cnn_output) 63 | end 64 | else -- word_vecs only 65 | x = word_vec_layer(inputs[1]) 66 | input_size_L = word_vec_size 67 | end 68 | if batch_norm == 1 then 69 | x = nn.BatchNormalization(0)(x) 70 | end 71 | if highway_layers > 0 then 72 | local highway_mlp = HighwayMLP.mlp(input_size_L, highway_layers) 73 | highway_mlp.name = 'highway' 74 | x = highway_mlp(x) 75 | end 76 | else 77 | x = outputs[(L-1)*2] -- prev_h 78 | if dropout > 0 then x = nn.Dropout(dropout)(x) end -- apply dropout, if any 79 | input_size_L = rnn_size 80 | end 81 | -- evaluate the input sums at once for efficiency 82 | local i2h = nn.Linear(input_size_L, 4 * rnn_size)(x) 83 | local h2h = nn.Linear(rnn_size, 4 * rnn_size)(prev_h) 84 | local all_input_sums = nn.CAddTable()({i2h, h2h}) 85 | 86 | local sigmoid_chunk = nn.Narrow(2, 1, 3*rnn_size)(all_input_sums) 87 | sigmoid_chunk = nn.Sigmoid()(sigmoid_chunk) 88 | local in_gate = nn.Narrow(2,1,rnn_size)(sigmoid_chunk) 89 | local out_gate = nn.Narrow(2, rnn_size+1, rnn_size)(sigmoid_chunk) 90 | local forget_gate = nn.Narrow(2, 2*rnn_size + 1, rnn_size)(sigmoid_chunk) 91 | local in_transform = nn.Tanh()(nn.Narrow(2,3*rnn_size + 1, rnn_size)(all_input_sums)) 92 | 93 | -- perform the LSTM update 94 | local next_c = nn.CAddTable()({ 95 | nn.CMulTable()({forget_gate, prev_c}), 96 | nn.CMulTable()({in_gate, in_transform}) 97 | }) 98 | -- gated cells form the output 99 | local next_h = nn.CMulTable()({out_gate, nn.Tanh()(next_c)}) 100 | 101 | table.insert(outputs, next_c) 102 | table.insert(outputs, next_h) 103 | end 104 | 105 | -- set up the decoder 106 | local top_h = outputs[#outputs] 107 | if dropout > 0 then 108 | top_h = nn.Dropout(dropout)(top_h) 109 | else 110 | top_h = nn.Identity()(top_h) --to be compatiable with dropout=0 and hsm>1 111 | end 112 | 113 | if hsm > 0 then -- if HSM is used then softmax will be done later 114 | table.insert(outputs, top_h) 115 | else 116 | local proj = nn.Linear(rnn_size, word_vocab_size)(top_h) 117 | local logsoft = nn.LogSoftMax()(proj) 118 | table.insert(outputs, logsoft) 119 | end 120 | return nn.gModule(inputs, outputs) 121 | end 122 | 123 | return LSTMTDNN 124 | 125 | -------------------------------------------------------------------------------- /model/TDNN.lua: -------------------------------------------------------------------------------- 1 | -- Time-delayed Neural Network (i.e. 1-d CNN) with multiple filter widths 2 | 3 | local TDNN = {} 4 | --local cudnn_status, cudnn = pcall(require, 'cudnn') 5 | 6 | function TDNN.tdnn(length, input_size, feature_maps, kernels) 7 | -- length = length of sentences/words (zero padded to be of same length) 8 | -- input_size = embedding_size 9 | -- feature_maps = table of feature maps (for each kernel width) 10 | -- kernels = table of kernel widths 11 | local layer1_concat, output 12 | local input = nn.Identity()() --input is batch_size x length x input_size 13 | 14 | local layer1 = {} 15 | for i = 1, #kernels do 16 | local reduced_l = length - kernels[i] + 1 17 | local pool_layer 18 | if opt.cudnn == 1 then 19 | -- Use CuDNN for temporal convolution. 20 | if not cudnn then require 'cudnn' end 21 | -- Fake the spatial convolution. 22 | local conv = cudnn.SpatialConvolution(1, feature_maps[i], input_size, 23 | kernels[i], 1, 1, 0) 24 | conv.name = 'conv_filter_' .. kernels[i] .. '_' .. feature_maps[i] 25 | local conv_layer = conv(nn.View(1, -1, input_size):setNumInputDims(2)(input)) 26 | --pool_layer = nn.Max(3)(nn.Max(3)(nn.Tanh()(conv_layer))) 27 | pool_layer = nn.Squeeze()(cudnn.SpatialMaxPooling(1, reduced_l, 1, 1, 0, 0)(nn.Tanh()(conv_layer))) 28 | else 29 | -- Temporal conv. much slower 30 | local conv = nn.TemporalConvolution(input_size, feature_maps[i], kernels[i]) 31 | local conv_layer = conv(input) 32 | conv.name = 'conv_filter_' .. kernels[i] .. '_' .. feature_maps[i] 33 | --pool_layer = nn.Max(2)(nn.Tanh()(conv_layer)) 34 | pool_layer = nn.TemporalMaxPooling(reduced_l)(nn.Tanh()(conv_layer)) 35 | pool_layer = nn.Squeeze()(pool_layer) 36 | 37 | end 38 | table.insert(layer1, pool_layer) 39 | end 40 | if #kernels > 1 then 41 | layer1_concat = nn.JoinTable(2)(layer1) 42 | output = layer1_concat 43 | else 44 | output = layer1[1] 45 | end 46 | return nn.gModule({input}, {output}) 47 | end 48 | 49 | return TDNN 50 | -------------------------------------------------------------------------------- /run_models.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #run get_data.sh to get all the relevant data 4 | #add -gpuid 0 to use GPU 5 | #add -cudnn 1 to use cudnn 6 | 7 | #To reproduce Table 3: 8 | #LSTM-Word-Small 9 | th main.lua -data_dir data/ptb -savefile ptb-word-small -EOS '+' -rnn_size 200 -use_chars 0 10 | -use_words 1 -word_vec_size 200 -highway_layers 0 11 | #LSTM-CharCNN-Small 12 | th main.lua -data_dir data/ptb -savefile ptb-char-small -EOS '+' -rnn_size 300 -use_chars 1 13 | -use_words 0 -char_vec_size 15 -highway_layers 1 -kernels '{1,2,3,4,5,6}' -feature_maps '{25,50,75,100,125,150}' 14 | #LSTM-Word-Large 15 | th main.lua -data_dir data/ptb -savefile ptb-word-large -EOS '+' -rnn_size 650 -use_chars 0 16 | -use_words 1 -word_vec_size 650 -highway_layers 0 17 | #LSTM-CharCNN-Large 18 | th main.lua -data_dir data/ptb -savefile ptb-char-large -EOS '+' -rnn_size 650 -use_chars 1 -use_words 0 19 | -char_vec_size 15 -highway_layers 2 -kernels '{1,2,3,4,5,6,7}' -feature_maps '{50,100,150,200,200,200,200}' 20 | 21 | #To reproduce Table 4, run the same scripts as above but change data_dir/savefile, 22 | #and use -EOS ''. So for German (DE), use the following scripts 23 | #LSTM-Word-Small 24 | th main.lua -data_dir data/de -savefile de-word-small -rnn_size 200 -use_chars 0 25 | -use_words 1 -word_vec_size 200 -highway_layers 0 -EOS '' 26 | #LSTM-CharCNN-Small 27 | th main.lua -data_dir data/de -savefile de-char-small -rnn_size 300 -use_chars 1 28 | -use_words 0 -char_vec_size 15 -highway_layers 1 -kernels '{1,2,3,4,5,6}' -feature_maps '{25,50,75,100,125,150}' -EOS '' 29 | #LSTM-Word-Large 30 | th main.lua -data_dir data/de -savefile de-word-large -rnn_size 650 -use_chars 0 31 | -use_words 1 -word_vec_size 650 -highway_layers 0 -EOS '' 32 | #LSTM-CharCNN-Large 33 | th main.lua -data_dir data/de -savefile de-char-large -rnn_size 650 -use_chars 1 -use_words 0 34 | -char_vec_size 15 -highway_layers 2 -kernels '{1,2,3,4,5,6,7}' -feature_maps '{50,100,150,200,200,200,200}' -EOS '' 35 | -------------------------------------------------------------------------------- /util/BatchLoaderUnk.lua: -------------------------------------------------------------------------------- 1 | 2 | -- Modified from https://github.com/karpathy/char-rnn 3 | -- This version is for cases where one has already segmented train/val/test splits 4 | 5 | local BatchLoaderUnk = {} 6 | local stringx = require('pl.stringx') 7 | BatchLoaderUnk.__index = BatchLoaderUnk 8 | utf8 = require 'lua-utf8' 9 | 10 | function BatchLoaderUnk.create(data_dir, batch_size, seq_length, max_word_l) 11 | local self = {} 12 | setmetatable(self, BatchLoaderUnk) 13 | 14 | local train_file = path.join(data_dir, 'train.txt') 15 | local valid_file = path.join(data_dir, 'valid.txt') 16 | local test_file = path.join(data_dir, 'test.txt') 17 | local input_files = {train_file, valid_file, test_file} 18 | local vocab_file = path.join(data_dir, 'vocab.t7') 19 | local tensor_file = path.join(data_dir, 'data.t7') 20 | local char_file = path.join(data_dir, 'data_char.t7') 21 | 22 | -- construct a tensor with all the data 23 | if not (path.exists(vocab_file) or path.exists(tensor_file) or path.exists(char_file)) then 24 | print('one-time setup: preprocessing input train/valid/test files in dir: ' .. data_dir) 25 | BatchLoaderUnk.text_to_tensor(input_files, vocab_file, tensor_file, char_file, max_word_l) 26 | end 27 | 28 | print('loading data files...') 29 | local all_data = torch.load(tensor_file) -- train, valid, test tensors 30 | local all_data_char = torch.load(char_file) -- train, valid, test character indices 31 | local vocab_mapping = torch.load(vocab_file) 32 | self.idx2word, self.word2idx, self.idx2char, self.char2idx = table.unpack(vocab_mapping) 33 | self.vocab_size = #self.idx2word 34 | print(string.format('Word vocab size: %d, Char vocab size: %d', #self.idx2word, #self.idx2char)) 35 | -- create word-char mappings 36 | self.max_word_l = all_data_char[1]:size(2) 37 | -- cut off the end for train/valid sets so that it divides evenly 38 | -- test set is not cut off 39 | self.batch_size = batch_size 40 | self.seq_length = seq_length 41 | self.split_sizes = {} 42 | self.all_batches = {} 43 | print('reshaping tensors...') 44 | local x_batches, y_batches, nbatches 45 | for split, data in ipairs(all_data) do 46 | local len = data:size(1) 47 | if len % (batch_size * seq_length) ~= 0 and split < 3 then 48 | data = data:sub(1, batch_size * seq_length * math.floor(len / (batch_size * seq_length))) 49 | end 50 | local ydata = data:clone() 51 | ydata:sub(1,-2):copy(data:sub(2,-1)) 52 | ydata[-1] = data[1] 53 | local data_char = torch.zeros(data:size(1), self.max_word_l):long() 54 | for i = 1, data:size(1) do 55 | data_char[i] = all_data_char[split][i] 56 | end 57 | if split < 3 then 58 | x_batches = data:view(batch_size, -1):split(seq_length, 2) 59 | y_batches = ydata:view(batch_size, -1):split(seq_length, 2) 60 | x_char_batches = data_char:view(batch_size, -1, self.max_word_l):split(seq_length,2) 61 | nbatches = #x_batches 62 | self.split_sizes[split] = nbatches 63 | assert(#x_batches == #y_batches) 64 | assert(#x_batches == #x_char_batches) 65 | else --for test we repeat dimensions to batch size (easier but inefficient evaluation) 66 | x_batches = {data:resize(1, data:size(1)):expand(batch_size, data:size(2))} 67 | y_batches = {ydata:resize(1, ydata:size(1)):expand(batch_size, ydata:size(2))} 68 | data_char = data_char:resize(1, data_char:size(1), data_char:size(2)) 69 | x_char_batches = {data_char:expand(batch_size, data_char:size(2), data_char:size(3))} 70 | self.split_sizes[split] = 1 71 | end 72 | self.all_batches[split] = {x_batches, y_batches, x_char_batches} 73 | end 74 | self.batch_idx = {0,0,0} 75 | print(string.format('data load done. Number of batches in train: %d, val: %d, test: %d', 76 | self.split_sizes[1], self.split_sizes[2], self.split_sizes[3])) 77 | collectgarbage() 78 | return self 79 | end 80 | 81 | function BatchLoaderUnk:reset_batch_pointer(split_idx, batch_idx) 82 | batch_idx = batch_idx or 0 83 | self.batch_idx[split_idx] = batch_idx 84 | end 85 | 86 | function BatchLoaderUnk:next_batch(split_idx) 87 | -- split_idx is integer: 1 = train, 2 = val, 3 = test 88 | self.batch_idx[split_idx] = self.batch_idx[split_idx] + 1 89 | if self.batch_idx[split_idx] > self.split_sizes[split_idx] then 90 | self.batch_idx[split_idx] = 1 -- cycle around to beginning 91 | end 92 | -- pull out the correct next batch 93 | local idx = self.batch_idx[split_idx] 94 | return self.all_batches[split_idx][1][idx], self.all_batches[split_idx][2][idx], self.all_batches[split_idx][3][idx] 95 | end 96 | 97 | function BatchLoaderUnk.text_to_tensor(input_files, out_vocabfile, out_tensorfile, out_charfile, max_word_l) 98 | print('Processing text into tensors...') 99 | local tokens = opt.tokens -- inherit global constants for tokens 100 | local f, rawdata 101 | local output_tensors = {} -- output tensors for train/val/test 102 | local output_chars = {} -- output character tensors for train/val/test sets 103 | local vocab_count = {} -- vocab count 104 | local max_word_l_tmp = 0 -- max word length of the corpus 105 | local idx2word = {tokens.UNK} -- unknown word token 106 | local word2idx = {}; word2idx[tokens.UNK] = 1 107 | local idx2char = {tokens.ZEROPAD, tokens.START, tokens.END} -- zero-pad, start-of-word, end-of-word tokens 108 | local char2idx = {}; char2idx[tokens.ZEROPAD] = 1; char2idx[tokens.START] = 2; char2idx[tokens.END] = 3 109 | local split_counts = {} 110 | 111 | -- first go through train/valid/test to get max word length 112 | -- if actual max word length is smaller than specified 113 | -- we use that instead. this is inefficient, but only a one-off thing so should be fine 114 | -- also counts the number of tokens 115 | for split = 1,3 do -- split = 1 (train), 2 (val), or 3 (test) 116 | f = io.open(input_files[split], 'r') 117 | local counts = 0 118 | for line in f:lines() do 119 | line = stringx.replace(line, '', tokens.UNK) -- replace unk with a single character 120 | line = stringx.replace(line, tokens.START, '') --start-of-word token is reserved 121 | line = stringx.replace(line, tokens.END, '') --end-of-word token is reserved 122 | for word in line:gmatch'([^%s]+)' do 123 | max_word_l_tmp = math.max(max_word_l_tmp, utf8.len(word) + 2) -- add 2 for start/end chars 124 | counts = counts + 1 125 | end 126 | if tokens.EOS ~= '' then 127 | counts = counts + 1 --PTB uses \n for , so need to add one more token at the end 128 | end 129 | end 130 | f:close() 131 | split_counts[split] = counts 132 | end 133 | 134 | print('After first pass of data, max word length is: ' .. max_word_l_tmp) 135 | print(string.format('Token count: train %d, val %d, test %d', 136 | split_counts[1], split_counts[2], split_counts[3])) 137 | 138 | -- if actual max word length is less than the limit, use that 139 | max_word_l = math.min(max_word_l_tmp, max_word_l) 140 | 141 | for split = 1,3 do -- split = 1 (train), 2 (val), or 3 (test) 142 | -- Preallocate the tensors we will need. 143 | -- Watch out the second one needs a lot of RAM. 144 | output_tensors[split] = torch.LongTensor(split_counts[split]) 145 | output_chars[split] = torch.ones(split_counts[split], max_word_l):long() 146 | 147 | f = io.open(input_files[split], 'r') 148 | local word_num = 0 149 | for line in f:lines() do 150 | line = stringx.replace(line, '', tokens.UNK) 151 | line = stringx.replace(line, tokens.START, '') -- start and end of word tokens are reserved 152 | line = stringx.replace(line, tokens.END, '') 153 | for rword in line:gmatch'([^%s]+)' do 154 | function append(word) 155 | word_num = word_num + 1 156 | -- Collect garbage. 157 | if word_num % 10000 == 0 then 158 | collectgarbage() 159 | end 160 | local chars = {char2idx[tokens.START]} -- start-of-word symbol 161 | if string.sub(word,1,1) == tokens.UNK and word:len() > 1 then -- unk token with character info available 162 | word = string.sub(word, 3) 163 | output_tensors[split][word_num] = word2idx[tokens.UNK] 164 | else 165 | if word2idx[word]==nil then 166 | idx2word[#idx2word + 1] = word -- create word-idx/idx-word mappings 167 | word2idx[word] = #idx2word 168 | end 169 | output_tensors[split][word_num] = word2idx[word] 170 | end 171 | local l = utf8.len(word) 172 | for _, char in utf8.next, word do 173 | char = utf8.char(char) -- save as actual characters 174 | if char2idx[char]==nil then 175 | idx2char[#idx2char + 1] = char -- create char-idx/idx-char mappings 176 | char2idx[char] = #idx2char 177 | end 178 | chars[#chars + 1] = char2idx[char] 179 | end 180 | chars[#chars + 1] = char2idx[tokens.END] -- end-of-word symbol 181 | for i = 1, math.min(#chars, max_word_l) do 182 | output_chars[split][word_num][i] = chars[i] 183 | end 184 | if #chars == max_word_l then 185 | chars[#chars] = char2idx[tokens.END] 186 | end 187 | end 188 | append(rword) 189 | end 190 | if tokens.EOS ~= '' then --PTB does not have so we add a character for tokens 191 | append(tokens.EOS) --other datasets don't need this 192 | end 193 | end 194 | end 195 | print "done" 196 | -- save output preprocessed files 197 | print('saving ' .. out_vocabfile) 198 | torch.save(out_vocabfile, {idx2word, word2idx, idx2char, char2idx}) 199 | print('saving ' .. out_tensorfile) 200 | torch.save(out_tensorfile, output_tensors) 201 | print('saving ' .. out_charfile) 202 | torch.save(out_charfile, output_chars) 203 | end 204 | 205 | return BatchLoaderUnk 206 | 207 | -------------------------------------------------------------------------------- /util/Diag.lua: -------------------------------------------------------------------------------- 1 | local Diag, parent = torch.class('nn.Diag', 'nn.Module') 2 | 3 | function Diag:__init(pos, length) 4 | parent.__init(self) 5 | self.weight = torch.Tensor(pos, length) 6 | self.gradWeight = torch.Tensor(pos, length) 7 | self:reset() 8 | end 9 | 10 | function Diag:reset(stdv) 11 | stdv = stdv or 1./math.sqrt(self.weight:size(1) + self.weight:size(2)) 12 | self.weight:uniform(-stdv, stdv) 13 | end 14 | 15 | function Diag:updateOutput(input) 16 | if input:dim()==2 then 17 | self.output:resize(self.weight:size(1), self.weight:size(2)) 18 | self.output:cmul(input, self.weight) 19 | else 20 | local batch_size = input:size(1) 21 | if not self.tmp or self.tmp:size(1) ~= batch_size then 22 | self.tmp = torch.expand(self.weight:view(1,self.weight:size(1), 23 | self.weight:size(2)), batch_size, self.weight:size(1), self.weight:size(2)) 24 | end 25 | self.output:resize(batch_size, self.weight:size(1), self.weight:size(2)) 26 | self.output:cmul(input, self.tmp) 27 | end 28 | return self.output 29 | end 30 | 31 | function Diag:updateGradInput(input, gradOutput) 32 | if input:dim()==2 then 33 | self.gradInput:resize(self.weight:size(1), self.weight:size(2)) 34 | self.gradInput:cmul(gradOutput, self.weight) 35 | else 36 | local batch_size = input:size(1) 37 | self.gradInput:resize(batch_size, self.weight:size(1), self.weight:size(2)) 38 | self.gradInput:cmul(gradOutput, self.tmp) 39 | end 40 | return self.gradInput 41 | end 42 | 43 | function Diag:accGradParameters(input, gradOutput) 44 | if input:dim()==2 then 45 | self.gradWeight:addcmul(gradOutput, input) 46 | else 47 | local batch_size = input:size(1) 48 | if not self.tmpGrad or self.tmpGrad:size(1) ~= batch_size then 49 | self.tmpGrad = torch.Tensor(batch_size, self.weight:size(1), 50 | self.weight:size(2)) 51 | end 52 | self.tmpGrad:cmul(gradOutput, input) 53 | self.gradWeight:add(self.tmpGrad:sum(1):squeeze()) 54 | end 55 | end -------------------------------------------------------------------------------- /util/HLogSoftMax.lua: -------------------------------------------------------------------------------- 1 | local HLogSoftMax, parent = torch.class('nn.HLogSoftMax', 'nn.Criterion') 2 | 3 | function HLogSoftMax:__init(mapping, input_size) 4 | -- different implementation of the fbnn.HSM module 5 | -- variable names are mostly the same as in fbnn.HSM 6 | -- only supports batch inputs 7 | 8 | parent.__init(self) 9 | if type(mapping) == 'table' then 10 | self.mapping = torch.LongTensor(mapping) 11 | else 12 | self.mapping = mapping 13 | end 14 | self.input_size = input_size 15 | self.n_classes = self.mapping:size(1) 16 | self.n_clusters = self.mapping[{{},1}]:max() 17 | self.n_class_in_cluster = torch.LongTensor(self.n_clusters):zero() 18 | for i = 1, self.mapping:size(1) do 19 | local c = self.mapping[i][1] 20 | self.n_class_in_cluster[c] = self.n_class_in_cluster[c] + 1 21 | end 22 | self.n_max_class_in_cluster = self.mapping[{{},2}]:max() 23 | 24 | --cluster softmax/loss 25 | self.cluster_model = nn.Sequential() 26 | self.cluster_model:add(nn.Linear(input_size, self.n_clusters)) 27 | self.cluster_model:add(nn.LogSoftMax()) 28 | self.logLossCluster = nn.ClassNLLCriterion() 29 | 30 | --class softmax/loss 31 | self.class_model = HSMClass.hsm(self.input_size, self.n_clusters, self.n_max_class_in_cluster) 32 | local get_layer = function (layer) 33 | if layer.name ~= nil then 34 | if layer.name == 'class_bias' then 35 | self.class_bias = layer 36 | elseif layer.name == 'class_weight' then 37 | self.class_weight = layer 38 | end 39 | end 40 | end 41 | self.class_model:apply(get_layer) 42 | self.logLossClass = nn.ClassNLLCriterion() 43 | 44 | self:change_bias() 45 | self.gradInput = torch.Tensor(input_size) 46 | end 47 | 48 | function HLogSoftMax:clone(...) 49 | return nn.Module.clone(self, ...) 50 | end 51 | 52 | function HLogSoftMax:parameters() 53 | return {self.cluster_model.modules[1].weight, 54 | self.cluster_model.modules[1].bias, 55 | self.class_bias.weight, 56 | self.class_weight.weight} , 57 | {self.cluster_model.modules[1].gradWeight, 58 | self.cluster_model.modules[1].gradBias, 59 | self.class_bias.gradWeight, 60 | self.class_weight.gradWeight} 61 | end 62 | 63 | function HLogSoftMax:getParameters() 64 | return nn.Module.getParameters(self) 65 | end 66 | 67 | function HLogSoftMax:updateOutput(input, target) 68 | self.batch_size = input:size(1) 69 | local new_target = self.mapping:index(1, target) 70 | local cluster_loss = self.logLossCluster:forward( 71 | self.cluster_model:forward(input), 72 | new_target:select(2,1)) 73 | local class_loss = self.logLossClass:forward( 74 | self.class_model:forward({input, new_target:select(2,1)}), 75 | new_target:select(2,2)) 76 | self.output = cluster_loss + class_loss 77 | return self.output 78 | end 79 | 80 | function HLogSoftMax:updateGradInput(input, target) 81 | self.gradInput:resizeAs(input) 82 | local new_target = self.mapping:index(1, target) 83 | -- backprop clusters 84 | self.logLossCluster:updateGradInput(self.cluster_model.output, 85 | new_target:select(2,1)) 86 | self.gradInput:copy(self.cluster_model:backward(input, 87 | self.logLossCluster.gradInput)) 88 | -- backprop classes 89 | self.logLossClass:updateGradInput(self.class_model.output, 90 | new_target:select(2,2)) 91 | self.gradInput:add(self.class_model:backward(input, 92 | self.logLossClass.gradInput)[1]) 93 | return self.gradInput 94 | end 95 | 96 | 97 | function HLogSoftMax:backward(input, target, scale) 98 | self:updateGradInput(input, target) 99 | return self.gradInput 100 | end 101 | 102 | function HLogSoftMax:change_bias() 103 | -- hacky way to deal with variable cluster sizes 104 | for i = 1, self.n_clusters do 105 | local c = self.n_class_in_cluster[i] 106 | for j = c+1, self.n_max_class_in_cluster do 107 | self.class_bias.weight[i][j] = math.log(0) 108 | end 109 | end 110 | end 111 | 112 | -------------------------------------------------------------------------------- /util/HSMClass.lua: -------------------------------------------------------------------------------- 1 | local HSMClass = {} 2 | 3 | function HSMClass.hsm(input_size, n_clusters, n_max_class_in_cluster) 4 | --inputs[1] is the input (batch_size by input_size) 5 | --inputs[2] is the target cluster (batch_size) 6 | 7 | local inputs = {nn.Identity()(), nn.Identity()()} 8 | local class_bias_layer = nn.LookupTable(n_clusters, n_max_class_in_cluster) 9 | local class_vec_layer = nn.LookupTable(n_clusters, input_size*n_max_class_in_cluster) 10 | local class_mat = nn.View(n_max_class_in_cluster, input_size)( 11 | class_vec_layer(inputs[2])) 12 | class_bias_layer.name = 'class_bias' 13 | class_vec_layer.name = 'class_weight' 14 | local input_mat = nn.View(input_size, 1)(inputs[1]) 15 | local class_scores = nn.Squeeze()(nn.MM()({class_mat, input_mat})) 16 | local output = nn.LogSoftMax()(nn.CAddTable()({class_bias_layer(inputs[2]), class_scores})) 17 | return nn.gModule(inputs, {output}) 18 | end 19 | 20 | return HSMClass -------------------------------------------------------------------------------- /util/OneHot.lua: -------------------------------------------------------------------------------- 1 | 2 | local OneHot, parent = torch.class('OneHot', 'nn.Module') 3 | 4 | function OneHot:__init(outputSize) 5 | parent.__init(self) 6 | self.outputSize = outputSize 7 | -- We'll construct one-hot encodings by using the index method to 8 | -- reshuffle the rows of an identity matrix. To avoid recreating 9 | -- it every iteration we'll cache it. 10 | self._eye = torch.eye(outputSize) 11 | end 12 | 13 | function OneHot:updateOutput(input) 14 | self.output:resize(input:size(1), self.outputSize):zero() 15 | if self._eye == nil then self._eye = torch.eye(self.outputSize) end 16 | self._eye = self._eye:float() 17 | local longInput = input:long() 18 | self.output:copy(self._eye:index(1, longInput)) 19 | return self.output 20 | end 21 | -------------------------------------------------------------------------------- /util/OuterProd.lua: -------------------------------------------------------------------------------- 1 | --[[ 2 | Outer product between two vectors with batch processing support 3 | --]] 4 | 5 | local OuterProd, parent = torch.class('nn.OuterProd', 'nn.Module') 6 | 7 | function OuterProd:__init() 8 | parent.__init(self) 9 | self.gradInput = {torch.Tensor(), torch.Tensor()} 10 | end 11 | 12 | function OuterProd:updateOutput(input) 13 | assert(#input==2, 'only supports outer products of 2 vectors') 14 | local a, b = table.unpack(input) 15 | assert(a:nDimension() == 1 or a:nDimension() == 2, 'input tensors must be 1D or 2D') 16 | if a:nDimension()==1 then 17 | self.output:resize(a:size(1), b:size(1)) 18 | self.output:ger(a, b) 19 | else -- mini batch processing 20 | self.output:resize(a:size(1), a:size(2), b:size(2)) 21 | for i = 1, a:size(1) do 22 | self.output[i]:ger(a[i], b[i]) 23 | end 24 | end 25 | return self.output 26 | end 27 | 28 | function OuterProd:updateGradInput(input, gradOutput) 29 | local a, b = table.unpack(input) 30 | self.gradInput[1]:resizeAs(a) 31 | self.gradInput[2]:resizeAs(b) 32 | if a:nDimension()==1 then 33 | self.gradInput[1]:mv(gradOutput, b) 34 | self.gradInput[2]:mv(gradOutput:t(), a) 35 | else -- mini batch processing 36 | for i = 1, gradOutput:size(1) do 37 | self.gradInput[1][i]:mv(gradOutput[i], b[i]) 38 | self.gradInput[2][i]:mv(gradOutput[i]:t(), a[i]) 39 | end 40 | end 41 | return self.gradInput 42 | end -------------------------------------------------------------------------------- /util/TensorProd.lua: -------------------------------------------------------------------------------- 1 | --[[ 2 | Torch module for tensor product 3 | vec1 = first input vector (size n) 4 | vec2 = second input vector (size m) 5 | T = 3D tensor (size o x n x m where o is the output size) 6 | output = vec1 x T x vec2 (size o) 7 | --]] 8 | 9 | local TensorProd, parent = torch.class('nn.TensorProd', 'nn.Module') 10 | 11 | function TensorProd:__init(vec1_size, vec2_size, output_size) 12 | parent.__init(self) 13 | self.bias = torch.Tensor(output_size) 14 | self.gradBias = torch.Tensor(output_size) 15 | self.weight = torch.Tensor(output_size, vec1_size, vec2_size) 16 | self.gradWeight = torch.Tensor(output_size, vec1_size, vec2_size) 17 | self.tmp = torch.Tensor() -- tmp tensor for intermediate calcs 18 | self.gradInput = {torch.Tensor(), torch.Tensor()} 19 | self.ab = torch.Tensor(vec1_size, vec2_size) -- outer prod storage during grad update 20 | self:reset() 21 | end 22 | 23 | function TensorProd:reset(stdv) 24 | stdv = sdtv or 1./math.sqrt(self.weight:size(2) + self.weight:size(3)) 25 | self.weight:uniform(-stdv, stdv) 26 | self.bias:uniform(-stdv, stdv) 27 | end 28 | 29 | function TensorProd:updateOutput(input) 30 | local a, b = table.unpack(input) -- output is a x weight x b 31 | assert(a:dim()==b:dim(), 'input tensors should have same number of dims (1 or 2)') 32 | if a:dim()==1 then 33 | self.output:resize(self.weight:size(1)) 34 | self.tmp:resize(self.weight:size(1), self.weight:size(2)) 35 | self.output:copy(self.bias) 36 | for i = 1, self.weight:size(1) do 37 | self.tmp[i]:mv(self.weight[i], b) 38 | end 39 | self.output:addmv(1, self.tmp, a) 40 | elseif a:dim()==2 then -- mini-batch processing 41 | local batch_size = a:size(1) 42 | self.output:resize(batch_size, self.weight:size(1)) 43 | if not self.buffer or self.buffer:nElement() ~= batch_size then 44 | self.buffer = torch.ones(batch_size) 45 | self.buffer2 = torch.ones(a:size(2)) 46 | end 47 | self.tmp:resize(self.weight:size(1), batch_size, self.weight:size(2)) 48 | for i = 1, self.weight:size(1) do 49 | self.tmp[i]:addmm(0, self.tmp[i], 1, b, self.weight[i]:t()) 50 | self.tmp[i]:cmul(a) 51 | self.output[{{},i}]:mv(self.tmp[i], self.buffer2) 52 | end 53 | self.output:addr(1, self.buffer, self.bias) 54 | else 55 | error("input must be 1D or 2D tensors") 56 | end 57 | return self.output 58 | end 59 | 60 | function TensorProd:updateGradInput(input, gradOutput) 61 | local a, b = table.unpack(input) 62 | self.gradInput[1]:resizeAs(a) 63 | self.gradInput[2]:resizeAs(b) 64 | if a:dim() == 1 then 65 | for i = 1, self.weight:size(1) do 66 | self.gradInput[1]:addmv(gradOutput[i], self.weight[i], b) 67 | self.gradInput[2]:addmv(gradOutput[i], self.weight[i]:t(), a) 68 | end 69 | else -- mini-batch processing 70 | local gradOutput1 = gradOutput:view(a:size(1), self.weight:size(1), 71 | 1):expand(a:size(1), self.weight:size(1), self.weight:size(2)) 72 | local gradOutput2 = gradOutput:view(a:size(1), self.weight:size(1), 73 | 1):expand(a:size(1), self.weight:size(1), self.weight:size(3)) 74 | for i = 1, self.weight:size(1) do 75 | self.gradInput[1]:add(torch.cmul(torch.mm(b, self.weight[i]:t()), gradOutput1[{{},i}])) 76 | self.gradInput[2]:add(torch.cmul(torch.mm(a, self.weight[i]), gradOutput2[{{},i}])) 77 | end 78 | end 79 | return self.gradInput 80 | end 81 | 82 | function TensorProd:accGradParameters(input, gradOutput) 83 | local a, b = table.unpack(input) 84 | if a:dim()==1 then 85 | self.ab:ger(a,b) 86 | self.gradBias:add(gradOutput) 87 | for i = 1, self.weight:size(1) do 88 | self.gradWeight[i]:add(gradOutput[i], self.ab) 89 | end 90 | else -- mini-batch processing 91 | self.gradBias:add(gradOutput:sum(1)) 92 | for i = 1, a:size(1) do 93 | self.ab:ger(a[i],b[i]) 94 | for j = 1, self.weight:size(1) do 95 | self.gradWeight[j]:add(gradOutput[i][j], self.ab) 96 | end 97 | end 98 | end 99 | end 100 | 101 | function TensorProd:__tostring__() 102 | return torch.type(self) .. 103 | string.format('(%d x %d -> %d)', self.weight:size(2), self.weight:size(3), self.weight:size(1)) 104 | end 105 | 106 | -------------------------------------------------------------------------------- /util/misc.lua: -------------------------------------------------------------------------------- 1 | 2 | -- misc utilities 3 | 4 | function clone_list(tensor_list, zero_too) 5 | -- takes a list of tensors and returns a list of cloned tensors 6 | local out = {} 7 | for k,v in pairs(tensor_list) do 8 | out[k] = v:clone() 9 | if zero_too then out[k]:zero() end 10 | end 11 | return out 12 | end -------------------------------------------------------------------------------- /util/model_utils.lua: -------------------------------------------------------------------------------- 1 | 2 | -- adapted from https://github.com/wojciechz/learning_to_execute 3 | -- utilities for combining/flattening parameters in a model 4 | -- the code in this script is more general than it needs to be, which is 5 | -- why it is kind of a large 6 | 7 | require 'torch' 8 | local model_utils = {} 9 | function model_utils.combine_all_parameters(...) 10 | --[[ like module:getParameters, but operates on many modules ]]-- 11 | 12 | -- get parameters 13 | local networks = {...} 14 | local parameters = {} 15 | local gradParameters = {} 16 | for i = 1, #networks do 17 | local tn = torch.typename(layer) 18 | local net_params, net_grads = networks[i]:parameters() 19 | 20 | if net_params then 21 | for _, p in pairs(net_params) do 22 | parameters[#parameters + 1] = p 23 | end 24 | for _, g in pairs(net_grads) do 25 | gradParameters[#gradParameters + 1] = g 26 | end 27 | end 28 | end 29 | 30 | local function storageInSet(set, storage) 31 | local storageAndOffset = set[torch.pointer(storage)] 32 | if storageAndOffset == nil then 33 | return nil 34 | end 35 | local _, offset = unpack(storageAndOffset) 36 | return offset 37 | end 38 | 39 | -- this function flattens arbitrary lists of parameters, 40 | -- even complex shared ones 41 | local function flatten(parameters) 42 | if not parameters or #parameters == 0 then 43 | return torch.Tensor() 44 | end 45 | local Tensor = parameters[1].new 46 | 47 | local storages = {} 48 | local nParameters = 0 49 | for k = 1,#parameters do 50 | local storage = parameters[k]:storage() 51 | if not storageInSet(storages, storage) then 52 | storages[torch.pointer(storage)] = {storage, nParameters} 53 | nParameters = nParameters + storage:size() 54 | end 55 | end 56 | 57 | local flatParameters = Tensor(nParameters):fill(1) 58 | local flatStorage = flatParameters:storage() 59 | 60 | for k = 1,#parameters do 61 | local storageOffset = storageInSet(storages, parameters[k]:storage()) 62 | parameters[k]:set(flatStorage, 63 | storageOffset + parameters[k]:storageOffset(), 64 | parameters[k]:size(), 65 | parameters[k]:stride()) 66 | parameters[k]:zero() 67 | end 68 | 69 | local maskParameters= flatParameters:float():clone() 70 | local cumSumOfHoles = flatParameters:float():cumsum(1) 71 | local nUsedParameters = nParameters - cumSumOfHoles[#cumSumOfHoles] 72 | local flatUsedParameters = Tensor(nUsedParameters) 73 | local flatUsedStorage = flatUsedParameters:storage() 74 | 75 | for k = 1,#parameters do 76 | local offset = cumSumOfHoles[parameters[k]:storageOffset()] 77 | parameters[k]:set(flatUsedStorage, 78 | parameters[k]:storageOffset() - offset, 79 | parameters[k]:size(), 80 | parameters[k]:stride()) 81 | end 82 | 83 | for _, storageAndOffset in pairs(storages) do 84 | local k, v = unpack(storageAndOffset) 85 | flatParameters[{{v+1,v+k:size()}}]:copy(Tensor():set(k)) 86 | end 87 | 88 | if cumSumOfHoles:sum() == 0 then 89 | flatUsedParameters:copy(flatParameters) 90 | else 91 | local counter = 0 92 | for k = 1,flatParameters:nElement() do 93 | if maskParameters[k] == 0 then 94 | counter = counter + 1 95 | flatUsedParameters[counter] = flatParameters[counter+cumSumOfHoles[k]] 96 | end 97 | end 98 | assert (counter == nUsedParameters) 99 | end 100 | return flatUsedParameters 101 | end 102 | 103 | -- flatten parameters and gradients 104 | local flatParameters = flatten(parameters) 105 | local flatGradParameters = flatten(gradParameters) 106 | 107 | -- return new flat vector that contains all discrete parameters 108 | return flatParameters, flatGradParameters 109 | end 110 | 111 | 112 | 113 | 114 | function model_utils.clone_many_times(net, T) 115 | local clones = {} 116 | 117 | local params, gradParams 118 | if net.parameters then 119 | params, gradParams = net:parameters() 120 | if params == nil then 121 | params = {} 122 | end 123 | end 124 | 125 | local paramsNoGrad 126 | if net.parametersNoGrad then 127 | paramsNoGrad = net:parametersNoGrad() 128 | end 129 | 130 | local mem = torch.MemoryFile("w"):binary() 131 | mem:writeObject(net) 132 | 133 | for t = 1, T do 134 | -- We need to use a new reader for each clone. 135 | -- We don't want to use the pointers to already read objects. 136 | local reader = torch.MemoryFile(mem:storage(), "r"):binary() 137 | local clone = reader:readObject() 138 | reader:close() 139 | 140 | if net.parameters then 141 | local cloneParams, cloneGradParams = clone:parameters() 142 | local cloneParamsNoGrad 143 | for i = 1, #params do 144 | cloneParams[i]:set(params[i]) 145 | cloneGradParams[i]:set(gradParams[i]) 146 | end 147 | if paramsNoGrad then 148 | cloneParamsNoGrad = clone:parametersNoGrad() 149 | for i =1,#paramsNoGrad do 150 | cloneParamsNoGrad[i]:set(paramsNoGrad[i]) 151 | end 152 | end 153 | end 154 | 155 | clones[t] = clone 156 | collectgarbage() 157 | end 158 | 159 | mem:close() 160 | return clones 161 | end 162 | 163 | return model_utils 164 | --------------------------------------------------------------------------------