├── .gitignore ├── LICENSE.md ├── README.md ├── attention_lstm.py ├── generate_insurance_qa_embeddings.py ├── install.sh ├── insurance_qa_eval.py ├── keras_models.py ├── results.notes └── word2vec_100_dim.embeddings /.gitignore: -------------------------------------------------------------------------------- 1 | # data / models (also potentially very large) 2 | data/ 3 | models/ 4 | models 5 | treq_eval* 6 | 7 | # pyc files aren't necessary 8 | *.pyc 9 | 10 | # the pycharm part isn't either 11 | .idea 12 | 13 | # large "data" files 14 | *.h5 15 | *.pkl 16 | *.txt 17 | *.dict 18 | *.model 19 | *.embeddings 20 | 21 | # virtual environments 22 | venv/ 23 | ENV/ 24 | 25 | # keras install 26 | Keras.* 27 | dist/ 28 | 29 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Benjamin Bolte 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # keras-language-modeling 2 | 3 | Some code for doing language modeling with Keras, in particular for question-answering tasks. I wrote a very long blog post that explains how a lot of this works, which can be found [here](http://benjaminbolte.com/blog/2016/keras-language-modeling.html). 4 | 5 | ### Stuff that might be of interest 6 | 7 | - `attention_lstm.py`: Attentional LSTM, based on one of the papers referenced in the blog post and others. One application used it for [image captioning](http://arxiv.org/pdf/1502.03044.pdf). It is initialized with an attention vector which provides the attention component for the neural network. 8 | - `insurance_qa_eval.py`: Evaluation framework for the InsuranceQA dataset. To get this working, clone the [data repository](https://github.com/codekansas/insurance_qa_python) and set the `INSURANCE_QA` environment variable to the cloned repository. Changing `config` will adjust how the model is trained. 9 | - `keras-language-model.py`: The `LanguageModel` class uses the `config` settings to generate a training model and a testing model. The model can be trained by passing a question vector, a ground truth answer vector, and a bad answer vector to `fit`. Then `predict` calculates the similarity between a question and answer. Override the `build` method with whatever language model you want to get a trainable model. Examples are provided at the bottom, including the `EmbeddingModel`, `ConvolutionModel`, and `RecurrentModel`. 10 | 11 | ### Getting Started 12 | 13 | ````bash 14 | # Install Keras (may also need dependencies) 15 | git clone https://github.com/fchollet/keras 16 | cd keras 17 | sudo python setup.py install 18 | 19 | # Clone InsuranceQA dataset 20 | git clone https://github.com/codekansas/insurance_qa_python 21 | export INSURANCE_QA=$(pwd)/insurance_qa_python 22 | 23 | # Run insurance_qa_eval.py 24 | git clone https://github.com/codekansas/keras-language-modeling 25 | cd keras-language-modeling/ 26 | python insurance_qa_eval.py 27 | ```` 28 | 29 | Alternatively, I wrote a script to get started on a Google Cloud Platform instance (Ubuntu 16.04) which can be run via 30 | 31 | ````bash 32 | cd ~ 33 | git clone https://github.com/codekansas/keras-language-modeling 34 | cd keras-language-modeling 35 | source install.py 36 | ```` 37 | 38 | I've been working on making these models available out-of-the-box. You need to install the Git branch of Keras (and maybe make some modifications) in order to run some of these models; the Keras project can be found [here](https://github.com/fchollet/keras). 39 | 40 | The runnable program is `insurance_qa_eval.py`. This will create a `models/` directory which will store a history of the model's weights as it is created. You need to set an environment variable to tell it where the INSURANCE_QA dataset is. 41 | 42 | Finally, my setup (which I think is pretty common) is to have an SSD with my operating system, and an HDD with larger data files. So I would recommend creating a `models/` symlink from the project directory to somewhere in your HDD, if you have a similar setup. 43 | 44 | ### Serving to a port 45 | 46 | I added a command line argument that uses Flask to serve to a port. Once you've [installed Flask](http://flask.pocoo.org/docs/0.11/installation/), you can run: 47 | 48 | ````bash 49 | python insurance_qa_eval.py serve 50 | ```` 51 | 52 | This is useful in combination with [ngrok](https://ngrok.com/) for monitoring training progress away from your desktop. 53 | 54 | ### Additionally 55 | 56 | - The official implementation can be found [here](https://github.com/white127/insuranceQA-cnn-lstm) 57 | 58 | ### Data 59 | 60 | - L6 from [Yahoo Webscope](http://webscope.sandbox.yahoo.com/) 61 | - [InsuranceQA data](https://github.com/shuzi/insuranceQA) 62 | - [Pythonic version](https://github.com/codekansas/insurance_qa_python) 63 | 64 | -------------------------------------------------------------------------------- /attention_lstm.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | 3 | from keras import backend as K 4 | from keras.engine import InputSpec 5 | from keras.layers import LSTM, activations, Wrapper 6 | 7 | 8 | class AttentionLSTM(LSTM): 9 | def __init__(self, output_dim, attention_vec, attn_activation='tanh', single_attention_param=False, **kwargs): 10 | self.attention_vec = attention_vec 11 | self.attn_activation = activations.get(attn_activation) 12 | self.single_attention_param = single_attention_param 13 | 14 | super(AttentionLSTM, self).__init__(output_dim, **kwargs) 15 | 16 | def build(self, input_shape): 17 | super(AttentionLSTM, self).build(input_shape) 18 | 19 | if hasattr(self.attention_vec, '_keras_shape'): 20 | attention_dim = self.attention_vec._keras_shape[1] 21 | else: 22 | raise Exception('Layer could not be build: No information about expected input shape.') 23 | 24 | self.U_a = self.inner_init((self.output_dim, self.output_dim), 25 | name='{}_U_a'.format(self.name)) 26 | self.b_a = K.zeros((self.output_dim,), name='{}_b_a'.format(self.name)) 27 | 28 | self.U_m = self.inner_init((attention_dim, self.output_dim), 29 | name='{}_U_m'.format(self.name)) 30 | self.b_m = K.zeros((self.output_dim,), name='{}_b_m'.format(self.name)) 31 | 32 | if self.single_attention_param: 33 | self.U_s = self.inner_init((self.output_dim, 1), 34 | name='{}_U_s'.format(self.name)) 35 | self.b_s = K.zeros((1,), name='{}_b_s'.format(self.name)) 36 | else: 37 | self.U_s = self.inner_init((self.output_dim, self.output_dim), 38 | name='{}_U_s'.format(self.name)) 39 | self.b_s = K.zeros((self.output_dim,), name='{}_b_s'.format(self.name)) 40 | 41 | self.trainable_weights += [self.U_a, self.U_m, self.U_s, self.b_a, self.b_m, self.b_s] 42 | 43 | if self.initial_weights is not None: 44 | self.set_weights(self.initial_weights) 45 | del self.initial_weights 46 | 47 | def step(self, x, states): 48 | h, [h, c] = super(AttentionLSTM, self).step(x, states) 49 | attention = states[4] 50 | 51 | m = self.attn_activation(K.dot(h, self.U_a) * attention + self.b_a) 52 | # Intuitively it makes more sense to use a sigmoid (was getting some NaN problems 53 | # which I think might have been caused by the exponential function -> gradients blow up) 54 | s = K.sigmoid(K.dot(m, self.U_s) + self.b_s) 55 | 56 | if self.single_attention_param: 57 | h = h * K.repeat_elements(s, self.output_dim, axis=1) 58 | else: 59 | h = h * s 60 | 61 | return h, [h, c] 62 | 63 | def get_constants(self, x): 64 | constants = super(AttentionLSTM, self).get_constants(x) 65 | constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m) 66 | return constants 67 | 68 | 69 | class AttentionLSTMWrapper(Wrapper): 70 | def __init__(self, layer, attention_vec, attn_activation='tanh', single_attention_param=False, **kwargs): 71 | assert isinstance(layer, LSTM) 72 | self.supports_masking = True 73 | self.attention_vec = attention_vec 74 | self.attn_activation = activations.get(attn_activation) 75 | self.single_attention_param = single_attention_param 76 | super(AttentionLSTMWrapper, self).__init__(layer, **kwargs) 77 | 78 | def build(self, input_shape): 79 | assert len(input_shape) >= 3 80 | self.input_spec = [InputSpec(shape=input_shape)] 81 | 82 | if not self.layer.built: 83 | self.layer.build(input_shape) 84 | self.layer.built = True 85 | 86 | super(AttentionLSTMWrapper, self).build() 87 | 88 | if hasattr(self.attention_vec, '_keras_shape'): 89 | attention_dim = self.attention_vec._keras_shape[1] 90 | else: 91 | raise Exception('Layer could not be build: No information about expected input shape.') 92 | 93 | self.U_a = self.layer.inner_init((self.layer.output_dim, self.layer.output_dim), name='{}_U_a'.format(self.name)) 94 | self.b_a = K.zeros((self.layer.output_dim,), name='{}_b_a'.format(self.name)) 95 | 96 | self.U_m = self.layer.inner_init((attention_dim, self.layer.output_dim), name='{}_U_m'.format(self.name)) 97 | self.b_m = K.zeros((self.layer.output_dim,), name='{}_b_m'.format(self.name)) 98 | 99 | if self.single_attention_param: 100 | self.U_s = self.layer.inner_init((self.layer.output_dim, 1), name='{}_U_s'.format(self.name)) 101 | self.b_s = K.zeros((1,), name='{}_b_s'.format(self.name)) 102 | else: 103 | self.U_s = self.layer.inner_init((self.layer.output_dim, self.layer.output_dim), name='{}_U_s'.format(self.name)) 104 | self.b_s = K.zeros((self.layer.output_dim,), name='{}_b_s'.format(self.name)) 105 | 106 | self.trainable_weights = [self.U_a, self.U_m, self.U_s, self.b_a, self.b_m, self.b_s] 107 | 108 | def get_output_shape_for(self, input_shape): 109 | return self.layer.get_output_shape_for(input_shape) 110 | 111 | def step(self, x, states): 112 | h, [h, c] = self.layer.step(x, states) 113 | attention = states[4] 114 | 115 | m = self.attn_activation(K.dot(h, self.U_a) * attention + self.b_a) 116 | s = K.sigmoid(K.dot(m, self.U_s) + self.b_s) 117 | 118 | if self.single_attention_param: 119 | h = h * K.repeat_elements(s, self.layer.output_dim, axis=1) 120 | else: 121 | h = h * s 122 | 123 | return h, [h, c] 124 | 125 | def get_constants(self, x): 126 | constants = self.layer.get_constants(x) 127 | constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m) 128 | return constants 129 | 130 | def call(self, x, mask=None): 131 | # input shape: (nb_samples, time (padded with zeros), input_dim) 132 | # note that the .build() method of subclasses MUST define 133 | # self.input_spec with a complete input shape. 134 | input_shape = self.input_spec[0].shape 135 | if K._BACKEND == 'tensorflow': 136 | if not input_shape[1]: 137 | raise Exception('When using TensorFlow, you should define ' 138 | 'explicitly the number of timesteps of ' 139 | 'your sequences.\n' 140 | 'If your first layer is an Embedding, ' 141 | 'make sure to pass it an "input_length" ' 142 | 'argument. Otherwise, make sure ' 143 | 'the first layer has ' 144 | 'an "input_shape" or "batch_input_shape" ' 145 | 'argument, including the time axis. ' 146 | 'Found input shape at layer ' + self.name + 147 | ': ' + str(input_shape)) 148 | if self.layer.stateful: 149 | initial_states = self.layer.states 150 | else: 151 | initial_states = self.layer.get_initial_states(x) 152 | constants = self.get_constants(x) 153 | preprocessed_input = self.layer.preprocess_input(x) 154 | 155 | last_output, outputs, states = K.rnn(self.step, preprocessed_input, 156 | initial_states, 157 | go_backwards=self.layer.go_backwards, 158 | mask=mask, 159 | constants=constants, 160 | unroll=self.layer.unroll, 161 | input_length=input_shape[1]) 162 | if self.layer.stateful: 163 | self.updates = [] 164 | for i in range(len(states)): 165 | self.updates.append((self.layer.states[i], states[i])) 166 | 167 | if self.layer.return_sequences: 168 | return outputs 169 | else: 170 | return last_output 171 | -------------------------------------------------------------------------------- /generate_insurance_qa_embeddings.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | Command-line script for generating embeddings 5 | Useful if you want to generate larger embeddings for some models 6 | """ 7 | 8 | from __future__ import print_function 9 | 10 | import os 11 | import sys 12 | import random 13 | import pickle 14 | import argparse 15 | import logging 16 | 17 | random.seed(42) 18 | 19 | 20 | def load(path, name): 21 | return pickle.load(open(os.path.join(path, name), 'rb')) 22 | 23 | 24 | def revert(vocab, indices): 25 | return [vocab.get(i, 'X') for i in indices] 26 | 27 | try: 28 | data_path = os.environ['INSURANCE_QA'] 29 | except KeyError: 30 | print('INSURANCE_QA is not set. Set it to your clone of https://github.com/codekansas/insurance_qa_python') 31 | sys.exit(1) 32 | 33 | # parse arguments 34 | parser = argparse.ArgumentParser(description='Generate embeddings for the InsuranceQA dataset') 35 | parser.add_argument('--iter', metavar='N', type=int, default=10, help='number of times to run') 36 | parser.add_argument('--size', metavar='D', type=int, default=100, help='dimensions in embedding') 37 | args = parser.parse_args() 38 | 39 | # configure logging 40 | logger = logging.getLogger(os.path.basename(sys.argv[0])) 41 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') 42 | logging.root.setLevel(level=logging.INFO) 43 | logger.info('running %s' % ' '.join(sys.argv)) 44 | 45 | # imports go down here because they are time-consuming 46 | from gensim.models import Word2Vec 47 | from keras_models import * 48 | 49 | vocab = load(data_path, 'vocabulary') 50 | 51 | answers = load(data_path, 'answers') 52 | sentences = [revert(vocab, txt) for txt in answers.values()] 53 | sentences += [revert(vocab, q['question']) for q in load(data_path, 'train')] 54 | 55 | # run model 56 | model = Word2Vec(sentences, size=args.size, min_count=5, window=5, sg=1, iter=args.iter) 57 | weights = model.syn0 58 | d = dict([(k, v.index) for k, v in model.vocab.items()]) 59 | emb = np.zeros(shape=(len(vocab)+1, args.size), dtype='float32') 60 | 61 | for i, w in vocab.items(): 62 | if w not in d: continue 63 | emb[i, :] = weights[d[w], :] 64 | 65 | np.save(open('word2vec_%d_dim.embeddings' % args.size, 'wb'), emb) 66 | logger.info('saved to "word2vec_%d_dim.embeddings"' % args.size) 67 | 68 | -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # This script will get you up and running on a Google Compute Engine instance 4 | # Ubuntu 16.04 (as many CPUs as you like) 5 | 6 | # exit on failure 7 | # set -e 8 | 9 | # make models directory 10 | if [ ! -d "models/" ]; then 11 | mkdir models 12 | fi 13 | 14 | # install pip (not installed by default on GCE) 15 | sudo apt install python-pip 16 | 17 | # install virtualenv 18 | sudo pip install virtualenv 19 | 20 | # create and activate virtual environment 21 | if [ ! -d "venv" ]; then 22 | virtualenv venv 23 | fi 24 | source venv/bin/activate 25 | 26 | # install h5py 27 | pip install h5py 28 | 29 | # install blas/lapack 30 | sudo apt install libblas-dev liblapack-dev libatlas-base-dev gfortran 31 | 32 | # install tensorflow 33 | export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl 34 | pip install --upgrade $TF_BINARY_URL 35 | 36 | # install keras from source in the home directory 37 | export KERAS_DIRECTORY=~/keras 38 | if [ ! -d "${KERAS_DIRECTORY}" ]; then 39 | git clone https://github.com/fchollet/keras ${KERAS_DIRECTORY} 40 | fi 41 | cd $KERAS_DIRECTORY 42 | python setup.py install 43 | cd - 44 | if [ ! -d ~/.keras ]; then 45 | mkdir ~/.keras 46 | fi 47 | echo '{"epsilon": 1e-07, "floatx": "float32", "backend": "tensorflow"}' > ~/.keras/keras.json 48 | 49 | # download insurance qa files 50 | export INSURANCE_QA=~/insurance_qa 51 | if [ ! -d $INSURANCE_QA ]; then 52 | git clone https://github.com/codekansas/insurance_qa_python $INSURANCE_QA 53 | fi 54 | 55 | # alert user that we're done 56 | echo ">==< Successfully installed dependencies >==<" 57 | 58 | -------------------------------------------------------------------------------- /insurance_qa_eval.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import os 4 | 5 | import sys 6 | import random 7 | from time import strftime, gmtime, time 8 | 9 | import pickle 10 | import json 11 | 12 | import thread 13 | from scipy.stats import rankdata 14 | 15 | random.seed(42) 16 | 17 | 18 | def log(x): 19 | print(x) 20 | 21 | 22 | class Evaluator: 23 | def __init__(self, conf, model, optimizer=None): 24 | try: 25 | data_path = os.environ['INSURANCE_QA'] 26 | except KeyError: 27 | print("INSURANCE_QA is not set. Set it to your clone of https://github.com/codekansas/insurance_qa_python") 28 | sys.exit(1) 29 | if isinstance(conf, str): 30 | conf = json.load(open(conf, 'rb')) 31 | self.model = model(conf) 32 | self.path = data_path 33 | self.conf = conf 34 | self.params = conf['training'] 35 | optimizer = self.params['optimizer'] if optimizer is None else optimizer 36 | self.model.compile(optimizer) 37 | self.answers = self.load('answers') # self.load('generated') 38 | self._vocab = None 39 | self._reverse_vocab = None 40 | self._eval_sets = None 41 | 42 | ##### Resources ##### 43 | 44 | def load(self, name): 45 | return pickle.load(open(os.path.join(self.path, name), 'rb')) 46 | 47 | def vocab(self): 48 | if self._vocab is None: 49 | self._vocab = self.load('vocabulary') 50 | return self._vocab 51 | 52 | def reverse_vocab(self): 53 | if self._reverse_vocab is None: 54 | vocab = self.vocab() 55 | self._reverse_vocab = dict((v.lower(), k) for k, v in vocab.items()) 56 | return self._reverse_vocab 57 | 58 | ##### Loading / saving ##### 59 | 60 | def save_epoch(self, epoch): 61 | if not os.path.exists('models/'): 62 | os.makedirs('models/') 63 | self.model.save_weights('models/weights_epoch_%d.h5' % epoch, overwrite=True) 64 | 65 | def load_epoch(self, epoch): 66 | assert os.path.exists('models/weights_epoch_%d.h5' % epoch), 'Weights at epoch %d not found' % epoch 67 | self.model.load_weights('models/weights_epoch_%d.h5' % epoch) 68 | 69 | ##### Converting / reverting ##### 70 | 71 | def convert(self, words): 72 | rvocab = self.reverse_vocab() 73 | if type(words) == str: 74 | words = words.strip().lower().split(' ') 75 | return [rvocab.get(w, 0) for w in words] 76 | 77 | def revert(self, indices): 78 | vocab = self.vocab() 79 | return [vocab.get(i, 'X') for i in indices] 80 | 81 | ##### Padding ##### 82 | 83 | def padq(self, data): 84 | return self.pad(data, self.conf.get('question_len', None)) 85 | 86 | def pada(self, data): 87 | return self.pad(data, self.conf.get('answer_len', None)) 88 | 89 | def pad(self, data, len=None): 90 | from keras.preprocessing.sequence import pad_sequences 91 | return pad_sequences(data, maxlen=len, padding='post', truncating='post', value=0) 92 | 93 | ##### Training ##### 94 | 95 | def get_time(self): 96 | return strftime('%Y-%m-%d %H:%M:%S', gmtime()) 97 | 98 | def train(self): 99 | batch_size = self.params['batch_size'] 100 | nb_epoch = self.params['nb_epoch'] 101 | validation_split = self.params['validation_split'] 102 | 103 | training_set = self.load('train') 104 | # top_50 = self.load('top_50') 105 | 106 | questions = list() 107 | good_answers = list() 108 | indices = list() 109 | 110 | for j, q in enumerate(training_set): 111 | questions += [q['question']] * len(q['answers']) 112 | good_answers += [self.answers[i] for i in q['answers']] 113 | indices += [j] * len(q['answers']) 114 | log('Began training at %s on %d samples' % (self.get_time(), len(questions))) 115 | 116 | questions = self.padq(questions) 117 | good_answers = self.pada(good_answers) 118 | 119 | val_loss = {'loss': 1., 'epoch': 0} 120 | 121 | # def get_bad_samples(indices, top_50): 122 | # return [self.answers[random.choice(top_50[i])] for i in indices] 123 | 124 | for i in range(1, nb_epoch+1): 125 | # sample from all answers to get bad answers 126 | # if i % 2 == 0: 127 | # bad_answers = self.pada(random.sample(self.answers.values(), len(good_answers))) 128 | # else: 129 | # bad_answers = self.pada(get_bad_samples(indices, top_50)) 130 | bad_answers = self.pada(random.sample(self.answers.values(), len(good_answers))) 131 | 132 | print('Fitting epoch %d' % i, file=sys.stderr) 133 | hist = self.model.fit([questions, good_answers, bad_answers], epochs=1, batch_size=batch_size, 134 | validation_split=validation_split, verbose=1) 135 | 136 | if hist.history['val_loss'][0] < val_loss['loss']: 137 | val_loss = {'loss': hist.history['val_loss'][0], 'epoch': i} 138 | log('%s -- Epoch %d ' % (self.get_time(), i) + 139 | 'Loss = %.4f, Validation Loss = %.4f ' % (hist.history['loss'][0], hist.history['val_loss'][0]) + 140 | '(Best: Loss = %.4f, Epoch = %d)' % (val_loss['loss'], val_loss['epoch'])) 141 | 142 | self.save_epoch(i) 143 | 144 | return val_loss 145 | 146 | ##### Evaluation ##### 147 | 148 | def prog_bar(self, so_far, total, n_bars=20): 149 | n_complete = int(so_far * n_bars / total) 150 | if n_complete >= n_bars - 1: 151 | print('\r[' + '=' * n_bars + ']', end='', file=sys.stderr) 152 | else: 153 | s = '\r[' + '=' * (n_complete - 1) + '>' + '.' * (n_bars - n_complete) + ']' 154 | print(s, end='', file=sys.stderr) 155 | 156 | def eval_sets(self): 157 | if self._eval_sets is None: 158 | self._eval_sets = dict([(s, self.load(s)) for s in ['dev', 'test1', 'test2']]) 159 | return self._eval_sets 160 | 161 | def get_score(self, verbose=False): 162 | top1_ls = [] 163 | mrr_ls = [] 164 | for name, data in self.eval_sets().items(): 165 | print('----- %s -----' % name) 166 | 167 | random.shuffle(data) 168 | 169 | if 'n_eval' in self.params: 170 | data = data[:self.params['n_eval']] 171 | 172 | c_1, c_2 = 0, 0 173 | 174 | for i, d in enumerate(data): 175 | self.prog_bar(i, len(data)) 176 | 177 | indices = d['good'] + d['bad'] 178 | answers = self.pada([self.answers[i] for i in indices]) 179 | question = self.padq([d['question']] * len(indices)) 180 | 181 | sims = self.model.predict([question, answers]) 182 | 183 | n_good = len(d['good']) 184 | max_r = np.argmax(sims) 185 | max_n = np.argmax(sims[:n_good]) 186 | 187 | r = rankdata(sims, method='max') 188 | 189 | if verbose: 190 | min_r = np.argmin(sims) 191 | amin_r = self.answers[indices[min_r]] 192 | amax_r = self.answers[indices[max_r]] 193 | amax_n = self.answers[indices[max_n]] 194 | 195 | print(' '.join(self.revert(d['question']))) 196 | print('Predicted: ({}) '.format(sims[max_r]) + ' '.join(self.revert(amax_r))) 197 | print('Expected: ({}) Rank = {} '.format(sims[max_n], r[max_n]) + ' '.join(self.revert(amax_n))) 198 | print('Worst: ({})'.format(sims[min_r]) + ' '.join(self.revert(amin_r))) 199 | 200 | c_1 += 1 if max_r == max_n else 0 201 | c_2 += 1 / float(r[max_r] - r[max_n] + 1) 202 | 203 | top1 = c_1 / float(len(data)) 204 | mrr = c_2 / float(len(data)) 205 | 206 | del data 207 | print('Top-1 Precision: %f' % top1) 208 | print('MRR: %f' % mrr) 209 | top1_ls.append(top1) 210 | mrr_ls.append(mrr) 211 | return top1_ls, mrr_ls 212 | 213 | 214 | if __name__ == '__main__': 215 | if len(sys.argv) >= 2 and sys.argv[1] == 'serve': 216 | from flask import Flask 217 | app = Flask(__name__) 218 | port = 5000 219 | lines = list() 220 | def log(x): 221 | lines.append(x) 222 | 223 | @app.route('/') 224 | def home(): 225 | return ('

Training Log

' + 226 | ''.join(['{}
'.format(line) for line in lines]) + 227 | '') 228 | 229 | def start_server(): 230 | app.run(debug=False, use_evalex=False, port=port) 231 | 232 | thread.start_new_thread(start_server, tuple()) 233 | print('Serving to port %d' % port, file=sys.stderr) 234 | 235 | import numpy as np 236 | 237 | conf = { 238 | 'n_words': 22353, 239 | 'question_len': 150, 240 | 'answer_len': 150, 241 | 'margin': 0.009, 242 | 'initial_embed_weights': 'word2vec_100_dim.embeddings', 243 | 244 | 'training': { 245 | 'batch_size': 100, 246 | 'nb_epoch': 2000, 247 | 'validation_split': 0.1, 248 | }, 249 | 250 | 'similarity': { 251 | 'mode': 'cosine', 252 | 'gamma': 1, 253 | 'c': 1, 254 | 'd': 2, 255 | 'dropout': 0.5, 256 | } 257 | } 258 | 259 | from keras_models import EmbeddingModel, ConvolutionModel, ConvolutionalLSTM 260 | evaluator = Evaluator(conf, model=ConvolutionModel, optimizer='adam') 261 | 262 | # train the model 263 | best_loss = evaluator.train() 264 | 265 | # evaluate mrr for a particular epoch 266 | evaluator.load_epoch(best_loss['epoch']) 267 | top1, mrr = evaluator.get_score(verbose=False) 268 | log(' - Top-1 Precision:') 269 | log(' - %.3f on test 1' % top1[0]) 270 | log(' - %.3f on test 2' % top1[1]) 271 | log(' - %.3f on dev' % top1[2]) 272 | log(' - MRR:') 273 | log(' - %.3f on test 1' % mrr[0]) 274 | log(' - %.3f on test 2' % mrr[1]) 275 | log(' - %.3f on dev' % mrr[2]) 276 | -------------------------------------------------------------------------------- /keras_models.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | from abc import abstractmethod 4 | 5 | from keras.engine import Input 6 | from keras.layers import merge, Embedding, Dropout, Conv1D, Lambda, LSTM, Dense, concatenate, TimeDistributed 7 | from keras import backend as K 8 | from keras.models import Model 9 | 10 | import numpy as np 11 | 12 | 13 | class LanguageModel: 14 | def __init__(self, config): 15 | self.question = Input(shape=(config['question_len'],), dtype='int32', name='question_base') 16 | self.answer_good = Input(shape=(config['answer_len'],), dtype='int32', name='answer_good_base') 17 | self.answer_bad = Input(shape=(config['answer_len'],), dtype='int32', name='answer_bad_base') 18 | 19 | self.config = config 20 | self.params = config.get('similarity', dict()) 21 | 22 | # initialize a bunch of variables that will be set later 23 | self._models = None 24 | self._similarities = None 25 | self._answer = None 26 | self._qa_model = None 27 | 28 | self.training_model = None 29 | self.prediction_model = None 30 | 31 | def get_answer(self): 32 | if self._answer is None: 33 | self._answer = Input(shape=(self.config['answer_len'],), dtype='int32', name='answer') 34 | return self._answer 35 | 36 | @abstractmethod 37 | def build(self): 38 | return 39 | 40 | def get_similarity(self): 41 | ''' Specify similarity in configuration under 'similarity' -> 'mode' 42 | If a parameter is needed for the model, specify it in 'similarity' 43 | 44 | Example configuration: 45 | 46 | config = { 47 | ... other parameters ... 48 | 'similarity': { 49 | 'mode': 'gesd', 50 | 'gamma': 1, 51 | 'c': 1, 52 | } 53 | } 54 | 55 | cosine: dot(a, b) / sqrt(dot(a, a) * dot(b, b)) 56 | polynomial: (gamma * dot(a, b) + c) ^ d 57 | sigmoid: tanh(gamma * dot(a, b) + c) 58 | rbf: exp(-gamma * l2_norm(a-b) ^ 2) 59 | euclidean: 1 / (1 + l2_norm(a - b)) 60 | exponential: exp(-gamma * l2_norm(a - b)) 61 | gesd: euclidean * sigmoid 62 | aesd: (euclidean + sigmoid) / 2 63 | ''' 64 | 65 | params = self.params 66 | similarity = params['mode'] 67 | 68 | dot = lambda a, b: K.batch_dot(a, b, axes=1) 69 | l2_norm = lambda a, b: K.sqrt(K.sum(K.square(a - b), axis=1, keepdims=True)) 70 | 71 | if similarity == 'cosine': 72 | return lambda x: dot(x[0], x[1]) / K.maximum(K.sqrt(dot(x[0], x[0]) * dot(x[1], x[1])), K.epsilon()) 73 | elif similarity == 'polynomial': 74 | return lambda x: (params['gamma'] * dot(x[0], x[1]) + params['c']) ** params['d'] 75 | elif similarity == 'sigmoid': 76 | return lambda x: K.tanh(params['gamma'] * dot(x[0], x[1]) + params['c']) 77 | elif similarity == 'rbf': 78 | return lambda x: K.exp(-1 * params['gamma'] * l2_norm(x[0], x[1]) ** 2) 79 | elif similarity == 'euclidean': 80 | return lambda x: 1 / (1 + l2_norm(x[0], x[1])) 81 | elif similarity == 'exponential': 82 | return lambda x: K.exp(-1 * params['gamma'] * l2_norm(x[0], x[1])) 83 | elif similarity == 'gesd': 84 | euclidean = lambda x: 1 / (1 + l2_norm(x[0], x[1])) 85 | sigmoid = lambda x: 1 / (1 + K.exp(-1 * params['gamma'] * (dot(x[0], x[1]) + params['c']))) 86 | return lambda x: euclidean(x) * sigmoid(x) 87 | elif similarity == 'aesd': 88 | euclidean = lambda x: 0.5 / (1 + l2_norm(x[0], x[1])) 89 | sigmoid = lambda x: 0.5 / (1 + K.exp(-1 * params['gamma'] * (dot(x[0], x[1]) + params['c']))) 90 | return lambda x: euclidean(x) + sigmoid(x) 91 | else: 92 | raise Exception('Invalid similarity: {}'.format(similarity)) 93 | 94 | def get_qa_model(self): 95 | if self._models is None: 96 | self._models = self.build() 97 | 98 | if self._qa_model is None: 99 | question_output, answer_output = self._models 100 | dropout = Dropout(self.params.get('dropout', 0.2)) 101 | similarity = self.get_similarity() 102 | # qa_model = merge([dropout(question_output), dropout(answer_output)], 103 | # mode=similarity, output_shape=lambda _: (None, 1)) 104 | qa_model = Lambda(similarity, output_shape=lambda _: (None, 1))([dropout(question_output), 105 | dropout(answer_output)]) 106 | self._qa_model = Model(inputs=[self.question, self.get_answer()], outputs=qa_model, name='qa_model') 107 | 108 | return self._qa_model 109 | 110 | def compile(self, optimizer, **kwargs): 111 | qa_model = self.get_qa_model() 112 | 113 | good_similarity = qa_model([self.question, self.answer_good]) 114 | bad_similarity = qa_model([self.question, self.answer_bad]) 115 | 116 | # loss = merge([good_similarity, bad_similarity], 117 | # mode=lambda x: K.relu(self.config['margin'] - x[0] + x[1]), 118 | # output_shape=lambda x: x[0]) 119 | 120 | loss = Lambda(lambda x: K.relu(self.config['margin'] - x[0] + x[1]), 121 | output_shape=lambda x: x[0])([good_similarity, bad_similarity]) 122 | 123 | self.prediction_model = Model(inputs=[self.question, self.answer_good], outputs=good_similarity, 124 | name='prediction_model') 125 | self.prediction_model.compile(loss=lambda y_true, y_pred: y_pred, optimizer=optimizer, **kwargs) 126 | 127 | self.training_model = Model(inputs=[self.question, self.answer_good, self.answer_bad], outputs=loss, 128 | name='training_model') 129 | self.training_model.compile(loss=lambda y_true, y_pred: y_pred, optimizer=optimizer, **kwargs) 130 | 131 | def fit(self, x, **kwargs): 132 | assert self.training_model is not None, 'Must compile the model before fitting data' 133 | y = np.zeros(shape=(x[0].shape[0],)) # doesn't get used 134 | return self.training_model.fit(x, y, **kwargs) 135 | 136 | def predict(self, x): 137 | assert self.prediction_model is not None and isinstance(self.prediction_model, Model) 138 | return self.prediction_model.predict_on_batch(x) 139 | 140 | def save_weights(self, file_name, **kwargs): 141 | assert self.prediction_model is not None, 'Must compile the model before saving weights' 142 | self.prediction_model.save_weights(file_name, **kwargs) 143 | 144 | def load_weights(self, file_name, **kwargs): 145 | assert self.prediction_model is not None, 'Must compile the model loading weights' 146 | self.prediction_model.load_weights(file_name, **kwargs) 147 | 148 | 149 | class EmbeddingModel(LanguageModel): 150 | def build(self): 151 | question = self.question 152 | answer = self.get_answer() 153 | 154 | # add embedding layers 155 | weights = np.load(self.config['initial_embed_weights']) 156 | embedding = Embedding(input_dim=self.config['n_words'], 157 | output_dim=weights.shape[1], 158 | mask_zero=True, 159 | # dropout=0.2, 160 | weights=[weights]) 161 | question_embedding = embedding(question) 162 | answer_embedding = embedding(answer) 163 | 164 | # maxpooling 165 | maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2])) 166 | maxpool.supports_masking = True 167 | question_pool = maxpool(question_embedding) 168 | answer_pool = maxpool(answer_embedding) 169 | 170 | return question_pool, answer_pool 171 | 172 | 173 | class ConvolutionModel(LanguageModel): 174 | def build(self): 175 | assert self.config['question_len'] == self.config['answer_len'] 176 | 177 | question = self.question 178 | answer = self.get_answer() 179 | 180 | # add embedding layers 181 | weights = np.load(self.config['initial_embed_weights']) 182 | embedding = Embedding(input_dim=self.config['n_words'], 183 | output_dim=weights.shape[1], 184 | weights=[weights]) 185 | question_embedding = embedding(question) 186 | answer_embedding = embedding(answer) 187 | 188 | hidden_layer = TimeDistributed(Dense(200, activation='tanh')) 189 | 190 | question_hl = hidden_layer(question_embedding) 191 | answer_hl = hidden_layer(answer_embedding) 192 | 193 | # cnn 194 | cnns = [Conv1D(kernel_size=kernel_size, 195 | filters=1000, 196 | activation='tanh', 197 | padding='same') for kernel_size in [2, 3, 5, 7]] 198 | # question_cnn = merge([cnn(question_embedding) for cnn in cnns], mode='concat') 199 | question_cnn = concatenate([cnn(question_hl) for cnn in cnns], axis=-1) 200 | # answer_cnn = merge([cnn(answer_embedding) for cnn in cnns], mode='concat') 201 | answer_cnn = concatenate([cnn(answer_hl) for cnn in cnns], axis=-1) 202 | 203 | # maxpooling 204 | maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2])) 205 | maxpool.supports_masking = True 206 | # enc = Dense(100, activation='tanh') 207 | # question_pool = enc(maxpool(question_cnn)) 208 | # answer_pool = enc(maxpool(answer_cnn)) 209 | question_pool = maxpool(question_cnn) 210 | answer_pool = maxpool(answer_cnn) 211 | 212 | return question_pool, answer_pool 213 | 214 | 215 | class ConvolutionalLSTM(LanguageModel): 216 | def build(self): 217 | question = self.question 218 | answer = self.get_answer() 219 | 220 | # add embedding layers 221 | weights = np.load(self.config['initial_embed_weights']) 222 | embedding = Embedding(input_dim=self.config['n_words'], 223 | output_dim=weights.shape[1], 224 | weights=[weights]) 225 | question_embedding = embedding(question) 226 | answer_embedding = embedding(answer) 227 | 228 | f_rnn = LSTM(141, return_sequences=True, implementation=1) 229 | b_rnn = LSTM(141, return_sequences=True, implementation=1, go_backwards=True) 230 | 231 | qf_rnn = f_rnn(question_embedding) 232 | qb_rnn = b_rnn(question_embedding) 233 | # question_pool = merge([qf_rnn, qb_rnn], mode='concat', concat_axis=-1) 234 | question_pool = concatenate([qf_rnn, qb_rnn], axis=-1) 235 | 236 | af_rnn = f_rnn(answer_embedding) 237 | ab_rnn = b_rnn(answer_embedding) 238 | # answer_pool = merge([af_rnn, ab_rnn], mode='concat', concat_axis=-1) 239 | answer_pool = concatenate([af_rnn, ab_rnn], axis=-1) 240 | 241 | # cnn 242 | cnns = [Conv1D(kernel_size=kernel_size, 243 | filters=500, 244 | activation='tanh', 245 | padding='same') for kernel_size in [1, 2, 3, 5]] 246 | # question_cnn = merge([cnn(question_pool) for cnn in cnns], mode='concat') 247 | question_cnn = concatenate([cnn(question_pool) for cnn in cnns], axis=-1) 248 | # answer_cnn = merge([cnn(answer_pool) for cnn in cnns], mode='concat') 249 | answer_cnn = concatenate([cnn(answer_pool) for cnn in cnns], axis=-1) 250 | 251 | maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2])) 252 | maxpool.supports_masking = True 253 | question_pool = maxpool(question_cnn) 254 | answer_pool = maxpool(answer_cnn) 255 | 256 | return question_pool, answer_pool 257 | 258 | 259 | class AttentionModel(LanguageModel): 260 | def build(self): 261 | question = self.question 262 | answer = self.get_answer() 263 | 264 | # add embedding layers 265 | weights = np.load(self.config['initial_embed_weights']) 266 | embedding = Embedding(input_dim=self.config['n_words'], 267 | output_dim=weights.shape[1], 268 | # mask_zero=True, 269 | weights=[weights]) 270 | question_embedding = embedding(question) 271 | answer_embedding = embedding(answer) 272 | 273 | # question rnn part 274 | f_rnn = LSTM(141, return_sequences=True, consume_less='mem') 275 | b_rnn = LSTM(141, return_sequences=True, consume_less='mem', go_backwards=True) 276 | question_f_rnn = f_rnn(question_embedding) 277 | question_b_rnn = b_rnn(question_embedding) 278 | 279 | # maxpooling 280 | maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2])) 281 | maxpool.supports_masking = True 282 | question_pool = merge([maxpool(question_f_rnn), maxpool(question_b_rnn)], mode='concat', concat_axis=-1) 283 | 284 | # answer rnn part 285 | from attention_lstm import AttentionLSTMWrapper 286 | f_rnn = AttentionLSTMWrapper(f_rnn, question_pool, single_attention_param=True) 287 | b_rnn = AttentionLSTMWrapper(b_rnn, question_pool, single_attention_param=True) 288 | 289 | answer_f_rnn = f_rnn(answer_embedding) 290 | answer_b_rnn = b_rnn(answer_embedding) 291 | answer_pool = merge([maxpool(answer_f_rnn), maxpool(answer_b_rnn)], mode='concat', concat_axis=-1) 292 | 293 | return question_pool, answer_pool 294 | -------------------------------------------------------------------------------- /results.notes: -------------------------------------------------------------------------------- 1 | Best results achieved for each model: 2 | 3 | Embedding + Max Pooling: 4 | - Top 1 Precision: 5 | - 0.492 on test 1 6 | - 0.483 on test 2 7 | - 0.495 on dev 8 | - MRR: 9 | - 0.624 on test 1 10 | - 0.611 on test 2 11 | - 0.624 on dev 12 | 13 | Attentional LSTM + Max Pooling: 14 | - Top 1 precision: 15 | - 0.480 on test 1 16 | - 0.465 on test 2 17 | - 0.487 on dev 18 | - MRR: 19 | - 0.627 on test 1 20 | - 0.613 on test 2 21 | - 0.635 on dev 22 | 23 | Unsupervised RNN language model + trained embeddings: 24 | - Top 1 precision: 25 | - 0.546 on test 1 26 | - 0.527 on test 2 27 | - 0.552 on dev 28 | - MRR: 29 | - 0.670 on test 1 30 | - 0.651 on test 2 31 | - 0.671 on dev 32 | 33 | Training ConvolutionalLSTM model for a long time (~4 days): 34 | - Top-1 Precision: 35 | - 0.564 on test 1 36 | - 0.543 on test 2 37 | - 0.573 on dev 38 | - MRR: 39 | - 0.681 on test 1 40 | - 0.661 on test 2 41 | - 0.686 on dev 42 | -------------------------------------------------------------------------------- /word2vec_100_dim.embeddings: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codekansas/keras-language-modeling/14d6c319ad0bd2dea70f401400e7e0e4e6fcb55b/word2vec_100_dim.embeddings --------------------------------------------------------------------------------