├── .gitignore
├── LICENSE.md
├── README.md
├── attention_lstm.py
├── generate_insurance_qa_embeddings.py
├── install.sh
├── insurance_qa_eval.py
├── keras_models.py
├── results.notes
└── word2vec_100_dim.embeddings


/.gitignore:
--------------------------------------------------------------------------------
 1 | # data / models (also potentially very large)
 2 | data/
 3 | models/
 4 | models
 5 | treq_eval*
 6 | 
 7 | # pyc files aren't necessary
 8 | *.pyc
 9 | 
10 | # the pycharm part isn't either
11 | .idea
12 | 
13 | # large "data" files
14 | *.h5
15 | *.pkl
16 | *.txt
17 | *.dict
18 | *.model
19 | *.embeddings
20 | 
21 | # virtual environments
22 | venv/
23 | ENV/
24 | 
25 | # keras install
26 | Keras.*
27 | dist/
28 | 
29 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 Benjamin Bolte
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # keras-language-modeling
 2 | 
 3 | Some code for doing language modeling with Keras, in particular for question-answering tasks. I wrote a very long blog post that explains how a lot of this works, which can be found [here](http://benjaminbolte.com/blog/2016/keras-language-modeling.html).
 4 | 
 5 | ### Stuff that might be of interest
 6 | 
 7 |  - `attention_lstm.py`: Attentional LSTM, based on one of the papers referenced in the blog post and others. One application used it for [image captioning](http://arxiv.org/pdf/1502.03044.pdf). It is initialized with an attention vector which provides the attention component for the neural network.
 8 |  - `insurance_qa_eval.py`: Evaluation framework for the InsuranceQA dataset. To get this working, clone the [data repository](https://github.com/codekansas/insurance_qa_python) and set the `INSURANCE_QA` environment variable to the cloned repository. Changing `config` will adjust how the model is trained.
 9 |  - `keras-language-model.py`: The `LanguageModel` class uses the `config` settings to generate a training model and a testing model. The model can be trained by passing a question vector, a ground truth answer vector, and a bad answer vector to `fit`. Then `predict` calculates the similarity between a question and answer. Override the `build` method with whatever language model you want to get a trainable model. Examples are provided at the bottom, including the `EmbeddingModel`, `ConvolutionModel`, and `RecurrentModel`.
10 | 
11 | ### Getting Started
12 | 
13 | ````bash
14 | # Install Keras (may also need dependencies)
15 | git clone https://github.com/fchollet/keras
16 | cd keras
17 | sudo python setup.py install
18 | 
19 | # Clone InsuranceQA dataset
20 | git clone https://github.com/codekansas/insurance_qa_python
21 | export INSURANCE_QA=$(pwd)/insurance_qa_python
22 | 
23 | # Run insurance_qa_eval.py
24 | git clone https://github.com/codekansas/keras-language-modeling
25 | cd keras-language-modeling/
26 | python insurance_qa_eval.py
27 | ````
28 | 
29 | Alternatively, I wrote a script to get started on a Google Cloud Platform instance (Ubuntu 16.04) which can be run via
30 | 
31 | ````bash
32 | cd ~
33 | git clone https://github.com/codekansas/keras-language-modeling
34 | cd keras-language-modeling
35 | source install.py
36 | ````
37 | 
38 | I've been working on making these models available out-of-the-box. You need to install the Git branch of Keras (and maybe make some modifications) in order to run some of these models; the Keras project can be found [here](https://github.com/fchollet/keras).
39 | 
40 | The runnable program is `insurance_qa_eval.py`. This will create a `models/` directory which will store a history of the model's weights as it is created. You need to set an environment variable to tell it where the INSURANCE_QA dataset is.
41 | 
42 | Finally, my setup (which I think is pretty common) is to have an SSD with my operating system, and an HDD with larger data files. So I would recommend creating a `models/` symlink from the project directory to somewhere in your HDD, if you have a similar setup.
43 | 
44 | ### Serving to a port
45 | 
46 | I added a command line argument that uses Flask to serve to a port. Once you've [installed Flask](http://flask.pocoo.org/docs/0.11/installation/), you can run:
47 | 
48 | ````bash
49 | python insurance_qa_eval.py serve
50 | ````
51 | 
52 | This is useful in combination with [ngrok](https://ngrok.com/) for monitoring training progress away from your desktop.
53 | 
54 | ### Additionally
55 | 
56 |  - The official implementation can be found [here](https://github.com/white127/insuranceQA-cnn-lstm)
57 | 
58 | ### Data
59 | 
60 |  - L6 from [Yahoo Webscope](http://webscope.sandbox.yahoo.com/)
61 |  - [InsuranceQA data](https://github.com/shuzi/insuranceQA)
62 |    - [Pythonic version](https://github.com/codekansas/insurance_qa_python)
63 | 
64 | 


--------------------------------------------------------------------------------
/attention_lstm.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | 
  3 | from keras import backend as K
  4 | from keras.engine import InputSpec
  5 | from keras.layers import LSTM, activations, Wrapper
  6 | 
  7 | 
  8 | class AttentionLSTM(LSTM):
  9 |     def __init__(self, output_dim, attention_vec, attn_activation='tanh', single_attention_param=False, **kwargs):
 10 |         self.attention_vec = attention_vec
 11 |         self.attn_activation = activations.get(attn_activation)
 12 |         self.single_attention_param = single_attention_param
 13 | 
 14 |         super(AttentionLSTM, self).__init__(output_dim, **kwargs)
 15 | 
 16 |     def build(self, input_shape):
 17 |         super(AttentionLSTM, self).build(input_shape)
 18 | 
 19 |         if hasattr(self.attention_vec, '_keras_shape'):
 20 |             attention_dim = self.attention_vec._keras_shape[1]
 21 |         else:
 22 |             raise Exception('Layer could not be build: No information about expected input shape.')
 23 | 
 24 |         self.U_a = self.inner_init((self.output_dim, self.output_dim),
 25 |                                    name='{}_U_a'.format(self.name))
 26 |         self.b_a = K.zeros((self.output_dim,), name='{}_b_a'.format(self.name))
 27 | 
 28 |         self.U_m = self.inner_init((attention_dim, self.output_dim),
 29 |                                    name='{}_U_m'.format(self.name))
 30 |         self.b_m = K.zeros((self.output_dim,), name='{}_b_m'.format(self.name))
 31 | 
 32 |         if self.single_attention_param:
 33 |             self.U_s = self.inner_init((self.output_dim, 1),
 34 |                                        name='{}_U_s'.format(self.name))
 35 |             self.b_s = K.zeros((1,), name='{}_b_s'.format(self.name))
 36 |         else:
 37 |             self.U_s = self.inner_init((self.output_dim, self.output_dim),
 38 |                                        name='{}_U_s'.format(self.name))
 39 |             self.b_s = K.zeros((self.output_dim,), name='{}_b_s'.format(self.name))
 40 | 
 41 |         self.trainable_weights += [self.U_a, self.U_m, self.U_s, self.b_a, self.b_m, self.b_s]
 42 | 
 43 |         if self.initial_weights is not None:
 44 |             self.set_weights(self.initial_weights)
 45 |             del self.initial_weights
 46 | 
 47 |     def step(self, x, states):
 48 |         h, [h, c] = super(AttentionLSTM, self).step(x, states)
 49 |         attention = states[4]
 50 | 
 51 |         m = self.attn_activation(K.dot(h, self.U_a) * attention + self.b_a)
 52 |         # Intuitively it makes more sense to use a sigmoid (was getting some NaN problems
 53 |         # which I think might have been caused by the exponential function -> gradients blow up)
 54 |         s = K.sigmoid(K.dot(m, self.U_s) + self.b_s)
 55 | 
 56 |         if self.single_attention_param:
 57 |             h = h * K.repeat_elements(s, self.output_dim, axis=1)
 58 |         else:
 59 |             h = h * s
 60 | 
 61 |         return h, [h, c]
 62 | 
 63 |     def get_constants(self, x):
 64 |         constants = super(AttentionLSTM, self).get_constants(x)
 65 |         constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m)
 66 |         return constants
 67 | 
 68 | 
 69 | class AttentionLSTMWrapper(Wrapper):
 70 |     def __init__(self, layer, attention_vec, attn_activation='tanh', single_attention_param=False, **kwargs):
 71 |         assert isinstance(layer, LSTM)
 72 |         self.supports_masking = True
 73 |         self.attention_vec = attention_vec
 74 |         self.attn_activation = activations.get(attn_activation)
 75 |         self.single_attention_param = single_attention_param
 76 |         super(AttentionLSTMWrapper, self).__init__(layer, **kwargs)
 77 | 
 78 |     def build(self, input_shape):
 79 |         assert len(input_shape) >= 3
 80 |         self.input_spec = [InputSpec(shape=input_shape)]
 81 | 
 82 |         if not self.layer.built:
 83 |             self.layer.build(input_shape)
 84 |             self.layer.built = True
 85 | 
 86 |         super(AttentionLSTMWrapper, self).build()
 87 | 
 88 |         if hasattr(self.attention_vec, '_keras_shape'):
 89 |             attention_dim = self.attention_vec._keras_shape[1]
 90 |         else:
 91 |             raise Exception('Layer could not be build: No information about expected input shape.')
 92 | 
 93 |         self.U_a = self.layer.inner_init((self.layer.output_dim, self.layer.output_dim), name='{}_U_a'.format(self.name))
 94 |         self.b_a = K.zeros((self.layer.output_dim,), name='{}_b_a'.format(self.name))
 95 | 
 96 |         self.U_m = self.layer.inner_init((attention_dim, self.layer.output_dim), name='{}_U_m'.format(self.name))
 97 |         self.b_m = K.zeros((self.layer.output_dim,), name='{}_b_m'.format(self.name))
 98 | 
 99 |         if self.single_attention_param:
100 |             self.U_s = self.layer.inner_init((self.layer.output_dim, 1), name='{}_U_s'.format(self.name))
101 |             self.b_s = K.zeros((1,), name='{}_b_s'.format(self.name))
102 |         else:
103 |             self.U_s = self.layer.inner_init((self.layer.output_dim, self.layer.output_dim), name='{}_U_s'.format(self.name))
104 |             self.b_s = K.zeros((self.layer.output_dim,), name='{}_b_s'.format(self.name))
105 | 
106 |         self.trainable_weights = [self.U_a, self.U_m, self.U_s, self.b_a, self.b_m, self.b_s]
107 | 
108 |     def get_output_shape_for(self, input_shape):
109 |         return self.layer.get_output_shape_for(input_shape)
110 | 
111 |     def step(self, x, states):
112 |         h, [h, c] = self.layer.step(x, states)
113 |         attention = states[4]
114 | 
115 |         m = self.attn_activation(K.dot(h, self.U_a) * attention + self.b_a)
116 |         s = K.sigmoid(K.dot(m, self.U_s) + self.b_s)
117 | 
118 |         if self.single_attention_param:
119 |             h = h * K.repeat_elements(s, self.layer.output_dim, axis=1)
120 |         else:
121 |             h = h * s
122 | 
123 |         return h, [h, c]
124 | 
125 |     def get_constants(self, x):
126 |         constants = self.layer.get_constants(x)
127 |         constants.append(K.dot(self.attention_vec, self.U_m) + self.b_m)
128 |         return constants
129 | 
130 |     def call(self, x, mask=None):
131 |         # input shape: (nb_samples, time (padded with zeros), input_dim)
132 |         # note that the .build() method of subclasses MUST define
133 |         # self.input_spec with a complete input shape.
134 |         input_shape = self.input_spec[0].shape
135 |         if K._BACKEND == 'tensorflow':
136 |             if not input_shape[1]:
137 |                 raise Exception('When using TensorFlow, you should define '
138 |                                 'explicitly the number of timesteps of '
139 |                                 'your sequences.\n'
140 |                                 'If your first layer is an Embedding, '
141 |                                 'make sure to pass it an "input_length" '
142 |                                 'argument. Otherwise, make sure '
143 |                                 'the first layer has '
144 |                                 'an "input_shape" or "batch_input_shape" '
145 |                                 'argument, including the time axis. '
146 |                                 'Found input shape at layer ' + self.name +
147 |                                 ': ' + str(input_shape))
148 |         if self.layer.stateful:
149 |             initial_states = self.layer.states
150 |         else:
151 |             initial_states = self.layer.get_initial_states(x)
152 |         constants = self.get_constants(x)
153 |         preprocessed_input = self.layer.preprocess_input(x)
154 | 
155 |         last_output, outputs, states = K.rnn(self.step, preprocessed_input,
156 |                                              initial_states,
157 |                                              go_backwards=self.layer.go_backwards,
158 |                                              mask=mask,
159 |                                              constants=constants,
160 |                                              unroll=self.layer.unroll,
161 |                                              input_length=input_shape[1])
162 |         if self.layer.stateful:
163 |             self.updates = []
164 |             for i in range(len(states)):
165 |                 self.updates.append((self.layer.states[i], states[i]))
166 | 
167 |         if self.layer.return_sequences:
168 |             return outputs
169 |         else:
170 |             return last_output
171 | 


--------------------------------------------------------------------------------
/generate_insurance_qa_embeddings.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | """
 4 | Command-line script for generating embeddings
 5 | Useful if you want to generate larger embeddings for some models
 6 | """
 7 | 
 8 | from __future__ import print_function
 9 | 
10 | import os
11 | import sys
12 | import random
13 | import pickle
14 | import argparse
15 | import logging
16 | 
17 | random.seed(42)
18 | 
19 | 
20 | def load(path, name):
21 |     return pickle.load(open(os.path.join(path, name), 'rb'))
22 | 
23 | 
24 | def revert(vocab, indices):
25 |     return [vocab.get(i, 'X') for i in indices]
26 | 
27 | try:
28 |     data_path = os.environ['INSURANCE_QA']
29 | except KeyError:
30 |     print('INSURANCE_QA is not set. Set it to your clone of https://github.com/codekansas/insurance_qa_python')
31 |     sys.exit(1)
32 | 
33 | # parse arguments
34 | parser = argparse.ArgumentParser(description='Generate embeddings for the InsuranceQA dataset')
35 | parser.add_argument('--iter', metavar='N', type=int, default=10, help='number of times to run')
36 | parser.add_argument('--size', metavar='D', type=int, default=100, help='dimensions in embedding')
37 | args = parser.parse_args()
38 | 
39 | # configure logging
40 | logger = logging.getLogger(os.path.basename(sys.argv[0]))
41 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
42 | logging.root.setLevel(level=logging.INFO)
43 | logger.info('running %s' % ' '.join(sys.argv))
44 | 
45 | # imports go down here because they are time-consuming
46 | from gensim.models import Word2Vec
47 | from keras_models import *
48 | 
49 | vocab = load(data_path, 'vocabulary')
50 | 
51 | answers = load(data_path, 'answers')
52 | sentences = [revert(vocab, txt) for txt in answers.values()]
53 | sentences += [revert(vocab, q['question']) for q in load(data_path, 'train')]
54 | 
55 | # run model
56 | model = Word2Vec(sentences, size=args.size, min_count=5, window=5, sg=1, iter=args.iter)
57 | weights = model.syn0
58 | d = dict([(k, v.index) for k, v in model.vocab.items()])
59 | emb = np.zeros(shape=(len(vocab)+1, args.size), dtype='float32')
60 | 
61 | for i, w in vocab.items():
62 |     if w not in d: continue
63 |     emb[i, :] = weights[d[w], :]
64 | 
65 | np.save(open('word2vec_%d_dim.embeddings' % args.size, 'wb'), emb)
66 | logger.info('saved to "word2vec_%d_dim.embeddings"' % args.size)
67 | 
68 | 


--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | # This script will get you up and running on a Google Compute Engine instance
 4 | # Ubuntu 16.04 (as many CPUs as you like)
 5 | 
 6 | # exit on failure
 7 | # set -e
 8 | 
 9 | # make models directory
10 | if [ ! -d "models/" ]; then
11 |   mkdir models
12 | fi
13 | 
14 | # install pip (not installed by default on GCE)
15 | sudo apt install python-pip
16 | 
17 | # install virtualenv
18 | sudo pip install virtualenv
19 | 
20 | # create and activate virtual environment
21 | if [ ! -d "venv" ]; then
22 |   virtualenv venv
23 | fi
24 | source venv/bin/activate
25 | 
26 | # install h5py
27 | pip install h5py
28 | 
29 | # install blas/lapack
30 | sudo apt install libblas-dev liblapack-dev libatlas-base-dev gfortran
31 | 
32 | # install tensorflow
33 | export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl
34 | pip install --upgrade $TF_BINARY_URL
35 | 
36 | # install keras from source in the home directory
37 | export KERAS_DIRECTORY=~/keras
38 | if [ ! -d "${KERAS_DIRECTORY}" ]; then
39 |   git clone https://github.com/fchollet/keras ${KERAS_DIRECTORY}
40 | fi
41 | cd $KERAS_DIRECTORY
42 | python setup.py install
43 | cd -
44 | if [ ! -d ~/.keras ]; then
45 |   mkdir ~/.keras
46 | fi
47 | echo '{"epsilon": 1e-07, "floatx": "float32", "backend": "tensorflow"}' > ~/.keras/keras.json
48 | 
49 | # download insurance qa files
50 | export INSURANCE_QA=~/insurance_qa
51 | if [ ! -d $INSURANCE_QA ]; then
52 |   git clone https://github.com/codekansas/insurance_qa_python $INSURANCE_QA
53 | fi
54 | 
55 | # alert user that we're done
56 | echo ">==< Successfully installed dependencies >==<"
57 | 
58 | 


--------------------------------------------------------------------------------
/insurance_qa_eval.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | 
  3 | import os
  4 | 
  5 | import sys
  6 | import random
  7 | from time import strftime, gmtime, time
  8 | 
  9 | import pickle
 10 | import json
 11 | 
 12 | import thread
 13 | from scipy.stats import rankdata
 14 | 
 15 | random.seed(42)
 16 | 
 17 | 
 18 | def log(x):
 19 |     print(x)
 20 | 
 21 | 
 22 | class Evaluator:
 23 |     def __init__(self, conf, model, optimizer=None):
 24 |         try:
 25 |             data_path = os.environ['INSURANCE_QA']
 26 |         except KeyError:
 27 |             print("INSURANCE_QA is not set. Set it to your clone of https://github.com/codekansas/insurance_qa_python")
 28 |             sys.exit(1)
 29 |         if isinstance(conf, str):
 30 |             conf = json.load(open(conf, 'rb'))
 31 |         self.model = model(conf)
 32 |         self.path = data_path
 33 |         self.conf = conf
 34 |         self.params = conf['training']
 35 |         optimizer = self.params['optimizer'] if optimizer is None else optimizer
 36 |         self.model.compile(optimizer)
 37 |         self.answers = self.load('answers') # self.load('generated')
 38 |         self._vocab = None
 39 |         self._reverse_vocab = None
 40 |         self._eval_sets = None
 41 | 
 42 |     ##### Resources #####
 43 | 
 44 |     def load(self, name):
 45 |         return pickle.load(open(os.path.join(self.path, name), 'rb'))
 46 | 
 47 |     def vocab(self):
 48 |         if self._vocab is None:
 49 |             self._vocab = self.load('vocabulary')
 50 |         return self._vocab
 51 | 
 52 |     def reverse_vocab(self):
 53 |         if self._reverse_vocab is None:
 54 |             vocab = self.vocab()
 55 |             self._reverse_vocab = dict((v.lower(), k) for k, v in vocab.items())
 56 |         return self._reverse_vocab
 57 | 
 58 |     ##### Loading / saving #####
 59 | 
 60 |     def save_epoch(self, epoch):
 61 |         if not os.path.exists('models/'):
 62 |             os.makedirs('models/')
 63 |         self.model.save_weights('models/weights_epoch_%d.h5' % epoch, overwrite=True)
 64 | 
 65 |     def load_epoch(self, epoch):
 66 |         assert os.path.exists('models/weights_epoch_%d.h5' % epoch), 'Weights at epoch %d not found' % epoch
 67 |         self.model.load_weights('models/weights_epoch_%d.h5' % epoch)
 68 | 
 69 |     ##### Converting / reverting #####
 70 | 
 71 |     def convert(self, words):
 72 |         rvocab = self.reverse_vocab()
 73 |         if type(words) == str:
 74 |             words = words.strip().lower().split(' ')
 75 |         return [rvocab.get(w, 0) for w in words]
 76 | 
 77 |     def revert(self, indices):
 78 |         vocab = self.vocab()
 79 |         return [vocab.get(i, 'X') for i in indices]
 80 | 
 81 |     ##### Padding #####
 82 | 
 83 |     def padq(self, data):
 84 |         return self.pad(data, self.conf.get('question_len', None))
 85 | 
 86 |     def pada(self, data):
 87 |         return self.pad(data, self.conf.get('answer_len', None))
 88 | 
 89 |     def pad(self, data, len=None):
 90 |         from keras.preprocessing.sequence import pad_sequences
 91 |         return pad_sequences(data, maxlen=len, padding='post', truncating='post', value=0)
 92 | 
 93 |     ##### Training #####
 94 | 
 95 |     def get_time(self):
 96 |         return strftime('%Y-%m-%d %H:%M:%S', gmtime())
 97 | 
 98 |     def train(self):
 99 |         batch_size = self.params['batch_size']
100 |         nb_epoch = self.params['nb_epoch']
101 |         validation_split = self.params['validation_split']
102 | 
103 |         training_set = self.load('train')
104 |         # top_50 = self.load('top_50')
105 | 
106 |         questions = list()
107 |         good_answers = list()
108 |         indices = list()
109 | 
110 |         for j, q in enumerate(training_set):
111 |             questions += [q['question']] * len(q['answers'])
112 |             good_answers += [self.answers[i] for i in q['answers']]
113 |             indices += [j] * len(q['answers'])
114 |         log('Began training at %s on %d samples' % (self.get_time(), len(questions)))
115 | 
116 |         questions = self.padq(questions)
117 |         good_answers = self.pada(good_answers)
118 | 
119 |         val_loss = {'loss': 1., 'epoch': 0}
120 | 
121 |         # def get_bad_samples(indices, top_50):
122 |         #     return [self.answers[random.choice(top_50[i])] for i in indices]
123 | 
124 |         for i in range(1, nb_epoch+1):
125 |             # sample from all answers to get bad answers
126 |             # if i % 2 == 0:
127 |             #     bad_answers = self.pada(random.sample(self.answers.values(), len(good_answers)))
128 |             # else:
129 |             #     bad_answers = self.pada(get_bad_samples(indices, top_50))
130 |             bad_answers = self.pada(random.sample(self.answers.values(), len(good_answers)))
131 | 
132 |             print('Fitting epoch %d' % i, file=sys.stderr)
133 |             hist = self.model.fit([questions, good_answers, bad_answers], epochs=1, batch_size=batch_size,
134 |                                   validation_split=validation_split, verbose=1)
135 | 
136 |             if hist.history['val_loss'][0] < val_loss['loss']:
137 |                 val_loss = {'loss': hist.history['val_loss'][0], 'epoch': i}
138 |             log('%s -- Epoch %d ' % (self.get_time(), i) +
139 |                 'Loss = %.4f, Validation Loss = %.4f ' % (hist.history['loss'][0], hist.history['val_loss'][0]) +
140 |                 '(Best: Loss = %.4f, Epoch = %d)' % (val_loss['loss'], val_loss['epoch']))
141 | 
142 |             self.save_epoch(i)
143 | 
144 |         return val_loss
145 | 
146 |     ##### Evaluation #####
147 | 
148 |     def prog_bar(self, so_far, total, n_bars=20):
149 |         n_complete = int(so_far * n_bars / total)
150 |         if n_complete >= n_bars - 1:
151 |             print('\r[' + '=' * n_bars + ']', end='', file=sys.stderr)
152 |         else:
153 |             s = '\r[' + '=' * (n_complete - 1) + '>' + '.' * (n_bars - n_complete) + ']'
154 |             print(s, end='', file=sys.stderr)
155 | 
156 |     def eval_sets(self):
157 |         if self._eval_sets is None:
158 |             self._eval_sets = dict([(s, self.load(s)) for s in ['dev', 'test1', 'test2']])
159 |         return self._eval_sets
160 | 
161 |     def get_score(self, verbose=False):
162 |         top1_ls = []
163 |         mrr_ls = []
164 |         for name, data in self.eval_sets().items():
165 |             print('----- %s -----' % name)
166 | 
167 |             random.shuffle(data)
168 | 
169 |             if 'n_eval' in self.params:
170 |                 data = data[:self.params['n_eval']]
171 | 
172 |             c_1, c_2 = 0, 0
173 | 
174 |             for i, d in enumerate(data):
175 |                 self.prog_bar(i, len(data))
176 | 
177 |                 indices = d['good'] + d['bad']
178 |                 answers = self.pada([self.answers[i] for i in indices])
179 |                 question = self.padq([d['question']] * len(indices))
180 | 
181 |                 sims = self.model.predict([question, answers])
182 | 
183 |                 n_good = len(d['good'])
184 |                 max_r = np.argmax(sims)
185 |                 max_n = np.argmax(sims[:n_good])
186 | 
187 |                 r = rankdata(sims, method='max')
188 | 
189 |                 if verbose:
190 |                     min_r = np.argmin(sims)
191 |                     amin_r = self.answers[indices[min_r]]
192 |                     amax_r = self.answers[indices[max_r]]
193 |                     amax_n = self.answers[indices[max_n]]
194 | 
195 |                     print(' '.join(self.revert(d['question'])))
196 |                     print('Predicted: ({}) '.format(sims[max_r]) + ' '.join(self.revert(amax_r)))
197 |                     print('Expected: ({}) Rank = {} '.format(sims[max_n], r[max_n]) + ' '.join(self.revert(amax_n)))
198 |                     print('Worst: ({})'.format(sims[min_r]) + ' '.join(self.revert(amin_r)))
199 | 
200 |                 c_1 += 1 if max_r == max_n else 0
201 |                 c_2 += 1 / float(r[max_r] - r[max_n] + 1)
202 | 
203 |             top1 = c_1 / float(len(data))
204 |             mrr = c_2 / float(len(data))
205 | 
206 |             del data
207 |             print('Top-1 Precision: %f' % top1)
208 |             print('MRR: %f' % mrr)
209 |             top1_ls.append(top1)
210 |             mrr_ls.append(mrr)
211 |         return top1_ls, mrr_ls
212 | 
213 | 
214 | if __name__ == '__main__':
215 |     if len(sys.argv) >= 2 and sys.argv[1] == 'serve':
216 |         from flask import Flask
217 |         app = Flask(__name__)
218 |         port = 5000
219 |         lines = list()
220 |         def log(x):
221 |             lines.append(x)
222 | 
223 |         @app.route('/')
224 |         def home():
225 |             return ('<html><body><h1>Training Log</h1>' +
226 |                     ''.join(['<code>{}</code><br/>'.format(line) for line in lines]) +
227 |                     '</body></html>')
228 | 
229 |         def start_server():
230 |             app.run(debug=False, use_evalex=False, port=port)
231 | 
232 |         thread.start_new_thread(start_server, tuple())
233 |         print('Serving to port %d' % port, file=sys.stderr)
234 | 
235 |     import numpy as np
236 | 
237 |     conf = {
238 |         'n_words': 22353,
239 |         'question_len': 150,
240 |         'answer_len': 150,
241 |         'margin': 0.009,
242 |         'initial_embed_weights': 'word2vec_100_dim.embeddings',
243 | 
244 |         'training': {
245 |             'batch_size': 100,
246 |             'nb_epoch': 2000,
247 |             'validation_split': 0.1,
248 |         },
249 | 
250 |         'similarity': {
251 |             'mode': 'cosine',
252 |             'gamma': 1,
253 |             'c': 1,
254 |             'd': 2,
255 |             'dropout': 0.5,
256 |         }
257 |     }
258 | 
259 |     from keras_models import EmbeddingModel, ConvolutionModel, ConvolutionalLSTM
260 |     evaluator = Evaluator(conf, model=ConvolutionModel, optimizer='adam')
261 | 
262 |     # train the model
263 |     best_loss = evaluator.train()
264 | 
265 |     # evaluate mrr for a particular epoch
266 |     evaluator.load_epoch(best_loss['epoch'])
267 |     top1, mrr = evaluator.get_score(verbose=False)
268 |     log(' - Top-1 Precision:')
269 |     log('   - %.3f on test 1' % top1[0])
270 |     log('   - %.3f on test 2' % top1[1])
271 |     log('   - %.3f on dev' % top1[2])
272 |     log(' - MRR:')
273 |     log('   - %.3f on test 1' % mrr[0])
274 |     log('   - %.3f on test 2' % mrr[1])
275 |     log('   - %.3f on dev' % mrr[2])
276 | 


--------------------------------------------------------------------------------
/keras_models.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | 
  3 | from abc import abstractmethod
  4 | 
  5 | from keras.engine import Input
  6 | from keras.layers import merge, Embedding, Dropout, Conv1D, Lambda, LSTM, Dense, concatenate, TimeDistributed
  7 | from keras import backend as K
  8 | from keras.models import Model
  9 | 
 10 | import numpy as np
 11 | 
 12 | 
 13 | class LanguageModel:
 14 |     def __init__(self, config):
 15 |         self.question = Input(shape=(config['question_len'],), dtype='int32', name='question_base')
 16 |         self.answer_good = Input(shape=(config['answer_len'],), dtype='int32', name='answer_good_base')
 17 |         self.answer_bad = Input(shape=(config['answer_len'],), dtype='int32', name='answer_bad_base')
 18 | 
 19 |         self.config = config
 20 |         self.params = config.get('similarity', dict())
 21 | 
 22 |         # initialize a bunch of variables that will be set later
 23 |         self._models = None
 24 |         self._similarities = None
 25 |         self._answer = None
 26 |         self._qa_model = None
 27 | 
 28 |         self.training_model = None
 29 |         self.prediction_model = None
 30 | 
 31 |     def get_answer(self):
 32 |         if self._answer is None:
 33 |             self._answer = Input(shape=(self.config['answer_len'],), dtype='int32', name='answer')
 34 |         return self._answer
 35 | 
 36 |     @abstractmethod
 37 |     def build(self):
 38 |         return
 39 | 
 40 |     def get_similarity(self):
 41 |         ''' Specify similarity in configuration under 'similarity' -> 'mode'
 42 |         If a parameter is needed for the model, specify it in 'similarity'
 43 | 
 44 |         Example configuration:
 45 | 
 46 |         config = {
 47 |             ... other parameters ...
 48 |             'similarity': {
 49 |                 'mode': 'gesd',
 50 |                 'gamma': 1,
 51 |                 'c': 1,
 52 |             }
 53 |         }
 54 | 
 55 |         cosine: dot(a, b) / sqrt(dot(a, a) * dot(b, b))
 56 |         polynomial: (gamma * dot(a, b) + c) ^ d
 57 |         sigmoid: tanh(gamma * dot(a, b) + c)
 58 |         rbf: exp(-gamma * l2_norm(a-b) ^ 2)
 59 |         euclidean: 1 / (1 + l2_norm(a - b))
 60 |         exponential: exp(-gamma * l2_norm(a - b))
 61 |         gesd: euclidean * sigmoid
 62 |         aesd: (euclidean + sigmoid) / 2
 63 |         '''
 64 | 
 65 |         params = self.params
 66 |         similarity = params['mode']
 67 | 
 68 |         dot = lambda a, b: K.batch_dot(a, b, axes=1)
 69 |         l2_norm = lambda a, b: K.sqrt(K.sum(K.square(a - b), axis=1, keepdims=True))
 70 | 
 71 |         if similarity == 'cosine':
 72 |             return lambda x: dot(x[0], x[1]) / K.maximum(K.sqrt(dot(x[0], x[0]) * dot(x[1], x[1])), K.epsilon())
 73 |         elif similarity == 'polynomial':
 74 |             return lambda x: (params['gamma'] * dot(x[0], x[1]) + params['c']) ** params['d']
 75 |         elif similarity == 'sigmoid':
 76 |             return lambda x: K.tanh(params['gamma'] * dot(x[0], x[1]) + params['c'])
 77 |         elif similarity == 'rbf':
 78 |             return lambda x: K.exp(-1 * params['gamma'] * l2_norm(x[0], x[1]) ** 2)
 79 |         elif similarity == 'euclidean':
 80 |             return lambda x: 1 / (1 + l2_norm(x[0], x[1]))
 81 |         elif similarity == 'exponential':
 82 |             return lambda x: K.exp(-1 * params['gamma'] * l2_norm(x[0], x[1]))
 83 |         elif similarity == 'gesd':
 84 |             euclidean = lambda x: 1 / (1 + l2_norm(x[0], x[1]))
 85 |             sigmoid = lambda x: 1 / (1 + K.exp(-1 * params['gamma'] * (dot(x[0], x[1]) + params['c'])))
 86 |             return lambda x: euclidean(x) * sigmoid(x)
 87 |         elif similarity == 'aesd':
 88 |             euclidean = lambda x: 0.5 / (1 + l2_norm(x[0], x[1]))
 89 |             sigmoid = lambda x: 0.5 / (1 + K.exp(-1 * params['gamma'] * (dot(x[0], x[1]) + params['c'])))
 90 |             return lambda x: euclidean(x) + sigmoid(x)
 91 |         else:
 92 |             raise Exception('Invalid similarity: {}'.format(similarity))
 93 | 
 94 |     def get_qa_model(self):
 95 |         if self._models is None:
 96 |             self._models = self.build()
 97 | 
 98 |         if self._qa_model is None:
 99 |             question_output, answer_output = self._models
100 |             dropout = Dropout(self.params.get('dropout', 0.2))
101 |             similarity = self.get_similarity()
102 |             # qa_model = merge([dropout(question_output), dropout(answer_output)],
103 |             #                  mode=similarity, output_shape=lambda _: (None, 1))
104 |             qa_model = Lambda(similarity, output_shape=lambda _: (None, 1))([dropout(question_output),
105 |                                                                              dropout(answer_output)])
106 |             self._qa_model = Model(inputs=[self.question, self.get_answer()], outputs=qa_model, name='qa_model')
107 | 
108 |         return self._qa_model
109 | 
110 |     def compile(self, optimizer, **kwargs):
111 |         qa_model = self.get_qa_model()
112 | 
113 |         good_similarity = qa_model([self.question, self.answer_good])
114 |         bad_similarity = qa_model([self.question, self.answer_bad])
115 | 
116 |         # loss = merge([good_similarity, bad_similarity],
117 |         #              mode=lambda x: K.relu(self.config['margin'] - x[0] + x[1]),
118 |         #              output_shape=lambda x: x[0])
119 | 
120 |         loss = Lambda(lambda x: K.relu(self.config['margin'] - x[0] + x[1]),
121 |                       output_shape=lambda x: x[0])([good_similarity, bad_similarity])
122 | 
123 |         self.prediction_model = Model(inputs=[self.question, self.answer_good], outputs=good_similarity,
124 |                                       name='prediction_model')
125 |         self.prediction_model.compile(loss=lambda y_true, y_pred: y_pred, optimizer=optimizer, **kwargs)
126 | 
127 |         self.training_model = Model(inputs=[self.question, self.answer_good, self.answer_bad], outputs=loss,
128 |                                     name='training_model')
129 |         self.training_model.compile(loss=lambda y_true, y_pred: y_pred, optimizer=optimizer, **kwargs)
130 | 
131 |     def fit(self, x, **kwargs):
132 |         assert self.training_model is not None, 'Must compile the model before fitting data'
133 |         y = np.zeros(shape=(x[0].shape[0],)) # doesn't get used
134 |         return self.training_model.fit(x, y, **kwargs)
135 | 
136 |     def predict(self, x):
137 |         assert self.prediction_model is not None and isinstance(self.prediction_model, Model)
138 |         return self.prediction_model.predict_on_batch(x)
139 | 
140 |     def save_weights(self, file_name, **kwargs):
141 |         assert self.prediction_model is not None, 'Must compile the model before saving weights'
142 |         self.prediction_model.save_weights(file_name, **kwargs)
143 | 
144 |     def load_weights(self, file_name, **kwargs):
145 |         assert self.prediction_model is not None, 'Must compile the model loading weights'
146 |         self.prediction_model.load_weights(file_name, **kwargs)
147 | 
148 | 
149 | class EmbeddingModel(LanguageModel):
150 |     def build(self):
151 |         question = self.question
152 |         answer = self.get_answer()
153 | 
154 |         # add embedding layers
155 |         weights = np.load(self.config['initial_embed_weights'])
156 |         embedding = Embedding(input_dim=self.config['n_words'],
157 |                               output_dim=weights.shape[1],
158 |                               mask_zero=True,
159 |                               # dropout=0.2,
160 |                               weights=[weights])
161 |         question_embedding = embedding(question)
162 |         answer_embedding = embedding(answer)
163 | 
164 |         # maxpooling
165 |         maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2]))
166 |         maxpool.supports_masking = True
167 |         question_pool = maxpool(question_embedding)
168 |         answer_pool = maxpool(answer_embedding)
169 | 
170 |         return question_pool, answer_pool
171 | 
172 | 
173 | class ConvolutionModel(LanguageModel):
174 |     def build(self):
175 |         assert self.config['question_len'] == self.config['answer_len']
176 | 
177 |         question = self.question
178 |         answer = self.get_answer()
179 | 
180 |         # add embedding layers
181 |         weights = np.load(self.config['initial_embed_weights'])
182 |         embedding = Embedding(input_dim=self.config['n_words'],
183 |                               output_dim=weights.shape[1],
184 |                               weights=[weights])
185 |         question_embedding = embedding(question)
186 |         answer_embedding = embedding(answer)
187 | 
188 |         hidden_layer = TimeDistributed(Dense(200, activation='tanh'))
189 | 
190 |         question_hl = hidden_layer(question_embedding)
191 |         answer_hl = hidden_layer(answer_embedding)
192 | 
193 |         # cnn
194 |         cnns = [Conv1D(kernel_size=kernel_size,
195 |                        filters=1000,
196 |                        activation='tanh',
197 |                        padding='same') for kernel_size in [2, 3, 5, 7]]
198 |         # question_cnn = merge([cnn(question_embedding) for cnn in cnns], mode='concat')
199 |         question_cnn = concatenate([cnn(question_hl) for cnn in cnns], axis=-1)
200 |         # answer_cnn = merge([cnn(answer_embedding) for cnn in cnns], mode='concat')
201 |         answer_cnn = concatenate([cnn(answer_hl) for cnn in cnns], axis=-1)
202 | 
203 |         # maxpooling
204 |         maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2]))
205 |         maxpool.supports_masking = True
206 |         # enc = Dense(100, activation='tanh')
207 |         # question_pool = enc(maxpool(question_cnn))
208 |         # answer_pool = enc(maxpool(answer_cnn))
209 |         question_pool = maxpool(question_cnn)
210 |         answer_pool = maxpool(answer_cnn)
211 | 
212 |         return question_pool, answer_pool
213 | 
214 | 
215 | class ConvolutionalLSTM(LanguageModel):
216 |     def build(self):
217 |         question = self.question
218 |         answer = self.get_answer()
219 | 
220 |         # add embedding layers
221 |         weights = np.load(self.config['initial_embed_weights'])
222 |         embedding = Embedding(input_dim=self.config['n_words'],
223 |                               output_dim=weights.shape[1],
224 |                               weights=[weights])
225 |         question_embedding = embedding(question)
226 |         answer_embedding = embedding(answer)
227 | 
228 |         f_rnn = LSTM(141, return_sequences=True, implementation=1)
229 |         b_rnn = LSTM(141, return_sequences=True, implementation=1, go_backwards=True)
230 | 
231 |         qf_rnn = f_rnn(question_embedding)
232 |         qb_rnn = b_rnn(question_embedding)
233 |         # question_pool = merge([qf_rnn, qb_rnn], mode='concat', concat_axis=-1)
234 |         question_pool = concatenate([qf_rnn, qb_rnn], axis=-1)
235 | 
236 |         af_rnn = f_rnn(answer_embedding)
237 |         ab_rnn = b_rnn(answer_embedding)
238 |         # answer_pool = merge([af_rnn, ab_rnn], mode='concat', concat_axis=-1)
239 |         answer_pool = concatenate([af_rnn, ab_rnn], axis=-1)
240 | 
241 |         # cnn
242 |         cnns = [Conv1D(kernel_size=kernel_size,
243 |                        filters=500,
244 |                        activation='tanh',
245 |                        padding='same') for kernel_size in [1, 2, 3, 5]]
246 |         # question_cnn = merge([cnn(question_pool) for cnn in cnns], mode='concat')
247 |         question_cnn = concatenate([cnn(question_pool) for cnn in cnns], axis=-1)
248 |         # answer_cnn = merge([cnn(answer_pool) for cnn in cnns], mode='concat')
249 |         answer_cnn = concatenate([cnn(answer_pool) for cnn in cnns], axis=-1)
250 | 
251 |         maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2]))
252 |         maxpool.supports_masking = True
253 |         question_pool = maxpool(question_cnn)
254 |         answer_pool = maxpool(answer_cnn)
255 | 
256 |         return question_pool, answer_pool
257 | 
258 | 
259 | class AttentionModel(LanguageModel):
260 |     def build(self):
261 |         question = self.question
262 |         answer = self.get_answer()
263 | 
264 |         # add embedding layers
265 |         weights = np.load(self.config['initial_embed_weights'])
266 |         embedding = Embedding(input_dim=self.config['n_words'],
267 |                               output_dim=weights.shape[1],
268 |                               # mask_zero=True,
269 |                               weights=[weights])
270 |         question_embedding = embedding(question)
271 |         answer_embedding = embedding(answer)
272 | 
273 |         # question rnn part
274 |         f_rnn = LSTM(141, return_sequences=True, consume_less='mem')
275 |         b_rnn = LSTM(141, return_sequences=True, consume_less='mem', go_backwards=True)
276 |         question_f_rnn = f_rnn(question_embedding)
277 |         question_b_rnn = b_rnn(question_embedding)
278 | 
279 |         # maxpooling
280 |         maxpool = Lambda(lambda x: K.max(x, axis=1, keepdims=False), output_shape=lambda x: (x[0], x[2]))
281 |         maxpool.supports_masking = True
282 |         question_pool = merge([maxpool(question_f_rnn), maxpool(question_b_rnn)], mode='concat', concat_axis=-1)
283 | 
284 |         # answer rnn part
285 |         from attention_lstm import AttentionLSTMWrapper
286 |         f_rnn = AttentionLSTMWrapper(f_rnn, question_pool, single_attention_param=True)
287 |         b_rnn = AttentionLSTMWrapper(b_rnn, question_pool, single_attention_param=True)
288 | 
289 |         answer_f_rnn = f_rnn(answer_embedding)
290 |         answer_b_rnn = b_rnn(answer_embedding)
291 |         answer_pool = merge([maxpool(answer_f_rnn), maxpool(answer_b_rnn)], mode='concat', concat_axis=-1)
292 | 
293 |         return question_pool, answer_pool
294 | 


--------------------------------------------------------------------------------
/results.notes:
--------------------------------------------------------------------------------
 1 | Best results achieved for each model:
 2 | 
 3 | Embedding + Max Pooling:
 4 |  - Top 1 Precision:
 5 |    - 0.492 on test 1
 6 |    - 0.483 on test 2
 7 |    - 0.495 on dev
 8 |  - MRR:
 9 |    - 0.624 on test 1
10 |    - 0.611 on test 2
11 |    - 0.624 on dev
12 | 
13 | Attentional LSTM + Max Pooling:
14 |  - Top 1 precision:
15 |    - 0.480 on test 1
16 |    - 0.465 on test 2
17 |    - 0.487 on dev
18 |  - MRR:
19 |    - 0.627 on test 1
20 |    - 0.613 on test 2
21 |    - 0.635 on dev
22 | 
23 | Unsupervised RNN language model + trained embeddings:
24 |  - Top 1 precision:
25 |    - 0.546 on test 1
26 |    - 0.527 on test 2
27 |    - 0.552 on dev
28 |  - MRR:
29 |    - 0.670 on test 1
30 |    - 0.651 on test 2
31 |    - 0.671 on dev
32 | 
33 | Training ConvolutionalLSTM model for a long time (~4 days):
34 |  - Top-1 Precision:
35 |    - 0.564 on test 1
36 |    - 0.543 on test 2
37 |    - 0.573 on dev
38 |  - MRR:
39 |    - 0.681 on test 1
40 |    - 0.661 on test 2
41 |    - 0.686 on dev
42 | 


--------------------------------------------------------------------------------
/word2vec_100_dim.embeddings:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codekansas/keras-language-modeling/14d6c319ad0bd2dea70f401400e7e0e4e6fcb55b/word2vec_100_dim.embeddings


--------------------------------------------------------------------------------