├── .gitignore ├── README.md ├── config.py ├── decoder.py ├── embedding.py ├── generate.py ├── images ├── ex1.jpg ├── ex2.jpg ├── ex3.jpg └── ex4.jpg ├── search.py └── skipthoughts.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | 59 | # Ignore changes to configuration file 60 | config.py 61 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # neural-storyteller 2 | 3 | neural-storyteller is a recurrent neural network that generates little stories about images. This repository contains code for generating stories with your own images, as well as instructions for training new models. 4 | 5 | 6 | *We were barely able to catch the breeze at the beach , and it felt as if someone stepped out of my mind . She was in love with him for the first time in months , so she had no intention of escaping . The sun had risen from the ocean , making her feel more alive than normal . She 's beautiful , but the truth is that I do n't know what to do . The sun was just starting to fade away , leaving people scattered around the Atlantic Ocean . I d seen the men in his life , who guided me at the beach once more .* 7 | 8 | [Samim](http://samim.io/) has made an awesome blog post with lots of results [here](https://medium.com/@samim/generating-stories-about-images-d163ba41e4ed). 9 | 10 | Some more results from an older model trained on Adventure books can be found [here](http://www.cs.toronto.edu/~rkiros/adv_L.html). 11 | 12 | The whole approach contains 4 components: 13 | * [skip-thought vectors](https://github.com/ryankiros/skip-thoughts) 14 | * [image-sentence embeddings](https://github.com/ryankiros/visual-semantic-embedding) 15 | * [conditional neural language models](https://github.com/ryankiros/skip-thoughts/tree/master/decoding) 16 | * style shifting (described in this project) 17 | 18 | The 'style-shifting' operation is what allows our model to transfer standard image captions to the style of stories from novels. The only source of supervision in our models is from [Microsoft COCO](http://mscoco.org/) captions. That is, we did not collect any new training data to directly predict stories given images. 19 | 20 | Style shifting was inspired by [A Neural Algorithm of Artistic Style](http://arxiv.org/abs/1508.06576) but the technical details are completely different. 21 | 22 | ## How does it work? 23 | 24 | We first train a recurrent neural network (RNN) decoder on romance novels. Each passage from a novel is mapped to a skip-thought vector. The RNN then conditions on the skip-thought vector and aims to generate the passage that it has encoded. We use romance novels collected from the BookCorpus [dataset](http://www.cs.toronto.edu/~mbweb/). 25 | 26 | Parallel to this, we train a visual-semantic embedding between COCO images and captions. In this model, captions and images are mapped into a common vector space. After training, we can embed new images and retrieve captions. 27 | 28 | Given these models, we need a way to bridge the gap between retrieved image captions and passages in novels. That is, if we had a function F that maps a collection of image caption vectors **x** to a book passage vector F(**x**), then we could feed F(**x**) to the decoder to get our story. There is no such parallel data, so we need to construct F another way. 29 | 30 | It turns out that skip-thought vectors have some intriguing properties that allow us to construct F in a really simple way. Suppose we have 3 vectors: an image caption **x**, a "caption style" vector **c** and a "book style" vector **b**. Then we define F as 31 | 32 | F(**x**) = **x** - **c** + **b** 33 | 34 | which intuitively means: keep the "thought" of the caption, but replace the image caption style with that of a story. Then, we simply feed F(**x**) to the decoder. 35 | 36 | How do we construct **c** and **b**? Here, **c** is the mean of the skip-thought vectors for Microsoft COCO training captions. We set **b** to be the mean of the skip-thought vectors for romance novel passages that are of length > 100. 37 | 38 | #### What kind of biases work? 39 | 40 | Skip-thought vectors are sensitive to: 41 | 42 | - length (if you bias by really long passages, it will decode really long stories) 43 | - punctuation 44 | - vocabulary 45 | - syntactic style (loosely speaking) 46 | 47 | For the last point, if you bias using text all written the same way the stories you get will also be written the same way. 48 | 49 | #### What can the decoder be trained on? 50 | 51 | We use romance novels, but that is because we have over 14 million passages to train on. Anything should work, provided you have a lot of text! If you want to train your own decoder, you can use the code available [here](https://github.com/ryankiros/skip-thoughts/tree/master/decoding) Any models trained there can be substituted here. 52 | 53 | ## Dependencies 54 | 55 | This code is written in python. To use it you will need: 56 | 57 | * Python 2.7 58 | * A recent version of [NumPy](http://www.numpy.org/) and [SciPy](http://www.scipy.org/) 59 | * [Lasagne](https://github.com/Lasagne/Lasagne) 60 | * A version of Theano that Lasagne supports 61 | 62 | For running on CPU, you will need to install [Caffe](http://caffe.berkeleyvision.org) and its python interface. 63 | 64 | 65 | ## Getting started 66 | 67 | You will first need to download some pre-trained models and style vectors. Most of the materials are available in a single compressed file, which you can obtain by running 68 | 69 | wget http://www.cs.toronto.edu/~rkiros/neural_storyteller.zip 70 | 71 | Included is a pre-trained decoder on romance novels, the decoder dictionary, caption and romance style vectors, MS COCO training captions and a pre-trained image-sentence embedding model. 72 | 73 | Next, you need to obtain the pre-trained skip-thoughts encoder. Go [here](https://github.com/ryankiros/skip-thoughts) and follow the instructions on the main page to obtain the pre-trained model. 74 | 75 | Finally, we need the VGG-19 ConvNet parameters. You can obtain them by running 76 | 77 | wget https://s3.amazonaws.com/lasagne/recipes/pretrained/imagenet/vgg19.pkl 78 | 79 | Note that this model is for non-commercial use only. Once you have all the materials, open `config.py` and specify the locations of all of the models and style vectors that you downloaded. 80 | 81 | For running on CPU, you will need to download the VGG-19 prototxt and model by: 82 | 83 | wget http://www.robots.ox.ac.uk/~vgg/software/very_deep/caffe/VGG_ILSVRC_19_layers.caffemodel 84 | wget https://gist.githubusercontent.com/ksimonyan/3785162f95cd2d5fee77/raw/bb2b4fe0a9bb0669211cf3d0bc949dfdda173e9e/VGG_ILSVRC_19_layers_deploy.prototxt 85 | 86 | You also need to modify pycaffe and model path in `config.py`, and modify the flag in line 8 as: 87 | 88 | FLAG_CPU_MODE = True 89 | 90 | ## Generating a story 91 | 92 | The images directory contains some sample images that you can try the model on. In order to generate a story, open Ipython and run the following: 93 | 94 | import generate 95 | z = generate.load_all() 96 | generate.story(z, './images/ex1.jpg') 97 | 98 | If everything works, it will first print out the nearest COCO captions to the image (predicted by the visual-semantic embedding model). Then it will print out a story. 99 | 100 | #### Generation options 101 | 102 | There are 2 knobs that can be tuned for generation: the number of retrieved captions to condition on as well as the beam search width. The defaults are 103 | 104 | generate.story(z, './images/ex1.jpg', k=100, bw=50) 105 | 106 | where k is the number of captions to condition on and bw is the beam width. These are reasonable defaults but playing around with these can give you very different outputs! The higher the beam width, the longer it takes to generate a story. 107 | 108 | If you bias by song lyrics, you can turn on the lyric flag which will print the output in multiple lines by comma delimiting. `neural_storyteller.zip` contains an additional bias vector called `swift_style.npy` which is the mean of skip-thought vectors across Taylor Swift lyrics. If you point `path_to_posbias` to this vector in `config.py`, you can generate captions in the style of Taylor Swift lyrics. For example: 109 | 110 | generate.story(z, './images/ex1.jpg', lyric=True) 111 | 112 | should output 113 | 114 | You re the only person on the beach right now 115 | you know 116 | I do n't think I will ever fall in love with you 117 | and when the sea breeze hits me 118 | I thought 119 | Hey 120 | 121 | ## Reference 122 | 123 | This project does not have any associated paper with it. If you found this code useful, please consider citing: 124 | 125 | Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. **"Skip-Thought Vectors."** *arXiv preprint arXiv:1506.06726 (2015).* 126 | 127 | @article{kiros2015skip, 128 | title={Skip-Thought Vectors}, 129 | author={Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S and Torralba, Antonio and Urtasun, Raquel and Fidler, Sanja}, 130 | journal={arXiv preprint arXiv:1506.06726}, 131 | year={2015} 132 | } 133 | 134 | If you also use the BookCorpus data for training new models, please also consider citing: 135 | 136 | Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler. 137 | **"Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books."** *arXiv preprint arXiv:1506.06724 (2015).* 138 | 139 | @article{zhu2015aligning, 140 | title={Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books}, 141 | author={Zhu, Yukun and Kiros, Ryan and Zemel, Richard and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja}, 142 | journal={arXiv preprint arXiv:1506.06724}, 143 | year={2015} 144 | } 145 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | """ 2 | Configuration for the generate module 3 | """ 4 | 5 | #-----------------------------------------------------------------------------# 6 | # Flags for running on CPU 7 | #-----------------------------------------------------------------------------# 8 | FLAG_CPU_MODE = True 9 | 10 | #-----------------------------------------------------------------------------# 11 | # Paths to models and biases 12 | #-----------------------------------------------------------------------------# 13 | paths = dict() 14 | 15 | # Skip-thoughts 16 | paths['skmodels'] = '/u/rkiros/public_html/models/' 17 | paths['sktables'] = '/u/rkiros/public_html/models/' 18 | 19 | # Decoder 20 | paths['decmodel'] = '/ais/gobi3/u/rkiros/storyteller/romance.npz' 21 | paths['dictionary'] = '/ais/gobi3/u/rkiros/storyteller/romance_dictionary.pkl' 22 | 23 | # Image-sentence embedding 24 | paths['vsemodel'] = '/ais/gobi3/u/rkiros/storyteller/coco_embedding.npz' 25 | 26 | # VGG-19 convnet 27 | paths['vgg'] = '/ais/gobi3/u/rkiros/vgg/vgg19.pkl' 28 | paths['pycaffe'] = '/u/yukun/Projects/caffe-run/python' 29 | paths['vgg_proto_caffe'] = '/ais/guppy9/movie2text/neural-storyteller/models/VGG_ILSVRC_19_layers_deploy.prototxt' 30 | paths['vgg_model_caffe'] = '/ais/guppy9/movie2text/neural-storyteller/models/VGG_ILSVRC_19_layers.caffemodel' 31 | 32 | 33 | # COCO training captions 34 | paths['captions'] = '/ais/gobi3/u/rkiros/storyteller/coco_train_caps.txt' 35 | 36 | # Biases 37 | paths['negbias'] = '/ais/gobi3/u/rkiros/storyteller/caption_style.npy' 38 | paths['posbias'] = '/ais/gobi3/u/rkiros/storyteller/romance_style.npy' 39 | -------------------------------------------------------------------------------- /decoder.py: -------------------------------------------------------------------------------- 1 | """ 2 | Decoder 3 | """ 4 | import theano 5 | import theano.tensor as tensor 6 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 7 | 8 | import cPickle as pkl 9 | import numpy 10 | 11 | from search import gen_sample 12 | from collections import OrderedDict 13 | 14 | 15 | def load_model(path_to_model, path_to_dictionary): 16 | """ 17 | Load a trained model for decoding 18 | """ 19 | # Load the worddict 20 | with open(path_to_dictionary, 'rb') as f: 21 | worddict = pkl.load(f) 22 | 23 | # Create inverted dictionary 24 | word_idict = dict() 25 | for kk, vv in worddict.iteritems(): 26 | word_idict[vv] = kk 27 | word_idict[0] = '' 28 | word_idict[1] = 'UNK' 29 | 30 | # Load model options 31 | with open('%s.pkl'%path_to_model, 'rb') as f: 32 | options = pkl.load(f) 33 | if 'doutput' not in options.keys(): 34 | options['doutput'] = True 35 | 36 | # Load parameters 37 | params = init_params(options) 38 | params = load_params(path_to_model, params) 39 | tparams = init_tparams(params) 40 | 41 | # Sampler. 42 | trng = RandomStreams(1234) 43 | f_init, f_next = build_sampler(tparams, options, trng) 44 | 45 | # Pack everything up 46 | dec = dict() 47 | dec['options'] = options 48 | dec['trng'] = trng 49 | dec['worddict'] = worddict 50 | dec['word_idict'] = word_idict 51 | dec['tparams'] = tparams 52 | dec['f_init'] = f_init 53 | dec['f_next'] = f_next 54 | return dec 55 | 56 | def run_sampler(dec, c, beam_width=1, stochastic=False, use_unk=False): 57 | """ 58 | Generate text conditioned on c 59 | """ 60 | sample, score = gen_sample(dec['tparams'], dec['f_init'], dec['f_next'], 61 | c.reshape(1, dec['options']['dimctx']), dec['options'], 62 | trng=dec['trng'], k=beam_width, maxlen=1000, stochastic=stochastic, 63 | use_unk=use_unk) 64 | text = [] 65 | if stochastic: 66 | sample = [sample] 67 | for c in sample: 68 | text.append(' '.join([dec['word_idict'][w] for w in c[:-1]])) 69 | 70 | #Sort beams by their NLL, return the best result 71 | lengths = numpy.array([len(s.split()) for s in text]) 72 | if lengths[0] == 0: # in case the model only predicts 73 | lengths = lengths[1:] 74 | score = score[1:] 75 | text = text[1:] 76 | sidx = numpy.argmin(score) 77 | text = text[sidx] 78 | score = score[sidx] 79 | 80 | return text 81 | 82 | def _p(pp, name): 83 | """ 84 | make prefix-appended name 85 | """ 86 | return '%s_%s'%(pp, name) 87 | 88 | def init_tparams(params): 89 | """ 90 | initialize Theano shared variables according to the initial parameters 91 | """ 92 | tparams = OrderedDict() 93 | for kk, pp in params.iteritems(): 94 | tparams[kk] = theano.shared(params[kk], name=kk) 95 | return tparams 96 | 97 | def load_params(path, params): 98 | """ 99 | load parameters 100 | """ 101 | pp = numpy.load(path) 102 | for kk, vv in params.iteritems(): 103 | if kk not in pp: 104 | warnings.warn('%s is not in the archive'%kk) 105 | continue 106 | params[kk] = pp[kk] 107 | return params 108 | 109 | # layers: 'name': ('parameter initializer', 'feedforward') 110 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 111 | 'gru': ('param_init_gru', 'gru_layer')} 112 | 113 | def get_layer(name): 114 | fns = layers[name] 115 | return (eval(fns[0]), eval(fns[1])) 116 | 117 | def init_params(options): 118 | """ 119 | Initialize all parameters 120 | """ 121 | params = OrderedDict() 122 | 123 | # Word embedding 124 | params['Wemb'] = norm_weight(options['n_words'], options['dim_word']) 125 | 126 | # init state 127 | params = get_layer('ff')[0](options, params, prefix='ff_state', nin=options['dimctx'], nout=options['dim']) 128 | 129 | # Decoder 130 | params = get_layer(options['decoder'])[0](options, params, prefix='decoder', 131 | nin=options['dim_word'], dim=options['dim']) 132 | 133 | # Output layer 134 | if options['doutput']: 135 | params = get_layer('ff')[0](options, params, prefix='ff_hid', nin=options['dim'], nout=options['dim_word']) 136 | params = get_layer('ff')[0](options, params, prefix='ff_logit', nin=options['dim_word'], nout=options['n_words']) 137 | else: 138 | params = get_layer('ff')[0](options, params, prefix='ff_logit', nin=options['dim'], nout=options['n_words']) 139 | 140 | return params 141 | 142 | def build_sampler(tparams, options, trng): 143 | """ 144 | Forward sampling 145 | """ 146 | ctx = tensor.matrix('ctx', dtype='float32') 147 | ctx0 = ctx 148 | 149 | init_state = get_layer('ff')[1](tparams, ctx, options, prefix='ff_state', activ='tanh') 150 | f_init = theano.function([ctx], init_state, name='f_init', profile=False) 151 | 152 | # x: 1 x 1 153 | y = tensor.vector('y_sampler', dtype='int64') 154 | init_state = tensor.matrix('init_state', dtype='float32') 155 | 156 | # if it's the first word, emb should be all zero 157 | emb = tensor.switch(y[:,None] < 0, tensor.alloc(0., 1, tparams['Wemb'].shape[1]), 158 | tparams['Wemb'][y]) 159 | 160 | # decoder 161 | proj = get_layer(options['decoder'])[1](tparams, emb, init_state, options, 162 | prefix='decoder', 163 | mask=None, 164 | one_step=True) 165 | next_state = proj[0] 166 | 167 | # output 168 | if options['doutput']: 169 | hid = get_layer('ff')[1](tparams, next_state, options, prefix='ff_hid', activ='tanh') 170 | logit = get_layer('ff')[1](tparams, hid, options, prefix='ff_logit', activ='linear') 171 | else: 172 | logit = get_layer('ff')[1](tparams, next_state, options, prefix='ff_logit', activ='linear') 173 | next_probs = tensor.nnet.softmax(logit) 174 | next_sample = trng.multinomial(pvals=next_probs).argmax(1) 175 | 176 | # next word probability 177 | inps = [y, init_state] 178 | outs = [next_probs, next_sample, next_state] 179 | f_next = theano.function(inps, outs, name='f_next', profile=False) 180 | 181 | return f_init, f_next 182 | 183 | def linear(x): 184 | """ 185 | Linear activation function 186 | """ 187 | return x 188 | 189 | def tanh(x): 190 | """ 191 | Tanh activation function 192 | """ 193 | return tensor.tanh(x) 194 | 195 | def ortho_weight(ndim): 196 | """ 197 | Orthogonal weight init, for recurrent layers 198 | """ 199 | W = numpy.random.randn(ndim, ndim) 200 | u, s, v = numpy.linalg.svd(W) 201 | return u.astype('float32') 202 | 203 | def norm_weight(nin,nout=None, scale=0.1, ortho=True): 204 | """ 205 | Uniform initalization from [-scale, scale] 206 | If matrix is square and ortho=True, use ortho instead 207 | """ 208 | if nout == None: 209 | nout = nin 210 | if nout == nin and ortho: 211 | W = ortho_weight(nin) 212 | else: 213 | W = numpy.random.uniform(low=-scale, high=scale, size=(nin, nout)) 214 | return W.astype('float32') 215 | 216 | # Feedforward layer 217 | def param_init_fflayer(options, params, prefix='ff', nin=None, nout=None, ortho=True): 218 | """ 219 | Affine transformation + point-wise nonlinearity 220 | """ 221 | if nin == None: 222 | nin = options['dim_proj'] 223 | if nout == None: 224 | nout = options['dim_proj'] 225 | params[_p(prefix,'W')] = norm_weight(nin, nout) 226 | params[_p(prefix,'b')] = numpy.zeros((nout,)).astype('float32') 227 | 228 | return params 229 | 230 | def fflayer(tparams, state_below, options, prefix='rconv', activ='lambda x: tensor.tanh(x)', **kwargs): 231 | """ 232 | Feedforward pass 233 | """ 234 | return eval(activ)(tensor.dot(state_below, tparams[_p(prefix,'W')])+tparams[_p(prefix,'b')]) 235 | 236 | # GRU layer 237 | def param_init_gru(options, params, prefix='gru', nin=None, dim=None): 238 | """ 239 | Gated Recurrent Unit (GRU) 240 | """ 241 | if nin == None: 242 | nin = options['dim_proj'] 243 | if dim == None: 244 | dim = options['dim_proj'] 245 | W = numpy.concatenate([norm_weight(nin,dim), 246 | norm_weight(nin,dim)], axis=1) 247 | params[_p(prefix,'W')] = W 248 | params[_p(prefix,'b')] = numpy.zeros((2 * dim,)).astype('float32') 249 | U = numpy.concatenate([ortho_weight(dim), 250 | ortho_weight(dim)], axis=1) 251 | params[_p(prefix,'U')] = U 252 | 253 | Wx = norm_weight(nin, dim) 254 | params[_p(prefix,'Wx')] = Wx 255 | Ux = ortho_weight(dim) 256 | params[_p(prefix,'Ux')] = Ux 257 | params[_p(prefix,'bx')] = numpy.zeros((dim,)).astype('float32') 258 | 259 | return params 260 | 261 | def gru_layer(tparams, state_below, init_state, options, prefix='gru', mask=None, one_step=False, **kwargs): 262 | """ 263 | Feedforward pass through GRU 264 | """ 265 | nsteps = state_below.shape[0] 266 | if state_below.ndim == 3: 267 | n_samples = state_below.shape[1] 268 | else: 269 | n_samples = 1 270 | 271 | dim = tparams[_p(prefix,'Ux')].shape[1] 272 | 273 | if init_state == None: 274 | init_state = tensor.alloc(0., n_samples, dim) 275 | 276 | if mask == None: 277 | mask = tensor.alloc(1., state_below.shape[0], 1) 278 | 279 | def _slice(_x, n, dim): 280 | if _x.ndim == 3: 281 | return _x[:, :, n*dim:(n+1)*dim] 282 | return _x[:, n*dim:(n+1)*dim] 283 | 284 | state_below_ = tensor.dot(state_below, tparams[_p(prefix, 'W')]) + tparams[_p(prefix, 'b')] 285 | state_belowx = tensor.dot(state_below, tparams[_p(prefix, 'Wx')]) + tparams[_p(prefix, 'bx')] 286 | U = tparams[_p(prefix, 'U')] 287 | Ux = tparams[_p(prefix, 'Ux')] 288 | 289 | def _step_slice(m_, x_, xx_, h_, U, Ux): 290 | preact = tensor.dot(h_, U) 291 | preact += x_ 292 | 293 | r = tensor.nnet.sigmoid(_slice(preact, 0, dim)) 294 | u = tensor.nnet.sigmoid(_slice(preact, 1, dim)) 295 | 296 | preactx = tensor.dot(h_, Ux) 297 | preactx = preactx * r 298 | preactx = preactx + xx_ 299 | 300 | h = tensor.tanh(preactx) 301 | 302 | h = u * h_ + (1. - u) * h 303 | h = m_[:,None] * h + (1. - m_)[:,None] * h_ 304 | 305 | return h 306 | 307 | seqs = [mask, state_below_, state_belowx] 308 | _step = _step_slice 309 | 310 | if one_step: 311 | rval = _step(*(seqs+[init_state, tparams[_p(prefix, 'U')], tparams[_p(prefix, 'Ux')]])) 312 | else: 313 | rval, updates = theano.scan(_step, 314 | sequences=seqs, 315 | outputs_info = [init_state], 316 | non_sequences = [tparams[_p(prefix, 'U')], 317 | tparams[_p(prefix, 'Ux')]], 318 | name=_p(prefix, '_layers'), 319 | n_steps=nsteps, 320 | profile=False, 321 | strict=True) 322 | rval = [rval] 323 | return rval 324 | 325 | -------------------------------------------------------------------------------- /embedding.py: -------------------------------------------------------------------------------- 1 | """ 2 | Joint image-sentence embedding space 3 | """ 4 | import theano 5 | import theano.tensor as tensor 6 | from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams 7 | 8 | import cPickle as pkl 9 | import numpy 10 | import nltk 11 | 12 | from collections import OrderedDict, defaultdict 13 | from scipy.linalg import norm 14 | 15 | 16 | def load_model(path_to_model): 17 | """ 18 | Load all model components 19 | """ 20 | # Load the worddict 21 | with open('%s.dictionary.pkl'%path_to_model, 'rb') as f: 22 | worddict = pkl.load(f) 23 | 24 | # Create inverted dictionary 25 | word_idict = dict() 26 | for kk, vv in worddict.iteritems(): 27 | word_idict[vv] = kk 28 | word_idict[0] = '' 29 | word_idict[1] = 'UNK' 30 | 31 | # Load model options 32 | with open('%s.pkl'%path_to_model, 'rb') as f: 33 | options = pkl.load(f) 34 | 35 | # Load parameters 36 | params = init_params(options) 37 | params = load_params(path_to_model, params) 38 | tparams = init_tparams(params) 39 | 40 | # Extractor functions 41 | trng = RandomStreams(1234) 42 | trng, [x, x_mask], sentences = build_sentence_encoder(tparams, options) 43 | f_senc = theano.function([x, x_mask], sentences, name='f_senc') 44 | 45 | trng, [im], images = build_image_encoder(tparams, options) 46 | f_ienc = theano.function([im], images, name='f_ienc') 47 | 48 | # Store everything we need in a dictionary 49 | model = {} 50 | model['options'] = options 51 | model['worddict'] = worddict 52 | model['word_idict'] = word_idict 53 | model['f_senc'] = f_senc 54 | model['f_ienc'] = f_ienc 55 | return model 56 | 57 | def encode_sentences(model, X, verbose=False, batch_size=128): 58 | """ 59 | Encode sentences into the joint embedding space 60 | """ 61 | features = numpy.zeros((len(X), model['options']['dim']), dtype='float32') 62 | 63 | # length dictionary 64 | ds = defaultdict(list) 65 | captions = [s.split() for s in X] 66 | for i,s in enumerate(captions): 67 | ds[len(s)].append(i) 68 | 69 | # quick check if a word is in the dictionary 70 | d = defaultdict(lambda : 0) 71 | for w in model['worddict'].keys(): 72 | d[w] = 1 73 | 74 | # Get features. This encodes by length, in order to avoid wasting computation 75 | for k in ds.keys(): 76 | if verbose: 77 | print k 78 | numbatches = len(ds[k]) / batch_size + 1 79 | for minibatch in range(numbatches): 80 | caps = ds[k][minibatch::numbatches] 81 | caption = [captions[c] for c in caps] 82 | 83 | seqs = [] 84 | for i, cc in enumerate(caption): 85 | seqs.append([model['worddict'][w] if d[w] > 0 and model['worddict'][w] < model['options']['n_words'] else 1 for w in cc]) 86 | x = numpy.zeros((k+1, len(caption))).astype('int64') 87 | x_mask = numpy.zeros((k+1, len(caption))).astype('float32') 88 | for idx, s in enumerate(seqs): 89 | x[:k,idx] = s 90 | x_mask[:k+1,idx] = 1. 91 | ff = model['f_senc'](x, x_mask) 92 | for ind, c in enumerate(caps): 93 | features[c] = ff[ind] 94 | 95 | return features 96 | 97 | def encode_images(model, IM): 98 | """ 99 | Encode images into the joint embedding space 100 | """ 101 | images = model['f_ienc'](IM) 102 | return images 103 | 104 | def _p(pp, name): 105 | """ 106 | make prefix-appended name 107 | """ 108 | return '%s_%s'%(pp, name) 109 | 110 | def init_tparams(params): 111 | """ 112 | initialize Theano shared variables according to the initial parameters 113 | """ 114 | tparams = OrderedDict() 115 | for kk, pp in params.iteritems(): 116 | tparams[kk] = theano.shared(params[kk], name=kk) 117 | return tparams 118 | 119 | def load_params(path, params): 120 | """ 121 | load parameters 122 | """ 123 | pp = numpy.load(path) 124 | for kk, vv in params.iteritems(): 125 | if kk not in pp: 126 | warnings.warn('%s is not in the archive'%kk) 127 | continue 128 | params[kk] = pp[kk] 129 | return params 130 | 131 | # layers: 'name': ('parameter initializer', 'feedforward') 132 | layers = {'ff': ('param_init_fflayer', 'fflayer'), 133 | 'gru': ('param_init_gru', 'gru_layer')} 134 | 135 | def get_layer(name): 136 | fns = layers[name] 137 | return (eval(fns[0]), eval(fns[1])) 138 | 139 | def init_params(options): 140 | """ 141 | Initialize all parameters 142 | """ 143 | params = OrderedDict() 144 | 145 | # Word embedding 146 | params['Wemb'] = norm_weight(options['n_words'], options['dim_word']) 147 | 148 | # Sentence encoder 149 | params = get_layer(options['encoder'])[0](options, params, prefix='encoder', 150 | nin=options['dim_word'], dim=options['dim']) 151 | 152 | # Image encoder 153 | params = get_layer('ff')[0](options, params, prefix='ff_image', nin=options['dim_image'], nout=options['dim']) 154 | 155 | return params 156 | 157 | def build_sentence_encoder(tparams, options): 158 | """ 159 | Encoder only, for sentences 160 | """ 161 | opt_ret = dict() 162 | 163 | trng = RandomStreams(1234) 164 | 165 | # description string: #words x #samples 166 | x = tensor.matrix('x', dtype='int64') 167 | mask = tensor.matrix('x_mask', dtype='float32') 168 | 169 | n_timesteps = x.shape[0] 170 | n_samples = x.shape[1] 171 | 172 | # Word embedding 173 | emb = tparams['Wemb'][x.flatten()].reshape([n_timesteps, n_samples, options['dim_word']]) 174 | 175 | # Encode sentences 176 | proj = get_layer(options['encoder'])[1](tparams, emb, None, options, 177 | prefix='encoder', 178 | mask=mask) 179 | sents = proj[0][-1] 180 | sents = l2norm(sents) 181 | 182 | return trng, [x, mask], sents 183 | 184 | def build_image_encoder(tparams, options): 185 | """ 186 | Encoder only, for images 187 | """ 188 | opt_ret = dict() 189 | 190 | trng = RandomStreams(1234) 191 | 192 | # image features 193 | im = tensor.matrix('im', dtype='float32') 194 | 195 | # Encode images 196 | images = get_layer('ff')[1](tparams, im, options, prefix='ff_image', activ='linear') 197 | images = l2norm(images) 198 | 199 | return trng, [im], images 200 | 201 | def linear(x): 202 | """ 203 | Linear activation function 204 | """ 205 | return x 206 | 207 | def tanh(x): 208 | """ 209 | Tanh activation function 210 | """ 211 | return tensor.tanh(x) 212 | 213 | def l2norm(X): 214 | """ 215 | Compute L2 norm, row-wise 216 | """ 217 | norm = tensor.sqrt(tensor.pow(X, 2).sum(1)) 218 | X /= norm[:, None] 219 | return X 220 | 221 | def ortho_weight(ndim): 222 | """ 223 | Orthogonal weight init, for recurrent layers 224 | """ 225 | W = numpy.random.randn(ndim, ndim) 226 | u, s, v = numpy.linalg.svd(W) 227 | return u.astype('float32') 228 | 229 | def norm_weight(nin,nout=None, scale=0.1, ortho=True): 230 | """ 231 | Uniform initalization from [-scale, scale] 232 | If matrix is square and ortho=True, use ortho instead 233 | """ 234 | if nout == None: 235 | nout = nin 236 | if nout == nin and ortho: 237 | W = ortho_weight(nin) 238 | else: 239 | W = numpy.random.uniform(low=-scale, high=scale, size=(nin, nout)) 240 | return W.astype('float32') 241 | 242 | def xavier_weight(nin,nout=None): 243 | """ 244 | Xavier init 245 | """ 246 | if nout == None: 247 | nout = nin 248 | r = numpy.sqrt(6.) / numpy.sqrt(nin + nout) 249 | W = numpy.random.rand(nin, nout) * 2 * r - r 250 | return W.astype('float32') 251 | 252 | # Feedforward layer 253 | def param_init_fflayer(options, params, prefix='ff', nin=None, nout=None, ortho=True): 254 | """ 255 | Affine transformation + point-wise nonlinearity 256 | """ 257 | if nin == None: 258 | nin = options['dim_proj'] 259 | if nout == None: 260 | nout = options['dim_proj'] 261 | params[_p(prefix,'W')] = xavier_weight(nin, nout) 262 | params[_p(prefix,'b')] = numpy.zeros((nout,)).astype('float32') 263 | 264 | return params 265 | 266 | def fflayer(tparams, state_below, options, prefix='rconv', activ='lambda x: tensor.tanh(x)', **kwargs): 267 | """ 268 | Feedforward pass 269 | """ 270 | return eval(activ)(tensor.dot(state_below, tparams[_p(prefix,'W')])+tparams[_p(prefix,'b')]) 271 | 272 | # GRU layer 273 | def param_init_gru(options, params, prefix='gru', nin=None, dim=None): 274 | """ 275 | Gated Recurrent Unit (GRU) 276 | """ 277 | if nin == None: 278 | nin = options['dim_proj'] 279 | if dim == None: 280 | dim = options['dim_proj'] 281 | W = numpy.concatenate([norm_weight(nin,dim), 282 | norm_weight(nin,dim)], axis=1) 283 | params[_p(prefix,'W')] = W 284 | params[_p(prefix,'b')] = numpy.zeros((2 * dim,)).astype('float32') 285 | U = numpy.concatenate([ortho_weight(dim), 286 | ortho_weight(dim)], axis=1) 287 | params[_p(prefix,'U')] = U 288 | 289 | Wx = norm_weight(nin, dim) 290 | params[_p(prefix,'Wx')] = Wx 291 | Ux = ortho_weight(dim) 292 | params[_p(prefix,'Ux')] = Ux 293 | params[_p(prefix,'bx')] = numpy.zeros((dim,)).astype('float32') 294 | 295 | return params 296 | 297 | def gru_layer(tparams, state_below, init_state, options, prefix='gru', mask=None, one_step=False, **kwargs): 298 | """ 299 | Feedforward pass through GRU 300 | """ 301 | nsteps = state_below.shape[0] 302 | if state_below.ndim == 3: 303 | n_samples = state_below.shape[1] 304 | else: 305 | n_samples = 1 306 | 307 | dim = tparams[_p(prefix,'Ux')].shape[1] 308 | 309 | if init_state == None: 310 | init_state = tensor.alloc(0., n_samples, dim) 311 | 312 | if mask == None: 313 | mask = tensor.alloc(1., state_below.shape[0], 1) 314 | 315 | def _slice(_x, n, dim): 316 | if _x.ndim == 3: 317 | return _x[:, :, n*dim:(n+1)*dim] 318 | return _x[:, n*dim:(n+1)*dim] 319 | 320 | state_below_ = tensor.dot(state_below, tparams[_p(prefix, 'W')]) + tparams[_p(prefix, 'b')] 321 | state_belowx = tensor.dot(state_below, tparams[_p(prefix, 'Wx')]) + tparams[_p(prefix, 'bx')] 322 | U = tparams[_p(prefix, 'U')] 323 | Ux = tparams[_p(prefix, 'Ux')] 324 | 325 | def _step_slice(m_, x_, xx_, h_, U, Ux): 326 | preact = tensor.dot(h_, U) 327 | preact += x_ 328 | 329 | r = tensor.nnet.sigmoid(_slice(preact, 0, dim)) 330 | u = tensor.nnet.sigmoid(_slice(preact, 1, dim)) 331 | 332 | preactx = tensor.dot(h_, Ux) 333 | preactx = preactx * r 334 | preactx = preactx + xx_ 335 | 336 | h = tensor.tanh(preactx) 337 | 338 | h = u * h_ + (1. - u) * h 339 | h = m_[:,None] * h + (1. - m_)[:,None] * h_ 340 | 341 | return h 342 | 343 | seqs = [mask, state_below_, state_belowx] 344 | _step = _step_slice 345 | 346 | if one_step: 347 | rval = _step(*(seqs+[init_state, tparams[_p(prefix, 'U')], tparams[_p(prefix, 'Ux')]])) 348 | else: 349 | rval, updates = theano.scan(_step, 350 | sequences=seqs, 351 | outputs_info = [init_state], 352 | non_sequences = [tparams[_p(prefix, 'U')], 353 | tparams[_p(prefix, 'Ux')]], 354 | name=_p(prefix, '_layers'), 355 | n_steps=nsteps, 356 | profile=False, 357 | strict=True) 358 | rval = [rval] 359 | return rval 360 | 361 | 362 | -------------------------------------------------------------------------------- /generate.py: -------------------------------------------------------------------------------- 1 | """ 2 | Story generation 3 | """ 4 | import cPickle as pkl 5 | import numpy 6 | import copy 7 | import sys 8 | import skimage.transform 9 | 10 | import skipthoughts 11 | import decoder 12 | import embedding 13 | 14 | import config 15 | 16 | import lasagne 17 | from lasagne.layers import InputLayer, DenseLayer, NonlinearityLayer, DropoutLayer 18 | from lasagne.layers import MaxPool2DLayer as PoolLayer 19 | from lasagne.nonlinearities import softmax 20 | from lasagne.utils import floatX 21 | if not config.FLAG_CPU_MODE: 22 | from lasagne.layers.corrmm import Conv2DMMLayer as ConvLayer 23 | 24 | from scipy import optimize, stats 25 | from collections import OrderedDict, defaultdict, Counter 26 | from numpy.random import RandomState 27 | from scipy.linalg import norm 28 | 29 | from PIL import Image 30 | from PIL import ImageFile 31 | ImageFile.LOAD_TRUNCATED_IMAGES = True 32 | 33 | 34 | def story(z, image_loc, k=100, bw=50, lyric=False): 35 | """ 36 | Generate a story for an image at location image_loc 37 | """ 38 | # Load the image 39 | rawim, im = load_image(image_loc) 40 | 41 | # Run image through convnet 42 | feats = compute_features(z['net'], im).flatten() 43 | feats /= norm(feats) 44 | 45 | # Embed image into joint space 46 | feats = embedding.encode_images(z['vse'], feats[None,:]) 47 | 48 | # Compute the nearest neighbours 49 | scores = numpy.dot(feats, z['cvec'].T).flatten() 50 | sorted_args = numpy.argsort(scores)[::-1] 51 | sentences = [z['cap'][a] for a in sorted_args[:k]] 52 | 53 | print 'NEAREST-CAPTIONS: ' 54 | for s in sentences[:5]: 55 | print s 56 | print '' 57 | 58 | # Compute skip-thought vectors for sentences 59 | svecs = skipthoughts.encode(z['stv'], sentences, verbose=False) 60 | 61 | # Style shifting 62 | shift = svecs.mean(0) - z['bneg'] + z['bpos'] 63 | 64 | # Generate story conditioned on shift 65 | passage = decoder.run_sampler(z['dec'], shift, beam_width=bw) 66 | print 'OUTPUT: ' 67 | if lyric: 68 | for line in passage.split(','): 69 | if line[0] != ' ': 70 | print line 71 | else: 72 | print line[1:] 73 | else: 74 | print passage 75 | 76 | 77 | def load_all(): 78 | """ 79 | Load everything we need for generating 80 | """ 81 | print config.paths['decmodel'] 82 | 83 | # Skip-thoughts 84 | print 'Loading skip-thoughts...' 85 | stv = skipthoughts.load_model(config.paths['skmodels'], 86 | config.paths['sktables']) 87 | 88 | # Decoder 89 | print 'Loading decoder...' 90 | dec = decoder.load_model(config.paths['decmodel'], 91 | config.paths['dictionary']) 92 | 93 | # Image-sentence embedding 94 | print 'Loading image-sentence embedding...' 95 | vse = embedding.load_model(config.paths['vsemodel']) 96 | 97 | # VGG-19 98 | print 'Loading and initializing ConvNet...' 99 | 100 | if config.FLAG_CPU_MODE: 101 | sys.path.insert(0, config.paths['pycaffe']) 102 | import caffe 103 | caffe.set_mode_cpu() 104 | net = caffe.Net(config.paths['vgg_proto_caffe'], 105 | config.paths['vgg_model_caffe'], 106 | caffe.TEST) 107 | else: 108 | net = build_convnet(config.paths['vgg']) 109 | 110 | # Captions 111 | print 'Loading captions...' 112 | cap = [] 113 | with open(config.paths['captions'], 'rb') as f: 114 | for line in f: 115 | cap.append(line.strip()) 116 | 117 | # Caption embeddings 118 | print 'Embedding captions...' 119 | cvec = embedding.encode_sentences(vse, cap, verbose=False) 120 | 121 | # Biases 122 | print 'Loading biases...' 123 | bneg = numpy.load(config.paths['negbias']) 124 | bpos = numpy.load(config.paths['posbias']) 125 | 126 | # Pack up 127 | z = {} 128 | z['stv'] = stv 129 | z['dec'] = dec 130 | z['vse'] = vse 131 | z['net'] = net 132 | z['cap'] = cap 133 | z['cvec'] = cvec 134 | z['bneg'] = bneg 135 | z['bpos'] = bpos 136 | 137 | return z 138 | 139 | def load_image(file_name): 140 | """ 141 | Load and preprocess an image 142 | """ 143 | MEAN_VALUE = numpy.array([103.939, 116.779, 123.68]).reshape((3,1,1)) 144 | image = Image.open(file_name) 145 | im = numpy.array(image) 146 | 147 | # Resize so smallest dim = 256, preserving aspect ratio 148 | if len(im.shape) == 2: 149 | im = im[:, :, numpy.newaxis] 150 | im = numpy.repeat(im, 3, axis=2) 151 | h, w, _ = im.shape 152 | if h < w: 153 | im = skimage.transform.resize(im, (256, w*256/h), preserve_range=True) 154 | else: 155 | im = skimage.transform.resize(im, (h*256/w, 256), preserve_range=True) 156 | 157 | # Central crop to 224x224 158 | h, w, _ = im.shape 159 | im = im[h//2-112:h//2+112, w//2-112:w//2+112] 160 | 161 | rawim = numpy.copy(im).astype('uint8') 162 | 163 | # Shuffle axes to c01 164 | im = numpy.swapaxes(numpy.swapaxes(im, 1, 2), 0, 1) 165 | 166 | # Convert to BGR 167 | im = im[::-1, :, :] 168 | 169 | im = im - MEAN_VALUE 170 | return rawim, floatX(im[numpy.newaxis]) 171 | 172 | def compute_features(net, im): 173 | """ 174 | Compute fc7 features for im 175 | """ 176 | if config.FLAG_CPU_MODE: 177 | net.blobs['data'].reshape(* im.shape) 178 | net.blobs['data'].data[...] = im 179 | net.forward() 180 | fc7 = net.blobs['fc7'].data 181 | else: 182 | fc7 = numpy.array(lasagne.layers.get_output(net['fc7'], im, 183 | deterministic=True).eval()) 184 | return fc7 185 | 186 | def build_convnet(path_to_vgg): 187 | """ 188 | Construct VGG-19 convnet 189 | """ 190 | net = {} 191 | net['input'] = InputLayer((None, 3, 224, 224)) 192 | net['conv1_1'] = ConvLayer(net['input'], 64, 3, pad=1) 193 | net['conv1_2'] = ConvLayer(net['conv1_1'], 64, 3, pad=1) 194 | net['pool1'] = PoolLayer(net['conv1_2'], 2) 195 | net['conv2_1'] = ConvLayer(net['pool1'], 128, 3, pad=1) 196 | net['conv2_2'] = ConvLayer(net['conv2_1'], 128, 3, pad=1) 197 | net['pool2'] = PoolLayer(net['conv2_2'], 2) 198 | net['conv3_1'] = ConvLayer(net['pool2'], 256, 3, pad=1) 199 | net['conv3_2'] = ConvLayer(net['conv3_1'], 256, 3, pad=1) 200 | net['conv3_3'] = ConvLayer(net['conv3_2'], 256, 3, pad=1) 201 | net['conv3_4'] = ConvLayer(net['conv3_3'], 256, 3, pad=1) 202 | net['pool3'] = PoolLayer(net['conv3_4'], 2) 203 | net['conv4_1'] = ConvLayer(net['pool3'], 512, 3, pad=1) 204 | net['conv4_2'] = ConvLayer(net['conv4_1'], 512, 3, pad=1) 205 | net['conv4_3'] = ConvLayer(net['conv4_2'], 512, 3, pad=1) 206 | net['conv4_4'] = ConvLayer(net['conv4_3'], 512, 3, pad=1) 207 | net['pool4'] = PoolLayer(net['conv4_4'], 2) 208 | net['conv5_1'] = ConvLayer(net['pool4'], 512, 3, pad=1) 209 | net['conv5_2'] = ConvLayer(net['conv5_1'], 512, 3, pad=1) 210 | net['conv5_3'] = ConvLayer(net['conv5_2'], 512, 3, pad=1) 211 | net['conv5_4'] = ConvLayer(net['conv5_3'], 512, 3, pad=1) 212 | net['pool5'] = PoolLayer(net['conv5_4'], 2) 213 | net['fc6'] = DenseLayer(net['pool5'], num_units=4096) 214 | net['fc7'] = DenseLayer(net['fc6'], num_units=4096) 215 | net['fc8'] = DenseLayer(net['fc7'], num_units=1000, nonlinearity=None) 216 | net['prob'] = NonlinearityLayer(net['fc8'], softmax) 217 | 218 | print 'Loading parameters...' 219 | output_layer = net['prob'] 220 | model = pkl.load(open(path_to_vgg)) 221 | lasagne.layers.set_all_param_values(output_layer, model['param values']) 222 | 223 | return net 224 | 225 | 226 | -------------------------------------------------------------------------------- /images/ex1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryankiros/neural-storyteller/61e12a7a0453bdc62013c7c07b7f7c331059d360/images/ex1.jpg -------------------------------------------------------------------------------- /images/ex2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryankiros/neural-storyteller/61e12a7a0453bdc62013c7c07b7f7c331059d360/images/ex2.jpg -------------------------------------------------------------------------------- /images/ex3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryankiros/neural-storyteller/61e12a7a0453bdc62013c7c07b7f7c331059d360/images/ex3.jpg -------------------------------------------------------------------------------- /images/ex4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ryankiros/neural-storyteller/61e12a7a0453bdc62013c7c07b7f7c331059d360/images/ex4.jpg -------------------------------------------------------------------------------- /search.py: -------------------------------------------------------------------------------- 1 | """ 2 | Code for sequence generation 3 | """ 4 | import numpy 5 | import copy 6 | 7 | def gen_sample(tparams, f_init, f_next, ctx, options, trng=None, k=1, maxlen=30, 8 | stochastic=True, argmax=False, use_unk=False): 9 | """ 10 | Generate a sample, using either beam search or stochastic sampling 11 | """ 12 | if k > 1: 13 | assert not stochastic, 'Beam search does not support stochastic sampling' 14 | 15 | sample = [] 16 | sample_score = [] 17 | if stochastic: 18 | sample_score = 0 19 | 20 | live_k = 1 21 | dead_k = 0 22 | 23 | hyp_samples = [[]] * live_k 24 | hyp_scores = numpy.zeros(live_k).astype('float32') 25 | hyp_states = [] 26 | 27 | next_state = f_init(ctx) 28 | next_w = -1 * numpy.ones((1,)).astype('int64') 29 | 30 | for ii in xrange(maxlen): 31 | inps = [next_w, next_state] 32 | ret = f_next(*inps) 33 | next_p, next_w, next_state = ret[0], ret[1], ret[2] 34 | 35 | if stochastic: 36 | if argmax: 37 | nw = next_p[0].argmax() 38 | else: 39 | nw = next_w[0] 40 | sample.append(nw) 41 | sample_score += next_p[0,nw] 42 | if nw == 0: 43 | break 44 | else: 45 | cand_scores = hyp_scores[:,None] - numpy.log(next_p) 46 | cand_flat = cand_scores.flatten() 47 | 48 | if not use_unk: 49 | voc_size = next_p.shape[1] 50 | for xx in range(len(cand_flat) / voc_size): 51 | cand_flat[voc_size * xx + 1] = 1e20 52 | 53 | ranks_flat = cand_flat.argsort()[:(k-dead_k)] 54 | 55 | voc_size = next_p.shape[1] 56 | trans_indices = ranks_flat / voc_size 57 | word_indices = ranks_flat % voc_size 58 | costs = cand_flat[ranks_flat] 59 | 60 | new_hyp_samples = [] 61 | new_hyp_scores = numpy.zeros(k-dead_k).astype('float32') 62 | new_hyp_states = [] 63 | 64 | for idx, [ti, wi] in enumerate(zip(trans_indices, word_indices)): 65 | new_hyp_samples.append(hyp_samples[ti]+[wi]) 66 | new_hyp_scores[idx] = copy.copy(costs[idx]) 67 | new_hyp_states.append(copy.copy(next_state[ti])) 68 | 69 | # check the finished samples 70 | new_live_k = 0 71 | hyp_samples = [] 72 | hyp_scores = [] 73 | hyp_states = [] 74 | 75 | for idx in xrange(len(new_hyp_samples)): 76 | if new_hyp_samples[idx][-1] == 0: 77 | sample.append(new_hyp_samples[idx]) 78 | sample_score.append(new_hyp_scores[idx]) 79 | dead_k += 1 80 | else: 81 | new_live_k += 1 82 | hyp_samples.append(new_hyp_samples[idx]) 83 | hyp_scores.append(new_hyp_scores[idx]) 84 | hyp_states.append(new_hyp_states[idx]) 85 | hyp_scores = numpy.array(hyp_scores) 86 | live_k = new_live_k 87 | 88 | if new_live_k < 1: 89 | break 90 | if dead_k >= k: 91 | break 92 | 93 | next_w = numpy.array([w[-1] for w in hyp_samples]) 94 | next_state = numpy.array(hyp_states) 95 | 96 | if not stochastic: 97 | # dump every remaining one 98 | if live_k > 0: 99 | for idx in xrange(live_k): 100 | sample.append(hyp_samples[idx]) 101 | sample_score.append(hyp_scores[idx]) 102 | 103 | return sample, sample_score 104 | 105 | 106 | -------------------------------------------------------------------------------- /skipthoughts.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Skip-thought vectors 3 | ''' 4 | import os 5 | 6 | import theano 7 | import theano.tensor as tensor 8 | 9 | import cPickle as pkl 10 | import numpy 11 | import copy 12 | import nltk 13 | 14 | from collections import OrderedDict, defaultdict 15 | from scipy.linalg import norm 16 | from nltk.tokenize import word_tokenize 17 | 18 | profile = False 19 | 20 | 21 | def load_model(path_to_models, path_to_tables): 22 | """ 23 | Load the model with saved tables 24 | """ 25 | path_to_umodel = path_to_models + 'uni_skip.npz' 26 | path_to_bmodel = path_to_models + 'bi_skip.npz' 27 | 28 | # Load model options 29 | with open('%s.pkl'%path_to_umodel, 'rb') as f: 30 | uoptions = pkl.load(f) 31 | with open('%s.pkl'%path_to_bmodel, 'rb') as f: 32 | boptions = pkl.load(f) 33 | 34 | # Load parameters 35 | uparams = init_params(uoptions) 36 | uparams = load_params(path_to_umodel, uparams) 37 | utparams = init_tparams(uparams) 38 | bparams = init_params_bi(boptions) 39 | bparams = load_params(path_to_bmodel, bparams) 40 | btparams = init_tparams(bparams) 41 | 42 | # Extractor functions 43 | embedding, x_mask, ctxw2v = build_encoder(utparams, uoptions) 44 | f_w2v = theano.function([embedding, x_mask], ctxw2v, name='f_w2v') 45 | embedding, x_mask, ctxw2v = build_encoder_bi(btparams, boptions) 46 | f_w2v2 = theano.function([embedding, x_mask], ctxw2v, name='f_w2v2') 47 | 48 | # Tables 49 | utable, btable = load_tables(path_to_tables) 50 | 51 | # Store everything we need in a dictionary 52 | model = {} 53 | model['uoptions'] = uoptions 54 | model['boptions'] = boptions 55 | model['utable'] = utable 56 | model['btable'] = btable 57 | model['f_w2v'] = f_w2v 58 | model['f_w2v2'] = f_w2v2 59 | 60 | return model 61 | 62 | def load_tables(path_to_tables): 63 | """ 64 | Load the tables 65 | """ 66 | words = [] 67 | utable = numpy.load(path_to_tables + 'utable.npy') 68 | btable = numpy.load(path_to_tables + 'btable.npy') 69 | f = open(path_to_tables + 'dictionary.txt', 'rb') 70 | for line in f: 71 | words.append(line.decode('utf-8').strip()) 72 | f.close() 73 | utable = OrderedDict(zip(words, utable)) 74 | btable = OrderedDict(zip(words, btable)) 75 | return utable, btable 76 | 77 | def encode(model, X, use_norm=True, verbose=True, batch_size=128, use_eos=False): 78 | """ 79 | Encode sentences in the list X. Each entry will return a vector 80 | """ 81 | # first, do preprocessing 82 | X = preprocess(X) 83 | 84 | # word dictionary and init 85 | d = defaultdict(lambda : 0) 86 | for w in model['utable'].keys(): 87 | d[w] = 1 88 | ufeatures = numpy.zeros((len(X), model['uoptions']['dim']), dtype='float32') 89 | bfeatures = numpy.zeros((len(X), 2 * model['boptions']['dim']), dtype='float32') 90 | 91 | # length dictionary 92 | ds = defaultdict(list) 93 | captions = [s.split() for s in X] 94 | for i,s in enumerate(captions): 95 | ds[len(s)].append(i) 96 | 97 | # Get features. This encodes by length, in order to avoid wasting computation 98 | for k in ds.keys(): 99 | if verbose: 100 | print k 101 | numbatches = len(ds[k]) / batch_size + 1 102 | for minibatch in range(numbatches): 103 | caps = ds[k][minibatch::numbatches] 104 | 105 | if use_eos: 106 | uembedding = numpy.zeros((k+1, len(caps), model['uoptions']['dim_word']), dtype='float32') 107 | bembedding = numpy.zeros((k+1, len(caps), model['boptions']['dim_word']), dtype='float32') 108 | else: 109 | uembedding = numpy.zeros((k, len(caps), model['uoptions']['dim_word']), dtype='float32') 110 | bembedding = numpy.zeros((k, len(caps), model['boptions']['dim_word']), dtype='float32') 111 | for ind, c in enumerate(caps): 112 | caption = captions[c] 113 | for j in range(len(caption)): 114 | if d[caption[j]] > 0: 115 | uembedding[j,ind] = model['utable'][caption[j]] 116 | bembedding[j,ind] = model['btable'][caption[j]] 117 | else: 118 | uembedding[j,ind] = model['utable']['UNK'] 119 | bembedding[j,ind] = model['btable']['UNK'] 120 | if use_eos: 121 | uembedding[-1,ind] = model['utable'][''] 122 | bembedding[-1,ind] = model['btable'][''] 123 | if use_eos: 124 | uff = model['f_w2v'](uembedding, numpy.ones((len(caption)+1,len(caps)), dtype='float32')) 125 | bff = model['f_w2v2'](bembedding, numpy.ones((len(caption)+1,len(caps)), dtype='float32')) 126 | else: 127 | uff = model['f_w2v'](uembedding, numpy.ones((len(caption),len(caps)), dtype='float32')) 128 | bff = model['f_w2v2'](bembedding, numpy.ones((len(caption),len(caps)), dtype='float32')) 129 | if use_norm: 130 | for j in range(len(uff)): 131 | uff[j] /= norm(uff[j]) 132 | bff[j] /= norm(bff[j]) 133 | for ind, c in enumerate(caps): 134 | ufeatures[c] = uff[ind] 135 | bfeatures[c] = bff[ind] 136 | 137 | features = numpy.c_[ufeatures, bfeatures] 138 | return features 139 | 140 | def preprocess(text): 141 | """ 142 | Preprocess text for encoder 143 | """ 144 | X = [] 145 | sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') 146 | for t in text: 147 | sents = sent_detector.tokenize(t) 148 | result = '' 149 | for s in sents: 150 | tokens = word_tokenize(s) 151 | result += ' ' + ' '.join(tokens) 152 | X.append(result) 153 | return X 154 | 155 | def _p(pp, name): 156 | """ 157 | make prefix-appended name 158 | """ 159 | return '%s_%s'%(pp, name) 160 | 161 | def init_tparams(params): 162 | """ 163 | initialize Theano shared variables according to the initial parameters 164 | """ 165 | tparams = OrderedDict() 166 | for kk, pp in params.iteritems(): 167 | tparams[kk] = theano.shared(params[kk], name=kk) 168 | return tparams 169 | 170 | def load_params(path, params): 171 | """ 172 | load parameters 173 | """ 174 | pp = numpy.load(path) 175 | for kk, vv in params.iteritems(): 176 | if kk not in pp: 177 | warnings.warn('%s is not in the archive'%kk) 178 | continue 179 | params[kk] = pp[kk] 180 | return params 181 | 182 | # layers: 'name': ('parameter initializer', 'feedforward') 183 | layers = {'gru': ('param_init_gru', 'gru_layer')} 184 | 185 | def get_layer(name): 186 | fns = layers[name] 187 | return (eval(fns[0]), eval(fns[1])) 188 | 189 | def init_params(options): 190 | """ 191 | initialize all parameters needed for the encoder 192 | """ 193 | params = OrderedDict() 194 | 195 | # embedding 196 | params['Wemb'] = norm_weight(options['n_words_src'], options['dim_word']) 197 | 198 | # encoder: GRU 199 | params = get_layer(options['encoder'])[0](options, params, prefix='encoder', 200 | nin=options['dim_word'], dim=options['dim']) 201 | return params 202 | 203 | def init_params_bi(options): 204 | """ 205 | initialize all paramters needed for bidirectional encoder 206 | """ 207 | params = OrderedDict() 208 | 209 | # embedding 210 | params['Wemb'] = norm_weight(options['n_words_src'], options['dim_word']) 211 | 212 | # encoder: GRU 213 | params = get_layer(options['encoder'])[0](options, params, prefix='encoder', 214 | nin=options['dim_word'], dim=options['dim']) 215 | params = get_layer(options['encoder'])[0](options, params, prefix='encoder_r', 216 | nin=options['dim_word'], dim=options['dim']) 217 | return params 218 | 219 | def build_encoder(tparams, options): 220 | """ 221 | build an encoder, given pre-computed word embeddings 222 | """ 223 | # word embedding (source) 224 | embedding = tensor.tensor3('embedding', dtype='float32') 225 | x_mask = tensor.matrix('x_mask', dtype='float32') 226 | 227 | # encoder 228 | proj = get_layer(options['encoder'])[1](tparams, embedding, options, 229 | prefix='encoder', 230 | mask=x_mask) 231 | ctx = proj[0][-1] 232 | 233 | return embedding, x_mask, ctx 234 | 235 | def build_encoder_bi(tparams, options): 236 | """ 237 | build bidirectional encoder, given pre-computed word embeddings 238 | """ 239 | # word embedding (source) 240 | embedding = tensor.tensor3('embedding', dtype='float32') 241 | embeddingr = embedding[::-1] 242 | x_mask = tensor.matrix('x_mask', dtype='float32') 243 | xr_mask = x_mask[::-1] 244 | 245 | # encoder 246 | proj = get_layer(options['encoder'])[1](tparams, embedding, options, 247 | prefix='encoder', 248 | mask=x_mask) 249 | projr = get_layer(options['encoder'])[1](tparams, embeddingr, options, 250 | prefix='encoder_r', 251 | mask=xr_mask) 252 | 253 | ctx = tensor.concatenate([proj[0][-1], projr[0][-1]], axis=1) 254 | 255 | return embedding, x_mask, ctx 256 | 257 | # some utilities 258 | def ortho_weight(ndim): 259 | W = numpy.random.randn(ndim, ndim) 260 | u, s, v = numpy.linalg.svd(W) 261 | return u.astype('float32') 262 | 263 | def norm_weight(nin,nout=None, scale=0.1, ortho=True): 264 | if nout == None: 265 | nout = nin 266 | if nout == nin and ortho: 267 | W = ortho_weight(nin) 268 | else: 269 | W = numpy.random.uniform(low=-scale, high=scale, size=(nin, nout)) 270 | return W.astype('float32') 271 | 272 | def param_init_gru(options, params, prefix='gru', nin=None, dim=None): 273 | """ 274 | parameter init for GRU 275 | """ 276 | if nin == None: 277 | nin = options['dim_proj'] 278 | if dim == None: 279 | dim = options['dim_proj'] 280 | W = numpy.concatenate([norm_weight(nin,dim), 281 | norm_weight(nin,dim)], axis=1) 282 | params[_p(prefix,'W')] = W 283 | params[_p(prefix,'b')] = numpy.zeros((2 * dim,)).astype('float32') 284 | U = numpy.concatenate([ortho_weight(dim), 285 | ortho_weight(dim)], axis=1) 286 | params[_p(prefix,'U')] = U 287 | 288 | Wx = norm_weight(nin, dim) 289 | params[_p(prefix,'Wx')] = Wx 290 | Ux = ortho_weight(dim) 291 | params[_p(prefix,'Ux')] = Ux 292 | params[_p(prefix,'bx')] = numpy.zeros((dim,)).astype('float32') 293 | 294 | return params 295 | 296 | def gru_layer(tparams, state_below, options, prefix='gru', mask=None, **kwargs): 297 | """ 298 | Forward pass through GRU layer 299 | """ 300 | nsteps = state_below.shape[0] 301 | if state_below.ndim == 3: 302 | n_samples = state_below.shape[1] 303 | else: 304 | n_samples = 1 305 | 306 | dim = tparams[_p(prefix,'Ux')].shape[1] 307 | 308 | if mask == None: 309 | mask = tensor.alloc(1., state_below.shape[0], 1) 310 | 311 | def _slice(_x, n, dim): 312 | if _x.ndim == 3: 313 | return _x[:, :, n*dim:(n+1)*dim] 314 | return _x[:, n*dim:(n+1)*dim] 315 | 316 | state_below_ = tensor.dot(state_below, tparams[_p(prefix, 'W')]) + tparams[_p(prefix, 'b')] 317 | state_belowx = tensor.dot(state_below, tparams[_p(prefix, 'Wx')]) + tparams[_p(prefix, 'bx')] 318 | U = tparams[_p(prefix, 'U')] 319 | Ux = tparams[_p(prefix, 'Ux')] 320 | 321 | def _step_slice(m_, x_, xx_, h_, U, Ux): 322 | preact = tensor.dot(h_, U) 323 | preact += x_ 324 | 325 | r = tensor.nnet.sigmoid(_slice(preact, 0, dim)) 326 | u = tensor.nnet.sigmoid(_slice(preact, 1, dim)) 327 | 328 | preactx = tensor.dot(h_, Ux) 329 | preactx = preactx * r 330 | preactx = preactx + xx_ 331 | 332 | h = tensor.tanh(preactx) 333 | 334 | h = u * h_ + (1. - u) * h 335 | h = m_[:,None] * h + (1. - m_)[:,None] * h_ 336 | 337 | return h 338 | 339 | seqs = [mask, state_below_, state_belowx] 340 | _step = _step_slice 341 | 342 | rval, updates = theano.scan(_step, 343 | sequences=seqs, 344 | outputs_info = [tensor.alloc(0., n_samples, dim)], 345 | non_sequences = [tparams[_p(prefix, 'U')], 346 | tparams[_p(prefix, 'Ux')]], 347 | name=_p(prefix, '_layers'), 348 | n_steps=nsteps, 349 | profile=profile, 350 | strict=True) 351 | rval = [rval] 352 | return rval 353 | 354 | 355 | --------------------------------------------------------------------------------