├── .gitignore ├── README.md ├── coco └── coco_preprocess.ipynb ├── dataloader.py ├── dataloaderraw.py ├── eval.py ├── eval_utils.py ├── misc ├── AttentionModel.py ├── ShowAttendTellModel.py ├── ShowAttendTellModel_old.py ├── ShowTellModel.py ├── __init__.py └── utils.py ├── models.py ├── opts.py ├── prepro.py ├── test ├── test_model.py └── test_simpleloader.py ├── train.py ├── vgg.py └── vis ├── index.html └── jquery-1.8.3.min.js /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | models 3 | 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Neuraltalk2-tensorflow 2 | This is a toy project for myself to start to learn tensorflow. 3 | 4 | I started to learn torch by learning from neuraltalk2, so I started my tensorflow with this too. 5 | 6 | I think this project is good for those who were familiar with neuraltalk2 in torch, because the main pipeline is almost the same. I don't know if it's a good tutorial to learn tensorflow, because the comments are still limited so far. 7 | 8 | Without finetuning on VGG, my code gives CIDEr score ~0.65 on validation set (in 50000 iterations). 9 | 10 | Currently if you want to use my code, you need to train the model from scratch (except VGG-16). 11 | 12 | # TODO: 13 | - ~~Finetuning VGG seems doesn't work. Need to be fixed.~~ 14 | - ~~No need to initialize from npy when having saved weight.~~ 15 | - Tensorflow stype file loading. (Multi-thread image loading) 16 | - ~~Test of stacked LSTM. and also GRUs~~ 17 | - Pretrained model 18 | - ~~Test code on single image~~ 19 | - Schedule sampling 20 | - ~~sample_max~~ 21 | - ~~eval on unseen images~~ 22 | - eval on test 23 | - visualize attention map 24 | 25 | # Requirements 26 | Python 2.7 27 | 28 | [Tensorflow 1.0](https://github.com/tensorflow/tensorflow), please follow the tensorflow website to install the tensorflow. 29 | 30 | # Train your own network on COCO 31 | **(Copy from neuraltalk2)** 32 | 33 | Great, first we need to some preprocessing. Head over to the `coco/` folder and run the IPython notebook to download the dataset and do some very simple preprocessing. The notebook will combine the train/val data together and create a very simple and small json file that contains a large list of image paths, and raw captions for each image, of the form: 34 | 35 | ``` 36 | [{ "file_path": "path/img.jpg", "captions": ["a caption", "a second caption of i"tgit ...] }, ...] 37 | ``` 38 | 39 | Once we have this, we're ready to invoke the `prepro.py` script, which will read all of this in and create a dataset (an hdf5 file and a json file) ready for consumption in the Lua code. For example, for MS COCO we can run the prepro file as follows: 40 | 41 | ```bash 42 | $ python prepro.py --input_json coco/coco_raw.json --num_val 5000 --num_test 5000 --images_root coco/images --word_count_threshold 5 --output_json coco/cocotalk.json --output_h5 coco/cocotalk.h5 43 | ``` 44 | 45 | This is telling the script to read in all the data (the images and the captions), allocate 5000 images for val/test splits respectively, and map all words that occur <= 5 times to a special `UNK` token. The resulting `json` and `h5` files are about 30GB and contain everything we want to know about the dataset. 46 | 47 | **Warning**: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See [this issue](https://github.com/karpathy/neuraltalk2/issues/4) for the fix, it involves manually replacing one image in the dataset. 48 | 49 | **(Copy end.)** 50 | 51 | Note that: the split used here can not be used for research. You can email me to ask for preprocessing code for COCO "standard" split, or you can modify the code by yourself if you are familiar. 52 | 53 | ~~Download or generate a tensorflow version pretrained vgg-16 [tensorflow-vgg16](https://github.com/ry/tensorflow-vgg16). ~~ 54 | 55 | I borrow the [machrisaa/tensorflow-vgg](https://github.com/machrisaa/tensorflow-vgg). I made some modification. 56 | - Add a variable `training` to control the evaluation and training mode of model (in principle it's controling the dropout probability). 57 | - Define all the weights and biases as Variable (previously constant). 58 | 59 | You need to download the npy file of vgg, [vgg16](https://dl.dropboxusercontent.com/u/50333326/vgg16.npy), [vgg19](https://dl.dropboxusercontent.com/u/50333326/vgg19.npy). Put the file somewhere (e.g. a `models` directory), and we're ready to train! 60 | 61 | ```bash 62 | $ python train.py --input_json coco/cocotalk.json --input_h5 coco/cocotalk.h5 --checkpoint_path ./log --save_checkpoint_every 2000 --val_images_use 3200 63 | ``` 64 | 65 | The train script will take over, and start dumping checkpoints into the folder specified by `checkpoint_path` (default = current folder). For more options, see `opts.py`. 66 | 67 | If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use `--language_eval 1` option, but don't forget to download the [coco-caption code](https://github.com/tylin/coco-caption) into `coco-caption` directory. 68 | 69 | **A few notes on training.** To give you an idea, with the default settings one epoch of MS COCO images is about 7500 iterations. 1 epoch of training (with no finetuning - notice this is the default) takes about 45 minutes and results in validation loss ~2.7 and CIDEr score of ~0.5. By iteration 50,000 CIDEr climbs up to about 0.65 (validation loss at about 2.4). 70 | 71 | ### Caption images after training 72 | 73 | Now place all your images of interest into a folder, e.g. `blah`, and run 74 | the eval script: 75 | 76 | ```bash 77 | $ python eval.py --model model.ckpt-**** --infos_path infos_.pkl --image_folder --num_images 10 78 | ``` 79 | 80 | This tells the `eval` script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing `batch_size` (default = 1). Use `-num_images -1` to process all images. The eval script will create an `vis.json` file inside the `vis` folder, which can then be visualized with the provided HTML interface: 81 | 82 | ```bash 83 | $ cd vis 84 | $ python -m SimpleHTTPServer 85 | ``` 86 | 87 | Now visit `localhost:8000` in your browser and you should see your predicted captions. 88 | 89 | **Beam Search**. Beam search is enabled by default because it increases the performance of the search for argmax decoding sequence. However, this is a little more expensive, so if you'd like to evaluate images faster, but at a cost of performance, use `--beam_size 1`. ~~For example, in one of my experiments beam size 2 gives CIDEr 0.922, and beam size 1 gives CIDEr 0.886.~~ 90 | 91 | **Running on MSCOCO images**. If you train on MSCOCO (see how below), you will have generated preprocessed MSCOCO images, which you can use directly in the eval script. In this case simply leave out the `image_folder` option and the eval script and instead pass in the `input_h5`, `input_json` to your preprocessed files. 92 | 93 | # Acknowledgements 94 | I learned a lot from these following repositories. 95 | 96 | - [neuraltalk2](https://github.com/karpathy/neuraltalk2)(of course) 97 | - [colornet](https://github.com/pavelgonchar/colornet)(for using pretrained vgg-16) 98 | - [tensorflow-vgg16](https://github.com/ry/tensorflow-vgg16.git)(tensorflow version of vgg-16) 99 | - [machrisaa/tensorflow-vgg](https://github.com/machrisaa/tensorflow-vgg)(For better loading vgg-16, but still not perfect) 100 | - [huyng/tensorflow-vgg](https://github.com/huyng/tensorflow-vgg)(This may be my next attempt.) 101 | - [char-rnn-tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow)(for using the RNN wrapper provided by tensorflow) 102 | - [show_and_tell.tensorflow](https://github.com/jazzsaxmafia/show_and_tell.tensorflow)(Gave me idea how to dump option information. Furthermore, this has the same algorithm as mine but with different code structure) 103 | - [TF-mrnn](https://github.com/mjhucla/TF-mRNN) I borrow the beam search code. And this is also a very good caption genration model. 104 | -------------------------------------------------------------------------------- /coco/coco_preprocess.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# COCO data preprocessing\n", 8 | "\n", 9 | "This code will download the caption anotations for coco and preprocess them into an hdf5 file and a json file. \n", 10 | "\n", 11 | "These will then be read by the COCO data loader in Lua and trained on." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": { 18 | "collapsed": false 19 | }, 20 | "outputs": [ 21 | { 22 | "data": { 23 | "text/plain": [ 24 | "0" 25 | ] 26 | }, 27 | "execution_count": 2, 28 | "metadata": {}, 29 | "output_type": "execute_result" 30 | } 31 | ], 32 | "source": [ 33 | "# lets download the annotations from http://mscoco.org/dataset/#download\n", 34 | "import os\n", 35 | "os.system('wget http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip') # ~19MB" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "metadata": { 42 | "collapsed": false 43 | }, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/plain": [ 48 | "0" 49 | ] 50 | }, 51 | "execution_count": 3, 52 | "metadata": {}, 53 | "output_type": "execute_result" 54 | } 55 | ], 56 | "source": [ 57 | "os.system('unzip captions_train-val2014.zip')" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "import json\n", 69 | "val = json.load(open('annotations/captions_val2014.json', 'r'))\n", 70 | "train = json.load(open('annotations/captions_train2014.json', 'r'))" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": { 77 | "collapsed": false 78 | }, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "[u'info', u'images', u'licenses', u'annotations']\n", 85 | "{u'description': u'This is stable 1.0 version of the 2014 MS COCO dataset.', u'url': u'http://mscoco.org', u'version': u'1.0', u'year': 2014, u'contributor': u'Microsoft COCO group', u'date_created': u'2015-01-27 09:11:52.357475'}\n", 86 | "40504\n", 87 | "202654\n", 88 | "{u'license': 3, u'file_name': u'COCO_val2014_000000391895.jpg', u'coco_url': u'http://mscoco.org/images/391895', u'height': 360, u'width': 640, u'date_captured': u'2013-11-14 11:18:45', u'flickr_url': u'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', u'id': 391895}\n", 89 | "{u'image_id': 203564, u'id': 37, u'caption': u'A bicycle replica with a clock as the front wheel.'}\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "print val.keys()\n", 95 | "print val['info']\n", 96 | "print len(val['images'])\n", 97 | "print len(val['annotations'])\n", 98 | "print val['images'][0]\n", 99 | "print val['annotations'][0]" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 6, 105 | "metadata": { 106 | "collapsed": false 107 | }, 108 | "outputs": [], 109 | "source": [ 110 | "import json\n", 111 | "import os\n", 112 | "\n", 113 | "# combine all images and annotations together\n", 114 | "imgs = val['images'] + train['images']\n", 115 | "annots = val['annotations'] + train['annotations']\n", 116 | "\n", 117 | "# for efficiency lets group annotations by image\n", 118 | "itoa = {}\n", 119 | "for a in annots:\n", 120 | " imgid = a['image_id']\n", 121 | " if not imgid in itoa: itoa[imgid] = []\n", 122 | " itoa[imgid].append(a)\n", 123 | "\n", 124 | "# create the json blob\n", 125 | "out = []\n", 126 | "for i,img in enumerate(imgs):\n", 127 | " imgid = img['id']\n", 128 | " \n", 129 | " # coco specific here, they store train/val images separately\n", 130 | " loc = 'train2014' if 'train' in img['file_name'] else 'val2014'\n", 131 | " \n", 132 | " jimg = {}\n", 133 | " jimg['file_path'] = os.path.join(loc, img['file_name'])\n", 134 | " jimg['id'] = imgid\n", 135 | " \n", 136 | " sents = []\n", 137 | " annotsi = itoa[imgid]\n", 138 | " for a in annotsi:\n", 139 | " sents.append(a['caption'])\n", 140 | " jimg['captions'] = sents\n", 141 | " out.append(jimg)\n", 142 | " \n", 143 | "json.dump(out, open('coco_raw.json', 'w'))" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 7, 149 | "metadata": { 150 | "collapsed": false 151 | }, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "{'captions': [u'A man with a red helmet on a small moped on a dirt road. ', u'Man riding a motor bike on a dirt road on the countryside.', u'A man riding on the back of a motorcycle.', u'A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. ', u'A man in a red shirt and a red hat is on a motorcycle on a hill side.'], 'file_path': u'val2014/COCO_val2014_000000391895.jpg', 'id': 391895}\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "# lets see what they look like\n", 163 | "print out[0]" 164 | ] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 2", 170 | "language": "python", 171 | "name": "python2" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 2 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython2", 183 | "version": "2.7.6" 184 | } 185 | }, 186 | "nbformat": 4, 187 | "nbformat_minor": 0 188 | } 189 | -------------------------------------------------------------------------------- /dataloader.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import json 6 | import h5py 7 | import os 8 | import tensorflow as tf 9 | import numpy as np 10 | import random 11 | import skimage 12 | import skimage.io 13 | import scipy.misc 14 | 15 | class DataLoader(): 16 | 17 | def __init__(self, opt): 18 | self.opt = opt 19 | self.batch_size = self.opt.batch_size 20 | self.seq_per_img = self.opt.seq_per_img 21 | 22 | # load the json file which contains additional information about the dataset 23 | print('DataLoader loading json file: ', opt.input_json) 24 | self.info = json.load(open(self.opt.input_json)) 25 | self.ix_to_word = self.info['ix_to_word'] 26 | self.vocab_size = len(self.ix_to_word) 27 | print('vocab size is ', self.vocab_size) 28 | 29 | # open the hdf5 file 30 | print('DataLoader loading h5 file: ', opt.input_h5) 31 | self.h5_file = h5py.File(self.opt.input_h5) 32 | 33 | 34 | # extract image size from dataset 35 | images_size = self.h5_file['images'].shape 36 | assert len(images_size) == 4, 'images should be a 4D tensor' 37 | assert images_size[2] == images_size[3], 'width and height must match' 38 | self.num_images = images_size[0] 39 | self.num_channels = images_size[1] 40 | self.max_image_size = images_size[2] 41 | print('read %d images of size %dx%dx%d' %(self.num_images, 42 | self.num_channels, self.max_image_size, self.max_image_size)) 43 | 44 | # load in the sequence data 45 | seq_size = self.h5_file['labels'].shape 46 | self.seq_length = seq_size[1] 47 | print('max sequence length in data is', self.seq_length) 48 | # load the pointers in full to RAM (should be small enough) 49 | self.label_start_ix = self.h5_file['label_start_ix'][:] 50 | self.label_end_ix = self.h5_file['label_end_ix'][:] 51 | 52 | # separate out indexes for each of the provided splits 53 | self.split_ix = {'train': [], 'val': [], 'test': []} 54 | for ix in range(len(self.info['images'])): 55 | img = self.info['images'][ix] 56 | if img['split'] == 'train': 57 | self.split_ix['train'].append(ix) 58 | elif img['split'] == 'val': 59 | self.split_ix['val'].append(ix) 60 | elif img['split'] == 'test': 61 | self.split_ix['test'].append(ix) 62 | elif opt.train_only == 0: # restval 63 | self.split_ix['train'].append(ix) 64 | 65 | print('assigned %d images to split train' %len(self.split_ix['train'])) 66 | print('assigned %d images to split val' %len(self.split_ix['val'])) 67 | print('assigned %d images to split test' %len(self.split_ix['test'])) 68 | 69 | self.iterators = {'train': 0, 'val': 0, 'test': 0} 70 | 71 | def get_vocab_size(self): 72 | return self.vocab_size 73 | 74 | def get_vocab(self): 75 | return self.ix_to_word 76 | 77 | def get_seq_length(self): 78 | return self.seq_length 79 | 80 | def get_batch(self, split, batch_size=None): 81 | split_ix = self.split_ix[split] 82 | batch_size = batch_size or self.batch_size 83 | 84 | img_batch = np.ndarray([batch_size, 224,224,3], dtype = 'float32') 85 | label_batch = np.zeros([batch_size * self.seq_per_img, self.seq_length + 2], dtype = 'int') 86 | mask_batch = np.zeros([batch_size * self.seq_per_img, self.seq_length + 2], dtype = 'float32') 87 | 88 | max_index = len(split_ix) 89 | wrapped = False 90 | 91 | infos = [] 92 | 93 | for i in range(batch_size): 94 | ri = self.iterators[split] 95 | ri_next = ri + 1 96 | if ri_next >= max_index: 97 | ri_next = 0 98 | wrapped = True 99 | self.iterators[split] = ri_next 100 | ix = split_ix[ri] 101 | 102 | # fetch image 103 | #img = self.load_image(self.image_info[ix]['filename']) 104 | img = self.h5_file['images'][ix, :, :, :].transpose(1, 2, 0) 105 | img_batch[i] = img[16:240, 16:240, :].astype('float32')/255.0 106 | 107 | # fetch the sequence labels 108 | ix1 = self.label_start_ix[ix] - 1 #label_start_ix starts from 1 109 | ix2 = self.label_end_ix[ix] - 1 110 | ncap = ix2 - ix1 + 1 # number of captions available for this image 111 | assert ncap > 0, 'an image does not have any label. this can be handled but right now isn\'t' 112 | 113 | if ncap < self.seq_per_img: 114 | # we need to subsample (with replacement) 115 | seq = np.zeros([self.seq_per_img, self.seq_length], dtype = 'int') 116 | for q in range(self.seq_per_img): 117 | ixl = random.randint(ix1,ix2) 118 | seq[q, :] = self.h5_file['labels'][ixl, :self.seq_length] 119 | else: 120 | ixl = random.randint(ix1, ix2 - self.seq_per_img + 1) 121 | seq = self.h5_file['labels'][ixl: ixl + self.seq_per_img, :self.seq_length] 122 | 123 | label_batch[i * self.seq_per_img : (i + 1) * self.seq_per_img, 1 : self.seq_length + 1] = seq 124 | 125 | # record associated info as well 126 | info_dict = {} 127 | info_dict['id'] = self.info['images'][ix]['id'] 128 | info_dict['file_path'] = self.info['images'][ix]['file_path'] 129 | infos.append(info_dict) 130 | 131 | # generate mask 132 | nonzeros = np.array(map(lambda x: (x != 0).sum()+2, label_batch)) 133 | for ix, row in enumerate(mask_batch): 134 | row[:nonzeros[ix]] = 1 135 | 136 | data = {} 137 | data['images'] = img_batch 138 | data['labels'] = label_batch 139 | data['masks'] = mask_batch 140 | data['bounds'] = {'it_pos_now': self.iterators[split], 'it_max': len(split_ix), 'wrapped': wrapped} 141 | data['infos'] = infos 142 | 143 | return data 144 | 145 | def reset_iterator(self, split): 146 | self.iterators[split] = 0 147 | -------------------------------------------------------------------------------- /dataloaderraw.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import json 6 | import h5py 7 | import os 8 | import tensorflow as tf 9 | import numpy as np 10 | import random 11 | import skimage 12 | import skimage.io 13 | import scipy.misc 14 | 15 | class DataLoaderRaw(): 16 | 17 | def __init__(self, opt): 18 | self.opt = opt 19 | self.coco_json = opt.get('coco_json', '') 20 | self.folder_path = opt.get('folder_path', '') 21 | 22 | self.batch_size = opt.get('batch_size', 1) 23 | 24 | # load the json file which contains additional information about the dataset 25 | print('DataLoaderRaw loading images from folder: ', self.folder_path) 26 | 27 | self.files = [] 28 | self.ids = [] 29 | 30 | print(len(self.coco_json)) 31 | if len(self.coco_json) > 0: 32 | print('reading from ' + opt.coco_json) 33 | # read in filenames from the coco-style json file 34 | self.coco_annotation = json.load(open(self.coco_json)) 35 | for k,v in enumerate(self.coco_annotation['images']): 36 | fullpath = os.path.join(self.folder_path, v['file_name']) 37 | self.files.append(fullpath) 38 | self.ids.append(v['id']) 39 | else: 40 | # read in all the filenames from the folder 41 | print('listing all images in directory ' + self.folder_path) 42 | def isImage(f): 43 | supportedExt = ['.jpg','.JPG','.jpeg','.JPEG','.png','.PNG','.ppm','.PPM'] 44 | for ext in supportedExt: 45 | start_idx = f.rfind(ext) 46 | if start_idx >= 0 and start_idx + len(ext) == len(f): 47 | return True 48 | return False 49 | 50 | n = 1 51 | for root, dirs, files in os.walk(self.folder_path, topdown=False): 52 | for file in files: 53 | fullpath = os.path.join(self.folder_path, file) 54 | if isImage(fullpath): 55 | self.files.append(fullpath) 56 | self.ids.append(str(n)) # just order them sequentially 57 | n = n + 1 58 | 59 | self.N = len(self.files) 60 | print('DataLoaderRaw found ', self.N, ' images') 61 | 62 | self.iterator = 0 63 | 64 | def get_batch(self, split, batch_size=None): 65 | batch_size = batch_size or self.batch_size 66 | 67 | # pick an index of the datapoint to load next 68 | img_batch = np.ndarray([batch_size, 224,224,3], dtype = 'float32') 69 | max_index = self.N 70 | wrapped = False 71 | infos = [] 72 | 73 | for i in range(batch_size): 74 | ri = self.iterator 75 | ri_next = ri + 1 76 | if ri_next >= max_index: 77 | ri_next = 0 78 | wrapped = True 79 | # wrap back around 80 | self.iterator = ri_next 81 | 82 | img = skimage.io.imread(self.files[ri]) 83 | 84 | if len(img.shape) == 2: 85 | img = img[:,:,np.newaxis] 86 | img = img.concatenate((img, img, img), axis=2) 87 | 88 | img_batch[i] = img[16:240, 16:240, :].astype('float32')/255.0 89 | 90 | info_struct = {} 91 | info_struct['id'] = self.ids[ri] 92 | info_struct['file_path'] = self.files[ri] 93 | infos.append(info_struct) 94 | 95 | data = {} 96 | data['images'] = img_batch 97 | data['bounds'] = {'it_pos_now': self.iterator, 'it_max': self.N, 'wrapped': wrapped} 98 | data['infos'] = infos 99 | 100 | return data 101 | 102 | def reset_iterator(self, split): 103 | self.iterator = 0 104 | -------------------------------------------------------------------------------- /eval.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import json 6 | import numpy as np 7 | import tensorflow as tf 8 | 9 | import time 10 | import os 11 | from six.moves import cPickle 12 | 13 | import opts 14 | import models 15 | from dataloader import * 16 | from dataloaderraw import * 17 | import eval_utils 18 | import argparse 19 | import misc.utils as utils 20 | 21 | NUM_THREADS = 2 #int(os.environ['OMP_NUM_THREADS']) 22 | 23 | # Input arguments and options 24 | parser = argparse.ArgumentParser() 25 | # Input paths 26 | parser.add_argument('--model', type=str, default='', 27 | help='path to model to evaluate') 28 | parser.add_argument('--infos_path', type=str, default='', 29 | help='path to infos to evaluate') 30 | # Basic options 31 | parser.add_argument('--batch_size', type=int, default=0, 32 | help='if > 0 then overrule, otherwise load from checkpoint.') 33 | parser.add_argument('--num_images', type=int, default=-1, 34 | help='how many images to use when periodically evaluating the loss? (-1 = all)') 35 | parser.add_argument('--language_eval', type=int, default=0, 36 | help='Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.') 37 | parser.add_argument('--dump_images', type=int, default=1, 38 | help='Dump images into vis/imgs folder for vis? (1=yes,0=no)') 39 | parser.add_argument('--dump_json', type=int, default=1, 40 | help='Dump json with predictions into vis folder? (1=yes,0=no)') 41 | parser.add_argument('--dump_path', type=int, default=0, 42 | help='Write image paths along with predictions into vis json? (1=yes,0=no)') 43 | 44 | # Sampling options 45 | parser.add_argument('--sample_max', type=int, default=1, 46 | help='1 = sample argmax words. 0 = sample from distributions.') 47 | parser.add_argument('--beam_size', type=int, default=2, 48 | help='used when sample_max = 1, indicates number of beams in beam search. Usually 2 or 3 works well. More is not better. Set this to 1 for faster runtime but a bit worse performance.') 49 | parser.add_argument('--temperature', type=float, default=1.0, 50 | help='temperature when sampling from distributions (i.e. when sample_max = 0). Lower = "safer" predictions.') 51 | # For evaluation on a folder of images: 52 | parser.add_argument('--image_folder', type=str, default='', 53 | help='If this is nonempty then will predict on the images in this folder path') 54 | parser.add_argument('--image_root', type=str, default='', 55 | help='In case the image paths have to be preprended with a root path to an image folder') 56 | # For evaluation on MSCOCO images from some split: 57 | parser.add_argument('--input_h5', type=str, default='', 58 | help='path to the h5file containing the preprocessed dataset. empty = fetch from model checkpoint.') 59 | parser.add_argument('--input_json', type=str, default='', 60 | help='path to the json file containing additional info and vocab. empty = fetch from model checkpoint.') 61 | parser.add_argument('--split', type=str, default='test', 62 | help='if running on MSCOCO images, which split to use: val|test|train') 63 | parser.add_argument('--coco_json', type=str, default='', 64 | help='if nonempty then use this file in DataLoaderRaw (see docs there). Used only in MSCOCO test evaluation, where we have a specific json file of only test set images.') 65 | # misc 66 | parser.add_argument('--id', type=str, default='evalscript', 67 | help='an id identifying this run/job. used only if language_eval = 1 for appending to intermediate files') 68 | 69 | opt = parser.parse_args() 70 | 71 | # Load infos 72 | with open(opt.infos_path) as f: 73 | infos = cPickle.load(f) 74 | 75 | # override and collect parameters 76 | if len(opt.input_h5) == 0: 77 | opt.input_h5 = infos['opt'].input_h5 78 | if len(opt.input_json) == 0: 79 | opt.input_json = infos['opt'].input_json 80 | if opt.batch_size == 0: 81 | opt.batch_size = infos['opt'].batch_size 82 | ignore = ["id", "batch_size", "beam_size", "start_from"] 83 | for k in vars(infos['opt']).keys(): 84 | if k not in ignore: 85 | if k in vars(opt): 86 | assert vars(opt)[k] == vars(infos['opt'])[k], k + ' option not consistent' 87 | else: 88 | vars(opt).update({k: vars(infos['opt'])[k]}) # copy over options from model 89 | 90 | vocab = infos['vocab'] # ix -> word mapping 91 | 92 | # Setup the model 93 | model = models.setup(opt) 94 | model.build_model() 95 | model.build_generator() 96 | model.build_decoder() 97 | 98 | # Create the Data Loader instance 99 | if len(opt.image_folder) == 0: 100 | loader = DataLoader(opt) 101 | else: 102 | loader = DataLoaderRaw({'folder_path': opt.image_folder, 103 | 'coco_json': opt.coco_json, 104 | 'batch_size': opt.batch_size}) 105 | 106 | # Evaluation fun(ction) 107 | def eval_split(sess, model, loader, eval_kwargs): 108 | verbose = eval_kwargs.get('verbose', True) 109 | num_images = eval_kwargs.get('num_images', -1) 110 | split = eval_kwargs.get('split', 'test') 111 | language_eval = eval_kwargs.get('language_eval', 0) 112 | dataset = eval_kwargs.get('dataset', 'coco') 113 | 114 | # Make sure in the evaluation mode 115 | sess.run(tf.assign(model.training, False)) 116 | sess.run(tf.assign(model.cnn_training, False)) 117 | 118 | loader.reset_iterator(split) 119 | 120 | n = 0 121 | loss_sum = 0 122 | loss_evals = 1e-8 123 | predictions = [] 124 | 125 | while True: 126 | # fetch a batch of data 127 | if opt.beam_size > 1: 128 | data = loader.get_batch(split, 1) 129 | n = n + 1 130 | else: 131 | data = loader.get_batch(split, opt.batch_size) 132 | n = n + opt.batch_size 133 | 134 | #evaluate loss if we have the labels 135 | loss = 0 136 | if data.get('labels', None) is not None: 137 | # forward the model to get loss 138 | feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks']} 139 | loss = sess.run(model.cost, feed) 140 | loss_sum = loss_sum + loss 141 | loss_evals = loss_evals + 1 142 | 143 | # forward the model to also get generated samples for each image 144 | if opt.beam_size == 1: 145 | # forward the model to also get generated samples for each image 146 | feed = {model.images: data['images']} 147 | #g_o,g_l,g_p, seq = sess.run([model.g_output, model.g_logits, model.g_probs, model.generator], feed) 148 | seq = sess.run(model.generator, feed) 149 | 150 | #set_trace() 151 | sents = utils.decode_sequence(vocab, seq) 152 | 153 | for k, sent in enumerate(sents): 154 | entry = {'image_id': data['infos'][k]['id'], 'caption': sent} 155 | predictions.append(entry) 156 | if verbose: 157 | print('image %s: %s' %(entry['image_id'], entry['caption'])) 158 | else: 159 | seq = model.decode(data['images'], opt.beam_size, sess) 160 | sents = [' '.join([vocab.get(str(ix), '') for ix in sent]).strip() for sent in seq] 161 | sents = [sents[0]] 162 | entry = {'image_id': data['infos'][0]['id'], 'caption': sents[0]} 163 | predictions.append(entry) 164 | if verbose: 165 | for sent in sents: 166 | print('image %s: %s' %(entry['image_id'], sent)) 167 | 168 | for k, sent in enumerate(sents): 169 | entry = {'image_id': data['infos'][k]['id'], 'caption': sent} 170 | if opt.dump_path == 1: 171 | entry['file_name'] = data['infos'][k]['file_path'] 172 | table.insert(predictions, entry) 173 | if opt.dump_images == 1: 174 | # dump the raw image to vis/ folder 175 | cmd = 'cp "' + os.path.join(opt.image_root, data['infos'][k]['file_path']) + '" vis/imgs/img' + str(len(predictions)) + '.jpg' # bit gross 176 | print(cmd) 177 | os.system(cmd) 178 | 179 | if verbose: 180 | print('image %s: %s' %(entry['image_id'], entry['caption'])) 181 | 182 | # if we wrapped around the split or used up val imgs budget then bail 183 | ix0 = data['bounds']['it_pos_now'] 184 | ix1 = data['bounds']['it_max'] 185 | if num_images != -1: 186 | ix1 = min(ix1, num_images) 187 | for i in range(n - ix1): 188 | predictions.pop() 189 | 190 | if verbose: 191 | print('evaluating validation preformance... %d/%d (%f)' %(ix0 - 1, ix1, loss)) 192 | 193 | if data['bounds']['wrapped']: 194 | break 195 | if num_images >= 0 and n >= num_images: 196 | break 197 | 198 | lang_stats = None 199 | if language_eval == 1: 200 | lang_stats = eval_utils.language_eval(dataset, predictions) 201 | 202 | # Switch back to training mode 203 | sess.run(tf.assign(model.training, True)) 204 | sess.run(tf.assign(model.cnn_training, True)) 205 | return loss_sum/loss_evals, predictions, lang_stats 206 | 207 | tf_config = tf.ConfigProto() 208 | tf_config.intra_op_parallelism_threads=NUM_THREADS 209 | tf_config.gpu_options.allow_growth = True 210 | with tf.Session(config=tf_config) as sess: 211 | # Initilize the variables 212 | sess.run(tf.global_variables_initializer()) 213 | # Load the model checkpoint to evaluate 214 | assert len(opt.model) > 0, 'must provide a model' 215 | tf.train.Saver(tf.trainable_variables()).restore(sess, opt.model) 216 | 217 | # Set sample options 218 | sess.run(tf.assign(model.sample_max, opt.sample_max == 1)) 219 | sess.run(tf.assign(model.sample_temperature, opt.temperature)) 220 | 221 | loss, split_predictions, lang_stats = eval_split(sess, model, loader, 222 | {'num_images': opt.num_images, 223 | 'language_eval': opt.language_eval, 224 | 'split': opt.split}) 225 | 226 | print('loss: ', loss) 227 | if lang_stats: 228 | print(lang_stats) 229 | 230 | if opt.dump_json == 1: 231 | # dump the json 232 | json.dump(split_predictions, open('vis/vis.json', 'w')) 233 | -------------------------------------------------------------------------------- /eval_utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | from json import encoder 3 | 4 | def language_eval(dataset, preds): 5 | import sys 6 | if 'coco' in dataset: 7 | sys.path.append("coco-caption") 8 | annFile = 'coco-caption/annotations/captions_val2014.json' 9 | else: 10 | sys.path.append("f30k-caption") 11 | annFile = 'f30k-caption/annotations/dataset_flickr30k.json' 12 | from pycocotools.coco import COCO 13 | from pycocoevalcap.eval import COCOEvalCap 14 | 15 | encoder.FLOAT_REPR = lambda o: format(o, '.3f') 16 | 17 | coco = COCO(annFile) 18 | valids = coco.getImgIds() 19 | 20 | # filter results to only those in MSCOCO validation set (will be about a third) 21 | preds_filt = [p for p in preds if p['image_id'] in valids] 22 | print 'using %d/%d predictions' % (len(preds_filt), len(preds)) 23 | json.dump(preds_filt, open('tmp.json', 'w')) # serialize to temporary json file. Sigh, COCO API... 24 | 25 | resFile = 'tmp.json' 26 | cocoRes = coco.loadRes(resFile) 27 | cocoEval = COCOEvalCap(coco, cocoRes) 28 | cocoEval.params['image_id'] = cocoRes.getImgIds() 29 | cocoEval.evaluate() 30 | 31 | # create output dictionary 32 | out = {} 33 | for metric, score in cocoEval.eval.items(): 34 | out[metric] = score 35 | 36 | return out -------------------------------------------------------------------------------- /misc/AttentionModel.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import tensorflow as tf 6 | import tensorflow.contrib.slim as slim 7 | import os 8 | import vgg 9 | import copy 10 | 11 | import numpy as np 12 | import misc.utils as utils 13 | 14 | # The maximimum step during generation 15 | MAX_STEPS = 30 16 | 17 | class AttentionModel(): 18 | """ 19 | This model is not using the show attend tell algorithm, but given seq2seq attention decoder. 20 | """ 21 | 22 | def initialize(self, sess): 23 | # Initialize the variables 24 | sess.run(tf.global_variables_initializer()) 25 | # Initialize the saver 26 | self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1) 27 | # Load weights from the checkpoint 28 | if vars(self.opt).get('start_from', None): 29 | self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path) 30 | # Initialize the summary writer 31 | self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph) 32 | 33 | def __init__(self, opt): 34 | self.vocab_size = opt.vocab_size 35 | self.input_encoding_size = opt.input_encoding_size 36 | self.rnn_size = opt.rnn_size 37 | self.num_layers = opt.num_layers 38 | self.drop_prob_lm = opt.drop_prob_lm 39 | self.seq_length = opt.seq_length 40 | self.vocab_size = opt.vocab_size 41 | self.seq_per_img = opt.seq_per_img 42 | 43 | self.opt = opt 44 | 45 | # Variable indicating in training mode or evaluation mode 46 | self.training = tf.Variable(True, trainable = False, name = "training") 47 | 48 | # Input variables 49 | self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images") 50 | self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels") 51 | self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks") 52 | 53 | # Build CNN 54 | if vars(self.opt).get('start_from', None): 55 | cnn_weight = None 56 | else: 57 | cnn_weight = self.opt.cnn_weight 58 | if self.opt.cnn_model == 'vgg16': 59 | self.cnn = vgg.Vgg16(cnn_weight) 60 | if self.opt.cnn_model == 'vgg19': 61 | self.cnn = vgg.Vgg19(cnn_weight) 62 | 63 | with tf.variable_scope("cnn"): 64 | self.cnn.build(self.images) 65 | 66 | if self.opt.cnn_model == 'vgg16': 67 | self.context = self.cnn.conv5_3 68 | if self.opt.cnn_model == 'vgg19': 69 | self.context = self.cnn.conv5_4 70 | 71 | self.cnn_training = self.cnn.training 72 | 73 | # Variable in language model 74 | with tf.variable_scope("rnnlm"): 75 | # Word Embedding table 76 | self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb') 77 | 78 | # RNN cell 79 | if opt.rnn_type == 'rnn': 80 | self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell 81 | elif opt.rnn_type == 'gru': 82 | self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell 83 | elif opt.rnn_type == 'lstm': 84 | self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell 85 | else: 86 | raise Exception("RNN type not supported: {}".format(opt.rnn_type)) 87 | 88 | # keep_prob is a function of training flag 89 | self.keep_prob = tf.cond(self.training, 90 | lambda : tf.constant(1 - self.drop_prob_lm), 91 | lambda : tf.constant(1.0), name = 'keep_prob') 92 | # basic cell has dropout wrapper 93 | self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size, state_is_tuple = True), 1.0, self.keep_prob) 94 | # cell is the final cell of each timestep 95 | self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers, state_is_tuple = True) 96 | 97 | def build_model(self): 98 | with tf.name_scope("batch_size"): 99 | # Get batch_size from the first dimension of self.images 100 | self.batch_size = tf.shape(self.images)[0] 101 | with tf.variable_scope("rnnlm"): 102 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 103 | ctx_mean = tf.reduce_mean(flattened_ctx, 1) 104 | 105 | # Initialize the first hidden state with the mean context 106 | initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size) 107 | # Replicate self.seq_per_img times for each state and image embedding 108 | self.initial_state = initial_state = utils.expand_feat(initial_state, self.seq_per_img) 109 | self.flattened_ctx = flattened_ctx = tf.reshape(tf.tile(tf.expand_dims(flattened_ctx, 1), [1, self.seq_per_img, 1, 1]), 110 | [self.batch_size * self.seq_per_img, 196, 512]) 111 | 112 | rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1])) 113 | rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs] 114 | 115 | outputs, last_state = tf.contrib.legacy_seq2seq.attention_decoder(rnn_inputs, initial_state, flattened_ctx, self.cell, loop_function=None) 116 | outputs = tf.concat(axis=0, values=outputs) 117 | 118 | self.logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit') 119 | self.logits = tf.split(axis=0, num_or_size_splits=len(rnn_inputs), value=self.logits) 120 | 121 | with tf.variable_scope("loss"): 122 | loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(self.logits, 123 | [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target 124 | [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])]) 125 | self.cost = tf.reduce_mean(loss) 126 | 127 | self.final_state = last_state 128 | self.lr = tf.Variable(0.0, trainable=False) 129 | self.cnn_lr = tf.Variable(0.0, trainable=False) 130 | 131 | # Collect the rnn variables, and create the optimizer of rnn 132 | tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm') 133 | grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip) 134 | #grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), 135 | # self.opt.grad_clip) 136 | optimizer = utils.get_optimizer(self.opt, self.lr) 137 | self.train_op = optimizer.apply_gradients(zip(grads, tvars)) 138 | 139 | # Collect the cnn variables, and create the optimizer of cnn 140 | cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn') 141 | cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip) 142 | #cnn_grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, cnn_tvars), 143 | # self.opt.grad_clip) 144 | cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr) 145 | self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars)) 146 | 147 | tf.summary.scalar('training loss', self.cost) 148 | tf.summary.scalar('learning rate', self.lr) 149 | tf.summary.scalar('cnn learning rate', self.cnn_lr) 150 | self.summaries = tf.summary.merge_all() 151 | 152 | def build_generator(self): 153 | """ 154 | Generator for generating captions 155 | Support sample max or sample from distribution 156 | No Beam search here; beam search is in decoder 157 | """ 158 | # Variables for the sample setting 159 | self.sample_max = tf.Variable(True, trainable = False, name = "sample_max") 160 | self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature") 161 | 162 | self.generator = [] 163 | with tf.variable_scope("rnnlm") as rnnlm_scope: 164 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 165 | ctx_mean = tf.reduce_mean(flattened_ctx, 1) 166 | 167 | tf.get_variable_scope().reuse_variables() 168 | 169 | initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size) 170 | 171 | rnn_inputs = [tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))] + [0] * (MAX_STEPS - 1) 172 | 173 | # Always pick the word with largest probability as the input of next time step 174 | def loop(prev, i): 175 | with tf.variable_scope(rnnlm_scope): 176 | prev = slim.fully_connected(prev, self.vocab_size + 1, activation_fn = None, scope = 'logit') 177 | prev_symbol = tf.stop_gradient(tf.cond(self.sample_max, 178 | lambda: tf.argmax(prev, 1), # pick the word with largest probability as the input of next time step 179 | lambda: tf.squeeze( 180 | tf.multinomial(tf.nn.log_softmax(prev) / self.sample_temperature, 1), 1))) # Sample from the distribution 181 | self.generator.append(prev_symbol) 182 | return tf.nn.embedding_lookup(self.Wemb, prev_symbol) 183 | 184 | outputs, last_state = tf.contrib.legacy_seq2seq.attention_decoder(rnn_inputs, initial_state, flattened_ctx, self.cell, loop_function=loop) 185 | self.g_outputs = outputs = tf.reshape(tf.concat(axis=1, values=outputs), [-1, self.rnn_size]) 186 | self.g_logits = logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit') 187 | self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1]) 188 | 189 | self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS - 1, -1])) 190 | 191 | def build_decoder_rnn(self, first_step): 192 | """ 193 | This function build a decoder 194 | if first_step is true, the state is initialized by mean context 195 | if first_step is not true, the states are placeholder, and should be assigned. 196 | """ 197 | with tf.variable_scope("rnnlm"): 198 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 199 | ctx_mean = tf.reduce_mean(flattened_ctx, 1) 200 | 201 | self.decoder_prev_word = tf.placeholder(tf.int32, [None]) 202 | if first_step: 203 | rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32)) 204 | else: 205 | rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word) 206 | 207 | tf.get_variable_scope().reuse_variables() 208 | if not first_step: 209 | initial_state = utils.get_placeholder_state(self.cell.state_size) 210 | self.decoder_flattened_state = utils.flatten_state(initial_state) 211 | else: 212 | initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size) 213 | 214 | outputs, state = tf.contrib.legacy_seq2seq.attention_decoder([rnn_input], initial_state, flattened_ctx, self.cell, initial_state_attention = not first_step) 215 | logits = slim.fully_connected(outputs[0], self.vocab_size + 1, activation_fn = None, scope = 'logit') 216 | decoder_probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, self.vocab_size + 1]) 217 | decoder_state = utils.flatten_state(state) 218 | 219 | # output the probability and flattened state to next time step 220 | return [decoder_probs, decoder_state] 221 | 222 | 223 | def build_decoder(self): 224 | self.decoder_model_init = self.build_decoder_rnn(True) # Used for the first step 225 | self.decoder_model_cont = self.build_decoder_rnn(False) 226 | 227 | def decode(self, img, beam_size, sess, max_steps=30): 228 | """Decode an image with a sentences.""" 229 | 230 | # Initilize beam search variables 231 | # Candidate will be represented with a dictionary 232 | # "indexes": a list with indexes denoted a sentence; 233 | # "words": word in the decoded sentence without 234 | # "score": log-likelihood of the sentence 235 | # "state": RNN state when generating the last word of the candidate 236 | good_sentences = [] # store sentences already ended with 237 | cur_best_cand = [] # store current best candidates 238 | highest_score = 0.0 # hightest log-likelihodd in good sentences 239 | 240 | # Get the initial logit and state 241 | cand = {'indexes': [], 'score': 0} 242 | cur_best_cand.append(cand) 243 | 244 | # Expand the current best candidates until max_steps or no candidate 245 | for i in xrange(max_steps + 1): 246 | # expand candidates 247 | cand_pool = [] 248 | if i == 0: 249 | all_probs, all_states = self.get_probs_init(img, sess) 250 | else: 251 | states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))] 252 | indexes = [cand['indexes'][-1] for cand in cur_best_cand] 253 | imgs = np.vstack([img] * len(cur_best_cand)) 254 | all_probs, all_states = self.get_probs_cont(states, imgs, indexes, sess) 255 | 256 | # Construct new beams 257 | for ind_cand in range(len(cur_best_cand)): 258 | cand = cur_best_cand[ind_cand] 259 | probs = all_probs[ind_cand] 260 | state = [x[ind_cand] for x in all_states] 261 | 262 | probs = np.squeeze(probs) 263 | probs_order = np.argsort(-probs) 264 | # append new end terminal at the end of this beam 265 | for ind_b in xrange(beam_size): 266 | cand_e = copy.deepcopy(cand) 267 | cand_e['indexes'].append(probs_order[ind_b]) 268 | cand_e['score'] -= np.log(probs[probs_order[ind_b]]) 269 | cand_e['state'] = state 270 | cand_pool.append(cand_e) 271 | # get best beams 272 | cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score']) 273 | cur_best_cand = utils.truncate_list(cur_best_cand, beam_size) 274 | 275 | # move candidates end with to good_sentences or remove it 276 | cand_left = [] 277 | for cand in cur_best_cand: 278 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 279 | continue # No need to expand that candidate 280 | if cand['indexes'][-1] == 0: #end of sentence 281 | good_sentences.append(cand) 282 | highest_score = max(highest_score, cand['score']) 283 | else: 284 | cand_left.append(cand) 285 | cur_best_cand = cand_left 286 | if not cur_best_cand: 287 | break 288 | 289 | # Add candidate left in cur_best_cand to good sentences 290 | for cand in cur_best_cand: 291 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 292 | continue 293 | if cand['indexes'][-1] != 0: 294 | cand['indexes'].append(0) 295 | good_sentences.append(cand) 296 | highest_score = max(highest_score, cand['score']) 297 | 298 | # Sort good sentences and return the final list 299 | good_sentences = sorted(good_sentences, key=lambda cand: cand['score']) 300 | good_sentences = utils.truncate_list(good_sentences, beam_size) 301 | 302 | return [sent['indexes'] for sent in good_sentences] 303 | 304 | 305 | def get_probs_init(self, img, sess): 306 | """Use the model to get initial logit""" 307 | m = self.decoder_model_init 308 | 309 | probs, state = sess.run(m, {self.images: img}) 310 | 311 | return (probs, state) 312 | 313 | def get_probs_cont(self, prev_state, img, prev_word, sess): 314 | """Use the model to get continued logit""" 315 | m = self.decoder_model_cont 316 | prev_word = np.array(prev_word, dtype='int32') 317 | 318 | placeholders = [self.images, self.decoder_prev_word] + self.decoder_flattened_state 319 | feeded = [img, prev_word] + prev_state 320 | 321 | probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))}) 322 | 323 | return (probs, state) -------------------------------------------------------------------------------- /misc/ShowAttendTellModel.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import tensorflow as tf 6 | import tensorflow.contrib.slim as slim 7 | import os 8 | import vgg 9 | import copy 10 | 11 | import numpy as np 12 | import misc.utils as utils 13 | 14 | # The maximimum step during generation 15 | MAX_STEPS = 30 16 | 17 | class ShowAttendTellModel(): 18 | 19 | def initialize(self, sess): 20 | # Initialize the variables 21 | sess.run(tf.global_variables_initializer()) 22 | # Initialize the saver 23 | self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1) 24 | # Load weights from the checkpoint 25 | if vars(self.opt).get('start_from', None): 26 | self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path) 27 | # Initialize the summary writer 28 | self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph) 29 | 30 | def __init__(self, opt): 31 | self.vocab_size = opt.vocab_size 32 | self.input_encoding_size = opt.input_encoding_size 33 | self.rnn_size = opt.rnn_size 34 | self.num_layers = opt.num_layers 35 | self.drop_prob_lm = opt.drop_prob_lm 36 | self.seq_length = opt.seq_length 37 | self.vocab_size = opt.vocab_size 38 | self.seq_per_img = opt.seq_per_img 39 | self.att_hid_size = opt.att_hid_size 40 | 41 | self.opt = opt 42 | 43 | # Variable indicating in training mode or evaluation mode 44 | self.training = tf.Variable(True, trainable = False, name = "training") 45 | 46 | # Input variables 47 | self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images") 48 | self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels") 49 | self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks") 50 | 51 | # Build CNN 52 | if vars(self.opt).get('start_from', None): 53 | cnn_weight = None 54 | else: 55 | cnn_weight = vars(self.opt).get('cnn_weight', None) 56 | if self.opt.cnn_model == 'vgg16': 57 | self.cnn = vgg.Vgg16(cnn_weight) 58 | if self.opt.cnn_model == 'vgg19': 59 | self.cnn = vgg.Vgg19(cnn_weight) 60 | 61 | with tf.variable_scope("cnn"): 62 | self.cnn.build(self.images) 63 | 64 | if self.opt.cnn_model == 'vgg16': 65 | self.context = self.cnn.conv5_3 66 | if self.opt.cnn_model == 'vgg19': 67 | self.context = self.cnn.conv5_4 68 | self.fc7 = self.cnn.drop7 69 | self.cnn_training = self.cnn.training 70 | 71 | # Variable in language model 72 | with tf.variable_scope("rnnlm"): 73 | # Word Embedding table 74 | self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb') 75 | 76 | # RNN cell 77 | if opt.rnn_type == 'rnn': 78 | self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell 79 | elif opt.rnn_type == 'gru': 80 | self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell 81 | elif opt.rnn_type == 'lstm': 82 | self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell 83 | else: 84 | raise Exception("RNN type not supported: {}".format(opt.rnn_type)) 85 | 86 | # keep_prob is a function of training flag 87 | self.keep_prob = tf.cond(self.training, 88 | lambda : tf.constant(1 - self.drop_prob_lm), 89 | lambda : tf.constant(1.0), name = 'keep_prob') 90 | 91 | # basic cell has dropout wrapper 92 | self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size), 1.0, self.keep_prob) 93 | # cell is the final cell of each timestep 94 | self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers) 95 | 96 | def get_alpha(self, prev_h, pctx): 97 | # projected state 98 | if self.att_hid_size == 0: 99 | pstate = slim.fully_connected(prev_h, 1, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * 1 100 | alpha = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * 1 101 | alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196 102 | alpha = tf.nn.softmax(alpha) 103 | else: 104 | pstate = slim.fully_connected(prev_h, self.att_hid_size, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * att_hid_size 105 | pctx_ = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * att_hid_size 106 | pctx_ = tf.nn.tanh(pctx_) # (batch * seq_per_img) * 196 * att_hid_size 107 | alpha = slim.fully_connected(pctx_, 1, activation_fn = None, scope = 'alpha') # (batch * seq_per_img) * 196 * 1 108 | alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196 109 | alpha = tf.nn.softmax(alpha) 110 | return alpha 111 | 112 | def build_model(self): 113 | with tf.name_scope("batch_size"): 114 | # Get batch_size from the first dimension of self.images 115 | self.batch_size = tf.shape(self.images)[0] 116 | with tf.variable_scope("rnnlm"): 117 | # Flatten the context 118 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 119 | 120 | # Initialize the first hidden state with the mean context 121 | initial_state = utils.get_initial_state(self.fc7, self.cell.state_size) 122 | # Replicate self.seq_per_img times for each state and image embedding 123 | self.initial_state = initial_state = utils.expand_feat(initial_state, self.seq_per_img) 124 | self.flattened_ctx = flattened_ctx = tf.reshape(tf.tile(tf.expand_dims(flattened_ctx, 1), [1, self.seq_per_img, 1, 1]), 125 | [self.batch_size * self.seq_per_img, 196, 512]) 126 | 127 | #projected context 128 | # This is used in attention module; do this outside the loop to reduce redundant computations 129 | # with tf.variable_scope("attention"): 130 | if self.att_hid_size == 0: 131 | pctx = slim.fully_connected(self.flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1 132 | else: 133 | pctx = slim.fully_connected(self.flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size 134 | 135 | rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1])) 136 | rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs] 137 | 138 | prev_h = utils.last_hidden_vec(initial_state) 139 | 140 | self.alphas = [] 141 | self.logits = [] 142 | outputs = [] 143 | state = initial_state 144 | for ind in range(self.seq_length + 1): 145 | if ind > 0: 146 | # Reuse the variables after the first timestep. 147 | tf.get_variable_scope().reuse_variables() 148 | 149 | with tf.variable_scope("attention"): 150 | alpha = self.get_alpha(prev_h, pctx) 151 | self.alphas.append(alpha) 152 | weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1) 153 | 154 | output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_inputs[ind]]), state) 155 | # Save the current output for next time step attention 156 | prev_h = output 157 | # Get the score of each word in vocabulary, 0 is end token. 158 | self.logits.append(slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit')) 159 | 160 | with tf.variable_scope("loss"): 161 | loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example( 162 | self.logits, 163 | [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target; ignore the first start token 164 | [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])]) 165 | self.cost = tf.reduce_mean(loss) 166 | 167 | self.final_state = state 168 | self.lr = tf.Variable(0.0, trainable=False) 169 | self.cnn_lr = tf.Variable(0.0, trainable=False) 170 | 171 | # Collect the rnn variables, and create the optimizer of rnn 172 | tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm') 173 | grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip) 174 | optimizer = utils.get_optimizer(self.opt, self.lr) 175 | self.train_op = optimizer.apply_gradients(zip(grads, tvars)) 176 | 177 | # Collect the cnn variables, and create the optimizer of cnn 178 | cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn') 179 | cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip) 180 | cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr) 181 | self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars)) 182 | 183 | tf.summary.scalar('training loss', self.cost) 184 | tf.summary.scalar('learning rate', self.lr) 185 | tf.summary.scalar('cnn learning rate', self.cnn_lr) 186 | self.summaries = tf.summary.merge_all() 187 | 188 | def build_generator(self): 189 | """ 190 | Generator for generating captions 191 | Support sample max or sample from distribution 192 | No Beam search here; beam search is in decoder 193 | """ 194 | # Variables for the sample setting 195 | self.sample_max = tf.Variable(True, trainable = False, name = "sample_max") 196 | self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature") 197 | 198 | self.generator = [] 199 | with tf.variable_scope("rnnlm"): 200 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 201 | 202 | tf.get_variable_scope().reuse_variables() 203 | 204 | initial_state = utils.get_initial_state(self.fc7, self.cell.state_size) 205 | 206 | #projected context 207 | # This is used in attention module; do this outside the loop to reduce redundant computations 208 | # with tf.variable_scope("attention"): 209 | if self.att_hid_size == 0: 210 | pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * 1 211 | else: 212 | pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * att_hid_size 213 | 214 | rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32)) 215 | 216 | prev_h = utils.last_hidden_vec(initial_state) 217 | 218 | self.g_alphas = [] 219 | outputs = [] 220 | state = initial_state 221 | for ind in range(MAX_STEPS): 222 | 223 | with tf.variable_scope("attention"): 224 | alpha = self.get_alpha(prev_h, pctx) 225 | self.g_alphas.append(alpha) 226 | weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1) 227 | 228 | output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), state) 229 | outputs.append(output) 230 | prev_h = output 231 | 232 | # Get the input of next timestep 233 | prev_logit = slim.fully_connected(prev_h, self.vocab_size + 1, activation_fn = None, scope = 'logit') 234 | prev_symbol = tf.stop_gradient(tf.cond(self.sample_max, 235 | lambda: tf.argmax(prev_logit, 1), # pick the word with largest probability as the input of next time step 236 | lambda: tf.squeeze( 237 | tf.multinomial(tf.nn.log_softmax(prev_logit) / self.sample_temperature, 1), 1))) # Sample from the distribution 238 | self.generator.append(prev_symbol) 239 | rnn_input = tf.nn.embedding_lookup(self.Wemb, prev_symbol) 240 | 241 | self.g_output = output = tf.reshape(tf.concat(axis=1, values=outputs), [-1, self.rnn_size]) # outputs[1:], because we don't calculate loss on time 0. 242 | self.g_logits = logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit') 243 | self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1]) 244 | 245 | self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS, -1])) 246 | 247 | def build_decoder_rnn(self, first_step): 248 | with tf.variable_scope("rnnlm"): 249 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 250 | 251 | tf.get_variable_scope().reuse_variables() 252 | 253 | if not first_step: 254 | initial_state = utils.get_placeholder_state(self.cell.state_size) 255 | self.decoder_flattened_state = utils.flatten_state(initial_state) 256 | else: 257 | initial_state = utils.get_initial_state(self.fc7, self.cell.state_size) 258 | 259 | self.decoder_prev_word = tf.placeholder(tf.int32, [None]) 260 | 261 | if first_step: 262 | rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32)) 263 | else: 264 | rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word) 265 | 266 | #projected context 267 | # This is used in attention module; do this outside the loop to reduce redundant computations 268 | # with tf.variable_scope("attention"): 269 | if self.att_hid_size == 0: 270 | pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1 271 | else: 272 | pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size 273 | 274 | prev_h = utils.last_hidden_vec(initial_state) 275 | 276 | alphas = [] 277 | outputs = [] 278 | 279 | with tf.variable_scope("attention"): 280 | alpha = self.get_alpha(prev_h, pctx) 281 | alphas.append(alpha) 282 | weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1) 283 | 284 | output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), initial_state) 285 | logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit') 286 | decoder_probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, self.vocab_size + 1]) 287 | decoder_state = utils.flatten_state(state) 288 | return [decoder_probs, decoder_state] 289 | 290 | def build_decoder(self): 291 | self.decoder_model_init = self.build_decoder_rnn(True) 292 | self.decoder_model_cont = self.build_decoder_rnn(False) 293 | 294 | def decode(self, img, beam_size, sess, max_steps=MAX_STEPS): 295 | """Decode an image with a sentences.""" 296 | 297 | # Initilize beam search variables 298 | # Candidate will be represented with a dictionary 299 | # "indexes": a list with indexes denoted a sentence; 300 | # "words": word in the decoded sentence without 301 | # "score": log-likelihood of the sentence 302 | # "state": RNN state when generating the last word of the candidate 303 | good_sentences = [] # store sentences already ended with 304 | cur_best_cand = [] # store current best candidates 305 | highest_score = 0.0 # hightest log-likelihodd in good sentences 306 | 307 | # Get the initial logit and state 308 | cand = {'indexes': [], 'score': 0} 309 | cur_best_cand.append(cand) 310 | 311 | # Expand the current best candidates until max_steps or no candidate 312 | for i in xrange(max_steps + 1): 313 | # expand candidates 314 | cand_pool = [] 315 | #for cand in cur_best_cand: 316 | #probs, state = self.get_probs_cont(cand['state'], cand['indexes'][-1], sess) 317 | if i == 0: 318 | all_probs, all_states = self.get_probs_init(img, sess) 319 | else: 320 | states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))] 321 | indexes = [cand['indexes'][-1] for cand in cur_best_cand] 322 | imgs = np.vstack([img] * len(cur_best_cand)) 323 | all_probs, all_states = self.get_probs_cont(states, imgs, indexes, sess) 324 | for ind_cand in range(len(cur_best_cand)): 325 | cand = cur_best_cand[ind_cand] 326 | probs = all_probs[ind_cand] 327 | state = [x[ind_cand] for x in all_states] 328 | 329 | probs = np.squeeze(probs) 330 | probs_order = np.argsort(-probs) 331 | for ind_b in xrange(beam_size): 332 | cand_e = copy.deepcopy(cand) 333 | cand_e['indexes'].append(probs_order[ind_b]) 334 | cand_e['score'] -= np.log(probs[probs_order[ind_b]]) 335 | cand_e['state'] = state 336 | cand_pool.append(cand_e) 337 | # get final cand_pool 338 | cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score']) 339 | cur_best_cand = utils.truncate_list(cur_best_cand, beam_size) 340 | 341 | # move candidates end with to good_sentences or remove it 342 | cand_left = [] 343 | for cand in cur_best_cand: 344 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 345 | continue # No need to expand that candidate 346 | if cand['indexes'][-1] == 0: #end of sentence 347 | good_sentences.append(cand) 348 | highest_score = max(highest_score, cand['score']) 349 | else: 350 | cand_left.append(cand) 351 | cur_best_cand = cand_left 352 | if not cur_best_cand: 353 | break 354 | 355 | # Add candidate left in cur_best_cand to good sentences 356 | for cand in cur_best_cand: 357 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 358 | continue 359 | if cand['indexes'][-1] != 0: 360 | cand['indexes'].append(0) 361 | good_sentences.append(cand) 362 | highest_score = max(highest_score, cand['score']) 363 | 364 | # Sort good sentences and return the final list 365 | good_sentences = sorted(good_sentences, key=lambda cand: cand['score']) 366 | good_sentences = utils.truncate_list(good_sentences, beam_size) 367 | 368 | return [sent['indexes'] for sent in good_sentences] 369 | 370 | def get_probs_init(self, img, sess): 371 | """Use the model to get initial logit""" 372 | m = self.decoder_model_init 373 | 374 | probs, state = sess.run(m, {self.images: img}) 375 | 376 | return (probs, state) 377 | 378 | def get_probs_cont(self, prev_state, img, prev_word, sess): 379 | """Use the model to get continued logit""" 380 | m = self.decoder_model_cont 381 | prev_word = np.array(prev_word, dtype='int32') 382 | 383 | # Feed images, input words, and the flattened state of previous time step. 384 | placeholders = [self.images, self.decoder_prev_word] + self.decoder_flattened_state 385 | feeded = [img, prev_word] + prev_state 386 | 387 | probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))}) 388 | 389 | return (probs, state) -------------------------------------------------------------------------------- /misc/ShowAttendTellModel_old.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import tensorflow as tf 6 | import tensorflow.contrib.slim as slim 7 | import os 8 | import vgg 9 | import copy 10 | 11 | import numpy as np 12 | import misc.utils as utils 13 | 14 | # The maximimum step during generation 15 | MAX_STEPS = 30 16 | 17 | class ShowAttendTellModel(): 18 | 19 | def initialize(self, sess): 20 | # Initialize the variables 21 | sess.run(tf.global_variables_initializer()) 22 | # Initialize the saver 23 | self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1) 24 | # Load weights from the checkpoint 25 | if vars(self.opt).get('start_from', None): 26 | self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path) 27 | # Initialize the summary writer 28 | self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph) 29 | 30 | def __init__(self, opt): 31 | self.vocab_size = opt.vocab_size 32 | self.input_encoding_size = opt.input_encoding_size 33 | self.rnn_size = opt.rnn_size 34 | self.num_layers = opt.num_layers 35 | self.drop_prob_lm = opt.drop_prob_lm 36 | self.seq_length = opt.seq_length 37 | self.vocab_size = opt.vocab_size 38 | self.seq_per_img = opt.seq_per_img 39 | self.att_hid_size = opt.att_hid_size 40 | 41 | self.opt = opt 42 | 43 | # Variable indicating in training mode or evaluation mode 44 | self.training = tf.Variable(True, trainable = False, name = "training") 45 | 46 | # Input variables 47 | self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images") 48 | self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels") 49 | self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks") 50 | 51 | # Build CNN 52 | if vars(self.opt).get('start_from', None): 53 | cnn_weight = None 54 | else: 55 | cnn_weight = vars(self.opt).get('cnn_weight', None) 56 | if self.opt.cnn_model == 'vgg16': 57 | self.cnn = vgg.Vgg16(cnn_weight) 58 | if self.opt.cnn_model == 'vgg19': 59 | self.cnn = vgg.Vgg19(cnn_weight) 60 | 61 | with tf.variable_scope("cnn"): 62 | self.cnn.build(self.images) 63 | 64 | if self.opt.cnn_model == 'vgg16': 65 | self.context = self.cnn.conv5_3 66 | if self.opt.cnn_model == 'vgg19': 67 | self.context = self.cnn.conv5_4 68 | 69 | self.cnn_training = self.cnn.training 70 | 71 | # Variable in language model 72 | with tf.variable_scope("rnnlm"): 73 | # Word Embedding table 74 | self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb') 75 | 76 | # RNN cell 77 | if opt.rnn_type == 'rnn': 78 | self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell 79 | elif opt.rnn_type == 'gru': 80 | self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell 81 | elif opt.rnn_type == 'lstm': 82 | self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell 83 | else: 84 | raise Exception("RNN type not supported: {}".format(opt.rnn_type)) 85 | 86 | # keep_prob is a function of training flag 87 | self.keep_prob = tf.cond(self.training, 88 | lambda : tf.constant(1 - self.drop_prob_lm), 89 | lambda : tf.constant(1.0), name = 'keep_prob') 90 | 91 | # basic cell has dropout wrapper 92 | self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size), 1.0, self.keep_prob) 93 | # cell is the final cell of each timestep 94 | self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers) 95 | 96 | def get_alpha(self, prev_h, pctx): 97 | # projected state 98 | if self.att_hid_size == 0: 99 | pstate = slim.fully_connected(prev_h, 1, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * 1 100 | alpha = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * 1 101 | alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196 102 | alpha = tf.nn.softmax(alpha) 103 | else: 104 | pstate = slim.fully_connected(prev_h, self.att_hid_size, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * att_hid_size 105 | pctx_ = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * att_hid_size 106 | pctx_ = tf.nn.tanh(pctx_) # (batch * seq_per_img) * 196 * att_hid_size 107 | alpha = slim.fully_connected(pctx_, 1, activation_fn = None, scope = 'alpha') # (batch * seq_per_img) * 196 * 1 108 | alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196 109 | alpha = tf.nn.softmax(alpha) 110 | return alpha 111 | 112 | def build_model(self): 113 | with tf.name_scope("batch_size"): 114 | # Get batch_size from the first dimension of self.images 115 | self.batch_size = tf.shape(self.images)[0] 116 | with tf.variable_scope("rnnlm"): 117 | # Flatten the context 118 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 119 | ctx_mean = tf.reduce_mean(flattened_ctx, 1) 120 | 121 | # Initialize the first hidden state with the mean context 122 | initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size) 123 | # Replicate self.seq_per_img times for each state and image embedding 124 | self.initial_state = initial_state = utils.expand_feat(initial_state, self.seq_per_img) 125 | self.flattened_ctx = flattened_ctx = tf.reshape(tf.tile(tf.expand_dims(flattened_ctx, 1), [1, self.seq_per_img, 1, 1]), 126 | [self.batch_size * self.seq_per_img, 196, 512]) 127 | 128 | #projected context 129 | # This is used in attention module; do this outside the loop to reduce redundant computations 130 | # with tf.variable_scope("attention"): 131 | if self.att_hid_size == 0: 132 | pctx = slim.fully_connected(self.flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1 133 | else: 134 | pctx = slim.fully_connected(self.flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size 135 | 136 | rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1])) 137 | rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs] 138 | 139 | prev_h = utils.last_hidden_vec(initial_state) 140 | 141 | self.alphas = [] 142 | self.logits = [] 143 | outputs = [] 144 | state = initial_state 145 | for ind in range(self.seq_length + 1): 146 | if ind > 0: 147 | # Reuse the variables after the first timestep. 148 | tf.get_variable_scope().reuse_variables() 149 | 150 | with tf.variable_scope("attention"): 151 | alpha = self.get_alpha(prev_h, pctx) 152 | self.alphas.append(alpha) 153 | weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1) 154 | 155 | output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_inputs[ind]]), state) 156 | # Save the current output for next time step attention 157 | prev_h = output 158 | # Get the score of each word in vocabulary, 0 is end token. 159 | self.logits.append(slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit')) 160 | 161 | with tf.variable_scope("loss"): 162 | loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example( 163 | self.logits, 164 | [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target; ignore the first start token 165 | [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])]) 166 | self.cost = tf.reduce_mean(loss) 167 | 168 | self.final_state = state 169 | self.lr = tf.Variable(0.0, trainable=False) 170 | self.cnn_lr = tf.Variable(0.0, trainable=False) 171 | 172 | # Collect the rnn variables, and create the optimizer of rnn 173 | tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm') 174 | grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip) 175 | optimizer = utils.get_optimizer(self.opt, self.lr) 176 | self.train_op = optimizer.apply_gradients(zip(grads, tvars)) 177 | 178 | # Collect the cnn variables, and create the optimizer of cnn 179 | cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn') 180 | cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip) 181 | cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr) 182 | self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars)) 183 | 184 | tf.summary.scalar('training loss', self.cost) 185 | tf.summary.scalar('learning rate', self.lr) 186 | tf.summary.scalar('cnn learning rate', self.cnn_lr) 187 | self.summaries = tf.summary.merge_all() 188 | 189 | def build_generator(self): 190 | """ 191 | Generator for generating captions 192 | Support sample max or sample from distribution 193 | No Beam search here; beam search is in decoder 194 | """ 195 | # Variables for the sample setting 196 | self.sample_max = tf.Variable(True, trainable = False, name = "sample_max") 197 | self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature") 198 | 199 | self.generator = [] 200 | with tf.variable_scope("rnnlm"): 201 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 202 | ctx_mean = tf.reduce_mean(flattened_ctx, 1) 203 | 204 | tf.get_variable_scope().reuse_variables() 205 | 206 | initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size) 207 | 208 | #projected context 209 | # This is used in attention module; do this outside the loop to reduce redundant computations 210 | # with tf.variable_scope("attention"): 211 | if self.att_hid_size == 0: 212 | pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * 1 213 | else: 214 | pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * att_hid_size 215 | 216 | rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32)) 217 | 218 | prev_h = utils.last_hidden_vec(initial_state) 219 | 220 | self.g_alphas = [] 221 | outputs = [] 222 | state = initial_state 223 | for ind in range(MAX_STEPS): 224 | 225 | with tf.variable_scope("attention"): 226 | alpha = self.get_alpha(prev_h, pctx) 227 | self.g_alphas.append(alpha) 228 | weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1) 229 | 230 | output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), state) 231 | outputs.append(output) 232 | prev_h = output 233 | 234 | # Get the input of next timestep 235 | prev_logit = slim.fully_connected(prev_h, self.vocab_size + 1, activation_fn = None, scope = 'logit') 236 | prev_symbol = tf.stop_gradient(tf.cond(self.sample_max, 237 | lambda: tf.argmax(prev_logit, 1), # pick the word with largest probability as the input of next time step 238 | lambda: tf.squeeze( 239 | tf.multinomial(tf.nn.log_softmax(prev_logit) / self.sample_temperature, 1), 1))) # Sample from the distribution 240 | self.generator.append(prev_symbol) 241 | rnn_input = tf.nn.embedding_lookup(self.Wemb, prev_symbol) 242 | 243 | self.g_output = output = tf.reshape(tf.concat(axis=1, values=outputs), [-1, self.rnn_size]) # outputs[1:], because we don't calculate loss on time 0. 244 | self.g_logits = logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit') 245 | self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1]) 246 | 247 | self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS, -1])) 248 | 249 | def build_decoder_rnn(self, first_step): 250 | with tf.variable_scope("rnnlm"): 251 | flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512]) 252 | ctx_mean = tf.reduce_mean(flattened_ctx, 1) 253 | 254 | tf.get_variable_scope().reuse_variables() 255 | 256 | if not first_step: 257 | initial_state = utils.get_placeholder_state(self.cell.state_size) 258 | self.decoder_flattened_state = utils.flatten_state(initial_state) 259 | else: 260 | initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size) 261 | 262 | self.decoder_prev_word = tf.placeholder(tf.int32, [None]) 263 | 264 | if first_step: 265 | rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32)) 266 | else: 267 | rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word) 268 | 269 | #projected context 270 | # This is used in attention module; do this outside the loop to reduce redundant computations 271 | # with tf.variable_scope("attention"): 272 | if self.att_hid_size == 0: 273 | pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1 274 | else: 275 | pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size 276 | 277 | prev_h = utils.last_hidden_vec(initial_state) 278 | 279 | alphas = [] 280 | outputs = [] 281 | 282 | with tf.variable_scope("attention"): 283 | alpha = self.get_alpha(prev_h, pctx) 284 | alphas.append(alpha) 285 | weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1) 286 | 287 | output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), initial_state) 288 | logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit') 289 | decoder_probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, self.vocab_size + 1]) 290 | decoder_state = utils.flatten_state(state) 291 | return [decoder_probs, decoder_state] 292 | 293 | 294 | def build_decoder(self): 295 | self.decoder_model_init = self.build_decoder_rnn(True) 296 | self.decoder_model_cont = self.build_decoder_rnn(False) 297 | 298 | def decode(self, img, beam_size, sess, max_steps=30): 299 | """Decode an image with a sentences.""" 300 | 301 | # Initilize beam search variables 302 | # Candidate will be represented with a dictionary 303 | # "indexes": a list with indexes denoted a sentence; 304 | # "words": word in the decoded sentence without 305 | # "score": log-likelihood of the sentence 306 | # "state": RNN state when generating the last word of the candidate 307 | good_sentences = [] # store sentences already ended with 308 | cur_best_cand = [] # store current best candidates 309 | highest_score = 0.0 # hightest log-likelihodd in good sentences 310 | 311 | # Get the initial logit and state 312 | cand = {'indexes': [], 'score': 0} 313 | cur_best_cand.append(cand) 314 | 315 | # Expand the current best candidates until max_steps or no candidate 316 | for i in xrange(max_steps + 1): 317 | # expand candidates 318 | cand_pool = [] 319 | #for cand in cur_best_cand: 320 | #probs, state = self.get_probs_cont(cand['state'], cand['indexes'][-1], sess) 321 | if i == 0: 322 | all_probs, all_states = self.get_probs_init(img, sess) 323 | else: 324 | states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))] 325 | indexes = [cand['indexes'][-1] for cand in cur_best_cand] 326 | imgs = np.vstack([img] * len(cur_best_cand)) 327 | all_probs, all_states = self.get_probs_cont(states, imgs, indexes, sess) 328 | for ind_cand in range(len(cur_best_cand)): 329 | cand = cur_best_cand[ind_cand] 330 | probs = all_probs[ind_cand] 331 | state = [x[ind_cand] for x in all_states] 332 | 333 | probs = np.squeeze(probs) 334 | probs_order = np.argsort(-probs) 335 | for ind_b in xrange(beam_size): 336 | cand_e = copy.deepcopy(cand) 337 | cand_e['indexes'].append(probs_order[ind_b]) 338 | cand_e['score'] -= np.log(probs[probs_order[ind_b]]) 339 | cand_e['state'] = state 340 | cand_pool.append(cand_e) 341 | # get final cand_pool 342 | cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score']) 343 | cur_best_cand = utils.truncate_list(cur_best_cand, beam_size) 344 | 345 | # move candidates end with to good_sentences or remove it 346 | cand_left = [] 347 | for cand in cur_best_cand: 348 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 349 | continue # No need to expand that candidate 350 | if cand['indexes'][-1] == 0: #end of sentence 351 | good_sentences.append(cand) 352 | highest_score = max(highest_score, cand['score']) 353 | else: 354 | cand_left.append(cand) 355 | cur_best_cand = cand_left 356 | if not cur_best_cand: 357 | break 358 | 359 | # Add candidate left in cur_best_cand to good sentences 360 | for cand in cur_best_cand: 361 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 362 | continue 363 | if cand['indexes'][-1] != 0: 364 | cand['indexes'].append(0) 365 | good_sentences.append(cand) 366 | highest_score = max(highest_score, cand['score']) 367 | 368 | # Sort good sentences and return the final list 369 | good_sentences = sorted(good_sentences, key=lambda cand: cand['score']) 370 | good_sentences = utils.truncate_list(good_sentences, beam_size) 371 | 372 | return [sent['indexes'] for sent in good_sentences] 373 | 374 | def get_probs_init(self, img, sess): 375 | """Use the model to get initial logit""" 376 | m = self.decoder_model_init 377 | 378 | probs, state = sess.run(m, {self.images: img}) 379 | 380 | return (probs, state) 381 | 382 | def get_probs_cont(self, prev_state, img, prev_word, sess): 383 | """Use the model to get continued logit""" 384 | m = self.decoder_model_cont 385 | prev_word = np.array(prev_word, dtype='int32') 386 | 387 | # Feed images, input words, and the flattened state of previous time step. 388 | placeholders = [self.images, self.decoder_prev_word] + self.decoder_flattened_state 389 | feeded = [img, prev_word] + prev_state 390 | 391 | probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))}) 392 | 393 | return (probs, state) -------------------------------------------------------------------------------- /misc/ShowTellModel.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import tensorflow as tf 6 | import tensorflow.contrib.slim as slim 7 | import os 8 | import vgg 9 | import copy 10 | 11 | import numpy as np 12 | import misc.utils as utils 13 | 14 | # The maximimum step during generation 15 | MAX_STEPS = 30 16 | 17 | class ShowTellModel(): 18 | 19 | def initialize(self, sess): 20 | # Initialize the variables 21 | sess.run(tf.global_variables_initializer()) 22 | # Initialize the saver 23 | self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1) 24 | # Load weights from the checkpoint 25 | if vars(self.opt).get('start_from', None): 26 | self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path) 27 | # Initialize the summary writer 28 | self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph) 29 | 30 | def __init__(self, opt): 31 | self.vocab_size = opt.vocab_size 32 | self.input_encoding_size = opt.input_encoding_size 33 | self.rnn_size = opt.rnn_size 34 | self.num_layers = opt.num_layers 35 | self.drop_prob_lm = opt.drop_prob_lm 36 | self.seq_length = opt.seq_length 37 | self.vocab_size = opt.vocab_size 38 | self.seq_per_img = opt.seq_per_img 39 | 40 | self.opt = opt 41 | 42 | # Variable indicating in training mode or evaluation mode 43 | self.training = tf.Variable(True, trainable = False, name = "training") 44 | 45 | # Input variables 46 | self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images") 47 | self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels") 48 | self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks") 49 | 50 | # Build CNN 51 | if vars(self.opt).get('start_from', None): 52 | cnn_weight = None 53 | else: 54 | cnn_weight = vars(self.opt).get('cnn_weight', None) 55 | if self.opt.cnn_model == 'vgg16': 56 | self.cnn = vgg.Vgg16(cnn_weight) 57 | if self.opt.cnn_model == 'vgg19': 58 | self.cnn = vgg.Vgg19(cnn_weight) 59 | 60 | with tf.variable_scope("cnn"): 61 | self.cnn.build(self.images) 62 | self.fc7 = self.cnn.drop7 63 | self.cnn_training = self.cnn.training 64 | 65 | # Variable in language model 66 | with tf.variable_scope("rnnlm"): 67 | # Word Embedding table 68 | self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb') 69 | 70 | # RNN cell 71 | if opt.rnn_type == 'rnn': 72 | self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell 73 | elif opt.rnn_type == 'gru': 74 | self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell 75 | elif opt.rnn_type == 'lstm': 76 | self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell 77 | else: 78 | raise Exception("RNN type not supported: {}".format(opt.rnn_type)) 79 | 80 | # keep_prob is a function of training flag 81 | self.keep_prob = tf.cond(self.training, 82 | lambda : tf.constant(1 - self.drop_prob_lm), 83 | lambda : tf.constant(1.0), name = 'keep_prob') 84 | 85 | # basic cell has dropout wrapper 86 | self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size), 1.0, self.keep_prob) 87 | # cell is the final cell of each timestep 88 | self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers) 89 | 90 | def build_model(self): 91 | with tf.name_scope("batch_size"): 92 | # Get batch_size from the first dimension of self.images 93 | self.batch_size = tf.shape(self.images)[0] 94 | 95 | with tf.variable_scope("cnn"): 96 | image_emb = slim.fully_connected(self.fc7, self.input_encoding_size, activation_fn=None, scope='encode_image') 97 | with tf.variable_scope("rnnlm"): 98 | # Replicate self.seq_per_img times for each image embedding 99 | image_emb = tf.reshape(tf.tile(tf.expand_dims(image_emb, 1), [1, self.seq_per_img, 1]), [self.batch_size * self.seq_per_img, self.input_encoding_size]) 100 | 101 | # rnn_inputs is a list of input, each element is the input of rnn at each time step 102 | # time step 0 is the image embedding 103 | rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1])) 104 | rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs] 105 | rnn_inputs = [image_emb] + rnn_inputs 106 | 107 | # The initial sate is zero 108 | initial_state = self.cell.zero_state(self.batch_size * self.seq_per_img, tf.float32) 109 | 110 | outputs, last_state = tf.contrib.legacy_seq2seq.rnn_decoder(rnn_inputs, initial_state, self.cell, loop_function=None) 111 | 112 | outputs = tf.concat(axis=0, values=outputs[1:]) 113 | self.logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit') 114 | self.logits = tf.split(axis=0, num_or_size_splits=len(rnn_inputs) - 1, value=self.logits) 115 | 116 | with tf.variable_scope("loss"): 117 | loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(self.logits, 118 | [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target 119 | [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])]) 120 | self.cost = tf.reduce_mean(loss) 121 | 122 | self.final_state = last_state 123 | self.lr = tf.Variable(0.0, trainable=False) 124 | self.cnn_lr = tf.Variable(0.0, trainable=False) 125 | 126 | # Collect the rnn variables, and create the optimizer of rnn 127 | tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm') 128 | grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip) 129 | #grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), 130 | # self.opt.grad_clip) 131 | optimizer = utils.get_optimizer(self.opt, self.lr) 132 | self.train_op = optimizer.apply_gradients(zip(grads, tvars)) 133 | 134 | # Collect the cnn variables, and create the optimizer of cnn 135 | cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn') 136 | cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip) 137 | #cnn_grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, cnn_tvars), 138 | # self.opt.grad_clip) 139 | cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr) 140 | self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars)) 141 | 142 | tf.summary.scalar('training loss', self.cost) 143 | tf.summary.scalar('learning rate', self.lr) 144 | tf.summary.scalar('cnn learning rate', self.cnn_lr) 145 | self.summaries = tf.summary.merge_all() 146 | 147 | def build_generator(self): 148 | """ 149 | Generator for generating captions 150 | Support sample max or sample from distribution 151 | No Beam search here; beam search is in decoder 152 | """ 153 | # Variables for the sample setting 154 | self.sample_max = tf.Variable(True, trainable = False, name = "sample_max") 155 | self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature") 156 | 157 | self.generator = [] 158 | with tf.variable_scope("cnn"): 159 | image_emb = slim.fully_connected(self.fc7, self.input_encoding_size, activation_fn=None, reuse=True, scope='encode_image') 160 | with tf.variable_scope("rnnlm") as rnnlm_scope: 161 | rnn_inputs = [image_emb] + [tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))] + [0] * (MAX_STEPS - 1) 162 | initial_state = self.cell.zero_state(self.batch_size, tf.float32) 163 | 164 | tf.get_variable_scope().reuse_variables() 165 | 166 | def loop(prev, i): 167 | if i == 1: 168 | return rnn_inputs[1] 169 | with tf.variable_scope(rnnlm_scope): 170 | prev = slim.fully_connected(prev, self.vocab_size + 1, activation_fn = None, scope = 'logit') 171 | prev_symbol = tf.stop_gradient(tf.cond(self.sample_max, 172 | lambda: tf.argmax(prev, 1), # pick the word with largest probability as the input of next time step 173 | lambda: tf.squeeze( 174 | tf.multinomial(tf.nn.log_softmax(prev) / self.sample_temperature, 1), 1))) # Sample from the distribution 175 | self.generator.append(prev_symbol) 176 | return tf.nn.embedding_lookup(self.Wemb, prev_symbol) 177 | 178 | outputs, last_state = tf.contrib.legacy_seq2seq.rnn_decoder(rnn_inputs, initial_state, self.cell, loop_function=loop) 179 | self.g_output = output = tf.reshape(tf.concat(axis=1, values=outputs[1:]), [-1, self.rnn_size]) # outputs[1:], because we don't calculate loss on time 0. 180 | self.g_logits = logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit') 181 | self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1]) 182 | 183 | self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS - 1, -1])) 184 | 185 | # Decoders are used for beam search. Much complicated than sample max. 186 | # Decoder decodes the image one time step at a time 187 | def build_decoder_rnn(self, first_step): 188 | 189 | with tf.variable_scope("cnn"): 190 | image_emb = slim.fully_connected(self.fc7, self.input_encoding_size, reuse=True, activation_fn=None, scope='encode_image') 191 | with tf.variable_scope("rnnlm"): 192 | if first_step: 193 | rnn_input = image_emb # At the first step, the input is the embedded image 194 | else: 195 | # The input of later time step, is the embedding of the previous word 196 | # The previous word is a placeholder 197 | self.decoder_prev_word = tf.placeholder(tf.int32, [None]) 198 | rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word) 199 | 200 | batch_size = tf.shape(rnn_input)[0] 201 | 202 | tf.get_variable_scope().reuse_variables() 203 | 204 | if not first_step: 205 | # If not first step, the states are also placeholders. 206 | self.decoder_initial_state = initial_state = utils.get_placeholder_state(self.cell.state_size) 207 | self.decoder_flattened_state = utils.flatten_state(initial_state) 208 | else: 209 | # The states for the first step are zero. 210 | initial_state = self.cell.zero_state(batch_size, tf.float32) 211 | 212 | outputs, state = tf.contrib.legacy_seq2seq.rnn_decoder([rnn_input], initial_state, self.cell) 213 | logits = slim.fully_connected(outputs[0], self.vocab_size + 1, activation_fn = None, scope = 'logit') 214 | decoder_probs = tf.reshape(tf.nn.softmax(logits), [batch_size, self.vocab_size + 1]) 215 | decoder_state = utils.flatten_state(state) 216 | # output the current word distribution and states 217 | return [decoder_probs, decoder_state] 218 | 219 | 220 | def build_decoder(self): 221 | self.decoder_model_init = self.build_decoder_rnn(True) 222 | self.decoder_model_cont = self.build_decoder_rnn(False) 223 | 224 | def decode(self, img, beam_size, sess, max_steps=30): 225 | """Decode an image with a sentences.""" 226 | 227 | # Initilize beam search variables 228 | # Candidate will be represented with a dictionary 229 | # "indexes": a list with indexes denoted a sentence; 230 | # "words": word in the decoded sentence without 231 | # "score": log-likelihood of the sentence 232 | # "state": RNN state when generating the last word of the candidate 233 | good_sentences = [] # store sentences already ended with 234 | cur_best_cand = [] # store current best candidates 235 | highest_score = 0.0 # hightest log-likelihodd in good sentences 236 | 237 | # Get the initial logit and state 238 | probs_init, state_init = self.get_probs_init(img, sess) 239 | cand = {'indexes': [0], 'score': 0, 'state': state_init} 240 | cur_best_cand.append(cand) 241 | 242 | # Expand the current best candidates until max_steps or no candidate 243 | for i in xrange(max_steps): 244 | # expand candidates 245 | cand_pool = [] 246 | states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))] 247 | indexes = [cand['indexes'][-1] for cand in cur_best_cand] 248 | all_probs, all_states = self.get_probs_cont(states, indexes, sess) 249 | for ind_cand in range(len(cur_best_cand)): 250 | cand = cur_best_cand[ind_cand] 251 | probs = all_probs[ind_cand] 252 | state = [x[ind_cand] for x in all_states] 253 | 254 | probs = np.squeeze(probs) 255 | probs_order = np.argsort(-probs) 256 | for ind_b in xrange(beam_size): 257 | cand_e = copy.deepcopy(cand) 258 | cand_e['indexes'].append(probs_order[ind_b]) 259 | cand_e['score'] -= np.log(probs[probs_order[ind_b]]) 260 | cand_e['state'] = state 261 | cand_pool.append(cand_e) 262 | # get final cand_pool 263 | cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score']) 264 | cur_best_cand = utils.truncate_list(cur_best_cand, beam_size) 265 | 266 | # move candidates end with to good_sentences or remove it 267 | cand_left = [] 268 | for cand in cur_best_cand: 269 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 270 | continue # No need to expand that candidate 271 | if cand['indexes'][-1] == 0: #end of sentence 272 | good_sentences.append(cand) 273 | highest_score = max(highest_score, cand['score']) 274 | else: 275 | cand_left.append(cand) 276 | cur_best_cand = cand_left 277 | if not cur_best_cand: 278 | break 279 | 280 | # Add candidate left in cur_best_cand to good sentences 281 | for cand in cur_best_cand: 282 | if len(good_sentences) > beam_size and cand['score'] > highest_score: 283 | continue 284 | if cand['indexes'][-1] != 0: 285 | cand['indexes'].append(0) 286 | good_sentences.append(cand) 287 | highest_score = max(highest_score, cand['score']) 288 | 289 | # Sort good sentences and return the final list 290 | good_sentences = sorted(good_sentences, key=lambda cand: cand['score']) 291 | good_sentences = utils.truncate_list(good_sentences, beam_size) 292 | 293 | return [sent['indexes'][1:] for sent in good_sentences] 294 | 295 | def get_probs_init(self, img, sess): 296 | """Use the model to get initial logit""" 297 | m = self.decoder_model_init 298 | 299 | probs, state = sess.run(m, {self.images: img}) 300 | 301 | return (probs, state) 302 | 303 | def get_probs_cont(self, prev_state, prev_word, sess): 304 | """Use the model to get continued logit""" 305 | m = self.decoder_model_cont 306 | prev_word = np.array(prev_word, dtype='int32') 307 | 308 | placeholders = [self.decoder_prev_word] + self.decoder_flattened_state 309 | feeded = [prev_word] + prev_state 310 | 311 | probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))}) 312 | 313 | return (probs, state) -------------------------------------------------------------------------------- /misc/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ruotianluo/neuraltalk2-tensorflow/65cd3ad5383b0785c63ed3baba5f2cd51df7b59c/misc/__init__.py -------------------------------------------------------------------------------- /misc/utils.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import tensorflow as tf 6 | import tensorflow.contrib.slim as slim 7 | import collections 8 | import six 9 | 10 | # My own clip by value which could input a list of tensors 11 | def clip_by_value(t_list, clip_value_min, clip_value_max, name=None): 12 | if (not isinstance(t_list, collections.Sequence) 13 | or isinstance(t_list, six.string_types)): 14 | raise TypeError("t_list should be a sequence") 15 | t_list = list(t_list) 16 | 17 | with tf.name_scope(name or "clip_by_value") as name: 18 | values = [ 19 | tf.convert_to_tensor( 20 | t.values if isinstance(t, tf.IndexedSlices) else t, 21 | name="t_%d" % i) 22 | if t is not None else t 23 | for i, t in enumerate(t_list)] 24 | values_clipped = [] 25 | for i, v in enumerate(values): 26 | if v is None: 27 | values_clipped.append(None) 28 | else: 29 | with tf.get_default_graph().colocate_with(v): 30 | values_clipped.append( 31 | tf.clip_by_value(v, clip_value_min, clip_value_max)) 32 | 33 | list_clipped = [ 34 | tf.IndexedSlices(c_v, t.indices, t.dense_shape) 35 | if isinstance(t, tf.IndexedSlices) 36 | else c_v 37 | for (c_v, t) in zip(values_clipped, t_list)] 38 | 39 | return list_clipped 40 | 41 | # Truncate the list of beam given a maximum length 42 | def truncate_list(l, max_len): 43 | if max_len == -1: 44 | max_len = len(l) 45 | return l[:min(len(l), max_len)] 46 | 47 | # Turn nested state into a flattened list 48 | # Used both for flattening the nested placeholder states and for output states value of previous time step 49 | def flatten_state(state): 50 | if isinstance(state, tf.contrib.rnn.LSTMStateTuple): 51 | return [state.c, state.h] 52 | elif isinstance(state, tuple): 53 | result = [] 54 | for i in xrange(len(state)): 55 | result += flatten_state(state[i]) 56 | return result 57 | else: 58 | return [state] 59 | 60 | # When decoding step by step: we need to initialize the state of next timestep according to the previous time step. 61 | # Because states could be nested tuples or lists, so we get the states recursively. 62 | def get_placeholder_state(state_size, scope = 'placeholder_state'): 63 | with tf.variable_scope(scope): 64 | if isinstance(state_size, tf.contrib.rnn.LSTMStateTuple): 65 | c = tf.placeholder(tf.float32, [None, state_size.c], name='LSTM_c') 66 | h = tf.placeholder(tf.float32, [None, state_size.h], name='LSTM_h') 67 | return tf.contrib.rnn.LSTMStateTuple(c,h) 68 | elif isinstance(state_size, tuple): 69 | result = [get_placeholder_state(state_size[i], "layer_"+str(i)) for i in xrange(len(state_size))] 70 | return tuple(result) 71 | elif isinstance(state_size, int): 72 | return tf.placeholder(tf.float32, [None, state_size], name='state') 73 | 74 | # Get the last hidden vector. (The hidden vector of the deepest layer) 75 | # For the input of the attention model of next time step. 76 | def last_hidden_vec(state): 77 | if isinstance(state, tuple): 78 | return last_hidden_vec(state[len(state) - 1]) 79 | elif isinstance(state, tf.contrib.rnn.LSTMStateTuple): 80 | return state.h 81 | else: 82 | return state 83 | 84 | # Input: seq, N*D numpy array, with element 0 .. vocab_size. 0 is END token. 85 | def decode_sequence(ix_to_word, seq): 86 | N, D = seq.shape 87 | out = [] 88 | for i in range(N): 89 | txt = '' 90 | for j in range(D): 91 | ix = seq[i,j] 92 | if ix > 0 : 93 | if j >= 1: 94 | txt = txt + ' ' 95 | txt = txt + ix_to_word[str(ix)] 96 | else: 97 | break 98 | out.append(txt) 99 | return out 100 | 101 | def get_initial_state(input, state_size, scope = 'init_state'): 102 | """ 103 | Recursively initialize the first state. 104 | 105 | state_size is a nested of tuple and LSTMStateTuple and integer. 106 | 107 | It is so complicated because we use state_is_tuple 108 | """ 109 | 110 | with tf.variable_scope(scope): 111 | if isinstance(state_size, tf.contrib.rnn.LSTMStateTuple): 112 | c = slim.fully_connected(input, state_size.c, activation_fn=tf.nn.tanh, scope='LSTM_c') 113 | h = slim.fully_connected(input, state_size.h, activation_fn=tf.nn.tanh, scope='LSTM_h') 114 | return tf.contrib.rnn.LSTMStateTuple(c,h) 115 | elif isinstance(state_size, tuple): 116 | result = [get_initial_state(input, state_size[i], "layer_"+str(i)) for i in xrange(len(state_size))] 117 | return tuple(result) 118 | elif isinstance(state_size, int): 119 | return slim.fully_connected(input, state_size, activation_fn=tf.nn.tanh, scope='state') 120 | 121 | def expand_feat(input, multiples, scope = 'expand_feat'): 122 | """ 123 | Expand the dimension of states; 124 | According to multiples. 125 | 126 | Similar reason why it's so complicated. 127 | """ 128 | with tf.variable_scope(scope): 129 | if isinstance(input, tf.contrib.rnn.LSTMStateTuple): 130 | c = expand_feat(input.c, multiples, scope='expand_LSTM_c') 131 | h = expand_feat(input.h, multiples, scope='expand_LSTM_c') 132 | return tf.contrib.rnn.LSTMStateTuple(c,h) 133 | elif isinstance(input, tuple): 134 | result = [expand_feat(input[i], multiples, "expand_layer_"+str(i)) for i in xrange(len(input))] 135 | return tuple(result) 136 | else: 137 | return tf.reshape(tf.tile(tf.expand_dims(input, 1), [1, multiples, 1]), [tf.shape(input)[0] * multiples, input.get_shape()[1].value]) 138 | 139 | def get_optimizer(opt, lr): 140 | if opt.optim == 'rmsprop': 141 | return tf.train.RMSPropOptimizer(lr, momentum=opt.optim_alpha, epsilon=opt.optim_epsilon) 142 | elif opt.optim == 'adagrad': 143 | return tf.train.AdagradOptimizer(lr) 144 | elif opt.optim == 'sgd': 145 | return tf.train.GradientDescentOptimizer(lr) 146 | elif opt.optim == 'sgdm': 147 | return tf.train.MomentumOptimizer(lr, opt.optim_alpha) 148 | elif opt.optim == 'sgdmom': 149 | return tf.train.MomentumOptimizer(lr, opt.optim_alpha, use_nesterov=True) 150 | elif opt.optim == 'adam': 151 | return tf.train.AdamOptimizer(lr, beta1=opt.optim_alpha, beta2=opt.optim_beta, epsilon=opt.optim_epsilon) 152 | else: 153 | raise Exception('bad option opt.optim') 154 | 155 | def get_cnn_optimizer(opt, cnn_lr): 156 | if opt.cnn_optim == 'rmsprop': 157 | return tf.train.RMSPropOptimizer(cnn_lr, momentum=opt.cnn_optim_alpha, epsilon=opt.optim_epsilon) 158 | elif opt.cnn_optim == 'adagrad': 159 | return tf.train.AdagradOptimizer(cnn_lr) 160 | elif opt.cnn_optim == 'sgd': 161 | return tf.train.GradientDescentOptimizer(cnn_lr) 162 | elif opt.cnn_optim == 'sgdm': 163 | return tf.train.MomentumOptimizer(cnn_lr, opt.cnn_optim_alpha) 164 | elif opt.cnn_optim == 'sgdmom': 165 | return tf.train.MomentumOptimizer(cnn_lr, opt.cnn_optim_alpha, use_nesterov=True) 166 | elif opt.cnn_optim == 'adam': 167 | return tf.train.AdamOptimizer(cnn_lr, beta1=opt.cnn_optim_alpha, beta2=opt.cnn_optim_beta, epsilon=opt.optim_epsilon) 168 | else: 169 | raise Exception('bad option opt.cnn_optim') 170 | -------------------------------------------------------------------------------- /models.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import tensorflow.contrib.slim as slim 3 | import os 4 | import vgg 5 | import copy 6 | 7 | import numpy as np 8 | import misc.utils as utils 9 | 10 | from misc.ShowTellModel import ShowTellModel 11 | from misc.AttentionModel import AttentionModel 12 | from misc.ShowAttendTellModel import ShowAttendTellModel 13 | 14 | def setup(opt): 15 | 16 | # check compatibility if training is continued from previously saved model 17 | if vars(opt).get('start_from', None) is not None: 18 | # check if all necessary files exist 19 | assert os.path.isdir(opt.start_from)," %s must be a a path" % opt.start_from 20 | assert os.path.isfile(os.path.join(opt.start_from,"infos_"+opt.id+".pkl")),"infos.pkl file does not exist in path %s"%opt.start_from 21 | ckpt = tf.train.get_checkpoint_state(opt.start_from) 22 | assert ckpt,"No checkpoint found" 23 | assert ckpt.model_checkpoint_path,"No model path found in checkpoint" 24 | opt.ckpt = ckpt 25 | if opt.caption_model == 'show_tell': 26 | return ShowTellModel(opt) 27 | elif opt.caption_model == 'attention': 28 | return AttentionModel(opt) 29 | elif opt.caption_model == 'show_attend_tell': 30 | return ShowAttendTellModel(opt) 31 | else: 32 | raise Exception("Caption model not supported: {}".format(opt.caption_model)) 33 | -------------------------------------------------------------------------------- /opts.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | def parse_opt(): 4 | parser = argparse.ArgumentParser() 5 | # Data input settings 6 | parser.add_argument('--input_json', type=str, default='data/coco.json', 7 | help='path to the json file containing additional info and vocab') 8 | parser.add_argument('--input_h5', type=str, default='data/coco.json', 9 | help='path to the h5file containing the preprocessed dataset') 10 | parser.add_argument('--cnn_model', type=str, default='vgg16', 11 | help='vgg16 or vgg19') 12 | parser.add_argument('--cnn_weight', type=str, default='models/vgg16.npy', 13 | help='path to CNN tf model. Note this MUST be a vgg16 right now.') 14 | parser.add_argument('--start_from', type=str, default=None, 15 | help="""continue training from saved model at this path. Path must contain files saved by previous training process: 16 | 'infos.pkl' : configuration; 17 | 'checkpoint' : paths to model file(s) (created by tf). 18 | Note: this file contains absolute paths, be careful when moving files around; 19 | 'model.ckpt-*' : file(s) with model definition (created by tf) 20 | """) 21 | 22 | # Model settings 23 | parser.add_argument('--caption_model', type=str, default="show_tell", 24 | help='show_tell, show_attend_tell, attention') 25 | parser.add_argument('--rnn_size', type=int, default=512, 26 | help='size of the rnn in number of hidden nodes in each layer') 27 | parser.add_argument('--num_layers', type=int, default=1, 28 | help='number of layers in the RNN') 29 | parser.add_argument('--rnn_type', type=str, default='lstm', 30 | help='rnn, gru, or lstm') 31 | parser.add_argument('--input_encoding_size', type=int, default=512, 32 | help='the encoding size of each token in the vocabulary, and the image.') 33 | parser.add_argument('--att_hid_size', type=int, default=512, 34 | help='the hidden size of the attention MLP; only useful in show_attend_tell; 0 if not using hidden layer') 35 | 36 | # Optimization: General 37 | parser.add_argument('--max_epochs', type=int, default=-1, 38 | help='number of epochs') 39 | parser.add_argument('--batch_size', type=int, default=16, 40 | help='minibatch size') 41 | parser.add_argument('--grad_clip', type=float, default=0.1, #5., 42 | help='clip gradients at this value') 43 | parser.add_argument('--drop_prob_lm', type=float, default=0.5, 44 | help='strength of dropout in the Language Model RNN') 45 | parser.add_argument('--finetune_cnn_after', type=int, default=-1, 46 | help='After what iteration do we start finetuning the CNN? (-1 = disable; never finetune, 0 = finetune from start)') 47 | parser.add_argument('--seq_per_img', type=int, default=5, 48 | help='number of captions to sample for each image during training. Done for efficiency since CNN forward pass is expensive. E.g. coco has 5 sents/image') 49 | parser.add_argument('--beam_size', type=int, default=1, 50 | help='used when sample_max = 1, indicates number of beams in beam search. Usually 2 or 3 works well. More is not better. Set this to 1 for faster runtime but a bit worse performance.') 51 | 52 | #Optimization: for the Language Model 53 | parser.add_argument('--optim', type=str, default='adam', 54 | help='what update to use? rmsprop|sgd|sgdmom|adagrad|adam') 55 | parser.add_argument('--learning_rate', type=float, default=4e-4, 56 | help='learning rate') 57 | parser.add_argument('--learning_rate_decay_start', type=int, default=-1, 58 | help='at what iteration to start decaying learning rate? (-1 = dont) (in epoch)') 59 | parser.add_argument('--learning_rate_decay_every', type=int, default=10, 60 | help='every how many iterations thereafter to drop LR by half?(in epoch)') 61 | parser.add_argument('--optim_alpha', type=float, default=0.8, 62 | help='alpha for adam') 63 | parser.add_argument('--optim_beta', type=float, default=0.999, 64 | help='beta used for adam') 65 | parser.add_argument('--optim_epsilon', type=float, default=1e-8, 66 | help='epsilon that goes into denominator for smoothing') 67 | 68 | #Optimization: for the CNN 69 | parser.add_argument('--cnn_optim', type=str, default='adam', 70 | help='optimization to use for CNN') 71 | parser.add_argument('--cnn_optim_alpha', type=float, default=0.8, 72 | help='alpha for momentum of CNN') 73 | parser.add_argument('--cnn_optim_beta', type=float, default=0.999, 74 | help='beta for momentum of CNN') 75 | parser.add_argument('--cnn_learning_rate', type=float, default=1e-5, 76 | help='learning rate for the CNN') 77 | parser.add_argument('--cnn_weight_decay', type=float, default=0, 78 | help='L2 weight decay just for the CNN') 79 | 80 | # Evaluation/Checkpointing 81 | parser.add_argument('--val_images_use', type=int, default=3200, 82 | help='how many images to use when periodically evaluating the validation loss? (-1 = all)') 83 | parser.add_argument('--save_checkpoint_every', type=int, default=2500, 84 | help='how often to save a model checkpoint (in iterations)?') 85 | parser.add_argument('--checkpoint_path', type=str, default='save', 86 | help='directory to store checkpointed models') 87 | parser.add_argument('--language_eval', type=int, default=0, 88 | help='Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.') 89 | parser.add_argument('--losses_log_every', type=int, default=25, 90 | help='How often do we snapshot losses, for inclusion in the progress dump? (0 = disable)') 91 | parser.add_argument('--load_best_score', type=int, default=1, 92 | help='Do we load previous best score when resuming training.') 93 | 94 | # misc 95 | parser.add_argument('--id', type=str, default='', 96 | help='an id identifying this run/job. used in cross-val and appended when writing progress files') 97 | parser.add_argument('--train_only', type=int, default=0, 98 | help='if true then use 80k, else use 110k') 99 | 100 | args = parser.parse_args() 101 | 102 | # Check if args are valid 103 | assert args.rnn_size > 0, "rnn_size should be greater than 0" 104 | assert args.num_layers > 0, "num_layers should be greater than 0" 105 | assert args.input_encoding_size > 0, "input_encoding_size should be greater than 0" 106 | assert args.batch_size > 0, "batch_size should be greater than 0" 107 | assert args.drop_prob_lm >= 0 and args.drop_prob_lm < 1, "drop_prob_lm should be between 0 and 1" 108 | assert args.seq_per_img > 0, "seq_per_img should be greater than 0" 109 | assert args.beam_size > 0, "beam_size should be greater than 0" 110 | assert args.save_checkpoint_every > 0, "save_checkpoint_every should be greater than 0" 111 | assert args.losses_log_every > 0, "losses_log_every should be greater than 0" 112 | assert args.language_eval == 0 or args.language_eval == 1, "language_eval should be 0 or 1" 113 | assert args.load_best_score == 0 or args.load_best_score == 1, "language_eval should be 0 or 1" 114 | assert args.train_only == 0 or args.train_only == 1, "language_eval should be 0 or 1" 115 | 116 | return args -------------------------------------------------------------------------------- /prepro.py: -------------------------------------------------------------------------------- 1 | """ 2 | Preprocess a raw json dataset into hdf5/json files for use in data_loader.lua 3 | 4 | Input: json file that has the form 5 | [{ file_path: 'path/img.jpg', captions: ['a caption', ...] }, ...] 6 | example element in this list would look like 7 | {'captions': [u'A man with a red helmet on a small moped on a dirt road. ', u'Man riding a motor bike on a dirt road on the countryside.', u'A man riding on the back of a motorcycle.', u'A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. ', u'A man in a red shirt and a red hat is on a motorcycle on a hill side.'], 'file_path': u'val2014/COCO_val2014_000000391895.jpg', 'id': 391895} 8 | 9 | This script reads this json, does some basic preprocessing on the captions 10 | (e.g. lowercase, etc.), creates a special UNK token, and encodes everything to arrays 11 | 12 | Output: a json file and an hdf5 file 13 | The hdf5 file contains several fields: 14 | /images is (N,3,256,256) uint8 array of raw image data in RGB format 15 | /labels is (M,max_length) uint32 array of encoded labels, zero padded 16 | /label_start_ix and /label_end_ix are (N,) uint32 arrays of pointers to the 17 | first and last indices (in range 1..M) of labels for each image 18 | /label_length stores the length of the sequence for each of the M sequences 19 | 20 | The json file has a dict that contains: 21 | - an 'ix_to_word' field storing the vocab in form {ix:'word'}, where ix is 1-indexed 22 | - an 'images' field that is a list holding auxiliary information for each image, 23 | such as in particular the 'split' it was assigned to. 24 | """ 25 | 26 | import os 27 | import json 28 | import argparse 29 | from random import shuffle, seed 30 | import string 31 | # non-standard dependencies: 32 | import h5py 33 | import numpy as np 34 | from scipy.misc import imread, imresize 35 | 36 | def prepro_captions(imgs): 37 | 38 | # preprocess all the captions 39 | print 'example processed tokens:' 40 | for i,img in enumerate(imgs): 41 | img['processed_tokens'] = [] 42 | for j,s in enumerate(img['captions']): 43 | txt = str(s).lower().translate(None, string.punctuation).strip().split() 44 | img['processed_tokens'].append(txt) 45 | if i < 10 and j == 0: print txt 46 | 47 | def build_vocab(imgs, params): 48 | count_thr = params['word_count_threshold'] 49 | 50 | # count up the number of words 51 | counts = {} 52 | for img in imgs: 53 | for txt in img['processed_tokens']: 54 | for w in txt: 55 | counts[w] = counts.get(w, 0) + 1 56 | cw = sorted([(count,w) for w,count in counts.iteritems()], reverse=True) 57 | print 'top words and their counts:' 58 | print '\n'.join(map(str,cw[:20])) 59 | 60 | # print some stats 61 | total_words = sum(counts.itervalues()) 62 | print 'total words:', total_words 63 | bad_words = [w for w,n in counts.iteritems() if n <= count_thr] 64 | vocab = [w for w,n in counts.iteritems() if n > count_thr] 65 | bad_count = sum(counts[w] for w in bad_words) 66 | print 'number of bad words: %d/%d = %.2f%%' % (len(bad_words), len(counts), len(bad_words)*100.0/len(counts)) 67 | print 'number of words in vocab would be %d' % (len(vocab), ) 68 | print 'number of UNKs: %d/%d = %.2f%%' % (bad_count, total_words, bad_count*100.0/total_words) 69 | 70 | # lets look at the distribution of lengths as well 71 | sent_lengths = {} 72 | for img in imgs: 73 | for txt in img['processed_tokens']: 74 | nw = len(txt) 75 | sent_lengths[nw] = sent_lengths.get(nw, 0) + 1 76 | max_len = max(sent_lengths.keys()) 77 | print 'max length sentence in raw data: ', max_len 78 | print 'sentence length distribution (count, number of words):' 79 | sum_len = sum(sent_lengths.values()) 80 | for i in xrange(max_len+1): 81 | print '%2d: %10d %f%%' % (i, sent_lengths.get(i,0), sent_lengths.get(i,0)*100.0/sum_len) 82 | 83 | # lets now produce the final annotations 84 | if bad_count > 0: 85 | # additional special UNK token we will use below to map infrequent words to 86 | print 'inserting the special UNK token' 87 | vocab.append('UNK') 88 | 89 | for img in imgs: 90 | img['final_captions'] = [] 91 | for txt in img['processed_tokens']: 92 | caption = [w if counts.get(w,0) > count_thr else 'UNK' for w in txt] 93 | img['final_captions'].append(caption) 94 | 95 | return vocab 96 | 97 | def assign_splits(imgs, params): 98 | num_val = params['num_val'] 99 | num_test = params['num_test'] 100 | 101 | for i,img in enumerate(imgs): 102 | if i < num_val: 103 | img['split'] = 'val' 104 | elif i < num_val + num_test: 105 | img['split'] = 'test' 106 | else: 107 | img['split'] = 'train' 108 | 109 | print 'assigned %d to val, %d to test.' % (num_val, num_test) 110 | 111 | def encode_captions(imgs, params, wtoi): 112 | """ 113 | encode all captions into one large array, which will be 1-indexed. 114 | also produces label_start_ix and label_end_ix which store 1-indexed 115 | and inclusive (Lua-style) pointers to the first and last caption for 116 | each image in the dataset. 117 | """ 118 | 119 | max_length = params['max_length'] 120 | N = len(imgs) 121 | M = sum(len(img['final_captions']) for img in imgs) # total number of captions 122 | 123 | label_arrays = [] 124 | label_start_ix = np.zeros(N, dtype='uint32') # note: these will be one-indexed 125 | label_end_ix = np.zeros(N, dtype='uint32') 126 | label_length = np.zeros(M, dtype='uint32') 127 | caption_counter = 0 128 | counter = 1 129 | for i,img in enumerate(imgs): 130 | n = len(img['final_captions']) 131 | assert n > 0, 'error: some image has no captions' 132 | 133 | Li = np.zeros((n, max_length), dtype='uint32') 134 | for j,s in enumerate(img['final_captions']): 135 | label_length[caption_counter] = min(max_length, len(s)) # record the length of this sequence 136 | caption_counter += 1 137 | for k,w in enumerate(s): 138 | if k < max_length: 139 | Li[j,k] = wtoi[w] 140 | 141 | # note: word indices are 1-indexed, and captions are padded with zeros 142 | label_arrays.append(Li) 143 | label_start_ix[i] = counter 144 | label_end_ix[i] = counter + n - 1 145 | 146 | counter += n 147 | 148 | L = np.concatenate(label_arrays, axis=0) # put all the labels together 149 | assert L.shape[0] == M, 'lengths don\'t match? that\'s weird' 150 | assert np.all(label_length > 0), 'error: some caption had no words?' 151 | 152 | print 'encoded captions to array of size ', `L.shape` 153 | return L, label_start_ix, label_end_ix, label_length 154 | 155 | def main(params): 156 | 157 | imgs = json.load(open(params['input_json'], 'r')) 158 | seed(123) # make reproducible 159 | shuffle(imgs) # shuffle the order 160 | 161 | # tokenization and preprocessing 162 | prepro_captions(imgs) 163 | 164 | # create the vocab 165 | vocab = build_vocab(imgs, params) 166 | itow = {i+1:w for i,w in enumerate(vocab)} # a 1-indexed vocab translation table 167 | wtoi = {w:i+1 for i,w in enumerate(vocab)} # inverse table 168 | 169 | # assign the splits 170 | assign_splits(imgs, params) 171 | 172 | # encode captions in large arrays, ready to ship to hdf5 file 173 | L, label_start_ix, label_end_ix, label_length = encode_captions(imgs, params, wtoi) 174 | 175 | # create output h5 file 176 | N = len(imgs) 177 | f = h5py.File(params['output_h5'], "w") 178 | f.create_dataset("labels", dtype='uint32', data=L) 179 | f.create_dataset("label_start_ix", dtype='uint32', data=label_start_ix) 180 | f.create_dataset("label_end_ix", dtype='uint32', data=label_end_ix) 181 | f.create_dataset("label_length", dtype='uint32', data=label_length) 182 | dset = f.create_dataset("images", (N,3,256,256), dtype='uint8') # space for resized images 183 | for i,img in enumerate(imgs): 184 | # load the image 185 | I = imread(os.path.join(params['images_root'], img['file_path'])) 186 | try: 187 | Ir = imresize(I, (256,256)) 188 | except: 189 | print 'failed resizing image %s - see http://git.io/vBIE0' % (img['file_path'],) 190 | raise 191 | # handle grayscale input images 192 | if len(Ir.shape) == 2: 193 | Ir = Ir[:,:,np.newaxis] 194 | Ir = np.concatenate((Ir,Ir,Ir), axis=2) 195 | # and swap order of axes from (256,256,3) to (3,256,256) 196 | Ir = Ir.transpose(2,0,1) 197 | # write to h5 198 | dset[i] = Ir 199 | if i % 1000 == 0: 200 | print 'processing %d/%d (%.2f%% done)' % (i, N, i*100.0/N) 201 | f.close() 202 | print 'wrote ', params['output_h5'] 203 | 204 | # create output json file 205 | out = {} 206 | out['ix_to_word'] = itow # encode the (1-indexed) vocab 207 | out['images'] = [] 208 | for i,img in enumerate(imgs): 209 | 210 | jimg = {} 211 | jimg['split'] = img['split'] 212 | if 'file_path' in img: jimg['file_path'] = img['file_path'] # copy it over, might need 213 | if 'id' in img: jimg['id'] = img['id'] # copy over & mantain an id, if present (e.g. coco ids, useful) 214 | 215 | out['images'].append(jimg) 216 | 217 | json.dump(out, open(params['output_json'], 'w')) 218 | print 'wrote ', params['output_json'] 219 | 220 | if __name__ == "__main__": 221 | 222 | parser = argparse.ArgumentParser() 223 | 224 | # input json 225 | parser.add_argument('--input_json', required=True, help='input json file to process into hdf5') 226 | parser.add_argument('--num_val', required=True, type=int, help='number of images to assign to validation data (for CV etc)') 227 | parser.add_argument('--output_json', default='data.json', help='output json file') 228 | parser.add_argument('--output_h5', default='data.h5', help='output h5 file') 229 | 230 | # options 231 | parser.add_argument('--max_length', default=16, type=int, help='max length of a caption, in number of words. captions longer than this get clipped.') 232 | parser.add_argument('--images_root', default='', help='root location in which images are stored, to be prepended to file_path in input json') 233 | parser.add_argument('--word_count_threshold', default=5, type=int, help='only words that occur more than this number of times will be put in vocab') 234 | parser.add_argument('--num_test', default=0, type=int, help='number of test images (to withold until very very end)') 235 | 236 | args = parser.parse_args() 237 | params = vars(args) # convert to ordinary dict 238 | print 'parsed input parameters:' 239 | print json.dumps(params, indent = 2) 240 | main(params) -------------------------------------------------------------------------------- /test/test_model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import models 3 | import opts 4 | import numpy as np 5 | 6 | 7 | opt = opts.parse_opt() 8 | opt.batch_size = 2 9 | opt.seq_length = 5 10 | opt.seq_per_img = 2 11 | sess = tf.InteractiveSession() 12 | 13 | data = {} 14 | im1 = np.random.random([1,224,224,3]) 15 | data['images'] = np.vstack([im1, -im1]) 16 | data['labels'] = np.array([[0,1,2,3,4,0,0],[0,6,7,8,9,10,0],[0,1,2,3,4,0,0],[0,6,7,8,9,10,0]]) 17 | data['masks'] = np.array([[0,1,1,1,1,0,0],[0,1,1,1,1,1,0],[0,1,1,1,1,0,0],[0,1,1,1,1,1,0]]) 18 | 19 | opt.vocab_size = 10 20 | model = models.Model(opt) 21 | 22 | model.build_model() 23 | model.build_generator() 24 | tf.global_variables_initializer().run() 25 | sess.run(tf.assign(model.lr, 0.01)) 26 | feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks'], model.keep_prob: 1.0} 27 | train_loss, _ = sess.run([model.cost, model.train_op], feed) 28 | 29 | seq = sess.run(model.generator, feed) 30 | print(seq) -------------------------------------------------------------------------------- /test/test_simpleloader.py: -------------------------------------------------------------------------------- 1 | from simpleloader import * 2 | import tensorflow as tf 3 | 4 | import opts 5 | 6 | opt = opts.parse_opt() 7 | loader = DataLoader(opt) 8 | sess = tf.InteractiveSession() 9 | loader.assign_session(sess) 10 | 11 | count = 0 12 | start = time.time() 13 | while True: 14 | data = loader.get_batch(0) 15 | count += 1 16 | if data['bounds']['wrapped']: 17 | break 18 | end = time.time() 19 | print 'Time in total:', end-start 20 | print 'Total batch number:', count 21 | print 'Average time:', (end-start)/count 22 | 23 | 24 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | import time 9 | import os 10 | from six.moves import cPickle 11 | 12 | import opts 13 | import models 14 | from dataloader import * 15 | import eval_utils 16 | import misc.utils as utils 17 | 18 | import os 19 | NUM_THREADS = 2 #int(os.environ['OMP_NUM_THREADS']) 20 | 21 | #from ipdb import set_trace 22 | 23 | def train(opt): 24 | loader = DataLoader(opt) 25 | opt.vocab_size = loader.vocab_size 26 | opt.seq_length = loader.seq_length 27 | model = models.setup(opt) 28 | 29 | infos = {} 30 | if opt.start_from is not None: 31 | # open old infos and check if models are compatible 32 | with open(os.path.join(opt.start_from, 'infos_'+opt.id+'.pkl')) as f: 33 | infos = cPickle.load(f) 34 | saved_model_opt = infos['opt'] 35 | need_be_same=["caption_model", "rnn_type", "rnn_size", "num_layers"] 36 | for checkme in need_be_same: 37 | assert vars(saved_model_opt)[checkme] == vars(opt)[checkme], "Command line argument and saved model disagree on '%s' " % checkme 38 | 39 | iteration = infos.get('iter', 0) 40 | epoch = infos.get('epoch', 0) 41 | val_result_history = infos.get('val_result_history', {}) 42 | loss_history = infos.get('loss_history', {}) 43 | 44 | loader.iterators = infos.get('iterators', loader.iterators) 45 | if opt.load_best_score == 1: 46 | best_val_score = infos.get('best_val_score', None) 47 | 48 | model.build_model() 49 | model.build_generator() 50 | model.build_decoder() 51 | 52 | tf_config = tf.ConfigProto() 53 | tf_config.intra_op_parallelism_threads=NUM_THREADS 54 | tf_config.gpu_options.allow_growth = True 55 | with tf.Session(config=tf_config) as sess: 56 | # Initialize the variables, and restore the variables form checkpoint if there is. 57 | # and initialize the writer 58 | model.initialize(sess) 59 | 60 | # Assign the learning rate 61 | if epoch > opt.learning_rate_decay_start and opt.learning_rate_decay_start >= 0: 62 | frac = (epoch - opt.learning_rate_decay_start) / opt.learning_rate_decay_every 63 | decay_factor = 0.5 ** frac 64 | sess.run(tf.assign(model.lr, opt.learning_rate * decay_factor)) # set the decayed rate 65 | sess.run(tf.assign(model.cnn_lr, opt.cnn_learning_rate * decay_factor)) 66 | else: 67 | sess.run(tf.assign(model.lr, opt.learning_rate)) 68 | sess.run(tf.assign(model.cnn_lr, opt.cnn_learning_rate)) 69 | # Assure in training mode 70 | sess.run(tf.assign(model.training, True)) 71 | sess.run(tf.assign(model.cnn_training, True)) 72 | 73 | while True: 74 | start = time.time() 75 | # Load data from train split (0) 76 | data = loader.get_batch('train') 77 | print('Read data:', time.time() - start) 78 | 79 | start = time.time() 80 | feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks']} 81 | if iteration <= opt.finetune_cnn_after or opt.finetune_cnn_after == -1: 82 | train_loss, merged, _ = sess.run([model.cost, model.summaries, model.train_op], feed) 83 | else: 84 | # Finetune the cnn 85 | train_loss, merged, _, __ = sess.run([model.cost, model.summaries, model.train_op, model.cnn_train_op], feed) 86 | end = time.time() 87 | print("iter {} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}" \ 88 | .format(iteration, epoch, train_loss, end - start)) 89 | 90 | # Update the iteration and epoch 91 | iteration += 1 92 | if data['bounds']['wrapped']: 93 | epoch += 1 94 | 95 | # Write the training loss summary 96 | if (iteration % opt.losses_log_every == 0): 97 | model.summary_writer.add_summary(merged, iteration) 98 | model.summary_writer.flush() 99 | loss_history[iteration] = train_loss 100 | 101 | # make evaluation on validation set, and save model 102 | if (iteration % opt.save_checkpoint_every == 0): 103 | # eval model 104 | eval_kwargs = {'val_images_use': opt.val_images_use, 105 | 'split': 'val', 106 | 'language_eval': opt.language_eval, 107 | 'dataset': opt.input_json} 108 | val_loss, predictions, lang_stats = eval_split(sess, model, loader, eval_kwargs) 109 | 110 | # Write validation result into summary 111 | summary = tf.Summary(value=[tf.Summary.Value(tag='validation loss', simple_value=val_loss)]) 112 | model.summary_writer.add_summary(summary, iteration) 113 | for k,v in lang_stats.iteritems(): 114 | summary = tf.Summary(value=[tf.Summary.Value(tag=k, simple_value=v)]) 115 | model.summary_writer.add_summary(summary, iteration) 116 | model.summary_writer.flush() 117 | val_result_history[iteration] = {'loss': val_loss, 'lang_stats': lang_stats, 'predictions': predictions} 118 | 119 | # Save model if is improving on validation result 120 | if opt.language_eval == 1: 121 | current_score = lang_stats['CIDEr'] 122 | else: 123 | current_score = - val_loss 124 | 125 | if best_val_score is None or current_score > best_val_score: # if true 126 | best_val_score = current_score 127 | checkpoint_path = os.path.join(opt.checkpoint_path, 'model.ckpt') 128 | model.saver.save(sess, checkpoint_path, global_step = iteration) 129 | print("model saved to {}".format(checkpoint_path)) 130 | 131 | # Dump miscalleous informations 132 | infos['iter'] = iteration 133 | infos['epoch'] = epoch 134 | infos['iterators'] = loader.iterators 135 | infos['best_val_score'] = best_val_score 136 | infos['opt'] = opt 137 | infos['val_result_history'] = val_result_history 138 | infos['loss_history'] = loss_history 139 | infos['vocab'] = loader.get_vocab() 140 | with open(os.path.join(opt.checkpoint_path, 'infos_'+opt.id+'.pkl'), 'wb') as f: 141 | cPickle.dump(infos, f) 142 | 143 | # Stop if reaching max epochs 144 | if epoch >= opt.max_epochs and opt.max_epochs != -1: 145 | break 146 | 147 | def eval_split(sess, model, loader, eval_kwargs): 148 | verbose = eval_kwargs.get('verbose', True) 149 | val_images_use = eval_kwargs.get('val_images_use', -1) 150 | split = eval_kwargs.get('split', 'val') 151 | language_eval = eval_kwargs.get('language_eval', 1) 152 | dataset = eval_kwargs.get('dataset', 'coco') 153 | 154 | # Make sure in the evaluation mode 155 | sess.run(tf.assign(model.training, False)) 156 | sess.run(tf.assign(model.cnn_training, False)) 157 | 158 | loader.reset_iterator(split) 159 | 160 | n = 0 161 | loss_sum = 0 162 | loss_evals = 0 163 | predictions = [] 164 | while True: 165 | if opt.beam_size > 1: 166 | data = loader.get_batch(split, 1) 167 | n = n + 1 168 | else: 169 | data = loader.get_batch(split) 170 | n = n + loader.batch_size 171 | 172 | # forward the model to get loss 173 | feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks']} 174 | loss = sess.run(model.cost, feed) 175 | 176 | loss_sum = loss_sum + loss 177 | loss_evals = loss_evals + 1 178 | 179 | if opt.beam_size == 1: 180 | # forward the model to also get generated samples for each image 181 | feed = {model.images: data['images']} 182 | #g_o,g_l,g_p, seq = sess.run([model.g_output, model.g_logits, model.g_probs, model.generator], feed) 183 | seq = sess.run(model.generator, feed) 184 | 185 | #set_trace() 186 | sents = utils.decode_sequence(loader.get_vocab(), seq) 187 | 188 | for k, sent in enumerate(sents): 189 | entry = {'image_id': data['infos'][k]['id'], 'caption': sent} 190 | predictions.append(entry) 191 | if verbose: 192 | print('image %s: %s' %(entry['image_id'], entry['caption'])) 193 | else: 194 | seq = model.decode(data['images'], opt.beam_size, sess) 195 | sents = [' '.join([loader.ix_to_word.get(str(ix), '') for ix in sent]).strip() for sent in seq] 196 | entry = {'image_id': data['infos'][0]['id'], 'caption': sents[0]} 197 | predictions.append(entry) 198 | if verbose: 199 | for sent in sents: 200 | print('image %s: %s' %(entry['image_id'], sent)) 201 | 202 | ix0 = data['bounds']['it_pos_now'] 203 | ix1 = data['bounds']['it_max'] 204 | if val_images_use != -1: 205 | ix1 = min(ix1, val_images_use) 206 | for i in range(n - ix1): 207 | predictions.pop() 208 | if verbose: 209 | print('evaluating validation preformance... %d/%d (%f)' %(ix0 - 1, ix1, loss)) 210 | 211 | if data['bounds']['wrapped']: 212 | break 213 | if n>= val_images_use: 214 | break 215 | 216 | if language_eval == 1: 217 | lang_stats = eval_utils.language_eval(dataset, predictions) 218 | 219 | # Switch back to training mode 220 | sess.run(tf.assign(model.training, True)) 221 | sess.run(tf.assign(model.cnn_training, True)) 222 | return loss_sum/loss_evals, predictions, lang_stats 223 | 224 | opt = opts.parse_opt() 225 | train(opt) 226 | -------------------------------------------------------------------------------- /vgg.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import numpy as np 4 | import tensorflow as tf 5 | import time 6 | 7 | VGG_MEAN = [103.939, 116.779, 123.68] 8 | 9 | 10 | class Vgg16: 11 | def __init__(self, vgg16_npy_path=None): 12 | if vgg16_npy_path is None: 13 | self.data_dict = {} 14 | else: 15 | assert os.path.isfile(vgg16_npy_path), vgg16_npy_path + " doesn't exist." 16 | self.data_dict = np.load(vgg16_npy_path).item() 17 | print "npy file loaded" 18 | 19 | def build(self, rgb): 20 | """ 21 | load variable from npy to build the VGG 22 | 23 | :param rgb: rgb image [batch, height, width, 3] values scaled [0, 1] 24 | """ 25 | 26 | start_time = time.time() 27 | print "build model started" 28 | rgb_scaled = rgb * 255.0 29 | 30 | # Convert RGB to BGR 31 | red, green, blue = tf.split(axis=3, num_or_size_splits=3, value=rgb_scaled) 32 | assert red.get_shape().as_list()[1:] == [224, 224, 1] 33 | assert green.get_shape().as_list()[1:] == [224, 224, 1] 34 | assert blue.get_shape().as_list()[1:] == [224, 224, 1] 35 | bgr = tf.concat(axis=3, values=[ 36 | blue - VGG_MEAN[0], 37 | green - VGG_MEAN[1], 38 | red - VGG_MEAN[2], 39 | ]) 40 | assert bgr.get_shape().as_list()[1:] == [224, 224, 3] 41 | 42 | self.training = tf.Variable(True, trainable = False, name = "training") 43 | 44 | self.conv1_1 = self.conv_layer(bgr, "conv1_1") 45 | self.conv1_2 = self.conv_layer(self.conv1_1, "conv1_2") 46 | self.pool1 = self.max_pool(self.conv1_2, 'pool1') 47 | 48 | self.conv2_1 = self.conv_layer(self.pool1, "conv2_1") 49 | self.conv2_2 = self.conv_layer(self.conv2_1, "conv2_2") 50 | self.pool2 = self.max_pool(self.conv2_2, 'pool2') 51 | 52 | self.conv3_1 = self.conv_layer(self.pool2, "conv3_1") 53 | self.conv3_2 = self.conv_layer(self.conv3_1, "conv3_2") 54 | self.conv3_3 = self.conv_layer(self.conv3_2, "conv3_3") 55 | self.pool3 = self.max_pool(self.conv3_3, 'pool3') 56 | 57 | self.conv4_1 = self.conv_layer(self.pool3, "conv4_1") 58 | self.conv4_2 = self.conv_layer(self.conv4_1, "conv4_2") 59 | self.conv4_3 = self.conv_layer(self.conv4_2, "conv4_3") 60 | self.pool4 = self.max_pool(self.conv4_3, 'pool4') 61 | 62 | self.conv5_1 = self.conv_layer(self.pool4, "conv5_1") 63 | self.conv5_2 = self.conv_layer(self.conv5_1, "conv5_2") 64 | self.conv5_3 = self.conv_layer(self.conv5_2, "conv5_3") 65 | self.pool5 = self.max_pool(self.conv5_3, 'pool5') 66 | 67 | self.keep_prob = tf.cond(self.training, lambda : tf.constant(0.5), lambda : tf.constant(1.0), name = "keep_prob") 68 | 69 | self.fc6 = self.fc_layer(self.pool5, "fc6") 70 | assert self.fc6.get_shape().as_list()[1:] == [4096] 71 | self.relu6 = tf.nn.relu(self.fc6, name = "relu6") 72 | self.drop6 = tf.nn.dropout(self.relu6, self.keep_prob, name = "drop6") 73 | 74 | self.fc7 = self.fc_layer(self.drop6, "fc7") 75 | self.relu7 = tf.nn.relu(self.fc7, name = "relu7") 76 | self.drop7 = tf.nn.dropout(self.relu7, self.keep_prob, name = "drop7") 77 | 78 | self.fc8 = self.fc_layer(self.drop7, "fc8") 79 | 80 | self.prob = tf.nn.softmax(self.fc8, name="prob") 81 | 82 | self.data_dict = None 83 | print "build model finished: %ds" % (time.time() - start_time) 84 | 85 | def avg_pool(self, bottom, name): 86 | return tf.nn.avg_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name) 87 | 88 | def max_pool(self, bottom, name): 89 | return tf.nn.max_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name) 90 | 91 | def conv_layer(self, bottom, name): 92 | with tf.variable_scope(name): 93 | filt = self.get_conv_filter(bottom, name) 94 | 95 | conv = tf.nn.conv2d(bottom, filt, [1, 1, 1, 1], padding='SAME') 96 | 97 | conv_biases = self.get_bias(bottom, name) 98 | bias = tf.nn.bias_add(conv, conv_biases) 99 | 100 | relu = tf.nn.relu(bias) 101 | return relu 102 | 103 | def fc_layer(self, bottom, name): 104 | with tf.variable_scope(name): 105 | shape = bottom.get_shape().as_list() 106 | dim = 1 107 | for d in shape[1:]: 108 | dim *= d 109 | x = tf.reshape(bottom, [-1, dim]) 110 | 111 | weights = self.get_fc_weight(x, name) 112 | biases = self.get_bias(x, name) 113 | 114 | # Fully connected layer. Note that the '+' operation automatically 115 | # broadcasts the biases. 116 | fc = tf.nn.bias_add(tf.matmul(x, weights), biases) 117 | 118 | return fc 119 | 120 | def get_n_out(self, name): 121 | if name[:4] == 'conv': 122 | n_out = 64 * (2 ** (min(int(name[4]),4) - 1)) 123 | else: 124 | if name[2] == '8': 125 | n_out = 1000 126 | else: 127 | n_out = 4096 128 | return n_out 129 | 130 | 131 | def get_conv_filter(self, bottom, name): 132 | if self.data_dict.get(name, None) is None: 133 | print 'No pretrained weight for', name, 'filter' 134 | n_in = bottom.get_shape()[-1].value 135 | n_out = self.get_n_out(name) 136 | print 'n_in', n_in, 'n_out', n_out 137 | return tf.get_variable("filter", 138 | shape=[3, 3, n_in, n_out], 139 | dtype=tf.float32, 140 | initializer=tf.contrib.layers.xavier_initializer_conv2d()) 141 | return tf.Variable(self.data_dict[name][0], name="filter") 142 | 143 | def get_bias(self, bottom, name): 144 | if self.data_dict.get(name, None) is None: 145 | print 'No pretrained weight for', name, 'biases' 146 | n_out = self.get_n_out(name) 147 | print 'n_out', n_out 148 | return tf.Variable(tf.constant(0.0, shape=[n_out], dtype=tf.float32), trainable=True, name='biases') 149 | return tf.Variable(self.data_dict[name][1], name="biases") 150 | 151 | def get_fc_weight(self, bottom, name): 152 | if self.data_dict.get(name, None) is None: 153 | print 'No pretrained weight for', name, 'weights' 154 | n_in = bottom.get_shape()[-1].value 155 | n_out = self.get_n_out(name) 156 | print 'n_in', n_in, 'n_out', n_out 157 | return tf.get_variable("weights", 158 | shape=[n_in, n_out], 159 | dtype=tf.float32, 160 | initializer=tf.contrib.layers.xavier_initializer()) 161 | return tf.Variable(self.data_dict[name][0], name="weights") 162 | 163 | 164 | class Vgg19: 165 | def __init__(self, vgg19_npy_path=None): 166 | if vgg19_npy_path is None: 167 | self.data_dict = {} 168 | else: 169 | assert os.path.isfile(vgg19_npy_path), vgg19_npy_path + " doesn't exist." 170 | self.data_dict = np.load(vgg19_npy_path).item() 171 | print "npy file loaded" 172 | 173 | def build(self, rgb): 174 | """ 175 | load variable from npy to build the VGG 176 | :param rgb: rgb image [batch, height, width, 3] values scaled [0, 1] 177 | """ 178 | 179 | start_time = time.time() 180 | print("build model started") 181 | rgb_scaled = rgb * 255.0 182 | 183 | # Convert RGB to BGR 184 | red, green, blue = tf.split(axis=3, num_or_size_splits=3, value=rgb_scaled) 185 | assert red.get_shape().as_list()[1:] == [224, 224, 1] 186 | assert green.get_shape().as_list()[1:] == [224, 224, 1] 187 | assert blue.get_shape().as_list()[1:] == [224, 224, 1] 188 | bgr = tf.concat(axis=3, values=[ 189 | blue - VGG_MEAN[0], 190 | green - VGG_MEAN[1], 191 | red - VGG_MEAN[2], 192 | ]) 193 | assert bgr.get_shape().as_list()[1:] == [224, 224, 3] 194 | 195 | self.training = tf.Variable(True, trainable = False, name = "training") 196 | 197 | self.conv1_1 = self.conv_layer(bgr, "conv1_1") 198 | self.conv1_2 = self.conv_layer(self.conv1_1, "conv1_2") 199 | self.pool1 = self.max_pool(self.conv1_2, 'pool1') 200 | 201 | self.conv2_1 = self.conv_layer(self.pool1, "conv2_1") 202 | self.conv2_2 = self.conv_layer(self.conv2_1, "conv2_2") 203 | self.pool2 = self.max_pool(self.conv2_2, 'pool2') 204 | 205 | self.conv3_1 = self.conv_layer(self.pool2, "conv3_1") 206 | self.conv3_2 = self.conv_layer(self.conv3_1, "conv3_2") 207 | self.conv3_3 = self.conv_layer(self.conv3_2, "conv3_3") 208 | self.conv3_4 = self.conv_layer(self.conv3_3, "conv3_4") 209 | self.pool3 = self.max_pool(self.conv3_4, 'pool3') 210 | 211 | self.conv4_1 = self.conv_layer(self.pool3, "conv4_1") 212 | self.conv4_2 = self.conv_layer(self.conv4_1, "conv4_2") 213 | self.conv4_3 = self.conv_layer(self.conv4_2, "conv4_3") 214 | self.conv4_4 = self.conv_layer(self.conv4_3, "conv4_4") 215 | self.pool4 = self.max_pool(self.conv4_4, 'pool4') 216 | 217 | self.conv5_1 = self.conv_layer(self.pool4, "conv5_1") 218 | self.conv5_2 = self.conv_layer(self.conv5_1, "conv5_2") 219 | self.conv5_3 = self.conv_layer(self.conv5_2, "conv5_3") 220 | self.conv5_4 = self.conv_layer(self.conv5_3, "conv5_4") 221 | self.pool5 = self.max_pool(self.conv5_4, 'pool5') 222 | 223 | self.keep_prob = tf.cond(self.training, lambda : tf.constant(0.5), lambda : tf.constant(1.0), name = "keep_prob") 224 | 225 | self.fc6 = self.fc_layer(self.pool5, "fc6") 226 | assert self.fc6.get_shape().as_list()[1:] == [4096] 227 | self.relu6 = tf.nn.relu(self.fc6, name = "relu6") 228 | self.drop6 = tf.nn.dropout(self.relu6, self.keep_prob, name = "drop6") 229 | 230 | self.fc7 = self.fc_layer(self.drop6, "fc7") 231 | self.relu7 = tf.nn.relu(self.fc7, name = 'relu7') 232 | self.drop7 = tf.nn.dropout(self.relu7, self.keep_prob, name = "drop7") 233 | 234 | self.fc8 = self.fc_layer(self.drop7, "fc8") 235 | 236 | self.prob = tf.nn.softmax(self.fc8, name="prob") 237 | 238 | self.data_dict = None 239 | print("build model finished: %ds" % (time.time() - start_time)) 240 | 241 | def avg_pool(self, bottom, name): 242 | return tf.nn.avg_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name) 243 | 244 | def max_pool(self, bottom, name): 245 | return tf.nn.max_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name) 246 | 247 | def conv_layer(self, bottom, name): 248 | with tf.variable_scope(name): 249 | filt = self.get_conv_filter(bottom, name) 250 | 251 | conv = tf.nn.conv2d(bottom, filt, [1, 1, 1, 1], padding='SAME') 252 | 253 | conv_biases = self.get_bias(bottom, name) 254 | bias = tf.nn.bias_add(conv, conv_biases) 255 | 256 | relu = tf.nn.relu(bias) 257 | return relu 258 | 259 | def fc_layer(self, bottom, name): 260 | with tf.variable_scope(name): 261 | shape = bottom.get_shape().as_list() 262 | dim = 1 263 | for d in shape[1:]: 264 | dim *= d 265 | x = tf.reshape(bottom, [-1, dim]) 266 | 267 | weights = self.get_fc_weight(x, name) 268 | biases = self.get_bias(x, name) 269 | 270 | # Fully connected layer. Note that the '+' operation automatically 271 | # broadcasts the biases. 272 | fc = tf.nn.bias_add(tf.matmul(x, weights), biases) 273 | 274 | return fc 275 | 276 | def get_n_out(self, name): 277 | if name[:4] == 'conv': 278 | n_out = 64 * (2 ** (min(int(name[4]),4) - 1)) 279 | else: 280 | if name[2] == '8': 281 | n_out = 1000 282 | else: 283 | n_out = 4096 284 | return n_out 285 | 286 | def get_conv_filter(self, bottom, name): 287 | if self.data_dict.get(name, None) is None: 288 | print 'No pretrained weight for', name, 'filter' 289 | n_in = bottom.get_shape()[-1].value 290 | n_out = self.get_n_out(name) 291 | print 'n_in', n_in, 'n_out', n_out 292 | return tf.get_variable("filter", 293 | shape=[3, 3, n_in, n_out], 294 | dtype=tf.float32, 295 | initializer=tf.contrib.layers.xavier_initializer_conv2d()) 296 | return tf.Variable(self.data_dict[name][0], name="filter") 297 | 298 | def get_bias(self, bottom, name): 299 | if self.data_dict.get(name, None) is None: 300 | print 'No pretrained weight for', name, 'biases' 301 | n_out = self.get_n_out(name) 302 | print 'n_out', n_out 303 | return tf.Variable(tf.constant(0.0, shape=[n_out], dtype=tf.float32), trainable=True, name='biases') 304 | return tf.Variable(self.data_dict[name][1], name="biases") 305 | 306 | def get_fc_weight(self, bottom, name): 307 | if self.data_dict.get(name, None) is None: 308 | print 'No pretrained weight for', name, 'weights' 309 | n_in = bottom.get_shape()[-1].value 310 | n_out = self.get_n_out(name) 311 | print 'n_in', n_in, 'n_out', n_out 312 | return tf.get_variable("weights", 313 | shape=[n_in, n_out], 314 | dtype=tf.float32, 315 | initializer=tf.contrib.layers.xavier_initializer()) 316 | return tf.Variable(self.data_dict[name][0], name="weights") 317 | -------------------------------------------------------------------------------- /vis/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | neuraltalk2 results visualization 7 | 8 | 42 | 43 | 44 |
45 | 72 | 73 | 74 | --------------------------------------------------------------------------------