├── .gitignore
├── README.md
├── coco
    └── coco_preprocess.ipynb
├── dataloader.py
├── dataloaderraw.py
├── eval.py
├── eval_utils.py
├── misc
    ├── AttentionModel.py
    ├── ShowAttendTellModel.py
    ├── ShowAttendTellModel_old.py
    ├── ShowTellModel.py
    ├── __init__.py
    └── utils.py
├── models.py
├── opts.py
├── prepro.py
├── test
    ├── test_model.py
    └── test_simpleloader.py
├── train.py
├── vgg.py
└── vis
    ├── index.html
    └── jquery-1.8.3.min.js


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | models
3 | 
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Neuraltalk2-tensorflow
  2 | This is a toy project for myself to start to learn tensorflow.
  3 | 
  4 | I started to learn torch by learning from neuraltalk2, so I started my tensorflow with this too.
  5 | 
  6 | I think this project is good for those who were familiar with neuraltalk2 in torch, because the main pipeline is almost the same. I don't know if it's a good tutorial to learn tensorflow, because the comments are still limited so far.
  7 | 
  8 | Without finetuning on VGG, my code gives CIDEr score ~0.65 on validation set (in 50000 iterations).
  9 | 
 10 | Currently if you want to use my code, you need to train the model from scratch (except VGG-16).
 11 | 
 12 | # TODO:
 13 | - ~~Finetuning VGG seems doesn't work. Need to be fixed.~~
 14 | - ~~No need to initialize from npy when having saved weight.~~
 15 | - Tensorflow stype file loading. (Multi-thread image loading)
 16 | - ~~Test of stacked LSTM. and also GRUs~~
 17 | - Pretrained model
 18 | - ~~Test code on single image~~
 19 | - Schedule sampling
 20 | - ~~sample_max~~
 21 | - ~~eval on unseen images~~
 22 | - eval on test
 23 | - visualize attention map
 24 | 
 25 | # Requirements
 26 | Python 2.7
 27 | 
 28 | [Tensorflow 1.0](https://github.com/tensorflow/tensorflow), please follow the tensorflow website to install the tensorflow.
 29 | 
 30 | # Train your own network on COCO
 31 | **(Copy from neuraltalk2)**
 32 | 
 33 | Great, first we need to some preprocessing. Head over to the `coco/` folder and run the IPython notebook to download the dataset and do some very simple preprocessing. The notebook will combine the train/val data together and create a very simple and small json file that contains a large list of image paths, and raw captions for each image, of the form:
 34 | 
 35 | ```
 36 | [{ "file_path": "path/img.jpg", "captions": ["a caption", "a second caption of i"tgit ...] }, ...]
 37 | ```
 38 | 
 39 | Once we have this, we're ready to invoke the `prepro.py` script, which will read all of this in and create a dataset (an hdf5 file and a json file) ready for consumption in the Lua code. For example, for MS COCO we can run the prepro file as follows:
 40 | 
 41 | ```bash
 42 | $ python prepro.py --input_json coco/coco_raw.json --num_val 5000 --num_test 5000 --images_root coco/images --word_count_threshold 5 --output_json coco/cocotalk.json --output_h5 coco/cocotalk.h5
 43 | ```
 44 | 
 45 | This is telling the script to read in all the data (the images and the captions), allocate 5000 images for val/test splits respectively, and map all words that occur <= 5 times to a special `UNK` token. The resulting `json` and `h5` files are about 30GB and contain everything we want to know about the dataset.
 46 | 
 47 | **Warning**: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See [this issue](https://github.com/karpathy/neuraltalk2/issues/4) for the fix, it involves manually replacing one image in the dataset.
 48 | 
 49 | **(Copy end.)**
 50 | 
 51 | Note that: the split used here can not be used for research. You can email me to ask for preprocessing code for COCO "standard" split, or you can modify the code by yourself if you are familiar.
 52 | 
 53 | ~~Download or generate a tensorflow version pretrained vgg-16 [tensorflow-vgg16](https://github.com/ry/tensorflow-vgg16). ~~
 54 | 
 55 | I borrow the [machrisaa/tensorflow-vgg](https://github.com/machrisaa/tensorflow-vgg). I made some modification.
 56 | - Add a variable `training` to control the evaluation and training mode of model (in principle it's controling the dropout probability).
 57 | - Define all the weights and biases as Variable (previously constant).
 58 | 
 59 | You need to download the npy file of vgg, [vgg16](https://dl.dropboxusercontent.com/u/50333326/vgg16.npy), [vgg19](https://dl.dropboxusercontent.com/u/50333326/vgg19.npy). Put the file somewhere (e.g. a `models` directory), and we're ready to train!
 60 | 
 61 | ```bash
 62 | $ python train.py --input_json coco/cocotalk.json --input_h5 coco/cocotalk.h5 --checkpoint_path ./log --save_checkpoint_every 2000 --val_images_use 3200
 63 | ```
 64 | 
 65 | The train script will take over, and start dumping checkpoints into the folder specified by `checkpoint_path` (default = current folder). For more options, see `opts.py`.
 66 | 
 67 | If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use `--language_eval 1` option, but don't forget to download the [coco-caption code](https://github.com/tylin/coco-caption) into `coco-caption` directory.
 68 | 
 69 | **A few notes on training.** To give you an idea, with the default settings one epoch of MS COCO images is about 7500 iterations. 1 epoch of training (with no finetuning - notice this is the default) takes about 45 minutes and results in validation loss ~2.7 and CIDEr score of ~0.5. By iteration 50,000 CIDEr climbs up to about 0.65 (validation loss at about 2.4). 
 70 | 
 71 | ### Caption images after training
 72 | 
 73 | Now place all your images of interest into a folder, e.g. `blah`, and run
 74 | the eval script:
 75 | 
 76 | ```bash
 77 | $ python eval.py --model model.ckpt-**** --infos_path infos_<id>.pkl --image_folder <image_folder> --num_images 10
 78 | ```
 79 | 
 80 | This tells the `eval` script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing `batch_size` (default = 1). Use `-num_images -1` to process all images. The eval script will create an `vis.json` file inside the `vis` folder, which can then be visualized with the provided HTML interface:
 81 | 
 82 | ```bash
 83 | $ cd vis
 84 | $ python -m SimpleHTTPServer
 85 | ```
 86 | 
 87 | Now visit `localhost:8000` in your browser and you should see your predicted captions.
 88 | 
 89 | **Beam Search**. Beam search is enabled by default because it increases the performance of the search for argmax decoding sequence. However, this is a little more expensive, so if you'd like to evaluate images faster, but at a cost of performance, use `--beam_size 1`. ~~For example, in one of my experiments beam size 2 gives CIDEr 0.922, and beam size 1 gives CIDEr 0.886.~~
 90 | 
 91 | **Running on MSCOCO images**. If you train on MSCOCO (see how below), you will have generated preprocessed MSCOCO images, which you can use directly in the eval script. In this case simply leave out the `image_folder` option and the eval script and instead pass in the `input_h5`, `input_json` to your preprocessed files.
 92 | 
 93 | # Acknowledgements
 94 | I learned a lot from these following repositories.
 95 | 
 96 | - [neuraltalk2](https://github.com/karpathy/neuraltalk2)(of course)
 97 | - [colornet](https://github.com/pavelgonchar/colornet)(for using pretrained vgg-16)
 98 | - [tensorflow-vgg16](https://github.com/ry/tensorflow-vgg16.git)(tensorflow version of vgg-16)
 99 | - [machrisaa/tensorflow-vgg](https://github.com/machrisaa/tensorflow-vgg)(For better loading vgg-16, but still not perfect)
100 | - [huyng/tensorflow-vgg](https://github.com/huyng/tensorflow-vgg)(This may be my next attempt.)
101 | - [char-rnn-tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow)(for using the RNN wrapper provided by tensorflow)
102 | - [show_and_tell.tensorflow](https://github.com/jazzsaxmafia/show_and_tell.tensorflow)(Gave me idea how to dump option information. Furthermore, this has the same algorithm as mine but with different code structure)
103 | - [TF-mrnn](https://github.com/mjhucla/TF-mRNN) I borrow the beam search code. And this is also a very good caption genration model.
104 | 


--------------------------------------------------------------------------------
/coco/coco_preprocess.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# COCO data preprocessing\n",
  8 |     "\n",
  9 |     "This code will download the caption anotations for coco and preprocess them into an hdf5 file and a json file. \n",
 10 |     "\n",
 11 |     "These will then be read by the COCO data loader in Lua and trained on."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 2,
 17 |    "metadata": {
 18 |     "collapsed": false
 19 |    },
 20 |    "outputs": [
 21 |     {
 22 |      "data": {
 23 |       "text/plain": [
 24 |        "0"
 25 |       ]
 26 |      },
 27 |      "execution_count": 2,
 28 |      "metadata": {},
 29 |      "output_type": "execute_result"
 30 |     }
 31 |    ],
 32 |    "source": [
 33 |     "# lets download the annotations from http://mscoco.org/dataset/#download\n",
 34 |     "import os\n",
 35 |     "os.system('wget http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip') # ~19MB"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 3,
 41 |    "metadata": {
 42 |     "collapsed": false
 43 |    },
 44 |    "outputs": [
 45 |     {
 46 |      "data": {
 47 |       "text/plain": [
 48 |        "0"
 49 |       ]
 50 |      },
 51 |      "execution_count": 3,
 52 |      "metadata": {},
 53 |      "output_type": "execute_result"
 54 |     }
 55 |    ],
 56 |    "source": [
 57 |     "os.system('unzip captions_train-val2014.zip')"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 2,
 63 |    "metadata": {
 64 |     "collapsed": false
 65 |    },
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "import json\n",
 69 |     "val = json.load(open('annotations/captions_val2014.json', 'r'))\n",
 70 |     "train = json.load(open('annotations/captions_train2014.json', 'r'))"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 3,
 76 |    "metadata": {
 77 |     "collapsed": false
 78 |    },
 79 |    "outputs": [
 80 |     {
 81 |      "name": "stdout",
 82 |      "output_type": "stream",
 83 |      "text": [
 84 |       "[u'info', u'images', u'licenses', u'annotations']\n",
 85 |       "{u'description': u'This is stable 1.0 version of the 2014 MS COCO dataset.', u'url': u'http://mscoco.org', u'version': u'1.0', u'year': 2014, u'contributor': u'Microsoft COCO group', u'date_created': u'2015-01-27 09:11:52.357475'}\n",
 86 |       "40504\n",
 87 |       "202654\n",
 88 |       "{u'license': 3, u'file_name': u'COCO_val2014_000000391895.jpg', u'coco_url': u'http://mscoco.org/images/391895', u'height': 360, u'width': 640, u'date_captured': u'2013-11-14 11:18:45', u'flickr_url': u'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', u'id': 391895}\n",
 89 |       "{u'image_id': 203564, u'id': 37, u'caption': u'A bicycle replica with a clock as the front wheel.'}\n"
 90 |      ]
 91 |     }
 92 |    ],
 93 |    "source": [
 94 |     "print val.keys()\n",
 95 |     "print val['info']\n",
 96 |     "print len(val['images'])\n",
 97 |     "print len(val['annotations'])\n",
 98 |     "print val['images'][0]\n",
 99 |     "print val['annotations'][0]"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 6,
105 |    "metadata": {
106 |     "collapsed": false
107 |    },
108 |    "outputs": [],
109 |    "source": [
110 |     "import json\n",
111 |     "import os\n",
112 |     "\n",
113 |     "# combine all images and annotations together\n",
114 |     "imgs = val['images'] + train['images']\n",
115 |     "annots = val['annotations'] + train['annotations']\n",
116 |     "\n",
117 |     "# for efficiency lets group annotations by image\n",
118 |     "itoa = {}\n",
119 |     "for a in annots:\n",
120 |     "    imgid = a['image_id']\n",
121 |     "    if not imgid in itoa: itoa[imgid] = []\n",
122 |     "    itoa[imgid].append(a)\n",
123 |     "\n",
124 |     "# create the json blob\n",
125 |     "out = []\n",
126 |     "for i,img in enumerate(imgs):\n",
127 |     "    imgid = img['id']\n",
128 |     "    \n",
129 |     "    # coco specific here, they store train/val images separately\n",
130 |     "    loc = 'train2014' if 'train' in img['file_name'] else 'val2014'\n",
131 |     "    \n",
132 |     "    jimg = {}\n",
133 |     "    jimg['file_path'] = os.path.join(loc, img['file_name'])\n",
134 |     "    jimg['id'] = imgid\n",
135 |     "    \n",
136 |     "    sents = []\n",
137 |     "    annotsi = itoa[imgid]\n",
138 |     "    for a in annotsi:\n",
139 |     "        sents.append(a['caption'])\n",
140 |     "    jimg['captions'] = sents\n",
141 |     "    out.append(jimg)\n",
142 |     "    \n",
143 |     "json.dump(out, open('coco_raw.json', 'w'))"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": 7,
149 |    "metadata": {
150 |     "collapsed": false
151 |    },
152 |    "outputs": [
153 |     {
154 |      "name": "stdout",
155 |      "output_type": "stream",
156 |      "text": [
157 |       "{'captions': [u'A man with a red helmet on a small moped on a dirt road. ', u'Man riding a motor bike on a dirt road on the countryside.', u'A man riding on the back of a motorcycle.', u'A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. ', u'A man in a red shirt and a red hat is on a motorcycle on a hill side.'], 'file_path': u'val2014/COCO_val2014_000000391895.jpg', 'id': 391895}\n"
158 |      ]
159 |     }
160 |    ],
161 |    "source": [
162 |     "# lets see what they look like\n",
163 |     "print out[0]"
164 |    ]
165 |   }
166 |  ],
167 |  "metadata": {
168 |   "kernelspec": {
169 |    "display_name": "Python 2",
170 |    "language": "python",
171 |    "name": "python2"
172 |   },
173 |   "language_info": {
174 |    "codemirror_mode": {
175 |     "name": "ipython",
176 |     "version": 2
177 |    },
178 |    "file_extension": ".py",
179 |    "mimetype": "text/x-python",
180 |    "name": "python",
181 |    "nbconvert_exporter": "python",
182 |    "pygments_lexer": "ipython2",
183 |    "version": "2.7.6"
184 |   }
185 |  },
186 |  "nbformat": 4,
187 |  "nbformat_minor": 0
188 | }
189 | 


--------------------------------------------------------------------------------
/dataloader.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import json
  6 | import h5py
  7 | import os
  8 | import tensorflow as tf
  9 | import numpy as np
 10 | import random
 11 | import skimage
 12 | import skimage.io
 13 | import scipy.misc
 14 | 
 15 | class DataLoader():
 16 |     
 17 |     def __init__(self, opt):
 18 |         self.opt = opt
 19 |         self.batch_size = self.opt.batch_size
 20 |         self.seq_per_img = self.opt.seq_per_img
 21 | 
 22 |         # load the json file which contains additional information about the dataset
 23 |         print('DataLoader loading json file: ', opt.input_json)
 24 |         self.info = json.load(open(self.opt.input_json))
 25 |         self.ix_to_word = self.info['ix_to_word']
 26 |         self.vocab_size = len(self.ix_to_word)
 27 |         print('vocab size is ', self.vocab_size)
 28 |         
 29 |         # open the hdf5 file
 30 |         print('DataLoader loading h5 file: ', opt.input_h5)
 31 |         self.h5_file = h5py.File(self.opt.input_h5)
 32 | 
 33 | 
 34 |         # extract image size from dataset
 35 |         images_size = self.h5_file['images'].shape
 36 |         assert len(images_size) == 4, 'images should be a 4D tensor'
 37 |         assert images_size[2] == images_size[3], 'width and height must match'
 38 |         self.num_images = images_size[0]
 39 |         self.num_channels = images_size[1]
 40 |         self.max_image_size = images_size[2]
 41 |         print('read %d images of size %dx%dx%d' %(self.num_images, 
 42 |                     self.num_channels, self.max_image_size, self.max_image_size))
 43 | 
 44 |         # load in the sequence data
 45 |         seq_size = self.h5_file['labels'].shape
 46 |         self.seq_length = seq_size[1]
 47 |         print('max sequence length in data is', self.seq_length)
 48 |         # load the pointers in full to RAM (should be small enough)
 49 |         self.label_start_ix = self.h5_file['label_start_ix'][:]
 50 |         self.label_end_ix = self.h5_file['label_end_ix'][:]
 51 | 
 52 |         # separate out indexes for each of the provided splits
 53 |         self.split_ix = {'train': [], 'val': [], 'test': []}
 54 |         for ix in range(len(self.info['images'])):
 55 |             img = self.info['images'][ix]
 56 |             if img['split'] == 'train':
 57 |                 self.split_ix['train'].append(ix)
 58 |             elif img['split'] == 'val':
 59 |                 self.split_ix['val'].append(ix)
 60 |             elif img['split'] == 'test':
 61 |                 self.split_ix['test'].append(ix)
 62 |             elif opt.train_only == 0: # restval
 63 |                 self.split_ix['train'].append(ix)
 64 | 
 65 |         print('assigned %d images to split train' %len(self.split_ix['train']))
 66 |         print('assigned %d images to split val' %len(self.split_ix['val']))
 67 |         print('assigned %d images to split test' %len(self.split_ix['test']))
 68 | 
 69 |         self.iterators = {'train': 0, 'val': 0, 'test': 0}
 70 | 
 71 |     def get_vocab_size(self):
 72 |         return self.vocab_size
 73 | 
 74 |     def get_vocab(self):
 75 |         return self.ix_to_word
 76 | 
 77 |     def get_seq_length(self):
 78 |         return self.seq_length
 79 | 
 80 |     def get_batch(self, split, batch_size=None):
 81 |         split_ix = self.split_ix[split]
 82 |         batch_size = batch_size or self.batch_size
 83 | 
 84 |         img_batch = np.ndarray([batch_size, 224,224,3], dtype = 'float32')
 85 |         label_batch = np.zeros([batch_size * self.seq_per_img, self.seq_length + 2], dtype = 'int')
 86 |         mask_batch = np.zeros([batch_size * self.seq_per_img, self.seq_length + 2], dtype = 'float32')
 87 | 
 88 |         max_index = len(split_ix)
 89 |         wrapped = False
 90 | 
 91 |         infos = []
 92 | 
 93 |         for i in range(batch_size):
 94 |             ri = self.iterators[split]
 95 |             ri_next = ri + 1
 96 |             if ri_next >= max_index:
 97 |                 ri_next = 0
 98 |                 wrapped = True
 99 |             self.iterators[split] = ri_next
100 |             ix = split_ix[ri]
101 | 
102 |             # fetch image
103 |             #img = self.load_image(self.image_info[ix]['filename'])
104 |             img = self.h5_file['images'][ix, :, :, :].transpose(1, 2, 0)
105 |             img_batch[i] = img[16:240, 16:240, :].astype('float32')/255.0
106 | 
107 |             # fetch the sequence labels
108 |             ix1 = self.label_start_ix[ix] - 1 #label_start_ix starts from 1
109 |             ix2 = self.label_end_ix[ix] - 1
110 |             ncap = ix2 - ix1 + 1 # number of captions available for this image
111 |             assert ncap > 0, 'an image does not have any label. this can be handled but right now isn\'t'
112 | 
113 |             if ncap < self.seq_per_img:
114 |                 # we need to subsample (with replacement)
115 |                 seq = np.zeros([self.seq_per_img, self.seq_length], dtype = 'int')
116 |                 for q in range(self.seq_per_img):
117 |                     ixl = random.randint(ix1,ix2)
118 |                     seq[q, :] = self.h5_file['labels'][ixl, :self.seq_length]
119 |             else:
120 |                 ixl = random.randint(ix1, ix2 - self.seq_per_img + 1)
121 |                 seq = self.h5_file['labels'][ixl: ixl + self.seq_per_img, :self.seq_length]
122 | 
123 |             label_batch[i * self.seq_per_img : (i + 1) * self.seq_per_img, 1 : self.seq_length + 1] = seq
124 | 
125 |             # record associated info as well
126 |             info_dict = {}
127 |             info_dict['id'] = self.info['images'][ix]['id']
128 |             info_dict['file_path'] = self.info['images'][ix]['file_path']
129 |             infos.append(info_dict)
130 | 
131 |         # generate mask
132 |         nonzeros = np.array(map(lambda x: (x != 0).sum()+2, label_batch))
133 |         for ix, row in enumerate(mask_batch):
134 |             row[:nonzeros[ix]] = 1
135 | 
136 |         data = {}
137 |         data['images'] = img_batch
138 |         data['labels'] = label_batch
139 |         data['masks'] = mask_batch 
140 |         data['bounds'] = {'it_pos_now': self.iterators[split], 'it_max': len(split_ix), 'wrapped': wrapped}
141 |         data['infos'] = infos
142 | 
143 |         return data
144 | 
145 |     def reset_iterator(self, split):
146 |         self.iterators[split] = 0
147 |         


--------------------------------------------------------------------------------
/dataloaderraw.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import json
  6 | import h5py
  7 | import os
  8 | import tensorflow as tf
  9 | import numpy as np
 10 | import random
 11 | import skimage
 12 | import skimage.io
 13 | import scipy.misc
 14 | 
 15 | class DataLoaderRaw():
 16 |     
 17 |     def __init__(self, opt):
 18 |         self.opt = opt
 19 |         self.coco_json = opt.get('coco_json', '')
 20 |         self.folder_path = opt.get('folder_path', '')
 21 | 
 22 |         self.batch_size = opt.get('batch_size', 1)
 23 | 
 24 |         # load the json file which contains additional information about the dataset
 25 |         print('DataLoaderRaw loading images from folder: ', self.folder_path)
 26 | 
 27 |         self.files = []
 28 |         self.ids = []
 29 | 
 30 |         print(len(self.coco_json))
 31 |         if len(self.coco_json) > 0:
 32 |             print('reading from ' + opt.coco_json)
 33 |             # read in filenames from the coco-style json file
 34 |             self.coco_annotation = json.load(open(self.coco_json))
 35 |             for k,v in enumerate(self.coco_annotation['images']):
 36 |                 fullpath = os.path.join(self.folder_path, v['file_name'])
 37 |                 self.files.append(fullpath)
 38 |                 self.ids.append(v['id'])
 39 |         else:
 40 |             # read in all the filenames from the folder
 41 |             print('listing all images in directory ' + self.folder_path)
 42 |             def isImage(f):
 43 |                 supportedExt = ['.jpg','.JPG','.jpeg','.JPEG','.png','.PNG','.ppm','.PPM']
 44 |                 for ext in supportedExt:
 45 |                     start_idx = f.rfind(ext)
 46 |                     if start_idx >= 0 and start_idx + len(ext) == len(f):
 47 |                         return True
 48 |                 return False
 49 | 
 50 |             n = 1
 51 |             for root, dirs, files in os.walk(self.folder_path, topdown=False):
 52 |                 for file in files:
 53 |                     fullpath = os.path.join(self.folder_path, file)
 54 |                     if isImage(fullpath):
 55 |                         self.files.append(fullpath)
 56 |                         self.ids.append(str(n)) # just order them sequentially
 57 |                         n = n + 1
 58 | 
 59 |         self.N = len(self.files)
 60 |         print('DataLoaderRaw found ', self.N, ' images')
 61 | 
 62 |         self.iterator = 0
 63 | 
 64 |     def get_batch(self, split, batch_size=None):
 65 |         batch_size = batch_size or self.batch_size
 66 | 
 67 |         # pick an index of the datapoint to load next
 68 |         img_batch = np.ndarray([batch_size, 224,224,3], dtype = 'float32')
 69 |         max_index = self.N
 70 |         wrapped = False
 71 |         infos = []
 72 | 
 73 |         for i in range(batch_size):
 74 |             ri = self.iterator
 75 |             ri_next = ri + 1
 76 |             if ri_next >= max_index:
 77 |                 ri_next = 0
 78 |                 wrapped = True
 79 |                 # wrap back around
 80 |             self.iterator = ri_next
 81 | 
 82 |             img = skimage.io.imread(self.files[ri])
 83 | 
 84 |             if len(img.shape) == 2:
 85 |                 img = img[:,:,np.newaxis]
 86 |                 img = img.concatenate((img, img, img), axis=2)
 87 | 
 88 |             img_batch[i] = img[16:240, 16:240, :].astype('float32')/255.0
 89 | 
 90 |             info_struct = {}
 91 |             info_struct['id'] = self.ids[ri]
 92 |             info_struct['file_path'] = self.files[ri]
 93 |             infos.append(info_struct)
 94 | 
 95 |         data = {}
 96 |         data['images'] = img_batch
 97 |         data['bounds'] = {'it_pos_now': self.iterator, 'it_max': self.N, 'wrapped': wrapped}
 98 |         data['infos'] = infos
 99 | 
100 |         return data
101 | 
102 |     def reset_iterator(self, split):
103 |         self.iterator = 0
104 |         


--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import json
  6 | import numpy as np
  7 | import tensorflow as tf
  8 | 
  9 | import time
 10 | import os
 11 | from six.moves import cPickle
 12 | 
 13 | import opts
 14 | import models
 15 | from dataloader import *
 16 | from dataloaderraw import *
 17 | import eval_utils
 18 | import argparse
 19 | import misc.utils as utils
 20 | 
 21 | NUM_THREADS = 2 #int(os.environ['OMP_NUM_THREADS'])
 22 | 
 23 | # Input arguments and options
 24 | parser = argparse.ArgumentParser()
 25 | # Input paths
 26 | parser.add_argument('--model', type=str, default='',
 27 |                 help='path to model to evaluate')
 28 | parser.add_argument('--infos_path', type=str, default='',
 29 |                 help='path to infos to evaluate')
 30 | # Basic options
 31 | parser.add_argument('--batch_size', type=int, default=0,
 32 |                 help='if > 0 then overrule, otherwise load from checkpoint.')
 33 | parser.add_argument('--num_images', type=int, default=-1,
 34 |                 help='how many images to use when periodically evaluating the loss? (-1 = all)')
 35 | parser.add_argument('--language_eval', type=int, default=0,
 36 |                 help='Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.')
 37 | parser.add_argument('--dump_images', type=int, default=1,
 38 |                 help='Dump images into vis/imgs folder for vis? (1=yes,0=no)')
 39 | parser.add_argument('--dump_json', type=int, default=1,
 40 |                 help='Dump json with predictions into vis folder? (1=yes,0=no)')
 41 | parser.add_argument('--dump_path', type=int, default=0,
 42 |                 help='Write image paths along with predictions into vis json? (1=yes,0=no)')
 43 | 
 44 | # Sampling options
 45 | parser.add_argument('--sample_max', type=int, default=1,
 46 |                 help='1 = sample argmax words. 0 = sample from distributions.')
 47 | parser.add_argument('--beam_size', type=int, default=2,
 48 |                 help='used when sample_max = 1, indicates number of beams in beam search. Usually 2 or 3 works well. More is not better. Set this to 1 for faster runtime but a bit worse performance.')
 49 | parser.add_argument('--temperature', type=float, default=1.0,
 50 |                 help='temperature when sampling from distributions (i.e. when sample_max = 0). Lower = "safer" predictions.')
 51 | # For evaluation on a folder of images:
 52 | parser.add_argument('--image_folder', type=str, default='', 
 53 |                 help='If this is nonempty then will predict on the images in this folder path')
 54 | parser.add_argument('--image_root', type=str, default='', 
 55 |                 help='In case the image paths have to be preprended with a root path to an image folder')
 56 | # For evaluation on MSCOCO images from some split:
 57 | parser.add_argument('--input_h5', type=str, default='', 
 58 |                 help='path to the h5file containing the preprocessed dataset. empty = fetch from model checkpoint.')
 59 | parser.add_argument('--input_json', type=str, default='', 
 60 |                 help='path to the json file containing additional info and vocab. empty = fetch from model checkpoint.')
 61 | parser.add_argument('--split', type=str, default='test', 
 62 |                 help='if running on MSCOCO images, which split to use: val|test|train')
 63 | parser.add_argument('--coco_json', type=str, default='', 
 64 |                 help='if nonempty then use this file in DataLoaderRaw (see docs there). Used only in MSCOCO test evaluation, where we have a specific json file of only test set images.')
 65 | # misc
 66 | parser.add_argument('--id', type=str, default='evalscript', 
 67 |                 help='an id identifying this run/job. used only if language_eval = 1 for appending to intermediate files')
 68 | 
 69 | opt = parser.parse_args()
 70 | 
 71 | # Load infos
 72 | with open(opt.infos_path) as f:
 73 |     infos = cPickle.load(f)
 74 | 
 75 | # override and collect parameters
 76 | if len(opt.input_h5) == 0:
 77 |     opt.input_h5 = infos['opt'].input_h5
 78 | if len(opt.input_json) == 0:
 79 |     opt.input_json = infos['opt'].input_json
 80 | if opt.batch_size == 0:
 81 |     opt.batch_size = infos['opt'].batch_size
 82 | ignore = ["id", "batch_size", "beam_size", "start_from"]
 83 | for k in vars(infos['opt']).keys():
 84 |     if k not in ignore:
 85 |         if k in vars(opt):
 86 |             assert vars(opt)[k] == vars(infos['opt'])[k], k + ' option not consistent'
 87 |         else:
 88 |             vars(opt).update({k: vars(infos['opt'])[k]}) # copy over options from model
 89 | 
 90 | vocab = infos['vocab'] # ix -> word mapping
 91 | 
 92 | # Setup the model
 93 | model = models.setup(opt)
 94 | model.build_model()
 95 | model.build_generator()
 96 | model.build_decoder()
 97 | 
 98 | # Create the Data Loader instance
 99 | if len(opt.image_folder) == 0:
100 |   loader = DataLoader(opt)
101 | else:
102 |   loader = DataLoaderRaw({'folder_path': opt.image_folder, 
103 |                             'coco_json': opt.coco_json,
104 |                             'batch_size': opt.batch_size})
105 | 
106 | # Evaluation fun(ction)
107 | def eval_split(sess, model, loader, eval_kwargs):
108 |     verbose = eval_kwargs.get('verbose', True)
109 |     num_images = eval_kwargs.get('num_images', -1)
110 |     split = eval_kwargs.get('split', 'test')
111 |     language_eval = eval_kwargs.get('language_eval', 0)
112 |     dataset = eval_kwargs.get('dataset', 'coco')
113 | 
114 |     # Make sure in the evaluation mode
115 |     sess.run(tf.assign(model.training, False))
116 |     sess.run(tf.assign(model.cnn_training, False))
117 | 
118 |     loader.reset_iterator(split)
119 | 
120 |     n = 0
121 |     loss_sum = 0
122 |     loss_evals = 1e-8
123 |     predictions = []
124 | 
125 |     while True:
126 |         # fetch a batch of data
127 |         if opt.beam_size > 1:
128 |             data = loader.get_batch(split, 1)
129 |             n = n + 1
130 |         else:
131 |             data = loader.get_batch(split, opt.batch_size)
132 |             n = n + opt.batch_size
133 | 
134 |         #evaluate loss if we have the labels
135 |         loss = 0
136 |         if data.get('labels', None) is not None:
137 |             # forward the model to get loss
138 |             feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks']}
139 |             loss = sess.run(model.cost, feed)
140 |             loss_sum = loss_sum + loss
141 |             loss_evals = loss_evals + 1
142 | 
143 |         # forward the model to also get generated samples for each image
144 |         if opt.beam_size == 1:
145 |             # forward the model to also get generated samples for each image
146 |             feed = {model.images: data['images']}
147 |             #g_o,g_l,g_p, seq = sess.run([model.g_output, model.g_logits, model.g_probs, model.generator], feed)
148 |             seq = sess.run(model.generator, feed)
149 | 
150 |             #set_trace()
151 |             sents = utils.decode_sequence(vocab, seq)
152 | 
153 |             for k, sent in enumerate(sents):
154 |                 entry = {'image_id': data['infos'][k]['id'], 'caption': sent}
155 |                 predictions.append(entry)
156 |                 if verbose:
157 |                     print('image %s: %s' %(entry['image_id'], entry['caption']))
158 |         else:
159 |             seq = model.decode(data['images'], opt.beam_size, sess)
160 |             sents = [' '.join([vocab.get(str(ix), '') for ix in sent]).strip() for sent in seq]
161 |             sents = [sents[0]]
162 |             entry = {'image_id': data['infos'][0]['id'], 'caption': sents[0]}
163 |             predictions.append(entry)
164 |             if verbose:
165 |                 for sent in sents:
166 |                     print('image %s: %s' %(entry['image_id'], sent))
167 | 
168 |         for k, sent in enumerate(sents):
169 |             entry = {'image_id': data['infos'][k]['id'], 'caption': sent}
170 |             if opt.dump_path == 1:
171 |                 entry['file_name'] = data['infos'][k]['file_path']
172 |                 table.insert(predictions, entry)
173 |             if opt.dump_images == 1:
174 |                 # dump the raw image to vis/ folder
175 |                 cmd = 'cp "' + os.path.join(opt.image_root, data['infos'][k]['file_path']) + '" vis/imgs/img' + str(len(predictions)) + '.jpg' # bit gross
176 |                 print(cmd)
177 |                 os.system(cmd)
178 | 
179 |             if verbose:
180 |                 print('image %s: %s' %(entry['image_id'], entry['caption']))
181 | 
182 |         # if we wrapped around the split or used up val imgs budget then bail
183 |         ix0 = data['bounds']['it_pos_now']
184 |         ix1 = data['bounds']['it_max']
185 |         if num_images != -1:
186 |             ix1 = min(ix1, num_images)
187 |         for i in range(n - ix1):
188 |             predictions.pop()
189 | 
190 |         if verbose:
191 |             print('evaluating validation preformance... %d/%d (%f)' %(ix0 - 1, ix1, loss))
192 | 
193 |         if data['bounds']['wrapped']:
194 |             break
195 |         if num_images >= 0 and n >= num_images:
196 |             break
197 | 
198 |     lang_stats = None
199 |     if language_eval == 1:
200 |         lang_stats = eval_utils.language_eval(dataset, predictions)
201 | 
202 |     # Switch back to training mode
203 |     sess.run(tf.assign(model.training, True))
204 |     sess.run(tf.assign(model.cnn_training, True))
205 |     return loss_sum/loss_evals, predictions, lang_stats
206 | 
207 | tf_config = tf.ConfigProto()
208 | tf_config.intra_op_parallelism_threads=NUM_THREADS
209 | tf_config.gpu_options.allow_growth = True
210 | with tf.Session(config=tf_config) as sess:
211 |     # Initilize the variables
212 |     sess.run(tf.global_variables_initializer())
213 |     # Load the model checkpoint to evaluate
214 |     assert len(opt.model) > 0, 'must provide a model'
215 |     tf.train.Saver(tf.trainable_variables()).restore(sess, opt.model)
216 | 
217 |     # Set sample options
218 |     sess.run(tf.assign(model.sample_max, opt.sample_max == 1))
219 |     sess.run(tf.assign(model.sample_temperature, opt.temperature))
220 | 
221 |     loss, split_predictions, lang_stats = eval_split(sess, model, loader, 
222 |         {'num_images': opt.num_images,
223 |         'language_eval': opt.language_eval,
224 |         'split': opt.split})
225 | 
226 | print('loss: ', loss)
227 | if lang_stats:
228 |   print(lang_stats)
229 | 
230 | if opt.dump_json == 1:
231 |     # dump the json
232 |     json.dump(split_predictions, open('vis/vis.json', 'w'))
233 | 


--------------------------------------------------------------------------------
/eval_utils.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | from json import encoder
 3 | 
 4 | def language_eval(dataset, preds):
 5 |     import sys
 6 |     if 'coco' in dataset:
 7 |         sys.path.append("coco-caption")
 8 |         annFile = 'coco-caption/annotations/captions_val2014.json'
 9 |     else:
10 |         sys.path.append("f30k-caption")
11 |         annFile = 'f30k-caption/annotations/dataset_flickr30k.json'
12 |     from pycocotools.coco import COCO
13 |     from pycocoevalcap.eval import COCOEvalCap
14 | 
15 |     encoder.FLOAT_REPR = lambda o: format(o, '.3f')
16 | 
17 |     coco = COCO(annFile)
18 |     valids = coco.getImgIds()
19 | 
20 |     # filter results to only those in MSCOCO validation set (will be about a third)
21 |     preds_filt = [p for p in preds if p['image_id'] in valids]
22 |     print 'using %d/%d predictions' % (len(preds_filt), len(preds))
23 |     json.dump(preds_filt, open('tmp.json', 'w')) # serialize to temporary json file. Sigh, COCO API...
24 | 
25 |     resFile = 'tmp.json'
26 |     cocoRes = coco.loadRes(resFile)
27 |     cocoEval = COCOEvalCap(coco, cocoRes)
28 |     cocoEval.params['image_id'] = cocoRes.getImgIds()
29 |     cocoEval.evaluate()
30 | 
31 |     # create output dictionary
32 |     out = {}
33 |     for metric, score in cocoEval.eval.items():
34 |         out[metric] = score
35 | 
36 |     return out


--------------------------------------------------------------------------------
/misc/AttentionModel.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import tensorflow as tf
  6 | import tensorflow.contrib.slim as slim
  7 | import os
  8 | import vgg
  9 | import copy
 10 | 
 11 | import numpy as np
 12 | import misc.utils as utils
 13 | 
 14 | # The maximimum step during generation
 15 | MAX_STEPS = 30
 16 | 
 17 | class AttentionModel():
 18 |     """
 19 |     This model is not using the show attend tell algorithm, but given seq2seq attention decoder.
 20 |     """
 21 | 
 22 |     def initialize(self, sess):
 23 |         # Initialize the variables
 24 |         sess.run(tf.global_variables_initializer())
 25 |         # Initialize the saver
 26 |         self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1)
 27 |         # Load weights from the checkpoint
 28 |         if vars(self.opt).get('start_from', None):
 29 |             self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path)
 30 |         # Initialize the summary writer
 31 |         self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph)
 32 | 
 33 |     def __init__(self, opt):
 34 |         self.vocab_size = opt.vocab_size
 35 |         self.input_encoding_size = opt.input_encoding_size
 36 |         self.rnn_size = opt.rnn_size
 37 |         self.num_layers = opt.num_layers
 38 |         self.drop_prob_lm = opt.drop_prob_lm
 39 |         self.seq_length = opt.seq_length
 40 |         self.vocab_size = opt.vocab_size
 41 |         self.seq_per_img = opt.seq_per_img
 42 | 
 43 |         self.opt = opt
 44 | 
 45 |         # Variable indicating in training mode or evaluation mode
 46 |         self.training = tf.Variable(True, trainable = False, name = "training")
 47 | 
 48 |         # Input variables
 49 |         self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images")
 50 |         self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels")
 51 |         self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks")
 52 | 
 53 |         # Build CNN
 54 |         if vars(self.opt).get('start_from', None):
 55 |             cnn_weight = None
 56 |         else:
 57 |             cnn_weight = self.opt.cnn_weight
 58 |         if self.opt.cnn_model == 'vgg16':
 59 |             self.cnn = vgg.Vgg16(cnn_weight)
 60 |         if self.opt.cnn_model == 'vgg19':
 61 |             self.cnn = vgg.Vgg19(cnn_weight)
 62 |             
 63 |         with tf.variable_scope("cnn"):
 64 |             self.cnn.build(self.images)
 65 | 
 66 |         if self.opt.cnn_model == 'vgg16':
 67 |             self.context = self.cnn.conv5_3
 68 |         if self.opt.cnn_model == 'vgg19':
 69 |             self.context = self.cnn.conv5_4
 70 |         
 71 |         self.cnn_training = self.cnn.training
 72 | 
 73 |         # Variable in language model
 74 |         with tf.variable_scope("rnnlm"):
 75 |             # Word Embedding table
 76 |             self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb')
 77 | 
 78 |             # RNN cell
 79 |             if opt.rnn_type == 'rnn':
 80 |                 self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell
 81 |             elif opt.rnn_type == 'gru':
 82 |                 self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell
 83 |             elif opt.rnn_type == 'lstm':
 84 |                 self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell
 85 |             else:
 86 |                 raise Exception("RNN type not supported: {}".format(opt.rnn_type))
 87 |             
 88 |             # keep_prob is a function of training flag
 89 |             self.keep_prob = tf.cond(self.training, 
 90 |                                 lambda : tf.constant(1 - self.drop_prob_lm),
 91 |                                 lambda : tf.constant(1.0), name = 'keep_prob')
 92 |             # basic cell has dropout wrapper
 93 |             self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size, state_is_tuple = True), 1.0, self.keep_prob)
 94 |             # cell is the final cell of each timestep
 95 |             self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers, state_is_tuple = True)
 96 | 
 97 |     def build_model(self):
 98 |         with tf.name_scope("batch_size"):
 99 |             # Get batch_size from the first dimension of self.images
100 |             self.batch_size = tf.shape(self.images)[0]
101 |         with tf.variable_scope("rnnlm"):
102 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
103 |             ctx_mean = tf.reduce_mean(flattened_ctx, 1)
104 |             
105 |             # Initialize the first hidden state with the mean context
106 |             initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size)
107 |             # Replicate self.seq_per_img times for each state and image embedding
108 |             self.initial_state = initial_state = utils.expand_feat(initial_state, self.seq_per_img)
109 |             self.flattened_ctx = flattened_ctx = tf.reshape(tf.tile(tf.expand_dims(flattened_ctx, 1), [1, self.seq_per_img, 1, 1]), 
110 |                 [self.batch_size * self.seq_per_img, 196, 512])
111 | 
112 |             rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1]))
113 |             rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs]
114 | 
115 |             outputs, last_state = tf.contrib.legacy_seq2seq.attention_decoder(rnn_inputs, initial_state, flattened_ctx, self.cell, loop_function=None)
116 |             outputs = tf.concat(axis=0, values=outputs)
117 | 
118 |             self.logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit')
119 |             self.logits = tf.split(axis=0, num_or_size_splits=len(rnn_inputs), value=self.logits)
120 | 
121 |         with tf.variable_scope("loss"):
122 |             loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(self.logits,
123 |                     [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target
124 |                     [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])])
125 |             self.cost = tf.reduce_mean(loss)
126 | 
127 |         self.final_state = last_state
128 |         self.lr = tf.Variable(0.0, trainable=False)
129 |         self.cnn_lr = tf.Variable(0.0, trainable=False)
130 | 
131 |         # Collect the rnn variables, and create the optimizer of rnn
132 |         tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm')
133 |         grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip)
134 |         #grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars),
135 |         #        self.opt.grad_clip)
136 |         optimizer = utils.get_optimizer(self.opt, self.lr)
137 |         self.train_op = optimizer.apply_gradients(zip(grads, tvars))
138 | 
139 |         # Collect the cnn variables, and create the optimizer of cnn
140 |         cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn')
141 |         cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip)
142 |         #cnn_grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, cnn_tvars),
143 |         #        self.opt.grad_clip)
144 |         cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr)
145 |         self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars))
146 | 
147 |         tf.summary.scalar('training loss', self.cost)
148 |         tf.summary.scalar('learning rate', self.lr)
149 |         tf.summary.scalar('cnn learning rate', self.cnn_lr)
150 |         self.summaries = tf.summary.merge_all()
151 | 
152 |     def build_generator(self):
153 |         """
154 |         Generator for generating captions
155 |         Support sample max or sample from distribution
156 |         No Beam search here; beam search is in decoder
157 |         """
158 |         # Variables for the sample setting
159 |         self.sample_max = tf.Variable(True, trainable = False, name = "sample_max")
160 |         self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature")
161 | 
162 |         self.generator = []
163 |         with tf.variable_scope("rnnlm") as rnnlm_scope:
164 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
165 |             ctx_mean = tf.reduce_mean(flattened_ctx, 1)
166 | 
167 |             tf.get_variable_scope().reuse_variables()
168 | 
169 |             initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size)
170 | 
171 |             rnn_inputs = [tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))] + [0] * (MAX_STEPS - 1)
172 | 
173 |             # Always pick the word with largest probability as the input of next time step
174 |             def loop(prev, i):
175 |                 with tf.variable_scope(rnnlm_scope):
176 |                     prev = slim.fully_connected(prev, self.vocab_size + 1, activation_fn = None, scope = 'logit')                
177 |                     prev_symbol = tf.stop_gradient(tf.cond(self.sample_max,
178 |                         lambda: tf.argmax(prev, 1), # pick the word with largest probability as the input of next time step
179 |                         lambda: tf.squeeze(
180 |                             tf.multinomial(tf.nn.log_softmax(prev) / self.sample_temperature, 1), 1))) # Sample from the distribution
181 |                     self.generator.append(prev_symbol)
182 |                     return tf.nn.embedding_lookup(self.Wemb, prev_symbol)
183 | 
184 |             outputs, last_state = tf.contrib.legacy_seq2seq.attention_decoder(rnn_inputs, initial_state, flattened_ctx, self.cell, loop_function=loop)
185 |             self.g_outputs = outputs = tf.reshape(tf.concat(axis=1, values=outputs), [-1, self.rnn_size]) 
186 |             self.g_logits = logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit')
187 |             self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1])
188 | 
189 |         self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS - 1, -1]))
190 |     
191 |     def build_decoder_rnn(self, first_step):
192 |         """
193 |         This function build a decoder
194 |         if first_step is true, the state is initialized by mean context
195 |         if first_step is not true, the states are placeholder, and should be assigned.
196 |         """
197 |         with tf.variable_scope("rnnlm"):
198 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
199 |             ctx_mean = tf.reduce_mean(flattened_ctx, 1)
200 | 
201 |             self.decoder_prev_word = tf.placeholder(tf.int32, [None])            
202 |             if first_step:
203 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))
204 |             else:
205 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word)
206 | 
207 |             tf.get_variable_scope().reuse_variables()
208 |             if not first_step:
209 |                 initial_state = utils.get_placeholder_state(self.cell.state_size)
210 |                 self.decoder_flattened_state = utils.flatten_state(initial_state)
211 |             else:
212 |                 initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size)
213 | 
214 |             outputs, state = tf.contrib.legacy_seq2seq.attention_decoder([rnn_input], initial_state, flattened_ctx, self.cell, initial_state_attention = not first_step)
215 |             logits = slim.fully_connected(outputs[0], self.vocab_size + 1, activation_fn = None, scope = 'logit')
216 |             decoder_probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, self.vocab_size + 1])
217 |             decoder_state = utils.flatten_state(state)
218 | 
219 |         # output the probability and flattened state to next time step
220 |         return [decoder_probs, decoder_state]
221 | 
222 | 
223 |     def build_decoder(self):
224 |         self.decoder_model_init = self.build_decoder_rnn(True) # Used for the first step
225 |         self.decoder_model_cont = self.build_decoder_rnn(False)
226 | 
227 |     def decode(self, img, beam_size, sess, max_steps=30):
228 |         """Decode an image with a sentences."""
229 |         
230 |         # Initilize beam search variables
231 |         # Candidate will be represented with a dictionary
232 |         #   "indexes": a list with indexes denoted a sentence; 
233 |         #   "words": word in the decoded sentence without <bos>
234 |         #   "score": log-likelihood of the sentence
235 |         #   "state": RNN state when generating the last word of the candidate
236 |         good_sentences = [] # store sentences already ended with <eos>
237 |         cur_best_cand = [] # store current best candidates
238 |         highest_score = 0.0 # hightest log-likelihodd in good sentences
239 |         
240 |         # Get the initial logit and state
241 |         cand = {'indexes': [], 'score': 0}
242 |         cur_best_cand.append(cand)
243 |             
244 |         # Expand the current best candidates until max_steps or no candidate
245 |         for i in xrange(max_steps + 1):
246 |             # expand candidates
247 |             cand_pool = []
248 |             if i == 0:
249 |                 all_probs, all_states = self.get_probs_init(img, sess)
250 |             else:
251 |                 states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))]
252 |                 indexes = [cand['indexes'][-1] for cand in cur_best_cand]
253 |                 imgs = np.vstack([img] * len(cur_best_cand))
254 |                 all_probs, all_states = self.get_probs_cont(states, imgs, indexes, sess)
255 | 
256 |             # Construct new beams
257 |             for ind_cand in range(len(cur_best_cand)):
258 |                 cand = cur_best_cand[ind_cand]
259 |                 probs = all_probs[ind_cand]
260 |                 state = [x[ind_cand] for x in all_states]
261 |                 
262 |                 probs = np.squeeze(probs)
263 |                 probs_order = np.argsort(-probs)
264 |                 # append new end terminal at the end of this beam
265 |                 for ind_b in xrange(beam_size):
266 |                     cand_e = copy.deepcopy(cand)
267 |                     cand_e['indexes'].append(probs_order[ind_b])
268 |                     cand_e['score'] -= np.log(probs[probs_order[ind_b]])
269 |                     cand_e['state'] = state
270 |                     cand_pool.append(cand_e)
271 |             # get best beams
272 |             cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score'])
273 |             cur_best_cand = utils.truncate_list(cur_best_cand, beam_size)
274 | 
275 |             # move candidates end with <eos> to good_sentences or remove it
276 |             cand_left = []
277 |             for cand in cur_best_cand:
278 |                 if len(good_sentences) > beam_size and cand['score'] > highest_score:
279 |                     continue # No need to expand that candidate
280 |                 if cand['indexes'][-1] == 0: #end of sentence
281 |                     good_sentences.append(cand)
282 |                     highest_score = max(highest_score, cand['score'])
283 |                 else:
284 |                     cand_left.append(cand)
285 |             cur_best_cand = cand_left
286 |             if not cur_best_cand:
287 |                 break
288 | 
289 |         # Add candidate left in cur_best_cand to good sentences 
290 |         for cand in cur_best_cand:
291 |             if len(good_sentences) > beam_size and cand['score'] > highest_score:
292 |                 continue
293 |             if cand['indexes'][-1] != 0:
294 |                 cand['indexes'].append(0)
295 |             good_sentences.append(cand)
296 |             highest_score = max(highest_score, cand['score'])
297 |             
298 |         # Sort good sentences and return the final list
299 |         good_sentences = sorted(good_sentences, key=lambda cand: cand['score'])
300 |         good_sentences = utils.truncate_list(good_sentences, beam_size)
301 | 
302 |         return [sent['indexes'] for sent in good_sentences]
303 | 
304 | 
305 |     def get_probs_init(self, img, sess):
306 |         """Use the model to get initial logit"""
307 |         m = self.decoder_model_init
308 |         
309 |         probs, state = sess.run(m, {self.images: img})
310 |                                                             
311 |         return (probs, state)
312 |         
313 |     def get_probs_cont(self, prev_state, img, prev_word, sess):
314 |         """Use the model to get continued logit"""
315 |         m = self.decoder_model_cont
316 |         prev_word = np.array(prev_word, dtype='int32')
317 | 
318 |         placeholders = [self.images, self.decoder_prev_word] + self.decoder_flattened_state
319 |         feeded = [img, prev_word] + prev_state
320 |         
321 |         probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))})
322 |                                                             
323 |         return (probs, state)


--------------------------------------------------------------------------------
/misc/ShowAttendTellModel.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import tensorflow as tf
  6 | import tensorflow.contrib.slim as slim
  7 | import os
  8 | import vgg
  9 | import copy
 10 | 
 11 | import numpy as np
 12 | import misc.utils as utils
 13 | 
 14 | # The maximimum step during generation
 15 | MAX_STEPS = 30
 16 | 
 17 | class ShowAttendTellModel():
 18 | 
 19 |     def initialize(self, sess):
 20 |         # Initialize the variables
 21 |         sess.run(tf.global_variables_initializer())
 22 |         # Initialize the saver
 23 |         self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1)
 24 |         # Load weights from the checkpoint
 25 |         if vars(self.opt).get('start_from', None):
 26 |             self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path)
 27 |         # Initialize the summary writer
 28 |         self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph)
 29 | 
 30 |     def __init__(self, opt):
 31 |         self.vocab_size = opt.vocab_size
 32 |         self.input_encoding_size = opt.input_encoding_size
 33 |         self.rnn_size = opt.rnn_size
 34 |         self.num_layers = opt.num_layers
 35 |         self.drop_prob_lm = opt.drop_prob_lm
 36 |         self.seq_length = opt.seq_length
 37 |         self.vocab_size = opt.vocab_size
 38 |         self.seq_per_img = opt.seq_per_img
 39 |         self.att_hid_size = opt.att_hid_size
 40 | 
 41 |         self.opt = opt
 42 | 
 43 |         # Variable indicating in training mode or evaluation mode
 44 |         self.training = tf.Variable(True, trainable = False, name = "training")
 45 | 
 46 |         # Input variables
 47 |         self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images")
 48 |         self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels")
 49 |         self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks")
 50 | 
 51 |         # Build CNN
 52 |         if vars(self.opt).get('start_from', None):
 53 |             cnn_weight = None
 54 |         else:
 55 |             cnn_weight = vars(self.opt).get('cnn_weight', None)
 56 |         if self.opt.cnn_model == 'vgg16':
 57 |             self.cnn = vgg.Vgg16(cnn_weight)
 58 |         if self.opt.cnn_model == 'vgg19':
 59 |             self.cnn = vgg.Vgg19(cnn_weight)
 60 |             
 61 |         with tf.variable_scope("cnn"):
 62 |             self.cnn.build(self.images)
 63 | 
 64 |         if self.opt.cnn_model == 'vgg16':
 65 |             self.context = self.cnn.conv5_3
 66 |         if self.opt.cnn_model == 'vgg19':
 67 |             self.context = self.cnn.conv5_4
 68 |         self.fc7 = self.cnn.drop7
 69 |         self.cnn_training = self.cnn.training
 70 | 
 71 |         # Variable in language model
 72 |         with tf.variable_scope("rnnlm"):
 73 |             # Word Embedding table
 74 |             self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb')
 75 | 
 76 |             # RNN cell
 77 |             if opt.rnn_type == 'rnn':
 78 |                 self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell
 79 |             elif opt.rnn_type == 'gru':
 80 |                 self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell
 81 |             elif opt.rnn_type == 'lstm':
 82 |                 self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell
 83 |             else:
 84 |                 raise Exception("RNN type not supported: {}".format(opt.rnn_type))
 85 | 
 86 |             # keep_prob is a function of training flag
 87 |             self.keep_prob = tf.cond(self.training, 
 88 |                                 lambda : tf.constant(1 - self.drop_prob_lm),
 89 |                                 lambda : tf.constant(1.0), name = 'keep_prob')
 90 | 
 91 |             # basic cell has dropout wrapper
 92 |             self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size), 1.0, self.keep_prob)
 93 |             # cell is the final cell of each timestep
 94 |             self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers)
 95 | 
 96 |     def get_alpha(self, prev_h, pctx):
 97 |         # projected state
 98 |         if self.att_hid_size == 0:
 99 |             pstate = slim.fully_connected(prev_h, 1, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * 1
100 |             alpha = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * 1
101 |             alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196
102 |             alpha = tf.nn.softmax(alpha)
103 |         else:
104 |             pstate = slim.fully_connected(prev_h, self.att_hid_size, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * att_hid_size
105 |             pctx_ = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * att_hid_size
106 |             pctx_ = tf.nn.tanh(pctx_) # (batch * seq_per_img) * 196 * att_hid_size
107 |             alpha = slim.fully_connected(pctx_, 1, activation_fn = None, scope = 'alpha') # (batch * seq_per_img) * 196 * 1
108 |             alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196
109 |             alpha = tf.nn.softmax(alpha)
110 |         return alpha
111 | 
112 |     def build_model(self):
113 |         with tf.name_scope("batch_size"):
114 |             # Get batch_size from the first dimension of self.images
115 |             self.batch_size = tf.shape(self.images)[0]
116 |         with tf.variable_scope("rnnlm"):
117 |             # Flatten the context
118 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
119 |             
120 |             # Initialize the first hidden state with the mean context
121 |             initial_state = utils.get_initial_state(self.fc7, self.cell.state_size)
122 |             # Replicate self.seq_per_img times for each state and image embedding
123 |             self.initial_state = initial_state = utils.expand_feat(initial_state, self.seq_per_img)
124 |             self.flattened_ctx = flattened_ctx = tf.reshape(tf.tile(tf.expand_dims(flattened_ctx, 1), [1, self.seq_per_img, 1, 1]), 
125 |                 [self.batch_size * self.seq_per_img, 196, 512])
126 | 
127 |             #projected context
128 |             # This is used in attention module; do this outside the loop to reduce redundant computations
129 |             # with tf.variable_scope("attention"):
130 |             if self.att_hid_size == 0:
131 |                 pctx = slim.fully_connected(self.flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1
132 |             else:
133 |                 pctx = slim.fully_connected(self.flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size
134 | 
135 |             rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1]))
136 |             rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs]
137 | 
138 |             prev_h = utils.last_hidden_vec(initial_state)
139 | 
140 |             self.alphas = []
141 |             self.logits = []
142 |             outputs = []
143 |             state = initial_state
144 |             for ind in range(self.seq_length + 1):
145 |                 if ind > 0:
146 |                     # Reuse the variables after the first timestep.
147 |                     tf.get_variable_scope().reuse_variables()
148 | 
149 |                 with tf.variable_scope("attention"):
150 |                     alpha = self.get_alpha(prev_h, pctx)
151 |                     self.alphas.append(alpha)
152 |                     weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1)
153 |                     
154 |                 output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_inputs[ind]]), state)
155 |                 # Save the current output for next time step attention
156 |                 prev_h = output
157 |                 # Get the score of each word in vocabulary, 0 is end token.
158 |                 self.logits.append(slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit'))
159 |                 
160 |         with tf.variable_scope("loss"):
161 |             loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
162 |                     self.logits,
163 |                     [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target; ignore the first start token
164 |                     [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])])
165 |             self.cost = tf.reduce_mean(loss)
166 | 
167 |         self.final_state = state
168 |         self.lr = tf.Variable(0.0, trainable=False)
169 |         self.cnn_lr = tf.Variable(0.0, trainable=False)
170 | 
171 |         # Collect the rnn variables, and create the optimizer of rnn
172 |         tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm')
173 |         grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip)
174 |         optimizer = utils.get_optimizer(self.opt, self.lr)
175 |         self.train_op = optimizer.apply_gradients(zip(grads, tvars))
176 | 
177 |         # Collect the cnn variables, and create the optimizer of cnn
178 |         cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn')
179 |         cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip)
180 |         cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr) 
181 |         self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars))
182 | 
183 |         tf.summary.scalar('training loss', self.cost)
184 |         tf.summary.scalar('learning rate', self.lr)
185 |         tf.summary.scalar('cnn learning rate', self.cnn_lr)
186 |         self.summaries = tf.summary.merge_all()
187 | 
188 |     def build_generator(self):
189 |         """
190 |         Generator for generating captions
191 |         Support sample max or sample from distribution
192 |         No Beam search here; beam search is in decoder
193 |         """
194 |         # Variables for the sample setting
195 |         self.sample_max = tf.Variable(True, trainable = False, name = "sample_max")
196 |         self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature")
197 | 
198 |         self.generator = []
199 |         with tf.variable_scope("rnnlm"):
200 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
201 | 
202 |             tf.get_variable_scope().reuse_variables()
203 | 
204 |             initial_state = utils.get_initial_state(self.fc7, self.cell.state_size)
205 | 
206 |             #projected context
207 |             # This is used in attention module; do this outside the loop to reduce redundant computations
208 |             # with tf.variable_scope("attention"):
209 |             if self.att_hid_size == 0:
210 |                 pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * 1
211 |             else:
212 |                 pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * att_hid_size
213 | 
214 |             rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))
215 | 
216 |             prev_h = utils.last_hidden_vec(initial_state)
217 | 
218 |             self.g_alphas = []
219 |             outputs = []
220 |             state = initial_state
221 |             for ind in range(MAX_STEPS):
222 | 
223 |                 with tf.variable_scope("attention"):
224 |                     alpha = self.get_alpha(prev_h, pctx)
225 |                     self.g_alphas.append(alpha)
226 |                     weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1)
227 | 
228 |                 output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), state)
229 |                 outputs.append(output)
230 |                 prev_h = output
231 | 
232 |                 # Get the input of next timestep
233 |                 prev_logit = slim.fully_connected(prev_h, self.vocab_size + 1, activation_fn = None, scope = 'logit')
234 |                 prev_symbol = tf.stop_gradient(tf.cond(self.sample_max,
235 |                     lambda: tf.argmax(prev_logit, 1), # pick the word with largest probability as the input of next time step
236 |                     lambda: tf.squeeze(
237 |                         tf.multinomial(tf.nn.log_softmax(prev_logit) / self.sample_temperature, 1), 1))) # Sample from the distribution
238 |                 self.generator.append(prev_symbol)
239 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, prev_symbol)
240 |             
241 |             self.g_output = output = tf.reshape(tf.concat(axis=1, values=outputs), [-1, self.rnn_size]) # outputs[1:], because we don't calculate loss on time 0.
242 |             self.g_logits = logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit')
243 |             self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1])
244 | 
245 |         self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS, -1]))
246 | 
247 |     def build_decoder_rnn(self, first_step):
248 |         with tf.variable_scope("rnnlm"):
249 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
250 | 
251 |             tf.get_variable_scope().reuse_variables()
252 | 
253 |             if not first_step:
254 |                 initial_state = utils.get_placeholder_state(self.cell.state_size)
255 |                 self.decoder_flattened_state = utils.flatten_state(initial_state)
256 |             else:
257 |                 initial_state = utils.get_initial_state(self.fc7, self.cell.state_size)
258 | 
259 |             self.decoder_prev_word = tf.placeholder(tf.int32, [None])
260 | 
261 |             if first_step:
262 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))
263 |             else:
264 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word)
265 | 
266 |             #projected context
267 |             # This is used in attention module; do this outside the loop to reduce redundant computations
268 |             # with tf.variable_scope("attention"):
269 |             if self.att_hid_size == 0:
270 |                 pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1
271 |             else:
272 |                 pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size
273 | 
274 |             prev_h = utils.last_hidden_vec(initial_state)
275 | 
276 |             alphas = []
277 |             outputs = []
278 | 
279 |             with tf.variable_scope("attention"):
280 |                 alpha = self.get_alpha(prev_h, pctx)
281 |                 alphas.append(alpha)
282 |                 weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1)
283 | 
284 |             output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), initial_state)
285 |             logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit')
286 |             decoder_probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, self.vocab_size + 1])
287 |             decoder_state = utils.flatten_state(state)
288 |         return [decoder_probs, decoder_state]
289 | 
290 |     def build_decoder(self):
291 |         self.decoder_model_init = self.build_decoder_rnn(True)
292 |         self.decoder_model_cont = self.build_decoder_rnn(False)
293 | 
294 |     def decode(self, img, beam_size, sess, max_steps=MAX_STEPS):
295 |         """Decode an image with a sentences."""
296 |         
297 |         # Initilize beam search variables
298 |         # Candidate will be represented with a dictionary
299 |         #   "indexes": a list with indexes denoted a sentence; 
300 |         #   "words": word in the decoded sentence without <bos>
301 |         #   "score": log-likelihood of the sentence
302 |         #   "state": RNN state when generating the last word of the candidate
303 |         good_sentences = [] # store sentences already ended with <bos>
304 |         cur_best_cand = [] # store current best candidates
305 |         highest_score = 0.0 # hightest log-likelihodd in good sentences
306 |         
307 |         # Get the initial logit and state
308 |         cand = {'indexes': [], 'score': 0}
309 |         cur_best_cand.append(cand)
310 |             
311 |         # Expand the current best candidates until max_steps or no candidate
312 |         for i in xrange(max_steps + 1):
313 |             # expand candidates
314 |             cand_pool = []
315 |             #for cand in cur_best_cand:
316 |                 #probs, state = self.get_probs_cont(cand['state'], cand['indexes'][-1], sess)
317 |             if i == 0:
318 |                 all_probs, all_states = self.get_probs_init(img, sess)
319 |             else:
320 |                 states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))]
321 |                 indexes = [cand['indexes'][-1] for cand in cur_best_cand]
322 |                 imgs = np.vstack([img] * len(cur_best_cand))
323 |                 all_probs, all_states = self.get_probs_cont(states, imgs, indexes, sess)
324 |             for ind_cand in range(len(cur_best_cand)):
325 |                 cand = cur_best_cand[ind_cand]
326 |                 probs = all_probs[ind_cand]
327 |                 state = [x[ind_cand] for x in all_states]
328 |                 
329 |                 probs = np.squeeze(probs)
330 |                 probs_order = np.argsort(-probs)
331 |                 for ind_b in xrange(beam_size):
332 |                     cand_e = copy.deepcopy(cand)
333 |                     cand_e['indexes'].append(probs_order[ind_b])
334 |                     cand_e['score'] -= np.log(probs[probs_order[ind_b]])
335 |                     cand_e['state'] = state
336 |                     cand_pool.append(cand_e)
337 |             # get final cand_pool
338 |             cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score'])
339 |             cur_best_cand = utils.truncate_list(cur_best_cand, beam_size)
340 | 
341 |             # move candidates end with <eos> to good_sentences or remove it
342 |             cand_left = []
343 |             for cand in cur_best_cand:
344 |                 if len(good_sentences) > beam_size and cand['score'] > highest_score:
345 |                     continue # No need to expand that candidate
346 |                 if cand['indexes'][-1] == 0: #end of sentence
347 |                     good_sentences.append(cand)
348 |                     highest_score = max(highest_score, cand['score'])
349 |                 else:
350 |                     cand_left.append(cand)
351 |             cur_best_cand = cand_left
352 |             if not cur_best_cand:
353 |                 break
354 | 
355 |         # Add candidate left in cur_best_cand to good sentences 
356 |         for cand in cur_best_cand:
357 |             if len(good_sentences) > beam_size and cand['score'] > highest_score:
358 |                 continue
359 |             if cand['indexes'][-1] != 0:
360 |                 cand['indexes'].append(0)
361 |             good_sentences.append(cand)
362 |             highest_score = max(highest_score, cand['score'])
363 |             
364 |         # Sort good sentences and return the final list
365 |         good_sentences = sorted(good_sentences, key=lambda cand: cand['score'])
366 |         good_sentences = utils.truncate_list(good_sentences, beam_size)
367 | 
368 |         return [sent['indexes'] for sent in good_sentences]
369 | 
370 |     def get_probs_init(self, img, sess):
371 |         """Use the model to get initial logit"""
372 |         m = self.decoder_model_init
373 |         
374 |         probs, state = sess.run(m, {self.images: img})
375 |                                                             
376 |         return (probs, state)
377 |         
378 |     def get_probs_cont(self, prev_state, img, prev_word, sess):
379 |         """Use the model to get continued logit"""
380 |         m = self.decoder_model_cont
381 |         prev_word = np.array(prev_word, dtype='int32')
382 | 
383 |         # Feed images, input words, and the flattened state of previous time step.
384 |         placeholders = [self.images, self.decoder_prev_word] + self.decoder_flattened_state
385 |         feeded = [img, prev_word] + prev_state
386 | 
387 |         probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))})
388 |                                                             
389 |         return (probs, state)


--------------------------------------------------------------------------------
/misc/ShowAttendTellModel_old.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import tensorflow as tf
  6 | import tensorflow.contrib.slim as slim
  7 | import os
  8 | import vgg
  9 | import copy
 10 | 
 11 | import numpy as np
 12 | import misc.utils as utils
 13 | 
 14 | # The maximimum step during generation
 15 | MAX_STEPS = 30
 16 | 
 17 | class ShowAttendTellModel():
 18 | 
 19 |     def initialize(self, sess):
 20 |         # Initialize the variables
 21 |         sess.run(tf.global_variables_initializer())
 22 |         # Initialize the saver
 23 |         self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1)
 24 |         # Load weights from the checkpoint
 25 |         if vars(self.opt).get('start_from', None):
 26 |             self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path)
 27 |         # Initialize the summary writer
 28 |         self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph)
 29 | 
 30 |     def __init__(self, opt):
 31 |         self.vocab_size = opt.vocab_size
 32 |         self.input_encoding_size = opt.input_encoding_size
 33 |         self.rnn_size = opt.rnn_size
 34 |         self.num_layers = opt.num_layers
 35 |         self.drop_prob_lm = opt.drop_prob_lm
 36 |         self.seq_length = opt.seq_length
 37 |         self.vocab_size = opt.vocab_size
 38 |         self.seq_per_img = opt.seq_per_img
 39 |         self.att_hid_size = opt.att_hid_size
 40 | 
 41 |         self.opt = opt
 42 | 
 43 |         # Variable indicating in training mode or evaluation mode
 44 |         self.training = tf.Variable(True, trainable = False, name = "training")
 45 | 
 46 |         # Input variables
 47 |         self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images")
 48 |         self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels")
 49 |         self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks")
 50 | 
 51 |         # Build CNN
 52 |         if vars(self.opt).get('start_from', None):
 53 |             cnn_weight = None
 54 |         else:
 55 |             cnn_weight = vars(self.opt).get('cnn_weight', None)
 56 |         if self.opt.cnn_model == 'vgg16':
 57 |             self.cnn = vgg.Vgg16(cnn_weight)
 58 |         if self.opt.cnn_model == 'vgg19':
 59 |             self.cnn = vgg.Vgg19(cnn_weight)
 60 |             
 61 |         with tf.variable_scope("cnn"):
 62 |             self.cnn.build(self.images)
 63 | 
 64 |         if self.opt.cnn_model == 'vgg16':
 65 |             self.context = self.cnn.conv5_3
 66 |         if self.opt.cnn_model == 'vgg19':
 67 |             self.context = self.cnn.conv5_4
 68 |         
 69 |         self.cnn_training = self.cnn.training
 70 | 
 71 |         # Variable in language model
 72 |         with tf.variable_scope("rnnlm"):
 73 |             # Word Embedding table
 74 |             self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb')
 75 | 
 76 |             # RNN cell
 77 |             if opt.rnn_type == 'rnn':
 78 |                 self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell
 79 |             elif opt.rnn_type == 'gru':
 80 |                 self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell
 81 |             elif opt.rnn_type == 'lstm':
 82 |                 self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell
 83 |             else:
 84 |                 raise Exception("RNN type not supported: {}".format(opt.rnn_type))
 85 | 
 86 |             # keep_prob is a function of training flag
 87 |             self.keep_prob = tf.cond(self.training, 
 88 |                                 lambda : tf.constant(1 - self.drop_prob_lm),
 89 |                                 lambda : tf.constant(1.0), name = 'keep_prob')
 90 | 
 91 |             # basic cell has dropout wrapper
 92 |             self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size), 1.0, self.keep_prob)
 93 |             # cell is the final cell of each timestep
 94 |             self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers)
 95 | 
 96 |     def get_alpha(self, prev_h, pctx):
 97 |         # projected state
 98 |         if self.att_hid_size == 0:
 99 |             pstate = slim.fully_connected(prev_h, 1, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * 1
100 |             alpha = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * 1
101 |             alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196
102 |             alpha = tf.nn.softmax(alpha)
103 |         else:
104 |             pstate = slim.fully_connected(prev_h, self.att_hid_size, activation_fn = None, scope = 'h_att') # (batch * seq_per_img) * att_hid_size
105 |             pctx_ = pctx + tf.expand_dims(pstate, 1) #(batch * seq_per_img) * 196 * att_hid_size
106 |             pctx_ = tf.nn.tanh(pctx_) # (batch * seq_per_img) * 196 * att_hid_size
107 |             alpha = slim.fully_connected(pctx_, 1, activation_fn = None, scope = 'alpha') # (batch * seq_per_img) * 196 * 1
108 |             alpha = tf.squeeze(alpha, [2]) # (batch * seq_per_img) * 196
109 |             alpha = tf.nn.softmax(alpha)
110 |         return alpha
111 | 
112 |     def build_model(self):
113 |         with tf.name_scope("batch_size"):
114 |             # Get batch_size from the first dimension of self.images
115 |             self.batch_size = tf.shape(self.images)[0]
116 |         with tf.variable_scope("rnnlm"):
117 |             # Flatten the context
118 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
119 |             ctx_mean = tf.reduce_mean(flattened_ctx, 1)
120 |             
121 |             # Initialize the first hidden state with the mean context
122 |             initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size)
123 |             # Replicate self.seq_per_img times for each state and image embedding
124 |             self.initial_state = initial_state = utils.expand_feat(initial_state, self.seq_per_img)
125 |             self.flattened_ctx = flattened_ctx = tf.reshape(tf.tile(tf.expand_dims(flattened_ctx, 1), [1, self.seq_per_img, 1, 1]), 
126 |                 [self.batch_size * self.seq_per_img, 196, 512])
127 | 
128 |             #projected context
129 |             # This is used in attention module; do this outside the loop to reduce redundant computations
130 |             # with tf.variable_scope("attention"):
131 |             if self.att_hid_size == 0:
132 |                 pctx = slim.fully_connected(self.flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1
133 |             else:
134 |                 pctx = slim.fully_connected(self.flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size
135 | 
136 |             rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1]))
137 |             rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs]
138 | 
139 |             prev_h = utils.last_hidden_vec(initial_state)
140 | 
141 |             self.alphas = []
142 |             self.logits = []
143 |             outputs = []
144 |             state = initial_state
145 |             for ind in range(self.seq_length + 1):
146 |                 if ind > 0:
147 |                     # Reuse the variables after the first timestep.
148 |                     tf.get_variable_scope().reuse_variables()
149 | 
150 |                 with tf.variable_scope("attention"):
151 |                     alpha = self.get_alpha(prev_h, pctx)
152 |                     self.alphas.append(alpha)
153 |                     weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1)
154 |                     
155 |                 output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_inputs[ind]]), state)
156 |                 # Save the current output for next time step attention
157 |                 prev_h = output
158 |                 # Get the score of each word in vocabulary, 0 is end token.
159 |                 self.logits.append(slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit'))
160 |                 
161 |         with tf.variable_scope("loss"):
162 |             loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
163 |                     self.logits,
164 |                     [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target; ignore the first start token
165 |                     [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])])
166 |             self.cost = tf.reduce_mean(loss)
167 | 
168 |         self.final_state = state
169 |         self.lr = tf.Variable(0.0, trainable=False)
170 |         self.cnn_lr = tf.Variable(0.0, trainable=False)
171 | 
172 |         # Collect the rnn variables, and create the optimizer of rnn
173 |         tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm')
174 |         grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip)
175 |         optimizer = utils.get_optimizer(self.opt, self.lr)
176 |         self.train_op = optimizer.apply_gradients(zip(grads, tvars))
177 | 
178 |         # Collect the cnn variables, and create the optimizer of cnn
179 |         cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn')
180 |         cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip)
181 |         cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr) 
182 |         self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars))
183 | 
184 |         tf.summary.scalar('training loss', self.cost)
185 |         tf.summary.scalar('learning rate', self.lr)
186 |         tf.summary.scalar('cnn learning rate', self.cnn_lr)
187 |         self.summaries = tf.summary.merge_all()
188 | 
189 |     def build_generator(self):
190 |         """
191 |         Generator for generating captions
192 |         Support sample max or sample from distribution
193 |         No Beam search here; beam search is in decoder
194 |         """
195 |         # Variables for the sample setting
196 |         self.sample_max = tf.Variable(True, trainable = False, name = "sample_max")
197 |         self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature")
198 | 
199 |         self.generator = []
200 |         with tf.variable_scope("rnnlm"):
201 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
202 |             ctx_mean = tf.reduce_mean(flattened_ctx, 1)
203 | 
204 |             tf.get_variable_scope().reuse_variables()
205 | 
206 |             initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size)
207 | 
208 |             #projected context
209 |             # This is used in attention module; do this outside the loop to reduce redundant computations
210 |             # with tf.variable_scope("attention"):
211 |             if self.att_hid_size == 0:
212 |                 pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * 1
213 |             else:
214 |                 pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch) * 196 * att_hid_size
215 | 
216 |             rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))
217 | 
218 |             prev_h = utils.last_hidden_vec(initial_state)
219 | 
220 |             self.g_alphas = []
221 |             outputs = []
222 |             state = initial_state
223 |             for ind in range(MAX_STEPS):
224 | 
225 |                 with tf.variable_scope("attention"):
226 |                     alpha = self.get_alpha(prev_h, pctx)
227 |                     self.g_alphas.append(alpha)
228 |                     weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1)
229 | 
230 |                 output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), state)
231 |                 outputs.append(output)
232 |                 prev_h = output
233 | 
234 |                 # Get the input of next timestep
235 |                 prev_logit = slim.fully_connected(prev_h, self.vocab_size + 1, activation_fn = None, scope = 'logit')
236 |                 prev_symbol = tf.stop_gradient(tf.cond(self.sample_max,
237 |                     lambda: tf.argmax(prev_logit, 1), # pick the word with largest probability as the input of next time step
238 |                     lambda: tf.squeeze(
239 |                         tf.multinomial(tf.nn.log_softmax(prev_logit) / self.sample_temperature, 1), 1))) # Sample from the distribution
240 |                 self.generator.append(prev_symbol)
241 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, prev_symbol)
242 |             
243 |             self.g_output = output = tf.reshape(tf.concat(axis=1, values=outputs), [-1, self.rnn_size]) # outputs[1:], because we don't calculate loss on time 0.
244 |             self.g_logits = logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit')
245 |             self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1])
246 | 
247 |         self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS, -1]))
248 | 
249 |     def build_decoder_rnn(self, first_step):
250 |         with tf.variable_scope("rnnlm"):
251 |             flattened_ctx = tf.reshape(self.context, [self.batch_size, 196, 512])
252 |             ctx_mean = tf.reduce_mean(flattened_ctx, 1)
253 | 
254 |             tf.get_variable_scope().reuse_variables()
255 | 
256 |             if not first_step:
257 |                 initial_state = utils.get_placeholder_state(self.cell.state_size)
258 |                 self.decoder_flattened_state = utils.flatten_state(initial_state)
259 |             else:
260 |                 initial_state = utils.get_initial_state(ctx_mean, self.cell.state_size)
261 | 
262 |             self.decoder_prev_word = tf.placeholder(tf.int32, [None])
263 | 
264 |             if first_step:
265 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))
266 |             else:
267 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word)
268 | 
269 |             #projected context
270 |             # This is used in attention module; do this outside the loop to reduce redundant computations
271 |             # with tf.variable_scope("attention"):
272 |             if self.att_hid_size == 0:
273 |                 pctx = slim.fully_connected(flattened_ctx, 1, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * 1
274 |             else:
275 |                 pctx = slim.fully_connected(flattened_ctx, self.att_hid_size, activation_fn = None, scope = 'ctx_att') # (batch * seq_per_img) * 196 * att_hid_size
276 | 
277 |             prev_h = utils.last_hidden_vec(initial_state)
278 | 
279 |             alphas = []
280 |             outputs = []
281 | 
282 |             with tf.variable_scope("attention"):
283 |                 alpha = self.get_alpha(prev_h, pctx)
284 |                 alphas.append(alpha)
285 |                 weighted_context = tf.reduce_sum(flattened_ctx * tf.expand_dims(alpha, 2), 1)
286 | 
287 |             output, state = self.cell(tf.concat(axis=1, values=[weighted_context, rnn_input]), initial_state)
288 |             logits = slim.fully_connected(output, self.vocab_size + 1, activation_fn = None, scope = 'logit')
289 |             decoder_probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, self.vocab_size + 1])
290 |             decoder_state = utils.flatten_state(state)
291 |         return [decoder_probs, decoder_state]
292 | 
293 | 
294 |     def build_decoder(self):
295 |         self.decoder_model_init = self.build_decoder_rnn(True)
296 |         self.decoder_model_cont = self.build_decoder_rnn(False)
297 | 
298 |     def decode(self, img, beam_size, sess, max_steps=30):
299 |         """Decode an image with a sentences."""
300 |         
301 |         # Initilize beam search variables
302 |         # Candidate will be represented with a dictionary
303 |         #   "indexes": a list with indexes denoted a sentence; 
304 |         #   "words": word in the decoded sentence without <bos>
305 |         #   "score": log-likelihood of the sentence
306 |         #   "state": RNN state when generating the last word of the candidate
307 |         good_sentences = [] # store sentences already ended with <bos>
308 |         cur_best_cand = [] # store current best candidates
309 |         highest_score = 0.0 # hightest log-likelihodd in good sentences
310 |         
311 |         # Get the initial logit and state
312 |         cand = {'indexes': [], 'score': 0}
313 |         cur_best_cand.append(cand)
314 |             
315 |         # Expand the current best candidates until max_steps or no candidate
316 |         for i in xrange(max_steps + 1):
317 |             # expand candidates
318 |             cand_pool = []
319 |             #for cand in cur_best_cand:
320 |                 #probs, state = self.get_probs_cont(cand['state'], cand['indexes'][-1], sess)
321 |             if i == 0:
322 |                 all_probs, all_states = self.get_probs_init(img, sess)
323 |             else:
324 |                 states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))]
325 |                 indexes = [cand['indexes'][-1] for cand in cur_best_cand]
326 |                 imgs = np.vstack([img] * len(cur_best_cand))
327 |                 all_probs, all_states = self.get_probs_cont(states, imgs, indexes, sess)
328 |             for ind_cand in range(len(cur_best_cand)):
329 |                 cand = cur_best_cand[ind_cand]
330 |                 probs = all_probs[ind_cand]
331 |                 state = [x[ind_cand] for x in all_states]
332 |                 
333 |                 probs = np.squeeze(probs)
334 |                 probs_order = np.argsort(-probs)
335 |                 for ind_b in xrange(beam_size):
336 |                     cand_e = copy.deepcopy(cand)
337 |                     cand_e['indexes'].append(probs_order[ind_b])
338 |                     cand_e['score'] -= np.log(probs[probs_order[ind_b]])
339 |                     cand_e['state'] = state
340 |                     cand_pool.append(cand_e)
341 |             # get final cand_pool
342 |             cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score'])
343 |             cur_best_cand = utils.truncate_list(cur_best_cand, beam_size)
344 | 
345 |             # move candidates end with <eos> to good_sentences or remove it
346 |             cand_left = []
347 |             for cand in cur_best_cand:
348 |                 if len(good_sentences) > beam_size and cand['score'] > highest_score:
349 |                     continue # No need to expand that candidate
350 |                 if cand['indexes'][-1] == 0: #end of sentence
351 |                     good_sentences.append(cand)
352 |                     highest_score = max(highest_score, cand['score'])
353 |                 else:
354 |                     cand_left.append(cand)
355 |             cur_best_cand = cand_left
356 |             if not cur_best_cand:
357 |                 break
358 | 
359 |         # Add candidate left in cur_best_cand to good sentences 
360 |         for cand in cur_best_cand:
361 |             if len(good_sentences) > beam_size and cand['score'] > highest_score:
362 |                 continue
363 |             if cand['indexes'][-1] != 0:
364 |                 cand['indexes'].append(0)
365 |             good_sentences.append(cand)
366 |             highest_score = max(highest_score, cand['score'])
367 |             
368 |         # Sort good sentences and return the final list
369 |         good_sentences = sorted(good_sentences, key=lambda cand: cand['score'])
370 |         good_sentences = utils.truncate_list(good_sentences, beam_size)
371 | 
372 |         return [sent['indexes'] for sent in good_sentences]
373 | 
374 |     def get_probs_init(self, img, sess):
375 |         """Use the model to get initial logit"""
376 |         m = self.decoder_model_init
377 |         
378 |         probs, state = sess.run(m, {self.images: img})
379 |                                                             
380 |         return (probs, state)
381 |         
382 |     def get_probs_cont(self, prev_state, img, prev_word, sess):
383 |         """Use the model to get continued logit"""
384 |         m = self.decoder_model_cont
385 |         prev_word = np.array(prev_word, dtype='int32')
386 | 
387 |         # Feed images, input words, and the flattened state of previous time step.
388 |         placeholders = [self.images, self.decoder_prev_word] + self.decoder_flattened_state
389 |         feeded = [img, prev_word] + prev_state
390 | 
391 |         probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))})
392 |                                                             
393 |         return (probs, state)


--------------------------------------------------------------------------------
/misc/ShowTellModel.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import tensorflow as tf
  6 | import tensorflow.contrib.slim as slim
  7 | import os
  8 | import vgg
  9 | import copy
 10 | 
 11 | import numpy as np
 12 | import misc.utils as utils
 13 | 
 14 | # The maximimum step during generation
 15 | MAX_STEPS = 30
 16 | 
 17 | class ShowTellModel():
 18 | 
 19 |     def initialize(self, sess):
 20 |         # Initialize the variables
 21 |         sess.run(tf.global_variables_initializer())
 22 |         # Initialize the saver
 23 |         self.saver = tf.train.Saver(tf.trainable_variables(), max_to_keep=1)
 24 |         # Load weights from the checkpoint
 25 |         if vars(self.opt).get('start_from', None):
 26 |             self.saver.restore(sess, self.opt.ckpt.model_checkpoint_path)
 27 |         # Initialize the summary writer
 28 |         self.summary_writer = tf.summary.FileWriter(self.opt.checkpoint_path, sess.graph)
 29 | 
 30 |     def __init__(self, opt):
 31 |         self.vocab_size = opt.vocab_size
 32 |         self.input_encoding_size = opt.input_encoding_size
 33 |         self.rnn_size = opt.rnn_size
 34 |         self.num_layers = opt.num_layers
 35 |         self.drop_prob_lm = opt.drop_prob_lm
 36 |         self.seq_length = opt.seq_length
 37 |         self.vocab_size = opt.vocab_size
 38 |         self.seq_per_img = opt.seq_per_img
 39 | 
 40 |         self.opt = opt
 41 | 
 42 |         # Variable indicating in training mode or evaluation mode
 43 |         self.training = tf.Variable(True, trainable = False, name = "training")
 44 | 
 45 |         # Input variables
 46 |         self.images = tf.placeholder(tf.float32, [None, 224, 224, 3], name = "images")
 47 |         self.labels = tf.placeholder(tf.int32, [None, self.seq_length + 2], name = "labels")
 48 |         self.masks = tf.placeholder(tf.float32, [None, self.seq_length + 2], name = "masks")
 49 | 
 50 |         # Build CNN
 51 |         if vars(self.opt).get('start_from', None):
 52 |             cnn_weight = None
 53 |         else:
 54 |             cnn_weight = vars(self.opt).get('cnn_weight', None)
 55 |         if self.opt.cnn_model == 'vgg16':
 56 |             self.cnn = vgg.Vgg16(cnn_weight)
 57 |         if self.opt.cnn_model == 'vgg19':
 58 |             self.cnn = vgg.Vgg19(cnn_weight)
 59 |             
 60 |         with tf.variable_scope("cnn"):
 61 |             self.cnn.build(self.images)
 62 |         self.fc7 = self.cnn.drop7
 63 |         self.cnn_training = self.cnn.training
 64 | 
 65 |         # Variable in language model
 66 |         with tf.variable_scope("rnnlm"):
 67 |             # Word Embedding table
 68 |             self.Wemb = tf.Variable(tf.random_uniform([self.vocab_size + 1, self.input_encoding_size], -0.1, 0.1), name='Wemb')
 69 | 
 70 |             # RNN cell
 71 |             if opt.rnn_type == 'rnn':
 72 |                 self.cell_fn = cell_fn = tf.contrib.rnn.BasicRNNCell
 73 |             elif opt.rnn_type == 'gru':
 74 |                 self.cell_fn = cell_fn = tf.contrib.rnn.GRUCell
 75 |             elif opt.rnn_type == 'lstm':
 76 |                 self.cell_fn = cell_fn = tf.contrib.rnn.LSTMCell
 77 |             else:
 78 |                 raise Exception("RNN type not supported: {}".format(opt.rnn_type))
 79 | 
 80 |             # keep_prob is a function of training flag
 81 |             self.keep_prob = tf.cond(self.training, 
 82 |                                 lambda : tf.constant(1 - self.drop_prob_lm),
 83 |                                 lambda : tf.constant(1.0), name = 'keep_prob')
 84 | 
 85 |             # basic cell has dropout wrapper
 86 |             self.basic_cell = cell = tf.contrib.rnn.DropoutWrapper(cell_fn(self.rnn_size), 1.0, self.keep_prob)
 87 |             # cell is the final cell of each timestep
 88 |             self.cell = tf.contrib.rnn.MultiRNNCell([cell] * opt.num_layers)
 89 | 
 90 |     def build_model(self):
 91 |         with tf.name_scope("batch_size"):
 92 |             # Get batch_size from the first dimension of self.images
 93 |             self.batch_size = tf.shape(self.images)[0]
 94 | 
 95 |         with tf.variable_scope("cnn"):
 96 |             image_emb = slim.fully_connected(self.fc7, self.input_encoding_size, activation_fn=None, scope='encode_image')
 97 |         with tf.variable_scope("rnnlm"):
 98 |             # Replicate self.seq_per_img times for each image embedding
 99 |             image_emb = tf.reshape(tf.tile(tf.expand_dims(image_emb, 1), [1, self.seq_per_img, 1]), [self.batch_size * self.seq_per_img, self.input_encoding_size])
100 | 
101 |             # rnn_inputs is a list of input, each element is the input of rnn at each time step
102 |             # time step 0 is the image embedding
103 |             rnn_inputs = tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=tf.nn.embedding_lookup(self.Wemb, self.labels[:,:self.seq_length + 1]))
104 |             rnn_inputs = [tf.squeeze(input_, [1]) for input_ in rnn_inputs]
105 |             rnn_inputs = [image_emb] + rnn_inputs
106 | 
107 |             # The initial sate is zero
108 |             initial_state = self.cell.zero_state(self.batch_size * self.seq_per_img, tf.float32)
109 | 
110 |             outputs, last_state = tf.contrib.legacy_seq2seq.rnn_decoder(rnn_inputs, initial_state, self.cell, loop_function=None)
111 |             
112 |             outputs = tf.concat(axis=0, values=outputs[1:])
113 |             self.logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit')
114 |             self.logits = tf.split(axis=0, num_or_size_splits=len(rnn_inputs) - 1, value=self.logits)
115 | 
116 |         with tf.variable_scope("loss"):
117 |             loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(self.logits,
118 |                     [tf.squeeze(label, [1]) for label in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.labels[:, 1:])], # self.labels[:,1:] is the target
119 |                     [tf.squeeze(mask, [1]) for mask in tf.split(axis=1, num_or_size_splits=self.seq_length + 1, value=self.masks[:, 1:])])
120 |             self.cost = tf.reduce_mean(loss)
121 |         
122 |         self.final_state = last_state
123 |         self.lr = tf.Variable(0.0, trainable=False)
124 |         self.cnn_lr = tf.Variable(0.0, trainable=False)
125 | 
126 |         # Collect the rnn variables, and create the optimizer of rnn
127 |         tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='rnnlm')
128 |         grads = utils.clip_by_value(tf.gradients(self.cost, tvars), -self.opt.grad_clip, self.opt.grad_clip)
129 |         #grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars),
130 |         #        self.opt.grad_clip)
131 |         optimizer = utils.get_optimizer(self.opt, self.lr)
132 |         self.train_op = optimizer.apply_gradients(zip(grads, tvars))
133 | 
134 |         # Collect the cnn variables, and create the optimizer of cnn
135 |         cnn_tvars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='cnn')
136 |         cnn_grads = utils.clip_by_value(tf.gradients(self.cost, cnn_tvars), -self.opt.grad_clip, self.opt.grad_clip)
137 |         #cnn_grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, cnn_tvars),
138 |         #        self.opt.grad_clip)
139 |         cnn_optimizer = utils.get_cnn_optimizer(self.opt, self.cnn_lr) 
140 |         self.cnn_train_op = cnn_optimizer.apply_gradients(zip(cnn_grads, cnn_tvars))
141 | 
142 |         tf.summary.scalar('training loss', self.cost)
143 |         tf.summary.scalar('learning rate', self.lr)
144 |         tf.summary.scalar('cnn learning rate', self.cnn_lr)
145 |         self.summaries = tf.summary.merge_all()
146 | 
147 |     def build_generator(self):
148 |         """
149 |         Generator for generating captions
150 |         Support sample max or sample from distribution
151 |         No Beam search here; beam search is in decoder
152 |         """
153 |         # Variables for the sample setting
154 |         self.sample_max = tf.Variable(True, trainable = False, name = "sample_max")
155 |         self.sample_temperature = tf.Variable(1.0, trainable = False, name = "temperature")
156 | 
157 |         self.generator = []
158 |         with tf.variable_scope("cnn"):
159 |             image_emb = slim.fully_connected(self.fc7, self.input_encoding_size, activation_fn=None, reuse=True, scope='encode_image')
160 |         with tf.variable_scope("rnnlm") as rnnlm_scope:
161 |             rnn_inputs = [image_emb] + [tf.nn.embedding_lookup(self.Wemb, tf.zeros([self.batch_size], tf.int32))] + [0] * (MAX_STEPS - 1)
162 |             initial_state = self.cell.zero_state(self.batch_size, tf.float32)
163 | 
164 |             tf.get_variable_scope().reuse_variables()
165 | 
166 |             def loop(prev, i):
167 |                 if i == 1:
168 |                     return rnn_inputs[1]
169 |                 with tf.variable_scope(rnnlm_scope):
170 |                     prev = slim.fully_connected(prev, self.vocab_size + 1, activation_fn = None, scope = 'logit')                
171 |                     prev_symbol = tf.stop_gradient(tf.cond(self.sample_max,
172 |                         lambda: tf.argmax(prev, 1), # pick the word with largest probability as the input of next time step
173 |                         lambda: tf.squeeze(
174 |                             tf.multinomial(tf.nn.log_softmax(prev) / self.sample_temperature, 1), 1))) # Sample from the distribution
175 |                     self.generator.append(prev_symbol)
176 |                     return tf.nn.embedding_lookup(self.Wemb, prev_symbol)
177 | 
178 |             outputs, last_state = tf.contrib.legacy_seq2seq.rnn_decoder(rnn_inputs, initial_state, self.cell, loop_function=loop)
179 |             self.g_output = output = tf.reshape(tf.concat(axis=1, values=outputs[1:]), [-1, self.rnn_size]) # outputs[1:], because we don't calculate loss on time 0.
180 |             self.g_logits = logits = slim.fully_connected(outputs, self.vocab_size + 1, activation_fn = None, scope = 'logit')
181 |             self.g_probs = probs = tf.reshape(tf.nn.softmax(logits), [self.batch_size, MAX_STEPS, self.vocab_size + 1])
182 | 
183 |         self.generator = tf.transpose(tf.reshape(tf.concat(axis=0, values=self.generator), [MAX_STEPS - 1, -1]))
184 | 
185 |     # Decoders are used for beam search. Much complicated than sample max.
186 |     # Decoder decodes the image one time step at a time
187 |     def build_decoder_rnn(self, first_step):
188 | 
189 |         with tf.variable_scope("cnn"):
190 |             image_emb = slim.fully_connected(self.fc7, self.input_encoding_size, reuse=True, activation_fn=None, scope='encode_image')
191 |         with tf.variable_scope("rnnlm"):
192 |             if first_step:
193 |                 rnn_input = image_emb # At the first step, the input is the embedded image
194 |             else:
195 |                 # The input of later time step, is the embedding of the previous word
196 |                 # The previous word is a placeholder
197 |                 self.decoder_prev_word = tf.placeholder(tf.int32, [None])
198 |                 rnn_input = tf.nn.embedding_lookup(self.Wemb, self.decoder_prev_word)
199 | 
200 |             batch_size = tf.shape(rnn_input)[0]
201 | 
202 |             tf.get_variable_scope().reuse_variables()
203 | 
204 |             if not first_step:
205 |                 # If not first step, the states are also placeholders.
206 |                 self.decoder_initial_state = initial_state = utils.get_placeholder_state(self.cell.state_size)
207 |                 self.decoder_flattened_state = utils.flatten_state(initial_state)
208 |             else:
209 |                 # The states for the first step are zero.
210 |                 initial_state = self.cell.zero_state(batch_size, tf.float32)
211 | 
212 |             outputs, state = tf.contrib.legacy_seq2seq.rnn_decoder([rnn_input], initial_state, self.cell)
213 |             logits = slim.fully_connected(outputs[0], self.vocab_size + 1, activation_fn = None, scope = 'logit')
214 |             decoder_probs = tf.reshape(tf.nn.softmax(logits), [batch_size, self.vocab_size + 1])
215 |             decoder_state = utils.flatten_state(state)
216 |         # output the current word distribution and states
217 |         return [decoder_probs, decoder_state]
218 | 
219 | 
220 |     def build_decoder(self):
221 |         self.decoder_model_init = self.build_decoder_rnn(True)
222 |         self.decoder_model_cont = self.build_decoder_rnn(False)
223 | 
224 |     def decode(self, img, beam_size, sess, max_steps=30):
225 |         """Decode an image with a sentences."""
226 |         
227 |         # Initilize beam search variables
228 |         # Candidate will be represented with a dictionary
229 |         #   "indexes": a list with indexes denoted a sentence; 
230 |         #   "words": word in the decoded sentence without <bos>
231 |         #   "score": log-likelihood of the sentence
232 |         #   "state": RNN state when generating the last word of the candidate
233 |         good_sentences = [] # store sentences already ended with <eos>
234 |         cur_best_cand = [] # store current best candidates
235 |         highest_score = 0.0 # hightest log-likelihodd in good sentences
236 | 
237 |         # Get the initial logit and state
238 |         probs_init, state_init = self.get_probs_init(img, sess)
239 |         cand = {'indexes': [0], 'score': 0, 'state': state_init}
240 |         cur_best_cand.append(cand)
241 |             
242 |         # Expand the current best candidates until max_steps or no candidate
243 |         for i in xrange(max_steps):
244 |             # expand candidates
245 |             cand_pool = []
246 |             states = [np.vstack([cand['state'][i] for cand in cur_best_cand]) for i in xrange(len(cur_best_cand[0]['state']))]
247 |             indexes = [cand['indexes'][-1] for cand in cur_best_cand]
248 |             all_probs, all_states = self.get_probs_cont(states, indexes, sess)
249 |             for ind_cand in range(len(cur_best_cand)):
250 |                 cand = cur_best_cand[ind_cand]
251 |                 probs = all_probs[ind_cand]
252 |                 state = [x[ind_cand] for x in all_states]
253 |                 
254 |                 probs = np.squeeze(probs)
255 |                 probs_order = np.argsort(-probs)
256 |                 for ind_b in xrange(beam_size):
257 |                     cand_e = copy.deepcopy(cand)
258 |                     cand_e['indexes'].append(probs_order[ind_b])
259 |                     cand_e['score'] -= np.log(probs[probs_order[ind_b]])
260 |                     cand_e['state'] = state
261 |                     cand_pool.append(cand_e)
262 |             # get final cand_pool
263 |             cur_best_cand = sorted(cand_pool, key=lambda cand: cand['score'])
264 |             cur_best_cand = utils.truncate_list(cur_best_cand, beam_size)
265 | 
266 |             # move candidates end with <eos> to good_sentences or remove it
267 |             cand_left = []
268 |             for cand in cur_best_cand:
269 |                 if len(good_sentences) > beam_size and cand['score'] > highest_score:
270 |                     continue # No need to expand that candidate
271 |                 if cand['indexes'][-1] == 0: #end of sentence
272 |                     good_sentences.append(cand)
273 |                     highest_score = max(highest_score, cand['score'])
274 |                 else:
275 |                     cand_left.append(cand)
276 |             cur_best_cand = cand_left
277 |             if not cur_best_cand:
278 |                 break
279 | 
280 |         # Add candidate left in cur_best_cand to good sentences 
281 |         for cand in cur_best_cand:
282 |             if len(good_sentences) > beam_size and cand['score'] > highest_score:
283 |                 continue
284 |             if cand['indexes'][-1] != 0:
285 |                 cand['indexes'].append(0)
286 |             good_sentences.append(cand)
287 |             highest_score = max(highest_score, cand['score'])
288 |             
289 |         # Sort good sentences and return the final list
290 |         good_sentences = sorted(good_sentences, key=lambda cand: cand['score'])
291 |         good_sentences = utils.truncate_list(good_sentences, beam_size)
292 |         
293 |         return [sent['indexes'][1:] for sent in good_sentences]
294 | 
295 |     def get_probs_init(self, img, sess):
296 |         """Use the model to get initial logit"""
297 |         m = self.decoder_model_init
298 |         
299 |         probs, state = sess.run(m, {self.images: img})
300 |                                                             
301 |         return (probs, state)
302 |         
303 |     def get_probs_cont(self, prev_state, prev_word, sess):
304 |         """Use the model to get continued logit"""
305 |         m = self.decoder_model_cont
306 |         prev_word = np.array(prev_word, dtype='int32')
307 | 
308 |         placeholders = [self.decoder_prev_word] + self.decoder_flattened_state
309 |         feeded = [prev_word] + prev_state
310 |         
311 |         probs, state = sess.run(m, {placeholders[i]: feeded[i] for i in xrange(len(placeholders))})
312 |                                                             
313 |         return (probs, state)


--------------------------------------------------------------------------------
/misc/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ruotianluo/neuraltalk2-tensorflow/65cd3ad5383b0785c63ed3baba5f2cd51df7b59c/misc/__init__.py


--------------------------------------------------------------------------------
/misc/utils.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import tensorflow as tf
  6 | import tensorflow.contrib.slim as slim
  7 | import collections
  8 | import six
  9 | 
 10 | # My own clip by value which could input a list of tensors
 11 | def clip_by_value(t_list, clip_value_min, clip_value_max, name=None):
 12 |     if (not isinstance(t_list, collections.Sequence)
 13 |             or isinstance(t_list, six.string_types)):
 14 |         raise TypeError("t_list should be a sequence")
 15 |     t_list = list(t_list)
 16 |         
 17 |     with tf.name_scope(name or "clip_by_value") as name:
 18 |         values = [
 19 |             tf.convert_to_tensor(
 20 |                 t.values if isinstance(t, tf.IndexedSlices) else t,
 21 |                 name="t_%d" % i)
 22 |             if t is not None else t
 23 |             for i, t in enumerate(t_list)]
 24 |         values_clipped = []
 25 |         for i, v in enumerate(values):
 26 |             if v is None:
 27 |                 values_clipped.append(None)
 28 |             else:
 29 |                 with tf.get_default_graph().colocate_with(v):
 30 |                     values_clipped.append(
 31 |                         tf.clip_by_value(v, clip_value_min, clip_value_max))
 32 | 
 33 |         list_clipped = [
 34 |             tf.IndexedSlices(c_v, t.indices, t.dense_shape)
 35 |             if isinstance(t, tf.IndexedSlices)
 36 |             else c_v
 37 |             for (c_v, t) in zip(values_clipped, t_list)]
 38 | 
 39 |     return list_clipped
 40 | 
 41 | # Truncate the list of beam given a maximum length
 42 | def truncate_list(l, max_len):
 43 |     if max_len == -1:
 44 |         max_len = len(l)
 45 |     return l[:min(len(l),  max_len)]
 46 | 
 47 | # Turn nested state into a flattened list
 48 | # Used both for flattening the nested placeholder states and for output states value of previous time step
 49 | def flatten_state(state):
 50 |     if isinstance(state, tf.contrib.rnn.LSTMStateTuple):
 51 |         return [state.c, state.h]
 52 |     elif isinstance(state, tuple):
 53 |         result = []
 54 |         for i in xrange(len(state)):
 55 |             result += flatten_state(state[i])
 56 |         return result
 57 |     else:
 58 |         return [state]
 59 | 
 60 | # When decoding step by step: we need to initialize the state of next timestep according to the previous time step.
 61 | # Because states could be nested tuples or lists, so we get the states recursively.
 62 | def get_placeholder_state(state_size, scope = 'placeholder_state'):
 63 |     with tf.variable_scope(scope):
 64 |         if isinstance(state_size, tf.contrib.rnn.LSTMStateTuple):
 65 |             c = tf.placeholder(tf.float32, [None, state_size.c], name='LSTM_c')
 66 |             h = tf.placeholder(tf.float32, [None, state_size.h], name='LSTM_h')
 67 |             return tf.contrib.rnn.LSTMStateTuple(c,h)
 68 |         elif isinstance(state_size, tuple):
 69 |             result = [get_placeholder_state(state_size[i], "layer_"+str(i)) for i in xrange(len(state_size))]
 70 |             return tuple(result)
 71 |         elif isinstance(state_size, int):
 72 |             return tf.placeholder(tf.float32, [None, state_size], name='state')
 73 | 
 74 | # Get the last hidden vector. (The hidden vector of the deepest layer)
 75 | # For the input of the attention model of next time step.
 76 | def last_hidden_vec(state):
 77 |     if isinstance(state, tuple):
 78 |         return last_hidden_vec(state[len(state) - 1])
 79 |     elif isinstance(state, tf.contrib.rnn.LSTMStateTuple):
 80 |         return state.h
 81 |     else:
 82 |         return state
 83 | 
 84 | # Input: seq, N*D numpy array, with element 0 .. vocab_size. 0 is END token.
 85 | def decode_sequence(ix_to_word, seq):
 86 |     N, D = seq.shape
 87 |     out = []
 88 |     for i in range(N):
 89 |         txt = ''
 90 |         for j in range(D):
 91 |             ix = seq[i,j]
 92 |             if ix > 0 :
 93 |                 if j >= 1:
 94 |                     txt = txt + ' '
 95 |                 txt = txt + ix_to_word[str(ix)]
 96 |             else:
 97 |                 break
 98 |         out.append(txt)
 99 |     return out
100 | 
101 | def get_initial_state(input, state_size, scope = 'init_state'):
102 |     """
103 |     Recursively initialize the first state.
104 | 
105 |     state_size is a nested of tuple and LSTMStateTuple and integer.
106 |         
107 |     It is so complicated because we use state_is_tuple
108 |     """
109 | 
110 |     with tf.variable_scope(scope):
111 |         if isinstance(state_size, tf.contrib.rnn.LSTMStateTuple):
112 |             c = slim.fully_connected(input, state_size.c, activation_fn=tf.nn.tanh, scope='LSTM_c')
113 |             h = slim.fully_connected(input, state_size.h, activation_fn=tf.nn.tanh, scope='LSTM_h')
114 |             return tf.contrib.rnn.LSTMStateTuple(c,h)
115 |         elif isinstance(state_size, tuple):
116 |             result = [get_initial_state(input, state_size[i], "layer_"+str(i)) for i in xrange(len(state_size))]
117 |             return tuple(result)
118 |         elif isinstance(state_size, int):
119 |             return slim.fully_connected(input, state_size, activation_fn=tf.nn.tanh, scope='state')
120 | 
121 | def expand_feat(input, multiples, scope = 'expand_feat'):
122 |     """
123 |     Expand the dimension of states;
124 |     According to multiples.
125 | 
126 |     Similar reason why it's so complicated.
127 |     """
128 |     with tf.variable_scope(scope):
129 |         if isinstance(input, tf.contrib.rnn.LSTMStateTuple):
130 |             c = expand_feat(input.c, multiples, scope='expand_LSTM_c')
131 |             h = expand_feat(input.h, multiples, scope='expand_LSTM_c')
132 |             return tf.contrib.rnn.LSTMStateTuple(c,h)
133 |         elif isinstance(input, tuple):
134 |             result = [expand_feat(input[i], multiples, "expand_layer_"+str(i)) for i in xrange(len(input))]
135 |             return tuple(result)
136 |         else:
137 |             return tf.reshape(tf.tile(tf.expand_dims(input, 1), [1, multiples, 1]), [tf.shape(input)[0] * multiples, input.get_shape()[1].value])
138 | 
139 | def get_optimizer(opt, lr):
140 |     if opt.optim == 'rmsprop':
141 |         return tf.train.RMSPropOptimizer(lr, momentum=opt.optim_alpha, epsilon=opt.optim_epsilon)
142 |     elif opt.optim == 'adagrad':
143 |         return tf.train.AdagradOptimizer(lr)
144 |     elif opt.optim == 'sgd':
145 |         return tf.train.GradientDescentOptimizer(lr)
146 |     elif opt.optim == 'sgdm':
147 |         return tf.train.MomentumOptimizer(lr, opt.optim_alpha)
148 |     elif opt.optim == 'sgdmom':
149 |         return tf.train.MomentumOptimizer(lr, opt.optim_alpha, use_nesterov=True)
150 |     elif opt.optim == 'adam':
151 |         return tf.train.AdamOptimizer(lr, beta1=opt.optim_alpha, beta2=opt.optim_beta, epsilon=opt.optim_epsilon)
152 |     else:
153 |         raise Exception('bad option opt.optim')
154 | 
155 | def get_cnn_optimizer(opt, cnn_lr):
156 |     if opt.cnn_optim == 'rmsprop':
157 |         return tf.train.RMSPropOptimizer(cnn_lr, momentum=opt.cnn_optim_alpha, epsilon=opt.optim_epsilon)
158 |     elif opt.cnn_optim == 'adagrad':
159 |         return tf.train.AdagradOptimizer(cnn_lr)
160 |     elif opt.cnn_optim == 'sgd':
161 |         return tf.train.GradientDescentOptimizer(cnn_lr)
162 |     elif opt.cnn_optim == 'sgdm':
163 |         return tf.train.MomentumOptimizer(cnn_lr, opt.cnn_optim_alpha)
164 |     elif opt.cnn_optim == 'sgdmom':
165 |         return tf.train.MomentumOptimizer(cnn_lr, opt.cnn_optim_alpha, use_nesterov=True)
166 |     elif opt.cnn_optim == 'adam':
167 |         return tf.train.AdamOptimizer(cnn_lr, beta1=opt.cnn_optim_alpha, beta2=opt.cnn_optim_beta, epsilon=opt.optim_epsilon)
168 |     else:
169 |         raise Exception('bad option opt.cnn_optim')
170 | 


--------------------------------------------------------------------------------
/models.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import tensorflow.contrib.slim as slim
 3 | import os
 4 | import vgg
 5 | import copy
 6 | 
 7 | import numpy as np
 8 | import misc.utils as utils
 9 | 
10 | from misc.ShowTellModel import ShowTellModel
11 | from misc.AttentionModel import AttentionModel
12 | from misc.ShowAttendTellModel import ShowAttendTellModel
13 | 
14 | def setup(opt):
15 |     
16 |     # check compatibility if training is continued from previously saved model
17 |     if vars(opt).get('start_from', None) is not None:
18 |         # check if all necessary files exist 
19 |         assert os.path.isdir(opt.start_from)," %s must be a a path" % opt.start_from
20 |         assert os.path.isfile(os.path.join(opt.start_from,"infos_"+opt.id+".pkl")),"infos.pkl file does not exist in path %s"%opt.start_from
21 |         ckpt = tf.train.get_checkpoint_state(opt.start_from)
22 |         assert ckpt,"No checkpoint found"
23 |         assert ckpt.model_checkpoint_path,"No model path found in checkpoint"
24 |         opt.ckpt = ckpt
25 |     if opt.caption_model == 'show_tell':
26 |         return ShowTellModel(opt)
27 |     elif opt.caption_model == 'attention':
28 |         return AttentionModel(opt)
29 |     elif opt.caption_model == 'show_attend_tell':
30 |         return ShowAttendTellModel(opt)
31 |     else:
32 |         raise Exception("Caption model not supported: {}".format(opt.caption_model))
33 | 


--------------------------------------------------------------------------------
/opts.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | 
  3 | def parse_opt():
  4 |     parser = argparse.ArgumentParser()
  5 |     # Data input settings
  6 |     parser.add_argument('--input_json', type=str, default='data/coco.json',
  7 |                     help='path to the json file containing additional info and vocab')
  8 |     parser.add_argument('--input_h5', type=str, default='data/coco.json',
  9 |                     help='path to the h5file containing the preprocessed dataset')
 10 |     parser.add_argument('--cnn_model', type=str, default='vgg16',
 11 |                     help='vgg16 or vgg19')
 12 |     parser.add_argument('--cnn_weight', type=str, default='models/vgg16.npy',
 13 |                     help='path to CNN tf model. Note this MUST be a vgg16 right now.')
 14 |     parser.add_argument('--start_from', type=str, default=None,
 15 |                     help="""continue training from saved model at this path. Path must contain files saved by previous training process: 
 16 |                         'infos.pkl'         : configuration;
 17 |                         'checkpoint'        : paths to model file(s) (created by tf).
 18 |                                               Note: this file contains absolute paths, be careful when moving files around;
 19 |                         'model.ckpt-*'      : file(s) with model definition (created by tf)
 20 |                     """)
 21 | 
 22 |     # Model settings
 23 |     parser.add_argument('--caption_model', type=str, default="show_tell",
 24 |                     help='show_tell, show_attend_tell, attention')
 25 |     parser.add_argument('--rnn_size', type=int, default=512,
 26 |                     help='size of the rnn in number of hidden nodes in each layer')
 27 |     parser.add_argument('--num_layers', type=int, default=1,
 28 |                     help='number of layers in the RNN')
 29 |     parser.add_argument('--rnn_type', type=str, default='lstm',
 30 |                     help='rnn, gru, or lstm')
 31 |     parser.add_argument('--input_encoding_size', type=int, default=512,
 32 |                     help='the encoding size of each token in the vocabulary, and the image.')
 33 |     parser.add_argument('--att_hid_size', type=int, default=512,
 34 |                     help='the hidden size of the attention MLP; only useful in show_attend_tell; 0 if not using hidden layer')
 35 | 
 36 |     # Optimization: General
 37 |     parser.add_argument('--max_epochs', type=int, default=-1,
 38 |                     help='number of epochs')
 39 |     parser.add_argument('--batch_size', type=int, default=16,
 40 |                     help='minibatch size')
 41 |     parser.add_argument('--grad_clip', type=float, default=0.1, #5.,
 42 |                     help='clip gradients at this value')
 43 |     parser.add_argument('--drop_prob_lm', type=float, default=0.5,
 44 |                     help='strength of dropout in the Language Model RNN')
 45 |     parser.add_argument('--finetune_cnn_after', type=int, default=-1,
 46 |                     help='After what iteration do we start finetuning the CNN? (-1 = disable; never finetune, 0 = finetune from start)')
 47 |     parser.add_argument('--seq_per_img', type=int, default=5,
 48 |                     help='number of captions to sample for each image during training. Done for efficiency since CNN forward pass is expensive. E.g. coco has 5 sents/image')
 49 |     parser.add_argument('--beam_size', type=int, default=1,
 50 |                     help='used when sample_max = 1, indicates number of beams in beam search. Usually 2 or 3 works well. More is not better. Set this to 1 for faster runtime but a bit worse performance.')
 51 | 
 52 |     #Optimization: for the Language Model
 53 |     parser.add_argument('--optim', type=str, default='adam',
 54 |                     help='what update to use? rmsprop|sgd|sgdmom|adagrad|adam')
 55 |     parser.add_argument('--learning_rate', type=float, default=4e-4,
 56 |                     help='learning rate')
 57 |     parser.add_argument('--learning_rate_decay_start', type=int, default=-1, 
 58 |                     help='at what iteration to start decaying learning rate? (-1 = dont) (in epoch)')
 59 |     parser.add_argument('--learning_rate_decay_every', type=int, default=10, 
 60 |                     help='every how many iterations thereafter to drop LR by half?(in epoch)')
 61 |     parser.add_argument('--optim_alpha', type=float, default=0.8,
 62 |                     help='alpha for adam')
 63 |     parser.add_argument('--optim_beta', type=float, default=0.999,
 64 |                     help='beta used for adam')
 65 |     parser.add_argument('--optim_epsilon', type=float, default=1e-8,
 66 |                     help='epsilon that goes into denominator for smoothing')
 67 | 
 68 |     #Optimization: for the CNN
 69 |     parser.add_argument('--cnn_optim', type=str, default='adam',
 70 |                     help='optimization to use for CNN')
 71 |     parser.add_argument('--cnn_optim_alpha', type=float, default=0.8,
 72 |                     help='alpha for momentum of CNN')
 73 |     parser.add_argument('--cnn_optim_beta', type=float, default=0.999,
 74 |                     help='beta for momentum of CNN')
 75 |     parser.add_argument('--cnn_learning_rate', type=float, default=1e-5,
 76 |                     help='learning rate for the CNN')
 77 |     parser.add_argument('--cnn_weight_decay', type=float, default=0,
 78 |                     help='L2 weight decay just for the CNN')
 79 | 
 80 |     # Evaluation/Checkpointing
 81 |     parser.add_argument('--val_images_use', type=int, default=3200,
 82 |                     help='how many images to use when periodically evaluating the validation loss? (-1 = all)')
 83 |     parser.add_argument('--save_checkpoint_every', type=int, default=2500,
 84 |                     help='how often to save a model checkpoint (in iterations)?')
 85 |     parser.add_argument('--checkpoint_path', type=str, default='save',
 86 |                     help='directory to store checkpointed models')
 87 |     parser.add_argument('--language_eval', type=int, default=0,
 88 |                     help='Evaluate language as well (1 = yes, 0 = no)? BLEU/CIDEr/METEOR/ROUGE_L? requires coco-caption code from Github.')
 89 |     parser.add_argument('--losses_log_every', type=int, default=25,
 90 |                     help='How often do we snapshot losses, for inclusion in the progress dump? (0 = disable)')       
 91 |     parser.add_argument('--load_best_score', type=int, default=1,
 92 |                     help='Do we load previous best score when resuming training.')       
 93 | 
 94 |     # misc
 95 |     parser.add_argument('--id', type=str, default='',
 96 |                     help='an id identifying this run/job. used in cross-val and appended when writing progress files')
 97 |     parser.add_argument('--train_only', type=int, default=0,
 98 |                     help='if true then use 80k, else use 110k')
 99 | 
100 |     args = parser.parse_args()
101 | 
102 |     # Check if args are valid
103 |     assert args.rnn_size > 0, "rnn_size should be greater than 0"
104 |     assert args.num_layers > 0, "num_layers should be greater than 0"
105 |     assert args.input_encoding_size > 0, "input_encoding_size should be greater than 0"
106 |     assert args.batch_size > 0, "batch_size should be greater than 0"
107 |     assert args.drop_prob_lm >= 0 and args.drop_prob_lm < 1, "drop_prob_lm should be between 0 and 1"
108 |     assert args.seq_per_img > 0, "seq_per_img should be greater than 0"
109 |     assert args.beam_size > 0, "beam_size should be greater than 0"
110 |     assert args.save_checkpoint_every > 0, "save_checkpoint_every should be greater than 0"
111 |     assert args.losses_log_every > 0, "losses_log_every should be greater than 0"
112 |     assert args.language_eval == 0 or args.language_eval == 1, "language_eval should be 0 or 1"
113 |     assert args.load_best_score == 0 or args.load_best_score == 1, "language_eval should be 0 or 1"
114 |     assert args.train_only == 0 or args.train_only == 1, "language_eval should be 0 or 1"
115 | 
116 |     return args


--------------------------------------------------------------------------------
/prepro.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Preprocess a raw json dataset into hdf5/json files for use in data_loader.lua
  3 | 
  4 | Input: json file that has the form
  5 | [{ file_path: 'path/img.jpg', captions: ['a caption', ...] }, ...]
  6 | example element in this list would look like
  7 | {'captions': [u'A man with a red helmet on a small moped on a dirt road. ', u'Man riding a motor bike on a dirt road on the countryside.', u'A man riding on the back of a motorcycle.', u'A dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud-wreathed mountains. ', u'A man in a red shirt and a red hat is on a motorcycle on a hill side.'], 'file_path': u'val2014/COCO_val2014_000000391895.jpg', 'id': 391895}
  8 | 
  9 | This script reads this json, does some basic preprocessing on the captions
 10 | (e.g. lowercase, etc.), creates a special UNK token, and encodes everything to arrays
 11 | 
 12 | Output: a json file and an hdf5 file
 13 | The hdf5 file contains several fields:
 14 | /images is (N,3,256,256) uint8 array of raw image data in RGB format
 15 | /labels is (M,max_length) uint32 array of encoded labels, zero padded
 16 | /label_start_ix and /label_end_ix are (N,) uint32 arrays of pointers to the 
 17 |   first and last indices (in range 1..M) of labels for each image
 18 | /label_length stores the length of the sequence for each of the M sequences
 19 | 
 20 | The json file has a dict that contains:
 21 | - an 'ix_to_word' field storing the vocab in form {ix:'word'}, where ix is 1-indexed
 22 | - an 'images' field that is a list holding auxiliary information for each image, 
 23 |   such as in particular the 'split' it was assigned to.
 24 | """
 25 | 
 26 | import os
 27 | import json
 28 | import argparse
 29 | from random import shuffle, seed
 30 | import string
 31 | # non-standard dependencies:
 32 | import h5py
 33 | import numpy as np
 34 | from scipy.misc import imread, imresize
 35 | 
 36 | def prepro_captions(imgs):
 37 |   
 38 |   # preprocess all the captions
 39 |   print 'example processed tokens:'
 40 |   for i,img in enumerate(imgs):
 41 |     img['processed_tokens'] = []
 42 |     for j,s in enumerate(img['captions']):
 43 |       txt = str(s).lower().translate(None, string.punctuation).strip().split()
 44 |       img['processed_tokens'].append(txt)
 45 |       if i < 10 and j == 0: print txt
 46 | 
 47 | def build_vocab(imgs, params):
 48 |   count_thr = params['word_count_threshold']
 49 | 
 50 |   # count up the number of words
 51 |   counts = {}
 52 |   for img in imgs:
 53 |     for txt in img['processed_tokens']:
 54 |       for w in txt:
 55 |         counts[w] = counts.get(w, 0) + 1
 56 |   cw = sorted([(count,w) for w,count in counts.iteritems()], reverse=True)
 57 |   print 'top words and their counts:'
 58 |   print '\n'.join(map(str,cw[:20]))
 59 | 
 60 |   # print some stats
 61 |   total_words = sum(counts.itervalues())
 62 |   print 'total words:', total_words
 63 |   bad_words = [w for w,n in counts.iteritems() if n <= count_thr]
 64 |   vocab = [w for w,n in counts.iteritems() if n > count_thr]
 65 |   bad_count = sum(counts[w] for w in bad_words)
 66 |   print 'number of bad words: %d/%d = %.2f%%' % (len(bad_words), len(counts), len(bad_words)*100.0/len(counts))
 67 |   print 'number of words in vocab would be %d' % (len(vocab), )
 68 |   print 'number of UNKs: %d/%d = %.2f%%' % (bad_count, total_words, bad_count*100.0/total_words)
 69 | 
 70 |   # lets look at the distribution of lengths as well
 71 |   sent_lengths = {}
 72 |   for img in imgs:
 73 |     for txt in img['processed_tokens']:
 74 |       nw = len(txt)
 75 |       sent_lengths[nw] = sent_lengths.get(nw, 0) + 1
 76 |   max_len = max(sent_lengths.keys())
 77 |   print 'max length sentence in raw data: ', max_len
 78 |   print 'sentence length distribution (count, number of words):'
 79 |   sum_len = sum(sent_lengths.values())
 80 |   for i in xrange(max_len+1):
 81 |     print '%2d: %10d   %f%%' % (i, sent_lengths.get(i,0), sent_lengths.get(i,0)*100.0/sum_len)
 82 | 
 83 |   # lets now produce the final annotations
 84 |   if bad_count > 0:
 85 |     # additional special UNK token we will use below to map infrequent words to
 86 |     print 'inserting the special UNK token'
 87 |     vocab.append('UNK')
 88 |   
 89 |   for img in imgs:
 90 |     img['final_captions'] = []
 91 |     for txt in img['processed_tokens']:
 92 |       caption = [w if counts.get(w,0) > count_thr else 'UNK' for w in txt]
 93 |       img['final_captions'].append(caption)
 94 | 
 95 |   return vocab
 96 | 
 97 | def assign_splits(imgs, params):
 98 |   num_val = params['num_val']
 99 |   num_test = params['num_test']
100 | 
101 |   for i,img in enumerate(imgs):
102 |       if i < num_val:
103 |         img['split'] = 'val'
104 |       elif i < num_val + num_test: 
105 |         img['split'] = 'test'
106 |       else: 
107 |         img['split'] = 'train'
108 | 
109 |   print 'assigned %d to val, %d to test.' % (num_val, num_test)
110 | 
111 | def encode_captions(imgs, params, wtoi):
112 |   """ 
113 |   encode all captions into one large array, which will be 1-indexed.
114 |   also produces label_start_ix and label_end_ix which store 1-indexed 
115 |   and inclusive (Lua-style) pointers to the first and last caption for
116 |   each image in the dataset.
117 |   """
118 | 
119 |   max_length = params['max_length']
120 |   N = len(imgs)
121 |   M = sum(len(img['final_captions']) for img in imgs) # total number of captions
122 | 
123 |   label_arrays = []
124 |   label_start_ix = np.zeros(N, dtype='uint32') # note: these will be one-indexed
125 |   label_end_ix = np.zeros(N, dtype='uint32')
126 |   label_length = np.zeros(M, dtype='uint32')
127 |   caption_counter = 0
128 |   counter = 1
129 |   for i,img in enumerate(imgs):
130 |     n = len(img['final_captions'])
131 |     assert n > 0, 'error: some image has no captions'
132 | 
133 |     Li = np.zeros((n, max_length), dtype='uint32')
134 |     for j,s in enumerate(img['final_captions']):
135 |       label_length[caption_counter] = min(max_length, len(s)) # record the length of this sequence
136 |       caption_counter += 1
137 |       for k,w in enumerate(s):
138 |         if k < max_length:
139 |           Li[j,k] = wtoi[w]
140 | 
141 |     # note: word indices are 1-indexed, and captions are padded with zeros
142 |     label_arrays.append(Li)
143 |     label_start_ix[i] = counter
144 |     label_end_ix[i] = counter + n - 1
145 |     
146 |     counter += n
147 |   
148 |   L = np.concatenate(label_arrays, axis=0) # put all the labels together
149 |   assert L.shape[0] == M, 'lengths don\'t match? that\'s weird'
150 |   assert np.all(label_length > 0), 'error: some caption had no words?'
151 | 
152 |   print 'encoded captions to array of size ', `L.shape`
153 |   return L, label_start_ix, label_end_ix, label_length
154 | 
155 | def main(params):
156 | 
157 |   imgs = json.load(open(params['input_json'], 'r'))
158 |   seed(123) # make reproducible
159 |   shuffle(imgs) # shuffle the order
160 | 
161 |   # tokenization and preprocessing
162 |   prepro_captions(imgs)
163 | 
164 |   # create the vocab
165 |   vocab = build_vocab(imgs, params)
166 |   itow = {i+1:w for i,w in enumerate(vocab)} # a 1-indexed vocab translation table
167 |   wtoi = {w:i+1 for i,w in enumerate(vocab)} # inverse table
168 | 
169 |   # assign the splits
170 |   assign_splits(imgs, params)
171 |   
172 |   # encode captions in large arrays, ready to ship to hdf5 file
173 |   L, label_start_ix, label_end_ix, label_length = encode_captions(imgs, params, wtoi)
174 | 
175 |   # create output h5 file
176 |   N = len(imgs)
177 |   f = h5py.File(params['output_h5'], "w")
178 |   f.create_dataset("labels", dtype='uint32', data=L)
179 |   f.create_dataset("label_start_ix", dtype='uint32', data=label_start_ix)
180 |   f.create_dataset("label_end_ix", dtype='uint32', data=label_end_ix)
181 |   f.create_dataset("label_length", dtype='uint32', data=label_length)
182 |   dset = f.create_dataset("images", (N,3,256,256), dtype='uint8') # space for resized images
183 |   for i,img in enumerate(imgs):
184 |     # load the image
185 |     I = imread(os.path.join(params['images_root'], img['file_path']))
186 |     try:
187 |         Ir = imresize(I, (256,256))
188 |     except:
189 |         print 'failed resizing image %s - see http://git.io/vBIE0' % (img['file_path'],)
190 |         raise
191 |     # handle grayscale input images
192 |     if len(Ir.shape) == 2:
193 |       Ir = Ir[:,:,np.newaxis]
194 |       Ir = np.concatenate((Ir,Ir,Ir), axis=2)
195 |     # and swap order of axes from (256,256,3) to (3,256,256)
196 |     Ir = Ir.transpose(2,0,1)
197 |     # write to h5
198 |     dset[i] = Ir
199 |     if i % 1000 == 0:
200 |       print 'processing %d/%d (%.2f%% done)' % (i, N, i*100.0/N)
201 |   f.close()
202 |   print 'wrote ', params['output_h5']
203 | 
204 |   # create output json file
205 |   out = {}
206 |   out['ix_to_word'] = itow # encode the (1-indexed) vocab
207 |   out['images'] = []
208 |   for i,img in enumerate(imgs):
209 |     
210 |     jimg = {}
211 |     jimg['split'] = img['split']
212 |     if 'file_path' in img: jimg['file_path'] = img['file_path'] # copy it over, might need
213 |     if 'id' in img: jimg['id'] = img['id'] # copy over & mantain an id, if present (e.g. coco ids, useful)
214 |     
215 |     out['images'].append(jimg)
216 |   
217 |   json.dump(out, open(params['output_json'], 'w'))
218 |   print 'wrote ', params['output_json']
219 | 
220 | if __name__ == "__main__":
221 | 
222 |   parser = argparse.ArgumentParser()
223 | 
224 |   # input json
225 |   parser.add_argument('--input_json', required=True, help='input json file to process into hdf5')
226 |   parser.add_argument('--num_val', required=True, type=int, help='number of images to assign to validation data (for CV etc)')
227 |   parser.add_argument('--output_json', default='data.json', help='output json file')
228 |   parser.add_argument('--output_h5', default='data.h5', help='output h5 file')
229 |   
230 |   # options
231 |   parser.add_argument('--max_length', default=16, type=int, help='max length of a caption, in number of words. captions longer than this get clipped.')
232 |   parser.add_argument('--images_root', default='', help='root location in which images are stored, to be prepended to file_path in input json')
233 |   parser.add_argument('--word_count_threshold', default=5, type=int, help='only words that occur more than this number of times will be put in vocab')
234 |   parser.add_argument('--num_test', default=0, type=int, help='number of test images (to withold until very very end)')
235 | 
236 |   args = parser.parse_args()
237 |   params = vars(args) # convert to ordinary dict
238 |   print 'parsed input parameters:'
239 |   print json.dumps(params, indent = 2)
240 |   main(params)


--------------------------------------------------------------------------------
/test/test_model.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import models
 3 | import opts
 4 | import numpy as np
 5 | 
 6 | 
 7 | opt = opts.parse_opt()
 8 | opt.batch_size = 2
 9 | opt.seq_length = 5
10 | opt.seq_per_img = 2
11 | sess = tf.InteractiveSession()
12 | 
13 | data = {}
14 | im1 = np.random.random([1,224,224,3])
15 | data['images'] = np.vstack([im1, -im1])
16 | data['labels'] = np.array([[0,1,2,3,4,0,0],[0,6,7,8,9,10,0],[0,1,2,3,4,0,0],[0,6,7,8,9,10,0]])
17 | data['masks'] = np.array([[0,1,1,1,1,0,0],[0,1,1,1,1,1,0],[0,1,1,1,1,0,0],[0,1,1,1,1,1,0]])
18 | 
19 | opt.vocab_size = 10
20 | model = models.Model(opt)
21 | 
22 | model.build_model()
23 | model.build_generator()
24 | tf.global_variables_initializer().run()
25 | sess.run(tf.assign(model.lr, 0.01))
26 | feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks'], model.keep_prob: 1.0}
27 | train_loss, _ = sess.run([model.cost, model.train_op], feed)
28 | 
29 | seq = sess.run(model.generator, feed)
30 | print(seq)


--------------------------------------------------------------------------------
/test/test_simpleloader.py:
--------------------------------------------------------------------------------
 1 | from simpleloader import *
 2 | import tensorflow as tf
 3 | 
 4 | import opts
 5 | 
 6 | opt = opts.parse_opt()
 7 | loader = DataLoader(opt)
 8 | sess = tf.InteractiveSession()
 9 | loader.assign_session(sess)
10 | 
11 | count = 0
12 | start = time.time()
13 | while True:
14 | 	data = loader.get_batch(0)
15 | 	count += 1
16 | 	if data['bounds']['wrapped']:
17 | 		break
18 | end = time.time()
19 | print 'Time in total:', end-start
20 | print 'Total batch number:', count
21 | print 'Average time:', (end-start)/count
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import numpy as np
  6 | import tensorflow as tf
  7 | 
  8 | import time
  9 | import os
 10 | from six.moves import cPickle
 11 | 
 12 | import opts
 13 | import models
 14 | from dataloader import *
 15 | import eval_utils
 16 | import misc.utils as utils
 17 | 
 18 | import os
 19 | NUM_THREADS = 2 #int(os.environ['OMP_NUM_THREADS'])
 20 | 
 21 | #from ipdb import set_trace
 22 | 
 23 | def train(opt):
 24 |     loader = DataLoader(opt)
 25 |     opt.vocab_size = loader.vocab_size
 26 |     opt.seq_length = loader.seq_length
 27 |     model = models.setup(opt)
 28 | 
 29 |     infos = {}
 30 |     if opt.start_from is not None:
 31 |         # open old infos and check if models are compatible
 32 |         with open(os.path.join(opt.start_from, 'infos_'+opt.id+'.pkl')) as f:
 33 |             infos = cPickle.load(f)
 34 |             saved_model_opt = infos['opt']
 35 |             need_be_same=["caption_model", "rnn_type", "rnn_size", "num_layers"]
 36 |             for checkme in need_be_same:
 37 |                 assert vars(saved_model_opt)[checkme] == vars(opt)[checkme], "Command line argument and saved model disagree on '%s' " % checkme
 38 | 
 39 |     iteration = infos.get('iter', 0)
 40 |     epoch = infos.get('epoch', 0)
 41 |     val_result_history = infos.get('val_result_history', {})
 42 |     loss_history = infos.get('loss_history', {})
 43 | 
 44 |     loader.iterators = infos.get('iterators', loader.iterators)
 45 |     if opt.load_best_score == 1:
 46 |         best_val_score = infos.get('best_val_score', None)
 47 | 
 48 |     model.build_model()
 49 |     model.build_generator()
 50 |     model.build_decoder()
 51 | 
 52 |     tf_config = tf.ConfigProto()
 53 |     tf_config.intra_op_parallelism_threads=NUM_THREADS
 54 |     tf_config.gpu_options.allow_growth = True
 55 |     with tf.Session(config=tf_config) as sess:
 56 |         # Initialize the variables, and restore the variables form checkpoint if there is.
 57 |         # and initialize the writer
 58 |         model.initialize(sess)
 59 |         
 60 |         # Assign the learning rate
 61 |         if epoch > opt.learning_rate_decay_start and opt.learning_rate_decay_start >= 0:
 62 |             frac = (epoch - opt.learning_rate_decay_start) / opt.learning_rate_decay_every
 63 |             decay_factor = 0.5  ** frac
 64 |             sess.run(tf.assign(model.lr, opt.learning_rate * decay_factor)) # set the decayed rate
 65 |             sess.run(tf.assign(model.cnn_lr, opt.cnn_learning_rate * decay_factor))
 66 |         else:
 67 |             sess.run(tf.assign(model.lr, opt.learning_rate))
 68 |             sess.run(tf.assign(model.cnn_lr, opt.cnn_learning_rate))
 69 |         # Assure in training mode
 70 |         sess.run(tf.assign(model.training, True))
 71 |         sess.run(tf.assign(model.cnn_training, True))
 72 | 
 73 |         while True:
 74 |             start = time.time()
 75 |             # Load data from train split (0)
 76 |             data = loader.get_batch('train')
 77 |             print('Read data:', time.time() - start)
 78 | 
 79 |             start = time.time()
 80 |             feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks']}
 81 |             if iteration <= opt.finetune_cnn_after or opt.finetune_cnn_after == -1:
 82 |                 train_loss, merged, _ = sess.run([model.cost, model.summaries, model.train_op], feed)
 83 |             else:
 84 |                 # Finetune the cnn
 85 |                 train_loss, merged, _, __ = sess.run([model.cost, model.summaries, model.train_op, model.cnn_train_op], feed)
 86 |             end = time.time()
 87 |             print("iter {} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}" \
 88 |                 .format(iteration, epoch, train_loss, end - start))
 89 | 
 90 |             # Update the iteration and epoch
 91 |             iteration += 1
 92 |             if data['bounds']['wrapped']:
 93 |                 epoch += 1
 94 | 
 95 |             # Write the training loss summary
 96 |             if (iteration % opt.losses_log_every == 0):
 97 |                 model.summary_writer.add_summary(merged, iteration)
 98 |                 model.summary_writer.flush()
 99 |                 loss_history[iteration] = train_loss
100 | 
101 |             # make evaluation on validation set, and save model
102 |             if (iteration % opt.save_checkpoint_every == 0):
103 |                 # eval model
104 |                 eval_kwargs = {'val_images_use': opt.val_images_use,
105 |                                 'split': 'val',
106 |                                 'language_eval': opt.language_eval, 
107 |                                 'dataset': opt.input_json}
108 |                 val_loss, predictions, lang_stats = eval_split(sess, model, loader, eval_kwargs)
109 | 
110 |                 # Write validation result into summary
111 |                 summary = tf.Summary(value=[tf.Summary.Value(tag='validation loss', simple_value=val_loss)])
112 |                 model.summary_writer.add_summary(summary, iteration)
113 |                 for k,v in lang_stats.iteritems():
114 |                     summary = tf.Summary(value=[tf.Summary.Value(tag=k, simple_value=v)])
115 |                     model.summary_writer.add_summary(summary, iteration)
116 |                 model.summary_writer.flush()
117 |                 val_result_history[iteration] = {'loss': val_loss, 'lang_stats': lang_stats, 'predictions': predictions}
118 | 
119 |                 # Save model if is improving on validation result
120 |                 if opt.language_eval == 1:
121 |                     current_score = lang_stats['CIDEr']
122 |                 else:
123 |                     current_score = - val_loss
124 | 
125 |                 if best_val_score is None or current_score > best_val_score: # if true
126 |                     best_val_score = current_score
127 |                     checkpoint_path = os.path.join(opt.checkpoint_path, 'model.ckpt')
128 |                     model.saver.save(sess, checkpoint_path, global_step = iteration)
129 |                     print("model saved to {}".format(checkpoint_path))
130 | 
131 |                     # Dump miscalleous informations
132 |                     infos['iter'] = iteration
133 |                     infos['epoch'] = epoch
134 |                     infos['iterators'] = loader.iterators
135 |                     infos['best_val_score'] = best_val_score
136 |                     infos['opt'] = opt
137 |                     infos['val_result_history'] = val_result_history
138 |                     infos['loss_history'] = loss_history
139 |                     infos['vocab'] = loader.get_vocab()
140 |                     with open(os.path.join(opt.checkpoint_path, 'infos_'+opt.id+'.pkl'), 'wb') as f:
141 |                         cPickle.dump(infos, f)
142 | 
143 |             # Stop if reaching max epochs
144 |             if epoch >= opt.max_epochs and opt.max_epochs != -1:
145 |                 break
146 | 
147 | def eval_split(sess, model, loader, eval_kwargs):
148 |     verbose = eval_kwargs.get('verbose', True)
149 |     val_images_use = eval_kwargs.get('val_images_use', -1)
150 |     split = eval_kwargs.get('split', 'val')
151 |     language_eval = eval_kwargs.get('language_eval', 1)
152 |     dataset = eval_kwargs.get('dataset', 'coco')
153 | 
154 |     # Make sure in the evaluation mode
155 |     sess.run(tf.assign(model.training, False))
156 |     sess.run(tf.assign(model.cnn_training, False))
157 | 
158 |     loader.reset_iterator(split)
159 | 
160 |     n = 0
161 |     loss_sum = 0
162 |     loss_evals = 0
163 |     predictions = []
164 |     while True:
165 |         if opt.beam_size > 1:
166 |             data = loader.get_batch(split, 1)
167 |             n = n + 1
168 |         else:
169 |             data = loader.get_batch(split)
170 |             n = n + loader.batch_size
171 | 
172 |         # forward the model to get loss
173 |         feed = {model.images: data['images'], model.labels: data['labels'], model.masks: data['masks']}
174 |         loss = sess.run(model.cost, feed)
175 | 
176 |         loss_sum = loss_sum + loss
177 |         loss_evals = loss_evals + 1
178 | 
179 |         if opt.beam_size == 1:
180 |             # forward the model to also get generated samples for each image
181 |             feed = {model.images: data['images']}
182 |             #g_o,g_l,g_p, seq = sess.run([model.g_output, model.g_logits, model.g_probs, model.generator], feed)
183 |             seq = sess.run(model.generator, feed)
184 | 
185 |             #set_trace()
186 |             sents = utils.decode_sequence(loader.get_vocab(), seq)
187 | 
188 |             for k, sent in enumerate(sents):
189 |                 entry = {'image_id': data['infos'][k]['id'], 'caption': sent}
190 |                 predictions.append(entry)
191 |                 if verbose:
192 |                     print('image %s: %s' %(entry['image_id'], entry['caption']))
193 |         else:
194 |             seq = model.decode(data['images'], opt.beam_size, sess)
195 |             sents = [' '.join([loader.ix_to_word.get(str(ix), '') for ix in sent]).strip() for sent in seq]
196 |             entry = {'image_id': data['infos'][0]['id'], 'caption': sents[0]}
197 |             predictions.append(entry)
198 |             if verbose:
199 |                 for sent in sents:
200 |                     print('image %s: %s' %(entry['image_id'], sent))
201 |         
202 |         ix0 = data['bounds']['it_pos_now']
203 |         ix1 = data['bounds']['it_max']
204 |         if val_images_use != -1:
205 |             ix1 = min(ix1, val_images_use)
206 |         for i in range(n - ix1):
207 |             predictions.pop()
208 |         if verbose:
209 |             print('evaluating validation preformance... %d/%d (%f)' %(ix0 - 1, ix1, loss))
210 | 
211 |         if data['bounds']['wrapped']:
212 |             break
213 |         if n>= val_images_use:
214 |             break
215 | 
216 |     if language_eval == 1:
217 |         lang_stats = eval_utils.language_eval(dataset, predictions)
218 | 
219 |     # Switch back to training mode
220 |     sess.run(tf.assign(model.training, True))
221 |     sess.run(tf.assign(model.cnn_training, True))
222 |     return loss_sum/loss_evals, predictions, lang_stats
223 | 
224 | opt = opts.parse_opt()
225 | train(opt)
226 | 


--------------------------------------------------------------------------------
/vgg.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | 
  3 | import numpy as np
  4 | import tensorflow as tf
  5 | import time
  6 | 
  7 | VGG_MEAN = [103.939, 116.779, 123.68]
  8 | 
  9 | 
 10 | class Vgg16:
 11 |     def __init__(self, vgg16_npy_path=None):
 12 |         if vgg16_npy_path is None:
 13 |             self.data_dict = {}
 14 |         else:
 15 |             assert os.path.isfile(vgg16_npy_path), vgg16_npy_path + " doesn't exist."
 16 |             self.data_dict = np.load(vgg16_npy_path).item()
 17 |             print "npy file loaded"
 18 | 
 19 |     def build(self, rgb):
 20 |         """
 21 |         load variable from npy to build the VGG
 22 | 
 23 |         :param rgb: rgb image [batch, height, width, 3] values scaled [0, 1]
 24 |         """
 25 | 
 26 |         start_time = time.time()
 27 |         print "build model started"
 28 |         rgb_scaled = rgb * 255.0
 29 | 
 30 |         # Convert RGB to BGR
 31 |         red, green, blue = tf.split(axis=3, num_or_size_splits=3, value=rgb_scaled)
 32 |         assert red.get_shape().as_list()[1:] == [224, 224, 1]
 33 |         assert green.get_shape().as_list()[1:] == [224, 224, 1]
 34 |         assert blue.get_shape().as_list()[1:] == [224, 224, 1]
 35 |         bgr = tf.concat(axis=3, values=[
 36 |             blue - VGG_MEAN[0],
 37 |             green - VGG_MEAN[1],
 38 |             red - VGG_MEAN[2],
 39 |         ])
 40 |         assert bgr.get_shape().as_list()[1:] == [224, 224, 3]
 41 | 
 42 |         self.training = tf.Variable(True, trainable = False, name = "training")
 43 | 
 44 |         self.conv1_1 = self.conv_layer(bgr, "conv1_1")
 45 |         self.conv1_2 = self.conv_layer(self.conv1_1, "conv1_2")
 46 |         self.pool1 = self.max_pool(self.conv1_2, 'pool1')
 47 | 
 48 |         self.conv2_1 = self.conv_layer(self.pool1, "conv2_1")
 49 |         self.conv2_2 = self.conv_layer(self.conv2_1, "conv2_2")
 50 |         self.pool2 = self.max_pool(self.conv2_2, 'pool2')
 51 | 
 52 |         self.conv3_1 = self.conv_layer(self.pool2, "conv3_1")
 53 |         self.conv3_2 = self.conv_layer(self.conv3_1, "conv3_2")
 54 |         self.conv3_3 = self.conv_layer(self.conv3_2, "conv3_3")
 55 |         self.pool3 = self.max_pool(self.conv3_3, 'pool3')
 56 | 
 57 |         self.conv4_1 = self.conv_layer(self.pool3, "conv4_1")
 58 |         self.conv4_2 = self.conv_layer(self.conv4_1, "conv4_2")
 59 |         self.conv4_3 = self.conv_layer(self.conv4_2, "conv4_3")
 60 |         self.pool4 = self.max_pool(self.conv4_3, 'pool4')
 61 | 
 62 |         self.conv5_1 = self.conv_layer(self.pool4, "conv5_1")
 63 |         self.conv5_2 = self.conv_layer(self.conv5_1, "conv5_2")
 64 |         self.conv5_3 = self.conv_layer(self.conv5_2, "conv5_3")
 65 |         self.pool5 = self.max_pool(self.conv5_3, 'pool5')
 66 | 
 67 |         self.keep_prob = tf.cond(self.training, lambda : tf.constant(0.5), lambda : tf.constant(1.0), name = "keep_prob")
 68 | 
 69 |         self.fc6 = self.fc_layer(self.pool5, "fc6")
 70 |         assert self.fc6.get_shape().as_list()[1:] == [4096]
 71 |         self.relu6 = tf.nn.relu(self.fc6, name = "relu6")
 72 |         self.drop6 = tf.nn.dropout(self.relu6, self.keep_prob, name = "drop6")
 73 | 
 74 |         self.fc7 = self.fc_layer(self.drop6, "fc7")
 75 |         self.relu7 = tf.nn.relu(self.fc7, name = "relu7")
 76 |         self.drop7 = tf.nn.dropout(self.relu7, self.keep_prob, name = "drop7")
 77 | 
 78 |         self.fc8 = self.fc_layer(self.drop7, "fc8")
 79 | 
 80 |         self.prob = tf.nn.softmax(self.fc8, name="prob")
 81 | 
 82 |         self.data_dict = None
 83 |         print "build model finished: %ds" % (time.time() - start_time)
 84 | 
 85 |     def avg_pool(self, bottom, name):
 86 |         return tf.nn.avg_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name)
 87 | 
 88 |     def max_pool(self, bottom, name):
 89 |         return tf.nn.max_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name)
 90 | 
 91 |     def conv_layer(self, bottom, name):
 92 |         with tf.variable_scope(name):
 93 |             filt = self.get_conv_filter(bottom, name)
 94 | 
 95 |             conv = tf.nn.conv2d(bottom, filt, [1, 1, 1, 1], padding='SAME')
 96 | 
 97 |             conv_biases = self.get_bias(bottom, name)
 98 |             bias = tf.nn.bias_add(conv, conv_biases)
 99 | 
100 |             relu = tf.nn.relu(bias)
101 |             return relu
102 | 
103 |     def fc_layer(self, bottom, name):
104 |         with tf.variable_scope(name):
105 |             shape = bottom.get_shape().as_list()
106 |             dim = 1
107 |             for d in shape[1:]:
108 |                 dim *= d
109 |             x = tf.reshape(bottom, [-1, dim])
110 | 
111 |             weights = self.get_fc_weight(x, name)
112 |             biases = self.get_bias(x, name)
113 | 
114 |             # Fully connected layer. Note that the '+' operation automatically
115 |             # broadcasts the biases.
116 |             fc = tf.nn.bias_add(tf.matmul(x, weights), biases)
117 | 
118 |             return fc
119 | 
120 |     def get_n_out(self, name):
121 |         if name[:4] == 'conv':
122 |             n_out = 64 * (2 ** (min(int(name[4]),4) - 1))
123 |         else:
124 |             if name[2] == '8':
125 |                 n_out = 1000
126 |             else:
127 |                 n_out = 4096
128 |         return n_out
129 | 
130 | 
131 |     def get_conv_filter(self, bottom, name):
132 |         if self.data_dict.get(name, None) is None:
133 |             print 'No pretrained weight for', name, 'filter'
134 |             n_in = bottom.get_shape()[-1].value
135 |             n_out = self.get_n_out(name)
136 |             print 'n_in', n_in, 'n_out', n_out
137 |             return tf.get_variable("filter",
138 |                                 shape=[3, 3, n_in, n_out],
139 |                                 dtype=tf.float32, 
140 |                                 initializer=tf.contrib.layers.xavier_initializer_conv2d())
141 |         return tf.Variable(self.data_dict[name][0], name="filter")
142 | 
143 |     def get_bias(self, bottom, name):
144 |         if self.data_dict.get(name, None) is None:
145 |             print 'No pretrained weight for', name, 'biases'
146 |             n_out = self.get_n_out(name)
147 |             print 'n_out', n_out
148 |             return tf.Variable(tf.constant(0.0, shape=[n_out], dtype=tf.float32), trainable=True, name='biases')
149 |         return tf.Variable(self.data_dict[name][1], name="biases")
150 | 
151 |     def get_fc_weight(self, bottom, name):
152 |         if self.data_dict.get(name, None) is None:
153 |             print 'No pretrained weight for', name, 'weights'
154 |             n_in = bottom.get_shape()[-1].value
155 |             n_out = self.get_n_out(name)
156 |             print 'n_in', n_in, 'n_out', n_out
157 |             return tf.get_variable("weights",
158 |                                 shape=[n_in, n_out],
159 |                                 dtype=tf.float32, 
160 |                                 initializer=tf.contrib.layers.xavier_initializer())
161 |         return tf.Variable(self.data_dict[name][0], name="weights")
162 | 
163 | 
164 | class Vgg19:
165 |     def __init__(self, vgg19_npy_path=None):
166 |         if vgg19_npy_path is None:
167 |             self.data_dict = {}
168 |         else:
169 |             assert os.path.isfile(vgg19_npy_path), vgg19_npy_path + " doesn't exist."
170 |             self.data_dict = np.load(vgg19_npy_path).item()
171 |             print "npy file loaded"
172 | 
173 |     def build(self, rgb):
174 |         """
175 |         load variable from npy to build the VGG
176 |         :param rgb: rgb image [batch, height, width, 3] values scaled [0, 1]
177 |         """
178 | 
179 |         start_time = time.time()
180 |         print("build model started")
181 |         rgb_scaled = rgb * 255.0
182 | 
183 |         # Convert RGB to BGR
184 |         red, green, blue = tf.split(axis=3, num_or_size_splits=3, value=rgb_scaled)
185 |         assert red.get_shape().as_list()[1:] == [224, 224, 1]
186 |         assert green.get_shape().as_list()[1:] == [224, 224, 1]
187 |         assert blue.get_shape().as_list()[1:] == [224, 224, 1]
188 |         bgr = tf.concat(axis=3, values=[
189 |             blue - VGG_MEAN[0],
190 |             green - VGG_MEAN[1],
191 |             red - VGG_MEAN[2],
192 |         ])
193 |         assert bgr.get_shape().as_list()[1:] == [224, 224, 3]
194 | 
195 |         self.training = tf.Variable(True, trainable = False, name = "training")
196 | 
197 |         self.conv1_1 = self.conv_layer(bgr, "conv1_1")
198 |         self.conv1_2 = self.conv_layer(self.conv1_1, "conv1_2")
199 |         self.pool1 = self.max_pool(self.conv1_2, 'pool1')
200 | 
201 |         self.conv2_1 = self.conv_layer(self.pool1, "conv2_1")
202 |         self.conv2_2 = self.conv_layer(self.conv2_1, "conv2_2")
203 |         self.pool2 = self.max_pool(self.conv2_2, 'pool2')
204 | 
205 |         self.conv3_1 = self.conv_layer(self.pool2, "conv3_1")
206 |         self.conv3_2 = self.conv_layer(self.conv3_1, "conv3_2")
207 |         self.conv3_3 = self.conv_layer(self.conv3_2, "conv3_3")
208 |         self.conv3_4 = self.conv_layer(self.conv3_3, "conv3_4")
209 |         self.pool3 = self.max_pool(self.conv3_4, 'pool3')
210 | 
211 |         self.conv4_1 = self.conv_layer(self.pool3, "conv4_1")
212 |         self.conv4_2 = self.conv_layer(self.conv4_1, "conv4_2")
213 |         self.conv4_3 = self.conv_layer(self.conv4_2, "conv4_3")
214 |         self.conv4_4 = self.conv_layer(self.conv4_3, "conv4_4")
215 |         self.pool4 = self.max_pool(self.conv4_4, 'pool4')
216 | 
217 |         self.conv5_1 = self.conv_layer(self.pool4, "conv5_1")
218 |         self.conv5_2 = self.conv_layer(self.conv5_1, "conv5_2")
219 |         self.conv5_3 = self.conv_layer(self.conv5_2, "conv5_3")
220 |         self.conv5_4 = self.conv_layer(self.conv5_3, "conv5_4")
221 |         self.pool5 = self.max_pool(self.conv5_4, 'pool5')
222 | 
223 |         self.keep_prob = tf.cond(self.training, lambda : tf.constant(0.5), lambda : tf.constant(1.0), name = "keep_prob")
224 | 
225 |         self.fc6 = self.fc_layer(self.pool5, "fc6")
226 |         assert self.fc6.get_shape().as_list()[1:] == [4096]
227 |         self.relu6 = tf.nn.relu(self.fc6, name = "relu6")
228 |         self.drop6 = tf.nn.dropout(self.relu6, self.keep_prob, name = "drop6")
229 | 
230 |         self.fc7 = self.fc_layer(self.drop6, "fc7")
231 |         self.relu7 = tf.nn.relu(self.fc7, name = 'relu7')
232 |         self.drop7 = tf.nn.dropout(self.relu7, self.keep_prob, name = "drop7")
233 | 
234 |         self.fc8 = self.fc_layer(self.drop7, "fc8")
235 | 
236 |         self.prob = tf.nn.softmax(self.fc8, name="prob")
237 | 
238 |         self.data_dict = None
239 |         print("build model finished: %ds" % (time.time() - start_time))
240 | 
241 |     def avg_pool(self, bottom, name):
242 |         return tf.nn.avg_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name)
243 | 
244 |     def max_pool(self, bottom, name):
245 |         return tf.nn.max_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name)
246 | 
247 |     def conv_layer(self, bottom, name):
248 |         with tf.variable_scope(name):
249 |             filt = self.get_conv_filter(bottom, name)
250 | 
251 |             conv = tf.nn.conv2d(bottom, filt, [1, 1, 1, 1], padding='SAME')
252 | 
253 |             conv_biases = self.get_bias(bottom, name)
254 |             bias = tf.nn.bias_add(conv, conv_biases)
255 | 
256 |             relu = tf.nn.relu(bias)
257 |             return relu
258 | 
259 |     def fc_layer(self, bottom, name):
260 |         with tf.variable_scope(name):
261 |             shape = bottom.get_shape().as_list()
262 |             dim = 1
263 |             for d in shape[1:]:
264 |                 dim *= d
265 |             x = tf.reshape(bottom, [-1, dim])
266 | 
267 |             weights = self.get_fc_weight(x, name)
268 |             biases = self.get_bias(x, name)
269 | 
270 |             # Fully connected layer. Note that the '+' operation automatically
271 |             # broadcasts the biases.
272 |             fc = tf.nn.bias_add(tf.matmul(x, weights), biases)
273 | 
274 |             return fc
275 | 
276 |     def get_n_out(self, name):
277 |         if name[:4] == 'conv':
278 |             n_out = 64 * (2 ** (min(int(name[4]),4) - 1))
279 |         else:
280 |             if name[2] == '8':
281 |                 n_out = 1000
282 |             else:
283 |                 n_out = 4096
284 |         return n_out
285 | 
286 |     def get_conv_filter(self, bottom, name):
287 |         if self.data_dict.get(name, None) is None:
288 |             print 'No pretrained weight for', name, 'filter'
289 |             n_in = bottom.get_shape()[-1].value
290 |             n_out = self.get_n_out(name)
291 |             print 'n_in', n_in, 'n_out', n_out
292 |             return tf.get_variable("filter",
293 |                                 shape=[3, 3, n_in, n_out],
294 |                                 dtype=tf.float32, 
295 |                                 initializer=tf.contrib.layers.xavier_initializer_conv2d())
296 |         return tf.Variable(self.data_dict[name][0], name="filter")
297 | 
298 |     def get_bias(self, bottom, name):
299 |         if self.data_dict.get(name, None) is None:
300 |             print 'No pretrained weight for', name, 'biases'
301 |             n_out = self.get_n_out(name)
302 |             print 'n_out', n_out
303 |             return tf.Variable(tf.constant(0.0, shape=[n_out], dtype=tf.float32), trainable=True, name='biases')
304 |         return tf.Variable(self.data_dict[name][1], name="biases")
305 | 
306 |     def get_fc_weight(self, bottom, name):
307 |         if self.data_dict.get(name, None) is None:
308 |             print 'No pretrained weight for', name, 'weights'
309 |             n_in = bottom.get_shape()[-1].value
310 |             n_out = self.get_n_out(name)
311 |             print 'n_in', n_in, 'n_out', n_out
312 |             return tf.get_variable("weights",
313 |                                 shape=[n_in, n_out],
314 |                                 dtype=tf.float32, 
315 |                                 initializer=tf.contrib.layers.xavier_initializer())
316 |         return tf.Variable(self.data_dict[name][0], name="weights")
317 | 


--------------------------------------------------------------------------------
/vis/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="en">
 3 |   <head>
 4 |     <meta charset="utf-8">
 5 |     <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
 6 |     <title>neuraltalk2 results visualization</title>
 7 |     <script src="jquery-1.8.3.min.js"></script>
 8 |     <style>
 9 |     body {
10 |       background: #000;
11 |       margin: 0;
12 |       font-family: "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif; 
13 |       font-weight: 200;
14 |     }
15 |     .result {
16 |       background: no-repeat center/contain #141414;
17 |       color: #fff;
18 |       height: 18vw;
19 |       position: relative;
20 |       width: 32vw;
21 |       display: inline-block;
22 |     }
23 |     .result:nth-child(2n+1) {
24 |       background-color: #181818;
25 |     }
26 |     .result p {
27 |       background: rgba(0, 0, 0, .5);
28 |       bottom: 0;
29 |       box-sizing: border-box;
30 |       left: 0;
31 |       margin: 0;
32 |       padding: 5px;
33 |       position: absolute;
34 |       text-shadow: 0 1px 1px #000;
35 |       width: 100%;
36 |     }
37 |     #results {
38 |       margin: 1.5vw auto;
39 |       width: 96vw;
40 |     }
41 |     </style>
42 |   </head>
43 |   <body>
44 |     <div id="results"></div>
45 |     <script>
46 |     function loadVisible() {
47 |       var top = $(document).scrollTop(), bottom = top + $(window).height();
48 |       var results = $('#results > div')
49 |       for (var i = 0; i < results.length; i++) {
50 |         var div = results.eq(i);
51 |         var y1 = div.position().top, y2 = y1 + div.height();
52 |         if (y1 > bottom || y2 < top) {
53 |           div.css('background-image', '');
54 |           continue;
55 |         }
56 |         div.css('background-image', 'url(' + div.data('image') + ')');
57 |       }
58 |     }
59 |     $(window).scroll(loadVisible);
60 |     $.getJSON('vis.json?t=' + +new Date, function (data) {
61 |       $('#results').html('');
62 |       for (var i = 0; i < data.length; i++) {
63 |         var path = 'imgs/img' + (i + 1) + '.jpg';
64 |         $('<div class="result">')
65 |           .data('image', path)
66 |           .append($('<p>').text(data[i].caption))
67 |           .appendTo('#results');
68 |       }
69 |       loadVisible();
70 |     });
71 |     </script>
72 |   </body>
73 | </html>
74 | 


--------------------------------------------------------------------------------